Differences between Solr's apparent equivalent queries

As I understand the Solr scoring function , the following two queries should be equivalent.

Namely, score(q1, d) = score(q2, d) for each document d in the corpus.

Request 1: evolution OR selection OR germline OR dna OR rna OR mitochondria

Request 2: (evolution OR selection OR germline) OR (dna OR rna OR mitochondria)

Queries are obviously logically equivalent (they both return the same set of documents). In addition, both queries consist of the same 6 terms, and each term has one plus in both queries. Therefore, each member must have the same contribution to the total score (same TF, same IDF, same impulse).

Despite this, the requests do not give the same ratings .

In general, the combination of terms ( a OR b OR c OR d ) does not match the query conjunction ( (a OR b) OR (c OR d) ). What is the semantic difference between the two types of queries? What causes them different grades?

The reason I ask is because I create a user request handler in which I create a second type of request (connection of requests), while I may need to build the first type of request (connection of terms), in other words, this what i am doing:

 Query q1 = ... //conjunction of terms evolution, selection, germline Query q2 = ... //conjunction of terms dna, rna, mitochondria Query conjunctionOfQueries = new BooleanQuery(); conjunctionOfQueries.add(q1, BooleanClause.Occure.SHOULD); conjunctionOfQueries.add(q2, BooleanClause.Occure.SHOULD); 

although maybe I should do this:

 List<String> terms = ... //extract all 6 terms from q1 and q2 List<TermQuery> termQueries = ... //create a new TermQuery from each term in terms Query conjunctionOfTerms = new BooleanQuery(); for (TermQuery t : termQueries) { conjunctionOfTerms.add(t, BooleanClause.Occure.SHOULD); } 
+4
source share
1 answer

I followed the femtoRgon advice to check the debug item of the invoice calculation. I found that the calculations are really mathematically equivalent. The only difference is that when calculating the conjuncture of the queries, we save the intermediate results. More precisely, we save the contribution to the sum of each subquery in a variable. Apparently, stopping to store the intermediate results leads to the accumulation of a numerical error: every time we save the intermediate result, we lose some accuracy. Since the actual queries in the application are quite large (not like a trivial sample query), there is a lot of accuracy that needs to be lost, and the accumulated error sometimes even changes the ranking order of returned documents.

Thus, it is expected that a request for a connection offer will give a slightly better ranking than a request for connecting requests, since a request for combining requests causes a large numerical error.

+4
source

Source: https://habr.com/ru/post/1490714/


All Articles