Counting - fieldnorm

I have the following entries and ratings against it when I search for β€œiphone” -

Record1: FieldName - DisplayName: "Iphone" FieldName - Name: "Iphone"

11.654595 = (MATCH) sum of: 11.654595 = (MATCH) max plus 0.01 times others of: 7.718274 = (MATCH) weight(DisplayName:iphone^10.0 in 915195), product of: 0.6654692 = queryWeight(DisplayName:iphone^10.0), product of: 10.0 = boost 11.598244 = idf(docFreq=484, maxDocs=19431244) 0.0057376726 = queryNorm 11.598244 = (MATCH) fieldWeight(DisplayName:iphone in 915195), product of: 1.0 = tf(termFreq(DisplayName:iphone)=1) 11.598244 = idf(docFreq=484, maxDocs=19431244) 1.0 = fieldNorm(field=DisplayName, doc=915195) 11.577413 = (MATCH) weight(Name:iphone^15.0 in 915195), product of: 0.99820393 = queryWeight(Name:iphone^15.0), product of: 15.0 = boost 11.598244 = idf(docFreq=484, maxDocs=19431244) 0.0057376726 = queryNorm 11.598244 = (MATCH) fieldWeight(Name:iphone in 915195), product of: 1.0 = tf(termFreq(Name:iphone)=1) 11.598244 = idf(docFreq=484, maxDocs=19431244) 1.0 = fieldNorm(field=Name, doc=915195) 

Record2: FieldName - DisplayName: "Iphone Book" Field Name - Name: "Iphone Book"

 7.284122 = (MATCH) sum of: 7.284122 = (MATCH) max plus 0.01 times others of: 4.823921 = (MATCH) weight(DisplayName:iphone^10.0 in 453681), product of: 0.6654692 = queryWeight(DisplayName:iphone^10.0), product of: 10.0 = boost 11.598244 = idf(docFreq=484, maxDocs=19431244) 0.0057376726 = queryNorm 7.2489023 = (MATCH) fieldWeight(DisplayName:iphone in 453681), product of: 1.0 = tf(termFreq(DisplayName:iphone)=1) 11.598244 = idf(docFreq=484, maxDocs=19431244) 0.625 = fieldNorm(field=DisplayName, doc=453681) 7.2358828 = (MATCH) weight(Name:iphone^15.0 in 453681), product of: 0.99820393 = queryWeight(Name:iphone^15.0), product of: 15.0 = boost 11.598244 = idf(docFreq=484, maxDocs=19431244) 0.0057376726 = queryNorm 7.2489023 = (MATCH) fieldWeight(Name:iphone in 453681), product of: 1.0 = tf(termFreq(Name:iphone)=1) 11.598244 = idf(docFreq=484, maxDocs=19431244) 0.625 = fieldNorm(field=Name, doc=453681) 

Record3: FieldName - DisplayName: "iPhone" Field Name - Name: "iPhone"

 7.284122 = (MATCH) sum of: 7.284122 = (MATCH) max plus 0.01 times others of: 4.823921 = (MATCH) weight(DisplayName:iphone^10.0 in 5737775), product of: 0.6654692 = queryWeight(DisplayName:iphone^10.0), product of: 10.0 = boost 11.598244 = idf(docFreq=484, maxDocs=19431244) 0.0057376726 = queryNorm 7.2489023 = (MATCH) fieldWeight(DisplayName:iphone in 5737775), product of: 1.0 = tf(termFreq(DisplayName:iphone)=1) 11.598244 = idf(docFreq=484, maxDocs=19431244) 0.625 = fieldNorm(field=DisplayName, doc=5737775) 7.2358828 = (MATCH) weight(Name:iphone^15.0 in 5737775), product of: 0.99820393 = queryWeight(Name:iphone^15.0), product of: 15.0 = boost 11.598244 = idf(docFreq=484, maxDocs=19431244) 0.0057376726 = queryNorm 7.2489023 = (MATCH) fieldWeight(Name:iphone in 5737775), product of: 1.0 = tf(termFreq(Name:iphone)=1) 11.598244 = idf(docFreq=484, maxDocs=19431244) 0.625 = fieldNorm(field=Name, doc=5737775) 

Why does Record2 and Record3 have the same score when record2 has 3 words and record3 has only one word. So, Record3 should have higher relevance than record 2. Why is fieldNorm both Record2 and Record3 the same?

QueryParser: Dismax FieldType: default text field type in solrconfig.xml

Adding DataFeed:

Record 1: Iphone

 { "ListPrice":1184.526, "ShipsTo":1, "OID":"190502", "EAN":"9780596804299", "ISBN":"0596804296", "Author":"Pogue, David", "product_type_fq":"Books", "ShipmentDurationDays":"21", "CurrencyValue":"24.9900", "ShipmentDurationText":"NORMALLY SHIPS IN 21 BUSINESS DAYS", "Availability":0, "COD":0, "PublicationDate":"2009-08-07 00:00:00.0", "Discount":"25", "SubCategory_fq":"Hardware", "Binding":"Paperback", "Category_fq":"Non Classifiable", "ShippingCharges":"0", "OIDType":8, "Pages":"397", "CallOrder":"0", "TrackInventory":"Ingram", "Author_fq":"Pogue, David", "DisplayName":"Iphone", "url":"/iphone-pogue-david/books/9780596804299.htm", "CurrencyType":"USD", "SubSubCategory":"Handheld Devices", "Mask":0, "Publisher":"Oreilly & Associates Inc", "Name":"Iphone", "Language":"English", "DisplayPriority":"999", "rowid":"books_9780596804299" } 

Record 2: Iphone Book

 { "ListPrice":1184.526, "ShipsTo":1, "OID":"94694", "EAN":"9780321534101", "ISBN":"0321534107", "Author":"Kelby, Scott/ White, Terry", "product_type_fq":"Books", "ShipmentDurationDays":"21", "CurrencyValue":"24.9900", "ShipmentDurationText":"NORMALLY SHIPS IN 21 BUSINESS DAYS", "Availability":1, "COD":0, "PublicationDate":"2007-08-13 00:00:00.0", "Discount":"25", "SubCategory_fq":"Handheld Devices", "Binding":"Paperback", "BAMcategory_src":"Computers", "Category_fq":"Computers", "ShippingCharges":"0", "OIDType":8, "Pages":"219", "CallOrder":"0", "TrackInventory":"Ingram", "Author_fq":"Kelby, Scott/ White, Terry", "DisplayName":"The Iphone Book", "url":"/iphone-book-kelby-scott-white-terry/books/9780321534101.htm", "CurrencyType":"USD", "SubSubCategory":" Handheld Devices", "BAMcategory_fq":"Computers", "Mask":0, "Publisher":"Pearson PTR", "Name":"The Iphone Book", "Language":"English", "DisplayPriority":"999", "rowid":"books_9780321534101" } 

Record 3: iPhone

 { "ListPrice":278.46, "ShipsTo":1, "OID":"694715", "EAN":"9781411423527", "ISBN":"1411423526", "Author":"Quamut (COR)", "product_type_fq":"Books", "ShipmentDurationDays":"21", "CurrencyValue":"5.9500", "ShipmentDurationText":"NORMALLY SHIPS IN 21 BUSINESS DAYS", "Availability":0, "COD":0, "PublicationDate":"2010-08-03 00:00:00.0", "Discount":"25", "SubCategory_fq":"Hardware", "Binding":"Paperback", "Category_fq":"Non Classifiable", "ShippingCharges":"0", "OIDType":8, "CallOrder":"0", "TrackInventory":"BNT", "Author_fq":"Quamut (COR)", "DisplayName":"iPhone", "url":"/iphone-quamut-cor/books/9781411423527.htm", "CurrencyType":"USD", "SubSubCategory":"Handheld Devices", "Mask":0, "Publisher":"Sterling Pub Co Inc", "Name":"iPhone", "Language":"English", "DisplayPriority":"999", "rowid":"books_9781411423527" } 
+4
source share
2 answers

fieldnorm takes into account the length of the field, i.e. number of members. The field type used is the text for the display name and the field name, in which temporary words and word separator filters will be stored.

Record 1 - Iphone
Would generate one token - Iphone

Record 2 - The Iphone Book
Generates 2 Tokens - Iphone, Book
This will be deleted using stop words.

Record 3 - Iphone
It will also generate 2 tokens - i,phone
Since the iPhone changed the case, the word separator filter with splitOnCaseChange would now split the iPhone into 2 i, Phone tokens and produce a field norm like Record 2

+5
source

This is the answer to user question 1021590 about the next question / answer to the da vinci search example.

The reason all documents get the same score is due to the fine-grained implementation of lengthNorm. Lucence TFIDFSimilarity doc states the following about norm(t, d) :

the given norm value is encoded as one byte before saving. During the search, the standard byte value is read from the index directory and decoded back to the value of the float. This encoding / decoding with decreasing index size occurs at the cost of loss of accuracy - it is not guaranteed that decoding (encode (x)) = x. For example, decoding (encoding (0.89)) = 0.75.

If you delve into the code, you will see that this phased encoding is performed as follows:

 public static byte floatToByte315(float f) { int bits = Float.floatToRawIntBits(f); int smallfloat = bits >> (24 - 3); if (smallfloat <= ((63 - 15) << 3)) { return (bits <= 0) ? (byte) 0 : (byte) 1; } if (smallfloat >= ((63 - 15) << 3) + 0x100) { return -1; } return (byte) (smallfloat - ((63 - 15) << 3)); } 

and decoding this byte for float is done as:

 public static float byte315ToFloat(byte b) { if (b == 0) return 0.0f; int bits = (b & 0xff) << (24 - 3); bits += (63 - 15) << 24; return Float.intBitsToFloat(bits); } 

lengthNorm calculated as 1 / sqrt( number of terms in field ) . This is then encoded for storage using floatToByte315 . For a field with 3 members we get:

floatToByte315( 1/sqrt(3.0) ) = 120

and for a field with 4 members we get:

floatToByte315( 1/sqrt(4.0) ) = 120

therefore both of them are decoded to:

byte315ToFloat(120) = 0.5 .

The document also states the following:

The rationale for supporting such compression with loss of normalized values ​​is that, given the complexity (and inaccuracy) of users in order to express their true need for information on demand, only large differences matter.

UPDATE: Starting with Solr 4.10, this implementation and related operators are part of DefaultSimilarity.

+3
source

Source: https://habr.com/ru/post/1380239/


All Articles