I have a long list of words that I entered into a very simple SOLR / Lucene database. my goal is to find “similar” words from a list for one-time queries, where “similarity” is specifically understood as a (damer) levensthein edit distance. I understand that SOLR provides such a distance for spelling suggestions.
in my SOLR schema.xml, I configured the field type string:
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
which I use to define the field
<field name='term' type='string' indexed='true' stored='true' required='true'/>
I want to find this field and return the results according to their levenshtein edit distance. however, when I run a type query webspace~0.1regarding SOLR with debugging and explanations, the report shows that a number of considerations were made when calculating the ratings, for example:
"1582":"
1.1353534 = (MATCH) sum of:
1.1353534 = (MATCH) weight(term:webpage^0.8148148 in 1581), product of:
0.08618848 = queryWeight(term:webpage^0.8148148), product of:
0.8148148 = boost
13.172914 = idf(docFreq=1, maxDocs=386954)
0.008029869 = queryNorm
13.172914 = (MATCH) fieldWeight(term:webpage in 1581), product of:
1.0 = tf(termFreq(term:webpage)=1)
13.172914 = idf(docFreq=1, maxDocs=386954)
1.0 = fieldNorm(field=term, doc=1581)
it is clear that for my application, the terms frequency, idfetc. meaningless, since each document contains only one term. I tried to use the spelling suggestion component, but could not get it to return the actual similarity ratings.
- , SOLR levensthein/jaro-winkler/n-gram , tf, idf, boost ? SOLR -? , .