Fuzzy text search in python

I am wondering if there is any Python library to search for fuzzy text. For instance:

  • I have three keywords, stamp, and mail .
  • I would like to have a function to check if these three words are inside the same paragraph (or certain distances, one page).
  • In addition, these words must maintain the same order. It’s good that other words appear between these three words.

I tried fuzzywuzzy , which did not solve my problem. Another Whoosh library looks powerful, but I did not find the right function ...

+6
source share
1 answer

{1} You can do this in Whoosh 2.7 . It has a fuzzy search, adding the whoosh.qparser.FuzzyTermPlugin plugin:

whoosh.qparser.FuzzyTermPlugin allows you to search for "fuzzy" terms, that is, terms that do not have to match exactly. A fuzzy term will correspond to any similar term within a certain number of “changes” (insertion of characters, deletion and / or transposition - this is called the “Damerau-Levenshtein editing distance”).

To add a fuzzy plugin:

 parser = qparser.QueryParser("fieldname", my_index.schema) parser.add_plugin(qparser.FuzzyTermPlugin()) 

Once you add the fuzzy plugin to the parser, you can specify the fuzzy term by adding ~ , followed by an additional maximum editing distance. If you did not specify an editing distance, the default value is 1.

For example, the following “fuzzy” query for terms:

 letter~ letter~2 letter~2/3 

{2} To keep words in order, use Query whoosh.query.Phrase , but you must replace the Phrase plugin with whoosh.qparser.SequencePlugin , which allows you to use fuzzy expressions inside a phrase:

 "letter~ stamp~ mail~" 

To replace the default plugin with the sequence plugin:

 parser = qparser.QueryParser("fieldname", my_index.schema) parser.remove_plugin_class(qparser.PhrasePlugin) parser.add_plugin(qparser.SequencePlugin()) 

{3} To resolve words, initialize the slop argument in your phrase request to a larger number:

 whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None) 

slop - the number of words allowed between each "word" in the phrase; By default, 1 means the phrase must match exactly.

You can also define slop in Query as follows:

 "letter~ stamp~ mail~"~10 

{4} General solution:

{4.a} The indexer will look like this:

 from whoosh.index import create_in from whoosh.fields import * schema = Schema(title=TEXT(stored=True), content=TEXT) ix = create_in("indexdir", schema) writer = ix.writer() writer.add_document(title=u"First document", content=u"This is the first document we've added!") writer.add_document(title=u"Second document", content=u"The second one is even more interesting!") writer.add_document(title=u"Third document", content=u"letter first, stamp second, mail third") writer.add_document(title=u"Fourth document", content=u"stamp first, mail third") writer.add_document(title=u"Fivth document", content=u"letter first, mail third") writer.add_document(title=u"Sixth document", content=u"letters first, stamps second, mial third wrong") writer.add_document(title=u"Seventh document", content=u"stamp first, letters second, mail third") writer.commit() 

{4.b} The crawler will look like this:

 from whoosh.qparser import QueryParser, FuzzyTermPlugin, PhrasePlugin, SequencePlugin with ix.searcher() as searcher: parser = QueryParser(u"content", ix.schema) parser.add_plugin(FuzzyTermPlugin()) parser.remove_plugin_class(PhrasePlugin) parser.add_plugin(SequencePlugin()) query = parser.parse(u"\"letter~2 stamp~2 mail~2\"~10") results = searcher.search(query) print "nb of results =", len(results) for r in results: print r 

This gives the result:

 nb of results = 2 <Hit {'title': u'Sixth document'}> <Hit {'title': u'Third document'}> 

{5} If you want to set the default fuzzy search without using the word~n syntax in each query word, you can initialize QueryParser as follows:

  from whoosh.query import FuzzyTerm parser = QueryParser(u"content", ix.schema, termclass = FuzzyTerm) 

Now you can use the query "letter stamp mail"~10 , but keep in mind that FuzzyTerm has a default distance of maxdist = 1 . Personalize the class if you want to increase the editing distance:

 class MyFuzzyTerm(FuzzyTerm): def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True): super(D, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore) # super().__init__() for Python 3 I think 

Literature:

+16
source

Source: https://habr.com/ru/post/987898/


All Articles