Fuzzy text search in python

Question

Fuzzy text search in python

I am wondering if there is any Python library to search for fuzzy text. For instance:

I have three keywords, stamp, and mail .
I would like to have a function to check if these three words are inside the same paragraph (or certain distances, one page).
In addition, these words must maintain the same order. It’s good that other words appear between these three words.

I tried fuzzywuzzy , which did not solve my problem. Another Whoosh library looks powerful, but I did not find the right function ...

+6

python full-text-search elasticsearch fuzzy-search whoosh

tao.hong May 26 '15 at 4:25

source share

1 answer

Assem chelli · Accepted Answer · 2015-05-26T05:51:08+0000

{1} You can do this in Whoosh 2.7 . It has a fuzzy search, adding the whoosh.qparser.FuzzyTermPlugin plugin:

whoosh.qparser.FuzzyTermPlugin allows you to search for "fuzzy" terms, that is, terms that do not have to match exactly. A fuzzy term will correspond to any similar term within a certain number of “changes” (insertion of characters, deletion and / or transposition - this is called the “Damerau-Levenshtein editing distance”).

To add a fuzzy plugin:

 parser = qparser.QueryParser("fieldname", my_index.schema) parser.add_plugin(qparser.FuzzyTermPlugin())

Once you add the fuzzy plugin to the parser, you can specify the fuzzy term by adding ~ , followed by an additional maximum editing distance. If you did not specify an editing distance, the default value is 1.

For example, the following “fuzzy” query for terms:

 letter~ letter~2 letter~2/3

{2} To keep words in order, use Query whoosh.query.Phrase , but you must replace the Phrase plugin with whoosh.qparser.SequencePlugin , which allows you to use fuzzy expressions inside a phrase:

 "letter~ stamp~ mail~"

To replace the default plugin with the sequence plugin:

 parser = qparser.QueryParser("fieldname", my_index.schema) parser.remove_plugin_class(qparser.PhrasePlugin) parser.add_plugin(qparser.SequencePlugin())

{3} To resolve words, initialize the slop argument in your phrase request to a larger number:

 whoosh.query.Phrase(fieldname, words, slop=1, boost=1.0, char_ranges=None)

slop - the number of words allowed between each "word" in the phrase; By default, 1 means the phrase must match exactly.

You can also define slop in Query as follows:

 "letter~ stamp~ mail~"~10

{4} General solution:

{4.a} The indexer will look like this:

 from whoosh.index import create_in from whoosh.fields import * schema = Schema(title=TEXT(stored=True), content=TEXT) ix = create_in("indexdir", schema) writer = ix.writer() writer.add_document(title=u"First document", content=u"This is the first document we've added!") writer.add_document(title=u"Second document", content=u"The second one is even more interesting!") writer.add_document(title=u"Third document", content=u"letter first, stamp second, mail third") writer.add_document(title=u"Fourth document", content=u"stamp first, mail third") writer.add_document(title=u"Fivth document", content=u"letter first, mail third") writer.add_document(title=u"Sixth document", content=u"letters first, stamps second, mial third wrong") writer.add_document(title=u"Seventh document", content=u"stamp first, letters second, mail third") writer.commit()

{4.b} The crawler will look like this:

 from whoosh.qparser import QueryParser, FuzzyTermPlugin, PhrasePlugin, SequencePlugin with ix.searcher() as searcher: parser = QueryParser(u"content", ix.schema) parser.add_plugin(FuzzyTermPlugin()) parser.remove_plugin_class(PhrasePlugin) parser.add_plugin(SequencePlugin()) query = parser.parse(u"\"letter~2 stamp~2 mail~2\"~10") results = searcher.search(query) print "nb of results =", len(results) for r in results: print r

This gives the result:

 nb of results = 2 <Hit {'title': u'Sixth document'}> <Hit {'title': u'Third document'}>

{5} If you want to set the default fuzzy search without using the word~n syntax in each query word, you can initialize QueryParser as follows:

  from whoosh.query import FuzzyTerm parser = QueryParser(u"content", ix.schema, termclass = FuzzyTerm)

Now you can use the query "letter stamp mail"~10 , but keep in mind that FuzzyTerm has a default distance of maxdist = 1 . Personalize the class if you want to increase the editing distance:

 class MyFuzzyTerm(FuzzyTerm): def __init__(self, fieldname, text, boost=1.0, maxdist=2, prefixlength=1, constantscore=True): super(D, self).__init__(fieldname, text, boost, maxdist, prefixlength, constantscore) # super().__init__() for Python 3 I think

Literature:

Fuzzy text search in python

More articles: