English Lexicon for the correction of search queries

I create a spelling corrector for search queries by implementing the method described in " Spelling correction as an iterative process that uses the collective knowledge of website users ."

The high-level approach is as follows: for this query, think up possible candidates for correction (words in the query log at a certain editing distance) for each unigram and bigram, then perform a modified Viterbi search to find most likely the sequence of candidates taking into account the bigram frequencies. Repeat this process until the sequence has the maximum probability.

The modification of the Viterbi search is such that if two adjacent words are found in trusted vocabulary, at most you can fix it. This is especially important to prevent the correction of correctly specified single-word queries with higher frequency words.

My question is to find such a vocabulary. It must be in English and contain its own nouns (first / last names, places, company names, etc.), which can appear in search queries, as well as ordinary and unusual English words. A push in the right direction would be helpful.

In addition, if someone reads this and has any suggestions for improving the methodology given in the document, I am open to those who also made it clear that this is my first foray into NLP.

+3
1

, , Google Web 1T 5 .

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T13

, , LDC.

, Python NLTK, Google , .

+2

Source: https://habr.com/ru/post/1730428/


All Articles