Toxicize Search Text in Python

Looking for strategies for how tokenize your search text and some ideas on how to implement them.

In particular, we are trying to identify custom business reviews to help with our search engine. All code is Python.

I think we need to do at least the following:

  • Converting plural nouns to special numbers
    I found a library called inflect that seems to do it well, does anyone have any experience with it?

  • Get rid of all non-alphanumeric characters
    This seems to work for regular expression, but I would love to hear any other suggestions.

  • Space-based tokenize, converting consecutive spaces to single spaces
    I think this is doable with some custom string manipulation in Python, but there might be a better way.

Does anyone have any other ideas on what I need to do to tokenize the text? In addition, what do you think of the methods and tools mentioned for implementing the above strategies?

Background information : (from comments in Dough T suggestion to search for Solr or Elastic)
We use ElasticSearch , and we use its tools for basic tokenization. We want to make the symbolism described above, because after tokenization we will need to apply a rather complicated semantic analysis to extract meaning from the text. We want the flexibility to determine exactly how we specify and the convenience of storing tokens stored in our own format using our own data annotations attached to them.
The only thing that we absolutely need is a single (large) database record for each token, accessible and modifiable on the fly, with everything related to this use of tokens. I think this excludes the use of only ES tokenization to process them as documents are indexed. We could use the ES analysis module to analyze the text without indexing it, and then process each token separately to create / update the token database entry ... We are looking for suggestions on this approach.

+4
source share
1 answer

I think you want to take a peek at a full-text search solution that provides the functions you describe, instead of embedding something of your own in python. The two big open source players in this space are elasticsearch and solr .

With these products, you can customize the fields that define user tokenization, the removal of punctuation marks, synonyms that help in the search, tokenization is nothing more than just spaces, etc. etc. You can also easily add plugins to modify this analysis chain.

Here is an example of a solr schema that has useful stuff:

Defining Field Types

 <fieldType class="solr.TextField" name="text_en" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>--> <filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> </fieldType> 

Define Field

 <field indexed="true" name="text_body" stored="false" type="text_en"/> 

Then you can work with the search server through a good REST API through python or just use Solr / Elasticsearch directly.

+5
source

Source: https://habr.com/ru/post/1446303/


All Articles