I think you want to take a peek at a full-text search solution that provides the functions you describe, instead of embedding something of your own in python. The two big open source players in this space are elasticsearch and solr .
With these products, you can customize the fields that define user tokenization, the removal of punctuation marks, synonyms that help in the search, tokenization is nothing more than just spaces, etc. etc. You can also easily add plugins to modify this analysis chain.
Here is an example of a solr schema that has useful stuff:
Defining Field Types
<fieldType class="solr.TextField" name="text_en" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>--> <filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.ASCIIFoldingFilterFactory"/> </analyzer> </fieldType>
Define Field
<field indexed="true" name="text_body" stored="false" type="text_en"/>
Then you can work with the search server through a good REST API through python or just use Solr / Elasticsearch directly.
source share