Use keywords effectively to find keywords

I need to match a really large list of keywords (> 1,000,000) in a string, effectively using python. I found some really good libraries that try to do this quickly:

1) FlashText ( https://github.com/vi3k6i5/flashtext )

2) Aho-Corasick algorithm, etc.

However, I have a peculiar requirement: in my context, the keyword say "XXXX YYYY" should return a match if my string "XXXX is a very good sign of YYYY." Note that "XXXX YYYY" is not a substring, but there are XXXX and YYYY in the string, and this is good enough for me to match.

I know how to do it naively. What I'm looking for is efficiency, better libraries for this?

+5
source share
2 answers

What do you ask as a complete text search task . There, the Python search package is called whoosh . @derek corpus can be indexed and searched in memory, as shown below.

from whoosh.filedb.filestore import RamStorage from whoosh.qparser import QueryParser from whoosh import fields texts = [ "Here a sentence with dog and apple in it", "Here a sentence with dog and poodle in it", "Here a sentence with poodle and apple in it", "Here a dog with and apple and a poodle in it", "Here an apple with a dog to show that order is irrelevant" ] schema = fields.Schema(text=fields.TEXT(stored=True)) storage = RamStorage() index = storage.create_index(schema) storage.open_index() writer = index.writer() for t in texts: writer.add_document(text = t) writer.commit() query = QueryParser('text', schema).parse('dog apple') results = index.searcher().search(query) for r in results: print(r) 

This gives:

 <Hit {'text': "Here a sentence with dog and apple in it"}> <Hit {'text': "Here a dog with and apple and a poodle in it"}> <Hit {'text': "Here an apple with a dog to show that order is irrelevant"}> 

You can also save your index using FileStorage , as described in How to Index Documents .

+1
source

This falls into a "naive" camp, but here is a method that uses sets as food for thought:

  docs = [
     "" "Here a sentence with dog and apple in it" "",
     "" "Here a sentence with dog and poodle in it" "",
     "" "Here a sentence with poodle and apple in it" "",
     "" "Here a dog with and apple and a poodle in it" "",
     "" "Here an apple with a dog to show that order is irrelevant" ""
 ]

 query = ['dog', 'apple']

 def get_similar (query, docs):
     res = []
     query_set = set (query)
     for i in docs:
         # if all n elements of query are in i, return i
         if query_set & set (i.split ("")) == query_set:
             res.append (i)
     return res

This returns:

  ["Here a sentence with dog and apple in it", 
 "Here a dog with and apple and a poodle in it", 
 "Here an apple with a dog to show that order is irrelevant"]

Of course, the complexity of the time is not that big, but much, much faster than using lists as a whole due to the speed of hash operations.


Part 2 states that Elasticsearch is a great candidate for this if you are willing to make an effort and you are dealing with a lot of data.

+1
source

Source: https://habr.com/ru/post/1274706/


All Articles