Find many lines in text - Python

I am looking for the best algorithm to solve this problem: the presence of a list (or dict, set) of small sentences finds all occurrences of these sentences in a larger text. Sentences in the list (or dict or set) are about 600 thousand, but an average of 3 words. Text averages 25 words. I just formatted the text (removing the punctuation marks, all lowercase letters and continuing on and on).

Here is what I tried (Python):

to_find_sentences = [ 'bla bla', 'have a tea', 'hy im luca', 'i love android', 'i love ios', ..... ] text = 'i love android and i think i will have a tea with john' def find_sentence(to_find_sentences, text): text = text.split() res = [] w = len(text) for i in range(w): for j in range(i+1,w+1): tmp = ' '.join(descr[i:j]) if tmp in to_find_sentences: res.add(tmp) return res print find_sentence(to_find_sentence, text) 

Of:

 ['i love android', 'have a tea'] 

In my case, I used a set to speed up the in operation

+5
source share
1 answer

A quick fix would be to create a Trie from your sentences and convert that trie into a regular expression. For your example, the template will look like this:

 (?:bla\ bla|h(?:ave\ a\ tea|y\ i\ m\ luca)|i\ love\ (?:android|ios)) 

Here is an example on debuggex :

enter image description here

It might be nice to add '\b' as word boundaries to avoid matching "have a team" .

You will need a small Trie script . This is not an official package yet, but you can simply download here as trie.py in your current directory.

Then you can use this code to generate trie / regex:

 import re from trie import Trie to_find_sentences = [ 'bla bla', 'have a tea', 'hy im luca', 'i love android', 'i love ios', ] trie = Trie() for sentence in to_find_sentences: trie.add(sentence) print(trie.pattern()) # (?:bla\ bla|h(?:ave\ a\ tea|y\ i\ m\ luca)|i\ love\ (?:android|ios)) pattern = re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE) text = 'i love android and i think i will have a tea with john' print(re.findall(pattern, text)) # ['i love android', 'have a tea'] 

You invest some time to create Trie and regex, but processing should be very fast.

Here's a related answer (Speed ​​up millions of regular expression notes in Python 3) if you need more info.

Note that he will not find overlapping sentences:

 to_find_sentences = [ 'i love android', 'android Marshmallow' ] # ... print(re.findall(pattern, "I love android Marshmallow")) # ['I love android'] 

You need to modify the regular expression with positive results in order to find overlapping sentences.

+5
source

Source: https://habr.com/ru/post/1267173/


All Articles