Algorithm for checking multiple substrings in multiple lines

I have several million lines, X, each of which contains less than 20 words. I also have a list of several thousand C. permutations for each x in X, I want to see if there are any lines in C that are contained in x. Right now I am using a naive double for a loop, but it has been a while and it is not finished yet ... Any suggestions? I use python if someone knows about a good implementation, but links to any language or general algorithms would be good too.

+3
source share
5 answers

Encode one of your string sets as trie (I recommend installing a wider set). Search time should be faster than an imperfect hash, and you will also save some memory.

+4
source

This will take a long time. You must check each of these several million lines for each of these several thousand substrings, which means that you will perform (several million * several thousand) comparison lines. Yes, it will take some time.

If this is something you are going to do only or infrequently, I would suggest using fgrep. If this is what you are going to do often, then you want to study the implementation of something like the Aho-Corasick string algorithm .

+1
source

x X , , :

, (n), x, .

:

keywords = set(['bla', 'fubar'])
for w in [x.split(' ') for x in X]:
    if w in keywords:
        pass # do what you need to do

reog googles, - . (Http://code.google.com/p/re2/)

EDIT: , - , . , .

0

subs=re.compile('|'.join(C))  
for x in X:  
    if subs.search(x):  
        print 'found'  
0

Source: https://habr.com/ru/post/1792036/


All Articles