Algorithm for checking multiple substrings in multiple lines

Question

Algorithm for checking multiple substrings in multiple lines

I have several million lines, X, each of which contains less than 20 words. I also have a list of several thousand C. permutations for each x in X, I want to see if there are any lines in C that are contained in x. Right now I am using a naive double for a loop, but it has been a while and it is not finished yet ... Any suggestions? I use python if someone knows about a good implementation, but links to any language or general algorithms would be good too.

+3

python string substring algorithm

sr1 Feb 14 '11 at 16:08

source share

5 answers

Gww · Answer 1 · 2011-02-14T16:15:20+0000

Encode one of your string sets as trie (I recommend installing a wider set). Search time should be faster than an imperfect hash, and you will also save some memory.

Jim mischel · Answer 2 · 2011-02-14T16:15:39+0000

This will take a long time. You must check each of these several million lines for each of these several thousand substrings, which means that you will perform (several million * several thousand) comparison lines. Yes, it will take some time.

If this is something you are going to do only or infrequently, I would suggest using fgrep. If this is what you are going to do often, then you want to study the implementation of something like the Aho-Corasick string algorithm .

sleeplessnerd · Answer 3 · 2011-02-14T16:21:04+0000

x X , , :

, (n), x, .

:

keywords = set(['bla', 'fubar'])
for w in [x.split(' ') for x in X]:
    if w in keywords:
        pass # do what you need to do

reog googles, - . (Http://code.google.com/p/re2/)

EDIT: , - , . , .

dugres · Answer 4 · 2011-02-14T16:44:33+0000

subs=re.compile('|'.join(C))  
for x in X:  
    if subs.search(x):  
        print 'found'

mcdowella · Answer 5 · 2011-02-14T19:54:03+0000

http://en.wikipedia.org/wiki/Aho-Corasick. - , , , + .

Another quick accurate template template is http://en.wikipedia.org/wiki/Rabin-Karp_string_search_algorithm

Algorithm for checking multiple substrings in multiple lines

More articles: