Effective data structure for substring search?

Question

Effective data structure for substring search?

Suppose I have a set of strings S and a query string q. I want to know if any member of S is a substring of q. (For the purposes of this question, a substring includes equality, for example, “foo” - a substring of “foo”.) For example, suppose a function that does what I need is called anySubstring :

 S = ["foo", "baz"] q = "foobar" assert anySubstring(S, q) # "foo" is a substring of "foobar" S = ["waldo", "baz"] assert not anySubstring(S, q)

Is there any easy-to-implement algorithm for this with sublinear time complexity in len(S) ? This is normal if S must first be processed in some smart data structure, because I will query each S with a large number of rows q, so the amortized cost of this preprocessing can be reasonable.

EDIT: To clarify, I don't care which member S is a substring of q, only if at least one of them. In other words, I only care about the logical answer.

+6

string algorithm data-structures

dsimcha Mar 09 '12 at 15:13

source share

3 answers

aelguindy · Answer 1 · 2012-03-09T16:15:24+0000

I think the Aho-Corasick algorithm does what you want. I think that there is another solution that is very simple to implement, it is the Karp-Rabin algorithm .

Ivaylo strandjev · Answer 2 · 2012-03-09T15:25:43+0000

So, if the length of S is less than the sum of the long potential substrings, the best option would be to build a tree of suffixes from S, and then search in it. This is linear with respect to the length S plus the total length of the candidate substrings. Of course, there cannot be an algorithm with better complexity, since you have to go through all the input at least. If the opposite is the case, then the length s is longer than the total length of the substrings, aho-corasick is the best option.

Hope this helps.

Dimple kapadia · Answer 3 · 2012-03-09T22:23:44+0000

Create a regular expression .*(S1|S2|...|Sn).* And build its minimum DFA.

Run the query string through DFA.

Effective data structure for substring search?

More articles: