Which string search algorithm is suitable for this?

Question

Which string search algorithm is suitable for this?

I have a large string, say: "aaaaaaaaaaababbbbbbbbbccccccccccccccdddddddddddddd" (but maybe longer), and I have a collection of many small strings. I want to count (overlap is ok) how many times the small lines are in the big line. I only care about speed. KMP seemed good, but Rabin-Karp seemed to be dealing with a few, but was slow.

+6

c ++ string algorithm

MyNameIsKhan Aug 3 '13 at 17:14

source share

5 answers

Compatibility with a large collection of strings sounds like http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm to me. He finds matches one at a time, so the idea of Peter de Rivaz could be better if there are a huge number of matches. On the other hand, Aho-Korasik does not need to keep a large line in memory - you can just skip it - and it’s very practical to implement and configure - the Wikipedia link notes that the original fgrep used it.

Thinking about this, you can get around the mega-match problem. Aho-Corasick can be thought of as creating a deterministic finite state machine capable of recognizing each of the strings it seeks. Machine status corresponds to the last N characters. If you want to match two lines, and one is the suffix of the other, you need to be careful that when you are in a state that says you just matched a longer line that you also recognize, it means that you just matched shorter string. If you intentionally choose not to do this, then the calculations that you accumulate for the shorter line will be incorrect, but you know that they are too low because the amount of the long line has been noticed. Therefore, if you modify Aho-Corasick so that only the longest match at each point is recognized and counted, then the cost of matching remains linear in the number of characters in the line you are looking for, and you can fix the counters at the end by going through long lines, and then increasing the number of lines for shorter lines that are suffixes of long lines. This will take time no more linear in the total size of the rows that are being searched.

+6

mcdowella Aug 3 '13 at 17:44

source share

Based on another answer (and hopefully making sure this is the best type of answer), you can find http://en.wikipedia.org/wiki/Suffix_tree and also go through the links listed there to learn about suffix trees, if you really want to get the fastest solution for your problem, which will also allow you to get the number of matches without iterating over all matches, and the runtime and memory requirements that you get are the maximum possible for any substring or match calculation algorithm. As soon as you understand how the suffix tree works, and how to create / use it, then you only need an additional setting that stores the number of different initial line positions that are displayed in each internal node of the tree, a slight modification that you can easily do this effectively (linear time, as already stated), recursively getting a count from child nodes and adding them to get a counter with the current node. These counts then allow you to count subscript matches without repeating all of them.

0

user2566092 Aug 3 '13 at 19:16

source share

1) I would go with state machines. It is impossible to imagine a specialized library right now, but general-purpose PCRE can be used to build automata that efficiently look for a given substring. For the substring "foo" and "bar", you can build the template / (foo) | (bar) /, scan the string and get the "id" number of the substring by repeating the ovector check which group was matched.
RE2 :: FindAndConsume is good if you only need a common score, and not a substring grouping.
Postscript Example of using Boost.Xpressive and loading lines from a map: http://ericniebler.com/2010/09/27/boost-xpressive-ftw/
PPS Recently, I managed to create a Ragel machine for a similar task. For a small set of search strings, the “normal” DFA will work, buf if you have a larger set of rules, and then with the help of Ragel scanners good results are displayed (here is the answer).
PPPS PCRE has the MARK keyword, which is very useful for this classification of subpatterns ( cf ).

2) Some time ago I wrote for Scala a thing based on Trie in Scala: https://gist.github.com/ArtemGr/6150594 ; Trie.search iterates over a string that is trying to match the current position with the number encoded in Trie. Trie is encoded in a single cache insensitive array, I expect it to be as good as non-JIT DFA.

3) I used boost :: spirit for a substring, but never before it evaluated how it relates to other approaches. Spirit uses some effective structure to match symbols , perhaps the structure can be used on its own without the overhead of the Spirit.

 #include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/phoenix.hpp> using qi::lit; using qi::alnum; using qi::digit; using qi::_val; using qi::_1; using boost::phoenix::ref; static struct exact_t: qi::symbols<char, int> { exact_t() {add ("foo", 1) ("bar", 2);} } exact; int32_t match = -1; qi::rule<const char*, int()> rule = +alnum >> exact [ref (match) = _1]; const char* it = haystack; // Mutable iterator for Spirit. qi::parse (it, haystack + haystackLen, rule);

0

Artemgr Aug 4 '13 at 15:11

source share

If I understood correctly, your input line consists of several single-character blocks.

In this case, you can compress your text using the length coding .

For instance:

 s = aaabbbbcc

encoded as:

 encoded_s = (a3)(b4)(c2)

Now you can search for patterns in encoded text.

If you need a specific algorithm, just do a web search to match patterns in strings encoded along the length of the string. You can achieve the time complexity of O(N + M) , where N and M are the lengths of the compressed text and the compressed template . Both M and N as a whole are much shorter than the original lengths, so it surpasses any standard pattern matching algorithm, for example. KMP

0

pkacprzak Aug 4 '13 at 15:24

source share

Peter de Rivaz · Accepted Answer · 2013-08-03T17:37:24+0000

The problem with most string search algorithms is that they will take at least O (k) time to return k matches, so if you have a string with 1 million "a" and 1 million small string queries "a" , then for the calculation of all matches it will take about 1 million million iterations!

An alternative linear time method will be as follows:

Build a tree of suffixes for a large string: O (n), where n is len (large string)
Precompress the number of suffixes below each node in the suffix tree: O (n)
For each small line, find the node in the suffix tree, where there is a small line as a suffix: O (m) where m is len (small line)
Add to the total number the suffixes below the node. (Each suffix corresponds to a different match of a small string in a large string)

This takes O (n + p) time, where n is the length of the large string and p is the total length of all the small strings.

Code example

As requested, here is some small (ish) sample code in Python that uses this approach:

from collections import defaultdict class SuffixTree: def __init__(self): """Returns an empty suffix tree""" self.T='' self.E={} self.nodes=[-1] # 0th node is empty string def add(self,s): """Adds the input string to the suffix tree. This inserts all substrings into the tree. End the string with a unique character if you want a leaf-node for every suffix. Produces an edge graph keyed by (node,character) that gives (first,last,end) This means that the edge has characters from T[first:last+1] and goes to node end.""" origin,first,last = 0,len(self.T),len(self.T)-1 self.T+=s nc = len(self.nodes) self.nodes += [-1]*(2*len(s)) T=self.T E=self.E nodes=self.nodes Lm1=len(T)-1 for last_char_index in xrange(first,len(T)): c=T[last_char_index] last_parent_node = -1 while 1: parent_node = origin if first>last: if (origin,c) in E: break else: key = origin,T[first] edge_first, edge_last, edge_end = E[key] span = last - first A = edge_first+span m = T[A+1] if m==c: break E[key] = (edge_first, A, nc) nodes[nc] = origin E[nc,m] = (A+1,edge_last,edge_end) parent_node = nc nc+=1 E[parent_node,c] = (last_char_index, Lm1, nc) nc+=1 if last_parent_node>0: nodes[last_parent_node] = parent_node last_parent_node = parent_node if origin==0: first+=1 else: origin = nodes[origin] if first <= last: edge_first,edge_last,edge_end=E[origin,T[first]] span = edge_last-edge_first while span <= last - first: first+=span+1 origin = edge_end if first <= last: edge_first,edge_last,edge_end = E[origin,T[first]] span = edge_last - edge_first if last_parent_node>0: nodes[last_parent_node] = parent_node last+=1 if first <= last: edge_first,edge_last,edge_end=E[origin,T[first]] span = edge_last-edge_first while span <= last - first: first+=span+1 origin = edge_end if first <= last: edge_first,edge_last,edge_end = E[origin,T[first]] span = edge_last - edge_first return self def make_choices(self): """Construct a sorted list for each node of the possible continuing characters""" choices = [list() for n in xrange(len(self.nodes))] # Contains set of choices for each node for (origin,c),edge in self.E.items(): choices[origin].append(c) choices=[sorted(s) for s in choices] # should not have any repeats by construction self.choices=choices return choices def count_suffixes(self,term): """Recurses through the tree finding how many suffixes are based at each node. Strings assumed to use term as the terminating character""" C = self.suffix_counts = [0]*len(self.nodes) choices = self.make_choices() def f(node=0): t=0 X=choices[node] if len(X)==0: t+=1 # this node is a leaf node else: for c in X: if c==term: t+=1 continue first,last,end = self.E[node,c] t+=f(end) C[node]=t return t return f() def count_matches(self,needle): """Return the count of matches for this needle in the suffix tree""" i=0 node=0 E=self.E T=self.T while i<len(needle): c=needle[i] key=node,c if key not in E: return 0 first,last,node = E[key] while i<len(needle) and first<=last: if needle[i]!=T[first]: return 0 i+=1 first+=1 return self.suffix_counts[node] big="aaaaaaaaaaabbbbbbbbbcccccccccccddddddddddd" small_strings=["a","ab","abc"] s=SuffixTree() term=chr(0) s.add(big+term) s.count_suffixes(term) for needle in small_strings: x=s.count_matches(needle) print needle,'has',x,'matches'

He prints:

 a has 11 matches ab has 1 matches abc has 0 matches

However, in practice, I would recommend that you simply use the pre-existing Aho-Corasick implementation, as I expect it to be much faster in your particular case.

Which string search algorithm is suitable for this?

Code example

More articles: