1) I would go with state machines. It is impossible to imagine a specialized library right now, but general-purpose PCRE can be used to build automata that efficiently look for a given substring. For the substring "foo" and "bar", you can build the template / (foo) | (bar) /, scan the string and get the "id" number of the substring by repeating the ovector check which group was matched.
RE2 :: FindAndConsume is good if you only need a common score, and not a substring grouping.
Postscript Example of using Boost.Xpressive and loading lines from a map: http://ericniebler.com/2010/09/27/boost-xpressive-ftw/
PPS Recently, I managed to create a Ragel machine for a similar task. For a small set of search strings, the โnormalโ DFA will work, buf if you have a larger set of rules, and then with the help of Ragel scanners good results are displayed (here is the answer).
PPPS PCRE has the MARK keyword, which is very useful for this classification of subpatterns ( cf ).
2) Some time ago I wrote for Scala a thing based on Trie in Scala: https://gist.github.com/ArtemGr/6150594 ; Trie.search iterates over a string that is trying to match the current position with the number encoded in Trie. Trie is encoded in a single cache insensitive array, I expect it to be as good as non-JIT DFA.
3) I used boost :: spirit for a substring, but never before it evaluated how it relates to other approaches. Spirit uses some effective structure to match symbols , perhaps the structure can be used on its own without the overhead of the Spirit.
#include <boost/spirit/include/qi.hpp> #include <boost/spirit/include/phoenix.hpp> using qi::lit; using qi::alnum; using qi::digit; using qi::_val; using qi::_1; using boost::phoenix::ref; static struct exact_t: qi::symbols<char, int> { exact_t() {add ("foo", 1) ("bar", 2);} } exact; int32_t match = -1; qi::rule<const char*, int()> rule = +alnum >> exact [ref (match) = _1]; const char* it = haystack; // Mutable iterator for Spirit. qi::parse (it, haystack + haystackLen, rule);
source share