Fast algorithm for extracting thousands of simple patterns from a lot of text

I want to be able to effectively expose thousands of regular expressions from GBs of text, knowing that most of these regular expressions will be quite simple, for example:

  \ bBarack \ s (Hussein \ s)? Obama \ b
 \ b (John | J \.) \ sBoehner \ b

and etc.

My current idea is to try to extract a long substring from each regular expression, and then use Aho-Corasick to match those substrings and eliminate most of the regular expression, and then match all the remaining regular expressions. Can anyone think of something better?

+4
source share
4 answers

You can use (f) lex to create a DFA that recognizes all literals in parallel. This may seem complicated if there are too many wildcards, but it works up to about 100 literals (for a 4-letter alphabet, perhaps more for natural text). You can disable the default action (ECHO) and only print lines + column numbers of matches.

[I assume grep -F does pretty much the same thing)

%{ /* C code to be copied verbatim */ #include <stdio.h> %} %% "TTGATTCACCAGCGCGTATTGTC" { printf("@%d: %d:%s\n", yylineno, yycolumn, "OMG! the TTGA pattern again" ); } "AGGTATCTGCTTCAATCAGCG" { printf("@%d: %d:%s\n", yylineno, yycolumn, "WTF?!" ); } ... more lines ... [bd-fh-su-z]+ {;} [ \t\r\n]+ {;} . {;} %% int main(void) { /* Call the lexer, then quit. */ yylex(); return 0; } 

A script like the one above can be generated using txt input with awk or any other script language.

+2
source

A slightly smarter implementation than running each regular expression for each file:

 For each regex: load regex into a regex engine assemble a list of regex engines For each byte in the file: insert byte to every regex engine print results if there are matches 

But I do not know about any programs that do this already - you will have to encode it yourself. It also means that you have a ram to maintain the state of the regular expression, and that you do not have any evil regular expressions

0
source

I'm not sure if you press the regular expression size limit, but you could just OR put them together into one giant regular expression:

 ((\bBarack\s(Hussein\s)?Obama\b)|(\b(John|J\.)\sBoehner\b)|(etc)|(etc)) 

If you hit some limit, you can do it with pieces of 100 at a time, or how many of them you can control

0
source

If you need a really fast implementation of a particular case, you can implement the Aho-Corasick algorithm . But in most cases, combining all your regular expressions into one regular expression, as recommended earlier, will not be too bad.

0
source

Source: https://habr.com/ru/post/1388881/


All Articles