Python -regex matching list of words

I have a python script that probably has 100 lines of regular expressions, each line corresponds to specific words.

the script explicitly consumes up to 100% of the processor each time it is started (I basically pass it a sentence and it will return the matching words found).

I want to combine them into 4 or 5 different "compiled" regular expression parsers, such as:

>>> words = ('hello', 'good\-bye', 'red', 'blue')
>>> pattern = re.compile('(' + '|'.join(words) + ')', re.IGNORECASE)

How many words can I safely have in this, and will it matter? Right now, if I run a cycle of a thousand random sentences, it is processing maybe 10 seconds, trying to dramatically increase this speed so that it can do 500 seconds (if possible).

Also, is such a list possible?

>>> words = ('\d{4,4}\.\d{2,2}\.\d{2,2}', '\d{2,2}\s\d{2,2}\s\d{4,4}\.')
>>> pattern = re.compile('(' + '|'.join(words) + ')', re.IGNORECASE)
>>> print pattern.findall("Today is 2010 11 08)
+3
1

O(N*M*L) ( N - , M - , , L - , ) . , find. , , , .

, Trie . :

TERMINAL = 'TERMINAL' # Marks the end of a word

def build(*words, trie={}):
    for word in words:
        pointer = trie
        for ch in word:
            pt = pt.setdefault(ch, {TERMINAL:False})
        pt[TERMINAL] = True
    return trie

def find(input, trie):
    results = []
    for i in range(len(input)):
        pt = trie
        for j in range(i, len(input)+1):
            if pt[TERMINAL]:
                results.append(input[i:j])
            if j >= len(input) or input[j] not in pt:
                break
            pt = pt[input[j]]
    return results

, trie. O(N*L), , , , .

+4

Source: https://habr.com/ru/post/1773751/


All Articles