Regex replacement takes time for millions of documents, how to do it faster?

Question

Regex replacement takes time for millions of documents, how to do it faster?

I have documents like:

documents = [
    "I work on c programing.",
    "I work on c coding.",
]

I have a synonym file, for example:

synonyms = {
    "c programing": "c programing",
    "c coding": "c programing"
}

I want to replace all synonyms for which I wrote this code:

# added code to pre-compile all regex to save compilation time. credits alec_djinn

compiled_dict = {}
for value in synonyms:
    compiled_dict[value] = re.compile(r'\b' + re.escape(value) + r'\b')

for doc in documents:
    document = doc
    for value in compiled_dict:
        lowercase = compiled_dict[value]
        document = lowercase.sub(synonyms[value], document)
    print(document)

Conclusion:

I work on c programing.
I work on c programing.

But since the number of documents is several million, and the number of synonyms is 10 thousand, the expected time to complete this code is 10 days.

Is there a faster way to do this?

PS: with the release, I want to train the word2vec model.

Any help is appreciated. I was thinking of writing some cpython code and putting it in parallel threads.

+4

python cpython parallel-processing word2vec

Vikash Singh May 25 '17 at 10:38

source share

4 answers

, :

. , .
, N/x , (, x = 4, 4 ) ( : )
, , , (, ).

+1

Ovidiu Dolha 25 '17 10:43

dict. , . .

:

compiled_dict = {}
for value in synonyms:
        compiled_dict[value] = re.compile(r'\b' + re.escape(value) + r'\b')


for document in documents:
    for value in synonyms:
        lowercase = compiled_dict[value]
        document = lowercase.sub(synonyms[value], document)

+1

alec_djinn 25 '17 12:02

, c++ c++., \b. \b (?!\w).

, ( ), , , .

(, ) , , .

:

import re

# Get those synonyms that are not single words and turn them into regexes:
# Don't use \b to end a pattern; just require that no \w should follow 
complex_synonyms = [(r'\b' + re.escape(key) + r'(?!\w)', synonyms[key]) for key in synonyms if not re.match(r'[\w+]+$', key)]

for i, document in enumerate(documents):
    # Deal with the easy cases (words) in one go, by checking each word in the document
    document = re.sub(r'[\w+]+', lambda word: synonyms[word[0]] if word[0] in synonyms else word[0], document)
    # Replace the remaining synonyms by using regular expressions
    for find, repl in complex_synonyms:
        document = re.sub(find, repl, document)
    # Store the result back into the document
    documents[i] = document

0

trincot 25 '17 13:00

wildwilhelm · Accepted Answer · 2017-05-25T12:51:19+0000

, , word2vec . ( "" ) , , Aho-Corasick , . fsed ( Python), .

Regex replacement takes time for millions of documents, how to do it faster?

More articles: