Regex replacement takes time for millions of documents, how to do it faster?

I have documents like:

documents = [
    "I work on c programing.",
    "I work on c coding.",
]

I have a synonym file, for example:

synonyms = {
    "c programing": "c programing",
    "c coding": "c programing"
}

I want to replace all synonyms for which I wrote this code:

# added code to pre-compile all regex to save compilation time. credits alec_djinn

compiled_dict = {}
for value in synonyms:
    compiled_dict[value] = re.compile(r'\b' + re.escape(value) + r'\b')

for doc in documents:
    document = doc
    for value in compiled_dict:
        lowercase = compiled_dict[value]
        document = lowercase.sub(synonyms[value], document)
    print(document)

Conclusion:

I work on c programing.
I work on c programing.

But since the number of documents is several million, and the number of synonyms is 10 thousand, the expected time to complete this code is 10 days.

Is there a faster way to do this?

PS: with the release, I want to train the word2vec model.

Any help is appreciated. I was thinking of writing some cpython code and putting it in parallel threads.

+4
source share
4 answers

, , word2vec . ( "" ) , , Aho-Corasick , . fsed ( Python), .

+12

, :

  • . , .
  • , N/x , (, x = 4, 4 ) ( : )
  • , , , (, ).
+1

dict. , . .

:

compiled_dict = {}
for value in synonyms:
        compiled_dict[value] = re.compile(r'\b' + re.escape(value) + r'\b')


for document in documents:
    for value in synonyms:
        lowercase = compiled_dict[value]
        document = lowercase.sub(synonyms[value], document)
+1

, c++ c++., \b. \b (?!\w).

, ( ), , , .

(, ) , , .

:

import re

# Get those synonyms that are not single words and turn them into regexes:
# Don't use \b to end a pattern; just require that no \w should follow 
complex_synonyms = [(r'\b' + re.escape(key) + r'(?!\w)', synonyms[key]) for key in synonyms if not re.match(r'[\w+]+$', key)]

for i, document in enumerate(documents):
    # Deal with the easy cases (words) in one go, by checking each word in the document
    document = re.sub(r'[\w+]+', lambda word: synonyms[word[0]] if word[0] in synonyms else word[0], document)
    # Replace the remaining synonyms by using regular expressions
    for find, repl in complex_synonyms:
        document = re.sub(find, repl, document)
    # Store the result back into the document
    documents[i] = document
0

Source: https://habr.com/ru/post/1677905/


All Articles