I have documents like:
documents = [
"I work on c programing.",
"I work on c coding.",
]
I have a synonym file, for example:
synonyms = {
"c programing": "c programing",
"c coding": "c programing"
}
I want to replace all synonyms for which I wrote this code:
compiled_dict = {}
for value in synonyms:
compiled_dict[value] = re.compile(r'\b' + re.escape(value) + r'\b')
for doc in documents:
document = doc
for value in compiled_dict:
lowercase = compiled_dict[value]
document = lowercase.sub(synonyms[value], document)
print(document)
Conclusion:
I work on c programing.
I work on c programing.
But since the number of documents is several million, and the number of synonyms is 10 thousand, the expected time to complete this code is 10 days.
Is there a faster way to do this?
PS: with the release, I want to train the word2vec model.
Any help is appreciated. I was thinking of writing some cpython code and putting it in parallel threads.
source
share