I have a collection of more than 5 million entries in which there are many noise functions (words). So I was thinking about doing spell correction and abbreviation processing.
When I googled for spelling fix packages in python, I got packages like autocorrect , textblob , hunspell and Peter norvig Method
Below is a sample of my dataset
Id description
1 switvch for air conditioner..............
2 control tfrmr...........
3 coling pad.................
4 DRLG machine
5 hair smothing kit...............
I tried spelling correction function over packages using code
dataset['description']=dataset['description'].apply(lambda x: list(set([spellcorrection_function(item) for item in x])))
It took more than 12 hours to complete the data set to complete the spell correction, and also introduces a bit of noise (for 20% of the total number of important words)
: "smothing" "-", "" ( "-" )
, - . , , , jarowinkler , 0,93
Wrong word correct word similarity score
switvch switch 0.98
coling cooling 0.98
smothing smoothing 0.99
50 . ,
(~ 3k )
key value
tfrmr transformer
drlg drilling
-
dataset['description']=dataset['description'].replace(similar_word_dictionary,regex=true)
dataset['description']=dataset['description'].replace(abbreviation_dictionary,regex=true)
, 10% , , .
Python , . RNN-LSTM .
RNN LSTM, .
, RNN ,
? -, , ,