Dataset for RNN-LSTM as spellcheck in python

I have a collection of more than 5 million entries in which there are many noise functions (words). So I was thinking about doing spell correction and abbreviation processing.

When I googled for spelling fix packages in python, I got packages like autocorrect , textblob , hunspell and Peter norvig Method

Below is a sample of my dataset

Id      description
 1      switvch for air conditioner..............
 2      control tfrmr...........
 3      coling pad.................
 4      DRLG machine
 5      hair smothing kit...............

I tried spelling correction function over packages using code

dataset['description']=dataset['description'].apply(lambda x: list(set([spellcorrection_function(item) for item in x])))

It took more than 12 hours to complete the data set to complete the spell correction, and also introduces a bit of noise (for 20% of the total number of important words)

: "smothing" "-", "" ( "-" )

, - . , , , jarowinkler , 0,93

  Wrong word   correct word  similarity score
    switvch      switch           0.98
    coling       cooling          0.98
    smothing     smoothing        0.99

50 . ,

(~ 3k )

 key     value
 tfrmr   transformer
 drlg    drilling

-

dataset['description']=dataset['description'].replace(similar_word_dictionary,regex=true)



dataset['description']=dataset['description'].replace(abbreviation_dictionary,regex=true)

, 10% , , .

Python , . RNN-LSTM .

RNN LSTM, .

, RNN ,  

  • ( - ) .
  • ( ).

? -, , ,

+4

Source: https://habr.com/ru/post/1694149/


All Articles