How to remove "similar" but not identical content in a MySQL database

Suppose I have this table:

ID | description
-------------------
5  | The bird flew over the tree.
2  | The birds, flew over the tree

These two lines have “similar” content. How to remove # 2?

  • What algorithm should be used for "similar" text?
  • How do I do this with Python?

Thanks!

+3
source share
4 answers

What you could try was to strip out the necessary punctuation and execute each sentence with a stemmer (like Porter Stemmer ).

, . , , , (, 40 ).

, , - , .

Python Porter.

+5

, "" ( , ... , ), .

+3

LIKE.

DELETE FROM myTable WHERE description LIKE 'The bird%flew over the tree%';
0

, , . , , , .

http://en.wikipedia.org/wiki/Edit_distance

http://en.wikipedia.org/wiki/Levenshtein_distance

http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance

, , , . 2 3 .

This may not be a good idea if your database is large.

In pseudo code, you mostly say

if editDist(stringA, stringB) > editThreshold
    remove stringA %or stringB
end
0
source

Source: https://habr.com/ru/post/1719212/


All Articles