Is there an implementation of this string matching method in python?

I'm trying to figure out which records in my datastore are close to duplicates using approximate row matching.

Is there any implementation of the following approach in python, or do I need to try casting?

Thank:)

from wikipedia :

...

The brute force approach will be to compute the editing distance to P for all substring T, and then select the substring with the minimum distance. However, this algorithm will have a run time of O (n3 m)

The best solution [3] [4], using dynamic programming, uses an alternative formulation of the problem: for each position j in the text T and each position I in the pattern P, calculate the minimum editing distance between the first characters of the pattern, Pi and any substring Tj ', j from T that ends at position j.

What is the most efficient way to apply this to many lines?

+3
source share
4 answers

Yes.

google("python levenshtein")
+1
source

difflib.get_close_matches should do the job.

+1
source

difflib ,

from difflib import context_diff

a = 'acaacbaaca'
b = 'accabcaacc'

print ''.join(context_diff(a,b))
0

fuzzywuzzy standard ratio(). fuzzywuzzy difflib http://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python

fuzzywuzzy: https://github.com/seatgeek/fuzzywuzzy

fuzz.ratio("this is a test", "this is a test!")
    96
0

Source: https://habr.com/ru/post/1796254/


All Articles