I'm trying to figure out which records in my datastore are close to duplicates using approximate row matching.
Is there any implementation of the following approach in python, or do I need to try casting?
Thank:)
from wikipedia :
...
The brute force approach will be to compute the editing distance to P for all substring T, and then select the substring with the minimum distance. However, this algorithm will have a run time of O (n3 m)
The best solution [3] [4], using dynamic programming, uses an alternative formulation of the problem: for each position j in the text T and each position I in the pattern P, calculate the minimum editing distance between the first characters of the pattern, Pi and any substring Tj ', j from T that ends at position j.
What is the most efficient way to apply this to many lines?
source
share