Python calculates line proximity

I want to compare string representations of 3 or 4 digit integers. I want to group them into pairs of "proximity". That is, pc_dud[3] should be combined with dud[0] .

 dud = ['3433', '3467', '3551', '345', '345', '3613'] pc_dud = ['3401', '3402', '3430', '0342', '3584'] 

Does anyone know of a good tool for this (I thought maybe something like a jellyfish)? Another solution, of course, would be to use arithmetic differences as an indicator of "proximity." Any thoughts?

+4
source share
4 answers

You can use difflib module:

Example:

 >>> import difflib >>> dud = ['3433', '3467', '3551', '345', '345', '3613'] >>> pc_dud = ['3401', '3402', '3430', '0342', '3584'] >>> difflib.get_close_matches(dud[0], pc_dud) ['3430'] 
+2
source

If you are looking for arithmetic proximity then pc_dud [3] will not match dud [0].

If you want to keep only unique pairs, use this:

 dud_ = dud[:] pc_dud_ = pc_dud[:] while dud_: d1, d2 = min(itertools.product(dud, pc_dud), key=lambda d1,d2: abs(int(d1) - int(d2))) dud_.remove(d1) pc_dud_.remove(d2) print d1, d2 
+1
source

One approach you can take is to compute the Levenshtein Distance between each whole line in dud compared to each whole line in pc_dud. This algorithm will give you the editing distance between each sequence of numbers. The edit distance number indicates the minimum number of steps required to convert a given sequence into another sequence. It considers insertion, deletion and replacement.

You will need a data structure to store the distance editing results compared to the sequence that it compared. You can then examine which edit distance is the smallest, and then group them as a pair.

Here is a good example of how two sequences of numbers from your arrays look compared to levenshtein. I used this tool to create a screenshot. http://www.let.rug.nl/kleiweg/lev/

http://imgur.com/OTJia7W

The number in the lower right corner is the minimum editing distance.

0
source

If you go on the Levenshtein route, as Paul recommended, check out Apache PyLucene - I believe lucene FuzzyQuery is based on this.

0
source

Source: https://habr.com/ru/post/1494526/


All Articles