How does python difflib.get_close_matches () function?

Below are two arrays:

import difflib import scipy import numpy a1=numpy.array(['198.129.254.73','134.55.221.58','134.55.219.121','134.55.41.41','198.124.252.101'], dtype='|S15') b1=numpy.array(['198.124.252.102','134.55.41.41','134.55.219.121','134.55.219.137','134.55.220.45', '198.124.252.130'],dtype='|S15') difflib.get_close_matches(a1[-1],b1,2) 

exit:

 ['198.124.252.130', '198.124.252.102'] 

'198.124.252.102' be the closest match for '198.124.252.101' ?

I looked at the documentation, where they indicated some floating-point types, but did not contain information about using the algorithm.

I need to find if the absolute difference between the last two octets is 1 (assuming the first three octets are the same).

So, I first find the closest row, and then check this closest row for the above condition.

Is there any other function or way to achieve this? Also how does get_close_matches() behave?

ipaddr doesn't seem to have this kind of manipulation for ips.

+6
source share
2 answers

Well, there is this part in docs explaining your problem:

This does not give minimal editing sequences, but usually gives matches that โ€œlook rightโ€ for people.

You can use Levenshtein_distance to get the expected results.

But for comparing IP addresses, I would suggest using integer comparison:

 >>> parts = [int(s) for s in '198.124.252.130'.split('.')] >>> parts2 = [int(s) for s in '198.124.252.101'.split('.')] >>> from operator import sub >>> diff = sum(d * 10**(3-pos) for pos,d in enumerate(map(sub, parts, parts2))) >>> diff 29 

This style can be used to create a comparison function:

 from functools import partial from operator import sub def compare_ips(base, ip1, ip2): base = [int(s) for s in base.split('.')] parts1 = (int(s) for s in ip1.split('.')) parts2 = (int(s) for s in ip2.split('.')) test1 = sum(abs(d * 10**(3-pos)) for pos,d in enumerate(map(sub, base, parts1))) test2 = sum(abs(d * 10**(3-pos)) for pos,d in enumerate(map(sub, base, parts2))) return cmp(test1, test2) base = '198.124.252.101' test_list = ['198.124.252.102','134.55.41.41','134.55.219.121', '134.55.219.137','134.55.220.45', '198.124.252.130'] sorted(test_list, cmp=partial(compare_ips, base)) # yields: # ['198.124.252.102', '198.124.252.130', '134.55.219.121', '134.55.219.137', # '134.55.220.45', '134.55.41.41'] 
+6
source

Some tips from difflib:

SequenceMatcher is a flexible class for comparing pairs of sequences of any type if sequence elements are hashed. The basis of the algorithm is preceded, and a little more interesting, than the algorithm published in late 1980 by Ratcliffe and Obershelp under the hyperbolic name "gestalt pattern matching". The main idea is to find the longest adjacent subsequence that does not contain "garbage" elements (RO does not address garbage). Then the same idea is applied recursively to parts of the sequences to the left and to the right of the subsequence. This does not provide minimal editing but tends to produce matches that โ€œlook rightโ€ for people.

Regarding your requirement, compare IP addresses based on user logic. You must first check the correctness of the string. Then writing comparison logic using simple integer arithmetic should be an easy task to fulfill your requirement. The library is not needed at all.

+1
source

Source: https://habr.com/ru/post/917353/


All Articles