Find a measure of the graphic similarity of two lines

I was not lucky to find such a package, optimally in Python. Is there a library that allows graphically comparing two strings?

For example, it would be useful to deal with spam if instead of R use or, worse, things like Α (capital alpha, 0x0391) instead of A to obfuscate their string.

The interface of such a package may be something like

 distance("Foo", "Bar") # large distance distance("Αe", "Are") # small distance 

Thanks!

+5
source share
2 answers

I am not aware of a package that does this. However, you can use tools such as a homoglyph attack generator , Unicode confusables consortium , links from the wikipedia page to IDN homograph identification or other similar resources to create your own look-alik library and build an estimate based on this.

EDIT . Unicode users seem to have put together a large, large character database that looks the same. It is available here . If I were you, I would build a script to read this in the Python dictionary, and then parse your string for matches. Exposure:

 FF4A ; 006A ; MA # ( j → j ) FULLWIDTH LATIN SMALL LETTER J → LATIN SMALL LETTER J # →ϳ→ 2149 ; 006A ; MA # ( ⅉ → j ) DOUBLE-STRUCK ITALIC SMALL J → LATIN SMALL LETTER J # 1D423 ; 006A ; MA # ( 𝐣 → j ) MATHEMATICAL BOLD SMALL J → LATIN SMALL LETTER J # 1D457 ; 006A ; MA # ( 𝑗 → j ) MATHEMATICAL ITALIC SMALL J → LATIN SMALL LETTER J # 
+5
source

With the @Richard information provided in his answer , I came up with this short Python 3 script that implements UTS # 39 :

 """Implement the simple algorithm laid out in UTS#39, paragraph 4 """ import csv import re import unicodedata comment_pattern = re.compile(r'\s*#.*$') def skip_comments(lines): """ A filter which skip/strip the comments and yield the rest of the lines :param lines: any object which we can iterate through such as a file object, list, tuple, or generator """ for line in lines: line = comment_pattern.sub('', line).strip() if line: yield line def normalize(s): return unicodedata.normalize("NFD", s) def to_unicode(code_point): return chr(int("0x" + code_point.lower(), 16)) def read_table(file_name): d = {} with open(file_name) as f: reader = csv.reader(skip_comments(f), delimiter=";") for row in reader: source = to_unicode(row[0]) prototypes = map(to_unicode, row[1].strip().split()) d[source] = ''.join(prototypes) return d TABLE = read_table("confusables.txt") def skeleton(s): s = normalize(s) s = ''.join(TABLE.get(c, c) for c in s) return normalize(s) def confusable(s1, s2): return skeleton(s1) == skeleton(s2) if __name__ == "__main__": for strings in [("Foo", "Bar"), ("Αe", "Are"), ("j", "j")]: print(*strings) print("Equal:", strings[0] == strings[1]) print("Confusable:", confusable(*strings), "\n") 

The confusables.txt file is confusables.txt be in the directory from which the script is being executed. In addition, I had to delete the first byte of this file, because it was some strange, non-printable character.

This follows only the simplest algorithm outlined at the beginning of paragraph 4, and not the more complicated cases of confusion between integer and mixed scripts outlined in 4.1 and 4.2. This remains as an exercise for the reader.

Note that "I" and "R" are not considered confusions in the unicode group, so False will be returned for these two lines.

0
source

Source: https://habr.com/ru/post/1275232/


All Articles