With the @Richard information provided in his answer , I came up with this short Python 3 script that implements UTS # 39 :
"""Implement the simple algorithm laid out in UTS#39, paragraph 4 """ import csv import re import unicodedata comment_pattern = re.compile(r'\s*#.*$') def skip_comments(lines): """ A filter which skip/strip the comments and yield the rest of the lines :param lines: any object which we can iterate through such as a file object, list, tuple, or generator """ for line in lines: line = comment_pattern.sub('', line).strip() if line: yield line def normalize(s): return unicodedata.normalize("NFD", s) def to_unicode(code_point): return chr(int("0x" + code_point.lower(), 16)) def read_table(file_name): d = {} with open(file_name) as f: reader = csv.reader(skip_comments(f), delimiter=";") for row in reader: source = to_unicode(row[0]) prototypes = map(to_unicode, row[1].strip().split()) d[source] = ''.join(prototypes) return d TABLE = read_table("confusables.txt") def skeleton(s): s = normalize(s) s = ''.join(TABLE.get(c, c) for c in s) return normalize(s) def confusable(s1, s2): return skeleton(s1) == skeleton(s2) if __name__ == "__main__": for strings in [("Foo", "Bar"), ("Αe", "Are"), ("j", "j")]: print(*strings) print("Equal:", strings[0] == strings[1]) print("Confusable:", confusable(*strings), "\n")
The confusables.txt file is confusables.txt be in the directory from which the script is being executed. In addition, I had to delete the first byte of this file, because it was some strange, non-printable character.
This follows only the simplest algorithm outlined at the beginning of paragraph 4, and not the more complicated cases of confusion between integer and mixed scripts outlined in 4.1 and 4.2. This remains as an exercise for the reader.
Note that "I" and "R" are not considered confusions in the unicode group, so False will be returned for these two lines.
source share