Detect arbitrary character string / script

I am working on cleaning the database of "profiles" of objects (people, organizations, etc.), and one of such parts of the profile is the name of the person in their native script (for example, Thai) encoded in UTF-8. In the previous data structure, we did not fix the character set of the name, so now we have more records with invalid values ​​than can be viewed manually.

What I need to do at this moment is to determine through the script what language / script any given name is in. With a sample dataset:

Name: "แผ่นดินต้น"
Script: NULL

Name: "አብርሃም"
Script: NULL

I need to end up

Name: "แผ่นดินต้น"
Script: Thai

Name: "አብርሃም"
Script: Amharic

I do not need to translate the names, just determine that they are a script. Is there an established methodology for determining this kind of thing?

+4
source share
2

charnames Perl .

use strict;
use warnings;
use charnames '';
use feature 'say';
use utf8;

say charnames::viacode(ord 'Բ');

__END__
ARMENIAN CAPITAL LETTER BEN

, - . , . , . , , .

, , , . , CPAN, , . - , .

+2

unicodedata2 Python, , Unicode script , :

#!/usr/bin/env python2
#coding: utf-8

import unicodedata2
import collections

def scripts(name):
    scripts = [unicodedata2.script(char) for char in name]
    scripts = collections.Counter(scripts)
    scripts = scripts.most_common()
    scripts = ', '.join(script for script,_ in scripts)
    return scripts


assert scripts(u'Rob') == 'Latin'
assert scripts(u'Robᵩ') == 'Latin, Greek'
assert scripts(u'Aarón') == 'Latin'
assert scripts(u'แผ่นดินต้น') == 'Thai'
assert scripts(u'አብርሃም') == 'Ethiopic'
+2

Source: https://habr.com/ru/post/1649195/


All Articles