I am working on cleaning the database of "profiles" of objects (people, organizations, etc.), and one of such parts of the profile is the name of the person in their native script (for example, Thai) encoded in UTF-8. In the previous data structure, we did not fix the character set of the name, so now we have more records with invalid values than can be viewed manually.
What I need to do at this moment is to determine through the script what language / script any given name is in. With a sample dataset:
Name: "แผ่นดินต้น"
Script: NULL
Name: "አብርሃም"
Script: NULL
I need to end up
Name: "แผ่นดินต้น"
Script: Thai
Name: "አብርሃም"
Script: Amharic
I do not need to translate the names, just determine that they are a script. Is there an established methodology for determining this kind of thing?
source
share