Define unicode string language in Java

If I have a string in java, how can I determine which language it belongs to? Does the Unicode Specification Specify?

+4
source share
1 answer

There is no metadata in the Unicode string indicating which language the string is in, if the string is even a word or phrase.

Based on the characters contained in the string, you can guess which language is used. For example, the Unicode range 30A0-30FF represents Japanese Katakana characters. Therefore, if most of your string consists of characters within this range, you can make an educated guess that it is Japanese. This, however, is not reliable. For example, what if these are just random Katakana characters?

For a reliable language definition, I would give up everything that I thought about using Unicode as the basis for determining a language and focused on language recognition algorithms.

+5
source

Source: https://habr.com/ru/post/1347111/


All Articles