Is there a list of languages ​​only for character areas for UTF-8 somewhere?

I am trying to parse some UTF-8 encoded documents in such a way as to recognize different characters of the language. For my approach to work, I need to ignore non-lingual characters such as control characters, mathematical characters, etc. Just an attempt to analyze the basic Latin part of the UTF standard led to the emergence of several regions, with characters like the division symbol, the middle of the range of valid Latin characters.

Is there a list that identifies these regions? Or better yet, Regex, which defines regions or something in C # that can identify different characters?

+3
source share
3 answers

Take a look at the Unicode character categories. You can match them in C # regular expressions with the character class syntax \p{catname}. Therefore, to match a lowercase letter, you must use \p{Ll}. You can combine them. [\p{Ll}\p{Lu}]matches characters in the class Ll or Lu.

+5
source

You can use \p{XXX}Unicode categories to match. For example, \p{Cc}matches all control characters.

I think you can use \wto match all letters in (L *). It is equal [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}]in Unicode mode.

See the http://www.fileformat.info/info/unicode/category/index.htm category list.

+1
source

, , C.

+1

Source: https://habr.com/ru/post/1745717/


All Articles