Is there a list of languages only for character areas for UTF-8 somewhere?

Question

Is there a list of languages only for character areas for UTF-8 somewhere?

I am trying to parse some UTF-8 encoded documents in such a way as to recognize different characters of the language. For my approach to work, I need to ignore non-lingual characters such as control characters, mathematical characters, etc. Just an attempt to analyze the basic Latin part of the UTF standard led to the emergence of several regions, with characters like the division symbol, the middle of the range of valid Latin characters.

Is there a list that identifies these regions? Or better yet, Regex, which defines regions or something in C # that can identify different characters?

+3

utf-8 character-encoding nlp

Laserjesus May 17, '10 at 3:15

source share

3 answers

You can use \p{XXX}Unicode categories to match. For example, \p{Cc}matches all control characters.

I think you can use \wto match all letters in (L *). It is equal [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}]in Unicode mode.

See the http://www.fileformat.info/info/unicode/category/index.htm category list.

+1

J-16 SDiZ May 17, '10 at 3:25

source share

, , C.

+1

BCS 17 '10 13:09

Matthew flaschen · Accepted Answer · 2010-05-17T03:21:28+0000

Take a look at the Unicode character categories. You can match them in C # regular expressions with the character class syntax \p{catname}. Therefore, to match a lowercase letter, you must use \p{Ll}. You can combine them. [\p{Ll}\p{Lu}]matches characters in the class Ll or Lu.

Is there a list of languages ​​only for character areas for UTF-8 somewhere?

More articles:

Is there a list of languages only for character areas for UTF-8 somewhere?