I am trying to parse some UTF-8 encoded documents in such a way as to recognize different characters of the language. For my approach to work, I need to ignore non-lingual characters such as control characters, mathematical characters, etc. Just an attempt to analyze the basic Latin part of the UTF standard led to the emergence of several regions, with characters like the division symbol, the middle of the range of valid Latin characters.
Is there a list that identifies these regions? Or better yet, Regex, which defines regions or something in C # that can identify different characters?
source
share