Lexers / Tokens and Character Sets

When building a lexer / tokenizer, it is erroneous to rely on functions (in C ) such as isdigit / isalpha / ...? As far as I know, they depend on the language. Should I choose a character set and concentrate on it and make a character representing myself, from which I look through classifications? Then the problem becomes lex ability to multiple character sets. I am creating one lexer / tokenizer for each character set or trying to create the code that I wrote, so the only thing I need to do is change the character mapping. What are common practices?

+4
source share
4 answers

Currently, I would like to focus on using lexer for the first time using a simple ASCII character set, and then when the lexer is working, put the display support for different types of characters, such as UTF-16 and locale support.

And no, you should not mistakenly rely on ctype functions such as isdigit , isalpha and so on ...

Actually, perhaps at a later stage there is a Posix equivalent for the ctype type for wide characters wctype.h 'so that it can be in your interest to define a macro later ... so that you can transparently change the code to handle different sets of locales ...

  #ifdef LEX_WIDECHARS
 #include <wctype.h>
 #define isdigit iswdigit
 #else
 #define isdigit
 #endif

In this context something similar will be defined ...

Hope this helps, Regards, Tom.

+3
source

The ctype.h functions are not very useful for characters containing nothing but ASCII. The default locale is C (essentially the same as ASCII on most machines), regardless of the system language. Even if you use setlocale to change the locale, it is likely that the system uses a character set with characters greater than 8 bits (for example, UTF-8), in which case you cannot say anything useful from a single char.

Wide characters handle more cases correctly, but even too often they fail.

So, if you want to reliably support non-ASCII isspace, you must do it yourself (or perhaps use an existing library).

Note. ASCII only has character codes 0-127 (or 32-127), and what some call 8 ASCII bits is actually some other character set (usually CP437, CP1252, ISO-8859-1 and often also something else) .

+2
source

You probably won't be able to go too far trying to create a local sensitive parser - this will drive you crazy. ASCII is great for most parsing needs - don't fight it: D

If you want to deal with it and use some of the character classifications, you should look into the ICU library, which implements Unicode religiously.

+2
source

Usually you need to ask yourself:

  • what exactly do you want to do, what is parsing?
  • What languages ​​do you want to support, a wide range or only Western European?
  • What encoding do you want to use UTF-8 or localized 8-bit encoding?
  • What OS are you using?

Let's start, if you work with Western languages ​​with localized 8-bit encoding, then probably yes, you can enable relay * if the locales are installed and configured.

However:

  • If you work with UTF-8, you cannot, because only ASCII will be closed by you, because everything outside ASCII takes more than one byte.
  • If you want to support oriental languages, all your assumptions about parsing will be wrong, since the Chinese do not use space to separate words. Most languages ​​do not even have upper or lower case letters, not even an alphabet like Hebrew or Arabic.

So what exactly do you want to do?

I would suggest looking at the ICU library with various break iterators or other tools, such as Qt, that provide some basic border analysis.

+1
source

Source: https://habr.com/ru/post/1300981/


All Articles