Lexers / Tokens and Character Sets

Question

Lexers / Tokens and Character Sets

When building a lexer / tokenizer, it is erroneous to rely on functions (in C ) such as isdigit / isalpha / ...? As far as I know, they depend on the language. Should I choose a character set and concentrate on it and make a character representing myself, from which I look through classifications? Then the problem becomes lex ability to multiple character sets. I am creating one lexer / tokenizer for each character set or trying to create the code that I wrote, so the only thing I need to do is change the character mapping. What are common practices?

+4

c tokenize character-encoding lexical-analysis

Questionable Feb 11 '10 at 10:55

source share

4 answers

t0mm13b · Answer 1 · 2010-02-11T11:12:45+0000

Currently, I would like to focus on using lexer for the first time using a simple ASCII character set, and then when the lexer is working, put the display support for different types of characters, such as UTF-16 and locale support.

And no, you should not mistakenly rely on ctype functions such as isdigit , isalpha and so on ...

Actually, perhaps at a later stage there is a Posix equivalent for the ctype type for wide characters wctype.h 'so that it can be in your interest to define a macro later ... so that you can transparently change the code to handle different sets of locales ...

  #ifdef LEX_WIDECHARS
 #include <wctype.h>
 #define isdigit iswdigit
 #else
 #define isdigit
 #endif

In this context something similar will be defined ...

Hope this helps, Regards, Tom.

Tronic · Answer 2 · 2010-02-11T11:00:34+0000

The ctype.h functions are not very useful for characters containing nothing but ASCII. The default locale is C (essentially the same as ASCII on most machines), regardless of the system language. Even if you use setlocale to change the locale, it is likely that the system uses a character set with characters greater than 8 bits (for example, UTF-8), in which case you cannot say anything useful from a single char.

Wide characters handle more cases correctly, but even too often they fail.

So, if you want to reliably support non-ASCII isspace, you must do it yourself (or perhaps use an existing library).

Note. ASCII only has character codes 0-127 (or 32-127), and what some call 8 ASCII bits is actually some other character set (usually CP437, CP1252, ISO-8859-1 and often also something else) .

Hassan syed · Answer 3 · 2010-02-11T11:12:11+0000

You probably won't be able to go too far trying to create a local sensitive parser - this will drive you crazy. ASCII is great for most parsing needs - don't fight it: D

If you want to deal with it and use some of the character classifications, you should look into the ICU library, which implements Unicode religiously.

Artyom · Answer 4 · 2010-02-11T19:53:22+0000

Usually you need to ask yourself:

what exactly do you want to do, what is parsing?
What languages do you want to support, a wide range or only Western European?
What encoding do you want to use UTF-8 or localized 8-bit encoding?
What OS are you using?

Let's start, if you work with Western languages with localized 8-bit encoding, then probably yes, you can enable relay * if the locales are installed and configured.

However:

If you work with UTF-8, you cannot, because only ASCII will be closed by you, because everything outside ASCII takes more than one byte.
If you want to support oriental languages, all your assumptions about parsing will be wrong, since the Chinese do not use space to separate words. Most languages do not even have upper or lower case letters, not even an alphabet like Hebrew or Arabic.

So what exactly do you want to do?

I would suggest looking at the ICU library with various break iterators or other tools, such as Qt, that provide some basic border analysis.

Lexers / Tokens and Character Sets

More articles: