A token is what you want. Traditionally (and for good reason), language specifications divided the analysis into two parts: the first part broke the input stream into tokens, and the second analyzed the tokens. (Theoretically, I think you can write any grammar on only one level, without using tokens, or the same, using individual characters as markers. I would not want to see the results for that for a language such as C ++, however.) But the definition is that the token is completely dependent on the language you are parsing: most languages, for example, treat white space as a delimiter (but not fortran); most languages predefine punctuation / operators using punctuation characters and do not allow these characters in characters (but not COBOL, where "abc-def" will be the only character). In some cases (including in the C ++ preprocessor), what is a token depends on the context, so you may need some feedback from the parser. (Hopefully not; this is for experienced programmers.)
Perhaps one thing is certain (unless each character is a marker): you will have to read ahead in the stream. Usually you can’t tell if there are more tokens just by looking at one character. I generally found it useful, in fact, for the tokenizer to read the whole token at a time, and keep it for as long as the Parser needs it. A function like hasMoreTokens actually scans the full token.
(And while I am, if source is istream : istream::peek does not return a pointer, but int .)
source share