Note that it is not necessary to distinguish between keywords and symbols at this stage yet. Doing this reduces the time spent later on comparing strings in the parser.
thanks for the reply.
BTW: I need to say some words about your idea.
For a fixed keyword group, I think a DFA in lexer can be much faster.
, here are my observations.
1, most compilers' lexer did the same way you said(gcc, clang), I think it is used for flexibility and I think it is not the most fastest way to do scanning.
e.g. gcc have to support many different c/c++/object c languages, and different languages has different keywords definitions. So, When the lexer get a "word token", the parser will later check in a symbol table to see whether a "word token" is a keyword in the language or a general identifier.
Usually, this symbol table is a hashtable, so search the "word token" is quite fast.
2, For my implementation, I use quex lexer generator, and it is internally generate a DFA(code directed, which is much faster then table driven lexer like flex), and as my Parser is definitely a C++ parser, So, it have a fixed keywords definition which can be defined in the lexer grammar. So, the lexer can distinguish a c++ keyword and a general identifier.
When it meets a keyword, it just return a type id (int value), and no text is needed, this can avoid the hashtable search stage.
From my point of view, this way should be more faster, the disadvantage is that the DFA is fixed after its generation, and it can't vary dynamically. e.g. I can't dynamically let the quex generated lexer to identify a new added keyword in the run time.