Another issue is the string construction. As you know, all token strings are in-fact a sub-string of the source file. (in some special case, the token is replaced by some macro expansion, but we can create an auxiliary source string to hold all the expanded strings).
What a lexer do is to locate the start point and the end point of the lexeme, for example in a source code
int main ( ) { int a; .....
^ $
Note, when a lexeme is found, the lexer (Quex lexer) know the start position "^", and the end position "$", also it has a Type enum information, in this case, it is an "identifier". It depend on the user to handle this information, so if you have a Token class like below:
class CCToken
{
std::string name;
TokenType type;
}
The user should construct the CCToken instance by a memory copy from source code to name member variables, then set the type member variables.
I think a better way is:
class CCToken
{
int source_index;
int lexeme_start;
int lexeme_length;
TokenType type;
}
There, the first member is the index to the source buffer, then remember the start position and length.
Maybe, we can supply a member function like: "std::string CCToken::ToStdString()", which return a true new std::string. In most cases, I think we don't need to use lexeme_start and lexeme_length, because we only need to know the TokenType. For example there are some TokenTypes like: "keyword_class", "keyword_public"........