But I think things get more complex on parsing c++, because the c++ language is not context free, so Syntax analysis can not get the correct tree, because it need semantic information. So, we can not create a Bison grammar to parse c++ code, because both syntax and semantic should be combined.
I think you are getting the job of the semantic analysis wrong. Let us say we have this code:
You can build an AST from that, and the symtab will have that
x is of type float. When you run the semantic analysis on that is when you will find that both lines have problems: assigning string literal to float, and pre-incrementing a float.
and about the preprocessor side, checking an identifier (to see whether it is a keyword or a general variable name, function name) was really time consuming and context sensitive, If we skip the expend the #include directive, we always get partial preprocessor result, so the macros may lost, and further #error kind message is also not correct.
Right, I totally forgot to specify that.
During the preprocessing stage you need to build a macro replacements map, or whatever you want to call it. It would turn every
identifier in a
#define identifier as an "identifier to be replaced afterwards" or, simply, "macro". That map would be indexed by the identifier (the macro's name), and store the macro's type (plain macro or function-like macro), its parameters (for function-like macros), and the plain sequence of tokens that follow (the replacement). Have in mind that that sequence of tokens must NOT be macro expanded when stored.
When the preprocessor finds an identifier, it will search for it in the map. If it is found, build the list of parameters (each parameter being a sequence of non-expanded tokens) (in case it is a function-like macro), and proceed to do the replacement (expand it) over and over again until no more replacements are made. During this stage you need to keep a sort of call stack to properly handle what could otherwise become a recursive replacement (probably leading to an endless loop). Recursion is something the preprocessor must not do (check the standard).
how to avoid parsing a header file time from times? do we have a PCH like mechanism?
Well, you could store the result of preprocessing any file found through a #include. It would be indexed by the full file location, "sub-indexed" by the context plus dependencies, store the macro replacements map and the output (the final list of tokens).
The "sub-indexing" is important for a proper handling. The 'context plus dependencies' refers to all macros that were defined just before the file was #include'd, and their values. It is also important to know which other macros would cause the header to produce a different output (due to #if*). It is rather tricky to get it right, and it may cause the parsing to be a lot slower, although quite accurate. That is why, when programming, preprocessed headers should always be the first ones to be included (so they carry in as little context as possible).
In order to improve speed, as well as to simplify the implementation, the "sub-indexing" could be discarded. In other words: parse it once, store it like that, do not care about context. Handling multiple inclusion turns into the annoying part, though (as per this topic, you may want it, but, most of the time, you will not).
[We are still on topic ]"Stable" header files (like those that come with the compiler) should be parsed once and stored. You do not want to parse them every single time.
The last two paragraphs, as far as I know, is how the guys at Whole Tomato do it for Visual Assist X.