Developer forums (C::B DEVELOPMENT STRICTLY!) > Development

Code completion doesnt follow #include in struct

<< < (6/8) > >>

JGM:

--- Quote from: Ceniza on March 26, 2011, 10:23:40 pm ---Not necessarily. It could just store it somewhere for later retrieval. The parsing should continue in case it is a false positive.

--- End quote ---

Yep, I thought that after implementing the ErrorException class :D


--- Quote from: Ceniza on March 26, 2011, 10:23:40 pm ---The preprocessing stage should output tokens, not text.

--- End quote ---

Ahhh true, and I kept thinking how to do it lol

ollydbg:

--- Quote from: Ceniza on March 26, 2011, 10:23:40 pm ---This is what the whole thing would, roughly, look like:

Preprocessor's Lexer -> Preprocessor -> Lexer -> Syntax analysis + symtab generation -> Semantic analysis.

Preprocessor's Lexer: Turns text into preprocessor tokens. Integral and floating-point values would be just "numbers". Keywords should be read as plain identifiers since the preprocessor does not care about them being a separate thing. File and line information is retrieved here.
Preprocessor: Resolves directives (#include, #if*, ...), discards tokens and builds new tokens when necessary (## and # operations). White spaces (space, newline, comments, ...) are, in theory, discarded as well.
Lexer: Converts "numbers" into proper tokens, concatenates contiguous string literals into a single string literal token and turns identifiers into keywords (the ones that are actually keywords, of course).
Syntax analysis: Checks that everything is properly "written" (class_decl ::= ttClass ttIdentifier ttSemiColon). An Abstract Syntax Tree can be built here, plus a symbols table.
Semantic analysis: Checks that everything makes sense: x = 3; // Is x a symbol in the current or a parent scope? Can it be assigned an integral type in any way (x is not const, x is integral, x has an overload of operator = that can be used, 3 can be turned into x's type and assigned, ...)?

That means some token types would not be seen by the preprocessor because its lexer would not produce them, most token types specifically for the preprocessor would have been consumed before reaching the lexer (at the next stage), and those few ones reaching it would be converted before being fed to the syntax analysis stage.

I hope it is clear enough, although its "roughness".

--- End quote ---
very nice info.
But I think things get more  complex on parsing c++, because the c++ language is not context free, so Syntax analysis can not get the correct tree, because it need semantic information. So, we can not create a Bison grammar to parse c++ code, because both syntax and semantic should be combined.

and about the preprocessor side, checking an identifier (to see whether it is a keyword or a general variable name, function name) was really time consuming and context sensitive,  If we skip the expend the #include directive, we always get partial preprocessor result, so the macros may lost, and further #error kind message is also not correct.

how to avoid parsing a header file time from times? do we have a PCH like mechanism?

Ceniza:

--- Quote from: ollydbg on March 27, 2011, 07:57:42 am ---But I think things get more  complex on parsing c++, because the c++ language is not context free, so Syntax analysis can not get the correct tree, because it need semantic information. So, we can not create a Bison grammar to parse c++ code, because both syntax and semantic should be combined.

--- End quote ---

I think you are getting the job of the semantic analysis wrong. Let us say we have this code:


--- Code: ---float x = "a value";
++x;
--- End code ---

You can build an AST from that, and the symtab will have that x is of type float. When you run the semantic analysis on that is when you will find that both lines have problems: assigning string literal to float, and pre-incrementing a float.


--- Quote from: ollydbg on March 27, 2011, 07:57:42 am ---and about the preprocessor side, checking an identifier (to see whether it is a keyword or a general variable name, function name) was really time consuming and context sensitive,  If we skip the expend the #include directive, we always get partial preprocessor result, so the macros may lost, and further #error kind message is also not correct.

--- End quote ---

Right, I totally forgot to specify that.

During the preprocessing stage you need to build a macro replacements map, or whatever you want to call it. It would turn every identifier in a #define identifier as an "identifier to be replaced afterwards" or, simply, "macro". That map would be indexed by the identifier (the macro's name), and store the macro's type (plain macro or function-like macro), its parameters (for function-like macros), and the plain sequence of tokens that follow (the replacement). Have in mind that that sequence of tokens must NOT be macro expanded when stored.

When the preprocessor finds an identifier, it will search for it in the map. If it is found, build the list of parameters (each parameter being a sequence of non-expanded tokens) (in case it is a function-like macro), and proceed to do the replacement (expand it) over and over again until no more replacements are made. During this stage you need to keep a sort of call stack to properly handle what could otherwise become a recursive replacement (probably leading to an endless loop). Recursion is something the preprocessor must not do (check the standard).


--- Quote from: ollydbg on March 27, 2011, 07:57:42 am ---how to avoid parsing a header file time from times? do we have a PCH like mechanism?

--- End quote ---

Well, you could store the result of preprocessing any file found through a #include. It would be indexed by the full file location, "sub-indexed" by the context plus dependencies, store the macro replacements map and the output (the final list of tokens).

The "sub-indexing" is important for a proper handling. The 'context plus dependencies' refers to all macros that were defined just before the file was #include'd, and their values. It is also important to know which other macros would cause the header to produce a different output (due to #if*). It is rather tricky to get it right, and it may cause the parsing to be a lot slower, although quite accurate. That is why, when programming, preprocessed headers should always be the first ones to be included (so they carry in as little context as possible).

In order to improve speed, as well as to simplify the implementation, the "sub-indexing" could be discarded. In other words: parse it once, store it like that, do not care about context. Handling multiple inclusion turns into the annoying part, though (as per this topic, you may want it, but, most of the time, you will not). [We are still on topic :P]

"Stable" header files (like those that come with the compiler) should be parsed once and stored. You do not want to parse them every single time.

The last two paragraphs, as far as I know, is how the guys at Whole Tomato do it for Visual Assist X.

ollydbg:

--- Quote from: Ceniza on March 27, 2011, 10:53:03 am ---I think you are getting the job of the semantic analysis wrong. Let us say we have this code:


--- Code: ---float x = "a value";
++x;
--- End code ---

You can build an AST from that, and the symtab will have that x is of type float. When you run the semantic analysis on that is when you will find that both lines have problems: assigning string literal to float, and pre-incrementing a float.


--- End quote ---
no, I have read some posts/threads on the web, look here:
7 Dealing with Ambiguities
and there are much more about parsing template instantiation code.
this always need semantic (type) information about the current identifier to support the syntax analysis.

Ceniza:

--- Quote ---The CDT parsers do not compute type information, the result being that some language constructs can be ambiguous, ...

--- End quote ---

According to that, they delay extracting type information until the semantic analysis stage, which is not practical for C/C++. That is completely unnecessary as the syntax analysis stage knows well what introduces new types. Since the symtab is populated as you build the AST, and you can also populate a "typetab", you can query information right away. It is, therefore, possible to know if x * y is a statement or an expression depending on whether or not x is in the symtab/"typetab". Otherwise, you will have to do the kind of trickery (ambiguity nodes) the CDT guys did.

Templates, on the other hand, require you to use the keyword typename to solve the ambiguity in favor of a statement, otherwise it is an expression.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version