CodeCompletion plugin

Developer forums (C::B DEVELOPMENT STRICTLY!) > Plugins development

<< < (3/4) > >>

killerbot:
sounds good. Long live our CC/parsing experts :-)

oBFusCATed:

--- Quote from: ollydbg on April 30, 2011, 05:10:09 pm ---As using quex lexer, parsing is much easier than the current implementation. :D

--- End quote ---
I see no patch that proves it will be good for the CC in C::B :lol:

ollydbg:
@oBFusCATed
There is no such patch, because I use another kind of Token.

CC's currently Token (Tokenizer class can supply) is just a wxString, so Token comparation is not quite good.
code snippet in DoParse() looks like below: Note: the Tokenizer has a hand-written lexer, which just return a lexeme ( a wxString ) with out Type ID information. comparation on strings is not quite good, we first do a switch on the token's length, then compared on text again.

--- Code: ---case 6:
if (token == ParserConsts::kw_delete)
{
m_Str.Clear();
SkipToOneOfChars(ParserConsts::semicolonclbrace);
}
else if (token == ParserConsts::kw_switch)
{
if (!m_Options.useBuffer || m_Options.bufferSkipBlocks)
SkipToOneOfChars(ParserConsts::semicolonclbrace, true);
else
m_Tokenizer.GetToken(); //skip args
m_Str.Clear();
}
else if (token == ParserConsts::kw_return)
{
SkipToOneOfChars(ParserConsts::semicolonclbrace, true);
m_Str.Clear();
}
else if (token == ParserConsts::kw_extern)
...

--- End code ---

In my implementation, Token class has more precise information. The Token class is briefly like: (Quex lexer takes the work to fill these information) So, if it is an identifier, its text field will take the actual lexeme string, but if it is a keyword or a punctuation, it just need an type ID, and its text can be empty.

--- Code: ---class Token
{
int type_id;
string text;
int line_number;
int column_number;
}
--- End code ---

So, In my implementation, I use code like below:

--- Code: --- while (true)
{
RawToken* tk = PeekToken();

switch (tk->type_id())
{
case TKN_L_BRACE: //{
{
SkipBrace();
break;
}
case TKN_R_BRACE: //}
{
// the only time we get to find a } is when recursively called by e.g. HandleClass
// we have to return now...
cout<<"DoParse(): return from"<<*tk<<tk->line_number()<<":"<<tk->column_number()<<endl;
ConsumeToken();
return;
}
case TKN_R_PAREN: //)
{
cout<<"DoParse(): return from"<<*tk<<tk->line_number()<<":"<<tk->column_number()<<endl;
ConsumeToken();
return;
}
case TKN_L_PAREN : // (
{
SkipParentheses();
break;
}
case TKN_FOR:
case TKN_WHILE:
{
TRACE("handling for or while block");
HandleForWhile();
}
.....

--- End code ---
You can see: I can compare on type ID to distinguish different Tokens. So, it just do int value comparation instead string comparation. Also, the Token can supply both line/column information.

I also use some layers from parserthread->preprocessor->tokenizer, cc's current implementation do preprocess and parse in one class layer, which makes the code hard to read and maintain. :D

I'd like to say, if we need to adopt a new parser, we should change code a lot a lot...

ptDev:

--- Quote from: ollydbg on May 01, 2011, 04:27:44 am ---@oBFusCATed
There is no such patch, because I use another kind of Token.

CC's currently Token (Tokenizer class can supply) is just a wxString, so Token comparation is not quite good.

[..]

I'd like to say, if we need to adopt a new parser, we should change code a lot a lot...

--- End quote ---

Please forgive my intrusion.

I am working on parser for D for a project of my own, and have too concluded that tokens need an initial classification both for better efficiency and better preparation for the semantical analysis. Outputting just strings may be handy as an initial approach and sound like a good idea at first, but some form of "predigestion" is very useful.

Basically, my "tokenizer" (in my case, the class is called Scanner) preliminarily classifies certain tokens such as braces, parenthesis, operators, etc. through an enum, and only stores the string in the case of a "word token". Note that it is not necessary to distinguish between keywords and symbols at this stage yet. Doing this reduces the time spent later on comparing strings in the parser.

example:

--- Code: ---struct Token
{
TokenType _type;
wxString _word;
};

--- End code ---

A lot of simple operators, parentheses, semicolons, commas and braces (the most common tokens in most source code) can be skipped, by avoiding strcmp() type operations that can be reduced to comparing an integer.

Just to say: ollydbg is spot on, as far as I can see.

ollydbg:

--- Quote ---Note that it is not necessary to distinguish between keywords and symbols at this stage yet. Doing this reduces the time spent later on comparing strings in the parser.
--- End quote ---
thanks for the reply.

BTW: I need to say some words about your idea.
For a fixed keyword group, I think a DFA in lexer can be much faster. :D , here are my observations.

1, most compilers' lexer did the same way you said(gcc, clang), I think it is used for flexibility and I think it is not the most fastest way to do scanning. :D
e.g. gcc have to support many different c/c++/object c languages, and different languages has different keywords definitions. So, When the lexer get a "word token", the parser will later check in a symbol table to see whether a "word token" is a keyword in the language or a general identifier.
Usually, this symbol table is a hashtable, so search the "word token" is quite fast.

2, For my implementation, I use quex lexer generator, and it is internally generate a DFA(code directed, which is much faster then table driven lexer like flex), and as my Parser is definitely a C++ parser, So, it have a fixed keywords definition which can be defined in the lexer grammar. So, the lexer can distinguish a c++ keyword and a general identifier.
When it meets a keyword, it just return a type id (int value), and no text is needed, this can avoid the hashtable search stage.

From my point of view, this way should be more faster, the disadvantage is that the DFA is fixed after its generation, and it can't vary dynamically. e.g. I can't dynamically let the quex generated lexer to identify a new added keyword in the run time.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version