Print Page - Macro replacement in CC(tokenizer) suggestion

Developer forums (C::B DEVELOPMENT STRICTLY!) => Development => CodeCompletion redesign => Topic started by: ollydbg on June 29, 2009, 04:10:29 am

Title: Macro replacement in CC(tokenizer) suggestion
Post by: ollydbg on June 29, 2009, 04:10:29 am

Hi, In the current architecture, the function ThisOrReplacement(m_Token) was only called in Tokenizer::GetToken()

Code

wxString Tokenizer::GetToken()
{
    m_UndoTokenIndex = m_TokenIndex;
    m_UndoLineNumber = m_LineNumber;
    m_UndoNestLevel  = m_NestLevel;

    if(m_PeekAvailable)
    {
        m_TokenIndex = m_PeekTokenIndex;
        m_LineNumber = m_PeekLineNumber;
        m_NestLevel  = m_PeekNestLevel;
        m_Token   = m_PeekToken;
    }
    else
        m_Token = DoGetToken();

    m_PeekAvailable = false;

    return ThisOrReplacement(m_Token);
}

To accelerate this "Macro replacement", I think it should be moved to DoGetToken.

Here are the reasons:

1, ThisOrReplacement(m_Token) internally use a wxString--> wxString map container, so, it will use a search algorithm in this map(normally this will cause a search on a balanced BST), this will take a lot of time.

2, we can avoid many situations to call this function, for example, when m_Token is '{' or wxEmptyString or many other string that shouldn't need macro expansion.

Any comments?

Thanks

Title: Re: Macro replacement in CC(tokenizer) suggestion
Post by: ollydbg on June 29, 2009, 04:33:14 am

Also, I suggest that when the Tokenizer return a Token( wxString ), it should also combined with a "type", which means the parser can use this type information to do Syntax Analysis.

If I can remember, Ceniza call this a "Smart lexer" :D

Title: Re: Macro replacement in CC(tokenizer) suggestion
Post by: thomas on June 29, 2009, 09:40:37 am

Quote from: ollydbg on June 29, 2009, 04:10:29 am

internally use a wxString--> wxString map container, so, it will use a search algorithm in this map(normally this will cause a search on a balanced BST), this will take a lot of time.

2, we can avoid many situations to call this function, for example, when m_Token is '{' or wxEmptyString or many other string that shouldn't need macro expansion.

I'd be careful with such an optimisation, since a map rarely needs to do more than 4-5 lookups in total, so if you add too many special cases, the resulting code will be slower (and at the same time more difficult to maintain).

Title: Re: Macro replacement in CC(tokenizer) suggestion
Post by: ollydbg on June 29, 2009, 09:57:35 am

Quote from: thomas on June 29, 2009, 09:40:37 am

I'd be careful with such an optimisation, since a map rarely needs to do more than 4-5 lookups in total, so if you add too many special cases, the resulting code will be slower (and at the same time more difficult to maintain).

Not fully understand you comments :(

I mean if we return a wxString from DoGetToken(), (for example '{') as we know '{' certainly don't need to do macro replacement, so ,we can avoid calling ThisOrReplacement('{');.

Also, there are many wxString like '{' :D

Title: Re: Macro replacement in CC(tokenizer) suggestion
Post by: thomas on June 30, 2009, 12:23:17 pm

Quote from: ollydbg on June 29, 2009, 09:57:35 am

Quote from: thomas on June 29, 2009, 09:40:37 am
I'd be careful with such an optimisation, since a map rarely needs to do more than 4-5 lookups in total, so if you add too many special cases, the resulting code will be slower (and at the same time more difficult to maintain).
Not fully understand you comments :(

I understood that your idea is to catch special cases which cannot possibly be macros, so they need not looked up in the map<wxString,wxString>.

In other words, replace code that looks like:
return the_map.find(token);

with something like:
if(token.IsEmpty() || (token == one_constant) || (token == two_constant) || (token == three_constant)) return token; else return the_map.find(token);

My point is that maps have O(log(n)) lookup, so unless a source has 20 billion preprocessor defines, it is really nothing to worry about. For "normal" amounts, log(n) will be something like 4, maybe 5. Let's assume the worst case of 5. One "operation" is a compare and a branch.

Adding a line like the above will remove 5 operations done by the map lookup in the best case, at a cost of 1-4 additional operations (average 2). So we save 5-1 = 4 operations in the best, and 5-2 = 3 operations in the average case.
In the worst case, it will add 4 operations to the existing 5, almost doubling the work.
This scenario might still be advantageous, but it's not likely that it will be a big win.

Now, you were talking of "many" cases. Let's say "many" means 10. In this case, we will do 1-10 operations (5 average) to eliminate the 5 lookups done by the map.
So, on the average, we replace 5 operations with 5 operations (zero win, but more complicated code), and in the worst case, we add 10 operations, tripling the amount of work done.

Title: Re: Macro replacement in CC(tokenizer) suggestion
Post by: oBFusCATed on June 30, 2009, 01:37:44 pm

Another option is to replace std::map with std::tr1::unordered_map, it if is available (gcc > 4.x.y, not sure which is the minimal x thought)

std::tr1::unordered_map is hash map, so the lookup is O(1) operation.

Title: Re: Macro replacement in CC(tokenizer) suggestion
Post by: ollydbg on June 30, 2009, 05:13:34 pm

Quote from: thomas on June 30, 2009, 12:23:17 pm

with something like:
if(token.IsEmpty() || (token == one_constant) || (token == two_constant) || (token == three_constant)) return token; else return the_map.find(token);

Hi, thomas, Thanks for the full explanation about your point.

But I think you still misunderstand my mind. :(

Look at the code in DoGetToken(), I have made a pseudo code below. We can add a bool variable

Code


bool NeedReplacement;

if (c == '_' || wxIsalpha(c))
    {          
         ..........
          NeedReplacement = true;
    }
    else if (wxIsdigit(CurrentChar()))
    {
          .........
          NeedReplacement = false;
    }
    else if (CurrentChar() == '"' || CurrentChar() == '\'')
    {
         ......
          NeedReplacement = false;
    }
    else if (CurrentChar() == '(')
    {
         ......
         NeedReplacement = false;
    }
    else
    {
            .......
    }

if(!NeedReplacement)
    return token;
else
    return the_map.find(token);

Because we have already know the information of the current token, we can still save quite a lot of time. :D

Title: Re: Macro replacement in CC(tokenizer) suggestion
Post by: ollydbg on June 30, 2009, 05:15:45 pm

Quote from: oBFusCATed on June 30, 2009, 01:37:44 pm

Another option is to replace std::map with std::tr1::unordered_map, it if is available (gcc > 4.x.y, not sure which is the minimal x thought)

std::tr1::unordered_map is hash map, so the lookup is O(1) operation.

This is the great news.
If it is really true that hash map runs faster, I do suggest change to use std::tr1::unordered_map, thanks for the suggestion!!!

Title: Re: Macro replacement in CC(tokenizer) suggestion
Post by: Kazade on June 30, 2009, 05:56:14 pm

Quote from: oBFusCATed on June 30, 2009, 01:37:44 pm

Another option is to replace std::map with std::tr1::unordered_map, it if is available (gcc > 4.x.y, not sure which is the minimal x thought)

std::tr1::unordered_map is hash map, so the lookup is O(1) operation.

I personally have been using std::tr1::unordered_map for everything for ages (at least a year), I definitely think if it's available it should be used. Apparently GCC has a __GXX_EXPERIMENTAL_CXX0X__ macro. eg:

Code

#ifdef __GXX_EXPERIMENTAL_CXX0X__
    typedef std::tr1::unordered_map CBMap;
#else
    typedef std::map CBMap;
#endif

CBMap<wxString, wxString> myMap;

or something like that.

Title: Re: Macro replacement in CC(tokenizer) suggestion
Post by: oBFusCATed on June 30, 2009, 07:10:47 pm

Kazade: unordered containers do not require the experimental c++-0x, they are in TR1

Code::Blocks Forums

Developer forums (C::B DEVELOPMENT STRICTLY!) => Development => CodeCompletion redesign => Topic started by: ollydbg on June 29, 2009, 04:10:29 am