Author Topic: ParserThread - switch on token strings  (Read 13226 times)

Offline Alpha

  • Developer
  • Lives here!
  • *****
  • Posts: 1511
ParserThread - switch on token strings
« on: September 05, 2014, 08:03:21 pm »
In ParserThread::DoParse() there are a number of string comparisons.  What if you could just send it through a switch?
Code: [Select]
#include "ugly_macros.h"
switch (HashStr(token))
{
    case ID('f','o','r'):
        break;
    case ID('e','l','s','e'):
        break;
    case ID('e','n','u','m'):
        break;
    default:
        break;
}
Yes, sort of ugly, but it works.  (If C++11 constexpr were allowed, it could be much more beautiful :) .)  I constructed the following macro/template hacks for some other project, and thought they might be useful here.  However, I do not currently have time to test integrating it myself.

Is it actually faster?  I have no idea...

Code: Text
  1. /* helper, truncate to 20 args */
  2. #define _GET_20_ARGS(                                                                                  \
  3.     _0,  _1,  _2,  _3,  _4,  _5,  _6,  _7,  _8,  _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19, \
  4.     ... )                                                                                              \
  5.     _0,  _1,  _2,  _3,  _4,  _5,  _6,  _7,  _8,  _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19
  6.  
  7. /* results in up to 20 arguments, padded with null characters
  8.  * requires at least 1 argument */
  9. #define GET_20_ARGS(...)                                                             \
  10.     _GET_20_ARGS( __VA_ARGS__, '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', \
  11.                    '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0' )
  12.  
  13. /* user interface */
  14. #define ID(...) CompileStrHasher<GET_20_ARGS(__VA_ARGS__)>::value
  15.  
  16. /* hash function (FNV in reverse) */
  17. template< char  _0, char  _1, char  _2, char  _3, char  _4, char  _5, char  _6, char  _7, char  _8, char  _9,
  18.           char _10, char _11, char _12, char _13, char _14, char _15, char _16, char _17, char _18, char _19  >
  19. struct CompileStrHasher
  20. {
  21.     enum
  22.     {
  23.         value = (_0 ^ ID(_1, _2, _3, _4, _5, _6, _7, _8, _9, _10, _11, _12, _13, _14, _15, _16, _17, _18, _19)) * 16777619u
  24.     };
  25. };
  26.  
  27. /* seed value */
  28. template<>
  29. struct CompileStrHasher<GET_20_ARGS('\0')>
  30. {
  31.     enum
  32.     {
  33.         value = 2166136261u
  34.     };
  35. };
  36.  
  37.  
  38.  
  39. // example usage
  40.  
  41. #if __cplusplus < 201103L
  42. // flat solution (run-time execution)
  43. unsigned HashStr(const char* str)
  44. {
  45.     unsigned value = 2166136261u;
  46.     if (!*str)
  47.         return value;
  48.     const char* end = str;
  49.     for (int i = 0; *(end + 1) && i < 20; ++end, ++i)
  50.         ;
  51.     do
  52.     {
  53.         value = (*end ^ value) * 16777619u;
  54.     } while (str != (end--));
  55.     return value;
  56. }
  57. #else
  58. // recursive C++11 constexpr (compile-time execution)
  59. constexpr unsigned HashStr(const char* str, int depth = 0)
  60. {
  61.     return (*str && depth < 20) ? (*str ^ HashStr(str+1, depth+1)) * 16777619u : 2166136261u;
  62. }
  63. #endif
  64.  
  65. int main()
  66. {
  67.     switch (HashStr("str"))
  68.     {
  69.         case ID('s','t','r'):
  70.             return 0;
  71.  
  72.         default:
  73.             break;
  74.     }
  75.     return 1;
  76. }
  77.  
(Note: I did not spend time trying to optimize the run-time counterpart much.)

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 10369
Re: ParserThread - switch on token strings
« Reply #1 on: September 06, 2014, 01:33:59 am »
I don't know the exact place you're trying to optimize, but I'm almost 100% sure that HashStr will be always called with a dynamic string not known at compile time. So you'll have a loop there.

Keep in mind that all the strxxx functions are pretty well optimized in the libc! And a simple for will be slower.
<debugger plugin maintainer>
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline Alpha

  • Developer
  • Lives here!
  • *****
  • Posts: 1511
Re: ParserThread - switch on token strings
« Reply #2 on: September 06, 2014, 02:24:00 am »
Yes, HashStr() will have to be a run time cost.  However, if the switch itself saves enough over repeated if/else if's on token, it might be worth it.

Okay, no loops :P ... sort of.  (except strlen())
Code: [Select]
#include <cstring>
unsigned HashStr(const char* str)
{
    unsigned value = 2166136261u;
    switch (strlen(str))
    {
        default:
        case 20: value = (str[19] ^ value) * 16777619u;
        case 19: value = (str[18] ^ value) * 16777619u;
        case 18: value = (str[17] ^ value) * 16777619u;
        case 17: value = (str[16] ^ value) * 16777619u;
        case 16: value = (str[15] ^ value) * 16777619u;
        case 15: value = (str[14] ^ value) * 16777619u;
        case 14: value = (str[13] ^ value) * 16777619u;
        case 13: value = (str[12] ^ value) * 16777619u;
        case 12: value = (str[11] ^ value) * 16777619u;
        case 11: value = (str[10] ^ value) * 16777619u;
        case 10: value = (str[ 9] ^ value) * 16777619u;
        case  9: value = (str[ 8] ^ value) * 16777619u;
        case  8: value = (str[ 7] ^ value) * 16777619u;
        case  7: value = (str[ 6] ^ value) * 16777619u;
        case  6: value = (str[ 5] ^ value) * 16777619u;
        case  5: value = (str[ 4] ^ value) * 16777619u;
        case  4: value = (str[ 3] ^ value) * 16777619u;
        case  3: value = (str[ 2] ^ value) * 16777619u;
        case  2: value = (str[ 1] ^ value) * 16777619u;
        case  1: value = (str[ 0] ^ value) * 16777619u;
        case  0:
            return value;
    }
}


Maybe I can find time this weekend to run a few tests.  We shall see...

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 10369
Re: ParserThread - switch on token strings
« Reply #3 on: September 06, 2014, 03:09:44 am »
Then how would you handle hash collisions?
<debugger plugin maintainer>
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 4985
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: ParserThread - switch on token strings
« Reply #4 on: September 06, 2014, 07:26:53 am »
Maybe some off topic, but to accelerate the token compare speed, I think we need a fast lexer, I have both C preprocessor grammar for both Quex - A Fast Universal Lexical Analyzer Generator and re2c scanner generator | SourceForge.net, they can generate very fast lexer, which not only get a lexeme string (like what currently we do in Tokenizer::GetToken()), but also an ID value. So, finally we can compare all the keyword tokens by IDs, not the literal strings. But using them will change our CC's code a lot :(
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline Alpha

  • Developer
  • Lives here!
  • *****
  • Posts: 1511
Re: ParserThread - switch on token strings
« Reply #5 on: September 09, 2014, 02:56:35 am »
Then how would you handle hash collisions?
If the performance from this conversion is significant, and there are no collisions with the list of built in keywords (from testing on a large selection of code, e.g. STL, Boost, wxWidgets, C::B, Linux kernel), I would argue do not handle it.  Why build purposefully wrong code?  Because the hiccups it might cause would be rare in comparison to the rest of the fallacies of the parser (but only worth the rare chance if speed up is significant).

If from testing there are collisions, I would test if performance is still improved when following each switch case by a string check.

[...] I have both C preprocessor grammar for both Quex - A Fast Universal Lexical Analyzer Generator and re2c scanner generator | SourceForge.net, they can generate very fast lexer, which not only get a lexeme string (like what currently we do in Tokenizer::GetToken()), but also an ID value. So, finally we can compare all the keyword tokens by IDs, not the literal strings. But using them will change our CC's code a lot :(
This would be ideal, but a significant job to rewrite.  I personally like Ragel, since mixing it in code feels cleaner/more flexible (and it says it has full DFA minimization, whereas re2c states it cannot yet do that; I do not have much experience with Quex).  Do you have a link to your C preprocessor grammars?  I am curious what they look like.
I have yet to find numbers comparing the speeds of code generated by these programs, though.

... Maybe next weekend I will have time to test...

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 4985
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: ParserThread - switch on token strings
« Reply #6 on: September 09, 2014, 05:10:45 am »
[...] I have both C preprocessor grammar for both Quex - A Fast Universal Lexical Analyzer Generator and re2c scanner generator | SourceForge.net, they can generate very fast lexer, which not only get a lexeme string (like what currently we do in Tokenizer::GetToken()), but also an ID value. So, finally we can compare all the keyword tokens by IDs, not the literal strings. But using them will change our CC's code a lot :(
This would be ideal, but a significant job to rewrite.  I personally like Ragel, since mixing it in code feels cleaner/more flexible (and it says it has full DFA minimization, whereas re2c states it cannot yet do that; I do not have much experience with Quex).  Do you have a link to your C preprocessor grammars?  I am curious what they look like.
Hi, Alpha.
I haven't use Ragle before.
Yes, I have a very dummy project here https://code.google.com/p/quexparser/ which I try to use quex as the lexer, and also a dummy C::B project to create a parser and a symbol tree, but since the I don't have much ability to build a "good" parser, I don't maintain the project for one or two years. :(

About the quex c++ grammar, it is the file: cpp.qx, and I use the file cpp.bat to generate the code from the grammar file. BTW: the grammar file are some complex, it have some modes, and can switch from Proprocessor mode to normal code mode.

To build the generated cpp lexer file, you need lots of headers files from Quex, there is an option to copy the header file needed to the target directory. Compared with Quex, the code base generated by re2c is much simpler and smaller. I don't compare the speed, but I see that Quex has some performance compare about re2c in there benchmark test code.

EDIT
I have my cpp lexer test project for re2c(which contains a grammar file for re2c) uploaded for testing.

EDIT2

For re2c: I see "The generated DFA is not minimal" in its document http://re2c.org/manual.html, its license is public domain.
For Ragel: I see that it "Minimize state machines using Hopcroft's algorithm", so it generates minimal DFA, see http://www.complang.org/ragel/, its license is GPL.
And a CPP grammar for Ragel is already there, see: cppscan.rl, so I will check it. :)


« Last Edit: September 09, 2014, 11:00:42 am by ollydbg »
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 4985
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: ParserThread - switch on token strings
« Reply #7 on: February 28, 2016, 06:20:56 am »
Hi, alpha and all, here are some news about the re2c Release 0.16.
Quote
This release adds a very important step in the process of code generation: minimization of the underlying DFA (deterministic finite automaton). Simply speaking, this means that re2c now generates less code (while the generated code behaves in exactly the same way).

DFA minimization is a very well-known technique and one might expect that any self-respecting lexer generator would definitely use it. So how could it be that re2c didn't? In fact, re2c did use a couple of self-invented tricks to compress the generated code (one interesting technique is constructing tunnel automaton). Some of these tricks were quite buggy (see this bug report for example). Now that re2c does canonical DFA minimization all this stuff is obsolete and has been dropped.
...

The grammar is inside in the C code, see an example here: C++98 lexer
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.