ParserThread - switch on token strings

Developer forums (C::B DEVELOPMENT STRICTLY!) > CodeCompletion redesign

<< < (2/2)

Alpha:

--- Quote from: oBFusCATed on September 06, 2014, 03:09:44 am ---Then how would you handle hash collisions?

--- End quote ---
If the performance from this conversion is significant, and there are no collisions with the list of built in keywords (from testing on a large selection of code, e.g. STL, Boost, wxWidgets, C::B, Linux kernel), I would argue do not handle it. Why build purposefully wrong code? Because the hiccups it might cause would be rare in comparison to the rest of the fallacies of the parser (but only worth the rare chance if speed up is significant).

If from testing there are collisions, I would test if performance is still improved when following each switch case by a string check.

--- Quote from: ollydbg on September 06, 2014, 07:26:53 am ---[...] I have both C preprocessor grammar for both Quex - A Fast Universal Lexical Analyzer Generator and re2c scanner generator | SourceForge.net, they can generate very fast lexer, which not only get a lexeme string (like what currently we do in Tokenizer::GetToken()), but also an ID value. So, finally we can compare all the keyword tokens by IDs, not the literal strings. But using them will change our CC's code a lot :(

--- End quote ---
This would be ideal, but a significant job to rewrite. I personally like Ragel, since mixing it in code feels cleaner/more flexible (and it says it has full DFA minimization, whereas re2c states it cannot yet do that; I do not have much experience with Quex). Do you have a link to your C preprocessor grammars? I am curious what they look like.
I have yet to find numbers comparing the speeds of code generated by these programs, though.

... Maybe next weekend I will have time to test...

ollydbg:

--- Quote from: Alpha on September 09, 2014, 02:56:35 am ---
--- Quote from: ollydbg on September 06, 2014, 07:26:53 am ---[...] I have both C preprocessor grammar for both Quex - A Fast Universal Lexical Analyzer Generator and re2c scanner generator | SourceForge.net, they can generate very fast lexer, which not only get a lexeme string (like what currently we do in Tokenizer::GetToken()), but also an ID value. So, finally we can compare all the keyword tokens by IDs, not the literal strings. But using them will change our CC's code a lot :(

--- End quote ---
This would be ideal, but a significant job to rewrite. I personally like Ragel, since mixing it in code feels cleaner/more flexible (and it says it has full DFA minimization, whereas re2c states it cannot yet do that; I do not have much experience with Quex). Do you have a link to your C preprocessor grammars? I am curious what they look like.

--- End quote ---
Hi, Alpha.
I haven't use Ragle before.
Yes, I have a very dummy project here https://code.google.com/p/quexparser/ which I try to use quex as the lexer, and also a dummy C::B project to create a parser and a symbol tree, but since the I don't have much ability to build a "good" parser, I don't maintain the project for one or two years. :(

About the quex c++ grammar, it is the file: cpp.qx, and I use the file cpp.bat to generate the code from the grammar file. BTW: the grammar file are some complex, it have some modes, and can switch from Proprocessor mode to normal code mode.

To build the generated cpp lexer file, you need lots of headers files from Quex, there is an option to copy the header file needed to the target directory. Compared with Quex, the code base generated by re2c is much simpler and smaller. I don't compare the speed, but I see that Quex has some performance compare about re2c in there benchmark test code.

EDIT
I have my cpp lexer test project for re2c(which contains a grammar file for re2c) uploaded for testing.

EDIT2

For re2c: I see "The generated DFA is not minimal" in its document http://re2c.org/manual.html, its license is public domain.
For Ragel: I see that it "Minimize state machines using Hopcroft's algorithm", so it generates minimal DFA, see http://www.complang.org/ragel/, its license is GPL.
And a CPP grammar for Ragel is already there, see: cppscan.rl, so I will check it. :)

ollydbg:
Hi, alpha and all, here are some news about the re2c Release 0.16.

--- Quote ---This release adds a very important step in the process of code generation: minimization of the underlying DFA (deterministic finite automaton). Simply speaking, this means that re2c now generates less code (while the generated code behaves in exactly the same way).

DFA minimization is a very well-known technique and one might expect that any self-respecting lexer generator would definitely use it. So how could it be that re2c didn't? In fact, re2c did use a couple of self-invented tricks to compress the generated code (one interesting technique is constructing tunnel automaton). Some of these tricks were quite buggy (see this bug report for example). Now that re2c does canonical DFA minimization all this stuff is obsolete and has been dropped.
...

--- End quote ---

The grammar is inside in the C code, see an example here: C++98 lexer

Navigation

[0] Message Index

[*] Previous page

Go to full version