Author Topic: Regular expressions  (Read 74109 times)

Offline Ceniza

  • Developer
  • Lives here!
  • *****
  • Posts: 1441
    • CenizaSOFT
Re: Regular expressions
« Reply #60 on: March 28, 2008, 10:37:43 am »
"Stay tuned, and stay infected." (I wonder if anyone knows where this quote comes from :D)

I tried to google cheat.  It linked back to this post.  Damn their spiders are getting fast.

You didn't search carefully. It's in the first results page ;)

My expectations were wrong. The truth is that I've been working more on the parser than anything else. My first attempt was to create a simple parser like the one I used to evaluate constant expressions for #if directives, but the grammar as it is defined is not good for such a parser (LL(0)). LL(1) would help a bit more, but it wouldn't be enough. Besides, I don't want to calculate FIRST and FOLLOW to implement the parser. However, the LL(0) attempt gave me an idea: a brute force parser. I'm completely unaware of its performance, but it should just work. Quoting the description found in the YARD page: "The distinguishing characteristic of a R-D parser, is that it employs a recursive brute-force algorithm to attempt to match the various grammar production rules. If a match fails, the parser backs up to the position it was when it attempted to match the failing rule."1. Yet another parser I found, linked from YARD's website, is Biscuit. Right now I'm experimenting with something similar to Biscuit's syntax, but not exactly it. BTW, even Yacc has a special implementation that splits the parser when a grammar is detected as ambiguous, so it analyzes more than one production at a time.

To avoid more false expectations, I'll just say that I'll report more things when I have something to show.

Just for the sake of sharing, you can get the current bizarre code experiment from here.

[1]

Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: Regular expressions
« Reply #61 on: March 28, 2008, 02:45:10 pm »
thanks for the link of the source!  :D I was trying to check the sources of your svn repository to study them, but i don't know how  :oops:

Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: Regular expressions
« Reply #62 on: March 28, 2008, 02:50:17 pm »
I have a questions, I'm curious to know if the implementation is going to support different file encodings, that should be something to visualize or anticipate, I think :roll:

Offline Ceniza

  • Developer
  • Lives here!
  • *****
  • Posts: 1441
    • CenizaSOFT
Re: Regular expressions
« Reply #63 on: March 28, 2008, 07:28:00 pm »
thanks for the link of the source!  :D I was trying to check the sources of your svn repository to study them, but i don't know how  :oops:

I have to enable anonymous SVN access first :)

svn://svn.cenizasoft.cjb.net/svnrepos/CCP/trunk

I have a questions, I'm curious to know if the implementation is going to support different file encodings, that should be something to visualize or anticipate, I think :roll:

Actually, the current implementation needs you to provide it with a few classes to be able to work. Right now I have ones using std::string and std::istream. If you want your own, you create them and feed the classes through template arguments. However, I still need to work a bit on that part.

Offline Ceniza

  • Developer
  • Lives here!
  • *****
  • Posts: 1441
    • CenizaSOFT
Re: Regular expressions
« Reply #64 on: June 14, 2008, 08:07:52 am »
It's been a while, and the news are _NOT_ as good as you would like them to be.

If you check the SVN commit log you'll see I started working on the project again about three weeks ago; however, I changed my proposal to implement ONLY the preprocessor. Implementing the whole thing would take me longer, and I do really want to graduate soon.

Well, the new proposal was approved, the project is finished, and the document is done. However, I might need to change a few things in the document if my director considers it necessary. I hope not :D

I wouldn't mind sharing the document, but it's in Spanish. Maybe the library reference would be of more help, yet it's not as interesting to read as the document itself :P

The next step: optimizing the library. I'm even planning to write a special String type for it to see if it improves performance (I'll do profiling later :P). I was even considering writing an article about the creation process of that String type plus the performance considerations and final implementation. Who knows? Maybe some of you may learn one or two things from it, or I just make you believe something that could not be true :P

Morten promised me to give me his blood in exchange for a good code-completion library. He threatened me too, but that's way OT :)

Let's see how it goes from here. At least the first step has been taken.

There's also a plugin laying around that can be installed in the latest Nightly build that tests the library. It doesn't offer any benefits, but it's good to see that its minimum functionality works.

-- END OF THE MESSAGE --

This message was written using only recycled bytes. The colors for the smileys do NOT use any dangerous substances. If your children eat this message take them to the hospital immediately.

Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: Regular expressions
« Reply #65 on: June 14, 2008, 07:33:52 pm »
I wouldn't mind sharing the document, but it's in Spanish. Maybe the library reference would be of more help, yet it's not as interesting to read as the document itself :P

My native language is spanish  :D

Offline Ceniza

  • Developer
  • Lives here!
  • *****
  • Posts: 1441
    • CenizaSOFT
Re: Regular expressions
« Reply #66 on: June 15, 2008, 07:20:23 am »
I wouldn't mind sharing the document, but it's in Spanish. Maybe the library reference would be of more help, yet it's not as interesting to read as the document itself :P

My native language is spanish  :D

I know :D

If that means you want to read the file then suit yourself.

BTW, that's just a preliminary version... which I hope becomes the final one.

I also started writing the document about the creation process of the String type. So far it only has the preliminary analysis, but may be a good read too. Link.

Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: Regular expressions
« Reply #67 on: June 16, 2008, 06:20:24 pm »
I was reading your project documents and liked very much the irc chat conversation  :P I felt like the beginner. I haven't finished reading but I'm at page 37 (and reading) I can say that you could be a great professor (you are really good with words)  :) Your university faculty should approve that immediately is an A++.

Also I partially read your document for the new string implementation class and it's true. I confronted  all those things while implementing a csv reader and writer I made, reading one byte at time was pretty slow, opening the hole file at once was amazingly fast, but using a bigger buffer should help. There should be functions to detect the size of the string to store and if it's to big for memory to handle parse it using a big amount of buffer depending on the size of the string, else store it completely on memory and parse it there since it is much faster.

The reading was pretty cool, implementing all those things properly should produce a really high speed c/c++ parser.

Offline Ceniza

  • Developer
  • Lives here!
  • *****
  • Posts: 1441
    • CenizaSOFT
Re: Regular expressions
« Reply #68 on: June 17, 2008, 02:29:22 am »
I was reading your project documents and liked very much the irc chat conversation  :P I felt like the beginner. I haven't finished reading but I'm at page 37 (and reading) I can say that you could be a great professor (you are really good with words)  :) Your university faculty should approve that immediately is an A++.

Heh, thanks. Actually, I was a bit afraid of having such an introduction for the document, but my thesis director said it was OK. I also handed over the document today. Let's hope I graduate soon :)

Thanks for the 'professor' thing too. In fact, two proposals that I've received have been to teach: one at the University and the other one in an English institution. The idea sounds nice, but I prefer to write programs at the moment. I could teach later :P

The reading was pretty cool, implementing all those things properly should produce a really high speed c/c++ parser.

That's the idea. I'm not completely satisfied by the speed of the current implementation. If you want to add a syntactic and semantic analysis too, the speed of the preprocessor should be very good. I'll finish that string type first, check the performance gain (or loss) and try to optimize for real after doing some profiling.

Offline Ceniza

  • Developer
  • Lives here!
  • *****
  • Posts: 1441
    • CenizaSOFT
Re: Regular expressions
« Reply #69 on: July 02, 2008, 03:55:54 am »
Quick news: my thesis has been approved and was graded with a "sobresaliente". The meaning of that word is something like "pretty good or ~excellent". Thesis are graded like this in here: not-approved, approved, "sobresaliente" and laureate. Not bad :D

In other news... I've finished the document where I implement the new string type for the preprocessor. It also means that I've finished the whole thing. Quoting from my own document:

The tests used three different types of string: STDString, CCPString<SimpleAllocator> and CCPString<CCPStringAllocator>. Here are the averaged results:

STDString: 0.409 s
CCPString<SimpleAllocator>: 0.193 s
CCPString<CCPStringAllocator>: 0.182 s

For being a SWAG we achieved a 2.12x performance gain using SimpleAllocator and 2.25x using CCPStringAllocator. Not bad!

What about memory usage? Let's take a look:

STDString: 11864 KiB
CCPString<SimpleAllocator>: 11792 KiB
CCPString<CCPStringAllocator>: 18496 KiB

Now the situation changes a bit. When using the SimpleAllocator we save less than 1% of memory, but using CCPStringAllocator increases the memory usage by a 56%, both compared to the memory usage when using STDString. I'm still curious about it. It'd be a good idea to play with different values for blockSize and see how the program behaves.

For those who also care about executable size, here are the results for each case:

STDString: 266.87 KiB
CCPString<SimpleAllocator>: 210.80 KiB
CCPString<CCPStringAllocator>: 238.81 KiB

Overall, SimpleAllocator has the best performance/memory usage/executable size ratio. Still, CCPStringAllocator gives the highest performance, but not for much compared to SimpleAllocator (about 6%).

I'd say we achieved our goal.

More than a 2x performance gain, and the library itself can be optimized too.

Next step: the painful parser.

Finally, to comply with my own disclaimer:

Quote from: Disclaimer
This document is copyrighted by me (Paúl Andrés Jiménez). If you reproduce this file or parts of it in your website or a document, print it, save it, borrow it, share it, whatever... you're gonna Burn in Hell (it's actually a nice song by Dimmu Borgir). If you try to take ownership of this file, I'll f*cking sue you! That is, if I ever find out that you did. If burning in Hell is good enough for you, just don't forget to include this disclaimer. Also, all the names of companies, products and such are owned by their owners (doh!).

Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: Regular expressions
« Reply #70 on: July 03, 2008, 08:17:31 am »
great! congratulations!

That disclaimer is funny  :lol: