Author Topic: Regular expressions  (Read 73260 times)

Offline Ceniza

  • Developer
  • Lives here!
  • *****
  • Posts: 1441
    • CenizaSOFT
Re: Regular expressions
« Reply #45 on: March 13, 2008, 10:50:40 pm »
Right now I think the best step forward would include the following:
...
3. have someone play with flex, come up with a good language spec for c++(worry about C later)
4. have someone else play with bison, and work on the C++ file for the parser.

3. Why flex? Besides, I already have the whole preprocessor's lexer working. The only symbols not yet generated (because they belong to another stage) are integer- and floating_point-literals. The preprocessor's lexer generates ppnumbers instead. Another missing part is string-literal concatenation (which will be done by the real lexer after it has retrieved preprocessed tokens).
4. I won't be up for that task. I already tried it with bison, and it was painful enough. If anyone feels like trying, please remember to use a code generation that plays nice with multi-threading (bison's default one doesn't).

For languages other than C and C++, flex and bison would be just fine. C and, even more, C++ are a PITA (Pain In The Ass, before someone asks) to parse by any means.

Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: Regular expressions
« Reply #46 on: March 14, 2008, 04:27:27 am »
PITA (Pain In The Ass, before someone asks)

Thanks for anticipating  :lol:

Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: Regular expressions
« Reply #47 on: March 14, 2008, 04:34:32 am »
Wow. I go away for a day or two and look what happens. This is really starting to come together as a team effort.

I like your attitude  :)

JGM> As for decoupling the language, the reason why you can't simply make the symbol table abstract is because languages are very different.

Yep, i will start having a really bad time, but then if theres is going to be a different parser for each language, I think would be good to standardize the parser structures, or create some abstract (interfaces) classes where others may derivate.

Offline stevenkaras

  • Multiple posting newcomer
  • *
  • Posts: 18
Re: Regular expressions
« Reply #48 on: March 14, 2008, 06:07:11 pm »
Yep, i will start having a really bad time, but then if theres is going to be a different parser for each language, I think would be good to standardize the parser structures, or create some abstract (interfaces) classes where others may derivate.
True. Parsing in general is a PITA. My brain is fried from this week, but here goes:

1. Symbols from a language are placed into a symbol table.
2. The user requests a completion of the current symbol.
3. the C::C plugin determines the correct scope of the symbol(global, local, class)
4. Compare the current symbol against the proper scopes.
5. Provide a composite list to the user.

That kind of glazes over the general process, and I'm sure that eran can explain it in further detail. As for providing this functionality for multiple languages, it should be pretty easy. Figuring out how to generalize the symbol class should be a priority in order to achieve that. What we have should handle most languages, but it may need more tweaking.

Cenzina> Don't worry about it. I threw out flex as an option because it's more or less a standard. I also mentioned it because it may be easier to generate the lexer from a document specification, rather than a hard-coded version. Besides, we can always edit the code generated by yacc or bison or flex if it doesn't suit our needs.

Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: Regular expressions
« Reply #49 on: March 14, 2008, 08:18:11 pm »
1. Symbols from a language are placed into a symbol table.
2. The user requests a completion of the current symbol.
3. the C::C plugin determines the correct scope of the symbol(global, local, class)
4. Compare the current symbol against the proper scopes.
5. Provide a composite list to the user.

Also we need to know a detailed insight on how wxScintilla works for this purposes, if some one that has already worked with it could explain something  :)

Offline byo

  • Plugin developer
  • Lives here!
  • ****
  • Posts: 836
Re: Regular expressions
« Reply #50 on: March 14, 2008, 09:55:09 pm »
Also we need to know a detailed insight on how wxScintilla works for this purposes, if some one that has already worked with it could explain something  :)

I've done some investigation in this area and here are results (may not be complete list but hope it will be useful anyway):

wxScintilla does notify about each it's event using wxScintillaEvent - this one is processed inside cbEditor class. The most imoprtant for us is cbEditor::OnScintillaEvent (around line 2614 in cbEditor.cpp) - it call editor hooks. Editor hooks are generally some functions from plugins which are called when something inside editor change.

Next one place where such hook arrives is CodeCompletion::EditorEventHook function (around line 1731 in plugins/codecompletion/codecompletion.cpp). This function is responsible for detecting whether we should show completion window / function tooltip.

Function tooltip is fired when user enters '(' key and is closed when user enters ')' one. Tooltip is prepared in CodeCompletion::ShowCallTip() function (line 734) - it uses NativeParser::GetCallTips and NativeParser::GetCallTipComas() but I haven't look into them yet.

Code completion window may be fired when user enters '<' key (for list of includes), '.' , '->' or '::'. Some basic  checks are performed to detect whether we can complete by fetching style of entered character ( control->GetStyleAt() ) - this style is generated by wxScintilla while colourising the code.
Code completion window can also be fired when user entered required number of characters in identifier and when required timeout occured. Window can also be fired when user press ctrl+space.

The processing of completion list is done inside CodeCompletion::DoCodeComplete. If this function detects that we're in preprocessor block, it calls CodeCompleteIncludes(), otherwise it calls CodeComplete(). I didn't look into these functions either.

So in general, I've written what happens in 2. point in stevens' list.
3 / 4 and 5 are done in CodeCompleteIncludes and CodeComplete.
1 is done either when project is being loaded or when file is being saved (I tend to press ctrl+shift+S to save all files frequently so it forces symbols to be refreshed ;) ).

Regards
   BYO


Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: Regular expressions
« Reply #51 on: March 15, 2008, 12:02:11 am »
Nice!  :D

The processing of completion list is done inside CodeCompletion::DoCodeComplete. If this function detects that we're in preprocessor block, it calls CodeCompleteIncludes(), otherwise it calls CodeComplete(). I didn't look into these functions either.

This is an important fact since we should divide the preprocessors symbols on another table, since like discussed before we are working we 2 different languages

What I still wonder is how we know what text the user entered before the symbols '.' '->' '::' etc...

Does we have to check in current row and column back until a space is found to see what the user typed, or this process is done by re-parsing the file again and again to store the position of symbols?

For example:

AnObject MyObject:

//When I press the '.' character
MyObject.
             ^We start counting back until we fine a space or \r, \n to get the word the user typed previously to
                search on the symbols table

I don't know if wxScintilla already provides this functionality.

Another problems are tooltips shown on codeblocks when opening a '(' are not persistent.

For Example:

//Function Prototype
void MyFunction(string str, int num, double pi, bool isGreat);

main()
{
   string myString;
    MyFunction(
                      ^Ok here is shown a tool tip with void MyFunction(string str, int num, double pi, bool isGreat)
}

The problem is when I write the first parameter, I use ctrl-space to search for myString and the tool tip disappears, so theres also the need to check the source when the user press the ',' to see if inside a functions parameters.

I should launch codeblocks get wxScintilla and start playing with it and the documentation :oops:
« Last Edit: March 15, 2008, 12:11:02 am by JGM »

Offline byo

  • Plugin developer
  • Lives here!
  • ****
  • Posts: 836
Re: Regular expressions
« Reply #52 on: March 15, 2008, 12:59:55 am »
What I still wonder is how we know what text the user entered before the symbols '.' '->' '::' etc...

Does we have to check in current row and column back until a space is found to see what the user typed, or this process is done by re-parsing the file again and again to store the position of symbols?


The most certain solution would be like this:
  • We fire parser from the beginning of the file
  • Parse files up till position of newly entered character (each #include directive should try to use some cached list of symbols instead of including real file content)
  • When parser reaches current position, it has some local lookup table (or other structures) which would be used when current/next token would be parsed - these table is  exactly what we need - list of symbols which match at current position.

Ok, but this may be time-consuming. So we could optimize it by skipping bodies of functions as I mentioned before.

To do this:
  • Start parsing from the beginning
  • when parser detects function body ( '{' token after function header ), it skipps tokens till there's closing '}' token (watch out for internal {} blocks) - it should use standard tokenization and preprocessing to skip all '{' and '}' inside comments strings, macro expansions, etc.
  • If we jump out of function body, we can continue normal parsing
  • Now if during the skipping processing we pass current position, this function is the one we have to parse correctly so we get back (or if toknes inside funciton were stored somewhere we jump to the beginning of this list) and do proper parsing.
  • If current token comes outside function we can handle it depending on symbol's context (like helping with includes)

This can be optimized futrher by checking what has changed in the file since previous processing. In most cases, code will be added sequentially so we would be able to use parser's local symbol tables from previous processing only altering them a little bit.

I may be wrong but current implementation in C::B looks like this:
  • Parse "using namespace" declarations
  • Detect what function we're in
  • Parse arguments of that function
  • Parse body of the function
  • Add symbols from function's scope
  • Build list of symbols

BYO

Offline stevenkaras

  • Multiple posting newcomer
  • *
  • Posts: 18
Re: Regular expressions
« Reply #53 on: March 16, 2008, 01:34:18 am »
BYO&JGM> Interesting discussion. The way I see it, there are 2 ways to handle calls:

1. Reparse the entire file in an optimized fashion
2. moving backwards, skip whitespace, then grab the token. Perhaps a quick function that takes a file position and grabs the token at that location would be useful. Note that this can be somewhat dangerous, as it could lead to a quickly outdated symbol table, unless all the conditions for adding/removing symbols are carefully designated.

So while we probably want to start out with the first, we can consider the second for a possible extension if optimization of this process is needed.

Also, I thought of another feature and a few more ways C::C should be called. The new feature would be tips on variables, showing the type, and possibly, the scope (if relevant). As for the events, C::C should be called on a mouse hover over an identifier after a timeout. Useful, in a pinch.

I'm updating the wiki with some of this, and rewriting a good chunk of what's already there. I can't sleep, and I've got a full pot of coffee.   :mrgreen:

Almost forgot. I'm also looking at sources of a BNF definition of C++ so we can feed that into bison&flex. I've seen some promising stuff.

Edit:
I've been poking around, and found some interesting stuff. First off, C++ grammar is not LALR compatible. So without some inaccuracy, flex is unusable, although bison supposedly supports the more general parsing method. The tradeoff is a big hike in inefficiency in order to work out all the accuracy issues(which would only pop up in somewhat extreme cases, as I'm told).

JGM> found an interesting read you might like: here

Also, I was able to find a BNF of C++ here, but it's put in with hyperlinks, and I'm not in a mood to remove by hand (or even write a script to do it for me), so I'll search some more. the GCC page may have something.

Ummm...we have a problem. We've been talking about creating a code completion plugin for C++. Which version of C++? Reading the gcc manual, I remembered that the language is still being developed, and that there are many(at least 4 official, final) standards out there.

In any case, I've found a few files:
1. Appendix A of "The C++ Programming Language", Special Edition, which is apparently C++98, and is a bit convoluted, to say the least.
2. A supposedly broken grammar with and without actions(no clue)

In any case, apparently gcc switched from using a yacc/flex style parser to a hand-written recursive one a few years ago(around 2002), and have been using that since then. Poking around in a copy of the 4.2.1 branch, there are a few yacc files, but nothing really all that useful. We may need to write a new grammar from scratch(painful to think about, but unless anyone can come up with a better alternative...)
« Last Edit: March 16, 2008, 04:14:34 am by stevenkaras »

Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: Regular expressions
« Reply #54 on: March 16, 2008, 04:34:38 am »
Wow, just read the wiki and I'm getting scare with all those examples on Complex cases of usage, and this is something that really needs some deep thinking  :shock: we should wrap g++ code and modify it to fit all these needs (just a joke) :lol:

Well so we have a symbols class and initial requirements. should we start thinking on a parser class?
Also, does we are going to use 2 different parsers? preprocessors and normal c++ syntax?

typedef map<tokenIdentifier, stringRepresentation> Tokens;
//Maybe not quite nice, just thought about, token[PLUS] == CurrentString found by parser, maybe would create a nice and readable syntax

class PreprocessorsParser
{
    Tokens m_Tokens;
    vector<Symbols> m_SymbolsTable;
    vector<string> m_Defines;
    vector<string> m_Directories;
    string m_Extensions;
   
    void AddDefine(const string& define);
    void AddDirectory(const string& directory);
    void SetFilesExtension(const string& extensions); //Should be a comma or space seperated
    vector<Symbols> GetSymbolsTable();
    string GetParsedFile(); //This would return a string representing a cleaned file of unneeded macros such as #if, #ifndef #elif, #endif
    void StartParsing();
};

class Parser
{
    Tokens m_Tokens;
    vector<Symbols> m_SymbolsTable;
    vector<string> m_Directories;
    string m_Extensions;
   
    void AddDirectory(const string& directory);
    void SetFilesExtension(const string& extensions); //Should be a comma or space seperated
    vector<Symbols> GetSymbolsTable();
    void StartParsing();
};

Two really incomplete examples just for further working. :wink: just keep it simple and easy to follow.
I'm really sleepy  :oops:

Offline Ceniza

  • Developer
  • Lives here!
  • *****
  • Posts: 1441
    • CenizaSOFT
Re: Regular expressions
« Reply #55 on: March 17, 2008, 07:22:59 pm »
If anyone feels interested, I just finished documenting the code of what I have so far. The documentation uses Doxygen, and the output is here.

The 7z file expands to what the html folder has.

Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: Regular expressions
« Reply #56 on: March 17, 2008, 10:53:23 pm »
good news!  :D

Offline rathapolk

  • Multiple posting newcomer
  • *
  • Posts: 10
Re: Regular expressions
« Reply #57 on: March 18, 2008, 08:42:09 am »
Other Opinion

1. Move the cursor back to find the beginning of the first deepest local scope '{'.
2. Parse from the beginning of the scope to the user's cursor position and create a local symbol table.
3. if the symbol not found, move the cursor back to the bigger scope and parse till the beginning of the previous scope. go on.. until the cursor moved out of a function, then use the global symbol table.

this may help when scope braces are not perfectly matched, in upper functions.

Rathapol.K

Offline Ceniza

  • Developer
  • Lives here!
  • *****
  • Posts: 1441
    • CenizaSOFT
Re: Regular expressions
« Reply #58 on: March 20, 2008, 12:00:55 am »
Just to keep you informed, I'm about to start working on the parser. I'll surely work a bit more on the preprocessor (bugfixing, retouching, cleaning up code, ...), but, as it is right now, I have enough to start. In fact, I could say it is 99% complete by now. Unfortunately, the computer where I'm storing it is currently stuck (surely online, but unresponsive), and I'm a bit too far away from it to reset it myself (8.886,28 km according to Google Earth).

Also, just for fun, I ran through the current implementation of my preprocessor some evil code using macros that I wrote today to implement this (it's part of the Lexer, and it tells you in a single pass if a PPNUMBER is an INTEGERLITERAL or a FLOATINGLITERAL), then took that output, ran it through the AStyle plugin, copy&pasted it into the original file (disabling temporarily the one with macros), compiled and ran the program. It compiled and ran as expected, so I guess it's working fine so far.

Now, before I start working on the parser, I'll design the symbol table for it and start analyzing some productions. I'll also check your ideas and see what I can borrow from them. I expect to have something to show you next week.

"Stay tuned, and stay infected." (I wonder if anyone knows where this quote comes from :D)

Offline Seronis

  • Almost regular
  • **
  • Posts: 197
Re: Regular expressions
« Reply #59 on: March 20, 2008, 02:50:47 am »
"Stay tuned, and stay infected." (I wonder if anyone knows where this quote comes from :D)

I tried to google cheat.  It linked back to this post.  Damn their spiders are getting fast.