Author Topic: Regular expressions  (Read 65605 times)

Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Regular expressions
« on: December 05, 2007, 04:02:35 am »
Could it be possible to create a parser using regular expressions?

I was thinking on the design of a code completion plugin using regex with container classes.

For example:
FileArray - a container class that stores all the files.

The FileArray will use an array of file classes with members that indicate if header or source file, it's full path and name. Each file class would have other containers with it's contents, like includes, namespaces, defines, classes, structs, enums, typedef, unions, functions, variables etc...

So first we open a file as a stream search for includes like this "incomplete example": (#include)( )*([\"\<])(.*)([\"\>]) then parse it and remove that entry from the stream to continue parsing: namespaces "another incomplete or bad regexp" (namespace)([\r\n\t ])+([_0-9a-zA-Z])+([\r\n\t ])+([\{])+(.*)*([\}])+, also remove all the results found by the regexp from the stream an continue parsing other elements of c/c++.

Also the FileArray would have member functions to indicate the include paths to search for sources and includes.

I think it will be more easy to maintain since all the elements of the language would be divided in classes and the parsing made by the regexp engine.

There should be a class for every element like the ones mentioned above: functions, defines, variables, namespaces, blah blah blah with an interface that make some functions and members available to all, like file to parse, parsefunction, etc...

An example would be a class for Classes that is included as an array(container) on the FileArray class. The class would use some arrays(containers) of Function class, Variables class, Enum class, Struct class, Union class, etc...

I would like to work on something like this, but I don't know if it's a silly idea. More experienced programmers opinions would be nice.

Offline eranif

  • Regular
  • ***
  • Posts: 256
Re: Regular expressions
« Reply #1 on: December 05, 2007, 07:26:37 am »
Could it be possible to create a parser using regular expressions?

I dont want to be the party pooper, but C++ is way to complex for regex (C++ contains too many ambiguities for regex to handle).
You will have the bring in the big guns to do this (some advance parsers such has ANTLR, PCCTS and other parsers that has N look ahead).

GNU's parser for C++, is based on Yacc & Flex. And even in those tools, parsing of C++ is too complex (I think their grammar contains some shift/shift, shift/reduce conflicts).

Still, I think it is possible to create a parser based on yacc & flex which will do most of the work (since you are not building a compiler here, it should be enough).

Eran


Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: Regular expressions
« Reply #2 on: December 05, 2007, 02:22:32 pm »
Still, I think it is possible to create a parser based on yacc & flex which will do most of the work

Never heard of that!
I'm reading this little guide http://epaperpress.com/lexandyacc/  and http://flex.sourceforge.net/manual/
But at first it's to complex to digest  :?

May be regexp could be used to parse simple things as unions, enums, typedef, variables, defines and others, while some manual intervention on classes, namespaces, templates since I dont know if it is possible to check on regexp for the right } ending bracket, since namespaces use {} and inside of them the use of {} is also. With regexp we could search for the words "namespace alfanum", but I don't know if regexp libraries return you the position on the string where it found it, so we jump there and search for the corresponding ending } bracket. Then with the content of the namespace {...} perform other parsing functions for advanced types as classes leaving the basic types as variable declarations, unions, enums, for later.

Offline eranif

  • Regular
  • ***
  • Posts: 256
Re: Regular expressions
« Reply #3 on: December 05, 2007, 03:13:04 pm »
At most, you will succeed in parsing very basic C++ expression, but once it comes to really complex expressions, such as:

Code
namespace MyNS {
template <typename T, typename Ty>
class MyClass :
/* some comment here */
public Singleton<MyClass> , public Factory<MyClass>, private SomeOtherClass<MyClass>
{
//Now make sure you ignore this comment as well
};
};//NS
(and I have seen such code...), I fear the you will hit some serious issues.

Also, to make a parser you need lookup table to distinguish between typenames and identifier,
for example:

Code
class MyClass{};
At this line the parser should consider MyClass as identifier, once the parsing of this line is completed, MyClass is inserted into a lookup table and marked as typename, so line like this:
Code
MyClass cls;
will be parsed correctly and considered cls as identifier (The parser should recognize MyClass as the typename and cls as the identifier)

And dont let me start talking about scoping... :wink:

If you really want to create a plugin to be a competitor for the current CodeCompletion plugin, I suggest you go and have a look at ctags
http://ctags.sf.net (contact me if you need more help)

Regex is good for lexing (tokenizing, and in fact the flex grammar uses regex a lot) not for grammar.

There are other advantages of using generated parsers and hand crafted ones:
Changing a grammar file is a lot easier than modifying an existing code, which contains patch over patch, until time comes and the code is smarter than its creator :D
 
 Eran


Offline thomas

  • Administrator
  • Lives here!
  • *****
  • Posts: 3979
Re: Regular expressions
« Reply #4 on: December 05, 2007, 03:29:03 pm »
Still, I think it is possible to create a parser based on yacc & flex which will do most of the work
Never heard of that!
Ceniza has done that, worked ok too.
Regex is in my opinion unsuitable mostly because it is far too slow. yacc and flex evaluate rules at compile time (or rather at pre-compile time) and hack together a C file from that which does just one thing, and nothing more.
Regex does everything at runtime.
"We should forget about small efficiencies, say about 97% of the time: Premature quotation is the root of public humiliation."

Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: Regular expressions
« Reply #5 on: December 05, 2007, 04:12:36 pm »
Ceniza has done that, worked ok too.

So the work of Ceniza will be implemented for the future code completion plugin?

If that's true, it is great! Then there should not be any other work done by others.

Offline thomas

  • Administrator
  • Lives here!
  • *****
  • Posts: 3979
Re: Regular expressions
« Reply #6 on: December 05, 2007, 05:00:33 pm »
So the work of Ceniza will be implemented for the future code completion plugin?
If that's true, it is great! Then there should not be any other work done by others.
Of course it's true, why would I make up such a story. It's not even April Fool's day  :)
However, "the future code completion" will probably be using a different implementation, not that one.

Then there should not be any other work done by others.
There is no reason why someone capable of coming up with a better solution than the present one should not work on it. It can only be for good.
Either way, parsing is not the only serious issue in code completion. In fact, the parser is not that bad at all any more, it only has a handful of issues. The GUI, update, and threading issues are a lot worse, I think.
"We should forget about small efficiencies, say about 97% of the time: Premature quotation is the root of public humiliation."

Offline Ceniza

  • Developer
  • Lives here!
  • *****
  • Posts: 1441
    • CenizaSOFT
Re: Regular expressions
« Reply #7 on: December 05, 2007, 07:03:50 pm »
Right now I'm planning how to do it, a rough sketch of how it should work. I'll surely base the new lexer on the previous handcrafted one, but it won't be the same. As for the parser, I'm not that sure about using yacc this time. However, the time for that decision is yet to come.

JGM: IIRC you speak Spanish. If you want you can check my proposal right here (which is in Spanish).

Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: Regular expressions
« Reply #8 on: December 06, 2007, 02:20:54 am »
JGM: IIRC you speak Spanish. If you want you can check my proposal right here (which is in Spanish).

I read your project document and is really nice, you have talent to write. It would be great to see a project of that magnitude working, it is really interesting. Also I learned a lot with your document, thanks for sharing it! :) By the way what does it means IIRC?

If there is something I could do to help would be nice.
« Last Edit: December 06, 2007, 02:27:27 am by JGM »

Offline Ceniza

  • Developer
  • Lives here!
  • *****
  • Posts: 1441
    • CenizaSOFT
Re: Regular expressions
« Reply #9 on: December 06, 2007, 03:38:19 am »
Quote from: JGM
It would be great to see a project of that magnitude working

Thing is: I need to get that project done to graduate.

Quote from: JGM
By the way what does it means IIRC?

IIRC: If I Recall Correctly.

Quote from: JGM
If there is something I could do to help would be nice.

Thank you :)

Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: Regular expressions
« Reply #10 on: December 10, 2007, 08:08:15 pm »
I have been reading some c++ books, and on each one I encounter a new feature of the language that I didn't know. I have to admit that I'm an amateur on c++, it has so many features that the language is pluginable, some things on the language have so many definitions or uses.  :?

I see why there are little software alternatives to accomplish code completion. Making an intelligent c++ parser would take a significant amount of time (maybe years).

Offline stevenkaras

  • Multiple posting newcomer
  • *
  • Posts: 18
Re: Regular expressions
« Reply #11 on: February 20, 2008, 01:26:00 am »
I've always said when in doubt, check the source. In this case, the best source is the ARM:
http://www.research.att.com/~bs/arm.html

and after that, that Bjarne book(my own name):
http://www.research.att.com/~bs/3rd.html
http://en.wikipedia.org/wiki/The_C%2B%2B_Programming_Language

As for my opinions about ctags: I've tried using ctags in the past, and failed spectacularly. Not really sure why, but I've had an instinctual dislike of the package since then. But perhaps we can steal the core of it, and then just update that, rather than using an external program. Also, when we internalize it, we take control about when it can reparse the project.

In any case, I'd like to see some documentation put here about how the code completion works now, so we can perhaps improve upon it. If not, at least discuss how code completion works, rather than everyone just saying: "let's use regex!" or "it should look like visual assistant x!"

Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: Regular expressions
« Reply #12 on: February 20, 2008, 02:13:55 am »
rather than everyone just saying: "let's use regex!" or "it should look like visual assistant x!"

Don't listen to us we are just a bunch of frustrated programmers that can't make a decent auto completion plugin  :lol: (talking for my self)

By the way as I know the code completion plug in is going to be redesigned ( a new and better one). Theres already people working on it (or at least a library). Also you are welcome to develop one that may take place on a future contest I guess  :wink:.

Offline stevenkaras

  • Multiple posting newcomer
  • *
  • Posts: 18
Re: Regular expressions
« Reply #13 on: February 26, 2008, 11:50:13 am »
I suppose what I'm trying to say is I think seeing some more discussion on how code completion actually works would create results faster than simple suggestions, and telling everyone: "Work alone". This is a group project, where everyone takes part. Not a project where one person puts together a whole bunch of other people's work. I'm not saying that working alone is bad (i prefer it myself), but on a complicated problem such as this, the design stages should be done in a public forum.

So, how does code completion work? Rather than throw around buzzwords and feature names, let's talk about what we actually want it to do:
Provide, in an automatic/seamless way, all possible/likely solutions to "complete" the current identifier.

1. To do so, you need to have a list of all the identifiers in the current scope. (That last sentence was important).
2. Then, reduce the list to what already begins with what has been typed.
3. Show that list to the user
4. If the user wants to, he can select the proper solution from that list, and press enter/tab to insert the solution.

Let me know if you have a correction/suggestion. I'll think about how to break down the problem further, and add that on.

Edit:

I've put some more thought into the problem. As for getting the list of identifiers, the problem is threefold. You need to generate 3 lists. The first list should contain global identifiers with external linkage in the current target(a stumbling block, as I see it). The second list should contain global identifiers with internal linkage, and the last list should contain all the local scope identifiers. A couple of solutions:

1. Keep 2 lists on hand for each file. One is external linkage, the other is internal linkage. Generate the list needed by merging all the lists for the current target on demand.
2. Keep a list on hand for each target (storing external linkage only). Generate the internal linkage list on the fly.

In any case, the local scope list should be generated on demand (it shouldn't take that much time, and makes the plugin more useful).

There are a few potential problems, as I see it:

1. What happens when the current scope is global?
2. What happens in an unbalanced block? (say I'm writing a function and forgot a brace...)
3. What should the data format be?
« Last Edit: February 26, 2008, 11:59:32 am by stevenkaras »

Offline byo

  • Plugin developer
  • Lives here!
  • ****
  • Posts: 837
Re: Regular expressions
« Reply #14 on: February 26, 2008, 08:57:13 pm »
Provide, in an automatic/seamless way, all possible/likely solutions to "complete" the current identifier.

That's only one thing. To make really good code-completion you will also have to:
  • Provide call tips while writing function-call function (maybe with some documentation gathered from comments)
  • Provide reliable list of include files when user type  #include macro
  • Help with coding initializer lists on constructors
  • and probably much more

Additionally current implementation also does few more things like:
  • Providing class browser
  • Providing "goto declaration/definition" stuff
  • Detecting current scope we're in

Ok, now we can not find all identifiers in current scope. Why? Because in c++ we always include global namespace which contains really huge amount of symbols so even enumerating them may be a problem. Also we can not only take symbols from one scope - for example when we're in member function of class we must look in local variables of this function, members of the class, all members of base classes, global namespaces and all namespaces included through "using" keyword - so it's not so simple.

It even looks worse when dealing with templates - let's take an example:

Code
teplate<typename T> class Template
{
    public:
        T& GetInstance() { return m_Instance; }
    private:
        T m_Instance;
};

class Parameter
{
    public:
        void PrintfText() { printf("Text"); }
};

int main(int,char**)
{
    Template<Parameter> Object;
   
    Object.GetInstance().Printf <<<<< What to complete here ?
}

In this example we don't only need to know that Object has function named GetInstance. If we detect what this function returns we see that this is template parameter - so such lists also have to be accessible from somwhere. And believe me - stl, boost and other modern libraries won't easily work without it.


Now to do the completion really good we can't overcome the problem of parsing most of the code in include files. Of course it don't require full parsing but to have accurate results we can only skip bodies of functions, everything else must be parsed as in standard c++ parser. But since we need to parse current context while completing the code, body of function we're in must be parsed then (so this shows that really good code-completion is in fact full c++ parser).

We can cache result of parsing include files (of course there are exceptions for this but they're rare) and we can quickly parse current file (when skipping function bodies) so the performance should be acceptable.

I thought lately about the cc plugin and how to improve it becuase the current implemetnation is not good enough for me (I started to really love templates and cc fails in this area) so maybe I could help too. But I think that maybe we shoudn't start any huge changes before the release (so it's not so long ;) ) and we should focus on the release right now.

But just one suggestion which should be considered before starting the change (I already written about this some time ago but it was long time ago so I don't even remember where I wrote this ;) ).

I think that current cc plugin should be splitted into three parts:
  • Parser plugin (or maybe language-support plugin) - this one will be responsible for parsing files / detecting completion context etc. etc. - mainly analysing source code and changes made to them. I also thought about adding possibility to define own lexers and fold detectors for scintilla editor here - and I can say that I already have this part done
  • Code-completion plugin - here we could leave all things related to managing threads / gui updates etc
     
  • Parser manager - which would provide some bridge between those two (maybe more) plugins - it would manage parser plugins (for example give proper parser for given language) and it could provide some interface used to access symbols.
If parser manager and interfaces for accessing symbols will be really good designed, we can separate parsing from completion in most of areas.

Regards

   BYO