Regular expressions

Developer forums (C::B DEVELOPMENT STRICTLY!) > CodeCompletion redesign

Regular expressions

(1/15) > >>

JGM:
Could it be possible to create a parser using regular expressions?

I was thinking on the design of a code completion plugin using regex with container classes.

For example:
FileArray - a container class that stores all the files.

The FileArray will use an array of file classes with members that indicate if header or source file, it's full path and name. Each file class would have other containers with it's contents, like includes, namespaces, defines, classes, structs, enums, typedef, unions, functions, variables etc...

So first we open a file as a stream search for includes like this "incomplete example": (#include)( )*([\"\<])(.*)([\"\>]) then parse it and remove that entry from the stream to continue parsing: namespaces "another incomplete or bad regexp" (namespace)([\r\n\t ])+([_0-9a-zA-Z])+([\r\n\t ])+([\{])+(.*)*([\}])+, also remove all the results found by the regexp from the stream an continue parsing other elements of c/c++.

Also the FileArray would have member functions to indicate the include paths to search for sources and includes.

I think it will be more easy to maintain since all the elements of the language would be divided in classes and the parsing made by the regexp engine.

There should be a class for every element like the ones mentioned above: functions, defines, variables, namespaces, blah blah blah with an interface that make some functions and members available to all, like file to parse, parsefunction, etc...

An example would be a class for Classes that is included as an array(container) on the FileArray class. The class would use some arrays(containers) of Function class, Variables class, Enum class, Struct class, Union class, etc...

I would like to work on something like this, but I don't know if it's a silly idea. More experienced programmers opinions would be nice.

eranif:

--- Quote from: JGM on December 05, 2007, 04:02:35 am ---Could it be possible to create a parser using regular expressions?

--- End quote ---

I dont want to be the party pooper, but C++ is way to complex for regex (C++ contains too many ambiguities for regex to handle).
You will have the bring in the big guns to do this (some advance parsers such has ANTLR, PCCTS and other parsers that has N look ahead).

GNU's parser for C++, is based on Yacc & Flex. And even in those tools, parsing of C++ is too complex (I think their grammar contains some shift/shift, shift/reduce conflicts).

Still, I think it is possible to create a parser based on yacc & flex which will do most of the work (since you are not building a compiler here, it should be enough).

Eran

JGM:

--- Quote from: eranif on December 05, 2007, 07:26:37 am ---Still, I think it is possible to create a parser based on yacc & flex which will do most of the work

--- End quote ---

Never heard of that!
I'm reading this little guide http://epaperpress.com/lexandyacc/ and http://flex.sourceforge.net/manual/
But at first it's to complex to digest :?

May be regexp could be used to parse simple things as unions, enums, typedef, variables, defines and others, while some manual intervention on classes, namespaces, templates since I dont know if it is possible to check on regexp for the right } ending bracket, since namespaces use {} and inside of them the use of {} is also. With regexp we could search for the words "namespace alfanum", but I don't know if regexp libraries return you the position on the string where it found it, so we jump there and search for the corresponding ending } bracket. Then with the content of the namespace {...} perform other parsing functions for advanced types as classes leaving the basic types as variable declarations, unions, enums, for later.

eranif:
At most, you will succeed in parsing very basic C++ expression, but once it comes to really complex expressions, such as:

--- Code: ---namespace MyNS {
template <typename T, typename Ty>
class MyClass :
/* some comment here */
public Singleton<MyClass> , public Factory<MyClass>, private SomeOtherClass<MyClass>
{
//Now make sure you ignore this comment as well
};
};//NS

--- End code ---
(and I have seen such code...), I fear the you will hit some serious issues.

Also, to make a parser you need lookup table to distinguish between typenames and identifier,
for example:

--- Code: ---class MyClass{};
--- End code ---
At this line the parser should consider MyClass as identifier, once the parsing of this line is completed, MyClass is inserted into a lookup table and marked as typename, so line like this:

--- Code: ---MyClass cls;
--- End code ---
will be parsed correctly and considered cls as identifier (The parser should recognize MyClass as the typename and cls as the identifier)

And dont let me start talking about scoping... :wink:

If you really want to create a plugin to be a competitor for the current CodeCompletion plugin, I suggest you go and have a look at ctags
http://ctags.sf.net (contact me if you need more help)

Regex is good for lexing (tokenizing, and in fact the flex grammar uses regex a lot) not for grammar.

There are other advantages of using generated parsers and hand crafted ones:
Changing a grammar file is a lot easier than modifying an existing code, which contains patch over patch, until time comes and the code is smarter than its creator :D

Eran

thomas:

--- Quote from: JGM on December 05, 2007, 02:22:32 pm ---
--- Quote from: eranif on December 05, 2007, 07:26:37 am ---Still, I think it is possible to create a parser based on yacc & flex which will do most of the work

--- End quote ---
Never heard of that!
--- End quote ---
Ceniza has done that, worked ok too.
Regex is in my opinion unsuitable mostly because it is far too slow. yacc and flex evaluate rules at compile time (or rather at pre-compile time) and hack together a C file from that which does just one thing, and nothing more.
Regex does everything at runtime.

Navigation

[0] Message Index

[#] Next page

Go to full version