Print Page - Regular expressions

Developer forums (C::B DEVELOPMENT STRICTLY!) => Development => CodeCompletion redesign => Topic started by: JGM on December 05, 2007, 04:02:35 am

Title: Regular expressions
Post by: JGM on December 05, 2007, 04:02:35 am

Could it be possible to create a parser using regular expressions?

I was thinking on the design of a code completion plugin using regex with container classes.

For example:
FileArray - a container class that stores all the files.

The FileArray will use an array of file classes with members that indicate if header or source file, it's full path and name. Each file class would have other containers with it's contents, like includes, namespaces, defines, classes, structs, enums, typedef, unions, functions, variables etc...

So first we open a file as a stream search for includes like this "incomplete example": (#include)( )*([\"\<])(.*)([\"\>]) then parse it and remove that entry from the stream to continue parsing: namespaces "another incomplete or bad regexp" (namespace)([\r\n\t ])+([_0-9a-zA-Z])+([\r\n\t ])+([\{])+(.*)*([\}])+, also remove all the results found by the regexp from the stream an continue parsing other elements of c/c++.

Also the FileArray would have member functions to indicate the include paths to search for sources and includes.

I think it will be more easy to maintain since all the elements of the language would be divided in classes and the parsing made by the regexp engine.

There should be a class for every element like the ones mentioned above: functions, defines, variables, namespaces, blah blah blah with an interface that make some functions and members available to all, like file to parse, parsefunction, etc...

An example would be a class for Classes that is included as an array(container) on the FileArray class. The class would use some arrays(containers) of Function class, Variables class, Enum class, Struct class, Union class, etc...

I would like to work on something like this, but I don't know if it's a silly idea. More experienced programmers opinions would be nice.

Title: Re: Regular expressions
Post by: eranif on December 05, 2007, 07:26:37 am

Quote from: JGM on December 05, 2007, 04:02:35 am

Could it be possible to create a parser using regular expressions?

I dont want to be the party pooper, but C++ is way to complex for regex (C++ contains too many ambiguities for regex to handle).
You will have the bring in the big guns to do this (some advance parsers such has ANTLR, PCCTS and other parsers that has N look ahead).

GNU's parser for C++, is based on Yacc & Flex. And even in those tools, parsing of C++ is too complex (I think their grammar contains some shift/shift, shift/reduce conflicts).

Still, I think it is possible to create a parser based on yacc & flex which will do most of the work (since you are not building a compiler here, it should be enough).

Eran

Title: Re: Regular expressions
Post by: JGM on December 05, 2007, 02:22:32 pm

Quote from: eranif on December 05, 2007, 07:26:37 am

Still, I think it is possible to create a parser based on yacc & flex which will do most of the work

Never heard of that!
I'm reading this little guide http://epaperpress.com/lexandyacc/ and http://flex.sourceforge.net/manual/
But at first it's to complex to digest :?

May be regexp could be used to parse simple things as unions, enums, typedef, variables, defines and others, while some manual intervention on classes, namespaces, templates since I dont know if it is possible to check on regexp for the right } ending bracket, since namespaces use {} and inside of them the use of {} is also. With regexp we could search for the words "namespace alfanum", but I don't know if regexp libraries return you the position on the string where it found it, so we jump there and search for the corresponding ending } bracket. Then with the content of the namespace {...} perform other parsing functions for advanced types as classes leaving the basic types as variable declarations, unions, enums, for later.

Title: Re: Regular expressions
Post by: eranif on December 05, 2007, 03:13:04 pm

At most, you will succeed in parsing very basic C++ expression, but once it comes to really complex expressions, such as:

Code

namespace MyNS {
template <typename T, typename Ty>
class MyClass : 
/* some comment here */
public Singleton<MyClass> , public Factory<MyClass>, private SomeOtherClass<MyClass>
{
//Now make sure you ignore this comment as well
};
};//NS

(and I have seen such code...), I fear the you will hit some serious issues.

Also, to make a parser you need lookup table to distinguish between typenames and identifier,
for example:

Code

class MyClass{};

At this line the parser should consider MyClass as identifier, once the parsing of this line is completed, MyClass is inserted into a lookup table and marked as typename, so line like this:

Code

MyClass cls;

will be parsed correctly and considered cls as identifier (The parser should recognize MyClass as the typename and cls as the identifier)

And dont let me start talking about scoping... :wink:

If you really want to create a plugin to be a competitor for the current CodeCompletion plugin, I suggest you go and have a look at ctags
http://ctags.sf.net (http://ctags.sf.net) (contact me if you need more help)

Regex is good for lexing (tokenizing, and in fact the flex grammar uses regex a lot) not for grammar.

There are other advantages of using generated parsers and hand crafted ones:
Changing a grammar file is a lot easier than modifying an existing code, which contains patch over patch, until time comes and the code is smarter than its creator :D

Eran

Title: Re: Regular expressions
Post by: thomas on December 05, 2007, 03:29:03 pm

Quote from: JGM on December 05, 2007, 02:22:32 pm

Quote from: eranif on December 05, 2007, 07:26:37 am
Still, I think it is possible to create a parser based on yacc & flex which will do most of the work
Never heard of that!

Ceniza has done that, worked ok too.
Regex is in my opinion unsuitable mostly because it is far too slow. yacc and flex evaluate rules at compile time (or rather at pre-compile time) and hack together a C file from that which does just one thing, and nothing more.
Regex does everything at runtime.

Title: Re: Regular expressions
Post by: JGM on December 05, 2007, 04:12:36 pm

Quote from: thomas on December 05, 2007, 03:29:03 pm

Ceniza has done that, worked ok too.

So the work of Ceniza will be implemented for the future code completion plugin?

If that's true, it is great! Then there should not be any other work done by others.

Title: Re: Regular expressions
Post by: thomas on December 05, 2007, 05:00:33 pm

Quote from: JGM on December 05, 2007, 04:12:36 pm

So the work of Ceniza will be implemented for the future code completion plugin?
If that's true, it is great! Then there should not be any other work done by others.

Of course it's true, why would I make up such a story. It's not even April Fool's day :)
However, "the future code completion" will probably be using a different implementation, not that one.

Quote from: JGM on December 05, 2007, 04:12:36 pm

Then there should not be any other work done by others.

There is no reason why someone capable of coming up with a better solution than the present one should not work on it. It can only be for good.
Either way, parsing is not the only serious issue in code completion. In fact, the parser is not that bad at all any more, it only has a handful of issues. The GUI, update, and threading issues are a lot worse, I think.

Title: Re: Regular expressions
Post by: Ceniza on December 05, 2007, 07:03:50 pm

Right now I'm planning how to do it, a rough sketch of how it should work. I'll surely base the new lexer on the previous handcrafted one, but it won't be the same. As for the parser, I'm not that sure about using yacc this time. However, the time for that decision is yet to come.

JGM: IIRC you speak Spanish. If you want you can check my proposal right here (http://ceniza666.googlepages.com/Anteproyecto.pdf) (which is in Spanish).

Title: Re: Regular expressions
Post by: JGM on December 06, 2007, 02:20:54 am

Quote from: Ceniza on December 05, 2007, 07:03:50 pm

JGM: IIRC you speak Spanish. If you want you can check my proposal right here (http://ceniza666.googlepages.com/Anteproyecto.pdf) (which is in Spanish).

I read your project document and is really nice, you have talent to write. It would be great to see a project of that magnitude working, it is really interesting. Also I learned a lot with your document, thanks for sharing it! :) By the way what does it means IIRC?

If there is something I could do to help would be nice.

Title: Re: Regular expressions
Post by: Ceniza on December 06, 2007, 03:38:19 am

Quote from: JGM

It would be great to see a project of that magnitude working

Thing is: I need to get that project done to graduate.

Quote from: JGM

By the way what does it means IIRC?

IIRC: If I Recall Correctly.

Quote from: JGM

If there is something I could do to help would be nice.

Thank you :)

Title: Re: Regular expressions
Post by: JGM on December 10, 2007, 08:08:15 pm

I have been reading some c++ books, and on each one I encounter a new feature of the language that I didn't know. I have to admit that I'm an amateur on c++, it has so many features that the language is pluginable, some things on the language have so many definitions or uses. :?

I see why there are little software alternatives to accomplish code completion. Making an intelligent c++ parser would take a significant amount of time (maybe years).

Title: Re: Regular expressions
Post by: stevenkaras on February 20, 2008, 01:26:00 am

I've always said when in doubt, check the source. In this case, the best source is the ARM:
http://www.research.att.com/~bs/arm.html (http://www.research.att.com/~bs/arm.html)

and after that, that Bjarne book(my own name):
http://www.research.att.com/~bs/3rd.html (http://www.research.att.com/~bs/3rd.html)
http://en.wikipedia.org/wiki/The_C%2B%2B_Programming_Language (http://en.wikipedia.org/wiki/The_C%2B%2B_Programming_Language)

As for my opinions about ctags: I've tried using ctags in the past, and failed spectacularly. Not really sure why, but I've had an instinctual dislike of the package since then. But perhaps we can steal the core of it, and then just update that, rather than using an external program. Also, when we internalize it, we take control about when it can reparse the project.

In any case, I'd like to see some documentation put here about how the code completion works now, so we can perhaps improve upon it. If not, at least discuss how code completion works, rather than everyone just saying: "let's use regex!" or "it should look like visual assistant x!"

Title: Re: Regular expressions
Post by: JGM on February 20, 2008, 02:13:55 am

Quote from: stevenkaras on February 20, 2008, 01:26:00 am

rather than everyone just saying: "let's use regex!" or "it should look like visual assistant x!"

Don't listen to us we are just a bunch of frustrated programmers that can't make a decent auto completion plugin :lol: (talking for my self)

By the way as I know the code completion plug in is going to be redesigned ( a new and better one). Theres already people working on it (or at least a library). Also you are welcome to develop one that may take place on a future contest I guess :wink:.

Title: Re: Regular expressions
Post by: stevenkaras on February 26, 2008, 11:50:13 am

I suppose what I'm trying to say is I think seeing some more discussion on how code completion actually works would create results faster than simple suggestions, and telling everyone: "Work alone". This is a group project, where everyone takes part. Not a project where one person puts together a whole bunch of other people's work. I'm not saying that working alone is bad (i prefer it myself), but on a complicated problem such as this, the design stages should be done in a public forum.

So, how does code completion work? Rather than throw around buzzwords and feature names, let's talk about what we actually want it to do:
Provide, in an automatic/seamless way, all possible/likely solutions to "complete" the current identifier.

1. To do so, you need to have a list of all the identifiers in the current scope. (That last sentence was important).
2. Then, reduce the list to what already begins with what has been typed.
3. Show that list to the user
4. If the user wants to, he can select the proper solution from that list, and press enter/tab to insert the solution.

Let me know if you have a correction/suggestion. I'll think about how to break down the problem further, and add that on.

Edit:

I've put some more thought into the problem. As for getting the list of identifiers, the problem is threefold. You need to generate 3 lists. The first list should contain global identifiers with external linkage in the current target(a stumbling block, as I see it). The second list should contain global identifiers with internal linkage, and the last list should contain all the local scope identifiers. A couple of solutions:

1. Keep 2 lists on hand for each file. One is external linkage, the other is internal linkage. Generate the list needed by merging all the lists for the current target on demand.
2. Keep a list on hand for each target (storing external linkage only). Generate the internal linkage list on the fly.

In any case, the local scope list should be generated on demand (it shouldn't take that much time, and makes the plugin more useful).

There are a few potential problems, as I see it:

1. What happens when the current scope is global?
2. What happens in an unbalanced block? (say I'm writing a function and forgot a brace...)
3. What should the data format be?

Title: Re: Regular expressions
Post by: byo on February 26, 2008, 08:57:13 pm

Quote from: stevenkaras on February 26, 2008, 11:50:13 am

Provide, in an automatic/seamless way, all possible/likely solutions to "complete" the current identifier.

That's only one thing. To make really good code-completion you will also have to:

Provide call tips while writing function-call function (maybe with some documentation gathered from comments)
Provide reliable list of include files when user type #include macro
Help with coding initializer lists on constructors
and probably much more

Additionally current implementation also does few more things like:

Providing class browser
Providing "goto declaration/definition" stuff
Detecting current scope we're in

Ok, now we can not find all identifiers in current scope. Why? Because in c++ we always include global namespace which contains really huge amount of symbols so even enumerating them may be a problem. Also we can not only take symbols from one scope - for example when we're in member function of class we must look in local variables of this function, members of the class, all members of base classes, global namespaces and all namespaces included through "using" keyword - so it's not so simple.

It even looks worse when dealing with templates - let's take an example:

Code

teplate<typename T> class Template
{
    public:
        T& GetInstance() { return m_Instance; }
    private:
        T m_Instance;
};

class Parameter
{
    public:
        void PrintfText() { printf("Text"); }
};

int main(int,char**)
{
    Template<Parameter> Object;
    
    Object.GetInstance().Printf <<<<< What to complete here ?
}

In this example we don't only need to know that Object has function named GetInstance. If we detect what this function returns we see that this is template parameter - so such lists also have to be accessible from somwhere. And believe me - stl, boost and other modern libraries won't easily work without it.

Now to do the completion really good we can't overcome the problem of parsing most of the code in include files. Of course it don't require full parsing but to have accurate results we can only skip bodies of functions, everything else must be parsed as in standard c++ parser. But since we need to parse current context while completing the code, body of function we're in must be parsed then (so this shows that really good code-completion is in fact full c++ parser).

We can cache result of parsing include files (of course there are exceptions for this but they're rare) and we can quickly parse current file (when skipping function bodies) so the performance should be acceptable.

I thought lately about the cc plugin and how to improve it becuase the current implemetnation is not good enough for me (I started to really love templates and cc fails in this area) so maybe I could help too. But I think that maybe we shoudn't start any huge changes before the release (so it's not so long ;) ) and we should focus on the release right now.

But just one suggestion which should be considered before starting the change (I already written about this some time ago but it was long time ago so I don't even remember where I wrote this ;) ).

I think that current cc plugin should be splitted into three parts:

Parser plugin (or maybe language-support plugin) - this one will be responsible for parsing files / detecting completion context etc. etc. - mainly analysing source code and changes made to them. I also thought about adding possibility to define own lexers and fold detectors for scintilla editor here - and I can say that I already have this part done
Code-completion plugin - here we could leave all things related to managing threads / gui updates etc
Parser manager - which would provide some bridge between those two (maybe more) plugins - it would manage parser plugins (for example give proper parser for given language) and it could provide some interface used to access symbols.

If parser manager and interfaces for accessing symbols will be really good designed, we can separate parsing from completion in most of areas.

Regards

BYO

Title: Re: Regular expressions
Post by: stevenkaras on March 02, 2008, 01:40:04 am

I enjoyed reading what you had to say. It was particularly well written and well thought out. I had put some more thought into the problem, and wrote a few pages on it. Here's more or less what I've written. It's lacking in some places, but I'm still working on it:

Having put yet more thought into this(what can I say? I'm bored), I've come up with a list of what the structures should look like:

Namespace(of which the global namespace is a special case of):
- list of all functions
- list of all variables
- list of all enumerations
- list of all typedefs
- list of all classes
- list of all namespaces (note that children of these can be moved up by using the using keyword)
- list of all preprocessor definitions (all defs go in the global object)

Class:
- list of base classes
- list of variables
- list of enumerations
- list of typedefs
- list of methods
- list of static variables
- list of static methods
- list of classes(always static)

Now we just need to pick what the best container would be for the lists, and write the code to merge these lists into the main one, which would differ in that instead of actually holding content, it would hold pointers to said content.

There are a few more questions, of course (along with my own answers):
* What sort of information should be held about the various identifiers?
- type for variables
- fuction signature for functions/methods
- the other end of a typedef
- Classes should check for inheritance
- Classes should check for storage class (private, protected, or public), and store that info for all members
- line number of declaration/prototype/implementation
* What will the processor time be for all of this?
* What big will the memory footprint be?
* Are we missing anything?
- yes, but I can't put my finger on it...
* Are there any easily-foreseen problems/difficulties?
- Typedef walking (checking for members of a typedeffed object)
- Array support
- Parameterized preprocessor definitions
- unnamed namespaces. Sometimes I want to strangle the ANSI/ISO committee.

In addition, I just looked at the Visual Assist X website. They have some really good ideas there. So good, in fact, that I'd like to change number 2 from the main procedure to this:
2. Reduce the list to the most likely/probable solutions to the current identifier

I've written out more or less what we'll need to implement the structures. I figure there's one thing in common with all the elements, and since C++ lends itself to this so well, we should have a base class for all identifiers, which all of the various types can be derived classes of:

1. Identifiers (the base class)

Code

class identifier
{
	string name;				// the identifier name
	int decl_line;				// line number of declaration (prototype for functions)
	virtual string tooltip(void) = 0;	// returns what a tooltip should display for the identifier
	virtual string listname(void) = 0;	// returns what the list name should look like
};

2. Variables
As much as it's tempting to add all sorts of flags about storage classes and modifiers, remember that when you do that, you increase the storage space by o(n). If we store it in a different list, you increase the storage space by o(1). Besides, the string is a descriptor, nothing more.

Code

class variable : public identifier
{
	string type;		// the type of the variable
};

3. Enumerations

Code

class enumeration : public identifier
{
};

4. Typedef
Yes, I'm aware typedef is a keyword. I don't particularly care (it's not real code)

Code

class typedef : public identifier
{
	string type;		// the base type
};

5. Method/Function

Code

class function : public identifier
{
	int impl_line;		// line number of the definition
	string returns;		// the return type
	string signature;	// The parameter list
};

6. Preprocessor defs
I've got absolutely no clue how to handle parameterized macros

Code

class preprocdef : public identifier
{
	string macro;		// the other side of the macro
};

7. Namespaces

Code

class namespace : public identifier
{
	list variables;
	list enumerations;
	list typedefs;
	list functions;
	list classes;
	list namespaces;
	void using(identifer);		// to support the using keyword(it brings something into the current namespace)
};

8. Classes
There's a slight problem here. Classes can be split into 6 sections: static/nonstatic, and then public, protected, and private. And they're all relevant.

Code

class class : public identifier
{
	list base_classes;
	list variables;
	list enumerations;
	list typedefs;
	list functions;
	list classes;
	list namespaces;
	list static_variables;
	list static_enumerations;
	list static_typedefs;
	list static_functions;
	list static_classes;
	list static_namespaces;
}

9. File
We only really want to cache external linkage(because the internal linkage of a file changes too quickly when we write code, and I'm not interested in telling the parser to reparse after every character typed.

Code

class file
{
	string filepath;	// the filename + path (to open it quickly for reference use)
	namespace global;	// the global namespace
}

I'll check my books if there's anything I missed language-wise. So far the only things missing/un-supported are unnamed namespaces and parameterized macros.

Title: Re: Regular expressions
Post by: JGM on March 02, 2008, 06:46:35 am

I think you are missing #defines well, macros, if defined then parse this area or the other one

Title: Re: Regular expressions
Post by: byo on March 03, 2008, 12:18:43 am

I also thought about data structures in cc. And in my opinion, each symbol (variable, class, typedef, enum etc.) should use one class to represent all informations they provide. That would really help streaming the data and if it's flexible enough it could be used for languages other that c++ too (although it's not it's main purpose).

And here's my concept. Each symbol should have following data:

Code

class symbol
{
    string name;              // name of the symbol
    int    id;                // Id of the symbol, should be unique in the project
    int    file_id;           // if od file where the symbol has been declared
    int    filepos_begin;     // Position where declaration of the symbol starts
    int    filepos_end;       // Position where declaration of the symbol ends
    int    type;              // Type of the symbol: macro / class / typedef / variable / function
    int    modifiers;         // Bitfield used to mark some estra properties of symbol like that it is static or inline
    int    value_type_id;     // Id of symbol which represents c++ type of current symbol (like type of variable or type of returned value from function)
    int    extra_type_id;      // Extra type used in some cases
    list   children;          // List of child elements of this symbol (members in class etc)
    list   extra_lists[3];    // Some extra lists which can provide additional symbols depending on type of current
                              // symbol - like list of base classes or list of template arguments, maybe we could give
                              // more than 3 lists, but I didn't found any reason for that now.
    map    extra_values;      // int -> string map which can keep some extra data
}

Each element of symbol list should have following properties:

Code

class list
{
    int symbol_id;        // Id of referenced symbol
    int scope;            // Scope of the symbol (public / private / protected ... ), don't have to be used
}

Most fields of the symbol class are easy-to-understand and their usage is rather straightforward. But interpretation of data should depend on symbol type.

Ok, here's detailed information on how the fields should be used in case of symbol types:

type

modifiers

value_type_id

extra_type

children

extra_lists[0]

extra_lists[1]

extra_lists[2]

----------------

namespace

|
|
|

declarations in
namespace

|
|
|

"using" namespaces

|
|
|

----------------

class / struct / union

|
|

members of class

|
|

base classes

|
|

template args

|
|

----------------

variable

|
|

extern, static, volatile, const

|
|

type of variable

|
|

----------------

function

|
|

static, inline, const ...

|
|

returned value

|
|

arguments

|
|

template arguments

|
|

----------------

typedef

|
|

pointer, array, reference, pointer_to_member

|
|

base type

|
|

type of class in pointer_to_member

|
|

----------------

enum

|
|

items in enum

|
|

----------------

enum item

|
|

id of enum

|
|

----------------

macro

|
|

macro parts

|
|

----------------

macro part

|
|

arg_to_string, va_args

|
|

number of arg or -1

|
|

Such representation would require some extra "hidden" symbols - for example when some complex type is returned from function, extra symbol of typedef representing proper value would be required.

Also in case of templates, typeid's should be threated in special way - negative value could mean to use template argument instead of some real type. Base types (the POD ones) should have some predefined type ids.

This is my proposition. May not be perfect (it's late here and my memory tends to forget important things) but is a sum of few months when I thought about CC from time to time ;)

Regards
BYO

Title: Re: Regular expressions
Post by: stevenkaras on March 05, 2008, 01:30:24 am

Looked really good. But I would have to disagree with placing the static keyword as a modifier in and of itself. Keep in mind what effect static has:

symbol type	\|without static	\|with static
global variable	\|external linkage	\|internal linkage
local variable	\|automatic storage	\|program storage
function	\|external linkage	\|internal linkage
class member	\|only accessed through an object	\|only accessed through the qualified name

Note: I probably forgot something

Couple of other things:
1. As for the extra lists, wouldn't it be better to use inheritance to implement that concept, rather than placing it in the base class?
2. Again, you use the extra list to show use of the using keyword, but I think it'd be simpler to effectively allow the transfer of symbols between namespaces. Especially once you consider the various ways you can use the using keyword (using namspace std; using std::cout; using ::myVar; etc)
3. JGM> I didn't include the preprocessor definitions because the preprocessor has it's own syntax, and is loosely connected to the C++. Keep in mind that you're actually programming in 2 languages at the same time.
4. BYO>Providing a symbol table is a hard task. But I like to see that everyone has put some thought into it. I got where you were going with the class, trying to avoid using inheritance, and the virtual table, but the code can be inefficient at first, and we can always re-implement it later as a monolithic class.
5. Some of the things you had in your class were very good (and even more importantly, it included some essentials that I forgot) such as file_id.
6. I think we should back off on storing everything about each variable for a bit, as it's just unnecessary, and focus on getting a basic implementation working. But I would like to mention that we should keep in mind that there are other uses for a symbol table other than code completion: a symbol browser, improved syntax highlighting (that would catch non-identifiers), and code refactoring.

So I'll sit down later on this week and work out a second draft of the classes.

-steven

PS. Does anyone ever post here when it isn't late at night?

Title: Re: Regular expressions
Post by: Ceniza on March 05, 2008, 09:10:44 am

Quote

3. JGM> I didn't include the preprocessor definitions because the preprocessor has it's own syntax, and is loosely connected to the C++. Keep in mind that you're actually programming in 2 languages at the same time.

Yup, that's right. Trying to mix both wouldn't be easy at all. Macros can create new names, new functions, new everything. That's why a preprocessing stage is necessary if we really want to know what's available from the code to the developer.

Saving information from the preprocessing stage is simpler. In fact, I have that already. To continue the discussion, my current implementation of the preprocessor's SymTab Element is as follows:

Code

    template <class StringType, class TokenType>
    struct PPSymTabElement
    {
        typedef std::vector<StringType> ParameterListType;
        typedef std::vector<TokenType> ReplacementListType;

        bool valid;
        bool isFunc;
        ParameterListType parameters;
        ReplacementListType replacement;

        PPSymTabElement() : valid(false), isFunc(false) {}
    };

Inheritance is very important when designing the nodes for the Abstract Syntax Tree that the parser will create. Having in mind the whole bunch of different types you can have in C++, the parser's SymTab Element must be very flexible, so inheritance should be considered too.

Title: Re: Regular expressions
Post by: stevenkaras on March 05, 2008, 05:00:44 pm

Quote

6. I think we should back off on storing everything about each variable for a bit, as it's just unnecessary, and focus on getting a basic implementation working. But I would like to mention that we should keep in mind that there are other uses for a symbol table other than code completion: a symbol browser, improved syntax highlighting (that would catch non-identifiers), and code refactoring.

I've thought about this even more, and it makes sense to split up code completion into several plugins. The basic one would provide the symbol table, along with an API for accessing that table(remind me how document trees are handled by wx?). Others can build on top of that, providing functionality for code completion, a symbol browser, and code refactoring.

Ceniza, could you explain a bit more about your class? I didn't quite grasp what you tried to do(although it looks sound).

Other than that, I'm a bit swamped this week, so at least until Sunday I won't have much time to think for myself :?

Title: Re: Regular expressions
Post by: Ceniza on March 05, 2008, 10:42:07 pm

Quote from: stevenkaras on March 05, 2008, 05:00:44 pm

Ceniza, could you explain a bit more about your class? I didn't quite grasp what you tried to do(although it looks sound).

I guess you're asking about the code I pasted. If that's the case, it's quite simple: there're two different types of macros. The first type are object-like macros, and the second type are function-like macros. isFunc indicates whether it's a function-like macro or not (obviously, it's an object-like one if false). If it's a function-like macro, then the parameters list (or vector if you prefer) will be populated with the name of all its parameters (the comma separated list of names that follows the left-parenthesis and which ends with a right-parenthesis), and if it's an object-like macro, then it 'll be empty. replacement is what follows the macro name and parameters (if it's a function-like one). valid... well, let me explain something first.

You may be wondering where the heck I'm storing the name of the macro. The name of the macro is the key of a LookupTable (std::map for example), and the element is an instance of PPSymTabElement. A requirement of the LookupTable is to return a reference to the element stored in it. If the element doesn't exist, then it returns a reference to a PPSymTabElement created with the default constructor. Now, valid is set to false by default, and it must be set to true before inserting a new element to the LookupTable. That's how you tell the difference between found and not found. It's surely not the most elegant solution, but, since it's my own design, I can do whatever I want. Furthermore, the whole code is full of strange design decisions just for fun... and to later test the parser's ability to understand itself.

I hope it's clearer now.

Now... to finish this post... what everyone was waiting for: pictures of naked women!

LOADING ...

Title: Re: Regular expressions
Post by: byo on March 05, 2008, 11:36:32 pm

Quote from: stevenkaras on March 05, 2008, 01:30:24 am

symbol type |without static |with static
global variable |external linkage |internal linkage
local variable |automatic storage |program storage
function |external linkage |internal linkage
class member |only accessed through an object |only accessed through the qualified name

Ok, I still don't agree that separate lists are required. When you want to have code-completion. In most cases it will cause us to process few symbol tables instead of one. And let's take an example:

Code

class Stuff
{
    public:
        static int m_StaticMember;
};
int Stuff::m_StaticMember = 10;

void func(void)
{
    Stuff object;

    Stuff::m_StaticMember = 11;    // This is valid code
    object.m_StaticMember = 12;   // This is valid code too
}

Now if we want to complete in the first case (through scope name) we can skip all non-static variables and functions, in second case we have to enumerate all no matter what modifiers are applied. And it may be helpful to provide whole list in first case just to give user's some hint on the content of class.

Quote

1. As for the extra lists, wouldn't it be better to use inheritance to implement that concept, rather than placing it in the base class?

That was caused becuase I wanted to have one class for any type of symbol that will be good enough without any inheritance. And we can always create some helper classes that will hide the "ugly" interface of general symbol class under some nice well-defined interface - for example:

Code

class Symbol
{
public:
    list m_ExtraStuff[3];
};

class ClassSymbol
{
    Symbol * m_Symbol;
public:
    ClassSymbol(Symbol* symbol): m_Symbol(symbol) {}
    list& GetMembers() { return m_Symbol->m_ExtraStuff[0]; }
};

Now we have few benefits of one unified interface:
* sizeof(Symbol) is always the same, no matter what symbol contains - this allows to create arrays of objects. When using inheritance only arrays of pointers can be created.
* if you look into current code::completion sources you may find that symbols use block-allocated method of memory allocation - this gives some nice boost when objects are allocated and deallocated frequently and this method can not be used in case of inheritance
* saving symbols cache and restoring from files is really easy. In case of inheritance we would have to use different method of saving and loading for each derived class so we wouldn't be able to do this without symbol factories etc.

Quote

2. Again, you use the extra list to show use of the using keyword, but I think it'd be simpler to effectively allow the transfer of symbols between namespaces. Especially once you consider the various ways you can use the using keyword (using namspace std; using std::cout; using ::myVar; etc)

I thought about such concept but let's take an example again: you create 10 namespaces and in each of those namespaces you put using std; - we can really quickly run out of memory.
And one more think - when symbol-lookup will be made it will usually have to lookup few namespaces (in most cases we will have at least 3: global namespace, local symbols and class members). So adding few more into current list of namespaces shouldn't be really so expensive.

Quote

4. BYO>Providing a symbol table is a hard task. But I like to see that everyone has put some thought into it. I got where you were going with the class, trying to avoid using inheritance, and the virtual table, but the code can be inefficient at first, and we can always re-implement it later as a monolithic class.

Sure, I'm not the boss here nor the most c++ expert :) I just want to have really good C::C 8)

Quote

6. I think we should back off on storing everything about each variable for a bit, as it's just unnecessary, and focus on getting a basic implementation working. But I would like to mention that we should keep in mind that there are other uses for a symbol table other than code completion: a symbol browser, improved syntax highlighting (that would catch non-identifiers), and code refactoring.

Ok, but let's still keep one thing on mind: c++ parsing is really _complex_ task because syntax of this language is so messed up. Some mistakes on design stage may be really painfull later ;). But I agree that anything working is better than something that does not.
[/quote]

Regards
BYO

Title: Re: Regular expressions
Post by: stevenkaras on March 09, 2008, 06:23:21 pm

Sorry for the delay. Real life did what it does best: rears its ugly head and distracts you from the fun.

In any case, I'm really glad to see this sort of discussion going on. It shows that there are people who care very deeply about C::C and C::B. At this stage, I have a few things to say:

1. BYO> Wonderful reply. You caught some mistakes of mine and rationalized your arguments for the symbol class quite well. Although I did check out the static thing. The reason why I had written that was that I was thinking of typedefs in classes, which must be addressed through a qualified name, rather than an object. Good for us that this gives us an easy way out (there's only one exception, which makes it easier). In addition, the concept of a wrapper class to decouple the implementation of the symbols is wonderful.
2. As for handling multiple lists, yes, 3 does seem to be the most relevant. In addition, the reason why we should keep in mind linkage is that we want to cache the external linkage of each file, not the internal(which I think should be generated on the fly).
3. If we can define a wrapper class, or at least a less ugly interface for the symbol class, this would be a great step in the right direction.
4. We all forgot about friends of classes. These have access to the class scope.
5. I'll sit down over the next few days and write a revised proposal, and post it up here once I'm finished.
6. Again, sorry for the delay, but reality is a bit more important, or at least urgent.

Title: Re: Regular expressions
Post by: byo on March 10, 2008, 10:53:25 pm

Quote from: stevenkaras on March 09, 2008, 06:23:21 pm

Sorry for the delay. Real life did what it does best: rears its ugly head and distracts you from the fun.

It's rather usual in open-source projects :) We all have to work for food ;)

Quote

2. As for handling multiple lists, yes, 3 does seem to be the most relevant. In addition, the reason why we should keep in mind linkage is that we want to cache the external linkage of each file, not the internal(which I think should be generated on the fly).

Ok, but it's as linker sees it. The problem is that c++ is really poor in terms of external / internal linkage. There's really nothing like "using" keyword from java in c++ (well, standard says that some cases of #include may have such behavior but I didn't found it so far).
Each #include does only mean: "put the content of file blahblah here". So the only assumption we can do here is that everything declared in header files should be threated as external stuff and everything in cpp files should be threated as internal stuff (as long as someone do not include cpp file ;) ). So external keyword is only a hint to compiler that such symbol is only declared here and definition stays elsewhere (and will be found while linking).

So my suggestion is like this:

Keep all known header files preprocessed somewhere (memory / files / mixed storages)
When working on some file, reparse it as it's changed (this will require some performance tuning - like skipping bodies of functions that are not user), if this file is header such change may fire some update of cache listed in previous point
When working with file in second point, each #include directive which is on the global scope (so not inside any namespaces, not in classes etc. because that require special threating) adds all keywords from such file into current lookup structures, if it's not in cache or we can not use the cache, we behave as normal preprocessor by including content of such file

Hmm, or maybe I misunderstood something about this external / internal linkage ;)

Quote

5. I'll sit down over the next few days and write a revised proposal, and post it up here once I'm finished.

Looking formard for the proposal :) maybe it could be put onto wiki so it would be easier to work on it ?

Regards
BYO

Title: Re: Regular expressions
Post by: stevenkaras on March 11, 2008, 04:19:17 am

Quote from: byo on March 10, 2008, 10:53:25 pm

Looking formard for the proposal :) maybe it could be put onto wiki so it would be easier to work on it ?

Done. Here's the link:
http://wiki.codeblocks.org/index.php?title=Code_Completion_Rewrite (http://wiki.codeblocks.org/index.php?title=Code_Completion_Rewrite)

Actual meat to follow.

mmmm...hamburger...*wipes drool*

--steven

Title: Re: Regular expressions
Post by: byo on March 11, 2008, 01:20:55 pm

Quote from: stevenkaras on March 11, 2008, 04:19:17 am

Done. Here's the link:
http://wiki.codeblocks.org/index.php?title=Code_Completion_Rewrite (http://wiki.codeblocks.org/index.php?title=Code_Completion_Rewrite)

I've added section where we can put some more complex C::C usage examples. Hope C::C will be able to handle them all soon :) And this can be a good list of test to perform.

Regards
BYO

Title: Re: Regular expressions
Post by: JGM on March 11, 2008, 01:38:00 pm

Cool to see so many interest and effort on planning a new code completion plugin. I just wanted to point that eranif has made a great job, I think that his library support everything that has been discussed here. Check his sample videos here http://codelite.org. His code completion library has involved into a full C/C++ IDE. Some time ago I talk to him to implement a plugin for codeblocks that used his library but I have become some lazy and also out of time.

Maybe theres no need to reinvent the wheel if is good enough :D

Title: Re: Regular expressions
Post by: stevenkaras on March 11, 2008, 05:12:29 pm

I've fleshed out what was there, and updated it to BYO's proposal with a few minor modifications :wink:

I'll get around to revising it sometime in the near future. In the meantime, take a look, edit it if you think you've worked out something better, etc.

Cheers!
steven

Title: Re: Regular expressions
Post by: JGM on March 11, 2008, 07:57:47 pm

Nice! that's really team work :D If things keep going like this we could have the best generic parser out there, so I would say that 2 brains work better than a lonely one 8). Everybody Join the C::C plugin revolution! :) I feel like a cheers leader :oops:

Title: Re: Regular expressions
Post by: JGM on March 11, 2008, 08:52:12 pm

I saw this on wiki

Quote

Questions:

* why are we storing filepos_end? Wouldn't it be much more useful to store declaration, definition info?

I think thats an easy way to help on refactoring and less memory consuming, unless using a database system like sqlite for storing all symbols data, that should be the case when implementing the culmination of all these ideas.

Title: Re: Regular expressions
Post by: byo on March 11, 2008, 10:21:10 pm

Quote from: JGM on March 11, 2008, 01:38:00 pm

Cool to see so many interest and effort on planning a new code completion plugin. I just wanted to point that eranif has made a great job, I think that his library support everything that has been discussed here. Check his sample videos here http://codelite.org. His code completion library has involved into a full C/C++ IDE. Some time ago I talk to him to implement a plugin for codeblocks that used his library but I have become some lazy and also out of time.

Maybe theres no need to reinvent the wheel if is good enough :D

I've looked into it and... well... I'm impressed on what this ide can do now. I'll look into source code to see whether it would be hard to integrate with C::B (user interface and overall app design looks very simillarily to C::B ;) ).

BYO

Title: Re: Regular expressions
Post by: JGM on March 12, 2008, 09:11:33 pm

Actually it uses most of the components that use code blocks, as scintilla and wxWidgets, maybe is not so hard to use it on codeblocks :)

Title: Re: Regular expressions
Post by: byo on March 12, 2008, 11:42:46 pm

Quote from: JGM on March 12, 2008, 09:11:33 pm

Actually it uses most of the components that use code blocks, as scintilla and wxWidgets, maybe is not so hard to use it on codeblocks :)

I've done some more investigatinos - CodeLite works perfect in many cases but it still have some problems with templates (for example I couldn't make function templates to work). And any more complex template-based code won't work - but that's probably just a temporary issue ;)

And there's another problem - CodeLite uses same approach like C::B in case of code::completion - everything is inside one library so we would have to isolate some parts (like parser) first. We can also drop the idea of splitting cc into smaller parts but I see this as a short-term solution - just imagine that someone wants to improove support for D language providing code::completion for it - everything should be created again - class browsers, symbol storages etc.

And next thing - if we decide to try CodeLite's stuff, we will probably have to branch it's code. So any further updates made in CodeLite would require some work to include in C::B - as long as we don't change it much everything will be fine but If we will have to make some bigger changes then we got a problem with keeping the code up-to-date.

One more question to Ceniza: You said something about new parser, what's the current progress ? Any results now ?

Regards
BYO

Title: Re: Regular expressions
Post by: Seronis on March 13, 2008, 12:34:52 am

Just curious, has anyone emailed the codelite dev and asked them if they would be interested in trying to integrate it with C::B ? They -might- be intersted in that and might appreciate the design discussion going on already.

Title: Re: Regular expressions
Post by: JGM on March 13, 2008, 01:19:30 am

Quote from: byo on March 12, 2008, 11:42:46 pm

One more question to Ceniza: You said something about new parser, what's the current progress ? Any results now ?

I would also like to know the status on the work of Ceniza :D

Quote from: Seronis on March 13, 2008, 12:34:52 am

Just curious, has anyone emailed the codelite dev and asked them if they would be interested in trying to integrate it with C::B ?

I don't think thats possible since eranif (codelite author) is really busy working on the CodeLite project.

Title: Re: Regular expressions
Post by: Jenna on March 13, 2008, 01:28:38 am

Quote from: JGM on March 13, 2008, 01:19:30 am

I don't think thats possible since eranif (codelite author) is really busy working on the CodeLite project.

and active C::B forum-member.

Title: Re: Regular expressions
Post by: JGM on March 13, 2008, 01:34:21 am

Quote from: jens on March 13, 2008, 01:28:38 am

and active C::B forum-member.

theres some irony on that :roll:

Title: Re: Regular expressions
Post by: eranif on March 13, 2008, 06:36:49 am

Quote from: byo

I've done some more investigatinos - CodeLite works perfect in many cases but it still have some problems with templates (for example I couldn't make function templates to work). And any more complex template-based code won't work - but that's probably just a temporary issue

Indeed, the CC for template functions is not supported yet, but can be added with relatively small effort.

Quote from: JGM

I don't think thats possible since eranif (codelite author) is really busy working on the CodeLite project.

Indeed, many tasks to complete and only one developer ... (myself)
However, I will be happy to assist to integrate it into C::B plugin

Quote from: JGM

theres some irony on that

I fail to see the irony in that, for me hanging in C::B forum is like being a member in a forum dealing with the my interest - IDEs.
I myself dont use C::B, but the topics here are interested, nothing more.

About codelite CC: It uses ctags only for creating the symbol table (kept inside SQLite database), the actual parsing of expressions and statements is done using yacc & flex. The advantage of using yacc is that the grammar is maintainable and can be updated very easily.
Also, by using the database as symbol table, you gain the ability to do more neat stuff, such as:
- codelite offers code-completion for files
- it offers to automatically add include files for classes/functions/structs etc
and more

Eran

Title: Re: Regular expressions
Post by: Ceniza on March 13, 2008, 09:33:54 am

Quote from: byo on March 12, 2008, 11:42:46 pm

One more question to Ceniza: You said something about new parser, what's the current progress ? Any results now ?

I'm still working on the Preprocessor, and I'm also almost done with it (I still need to implement a few minor details). I'm also planning to change a bit the design of a few things (which should take just a few hours of work).

If you are even more interested in following the development of it, go to http://svn.cenizasoft.cjb.net/ -> CCP. Oh, and don't let yourself be fooled by the SourceParser and SourceFormatter projects... I abandoned those.

Right now there's only access to the repository thru ssh, but anonymous access should be just a matter of redirecting a port in the firewall. Just tell me if you are interested in trying the mess... er... sources, so I can check that.

Title: Re: Regular expressions
Post by: JGM on March 13, 2008, 03:06:39 pm

Cool! Does that code could be configured to work with any other language?

It would be great to have a C::C SDK where you can define the tokens and what kind of token it's so when passing that to the parser it takes care of generating a generic symbols table (like the one described by BYO).

For Example, on BASIC the token to finalize a line of code is the '\n' (i think) on C++ is ';'
For beginning a function body and ending it on BASIC "sub" or "function", and "end sub" or "end function" while on C++ '{' and '}'

There should be some method (strategy?) to create a C::C SDK that is configurable to fit into any language need, that at least could generate a symbols table based on some options that the developer is telling to the framework using some configuration options. I learned here it exist this yacc and flex tools, but they are some kind of complex to use and adapt for different languages.

Maybe if we could describe the order or structure in which every token could appear so when the parser encounters that it take the necessary actions to populate the symbols table with a specified format.

Example:

TokenContainer.Add("(");
TokenContainer["("].Describe(BEGIN_BLOCK);

TokenContainer.Add(")");
TokenContainer[")"].Describe(END_BLOCK("(")); //here we tell it ends the "(" block

TokenContainer.Add("{");
TokenContainer["{"].Describe(BEGIN_BLOCK);

TokenContainer.Add("}");
TokenContainer["}"].Describe(END_BLOCK("{")); //here we tell it ends the "{" block

TokenContainer.Add("-");
TokenContainer["-"].Describe(MINUS_OPERATOR);

TokenContainer.Add("+");
TokenContainer["+"].Describe(PLUS_OPERATOR);

TokenContainer.Add("operator");
TokenContainer["operator"].AddConfiguration("+ || - && ( ... ) || { ... } || ;");
TokenContainer["operator"].AddConfiguration("* || / && ( ... ) || { ... } || ;");

Parser.AddDirectory("include");
Parser.FileExtensions("h cpp")
Parser.AddTokens(TokenContainer);
Parser.StartParsing();

SymbolsTable[Parser.GetSymbols().Count()] Symbols = Parser.GetSymbols();

Legend:
|| = or
&& = and
... = anything that should parsed and is part of the current token

So when the parser starts working and it encounters the "operator" token it could be followed by a + or - tokens followed by a pair of parenthesis with anything inside followed by and optional body {} and optional ;
This is just an idea and could be improved.

Also there should be another configurations to tell the parser how to store what it found on the symbols table. Thats something to think on :)

Programming languages have many things in common, if we could wrap all those things and think in a generic way, but it is difficult :(

Title: Re: Regular expressions
Post by: Ceniza on March 13, 2008, 03:54:34 pm

Quote

Cool! Does that code could be configured to work with any other language?

Not really. The preprocessing stage is only for C++. I think the multi-language functionality should be implemented differently, perhaps creating a plugin where parser plugins can be connected. I don't really see a reason to mix many languages in the same symtab.

Title: Re: Regular expressions
Post by: JGM on March 13, 2008, 04:23:59 pm

Quote from: Ceniza on March 13, 2008, 03:54:34 pm

I don't really see a reason to mix many languages in the same symtab.

To be able to create a generic and customizable parser that with configurations is capable of parsing any language? The symbols table will be used in different languages, with some sort of documents that explain how is used, like the Tar File Struct. That stores a string on a specific location to indicate if ustar or not, keeping its native format.

Also to serve as an interface (base) that helps a developer re-implement it for other languages, accelerating the process of creating other parsers.

Maybe theres a way to make the symbols table more customizable.

Example:
=================================================================
SymbolsContainer.SetLanguage("C++");
SymbolsContainer.AddSymbol(theSymbolToAdd);

template <typename Types, typename Flags, typename ChildrensList, typename ExtraList, typename ExtraValues>
Class Symbols{
string name; // name of the symbol
int id; // Id of the symbol, should be unique in the workspace
int file_id; // Id of file where the symbol has been declared
int filepos_begin; // Position where declaration of the symbol starts
int filepos_end; // Position where declaration of the symbol ends
Types type; // Type of the symbol: macro / class / typedef / variable / function
Flags modifiers; // Bitfield used to mark some estra properties of symbol
// like that it is static or inline
int value_type_id; // Id of symbol which represents c++ type of current symbol
// (like type of variable or type of returned value from function)
int extra_type_id; // Extra type used in some cases
ChildrensList children; // List of child elements of this symbol (members in class etc)
ExtraList extra_lists[3]; // See table below
ExtraValues extra_values; // int -> string map which can keep some extra data
}
=================================================================

Title: Re: Regular expressions
Post by: byo on March 13, 2008, 09:27:50 pm

Quote from: Ceniza on March 13, 2008, 03:54:34 pm

Not really. The preprocessing stage is only for C++. I think the multi-language functionality should be implemented differently, perhaps creating a plugin where parser plugins can be connected. I don't really see a reason to mix many languages in the same symtab.

The goal is not to mix symbols in one table but if we could guarantee that most languages could use one unique representation of symbol table, it will be possible to ass support for new language only by writing it's parser - other stuff like symbol browser, completion window, call tooltip etc would be the same, no need to duplicate the code.

Regards
BYO

Title: Re: Regular expressions
Post by: stevenkaras on March 13, 2008, 09:48:36 pm

Wow. I go away for a day or two and look what happens. This is really starting to come together as a team effort.

JGM> As for decoupling the language, the reason why you can't simply make the symbol table abstract is because languages are very different. For example, java doesn't have multiple inheritance, basic doesn't support classes, and so on. The syntaxes are very similar, but the actual structure of the language is so different, that you'd have a hard time describing conflicting language elements. How would you implement Basic's GOSUB and PROCEDURES as compared with C functions. However, as BYO pointed out, as long as the symbol table is abstract enough in it's use, the chances of successfully extending the language features is very likely.

Right now I think the best step forward would include the following:
1. doing some test with ctags for the symbol table
2. decide on a storage format for the symbols to be provided to the other components
3. have someone play with flex, come up with a good language spec for c++(worry about C later)
4. have someone else play with bison, and work on the C++ file for the parser.

I think that should be about it. Let me know if I forgot something.

Cheers!
steven

Title: Re: Regular expressions
Post by: Ceniza on March 13, 2008, 10:50:40 pm

Quote from: stevenkaras on March 13, 2008, 09:48:36 pm

Right now I think the best step forward would include the following:
...
3. have someone play with flex, come up with a good language spec for c++(worry about C later)
4. have someone else play with bison, and work on the C++ file for the parser.

3. Why flex? Besides, I already have the whole preprocessor's lexer working. The only symbols not yet generated (because they belong to another stage) are integer- and floating_point-literals. The preprocessor's lexer generates ppnumbers instead. Another missing part is string-literal concatenation (which will be done by the real lexer after it has retrieved preprocessed tokens).
4. I won't be up for that task. I already tried it with bison, and it was painful enough. If anyone feels like trying, please remember to use a code generation that plays nice with multi-threading (bison's default one doesn't).

For languages other than C and C++, flex and bison would be just fine. C and, even more, C++ are a PITA (Pain In The Ass, before someone asks) to parse by any means.

Title: Re: Regular expressions
Post by: JGM on March 14, 2008, 04:27:27 am

Quote from: Ceniza on March 13, 2008, 10:50:40 pm

PITA (Pain In The Ass, before someone asks)

Thanks for anticipating :lol:

Title: Re: Regular expressions
Post by: JGM on March 14, 2008, 04:34:32 am

Quote from: stevenkaras on March 13, 2008, 09:48:36 pm

Wow. I go away for a day or two and look what happens. This is really starting to come together as a team effort.

I like your attitude :)

Quote from: stevenkaras on March 13, 2008, 09:48:36 pm

JGM> As for decoupling the language, the reason why you can't simply make the symbol table abstract is because languages are very different.

Yep, i will start having a really bad time, but then if theres is going to be a different parser for each language, I think would be good to standardize the parser structures, or create some abstract (interfaces) classes where others may derivate.

Title: Re: Regular expressions
Post by: stevenkaras on March 14, 2008, 06:07:11 pm

Quote from: JGM on March 14, 2008, 04:34:32 am

Yep, i will start having a really bad time, but then if theres is going to be a different parser for each language, I think would be good to standardize the parser structures, or create some abstract (interfaces) classes where others may derivate.

True. Parsing in general is a PITA. My brain is fried from this week, but here goes:

1. Symbols from a language are placed into a symbol table.
2. The user requests a completion of the current symbol.
3. the C::C plugin determines the correct scope of the symbol(global, local, class)
4. Compare the current symbol against the proper scopes.
5. Provide a composite list to the user.

That kind of glazes over the general process, and I'm sure that eran can explain it in further detail. As for providing this functionality for multiple languages, it should be pretty easy. Figuring out how to generalize the symbol class should be a priority in order to achieve that. What we have should handle most languages, but it may need more tweaking.

Cenzina> Don't worry about it. I threw out flex as an option because it's more or less a standard. I also mentioned it because it may be easier to generate the lexer from a document specification, rather than a hard-coded version. Besides, we can always edit the code generated by yacc or bison or flex if it doesn't suit our needs.

Title: Re: Regular expressions
Post by: JGM on March 14, 2008, 08:18:11 pm

Quote from: stevenkaras on March 14, 2008, 06:07:11 pm

1. Symbols from a language are placed into a symbol table.
2. The user requests a completion of the current symbol.
3. the C::C plugin determines the correct scope of the symbol(global, local, class)
4. Compare the current symbol against the proper scopes.
5. Provide a composite list to the user.

Also we need to know a detailed insight on how wxScintilla works for this purposes, if some one that has already worked with it could explain something :)

Title: Re: Regular expressions
Post by: byo on March 14, 2008, 09:55:09 pm

Quote from: JGM on March 14, 2008, 08:18:11 pm

Also we need to know a detailed insight on how wxScintilla works for this purposes, if some one that has already worked with it could explain something :)

I've done some investigation in this area and here are results (may not be complete list but hope it will be useful anyway):

wxScintilla does notify about each it's event using wxScintillaEvent - this one is processed inside cbEditor class. The most imoprtant for us is cbEditor::OnScintillaEvent (around line 2614 in cbEditor.cpp) - it call editor hooks. Editor hooks are generally some functions from plugins which are called when something inside editor change.

Next one place where such hook arrives is CodeCompletion::EditorEventHook function (around line 1731 in plugins/codecompletion/codecompletion.cpp). This function is responsible for detecting whether we should show completion window / function tooltip.

Function tooltip is fired when user enters '(' key and is closed when user enters ')' one. Tooltip is prepared in CodeCompletion::ShowCallTip() function (line 734) - it uses NativeParser::GetCallTips and NativeParser::GetCallTipComas() but I haven't look into them yet.

Code completion window may be fired when user enters '<' key (for list of includes), '.' , '->' or '::'. Some basic checks are performed to detect whether we can complete by fetching style of entered character ( control->GetStyleAt() ) - this style is generated by wxScintilla while colourising the code.
Code completion window can also be fired when user entered required number of characters in identifier and when required timeout occured. Window can also be fired when user press ctrl+space.

The processing of completion list is done inside CodeCompletion::DoCodeComplete. If this function detects that we're in preprocessor block, it calls CodeCompleteIncludes(), otherwise it calls CodeComplete(). I didn't look into these functions either.

So in general, I've written what happens in 2. point in stevens' list.
3 / 4 and 5 are done in CodeCompleteIncludes and CodeComplete.
1 is done either when project is being loaded or when file is being saved (I tend to press ctrl+shift+S to save all files frequently so it forces symbols to be refreshed ;) ).

Regards
BYO

Title: Re: Regular expressions
Post by: JGM on March 15, 2008, 12:02:11 am

Nice! :D

Quote from: byo on March 14, 2008, 09:55:09 pm

The processing of completion list is done inside CodeCompletion::DoCodeComplete. If this function detects that we're in preprocessor block, it calls CodeCompleteIncludes(), otherwise it calls CodeComplete(). I didn't look into these functions either.

This is an important fact since we should divide the preprocessors symbols on another table, since like discussed before we are working we 2 different languages

What I still wonder is how we know what text the user entered before the symbols '.' '->' '::' etc...

Does we have to check in current row and column back until a space is found to see what the user typed, or this process is done by re-parsing the file again and again to store the position of symbols?

For example:

AnObject MyObject:

//When I press the '.' character
MyObject.
^We start counting back until we fine a space or \r, \n to get the word the user typed previously to
search on the symbols table

I don't know if wxScintilla already provides this functionality.

Another problems are tooltips shown on codeblocks when opening a '(' are not persistent.

For Example:

//Function Prototype
void MyFunction(string str, int num, double pi, bool isGreat);

main()
{
string myString;
MyFunction(
^Ok here is shown a tool tip with void MyFunction(string str, int num, double pi, bool isGreat)
}

The problem is when I write the first parameter, I use ctrl-space to search for myString and the tool tip disappears, so theres also the need to check the source when the user press the ',' to see if inside a functions parameters.

I should launch codeblocks get wxScintilla and start playing with it and the documentation :oops:

Title: Re: Regular expressions
Post by: byo on March 15, 2008, 12:59:55 am

Quote from: JGM on March 15, 2008, 12:02:11 am

What I still wonder is how we know what text the user entered before the symbols '.' '->' '::' etc...

Does we have to check in current row and column back until a space is found to see what the user typed, or this process is done by re-parsing the file again and again to store the position of symbols?

The most certain solution would be like this:

We fire parser from the beginning of the file
Parse files up till position of newly entered character (each #include directive should try to use some cached list of symbols instead of including real file content)
When parser reaches current position, it has some local lookup table (or other structures) which would be used when current/next token would be parsed - these table is exactly what we need - list of symbols which match at current position.

Ok, but this may be time-consuming. So we could optimize it by skipping bodies of functions as I mentioned before.

To do this:

Start parsing from the beginning
when parser detects function body ( '{' token after function header ), it skipps tokens till there's closing '}' token (watch out for internal {} blocks) - it should use standard tokenization and preprocessing to skip all '{' and '}' inside comments strings, macro expansions, etc.
If we jump out of function body, we can continue normal parsing
Now if during the skipping processing we pass current position, this function is the one we have to parse correctly so we get back (or if toknes inside funciton were stored somewhere we jump to the beginning of this list) and do proper parsing.
If current token comes outside function we can handle it depending on symbol's context (like helping with includes)

This can be optimized futrher by checking what has changed in the file since previous processing. In most cases, code will be added sequentially so we would be able to use parser's local symbol tables from previous processing only altering them a little bit.

I may be wrong but current implementation in C::B looks like this:

Parse "using namespace" declarations
Detect what function we're in
Parse arguments of that function
Parse body of the function
Add symbols from function's scope
Build list of symbols

BYO

Title: Re: Regular expressions
Post by: stevenkaras on March 16, 2008, 01:34:18 am

BYO&JGM> Interesting discussion. The way I see it, there are 2 ways to handle calls:

1. Reparse the entire file in an optimized fashion
2. moving backwards, skip whitespace, then grab the token. Perhaps a quick function that takes a file position and grabs the token at that location would be useful. Note that this can be somewhat dangerous, as it could lead to a quickly outdated symbol table, unless all the conditions for adding/removing symbols are carefully designated.

So while we probably want to start out with the first, we can consider the second for a possible extension if optimization of this process is needed.

Also, I thought of another feature and a few more ways C::C should be called. The new feature would be tips on variables, showing the type, and possibly, the scope (if relevant). As for the events, C::C should be called on a mouse hover over an identifier after a timeout. Useful, in a pinch.

I'm updating the wiki with some of this, and rewriting a good chunk of what's already there. I can't sleep, and I've got a full pot of coffee. :mrgreen:

Almost forgot. I'm also looking at sources of a BNF definition of C++ so we can feed that into bison&flex. I've seen some promising stuff.

Edit:
I've been poking around, and found some interesting stuff. First off, C++ grammar is not LALR (http://en.wikipedia.org/wiki/LALR_parser) compatible. So without some inaccuracy, flex is unusable, although bison supposedly supports the more general parsing method (http://en.wikipedia.org/wiki/GLR_parser). The tradeoff is a big hike in inefficiency in order to work out all the accuracy issues(which would only pop up in somewhat extreme cases, as I'm told).

JGM> found an interesting read you might like: here (http://en.wikipedia.org/wiki/Name_mangling)

Also, I was able to find a BNF of C++ here (http://www.nongnu.org/hcb/), but it's put in with hyperlinks, and I'm not in a mood to remove by hand (or even write a script to do it for me), so I'll search some more. the GCC page (http://www.gnu.org/software/gcc/gcc.html) may have something.

Ummm...we have a problem. We've been talking about creating a code completion plugin for C++. Which version of C++? Reading the gcc manual, I remembered that the language is still being developed, and that there are many(at least 4 official, final) standards out there.

In any case, I've found a few files:
1. Appendix A of "The C++ Programming Language", Special Edition, which is apparently C++98, and is a bit convoluted, to say the least.
2. A supposedly broken grammar with (http://www.computing.surrey.ac.uk/research/dsrg/fog/v/CxxTester.y) and without (http://www.computing.surrey.ac.uk/research/dsrg/fog/v/CxxGrammar.y) actions(no clue)

In any case, apparently gcc switched from using a yacc/flex style parser to a hand-written recursive one a few years ago(around 2002), and have been using that since then. Poking around in a copy of the 4.2.1 branch, there are a few yacc files, but nothing really all that useful. We may need to write a new grammar from scratch(painful to think about, but unless anyone can come up with a better alternative...)

Title: Re: Regular expressions
Post by: JGM on March 16, 2008, 04:34:38 am

Wow, just read the wiki and I'm getting scare with all those examples on Complex cases of usage, and this is something that really needs some deep thinking :shock: we should wrap g++ code and modify it to fit all these needs (just a joke) :lol:

Well so we have a symbols class and initial requirements. should we start thinking on a parser class?
Also, does we are going to use 2 different parsers? preprocessors and normal c++ syntax?

typedef map<tokenIdentifier, stringRepresentation> Tokens;
//Maybe not quite nice, just thought about, token[PLUS] == CurrentString found by parser, maybe would create a nice and readable syntax

class PreprocessorsParser
{
Tokens m_Tokens;
vector<Symbols> m_SymbolsTable;
vector<string> m_Defines;
vector<string> m_Directories;
string m_Extensions;

void AddDefine(const string& define);
void AddDirectory(const string& directory);
void SetFilesExtension(const string& extensions); //Should be a comma or space seperated
vector<Symbols> GetSymbolsTable();
string GetParsedFile(); //This would return a string representing a cleaned file of unneeded macros such as #if, #ifndef #elif, #endif
void StartParsing();
};

class Parser
{
Tokens m_Tokens;
vector<Symbols> m_SymbolsTable;
vector<string> m_Directories;
string m_Extensions;

void AddDirectory(const string& directory);
void SetFilesExtension(const string& extensions); //Should be a comma or space seperated
vector<Symbols> GetSymbolsTable();
void StartParsing();
};

Two really incomplete examples just for further working. :wink: just keep it simple and easy to follow.
I'm really sleepy :oops:

Title: Re: Regular expressions
Post by: Ceniza on March 17, 2008, 07:22:59 pm

If anyone feels interested, I just finished documenting the code of what I have so far. The documentation uses Doxygen, and the output is here (http://ccp.cenizasoft.cjb.net/).

The 7z file expands to what the html folder has.

Title: Re: Regular expressions
Post by: JGM on March 17, 2008, 10:53:23 pm

good news! :D

Title: Re: Regular expressions
Post by: rathapolk on March 18, 2008, 08:42:09 am

Other Opinion

1. Move the cursor back to find the beginning of the first deepest local scope '{'.
2. Parse from the beginning of the scope to the user's cursor position and create a local symbol table.
3. if the symbol not found, move the cursor back to the bigger scope and parse till the beginning of the previous scope. go on.. until the cursor moved out of a function, then use the global symbol table.

this may help when scope braces are not perfectly matched, in upper functions.

Rathapol.K

Title: Re: Regular expressions
Post by: Ceniza on March 20, 2008, 12:00:55 am

Just to keep you informed, I'm about to start working on the parser. I'll surely work a bit more on the preprocessor (bugfixing, retouching, cleaning up code, ...), but, as it is right now, I have enough to start. In fact, I could say it is 99% complete by now. Unfortunately, the computer where I'm storing it is currently stuck (surely online, but unresponsive), and I'm a bit too far away from it to reset it myself (8.886,28 km according to Google Earth).

Also, just for fun, I ran through the current implementation of my preprocessor some evil code using macros that I wrote today to implement this (http://ceniza666.googlepages.com/INT_FLOAT_LITERAL.png) (it's part of the Lexer, and it tells you in a single pass if a PPNUMBER is an INTEGERLITERAL or a FLOATINGLITERAL), then took that output, ran it through the AStyle plugin, copy&pasted it into the original file (disabling temporarily the one with macros), compiled and ran the program. It compiled and ran as expected, so I guess it's working fine so far.

Now, before I start working on the parser, I'll design the symbol table for it and start analyzing some productions. I'll also check your ideas and see what I can borrow from them. I expect to have something to show you next week.

"Stay tuned, and stay infected." (I wonder if anyone knows where this quote comes from :D)

Title: Re: Regular expressions
Post by: Seronis on March 20, 2008, 02:50:47 am

Quote from: Ceniza on March 20, 2008, 12:00:55 am

"Stay tuned, and stay infected." (I wonder if anyone knows where this quote comes from :D)

I tried to google cheat. It linked back to this post. Damn their spiders are getting fast.

Title: Re: Regular expressions
Post by: Ceniza on March 28, 2008, 10:37:43 am

Quote from: Seronis on March 20, 2008, 02:50:47 am

Quote from: Ceniza on March 20, 2008, 12:00:55 am
"Stay tuned, and stay infected." (I wonder if anyone knows where this quote comes from :D)

I tried to google cheat. It linked back to this post. Damn their spiders are getting fast.

You didn't search carefully. It's in the first results page ;)

My expectations were wrong. The truth is that I've been working more on the parser than anything else. My first attempt was to create a simple parser like the one I used to evaluate constant expressions for #if directives, but the grammar as it is defined is not good for such a parser (LL(0)). LL(1) would help a bit more, but it wouldn't be enough. Besides, I don't want to calculate FIRST and FOLLOW to implement the parser. However, the LL(0) attempt gave me an idea: a brute force parser. I'm completely unaware of its performance, but it should just work. Quoting the description found in the YARD (http://www.ootl.org/yard/) page: "The distinguishing characteristic of a R-D parser, is that it employs a recursive brute-force algorithm to attempt to match the various grammar production rules. If a match fails, the parser backs up to the position it was when it attempted to match the failing rule."¹. Yet another parser I found, linked from YARD's website, is Biscuit (http://p-stade.sourceforge.net/biscuit/index.html). Right now I'm experimenting with something similar to Biscuit's syntax, but not exactly it. BTW, even Yacc has a special implementation that splits the parser when a grammar is detected as ambiguous, so it analyzes more than one production at a time.

To avoid more false expectations, I'll just say that I'll report more things when I have something to show.

Just for the sake of sharing, you can get the current bizarre code experiment from here (http://ceniza666.googlepages.com/BruteForce.cpp).

[1] (http://www.ootl.org/yard/#rd)

Title: Re: Regular expressions
Post by: JGM on March 28, 2008, 02:45:10 pm

thanks for the link of the source! :D I was trying to check the sources of your svn repository to study them, but i don't know how :oops:

Title: Re: Regular expressions
Post by: JGM on March 28, 2008, 02:50:17 pm

I have a questions, I'm curious to know if the implementation is going to support different file encodings, that should be something to visualize or anticipate, I think :roll:

Title: Re: Regular expressions
Post by: Ceniza on March 28, 2008, 07:28:00 pm

Quote from: JGM on March 28, 2008, 02:45:10 pm

thanks for the link of the source! :D I was trying to check the sources of your svn repository to study them, but i don't know how :oops:

I have to enable anonymous SVN access first :)

svn://svn.cenizasoft.cjb.net/svnrepos/CCP/trunk

Quote from: JGM on March 28, 2008, 02:50:17 pm

I have a questions, I'm curious to know if the implementation is going to support different file encodings, that should be something to visualize or anticipate, I think :roll:

Actually, the current implementation needs you to provide it with a few classes to be able to work. Right now I have ones using std::string and std::istream. If you want your own, you create them and feed the classes through template arguments. However, I still need to work a bit on that part.

Title: Re: Regular expressions
Post by: Ceniza on June 14, 2008, 08:07:52 am

It's been a while, and the news are _NOT_ as good as you would like them to be.

If you check the SVN commit log you'll see I started working on the project again about three weeks ago; however, I changed my proposal to implement ONLY the preprocessor. Implementing the whole thing would take me longer, and I do really want to graduate soon.

Well, the new proposal was approved, the project is finished, and the document is done. However, I might need to change a few things in the document if my director considers it necessary. I hope not :D

I wouldn't mind sharing the document, but it's in Spanish. Maybe the library reference would be of more help, yet it's not as interesting to read as the document itself :P

The next step: optimizing the library. I'm even planning to write a special String type for it to see if it improves performance (I'll do profiling later :P). I was even considering writing an article about the creation process of that String type plus the performance considerations and final implementation. Who knows? Maybe some of you may learn one or two things from it, or I just make you believe something that could not be true :P

Morten promised me to give me his blood in exchange for a good code-completion library. He threatened me too, but that's way OT :)

Let's see how it goes from here. At least the first step has been taken.

There's also a plugin laying around that can be installed in the latest Nightly build that tests the library. It doesn't offer any benefits, but it's good to see that its minimum functionality works.

-- END OF THE MESSAGE --

This message was written using only recycled bytes. The colors for the smileys do NOT use any dangerous substances. If your children eat this message take them to the hospital immediately.

Title: Re: Regular expressions
Post by: JGM on June 14, 2008, 07:33:52 pm

Quote from: Ceniza on June 14, 2008, 08:07:52 am

I wouldn't mind sharing the document, but it's in Spanish. Maybe the library reference would be of more help, yet it's not as interesting to read as the document itself :P

My native language is spanish :D

Title: Re: Regular expressions
Post by: Ceniza on June 15, 2008, 07:20:23 am

Quote from: JGM on June 14, 2008, 07:33:52 pm

Quote from: Ceniza on June 14, 2008, 08:07:52 am
I wouldn't mind sharing the document, but it's in Spanish. Maybe the library reference would be of more help, yet it's not as interesting to read as the document itself :P

My native language is spanish :D

I know :D

If that means you want to read the file then suit yourself (http://ceniza666.googlepages.com/DocumentoFinal.pdf).

BTW, that's just a preliminary version... which I hope becomes the final one.

I also started writing the document about the creation process of the String type. So far it only has the preliminary analysis, but may be a good read too. Link (http://ceniza666.googlepages.com/caost4ccp.html).

Title: Re: Regular expressions
Post by: JGM on June 16, 2008, 06:20:24 pm

I was reading your project documents and liked very much the irc chat conversation :P I felt like the beginner. I haven't finished reading but I'm at page 37 (and reading) I can say that you could be a great professor (you are really good with words) :) Your university faculty should approve that immediately is an A++.

Also I partially read your document for the new string implementation class and it's true. I confronted all those things while implementing a csv reader and writer I made, reading one byte at time was pretty slow, opening the hole file at once was amazingly fast, but using a bigger buffer should help. There should be functions to detect the size of the string to store and if it's to big for memory to handle parse it using a big amount of buffer depending on the size of the string, else store it completely on memory and parse it there since it is much faster.

The reading was pretty cool, implementing all those things properly should produce a really high speed c/c++ parser.

Title: Re: Regular expressions
Post by: Ceniza on June 17, 2008, 02:29:22 am

Quote from: JGM on June 16, 2008, 06:20:24 pm

I was reading your project documents and liked very much the irc chat conversation :P I felt like the beginner. I haven't finished reading but I'm at page 37 (and reading) I can say that you could be a great professor (you are really good with words) :) Your university faculty should approve that immediately is an A++.

Heh, thanks. Actually, I was a bit afraid of having such an introduction for the document, but my thesis director said it was OK. I also handed over the document today. Let's hope I graduate soon :)

Thanks for the 'professor' thing too. In fact, two proposals that I've received have been to teach: one at the University and the other one in an English institution. The idea sounds nice, but I prefer to write programs at the moment. I could teach later :P

Quote from: JGM on June 16, 2008, 06:20:24 pm

The reading was pretty cool, implementing all those things properly should produce a really high speed c/c++ parser.

That's the idea. I'm not completely satisfied by the speed of the current implementation. If you want to add a syntactic and semantic analysis too, the speed of the preprocessor should be very good. I'll finish that string type first, check the performance gain (or loss) and try to optimize for real after doing some profiling.

Title: Re: Regular expressions
Post by: Ceniza on July 02, 2008, 03:55:54 am

Quick news: my thesis has been approved and was graded with a "sobresaliente". The meaning of that word is something like "pretty good or ~excellent". Thesis are graded like this in here: not-approved, approved, "sobresaliente" and laureate. Not bad :D

In other news... I've finished the document where I implement the new string type for the preprocessor. It also means that I've finished the whole thing. Quoting from my own document:

Quote from: http://ceniza666.googlepages.com/caost4ccp.html

The tests used three different types of string: STDString, CCPString<SimpleAllocator> and CCPString<CCPStringAllocator>. Here are the averaged results:

STDString: 0.409 s
CCPString<SimpleAllocator>: 0.193 s
CCPString<CCPStringAllocator>: 0.182 s

For being a SWAG we achieved a 2.12x performance gain using SimpleAllocator and 2.25x using CCPStringAllocator. Not bad!

What about memory usage? Let's take a look:

STDString: 11864 KiB
CCPString<SimpleAllocator>: 11792 KiB
CCPString<CCPStringAllocator>: 18496 KiB

Now the situation changes a bit. When using the SimpleAllocator we save less than 1% of memory, but using CCPStringAllocator increases the memory usage by a 56%, both compared to the memory usage when using STDString. I'm still curious about it. It'd be a good idea to play with different values for blockSize and see how the program behaves.

For those who also care about executable size, here are the results for each case:

STDString: 266.87 KiB
CCPString<SimpleAllocator>: 210.80 KiB
CCPString<CCPStringAllocator>: 238.81 KiB

Overall, SimpleAllocator has the best performance/memory usage/executable size ratio. Still, CCPStringAllocator gives the highest performance, but not for much compared to SimpleAllocator (about 6%).

I'd say we achieved our goal.

More than a 2x performance gain, and the library itself can be optimized too.

Next step: the painful parser.

Finally, to comply with my own disclaimer:

Quote from: Disclaimer

This document is copyrighted by me (Paúl Andrés Jiménez). If you reproduce this file or parts of it in your website or a document, print it, save it, borrow it, share it, whatever... you're gonna Burn in Hell (it's actually a nice song by Dimmu Borgir). If you try to take ownership of this file, I'll f*cking sue you! That is, if I ever find out that you did. If burning in Hell is good enough for you, just don't forget to include this disclaimer. Also, all the names of companies, products and such are owned by their owners (doh!).

Title: Re: Regular expressions
Post by: JGM on July 03, 2008, 08:17:31 am

great! congratulations!

That disclaimer is funny :lol:

Code::Blocks Forums

Developer forums (C::B DEVELOPMENT STRICTLY!) => Development => CodeCompletion redesign => Topic started by: JGM on December 05, 2007, 04:02:35 am