Author Topic: Regular expressions  (Read 73273 times)

Offline stevenkaras

  • Multiple posting newcomer
  • *
  • Posts: 18
Re: Regular expressions
« Reply #15 on: March 02, 2008, 01:40:04 am »
I enjoyed reading what you had to say. It was particularly well written and well thought out. I had put some more thought into the problem, and wrote a few pages on it. Here's more or less what I've written. It's lacking in some places, but I'm still working on it:



Having put yet more thought into this(what can I say? I'm bored), I've come up with a list of what the structures should look like:

Namespace(of which the global namespace is a special case of):
- list of all functions
- list of all variables
- list of all enumerations
- list of all typedefs
- list of all classes
- list of all namespaces (note that children of these can be moved up by using the using keyword)
- list of all preprocessor definitions (all defs go in the global object)

Class:
- list of base classes
- list of variables
- list of enumerations
- list of typedefs
- list of methods
- list of static variables
- list of static methods
- list of classes(always static)

Now we just need to pick what the best container would be for the lists, and write the code to merge these lists into the main one, which would differ in that instead of actually holding content, it would hold pointers to said content.

There are a few more questions, of course (along with my own answers):
* What sort of information should be held about the various identifiers?
  - type for variables
  - fuction signature for functions/methods
  - the other end of a typedef
  - Classes should check for inheritance
  - Classes should check for storage class (private, protected, or public), and store that info for all members
  - line number of declaration/prototype/implementation
* What will the processor time be for all of this?
* What big will the memory footprint be?
* Are we missing anything?
  - yes, but I can't put my finger on it...
* Are there any easily-foreseen problems/difficulties?
  - Typedef walking (checking for members of a typedeffed object)
  - Array support
  - Parameterized preprocessor definitions
  - unnamed namespaces. Sometimes I want to strangle the ANSI/ISO committee.
 
In addition, I just looked at the Visual Assist X website. They have some really good ideas there. So good, in fact, that I'd like to change number 2 from the main procedure to this:
2. Reduce the list to the most likely/probable solutions to the current identifier



I've written out more or less what we'll need to implement the structures. I figure there's one thing in common with all the elements, and since C++ lends itself to this so well, we should have a base class for all identifiers, which all of the various types can be derived classes of:

1. Identifiers (the base class)
Code
class identifier
{
string name; // the identifier name
int decl_line; // line number of declaration (prototype for functions)
virtual string tooltip(void) = 0; // returns what a tooltip should display for the identifier
virtual string listname(void) = 0; // returns what the list name should look like
};

2. Variables
As much as it's tempting to add all sorts of flags about storage classes and modifiers, remember that when you do that, you increase the storage space by o(n). If we store it in a different list, you increase the storage space by o(1). Besides, the string is a descriptor, nothing more.

Code
class variable : public identifier
{
string type; // the type of the variable
};

3. Enumerations
Code
class enumeration : public identifier
{
};

4. Typedef
Yes, I'm aware typedef is a keyword. I don't particularly care (it's not real code)
Code
class typedef : public identifier
{
string type; // the base type
};

5. Method/Function
Code
class function : public identifier
{
int impl_line; // line number of the definition
string returns; // the return type
string signature; // The parameter list
};

6. Preprocessor defs
I've got absolutely no clue how to handle parameterized macros
Code
class preprocdef : public identifier
{
string macro; // the other side of the macro
};

7. Namespaces
Code
class namespace : public identifier
{
list variables;
list enumerations;
list typedefs;
list functions;
list classes;
list namespaces;
void using(identifer); // to support the using keyword(it brings something into the current namespace)
};

8. Classes
There's a slight problem here. Classes can be split into 6 sections: static/nonstatic, and then public, protected, and private. And they're all relevant.
Code
class class : public identifier
{
list base_classes;
list variables;
list enumerations;
list typedefs;
list functions;
list classes;
list namespaces;
list static_variables;
list static_enumerations;
list static_typedefs;
list static_functions;
list static_classes;
list static_namespaces;
}

9. File
We only really want to cache external linkage(because the internal linkage of a file changes too quickly when we write code, and I'm not interested in telling the parser to reparse after every character typed.
Code
class file
{
string filepath; // the filename + path (to open it quickly for reference use)
namespace global; // the global namespace
}

I'll check my books if there's anything I missed language-wise. So far the only things missing/un-supported are unnamed namespaces and parameterized macros.

Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: Regular expressions
« Reply #16 on: March 02, 2008, 06:46:35 am »
I think you are missing #defines well, macros, if defined then parse this area or the other one

Offline byo

  • Plugin developer
  • Lives here!
  • ****
  • Posts: 836
Re: Regular expressions
« Reply #17 on: March 03, 2008, 12:18:43 am »
I also thought about data structures in cc. And in my opinion, each symbol (variable, class, typedef, enum etc.) should use one class to represent all informations they provide. That would really help streaming the data and if it's flexible enough it could be used for languages other that c++ too (although it's not it's main purpose).

And here's my concept. Each symbol should have following data:

Code
class symbol
{
    string name;              // name of the symbol
    int    id;                // Id of the symbol, should be unique in the project
    int    file_id;           // if od file where the symbol has been declared
    int    filepos_begin;     // Position where declaration of the symbol starts
    int    filepos_end;       // Position where declaration of the symbol ends
    int    type;              // Type of the symbol: macro / class / typedef / variable / function
    int    modifiers;         // Bitfield used to mark some estra properties of symbol like that it is static or inline
    int    value_type_id;     // Id of symbol which represents c++ type of current symbol (like type of variable or type of returned value from function)
    int    extra_type_id;      // Extra type used in some cases
    list   children;          // List of child elements of this symbol (members in class etc)
    list   extra_lists[3];    // Some extra lists which can provide additional symbols depending on type of current
                              // symbol - like list of base classes or list of template arguments, maybe we could give
                              // more than 3 lists, but I didn't found any reason for that now.
    map    extra_values;      // int -> string map which can keep some extra data
}

Each element of symbol list should have following properties:
Code
class list
{
    int symbol_id;        // Id of referenced symbol
    int scope;            // Scope of the symbol (public / private / protected ... ), don't have to be used
}

Most fields of the symbol class are easy-to-understand and their usage is rather straightforward. But interpretation of data should depend on symbol type.

Ok, here's detailed information on how the fields should be used in case of symbol types:

type|modifiers|value_type_id|extra_type|children|extra_lists[0]|extra_lists[1]|extra_lists[2]
----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------
namespace|
|
|
|
|
|
|
|
|
|
|
|
declarations in
namespace
|
|
|
"using" namespaces|
|
|
|
|
|
----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------
class / struct / union|
|
|
|
|
|
|
|
members of class|
|
base classes|
|
template args|
|
----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------
variable|
|
extern, static, volatile, const|
|
type of variable|
|
|
|
|
|
|
|
|
|
----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------
function|
|
static, inline, const ...|
|
returned value|
|
|
|
arguments|
|
template arguments|
|
|
|
----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------
typedef|
|
pointer, array, reference, pointer_to_member|
|
base type|
|
type of class in pointer_to_member|
|
|
|
|
|
|
|
----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------
enum|
|
|
|
|
|
|
|
items in enum|
|
|
|
|
|
----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------
enum item|
|
|
|
|
|
id of enum|
|
|
|
|
|
|
|
----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------
macro|
|
|
|
|
|
|
|
macro parts|
|
|
|
|
|
----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------
macro part|
|
arg_to_string, va_args|
|
number of arg or -1|
|
|
|
|
|
|
|
|
|

Such representation would require some extra "hidden" symbols - for example when some complex type is returned from function, extra symbol of typedef representing proper value would be required.

Also in case of templates, typeid's should be threated in special way - negative value could mean to use template argument instead of some real type. Base types (the POD ones) should have some predefined type ids.

This is my proposition. May not be perfect (it's late here and my memory tends to forget important things) but is a sum of few months when I thought about CC from time to time ;)

Regards
   BYO

Offline stevenkaras

  • Multiple posting newcomer
  • *
  • Posts: 18
Re: Regular expressions
« Reply #18 on: March 05, 2008, 01:30:24 am »
Looked really good. But I would have to disagree with placing the static keyword as a modifier in and of itself. Keep in mind what effect static has:

symbol type|without static|with static
global variable|external linkage|internal linkage
local variable|automatic storage|program storage
function|external linkage|internal linkage
class member|only accessed through an object|only accessed through the qualified name
Note: I probably forgot something

Couple of other things:
1. As for the extra lists, wouldn't it be better to use inheritance to implement that concept, rather than placing it in the base class?
2. Again, you use the extra list to show use of the using keyword, but I think it'd be simpler to effectively allow the transfer of symbols between namespaces. Especially once you consider the various ways you can use the using keyword (using namspace std; using std::cout; using ::myVar; etc)
3. JGM> I didn't include the preprocessor definitions because the preprocessor has it's own syntax, and is loosely connected to the C++. Keep in mind that you're actually programming in 2 languages at the same time.
4. BYO>Providing a symbol table is a hard task. But I like to see that everyone has put some thought into it. I got where you were going with the class, trying to avoid using inheritance, and the virtual table, but the code can be inefficient at first, and we can always re-implement it later as a monolithic class.
5. Some of the things you had in your class were very good (and even more importantly, it included some essentials that I forgot) such as file_id.
6. I think we should back off on storing everything about each variable for a bit, as it's just unnecessary, and focus on getting a basic implementation working. But I would like to mention that we should keep in mind that there are other uses for a symbol table other than code completion: a symbol browser, improved syntax highlighting (that would catch non-identifiers), and code refactoring.

So I'll sit down later on this week and work out a second draft of the classes.

-steven

PS. Does anyone ever post here when it isn't late at night?

Offline Ceniza

  • Developer
  • Lives here!
  • *****
  • Posts: 1441
    • CenizaSOFT
Re: Regular expressions
« Reply #19 on: March 05, 2008, 09:10:44 am »
Quote
3. JGM> I didn't include the preprocessor definitions because the preprocessor has it's own syntax, and is loosely connected to the C++. Keep in mind that you're actually programming in 2 languages at the same time.

Yup, that's right. Trying to mix both wouldn't be easy at all. Macros can create new names, new functions, new everything. That's why a preprocessing stage is necessary if we really want to know what's available from the code to the developer.

Saving information from the preprocessing stage is simpler. In fact, I have that already. To continue the discussion, my current implementation of the preprocessor's SymTab Element is as follows:

Code
    template <class StringType, class TokenType>
    struct PPSymTabElement
    {
        typedef std::vector<StringType> ParameterListType;
        typedef std::vector<TokenType> ReplacementListType;

        bool valid;
        bool isFunc;
        ParameterListType parameters;
        ReplacementListType replacement;

        PPSymTabElement() : valid(false), isFunc(false) {}
    };

Inheritance is very important when designing the nodes for the Abstract Syntax Tree that the parser will create. Having in mind the whole bunch of different types you can have in C++, the parser's SymTab Element must be very flexible, so inheritance should be considered too.

Offline stevenkaras

  • Multiple posting newcomer
  • *
  • Posts: 18
Re: Regular expressions
« Reply #20 on: March 05, 2008, 05:00:44 pm »
Quote
6. I think we should back off on storing everything about each variable for a bit, as it's just unnecessary, and focus on getting a basic implementation working. But I would like to mention that we should keep in mind that there are other uses for a symbol table other than code completion: a symbol browser, improved syntax highlighting (that would catch non-identifiers), and code refactoring.

I've thought about this even more, and it makes sense to split up code completion into several plugins. The basic one would provide the symbol table, along with an API for accessing that table(remind me how document trees are handled by wx?). Others can build on top of that, providing functionality for code completion, a symbol browser, and code refactoring.

Ceniza, could you explain a bit more about your class? I didn't quite grasp what you tried to do(although it looks sound).

Other than that, I'm a bit swamped this week, so at least until Sunday I won't have much time to think for myself  :?

Offline Ceniza

  • Developer
  • Lives here!
  • *****
  • Posts: 1441
    • CenizaSOFT
Re: Regular expressions
« Reply #21 on: March 05, 2008, 10:42:07 pm »
Ceniza, could you explain a bit more about your class? I didn't quite grasp what you tried to do(although it looks sound).

I guess you're asking about the code I pasted. If that's the case, it's quite simple: there're two different types of macros. The first type are object-like macros, and the second type are function-like macros. isFunc indicates whether it's a function-like macro or not (obviously, it's an object-like one if false). If it's a function-like macro, then the parameters list (or vector if you prefer) will be populated with the name of all its parameters (the comma separated list of names that follows the left-parenthesis and which ends with a right-parenthesis), and if it's an object-like macro, then it 'll be empty. replacement is what follows the macro name and parameters (if it's a function-like one). valid... well, let me explain something first.

You may be wondering where the heck I'm storing the name of the macro. The name of the macro is the key of a LookupTable (std::map for example), and the element is an instance of PPSymTabElement. A requirement of the LookupTable is to return a reference to the element stored in it. If the element doesn't exist, then it returns a reference to a PPSymTabElement created with the default constructor. Now, valid is set to false by default, and it must be set to true before inserting a new element to the LookupTable. That's how you tell the difference between found and not found. It's surely not the most elegant solution, but, since it's my own design, I can do whatever I want. Furthermore, the whole code is full of strange design decisions just for fun... and to later test the parser's ability to understand itself.

I hope it's clearer now.

Now... to finish this post... what everyone was waiting for: pictures of naked women!

LOADING ...

Offline byo

  • Plugin developer
  • Lives here!
  • ****
  • Posts: 836
Re: Regular expressions
« Reply #22 on: March 05, 2008, 11:36:32 pm »
symbol type|without static|with static
global variable|external linkage|internal linkage
local variable|automatic storage|program storage
function|external linkage|internal linkage
class member|only accessed through an object|only accessed through the qualified name

Ok, I still don't agree that separate lists are required. When you want to have code-completion. In most cases it will cause us to process few symbol tables instead of one. And let's take an example:

Code
class Stuff
{
    public:
        static int m_StaticMember;
};
int Stuff::m_StaticMember = 10;

void func(void)
{
    Stuff object;

    Stuff::m_StaticMember = 11;    // This is valid code
    object.m_StaticMember = 12;   // This is valid code too
}

Now if we want to complete in the first case (through scope name) we can skip all non-static variables and functions, in second case we have to enumerate all no matter what modifiers are applied. And it may be helpful to provide whole list in first case just to give user's some hint on the content of class.

Quote
1. As for the extra lists, wouldn't it be better to use inheritance to implement that concept, rather than placing it in the base class?

That was caused becuase I wanted to have one class for any type of symbol that will be good enough without any inheritance. And we can always create some helper classes that will hide the "ugly" interface of general symbol class under some nice well-defined interface - for example:

Code
class Symbol
{
public:
    list m_ExtraStuff[3];
};

class ClassSymbol
{
    Symbol * m_Symbol;
public:
    ClassSymbol(Symbol* symbol): m_Symbol(symbol) {}
    list& GetMembers() { return m_Symbol->m_ExtraStuff[0]; }
};

Now we have few benefits of one unified interface:
* sizeof(Symbol) is always the same, no matter what symbol contains - this allows to create arrays of objects. When using inheritance only arrays of pointers can be created.
* if you look into current code::completion sources you may find that symbols use block-allocated method of memory allocation - this gives some nice boost when objects are allocated and deallocated frequently and this method can not be used in case of inheritance
* saving symbols cache and restoring from files is really easy. In case of inheritance we would have to use different method of saving and loading for each derived class  so we wouldn't be able to do this without symbol factories etc.

Quote
2. Again, you use the extra list to show use of the using keyword, but I think it'd be simpler to effectively allow the transfer of symbols between namespaces. Especially once you consider the various ways you can use the using keyword (using namspace std; using std::cout; using ::myVar; etc)

I thought about such concept but let's take an example again: you create 10 namespaces and in each of those namespaces you put using std; - we can really quickly run out of memory.
And one more think - when symbol-lookup will be made it will usually have to lookup few namespaces (in most cases we will have at least 3: global namespace, local symbols and class members). So adding few more into current list of namespaces shouldn't be really so expensive.

Quote
4. BYO>Providing a symbol table is a hard task. But I like to see that everyone has put some thought into it. I got where you were going with the class, trying to avoid using inheritance, and the virtual table, but the code can be inefficient at first, and we can always re-implement it later as a monolithic class.
Sure, I'm not the boss here nor the most c++ expert :) I just want to have really good C::C  8)

Quote
6. I think we should back off on storing everything about each variable for a bit, as it's just unnecessary, and focus on getting a basic implementation working. But I would like to mention that we should keep in mind that there are other uses for a symbol table other than code completion: a symbol browser, improved syntax highlighting (that would catch non-identifiers), and code refactoring.

Ok, but let's still keep one thing on mind: c++ parsing is really _complex_ task because syntax of this language is so messed up. Some mistakes on design stage may be really painfull later ;). But I agree that anything working is better than something that does not.
[/quote]

Regards
   BYO

Offline stevenkaras

  • Multiple posting newcomer
  • *
  • Posts: 18
Re: Regular expressions
« Reply #23 on: March 09, 2008, 06:23:21 pm »
Sorry for the delay. Real life did what it does best: rears its ugly head and distracts you from the fun.

In any case, I'm really glad to see this sort of discussion going on. It shows that there are people who care very deeply about C::C and C::B. At this stage, I have a few things to say:

1. BYO> Wonderful reply. You caught some mistakes of mine and rationalized your arguments for the symbol class quite well. Although I did check out the static thing. The reason why I had written that was that I was thinking of typedefs in classes, which must be addressed through a qualified name, rather than an object. Good for us that this gives us an easy way out (there's only one exception, which makes it easier). In addition, the concept of a wrapper class to decouple the implementation of the symbols is wonderful.
2. As for handling multiple lists, yes, 3 does seem to be the most relevant. In addition, the reason why we should keep in mind linkage is that we want to cache the external linkage of each file, not the internal(which I think should be generated on the fly).
3. If we can define a wrapper class, or at least a less ugly interface for the symbol class, this would be a great step in the right direction.
4. We all forgot about friends of classes. These have access to the class scope.
5. I'll sit down over the next few days and write a revised proposal, and post it up here once I'm finished.
6. Again, sorry for the delay, but reality is a bit more important, or at least urgent.

Offline byo

  • Plugin developer
  • Lives here!
  • ****
  • Posts: 836
Re: Regular expressions
« Reply #24 on: March 10, 2008, 10:53:25 pm »
Sorry for the delay. Real life did what it does best: rears its ugly head and distracts you from the fun.

It's rather usual in open-source projects :) We all have to work for food ;)

Quote
2. As for handling multiple lists, yes, 3 does seem to be the most relevant. In addition, the reason why we should keep in mind linkage is that we want to cache the external linkage of each file, not the internal(which I think should be generated on the fly).

Ok, but it's as linker sees it. The problem is that c++ is really poor in terms of external / internal linkage. There's really nothing like "using" keyword from java in c++ (well, standard says that some cases of #include may have such behavior but I didn't found it so far).
Each #include does only mean: "put the content of file blahblah here". So the only assumption we can do here is that everything declared in header files should be threated as external stuff and everything in cpp files should be threated as internal stuff (as long as someone do not include cpp file ;) ). So external keyword is only a hint to compiler that such symbol is only declared here and definition stays elsewhere (and will be found while linking).

So my suggestion is like this:
  • Keep all known header files preprocessed somewhere (memory / files / mixed storages)
  • When working on some file, reparse it as it's changed (this will require some performance tuning - like skipping bodies of functions that are not user), if this file is header such change may fire some update of cache listed in previous point
  • When working with file in second point, each #include directive which is on the global scope (so not inside any namespaces, not in classes etc. because that require special threating) adds all keywords from such file into current lookup structures, if it's not in cache or we can not use the cache, we behave as normal preprocessor by including content of such file

Hmm, or maybe I misunderstood something about this external / internal linkage ;)

Quote
5. I'll sit down over the next few days and write a revised proposal, and post it up here once I'm finished.

Looking formard for the proposal :) maybe it could be put onto wiki so it would be easier to work on it ?

Regards
   BYO

Offline stevenkaras

  • Multiple posting newcomer
  • *
  • Posts: 18
Re: Regular expressions
« Reply #25 on: March 11, 2008, 04:19:17 am »
Looking formard for the proposal :) maybe it could be put onto wiki so it would be easier to work on it ?
Done. Here's the link:
http://wiki.codeblocks.org/index.php?title=Code_Completion_Rewrite

Actual meat to follow.

mmmm...hamburger...*wipes drool*

--steven

Offline byo

  • Plugin developer
  • Lives here!
  • ****
  • Posts: 836
Re: Regular expressions
« Reply #26 on: March 11, 2008, 01:20:55 pm »
Done. Here's the link:
http://wiki.codeblocks.org/index.php?title=Code_Completion_Rewrite

I've added section where we can put some more complex C::C usage examples. Hope C::C will be able to handle them all soon :) And this can be a good list of test to perform.

Regards
   BYO

Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: Regular expressions
« Reply #27 on: March 11, 2008, 01:38:00 pm »
Cool to see so many interest and effort on planning a new code completion plugin. I just wanted to point that eranif has made a great job, I think that his library support everything that has been discussed here. Check his sample videos here http://codelite.org. His code completion library has involved into a full C/C++ IDE. Some time ago I talk to him to implement a plugin for codeblocks that used his library but I have become some lazy and also out of time.

Maybe theres no need to reinvent the wheel if is good enough  :D
« Last Edit: March 11, 2008, 01:40:37 pm by JGM »

Offline stevenkaras

  • Multiple posting newcomer
  • *
  • Posts: 18
Re: Regular expressions
« Reply #28 on: March 11, 2008, 05:12:29 pm »
I've fleshed out what was there, and updated it to BYO's proposal with a few minor modifications :wink:

I'll get around to revising it sometime in the near future. In the meantime, take a look, edit it if you think you've worked out something better, etc.

Cheers!
steven

Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: Regular expressions
« Reply #29 on: March 11, 2008, 07:57:47 pm »
Nice! that's really team work  :D If things keep going like this we could have the best generic parser  out there, so I would say that 2 brains work better than a lonely one  8). Everybody Join the C::C plugin revolution! :) I feel like a cheers leader  :oops: