Regular expressions

Developer forums (C::B DEVELOPMENT STRICTLY!) > CodeCompletion redesign

Regular expressions

<< < (4/15) > >>

stevenkaras:
I enjoyed reading what you had to say. It was particularly well written and well thought out. I had put some more thought into the problem, and wrote a few pages on it. Here's more or less what I've written. It's lacking in some places, but I'm still working on it:

Having put yet more thought into this(what can I say? I'm bored), I've come up with a list of what the structures should look like:

Namespace(of which the global namespace is a special case of):
- list of all functions
- list of all variables
- list of all enumerations
- list of all typedefs
- list of all classes
- list of all namespaces (note that children of these can be moved up by using the using keyword)
- list of all preprocessor definitions (all defs go in the global object)

Class:
- list of base classes
- list of variables
- list of enumerations
- list of typedefs
- list of methods
- list of static variables
- list of static methods
- list of classes(always static)

Now we just need to pick what the best container would be for the lists, and write the code to merge these lists into the main one, which would differ in that instead of actually holding content, it would hold pointers to said content.

There are a few more questions, of course (along with my own answers):
* What sort of information should be held about the various identifiers?
- type for variables
- fuction signature for functions/methods
- the other end of a typedef
- Classes should check for inheritance
- Classes should check for storage class (private, protected, or public), and store that info for all members
- line number of declaration/prototype/implementation
* What will the processor time be for all of this?
* What big will the memory footprint be?
* Are we missing anything?
- yes, but I can't put my finger on it...
* Are there any easily-foreseen problems/difficulties?
- Typedef walking (checking for members of a typedeffed object)
- Array support
- Parameterized preprocessor definitions
- unnamed namespaces. Sometimes I want to strangle the ANSI/ISO committee.

In addition, I just looked at the Visual Assist X website. They have some really good ideas there. So good, in fact, that I'd like to change number 2 from the main procedure to this:
2. Reduce the list to the most likely/probable solutions to the current identifier

I've written out more or less what we'll need to implement the structures. I figure there's one thing in common with all the elements, and since C++ lends itself to this so well, we should have a base class for all identifiers, which all of the various types can be derived classes of:

1. Identifiers (the base class)

--- Code: ---class identifier
{
string name; // the identifier name
int decl_line; // line number of declaration (prototype for functions)
virtual string tooltip(void) = 0; // returns what a tooltip should display for the identifier
virtual string listname(void) = 0; // returns what the list name should look like
};
--- End code ---

2. Variables
As much as it's tempting to add all sorts of flags about storage classes and modifiers, remember that when you do that, you increase the storage space by o(n). If we store it in a different list, you increase the storage space by o(1). Besides, the string is a descriptor, nothing more.

--- Code: ---class variable : public identifier
{
string type; // the type of the variable
};
--- End code ---

3. Enumerations

--- Code: ---class enumeration : public identifier
{
};
--- End code ---

4. Typedef
Yes, I'm aware typedef is a keyword. I don't particularly care (it's not real code)

--- Code: ---class typedef : public identifier
{
string type; // the base type
};
--- End code ---

5. Method/Function

--- Code: ---class function : public identifier
{
int impl_line; // line number of the definition
string returns; // the return type
string signature; // The parameter list
};
--- End code ---

6. Preprocessor defs
I've got absolutely no clue how to handle parameterized macros

--- Code: ---class preprocdef : public identifier
{
string macro; // the other side of the macro
};
--- End code ---

7. Namespaces

--- Code: ---class namespace : public identifier
{
list variables;
list enumerations;
list typedefs;
list functions;
list classes;
list namespaces;
void using(identifer); // to support the using keyword(it brings something into the current namespace)
};
--- End code ---

8. Classes
There's a slight problem here. Classes can be split into 6 sections: static/nonstatic, and then public, protected, and private. And they're all relevant.

--- Code: ---class class : public identifier
{
list base_classes;
list variables;
list enumerations;
list typedefs;
list functions;
list classes;
list namespaces;
list static_variables;
list static_enumerations;
list static_typedefs;
list static_functions;
list static_classes;
list static_namespaces;
}
--- End code ---

9. File
We only really want to cache external linkage(because the internal linkage of a file changes too quickly when we write code, and I'm not interested in telling the parser to reparse after every character typed.

--- Code: ---class file
{
string filepath; // the filename + path (to open it quickly for reference use)
namespace global; // the global namespace
}
--- End code ---

I'll check my books if there's anything I missed language-wise. So far the only things missing/un-supported are unnamed namespaces and parameterized macros.

JGM:
I think you are missing #defines well, macros, if defined then parse this area or the other one

byo:
I also thought about data structures in cc. And in my opinion, each symbol (variable, class, typedef, enum etc.) should use one class to represent all informations they provide. That would really help streaming the data and if it's flexible enough it could be used for languages other that c++ too (although it's not it's main purpose).

And here's my concept. Each symbol should have following data:

--- Code: ---class symbol
{
string name; // name of the symbol
int id; // Id of the symbol, should be unique in the project
int file_id; // if od file where the symbol has been declared
int filepos_begin; // Position where declaration of the symbol starts
int filepos_end; // Position where declaration of the symbol ends
int type; // Type of the symbol: macro / class / typedef / variable / function
int modifiers; // Bitfield used to mark some estra properties of symbol like that it is static or inline
int value_type_id; // Id of symbol which represents c++ type of current symbol (like type of variable or type of returned value from function)
int extra_type_id; // Extra type used in some cases
list children; // List of child elements of this symbol (members in class etc)
list extra_lists[3]; // Some extra lists which can provide additional symbols depending on type of current
// symbol - like list of base classes or list of template arguments, maybe we could give
// more than 3 lists, but I didn't found any reason for that now.
map extra_values; // int -> string map which can keep some extra data
}

--- End code ---

Each element of symbol list should have following properties:

--- Code: ---class list
{
int symbol_id; // Id of referenced symbol
int scope; // Scope of the symbol (public / private / protected ... ), don't have to be used
}

--- End code ---

Most fields of the symbol class are easy-to-understand and their usage is rather straightforward. But interpretation of data should depend on symbol type.

Ok, here's detailed information on how the fields should be used in case of symbol types:

type|modifiers|value_type_id|extra_type|children|extra_lists[0]|extra_lists[1]|extra_lists[2]----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------namespace|
|
||
|
||
|
||
|
|declarations in
namespace|
|
|"using" namespaces|
|
||
|
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------class / struct / union|
||
||
||
|members of class|
|base classes|
|template args|
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------variable|
|extern, static, volatile, const|
|type of variable|
||
||
||
||
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------function|
|static, inline, const ...|
|returned value|
||
|arguments|
|template arguments|
||
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------typedef|
|pointer, array, reference, pointer_to_member|
|base type|
|type of class in pointer_to_member|
||
||
||
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------enum|
||
||
||
|items in enum|
||
||
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------enum item|
||
||
|id of enum|
||
||
||
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------macro|
||
||
||
|macro parts|
||
||
|----------------|----------------|----------------|----------------|----------------|----------------|----------------|----------------macro part|
|arg_to_string, va_args|
|number of arg or -1|
||
||
||
||
|
Such representation would require some extra "hidden" symbols - for example when some complex type is returned from function, extra symbol of typedef representing proper value would be required.

Also in case of templates, typeid's should be threated in special way - negative value could mean to use template argument instead of some real type. Base types (the POD ones) should have some predefined type ids.

This is my proposition. May not be perfect (it's late here and my memory tends to forget important things) but is a sum of few months when I thought about CC from time to time ;)

Regards
BYO

stevenkaras:
Looked really good. But I would have to disagree with placing the static keyword as a modifier in and of itself. Keep in mind what effect static has:

symbol type|without static|with staticglobal variable|external linkage|internal linkagelocal variable|automatic storage|program storagefunction|external linkage|internal linkageclass member|only accessed through an object|only accessed through the qualified nameNote: I probably forgot something

Couple of other things:
1. As for the extra lists, wouldn't it be better to use inheritance to implement that concept, rather than placing it in the base class?
2. Again, you use the extra list to show use of the using keyword, but I think it'd be simpler to effectively allow the transfer of symbols between namespaces. Especially once you consider the various ways you can use the using keyword (using namspace std; using std::cout; using ::myVar; etc)
3. JGM> I didn't include the preprocessor definitions because the preprocessor has it's own syntax, and is loosely connected to the C++. Keep in mind that you're actually programming in 2 languages at the same time.
4. BYO>Providing a symbol table is a hard task. But I like to see that everyone has put some thought into it. I got where you were going with the class, trying to avoid using inheritance, and the virtual table, but the code can be inefficient at first, and we can always re-implement it later as a monolithic class.
5. Some of the things you had in your class were very good (and even more importantly, it included some essentials that I forgot) such as file_id.
6. I think we should back off on storing everything about each variable for a bit, as it's just unnecessary, and focus on getting a basic implementation working. But I would like to mention that we should keep in mind that there are other uses for a symbol table other than code completion: a symbol browser, improved syntax highlighting (that would catch non-identifiers), and code refactoring.

So I'll sit down later on this week and work out a second draft of the classes.

-steven

PS. Does anyone ever post here when it isn't late at night?

Ceniza:

--- Quote ---3. JGM> I didn't include the preprocessor definitions because the preprocessor has it's own syntax, and is loosely connected to the C++. Keep in mind that you're actually programming in 2 languages at the same time.
--- End quote ---

Yup, that's right. Trying to mix both wouldn't be easy at all. Macros can create new names, new functions, new everything. That's why a preprocessing stage is necessary if we really want to know what's available from the code to the developer.

Saving information from the preprocessing stage is simpler. In fact, I have that already. To continue the discussion, my current implementation of the preprocessor's SymTab Element is as follows:

--- Code: --- template <class StringType, class TokenType>
struct PPSymTabElement
{
typedef std::vector<StringType> ParameterListType;
typedef std::vector<TokenType> ReplacementListType;

bool valid;
bool isFunc;
ParameterListType parameters;
ReplacementListType replacement;

PPSymTabElement() : valid(false), isFunc(false) {}
};
--- End code ---

Inheritance is very important when designing the nodes for the Abstract Syntax Tree that the parser will create. Having in mind the whole bunch of different types you can have in C++, the parser's SymTab Element must be very flexible, so inheritance should be considered too.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version