So you have given up on using a real C++ parser that builds a full AST? One like Elsa or ANTLRC++.
Not so much given up on as moved beyond. As it happens, I spent the past two to three months updating and integrating the ANTLR C++ grammar, to the point where it will now successfully parse most of the GCC libstdc++ headers. This done, it should be a fairly simple exercise integrating it with any existing base for code completion. Why, then, have I moved beyond it? Simply:
It's SLOW. Parsing the iostream header takes in excess of 5 seconds on my machine, and that using all the CPU possible, rather than playing nice in threads as a good code completion module would. Granted, the iostream header is one of the bigger ones -- but it's also one of the more commonly used ones. Also granted, 5 seconds isn't a terribly long amount of time -- but it's too long for a code completion parser.
I know this is going to come on strong, but I have no faith in a custom built C++ parser providing any better level of functionality than the current pluggin. Instead of spending hundreds of hours reinventing the wheel we should port Elsa (some small linux dependent pieces exist), or get the ANTLR passed parser working. Then we can do a whole lot more than just code completion. Elsa at least builds a full AST out of all possible c++ code including the stl headers, so it has full namespace and template support. It has a prebuilt visitor based system for pulling information out of the AST, so you can gather all the information you need.
Because Elsa won't compile for Windows without porting work, I've passed over it thus far.
Now you could integrate the Elsa or ANTLR passed parser as your "cxxparser", but I don't think you need to go through all the extra work of building a parser smart enough to filter out that information. I think the above listed parsers might keep the comments in the AST, and If they do, then it will save you a lot of work, they also don't care about extra preprocessor tokens. Of course you could always use ucpp, or wave to preprocess everything for you.
I integrated ucpp with the ANTLR grammar as a part of my previous work, but that was where I encountered the issue I see as a problem: loss of positional data. How does the preprocessor communicate to the parser where the tokens it's receiving resided in the original text? It would take some modification of the ANTLR framework itself to include file position data in the tokens, and of ucpp to generate those tokens.
My CxxMultiLexer covers that straightforwardly, by generating tokens WITH positional data and by not filtering anything, leaving that to the CxxOverseer which in essence "splits" the incoming token stream and sends the relevant tokens to each task-specific class.
Remember, I'm not even an official Code::Blocks developer at this point in time. Most of what I'm working on right now is an experimental learning process, so that, when I'm finally ready to contribute to the code completion module, I'll be able to back the contribution with experience and know-how.