Author Topic: CodeCompletion plugin performance: usage of KMP search algorithm (tokenizer.cpp)  (Read 308 times)

Offline J.

  • Multiple posting newcomer
  • *
  • Posts: 37
I am wondering why it is the KMP search algorithm that is used in the CodeCompletion plugin (tokenizer.cpp).

The KMP algorithm optimizes the search for any sub-matches in arbitrary strings, while it is my understanding that
  • the CB tokenizer passes tokens (i.e. words) to the parser and
  • matches can only happen at the beginning of a token (which would then make KMP obsolete).

Moreover, the KMP preprocessing routine is called for every next-match search regardless if the search word (key) has changed or not. A performance downside.

... or is it intended to move to another more robust and more performant lexer/parser combination in the very next future for CB?

Any views?

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 11279
    • Travis build status
... or is it intended to move to another more robust and more performant lexer/parser combination in the very next future for CB?
My personal goal is to move to a language server plugin and use clangd for parsing. I have a working prototype, but I have to spend a lot of time in polishing also clangd has to implement a lot of features to be really useful.
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 5128
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
kmp algorithm in the tokenizer class is only used for macro string expansion.

Parsing the C++ source code is quite hard, I have been maintaining the parsing algorithm for a long time, but it still has a lot of things to implement.

If you have motivation and time, we welcome any contribution, thanks
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline J.

  • Multiple posting newcomer
  • *
  • Posts: 37
My personal goal is to move to a language server plugin and use clangd for parsing. I have a working prototype, but I have to spend a lot of time in polishing also clangd has to implement a lot of features to be really useful.
That's intriguing. Assuming that clangd would then essentially replace the current code completion module, there should not be too much efforts put into it until then from my point of view.   

Offline J.

  • Multiple posting newcomer
  • *
  • Posts: 37
kmp algorithm in the tokenizer class is only used for macro string expansion.

Parsing the C++ source code is quite hard, I have been maintaining the parsing algorithm for a long time, but it still has a lot of things to implement.
But c++ parsing together with cpp macro string expansion would then be covered by clangd once implemented, no? See above.
Quote
If you have motivation and time, we welcome any contribution, thanks
The question is by when clangd should be expected to be part of CB and then - if I am correct here - life would become much easier with regards to c/c++ parsing incl. cpp stages. 
I will see if I can spare some time for the code completion part.  I am inclined to think that KMP is over the top, since we are not looking for substrings in words, but the matches are always at the beginning of a word.  My hope would be that dropping KMP and using PCRE regular expressions speed things up.  Currently I am not sure if the advanced wx regexp flavour is sufficient what I have in mind.

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 11279
    • Travis build status
The question is by when clangd should be expected to be part of CB and then...
When it is ready, as always. I hope to spend some time on the LSP plugin soon...
But clangd as a server has limitations, too. If you want to contribute to a parser, contributing to clangd is a good option. :)
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 5128
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Language Server Protocol - Wikipedia sounds interesting, and it's initially designed by Microsoft. :)
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.