Author Topic: Improving "search in files" with a word index? And other ideas with metadata  (Read 9416 times)

Offline rickg22

  • Lives here!
  • ****
  • Posts: 2283
Hi guys. I was wondering of something... Recently I've been using a lot the search in files feature, and I realized that perhaps things could speed up a bit if we maintained a "global dictionary of tokens" and keep a list of tokens per file (this list could possibly be updated on file save). Search in files would tokenize the search string and find which files had all of the tokens, and to refine the search from there.

Another idea that I had was to revamp the "TODO" plugin to use metadata for TODOS, including their dates,  file/lines and priorities. So when I open the project, I can see the latest TODOs that I have added without having to search in all the files. The latest todo would open the corresponding file whenever I open the project.

Another metadata Idea would be an expansion of the todo concept, and I don't know if it could be implemented. How about adding "notes" per file, so that we could have more thorough comments  (maybe even including graphics in later versions)? So, instead of having a comment like // TODO, we could have //EXTNOTE:45, and if we hovered the mouse over that line, a "hint" would popup displaying the notes file.

What do you think?

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13406
    • Travis build status
Sounds great, but someone should implement it, would you?  :lol:

The string tokenization, sound pretty good. Last couple of days I'm wondering how VStudio, does find in files so fast and maybe they do something like this.

TODO notes is great idea, too, and sounds doable.
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline rickg22

  • Lives here!
  • ****
  • Posts: 2283
Okay, I'm going to have vacations soon. Perhaps I'll implement this.

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13406
    • Travis build status
For the search it will be best to modify the ThreadSearch plugin, because it is way better than the normal "find in files"...
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 6077
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Hi guys. I was wondering of something... Recently I've been using a lot the search in files feature, and I realized that perhaps things could speed up a bit if we maintained a "global dictionary of tokens" and keep a list of tokens per file (this list could possibly be updated on file save). Search in files would tokenize the search string and find which files had all of the tokens, and to refine the search from there.
So, it looks like you want to implement a text search. Not reg search. right?
The dictionary could contain some thing like:
Code
keyword(string) -> [file index(int), offset in the file(int)] 
That's all.
About the tokenizer, the QUEX could be a big candidate. It is extremely fast. It natively support output the "offset in the file" characters. Also, we can also record the "line" and "column" information.

The dictionary is mostly like the tokenstree in CodeCompletion plugin. As I think you are quite familiar with it.  A self made Patricia tree or some database like SQLite.

Quote
Another metadata Idea would be an expansion of the todo concept, and I don't know if it could be implemented. How about adding "notes" per file, so that we could have more thorough comments  (maybe even including graphics in later versions)? So, instead of having a comment like // TODO, we could have //EXTNOTE:45, and if we hovered the mouse over that line, a "hint" would popup displaying the notes file.
Currently, if you use doxygen style comments, I think we can add something like
Code
@CBNOTE:45
The doxygen already support put a link of images/latex style formula in the comment, so we can only interpret that special command, and show the image when the mouse hover it.
« Last Edit: October 13, 2011, 08:15:18 am by ollydbg »
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline Jenna

  • Administrator
  • Lives here!
  • *****
  • Posts: 7252
Code
@CBNOTE:45
The doxygen already support put a link of images/latex style formula in the comment, so we can only interpret that special command, and show the image when the mouse hover it.
But that would mean, that todo's written for developpers get part of the doxygen-genrated docu, or is there a way to exclude some of them ?

Offline rickg22

  • Lives here!
  • ****
  • Posts: 2283
So, it looks like you want to implement a text search. Not reg search. right?
The dictionary could contain some thing like:
Code
keyword(string) -> [file index(int), offset in the file(int)] 
That's all.

Yes, it would be for normal searches (not regex). To find a string we would just find if all its words (or tokens, if you prefer) are present in the file, by using the index. This way we can discard files that cannot possibly contain our search string.

But I'm wondering how to do it in the most efficient and less-convoluted way. I think maintaining a global index would be overkill - perhaps doing it in a per-file basis would be the best. This way, each time a file was saved, only its index would be updated. Otherwise, we would need to use a database engine for it.

Maybe we could allow the user to have an (optional) SQL engine (with username, password) to store the offset values instead of flat data - or do we have an SQLite engine running with C::B already?

So, instead of searching for all the files, we would just parse the files index for the search.

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13406
    • Travis build status
This way, each time a file was saved, only its index would be updated. Otherwise, we would need to use a database engine for it.
I think it will be better to update the index/db when the user searches and the timestamp of the file is newer than the actual database.

Maybe we could allow the user to have an (optional) SQL engine (with username, password) to store the offset values instead of flat data - or do we have an SQLite engine running with C::B already?
No SQLite used in C::B at the moment, but this engine is pretty slow.
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline eranif

  • Regular
  • ***
  • Posts: 256
No SQLite used in C::B at the moment, but this engine is pretty slow.

I recently used QDBM for a project of mine - and I can tell you that it is *way* faster than SQLite
You interact directly with the B-Tree, it has cursor functionality and it is even a transcational storage ;)

The Odeum API (is exactly what you are looking for):
http://fallabs.com/qdbm/spex.html#odeumapi

Villa API (b-tree API with transcation support):
http://fallabs.com/qdbm/spex.html#villaapi
It is licensed under the LGPL which I guess its OK for C::B

I used the Villa API because I needed the cursor functionality (it allows you a very fast search for a given prefix)
It also supports revert-index for full text search

You can also replace it default compare function per search, so you could perform searches case-sensitive / non-case-sensitive
Eran
« Last Edit: October 15, 2011, 10:37:27 am by eranif »

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 6077
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
I just found that codelite now have a branch using QDBM.  :D

But What I see from QDBM main page is:
Quote
Copyright (C) 2000-2007 Mikio Hirabayashi
Last Update: Thu, 26 Oct 2006 15:00:20 +0900

Sounds like it have no updates in last fine years. :(
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline Freem

  • Almost regular
  • **
  • Posts: 218
In fact, you are just searching for a RDBMS which can be used without big installation procedure and actively developed?
If yes, then I remember that firebird can be embedded, too. I don't know if it's speed match your needs, but it is easy to use (in fact, I don't remember differences in code when using it on a classical way or embedded.).