Print Page - Improving "search in files" with a word index? And other ideas with metadata

Developer forums (C::B DEVELOPMENT STRICTLY!) => Development => Topic started by: rickg22 on October 12, 2011, 06:06:28 pm

Title: Improving "search in files" with a word index? And other ideas with metadata
Post by: rickg22 on October 12, 2011, 06:06:28 pm

Hi guys. I was wondering of something... Recently I've been using a lot the search in files feature, and I realized that perhaps things could speed up a bit if we maintained a "global dictionary of tokens" and keep a list of tokens per file (this list could possibly be updated on file save). Search in files would tokenize the search string and find which files had all of the tokens, and to refine the search from there.

Another idea that I had was to revamp the "TODO" plugin to use metadata for TODOS, including their dates, file/lines and priorities. So when I open the project, I can see the latest TODOs that I have added without having to search in all the files. The latest todo would open the corresponding file whenever I open the project.

Another metadata Idea would be an expansion of the todo concept, and I don't know if it could be implemented. How about adding "notes" per file, so that we could have more thorough comments (maybe even including graphics in later versions)? So, instead of having a comment like // TODO, we could have //EXTNOTE:45, and if we hovered the mouse over that line, a "hint" would popup displaying the notes file.

What do you think?

Title: Re: Improving "search in files" with a word index? And other ideas with metadata
Post by: oBFusCATed on October 12, 2011, 06:28:08 pm

Sounds great, but someone should implement it, would you? :lol:

The string tokenization, sound pretty good. Last couple of days I'm wondering how VStudio, does find in files so fast and maybe they do something like this.

TODO notes is great idea, too, and sounds doable.

Title: Re: Improving "search in files" with a word index? And other ideas with metadata
Post by: rickg22 on October 12, 2011, 07:20:54 pm

Okay, I'm going to have vacations soon. Perhaps I'll implement this.

Title: Re: Improving "search in files" with a word index? And other ideas with metadata
Post by: oBFusCATed on October 12, 2011, 08:03:39 pm

For the search it will be best to modify the ThreadSearch plugin, because it is way better than the normal "find in files"...

Title: Re: Improving "search in files" with a word index? And other ideas with metadata
Post by: ollydbg on October 13, 2011, 03:18:24 am

Quote from: rickg22 on October 12, 2011, 06:06:28 pm

Hi guys. I was wondering of something... Recently I've been using a lot the search in files feature, and I realized that perhaps things could speed up a bit if we maintained a "global dictionary of tokens" and keep a list of tokens per file (this list could possibly be updated on file save). Search in files would tokenize the search string and find which files had all of the tokens, and to refine the search from there.

So, it looks like you want to implement a text search. Not reg search. right?
The dictionary could contain some thing like:

Code

keyword(string) -> [file index(int), offset in the file(int)]

That's all.
About the tokenizer, the QUEX could be a big candidate. It is extremely fast. It natively support output the "offset in the file" characters. Also, we can also record the "line" and "column" information.

The dictionary is mostly like the tokenstree in CodeCompletion plugin. As I think you are quite familiar with it. A self made Patricia tree or some database like SQLite.

Quote

Another metadata Idea would be an expansion of the todo concept, and I don't know if it could be implemented. How about adding "notes" per file, so that we could have more thorough comments (maybe even including graphics in later versions)? So, instead of having a comment like // TODO, we could have //EXTNOTE:45, and if we hovered the mouse over that line, a "hint" would popup displaying the notes file.

Currently, if you use doxygen style comments, I think we can add something like

Code

@CBNOTE:45

The doxygen already support put a link of images/latex style formula in the comment, so we can only interpret that special command, and show the image when the mouse hover it.

Title: Re: Improving "search in files" with a word index? And other ideas with metadata
Post by: Jenna on October 13, 2011, 06:25:56 am

Quote from: ollydbg on October 13, 2011, 03:18:24 am

Code
@CBNOTE:45
The doxygen already support put a link of images/latex style formula in the comment, so we can only interpret that special command, and show the image when the mouse hover it.

But that would mean, that todo's written for developpers get part of the doxygen-genrated docu, or is there a way to exclude some of them ?

Title: Re: Improving "search in files" with a word index? And other ideas with metadata
Post by: rickg22 on October 15, 2011, 01:26:54 am

Quote from: ollydbg on October 13, 2011, 03:18:24 am

So, it looks like you want to implement a text search. Not reg search. right?
The dictionary could contain some thing like:
Code
keyword(string) -> [file index(int), offset in the file(int)] 
That's all.

Yes, it would be for normal searches (not regex). To find a string we would just find if all its words (or tokens, if you prefer) are present in the file, by using the index. This way we can discard files that cannot possibly contain our search string.

But I'm wondering how to do it in the most efficient and less-convoluted way. I think maintaining a global index would be overkill - perhaps doing it in a per-file basis would be the best. This way, each time a file was saved, only its index would be updated. Otherwise, we would need to use a database engine for it.

Maybe we could allow the user to have an (optional) SQL engine (with username, password) to store the offset values instead of flat data - or do we have an SQLite engine running with C::B already?

So, instead of searching for all the files, we would just parse the files index for the search.

Title: Re: Improving "search in files" with a word index? And other ideas with metadata
Post by: oBFusCATed on October 15, 2011, 10:15:52 am

Quote from: rickg22 on October 15, 2011, 01:26:54 am

This way, each time a file was saved, only its index would be updated. Otherwise, we would need to use a database engine for it.

I think it will be better to update the index/db when the user searches and the timestamp of the file is newer than the actual database.

Quote from: rickg22 on October 15, 2011, 01:26:54 am

Maybe we could allow the user to have an (optional) SQL engine (with username, password) to store the offset values instead of flat data - or do we have an SQLite engine running with C::B already?

No SQLite used in C::B at the moment, but this engine is pretty slow.

Title: Re: Improving "search in files" with a word index? And other ideas with metadata
Post by: eranif on October 15, 2011, 10:35:09 am

Quote from: oBFusCATed on October 15, 2011, 10:15:52 am

No SQLite used in C::B at the moment, but this engine is pretty slow.

I recently used QDBM for a project of mine - and I can tell you that it is *way* faster than SQLite
You interact directly with the B-Tree, it has cursor functionality and it is even a transcational storage ;)

The Odeum API (is exactly what you are looking for):
http://fallabs.com/qdbm/spex.html#odeumapi (http://fallabs.com/qdbm/spex.html#odeumapi)

Villa API (b-tree API with transcation support):
http://fallabs.com/qdbm/spex.html#villaapi (http://fallabs.com/qdbm/spex.html#villaapi)
It is licensed under the LGPL which I guess its OK for C::B

I used the Villa API because I needed the cursor functionality (it allows you a very fast search for a given prefix)
It also supports revert-index for full text search

You can also replace it default compare function per search, so you could perform searches case-sensitive / non-case-sensitive
Eran

Title: Re: Improving "search in files" with a word index? And other ideas with metadata
Post by: ollydbg on October 28, 2011, 08:43:26 am

I just found that codelite now have a branch using QDBM. :D

But What I see from QDBM main page is:

Quote

Copyright (C) 2000-2007 Mikio Hirabayashi
Last Update: Thu, 26 Oct 2006 15:00:20 +0900

Sounds like it have no updates in last fine years. :(

Title: Re: Improving "search in files" with a word index? And other ideas with metadata
Post by: Freem on October 28, 2011, 03:37:46 pm

In fact, you are just searching for a RDBMS which can be used without big installation procedure and actively developed?
If yes, then I remember that firebird can be embedded, too. I don't know if it's speed match your needs, but it is easy to use (in fact, I don't remember differences in code when using it on a classical way or embedded.).

Code::Blocks Forums

Developer forums (C::B DEVELOPMENT STRICTLY!) => Development => Topic started by: rickg22 on October 12, 2011, 06:06:28 pm