Author Topic: Suggested framework for search-in-files index (Read 5891 times)

rickg22 · « **on:** November 08, 2011, 12:18:24 am »

Here's my more-or-less formal proposal to accelerate find-in-files (edit: if you don't see the attachment, you need to log in)

(Sorry for not copying the text earlier, I was in some kind of hurry)

In memory (and stored also in disk, of course) we would have a global index of tokens and filenames. To search a string we would first search in the in-memory global token list. The only drawback I see would be the memory consumption, but there has to be a sacrifice (the same old compromise: Disk / memory / speed).

Code

Global Token list:
1: abcd
2: test
3: myClass
4: for
5: class
6: public

Code

Global filename list:
1: main.cpp
2: main.h
3: include/mylib.h
4: libs/mylib.cpp
5: gui/qt/main_gui.cpp
6: ...

Code

local subdirectory index, e.g. /home/me/projects/proj1/src/sdk/search.cbcache

1 (this is for token #1): Files with ids 1,3,4,17,25,30,32,33,41,42,43
2: 2,3,5,10,12,...
3: ...
4: ...
...
(we could encode the ids with base64 to minimize disk usage)

To search for a string, we would follow these steps:

1) Tokenize the search string ("needle") so we have more but smaller needles. We would search the needles in our haystack, this is, the in-memory token list, and see which tokens match. This is the part where we can apply searching full-words, beginning of words, etc. By doing so, we transform our needles into numeric ids that will be searched in the directory indexes.

2) Some tokens will match part of our search string, while other tokens will match other parts. We can make a temporary "dictionary" so that when we search later, we will see if a file (or directory) has the required matches for ALL the tokens we're searching for (ollydbg explains it further).

3) We can have a global index of directories to minimize searching time (see below)

4) Each directory will have a local index of tokens, followed by a list of which files match that token in particular. If we need a match for tokens A and B, and a file has only token A or token B, we will dismiss that file.

Explanation of 3): For the global index of directories: It would be like the per-directory index explained above. The tokens of all the files for a directory will appear in the directory's entry. If this entry cannot match our search tokens, it means that none of the files in the directory will match, either. (However, if a file matches token A only, and another file matches token B only, and we need a file to have both A and B, the directory will trigger a false positive. This is normal, we can still search in the directory index for all the file entries and dismiss the false positives.

5) For all the matching files, notice that we only have POTENTIAL matches. We still could have false positives. This is where we search the contents of each file as we normally do. But to reach this final step we will have already dismissed a big deal of non-matching files and directories, saving us the time to parse each one of them. Also, I'm wondering if we should have a search-on-demand on indexed files, where in the results list we could click and search the particular file. Until then, the file would appear on the search results as "?", indicating that it ONLY PROBABLY matches our search criteria.

Notice that to build these indexes, we need to scan the entire project once, and re-scan each file and rewrite the indexes after each save.

--- end of explanation ---

Please share your opinions

ollydbg · « **Reply #1 on:** November 08, 2011, 02:28:12 am »

Good!

One thing I don't understand is: when you have a long search string. e.g. I would like to search "int aaaa" in all files. What you plan to do is:
1, search "int" in the caches, result a files set A.
2, search "aaaa" in the cashes, result a files set B.
3, collect the common files results set C, which is (C = A and B).
4, do a deeper search in the final files set C.

Right?

rickg22 · « **Reply #2 on:** November 08, 2011, 04:23:42 am »

Quote from: ollydbg on November 08, 2011, 02:28:12 am

Good!

One thing I don't understand is: when you have a long search string. e.g. I would like to search "int aaaa" in all files. What you plan to do is:
1, search "int" in the caches, result a files set A.
2, search "aaaa" in the cashes, result a files set B.
3, collect the common files results set C, which is (C = A and B).
4, do a deeper search in the final files set C.

Right?

Yes, that's correct. Having a full-text index would require making a complete suffix tree on all the files (or at least all the lines) and that would imply a huge complexity and memory toll. Of course, for each file I could add a *little* more info, like storing the next 2 or 3 tokens that follow a particular token.

For example, if in file "myfile.cpp" we have "int aaaa", in the corresponding index for "int", we would have (before replacing the filenames and tokens with their corresponding indexes): "myfile.cpp" aaaa|b|i|j; "myfile2.cpp" i|j|counter; This would make an index of "order 1"; An order 2 index would make two words after int, like "myfile.cpp" aaaa bbbb|i j (note: the commas would be filtered out); "myfile3.cpp" x float; And so on.

ADDITIONAL NOTE: oBFusCATed just suggested in another thread that I update the index not on file save, but after finding in files, for each file that had been updated after the last indexing change. This is actually a very good idea, as we would only update the index when actually using it.

oBFusCATed · « **Reply #3 on:** November 08, 2011, 10:24:28 am »

What will happen if I search for:
1. 'Class::method' and there are method1,method2, method_sadf, etc
2. '"my string'

What is your plan for this feature, provide and API or just hack the 'Find in files' feature?
Because I've not used 'Find in files' for a long time, ThreadSearch is the way more useful one.
I would like it to be faster

rickg22 · « **Reply #4 on:** November 08, 2011, 03:45:10 pm »

Quote from: oBFusCATed on November 08, 2011, 10:24:28 am

What will happen if I search for:
1. 'Class::method' and there are method1,method2, method_sadf, etc
2. '"my string'

The tokenizer would split that to "Class" (end of word) and "method" (beginning of word). If you want an exact match, then you'd check the "full word". So we'd have that split in "Class" (full word) and "method" (full word). The search algorithm is not a psychic, you need to tell it exactly what you want to search for.

Quote

What is your plan for this feature, provide and API or just hack the 'Find in files' feature?
Because I've not used 'Find in files' for a long time, ThreadSearch is the way more useful one.
I would like it to be faster

At first I'm only hacking "Find in files", but I'll need to provide the search index with a reasonable API to be useful.

Code::Blocks Forums

News:

Author Topic: Suggested framework for search-in-files index (Read 5891 times)

rickg22

Suggested framework for search-in-files index

ollydbg

Re: Suggested framework for search-in-files index

rickg22

Re: Suggested framework for search-in-files index

oBFusCATed

Re: Suggested framework for search-in-files index

rickg22

Re: Suggested framework for search-in-files index