Author Topic: lexer file loading ... (Read 18027 times)

tiwag · « **on:** April 04, 2006, 03:12:29 pm »

... lasts ages

i think this point should really be improved.

what can be done to improve the loading time of these lexer files ?

Michael · « **Reply #1 on:** April 04, 2006, 03:25:00 pm »

Hello,

May be it could be useful to just load only the "most important" of them at the beginning (C::B default) and then let the user decides if she/he needs additional ones or not (as with the plugins). For example, I do not use f77 lexer, so why I have to load it?

Additionally, a background thread could be used to load the lexers. Moreover, the use of SAX could also be helpful to speed up the process.

Best wishes,
Michael

mandrav · « **Reply #2 on:** April 04, 2006, 03:51:51 pm »

This is the speed penalty we have to pay for using TinyXML. It makes everything easier for us, but it's a little slow, especially when parsing large nodes (like the lexers' keywords)...

Michael · « **Reply #3 on:** April 04, 2006, 04:06:52 pm »

Quote from: mandrav on April 04, 2006, 03:51:51 pm

This is the speed penalty we have to pay for using TinyXML. It makes everything easier for us, but it's a little slow, especially when parsing large nodes (like the lexers' keywords)...

Yes, TinyXML is a nice piece of code. Pity that it does not implement SAX or pull SAX.

Best wishes,
Michael

Ceniza · « **Reply #4 on:** April 04, 2006, 04:34:35 pm »

Wow, thinking about how slow was Code::Blocks loading because of those lexers was the last thing I did when I went to sleep. Now I wake up and I find a post about it.

Are you spying me?

I support Michael's idea of just loading the most important ones and somehow letting the user decide if more lexers should be loaded later.

That would also add an extra Tip Of The Day

thomas · « **Reply #5 on:** April 04, 2006, 06:01:29 pm »

Michael's approach is good, as it addresses the actual problem, which is not TinyXML being slow, but loading many lexers that are unneeded.
Face it, we have more lexers that are complete nonsense than lexers that are actually used by anyone. Out of the 22 lexers loaded at startup, the "average" user will use 2, maybe 3.

Peeking into the sources for 10 seconds, I found an additional issue with lexer loading that has not been noticed so far: it is not Unicode-safe. I wonder nobody has complained about lexers not working at all in Czech and Russian installations yet... :shock:

Being fed up with my other stuff at the moment, I'll have a look at what can be done about lexer loading after dinner

takeshimiya · « **Reply #6 on:** April 04, 2006, 11:59:21 pm »

I've spent various days trying to understand this issue (why the C::B loading was slow).

Limiting the loaded lexers could be a temporal quick fix until a better solution is implemented.

Here are some of my tests on my pc:
C::B loading on first time: 15 seconds. C::B loading on subsequent times: 5 seconds.
The lexers parsing take a constant time everytime, and the difference between the first and subsequent time is mostly attributed to the DLLs loading.
TinyXML parsing is slow, with a rough measure of 200ms each lexer.

A bottom note:
At my univ. which I can only load C::B from the LAN, C::B takes 50 seconds to load. SciTE takes 1 second.
SciTE loads way more lexers than C::B. The SciTE lexers also have more features.
We can do better.

There are multiple ways to improve things. It needs time.

Michael · « **Reply #7 on:** April 05, 2006, 09:51:14 am »

Quote from: Michael on April 04, 2006, 03:25:00 pm

May be it could be useful to just load only the "most important" of them at the beginning (C::B default) and then let the user decides if she/he needs additional ones or not (as with the plugins). For example, I do not use f77 lexer, so why I have to load it?

Hello,

I have had some time to thinking about an alternative

. May be lexers could be handle as file associations are. Depending on which kind of file a user open, C::B loads the relative lexer automatically. So, it would not be necessary to load at the beginning 2-3 basic lexers (to decide which of them could not be so easier...).

By default C::B has pre-defined associations (stored into the C::B config file?). Each type of file has its lexer. The user has the possibility to modify this list by either adding a new lexer and its relative file type association and/or to modify an existing one. User specific lexers could be stored separately into the C::B config file (or an alternative lexer config file).

May be the lexers used by project or workspace could be stored into .cbp or .workspace files, so that when loading a project or workspace, C::B loads directly the necessary lexers without having to parser the file types to get the corresponding lexers.

This method has the advantage to make the handling of lexers automatic with "no" or very limited additional overhaed for the user. The disadvantage is a bit more overhead for C::B as it would be necessary to e.g., check that a lexer is not loaded several times.

If you have questions and or comments, please do not hesitate

.

Best wishes,
Michael

mandrav · « **Reply #8 on:** April 05, 2006, 11:28:14 am »

Revision 2306 has fixed the delay when opening "Settings->Editor". That's a start

.

Michael · « **Reply #9 on:** April 05, 2006, 11:38:44 am »

Quote from: mandrav on April 05, 2006, 11:28:14 am

Revision 2306 has fixed the delay when opening "Settings->Editor". That's a start .

Great

.

I have remarked that this morning there were several commits. Each time I have built a C::B revision, I have discovered a new one

.

Best wishes,
Michael

thomas · « **Reply #10 on:** April 05, 2006, 11:50:19 am »

Quote from: Michael on April 05, 2006, 09:51:14 am

May be lexers could be handle as file associations are. Depending on which kind of file a user open, C::B loads the relative lexer automatically [...]

I have been trying to implement just that last evening, but it is not as easy as you think. First, you don't know what a lexer refers to without loading it. Thus, you would have to encode this information somewhere. Keeping around an extra map file for this would work best, but then you are building up a dependency which is not good. When adding a new lexer, you have to update the map, or it won't work.
One could think about putting the extension which is handled into the lexer's name, but most lexers handle several (up to 6) file types, so filenames would become quite cluttered (still possible).

Quote

By default C::B has pre-defined associations (stored into the C::B config file?).

Hardcoded at the present time. We discussed this in January when restructuring the file association code, but decided to leave it hardcoded for now to not further complicate things.

Quote

Each type of file has its lexer. The user has the possibility to modify this list by either adding a new lexer and its relative file type association and/or to modify an existing one. User specific lexers could be stored separately into the C::B config file (or an alternative lexer config file).

That's basically how it used to be in the dark ages when all lexers were copied to the configuration. Currently, only differences are stored to the config.

My current plan is to scan the lexer folder once and load all lexers once. That provides us with a mapping of extensions to lexers which can be saved in the config file. On subsequent loads, Code::Blocks will know which lexer to load when opening a specific file type, and that can indeed be done on request then. When installing a new lexer, one would have to hit the "refresh button" to force reloading the map. That way, you don't need to configure anything, which is a good thing. I am still looking for a weak spot in this approach, but I guess it might just work fine.
What do you think about this approach?

Quote

TinyXML parsing is slow, with a rough measure of 200ms each lexer. [...]
SciTE loads way more lexers than C::B. The SciTE lexers also have more features.

You're comparing apples and oranges again. SciTE lexers have a collection of single line key/value pairs, and Code::Blocks lexers are xml documents that are validated for well-formedness. Of course it takes time to validate a document, this is not surprising.
The same goes for your network load story. You're missing the point here, too.
We are making on the order of 13,000 isolated file accesses during a "normal" startup. On a local file system, much of this can be cached, but it is absolutely not surprising that this is a major performance bottleneck over a network.
wxWidgets makes on the order of 10,000 distinct file accesses alone to load the XRC files. You can easily verify this using FileMon if you have any doubts about it.
To get back to TinyXML which is so terribly slow: the configuration file loads with about 6-7 file accesses, and all lexers are loaded using about 100 distinct file accesses. The time that TinyXML takes to parse those files is just ridiculous compared to the network latency of 10k accesses...

Michael · « **Reply #11 on:** April 05, 2006, 02:49:21 pm »

Quote from: thomas on April 05, 2006, 11:50:19 am

Quote from: Michael on April 05, 2006, 09:51:14 am
May be lexers could be handle as file associations are. Depending on which kind of file a user open, C::B loads the relative lexer automatically [...]
First, you don't know what a lexer refers to without loading it.

In my idea, you have a table where for each lexer there are the file extensions supported. But instead of putting just the lexer name, you put its path (possibly relative) and its name (as alternative, just the lexer filename and the lexer folder path stored separately). In a similar way as for the include files and libraries. In this case C::B knows which lexer it has to load (without before parsing all the lexers).

Quote from: thomas on April 05, 2006, 11:50:19 am

Thus, you would have to encode this information somewhere. Keeping around an extra map file for this would work best, but then you are building up a dependency which is not good. When adding a new lexer, you have to update the map, or it won't work.
One could think about putting the extension which is handled into the lexer's name, but most lexers handle several (up to 6) file types, so filenames would become quite cluttered (still possible).

The information could be stored into an XML file. When C::B starts, it loads the XML file, parses it, gets the info and fills the table. When a user add/modify a lexer/extension, this can be easily saved into the XML file. May be a multimap can be used, where the lexer "name" would be the key and the extensions the values.

Disadvantage is that you build some dependencies which is not good.

Quote from: thomas on April 05, 2006, 11:50:19 am

My current plan is to scan the lexer folder once and load all lexers once. That provides us with a mapping of extensions to lexers which can be saved in the config file. On subsequent loads, Code::Blocks will know which lexer to load when opening a specific file type, and that can indeed be done on request then. When installing a new lexer, one would have to hit the "refresh button" to force reloading the map. That way, you don't need to configure anything, which is a good thing. I am still looking for a weak spot in this approach, but I guess it might just work fine.
What do you think about this approach?

I think it is a good alternative

. The question is how to manage the updates of the map (addition, deletion, modification of a lexer). E.g., if you add/modify a lexer would C::B re-parses all the lexer again or just to new/modified one? If you re-scan all the lexers (easiest solution), it would take time and may be the user will not appreciate. May be a thread with low priority could be used to manage this update process.

Anyway, as you say it should work fine

. May be to spot some problems, at the beginning a simple implementation could be used. If no major problems are reported, it could be extended and improved. It would be not so good if a large amount of time is invested at the beginning, just to know that the idea will not work. Better beginning with a simple solution and extend it successively.

Best wishes,
Michael

thomas · « **Reply #12 on:** April 05, 2006, 03:08:19 pm »

Quote

I think it is a good alternative Smile. The question is how to manage the updates of the map (addition, deletion, modification of a lexer). E.g., if you add/modify a lexer would C::B re-parses all the lexer again or just to new/modified one? If you re-scan all the lexers (easiest solution), it would take time and may be the user will not appreciate. May be a thread with low priority could be used to manage this update process.

Modifying a lexer should not matter at all (unless you change the file mapping). Reparsing everything from scratch is very attractive, as it is simple to implement. It may take 3-5 seconds, but so what... you don't add new lexers every day

Deletion should not be a problem, if the file is not found, you simply return the same value (called LEX_NONE or something) that is returned if a lexer is not known at all.

Putting those extension/file mappings into the config is probably the least painful. I would not want to require the user to edit a configuration file by hand just to add a lexer. Also, this would not work well with internet update/install. To modify an external file, we would need to either implement a complete parser or distribute a tool like sed or something with Code::Blocks. On the other hand, allowing the updater to fire a "reload lexers" event is trivial and 100% safe.

Michael · « **Reply #13 on:** April 05, 2006, 03:45:06 pm »

Quote from: thomas on April 05, 2006, 03:08:19 pm

Quote
I think it is a good alternative Smile. The question is how to manage the updates of the map (addition, deletion, modification of a lexer). E.g., if you add/modify a lexer would C::B re-parses all the lexer again or just to new/modified one? If you re-scan all the lexers (easiest solution), it would take time and may be the user will not appreciate. May be a thread with low priority could be used to manage this update process.
Modifying a lexer should not matter at all (unless you change the file mapping). Reparsing everything from scratch is very attractive, as it is simple to implement. It may take 3-5 seconds, but so what... you don't add new lexers every day
Deletion should not be a problem, if the file is not found, you simply return the same value (called LEX_NONE or something) that is returned if a lexer is not known at all.

If it takes around 5 second or so, I think it is not an issue. And yes, you do not add a lexer each day

.

Quote from: thomas on April 05, 2006, 03:08:19 pm

Putting those extension/file mappings into the config is probably the least painful. I would not want to require the user to edit a configuration file by hand just to add a lexer. Also, this would not work well with internet update/install. To modify an external file, we would need to either implement a complete parser or distribute a tool like sed or something with Code::Blocks. On the other hand, allowing the updater to fire a "reload lexers" event is trivial and 100% safe.

The user should not touch the XML file where the lexers and relative associations are stored, but just the table in C::B. The modifications are then stored by C::B. But if this might make problems, better a 100% safe solution as "reload lexers"

.

Best wishes,
Michael

takeshimiya · « **Reply #14 on:** April 05, 2006, 03:46:44 pm »

Quote from: thomas on April 05, 2006, 11:50:19 am

Quote
TinyXML parsing is slow, with a rough measure of 200ms each lexer. [...]
SciTE loads way more lexers than C::B. The SciTE lexers also have more features.
You're comparing apples and oranges again. SciTE lexers have a collection of single line key/value pairs, and Code::Blocks lexers are xml documents that are validated for well-formedness. Of course it takes time to validate a document, this is not surprising.

Of course it's not surprising, and of course I'm comparing apples to oranges... Because they're different formats.
But that was a point.
Notice that those rough 200ms per xml lexer is on local disk, guess what takes to parse more than 50 C::B xml lexers.

Quote from: thomas on April 05, 2006, 11:50:19 am

The same goes for your network load story. You're missing the point here, too.
We are making on the order of 13,000 isolated file accesses during a "normal" startup. On a local file system, much of this can be cached, but it is absolutely not surprising that this is a major performance bottleneck over a network.

Yes, that's another point, further improvements can be done with caching,

Quote from: thomas on April 05, 2006, 11:50:19 am

wxWidgets makes on the order of 10,000 distinct file accesses alone to load the XRC files.

I thought that the XRC's were loaded from the zips, which then were read from memory instead of disk.
(zip's from disk, xrc's from memory, uncompressed).

Quote from: thomas on April 05, 2006, 11:50:19 am

To get back to TinyXML which is so terribly slow: the configuration file loads with about 6-7 file accesses, and all lexers are loaded using about 100 distinct file accesses. The time that TinyXML takes to parse those files is just ridiculous compared to the network latency of 10k accesses...

True, but if network latency were the only issue, why the SciTE lexers takes 1 second on LAN, while it haves more lexers?
What is SciTE doing <somehow> that reduces network latency? Perhaps what Michael suggested?

Quote from: mandrav on April 05, 2006, 11:28:14 am

Revision 2306 has fixed the delay when opening "Settings->Editor". That's a start .

Great

squizzz · « **Reply #15 on:** April 07, 2006, 12:27:05 am »

Thanks for fixing Settings->Editor delay.

Regarding startup time - I have some ~~low end~~ ancient specs machine here (533MHz 256MB :lol:), so it takes ~ 20 seconds to start Code::Blocks - 10 seconds for lexers and 10 for plugins. What's interesting - matlab_lexer takes whole 4 seconds to load, while others of comparable size take no more than 400 ms... Anyway, it's nice to know what should I turn off first.

thomas · « **Reply #16 on:** April 07, 2006, 02:18:48 am »

Quote

Yes, that's another point, further improvements can be done with caching

Quote

I thought that the XRC's were loaded from the zips, which then were read from memory instead of disk. (zip's from disk, xrc's from memory, uncompressed).

I am really unable to explain this to you if you don't read, Takeshi. Run FileMon yourself and you will see. We cannot cache those reads, they are made implicitely by the XRC loader... it is not the lexers that take 50 seconds to load.

Quote

True, but if network latency were the only issue, why the SciTE lexers takes 1 second on LAN, while it haves more lexers? What is SciTE doing <somehow> that reduces network latency? Perhaps what Michael suggested?

Network latency is the only issue for that phenomenon. And once again, it is not the lexers that take 50 seconds.
Scite is not doing anything to reduce network latency (as it happens, the speed of light is the same for SciTE as for Code::Blocks, and routers work none differently for SciTE, either).
SciTE simply does not use XRC and thus does a lot less I/O operations (about 800 alltogether, 1/16 as many as Code::Blocks). That, too, can be seen from running FileMon.

Quote

Notice that those rough 200ms per xml lexer is on local disk, guess what takes to parse more than 50 C::B xml lexers.

Unless your machine is ideed 10 times slower than any machine Yiannis and me have tried, this is not correct. It is certainly true that parsing is not free. However, there are just 3 lexers which take that long (the ones with mega-long keyword lists: matlab, masm, and nsis), all others are on the order of 20-40ms. But never mind that's a different issue. Once we have found a way to implement on-demand loading (which is unluckily not trivial to implement), that should no longer be a problem.

takeshimiya · « **Reply #17 on:** April 07, 2006, 09:38:42 pm »

Quote from: thomas on April 07, 2006, 02:18:48 am

it is not the lexers that take 50 seconds to load.

Never implied that. What's implied is that C::B takes 50 seconds to load and the lexers are partly responsible for that, but not the biggest one.
As you correctly pointed out, the major (constant time) bottleneck are resources.

The startup-time bottlenecks are:

Loading of Lexers (Takes constant time between runs)
Can be solved in various ways: on-demand loading, another format, etc.
Loading of Resources (Takes constant time between runs, more than Lexers)
Not much can be done, except reducing a bit the XRC usage (ie. toolbars).
Or compressing the resources with compression level 0 (that will imply more space on disk,
but probably smaller download size for the "compressing what is compressed" thing).
Loading of DLLs (Takes a lot of time on first run, more than Resources. In the subsequent runs Windows keeps the DLLs on memory)
Can be solved with prelinking, and with GCC4 visibility flags. Certainly the two will have a big impact on faster loading.
We can wait until MinGW GCC4.1 comes out, and compile with the visibility flags.

Certainly the lexers aren't the major startup-time bottleneck, but when you're not an "fast platform", like a "not-so-fast" pc, or a loading from a network or usb-key, everything matters.
Anyways I'm more concerned about other aspects of the lexers, not only the performance.

Quote from: thomas on April 07, 2006, 02:18:48 am

Quote
Notice that those rough 200ms per xml lexer is on local disk, guess what takes to parse more than 50 C::B xml lexers.
Unless your machine is indeed 10 times slower than any machine Yiannis and me have tried, this is not correct.

You're right, my "rough" measure is what C::B printed in the log, which I certainly can't trust because it's common that two items have the same time, which I really doubt.
I'm in the search of a profiler for MinGW other than gprof (for it's known limitations). I'm using AMD CodeAnalyst which is really great but currently doesn't support MinGW.
Any suggestions?

thomas · « **Reply #18 on:** April 07, 2006, 10:07:29 pm »

Quote

Not much can be done, except reducing a bit the XRC usage (ie. toolbars).
Or compressing the resources with compression level 0 (that will imply more space on disk,
but probably smaller download size for the "compressing what is compressed" thing).

No, that makes it worse.

takeshimiya · « **Reply #19 on:** April 07, 2006, 10:24:27 pm »

Quote from: thomas on April 07, 2006, 10:07:29 pm

Quote
Not much can be done, except reducing a bit the XRC usage (ie. toolbars).
Or compressing the resources with compression level 0 (that will imply more space on disk,
but probably smaller download size for the "compressing what is compressed" thing).
No, that makes it worse.

That's true in most cases, but it really depends if the bottleneck is the CPU or file access.
When loading from a network it will be the file access, and when loading from a local hard disk with a slow CPU (or if you're doing another CPU-intensive task), it will be CPU.
So it's a good compromise as is now.

I'm looking for a MinGW profiler other than GProf, anyone knows?

thomas · « **Reply #20 on:** April 08, 2006, 01:41:06 am »

Quote from: Takeshi Miya on April 07, 2006, 10:24:27 pm

That's true in most cases, but it really depends if the bottleneck is the CPU or file access.
When loading from a network it will be the file access, and when loading from a local hard disk with a slow CPU (or if you're doing another CPU-intensive task), it will be CPU.

No, Takeshi, CPU is not a factor here, in neither case.

News:

Author Topic: lexer file loading ... (Read 18027 times)

takeshimiya

takeshimiya

takeshimiya

takeshimiya