Author Topic: UTF8 Encoding conversion speedup (Linux) (Read 22638 times)

dmoore · « **on:** February 22, 2009, 08:00:07 pm »

EDIT: I should have pointed out that this really only applies to linux. AFAIK win32 encoding conversion works just fine

I've been trying to understand why files are taking so long to convert. I think there is something broken in either wxCSConv or in C::Bs encoding detection.

I think there are two bottlenecks:

1. opening a standard ASCII results in failed detection and fallback to whatever default conversion is specified. that might be right, but it still means running several detection algorithms through the entire buffer.

2. If the fallback is UTF8, wxCSConv takes an horrendously long time to convert the file. My fileexplorer.cpp converts in ~1sec vs ~4millisec if I use wxCSConv instead of wxMBConvUTF8. On the other hand, wxCSConv seems to be fine if the fallback is ISO 8859 (provided it's the right encoding!)

I'm submitting a patch that at least deals with #2 for UTF8 by using wxMBConvUTF8 instead of wxCSConv. I would expect that other encodings could also be slow with wxCSConv that would require additional cases

[attachment deleted by admin]

XayC · « **Reply #1 on:** February 22, 2009, 09:17:15 pm »

The problem is that there's no way to know the encoding of a source file (I'm talking about standard C/C++ files) so the "encoding detection" is used to guess a possible encoding for the file. This can work, but can fail too.

Regarding point #1 I think 2 things can be done:
1) If the file couldn't be opened using the default encoding (which is probably the encoding used by most of the user's files) then either use a fallback encoding or inform the user that the file couldn't be opened with the specified encoding. But this may be what C:B is already doing.

2) Limit the amount of data scanned to detect the encoding. In this way, to find a possible encoding, small files will still be scanned the old way, while very big files will be only partially scanned. Considering that the detection is a guess, guessing the encoding on a reduced amount of data should not give big problems. Of course if a file is very big, and only the first part get considered for the detection, multi-byte characters or surrogate character can be truncated during this process. To handle this correctly, the detection must be performed up to 4 times, using an increased (or decreased) size of the data by one byte at a time.

XayC

dmoore · « **Reply #2 on:** February 23, 2009, 05:18:57 am »

Quote from: XayC on February 22, 2009, 09:17:15 pm

The problem is that there's no way to know the encoding of a source file (I'm talking about standard C/C++ files) so the "encoding detection" is used to guess a possible encoding for the file. This can work, but can fail too.

yes. if it's ASCII, then a large number of 8-bit encodings are acceptable including UTF8.

Quote

Regarding point #1 I think 2 things can be done:
1) If the file couldn't be opened using the default encoding (which is probably the encoding used by most of the user's files) then either use a fallback encoding or inform the user that the file couldn't be opened with the specified encoding. But this may be what C:B is already doing.

yes, the user can opt to use a preferred encoding or a fallback encoding. I'm not sure the user really gets informed about encoding failures (just a blank file IIRC).

Quote

2) Limit the amount of data scanned to detect the encoding. In this way, to find a possible encoding, small files will still be scanned the old way, while very big files will be only partially scanned. Considering that the detection is a guess, guessing the encoding on a reduced amou nt of data should not give big problems. Of course if a file is very big, and only the first part get considered for the detection, multi-byte characters or surrogate character can be truncated during this process. To handle this correctly, the detection must be performed up to 4 times, using an increased (or decreased) size of the data by one byte at a time.

it's a messy process.

anyway, fixing the (wxCSConv) seems to eliminate most of the delay so I probably won't spend much time thinking about how to improve the detection. Personally I don't plan on using anything but UTF8.

MortenMacFly · « **Reply #3 on:** February 23, 2009, 07:19:40 am »

Quote from: dmoore on February 23, 2009, 05:18:57 am

yes, the user can opt to use a preferred encoding or a fallback encoding. I'm not sure the user really gets informed about encoding failures (just a blank file IIRC).

There is detailed information accordingly in the debug log... ;-)

dmoore · « **Reply #4 on:** February 23, 2009, 08:27:24 pm »

Quote from: MortenMacFly on February 23, 2009, 07:19:40 am

Quote from: dmoore on February 23, 2009, 05:18:57 am
yes, the user can opt to use a preferred encoding or a fallback encoding. I'm not sure the user really gets informed about encoding failures (just a blank file IIRC).
There is detailed information accordingly in the debug log... ;-)

so how about a little error dialog alerting to user to the failed encoding? (one of those shiny info popups would be nice). some error info could appear in the regular log.

also would it make sense to have an encoding drop down in the file open dialog (by default it would be set "automatic")

Jenna · « **Reply #5 on:** February 26, 2009, 02:42:38 pm »

I'm certified sick this week, but lying in bed without doing anything is too boring, so I played a little bit with encoding-detection and code-conversion.

I have adapted mozillas encoding-detection for C::B.
The recognition seems to be much better.

After some other tweaking (among others using the idea behind dmoore's suggestions about not using wxCSConv if possible), I was able to speed up the loading of xmltest.cpp (blown up to 3,5 MB with multiple copies of it's content) from about 31 seconds to about 2,5 seconds.

I'm currently working on a patch that can be uploaded for others to test.

Needs some (much) code-cleanup, but if it's ready, I will put it onto my server (it's too large for an attachement, I think, because of the encoding-detection code).

EDIT:
I just tested another very large (this time UTF-8 file):

loadtime decreased from 82 seconds to less than 3 !!

Biplab · « **Reply #6 on:** February 26, 2009, 03:07:12 pm »

Quote from: jens on February 26, 2009, 02:42:38 pm

I'm certified sick this week, but lying in bed without doing anything is too boring, so I played a little bit with encoding-detection and code-conversion.

I have adapted mozillas encoding-detection for C::B.
The recognition seems to be much better.

We (Morten and Me) had previously proposed to include this. But this was not accepted as encoding detection of all files in a large project may take significant amount of time. This code is proven one and is still one of the best encoding detection routine available.

dmoore · « **Reply #7 on:** February 26, 2009, 03:16:02 pm »

Quote from: Biplab on February 26, 2009, 03:07:12 pm

We (Morten and Me) had previously proposed to include this. But this was not accepted as encoding detection of all files in a large project may take significant amount of time.

in my testing, it was the conversion not the detection that was taking a long time (maybe the balance shifts a bit on windows platforms where wxCSConv seems to do the right thing). Opening a project with ~10 large utf8 files to open could take a minute (albeit on a moderately specced pc). forget about loading big log files...

Quote

This code is proven one and is still one of the best encoding detection routine available.

Do you mean mozilla's or the one in our trunk?

MortenMacFly · « **Reply #8 on:** February 26, 2009, 08:00:24 pm »

Quote from: dmoore on February 26, 2009, 03:16:02 pm

Do you mean mozilla's or the one in our trunk?

Mozilla's (probably in Mozilla's trunk... ;-)).

MortenMacFly · « **Reply #9 on:** February 26, 2009, 08:17:14 pm »

Quote from: jens on February 26, 2009, 02:42:38 pm

I'm certified sick this week

I over-read this one... I hope you get well soon - I had been sick last week... but the doc gave me just 2 days for recovery. I'll go to another next time.

Jenna · « **Reply #10 on:** February 28, 2009, 12:19:45 am »

Quote from: MortenMacFly on February 26, 2009, 08:17:14 pm

Quote from: jens on February 26, 2009, 02:42:38 pm
I'm certified sick this week
I over-read this one... I hope you get well soon - I had been sick last week... but the doc gave me just 2 days for recovery. I'll go to another next time.

Not really good, but I have to go to work on monday again, so it must get better.

Quote from: jens on February 26, 2009, 02:42:38 pm

I will put it onto my server (it's too large for an attachement, I think, because of the encoding-detection code).

I try to attach it, it's about 110 kB and it should be small enough.

The patch should include all needed files (for linux and windows).

[attachment deleted by admin]

Jenna · « **Reply #11 on:** February 28, 2009, 03:08:09 pm »

Here is another patch.
This one can speed up saving of (large) Unicode-files on linux a lot.

[attachment deleted by admin]

MortenMacFly · « **Reply #12 on:** February 28, 2009, 08:11:43 pm »

Quote from: jens on February 28, 2009, 03:08:09 pm

Here is another patch.

Having this applied and trying to create a project e.g. using the wxWidgets wizard results in C::B crashing.
Had to revert this and all works fine again...?!

Jenna · « **Reply #13 on:** February 28, 2009, 08:31:26 pm »

Quote from: MortenMacFly on February 28, 2009, 08:11:43 pm

Quote from: jens on February 28, 2009, 03:08:09 pm
Here is another patch.
Having this applied and trying to create a project e.g. using the wxWidgets wizard results in C::B crashing.
Had to revert this and all works fine again...?!

No problem her, neither on linux, nor on windows.

But I only tried both patches together, I will try the second aloen, but I don't think, this can be the problem.

Did you try a full rebuild, or a build after deleting "devel" and "output" subdirs (to force a relink) ?
Did you rebuild the contrib-plugins too ?

MortenMacFly · « **Reply #14 on:** February 28, 2009, 08:45:48 pm »

Quote from: jens on February 28, 2009, 08:31:26 pm

But I only tried both patches together,

Me, too! :-)

Quote from: jens on February 28, 2009, 08:31:26 pm

Did you try a full rebuild, or a build after deleting "devel" and "output" subdirs (to force a relink) ?
Did you rebuild the contrib-plugins too ?

No && No. Will try that... but not today anymore...

Code::Blocks Forums

News:

Author Topic: UTF8 Encoding conversion speedup (Linux) (Read 22638 times)

dmoore

UTF8 Encoding conversion speedup (Linux)

XayC

Re: UTF8 Encoding conversion speedup

dmoore

Re: UTF8 Encoding conversion speedup

MortenMacFly

Re: UTF8 Encoding conversion speedup

dmoore

Re: UTF8 Encoding conversion speedup (Linux)

Jenna

Re: UTF8 Encoding conversion speedup (Linux)

Biplab

Re: UTF8 Encoding conversion speedup (Linux)

dmoore

Re: UTF8 Encoding conversion speedup (Linux)

MortenMacFly

Re: UTF8 Encoding conversion speedup (Linux)

MortenMacFly

Re: UTF8 Encoding conversion speedup (Linux)

Jenna

Re: UTF8 Encoding conversion speedup (Linux)

Jenna

Re: UTF8 Encoding conversion speedup (Linux)

MortenMacFly

Re: UTF8 Encoding conversion speedup (Linux)

Jenna

Re: UTF8 Encoding conversion speedup (Linux)

MortenMacFly

Re: UTF8 Encoding conversion speedup (Linux)