UTF8 Encoding conversion speedup (Linux)

Developer forums (C::B DEVELOPMENT STRICTLY!) > Development

(1/4) > >>

dmoore:
EDIT: I should have pointed out that this really only applies to linux. AFAIK win32 encoding conversion works just fine

I've been trying to understand why files are taking so long to convert. I think there is something broken in either wxCSConv or in C::Bs encoding detection.

I think there are two bottlenecks:

1. opening a standard ASCII results in failed detection and fallback to whatever default conversion is specified. that might be right, but it still means running several detection algorithms through the entire buffer.

2. If the fallback is UTF8, wxCSConv takes an horrendously long time to convert the file. My fileexplorer.cpp converts in ~1sec vs ~4millisec if I use wxCSConv instead of wxMBConvUTF8. On the other hand, wxCSConv seems to be fine if the fallback is ISO 8859 (provided it's the right encoding!)

I'm submitting a patch that at least deals with #2 for UTF8 by using wxMBConvUTF8 instead of wxCSConv. I would expect that other encodings could also be slow with wxCSConv that would require additional cases

[attachment deleted by admin]

XayC:
The problem is that there's no way to know the encoding of a source file (I'm talking about standard C/C++ files) so the "encoding detection" is used to guess a possible encoding for the file. This can work, but can fail too.

Regarding point #1 I think 2 things can be done:
1) If the file couldn't be opened using the default encoding (which is probably the encoding used by most of the user's files) then either use a fallback encoding or inform the user that the file couldn't be opened with the specified encoding. But this may be what C:B is already doing.

2) Limit the amount of data scanned to detect the encoding. In this way, to find a possible encoding, small files will still be scanned the old way, while very big files will be only partially scanned. Considering that the detection is a guess, guessing the encoding on a reduced amount of data should not give big problems. Of course if a file is very big, and only the first part get considered for the detection, multi-byte characters or surrogate character can be truncated during this process. To handle this correctly, the detection must be performed up to 4 times, using an increased (or decreased) size of the data by one byte at a time.

XayC

dmoore:

--- Quote from: XayC on February 22, 2009, 09:17:15 pm ---The problem is that there's no way to know the encoding of a source file (I'm talking about standard C/C++ files) so the "encoding detection" is used to guess a possible encoding for the file. This can work, but can fail too.

--- End quote ---

yes. if it's ASCII, then a large number of 8-bit encodings are acceptable including UTF8.

--- Quote ---Regarding point #1 I think 2 things can be done:
1) If the file couldn't be opened using the default encoding (which is probably the encoding used by most of the user's files) then either use a fallback encoding or inform the user that the file couldn't be opened with the specified encoding. But this may be what C:B is already doing.

--- End quote ---

yes, the user can opt to use a preferred encoding or a fallback encoding. I'm not sure the user really gets informed about encoding failures (just a blank file IIRC).

--- Quote ---2) Limit the amount of data scanned to detect the encoding. In this way, to find a possible encoding, small files will still be scanned the old way, while very big files will be only partially scanned. Considering that the detection is a guess, guessing the encoding on a reduced amou nt of data should not give big problems. Of course if a file is very big, and only the first part get considered for the detection, multi-byte characters or surrogate character can be truncated during this process. To handle this correctly, the detection must be performed up to 4 times, using an increased (or decreased) size of the data by one byte at a time.

--- End quote ---

it's a messy process. :)

anyway, fixing the (wxCSConv) seems to eliminate most of the delay so I probably won't spend much time thinking about how to improve the detection. Personally I don't plan on using anything but UTF8.

MortenMacFly:

--- Quote from: dmoore on February 23, 2009, 05:18:57 am ---yes, the user can opt to use a preferred encoding or a fallback encoding. I'm not sure the user really gets informed about encoding failures (just a blank file IIRC).

--- End quote ---
There is detailed information accordingly in the debug log... ;-)

dmoore:

--- Quote from: MortenMacFly on February 23, 2009, 07:19:40 am ---
--- Quote from: dmoore on February 23, 2009, 05:18:57 am ---yes, the user can opt to use a preferred encoding or a fallback encoding. I'm not sure the user really gets informed about encoding failures (just a blank file IIRC).

--- End quote ---
There is detailed information accordingly in the debug log... ;-)

--- End quote ---

so how about a little error dialog alerting to user to the failed encoding? (one of those shiny info popups would be nice). some error info could appear in the regular log.

also would it make sense to have an encoding drop down in the file open dialog (by default it would be set "automatic")

Navigation

[0] Message Index

[#] Next page

Go to full version