Ok, lets summarize all this and look if i have understood all:
1) scintilla uses utf-8 internally
2) wxScintilla::GetCharAt returns a single byte from position pos (so it can return them middle of a code point) but always 1 byte. There is the possibility to find the beginning of the codepoint and extract the full codepoint.
3) All code uses wxChar (aka wchar_t ) to represent one character. This is 2 Bytes on windows and 4 Bytes on UNIX. So we have to implement platform depended code....
4) The plugin basically goes trough every character byte by byte. This will work as long as you use utf-8 or any other "single byte" encoding for your document and the dictionary
5) Hunspell uses UTF-8 but should be able to handle every encoding. In our case we use UTF8 and we should stick to it...
5) Now we hit the wxIspunct (aka iswpunct) function:
5.1) On linux this function kind of works with utf8 because it treads all characters above 0x80 as non punctuation characters (how it should ). But it will not work for Unicode character that are punctuations and non ascii, but this kind of characters are probably rare in the programming space
5.2) On windows this does not work because some characters above 0x80 (at least with the english localization) are treated as punctuation (for example: 0xBB ( ╗ ) ) and if you use a utf8 encoded document you will hit this characters quite fast...
So there are two main problems:
1) We do not handle unicode correctly: wxScintilla::GetCharAt should return the full code point. But here we diverge between unix and windows (wchar_t has different length) As noted top we could probably come away by only using 2 bytes and use utf16 on windows. Why is this needed?
2) wxIspunct (aka iswpunct) needs the full code point to work correctly. On windows this has to be UTF 16 and on linux probably UTF32 or UTF8 idk...
I really do not feel to rewrite all this in unicode aware code... Specially because wxWidgets does not take the load from us, because we still have to make ... (100% not wx2.8...)
How can we fix this for the next release (or even nightly)?
1) On Linux we do not need to do anything
2) On windows:
2.1) feed only valuse < 0x80 to wxIspunct or
2.2) Use the code described top that worked until now....
If no big objections i would like do implement 2.2.
[edit:]
There are many punctuation characters outside ascii
:
http://www.open-std.org/JTC1/SC35/WG5/docs/30112d10.pdf[edit2:]
i tried to convert utf8 to utf16 with the functions provided with hunspell. they work pretty well, but i still get errors for the russian characters:
is alpha: ( 0x447 ) is alpha: true
( 0x0438 ) is alpha: false
( 0x0441 ) is alpha: true
( 0x043b ) is alpha: false
( 0x043e ) is alpha: false
( 0x0432 ) is alpha: false
( 0x044b ) is alpha: true
( 0x0445 ) is alpha: true
They should all be true... I think this is because i have a english locale set and using isalphaw() with locale does not work for me, because i do not know what locale to set...
After this experiment i even stronger think 2.2 is good enough...