Wrong spell checker on russian (and probably other languages)

Developer forums (C::B DEVELOPMENT STRICTLY!) > Development

(1/10) > >>

BlueHazzard:
Hi,
i had some time to investigate the spellchecker problem Khram has with the spell checker he reported in many nightly build forums.
This only happens on windows. Linux works like a charm

You should easily be able to reproduce this by enabling the spell checker and paste this in a new editor window

--- Code: ---числовых

--- End code ---
You do not need any russian dictionary installed. You can see that the last view letters are highlighted as error but if this really is an error the whole word should be highlighted.
As far as i can tell the problem is in the function
bool SpellCheckHelper::IsWhiteSpace(const wxChar &ch)
in src\plugins\contrib\SpellChecker\SpellCheckHelper.cpp:42

--- Code: ---bool SpellCheckHelper::IsWhiteSpace(const wxChar &ch)
{
// Support words like doesn't: ch!='\''

return wxIsspace(ch) || (wxIspunct(ch) && ch!='\'') || wxIsdigit(ch);
}

--- End code ---

the function wxIspunct(ch) returns =! 0 for some letters that are not punctuation, and so the word gets split up incorrectly.
wxIspunct(ch) is a macro to std::ispunct (or the c equivalent) and this has some problems on windows with non ascii things. The old unicode story again...

I have not investigated future. Any ideas how we can fix this, or where this comes from?
I think the transfer from wxWidgets string to utf-16 for windows does not work in this plugin and i will try to investigate a bit more...

Some update so far:
The problem is that we get a wrong charachter from the editor:
wxScintilla::GetCharAt() returns 0xD1 but should return 0xD187 (UTF-8) or 0x0447 (UTF-16BE) (what a crap... why not utf-8 everywhere ;( )

[EDIT:] Notepad++ does not have this problem. I try to reverse engineer what they are doing...

greetings

BlueHazzard:
Ok, it really boils down to unicode problems.
Does someone know if on linux sci returns an UTF32 character with getCharAt ? This would explain why it works on linux but not on windows.
I do not know yet how to fix this.... The easiest solution would probably be to get all text from editor put it in a wxWidgets string, convert it to utf-8 (if not utf-8), split it into words and feed it into hunspell, because as far as i see hunspell uses utf-8 internally and needs utf8 input (bug how does this work on linux?).
This would be painful slow and the splitting has to be done by hand, because as seen above the wxIspunct uses the system function and on windows this does not support utf-8...

A other solution would be to try to modify getCharAt() to check if it is a multi byte character and try to get the real character, but then the whole thing from above has to be redone...

If i try to fix this, i will only try to fix UTF-8 files. I do not see the need to support other encoding. (The notepad++ spell checker supports other encoding, and implements iconv on windows, and i do not feel the need to do this for codeblocks on windows). One thing would be if we can use wxWidgets to do the conversation...
One thing that bothers me is that according Khram there was some point where the spell checker worked. If i look at the code i can not believe this...

oBFusCATed:
If I remember correctly the nightly after a change to SpellCheckHelper::IsWhiteSpace broke the use case of Khram, but I think he is not using an utf8 encoding.

How does notebpad++ feed hunspell? Does it use the equivalent of getchatat?

raynebc:
He claimed the problem occurs in various encodings as well as UTF-8:
http://forums.codeblocks.org/index.php/topic,23102.msg157343.html#msg157343

BlueHazzard:
Ok, we know about this issue since some time:
http://forums.codeblocks.org/index.php/topic,20195.15.html

I do not follow the conclusion white tiger gave about using wxIspunct, because it will not work on windows with utf-8 (or any other encoding)

As far as i can tell hunspell needs utf-8 strings (at least notepad++ gives utf-8 strings to hunspell)
Notepad++ uses

--- Code: ---const wchar_t *default_delimiters() {
return L",.!?\":;{}()[]\\/"
L"=+-^$*<>|#$@%&~"
L"\u2026\u2116\u2014\u00AB\u00BB\u2013\u2022\u00A9\u203A\u201C\u201D"
L"\u00B7"
L"\u00A0\u0060\u2192\u00d7";
}
--- End code ---
for delimiters as far as i can tell
Notepad++ gets a full range of text from the scintilla (4096 charachters) and stores and encodes them as needed. So it does not use the getCharAt function

For me this is some kind of nightmare:
1) wxWidgets uses 16 bit char on windows for its code points . This does not ensure that all code points can be represented with one wxChar
2) For this wxIspunct can not be used, because it needs a wxChar that will not fit all code points (it should take int (at least 32bit) or wchar_t*)
3) The spellchecker code uses IsWhiteSpace( stc->GetCharAt() ) in a lot places... so a lot rework...

The easiest fix would probably be to hardcode the escape sequences like notepad++ and not use wxIspunct. In theory this should then extract all bytes needed for UTF8 automatically. And ditch all other encoding...

Navigation

[0] Message Index

[#] Next page

Go to full version