Author Topic: Wrong spell checker on russian (and probably other languages) (Read 101167 times)

BlueHazzard · « **on:** March 30, 2019, 09:30:20 pm »

Hi,
i had some time to investigate the spellchecker problem Khram has with the spell checker he reported in many nightly build forums.
This only happens on windows. Linux works like a charm

You should easily be able to reproduce this by enabling the spell checker and paste this in a new editor window

Code

числовых

You do not need any russian dictionary installed. You can see that the last view letters are highlighted as error but if this really is an error the whole word should be highlighted.
As far as i can tell the problem is in the function
bool SpellCheckHelper::IsWhiteSpace(const wxChar &ch)
in src\plugins\contrib\SpellChecker\SpellCheckHelper.cpp:42

Code

bool SpellCheckHelper::IsWhiteSpace(const wxChar &ch)
{
    // Support words like doesn't: ch!='\''

    return wxIsspace(ch) || (wxIspunct(ch) && ch!='\'') || wxIsdigit(ch);
}

the function wxIspunct(ch) returns =! 0 for some letters that are not punctuation, and so the word gets split up incorrectly.
wxIspunct(ch) is a macro to std::ispunct (or the c equivalent) and this has some problems on windows with non ascii things. The old unicode story again...

I have not investigated future. Any ideas how we can fix this, or where this comes from?
I think the transfer from wxWidgets string to utf-16 for windows does not work in this plugin and i will try to investigate a bit more...

Some update so far:
The problem is that we get a wrong charachter from the editor:
wxScintilla::GetCharAt() returns 0xD1 but should return 0xD187 (UTF-8) or 0x0447 (UTF-16BE) (what a crap... why not utf-8 everywhere ;( )

[EDIT:] Notepad++ does not have this problem. I try to reverse engineer what they are doing...

greetings

BlueHazzard · « **Reply #1 on:** March 30, 2019, 10:03:56 pm »

Ok, it really boils down to unicode problems.
Does someone know if on linux sci returns an UTF32 character with getCharAt ? This would explain why it works on linux but not on windows.
I do not know yet how to fix this.... The easiest solution would probably be to get all text from editor put it in a wxWidgets string, convert it to utf-8 (if not utf-8), split it into words and feed it into hunspell, because as far as i see hunspell uses utf-8 internally and needs utf8 input (bug how does this work on linux?).
This would be painful slow and the splitting has to be done by hand, because as seen above the wxIspunct uses the system function and on windows this does not support utf-8...

A other solution would be to try to modify getCharAt() to check if it is a multi byte character and try to get the real character, but then the whole thing from above has to be redone...

If i try to fix this, i will only try to fix UTF-8 files. I do not see the need to support other encoding. (The notepad++ spell checker supports other encoding, and implements iconv on windows, and i do not feel the need to do this for codeblocks on windows). One thing would be if we can use wxWidgets to do the conversation...
One thing that bothers me is that according Khram there was some point where the spell checker worked. If i look at the code i can not believe this...

oBFusCATed · « **Reply #2 on:** March 30, 2019, 11:03:33 pm »

If I remember correctly the nightly after a change to SpellCheckHelper::IsWhiteSpace broke the use case of Khram, but I think he is not using an utf8 encoding.

How does notebpad++ feed hunspell? Does it use the equivalent of getchatat?

raynebc · « **Reply #3 on:** March 31, 2019, 08:32:47 am »

He claimed the problem occurs in various encodings as well as UTF-8:
http://forums.codeblocks.org/index.php/topic,23102.msg157343.html#msg157343

BlueHazzard · « **Reply #4 on:** March 31, 2019, 03:10:09 pm »

Ok, we know about this issue since some time:
http://forums.codeblocks.org/index.php/topic,20195.15.html

I do not follow the conclusion white tiger gave about using wxIspunct, because it will not work on windows with utf-8 (or any other encoding)

As far as i can tell hunspell needs utf-8 strings (at least notepad++ gives utf-8 strings to hunspell)
Notepad++ uses

Code

const wchar_t *default_delimiters() {
  return L",.!?\":;{}()[]\\/"
         L"=+-^$*<>|#$@%&~"
         L"\u2026\u2116\u2014\u00AB\u00BB\u2013\u2022\u00A9\u203A\u201C\u201D"
         L"\u00B7"
         L"\u00A0\u0060\u2192\u00d7";
}

for delimiters as far as i can tell
Notepad++ gets a full range of text from the scintilla (4096 charachters) and stores and encodes them as needed. So it does not use the getCharAt function

For me this is some kind of nightmare:
1) wxWidgets uses 16 bit char on windows for its code points . This does not ensure that all code points can be represented with one wxChar
2) For this wxIspunct can not be used, because it needs a wxChar that will not fit all code points (it should take int (at least 32bit) or wchar_t*)
3) The spellchecker code uses IsWhiteSpace( stc->GetCharAt() ) in a lot places... so a lot rework...

The easiest fix would probably be to hardcode the escape sequences like notepad++ and not use wxIspunct. In theory this should then extract all bytes needed for UTF8 automatically. And ditch all other encoding...

New Pagodi · « **Reply #5 on:** March 31, 2019, 07:13:17 pm »

I'm not sure how codeblocks works, but I know with wxSTC the buffer internally contains UTF8 data. wxSTC offers a number of "Raw" methods that skip the conversion to/from wxString and access the UTF8 data directly. If they're not available in codeblocks, you can use Scintilla messages instead.

For example, to get the 4th through 8th characters, you could do something like this:

Code

editor->SendMsg(SCI_SETTARGETSTART,4);
editor->SendMsg(SCI_SETTARGETEND,8);
int len = editor->SendMsg(SCI_GETTARGETTEXT);
// allocate i bytes in a buffer c here
editor->SendMsg(SCI_GETTARGETTEXT,0,reinterpret_cast<wxIntPtr>(c));
// do something with c and then free the buffer

oBFusCATed · « **Reply #6 on:** March 31, 2019, 07:16:56 pm »

Quote from: BlueHazzard on March 31, 2019, 03:10:09 pm

The easiest fix would probably be to hardcode the escape sequences like notepad++ and not use wxIspunct. In theory this should then extract all bytes needed for UTF8 automatically. And ditch all other encoding...

Have you blamed the file where wxispuntc is used? What happens if you revert this commit?

sodev · « **Reply #7 on:** March 31, 2019, 07:47:24 pm »

Quote from: BlueHazzard on March 31, 2019, 03:10:09 pm

1) wxWidgets uses 16 bit char on windows for its code points . This does not ensure that all code points can be represented with one wxChar
2) For this wxIspunct can not be used, because it needs a wxChar that will not fit all code points (it should take int (at least 32bit) or wchar_t*)

I doubt that these surrogate pairs cause the issues, they are only used for languages outside the BMP, unless you use emojis or whatever you won't find these in source files.

However the assumption one codepoint == one character might cause problems, e. g. the letter ä exists as a single codepoint, but you can also create it with two codepoints, the letter a and this 2 dots "decorator" i currently don't know how to type.

Oh, and on Windows wchar_t IS 16 bit because Windows does use UTF-16. For the record, UTF-16 is NOT fixed length, only UTF-32 is.

raynebc · « **Reply #8 on:** April 01, 2019, 07:04:36 pm »

https://www.compart.com/en/unicode/U+00A8
https://www.compart.com/en/unicode/U+0308

Typically known as "diaeresis" or "umlaut".

BlueHazzard · « **Reply #9 on:** April 01, 2019, 10:14:26 pm »

Quote

I doubt that these surrogate pairs cause the issues, they are only used for languages outside the BMP, unless you use emojis or whatever you won't find these in source files.

They do not cause the issue, but they are part of the problem. And the assumption 16bit are enough lead to the mess with windows we have now, because they used UTF16...

Quote

Oh, and on Windows wchar_t IS 16 bit because Windows does use UTF-16. For the record, UTF-16 is NOT fixed length, only UTF-32 is.

And that is why utf16 is ******

Quote

Have you blamed the file where wxispuntc is used? What happens if you revert this commit?

Code

bool SpellCheckHelper::IsWhiteSpace(const wxChar &ch)
{
#ifdef __WXMSW__
     wxString str( _T(" \t\r\n.,'`?!@#$%^&*()-=_+[]{}\\|;:\"<>/~0123456789") );
    return str.Find(ch) != wxNOT_FOUND; //signed-unsigned comparison; switched from "find()" to "Find()"
#else
    // Support words like doesn't: ch!='\''
    return wxIsspace(ch) || (wxIspunct(ch) && ch!='\'') || wxIsdigit(ch);
#endif // __WXMSW__
}

This code reverts the changes and works fine on windows (at least with no crazy utf things. Tested with the russian example and some other languages), and does not touch the fine working linux code.
I really do not feel to make this more complicated. I have tried to extract the UTF-8 code from the scintilla control and merge them to int, but i think it is complicated, a mess and does not work properly, because the spell checker iterates in all directions trough the control. (see my question at the bottom)

Quote

I'm not sure how codeblocks works, but I know with wxSTC the buffer internally contains UTF8 data. wxSTC offers a number of "Raw" methods that skip the conversion to/from wxString and access the UTF8 data directly. If they're not available in codeblocks, you can use Scintilla messages instead.

For example, to get the 4th through 8th characters, you could do something like this:

Thank you for your information! Can you tell me if the getCharAt(x) always hits the beginning of a character (0XXX XXXX or 110X XXX) or is it possible to hit in the middle of a utf character (that starts with 10XX XXXX)? This would simplify things a lot...

thank you all for the comments....

oBFusCATed · « **Reply #10 on:** April 02, 2019, 12:56:08 am »

See this https://www.scintilla.org/ScintillaDoc.html#SCI_POSITIONBEFORE and probably read more about text retrieval in the docs of scintilla.

BlueHazzard · « **Reply #11 on:** April 14, 2019, 03:09:07 pm »

To bring this up again. Does something speaks against the solution to introduce the old code with an #ifdef on windows?
An ifdef is probably always needed to solve this issue...

oBFusCATed · « **Reply #12 on:** April 14, 2019, 03:32:38 pm »

I have no opinion.

I'm not even sure what is the problem solved by the change.
I don't really have time nor desire to dig into it.
It is up to you to decide what to do

BlueHazzard · « **Reply #13 on:** May 12, 2019, 12:34:07 am »

Ok, lets summarize all this and look if i have understood all:
1) scintilla uses utf-8 internally
2) wxScintilla::GetCharAt returns a single byte from position pos (so it can return them middle of a code point) but always 1 byte. There is the possibility to find the beginning of the codepoint and extract the full codepoint.
3) All code uses wxChar (aka wchar_t ) to represent one character. This is 2 Bytes on windows and 4 Bytes on UNIX. So we have to implement platform depended code....
4) The plugin basically goes trough every character byte by byte. This will work as long as you use utf-8 or any other "single byte" encoding for your document and the dictionary
5) Hunspell uses UTF-8 but should be able to handle every encoding. In our case we use UTF8 and we should stick to it...
5) Now we hit the wxIspunct (aka iswpunct) function:
5.1) On linux this function kind of works with utf8 because it treads all characters above 0x80 as non punctuation characters (how it should ). But it will not work for Unicode character that are punctuations and non ascii, but this kind of characters are probably rare in the programming space
5.2) On windows this does not work because some characters above 0x80 (at least with the english localization) are treated as punctuation (for example: 0xBB ( ╗ ) ) and if you use a utf8 encoded document you will hit this characters quite fast...

So there are two main problems:
1) We do not handle unicode correctly: wxScintilla::GetCharAt should return the full code point. But here we diverge between unix and windows (wchar_t has different length) As noted top we could probably come away by only using 2 bytes and use utf16 on windows. Why is this needed?
2) wxIspunct (aka iswpunct) needs the full code point to work correctly. On windows this has to be UTF 16 and on linux probably UTF32 or UTF8 idk...

I really do not feel to rewrite all this in unicode aware code... Specially because wxWidgets does not take the load from us, because we still have to make ... (100% not wx2.8...)
How can we fix this for the next release (or even nightly)?
1) On Linux we do not need to do anything
2) On windows:
2.1) feed only valuse < 0x80 to wxIspunct or
2.2) Use the code described top that worked until now....

If no big objections i would like do implement 2.2.

[edit:]
There are many punctuation characters outside ascii

: http://www.open-std.org/JTC1/SC35/WG5/docs/30112d10.pdf

[edit2:]
i tried to convert utf8 to utf16 with the functions provided with hunspell. they work pretty well, but i still get errors for the russian characters:

Code

is alpha:  ( 0x447 ) is alpha: true
( 0x0438 ) is alpha: false
( 0x0441 ) is alpha: true
( 0x043b ) is alpha: false
( 0x043e ) is alpha: false
( 0x0432 ) is alpha: false
( 0x044b ) is alpha: true
( 0x0445 ) is alpha: true

They should all be true... I think this is because i have a english locale set and using isalphaw() with locale does not work for me, because i do not know what locale to set...
After this experiment i even stronger think 2.2 is good enough...

oBFusCATed · « **Reply #14 on:** May 12, 2019, 10:56:56 am »

1. Concentrate on wx 3.1.x, no need to bother deeply with wx2.8
2. I don't think you should bother with utf16 or utf32
3. Have you considered switching to working on lines and not characters? There is this call:

Code

wxString GetCurLine(int* linePos=NULL);

It gives you the whole line to work with.

4. Also there are these:

Code

    // Compact the document buffer and return a read-only pointer to the
    // characters in the document.
    const char* GetCharacterPointer() const;

    // Return a read-only pointer to a range of characters in the document.
    // May move the gap so that the range is contiguous, but will only move up
    // to rangeLength bytes.
    const char* GetRangePointer(int position, int rangeLength) const;

Aren't they useful to give you the whole utf8 character's byte sequence?

Code::Blocks Forums

News:

Author Topic: Wrong spell checker on russian (and probably other languages) (Read 101167 times)

BlueHazzard

Wrong spell checker on russian (and probably other languages)

BlueHazzard

Re: Wrong spell checker on russian (and probably other languages)

oBFusCATed

Re: Wrong spell checker on russian (and probably other languages)

raynebc

Re: Wrong spell checker on russian (and probably other languages)

BlueHazzard

Re: Wrong spell checker on russian (and probably other languages)

New Pagodi

Re: Wrong spell checker on russian (and probably other languages)

oBFusCATed

Re: Wrong spell checker on russian (and probably other languages)

sodev

Re: Wrong spell checker on russian (and probably other languages)

raynebc

Re: Wrong spell checker on russian (and probably other languages)

BlueHazzard

Re: Wrong spell checker on russian (and probably other languages)

oBFusCATed

Re: Wrong spell checker on russian (and probably other languages)

BlueHazzard

Re: Wrong spell checker on russian (and probably other languages)

oBFusCATed

Re: Wrong spell checker on russian (and probably other languages)

BlueHazzard

Re: Wrong spell checker on russian (and probably other languages)

oBFusCATed

Re: Wrong spell checker on russian (and probably other languages)