Developer forums (C::B DEVELOPMENT STRICTLY!) > Development

Wrong spell checker on russian (and probably other languages)

<< < (2/10) > >>

New Pagodi:
I'm not sure how codeblocks works, but I know with wxSTC the buffer internally contains UTF8 data.  wxSTC offers a number of "Raw" methods that skip the conversion to/from wxString and access the UTF8 data directly.  If they're not available in codeblocks, you can use Scintilla messages instead.

For example, to get the 4th through 8th characters, you could do something like this:


--- Code: ---editor->SendMsg(SCI_SETTARGETSTART,4);
editor->SendMsg(SCI_SETTARGETEND,8);
int len = editor->SendMsg(SCI_GETTARGETTEXT);
// allocate i bytes in a buffer c here
editor->SendMsg(SCI_GETTARGETTEXT,0,reinterpret_cast<wxIntPtr>(c));
// do something with c and then free the buffer

--- End code ---

oBFusCATed:

--- Quote from: BlueHazzard on March 31, 2019, 03:10:09 pm ---The easiest fix would probably be to hardcode the escape sequences like notepad++ and not use wxIspunct. In theory this should then extract all bytes needed for UTF8 automatically. And ditch all other encoding...

--- End quote ---
Have you blamed the file where wxispuntc is used? What happens if you revert this commit?

sodev:

--- Quote from: BlueHazzard on March 31, 2019, 03:10:09 pm ---1) wxWidgets uses 16 bit char on windows for its code points . This does not ensure that all code points can be represented with one wxChar
2) For this wxIspunct can not be used, because it needs a wxChar that will not fit all code points (it should take int (at least 32bit) or wchar_t*)

--- End quote ---

I doubt that these surrogate pairs cause the issues, they are only used for languages outside the BMP, unless you use emojis or whatever you won't find these in source files.

However the assumption one codepoint == one character might cause problems, e. g. the letter รค exists as a single codepoint, but you can also create it with two codepoints, the letter a and this 2 dots "decorator" i currently don't know how to type.

Oh, and on Windows wchar_t IS 16 bit because Windows does use UTF-16. For the record, UTF-16 is NOT fixed length, only UTF-32 is.

raynebc:
https://www.compart.com/en/unicode/U+00A8
https://www.compart.com/en/unicode/U+0308

Typically known as "diaeresis" or "umlaut".

BlueHazzard:

--- Quote ---I doubt that these surrogate pairs cause the issues, they are only used for languages outside the BMP, unless you use emojis or whatever you won't find these in source files.
--- End quote ---
They do not cause the issue, but they are part of the problem. And the assumption 16bit are enough lead to the mess with windows we have now, because they used UTF16...


--- Quote ---Oh, and on Windows wchar_t IS 16 bit because Windows does use UTF-16. For the record, UTF-16 is NOT fixed length, only UTF-32 is.
--- End quote ---
And that is why utf16 is ******


--- Quote ---Have you blamed the file where wxispuntc is used? What happens if you revert this commit?
--- End quote ---

--- Code: ---bool SpellCheckHelper::IsWhiteSpace(const wxChar &ch)
{
#ifdef __WXMSW__
     wxString str( _T(" \t\r\n.,'`?!@#$%^&*()-=_+[]{}\\|;:\"<>/~0123456789") );
    return str.Find(ch) != wxNOT_FOUND; //signed-unsigned comparison; switched from "find()" to "Find()"
#else
    // Support words like doesn't: ch!='\''
    return wxIsspace(ch) || (wxIspunct(ch) && ch!='\'') || wxIsdigit(ch);
#endif // __WXMSW__
}
--- End code ---
This code reverts the changes and works fine on windows (at least with no crazy utf things. Tested with the russian example and some other languages), and does not touch the fine working linux code.
I really do not feel to make this more complicated. I have tried to extract the UTF-8 code from the scintilla control and merge them to int, but i think it is complicated, a mess and does not work properly, because the spell checker iterates in all directions trough the control. (see my question at the bottom)


--- Quote ---I'm not sure how codeblocks works, but I know with wxSTC the buffer internally contains UTF8 data.  wxSTC offers a number of "Raw" methods that skip the conversion to/from wxString and access the UTF8 data directly.  If they're not available in codeblocks, you can use Scintilla messages instead.

For example, to get the 4th through 8th characters, you could do something like this:
--- End quote ---
Thank you for your information! Can you tell me if the getCharAt(x) always hits the beginning of a character (0XXX XXXX or 110X XXX) or is it possible to hit in the middle of a utf character (that starts with 10XX XXXX)? This would simplify things a lot...

thank you all for the comments....

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version