Developer forums (C::B DEVELOPMENT STRICTLY!) > Development

Wrong spell checker on russian (and probably other languages)

<< < (3/10) > >>

oBFusCATed:
See this https://www.scintilla.org/ScintillaDoc.html#SCI_POSITIONBEFORE and probably read more about text retrieval in the docs of scintilla.

BlueHazzard:
To bring this up again. Does something speaks against the solution to introduce the old code with an #ifdef on windows?
An ifdef is probably always needed to solve this issue...

oBFusCATed:
I have no opinion. :)
I'm not even sure what is the problem solved by the change.
I don't really have time nor desire to dig into it.
It is up to you to decide what to do  8)

BlueHazzard:
Ok, lets summarize all this and look if i have understood all:
1) scintilla uses utf-8 internally
2) wxScintilla::GetCharAt returns a single byte from position pos (so it can return them middle of a code point) but always 1 byte. There is the possibility to find the beginning of the codepoint and extract the full codepoint.
3) All code uses wxChar (aka wchar_t ) to represent one character. This is 2 Bytes on windows and 4 Bytes on UNIX. So we have to implement platform depended code....
4) The plugin basically goes trough every character byte by byte. This will work as long as you use utf-8 or any other "single byte" encoding for your document and the dictionary
5) Hunspell uses UTF-8 but should be able to handle every encoding. In our case we use UTF8 and we should stick to it...
5) Now we hit the wxIspunct (aka iswpunct) function:
5.1) On linux this function kind of works with utf8 because it treads all characters above  0x80 as non punctuation characters (how it should ). But it will not work for Unicode character that are punctuations and non ascii, but this kind of characters are probably rare in the programming space
5.2) On windows this does not work because some characters above 0x80 (at least with the english localization) are treated as punctuation (for example: 0xBB ( ╗ ) ) and if you use a utf8 encoded document you will hit this characters quite fast...

So there are two main problems:
1) We do not handle unicode correctly: wxScintilla::GetCharAt should return the full code point. But here we diverge between unix and windows (wchar_t has different length) As noted top we could probably come away by only using 2 bytes and use utf16 on windows. Why is this needed?
2) wxIspunct (aka iswpunct)  needs the full code point to work correctly. On windows this has to be UTF 16 and on linux probably UTF32 or UTF8 idk...

I really do not feel to rewrite all this in unicode aware code... Specially because wxWidgets does not take the load from us, because we still have to make ... (100% not wx2.8...)
How can we fix this for the next release (or even nightly)?
1) On Linux we do not need to do anything
2) On windows:
2.1) feed only valuse < 0x80 to wxIspunct  or
2.2) Use the code described top that worked until now....

If no big objections i would like do implement 2.2.

[edit:]
There are many punctuation characters outside ascii :( : http://www.open-std.org/JTC1/SC35/WG5/docs/30112d10.pdf

[edit2:]
i tried to convert utf8 to utf16 with the functions provided with hunspell. they work pretty well, but i still get errors for the russian characters:

--- Code: ---is alpha:  ( 0x447 ) is alpha: true
( 0x0438 ) is alpha: false
( 0x0441 ) is alpha: true
( 0x043b ) is alpha: false
( 0x043e ) is alpha: false
( 0x0432 ) is alpha: false
( 0x044b ) is alpha: true
( 0x0445 ) is alpha: true
--- End code ---
They should all be true... I think this is because i have a english locale set and using isalphaw() with locale does not work for me, because i do not know what locale to set...
After this experiment i even stronger think 2.2 is good enough...

oBFusCATed:
1. Concentrate on wx 3.1.x, no need to bother deeply with wx2.8
2. I don't think you should bother with utf16 or utf32
3. Have you considered switching to working on lines and not characters? There is this call:

--- Code: ---wxString GetCurLine(int* linePos=NULL);
--- End code ---
It gives you the whole line to work with.

4. Also there are these:

--- Code: ---    // Compact the document buffer and return a read-only pointer to the
    // characters in the document.
    const char* GetCharacterPointer() const;

    // Return a read-only pointer to a range of characters in the document.
    // May move the gap so that the range is contiguous, but will only move up
    // to rangeLength bytes.
    const char* GetRangePointer(int position, int rangeLength) const;

--- End code ---
Aren't they useful to give you the whole utf8 character's byte sequence?

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version