Author Topic: Wrong spell checker on russian (and probably other languages)  (Read 34156 times)

Offline BlueHazzard

  • Developer
  • Lives here!
  • *****
  • Posts: 3352
Re: Wrong spell checker on russian (and probably other languages)
« Reply #15 on: May 12, 2019, 12:44:10 pm »
The problem is that utf8 does not work on windows and we use utf8 everywhere (what is the only right thing to do).  wxIspunc uses the system function std::iswpunc and this function is not really nice specified by the standard... So the word splitting won't work on windows if we do not convert it to utf16 when we want to use some unicode aware functionality. And what i can tell, for wxIspunc (aka std::iswpunc) also the local is crucial, because as i noted top, with my locale, russian characters are not detected correctly...

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13406
    • Travis build status
Re: Wrong spell checker on russian (and probably other languages)
« Reply #16 on: May 12, 2019, 12:57:33 pm »
Then why don't you just write a cbIspuncUtf8 and be done with it?
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline BlueHazzard

  • Developer
  • Lives here!
  • *****
  • Posts: 3352
Re: Wrong spell checker on russian (and probably other languages)
« Reply #17 on: October 18, 2019, 11:34:06 pm »
Ok, here is a patch that should work on windows for all single point UTF16 code points.
On linux it works, but i do not know if there is a better way to make the iswspace() function working. On my test system (english linux mint, default locale tmp="C") without switching locale it does not work. Right now i have to set and reset the locale...

Example to test:
Code
// Hänsel  und Gretel <- German dic
//числовых числовых <- Russian dic
on both example the dictionary should not underline the two words separated by the dash. The dash is a unicode character to test the isSpace function. (there is a dash, in my firefox it is barely visible)
« Last Edit: October 18, 2019, 11:35:43 pm by BlueHazzard »

Offline BlueHazzard

  • Developer
  • Lives here!
  • *****
  • Posts: 3352
Re: Wrong spell checker on russian (and probably other languages)
« Reply #18 on: October 19, 2019, 11:28:24 am »
i probably should use
Code
wxIsspace_l(wxChar, wxXLocale)
and so on on linux...

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13406
    • Travis build status
Re: Wrong spell checker on russian (and probably other languages)
« Reply #19 on: October 19, 2019, 11:39:58 am »
I don't know what is the exact problem but patches with calls to setlocale(LC_ALL, "en_US.utf8"); are really unacceptable.
You have no guarantees that the user has the files for this locale. It is highly unlikely that this would happen but, still.
Also setlocale modifies the locale of the whole thread, and it is slow...
Putting bad words in the comments is also unacceptable.

Why is UTF8toUTF32 returning int32_t and not uint32_t? Why is UTF32toUTF16 using plain types and not sized types?

Have you considered using these things: https://www.scintilla.org/ScintillaDoc.html#SCI_WORDENDPOSITION ?
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline BlueHazzard

  • Developer
  • Lives here!
  • *****
  • Posts: 3352
Re: Wrong spell checker on russian (and probably other languages)
« Reply #20 on: October 19, 2019, 01:13:13 pm »
Quote
You have no guarantees that the user has the files for this locale. It is highly unlikely that this would happen but, still.
It would be great if someone who knows the UTF stuff on linux would help... I can not find any information how to get iswspace working on linux. If the default locale is "C" it does not work "C.utf8" is not present on my default fresh install on linux so i avoid it. i think en_US.utf8 is present on all systems, and if not, hey, at the current stage the spell checker does not work anyway, so better then nothing, and if someone has a problem we can try to fix it. This will work 99% of the time and i think we should stick to this... Again, if someone knows how to handle this i am open for it, but after 2 days of googling i can not find the right approach...

Quote
Also setlocale modifies the locale of the whole thread, and it is slow...
Yes i agree here, see my previous post about using iswspace_l

Quote
Putting bad words in the comments is also unacceptable.
Why is UTF8toUTF32 returning int32_t and not uint32_t? Why is UTF32toUTF16 using plain types and not sized types?
I was frustrated... Thank you about the types, i missed them, was to late...

Quote
Have you considered using these things: https://www.scintilla.org/ScintillaDoc.html#SCI_WORDENDPOSITION ?
This would probably mean to rewrite the whole plugin. No time and motivation for that... Also i can not read from the documentation if this is full unicode aware... You probably have to set the word characters by yourself (https://www.scintilla.org/ScintillaDoc.html#SCI_SETWORDCHARS)

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13406
    • Travis build status
Re: Wrong spell checker on russian (and probably other languages)
« Reply #21 on: October 19, 2019, 01:20:32 pm »
Piling up workarounds doesn't improve things and sooner or later leads to a re-write.
So why don't you do the rewrite at this very moment and save yourself some time and extra work? :)

The set of word characters is probably correct, because ctrl-left, ctrl-right seem to work correctly or at least they are adequate.
And I think these use the current locale set on the editor or it might be using UTF8 internally. I'm not sure.
But anyway scintilla should handle encodings well, because this is one of its main tasks.

p.s. Also have you measured performance? All these calls your making aren't cheap at all!
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline osdt

  • Multiple posting newcomer
  • *
  • Posts: 63
Re: Wrong spell checker on russian (and probably other languages)
« Reply #22 on: October 19, 2019, 09:59:37 pm »
Quote
.. patches with calls to setlocale(LC_ALL, "en_US.utf8"); are really unacceptable.
It would be great if someone who knows the UTF stuff on linux would help... I can not find any information how to get iswspace working on linux. If the default locale is "C" it does not work ...

A single call to ::setlocale(LC_ALL, "") (or at least LC_CTYPE) early at CB startup will be the way to go. It's needed by many character conversion functions, even wxString::mb_str(...) and friends needs it to be set correctly.

Quote from: http://man7.org/linux/man-pages/man3/iswspace.3.html
       The behavior of iswspace() depends on the LC_CTYPE category of the current locale.

Offline BlueHazzard

  • Developer
  • Lives here!
  • *****
  • Posts: 3352
Re: Wrong spell checker on russian (and probably other languages)
« Reply #23 on: October 22, 2019, 09:28:11 am »
After the first anger and laziness i tried the Scintilla functions and they seem to work quite well... Would be cool if someone with an "exotic" (non ASCII) directory could test this more. I tested it with the Russian dictionary, but i really can only c&p words from google and do not know if they are correct...

Here is the patch. About performance... I have not really tested it, but this should be quite faster then the old approach (if Scintilla is decent fast in word finding, but i think it is)... Also, we are not a word processor, but a programming ide....

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13406
    • Travis build status
Re: Wrong spell checker on russian (and probably other languages)
« Reply #24 on: October 22, 2019, 07:26:56 pm »
Also, we are not a word processor, but a programming ide....
What do you mean by this? I regularly have to edit 10-20kloc files. They feel rather sluggish.
If people want to use a slow IDE there are plenty of options (like vscode).
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline BlueHazzard

  • Developer
  • Lives here!
  • *****
  • Posts: 3352
Re: Wrong spell checker on russian (and probably other languages)
« Reply #25 on: October 23, 2019, 09:51:05 am »
Quote
What do you mean by this?
That we do not have to spell check 10000 words, because we spell check only comments and strings (not the majority of the file). Anyway This patch should speed up things to quite some extend (if the scintilla message queue is the slow down part).

Quote
They feel rather sluggish.
The scrolling or the loading? The scrolling should not be influenced by the spell checker on codeblocks side (on scintilla side, aka coloring i do not know) because we parse only on loading and then only the modified part of the file.

If i find time i will try to make some measurements. Beside this, any other comments on the code? Can i commit it?

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13406
    • Travis build status
Re: Wrong spell checker on russian (and probably other languages)
« Reply #26 on: October 23, 2019, 08:01:30 pm »
That we do not have to spell check 10000 words, because we spell check only comments and strings (not the majority of the file).
Your metric is off. Simple guess 10 words a line, 10kloc file, 10% comments - you have 10k words.

The scrolling or the loading?
Loading is really bad. I've not profiled to see what is going on. There is known bug that loading files with many functions is rather slow, because the creation of the CC combobox in the toolbar is rather slow. Probably it is this one, but only profiling will tell.

My comment is a general comment about performance and C::B.
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13406
    • Travis build status
Re: Wrong spell checker on russian (and probably other languages)
« Reply #27 on: October 23, 2019, 08:02:28 pm »
Beside this, any other comments on the code? Can i commit it?
I guess, I'll have to test it to have an opinion.
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13406
    • Travis build status
Re: Wrong spell checker on russian (and probably other languages)
« Reply #28 on: October 23, 2019, 08:07:09 pm »
1. Why is this needed?
Code
if ( !stc->IsRangeWord(wordstart, wordend) )
Remove or add a comment!

2. Also why do you call WordEndPosition with a pos argument and not with a wordstart argument?
3. The first change in the patch looks strange. Add a comment why it is needed? Why is there a start-- operation before it?
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline BlueHazzard

  • Developer
  • Lives here!
  • *****
  • Posts: 3352
Re: Wrong spell checker on russian (and probably other languages)
« Reply #29 on: October 25, 2019, 08:51:45 am »
I am always astonished how your reviews improve the code... Thank you for that.
Here is a second revised patch

Quote
Also why do you call WordEndPosition with a pos argument and not with a wordstart argument?
The idea was, that the search word end function does not have to start from the beginning of the word to search for the end (performance) but on a second thought this probably can also backfire...
Using startword is probably saver...