Author Topic: Wrong spell checker on russian (and probably other languages)  (Read 25554 times)

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13413
    • Travis build status
Re: Wrong spell checker on russian (and probably other languages)
« Reply #30 on: October 26, 2019, 10:28:27 am »
1. m_pSpellHelper->HasStyleToBeChecked(lang, stc->GetStyleAt(pos)) in OnlineSpellChecker::DoSetIndications is wrong. It should be using wordstart
2. pos++ in the same loop is suspicious. Why don't you move with a whole word if HasStyleToBeChecked returns false?
3. do you need to WordEndPosition if WordStartPosition fails?
4. wordstart and wordend variables could have smaller scope, they aren't used outside of the loop
5. m_invalidatedRangesStart and m_invalidatedRangesEnd are (almost) already at word boundaries, why don't you take advantage of this, or even don't do the word boundaries adjustment for them
6. What happens if some of the new function calls return wrong result in SpellCheckerPlugin::OnThesaurus? Is no the selection case handled correctly there?
7. Same in SpellCheckerPlugin::OnCamelCase
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline BlueHazzard

  • Developer
  • Lives here!
  • *****
  • Posts: 3353
Re: Wrong spell checker on russian (and probably other languages)
« Reply #31 on: October 28, 2019, 11:20:58 pm »
Quote
1. m_pSpellHelper->HasStyleToBeChecked(lang, stc->GetStyleAt(pos)) in OnlineSpellChecker::DoSetIndications is wrong. It should be using wordstart
fixed
Quote
2. pos++ in the same loop is suspicious. Why don't you move with a whole word if HasStyleToBeChecked returns false?
because this can end in an infinite loop.
if i use this code:
Code
pos = wordend + 1; 
I tested it and found this case: End of line
Code
*\n\r
1) Word start is 0 ('*') word end is  0 ('*')
1.1) pos = wordend (0) + 1 == 1
2) Word start is 1 ('\n') word end is 1 ('\n')
2.1) pos = wordend (1) + 1 == 2
3) Word start is 1 ('\n') word end is 1 ('\n') because '\r' is not used as word start
3.1) pos = wordend (1) + 1 == 2
4 ...... (endless loop)

pos++ guarantees that strange word combinations are skipped (for example faulty utf code points scintilla can not parse, purely speculative)....

Quote
3. do you need to WordEndPosition if WordStartPosition fails?
This really sounds like premature optimization (and what did Knuth tell us about that?) Anyway, i added a check for it

Quote
4. wordstart and wordend variables could have smaller scope, they aren't used outside of the loop
fixed

Quote
5. m_invalidatedRangesStart and m_invalidatedRangesEnd are (almost) already at word boundaries, why don't you take advantage of this, or even don't do the word boundaries adjustment for them
This two variables contain the whole invalidated/modified text. This range is also used to remove the marker. So the whole block that has to be rechecked is invalidated at one call. If we split this range into single words, we get rid of two messages calls and loose this advantages, because we call a lot more times IndicatorSetStyle (for every word, that can also be a single charachter)... No gain for speed here...

Quote
6. What happens if some of the new function calls return wrong result in SpellCheckerPlugin::OnThesaurus? Is no the selection case handled correctly there?
Yes and i added additional checks

Quote
7. Same in SpellCheckerPlugin::OnCamelCase
see above

I have committed the code now (revision: 11888)... I think we are at a good point and i would like to have it tested in the next nightly....
Thank you again!


Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13413
    • Travis build status
Re: Wrong spell checker on russian (and probably other languages)
« Reply #32 on: October 29, 2019, 12:21:40 am »
Quote
2. pos++ in the same loop is suspicious. Why don't you move with a whole word if HasStyleToBeChecked returns false?
because this can end in an infinite loop.
if i use this code:
Code
pos = wordend + 1; 
I tested it and found this case: End of line
Code
*\n\r
1) Word start is 0 ('*') word end is  0 ('*')
1.1) pos = wordend (0) + 1 == 1
2) Word start is 1 ('\n') word end is 1 ('\n')
2.1) pos = wordend (1) + 1 == 2
3) Word start is 1 ('\n') word end is 1 ('\n') because '\r' is not used as word start
3.1) pos = wordend (1) + 1 == 2
4 ...... (endless loop)
You can always do a pos++ if you detect you're not moving. Keep in mind that checks are inexpensive relatively to calls to Scintilla! My guess would 10:1 the difference in execution time. Scintilla calls are non inlined, there 1 or even 2 big switches and then there is more code for the actual operation.

Quote
3. do you need to WordEndPosition if WordStartPosition fails?
This really sounds like premature optimization (and what did Knuth tell us about that?) Anyway, i added a check for it
Are you sure you're quoting Knuth correctly? Do you actually know the context where he formulated it and where it was valid? This quote is probably from different time and it is not valid any more.

Computers are fast, really fast, software is just painfully inefficient...
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline BlueHazzard

  • Developer
  • Lives here!
  • *****
  • Posts: 3353
Re: Wrong spell checker on russian (and probably other languages)
« Reply #33 on: November 07, 2019, 10:30:32 pm »
ok, your last comment made me wonder, and indeed;
Code
diff --git a/src/plugins/contrib/SpellChecker/OnlineSpellChecker.cpp b/src/plugins/contrib/SpellChecker/OnlineSpellChecker.cpp
index b08b91f51..2753582e6 100644
--- a/src/plugins/contrib/SpellChecker/OnlineSpellChecker.cpp
+++ b/src/plugins/contrib/SpellChecker/OnlineSpellChecker.cpp
@@ -198,6 +198,8 @@ void OnlineSpellChecker::DoSetIndications(cbEditor* ctrl) const
             // remove styling:
             stc->IndicatorClearRange(start, end - start);

+            int oldWordStart = 0;
+            int posToAdd = 0;
             for ( int pos = start ;  pos < end ; pos++)
             {
                 EditorColourSet* colour_set = Manager::Get()->GetEditorManager()->GetColourSet();
@@ -215,7 +217,17 @@ void OnlineSpellChecker::DoSetIndications(cbEditor* ctrl) const
                      m_pSpellHelper->HasStyleToBeChecked(lang, stc->GetStyleAt(wordstart)) )
                 {
                     DissectWordAndCheck(stc, wordstart, wordend);
-                    pos = wordend;
+                }
+                pos = wordend;
+                if ( oldWordStart == wordstart )
+                {
+                    posToAdd++;
+                    pos += posToAdd;
+                }
+                else
+                {
+                    posToAdd = 0;
+                    oldWordStart = wordstart;
                 }

             }
this patch makes loading the main.cpp file nearly 33% faster...
« Last Edit: November 07, 2019, 10:32:17 pm by BlueHazzard »

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13413
    • Travis build status
Re: Wrong spell checker on russian (and probably other languages)
« Reply #34 on: November 08, 2019, 08:41:51 am »
I don't really understand what it does...  :o
It would be good if you either add a comment explaining it or even better make it simpler and more clear.
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13413
    • Travis build status
Re: Wrong spell checker on russian (and probably other languages)
« Reply #35 on: November 08, 2019, 08:50:15 am »
Code
                EditorColourSet* colour_set = Manager::Get()->GetEditorManager()->GetColourSet();
                if (!colour_set)
                    break;
                wxString lang = colour_set->GetLanguageName(ctrl->GetLanguage() );
Why is this code executed in a loop?

HasStyleToBeChecked also looks expensive. I'd rather change it to use unordered_set<struct{wxstring lang, int style}> instead of a map containing a set...
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline BlueHazzard

  • Developer
  • Lives here!
  • *****
  • Posts: 3353
Re: Wrong spell checker on russian (and probably other languages)
« Reply #36 on: November 08, 2019, 11:15:10 am »
Quote
I don't really understand what it does...  :o
It would be good if you either add a comment explaining it or even better make it simpler and more clear.
It does what you suggested :) looking if we are moving, and if not increase the position future and future with increasing steps, so we get not stuck in a position, where word end is before the next valid word beginning. It is only a draft. I will make a commented patch...

Quote
Why is this code executed in a loop?
Have not looked into this part of code...

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13413
    • Travis build status
Re: Wrong spell checker on russian (and probably other languages)
« Reply #37 on: November 08, 2019, 07:54:17 pm »
It does what you suggested :)
I've not suggested writing it in such a way. :)
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13413
    • Travis build status
Re: Wrong spell checker on russian (and probably other languages)
« Reply #38 on: November 12, 2019, 01:44:30 am »
I see you've committed it. I've read the comment multiple times and I've read the code multiple times and I still don't know why you've written it in such a convoluted way.

So, as far as I understand it the case which breaks the previous code is if pos is pointing between the to eol characters and wordstart/wordend point to an empty word. But then why don't you detect this case and just move pos by one and repeat the loop? Why are two new variables needed? Is there a case where posToAdd is bigger than 1 really? Another test which might work is if pos==wordend then you move pos by 1.
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline BlueHazzard

  • Developer
  • Lives here!
  • *****
  • Posts: 3353
Re: Wrong spell checker on russian (and probably other languages)
« Reply #39 on: November 12, 2019, 08:07:48 am »
Quote
Is there a case where posToAdd is bigger than 1 really?
One case i can imagine: "\r\n\n" in this jumping one position is not enough, you have to jump at least 3 positions: wordstart and wordend will always point to \r for this word, independent where the input points to (0, 1 or 2)

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13413
    • Travis build status
Re: Wrong spell checker on russian (and probably other languages)
« Reply #40 on: November 12, 2019, 09:21:20 am »
You'll do 3 iterations of the loop each time moving by 1.

The same would happen with the current version, I think. In fact I think your version is probably wrong if posToAdd is larger than one. On the first iteration it will move by one character, on the second it will move by two, possibly skipping one characters. But I guess WordStartPosition is hiding it, because it can go back. Can it? Or are you saying that WordStartPosition would return the position at the start of the whole sequence of EOL chars no matter where in the sequence pos is?
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline BlueHazzard

  • Developer
  • Lives here!
  • *****
  • Posts: 3353
Re: Wrong spell checker on russian (and probably other languages)
« Reply #41 on: November 12, 2019, 09:34:08 am »
Quote
WordStartPosition would return the position at the start of the whole sequence of EOL chars no matter where in the sequence pos is?
Wordstart AND wordend can go before pos. Exactly That is the problem... And my code can jump past this problem.
[Edit:] This probably can be mitigated (i have not checked id, and can probably lead to other problems)  , by using pos for the findWordEnd function. (but some where more up this topic you suggested to use wordStart to find wordEnd and not pos)
« Last Edit: November 12, 2019, 09:38:15 am by BlueHazzard »

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13413
    • Travis build status
Re: Wrong spell checker on russian (and probably other languages)
« Reply #42 on: November 12, 2019, 08:15:01 pm »
So you'll get a situation where wordend<pos isn't it? Then why don't you just increment pos with one in this case and do another iteration of the loop?
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline BlueHazzard

  • Developer
  • Lives here!
  • *****
  • Posts: 3353
Re: Wrong spell checker on russian (and probably other languages)
« Reply #43 on: November 12, 2019, 11:07:37 pm »
Ok, let me try to explain:

we have the string "\r\n\n"

Code
1) 
// first loop
pos==0; // str[pos] == '\r'
wordstart  = findwordstart(pos);// == 0;
wordend = findwordend(wordstart);// == 0;
pos = wordend; //== 0
2) // Second loop: pos++
pos==1; // str[pos] == '\n'
wordstart  = findwordstart(pos);// == 0;
wordend = findwordend(wordstart);// == 0;
pos = wordend;//  == 0
3) // Third loop: pos++
pos==1; // str[pos] == '\n'
wordstart  = findwordstart(pos);// == 0;
wordend = findwordend(wordstart);// == 0;
pos = wordend;//  == 0
this does not work...

Code
1) 
// first loop
pos==0; // str[pos] == '\r'
wordstart  = findwordstart(pos);// == 0;
wordend = findwordend(wordstart);// == 0;
if(pos == wordend ) pos++;// == 1

2) // Second loop: pos++
pos==2; // str[pos] == '\n'
wordstart  = findwordstart(pos);// == 0;
wordend = findwordend(wordstart);// == 0;
if(pos == wordend ) pos++;// == 1
3) // Third loop: pos++
pos==2; // str[pos] == '\n'
wordstart  = findwordstart(pos);// == 0;
wordend = findwordend(wordstart);// == 0;
if(pos == wordend ) pos++;// == 1
this also does not work

my code:

Code
0)
oldpos=0;
incstep=0;
1)
// first loop
pos==0; // str[pos] == '\r'
wordstart  = findwordstart(pos);// == 0;
wordend = findwordend(wordstart);// == 0;
pos = worend;
if(pos == oldpos)
{
   incstep++;// == 1
   pos+=incstep;// == 1
}
2) // Second loop: pos++
pos==2; // str[pos] == '\n'
wordstart  = findwordstart(pos);// == 0;
wordend = findwordend(wordstart);// == 0;
pos = worend;
if(pos == oldpos)
{
   incstep++;// == 2
   pos+=incstep;// == 2
}
3) // Third loop: pos++
pos==3; // str[pos] == 'X' // <------ YAYYYYY PAST THE 2
....

My code works also for "\r\n\n\n\n\n\n\n\n\n\n\n\n\n" without any special ifs...

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13413
    • Travis build status
Re: Wrong spell checker on russian (and probably other languages)
« Reply #44 on: November 13, 2019, 01:20:44 am »
Ok, let me try to explain:

we have the string "\r\n\n"

Code
1) 
// first loop
pos==0; // str[pos] == '\r'
wordstart  = findwordstart(pos);// == 0;
wordend = findwordend(wordstart);// == 0;
pos = wordend; //== 0
2) // Second loop: pos++
pos==1; // str[pos] == '\n'
wordstart  = findwordstart(pos);// == 0;
wordend = findwordend(wordstart);// == 0;
pos = wordend;//  == 0
3) // Third loop: pos++
pos==1; // str[pos] == '\n'
wordstart  = findwordstart(pos);// == 0;
wordend = findwordend(wordstart);// == 0;
pos = wordend;//  == 0

Why is pos==1 at iteration 3? If you add a condition if (wordend<pos) pos++; then at iteration 2 pos would be incremented to 2 and everything would be good and working. Am I still missing something?

You should of course make line pos = wordend; to be conditional...Something like:
Code
if (wordend > pos)
  pos = wordend;
else
  pos++;
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]