Print Page - Spellchecker Issues

Developer forums (C::B DEVELOPMENT STRICTLY!) => Plugins development => Topic started by: Khram on March 10, 2015, 01:22:29 am

Title: Spellchecker Issues
Post by: Khram on March 10, 2015, 01:22:29 am

Title: Re: Spellchecker Issues
Post by: oBFusCATed on March 10, 2015, 10:02:21 am

Khram: Are you going to cooperate and describe you problem with details this time? I'm asking this for a third time! If you don't describe the problem with details, post example files, etc, no one will fix your problem.

Quote from: oBFusCATed on February 17, 2015, 09:38:37 am

Quote from: Khram on February 17, 2015, 05:00:25 am
Until now, no archive repository. Interestingly, there has been corrected checking Russian spelling encoded Win-1251.
>:( - Again misery and refund on 9958 version. Here, too, Russian orthography is faulty.
I couldn't understand what is your problem.
Can you paraphrase this in a more understandable way?

Quote from: Khram on February 18, 2015, 05:42:02 am

Please return option "-funsigned-char" when compiling and assembling the Spell-plugin. Maybe it back to life for a simple eight-bit encoding win-1251 of comments in a source codes

++ This is last year's problem, but I'm not ready to deploy the entire system programming to self solving it. I just computers Win-8.1 (64) and Win-7 (32).

Quote from: oBFusCATed on February 18, 2015, 08:55:46 am

Kham:
Looking at the svn history I don't think we've ever used this option to compile the spell checker.
If you have a particular problem and it is not reported to the sf.net project page, please do so and post a link here. If it is post a link, so we can look at it.
But you're posting a message that has almost no meaning in a second night build topic! We cannot help if you are not cooperative!

Title: Re: Spellchecker Issues
Post by: oBFusCATed on March 10, 2015, 10:38:12 am

Khram: Of course it haven't changed, you've not described what the problem really is and how to reproduce it. The status will be the same until you do it or someone else is able to give us a way to reproduce it.

Title: Re: Spellchecker Issues
Post by: oBFusCATed on March 10, 2015, 01:43:59 pm

Khram:
Keep in mind that most people here use english like encodings (only).
So it is best to post what is the encoding of the file you're seeing the problem with.
And probably even better post an example file and the appropriate dictionary files that should match it.

Luckily for you I think I'm able to reproduce it, so I'll see what is going on.

Title: Re: Spellchecker Issues
Post by: janissl on March 10, 2015, 04:54:34 pm

Quote from: Khram on March 10, 2015, 12:23:20 pm

I showed the picture that SpellChekk confused in a letter, and should understand the words. I think that the problem in sign chars instead of unsigned bytes. What may not be understand ???

I guess the spellcheck is designed for checking strings in your code i.e. text strings displayed for users of your application. What do you write in comments is up to a developer.

Title: Re: Spellchecker Issues
Post by: oBFusCATed on March 10, 2015, 04:59:15 pm

Quote from: janissl on March 10, 2015, 04:54:34 pm

I guess the spellcheck is designed for checking strings in your code i.e. text strings displayed for users of your application. What do you write in comments is up to a developer.

Wrong. The spellchecker do check both comments and strings. At least this is how it works for english+utf8.

Title: Re: Spellchecker Issues
Post by: janissl on March 10, 2015, 06:05:32 pm

Yes, I was wrong. Unfortunately, the same issue also applies to the Latvian (lv_LV, encoding: utf-8). No matter - is this a code or a comment.

Title: Re: Spellchecker Issues
Post by: raynebc on March 11, 2015, 06:11:21 pm

While we're talking about the spellchecker, does anybody else run into constant problems with it incorrectly flagging spelling errors? It seems that it happens to me often, especially if I'm copying/pasting text/comments. When this happens, it will claim the words are misspelled until I make any change to the word, even if it's something as simple as adding and removing a space character to the end of the word.

Title: Re: Spellchecker Issues
Post by: oBFusCATed on March 11, 2015, 08:23:48 pm

I think, I've never seen this. :(

Title: Re: Spellchecker Issues
Post by: raynebc on March 27, 2015, 08:23:14 pm

Quote from: oBFusCATed on March 11, 2015, 08:23:48 pm

I think, I've never seen this. :(

Thinking about this some more since, does Code::Blocks use different dictionaries based on the detected language of the environment (ie. Windows' locale)?

Title: Re: Spellchecker Issues
Post by: stahta01 on March 27, 2015, 11:54:04 pm

Quote from: raynebc on March 27, 2015, 08:23:14 pm

Quote from: oBFusCATed on March 11, 2015, 08:23:48 pm
I think, I've never seen this. :(
Thinking about this some more since, does Code::Blocks use different dictionaries based on the detected language of the environment (ie. Windows' locale)?

IIRC, CB has a settings that picks the dictionary used by the spell checker.
But, the character encoding used by the OS/file might be causing the problem based on what I read on this board somewhere.
No idea if the default OS character encoding has any possibility of causing the issue.

Tim S.

Title: Re: Spellchecker Issues
Post by: oBFusCATed on March 28, 2015, 01:38:09 am

Quote from: raynebc on March 27, 2015, 08:23:14 pm

Thinking about this some more since, does Code::Blocks use different dictionaries based on the detected language of the environment (ie. Windows' locale)?

As far as I could understand the code:
1. Every dictionary is stored in its own encoding
2. The encoding is not utf8, but some language specific (probably to safe some space)
3. CB tries to convert the words from the encoding it has detected for the file to the encoding used by the dictionary, or vice versa. This is done in order to find matching words.
4. I doubt that the system encoding has anything to do with this process.

Title: Re: Spellchecker Issuest.
Post by: janissl on March 28, 2015, 04:31:50 pm

Quote from: oBFusCATed on March 28, 2015, 01:38:09 am

1. Every dictionary is stored in its own encoding
2. The encoding is not utf8, but some language specific (probably to safe some space)
3. CB tries to convert the words from the encoding it has detected for the file to the encoding used by the dictionary, or vice versa. This is done in order to find matching words.
4. I doubt that the system encoding has anything to do with this process.

No, there must be another cause. For example, the Latvian dictionary is stored in UTF-8 with no BOM (see hunspell_lv.png in the attachment). The source file is also using the UTF-8. Even changing all words to the nominative case as they appear in the .dic file does not help to get rid of the curly red underlines.

I guess, some specific characters are causing this behaviour for some reason (see lv_strings_false_misspelled.png in the attachment). In addition, some words are underlined partly.

My default system encoding is Windows-1257 and the codepage for the Windows Command Prompt is 775 (the default Windows settings for the Baltic languages) but I think the Code::Blocks should not use those two encodings in any way if I have set explicitly the UTF-8 for the "Use encoding when opening files" option.

Playing with options in Editor Settings (checking-unchecking checkboxes and radio buttons under Encoding) did not change anything in the false error detection.

Title: Re: Spellchecker Issues
Post by: janissl on March 28, 2015, 06:25:56 pm

However, the correct place to discuss the SpellChecker issues is the forum for plugins development: http://forums.codeblocks.org/index.php/topic,11307.0.html (http://forums.codeblocks.org/index.php/topic,11307.0.html)...

Title: Re: Spellchecker Issues
Post by: oBFusCATed on March 29, 2015, 04:44:29 am

The encoding for a particular dictionary is specified in the .aff file.

Title: Re: Spellchecker Issues
Post by: oBFusCATed on April 14, 2015, 02:16:52 am

Can someone try the ru-ru dictionary that is coming with libre office to spellcheck some of the files in the attached project on windows?

@Khram: It will be easier if you post the files for the dictionary yourself and so others can use them to debug the issue.

Title: Re: Spellchecker Issues
Post by: MortenMacFly on April 14, 2015, 07:59:27 am

Quote from: oBFusCATed on April 14, 2015, 02:16:52 am

Can someone try the ru-ru dictionary that is coming with libre office to spellcheck some of the files in the attached project on windows?

Well I picked just one ru_RU dictionary I found and they are not correctly spell-checked. Maybe I picked the wrong one?

@Khram: What dictionary do you use exactly?

Title: Re: Spellchecker Issues
Post by: Alpha on May 14, 2015, 04:22:37 am

If making a guess, the issue might be here:

Code: cpp

bool SpellCheckHelper::IsWhiteSpace(const wxChar &ch)
{
    return wxIsspace(ch) || wxIspunct(ch) || wxIsdigit(ch);
}

Title: Re: Spellchecker Issues
Post by: oBFusCATed on May 14, 2015, 09:29:00 am

Quote from: Khram on May 13, 2015, 10:47:20 pm

In new nightly build (10253) - spellSheck no working

Of course it is not working - no one has fixed it, because they can't reproduce it.

Please post a source file and a dictionary file that should be used to reproduce the problem.
Also (probably) post a screenshot with your regional settings.

Title: Re: Spellchecker Issues
Post by: Alatar on May 26, 2015, 11:35:10 am

Please find dictonary on yandex.disk - https://yadi.sk/d/glyHPzRKgsZkQ
Testfile and screenshot of wrong behaviour are attached

Here is C::B version string and localisation settings:

Code

Code::Blocks svn build  rev 10309 May 25 2015, 10:02:04 - wx2.8.12 (Linux, unicode) - 64 bit

alatar@al_work:~% locale
LANG=en_GB.UTF-8
LC_CTYPE="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_PAPER="en_GB.UTF-8"
LC_NAME="en_GB.UTF-8"
LC_ADDRESS="en_GB.UTF-8"
LC_TELEPHONE="en_GB.UTF-8"
LC_MEASUREMENT="en_GB.UTF-8"
LC_IDENTIFICATION="en_GB.UTF-8"
LC_ALL=

alatar@al_work:~% uname -a
Linux al_work 3.17.7-gentoo #1 SMP PREEMPT Mon Mar 30 18:24:07 MSK 2015 x86_64 Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz GenuineIntel GNU/Linux

Title: Re: Spellchecker Issues
Post by: oBFusCATed on May 27, 2015, 12:58:46 am

@Alatar:
I've tried the hunspell binary and it cannot spell-check correctly the Russian part of the file using your dictionary.
I've tried something like

Code

$ copy the dictionary to /usr/share/hunspell
$ hunspell -d Russian-English  -i utf-8 /tmp/spellcheck_check.txt

I guess this is the problem:

Code

error: unknown encoding Windows-1251: using iso88591 as fallback

Please keep in mind that hunspell uses iconv to do the conversions.
If you can reproduce the problem with huspell in a console, then you should talk to either hunspell devs or the vendor of your dictionary.

I'm running this test on gentoo linux.

Title: Re: Spellchecker Issues
Post by: White-Tiger on July 23, 2015, 04:53:58 pm

I've also problems with SpellChecker and CB r10341
Basically it's not working at all.. the only thing that works is the "user dictionary".

I'm not even using any kind of weird language.. I only need the English spell checking to work as that's the main language used by developers.
Not sure what's wrong here though.. I don't see any errors in the CB consoles (only when I delete the th_* files as they can't be found, or if I switch to the GB dictionary because it's then loading the US one)

It's not only highlighting everything that is not in the custom dictionary, but also Edit->Spelling... doesn't provide suggestions or that like..
The source files I've checked with aren't even UTF-8 yet, they are still plain ASCII without special chars in them

Here are the dicts I've tried to use on my Windows machine with dictionary path set to %AppData%\codeblocks\SpellChecker : https://db.tt/pSVUEisr

Title: Re: Spellchecker Issues
Post by: oBFusCATed on July 23, 2015, 06:38:15 pm

@White-Tiger:
Just tried them and they work as expected in both r10333 and r10358 on linux.
Do you have any other hunspell based apps that you can try if they work correctly?

Also is there a nightly that just works with this dictionary?

Title: Re: Spellchecker Issues
Post by: White-Tiger on July 29, 2015, 06:52:29 pm

well... yes those dictionaries work in Miranda NG (IM)
I've also just tried the last stable of Code::Blocks, that one didn't work as well... going to boot up my XP VM now and try it there

edit:
tried my XP VM with r10341 nightly, SpellChecker seemed to work at first.. yet I've found out the reasons. The problem lies in the path... "%AppData%\codeblocks\SpellChecker" by itself is fully functional, but my user name includes a special character: "é"
So as soon as there's any non-ASCII char in the path, it fails to work.
Normally I wouldn't choose such a Windows user name.. but Windows simply used my real name the moment I've signed in with my Microsoft account... And so far, I didn't had a program that couldn't handle it. (and Code::Blocks works in most cases)

Title: Re: Spellchecker Issues
Post by: oBFusCATed on July 29, 2015, 08:48:01 pm

Interesting. I guess someone running windows should have to debug this.

Title: Re: Spellchecker Issues
Post by: raynebc on July 30, 2015, 07:45:33 pm

One insight I can offer is that the traditional file I/O C functions like fopen (at least in Windows with MinGW) tend to not support file paths containing Unicode or extended ASCII characters. It's been a huge thorn in my side for some time now. Third party I/O functions (like the ones in the Allegro game library) can open such files with absolutely no problem. Non cross-platform implementations like the ones in Visual Studio also probably support such file paths because I've never run into any Windows-specific application with that limitation.

Title: Re: Spellchecker Issues
Post by: White-Tiger on August 01, 2015, 01:21:00 pm

well.. the way Code::Blocks opens files seems to be fine... not sure if Code::Blocks uses Unicode / wchar_t / TCHAR on Windows, but the thing is that SpellChecker finds and successfully opens the dictionaries... otherwise this shouldn't work:

Code

SpellChecker: Thesaurus files 'C:\Users\René\AppData\Roaming\codeblocks\SpellChecker\th_en_GB.idx' not found!
SpellChecker: Loading 'C:\Users\René\AppData\Roaming\codeblocks\SpellChecker\th_en_US.idx' instead...

So parts of SpellChecker seem to work, while others doesn't

edit: My bet:
HunspellInterface.cpp:61-62: should both prefix the path with "\\?\" to let Hunspell handle UTF-8 paths on Windows.. (Windows only)
see:

Quote from: hunspell/hunspell.hxx

/* Hunspell(aff, dic) - constructor of Hunspell class
* input: path of affix file and dictionary file
*
* In WIN32 environment, use UTF-8 encoded paths started with the long path
* prefix \\\\?\\ to handle system-independent character encoding and very
* long path names (without the long path prefix Hunspell will use fopen()
* with system-dependent character encoding instead of _wfopen()).
*/

Title: Re: Spellchecker Issues
Post by: White-Tiger on August 06, 2015, 03:48:57 pm

Quote from: Alpha on May 14, 2015, 04:22:37 am

If making a guess, the issue might be here:

Code: cpp

bool SpellCheckHelper::IsWhiteSpace(const wxChar &ch)
{
    return wxIsspace(ch) || wxIspunct(ch) || wxIsdigit(ch);
}

This was actually the root of a problem I've encountered after fixing my file path issue above.
I had to change the "wxIspunct(ch)" part into "(wxIspunct(ch) && ch!='\'')" because words such as "doesn't" also showed up to be misspelled..
I suggest to unify the source code and use something like seen in HunspellInterface.cpp:130 (uses a list of known "non-word" chars)

Code: cpp

  wxString strDelimiters = _T(" \t\r\n.,?!@#$%^&*()-=_+[]{}\\|;:\"<>/~0123456789");
  wxStringTokenizer tkz(strText, strDelimiters);

I've further noticed that SpellChecker doesn't seem to handle UTF-8 at all.. at least when I try to correct the word "doesn¾" and use the suggested "doesn't", I'll end up with "doesn'txBE"
The menu item also only showed "doesn" without any visible char thereafter. (so only the first half of the UTF-8 char)

Title: Re: Spellchecker Issues
Post by: oBFusCATed on August 06, 2015, 08:56:05 pm

Hm, I've wondered why "doesn't" is detected as misspelled.
Can you post a patch with your second suggestion?

Title: Re: Spellchecker Issues
Post by: stahta01 on August 06, 2015, 10:53:00 pm

Quote from: oBFusCATed on August 06, 2015, 08:56:05 pm

Hm, I've wondered why "doesn't" is detected as misspelled.
Can you post a patch with your second suggestion?

Maybe the wrong single quote is used?

Tim S.

Title: Re: Spellchecker Issues
Post by: oBFusCATed on August 06, 2015, 11:15:30 pm

Quote from: stahta01 on August 06, 2015, 10:53:00 pm

Maybe the wrong single quote is used?

Re-read White-Tiger's post. He seems to have found the reason.

Title: Re: Spellchecker Issues
Post by: White-Tiger on August 10, 2015, 01:44:13 pm

Quote from: oBFusCATed on August 06, 2015, 08:56:05 pm

Hm, I've wondered why "doesn't" is detected as misspelled.
Can you post a patch with your second suggestion?

Which second suggestion exactly?
If you're talking about the wxIspunct and unification as mentioned, I've came to the conclusion that wxIspunct should be ok here. It got introduced in r10014 (spellchecker: replace hardcoded character set with unicode compatible calls, improves checking accuracy in utf8 comments) by alpha0010.
"wxIspunct" seems to handle everything that isn't a word character.. this includes ' and other characters that might be "part" of a word in some languages. So just filter those few characters out and it'll be fine

HunspellInterface.cpp:130 might be a bit troublesome to make use of wxIspunct... as it requires to rewrite the code so that we manually loop over the string and search/parse words..

I've also took a peek at Firefox's spell checker and it's also using ispunct, but the apostrophe is a special case... it is a punct if set alone or not between 2 "words",
that is " Windows' " is seen as " Windows " as there's no letter after the apostrophe. (and thus is successfully checked for spelling. "Windows'" is not part of the dictionary because it's not required if those rules are to be followed.

So something like "IsWhiteSpace()" returning 0 for non-space / word characters, 1 for space and 2 for "special". When it returns "2" we'll check if another !IsWhiteSpace follows which means it isn't a space. Otherwise it was.
Though Firefox uses ispunct() together with IsConditionalPunctuation() which returns true for ', 0x2019 /*RIGHT SINGLE QUOTATION MARK*/ and 0x00B7 /*MIDDLE DOT*/

Title: Re: Spellchecker Issues
Post by: MortenMacFly on August 18, 2015, 01:09:09 pm

Quote from: White-Tiger on August 10, 2015, 01:44:13 pm

...

Well I am a bit lost now. Could you please state shortly once again what changes will fix the original bug reported? Maybe you can even provide a patch? Its easy to do: Checkout from SVN, make the changes in the working copy, run this command at the root of you working copy:
svn diff > diff.patch
(...assuming you have the SVN executable in the path.)

Title: Re: The new 16.01 spellchecker not work
Post by: MortenMacFly on January 29, 2016, 07:18:04 am

Quote from: Khram on January 29, 2016, 01:35:43 am

The new version 16.01 spell checker also confuses words, threatening the collapse of the whole program, of course.

Well I figured out meanwhile that this is not a C::B but a hunspell issue (that's the lib we use for spellchecking). I told the Hunspell maintainers but 'I got nothing in return so far...

Title: Re: Spellchecker Issues
Post by: White-Tiger on January 29, 2016, 04:06:20 pm

why do you think it's a hunspell issue?

Title: Re: Spellchecker Issues
Post by: MortenMacFly on January 29, 2016, 10:45:18 pm

Quote from: White-Tiger on January 29, 2016, 04:06:20 pm

why do you think it's a hunspell issue?

Well to be more precise: We actually have two issues here:
1.) (hunspell): If the dictionaries are in a path with non-ASCII characters hunspell is unable to pick up any dictionary.
2.) The Russian words are broken due to the way we handle to find word boundaries in OnlineSpellChecker.cpp. Here (search for the comment "//find recheck range end:") we check for whitespace in a way that it does not work for e.g. Russian (see SpellCheckHelper::IsWhiteSpace(ch)).

The latter we can do something about it... the first one we can't. Both lead to Russian SpellChecking being broken, unfortunately.

Title: Re: Spellchecker Issues
Post by: White-Tiger on January 30, 2016, 01:33:37 am

1) is also a C::B issue.. see: http://forums.codeblocks.org/index.php/topic,20195.msg139323.html#msg139323
Hunspell has means to support such paths.. C::B is simply not using them.
One could argue that they could have supported wchar_t* directly.. though their way is a bit less platform dependent

So something like this in HunspellInterface.cpp:61-63

Code: cpp

    wxCharBuffer affixFileCharBuffer = ConvertToUnicode(_T("\\\\?\\") + strAffixFile);
    wxCharBuffer dictionaryFileCharBuffer = ConvertToUnicode(_T("\\\\?\\") + strDictionaryFile);
    m_pHunspell = new Hunspell(affixFileCharBuffer, dictionaryFileCharBuffer);

would work for Windows (this is what I'm using locally, and as far as I can tell, it seems to work)

Title: Re: Spellchecker Issues
Post by: MortenMacFly on January 30, 2016, 07:48:28 am

Quote from: White-Tiger on January 30, 2016, 01:33:37 am

So something like this in HunspellInterface.cpp:61-63
Code: cpp
    wxCharBuffer affixFileCharBuffer = ConvertToUnicode(_T("\\\\?\\") + strAffixFile);
    wxCharBuffer dictionaryFileCharBuffer = ConvertToUnicode(_T("\\\\?\\") + strDictionaryFile);
    m_pHunspell = new Hunspell(affixFileCharBuffer, dictionaryFileCharBuffer);
would work for Windows (this is what I'm using locally, and as far as I can tell, it seems to work)

I've applied a cross-platform compatible version of this - for me that really seems to work. Nice catch!

So now whats missing is umlauts and Unicode... at least we are getting closer...

Code::Blocks Forums

Developer forums (C::B DEVELOPMENT STRICTLY!) => Plugins development => Topic started by: Khram on March 10, 2015, 01:22:29 am