Code::Blocks Forums

Developer forums (C::B DEVELOPMENT STRICTLY!) => Development => Topic started by: Jenna on March 01, 2009, 01:16:30 pm

Title: Looking for non english sources to test encoding detection
Post by: Jenna on March 01, 2009, 01:16:30 pm
I'm currently experimenting mozillas charset-detection for C::B (see this thread: http://forums.codeblocks.org/index.php/topic,10159.msg70493.html#msg70493 (http://forums.codeblocks.org/index.php/topic,10159.msg70493.html#msg70493))

I'm looking for files that use encodings, that are not correctly recognized by C::B's encoding detection.

I mean any files that can only be opened after conversion to UTF-8, or by forcing a special fallback encoding or by bypassing C::B's autodetetction.

Especially files in that contain chinese, japanese, cyrillic, eastern-europe or hebrew characters.

It would be nice to have a native and a  UTF-8 version to see if the characters are detected/displayed correctly.

Please don't attach such files to your posts, but send them via mail to "chardet at jenslody dot de".

So we reduce unnecessary server-load.

I will put them on my server, for others to test them, if they want.
They will be available on http://chardet.jenslody.de/ (http://chardet.jenslody.de/) (empty at the moment).

If you don't want the files to be published, please put a short note inside the mail.

I'm interested in single-files and of course also complete (short example) projects/workspaces.
Title: Re: Looking for non english sources to test encoding detection
Post by: ollydbg on March 01, 2009, 01:34:37 pm
Ok, I can report some files which are located in code::blocks source folder:

src/plugins/codecompletion/parser/tokenizer.cpp

src/sdk/wxscintilla/src/scintilla/src/LexMatlab.cxx

src/sdk/wxscintilla/src/scintilla/src/LexErlang.cxx

src/sdk/wxscintilla/src/scintilla/src/Editor.cxx

src/sdk/resources/lexers/lexer_css.xml

src/plugins/compilergcc/compilergcc.cpp


Thank you!


Title: Re: Looking for non english sources to test encoding detection
Post by: Jenna on March 01, 2009, 02:09:49 pm
The last tow files are identified correctly in pure trunk and with the mozilla detection (one as UTF-8 with BOM and the as UTF-8 without BOM).
The others work only using system fallback on trunk and are detected as CP1252 (Windows 1252) by the mozilla detector.
Title: Re: Looking for non english sources to test encoding detection
Post by: ollydbg on March 01, 2009, 02:26:08 pm
ok :D
These files came from this bug report message one week ago.
http://forums.codeblocks.org/index.php/topic,10130.msg70316.html#msg70316
Title: Re: Looking for non english sources to test encoding detection
Post by: nanyu on March 02, 2009, 03:22:16 am
I send one.
Title: Re: Looking for non english sources to test encoding detection
Post by: Jenna on March 02, 2009, 07:13:52 am
I send one.
Thanks nanyu.

With mozilla-detection the non-UTF-8 is detected as chinese simpilfied (cp936) by C::B. That means the encoding-detector told me it is gb18030, but I change it internally to cp936 (windows-936), because wxWidgets only knows this one.
The trunk version only opens the UTF-8 file correctly on my system (detected as UTF-8 with BOM).

In my test version all chars are identical in both files, but some seem to miss: line 18 to 22 show a square as first character.
That's most likely a limitation of the characterset on my system, because iceweasel (the debian name for firefox) shows the same.
Title: Re: Looking for non english sources to test encoding detection
Post by: nanyu on March 02, 2009, 09:36:51 am

......
..... but some seem to miss: line 18 to 22 show a square as first character....


 :D  Don't worry for it! , because those four square characters ARE meant to  four square characters.
Title: Re: Looking for non english sources to test encoding detection
Post by: ollydbg on March 02, 2009, 09:51:12 am

......
..... but some seem to miss: line 18 to 22 show a square as first character....


 :D  Don't worry for it! , because those four square characters ARE meant to  four square characters.
:D,Yes, Maybe, Jens' system can't display Chinese characters.
Title: Re: Looking for non english sources to test encoding detection
Post by: Jenna on March 02, 2009, 10:04:54 am

......
..... but some seem to miss: line 18 to 22 show a square as first character....


 :D  Don't worry for it! , because those four square characters ARE meant to  four square characters.
:D,Yes, Maybe, Jens' system can't display Chinese characters.

My linux-system at home can display them, but not my windows-system (even after installing support for chinese characters in XP).
Maybe I'm missing something.
<EDIT>
After installing support for east-asian languages it works in C::B. Windows seems to need more files than just new fonts to display it correctly.
</EDIT>

But I can not read chinese, so I did not know whether the squares are wanted or just replacements.

(My father was able to read and speak a little chinese, but he died 15 months ago, so he can not help me.)
Title: Re: Looking for non english sources to test encoding detection
Post by: nanyu on March 02, 2009, 10:47:24 am
those squares are wanted, not for replacement. now you see?
Title: Re: Looking for non english sources to test encoding detection
Post by: vix on August 04, 2009, 08:23:06 am
I've just sent a file with chars used in Italian (à, è. é. ì, ò and ù).
Not working in SVN 5696 and 5716.
Works in 5678 and older.
Title: Re: Looking for non english sources to test encoding detection
Post by: Jenna on August 04, 2009, 09:15:28 am
I've just sent a file with chars used in Italian (à, è. é. ì, ò and ù).
Not working in SVN 5696 and 5716.
Works in 5678 and older.
Thanks, I found the cause for your problems, answer is here (http://forums.codeblocks.org/index.php/topic,10912.msg74883/topicseen.html#msg74883) .
Title: Re: Looking for non english sources to test encoding detection
Post by: christina2009 on August 12, 2009, 03:33:30 am
I'm currently experimenting mozillas charset-detection for C::B (see this thread: http://forums.codeblocks.org/index.php/topic,10159.msg70493.html#msg70493 (http://forums.codeblocks.org/index.php/topic,10159.msg70493.html#msg70493))

I'm looking for files that use encodings, that are not correctly recognized by C::B's encoding detection.

I mean any files that can only be opened after conversion to UTF-8, or by forcing a special fallback encoding or by bypassing C::B's autodetetction.

Especially files in that contain chinese, japanese, cyrillic, eastern-europe or hebrew characters.

It would be nice to have a native and a  UTF-8 version to see if the characters are detected/displayed correctly.

Please don't attach such files to your posts, but send them via mail to "chardet at jenslody dot de".

So we reduce unnecessary server-load.

I will put them on my server, for others to test them, if they want.
They will be available on http://chardet.jenslody.de/ (http://chardet.jenslody.de/) (empty at the moment).

If you don't want the files to be published, please put a short note inside the mail.

I'm interested in single-files and of course also complete (short example) projects/workspaces.

I think this is enough .....
I do agree with you. Those are the most effective way

comparatif simulation taux pret auto  (http://pret-auto.org) - taux pret auto differe selon la prise en compte ... calculent automatiquement le taux pour un prêt automobile donne.comparatif simulation taux pret auto (http://pret-auto.org)