Author Topic: Looking for non english sources to test encoding detection  (Read 13145 times)

Offline Jenna

  • Administrator
  • Lives here!
  • *****
  • Posts: 7255
Looking for non english sources to test encoding detection
« on: March 01, 2009, 01:16:30 pm »
I'm currently experimenting mozillas charset-detection for C::B (see this thread: http://forums.codeblocks.org/index.php/topic,10159.msg70493.html#msg70493)

I'm looking for files that use encodings, that are not correctly recognized by C::B's encoding detection.

I mean any files that can only be opened after conversion to UTF-8, or by forcing a special fallback encoding or by bypassing C::B's autodetetction.

Especially files in that contain chinese, japanese, cyrillic, eastern-europe or hebrew characters.

It would be nice to have a native and a  UTF-8 version to see if the characters are detected/displayed correctly.

Please don't attach such files to your posts, but send them via mail to "chardet at jenslody dot de".

So we reduce unnecessary server-load.

I will put them on my server, for others to test them, if they want.
They will be available on http://chardet.jenslody.de/ (empty at the moment).

If you don't want the files to be published, please put a short note inside the mail.

I'm interested in single-files and of course also complete (short example) projects/workspaces.
« Last Edit: March 01, 2009, 01:37:34 pm by jens »

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 5916
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: Looking for non english sources to test encoding detection
« Reply #1 on: March 01, 2009, 01:34:37 pm »
Ok, I can report some files which are located in code::blocks source folder:

src/plugins/codecompletion/parser/tokenizer.cpp

src/sdk/wxscintilla/src/scintilla/src/LexMatlab.cxx

src/sdk/wxscintilla/src/scintilla/src/LexErlang.cxx

src/sdk/wxscintilla/src/scintilla/src/Editor.cxx

src/sdk/resources/lexers/lexer_css.xml

src/plugins/compilergcc/compilergcc.cpp


Thank you!


If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline Jenna

  • Administrator
  • Lives here!
  • *****
  • Posts: 7255
Re: Looking for non english sources to test encoding detection
« Reply #2 on: March 01, 2009, 02:09:49 pm »
The last tow files are identified correctly in pure trunk and with the mozilla detection (one as UTF-8 with BOM and the as UTF-8 without BOM).
The others work only using system fallback on trunk and are detected as CP1252 (Windows 1252) by the mozilla detector.

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 5916
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: Looking for non english sources to test encoding detection
« Reply #3 on: March 01, 2009, 02:26:08 pm »
ok :D
These files came from this bug report message one week ago.
http://forums.codeblocks.org/index.php/topic,10130.msg70316.html#msg70316
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline nanyu

  • Almost regular
  • **
  • Posts: 188
  • nanyu
Re: Looking for non english sources to test encoding detection
« Reply #4 on: March 02, 2009, 03:22:16 am »
I send one.

Offline Jenna

  • Administrator
  • Lives here!
  • *****
  • Posts: 7255
Re: Looking for non english sources to test encoding detection
« Reply #5 on: March 02, 2009, 07:13:52 am »
I send one.
Thanks nanyu.

With mozilla-detection the non-UTF-8 is detected as chinese simpilfied (cp936) by C::B. That means the encoding-detector told me it is gb18030, but I change it internally to cp936 (windows-936), because wxWidgets only knows this one.
The trunk version only opens the UTF-8 file correctly on my system (detected as UTF-8 with BOM).

In my test version all chars are identical in both files, but some seem to miss: line 18 to 22 show a square as first character.
That's most likely a limitation of the characterset on my system, because iceweasel (the debian name for firefox) shows the same.

Offline nanyu

  • Almost regular
  • **
  • Posts: 188
  • nanyu
Re: Looking for non english sources to test encoding detection
« Reply #6 on: March 02, 2009, 09:36:51 am »

......
..... but some seem to miss: line 18 to 22 show a square as first character....


 :D  Don't worry for it! , because those four square characters ARE meant to  four square characters.

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 5916
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: Looking for non english sources to test encoding detection
« Reply #7 on: March 02, 2009, 09:51:12 am »

......
..... but some seem to miss: line 18 to 22 show a square as first character....


 :D  Don't worry for it! , because those four square characters ARE meant to  four square characters.
:D,Yes, Maybe, Jens' system can't display Chinese characters.
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline Jenna

  • Administrator
  • Lives here!
  • *****
  • Posts: 7255
Re: Looking for non english sources to test encoding detection
« Reply #8 on: March 02, 2009, 10:04:54 am »

......
..... but some seem to miss: line 18 to 22 show a square as first character....


 :D  Don't worry for it! , because those four square characters ARE meant to  four square characters.
:D,Yes, Maybe, Jens' system can't display Chinese characters.

My linux-system at home can display them, but not my windows-system (even after installing support for chinese characters in XP).
Maybe I'm missing something.
<EDIT>
After installing support for east-asian languages it works in C::B. Windows seems to need more files than just new fonts to display it correctly.
</EDIT>

But I can not read chinese, so I did not know whether the squares are wanted or just replacements.

(My father was able to read and speak a little chinese, but he died 15 months ago, so he can not help me.)
« Last Edit: March 02, 2009, 10:23:29 am by jens »

Offline nanyu

  • Almost regular
  • **
  • Posts: 188
  • nanyu
Re: Looking for non english sources to test encoding detection
« Reply #9 on: March 02, 2009, 10:47:24 am »
those squares are wanted, not for replacement. now you see?

Offline vix

  • Multiple posting newcomer
  • *
  • Posts: 60
Re: Looking for non english sources to test encoding detection
« Reply #10 on: August 04, 2009, 08:23:06 am »
I've just sent a file with chars used in Italian (à, è. é. ì, ò and ù).
Not working in SVN 5696 and 5716.
Works in 5678 and older.

Offline Jenna

  • Administrator
  • Lives here!
  • *****
  • Posts: 7255
Re: Looking for non english sources to test encoding detection
« Reply #11 on: August 04, 2009, 09:15:28 am »
I've just sent a file with chars used in Italian (à, è. é. ì, ò and ù).
Not working in SVN 5696 and 5716.
Works in 5678 and older.
Thanks, I found the cause for your problems, answer is here .

christina2009

  • Guest
Re: Looking for non english sources to test encoding detection
« Reply #12 on: August 12, 2009, 03:33:30 am »
I'm currently experimenting mozillas charset-detection for C::B (see this thread: http://forums.codeblocks.org/index.php/topic,10159.msg70493.html#msg70493)

I'm looking for files that use encodings, that are not correctly recognized by C::B's encoding detection.

I mean any files that can only be opened after conversion to UTF-8, or by forcing a special fallback encoding or by bypassing C::B's autodetetction.

Especially files in that contain chinese, japanese, cyrillic, eastern-europe or hebrew characters.

It would be nice to have a native and a  UTF-8 version to see if the characters are detected/displayed correctly.

Please don't attach such files to your posts, but send them via mail to "chardet at jenslody dot de".

So we reduce unnecessary server-load.

I will put them on my server, for others to test them, if they want.
They will be available on http://chardet.jenslody.de/ (empty at the moment).

If you don't want the files to be published, please put a short note inside the mail.

I'm interested in single-files and of course also complete (short example) projects/workspaces.

I think this is enough .....
I do agree with you. Those are the most effective way

comparatif simulation taux pret auto - taux pret auto differe selon la prise en compte ... calculent automatiquement le taux pour un prêt automobile donne.comparatif simulation taux pret auto