Author Topic: ISO-8859-1 detection problems - HELP! (Read 8389 times)

rickg22 · « **on:** August 19, 2011, 06:33:43 pm »

Hi there. I'm having a problem with encoding detection on certain files. The files in question are written in ISO-8859-1. But when opening, C::B claims they were written in ISO-8859-7.

The lines in question are:

Code

				if(this.fundamentos_legales[j].descripcion.indexOf(this.nombre_legislaciones[k].legislacion) != -1
				|| this.fundamentos_legales[j].descripcion.indexOf(this.nombre_legislaciones[k].legislacion.replace(/[áéíóúñ]/gi,'')) != -1
				|| this.fundamentos_legales[j].descripcion.indexOf(this.nombre_legislaciones[k].legislacioncorto) != -1
				|| this.fundamentos_legales[j].descripcion.indexOf(this.nombre_legislaciones[k].legislacioncorto.replace(/[áéíóúñ]/gi,'')) != -1)
				{

See the accented characters there? They throw off the auto-detection (C::B changes them to "αινσϊρ". I can't tell C::B to use exclusively ISO-8859-1 because I have utf-8 files elsewhere. How can I tell C::B to use either ISO-8859-1,Windows-1252 *OR* utf-8?

Please help!

EDIT: Bug reported in https://developer.berlios.de/bugs/?func=detailbug&bug_id=18316&group_id=5358

I have an idea of what C::B should do. You could specify in the project settings (or the global settings, maybe both) what encodings can be autodetected. If an opened file is detected to be in another encoding, a confirmation dialog should open.

"This file was detected as ISO-8859-7, but we could be mistaken. Do you wish to use ISO-8859-7 as the encoding, or open it with another encoding?"

Then you choose the other encoding, with the option to [ ] Always open as _______ (encoding goes here).

Jenna · « **Reply #1 on:** August 19, 2011, 09:25:33 pm »

The problem with encoding-detection, is that it works better if it has more text to test, the greatest problem are single characters.
It should work better, if you add some spanish (or other ISO-8859-1) comments.

We could also give the latin1-detection more precedence over the other detections, but this will most likely break the detection of other encodings.
The mozilla-developers have lowered the confidence of the latin1-prober, to make detection more accurate:

Code

  // lower the confidence of latin1 so that other more accurate detector
  // can take priority.
  confidence *= 0.50f;

With the following comment (italian as far as I know), I took from a file used to test encoding-detection, your sample works here:

Code

//L'albero è sul comò e perciò chissà perché non sarà più lì!!

rickg22 · « **Reply #2 on:** August 19, 2011, 10:55:27 pm »

Thanks, but wouldn't that be a workaround more than a solution?

I just want to be able to specify which encoding is this file, without having to modify it (it's part of a team project, and I shouldn't modify the file unless it's absolutely necessary.

Jenna · « **Reply #3 on:** August 20, 2011, 09:07:52 am »

Quote from: rickg22 on August 19, 2011, 10:55:27 pm

Thanks, but wouldn't that be a workaround more than a solution?

I just want to be able to specify which encoding is this file, without having to modify it (it's part of a team project, and I shouldn't modify the file unless it's absolutely necessary.

Yes, it's just a workaround.

Another solution would be to force enconding on file level, but we would have to store this information somewehere.
The correct place would be the files properties in the projectfile.
I did not look into it, but it should probably be not too hard to implement.

But before working on it, I would like to see the opinion of other devs and users.

MortenMacFly · « **Reply #4 on:** August 20, 2011, 10:04:17 am »

Quote from: jens on August 20, 2011, 09:07:52 am

But before working on it, I would like to see the opinion of other devs and users.

I wonder how this is handled in other IDE's (CodeLite/VS for example). Does anyone know?
I don't recall that I've ever seen such kind of flags in project files, so there might be a "smarter" way.
Rick: How would you do this in VS?

thomas · « **Reply #5 on:** August 20, 2011, 11:11:43 am »

I've been wondering for quite some time now whether we should not just say "everyone uses UTF-8 and fuck the rest". There is so much pain involved in encodings, it never seems to work right for anyone (including me), and seeing how it is truly a harsh task for the machine to guess right in many cases, it probably won't ever work either.
So basically, one could still import whatever encodings there may be, but only ever create/edit/save in UTF-8.

Yes, I'm aware that this will not work for people who are on projects with people who don't get their editors right, in the same way as exclusively using tabs (which I still favour as an idea) will cause trouble for people in projects with others who don't know the difference between the tab key and the space bar or people who use text editors from 1976.
One could at least consider making such a behaviour configurable (and, if it went after me, enabled by default).

UTF-8 admittedly does not "truly work" for anyone but native English (insofar as it needs escape characters), but on the other hand it works surprisingly well with very little overhead for 90% of the world, and it works in a still acceptable manner for the remaining 10%. It's well-supported from the compiler side too^[1].

Sure enough, if you write your thesis in traditional Chinese, then UTF-8 will not truly be the best possible pick, but if you use Code::Blocks for that, you're using the wrong tool, too. On the other hand, source code is still "mostly ANSI" even if you're Chinese, so the impact is not truly that bad.

^[1]In fact, speaking of encoding, did anyone ever wonder if it's necessary to use -finput-charset in accordance with each file's encoding, since gcc assumes UTF-8 otherwise? I've never noticed anything since I rarely ever have a non-English character in a source and use UTF-8 anyway, but technically, we're compiling all sources wrong by default...

Jenna · « **Reply #6 on:** August 20, 2011, 02:55:45 pm »

Quote from: MortenMacFly on August 20, 2011, 10:04:17 am

Quote from: jens on August 20, 2011, 09:07:52 am
But before working on it, I would like to see the opinion of other devs and users.
I wonder how this is handled in other IDE's (CodeLite/VS for example). Does anyone know?
I don't recall that I've ever seen such kind of flags in project files, so there might be a "smarter" way.
Rick: How would you do this in VS?

Codelite opens ricks example correct, but a chinese (not utf-8) text not recognized.
Such a flag would be "hidden" in the file's properties, so most users would not be bothered normally, but in some special cases (as ricks example) the encoding can be forced without breaking everything else.

rickg22 · « **Reply #7 on:** August 22, 2011, 05:49:48 pm »

Quote from: thomas on August 20, 2011, 11:11:43 am

I've been wondering for quite some time now whether we should not just say "everyone uses UTF-8 and fuck the rest". There is so much pain involved in encodings, it never seems to work right for anyone (including me), and seeing how it is truly a harsh task for the machine to guess right in many cases, it probably won't ever work either.
So basically, one could still import whatever encodings there may be, but only ever create/edit/save in UTF-8.

The problem is that this is a non-portable solution. Basically, the project I'm working on is an ASP project (IIS) on a windows machine. As much as I wished every platform to support UTF-8 natively, I think Microsoft IIS will be a viable platform for at least a decade. And, as we know, IIS defecates on UTF-8.

eranif · « **Reply #8 on:** August 23, 2011, 07:53:52 pm »

Quote

I wonder how this is handled in other IDE's (CodeLite/VS for example). Does anyone know?

Well, codelite does not do anything special. By default all files are opened in ISO-8859-1. User can set the encoding in the IDE level (NOT per file)

The reason I choose to use ISO-8859-1 as the default and NOT UTF-8 is because saving a file with UTF8 encoding is like x10 slower under Linux.

The only thing "smart" that codelite does is handling BOM correctly.
Eran

Code::Blocks Forums

News:

Author Topic: ISO-8859-1 detection problems - HELP! (Read 8389 times)

rickg22

ISO-8859-1 detection problems - HELP!

Jenna

Re: ISO-8859-1 detection problems - HELP!

rickg22

Re: ISO-8859-1 detection problems - HELP!

Jenna

Re: ISO-8859-1 detection problems - HELP!

MortenMacFly

Re: ISO-8859-1 detection problems - HELP!

thomas

Re: ISO-8859-1 detection problems - HELP!

Jenna

Re: ISO-8859-1 detection problems - HELP!

rickg22

Re: ISO-8859-1 detection problems - HELP!

eranif

Re: ISO-8859-1 detection problems - HELP!