Author Topic: Linux won't edit Jonsson umlauted name (Read 24940 times)

Pecan · « **on:** May 17, 2006, 01:38:46 am »

I cannot apply a patch, or gedit, or CodeBlocks edit a file with
Jonsson's umlauted name using ubuntu brezzy.

why?

gedit says it cannot determine the coding of as_config.h et.al.
CodeBlocks shows an empty file, but shows an utf-8 status.
vim doesn't give a damn.

I had to change (using vim) the umlauted 'o' to an english 'o' to edit the file.
Then the patches worked and everyone was happy.

what gives here?

thanks
pecan

EDIT: Mac couldn't apply the patch either, but when I edited
the files by hand, it changed the umlaut to a '^' and shows ISO-8859-1 in the status bar.

Pecan · « **Reply #1 on:** May 17, 2006, 02:15:04 pm »

What encoding should I set for the Linux Codeblocks editor in order to edit the AngelScript files containing an umlauted 'o'?

It defaults to utf-8, but the editor shows all blanks after the umlauted 'o'.

thanks
pecan

thomas · « **Reply #2 on:** May 17, 2006, 02:32:54 pm »

We don't have Codepage/Unicode detection, unluckily. It is a quite complicated matter.

The current version of Code::Blocks uses a hack to make things work for most people, and that is simply setting the encoding to "system default". Most of the time, that happens to be the correct encoding, and it works.

If you have an idea how to determine the document encoding efficiently (or maybe even own code to do that / know a free library), please step forward

We plan to implement something similar to how web browsers do automatic document encoding detection after 1.0, i.e. build a histogram over the input file and do a statistical matching.
I don't know whether this is terribly efficient, but it should be quite failsafe.

Any better idea?

mandrav · « **Reply #3 on:** May 17, 2006, 03:06:47 pm »

Quote from: Pecan on May 17, 2006, 02:15:04 pm

What encoding should I set for the Linux Codeblocks editor in order to edit the AngelScript files containing an umlauted 'o'?

It defaults to utf-8, but the editor shows all blanks after the umlauted 'o'.

thanks
pecan

I 've set it to iso8859-1 and I can edit the AngelScript files just fine...

takeshimiya · « **Reply #4 on:** May 17, 2006, 03:12:34 pm »

Quote from: thomas on May 17, 2006, 02:32:54 pm

If you have an idea how to determine the document encoding efficiently (or maybe even own code to do that / know a free library), please step forward

Hope this helps

http://www.mozilla.org/projects/intl/
http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
http://www.mozilla.org/projects/intl/chardet.html
http://www.mozilla.org/projects/intl/ChardetInterface.htm

and at last, "How to build standalone universal charset detector from Mozilla source":
http://www.mozilla.org/projects/intl/detectorsrc.html

thomas · « **Reply #5 on:** May 17, 2006, 03:25:23 pm »

Actually, searching Google for "mozilla charset detector" was what we did 4 months ago, thank you

This is what "similar to how web browsers do [...] histogram" was supposed to mean.

We're actually looking for something a bit smaller and more manageable than Mozilla. Think of TinyXML (TinyCharsetDetect

.

Pecan · « **Reply #6 on:** May 17, 2006, 03:36:44 pm »

Quote from: mandrav on May 17, 2006, 03:06:47 pm

I 've set it to iso8859-1 and I can edit the AngelScript files just fine...

Thanks. I did not know what would be appropriate.

What happens to a submitted patch whose hunk gets rejected because patch cannot understand the encoding?

Is there something I can do to avoid the hunk rejection?

I had to edit, then re-save the as_config.h etc before patch would work.
pecan

takeshimiya · « **Reply #7 on:** May 17, 2006, 09:44:10 pm »

Quote from: thomas on May 17, 2006, 03:25:23 pm

We're actually looking for something a bit smaller and more manageable than Mozilla.

Just something to note, at first I got the impression that one would need to compile the whole (really big) Mozilla suite or at least XPCOM, but actually it is very few files without any dependencies (external nor mozilla), it seems really lightweight, just check the repository.

Here are the instructions to "pretty easy to use universal detector in your own project": http://www.mozilla.org/projects/intl/detectorsrc.html

You'll be writting a C::B wrapper for it, so the actual implementation can be decided later anyways.

btw, the composite approach of Mozilla using code scheme, char distribution and 2-char sequences, is really clever. If we can reuse and wrap it, it will be better time spent, but TinyCharsetDetect doesn't seems bad neither

thomas · « **Reply #8 on:** May 18, 2006, 11:01:37 am »

Quote from: Takeshi Miya on May 17, 2006, 09:44:10 pm

Here are the instructions to "pretty easy to use universal detector in your own project"

Yes, we've been aware of that for many months. It is just that it doesn't help at all.

Quote

You'll be writting a C::B wrapper for it, so the actual implementation can be decided later anyways.

No, we would not be writing a wrapper, we would have to rewrite a good bit of it to make any senseful use of it. Think about it.
If it were as easy as writing a small C++ wrapper, we'd have done that months ago.

takeshimiya · « **Reply #9 on:** May 18, 2006, 11:05:34 am »

Quote from: thomas on May 18, 2006, 11:01:37 am

Quote from: Takeshi Miya on May 17, 2006, 09:44:10 pm
Here are the instructions to "pretty easy to use universal detector in your own project"
Yes, we've been aware of that for many months. It is just that it doesn't help at all.

Quote
You'll be writting a C::B wrapper for it, so the actual implementation can be decided later anyways.
No, we would not be writing a wrapper, we would have to rewrite a good bit of it to make any senseful use of it. Think about it.
If it were as easy as writing a small C++ wrapper, we'd have done that months ago.

No problem if you want/have time to do so. I'll be glad to use it in my own projects too if it doesn't haves more dependencies than wx.

MortenMacFly · « **Reply #10 on:** May 18, 2006, 01:26:22 pm »

I stumbled accross a D forum (looking for seomething completely different ;-)) and found how the UTF detection for source code files seems to be done for the D compiler. It looks quite similar to:

Code

enum cbEncoding
{
  UTF_8,
  UTF_16LE,
  UTF_16BE,
  UTF_32LE,
  UTF_32BE
};

// Input: s: the first four byte of the file,
// (or all of them, if the file size is less than four bytes).
// Output: file encoding
cbEncoding DetectEncoding(unsigned char* s)
{
  unsigned char a = sizeof(s) >= 1 ? s[0] : 1;
  unsigned char b = sizeof(s) >= 2 ? s[1] : 1;
  unsigned char c = sizeof(s) >= 3 ? s[2] : 1;
  unsigned char d = sizeof(s) >= 4 ? s[3] : 1;

  if (a==0xFF && b==0xFE && c==0x00 && d==0x00)	return UTF_32LE;
  if (a==0x00 && b==0x00 && c==0xFE && d==0xFF) return UTF_32BE;
  if (a==0xFF && b==0xFE)                       return UTF_16LE;
  if (a==0xFE && b==0xFF)                       return UTF_16BE;
  if (a==0xEF && b==0xBB && c==0xBF)            return UTF_8;
  if            (b==0x00 && c==0x00 && d==0x00) return UTF_32LE;
  if (a==0x00 && b==0x00 && c==0x00)            return UTF_32BE;
  if            (b==0x00)                       return UTF_16LE;
  if (a==0x00)                                  return UTF_16BE;

  return UTF_8;
}

Now I understand definitely not enough about Unicode to really provide knowledge but this may (!) be a starting point...?!
With regards, Morten.

mandrav · « **Reply #11 on:** May 18, 2006, 01:30:29 pm »

Quote

Now I understand definitely not enough about Unicode to really provide knowledge but this may (!) be a starting point...?!
With regards, Morten.

This kind of detection, i.e. checking for existing BOM (byte order mark), we already do. Unfortunately it's not enough...

thomas · « **Reply #12 on:** May 18, 2006, 02:07:04 pm »

The problem lies in the files that do not have a BOM. These are 90% of all files, as it happens, since this is Microsoft proprietary stuff

On documents that have no BOM, it boils down to either knowing the correct encoding or doing a wild guess based on the actual content. Something that is relatively easy to do is to find out UTF-16 encoded files, since they have unhealthy amounts of zero bytes, and similarly it is relatively easy to tell whether a document is not ANSI and not UTF-8 encoded (if you try to decode it as UTF-8, and it contains illegal sequences, then it is neither). That's what the Mozilla engine (and probably most every other browser) does as the first step, too.

The true problem lies in documents that are not obviously illegal UTF-8. For these, browsers typically build a histogram and compare the peak frequencies to a known "profile" for a given language/encoding. In languages based on the latin alphabet like most European and Balkan languages, the top 30 characters make up 95%. In Japanese, the top 60 characters make up about half of a "normal" text, and in Chinese, you have a couple of hundred "top" characters. Thus it becomes obvious that the less Western, the harder. You need a lot more input text to make a reasonable guess.

However, the main problem is the point which Takeshi missed completely and which is why I said that using the Mozilla encoding detector is useless.
Source files consist of 80-90% ANSI characters (keywords, operators, white space). Only comments and character constants contain "unusual" characters at all (sometimes only a single character in a source file!). This means that building a histogram over the flat input file and comparing to known language statistics is entirely useless - the characters in question only make up a tiny fraction of the document.

What we will probably have to do will be either a parser that isolates character constants and comments and only parses those, or we will have to use a greatly different statistical model. Or we may have to choose a completely different strategy.
In any case, it is not as easy as saying "hey, we can use Mozilla's engine" - that won't work.
I wish it would...

MortenMacFly · « **Reply #13 on:** May 18, 2006, 02:57:12 pm »

Meanwhile I read the document provided with the mozilla (firefox) sources in mozilla\extensions\universalchardet\doc. I have to say wow - statistical analysis for the detection... it seems there is really something wrong with this in general. Anyway:

Quote from: thomas on May 18, 2006, 02:07:04 pm

What we will probably have to do will be either a parser that isolates character constants and comments and only parses those, or we will have to use a greatly different statistical model. Or we may have to choose a completely different strategy.

I guess it's even worse. As far as I understand variable names could be in unicode for some compilers, too. And (not to forget) strings/chars within the sources as you said. This means that it would be required to have knowledge about what is a valid comment / string -> so is it //, /*, !, C, ' or '...', "..." (...). This really seems to be very difficult. I wonder if it makes sense in general to try that. Because all solutions will really depend on a lot of variables and will never work 100% correct (as it has been said already). Isn't it better to support a limited number of file types and try to determine when a file is of unknown type and request the user to convert it into something "general"? So rather than to try to be very smart be very strict?
With regards, Morten.

takeshimiya · « **Reply #14 on:** May 18, 2006, 04:41:10 pm »

Quote from: thomas on May 18, 2006, 02:07:04 pm

The problem lies in the files that do not have a BOM. These are 90% of all files, as it happens, since this is Microsoft proprietary stuff

On documents that have no BOM, it boils down to either knowing the correct encoding or doing a wild guess based on the actual content. Something that is relatively easy to do is to find out UTF-16 encoded files, since they have unhealthy amounts of zero bytes, and similarly it is relatively easy to tell whether a document is not ANSI and not UTF-8 encoded (if you try to decode it as UTF-8, and it contains illegal sequences, then it is neither). That's what the Mozilla engine (and probably most every other browser) does as the first step, too.

Yes, there are a few more tricks to (posetively if found) identify an encoding, like a substantial ammount of multibyte characters in UTF-8, or "~{" and ESC sequences it's very likely to be ISO-2022-JP, and a few more.

Quote from: thomas on May 18, 2006, 02:07:04 pm

The true problem lies in documents that are not obviously illegal UTF-8. For these, browsers typically build a histogram and compare the peak frequencies to a known "profile" for a given language/encoding. In languages based on the latin alphabet like most European and Balkan languages, the top 30 characters make up 95%. In Japanese, the top 60 characters make up about half of a "normal" text, and in Chinese, you have a couple of hundred "top" characters. Thus it becomes obvious that the less Western, the harder. You need a lot more input text to make a reasonable guess.

Yes, though actually this isn't that much "language dependant", but "encoding dependant", as there are some encodings (ie. japanese EUC, ISO, etc) which some are easier to detect while other aren't.

The worse situation are the east asian for sure; for example the different encodings of Chinese, as they share a lot of codepoints and with other encodings (such as japanese ones) which makes it more difficult.

Quote from: thomas on May 18, 2006, 02:07:04 pm

However, the main problem is the point which Takeshi missed completely and which is why I said that using the Mozilla encoding detector is useless.

But you said "We're actually looking for something a bit smaller and more manageable than Mozilla.", so I guessed that mozilla was ruled out only because of that. Anyways I don't think it is useless, given this:

Quote from: thomas on May 18, 2006, 02:07:04 pm

Source files consist of 80-90% ANSI characters (keywords, operators, white space). Only comments and character constants contain "unusual" characters at all (sometimes only a single character in a source file!).

Yes, very right. However one of the algorithms of Mozilla is the 2-char, which marks all ascii (keywords, spaces, etc) as noise.

Quote from: thomas on May 18, 2006, 02:07:04 pm

This means that building a histogram over the flat input file and comparing to known language statistics is entirely useless - the characters in question only make up a tiny fraction of the document.

What we will probably have to do will be either a parser that isolates character constants and comments and only parses those

Yes, but you're calling it "entirely useless" and I don't think it is yet, we can still use it:
The idea of a parser that parses comments it's great, so we can parse them, and once recolected, feed them to the mozilla detector.

Quote from: MortenMacFly on May 18, 2006, 02:57:12 pm

I guess it's even worse. As far as I understand variable names could be in unicode for some compilers, too.

Yes that is right, but we're talking about "probabilities", so with the comments from the source alone, we have a good chance as it's not a very common scenario to have unicode variable names and not comments in unicode

Quote from: MortenMacFly on May 18, 2006, 02:57:12 pm

And (not to forget) strings/chars within the sources as you said. This means that it would be required to have knowledge about what is a valid comment / string -> so is it //, /*, !, C, ' or '...', "..." (...). This really seems to be very difficult.

Thanks for remembering that, in fact it is very easy, and the comment characters can exist in the C::B lexers (in fact .properties lexer files already haves them

).
Also making them in the SDK will increase reuse, it's exactly what the Code Statistics plugin is using right now.

Quote from: MortenMacFly on May 18, 2006, 02:57:12 pm

Isn't it better to support a limited number of file types and try to determine when a file is of unknown type and request the user to convert it into something "general"?

That is a good point, all of this is complex, but it's a feature for the users, and there are a lot of source code over the world which probably don't haves the choice to use something more "general" (either by the tools that are being used, by some employeer, etc).

This is an interesting topic, however I don't think it's a top priority, it'll take a lot of human resources/time (it could have been interesting as a google soc

).

BTW, the new compiler framework and the blocks are starting to rock, so better keep this for a future

Code::Blocks Forums

News:

Author Topic: Linux won't edit Jonsson umlauted name (Read 24940 times)

Pecan

Linux won't edit Jonsson umlauted name

Pecan

Re: Linux won't edit Jonsson umlauted name

thomas

Re: Linux won't edit Jonsson umlauted name

mandrav

Re: Linux won't edit Jonsson umlauted name

takeshimiya

Re: Linux won't edit Jonsson umlauted name

thomas

Re: Linux won't edit Jonsson umlauted name

Pecan

Re: Linux won't edit Jonsson umlauted name

takeshimiya

Re: Linux won't edit Jonsson umlauted name

thomas

Re: Linux won't edit Jonsson umlauted name

takeshimiya

Re: Linux won't edit Jonsson umlauted name

MortenMacFly

Re: Linux won't edit Jonsson umlauted name

mandrav

Re: Linux won't edit Jonsson umlauted name

thomas

Re: Linux won't edit Jonsson umlauted name

MortenMacFly

Re: Linux won't edit Jonsson umlauted name

takeshimiya

Re: Linux won't edit Jonsson umlauted name