Linux won't edit Jonsson umlauted name

Developer forums (C::B DEVELOPMENT STRICTLY!) > Development

<< < (3/8) > >>

MortenMacFly:
I stumbled accross a D forum (looking for seomething completely different ;-)) and found how the UTF detection for source code files seems to be done for the D compiler. It looks quite similar to:

--- Code: ---enum cbEncoding
{
UTF_8,
UTF_16LE,
UTF_16BE,
UTF_32LE,
UTF_32BE
};

// Input: s: the first four byte of the file,
// (or all of them, if the file size is less than four bytes).
// Output: file encoding
cbEncoding DetectEncoding(unsigned char* s)
{
unsigned char a = sizeof(s) >= 1 ? s[0] : 1;
unsigned char b = sizeof(s) >= 2 ? s[1] : 1;
unsigned char c = sizeof(s) >= 3 ? s[2] : 1;
unsigned char d = sizeof(s) >= 4 ? s[3] : 1;

if (a==0xFF && b==0xFE && c==0x00 && d==0x00) return UTF_32LE;
if (a==0x00 && b==0x00 && c==0xFE && d==0xFF) return UTF_32BE;
if (a==0xFF && b==0xFE) return UTF_16LE;
if (a==0xFE && b==0xFF) return UTF_16BE;
if (a==0xEF && b==0xBB && c==0xBF) return UTF_8;
if (b==0x00 && c==0x00 && d==0x00) return UTF_32LE;
if (a==0x00 && b==0x00 && c==0x00) return UTF_32BE;
if (b==0x00) return UTF_16LE;
if (a==0x00) return UTF_16BE;

return UTF_8;
}

--- End code ---
Now I understand definitely not enough about Unicode to really provide knowledge but this may (!) be a starting point...?!
With regards, Morten.

mandrav:

--- Quote ---Now I understand definitely not enough about Unicode to really provide knowledge but this may (!) be a starting point...?!
With regards, Morten.
--- End quote ---

This kind of detection, i.e. checking for existing BOM (byte order mark), we already do. Unfortunately it's not enough...

thomas:
The problem lies in the files that do not have a BOM. These are 90% of all files, as it happens, since this is Microsoft proprietary stuff :(

On documents that have no BOM, it boils down to either knowing the correct encoding or doing a wild guess based on the actual content. Something that is relatively easy to do is to find out UTF-16 encoded files, since they have unhealthy amounts of zero bytes, and similarly it is relatively easy to tell whether a document is not ANSI and not UTF-8 encoded (if you try to decode it as UTF-8, and it contains illegal sequences, then it is neither). That's what the Mozilla engine (and probably most every other browser) does as the first step, too.

The true problem lies in documents that are not obviously illegal UTF-8. For these, browsers typically build a histogram and compare the peak frequencies to a known "profile" for a given language/encoding. In languages based on the latin alphabet like most European and Balkan languages, the top 30 characters make up 95%. In Japanese, the top 60 characters make up about half of a "normal" text, and in Chinese, you have a couple of hundred "top" characters. Thus it becomes obvious that the less Western, the harder. You need a lot more input text to make a reasonable guess.

However, the main problem is the point which Takeshi missed completely and which is why I said that using the Mozilla encoding detector is useless.
Source files consist of 80-90% ANSI characters (keywords, operators, white space). Only comments and character constants contain "unusual" characters at all (sometimes only a single character in a source file!). This means that building a histogram over the flat input file and comparing to known language statistics is entirely useless - the characters in question only make up a tiny fraction of the document.

What we will probably have to do will be either a parser that isolates character constants and comments and only parses those, or we will have to use a greatly different statistical model. Or we may have to choose a completely different strategy.
In any case, it is not as easy as saying "hey, we can use Mozilla's engine" - that won't work.
I wish it would... :(

MortenMacFly:
Meanwhile I read the document provided with the mozilla (firefox) sources in mozilla\extensions\universalchardet\doc. I have to say wow - statistical analysis for the detection... it seems there is really something wrong with this in general. Anyway:

--- Quote from: thomas on May 18, 2006, 02:07:04 pm ---What we will probably have to do will be either a parser that isolates character constants and comments and only parses those, or we will have to use a greatly different statistical model. Or we may have to choose a completely different strategy.

--- End quote ---
I guess it's even worse. As far as I understand variable names could be in unicode for some compilers, too. And (not to forget) strings/chars within the sources as you said. This means that it would be required to have knowledge about what is a valid comment / string -> so is it //, /*, !, C, ' or '...', "..." (...). This really seems to be very difficult. I wonder if it makes sense in general to try that. Because all solutions will really depend on a lot of variables and will never work 100% correct (as it has been said already). Isn't it better to support a limited number of file types and try to determine when a file is of unknown type and request the user to convert it into something "general"? So rather than to try to be very smart be very strict?
With regards, Morten.

takeshimiya:

--- Quote from: thomas on May 18, 2006, 02:07:04 pm ---The problem lies in the files that do not have a BOM. These are 90% of all files, as it happens, since this is Microsoft proprietary stuff :(

On documents that have no BOM, it boils down to either knowing the correct encoding or doing a wild guess based on the actual content. Something that is relatively easy to do is to find out UTF-16 encoded files, since they have unhealthy amounts of zero bytes, and similarly it is relatively easy to tell whether a document is not ANSI and not UTF-8 encoded (if you try to decode it as UTF-8, and it contains illegal sequences, then it is neither). That's what the Mozilla engine (and probably most every other browser) does as the first step, too.

--- End quote ---
Yes, there are a few more tricks to (posetively if found) identify an encoding, like a substantial ammount of multibyte characters in UTF-8, or "~{" and ESC sequences it's very likely to be ISO-2022-JP, and a few more.

--- Quote from: thomas on May 18, 2006, 02:07:04 pm ---The true problem lies in documents that are not obviously illegal UTF-8. For these, browsers typically build a histogram and compare the peak frequencies to a known "profile" for a given language/encoding. In languages based on the latin alphabet like most European and Balkan languages, the top 30 characters make up 95%. In Japanese, the top 60 characters make up about half of a "normal" text, and in Chinese, you have a couple of hundred "top" characters. Thus it becomes obvious that the less Western, the harder. You need a lot more input text to make a reasonable guess.

--- End quote ---
Yes, though actually this isn't that much "language dependant", but "encoding dependant", as there are some encodings (ie. japanese EUC, ISO, etc) which some are easier to detect while other aren't.

The worse situation are the east asian for sure; for example the different encodings of Chinese, as they share a lot of codepoints and with other encodings (such as japanese ones) which makes it more difficult.

--- Quote from: thomas on May 18, 2006, 02:07:04 pm ---However, the main problem is the point which Takeshi missed completely and which is why I said that using the Mozilla encoding detector is useless.

--- End quote ---

But you said "We're actually looking for something a bit smaller and more manageable than Mozilla.", so I guessed that mozilla was ruled out only because of that. Anyways I don't think it is useless, given this:

--- Quote from: thomas on May 18, 2006, 02:07:04 pm ---Source files consist of 80-90% ANSI characters (keywords, operators, white space). Only comments and character constants contain "unusual" characters at all (sometimes only a single character in a source file!).

--- End quote ---
Yes, very right. However one of the algorithms of Mozilla is the 2-char, which marks all ascii (keywords, spaces, etc) as noise.

--- Quote from: thomas on May 18, 2006, 02:07:04 pm ---This means that building a histogram over the flat input file and comparing to known language statistics is entirely useless - the characters in question only make up a tiny fraction of the document.

What we will probably have to do will be either a parser that isolates character constants and comments and only parses those

--- End quote ---
Yes, but you're calling it "entirely useless" and I don't think it is yet, we can still use it:
The idea of a parser that parses comments it's great, so we can parse them, and once recolected, feed them to the mozilla detector. :D

--- Quote from: MortenMacFly on May 18, 2006, 02:57:12 pm ---I guess it's even worse. As far as I understand variable names could be in unicode for some compilers, too.

--- End quote ---
Yes that is right, but we're talking about "probabilities", so with the comments from the source alone, we have a good chance as it's not a very common scenario to have unicode variable names and not comments in unicode :)

--- Quote from: MortenMacFly on May 18, 2006, 02:57:12 pm ---And (not to forget) strings/chars within the sources as you said. This means that it would be required to have knowledge about what is a valid comment / string -> so is it //, /*, !, C, ' or '...', "..." (...). This really seems to be very difficult.

--- End quote ---
Thanks for remembering that, in fact it is very easy, and the comment characters can exist in the C::B lexers (in fact .properties lexer files already haves them :P).
Also making them in the SDK will increase reuse, it's exactly what the Code Statistics plugin is using right now.

--- Quote from: MortenMacFly on May 18, 2006, 02:57:12 pm --- Isn't it better to support a limited number of file types and try to determine when a file is of unknown type and request the user to convert it into something "general"?

--- End quote ---

That is a good point, all of this is complex, but it's a feature for the users, and there are a lot of source code over the world which probably don't haves the choice to use something more "general" (either by the tools that are being used, by some employeer, etc).

This is an interesting topic, however I don't think it's a top priority, it'll take a lot of human resources/time (it could have been interesting as a google soc 8)).

BTW, the new compiler framework and the blocks are starting to rock, so better keep this for a future :P

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version