The problem lies in the files that do not have a BOM. These are 90% of all files, as it happens, since this is Microsoft proprietary stuff
On documents that have no BOM, it boils down to either knowing the correct encoding or doing a wild guess based on the actual content. Something that is relatively easy to do is to find out UTF-16 encoded files, since they have unhealthy amounts of zero bytes, and similarly it is relatively easy to tell whether a document is not ANSI and not UTF-8 encoded (if you try to decode it as UTF-8, and it contains illegal sequences, then it is neither). That's what the Mozilla engine (and probably most every other browser) does as the first step, too.
Yes, there are a few more tricks to (posetively if found) identify an encoding, like a substantial ammount of multibyte characters in UTF-8, or "~{" and ESC sequences it's very likely to be ISO-2022-JP, and a few more.
The true problem lies in documents that are not obviously illegal UTF-8. For these, browsers typically build a histogram and compare the peak frequencies to a known "profile" for a given language/encoding. In languages based on the latin alphabet like most European and Balkan languages, the top 30 characters make up 95%. In Japanese, the top 60 characters make up about half of a "normal" text, and in Chinese, you have a couple of hundred "top" characters. Thus it becomes obvious that the less Western, the harder. You need a lot more input text to make a reasonable guess.
Yes, though actually this isn't that much "language dependant", but "encoding dependant", as there are some encodings (ie. japanese EUC, ISO, etc) which some are easier to detect while other aren't.
The worse situation are the east asian for sure; for example the different encodings of Chinese, as they share a lot of codepoints and with other encodings (such as japanese ones) which makes it more difficult.
However, the main problem is the point which Takeshi missed completely and which is why I said that using the Mozilla encoding detector is useless.
But you said "We're actually looking for something a bit smaller and more manageable than Mozilla.", so I guessed that mozilla was ruled out only because of that. Anyways I don't think it is useless, given this:
Source files consist of 80-90% ANSI characters (keywords, operators, white space). Only comments and character constants contain "unusual" characters at all (sometimes only a single character in a source file!).
Yes, very right. However one of the algorithms of Mozilla is the 2-char, which marks all ascii (keywords, spaces, etc) as noise.
This means that building a histogram over the flat input file and comparing to known language statistics is entirely useless - the characters in question only make up a tiny fraction of the document.
What we will probably have to do will be either a parser that isolates character constants and comments and only parses those
Yes, but you're calling it "entirely useless" and I don't think it is yet, we can still use it:
The idea of a parser that parses comments it's great, so we can parse them, and once recolected, feed them to the mozilla detector.
I guess it's even worse. As far as I understand variable names could be in unicode for some compilers, too.
Yes that is right, but we're talking about "probabilities", so with the comments from the source alone, we have a good chance as it's not a very common scenario to have unicode variable names and not comments in unicode
And (not to forget) strings/chars within the sources as you said. This means that it would be required to have knowledge about what is a valid comment / string -> so is it //, /*, !, C, ' or '...', "..." (...). This really seems to be very difficult.
Thanks for remembering that, in fact it is very easy, and the comment characters can exist in the C::B lexers (in fact .properties lexer files already haves them
).
Also making them in the SDK will increase reuse, it's exactly what the Code Statistics plugin is using right now.
Isn't it better to support a limited number of file types and try to determine when a file is of unknown type and request the user to convert it into something "general"?
That is a good point, all of this is complex, but it's a feature
for the users, and there are a lot of source code over the world which probably don't haves the choice to use something more "general" (either by the tools that are being used, by some employeer, etc).
This is an interesting topic, however I don't think it's a top priority, it'll take a lot of human resources/time (it could have been interesting as a google soc
).
BTW, the new compiler framework and the blocks are starting to rock, so better keep this for a future