Right. UTF-16 can be ruled out rather easily to start with, and the remaining "strange" encodings should not bother us.
Yep, just remember that east-asian languages usually will not have the "insane amount of 0's" we're looking.
Also, only a tiny fraction of the population uses them.
I don't know how accurate is that percentage, given that 1/4 of the world population speaks east-asian languages, but...
We should concentrate on making it work well for 98%, not waste time on making it work badly for 100%.
...I don't know how much of them aren't using UNICODE these days for their source code, so that guess is ok probably.
Those few actually using anything else than 8-bit codepages, UTF-8, and UTF-16 can really be asked to convert to UTF-8 or UTF-16. Since both UTF-8 and UTF-16 are established standards that work excellently, there should not be much to object.
Yes, UTF-8 rocks for programming, thanks to the ASCII part being backwards-compatible. It should be the default nowadays.
What if we use word boundaries to feed complete words to the statistical model?
For example, if we first skip over all ANSI chars to identify candidate "strange chars", and then use Scintilla's word retrieval functions.
If the input text were "Ångström was a phycisist", then we would use "Ångström" for the statistical model, rather than "Åö".
In this case, we do not need to bother what might be a comment or character constant, we don't need to parse programming-language specific stuff, but we still get somewhat of a "complete" fragment (nice oxymoron). It is a compromise between a full parser and only using a fraction of the statistical data available (the frequencies of "ngstrm" may give a clue, too).
That is a good approach too, but will consume a lot of cpu-resources and statistical analysis. Something to point out is: those statistics can guess wrong,
for example I tend to write english code and english comments, but for some of the words I use my mother tongue (when I don't know the word in english), so I don't think it's an uncommon situation.
Ok, I think we have two good (and not extremely complex) solutions:
1) Try to detect if an "Illegal Unicode sequence" exist. If it exists, show a dialog to the user asking for an encoding, defaulting to the current system encoding.
2) Use a comment parser, like the CodeStatistics Plugin uses. With that block of comments, we'll feed them to the Mozilla encoding detector class.
Either of the two seems to work reasonably for most people, without making statistical analysis from our part, and in a relatively easy way.
What do you think?