I think thomas meant, that the striping should only take place, after we checked for easy-to-detect encodings.
Right. UTF-16 can be ruled out rather easily to start with, and the remaining "strange" encodings should not bother us.
[There are just too many of them, wxScite does not even support some, and manually converting them is a pain. Also, only a tiny fraction of the population uses them. We should concentrate on making it work well for 98%, not waste time on making it work badly for 100%.
Those few actually using anything else than 8-bit codepages, UTF-8, and UTF-16 can really be asked to convert to UTF-8 or UTF-16. Since both UTF-8 and UTF-16 are established standards that work excellently, there should not be much to object.]Regarding UTF-8, if there are no valid UTF-8 multibyte sequences to be found (which you can check easily) then either there are none because the text contains no characters that need encoding, or you will find illegal sequences.
In the former case, you don't need to care, as you'll simply use ANSI encoding (or whatever codepage, it shouldn't matter, you can even use UTF-8, just to be sure for the future).
Only in the latter case, it gets
really complicated. Then, and only then, you may have characters in some unknown encoding which might mean anything, for example 8859-1 Ø could as well be 8859-2 Ř, and you have no idea what it may be.
Now, the problem lies in finding out whether it is a Ø or a Ř, this has to be done from a statistical model.
Martin:
What if we use word boundaries to feed complete words to the statistical model?
For example, if we first skip over all ANSI chars to identify candidate "strange chars", and then use Scintilla's word retrieval functions.
If the input text were "Ångström was a phycisist", then we would use "Ångström" for the statistical model, rather than "Åö".
In this case, we do not need to bother what might be a comment or character constant, we don't need to parse programming-language specific stuff, but we still get somewhat of a "complete" fragment (nice oxymoron). It is a compromise between a full parser and only using a fraction of the statistical data available (the frequencies of "ngstrm" may give a clue, too).
I don't remember exactly who came up with counting letters in the English language first, you'd probably praise the late Mr. Shannon, but Arthur Conan Doyle wrote his tale "The Dancing Men" 13 years before Shannon was even born...
Anyway, this story teaches us that seemingly unimportant letters may be interesting too