Developer forums (C::B DEVELOPMENT STRICTLY!) > Development
Linux won't edit Jonsson umlauted name
takeshimiya:
Seems like my previous post on Page #1, SMF maked all you missed (?)
--- Quote from: thomas on May 19, 2006, 11:37:49 am ---
--- Quote from: MortenMacFly on May 19, 2006, 11:06:39 am ---Statistically spoken: Yes!
[...] amplify parts that have a higher probability for unicode characters
--- End quote ---
Exactly :)
But do we want to know whether they're strings or comments or whatever? Do we care what identifies a string constant?
--- End quote ---
Yes and no, according to the encoding we're trying to detect and the algorithm used.
--- Quote from: thomas on May 19, 2006, 11:37:49 am ---If we dead-strip everything with a value less than 127, we eleminate all ANSI characters (all keywords, operators, and all English comment/constant text). All that remains are the (amplified) non-ANSI characters, if there are any. It greatly biases the distribution towards them, but is that really a bad thing?
--- End quote ---
It would be great if it was as easy as dead-stripping everything smaller than a 127 value. :D
It could only help when trying to detect a single-byte encoding, but not much more.
Think a bit about it: if you dead-strip all bytes with a value < 127, you'll be removing a lot information in multibyte encodings such as Shift-JIS, UTF-16, etc, because almost all byte values are lower than 127.
--- Quote from: thomas on May 19, 2006, 11:37:49 am ---The question is just, can you judge a charset/language only by looking at the special characters? Could you tell that "Fürwahr, Du bist der Größte" is "German / ISO-8859-1" only by looking at "üöß"? I think you can.
--- End quote ---
I think you can't guess an encoding by looking at special characters, but a combination of them (a sequence of 2~3 characters).
You can't by looking only at "ü" for example, because the decimal byte value can exist in almost any encoding, and even any languages (in fact my name haves an "ü" and that letter is legal Spanish :P).
A method that works great for detecting single-byte encoding, is what is called the "2-char sequence method". In the example, you'll look for common sequences in german, like "öß", "Fü", etc.
Another thing we must account, is that we're trying to detect the encoding, not specifically the language, so: we must remember that some encodings share a lot of codepoints with other encodings; this is the common case between east asian encodings.
Thus for example you could guess that if you found a "連続" sequence is Japanese (it could be Chinese too), but you can't tell what encoding is between the ones that shares so much codepoints.
I'm guessing you've read it already, it explains greatly the 3 approach used in mozilla, some works great for detecting single-byte encodings and some others for detecting multi-byte encodings, that's why the composite approach was used: http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
Defender:
--- Quote from: Takeshi Miya on May 19, 2006, 12:34:53 pm ---
Think a bit about it: if you dead-strip all bytes with a value < 127, you'll be removing a lot information in multibyte encodings such as Shift-JIS, UTF-16, etc, because almost all byte values are lower than 127.
--- End quote ---
I think thomas meant, that the striping should only take place, after we checked for easy-to-detect encodings.
Quote:
--- Quote from: thomas ---On documents that have no BOM, it boils down to either knowing the correct encoding or doing a wild guess based on the actual content. Something that is relatively easy to do is to find out UTF-16 encoded files, since they have unhealthy amounts of zero bytes, and similarly it is relatively easy to tell whether a document is not ANSI and not UTF-8 encoded (if you try to decode it as UTF-8, and it contains illegal sequences, then it is neither). That's what the Mozilla engine (and probably most every other browser) does as the first step, too.
--- End quote ---
Excuse me if I am wrong.
Defender
MortenMacFly:
--- Quote from: thomas on May 19, 2006, 11:37:49 am ---Exactly :)
--- End quote ---
Mmmmh.... I was just reading over at wikipedia from http://en.wikipedia.org/wiki/ISO/IEC_8859-1 to http://en.wikipedia.org/wiki/ISO/IEC_8859-16 about what such a bias would mean. Just to get you right: So if everything up to 127 (maybe even everything up to 160) is skipped we truely skip all keywords, variables and symbols/signs. There is a very nice table at http://en.wikipedia.org/wiki/ISO_8859#Table that compares the different ISO's.
What remains is what you would like to analyse statistically, right?
This sounds logically to me - yes, but it's difficult to judge if this is a good approach. I think we really require an expert on that topic to answer (best if there is any that knows enough about all of these languages).
Still: I think this is a very minor addition (e.g. a simple stream manipulator) to the main part that would be required anyway. Unfortunately I know nearly nothing about languages and the theory about how often specific characters or combination of characters are used to setup a statistic model for what remains. The more I think about that... it seems there is a reason for why the Mozilla effort is so complex...?! :roll:
With regards, Morten.
Edit: Linked directly to table mentioned to avoid confusion with other tables on that page.
takeshimiya:
--- Quote from: Defender on May 19, 2006, 12:43:19 pm ---
--- Quote from: Takeshi Miya on May 19, 2006, 12:34:53 pm ---
Think a bit about it: if you dead-strip all bytes with a value < 127, you'll be removing a lot information in multibyte encodings such as Shift-JIS, UTF-16, etc, because almost all byte values are lower than 127.
--- End quote ---
I think thomas meant, that the striping should only take place, after we checked for easy-to-detect encodings.
--- End quote ---
Yes, of course the very first thing would be checking for BOM and trying to detect the easy encodings. But almost all multibyte encodings are precisely not easy to detect.
--- Quote from: MortenMacFly on May 19, 2006, 12:44:49 pm ---So if everything up to 127 (maybe even everything up to 160) is skipped we truely skip all keywords, variables and symbols/signs. There is a very nice table at http://en.wikipedia.org/wiki/ISO_8859 that compares the different ISO's.
--- End quote ---
Again, that'll serve only for positively detecting single-byte encodings, with what you could call "1-char detection method", and I explained above why it will not work except for very few cases. The 2-char (or more) method would give us a best guess (for single-byte encodings).
--- Quote from: MortenMacFly on May 19, 2006, 12:44:49 pm ---Still: I think this is a very minor addition (e.g. a simple stream manipulator) to the main part that would be required anyway.
--- End quote ---
I think too, but for some detection methods, other methods will need the raw stream without "comments parsing". It is a great idea nonetheless.
--- Quote from: MortenMacFly on May 19, 2006, 12:44:49 pm ---What remains is what you would like to analyse statistically, right?
The more I think about that... it seems there is a reason for why the Mozilla effort is so complex...?! :roll:
With regards, Morten.
--- End quote ---
Yes, they've already built a lot of tools, automation programs and researchs for language statistics. The "Mozilla language/encoding detection module" was a separate project that later was merged in Mozilla and now is being mantained there.
If somehow we don't use the Mozilla detection source, we can still use a lot of researchs made by them, the statistical analysis for each language, the tools for built those statistics, etc.
thomas:
--- Quote from: Defender on May 19, 2006, 12:43:19 pm ---I think thomas meant, that the striping should only take place, after we checked for easy-to-detect encodings.
--- End quote ---
Right. UTF-16 can be ruled out rather easily to start with, and the remaining "strange" encodings should not bother us.
[There are just too many of them, wxScite does not even support some, and manually converting them is a pain. Also, only a tiny fraction of the population uses them. We should concentrate on making it work well for 98%, not waste time on making it work badly for 100%.
Those few actually using anything else than 8-bit codepages, UTF-8, and UTF-16 can really be asked to convert to UTF-8 or UTF-16. Since both UTF-8 and UTF-16 are established standards that work excellently, there should not be much to object.]
Regarding UTF-8, if there are no valid UTF-8 multibyte sequences to be found (which you can check easily) then either there are none because the text contains no characters that need encoding, or you will find illegal sequences.
In the former case, you don't need to care, as you'll simply use ANSI encoding (or whatever codepage, it shouldn't matter, you can even use UTF-8, just to be sure for the future).
Only in the latter case, it gets really complicated. Then, and only then, you may have characters in some unknown encoding which might mean anything, for example 8859-1 Ø could as well be 8859-2 Ř, and you have no idea what it may be.
Now, the problem lies in finding out whether it is a Ø or a Ř, this has to be done from a statistical model.
Martin:
What if we use word boundaries to feed complete words to the statistical model?
For example, if we first skip over all ANSI chars to identify candidate "strange chars", and then use Scintilla's word retrieval functions.
If the input text were "Ångström was a phycisist", then we would use "Ångström" for the statistical model, rather than "Åö".
In this case, we do not need to bother what might be a comment or character constant, we don't need to parse programming-language specific stuff, but we still get somewhat of a "complete" fragment (nice oxymoron). It is a compromise between a full parser and only using a fraction of the statistical data available (the frequencies of "ngstrm" may give a clue, too).
I don't remember exactly who came up with counting letters in the English language first, you'd probably praise the late Mr. Shannon, but Arthur Conan Doyle wrote his tale "The Dancing Men" 13 years before Shannon was even born...
Anyway, this story teaches us that seemingly unimportant letters may be interesting too :)
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version