Developer forums (C::B DEVELOPMENT STRICTLY!) > Development

Linux won't edit Jonsson umlauted name

<< < (7/8) > >>

takeshimiya:

--- Quote from: thomas on May 19, 2006, 01:35:08 pm ---Right. UTF-16 can be ruled out rather easily to start with, and the remaining "strange" encodings should not bother us.

--- End quote ---
Yep, just remember that east-asian languages usually will not have the "insane amount of 0's" we're looking.


--- Quote from: thomas on May 19, 2006, 01:35:08 pm ---Also, only a tiny fraction of the population uses them.

--- End quote ---
I don't know how accurate is that percentage, given that 1/4 of the world population speaks east-asian languages, but...


--- Quote from: thomas on May 19, 2006, 01:35:08 pm ---We should concentrate on making it work well for 98%, not waste time on making it work badly for 100%.

--- End quote ---
...I don't know how much of them aren't using UNICODE these days for their source code, so that guess is ok probably.


--- Quote from: thomas on May 19, 2006, 01:35:08 pm ---Those few actually using anything else than 8-bit codepages, UTF-8, and UTF-16 can really be asked to convert to UTF-8 or UTF-16. Since both UTF-8 and UTF-16 are established standards that work excellently, there should not be much to object.

--- End quote ---
Yes, UTF-8 rocks for programming, thanks to the ASCII part being backwards-compatible. It should be the default nowadays.


--- Quote from: thomas on May 19, 2006, 01:35:08 pm ---What if we use word boundaries to feed complete words to the statistical model?
For example, if we first skip over all ANSI chars to identify candidate "strange chars", and then use Scintilla's word retrieval functions.
If the input text were "Ångström was a phycisist", then we would use "Ångström" for the statistical model, rather than "Åö".
In this case, we do not need to bother what might be a comment or character constant, we don't need to parse programming-language specific stuff, but we still get somewhat of a "complete" fragment (nice oxymoron). It is a compromise between a full parser and only using a fraction of the statistical data available (the frequencies of "ngstrm" may give a clue, too).

--- End quote ---
That is a good approach too, but will consume a lot of cpu-resources and statistical analysis. Something to point out is: those statistics can guess wrong, for example I tend to write english code and english comments, but for some of the words I use my mother tongue (when I don't know the word in english), so I don't think it's an uncommon situation.

Ok, I think we have two good (and not extremely complex) solutions:
1) Try to detect if an "Illegal Unicode sequence" exist. If it exists, show a dialog to the user asking for an encoding, defaulting to the current system encoding.
2) Use a comment parser, like the CodeStatistics Plugin uses. With that block of comments, we'll feed them to the Mozilla encoding detector class.

Either of the two seems to work reasonably for most people, without making statistical analysis from our part, and in a relatively easy way.
What do you think?

thomas:

--- Quote from: Takeshi Miya on May 19, 2006, 02:03:30 pm ---Yep, just remember that east-asian languages usually will not have the "insane amount of 0's" we're looking.
[...]
What do you think?

--- End quote ---
I think that a "typical" project made from "typical" sources with 10% comments and character strings (Code::Blocks has about 3%) encoded in UTF-16 has 45% null bytes...
ANSI source files do not have a single null byte in the normal case, and neither should UTF-8. If they do, something is weird about that source.

Also, I trust that Martin will work out a usable and computionally feasible statistical model before version 1.5 :)

Defender:

--- Quote from: Takeshi Miya on May 19, 2006, 02:03:30 pm ---
--- Quote from: thomas on May 19, 2006, 01:35:08 pm ---Right. UTF-16 can be ruled out rather easily to start with, and the remaining "strange" encodings should not bother us.

--- End quote ---
Yep, just remember that east-asian languages usually will not have the "insane amount of 0's" we're looking.
...

--- End quote ---
That's perfectly true, but the majority of a source file is made up by english characters, and encoded with UTF-16 they contain a lot of NULL bytes ;)

thomas:
Yep, around 45% :)

takeshimiya:

--- Quote from: Defender on May 19, 2006, 04:49:56 pm ---
--- Quote from: Takeshi Miya on May 19, 2006, 02:03:30 pm ---
--- Quote from: thomas on May 19, 2006, 01:35:08 pm ---Right. UTF-16 can be ruled out rather easily to start with, and the remaining "strange" encodings should not bother us.

--- End quote ---
Yep, just remember that east-asian languages usually will not have the "insane amount of 0's" we're looking.
...

--- End quote ---
That's perfectly true, but the majority of a source file is made up by english characters, and encoded with UTF-16 they contain a lot of NULL bytes ;)

--- End quote ---

Yep, thomas' guess is right, I was just trying to be informative for those that could might think that UTF-16 will always have 0's in any case. =)
If anyone makes that assumption, he/she could forget and use the "comments parser" or "ascii stripping" in the wrong order (ie. call first the comments parser and later the UTF-16 detector, and it'll not work; no 0's will be found). But in the reverse order it will work. In this case UTF-16 detection would be an exception, as other detection methods will/could be better after the comments parsing.

Just something to keep in mind. :)

The 1) solution seems to be the fastest/easiest to do in the meantime, what do you think?

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version