Author Topic: Linux won't edit Jonsson umlauted name  (Read 28074 times)

takeshimiya

  • Guest
Re: Linux won't edit Jonsson umlauted name
« Reply #30 on: May 19, 2006, 02:03:30 pm »
Right. UTF-16 can be ruled out rather easily to start with, and the remaining "strange" encodings should not bother us.
Yep, just remember that east-asian languages usually will not have the "insane amount of 0's" we're looking.

Also, only a tiny fraction of the population uses them.
I don't know how accurate is that percentage, given that 1/4 of the world population speaks east-asian languages, but...

We should concentrate on making it work well for 98%, not waste time on making it work badly for 100%.
...I don't know how much of them aren't using UNICODE these days for their source code, so that guess is ok probably.

Those few actually using anything else than 8-bit codepages, UTF-8, and UTF-16 can really be asked to convert to UTF-8 or UTF-16. Since both UTF-8 and UTF-16 are established standards that work excellently, there should not be much to object.
Yes, UTF-8 rocks for programming, thanks to the ASCII part being backwards-compatible. It should be the default nowadays.

What if we use word boundaries to feed complete words to the statistical model?
For example, if we first skip over all ANSI chars to identify candidate "strange chars", and then use Scintilla's word retrieval functions.
If the input text were "Ångström was a phycisist", then we would use "Ångström" for the statistical model, rather than "Åö".
In this case, we do not need to bother what might be a comment or character constant, we don't need to parse programming-language specific stuff, but we still get somewhat of a "complete" fragment (nice oxymoron). It is a compromise between a full parser and only using a fraction of the statistical data available (the frequencies of "ngstrm" may give a clue, too).
That is a good approach too, but will consume a lot of cpu-resources and statistical analysis. Something to point out is: those statistics can guess wrong, for example I tend to write english code and english comments, but for some of the words I use my mother tongue (when I don't know the word in english), so I don't think it's an uncommon situation.

Ok, I think we have two good (and not extremely complex) solutions:
1) Try to detect if an "Illegal Unicode sequence" exist. If it exists, show a dialog to the user asking for an encoding, defaulting to the current system encoding.
2) Use a comment parser, like the CodeStatistics Plugin uses. With that block of comments, we'll feed them to the Mozilla encoding detector class.

Either of the two seems to work reasonably for most people, without making statistical analysis from our part, and in a relatively easy way.
What do you think?
« Last Edit: May 19, 2006, 02:17:20 pm by Takeshi Miya »

Offline thomas

  • Administrator
  • Lives here!
  • *****
  • Posts: 3979
Re: Linux won't edit Jonsson umlauted name
« Reply #31 on: May 19, 2006, 02:59:04 pm »
Yep, just remember that east-asian languages usually will not have the "insane amount of 0's" we're looking.
[...]
What do you think?
I think that a "typical" project made from "typical" sources with 10% comments and character strings (Code::Blocks has about 3%) encoded in UTF-16 has 45% null bytes...
ANSI source files do not have a single null byte in the normal case, and neither should UTF-8. If they do, something is weird about that source.

Also, I trust that Martin will work out a usable and computionally feasible statistical model before version 1.5 :)
"We should forget about small efficiencies, say about 97% of the time: Premature quotation is the root of public humiliation."

Offline Defender

  • Multiple posting newcomer
  • *
  • Posts: 49
Re: Linux won't edit Jonsson umlauted name
« Reply #32 on: May 19, 2006, 04:49:56 pm »
Right. UTF-16 can be ruled out rather easily to start with, and the remaining "strange" encodings should not bother us.
Yep, just remember that east-asian languages usually will not have the "insane amount of 0's" we're looking.
...
That's perfectly true, but the majority of a source file is made up by english characters, and encoded with UTF-16 they contain a lot of NULL bytes ;)

Offline thomas

  • Administrator
  • Lives here!
  • *****
  • Posts: 3979
Re: Linux won't edit Jonsson umlauted name
« Reply #33 on: May 19, 2006, 05:15:40 pm »
Yep, around 45% :)
"We should forget about small efficiencies, say about 97% of the time: Premature quotation is the root of public humiliation."

takeshimiya

  • Guest
Re: Linux won't edit Jonsson umlauted name
« Reply #34 on: May 19, 2006, 10:39:40 pm »
Right. UTF-16 can be ruled out rather easily to start with, and the remaining "strange" encodings should not bother us.
Yep, just remember that east-asian languages usually will not have the "insane amount of 0's" we're looking.
...
That's perfectly true, but the majority of a source file is made up by english characters, and encoded with UTF-16 they contain a lot of NULL bytes ;)

Yep, thomas' guess is right, I was just trying to be informative for those that could might think that UTF-16 will always have 0's in any case. =)
If anyone makes that assumption, he/she could forget and use the "comments parser" or "ascii stripping" in the wrong order (ie. call first the comments parser and later the UTF-16 detector, and it'll not work; no 0's will be found). But in the reverse order it will work. In this case UTF-16 detection would be an exception, as other detection methods will/could be better after the comments parsing.

Just something to keep in mind. :)

The 1) solution seems to be the fastest/easiest to do in the meantime, what do you think?

Offline Defender

  • Multiple posting newcomer
  • *
  • Posts: 49
Re: Linux won't edit Jonsson umlauted name
« Reply #35 on: May 20, 2006, 08:08:55 am »
Yep, 1) seems to be a good solution.

UTF-16?
 - Y: UTF-16.
 - N: only ASCII(<128) chars?
  - Y: UTF-8.
  - N: legal UTF-8?
   - Y: (assume UTF-8 or ask the user?)
   - N: prompt for an encoding.

I tried to sum, what we know till now  :lol:

regards, Defender
« Last Edit: May 20, 2006, 08:21:58 am by Defender »

takeshimiya

  • Guest
Re: Linux won't edit Jonsson umlauted name
« Reply #36 on: May 20, 2006, 07:49:41 pm »
   - N: prompt for an encoding.
And put in that dialog a "Don't annoy me again!" :), because it is very common that the user will be using always that encoding from that moment.

Offline Defender

  • Multiple posting newcomer
  • *
  • Posts: 49
Re: Linux won't edit Jonsson umlauted name
« Reply #37 on: May 20, 2006, 07:55:09 pm »
   - N: prompt for an encoding.
And put in that dialog a "Don't annoy me again!" :), because it is very common that the user will be using always that encoding from that moment.

That's quite true  :lol: