Author Topic: Linux won't edit Jonsson umlauted name (Read 33394 times)

Pecan · « **on:** May 17, 2006, 01:38:46 am »

I cannot apply a patch, or gedit, or CodeBlocks edit a file with
Jonsson's umlauted name using ubuntu brezzy.

why?

gedit says it cannot determine the coding of as_config.h et.al.
CodeBlocks shows an empty file, but shows an utf-8 status.
vim doesn't give a damn.

I had to change (using vim) the umlauted 'o' to an english 'o' to edit the file.
Then the patches worked and everyone was happy.

what gives here?

thanks
pecan

EDIT: Mac couldn't apply the patch either, but when I edited
the files by hand, it changed the umlaut to a '^' and shows ISO-8859-1 in the status bar.

Pecan · « **Reply #1 on:** May 17, 2006, 02:15:04 pm »

What encoding should I set for the Linux Codeblocks editor in order to edit the AngelScript files containing an umlauted 'o'?

It defaults to utf-8, but the editor shows all blanks after the umlauted 'o'.

thanks
pecan

thomas · « **Reply #2 on:** May 17, 2006, 02:32:54 pm »

We don't have Codepage/Unicode detection, unluckily. It is a quite complicated matter.

The current version of Code::Blocks uses a hack to make things work for most people, and that is simply setting the encoding to "system default". Most of the time, that happens to be the correct encoding, and it works.

If you have an idea how to determine the document encoding efficiently (or maybe even own code to do that / know a free library), please step forward

We plan to implement something similar to how web browsers do automatic document encoding detection after 1.0, i.e. build a histogram over the input file and do a statistical matching.
I don't know whether this is terribly efficient, but it should be quite failsafe.

Any better idea?

mandrav · « **Reply #3 on:** May 17, 2006, 03:06:47 pm »

Quote from: Pecan on May 17, 2006, 02:15:04 pm

What encoding should I set for the Linux Codeblocks editor in order to edit the AngelScript files containing an umlauted 'o'?

It defaults to utf-8, but the editor shows all blanks after the umlauted 'o'.

thanks
pecan

I 've set it to iso8859-1 and I can edit the AngelScript files just fine...

takeshimiya · « **Reply #4 on:** May 17, 2006, 03:12:34 pm »

Quote from: thomas on May 17, 2006, 02:32:54 pm

If you have an idea how to determine the document encoding efficiently (or maybe even own code to do that / know a free library), please step forward

Hope this helps

http://www.mozilla.org/projects/intl/
http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html
http://www.mozilla.org/projects/intl/chardet.html
http://www.mozilla.org/projects/intl/ChardetInterface.htm

and at last, "How to build standalone universal charset detector from Mozilla source":
http://www.mozilla.org/projects/intl/detectorsrc.html

thomas · « **Reply #5 on:** May 17, 2006, 03:25:23 pm »

Actually, searching Google for "mozilla charset detector" was what we did 4 months ago, thank you

This is what "similar to how web browsers do [...] histogram" was supposed to mean.

We're actually looking for something a bit smaller and more manageable than Mozilla. Think of TinyXML (TinyCharsetDetect

.

Pecan · « **Reply #6 on:** May 17, 2006, 03:36:44 pm »

Quote from: mandrav on May 17, 2006, 03:06:47 pm

I 've set it to iso8859-1 and I can edit the AngelScript files just fine...

Thanks. I did not know what would be appropriate.

What happens to a submitted patch whose hunk gets rejected because patch cannot understand the encoding?

Is there something I can do to avoid the hunk rejection?

I had to edit, then re-save the as_config.h etc before patch would work.
pecan

takeshimiya · « **Reply #7 on:** May 17, 2006, 09:44:10 pm »

Quote from: thomas on May 17, 2006, 03:25:23 pm

We're actually looking for something a bit smaller and more manageable than Mozilla.

Just something to note, at first I got the impression that one would need to compile the whole (really big) Mozilla suite or at least XPCOM, but actually it is very few files without any dependencies (external nor mozilla), it seems really lightweight, just check the repository.

Here are the instructions to "pretty easy to use universal detector in your own project": http://www.mozilla.org/projects/intl/detectorsrc.html

You'll be writting a C::B wrapper for it, so the actual implementation can be decided later anyways.

btw, the composite approach of Mozilla using code scheme, char distribution and 2-char sequences, is really clever. If we can reuse and wrap it, it will be better time spent, but TinyCharsetDetect doesn't seems bad neither

thomas · « **Reply #8 on:** May 18, 2006, 11:01:37 am »

Quote from: Takeshi Miya on May 17, 2006, 09:44:10 pm

Here are the instructions to "pretty easy to use universal detector in your own project"

Yes, we've been aware of that for many months. It is just that it doesn't help at all.

Quote

You'll be writting a C::B wrapper for it, so the actual implementation can be decided later anyways.

No, we would not be writing a wrapper, we would have to rewrite a good bit of it to make any senseful use of it. Think about it.
If it were as easy as writing a small C++ wrapper, we'd have done that months ago.

takeshimiya · « **Reply #9 on:** May 18, 2006, 11:05:34 am »

Quote from: thomas on May 18, 2006, 11:01:37 am

Quote from: Takeshi Miya on May 17, 2006, 09:44:10 pm
Here are the instructions to "pretty easy to use universal detector in your own project"
Yes, we've been aware of that for many months. It is just that it doesn't help at all.

Quote
You'll be writting a C::B wrapper for it, so the actual implementation can be decided later anyways.
No, we would not be writing a wrapper, we would have to rewrite a good bit of it to make any senseful use of it. Think about it.
If it were as easy as writing a small C++ wrapper, we'd have done that months ago.

No problem if you want/have time to do so. I'll be glad to use it in my own projects too if it doesn't haves more dependencies than wx.

MortenMacFly · « **Reply #10 on:** May 18, 2006, 01:26:22 pm »

I stumbled accross a D forum (looking for seomething completely different ;-)) and found how the UTF detection for source code files seems to be done for the D compiler. It looks quite similar to:

Code

enum cbEncoding
{
  UTF_8,
  UTF_16LE,
  UTF_16BE,
  UTF_32LE,
  UTF_32BE
};

// Input: s: the first four byte of the file,
// (or all of them, if the file size is less than four bytes).
// Output: file encoding
cbEncoding DetectEncoding(unsigned char* s)
{
  unsigned char a = sizeof(s) >= 1 ? s[0] : 1;
  unsigned char b = sizeof(s) >= 2 ? s[1] : 1;
  unsigned char c = sizeof(s) >= 3 ? s[2] : 1;
  unsigned char d = sizeof(s) >= 4 ? s[3] : 1;

  if (a==0xFF && b==0xFE && c==0x00 && d==0x00)	return UTF_32LE;
  if (a==0x00 && b==0x00 && c==0xFE && d==0xFF) return UTF_32BE;
  if (a==0xFF && b==0xFE)                       return UTF_16LE;
  if (a==0xFE && b==0xFF)                       return UTF_16BE;
  if (a==0xEF && b==0xBB && c==0xBF)            return UTF_8;
  if            (b==0x00 && c==0x00 && d==0x00) return UTF_32LE;
  if (a==0x00 && b==0x00 && c==0x00)            return UTF_32BE;
  if            (b==0x00)                       return UTF_16LE;
  if (a==0x00)                                  return UTF_16BE;

  return UTF_8;
}

Now I understand definitely not enough about Unicode to really provide knowledge but this may (!) be a starting point...?!
With regards, Morten.

mandrav · « **Reply #11 on:** May 18, 2006, 01:30:29 pm »

Quote

Now I understand definitely not enough about Unicode to really provide knowledge but this may (!) be a starting point...?!
With regards, Morten.

This kind of detection, i.e. checking for existing BOM (byte order mark), we already do. Unfortunately it's not enough...

thomas · « **Reply #12 on:** May 18, 2006, 02:07:04 pm »

The problem lies in the files that do not have a BOM. These are 90% of all files, as it happens, since this is Microsoft proprietary stuff

On documents that have no BOM, it boils down to either knowing the correct encoding or doing a wild guess based on the actual content. Something that is relatively easy to do is to find out UTF-16 encoded files, since they have unhealthy amounts of zero bytes, and similarly it is relatively easy to tell whether a document is not ANSI and not UTF-8 encoded (if you try to decode it as UTF-8, and it contains illegal sequences, then it is neither). That's what the Mozilla engine (and probably most every other browser) does as the first step, too.

The true problem lies in documents that are not obviously illegal UTF-8. For these, browsers typically build a histogram and compare the peak frequencies to a known "profile" for a given language/encoding. In languages based on the latin alphabet like most European and Balkan languages, the top 30 characters make up 95%. In Japanese, the top 60 characters make up about half of a "normal" text, and in Chinese, you have a couple of hundred "top" characters. Thus it becomes obvious that the less Western, the harder. You need a lot more input text to make a reasonable guess.

However, the main problem is the point which Takeshi missed completely and which is why I said that using the Mozilla encoding detector is useless.
Source files consist of 80-90% ANSI characters (keywords, operators, white space). Only comments and character constants contain "unusual" characters at all (sometimes only a single character in a source file!). This means that building a histogram over the flat input file and comparing to known language statistics is entirely useless - the characters in question only make up a tiny fraction of the document.

What we will probably have to do will be either a parser that isolates character constants and comments and only parses those, or we will have to use a greatly different statistical model. Or we may have to choose a completely different strategy.
In any case, it is not as easy as saying "hey, we can use Mozilla's engine" - that won't work.
I wish it would...

MortenMacFly · « **Reply #13 on:** May 18, 2006, 02:57:12 pm »

Meanwhile I read the document provided with the mozilla (firefox) sources in mozilla\extensions\universalchardet\doc. I have to say wow - statistical analysis for the detection... it seems there is really something wrong with this in general. Anyway:

Quote from: thomas on May 18, 2006, 02:07:04 pm

What we will probably have to do will be either a parser that isolates character constants and comments and only parses those, or we will have to use a greatly different statistical model. Or we may have to choose a completely different strategy.

I guess it's even worse. As far as I understand variable names could be in unicode for some compilers, too. And (not to forget) strings/chars within the sources as you said. This means that it would be required to have knowledge about what is a valid comment / string -> so is it //, /*, !, C, ' or '...', "..." (...). This really seems to be very difficult. I wonder if it makes sense in general to try that. Because all solutions will really depend on a lot of variables and will never work 100% correct (as it has been said already). Isn't it better to support a limited number of file types and try to determine when a file is of unknown type and request the user to convert it into something "general"? So rather than to try to be very smart be very strict?
With regards, Morten.

takeshimiya · « **Reply #14 on:** May 18, 2006, 04:41:10 pm »

Quote from: thomas on May 18, 2006, 02:07:04 pm

The problem lies in the files that do not have a BOM. These are 90% of all files, as it happens, since this is Microsoft proprietary stuff

On documents that have no BOM, it boils down to either knowing the correct encoding or doing a wild guess based on the actual content. Something that is relatively easy to do is to find out UTF-16 encoded files, since they have unhealthy amounts of zero bytes, and similarly it is relatively easy to tell whether a document is not ANSI and not UTF-8 encoded (if you try to decode it as UTF-8, and it contains illegal sequences, then it is neither). That's what the Mozilla engine (and probably most every other browser) does as the first step, too.

Yes, there are a few more tricks to (posetively if found) identify an encoding, like a substantial ammount of multibyte characters in UTF-8, or "~{" and ESC sequences it's very likely to be ISO-2022-JP, and a few more.

Quote from: thomas on May 18, 2006, 02:07:04 pm

The true problem lies in documents that are not obviously illegal UTF-8. For these, browsers typically build a histogram and compare the peak frequencies to a known "profile" for a given language/encoding. In languages based on the latin alphabet like most European and Balkan languages, the top 30 characters make up 95%. In Japanese, the top 60 characters make up about half of a "normal" text, and in Chinese, you have a couple of hundred "top" characters. Thus it becomes obvious that the less Western, the harder. You need a lot more input text to make a reasonable guess.

Yes, though actually this isn't that much "language dependant", but "encoding dependant", as there are some encodings (ie. japanese EUC, ISO, etc) which some are easier to detect while other aren't.

The worse situation are the east asian for sure; for example the different encodings of Chinese, as they share a lot of codepoints and with other encodings (such as japanese ones) which makes it more difficult.

Quote from: thomas on May 18, 2006, 02:07:04 pm

However, the main problem is the point which Takeshi missed completely and which is why I said that using the Mozilla encoding detector is useless.

But you said "We're actually looking for something a bit smaller and more manageable than Mozilla.", so I guessed that mozilla was ruled out only because of that. Anyways I don't think it is useless, given this:

Quote from: thomas on May 18, 2006, 02:07:04 pm

Source files consist of 80-90% ANSI characters (keywords, operators, white space). Only comments and character constants contain "unusual" characters at all (sometimes only a single character in a source file!).

Yes, very right. However one of the algorithms of Mozilla is the 2-char, which marks all ascii (keywords, spaces, etc) as noise.

Quote from: thomas on May 18, 2006, 02:07:04 pm

This means that building a histogram over the flat input file and comparing to known language statistics is entirely useless - the characters in question only make up a tiny fraction of the document.

What we will probably have to do will be either a parser that isolates character constants and comments and only parses those

Yes, but you're calling it "entirely useless" and I don't think it is yet, we can still use it:
The idea of a parser that parses comments it's great, so we can parse them, and once recolected, feed them to the mozilla detector.

Quote from: MortenMacFly on May 18, 2006, 02:57:12 pm

I guess it's even worse. As far as I understand variable names could be in unicode for some compilers, too.

Yes that is right, but we're talking about "probabilities", so with the comments from the source alone, we have a good chance as it's not a very common scenario to have unicode variable names and not comments in unicode

Quote from: MortenMacFly on May 18, 2006, 02:57:12 pm

And (not to forget) strings/chars within the sources as you said. This means that it would be required to have knowledge about what is a valid comment / string -> so is it //, /*, !, C, ' or '...', "..." (...). This really seems to be very difficult.

Thanks for remembering that, in fact it is very easy, and the comment characters can exist in the C::B lexers (in fact .properties lexer files already haves them

).
Also making them in the SDK will increase reuse, it's exactly what the Code Statistics plugin is using right now.

Quote from: MortenMacFly on May 18, 2006, 02:57:12 pm

Isn't it better to support a limited number of file types and try to determine when a file is of unknown type and request the user to convert it into something "general"?

That is a good point, all of this is complex, but it's a feature for the users, and there are a lot of source code over the world which probably don't haves the choice to use something more "general" (either by the tools that are being used, by some employeer, etc).

This is an interesting topic, however I don't think it's a top priority, it'll take a lot of human resources/time (it could have been interesting as a google soc

).

BTW, the new compiler framework and the blocks are starting to rock, so better keep this for a future

Defender · « **Reply #15 on:** May 18, 2006, 04:53:40 pm »

I have a general question related to source encodings.
Is it legal to have non-English (non-ASCII) single byte characters is a C source file as a string literal? Does the compiler recognize, if there are multibyte characters is the source (for example as UTF-8 string literals)?

Thanks: Defender

thomas · « **Reply #16 on:** May 18, 2006, 05:11:00 pm »

Quote from: Defender on May 18, 2006, 04:53:40 pm

I have a general question related to source encodings.
Is it legal to have non-English (non-ASCII) single byte characters is a C source file as a string literal? Does the compiler recognize, if there are multibyte characters is the source (for example as UTF-8 string literals)?

Thanks: Defender

Yes and no. If you specify the input character encoding, it is legal. Not all compilers support that, but gcc for example, does (gcc fully supports UTF-encoded sources, and on the majority of systems, that's even the default).
Don't worry about that kind of stuff, because you will know if you need to specify the encoding because your source will immediately fail to compile with the message "Illegal byte sequence"

Defender · « **Reply #17 on:** May 18, 2006, 05:14:41 pm »

Thanks for the info, thomas

thomas · « **Reply #18 on:** May 18, 2006, 06:22:49 pm »

Quote from: MortenMacFly on May 18, 2006, 02:57:12 pm

I guess it's even worse. As far as I understand variable names could be in unicode for some compilers, too.

Luckily, it says "A valid identifier is a sequence of one or more letters, digits or underscore characters", and "letter" is defined as [A-Za-z]. Phew...

You're right though, it is still difficult enough.

Quote from: MortenMacFly on May 18, 2006, 02:57:12 pm

Isn't it better to support a limited number of file types and try to determine when a file is of unknown type and request the user to convert it into something "general"? So rather than to try to be very smart be very strict?

That was the secret "Plan B"

Though not the best solution, it would probably work good enough.
Possibly we'll have to settle for something like that (or a hybrid solution) in the end, as it is really not trivial, and it is quite possible that we don't find a better solution.

MortenMacFly · « **Reply #19 on:** May 18, 2006, 07:36:06 pm »

Quote from: thomas on May 18, 2006, 06:22:49 pm

Luckily, it says "A valid identifier is a sequence of one or more letters, digits or underscore characters", and "letter" is defined as [A-Za-z]. Phew...

Yes, I got that part wrong. I was refering to an article about the C# language (compiler) but what was ment was the content of a variable, not the name... sorry. :oops:

takeshimiya · « **Reply #20 on:** May 19, 2006, 02:11:14 am »

I wonder why SMF when you click on a "Recent Unread Topic" doesn't takes to the last post readed but to the last page, sometimes.

See my previous post on Page #1.

thomas · « **Reply #21 on:** May 19, 2006, 10:07:58 am »

Quote from: MortenMacFly on May 18, 2006, 02:57:12 pm

This means that it would be required to have knowledge about what is a valid comment / string -> so is it //, /*, !, C, ' or '...', "..." (...).

Actually, do we really want to know? Do we need to know?

Suppose we use a regex like this to hop over everything that we're not interested in:
[^A-Za-z0-9 \+\-\*\/\.\,\:\;\!\"\$\%\&\=\[\]\{\}\'\#\<\>\\]

Even easier, we could use something like this:
void MoveToNextInterestingLocation(wxChar*& ptr) { while(ptr < endOfFile && *ptr < 127) ++ptr; }

Lol, actually using a regex was quite a stupid idea

takeshimiya · « **Reply #22 on:** May 19, 2006, 10:24:26 am »

Quote from: thomas on May 19, 2006, 10:07:58 am

Lol, actually using a regex was quite a stupid idea

lol :lol:

What about my suggestion in my last post above of moving to the SDK the comment tokens (for the Code Stats plugin, the wxSmith plugin, ToDo plugin, etc)?

MortenMacFly · « **Reply #23 on:** May 19, 2006, 11:06:39 am »

Quote from: thomas on May 19, 2006, 10:07:58 am

Actually, do we really want to know? Do we need to know?

Statistically spoken: Yes!
If you want to improve the detection rate it is always better to put as much knowledge in your model as possible. The possibility to detect the right unicode scheme is higher if you skip parts where you know they won't deliver relevant information. Or (to say it in the opposite) amplify parts that have a higher probability for unicode characters. And these are in fact strings and comments. If this is true (which you -Thomas- supposed in an earlier message and I agree with you on that) this should really be considered because this will significantly increase the detection rate.
With regards, Morten.
BTW: Modelling is my daily business... in case you wonder... ;-)

thomas · « **Reply #24 on:** May 19, 2006, 11:37:49 am »

Quote from: MortenMacFly on May 19, 2006, 11:06:39 am

Statistically spoken: Yes!
[...] amplify parts that have a higher probability for unicode characters

Exactly

But do we want to know whether they're strings or comments or whatever? Do we care what identifies a string constant?
If we dead-strip everything with a value less than 127, we eleminate all ANSI characters (all keywords, operators, and all English comment/constant text). All that remains are the (amplified) non-ANSI characters, if there are any. It greatly biases the distribution towards them, but is that really a bad thing?
The question is just, can you judge a charset/language only by looking at the special characters? Could you tell that "Fürwahr, Du bist der Größte" is "German / ISO-8859-1" only by looking at "üöß"? I think you can.
Even if you cannot tell for sure what language it is, that does not matter - we aren't interested in that. If the text was "Fürwahr, ich liebe Smørrebrød", or stripped "üøø" is still only ISO-8859-1. I don't know if we are as lucky with the cyrillic alphabet, though...

takeshimiya · « **Reply #25 on:** May 19, 2006, 12:34:53 pm »

Seems like my previous post on Page #1, SMF maked all you missed (?)

Quote from: thomas on May 19, 2006, 11:37:49 am

Quote from: MortenMacFly on May 19, 2006, 11:06:39 am
Statistically spoken: Yes!
[...] amplify parts that have a higher probability for unicode characters
Exactly
But do we want to know whether they're strings or comments or whatever? Do we care what identifies a string constant?

Yes and no, according to the encoding we're trying to detect and the algorithm used.

Quote from: thomas on May 19, 2006, 11:37:49 am

If we dead-strip everything with a value less than 127, we eleminate all ANSI characters (all keywords, operators, and all English comment/constant text). All that remains are the (amplified) non-ANSI characters, if there are any. It greatly biases the distribution towards them, but is that really a bad thing?

It would be great if it was as easy as dead-stripping everything smaller than a 127 value.

It could only help when trying to detect a single-byte encoding, but not much more.

Think a bit about it: if you dead-strip all bytes with a value < 127, you'll be removing a lot information in multibyte encodings such as Shift-JIS, UTF-16, etc, because almost all byte values are lower than 127.

Quote from: thomas on May 19, 2006, 11:37:49 am

The question is just, can you judge a charset/language only by looking at the special characters? Could you tell that "Fürwahr, Du bist der Größte" is "German / ISO-8859-1" only by looking at "üöß"? I think you can.

I think you can't guess an encoding by looking at special characters, but a combination of them (a sequence of 2~3 characters).
You can't by looking only at "ü" for example, because the decimal byte value can exist in almost any encoding, and even any languages (in fact my name haves an "ü" and that letter is legal Spanish

).

A method that works great for detecting single-byte encoding, is what is called the "2-char sequence method". In the example, you'll look for common sequences in german, like "öß", "Fü", etc.

Another thing we must account, is that we're trying to detect the encoding, not specifically the language, so: we must remember that some encodings share a lot of codepoints with other encodings; this is the common case between east asian encodings.
Thus for example you could guess that if you found a "連続" sequence is Japanese (it could be Chinese too), but you can't tell what encoding is between the ones that shares so much codepoints.

I'm guessing you've read it already, it explains greatly the 3 approach used in mozilla, some works great for detecting single-byte encodings and some others for detecting multi-byte encodings, that's why the composite approach was used: http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

Defender · « **Reply #26 on:** May 19, 2006, 12:43:19 pm »

Quote from: Takeshi Miya on May 19, 2006, 12:34:53 pm

Think a bit about it: if you dead-strip all bytes with a value < 127, you'll be removing a lot information in multibyte encodings such as Shift-JIS, UTF-16, etc, because almost all byte values are lower than 127.

I think thomas meant, that the striping should only take place, after we checked for easy-to-detect encodings.
Quote:

Quote from: thomas

On documents that have no BOM, it boils down to either knowing the correct encoding or doing a wild guess based on the actual content. Something that is relatively easy to do is to find out UTF-16 encoded files, since they have unhealthy amounts of zero bytes, and similarly it is relatively easy to tell whether a document is not ANSI and not UTF-8 encoded (if you try to decode it as UTF-8, and it contains illegal sequences, then it is neither). That's what the Mozilla engine (and probably most every other browser) does as the first step, too.

Excuse me if I am wrong.

Defender

MortenMacFly · « **Reply #27 on:** May 19, 2006, 12:44:49 pm »

Quote from: thomas on May 19, 2006, 11:37:49 am

Exactly

Mmmmh.... I was just reading over at wikipedia from http://en.wikipedia.org/wiki/ISO/IEC_8859-1 to http://en.wikipedia.org/wiki/ISO/IEC_8859-16 about what such a bias would mean. Just to get you right: So if everything up to 127 (maybe even everything up to 160) is skipped we truely skip all keywords, variables and symbols/signs. There is a very nice table at http://en.wikipedia.org/wiki/ISO_8859#Table that compares the different ISO's.
What remains is what you would like to analyse statistically, right?
This sounds logically to me - yes, but it's difficult to judge if this is a good approach. I think we really require an expert on that topic to answer (best if there is any that knows enough about all of these languages).
Still: I think this is a very minor addition (e.g. a simple stream manipulator) to the main part that would be required anyway. Unfortunately I know nearly nothing about languages and the theory about how often specific characters or combination of characters are used to setup a statistic model for what remains. The more I think about that... it seems there is a reason for why the Mozilla effort is so complex...?! :roll:
With regards, Morten.

Edit: Linked directly to table mentioned to avoid confusion with other tables on that page.

takeshimiya · « **Reply #28 on:** May 19, 2006, 12:59:20 pm »

Quote from: Defender on May 19, 2006, 12:43:19 pm

Quote from: Takeshi Miya on May 19, 2006, 12:34:53 pm

Think a bit about it: if you dead-strip all bytes with a value < 127, you'll be removing a lot information in multibyte encodings such as Shift-JIS, UTF-16, etc, because almost all byte values are lower than 127.

I think thomas meant, that the striping should only take place, after we checked for easy-to-detect encodings.

Yes, of course the very first thing would be checking for BOM and trying to detect the easy encodings. But almost all multibyte encodings are precisely not easy to detect.

Quote from: MortenMacFly on May 19, 2006, 12:44:49 pm

So if everything up to 127 (maybe even everything up to 160) is skipped we truely skip all keywords, variables and symbols/signs. There is a very nice table at http://en.wikipedia.org/wiki/ISO_8859 that compares the different ISO's.

Again, that'll serve only for positively detecting single-byte encodings, with what you could call "1-char detection method", and I explained above why it will not work except for very few cases. The 2-char (or more) method would give us a best guess (for single-byte encodings).

Quote from: MortenMacFly on May 19, 2006, 12:44:49 pm

Still: I think this is a very minor addition (e.g. a simple stream manipulator) to the main part that would be required anyway.

I think too, but for some detection methods, other methods will need the raw stream without "comments parsing". It is a great idea nonetheless.

Quote from: MortenMacFly on May 19, 2006, 12:44:49 pm

What remains is what you would like to analyse statistically, right?
The more I think about that... it seems there is a reason for why the Mozilla effort is so complex...?! :roll:
With regards, Morten.

Yes, they've already built a lot of tools, automation programs and researchs for language statistics. The "Mozilla language/encoding detection module" was a separate project that later was merged in Mozilla and now is being mantained there.

If somehow we don't use the Mozilla detection source, we can still use a lot of researchs made by them, the statistical analysis for each language, the tools for built those statistics, etc.

thomas · « **Reply #29 on:** May 19, 2006, 01:35:08 pm »

Quote from: Defender on May 19, 2006, 12:43:19 pm

I think thomas meant, that the striping should only take place, after we checked for easy-to-detect encodings.

Right. UTF-16 can be ruled out rather easily to start with, and the remaining "strange" encodings should not bother us.
[There are just too many of them, wxScite does not even support some, and manually converting them is a pain. Also, only a tiny fraction of the population uses them. We should concentrate on making it work well for 98%, not waste time on making it work badly for 100%.
Those few actually using anything else than 8-bit codepages, UTF-8, and UTF-16 can really be asked to convert to UTF-8 or UTF-16. Since both UTF-8 and UTF-16 are established standards that work excellently, there should not be much to object.]

Regarding UTF-8, if there are no valid UTF-8 multibyte sequences to be found (which you can check easily) then either there are none because the text contains no characters that need encoding, or you will find illegal sequences.

In the former case, you don't need to care, as you'll simply use ANSI encoding (or whatever codepage, it shouldn't matter, you can even use UTF-8, just to be sure for the future).

Only in the latter case, it gets really complicated. Then, and only then, you may have characters in some unknown encoding which might mean anything, for example 8859-1 Ø could as well be 8859-2 Ř, and you have no idea what it may be.

Now, the problem lies in finding out whether it is a Ø or a Ř, this has to be done from a statistical model.

Martin:
What if we use word boundaries to feed complete words to the statistical model?
For example, if we first skip over all ANSI chars to identify candidate "strange chars", and then use Scintilla's word retrieval functions.
If the input text were "Ångström was a phycisist", then we would use "Ångström" for the statistical model, rather than "Åö".
In this case, we do not need to bother what might be a comment or character constant, we don't need to parse programming-language specific stuff, but we still get somewhat of a "complete" fragment (nice oxymoron). It is a compromise between a full parser and only using a fraction of the statistical data available (the frequencies of "ngstrm" may give a clue, too).

I don't remember exactly who came up with counting letters in the English language first, you'd probably praise the late Mr. Shannon, but Arthur Conan Doyle wrote his tale "The Dancing Men" 13 years before Shannon was even born...
Anyway, this story teaches us that seemingly unimportant letters may be interesting too

takeshimiya · « **Reply #30 on:** May 19, 2006, 02:03:30 pm »

Quote from: thomas on May 19, 2006, 01:35:08 pm

Right. UTF-16 can be ruled out rather easily to start with, and the remaining "strange" encodings should not bother us.

Yep, just remember that east-asian languages usually will not have the "insane amount of 0's" we're looking.

Quote from: thomas on May 19, 2006, 01:35:08 pm

Also, only a tiny fraction of the population uses them.

I don't know how accurate is that percentage, given that 1/4 of the world population speaks east-asian languages, but...

Quote from: thomas on May 19, 2006, 01:35:08 pm

We should concentrate on making it work well for 98%, not waste time on making it work badly for 100%.

...I don't know how much of them aren't using UNICODE these days for their source code, so that guess is ok probably.

Quote from: thomas on May 19, 2006, 01:35:08 pm

Those few actually using anything else than 8-bit codepages, UTF-8, and UTF-16 can really be asked to convert to UTF-8 or UTF-16. Since both UTF-8 and UTF-16 are established standards that work excellently, there should not be much to object.

Yes, UTF-8 rocks for programming, thanks to the ASCII part being backwards-compatible. It should be the default nowadays.

Quote from: thomas on May 19, 2006, 01:35:08 pm

What if we use word boundaries to feed complete words to the statistical model?
For example, if we first skip over all ANSI chars to identify candidate "strange chars", and then use Scintilla's word retrieval functions.
If the input text were "Ångström was a phycisist", then we would use "Ångström" for the statistical model, rather than "Åö".
In this case, we do not need to bother what might be a comment or character constant, we don't need to parse programming-language specific stuff, but we still get somewhat of a "complete" fragment (nice oxymoron). It is a compromise between a full parser and only using a fraction of the statistical data available (the frequencies of "ngstrm" may give a clue, too).

That is a good approach too, but will consume a lot of cpu-resources and statistical analysis. Something to point out is: those statistics can guess wrong, for example I tend to write english code and english comments, but for some of the words I use my mother tongue (when I don't know the word in english), so I don't think it's an uncommon situation.

Ok, I think we have two good (and not extremely complex) solutions:
1) Try to detect if an "Illegal Unicode sequence" exist. If it exists, show a dialog to the user asking for an encoding, defaulting to the current system encoding.
2) Use a comment parser, like the CodeStatistics Plugin uses. With that block of comments, we'll feed them to the Mozilla encoding detector class.

Either of the two seems to work reasonably for most people, without making statistical analysis from our part, and in a relatively easy way.
What do you think?

thomas · « **Reply #31 on:** May 19, 2006, 02:59:04 pm »

Quote from: Takeshi Miya on May 19, 2006, 02:03:30 pm

Yep, just remember that east-asian languages usually will not have the "insane amount of 0's" we're looking.
[...]
What do you think?

I think that a "typical" project made from "typical" sources with 10% comments and character strings (Code::Blocks has about 3%) encoded in UTF-16 has 45% null bytes...
ANSI source files do not have a single null byte in the normal case, and neither should UTF-8. If they do, something is weird about that source.

Also, I trust that Martin will work out a usable and computionally feasible statistical model before version 1.5

Defender · « **Reply #32 on:** May 19, 2006, 04:49:56 pm »

Quote from: Takeshi Miya on May 19, 2006, 02:03:30 pm

Quote from: thomas on May 19, 2006, 01:35:08 pm
Right. UTF-16 can be ruled out rather easily to start with, and the remaining "strange" encodings should not bother us.
Yep, just remember that east-asian languages usually will not have the "insane amount of 0's" we're looking.
...

That's perfectly true, but the majority of a source file is made up by english characters, and encoded with UTF-16 they contain a lot of NULL bytes

thomas · « **Reply #33 on:** May 19, 2006, 05:15:40 pm »

Yep, around 45%

takeshimiya · « **Reply #34 on:** May 19, 2006, 10:39:40 pm »

Quote from: Defender on May 19, 2006, 04:49:56 pm

Quote from: Takeshi Miya on May 19, 2006, 02:03:30 pm
Quote from: thomas on May 19, 2006, 01:35:08 pm
Right. UTF-16 can be ruled out rather easily to start with, and the remaining "strange" encodings should not bother us.
Yep, just remember that east-asian languages usually will not have the "insane amount of 0's" we're looking.
...
That's perfectly true, but the majority of a source file is made up by english characters, and encoded with UTF-16 they contain a lot of NULL bytes

Yep, thomas' guess is right, I was just trying to be informative for those that could might think that UTF-16 will always have 0's in any case. =)
If anyone makes that assumption, he/she could forget and use the "comments parser" or "ascii stripping" in the wrong order (ie. call first the comments parser and later the UTF-16 detector, and it'll not work; no 0's will be found). But in the reverse order it will work. In this case UTF-16 detection would be an exception, as other detection methods will/could be better after the comments parsing.

Just something to keep in mind.

The 1) solution seems to be the fastest/easiest to do in the meantime, what do you think?

Defender · « **Reply #35 on:** May 20, 2006, 08:08:55 am »

Yep, 1) seems to be a good solution.

UTF-16?
- Y: UTF-16.
- N: only ASCII(<128) chars?
- Y: UTF-8.
- N: legal UTF-8?
- Y: (assume UTF-8 or ask the user?)
- N: prompt for an encoding.

I tried to sum, what we know till now :lol:

regards, Defender

takeshimiya · « **Reply #36 on:** May 20, 2006, 07:49:41 pm »

Quote from: Defender on May 20, 2006, 08:08:55 am

- N: prompt for an encoding.

And put in that dialog a "Don't annoy me again!"

, because it is very common that the user will be using always that encoding from that moment.

Defender · « **Reply #37 on:** May 20, 2006, 07:55:09 pm »

Quote from: Takeshi Miya on May 20, 2006, 07:49:41 pm

Quote from: Defender on May 20, 2006, 08:08:55 am
- N: prompt for an encoding.
And put in that dialog a "Don't annoy me again!" , because it is very common that the user will be using always that encoding from that moment.

That's quite true :lol:

News:

Author Topic: Linux won't edit Jonsson umlauted name (Read 33394 times)

takeshimiya

takeshimiya

takeshimiya

takeshimiya

takeshimiya

takeshimiya

takeshimiya

takeshimiya

takeshimiya

takeshimiya

takeshimiya