What encoding should I set for the Linux Codeblocks editor in order to edit the AngelScript files containing an umlauted 'o'?
It defaults to utf-8, but the editor shows all blanks after the umlauted 'o'.
thanks
pecan
If you have an idea how to determine the document encoding efficiently (or maybe even own code to do that / know a free library), please step forward :)
I 've set it to iso8859-1 and I can edit the AngelScript files just fine...
We're actually looking for something a bit smaller and more manageable than Mozilla.Just something to note, at first I got the impression that one would need to compile the whole (really big) Mozilla suite or at least XPCOM, but actually it is very few files without any dependencies (external nor mozilla), it seems really lightweight, just check the repository.
Here are the instructions to "pretty easy to use universal detector in your own project"Yes, we've been aware of that for many months. It is just that it doesn't help at all. :)
You'll be writting a C::B wrapper for it, so the actual implementation can be decided later anyways.No, we would not be writing a wrapper, we would have to rewrite a good bit of it to make any senseful use of it. Think about it.
Here are the instructions to "pretty easy to use universal detector in your own project"Yes, we've been aware of that for many months. It is just that it doesn't help at all. :)QuoteYou'll be writting a C::B wrapper for it, so the actual implementation can be decided later anyways.No, we would not be writing a wrapper, we would have to rewrite a good bit of it to make any senseful use of it. Think about it.
If it were as easy as writing a small C++ wrapper, we'd have done that months ago.
enum cbEncoding
{
UTF_8,
UTF_16LE,
UTF_16BE,
UTF_32LE,
UTF_32BE
};
// Input: s: the first four byte of the file,
// (or all of them, if the file size is less than four bytes).
// Output: file encoding
cbEncoding DetectEncoding(unsigned char* s)
{
unsigned char a = sizeof(s) >= 1 ? s[0] : 1;
unsigned char b = sizeof(s) >= 2 ? s[1] : 1;
unsigned char c = sizeof(s) >= 3 ? s[2] : 1;
unsigned char d = sizeof(s) >= 4 ? s[3] : 1;
if (a==0xFF && b==0xFE && c==0x00 && d==0x00) return UTF_32LE;
if (a==0x00 && b==0x00 && c==0xFE && d==0xFF) return UTF_32BE;
if (a==0xFF && b==0xFE) return UTF_16LE;
if (a==0xFE && b==0xFF) return UTF_16BE;
if (a==0xEF && b==0xBB && c==0xBF) return UTF_8;
if (b==0x00 && c==0x00 && d==0x00) return UTF_32LE;
if (a==0x00 && b==0x00 && c==0x00) return UTF_32BE;
if (b==0x00) return UTF_16LE;
if (a==0x00) return UTF_16BE;
return UTF_8;
}
Now I understand definitely not enough about Unicode to really provide knowledge but this may (!) be a starting point...?!
With regards, Morten.
What we will probably have to do will be either a parser that isolates character constants and comments and only parses those, or we will have to use a greatly different statistical model. Or we may have to choose a completely different strategy.I guess it's even worse. As far as I understand variable names could be in unicode for some compilers, too. And (not to forget) strings/chars within the sources as you said. This means that it would be required to have knowledge about what is a valid comment / string -> so is it //, /*, !, C, ' or '...', "..." (...). This really seems to be very difficult. I wonder if it makes sense in general to try that. Because all solutions will really depend on a lot of variables and will never work 100% correct (as it has been said already). Isn't it better to support a limited number of file types and try to determine when a file is of unknown type and request the user to convert it into something "general"? So rather than to try to be very smart be very strict?
The problem lies in the files that do not have a BOM. These are 90% of all files, as it happens, since this is Microsoft proprietary stuff :(Yes, there are a few more tricks to (posetively if found) identify an encoding, like a substantial ammount of multibyte characters in UTF-8, or "~{" and ESC sequences it's very likely to be ISO-2022-JP, and a few more.
On documents that have no BOM, it boils down to either knowing the correct encoding or doing a wild guess based on the actual content. Something that is relatively easy to do is to find out UTF-16 encoded files, since they have unhealthy amounts of zero bytes, and similarly it is relatively easy to tell whether a document is not ANSI and not UTF-8 encoded (if you try to decode it as UTF-8, and it contains illegal sequences, then it is neither). That's what the Mozilla engine (and probably most every other browser) does as the first step, too.
The true problem lies in documents that are not obviously illegal UTF-8. For these, browsers typically build a histogram and compare the peak frequencies to a known "profile" for a given language/encoding. In languages based on the latin alphabet like most European and Balkan languages, the top 30 characters make up 95%. In Japanese, the top 60 characters make up about half of a "normal" text, and in Chinese, you have a couple of hundred "top" characters. Thus it becomes obvious that the less Western, the harder. You need a lot more input text to make a reasonable guess.Yes, though actually this isn't that much "language dependant", but "encoding dependant", as there are some encodings (ie. japanese EUC, ISO, etc) which some are easier to detect while other aren't.
However, the main problem is the point which Takeshi missed completely and which is why I said that using the Mozilla encoding detector is useless.
Source files consist of 80-90% ANSI characters (keywords, operators, white space). Only comments and character constants contain "unusual" characters at all (sometimes only a single character in a source file!).Yes, very right. However one of the algorithms of Mozilla is the 2-char, which marks all ascii (keywords, spaces, etc) as noise.
This means that building a histogram over the flat input file and comparing to known language statistics is entirely useless - the characters in question only make up a tiny fraction of the document.Yes, but you're calling it "entirely useless" and I don't think it is yet, we can still use it:
What we will probably have to do will be either a parser that isolates character constants and comments and only parses those
I guess it's even worse. As far as I understand variable names could be in unicode for some compilers, too.Yes that is right, but we're talking about "probabilities", so with the comments from the source alone, we have a good chance as it's not a very common scenario to have unicode variable names and not comments in unicode :)
And (not to forget) strings/chars within the sources as you said. This means that it would be required to have knowledge about what is a valid comment / string -> so is it //, /*, !, C, ' or '...', "..." (...). This really seems to be very difficult.Thanks for remembering that, in fact it is very easy, and the comment characters can exist in the C::B lexers (in fact .properties lexer files already haves them :P).
Isn't it better to support a limited number of file types and try to determine when a file is of unknown type and request the user to convert it into something "general"?
I have a general question related to source encodings.Yes and no. If you specify the input character encoding, it is legal. Not all compilers support that, but gcc for example, does (gcc fully supports UTF-encoded sources, and on the majority of systems, that's even the default).
Is it legal to have non-English (non-ASCII) single byte characters is a C source file as a string literal? Does the compiler recognize, if there are multibyte characters is the source (for example as UTF-8 string literals)?
Thanks: Defender
I guess it's even worse. As far as I understand variable names could be in unicode for some compilers, too.Luckily, it says "A valid identifier is a sequence of one or more letters, digits or underscore characters", and "letter" is defined as [A-Za-z]. Phew... :)
Isn't it better to support a limited number of file types and try to determine when a file is of unknown type and request the user to convert it into something "general"? So rather than to try to be very smart be very strict?That was the secret "Plan B" :) Though not the best solution, it would probably work good enough.
Luckily, it says "A valid identifier is a sequence of one or more letters, digits or underscore characters", and "letter" is defined as [A-Za-z]. Phew... :)Yes, I got that part wrong. I was refering to an article about the C# language (compiler) but what was ment was the content of a variable, not the name... sorry. :oops:
This means that it would be required to have knowledge about what is a valid comment / string -> so is it //, /*, !, C, ' or '...', "..." (...).Actually, do we really want to know? Do we need to know?
Lol, actually using a regex was quite a stupid idea :)lol :lol:
Actually, do we really want to know? Do we need to know?Statistically spoken: Yes!
Statistically spoken: Yes!Exactly :)
[...] amplify parts that have a higher probability for unicode characters
Yes and no, according to the encoding we're trying to detect and the algorithm used.Statistically spoken: Yes!Exactly :)
[...] amplify parts that have a higher probability for unicode characters
But do we want to know whether they're strings or comments or whatever? Do we care what identifies a string constant?
If we dead-strip everything with a value less than 127, we eleminate all ANSI characters (all keywords, operators, and all English comment/constant text). All that remains are the (amplified) non-ANSI characters, if there are any. It greatly biases the distribution towards them, but is that really a bad thing?It would be great if it was as easy as dead-stripping everything smaller than a 127 value. :D
The question is just, can you judge a charset/language only by looking at the special characters? Could you tell that "Fürwahr, Du bist der Größte" is "German / ISO-8859-1" only by looking at "üöß"? I think you can.I think you can't guess an encoding by looking at special characters, but a combination of them (a sequence of 2~3 characters).
I think thomas meant, that the striping should only take place, after we checked for easy-to-detect encodings.
Think a bit about it: if you dead-strip all bytes with a value < 127, you'll be removing a lot information in multibyte encodings such as Shift-JIS, UTF-16, etc, because almost all byte values are lower than 127.
On documents that have no BOM, it boils down to either knowing the correct encoding or doing a wild guess based on the actual content. Something that is relatively easy to do is to find out UTF-16 encoded files, since they have unhealthy amounts of zero bytes, and similarly it is relatively easy to tell whether a document is not ANSI and not UTF-8 encoded (if you try to decode it as UTF-8, and it contains illegal sequences, then it is neither). That's what the Mozilla engine (and probably most every other browser) does as the first step, too.
Exactly :)Mmmmh.... I was just reading over at wikipedia from http://en.wikipedia.org/wiki/ISO/IEC_8859-1 (http://en.wikipedia.org/wiki/ISO/IEC_8859-1) to http://en.wikipedia.org/wiki/ISO/IEC_8859-16 (http://en.wikipedia.org/wiki/ISO/IEC_8859-16) about what such a bias would mean. Just to get you right: So if everything up to 127 (maybe even everything up to 160) is skipped we truely skip all keywords, variables and symbols/signs. There is a very nice table at http://en.wikipedia.org/wiki/ISO_8859#Table (http://en.wikipedia.org/wiki/ISO_8859#Table) that compares the different ISO's.
I think thomas meant, that the striping should only take place, after we checked for easy-to-detect encodings.
Think a bit about it: if you dead-strip all bytes with a value < 127, you'll be removing a lot information in multibyte encodings such as Shift-JIS, UTF-16, etc, because almost all byte values are lower than 127.
So if everything up to 127 (maybe even everything up to 160) is skipped we truely skip all keywords, variables and symbols/signs. There is a very nice table at http://en.wikipedia.org/wiki/ISO_8859 (http://en.wikipedia.org/wiki/ISO_8859) that compares the different ISO's.Again, that'll serve only for positively detecting single-byte encodings, with what you could call "1-char detection method", and I explained above why it will not work except for very few cases. The 2-char (or more) method would give us a best guess (for single-byte encodings).
Still: I think this is a very minor addition (e.g. a simple stream manipulator) to the main part that would be required anyway.I think too, but for some detection methods, other methods will need the raw stream without "comments parsing". It is a great idea nonetheless.
What remains is what you would like to analyse statistically, right?Yes, they've already built a lot of tools, automation programs and researchs for language statistics. The "Mozilla language/encoding detection module" was a separate project that later was merged in Mozilla and now is being mantained there.
The more I think about that... it seems there is a reason for why the Mozilla effort is so complex...?! :roll:
With regards, Morten.
I think thomas meant, that the striping should only take place, after we checked for easy-to-detect encodings.Right. UTF-16 can be ruled out rather easily to start with, and the remaining "strange" encodings should not bother us.
Right. UTF-16 can be ruled out rather easily to start with, and the remaining "strange" encodings should not bother us.Yep, just remember that east-asian languages usually will not have the "insane amount of 0's" we're looking.
Also, only a tiny fraction of the population uses them.I don't know how accurate is that percentage, given that 1/4 of the world population speaks east-asian languages, but...
We should concentrate on making it work well for 98%, not waste time on making it work badly for 100%....I don't know how much of them aren't using UNICODE these days for their source code, so that guess is ok probably.
Those few actually using anything else than 8-bit codepages, UTF-8, and UTF-16 can really be asked to convert to UTF-8 or UTF-16. Since both UTF-8 and UTF-16 are established standards that work excellently, there should not be much to object.Yes, UTF-8 rocks for programming, thanks to the ASCII part being backwards-compatible. It should be the default nowadays.
What if we use word boundaries to feed complete words to the statistical model?That is a good approach too, but will consume a lot of cpu-resources and statistical analysis. Something to point out is: those statistics can guess wrong, for example I tend to write english code and english comments, but for some of the words I use my mother tongue (when I don't know the word in english), so I don't think it's an uncommon situation.
For example, if we first skip over all ANSI chars to identify candidate "strange chars", and then use Scintilla's word retrieval functions.
If the input text were "Ångström was a phycisist", then we would use "Ångström" for the statistical model, rather than "Åö".
In this case, we do not need to bother what might be a comment or character constant, we don't need to parse programming-language specific stuff, but we still get somewhat of a "complete" fragment (nice oxymoron). It is a compromise between a full parser and only using a fraction of the statistical data available (the frequencies of "ngstrm" may give a clue, too).
Yep, just remember that east-asian languages usually will not have the "insane amount of 0's" we're looking.I think that a "typical" project made from "typical" sources with 10% comments and character strings (Code::Blocks has about 3%) encoded in UTF-16 has 45% null bytes...
[...]
What do you think?
That's perfectly true, but the majority of a source file is made up by english characters, and encoded with UTF-16 they contain a lot of NULL bytes ;)Right. UTF-16 can be ruled out rather easily to start with, and the remaining "strange" encodings should not bother us.Yep, just remember that east-asian languages usually will not have the "insane amount of 0's" we're looking.
...
That's perfectly true, but the majority of a source file is made up by english characters, and encoded with UTF-16 they contain a lot of NULL bytes ;)Right. UTF-16 can be ruled out rather easily to start with, and the remaining "strange" encodings should not bother us.Yep, just remember that east-asian languages usually will not have the "insane amount of 0's" we're looking.
...
- N: prompt for an encoding.And put in that dialog a "Don't annoy me again!" :), because it is very common that the user will be using always that encoding from that moment.
That's quite true :lol:- N: prompt for an encoding.And put in that dialog a "Don't annoy me again!" :), because it is very common that the user will be using always that encoding from that moment.