Author Topic: Linux won't edit Jonsson umlauted name  (Read 28076 times)

Offline Defender

  • Multiple posting newcomer
  • *
  • Posts: 49
Re: Linux won't edit Jonsson umlauted name
« Reply #15 on: May 18, 2006, 04:53:40 pm »
I have a general question related to source encodings.
Is it legal to have non-English (non-ASCII) single byte characters is a C source file as a string literal? Does the compiler recognize, if there are multibyte characters is the source (for example as UTF-8 string literals)?

Thanks: Defender

Offline thomas

  • Administrator
  • Lives here!
  • *****
  • Posts: 3979
Re: Linux won't edit Jonsson umlauted name
« Reply #16 on: May 18, 2006, 05:11:00 pm »
I have a general question related to source encodings.
Is it legal to have non-English (non-ASCII) single byte characters is a C source file as a string literal? Does the compiler recognize, if there are multibyte characters is the source (for example as UTF-8 string literals)?

Thanks: Defender
Yes and no. If you specify the input character encoding, it is legal. Not all compilers support that, but gcc for example, does (gcc fully supports UTF-encoded sources, and on the majority of systems, that's even the default).
Don't worry about that kind of stuff, because you will know if you need to specify the encoding because your source will immediately fail to compile with the message "Illegal byte sequence" :)
"We should forget about small efficiencies, say about 97% of the time: Premature quotation is the root of public humiliation."

Offline Defender

  • Multiple posting newcomer
  • *
  • Posts: 49
Re: Linux won't edit Jonsson umlauted name
« Reply #17 on: May 18, 2006, 05:14:41 pm »
Thanks for the info, thomas  :)

Offline thomas

  • Administrator
  • Lives here!
  • *****
  • Posts: 3979
Re: Linux won't edit Jonsson umlauted name
« Reply #18 on: May 18, 2006, 06:22:49 pm »
I guess it's even worse. As far as I understand variable names could be in unicode for some compilers, too.
Luckily, it says "A valid identifier is a sequence of one or more letters, digits or underscore characters", and "letter" is defined as [A-Za-z]. Phew... :)
You're right though, it is still difficult enough.

Isn't it better to support a limited number of file types and try to determine when a file is of unknown type and request the user to convert it into something "general"? So rather than to try to be very smart be very strict?
That was the secret "Plan B" :) Though not the best solution, it would probably work good enough.
Possibly we'll have to settle for something like that (or a hybrid solution) in the end, as it is really not trivial, and it is quite possible that we don't find a better solution.
"We should forget about small efficiencies, say about 97% of the time: Premature quotation is the root of public humiliation."

Offline MortenMacFly

  • Administrator
  • Lives here!
  • *****
  • Posts: 9702
Re: Linux won't edit Jonsson umlauted name
« Reply #19 on: May 18, 2006, 07:36:06 pm »
Luckily, it says "A valid identifier is a sequence of one or more letters, digits or underscore characters", and "letter" is defined as [A-Za-z]. Phew... :)
Yes, I got that part wrong. I was refering to an article about the C# language (compiler) but what was ment was the content of a variable, not the name... sorry. :oops:
Compiler logging: Settings->Compiler & Debugger->tab "Other"->Compiler logging="Full command line"
C::B Manual: https://www.codeblocks.org/docs/main_codeblocks_en.html
C::B FAQ: https://wiki.codeblocks.org/index.php?title=FAQ

takeshimiya

  • Guest
Re: Linux won't edit Jonsson umlauted name
« Reply #20 on: May 19, 2006, 02:11:14 am »
I wonder why SMF when you click on a "Recent Unread Topic" doesn't takes to the last post readed but to the last page, sometimes.

See my previous post on Page #1.

Offline thomas

  • Administrator
  • Lives here!
  • *****
  • Posts: 3979
Re: Linux won't edit Jonsson umlauted name
« Reply #21 on: May 19, 2006, 10:07:58 am »
This means that it would be required to have knowledge about what is a valid comment / string -> so is it //, /*, !, C, ' or '...', "..." (...).
Actually, do we really want to know? Do we need to know?

Suppose we use a regex like this to hop over everything that we're not interested in:
[^A-Za-z0-9 \+\-\*\/\.\,\:\;\!\"\$\%\&\(\)\=\[\]\{\}\'\#\<\>\\]

Even easier, we could use something like this:

void MoveToNextInterestingLocation(wxChar*& ptr)
{
    while(ptr < endOfFile && *ptr < 127)
        ++ptr;
}


Lol, actually using a regex was quite a stupid idea :)
"We should forget about small efficiencies, say about 97% of the time: Premature quotation is the root of public humiliation."

takeshimiya

  • Guest
Re: Linux won't edit Jonsson umlauted name
« Reply #22 on: May 19, 2006, 10:24:26 am »
Lol, actually using a regex was quite a stupid idea :)
lol  :lol:

What about my suggestion in my last post above of moving to the SDK the comment tokens (for the Code Stats plugin, the wxSmith plugin, ToDo plugin, etc)?

Offline MortenMacFly

  • Administrator
  • Lives here!
  • *****
  • Posts: 9702
Re: Linux won't edit Jonsson umlauted name
« Reply #23 on: May 19, 2006, 11:06:39 am »
Actually, do we really want to know? Do we need to know?
Statistically spoken: Yes!
If you want to improve the detection rate it is always better to put as much knowledge in your model as possible. The possibility to detect the right unicode scheme is higher if you skip parts where you know they won't deliver relevant information. Or (to say it in the opposite) amplify parts that have a higher probability for unicode characters. And these are in fact strings and comments. If this is true (which you -Thomas- supposed in an earlier message and I agree with you on that) this should really be considered because this will significantly increase the detection rate.
With regards, Morten.
BTW: Modelling is my daily business... in case you wonder... ;-)
Compiler logging: Settings->Compiler & Debugger->tab "Other"->Compiler logging="Full command line"
C::B Manual: https://www.codeblocks.org/docs/main_codeblocks_en.html
C::B FAQ: https://wiki.codeblocks.org/index.php?title=FAQ

Offline thomas

  • Administrator
  • Lives here!
  • *****
  • Posts: 3979
Re: Linux won't edit Jonsson umlauted name
« Reply #24 on: May 19, 2006, 11:37:49 am »
Statistically spoken: Yes!
[...] amplify parts that have a higher probability for unicode characters
Exactly :)
But do we want to know whether they're strings or comments or whatever? Do we care what identifies a string constant?
If we dead-strip everything with a value less than 127, we eleminate all ANSI characters (all keywords, operators, and all English comment/constant text). All that remains are the (amplified) non-ANSI characters, if there are any. It greatly biases the distribution towards them, but is that really a bad thing?
The question is just, can you judge a charset/language only by looking at the special characters? Could you tell that "Fürwahr, Du bist der Größte" is "German / ISO-8859-1" only by looking at "üöß"? I think you can.
Even if you cannot tell for sure what language it is, that does not matter - we aren't interested in that. If the text was "Fürwahr, ich liebe Smørrebrød", or stripped "üøø" is still only ISO-8859-1. I don't know if we are as lucky with the cyrillic alphabet, though...
"We should forget about small efficiencies, say about 97% of the time: Premature quotation is the root of public humiliation."

takeshimiya

  • Guest
Re: Linux won't edit Jonsson umlauted name
« Reply #25 on: May 19, 2006, 12:34:53 pm »
Seems like my previous post on Page #1, SMF maked all you missed (?)

Statistically spoken: Yes!
[...] amplify parts that have a higher probability for unicode characters
Exactly :)
But do we want to know whether they're strings or comments or whatever? Do we care what identifies a string constant?
Yes and no, according to the encoding we're trying to detect and the algorithm used.

If we dead-strip everything with a value less than 127, we eleminate all ANSI characters (all keywords, operators, and all English comment/constant text). All that remains are the (amplified) non-ANSI characters, if there are any. It greatly biases the distribution towards them, but is that really a bad thing?
It would be great if it was as easy as dead-stripping everything smaller than a 127 value. :D
 It could only help when trying to detect a single-byte encoding, but not much more.

Think a bit about it: if you dead-strip all bytes with a value < 127, you'll be removing a lot information in multibyte encodings such as Shift-JIS, UTF-16, etc, because almost all byte values are lower than 127.

The question is just, can you judge a charset/language only by looking at the special characters? Could you tell that "Fürwahr, Du bist der Größte" is "German / ISO-8859-1" only by looking at "üöß"? I think you can.
I think you can't guess an encoding by looking at special characters, but a combination of them (a sequence of 2~3 characters).
You can't by looking only at "ü" for example, because the decimal byte value can exist in almost any encoding, and even any languages (in fact my name haves an "ü" and that letter is legal Spanish :P).

A method that works great for detecting single-byte encoding, is what is called the "2-char sequence method". In the example, you'll look for common sequences in german, like "öß", "Fü", etc.

Another thing we must account, is that we're trying to detect the encoding, not specifically the language, so: we must remember that some encodings share a lot of codepoints with other encodings; this is the common case between east asian encodings.
Thus for example you could guess that if you found a "連続" sequence is Japanese (it could be Chinese too), but you can't tell what encoding is between the ones that shares so much codepoints.

I'm guessing you've read it already, it explains greatly the 3 approach used in mozilla, some works great for detecting single-byte encodings and some others for detecting multi-byte encodings, that's why the composite approach was used: http://www.mozilla.org/projects/intl/UniversalCharsetDetection.html

« Last Edit: May 19, 2006, 12:45:50 pm by Takeshi Miya »

Offline Defender

  • Multiple posting newcomer
  • *
  • Posts: 49
Re: Linux won't edit Jonsson umlauted name
« Reply #26 on: May 19, 2006, 12:43:19 pm »

Think a bit about it: if you dead-strip all bytes with a value < 127, you'll be removing a lot information in multibyte encodings such as Shift-JIS, UTF-16, etc, because almost all byte values are lower than 127.

I think thomas meant, that the striping should only take place, after we checked for easy-to-detect encodings.
Quote:
Quote from: thomas
On documents that have no BOM, it boils down to either knowing the correct encoding or doing a wild guess based on the actual content. Something that is relatively easy to do is to find out UTF-16 encoded files, since they have unhealthy amounts of zero bytes, and similarly it is relatively easy to tell whether a document is not ANSI and not UTF-8 encoded (if you try to decode it as UTF-8, and it contains illegal sequences, then it is neither). That's what the Mozilla engine (and probably most every other browser) does as the first step, too.

Excuse me if I am wrong.

Defender

Offline MortenMacFly

  • Administrator
  • Lives here!
  • *****
  • Posts: 9702
Re: Linux won't edit Jonsson umlauted name
« Reply #27 on: May 19, 2006, 12:44:49 pm »
Exactly :)
Mmmmh.... I was just reading over at wikipedia from http://en.wikipedia.org/wiki/ISO/IEC_8859-1 to http://en.wikipedia.org/wiki/ISO/IEC_8859-16 about what such a bias would mean. Just to get you right: So if everything up to 127 (maybe even everything up to 160) is skipped we truely skip all keywords, variables and symbols/signs. There is a very nice table at http://en.wikipedia.org/wiki/ISO_8859#Table that compares the different ISO's.
What remains is what you would like to analyse statistically, right?
This sounds logically to me - yes, but it's difficult to judge if this is a good approach. I think we really require an expert on that topic to answer (best if there is any that knows enough about all of these languages).
Still: I think this is a very minor addition (e.g. a simple stream manipulator) to the main part that would be required anyway. Unfortunately I know nearly nothing about languages and the theory about how often specific characters or combination of characters are used to setup a statistic model for what remains. The more I think about that... it seems there is a reason for why the Mozilla effort is so complex...?! :roll:
With regards, Morten.

Edit: Linked directly to table mentioned to avoid confusion with other tables on that page.
« Last Edit: May 19, 2006, 01:06:58 pm by MortenMacFly »
Compiler logging: Settings->Compiler & Debugger->tab "Other"->Compiler logging="Full command line"
C::B Manual: https://www.codeblocks.org/docs/main_codeblocks_en.html
C::B FAQ: https://wiki.codeblocks.org/index.php?title=FAQ

takeshimiya

  • Guest
Re: Linux won't edit Jonsson umlauted name
« Reply #28 on: May 19, 2006, 12:59:20 pm »

Think a bit about it: if you dead-strip all bytes with a value < 127, you'll be removing a lot information in multibyte encodings such as Shift-JIS, UTF-16, etc, because almost all byte values are lower than 127.

I think thomas meant, that the striping should only take place, after we checked for easy-to-detect encodings.

Yes, of course the very first thing would be checking for BOM and trying to detect the easy encodings. But almost all multibyte encodings are precisely not easy to detect.


So if everything up to 127 (maybe even everything up to 160) is skipped we truely skip all keywords, variables and symbols/signs. There is a very nice table at http://en.wikipedia.org/wiki/ISO_8859 that compares the different ISO's.
Again, that'll serve only for positively detecting single-byte encodings, with what you could call "1-char detection method", and I explained above why it will not work except for very few cases. The 2-char (or more) method would give us a best guess (for single-byte encodings).

Still: I think this is a very minor addition (e.g. a simple stream manipulator) to the main part that would be required anyway.
I think too, but for some detection methods, other methods will need the raw stream without "comments parsing". It is a great idea nonetheless.


What remains is what you would like to analyse statistically, right?
The more I think about that... it seems there is a reason for why the Mozilla effort is so complex...?! :roll:
With regards, Morten.
Yes, they've already built a lot of tools, automation programs and researchs for language statistics. The "Mozilla language/encoding detection module" was a separate project that later was merged in Mozilla and now is being mantained there.

If somehow we don't use the Mozilla detection source, we can still use a lot of researchs made by them, the statistical analysis for each language, the tools for built those statistics, etc.

Offline thomas

  • Administrator
  • Lives here!
  • *****
  • Posts: 3979
Re: Linux won't edit Jonsson umlauted name
« Reply #29 on: May 19, 2006, 01:35:08 pm »
I think thomas meant, that the striping should only take place, after we checked for easy-to-detect encodings.
Right. UTF-16 can be ruled out rather easily to start with, and the remaining "strange" encodings should not bother us.
[There are just too many of them, wxScite does not even support some, and manually converting them is a pain. Also, only a tiny fraction of the population uses them. We should concentrate on making it work well for 98%, not waste time on making it work badly for 100%.
Those few actually using anything else than 8-bit codepages, UTF-8, and UTF-16 can really be asked to convert to UTF-8 or UTF-16. Since both UTF-8 and UTF-16 are established standards that work excellently, there should not be much to object.]


Regarding UTF-8, if there are no valid UTF-8 multibyte sequences to be found (which you can check easily) then either there are none because the text contains no characters that need encoding, or you will find illegal sequences.

In the former case, you don't need to care, as you'll simply use ANSI encoding (or whatever codepage, it shouldn't matter, you can even use UTF-8, just to be sure for the future).

Only in the latter case, it gets really complicated. Then, and only then, you may have characters in some unknown encoding which might mean anything, for example 8859-1 Ø could as well be 8859-2 Ř, and you have no idea what it may be.

Now, the problem lies in finding out whether it is a Ø or a Ř, this has to be done from a statistical model.

Martin:
What if we use word boundaries to feed complete words to the statistical model?
For example, if we first skip over all ANSI chars to identify candidate "strange chars", and then use Scintilla's word retrieval functions.
If the input text were "Ångström was a phycisist", then we would use "Ångström" for the statistical model, rather than "Åö".
In this case, we do not need to bother what might be a comment or character constant, we don't need to parse programming-language specific stuff, but we still get somewhat of a "complete" fragment (nice oxymoron). It is a compromise between a full parser and only using a fraction of the statistical data available (the frequencies of "ngstrm" may give a clue, too).

I don't remember exactly who came up with counting letters in the English language first, you'd probably praise the late Mr. Shannon, but Arthur Conan Doyle wrote his tale "The Dancing Men" 13 years before Shannon was even born...
Anyway, this story teaches us that seemingly unimportant letters may be interesting too :)
« Last Edit: May 19, 2006, 01:37:24 pm by thomas »
"We should forget about small efficiencies, say about 97% of the time: Premature quotation is the root of public humiliation."