Developer forums (C::B DEVELOPMENT STRICTLY!) > Development
Linux won't edit Jonsson umlauted name
takeshimiya:
I wonder why SMF when you click on a "Recent Unread Topic" doesn't takes to the last post readed but to the last page, sometimes.
See my previous post on Page #1.
thomas:
--- Quote from: MortenMacFly on May 18, 2006, 02:57:12 pm ---This means that it would be required to have knowledge about what is a valid comment / string -> so is it //, /*, !, C, ' or '...', "..." (...).
--- End quote ---
Actually, do we really want to know? Do we need to know?
Suppose we use a regex like this to hop over everything that we're not interested in:
[^A-Za-z0-9 \+\-\*\/\.\,\:\;\!\"\$\%\&\(\)\=\[\]\{\}\'\#\<\>\\]
Even easier, we could use something like this:
void MoveToNextInterestingLocation(wxChar*& ptr)
{
while(ptr < endOfFile && *ptr < 127)
++ptr;
}
Lol, actually using a regex was quite a stupid idea :)
takeshimiya:
--- Quote from: thomas on May 19, 2006, 10:07:58 am ---Lol, actually using a regex was quite a stupid idea :)
--- End quote ---
lol :lol:
What about my suggestion in my last post above of moving to the SDK the comment tokens (for the Code Stats plugin, the wxSmith plugin, ToDo plugin, etc)?
MortenMacFly:
--- Quote from: thomas on May 19, 2006, 10:07:58 am ---Actually, do we really want to know? Do we need to know?
--- End quote ---
Statistically spoken: Yes!
If you want to improve the detection rate it is always better to put as much knowledge in your model as possible. The possibility to detect the right unicode scheme is higher if you skip parts where you know they won't deliver relevant information. Or (to say it in the opposite) amplify parts that have a higher probability for unicode characters. And these are in fact strings and comments. If this is true (which you -Thomas- supposed in an earlier message and I agree with you on that) this should really be considered because this will significantly increase the detection rate.
With regards, Morten.
BTW: Modelling is my daily business... in case you wonder... ;-)
thomas:
--- Quote from: MortenMacFly on May 19, 2006, 11:06:39 am ---Statistically spoken: Yes!
[...] amplify parts that have a higher probability for unicode characters
--- End quote ---
Exactly :)
But do we want to know whether they're strings or comments or whatever? Do we care what identifies a string constant?
If we dead-strip everything with a value less than 127, we eleminate all ANSI characters (all keywords, operators, and all English comment/constant text). All that remains are the (amplified) non-ANSI characters, if there are any. It greatly biases the distribution towards them, but is that really a bad thing?
The question is just, can you judge a charset/language only by looking at the special characters? Could you tell that "Fürwahr, Du bist der Größte" is "German / ISO-8859-1" only by looking at "üöß"? I think you can.
Even if you cannot tell for sure what language it is, that does not matter - we aren't interested in that. If the text was "Fürwahr, ich liebe Smørrebrød", or stripped "üøø" is still only ISO-8859-1. I don't know if we are as lucky with the cyrillic alphabet, though...
Navigation
[0] Message Index
[#] Next page
[*] Previous page
Go to full version