Author Topic: Question on c == 178 || c == 179 || c == 185 in tokenizer (Read 33849 times)

ollydbg · « **on:** October 24, 2009, 07:03:08 am »

In the tokenizer.cpp. around line 664, there are code like below:

Code

#ifdef __WXMSW__ // This is a Windows only bug!
    else if (c == 178 || c == 179 || c == 185) // fetch ² and ³
    {
        str = c;
        MoveToNextChar();
    }
#endif

I just set a breakpoint in there, and test several project, but the bp never hit there.
So, my question is:

Can someone explain this? ,why it is a windows bug?

Thanks.

ollydbg · « **Reply #1 on:** October 24, 2009, 07:15:34 am »

By the way, look at the screen shot:

The number "2" and "3" are superscript numbers. I don't think they should be any where in the C/C++ source code for any reasons.

Edit

I just track the change log of tokenizer.cpp. It seems in rev 5117, this code was added by Morten.

Code

    else if (c == 178 || c == 179 || c == 185) // fetch ² and ³
    {
        str = c;
        MoveToNextChar();
    }

In rev 5150, there is another proprocessor added by biplab.

Code

#ifdef __WXMSW__ // This is a Windows only bug!
    else if (c == 178 || c == 179 || c == 185) // fetch ² and ³
    {
        str = c;
        MoveToNextChar();
    }
#endif

So, @morten.

Can you explain a little? thanks.

Jenna · « **Reply #2 on:** October 24, 2009, 09:40:25 am »

Hi ollydbg, these characters cause a lockup on paring, see this thread for more information: http://forums.codeblocks.org/index.php/topic,8700.msg63405.html#msg63405 .

ollydbg · « **Reply #3 on:** October 24, 2009, 09:51:10 am »

Thanks jens. I'll carefully read that topics.

ollydbg · « **Reply #4 on:** October 30, 2009, 03:58:08 pm »

Hi, jens:

I still think that these code can be removed, because people won't write the superscipt "2" and "3" as a variable name or an identifier.

So, the only place they can exists is in Comments, in this case, these special characters can be safely skipped. For me, I can't find a case people will misuse them.

Jenna · « **Reply #5 on:** October 30, 2009, 04:21:14 pm »

Quote from: ollydbg on October 30, 2009, 03:58:08 pm

Hi, jens:

I still think that these code can be removed, because people won't write the superscipt "2" and "3" as a variable name or an identifier.

So, the only place they can exists is in Comments, in this case, these special characters can be safely skipped. For me, I can't find a case people will misuse them.

We added this code, because people did in fact type this superscripts and C::B locked up. And that should of course not happen, even if users type sometghing in the wrong or better unexpected place.

ollydbg · « **Reply #6 on:** October 30, 2009, 04:26:49 pm »

Quote from: jens on October 30, 2009, 04:21:14 pm

Quote from: ollydbg on October 30, 2009, 03:58:08 pm
Hi, jens:

I still think that these code can be removed, because people won't write the superscipt "2" and "3" as a variable name or an identifier.

So, the only place they can exists is in Comments, in this case, these special characters can be safely skipped. For me, I can't find a case people will misuse them.
We added this code, because people did in fact type this superscripts and C::B locked up. And that should of course not happen, even if users type sometghing in the wrong or better unexpected place.

OK, this is a good reason!
Thanks for your explanation.

ollydbg · « **Reply #7 on:** March 29, 2012, 04:14:55 am »

I'm thinking that the way we currently use:

Code

#ifdef __WXMSW__ // This is a Windows only bug!
    // fetch non-English characters, see more details in: http://forums.codeblocks.org/index.php/topic,11387.0.html
    else if (c == 178 || c == 179 || c == 185)
    {
        str = c;
        MoveToNextChar();
    }
#endif
    else if (wxIsdigit(c))
    {
        // numbers
        while (NotEOF() && CharInString(CurrentChar(), _T("0123456789.abcdefABCDEFXxLl")))
            MoveToNextChar();

        if (IsEOF())
            return wxEmptyString;

        str = m_Buffer.Mid(start, m_TokenIndex - start);
    }

This way, we filter out the dummy digit by using:

Code

if (c == 178 || c == 179 || c == 185)

My idea is:
We can still let the route go into the branch:

Code

 else if (wxIsdigit(c))
    {
        // numbers
        while (NotEOF() && CharInString(CurrentChar(), _T("0123456789.abcdefABCDEFXxLl")))
            MoveToNextChar();

        if (IsEOF())
            return wxEmptyString;

        str = m_Buffer.Mid(start, m_TokenIndex - start);
    }

Here, when entered, We first do a MoveToNextChar(), then we can check the CurrentChar(). This way, we have at least go one character step, so it can avoid the endless loop( the loop is caused that in this branch, we don't go one char step, so we have still in the same index once we run Tokenizer::DoGetToken() next time)

Code

 else if (wxIsdigit(c))
    {
        // numbers
        while (NotEOF() && CharInString(CurrentChar(), _T("0123456789.abcdefABCDEFXxLl")))
            MoveToNextChar();

Here, we check the CurrentChar twice.
(One in wxIsdigit(c) and the other in CharInString(CurrentChar(), _T("0123456789.abcdefABCDEFXxLl")))

Any ideas?

MortenMacFly · « **Reply #8 on:** March 29, 2012, 06:45:34 am »

Quote from: ollydbg on March 29, 2012, 04:14:55 am

Code

 else if (wxIsdigit(c))
    {
        // numbers
        while (NotEOF() && CharInString(CurrentChar(), _T("0123456789.abcdefABCDEFXxLl")))
            MoveToNextChar();

        if (IsEOF())
            return wxEmptyString;

        str = m_Buffer.Mid(start, m_TokenIndex - start);
    }

Here, when entered, We first do a MoveToNextChar(), then we can check the CurrentChar().

I believe we don't. As far as I can remember wxIsdigit() treats "²" and "³" unfortunately as a digit, too (seems Windows only?!). So in that case if we encounter such a character the loop will exactly not move to the next char, so it becomes an endless while loop.

But why don't you simply try?

BTW: 178 is ², 179 is ³, but what was 185 ?

ollydbg · « **Reply #9 on:** March 29, 2012, 08:26:05 am »

I wrote a sample test code:

Code

void IsAlphaFrame::OnAbout(wxCommandEvent& event)
{

    wxChar a = _T('中');
    if(wxIsalpha(a))
    {
        TextCtrl->AppendText(a);
        TextCtrl->AppendText(_T(" is an alpha"));
        TextCtrl->AppendText(_T("\r\n"));
    }

    int i;
    for (i = 0; i < 500; ++i)
    {
        if (wxIsdigit(i))
        {
            wxString str;
            str << i;
            TextCtrl->AppendText(str);
            TextCtrl->AppendText(_T(" "));
            wxChar ch = i;
            TextCtrl->AppendText(ch);
            TextCtrl->AppendText(_T(" is a digit"));
            TextCtrl->AppendText(_T("\r\n"));
        }
    }

    for (i = 0; i < 500; ++i)
    {
        if (wxIsalpha(i))
        {
            wxString str;
            str << i;
            TextCtrl->AppendText(str);
            TextCtrl->AppendText(_T(" "));
            wxChar ch = i;
            TextCtrl->AppendText(ch);
            TextCtrl->AppendText(_T(" is an alpha"));
            TextCtrl->AppendText(_T("\r\n"));
        }
    }
}

And the result is:

Code

中 is an alpha
48 0 is a digit
49 1 is a digit
50 2 is a digit
51 3 is a digit
52 4 is a digit
53 5 is a digit
54 6 is a digit
55 7 is a digit
56 8 is a digit
57 9 is a digit
178 ² is a digit
179 ³ is a digit
185 ¹ is a digit
65 A is an alpha
66 B is an alpha
67 C is an alpha
68 D is an alpha
69 E is an alpha
70 F is an alpha
71 G is an alpha
72 H is an alpha
73 I is an alpha
74 J is an alpha
75 K is an alpha
76 L is an alpha
77 M is an alpha
78 N is an alpha
79 O is an alpha
80 P is an alpha
81 Q is an alpha
82 R is an alpha
83 S is an alpha
84 T is an alpha
85 U is an alpha
86 V is an alpha
87 W is an alpha
88 X is an alpha
89 Y is an alpha
90 Z is an alpha
97 a is an alpha
98 b is an alpha
99 c is an alpha
100 d is an alpha
101 e is an alpha
102 f is an alpha
103 g is an alpha
104 h is an alpha
105 i is an alpha
106 j is an alpha
107 k is an alpha
108 l is an alpha
109 m is an alpha
110 n is an alpha
111 o is an alpha
112 p is an alpha
113 q is an alpha
114 r is an alpha
115 s is an alpha
116 t is an alpha
117 u is an alpha
118 v is an alpha
119 w is an alpha
120 x is an alpha
121 y is an alpha
122 z is an alpha
192 À is an alpha
193 Á is an alpha
194 Â is an alpha
195 Ã is an alpha
196 Ä is an alpha
197 Å is an alpha
198 Æ is an alpha
199 Ç is an alpha
200 È is an alpha
201 É is an alpha
202 Ê is an alpha
203 Ë is an alpha
204 Ì is an alpha
205 Í is an alpha
206 Î is an alpha
207 Ï is an alpha
208 Ð is an alpha
209 Ñ is an alpha
210 Ò is an alpha
211 Ó is an alpha
212 Ô is an alpha
213 Õ is an alpha
214 Ö is an alpha
216 Ø is an alpha
217 Ù is an alpha
218 Ú is an alpha
219 Û is an alpha
220 Ü is an alpha
221 Ý is an alpha
222 Þ is an alpha
223 ß is an alpha
224 à is an alpha
225 á is an alpha
226 â is an alpha
227 ã is an alpha
228 ä is an alpha
229 å is an alpha
230 æ is an alpha
231 ç is an alpha
232 è is an alpha
233 é is an alpha
234 ê is an alpha
235 ë is an alpha
236 ì is an alpha
237 í is an alpha
238 î is an alpha
239 ï is an alpha
240 ð is an alpha
241 ñ is an alpha
242 ò is an alpha
243 ó is an alpha
244 ô is an alpha
245 õ is an alpha
246 ö is an alpha
248 ø is an alpha
249 ù is an alpha
250 ú is an alpha
251 û is an alpha
252 ü is an alpha
253 ý is an alpha
254 þ is an alpha
255 ÿ is an alpha
256 Ā is an alpha
257 ā is an alpha
258 Ă is an alpha
259 ă is an alpha
260 Ą is an alpha
261 ą is an alpha
262 Ć is an alpha
263 ć is an alpha
264 Ĉ is an alpha
265 ĉ is an alpha
266 Ċ is an alpha
267 ċ is an alpha
268 Č is an alpha
269 č is an alpha
270 Ď is an alpha
271 ď is an alpha
272 Đ is an alpha
273 đ is an alpha
274 Ē is an alpha
275 ē is an alpha
276 Ĕ is an alpha
277 ĕ is an alpha
278 Ė is an alpha
279 ė is an alpha
280 Ę is an alpha
281 ę is an alpha
282 Ě is an alpha
283 ě is an alpha
284 Ĝ is an alpha
285 ĝ is an alpha
286 Ğ is an alpha
287 ğ is an alpha
288 Ġ is an alpha
289 ġ is an alpha
290 Ģ is an alpha
291 ģ is an alpha
292 Ĥ is an alpha
293 ĥ is an alpha
294 Ħ is an alpha
295 ħ is an alpha
296 Ĩ is an alpha
297 ĩ is an alpha
298 Ī is an alpha
299 ī is an alpha
300 Ĭ is an alpha
301 ĭ is an alpha
302 Į is an alpha
303 į is an alpha
304 İ is an alpha
305 ı is an alpha
306 Ĳ is an alpha
307 ĳ is an alpha
308 Ĵ is an alpha
309 ĵ is an alpha
310 Ķ is an alpha
311 ķ is an alpha
312 ĸ is an alpha
313 Ĺ is an alpha
314 ĺ is an alpha
315 Ļ is an alpha
316 ļ is an alpha
317 Ľ is an alpha
318 ľ is an alpha
319 Ŀ is an alpha
320 ŀ is an alpha
321 Ł is an alpha
322 ł is an alpha
323 Ń is an alpha
324 ń is an alpha
325 Ņ is an alpha
326 ņ is an alpha
327 Ň is an alpha
328 ň is an alpha
329 ŉ is an alpha
330 Ŋ is an alpha
331 ŋ is an alpha
332 Ō is an alpha
333 ō is an alpha
334 Ŏ is an alpha
335 ŏ is an alpha
336 Ő is an alpha
337 ő is an alpha
338 Œ is an alpha
339 œ is an alpha
340 Ŕ is an alpha
341 ŕ is an alpha
342 Ŗ is an alpha
343 ŗ is an alpha
344 Ř is an alpha
345 ř is an alpha
346 Ś is an alpha
347 ś is an alpha
348 Ŝ is an alpha
349 ŝ is an alpha
350 Ş is an alpha
351 ş is an alpha
352 Š is an alpha
353 š is an alpha
354 Ţ is an alpha
355 ţ is an alpha
356 Ť is an alpha
357 ť is an alpha
358 Ŧ is an alpha
359 ŧ is an alpha
360 Ũ is an alpha
361 ũ is an alpha
362 Ū is an alpha
363 ū is an alpha
364 Ŭ is an alpha
365 ŭ is an alpha
366 Ů is an alpha
367 ů is an alpha
368 Ű is an alpha
369 ű is an alpha
370 Ų is an alpha
371 ų is an alpha
372 Ŵ is an alpha
373 ŵ is an alpha
374 Ŷ is an alpha
375 ŷ is an alpha
376 Ÿ is an alpha
377 Ź is an alpha
378 ź is an alpha
379 Ż is an alpha
380 ż is an alpha
381 Ž is an alpha
382 ž is an alpha
383 ſ is an alpha
384 ƀ is an alpha
385 Ɓ is an alpha
386 Ƃ is an alpha
387 ƃ is an alpha
388 Ƅ is an alpha
389 ƅ is an alpha
390 Ɔ is an alpha
391 Ƈ is an alpha
392 ƈ is an alpha
393 Ɖ is an alpha
394 Ɗ is an alpha
395 Ƌ is an alpha
396 ƌ is an alpha
397 ƍ is an alpha
398 Ǝ is an alpha
399 Ə is an alpha
400 Ɛ is an alpha
401 Ƒ is an alpha
402 ƒ is an alpha
403 Ɠ is an alpha
404 Ɣ is an alpha
405 ƕ is an alpha
406 Ɩ is an alpha
407 Ɨ is an alpha
408 Ƙ is an alpha
409 ƙ is an alpha
410 ƚ is an alpha
411 ƛ is an alpha
412 Ɯ is an alpha
413 Ɲ is an alpha
414 ƞ is an alpha
415 Ɵ is an alpha
416 Ơ is an alpha
417 ơ is an alpha
418 Ƣ is an alpha
419 ƣ is an alpha
420 Ƥ is an alpha
421 ƥ is an alpha
422 Ʀ is an alpha
423 Ƨ is an alpha
424 ƨ is an alpha
425 Ʃ is an alpha
426 ƪ is an alpha
427 ƫ is an alpha
428 Ƭ is an alpha
429 ƭ is an alpha
430 Ʈ is an alpha
431 Ư is an alpha
432 ư is an alpha
433 Ʊ is an alpha
434 Ʋ is an alpha
435 Ƴ is an alpha
436 ƴ is an alpha
437 Ƶ is an alpha
438 ƶ is an alpha
439 Ʒ is an alpha
440 Ƹ is an alpha
441 ƹ is an alpha
442 ƺ is an alpha
443 ƻ is an alpha
444 Ƽ is an alpha
445 ƽ is an alpha
446 ƾ is an alpha
447 ƿ is an alpha
448 ǀ is an alpha
449 ǁ is an alpha
450 ǂ is an alpha
451 ǃ is an alpha
452 Ǆ is an alpha
453 ǅ is an alpha
454 ǆ is an alpha
455 Ǉ is an alpha
456 ǈ is an alpha
457 ǉ is an alpha
458 Ǌ is an alpha
459 ǋ is an alpha
460 ǌ is an alpha
461 Ǎ is an alpha
462 ǎ is an alpha
463 Ǐ is an alpha
464 ǐ is an alpha
465 Ǒ is an alpha
466 ǒ is an alpha
467 Ǔ is an alpha
468 ǔ is an alpha
469 Ǖ is an alpha
470 ǖ is an alpha
471 Ǘ is an alpha
472 ǘ is an alpha
473 Ǚ is an alpha
474 ǚ is an alpha
475 Ǜ is an alpha
476 ǜ is an alpha
477 ǝ is an alpha
478 Ǟ is an alpha
479 ǟ is an alpha
480 Ǡ is an alpha
481 ǡ is an alpha
482 Ǣ is an alpha
483 ǣ is an alpha
484 Ǥ is an alpha
485 ǥ is an alpha
486 Ǧ is an alpha
487 ǧ is an alpha
488 Ǩ is an alpha
489 ǩ is an alpha
490 Ǫ is an alpha
491 ǫ is an alpha
492 Ǭ is an alpha
493 ǭ is an alpha
494 Ǯ is an alpha
495 ǯ is an alpha
496 ǰ is an alpha
497 Ǳ is an alpha
498 ǲ is an alpha
499 ǳ is an alpha

Look: value >= 255 are all regarded as alhpa.

Can we just use some code like:

Code

if (_T("0")<=ch && ch<=_T("9"))

MortenMacFly · « **Reply #10 on:** March 29, 2012, 09:32:14 am »

Probably I am missing something, but I don't see what the problem is.

The standard of C++ in theory allows for unicode characters in variable names (if you encode the file properly and use some vodoo- command line switches to GCC, for example). So what wxIsalpha does is correct. With wxIsdigit its different because "²" and stuff are really no digits, thus the work-around.

If a user tries to compile a file with strange variable names and doesn't setup everything properly the compiler will complain anyways.

So what bug are you trying to fix? Is there a combination / source code that does not work properly? Can you provide a test case then?

ollydbg · « **Reply #11 on:** March 29, 2012, 10:45:09 am »

Quote from: MortenMacFly on March 29, 2012, 09:32:14 am

So what bug are you trying to fix?

I'm viewing the code, and I think we can remove such #ifdef snippet, and made the source code easy to read and understand.

Quote

Is there a combination / source code that does not work properly? Can you provide a test case then?

The change I suggest does not fix any errors, but just a kind of re-factoring.

thomas · « **Reply #12 on:** March 29, 2012, 12:04:43 pm »

Quote from: MortenMacFly on March 29, 2012, 09:32:14 am

With wxIsdigit its different because "²" and stuff are really no digits, thus the work-around.

Unluckily, this is no bug, wxWidgets is correct for once.

Unicode is admittedly retarded in many places, this is one -- but it is pointless to discuss whether it makes sense or not, or whether it's "correct". Unicode, which is the standard, defines it that way, so it is correct by definition. It's totall bull, and it doesn't even make sense, but it is correct.

For example, ³ is SUPERSCRIPT THREE, categorized under Number, other, and assigned the numeric value 3. See here for a nice tabular breakdown.

MortenMacFly · « **Reply #13 on:** March 29, 2012, 12:54:39 pm »

Quote from: thomas on March 29, 2012, 12:04:43 pm

Quote from: MortenMacFly on March 29, 2012, 09:32:14 am
With wxIsdigit its different because "²" and stuff are really no digits, thus the work-around.
Unluckily, this is no bug, wxWidgets is correct for once.

I didn't say its a bug in wxWidgets (its not!) I said its wrong in our case as a assignment like:

Code

int i = ³;

...and a variable like:

Code

int i³ = 5;

is not going to work.

thomas · « **Reply #14 on:** March 29, 2012, 04:46:57 pm »

Yes, but according to Unicode, both are perfectly legitimate. And, it is correct for wxIsdigit to say that it's a digit, because it is. Unluckily, that's not what we're interested in.

In C++, even your second snippet is strictly legitimate (believe it or not!), as universal-character-name (without further explanation!) is allowed in identifiers as well as "other implementation-defined characters" (whatever that may be).

Funnily, the standard defines exactly what digits (0-9) and nondigits are (a-z, A-Z, and _), but the specification text later talks of letters and digits, without specifying what letter refers to, or what the difference is between "letter and nondigit" or "digits, nondigits, and pretty much every character" and "just every character". And, there is no mention of universal-character-name in the text, either.

On the other hand, for integer literals, C++ very clearly defines what can go into the literal, ² and ³ are not in the list (although they are digits).

Which... I agree, is all in all totally retarded. Here we have, again, a proof of concept for "internationalization is shit".

We might actually be off better using find_first_of("xX0123456789ABCDEFabcdef"); because that much more closely matches what C/C++ understands as number (the same with A-Z, a-z, and underscore added for identifiers).

Actually, why hasn't anyone reported problems with Ogham and Klingon numbers yet?

Code::Blocks Forums

News:

Author Topic: Question on c == 178 || c == 179 || c == 185 in tokenizer (Read 33849 times)

ollydbg

Question on c == 178 || c == 179 || c == 185 in tokenizer

ollydbg

Re: Question on c == 178 || c == 179 || c == 185 in tokenizer

Jenna

Re: Question on c == 178 || c == 179 || c == 185 in tokenizer

ollydbg

Re: Question on c == 178 || c == 179 || c == 185 in tokenizer

ollydbg

Re: Question on c == 178 || c == 179 || c == 185 in tokenizer

Jenna

Re: Question on c == 178 || c == 179 || c == 185 in tokenizer

ollydbg

Re: Question on c == 178 || c == 179 || c == 185 in tokenizer

ollydbg

Re: Question on c == 178 || c == 179 || c == 185 in tokenizer

MortenMacFly

Re: Question on c == 178 || c == 179 || c == 185 in tokenizer

ollydbg

Re: Question on c == 178 || c == 179 || c == 185 in tokenizer

MortenMacFly

Re: Question on c == 178 || c == 179 || c == 185 in tokenizer

ollydbg

Re: Question on c == 178 || c == 179 || c == 185 in tokenizer

thomas

Re: Question on c == 178 || c == 179 || c == 185 in tokenizer

MortenMacFly

Re: Question on c == 178 || c == 179 || c == 185 in tokenizer

thomas

Re: Question on c == 178 || c == 179 || c == 185 in tokenizer