Author Topic: Re: How to set/get information of encoding of compiled source files in project ?  (Read 6641 times)

Luke Matuszewski

  • Guest
I was thinking how to get information of encoding of source files from my project...We all know that source files eg for C++ or C should be written in default encoding of operating system so compiler will properly "decode" the code written and thus:
- for linux in Poland default encoding is ISO 8859-2 (or when set UTF-8/other);
- for windows the default encoding is Windows 1250 in Poland (in USA it is probably Windows 1252);
But we all know that first 127 characters in these encodings (in Windows 125x and ISO 8859 and UTF-8) are the same, and thus all keywords in standard languages are properly read...but i want to ask:

1. What must i do to write source files in my project and use Unicode encoding(and which unicode encoding i should use in wxWidgets - UTF-16 ?) ? I ask becouse i would like to put some constant strings in my source code which will be encoded in unicode especially with in wxWidgets using wxString (_T() macro or wxT() macro or even _() macro) here is example

if ( string should be translated )
      use _("string")
   else if ( string should be in Unicode in Unicode build )
      use wxT("string")
   else
      just use "string" normally
// wxT()/_T() adds only L literal to string so it is trated as wide characters.
So in wxWidgets i have these macros that translates my strings (_() macro) if it should be translated (it will be translated in nonUnicode build and NOT translated in Unicode build - when i configure wxWidgets to use unicode build).
My question is what unicode encoding i should use whe writing wxWidgets project (if i should) and how those it come to play with Code Blocks (does code blocks supports editing files in unicode encoding and which of them).

2. What about C++ compilers and support for unicode encoded source files ? I ask about these becouse only in Visual Studio i can choose the encoding of the file so compiler will properly decode the source file contents...

I assume here that code blocks writes source code files in default encoding of operating system so writing in that encoding is supported... but what about other encoding eg. project written for multiplatforms... with unicode this problem will be handled since unicode UTF-16 supports wide spread of languages (even those which are dead)

(Also UTF-8 is totally compatible with ASCII).

I have read also that spec says that C/C++ code should be written in basic character set similar (but more restricted) to ASCII, but what about wide characters and L putted before/after strings in C/C++ code like this:
... = "someCode"L;

One way to use unicode strings is to use in string the \uXXXX, but it is completely unreadable... So how to write strings(or to be more corect char arrays) that will be human readable (eg. by Japanise people in japanise) and will properly be "compiled".

Help and answers are appreciated .
Luke Matuszewski from Poland.

Offline thomas

  • Administrator
  • Lives here!
  • *****
  • Posts: 3979
I was thinking how to get information of encoding of source files from my project
You don't. Unless you have an well-formed xml file (or similar), you have no way to find out (unless you have a filesystem that provides some kind of meta-data describing encoding).
Sources are binary streams of whatsoever characters without any information about encoding. Mainstream source control systems do not pay attention to encoding either (I cannot claim that this is true for every SCM, but for cvs and svn it certainly is).

Quote
1. What must i do to ... wxWidgets
No idea, me nah do Unicode! :P  Someone else? ;)

Quote
What about C++ compilers and support for unicode encoded source files
gcc: -finput-charset=charset. Try UTF-8, for example, works fine.
"We should forget about small efficiencies, say about 97% of the time: Premature quotation is the root of public humiliation."

Luke Matuszewski

  • Guest
Ok... i read my post and it looks to messy. Lets clearify :). What my itend was is:
1. To allow user of codeblocks to specify between encodings: (native-default for os)/UTF-8/US-ASCII/UTF-16/UTF-16LE/UTF-16BE/ISO-8859-1 of project files (source files/rc files/all other)...it will be good if i could specify this similar the other IDE allows it eg. in Eclipse (eclipse.org) in general section of preferences and maybe to allow even specify the extension of filename and their encoding...
I know that not all compilers support switch like -finput-charset in gcc, but it will depend on developer of project to switch between native-default for os which is supported by all C/C++/other compilers or other...
This functionality will allow to write string literals in visual form - so programmer will see e.g. hiragana chars with chineese chars... and mainly see comments in e.g. UTF-8 so documentation generated from them could be seen an UTF-8...

In wxWidgets we have _() macro that translates from native-default for os to Unicode representation of wxWidgets(possibly UTF-16 but now i don't remeber) for string literals if specific flag in compilation is set.
(and wxT() / _T() macros only adds L in unicode build and does nothing in ANSI build (ANSI in terms of native-default for os encoding)).

I anyone could write some helpful info i would appreciate, especially about gettext() function used for i18n.

Thanks in advance and i would be gratefull for any corrections/infos....

Best regards.
Luke from Poland.
   

Offline thomas

  • Administrator
  • Lives here!
  • *****
  • Posts: 3979
Ok... i read my post and it looks to messy. Lets clearify :). What my itend was is:
1. To allow user of codeblocks to specify between encodings
This functionality will allow to write string literals in visual form - so programmer will see e.g. hiragana chars with chineese chars...
If you could provide information on how to get wxScintilla to handle different encodings, you would be very welcome. I tried setting properties and using the SetEncoding() function while trying to add this exact functionality last week, but it had no apparent effect. In fact, the properties recommended at the SciTE site do not even seem to exist in wxScintilla.

Quote
In wxWidgets we have _() macro that translates from native-default for os to Unicode representation of wxWidgets
That's not precise. _() does an implicit _T() but it also actually translates text to another language using gettext(). So not only the characters are translated to a different charset, but the whole string is replaced by something completely different.
_T() and wxT() are (as you said) identical macros which resolve to identity on ANSI and L## in Unicode builds (wchar_t strings). To make things more confusing, though, wchar_t is unsigned short int on my Windows machine, but unsigned int on my Linux box, so sizeof(wchar_t) != sizeof(wchar_t) with identical wxWidgets versions.
"We should forget about small efficiencies, say about 97% of the time: Premature quotation is the root of public humiliation."