Author Topic: When will support UTF-8 editor? (Read 18890 times)

dbtsai · « **on:** December 12, 2005, 12:45:52 pm »

In the official release or SVN release, I find that Code::Blocks compiled with ANSI mode rather than UTF-8.

Because I am from Taiwan, some chinese words can not display in the Code::Blocks, even some simple comment.

So, I realy realy hole that it could support read utf-8 source code.

By the way, will the next offical release ship with the wx lib?? Or we need comiple the wx lib, it's not easy for begainer.

Thanks.

Michael · « **Reply #1 on:** December 12, 2005, 01:04:26 pm »

Hello,

AFAIK C::B RC2 supports UNICODE (http://forums.codeblocks.org/index.php?topic=1162.0).

You should compile wxWidgets with UNICODE and then C::B sources.

You can also download Therion's wxWindows 2.6.2 build (see http://paginas.terra.com.br/informatica/mauricio/codeblocks/). This package includes dll and static libraries for GCC 3.4.4 (both Unicode and NonUnicode).

Michael

takeshimiya · « **Reply #2 on:** December 12, 2005, 01:06:31 pm »

There are any disvantages of having C::B compiled in Unicode mode for the official releases (ie. RC3)?

dbtsai · « **Reply #3 on:** December 12, 2005, 01:14:25 pm »

Hi, Michael

In the version, Therion's wxWindows 2.6.2 build,

in HELP-> ABOUT still say wx2.6.2(Windows, ANSI)

and I can not use the code::blocks to open an source code which encode by utf-8.

I know that the lib he provide have utf-8 version, but what i mean is that

the code::blocks editors still can not open utf-8 source.

Thanks~~~ ^_^

Quote from: Michael on December 12, 2005, 01:04:26 pm

Hello,

AFAIK C::B RC2 supports UNICODE (http://forums.codeblocks.org/index.php?topic=1162.0).

You should compile wxWidgets with UNICODE and then C::B sources.

You can also download Therion's wxWindows 2.6.2 build (see http://paginas.terra.com.br/informatica/mauricio/codeblocks/). This package includes dll and static libraries for GCC 3.4.4 (both Unicode and NonUnicode).

Michael

takeshimiya · « **Reply #4 on:** December 12, 2005, 01:21:34 pm »

I'm afraid no one is making Unicode builds of Code::Blocks.

Michael · « **Reply #5 on:** December 12, 2005, 01:29:36 pm »

Quote from: Takeshi Miya on December 12, 2005, 01:21:34 pm

I'm afraid no one is making Unicode builds of Code::Blocks.

But you can make a UNICODE build of C::B or? For what I have understood from the post Version 1.0rc2 released!, C::B supports UNICODE.

Michael

takeshimiya · « **Reply #6 on:** December 12, 2005, 01:43:30 pm »

Yes anyone can, but no one is distributing builds of C::B Unicode in win32.

C::B supports Unicode means that it can be compiled in Unicode, not that it is compiled in Unicode.

Michael · « **Reply #7 on:** December 12, 2005, 01:56:48 pm »

Quote from: Takeshi Miya on December 12, 2005, 01:43:30 pm

C::B supports Unicode means that it can be compiled in Unicode, not that it is compiled in Unicode.

Ok, so I have understood right. Thank you.

I think, dbtsai, that you should have to make a UNICODE build of C::B with wxWidgets UNICODE from Therion (or with wxWidgets UNICODE compile by yourself if you prefer).

Michael

thomas · « **Reply #8 on:** December 12, 2005, 02:53:08 pm »

Quote from: Takeshi Miya on December 12, 2005, 01:06:31 pm

There are any disvantages of having C::B compiled in Unicode mode for the official releases (ie. RC3)?

Yes, there are disadvantages. Unicode support is not 100% finished and tested. Also, at least one third party library used in Code::Blocks does not support wide character strings (even though it apparently still works, somehow).
ANSI, on the other hand, works 100% certain and is officially supported.

No doubt, some day Code::Blocks will switch to Unicode alltogether (as that will work universally), but I dare not say when that will be.

dbtsai · « **Reply #9 on:** December 12, 2005, 06:33:10 pm »

hi,

Ok, I will try to compile it by myself. If any good news, I will post it. ^_^

And in my case, the a chinese word is use two bytes in ANSI mode,
but in the C::B, when I use delete key, it will only delete one byte, half of a chinese word.
It is not correct. Most of Chinese or Janpan program need to take this problem into consideration, and
programer need to solve it my theirself, that is why I very very very holp C::B support UTF-8.

Thanks

takeshimiya · « **Reply #10 on:** December 12, 2005, 06:47:41 pm »

Quote from: thomas on December 12, 2005, 02:53:08 pm

Also, at least one third party library used in Code::Blocks does not support wide character strings (even though it apparently still works, somehow).
ANSI, on the other hand, works 100% certain and is officially supported.

What are the specific libraries that doesn't support widechars and what can we do to make them support it, appart from submitting a feature request?

thomas · « **Reply #11 on:** December 12, 2005, 10:13:00 pm »

This is one I know about, and the most important at the same time:

Quote from: http://www.grinninglizard.com/tinyxmldocs/index.html

TinyXml supports UTF-8 allowing to manipulate XML files in any language.
[...]
TinyXml does not use or directly support wchar, TCHAR, or Microsofts _UNICODE at this time.

Apparently, it still works ... somehow. Although I do not understand how it works, it actually seems to do o.k. in Unicode builds. But it still does not feel good.

thomas · « **Reply #12 on:** December 12, 2005, 10:51:05 pm »

And here might just be the first case where it doesn't....

http://forums.codeblocks.org/index.php?topic=1618.0

280Z28 · « **Reply #13 on:** December 13, 2005, 07:36:44 am »

I tried but didn't have time to fight with it. It's running stable in ANSI so I left it there. :?

Once I take care of my "level 1 problems (most important bugs to fix IMO)," I might work on this again.

kagerato · « **Reply #14 on:** December 14, 2005, 02:03:02 am »

Quote from: http://www.grinninglizard.com/tinyxmldocs/index.html

TinyXml supports UTF-8 allowing to manipulate XML files in any language.
[...]
TinyXml does not use or directly support wchar, TCHAR, or Microsofts _UNICODE at this time.

This makes little sense to me. UTF-8 is a particular representation of Unicode text requiring at least 8 bits per character, widely used because it's 1:1 with ASCII. Supporting UTF-8 should be enough for unicode operability in any language.

WCHAR and TCHAR are just Windows-specific typedef's, as far as I know. (Reference: MSDN)

_UNICODE is a preprocessor definition used by Microsoft's compiler. (Reference: Microsoft)

What, then, do WCHAR, TCHAR, and _UNICODE have to do with proper/complete implementation of unicode support?

takeshimiya · « **Reply #15 on:** December 14, 2005, 02:21:06 am »

Let's see: TiniXml loads files in UTF-8, it can't load any other Unicode encoding (neither from a file or in memory).

In memory it stores the UTF-8 encoded strings as an array of chars. Each byte IS NOT a character (coincidentaly only in english a byte=character).

However, wxWidgets or Windows for the matter, handle Unicode in other encoding (in memory): UTF-16 <multibyte encoding>.

So if we want the Unicode data from TiniXml, we must convert UTF-8->UTF-16. And if we want to talk from wxWidgets to TiniXml, UTF16->UTF-8.

The reason mentioned above by thomas, that it appears to work "somehow", can be because wxWidgets uses (I think) wxMBConv classes to do this conversion by default (it assumes UTF-8 if you don't specify another encoding) when compiled in Unicode mode.

thomas · « **Reply #16 on:** December 14, 2005, 02:37:39 am »

UTF-8 is a way to write Unicode in a backwards-compatible way on media that support 8bit per character tokens. It is a variable length format which uses between 1 octet (for ANSI characters) to 6 octets. Most languages, except the really exotic ones can usually be represented with sequences of 1-2 octets per character.

Unicode is a family of standards (I know at least two different standards) which represent characters in words of 16 bits or 32 bits. Maybe there are even more standards which I do not know about, but that does not matter. The characters that UTF-8 encodes are really words of 16 or 32 bits.

If you are to represent Unicode text in a wxString, this is done by using wchar_t characters. On Windows, these are 16 bits, on my Linux box, these are 32 bits. Whatever size it is, sizeof(wchar_t) != sizeof(char), because if you pass "ABC" then you do not really pass 0x41, 0x42, 0x43 -- in reality, you pass two (four) times as much data, so for example 0x41, 0x00, 0x42, 0x00, 0x43, 0x00. (In fact I have no idea about the actual encoding -- what matters though, is that these are 16/32 bit values).

So obviously it cannot work reliably if you hand this data to some library which expects characters to be octets. It may work for a while, and then fail randomly due to a thing as simple as calling strlen() on a character string that happens to have 0x00 as the upper byte somewhere.

kagerato · « **Reply #17 on:** December 16, 2005, 06:56:19 am »

Quote from: Takeshi Miya

Let's see: TiniXml loads files in UTF-8, it can't load any other Unicode encoding (neither from a file or in memory).

There's no reason why anyone should require support of multiple Unicode encodings. People may prefer UTF-16 or some other structure, but it is perfectly possible to convert losslessly. In any case, this is a tangental discussion irrelevant from my original point.

Quote from: thomas

So obviously it cannot work reliably if you hand this data to some library which expects characters to be octets.

A library that properly supports UTF-8 does not expect each character to be eight bits. UTF-8 is a variable-width representation. Characters "wider" than 8 bits come into play when the MSB is set. (Of course, this only occurs when the encoding is not strict ASCII.)

If the code assumes a particular width or byte alignment when it does not exist (as is clearly the case with UTF-8), then it is an improper implementation -- to say the least. The claim that TinyXML supports UTF-8 would therefore be false.

Understand what I meant now?

takeshimiya · « **Reply #18 on:** December 16, 2005, 06:13:42 pm »

TinyXml supports UTF-8.
How are you supposed to store in memory UTF-8 encoded in memory then...?

thomas · « **Reply #19 on:** December 16, 2005, 06:49:00 pm »

We can probably work around it. When first writing the ConfigManager, I used mb_str() a few times when passing data to tinyXML, and c_str() in other places. Actually I don't remember the reason I did that in the first place, any more. I think it was because certain things were ANSI anyway.... Either way, Yiannis was nice enough to change most of them to mb_str() while I wasn't looking, and that was really a good idea. It means that now we are feeding tinyXML octet streams (except for very few exceptions), so it should really work.

The reason it still does not work 100% is because the CRC calculation for the layout is not good and because we may have missed one or two spots. But I expect it to work reliably once that has been adressed.

So.... hopefully no reason to worry about tinyXML any more.

kagerato · « **Reply #20 on:** December 19, 2005, 12:38:23 am »

Quote from: Takeshi Miya on December 16, 2005, 06:13:42 pm

TinyXml supports UTF-8.
How are you supposed to store in memory UTF-8 encoded in memory then...?

What encoding do you use to store your text in RAM, you mean? I see two optimal ways:

1.) As UTF-8 (which, once again, is variable-width)
2.) As UTF-16 (Windows and other systems seem to accept Unicode data most often using this encoding)

#1 makes it a simple matter to read and write data between disk and RAM, since you'll very likely be using UTF-8 for both. The latter option is better if you're commonly calling functions from system or third-party libraries that require UTF-16. The alternative to #2 in the same situation is multiple copies of the text in different encodings, which is not only messy, tedious, and a potential source of bugs, but also a misuse of RAM and processing.

In any case, thomas sounds like he knows how to manage whatever the problem is/was. I still do not completely understand the nature of the problem; hence why I asked.

thomas · « **Reply #21 on:** December 19, 2005, 12:52:10 am »

The problem is that we store all text in UTF-16 using wchar[], and we do not have a choice to do otherwise. tinyXML does not support wchar. Therefore, we convert to UTF-8 just before passing the data to tinyXML.

Also, wxScintilla might not be completely Unicode-safe. This is only a suspicion, not necessarily true. While browsing the sources, I have spotted several places where they use chars as indices or compare against const char values. Unless these are only applied on text fragments which have been converted to UTF-8 (which I don't know, maybe they are?), this may be an issue. In that case, we will have another problem which is not easily solved.

takeshimiya · « **Reply #22 on:** December 19, 2005, 01:11:12 am »

Regarding Scintilla, I once asked the SciTE developers if support for Unicode filenames was feasible, and they answered that someone once started working on that, but it wasn't an easy task and requiered a rather major rewrite.
Anyways, Unicode text in Scintilla seems to work ok, but we always can expect bugs because they have their own string class, and I noticed some const chars* around the code too, so I'm not sure if it supports fully Unicode.

dbtsai · « **Reply #23 on:** December 20, 2005, 12:57:21 pm »

Well, I would like to try utf-8 version C::B,

but I can not compile it well....

Could anyone release an utf-8 compiled version??

And let people to try what's going wrong!!

^_^

thomas · « **Reply #24 on:** December 20, 2005, 03:11:20 pm »

Quote from: dbtsai on December 20, 2005, 12:57:21 pm

Well, I would like to try utf-8 version C::B,

but I can not compile it well....

http://forums.codeblocks.org/index.php?topic=1701.0

News:

Author Topic: When will support UTF-8 editor? (Read 18890 times)

dbtsai

takeshimiya

dbtsai

takeshimiya

takeshimiya

dbtsai

takeshimiya

takeshimiya

takeshimiya

takeshimiya

dbtsai