User forums > Using Code::Blocks

When will support UTF-8 editor?

<< < (4/5) > >>

takeshimiya:
Let's see: TiniXml loads files in UTF-8, it can't load any other Unicode encoding (neither from a file or in memory).

In memory it stores the UTF-8 encoded strings as an array of chars. Each byte IS NOT a character (coincidentaly only in english a byte=character).

However, wxWidgets or Windows for the matter, handle Unicode in other encoding (in memory): UTF-16 <multibyte encoding>.

So if we want the Unicode data from TiniXml, we must convert UTF-8->UTF-16. And if we want to talk from wxWidgets to TiniXml, UTF16->UTF-8.

The reason mentioned above by thomas, that it appears to work "somehow", can be because wxWidgets uses (I think) wxMBConv classes to do this conversion by default (it assumes UTF-8 if you don't specify another encoding) when compiled in Unicode mode.

thomas:
UTF-8 is a way to write Unicode in a backwards-compatible way on media that support 8bit per character tokens. It is a variable length format which uses between 1 octet (for ANSI characters) to 6 octets. Most languages, except the really exotic ones can usually be represented with sequences of 1-2 octets per character.

Unicode is a family of standards (I know at least two different standards) which represent characters in words of 16 bits or 32 bits. Maybe there are even more standards which I do not know about, but that does not matter. The characters that UTF-8 encodes are really words of 16 or 32 bits.

If you are to represent Unicode text in a wxString, this is done by using wchar_t characters. On Windows, these are 16 bits, on my Linux box, these are 32 bits. Whatever size it is, sizeof(wchar_t) != sizeof(char), because if you pass "ABC" then you do not really pass 0x41, 0x42, 0x43 -- in reality, you pass two (four) times as much data, so for example 0x41, 0x00, 0x42, 0x00, 0x43, 0x00. (In fact I have no idea about the actual encoding -- what matters though, is that these are 16/32 bit values).

So obviously it cannot work reliably if you hand this data to some library which expects characters to be octets. It may work for a while, and then fail randomly due to a thing as simple as calling strlen() on a character string that happens to have 0x00 as the upper byte somewhere.

kagerato:

--- Quote from: Takeshi Miya ---Let's see: TiniXml loads files in UTF-8, it can't load any other Unicode encoding (neither from a file or in memory).
--- End quote ---

There's no reason why anyone should require support of multiple Unicode encodings.  People may prefer UTF-16 or some other structure, but it is perfectly possible to convert losslessly.  In any case, this is a tangental discussion irrelevant from my original point.


--- Quote from: thomas ---So obviously it cannot work reliably if you hand this data to some library which expects characters to be octets.
--- End quote ---

A library that properly supports UTF-8 does not expect each character to be eight bits.  UTF-8 is a variable-width representation.  Characters "wider" than 8 bits come into play when the MSB is set.  (Of course, this only occurs when the encoding is not strict ASCII.)

If the code assumes a particular width or byte alignment when it does not exist (as is clearly the case with UTF-8), then it is an improper implementation -- to say the least.  The claim that TinyXML supports UTF-8 would therefore be false.

Understand what I meant now?

takeshimiya:
TinyXml supports UTF-8.
How are you supposed to store in memory UTF-8 encoded in memory then...?

thomas:
We can probably work around it. When first writing the ConfigManager, I used mb_str() a few times when passing data to tinyXML, and c_str() in other places. Actually I don't remember the reason I did that in the first place, any more. I think it was because certain things were ANSI anyway.... Either way, Yiannis was nice enough to change most of them to mb_str() while I wasn't looking, and that was really a good idea. It means that now we are feeding tinyXML octet streams (except for very few exceptions), so it should really work.

The reason it still does not work 100% is because the CRC calculation for the layout is not good and because we may have missed one or two spots. But I expect it to work reliably once that has been adressed.

So.... hopefully no reason to worry about tinyXML any more.

Navigation

[0] Message Index

[#] Next page

[*] Previous page

Go to full version