Author Topic: wxString support in wxWidgets 3.0 problem? (Read 32586 times)

ollydbg · « **on:** March 27, 2011, 10:44:16 am »

I have read this:
wxWidgets: wxString Class Reference
Then, I found that in wxWidgets 3.0, the wxString class will internally use different "code unit", which is utf8 under Linux like system, and utf16 under windows.

Both of them were variable length code point Unicode representation, so index reference like:

Code

wxString s;
s[20]= something;

will have very low performance compared with sequence iterator.

I'm not sure what does the currently implementation, but does this will cause potential issue in the feature?

EDIT 2013-10-18:

Quote

I found that in wxWidgets 3.0, the wxString class will internally use different "code unit", which is utf8 under Linux like system, and utf16 under windows.

This is not correct, now all use fixed width unit (wchar_t), see this post http://forums.codeblocks.org/index.php/topic,14421.msg126174.html#msg126174 for explanation.

Ceniza · « **Reply #1 on:** March 27, 2011, 11:23:34 am »

All I can find is that they use a bigger type (4 bytes) in order to be able to store all code points, but they also say the implementation will be around basic_string.

Calling operator [] will call method at() which will call begin() and PosToImpl() and return it through a call to DecodeChar which will return a small object wxUniCharRef which will handle the assignment by creating a wxUniChar which will be assigned to an iterator, or something along those lines.

Most, if not all, of those methods were one-liners in the header file, so there is a good chance the produced code will be decently optimized. It would be nice to actually see what the compiler can do with it before we worry too much. It also seems like the current implementation is about the same anyway.

ollydbg · « **Reply #2 on:** March 27, 2011, 11:31:52 am »

Quote from: Ceniza on March 27, 2011, 11:23:34 am

All I can find is that they use a bigger type (4 bytes) in order to be able to store all code points, but they also say the implementation will be around basic_string.

Calling operator [] will call method at() which will call begin() and PosToImpl() and return it through a call to DecodeChar which will return a small object wxUniCharRef which will handle the assignment by creating a wxUniChar which will be assigned to an iterator, or something along those lines.

Most, if not all, of those methods were one-liners in the header file, so there is a good chance the produced code will be decently optimized. It would be nice to actually see what the compiler can do with it before we worry too much. It also seems like the current implementation is about the same anyway.

current implementation use the same method (4 bytes for a code point)??

PS:I received a email notifier that you have replied minutes ago on Code completion doesnt follow #include in struct, but I do not see this post on that thread, strange, did you delete this?

Ceniza · « **Reply #3 on:** March 27, 2011, 11:46:52 am »

Quote from: ollydbg

current implementation use the same method (4 bytes for a code point)??

It looks like the current implementation uses either wchar_t or char depending on how you configure wxWidgets. It is kind of difficult to be completely sure of all the differences by jumping everywhere in the svn repository.

Quote from: ollydbg

PS:I received a email notifier that you have replied minutes ago on Code completion doesnt follow #include in struct, but I do not see this post on that thread, strange, did you delete this?

The forum thought it was spam, but it seems to be fixed now. I do not have the powers to un-smap myself

P.S.: This post was considered spam too. Probably due to all links in the quotes.

JGM · « **Reply #4 on:** March 28, 2011, 01:42:09 am »

mmm I'm getting the same result, marking some of my messages as spam

ollydbg · « **Reply #5 on:** April 06, 2011, 10:07:17 am »

FYI:
I found one message in wx forum:

Quote

DL> It's not really thread-safe since it uses reference counting - I think,

This was true for 2.8 but this question is explicitly about 2.9 and by
default in wx 2.9 (i.e. unless you set wxUSE_STD_STRING to 0) wxString uses
std::basic_string for implementation and so doesn't use reference counting
if the standard class doesn't -- and most, if not all, of them don't use it
any more. So the thread safety of wxString is the same as the thread-safety
of the underlying standard library string class.

Regards,
VZ

So, in the future, it seems wxString 3.x/2.9.x mostly does NOT use reference counting as stl.
Then in the current Codecompletion plugin's source, there are a lot of functions like:

Code

wxString GetToken();
wxString PeekToken();

These code will do a deep copy of string data, so I'm concern the performance.

PS: Under wxWidgets 2.8.x 's implementation, wxString use reference counting, so return a wxString object is much fast (it do not do a deep copy of string data)

So, what do you think?

oBFusCATed · « **Reply #6 on:** April 06, 2011, 10:50:18 am »

RValue references to the rescue

And as always performance optimizations should be done when there is info that something is slow!
So profile it first then optimize, then profile again to see it is faster.

ollydbg · « **Reply #7 on:** April 06, 2011, 10:53:09 am »

I just search the Google for sometime, and found that
gcc libc++'s string is COW(copy on write), see
http://stackoverflow.com/questions/1594803/is-stdstring-thead-safe-with-gcc-4-3

This code can show the COW

Code

#include <string>
#include <cstdio>

int main()
   {
   std::string orig = "I'm the original!";
   std::string copy_cow = orig;
   std::string copy_mem = orig.c_str();
   std::printf("%p %p %p\n", orig.data(),
                             copy_cow.data(),
                             copy_mem.data());
   }

So, I think though wx does not use reference count, I think std::string use it.

Am I right??? some one can confirm this?

ollydbg · « **Reply #8 on:** April 06, 2011, 02:50:01 pm »

Oh, it seems the COW will be disabled in the future c++0x
see:
http://stackoverflow.com/questions/4067395/gnu-stl-string-is-copy-on-write-involved-here

Quote

Just wanted to note that copy on write is probably going to fade away in C++0x with the introduction of move semantics (makes COW obsolete for many typical use cases) and concurrency (makes COW potentially very inefficient due to synchronization issues).

and
just how bad CoW can be in a multithreaded environment, even if there's only one thread

N2668: "Concurrency Modifications to Basic String"

ollydbg · « **Reply #9 on:** October 17, 2013, 05:08:33 pm »

FYI:

I see that under Linux, the wxString in wxWidgets 2.9.x now use std::basic_string<wchar_t>, the change happens around 2012-05-13, see this commit to wxWidgets' svn repo:
SVN:(VZ)[71424] Disable the use of UTF-8 by default in Unix builds. - Google Groups, it was using UTF-8 by default before this commit.

I think this is a good news, which means it will let wxWidgets have better performance when parsing. Also, directly use the wchar_t pointer is safe in either Windows and Linux, this is because all the character are occupy the same byte lengths (fixed-width encoding).

So, never mind about the issue reported in: unsafe memory copy in CC's macro replacement

EDIT: this is the current document about performance in wxString in the webpage: http://docs.wxwidgets.org/trunk/classwx_string.html

Quote

Performance characteristics

wxString uses std::basic_string internally to store its content (unless this is not supported by the compiler or disabled specifically when building wxWidgets) and it therefore inherits many features from std::basic_string. In particular, most modern implementations of std::basic_string are thread-safe and don't use reference counting (making copying large strings potentially expensive) and so wxString has the same characteristics.

By default, wxString uses std::basic_string specialized for the platform-dependent wchar_t type, meaning that it is not memory-efficient for ASCII strings, especially under Unix platforms where every ASCII character, normally fitting in a byte, is represented by a 4 byte wchar_t.

It is possible to build wxWidgets with wxUSE_UNICODE_UTF8 set to 1 in which case an UTF-8-encoded string representation is stored in std::basic_string specialized for char, i.e. the usual std::string. In this case the memory efficiency problem mentioned above doesn't arise but run-time performance of many wxString methods changes dramatically, in particular accessing the N-th character of the string becomes an operation taking O(N) time instead of O(1), i.e. constant, time by default. Thus, if you do use this so called UTF-8 build, you should avoid using indices to access the strings whenever possible and use the iterators instead. As an example, traversing the string using iterators is an O(N), where N is the string length, operation in both the normal ("wchar_t") and UTF-8 builds but doing it using indices becomes O(N^2) in UTF-8 case meaning that simply checking every character of a reasonably long (e.g. a couple of millions elements) string can take an unreasonably long time.

However, if you do use iterators, UTF-8 build can be a better choice than the default build, especially for the memory-constrained embedded systems. Notice also that GTK+ and DirectFB use UTF-8 internally, so using this build not only saves memory for ASCII strings but also avoids conversions between wxWidgets and the underlying toolkit.

BlueHazzard · « **Reply #10 on:** October 17, 2013, 05:23:22 pm »

is wchar_t in windows 16bit? If yes, can you use character access anyway? I mean, there isn't enough space in 16 bit for the whole unicode- tables.
(UTF16 is the basest decision you can make by supporting unicode:
* you have to look for the endianess
* If you open a corrupt file, there is no way to repair it...)

ollydbg · « **Reply #11 on:** October 17, 2013, 05:35:44 pm »

Quote from: BlueHazzard on October 17, 2013, 05:23:22 pm

is wchar_t in windows 16bit?

If I know correctly, yes. Windows system is in-fact using UTF-16 for encoding strings, and wchar_t is 16 bits.

Quote

If yes, can you use character access anyway? I mean, there isn't enough space in 16 bit for the whole unicode- tables.
(UTF16 is the basest decision you can make by supporting unicode:
* you have to look for the endianess
* If you open a corrupt file, there is no way to repair it...)

In some cases, a character need four bytes to holds, which means two UTF-16 code unit. Under Windows, the user need to handle this special case (called surrogate pairs)

See the document in: http://docs.wxwidgets.org/trunk/overview_string.html

Quote

Internal wxString Encoding

Since wxWidgets 3.0 wxString internally uses UTF-16 (with Unicode code units stored in wchar_t) under Windows and UTF-8 (with Unicode code units stored in char) under Unix, Linux and Mac OS X to store its content.

For definitions of code units and code points terms, please see the Unicode Representations and Terminology paragraph.

For simplicity of implementation, wxString when wxUSE_UNICODE_WCHAR==1 (e.g. on Windows) uses per code unit indexing instead of per code point indexing and doesn't know anything about surrogate pairs; in other words it always considers code points to be composed by 1 code unit, while this is really true only for characters in the BMP (Basic Multilingual Plane). Thus when iterating over a UTF-16 string stored in a wxString under Windows, the user code has to take care of surrogate pairs himself. (Note however that Windows itself has built-in support for surrogate pairs in UTF-16, such as for drawing strings on screen.)

Remarks
Note that while the behaviour of wxString when wxUSE_UNICODE_WCHAR==1 resembles UCS-2 encoding, it's not completely correct to refer to wxString as UCS-2 encoded since you can encode code points outside the BMP in a wxString as two code units (i.e. as a surrogate pair; as already mentioned however wxString will "see" them as two different code points)

When instead wxUSE_UNICODE_UTF8==1 (e.g. on Linux and Mac OS X) wxString handles UTF8 multi-bytes sequences just fine also for characters outside the BMP (it implements per code point indexing), so that you can use UTF8 in a completely transparent way:

Note that it looks like this document is not correct, since UNIX system is now use std::basic_string<wchar_t> by default.
But its description is correct on Windows system.

Note, we don't handle surrogate pairs in currently C::B, wxWidgets 2.8.12 use std::basic_string<wchar_t> too. Question: are there any change we can meet a surrogate pairs in C++ source code? Maybe it is in comments? Which character need surrogate pairs to hold under Windows? I don't have such example.

oBFusCATed · « **Reply #12 on:** October 17, 2013, 06:24:01 pm »

ollydbg: Are you sure you've disabled building in STL mode?

ollydbg · « **Reply #13 on:** October 18, 2013, 03:09:05 am »

Quote from: oBFusCATed on October 17, 2013, 06:24:01 pm

ollydbg: Are you sure you've disabled building in STL mode?

Hi, Obf, what does this question means? I'm sorry I can't understand your question. You mean: build wxString without using the internal std::basic_string support? I think this is not an option for wx2.9.x+.

Quote from: ollydbg on October 17, 2013, 05:35:44 pm

...
Note that it looks like this document is not correct, since UNIX system is now use std::basic_string<wchar_t> by default.

I have report this issue to wxWidgets maillist, now it was fixed in the wx trunk, see this commit: https://groups.google.com/d/msg/wx-commits-diffs/QZDKnpiL3lM/eEX0cFOKS3cJ, the web page: http://docs.wxwidgets.org/trunk/overview_string.html need some days to synchronize with the trunk change.

Another issue I see is: wxString is not NULL terminated, right? So, it is OK for the while condition check below in function: bool ParserThread::GetBaseArgs(const wxString& args, wxString& baseArgs)

Code

    while (*ptr != ParserConsts::null)
    {
    ...
    }

Basically, I think we should use the length of the wxString to limit the pointer range.

BlueHazzard · « **Reply #14 on:** October 18, 2013, 04:34:42 am »

In c++ times the pointer-way is the bad way

Better would be to use iterators...
but i think wx2.8 has no support for string iterators -.-

ollydbg · « **Reply #15 on:** October 18, 2013, 04:49:21 am »

Quote from: BlueHazzard on October 18, 2013, 04:34:42 am

In c++ times the pointer-way is the bad way Better would be to use iterators...
but i think wx2.8 has no support for string iterators -.-

Yes, I agree.

What I think the better way is:
Use std::string to hold the source file buffers internally in CC plugins, not wxString. Even wxWidgets document suggest this:
http://docs.wxwidgets.org/trunk/classwx_string.html

Quote

String class for passing textual data to or receiving it from wxWidgets.

Note
While the use of wxString is unavoidable in wxWidgets program, you are encouraged to use the standard string classes std::string or std::wstring in your applications and convert them to and from wxString only when interacting with wxWidgets.

Then all the internal source code was encoded in UTF8 (stored in std::string), then I have already created a simple/faster lexer by Quex, see this post Quex lexer grammar, probably can make our tokenizer much faster for details. the generated lexer is all c/c++ code, and it is about three times faster than Flex generated lexer.

Note: ctags internally use char type too.

ollydbg · « **Reply #16 on:** October 18, 2013, 05:13:36 am »

Another issue is the string construction. As you know, all token strings are in-fact a sub-string of the source file. (in some special case, the token is replaced by some macro expansion, but we can create an auxiliary source string to hold all the expanded strings).

What a lexer do is to locate the start point and the end point of the lexeme, for example in a source code

Code

int main ( ) { int a; .....
    ^   $

Note, when a lexeme is found, the lexer (Quex lexer) know the start position "^", and the end position "$", also it has a Type enum information, in this case, it is an "identifier". It depend on the user to handle this information, so if you have a Token class like below:

Code

class CCToken
{
    std::string name;
    TokenType   type;
}

The user should construct the CCToken instance by a memory copy from source code to name member variables, then set the type member variables.

I think a better way is:

Code

class CCToken
{
    int  source_index;
    int  lexeme_start;
    int  lexeme_length;
    TokenType  type;
}

There, the first member is the index to the source buffer, then remember the start position and length.

Maybe, we can supply a member function like: "std::string CCToken::ToStdString()", which return a true new std::string. In most cases, I think we don't need to use lexeme_start and lexeme_length, because we only need to know the TokenType. For example there are some TokenTypes like: "keyword_class", "keyword_public"........

Code::Blocks Forums

News:

Author Topic: wxString support in wxWidgets 3.0 problem? (Read 32586 times)

ollydbg

wxString support in wxWidgets 3.0 problem?

Ceniza

Re: wxString support in wxWidgets 3.0 problem?

ollydbg

Re: wxString support in wxWidgets 3.0 problem?

Ceniza

Re: wxString support in wxWidgets 3.0 problem?

JGM

Re: wxString support in wxWidgets 3.0 problem?

ollydbg

Re: wxString support in wxWidgets 3.0 problem?

oBFusCATed

Re: wxString support in wxWidgets 3.0 problem?

ollydbg

Re: wxString support in wxWidgets 3.0 problem?

ollydbg

Re: wxString support in wxWidgets 3.0 problem?

ollydbg

Re: wxString support in wxWidgets 3.0 problem?

BlueHazzard

Re: wxString support in wxWidgets 3.0 problem?

ollydbg

Re: wxString support in wxWidgets 3.0 problem?

oBFusCATed

Re: wxString support in wxWidgets 3.0 problem?

ollydbg

Re: wxString support in wxWidgets 3.0 problem?

BlueHazzard

Re: wxString support in wxWidgets 3.0 problem?

ollydbg

Re: wxString support in wxWidgets 3.0 problem?

ollydbg

Re: wxString support in wxWidgets 3.0 problem?