Author Topic: wxString support in wxWidgets 3.0 problem?  (Read 28411 times)

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 5737
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
wxString support in wxWidgets 3.0 problem?
« on: March 27, 2011, 10:44:16 am »
I have read this:
wxWidgets: wxString Class Reference
Then, I found that in wxWidgets 3.0, the wxString class will internally use different "code unit", which is utf8 under Linux like system, and utf16 under windows.

Both of them were variable length code point Unicode representation, so index reference like:
Code
wxString s;
s[20]= something;
will have very low performance compared with sequence iterator.

I'm not sure what does the currently implementation, but does this will cause potential issue in the feature?

EDIT 2013-10-18:
Quote
I found that in wxWidgets 3.0, the wxString class will internally use different "code unit", which is utf8 under Linux like system, and utf16 under windows.
This is not correct, now all use fixed width unit (wchar_t), see this post http://forums.codeblocks.org/index.php/topic,14421.msg126174.html#msg126174 for explanation.


« Last Edit: October 18, 2013, 03:14:28 am by ollydbg »
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline Ceniza

  • Developer
  • Lives here!
  • *****
  • Posts: 1441
    • CenizaSOFT
Re: wxString support in wxWidgets 3.0 problem?
« Reply #1 on: March 27, 2011, 11:23:34 am »
All I can find is that they use a bigger type (4 bytes) in order to be able to store all code points, but they also say the implementation will be around basic_string.

Calling operator [] will call method at() which will call begin() and PosToImpl() and return it through a call to DecodeChar which will return a small object wxUniCharRef which will handle the assignment by creating a wxUniChar which will be assigned to an iterator, or something along those lines.

Most, if not all, of those methods were one-liners in the header file, so there is a good chance the produced code will be decently optimized. It would be nice to actually see what the compiler can do with it before we worry too much. It also seems like the current implementation is about the same anyway.

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 5737
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: wxString support in wxWidgets 3.0 problem?
« Reply #2 on: March 27, 2011, 11:31:52 am »
All I can find is that they use a bigger type (4 bytes) in order to be able to store all code points, but they also say the implementation will be around basic_string.

Calling operator [] will call method at() which will call begin() and PosToImpl() and return it through a call to DecodeChar which will return a small object wxUniCharRef which will handle the assignment by creating a wxUniChar which will be assigned to an iterator, or something along those lines.

Most, if not all, of those methods were one-liners in the header file, so there is a good chance the produced code will be decently optimized. It would be nice to actually see what the compiler can do with it before we worry too much. It also seems like the current implementation is about the same anyway.
current implementation use the same method (4 bytes for a code point)??

PS:I received a email notifier that you have replied minutes ago on Code completion doesnt follow #include in struct, but I do not see this post on that thread, strange, did you delete this?
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline Ceniza

  • Developer
  • Lives here!
  • *****
  • Posts: 1441
    • CenizaSOFT
Re: wxString support in wxWidgets 3.0 problem?
« Reply #3 on: March 27, 2011, 11:46:52 am »
Quote from: ollydbg
current implementation use the same method (4 bytes for a code point)??

It looks like the current implementation uses either wchar_t or char depending on how you configure wxWidgets. It is kind of difficult to be completely sure of all the differences by jumping everywhere in the svn repository.

Quote from: ollydbg
PS:I received a email notifier that you have replied minutes ago on Code completion doesnt follow #include in struct, but I do not see this post on that thread, strange, did you delete this?

The forum thought it was spam, but it seems to be fixed now. I do not have the powers to un-smap myself :P

P.S.: This post was considered spam too. Probably due to all links in the quotes.

Offline JGM

  • Lives here!
  • ****
  • Posts: 518
  • Got to practice :)
Re: wxString support in wxWidgets 3.0 problem?
« Reply #4 on: March 28, 2011, 01:42:09 am »
mmm I'm getting the same result, marking some of my messages as spam

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 5737
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: wxString support in wxWidgets 3.0 problem?
« Reply #5 on: April 06, 2011, 10:07:17 am »
FYI:
I found one message in wx forum:
Quote
DL> It's not really thread-safe since it uses reference counting - I think,

 This was true for 2.8 but this question is explicitly about 2.9 and by
default in wx 2.9 (i.e. unless you set wxUSE_STD_STRING to 0) wxString uses
std::basic_string for implementation and so doesn't use reference counting
if the standard class doesn't -- and most, if not all, of them don't use it
any more. So the thread safety of wxString is the same as the thread-safety
of the underlying standard library string class.

 Regards,
VZ

So, in the future, it seems wxString 3.x/2.9.x mostly does NOT use reference counting as stl.
Then in the current Codecompletion plugin's source, there are a lot of functions like:

Code
wxString GetToken();
wxString PeekToken();
These code will do a deep copy of string data, so I'm concern the performance.

PS: Under wxWidgets 2.8.x 's implementation, wxString use reference counting, so return a wxString object is much fast (it do not do a deep copy of string data)

So, what do you think?
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13438
    • Travis build status
Re: wxString support in wxWidgets 3.0 problem?
« Reply #6 on: April 06, 2011, 10:50:18 am »
RValue references to the rescue :)

And as always performance optimizations should be done when there is info that something is slow!
So profile it first then optimize, then profile again to see it is faster.
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 5737
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: wxString support in wxWidgets 3.0 problem?
« Reply #7 on: April 06, 2011, 10:53:09 am »
I just search the Google for sometime, and found that
gcc libc++'s string is COW(copy on write), see
http://stackoverflow.com/questions/1594803/is-stdstring-thead-safe-with-gcc-4-3

This code can show the COW
Code
#include <string>
#include <cstdio>

int main()
   {
   std::string orig = "I'm the original!";
   std::string copy_cow = orig;
   std::string copy_mem = orig.c_str();
   std::printf("%p %p %p\n", orig.data(),
                             copy_cow.data(),
                             copy_mem.data());
   }

So, I think though wx does not use reference count, I think std::string use it.

Am I right??? some one can confirm this?
« Last Edit: April 06, 2011, 02:19:22 pm by ollydbg »
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 5737
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: wxString support in wxWidgets 3.0 problem?
« Reply #8 on: April 06, 2011, 02:50:01 pm »
Oh, it seems the COW will be disabled in the future c++0x
see:
http://stackoverflow.com/questions/4067395/gnu-stl-string-is-copy-on-write-involved-here
Quote
Just wanted to note that copy on write is probably going to fade away in C++0x with the introduction of move semantics (makes COW obsolete for many typical use cases) and concurrency (makes COW potentially very inefficient due to synchronization issues).

and
just how bad CoW can be in a multithreaded environment, even if there's only one thread

N2668: "Concurrency Modifications to Basic String"
« Last Edit: April 06, 2011, 03:24:47 pm by ollydbg »
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 5737
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: wxString support in wxWidgets 3.0 problem?
« Reply #9 on: October 17, 2013, 05:08:33 pm »
FYI:

I see that under Linux, the wxString in wxWidgets 2.9.x now use std::basic_string<wchar_t>, the change happens around 2012-05-13, see this commit to wxWidgets' svn repo:
SVN:(VZ)[71424] Disable the use of UTF-8 by default in Unix builds. - Google Groups, it was using UTF-8 by default before this commit.

I think this is a good news, which means it will let wxWidgets have better performance when parsing. Also, directly use the wchar_t pointer is safe in either Windows and Linux, this is because all the character are occupy the same byte lengths (fixed-width encoding).

So, never mind about the issue reported in: unsafe memory copy in CC's macro replacement

EDIT: this is the current document about performance in wxString in the webpage: http://docs.wxwidgets.org/trunk/classwx_string.html
Quote
Performance characteristics

wxString uses std::basic_string internally to store its content (unless this is not supported by the compiler or disabled specifically when building wxWidgets) and it therefore inherits many features from std::basic_string. In particular, most modern implementations of std::basic_string are thread-safe and don't use reference counting (making copying large strings potentially expensive) and so wxString has the same characteristics.

By default, wxString uses std::basic_string specialized for the platform-dependent wchar_t type, meaning that it is not memory-efficient for ASCII strings, especially under Unix platforms where every ASCII character, normally fitting in a byte, is represented by a 4 byte wchar_t.

It is possible to build wxWidgets with wxUSE_UNICODE_UTF8 set to 1 in which case an UTF-8-encoded string representation is stored in std::basic_string specialized for char, i.e. the usual std::string. In this case the memory efficiency problem mentioned above doesn't arise but run-time performance of many wxString methods changes dramatically, in particular accessing the N-th character of the string becomes an operation taking O(N) time instead of O(1), i.e. constant, time by default. Thus, if you do use this so called UTF-8 build, you should avoid using indices to access the strings whenever possible and use the iterators instead. As an example, traversing the string using iterators is an O(N), where N is the string length, operation in both the normal ("wchar_t") and UTF-8 builds but doing it using indices becomes O(N^2) in UTF-8 case meaning that simply checking every character of a reasonably long (e.g. a couple of millions elements) string can take an unreasonably long time.

However, if you do use iterators, UTF-8 build can be a better choice than the default build, especially for the memory-constrained embedded systems. Notice also that GTK+ and DirectFB use UTF-8 internally, so using this build not only saves memory for ASCII strings but also avoids conversions between wxWidgets and the underlying toolkit.
« Last Edit: October 17, 2013, 05:19:30 pm by ollydbg »
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline BlueHazzard

  • Developer
  • Lives here!
  • *****
  • Posts: 3289
Re: wxString support in wxWidgets 3.0 problem?
« Reply #10 on: October 17, 2013, 05:23:22 pm »
is wchar_t in windows 16bit? If yes, can you use character access anyway? I mean, there isn't enough space in 16 bit for the whole unicode- tables.
(UTF16 is the basest decision you can make by supporting unicode:
* you have to look for the endianess
* If you open a corrupt file, there is no way to repair it...)

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 5737
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: wxString support in wxWidgets 3.0 problem?
« Reply #11 on: October 17, 2013, 05:35:44 pm »
is wchar_t in windows 16bit?
If I know correctly, yes. Windows system is in-fact using UTF-16 for encoding strings, and wchar_t is 16 bits.

Quote
If yes, can you use character access anyway? I mean, there isn't enough space in 16 bit for the whole unicode- tables.
(UTF16 is the basest decision you can make by supporting unicode:
* you have to look for the endianess
* If you open a corrupt file, there is no way to repair it...)
In some cases, a character need four bytes to holds, which means two UTF-16 code unit. Under Windows, the user need to handle this special case (called surrogate pairs)

See the document in: http://docs.wxwidgets.org/trunk/overview_string.html
Quote
Internal wxString Encoding

Since wxWidgets 3.0 wxString internally uses UTF-16 (with Unicode code units stored in wchar_t) under Windows and UTF-8 (with Unicode code units stored in char) under Unix, Linux and Mac OS X to store its content.

For definitions of code units and code points terms, please see the Unicode Representations and Terminology paragraph.

For simplicity of implementation, wxString when wxUSE_UNICODE_WCHAR==1 (e.g. on Windows) uses per code unit indexing instead of per code point indexing and doesn't know anything about surrogate pairs; in other words it always considers code points to be composed by 1 code unit, while this is really true only for characters in the BMP (Basic Multilingual Plane). Thus when iterating over a UTF-16 string stored in a wxString under Windows, the user code has to take care of surrogate pairs himself. (Note however that Windows itself has built-in support for surrogate pairs in UTF-16, such as for drawing strings on screen.)

Remarks
    Note that while the behaviour of wxString when wxUSE_UNICODE_WCHAR==1 resembles UCS-2 encoding, it's not completely correct to refer to wxString as UCS-2 encoded since you can encode code points outside the BMP in a wxString as two code units (i.e. as a surrogate pair; as already mentioned however wxString will "see" them as two different code points)

When instead wxUSE_UNICODE_UTF8==1 (e.g. on Linux and Mac OS X) wxString handles UTF8 multi-bytes sequences just fine also for characters outside the BMP (it implements per code point indexing), so that you can use UTF8 in a completely transparent way:

Note that it looks like this document is not correct, since UNIX system is now use std::basic_string<wchar_t> by default.
But its description is correct on Windows system.

Note, we don't handle surrogate pairs in currently C::B, wxWidgets 2.8.12 use std::basic_string<wchar_t> too. Question: are there any change we can meet a surrogate pairs in C++ source code? Maybe it is in comments? Which character need surrogate pairs to hold under Windows? I don't have such example.
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13438
    • Travis build status
Re: wxString support in wxWidgets 3.0 problem?
« Reply #12 on: October 17, 2013, 06:24:01 pm »
ollydbg: Are you sure you've disabled building in STL mode?
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 5737
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: wxString support in wxWidgets 3.0 problem?
« Reply #13 on: October 18, 2013, 03:09:05 am »
ollydbg: Are you sure you've disabled building in STL mode?
Hi, Obf, what does this question means? I'm sorry I can't understand your question. You mean: build wxString without using the internal std::basic_string support? I think this is not an option for wx2.9.x+.

...
Note that it looks like this document is not correct, since UNIX system is now use std::basic_string<wchar_t> by default.
I have report this issue to wxWidgets maillist, now it was fixed in the wx trunk, see this commit: https://groups.google.com/d/msg/wx-commits-diffs/QZDKnpiL3lM/eEX0cFOKS3cJ, the web page: http://docs.wxwidgets.org/trunk/overview_string.html need some days to synchronize with the trunk change.

Another issue I see is: wxString is not NULL terminated, right? So, it is OK for the while condition check below in function: bool ParserThread::GetBaseArgs(const wxString& args, wxString& baseArgs)
Code
    while (*ptr != ParserConsts::null)
    {
    ...
    }
Basically, I think we should use the length of the wxString to limit the pointer range.
« Last Edit: October 18, 2013, 03:11:34 am by ollydbg »
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline BlueHazzard

  • Developer
  • Lives here!
  • *****
  • Posts: 3289
Re: wxString support in wxWidgets 3.0 problem?
« Reply #14 on: October 18, 2013, 04:34:42 am »
In c++ times the pointer-way is the bad way ;) Better would be to use iterators...
but i think wx2.8 has no support for string iterators -.-

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 5737
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: wxString support in wxWidgets 3.0 problem?
« Reply #15 on: October 18, 2013, 04:49:21 am »
In c++ times the pointer-way is the bad way ;) Better would be to use iterators...
but i think wx2.8 has no support for string iterators -.-
Yes, I agree.

What I think the better way is:
Use std::string to hold the source file buffers internally in CC plugins, not wxString. Even wxWidgets document suggest this:
http://docs.wxwidgets.org/trunk/classwx_string.html
Quote
String class for passing textual data to or receiving it from wxWidgets.

Note
    While the use of wxString is unavoidable in wxWidgets program, you are encouraged to use the standard string classes std::string or std::wstring in your applications and convert them to and from wxString only when interacting with wxWidgets.
Then all the internal source code was encoded in UTF8 (stored in std::string), then I have already created a simple/faster lexer by Quex, see this post Quex lexer grammar, probably can make our tokenizer much faster for details. the generated lexer is all c/c++ code, and it is about three times faster than Flex generated lexer.

Note: ctags internally use char type too.
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 5737
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: wxString support in wxWidgets 3.0 problem?
« Reply #16 on: October 18, 2013, 05:13:36 am »
Another issue is the string construction. As you know, all token strings are in-fact a sub-string of the source file. (in some special case, the token is replaced by some macro expansion, but we can create an auxiliary source string to hold all the expanded strings).

What a lexer do is to locate the start point and the end point of the lexeme, for example in a source code
Code
int main ( ) { int a; .....
    ^   $
Note, when a lexeme is found, the lexer (Quex lexer) know the start position "^", and the end position "$", also it has a Type enum information, in this case, it is an "identifier". It depend on the user to handle this information, so if you have a Token class like below:
Code
class CCToken
{
    std::string name;
    TokenType   type;
}
The user should construct the CCToken instance by a memory copy from source code to name member variables, then set the type member variables.

I think a better way is:
Code
class CCToken
{
    int  source_index;
    int  lexeme_start;
    int  lexeme_length;
    TokenType  type;
}
There, the first member is the index to the source buffer, then remember the start position and length.

Maybe, we can supply a member function like: "std::string CCToken::ToStdString()", which return a true new std::string. In most cases, I think we don't need to use lexeme_start and lexeme_length, because we only need to know the TokenType. For example there are some TokenTypes like: "keyword_class", "keyword_public"........


« Last Edit: October 18, 2013, 05:15:18 am by ollydbg »
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.