Author Topic: Discussion about the Tokenizer::ReadFile  (Read 12552 times)

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 6107
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Discussion about the Tokenizer::ReadFile
« on: September 03, 2011, 04:13:09 pm »
Code
bool Tokenizer::ReadFile()
{
    bool success = false;
    wxString fileName = wxEmptyString;
    if (m_Loader)
    {
        fileName = m_Loader->FileName();
        char* data  = m_Loader->GetData();
        m_BufferLen = m_Loader->GetLength();

        // the following code is faster than DetectEncodingAndConvert()
//        DetectEncodingAndConvert(data, m_Buffer);

        // same code as in cbC2U() but with the addition of the string length (3rd param in unicode version)
        // and the fallback encoding conversion
#if wxUSE_UNICODE
        m_Buffer = wxString(data, wxConvUTF8, m_BufferLen + 1); // + 1 => sentinel
        if (m_Buffer.Length() == 0)
        {
            // could not read as utf-8 encoding, try iso8859-1
            m_Buffer = wxString(data, wxConvISO8859_1, m_BufferLen + 1); // + 1 => sentinel
        }
#else
        m_Buffer = wxString(data, m_BufferLen + 1); // + 1 => sentinel
#endif

        success = (data != 0);
    }

Look at this function:
the data is a data pointer (char *) to the raw source file.
We just do a conversion from the data to Unicode wxString representation.

We have tried two method:
Code
m_Buffer = wxString(data, wxConvUTF8, m_BufferLen + 1); 
or
m_Buffer = wxString(data, wxConvISO8859_1, m_BufferLen + 1);

As I know, the default source file encoding which gcc expect is the UTF8.

case A: the source file is in UTF8 format:
If some source file contains some chars whose code point value is bigger than 0xFF, than all the bytes were has the (1??? ????)binary representation, and these characters should only be in c-strings or c/cpp comments. All the variables or identifiers are all ascii chars which has value<0x7F.

case B: the source file is in ISO8859_1 format, I believe it is the same thing as above.

So, I'm wondering why we need to do such conversion to unicode wxString representation? On Windows system, I believe wxString is currently (wx2.8.12) is just like basic_string<wchar>, which means each element takes two bytes.

I looked at source code of codelite's lexer (flex based), or even ctags' source code. all of them use char based buffer handling, I think it is enough to hold all the information we need.

So, why not using std::string? it can at least save half memory. As we know, the TokensTree now takes huge memory footprint.
Any ideas.

If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13406
    • Travis build status
Re: Discussion about the Tokenizer::ReadFile
« Reply #1 on: September 03, 2011, 10:49:59 pm »
As we know, the TokensTree now takes huge memory footprint.
What huge memory footprint means?
On my systems with my projects C::B is taking 100-200mbs of ram, which is quite good.
Recent months I'm quite used to the 3-4gb of ram usage by some programs I'm using => C::B is using, too little memory :)
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 6107
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: Discussion about the Tokenizer::ReadFile
« Reply #2 on: September 04, 2011, 01:29:57 am »
As we know, the TokensTree now takes huge memory footprint.
What huge memory footprint means?
It means it eat a lot of memory. E.g. if I open codeblocks.cbp, it takes about 200M memory.

The other issue is, if we do such conversion on every source file, it takes several time. Not sure how fast the conversion can be done, but I think it is not quite short.
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline oBFusCATed

  • Developer
  • Lives here!
  • *****
  • Posts: 13406
    • Travis build status
Re: Discussion about the Tokenizer::ReadFile
« Reply #3 on: September 05, 2011, 09:11:33 pm »
The other issue is, if we do such conversion on every source file, it takes several time. Not sure how fast the conversion can be done, but I think it is not quite short.
Hm, aren't you solving performance problems without having profiled the code again?
Developer's judgement abount code-performance is 99% wrong.
(most of the time I ignore long posts)
[strangers don't send me private messages, I'll ignore them; post a topic in the forum, but first read the rules!]