Code::Blocks Forums

Developer forums (C::B DEVELOPMENT STRICTLY!) => Development => CodeCompletion redesign => Topic started by: ollydbg on September 03, 2011, 04:13:09 pm

Title: Discussion about the Tokenizer::ReadFile
Post by: ollydbg on September 03, 2011, 04:13:09 pm
Code
bool Tokenizer::ReadFile()
{
    bool success = false;
    wxString fileName = wxEmptyString;
    if (m_Loader)
    {
        fileName = m_Loader->FileName();
        char* data  = m_Loader->GetData();
        m_BufferLen = m_Loader->GetLength();

        // the following code is faster than DetectEncodingAndConvert()
//        DetectEncodingAndConvert(data, m_Buffer);

        // same code as in cbC2U() but with the addition of the string length (3rd param in unicode version)
        // and the fallback encoding conversion
#if wxUSE_UNICODE
        m_Buffer = wxString(data, wxConvUTF8, m_BufferLen + 1); // + 1 => sentinel
        if (m_Buffer.Length() == 0)
        {
            // could not read as utf-8 encoding, try iso8859-1
            m_Buffer = wxString(data, wxConvISO8859_1, m_BufferLen + 1); // + 1 => sentinel
        }
#else
        m_Buffer = wxString(data, m_BufferLen + 1); // + 1 => sentinel
#endif

        success = (data != 0);
    }

Look at this function:
the data is a data pointer (char *) to the raw source file.
We just do a conversion from the data to Unicode wxString representation.

We have tried two method:
Code
m_Buffer = wxString(data, wxConvUTF8, m_BufferLen + 1); 
or
m_Buffer = wxString(data, wxConvISO8859_1, m_BufferLen + 1);

As I know, the default source file encoding which gcc expect is the UTF8.

case A: the source file is in UTF8 format:
If some source file contains some chars whose code point value is bigger than 0xFF, than all the bytes were has the (1??? ????)binary representation, and these characters should only be in c-strings or c/cpp comments. All the variables or identifiers are all ascii chars which has value<0x7F.

case B: the source file is in ISO8859_1 format, I believe it is the same thing as above.

So, I'm wondering why we need to do such conversion to unicode wxString representation? On Windows system, I believe wxString is currently (wx2.8.12) is just like basic_string<wchar>, which means each element takes two bytes.

I looked at source code of codelite's lexer (flex based), or even ctags' source code. all of them use char based buffer handling, I think it is enough to hold all the information we need.

So, why not using std::string? it can at least save half memory. As we know, the TokensTree now takes huge memory footprint.
Any ideas.

Title: Re: Discussion about the Tokenizer::ReadFile
Post by: oBFusCATed on September 03, 2011, 10:49:59 pm
As we know, the TokensTree now takes huge memory footprint.
What huge memory footprint means?
On my systems with my projects C::B is taking 100-200mbs of ram, which is quite good.
Recent months I'm quite used to the 3-4gb of ram usage by some programs I'm using => C::B is using, too little memory :)
Title: Re: Discussion about the Tokenizer::ReadFile
Post by: ollydbg on September 04, 2011, 01:29:57 am
As we know, the TokensTree now takes huge memory footprint.
What huge memory footprint means?
It means it eat a lot of memory. E.g. if I open codeblocks.cbp, it takes about 200M memory.

The other issue is, if we do such conversion on every source file, it takes several time. Not sure how fast the conversion can be done, but I think it is not quite short.
Title: Re: Discussion about the Tokenizer::ReadFile
Post by: oBFusCATed on September 05, 2011, 09:11:33 pm
The other issue is, if we do such conversion on every source file, it takes several time. Not sure how fast the conversion can be done, but I think it is not quite short.
Hm, aren't you solving performance problems without having profiled the code again?
Developer's judgement abount code-performance is 99% wrong.