bool Tokenizer::ReadFile()
{
bool success = false;
wxString fileName = wxEmptyString;
if (m_Loader)
{
fileName = m_Loader->FileName();
char* data = m_Loader->GetData();
m_BufferLen = m_Loader->GetLength();
// the following code is faster than DetectEncodingAndConvert()
// DetectEncodingAndConvert(data, m_Buffer);
// same code as in cbC2U() but with the addition of the string length (3rd param in unicode version)
// and the fallback encoding conversion
#if wxUSE_UNICODE
m_Buffer = wxString(data, wxConvUTF8, m_BufferLen + 1); // + 1 => sentinel
if (m_Buffer.Length() == 0)
{
// could not read as utf-8 encoding, try iso8859-1
m_Buffer = wxString(data, wxConvISO8859_1, m_BufferLen + 1); // + 1 => sentinel
}
#else
m_Buffer = wxString(data, m_BufferLen + 1); // + 1 => sentinel
#endif
success = (data != 0);
}
Look at this function:
the data is a data pointer (char *) to the raw source file.
We just do a conversion from the data to Unicode wxString representation.
We have tried two method:
m_Buffer = wxString(data, wxConvUTF8, m_BufferLen + 1);
or
m_Buffer = wxString(data, wxConvISO8859_1, m_BufferLen + 1);
As I know, the default source file encoding which gcc expect is the
UTF8.
case A: the source file is in UTF8 format:
If some source file contains some chars whose code point value is bigger than 0xFF, than all the bytes were has the (1???

?)binary representation, and these characters should only be in c-strings or c/cpp comments. All the variables or identifiers are all ascii chars which has value<0x7F.
case B: the source file is in ISO8859_1 format, I believe it is the same thing as above.
So, I'm wondering why we need to do such conversion to unicode wxString representation? On Windows system, I believe wxString is currently (wx2.8.12) is just like basic_string<wchar>, which means each element takes two bytes.
I looked at source code of codelite's lexer (flex based), or even ctags' source code. all of them use char based buffer handling, I think it is enough to hold all the information we need.
So, why not using std::string? it can at least save half memory. As we know, the TokensTree now takes huge memory footprint.
Any ideas.