Developer forums (C::B DEVELOPMENT STRICTLY!) > Development

The encoding text from GCC compiler should be UTF8 by default?

<< < (2/2)

ollydbg:

--- Quote from: Miguel Gimenez on October 29, 2022, 02:04:58 pm ---IIRC the message also has the file path, I hope this change will not modify it if there are non-ASCII characters in it.

--- End quote ---

Hi, thanks. That's another issue.

I just checked a file path contains CJK chars:


--- Code: ---D:\code\test-crash-中文\main.cpp
--- End code ---

Now, without your suggestion empty() check, I see some build log lines are missing(empty).

By using the empty() check, it works OK.

See image shot below:


EDIT:

I also checked another file path which contains some like, latin small letter e with grave
(sorry, our forum does not allow to post that non-ASCII chars in the post, so I add another screen shot. )

It also works OK.


ollydbg:
Things are more complex than I thought.

First, it looks like in C::B, the compiler plugin send the compile command to GCC, the file path in the command is in Unicode format, under my Win7, it is GB2312, and in GCC's return text, the file path is still in GB2312 encoding, while when GCC has some diagnose message, for example, it report some error position, it use byte position.

While, in our C::B, when handling the stdout and stderr pipe, it use a converter:


--- Code: ---// The following class is created to override wxTextStream::ReadLine()
class cbTextInputStream : public wxTextInputStream
{
    protected:
        bool m_allowMBconversion;
    public:
#if wxUSE_UNICODE
        cbTextInputStream(wxInputStream& s, const wxString &sep=wxT(" \t"), wxMBConv& conv = wxConvLocal )
            : wxTextInputStream(s, sep, conv),
            m_allowMBconversion(true)
        {
            memset((void*)m_lastBytes, 0, 10);
        }

--- End code ---

Here, the wxConvLocal is local, which means the byte stream is expected as the GB2312 encoding, and if we have such code


--- Code: ---    5 |     int abc; ///< 串口号
      |         ^~~

--- End code ---

If the code content is in UTF8 format, it just wrongly convert the string to GB2312.

This is the code to fetch each byte from the input pipe stream, and convert it by the wxConvLocal converter.


--- Code: ---        // The following function was copied verbatim from wxTextStream::NextChar()
        // The only change, is the removal of the MB2WC function
        // With PipedProcess we work with compilers/debuggers which (usually) don't
        // send us unicode (at least GDB).
        wxChar NextChar()
        {
        #if wxUSE_UNICODE
            wxChar wbuf[2];
            memset((void*)m_lastBytes, 0, 10);
            for (size_t inlen = 0; inlen < 9; inlen++)
            {
                // actually read the next character byte
                m_lastBytes[inlen] = m_input.GetC();

                if (m_input.LastRead() <= 0)
                    return wxEOT;
                // inlen is the byte index we get copied from the input byte stream
                if (m_allowMBconversion)
                {
                    int retlen = (int) m_conv->MB2WC(wbuf, m_lastBytes, 2); // returns -1 for failure
                    if (retlen >= 0) // res == 0 could happen for '\0' char
                        return wbuf[0];
                }
                else
                    return m_lastBytes[inlen]; // C::B fix (?)
            }
            // there should be no encoding which requires more than nine bytes for one character...
            return wxEOT;
        #else
            m_lastBytes[0] = m_input.GetC();

            if (m_input.LastRead() <= 0)
                return wxEOT;

            return m_lastBytes[0];
        #endif
        }

--- End code ---

The trick here is: if m_conv->MB2WC(wbuf, m_lastBytes, 2); function works partially on the file content conversion, it got the wrong diagnose wxString.


Only if MB2WC() function call get failed, then the raw byte will returned, and we will have later chance to convert it by the code:


--- Code: ---    {
        wxString msg1 = wxString::FromUTF8(msg.c_str());
        AddOutputLine(msg1.empty() ? msg : msg1);
    }

--- End code ---

ollydbg:
I have a new code to solve this issue in the file: sdk\pipedprocess.cpp

I think we don't need to call the function: wxChar NextChar().


--- Code: ---        wxString ReadLine()
        {
            wxString line;

            std::string lineBytes;

            while ( m_input.CanRead() && !m_input.Eof() )
            {
                char c = m_input.GetC();
                if (m_input.LastRead() <= 0)
                    break;

                if ( !m_input )
                    break;

                if (EatEOL(c))
                    break;

                lineBytes += c;
            }
            // for the compiler output, it could be either the file content and the file path
            // the file content could be in any encoding, mostly the utf-8
            // for the file path, it usually contains the legacy MBCS encoding.(ANSI string)
            // so, we firstly try to convert from UTF8, if failed, try the wxConvLocal
            line = wxString::FromUTF8(lineBytes.c_str());
            if (line.empty())
            {
                line = wxString(lineBytes.c_str()); // use the wxConvLocal
            }
            return line;
        }

--- End code ---

ollydbg:
I'm not sure what is the other dev's opinion.

The old way the pipedprocess.cpp does is:

fetch one or two or more bytes from the input pipe, and try to convert it to the wxChar by the local encoding.

My changed way:

fetch a whole line from the pipe, and convert whole line to a wxString. First try the UTF8 convert, it it fails, try local encoding. In my system, for Chinese chars, it need two bytes for local encoding(The GB2312 ANSI encoding), but for UTF8, it need 3 bytes.

There are maybe some issues. To test whether it is a EOL, the current byte should be compared with the LF character `\n`. But I'm not sure the LF character (hex 0x0a) exists in Chinese encoding byte stream as the second or third byte(whether in UTF8 encoding or GB2312 encoding). If that exits, it will totally destroy the EOL finding mechanism.


EDIT:

It looks like my changed way is safe, I looked at this document:
GB_2312-Wikipedia


--- Quote ---EUC-CN is often used as the character encoding (i.e. for external storage) in programs that deal with GB/T 2312, thus maintaining compatibility with ASCII. Two bytes are used to represent every character not found in ASCII. The value of the first byte is from 0xA1-0xF7 (161-247), while the value of the second byte is from 0xA1-0xFE (161-254). Since all of these ranges are beyond ASCII, like UTF-8, it is possible to check if a byte is part of a multi-byte construct when using EUC-CN, but not if a byte is first or last.

--- End quote ---

So, the encoding will not be conflicts with the LF character (hex 0x0a).

ollydbg:
I committed this change in svn repo now, rev 13049.

Navigation

[0] Message Index

[*] Previous page

Go to full version