Author Topic: The encoding text from GCC compiler should be UTF8 by default?  (Read 18462 times)

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 6111
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
The encoding text from GCC compiler should be UTF8 by default?
« on: October 29, 2022, 05:20:24 am »
Hi, I have source code .cpp file in UTF8 format, and the file contains CJK chars in comments. When building, I see that the Build log message shows badly of the comments.


See the image shot named: 2022-10-29-utf8.png

I did a test, I can change the source code format to GB2312, now, I see the Build log shows the correct CJK chars, see image show named: 2022-10-29-GB2312.png

So, my question is: do we think that the source file should be UTF8 by default?

If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 6111
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: The encoding text from GCC compiler should be UTF8 by default?
« Reply #1 on: October 29, 2022, 05:26:01 am »
I have a patch to fix this issue:

Code
From 90a6a42d30abea13cf4b23cc47868e2acc569aeb Mon Sep 17 00:00:00 2001
From: asmwarrior <a@b.com>
Date: Sat, 29 Oct 2022 11:21:56 +0800
Subject: correctly encoding convert from the GCC's message to wxString


diff --git a/src/plugins/compilergcc/compilergcc.cpp b/src/plugins/compilergcc/compilergcc.cpp
index d47dcf5a..edcec90d 100644
--- a/src/plugins/compilergcc/compilergcc.cpp
+++ b/src/plugins/compilergcc/compilergcc.cpp
@@ -3615,7 +3615,11 @@ void CompilerGCC::OnGCCError(CodeBlocksEvent& event)
 {
     wxString msg = event.GetString();
     if (!msg.IsEmpty())
-        AddOutputLine(msg);
+    {
+        wxString msg1 = wxString::FromUTF8(msg.c_str());
+        AddOutputLine(msg1);
+    }
+
 }
 
 void CompilerGCC::OnGCCTerminated(CodeBlocksEvent& event)


The result looks good, see the image shot named: 2022-10-29-utf8-fix.png

My question is: do we expect the GCC's return text message (in-fact it is the source code's encoding) is UTF8?

If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline Miguel Gimenez

  • Developer
  • Lives here!
  • *****
  • Posts: 1781
Re: The encoding text from GCC compiler should be UTF8 by default?
« Reply #2 on: October 29, 2022, 01:00:53 pm »
The encoding of GCC's output will be the same of the source file, not necessarily UTF8. I would check if msg1 is empty, indicating invalid UTF8, and use the original string in that case:

Code
    {
        wxString msg1 = wxString::FromUTF8(msg.c_str());
        AddOutputLine(msg1.empty() ? msg : msg1);
    }

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 6111
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: The encoding text from GCC compiler should be UTF8 by default?
« Reply #3 on: October 29, 2022, 01:27:40 pm »
The encoding of GCC's output will be the same of the source file, not necessarily UTF8. I would check if msg1 is empty, indicating invalid UTF8, and use the original string in that case:

Code
    {
        wxString msg1 = wxString::FromUTF8(msg.c_str());
        AddOutputLine(msg1.empty() ? msg : msg1);
    }

Hi, Miguel Gimenez, thanks for the reply.

Indeed. The output line has the same encoding as the input source code.
Your suggested method is much robust, and better.

So, shall we commit such fix in our code repository? At least it will fix the garbage characters output in the build log for my cases.

My original idea is that we have to detect the encoding of the input source file, but this way is far more complex than the way you suggested.
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline Miguel Gimenez

  • Developer
  • Lives here!
  • *****
  • Posts: 1781
Re: The encoding text from GCC compiler should be UTF8 by default?
« Reply #4 on: October 29, 2022, 02:04:58 pm »
IIRC the message also has the file path, I hope this change will not modify it if there are non-ASCII characters in it.

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 6111
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: The encoding text from GCC compiler should be UTF8 by default?
« Reply #5 on: October 30, 2022, 02:13:20 am »
IIRC the message also has the file path, I hope this change will not modify it if there are non-ASCII characters in it.

Hi, thanks. That's another issue.

I just checked a file path contains CJK chars:

Code
D:\code\test-crash-中文\main.cpp

Now, without your suggestion empty() check, I see some build log lines are missing(empty).

By using the empty() check, it works OK.

See image shot below:


EDIT:

I also checked another file path which contains some like, latin small letter e with grave
(sorry, our forum does not allow to post that non-ASCII chars in the post, so I add another screen shot. )

It also works OK.


« Last Edit: October 30, 2022, 02:24:43 am by ollydbg »
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 6111
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: The encoding text from GCC compiler should be UTF8 by default?
« Reply #6 on: October 30, 2022, 11:48:36 am »
Things are more complex than I thought.

First, it looks like in C::B, the compiler plugin send the compile command to GCC, the file path in the command is in Unicode format, under my Win7, it is GB2312, and in GCC's return text, the file path is still in GB2312 encoding, while when GCC has some diagnose message, for example, it report some error position, it use byte position.

While, in our C::B, when handling the stdout and stderr pipe, it use a converter:

Code
// The following class is created to override wxTextStream::ReadLine()
class cbTextInputStream : public wxTextInputStream
{
    protected:
        bool m_allowMBconversion;
    public:
#if wxUSE_UNICODE
        cbTextInputStream(wxInputStream& s, const wxString &sep=wxT(" \t"), wxMBConv& conv = wxConvLocal )
            : wxTextInputStream(s, sep, conv),
            m_allowMBconversion(true)
        {
            memset((void*)m_lastBytes, 0, 10);
        }

Here, the wxConvLocal is local, which means the byte stream is expected as the GB2312 encoding, and if we have such code

Code
    5 |     int abc; ///< 串口号
      |         ^~~

If the code content is in UTF8 format, it just wrongly convert the string to GB2312.

This is the code to fetch each byte from the input pipe stream, and convert it by the wxConvLocal converter.

Code
        // The following function was copied verbatim from wxTextStream::NextChar()
        // The only change, is the removal of the MB2WC function
        // With PipedProcess we work with compilers/debuggers which (usually) don't
        // send us unicode (at least GDB).
        wxChar NextChar()
        {
        #if wxUSE_UNICODE
            wxChar wbuf[2];
            memset((void*)m_lastBytes, 0, 10);
            for (size_t inlen = 0; inlen < 9; inlen++)
            {
                // actually read the next character byte
                m_lastBytes[inlen] = m_input.GetC();

                if (m_input.LastRead() <= 0)
                    return wxEOT;
                // inlen is the byte index we get copied from the input byte stream
                if (m_allowMBconversion)
                {
                    int retlen = (int) m_conv->MB2WC(wbuf, m_lastBytes, 2); // returns -1 for failure
                    if (retlen >= 0) // res == 0 could happen for '\0' char
                        return wbuf[0];
                }
                else
                    return m_lastBytes[inlen]; // C::B fix (?)
            }
            // there should be no encoding which requires more than nine bytes for one character...
            return wxEOT;
        #else
            m_lastBytes[0] = m_input.GetC();

            if (m_input.LastRead() <= 0)
                return wxEOT;

            return m_lastBytes[0];
        #endif
        }

The trick here is: if m_conv->MB2WC(wbuf, m_lastBytes, 2); function works partially on the file content conversion, it got the wrong diagnose wxString.


Only if MB2WC() function call get failed, then the raw byte will returned, and we will have later chance to convert it by the code:

Code
    {
        wxString msg1 = wxString::FromUTF8(msg.c_str());
        AddOutputLine(msg1.empty() ? msg : msg1);
    }

If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 6111
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: The encoding text from GCC compiler should be UTF8 by default?
« Reply #7 on: October 30, 2022, 04:02:18 pm »
I have a new code to solve this issue in the file: sdk\pipedprocess.cpp

I think we don't need to call the function: wxChar NextChar().

Code
        wxString ReadLine()
        {
            wxString line;

            std::string lineBytes;

            while ( m_input.CanRead() && !m_input.Eof() )
            {
                char c = m_input.GetC();
                if (m_input.LastRead() <= 0)
                    break;

                if ( !m_input )
                    break;

                if (EatEOL(c))
                    break;

                lineBytes += c;
            }
            // for the compiler output, it could be either the file content and the file path
            // the file content could be in any encoding, mostly the utf-8
            // for the file path, it usually contains the legacy MBCS encoding.(ANSI string)
            // so, we firstly try to convert from UTF8, if failed, try the wxConvLocal
            line = wxString::FromUTF8(lineBytes.c_str());
            if (line.empty())
            {
                line = wxString(lineBytes.c_str()); // use the wxConvLocal
            }
            return line;
        }
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 6111
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: The encoding text from GCC compiler should be UTF8 by default?
« Reply #8 on: November 05, 2022, 02:49:43 am »
I'm not sure what is the other dev's opinion.

The old way the pipedprocess.cpp does is:

fetch one or two or more bytes from the input pipe, and try to convert it to the wxChar by the local encoding.

My changed way:

fetch a whole line from the pipe, and convert whole line to a wxString. First try the UTF8 convert, it it fails, try local encoding. In my system, for Chinese chars, it need two bytes for local encoding(The GB2312 ANSI encoding), but for UTF8, it need 3 bytes.

There are maybe some issues. To test whether it is a EOL, the current byte should be compared with the LF character `\n`. But I'm not sure the LF character (hex 0x0a) exists in Chinese encoding byte stream as the second or third byte(whether in UTF8 encoding or GB2312 encoding). If that exits, it will totally destroy the EOL finding mechanism.


EDIT:

It looks like my changed way is safe, I looked at this document:
GB_2312-Wikipedia

Quote
EUC-CN is often used as the character encoding (i.e. for external storage) in programs that deal with GB/T 2312, thus maintaining compatibility with ASCII. Two bytes are used to represent every character not found in ASCII. The value of the first byte is from 0xA1-0xF7 (161-247), while the value of the second byte is from 0xA1-0xFE (161-254). Since all of these ranges are beyond ASCII, like UTF-8, it is possible to check if a byte is part of a multi-byte construct when using EUC-CN, but not if a byte is first or last.

So, the encoding will not be conflicts with the LF character (hex 0x0a).
« Last Edit: November 05, 2022, 03:27:26 am by ollydbg »
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.

Offline ollydbg

  • Developer
  • Lives here!
  • *****
  • Posts: 6111
  • OpenCV and Robotics
    • Chinese OpenCV forum moderator
Re: The encoding text from GCC compiler should be UTF8 by default?
« Reply #9 on: November 19, 2022, 03:48:15 am »
I committed this change in svn repo now, rev 13049.
If some piece of memory should be reused, turn them to variables (or const variables).
If some piece of operations should be reused, turn them to functions.
If they happened together, then turn them to classes.