C + + string encoding conversion

Time:2021-10-15

There are many kinds of strings in C + +. For details, refer toString types in C + +。 This article mainly takes string type as an example to talk about string encoding. String is selected mainly because:

  • Byte is the minimum structure of binary encoding of string. String is essentially an array of bytes
  • C + + has no byte type, and the byte type of the third party is usually implemented by char
  • charIt can be directly converted to string, that is, byteDirect to string

Code transferred fromUtf8 and STD:: string character encoding conversion, the conversion method of other encoding formats is similar (first convert to double byte Unicode encoding, and then convert to other encoded multi bytes). The code is as follows:

std::string UTF8_To_string(const std::string& str)
{
    int nwLen = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), -1, NULL, 0);    
    wchar_ t* pwBuf = new wchar_ t[nwLen + 1];// Add 1 to truncate the string 
    memset(pwBuf, 0, nwLen * 2 + 2);

    MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), pwBuf, nwLen);

    int nLen = WideCharToMultiByte(CP_ACP, 0, pwBuf, -1, NULL, NULL, NULL, NULL);

    char* pBuf = new char[nLen + 1];
    memset(pBuf, 0, nLen + 1);

    WideCharToMultiByte(CP_ACP, 0, pwBuf, nwLen, pBuf, nLen, NULL, NULL);

    std::string retStr = pBuf;

    delete[]pBuf;
    delete[]pwBuf;

    pBuf = NULL;
    pwBuf = NULL;

    return retStr;
}


std::string string_To_UTF8(const std::string& str)
{
    int nwLen = ::MultiByteToWideChar(CP_ACP, 0, str.c_str(), -1, NULL, 0);

    wchar_ t* pwBuf = new wchar_ t[nwLen + 1];// Add 1 to truncate the string 
    ZeroMemory(pwBuf, nwLen * 2 + 2);

    ::MultiByteToWideChar(CP_ACP, 0, str.c_str(), str.length(), pwBuf, nwLen);

    int nLen = ::WideCharToMultiByte(CP_UTF8, 0, pwBuf, -1, NULL, NULL, NULL, NULL);

    char* pBuf = new char[nLen + 1];
    ZeroMemory(pBuf, nLen + 1);

    ::WideCharToMultiByte(CP_UTF8, 0, pwBuf, nwLen, pBuf, nLen, NULL, NULL);

    std::string retStr(pBuf);

    delete[]pwBuf;
    delete[]pBuf;

    pwBuf = NULL;
    pBuf = NULL;

    return retStr;
}

Note: the ANSI code used for string represents GB2312 code in simplified Chinese system.

Multibytetowidechar and WideCharToMultiByte usage referenceMultibytetowidechar and WideCharToMultiByte usage details
, the first parameter of the method is to specify the encoding format of the string memory indicated by the pointer, as follows:

Value Description
CP_ACP ANSI code page
CP_MACCP Not supported
CP_OEMCP OEM code page
CP_SYMBOL Not supported
CP_THREAD_ACP Not supported
CP_UTF7 UTF-7 code page
CP_UTF8 UTF-8 code page

Both methods will be called twice. The last parameter (target string length) of the first call is 0, and the method returns the length of the target string length. On the second call, the last parameter is passed inTarget string length + 1, write the converted String directly to the buffer.

Note: there are two similar functions under Linux: mbstowcs() and wcstombs(). Refer tohttps://blog.csdn.net/yiyaaixuexi/article/details/6174971