How to convert utf-16 code into UTF-8 code in JavaScript — utfx.js source code analysis

Time:2019-10-18

Summary

When you need to communicate with the server through binary data in the front end, you may encounter binary data coding problems. Most of the server-side string encoding types are UTF-8, while in JavaScript, the string encoding type is utf-16, so you need a method that can convert strings between two encoding methods.

Through the analysis of the code of utfx.js, this paper will give you a deep understanding of the conversion methods of utf8 and utf16 in JavaScript, and deepen the understanding of the specific principles of UTF-8 and utf-16 in Unicode.

The main contents of this paper are as follows:

  • Brief introduction to utfx.js API
  • Utf-16 encoding to UTF-8 encoding
  • UTF-8 encoding string length calculation
  • Experimental function: window.textencoder

If you don’t understand the specific principles of UTF-8 and utf-16 in Unicode, you can read my previous blog – UTF-8 and utf-16 in Unicode.

If you want to understand the conversion scenarios related to this library, you can read how the JavaScript strings of my previous blog websocket series are converted to binary data.

Introduction to utfx.js API

Before we go into the specific code details, let’s first understand the library we need to introduce – utfx.js. Only when we know how to use this library can we better understand the source code.

There are not many codes in utfx.js, and there are only eight API interfaces, namely:

  • Encodeutf8: converts UTF-8 encoded string code to binary bytes.
  • Decodeutf8: decodes the binary bytes encoded by UTF-8 to the string code.
  • Utf16toutf8: converts utf-16 characters to UTF-8 code.
  • Utf8toutf16: convert the code code of UTF-8 to the character of utf-16.
  • Encodeutf16toutf8: converts utf-16 encoded characters to UTF-8 encoded bytes.
  • Decodeutf8toutf16: converts UTF-8 encoded bytes to utf-16 encoded characters.
  • Calculatecodepoint: calculates the character length under UTF-8 encoding.
  • Calculateutf8: calculates the length of bytes needed to store UTF-8 encoded code.
  • Calculateutf16asutf8: calculates the storage length required for utf-16 encoded characters after conversion to UTF-8.

Next, we will select several representative APIs and analyze the specific code for their implementation to help you quickly understand the two coding methods.

Utf-16 encoding to UTF-8 encoding

Let’s take a look at how to convert utf-16 encoded data to UTF-8 encoded data.

When we need to convert utf-16 data to UTF-8 encoded data, the best way is definitely to convert utf-16 encoded data to universal Unicode code, and UTF-8 encoding is in progress. We use the code of utf16toutf8 and encodeutf8 methods for specific analysis.

UTF16toUTF8

This function name appears to directly convert utf-16 encoded bytes data to UTF-8 encoded bytes data. In fact,Convert utf-16 encoded bytes data to unicode corresponding binary data

/**
 *Utf16 data to unicode data
 *@ param SRC data source, type is function, call returns 1 byte data at a time, return null if it reaches the end of string
 *@ param DST processing function, the type is function, and the resulting bytes are passed to the DST function as parameters
 */
utfx.UTF16toUTF8 = function (src, dst) {
    var c1, c2 = null;
    while (true) {
        //Call SRC function at the end to get null and enter this branch logic
        if ((c1 = c2 !== null ? c2 : src()) === null)
            break;
        
        //According to Unicode standard, the value of U + d800 ~ U + dfff does not correspond to any character, that is, it is specially used to judge whether it is a high-level agent.
        if (c1 >= 0xD800 && c1 <= 0xDFFF) {
            if ((c2 = src()) !== null) {
                //If the Unicode code range exceeds U + ffff, the branch logic will be entered (two segments: the first segment is greater than u + d800, and the second segment is greater than u + dc00).
                if (c2 >= 0xDC00 && c2 <= 0xDFFF) {
                    //Step 1: restore the upper 10 bits with C1; step 2: restore the lower 10 bits with C2; step 3: add the subtracted 0x10000
                    dst((c1 - 0xD800) * 0x400 + c2 - 0xDC00 + 0x10000);
                    c2 = null; continue;
                }
            }
        }
        dst(c1);
    }
    if (c2 !== null) dst(c2);
};

According to the code and the above notes, you should be able to understand the corresponding code, so we won’t go over it here. Let’s move on to the method of converting Unicode code to UTF-8 encoding.

encodeUTF8

In this method, the Unicode code is converted into UTF-8 code to get UTF-8 encoded bytes data.

/**
 *Unicode data to UTF-8 data
 *@ param SRC data source, type is function, call returns 1 byte data at a time, return null if it reaches the end of string
 *@ param DST processing function, the type is function, and the resulting bytes are passed to the DST function as parameters
 */
utfx.encodeUTF8 = function (src, dst) {
    var cp = null;
    if (typeof src === 'number')
        cp = src,
            src = function () {return null;};
    while (cp !== null || (cp = src()) !== null) {
        if (cp < 0x80)
        //1 byte storage
            dst(cp & 0x7F);
        else if (cp < 0x800)
        //2-byte storage
            dst(((cp >> 6) & 0x1F) | 0xC0),
            dst((cp & 0x3F) | 0x80);
        else if (cp < 0x10000)
        //3-byte storage
            dst(((cp >> 12) & 0x0F) | 0xE0),
            dst(((cp >> 6) & 0x3F) | 0x80),
            dst((cp & 0x3F) | 0x80);
        else
        //4-byte storage
            dst(((cp >> 18) & 0x07) | 0xF0),
            dst(((cp >> 12) & 0x3F) | 0x80),
            dst(((cp >> 6) & 0x3F) | 0x80),
            dst((cp & 0x3F) | 0x80);
        cp = null;
    }
};

The above code is basically the same as the UTF-8 coding specification. If you don’t understand the relevant specifications, you can read the previous blog mentioned in the overview of this article.

Encoding string length calculation

When we give a string of Unicode codes, we need to know how large the arraybuffer is to be applied for the converted data storage. At the same time, the library also provides the storage length of UTF-8 data based on the length of Unicode code or data in utf-16 encoding format.

Let’s introducecalculateUTF8andcalculateUTF16asUTF8These two methods.

calculateUTF8

This method uses Unicode code to calculate the storage length after conversion to UTF-8 code.

/**
 *Calculate the required storage length after conversion to UTF-8 encoding according to unicode encoding
 *@ param SRC data source, type is function, call returns 1 byte data at a time, return null if it reaches the end of string
 */
utfx.calculateUTF8 = function (src) {
    var cp, l = 0;
    while ((cp = src()) !== null)
        //The range of 1 byte is 0 ~ 0x7F; the range of 2 byte is 0x80 ~ 0x7ff; the range of 3 bytes is 0x800 ~ 0xFFFF; the range of 4 bytes is 0x10000 ~ 0x10ffff.
        l += (cp < 0x80) ? 1 : (cp < 0x800) ? 2 : (cp < 0x10000) ? 3 : 4;
    return l;
};

According to the above code and UTF-8 coding specification, we can easily understand the width calculation method.

calculateUTF16asUTF8

This method uses utf16 data to calculate the storage length after conversion to Unicode code and to UTF-8 code.

/**
 *According to utf-16 encoded bytes, calculate the length converted to unicode and the storage length required after conversion to UTF-8 encoding
 *@ param SRC data source, type is function, call returns 1 byte data at a time, return null if it reaches the end of string
 */
utfx.calculateUTF16asUTF8 = function (src) {
    var n = 0, l = 0;
    utfx.UTF16toUTF8(src, function (cp) {
        ++n; l += (cp < 0x80) ? 1 : (cp < 0x800) ? 2 : (cp < 0x10000) ? 3 : 4;
    });
    return [n, l];
};

This method obtains Unicode data through the method of converting utf-16 code to Unicode code introduced before, and then calculates, and returns the length of Unicode code and UTF-8 code.

Window.textencoder and window.textdecoder

These are two experimental new constructors, by creating an encoder(TextEncodeObject) and decoder(TextDecodeObject) to implement the conversion between string types in JavaScript and UTF-8 encoded data.

The construction method will return a UTF-8 encoded one as follows:

let encoder = new TextEncoder();
let decoder = new TextDecoder();

Let unit8array = encoder. Encode ('a '); // returns a unit8array type - [97]
Let STR = decoder. Decode (ARR); // returns a string with a value of 'a'

At present, the compatibility of this new technology still has a lot of problems. Only chrome 38, Firefox 19 and opera 25 support it. Other mainstream browsers, such as IE and safari, do not have any support, so they need to be used carefully in the production process.

summary

In this paper, utfx.js, a library that implements UTF-8 and utf-16 in Unicode, is analyzed in part. By looking at the specific code implementation, I believe that you should be able to better understand the specific specifications of these two coding methods, as well as the corresponding usage and scenarios.