Recently, when I was reading a book, I suddenly got entangled in Unicode related character coding. I looked up some information and wrote this article. By the way, I took notes, hoping to help some people. If there is something wrong or incorrect in the article, please correct it.
Unicode is a symbol set, which arranges and encodes most of the word systems in the world, so that computers can present and process words in a simpler way. It solves the limitation of traditional character coding scheme.
In history, there are two independent organizations trying to create a single character set, namely the international organization for Standardization (ISO) and theUnified code alliance of non profit organizations. The former developed the ISO / IEC 10646 project, while the latter developed the unified code project. As a result, different standards were initially developed. They soon discovered the existence of each other. They all worked for the same purpose. Finally, they merged the results of their work. The encoding method of Unicode corresponds to the concept of universal character set (UCS) in ISO 10646.
The encoding method of unified code uses 16 bit encoding space, that is, each character takes up 2 bytes and can represent 2 ^ (16) characters at most, which basically meets the needs of various languages. In fact, 16 bit encoding space is not fully used, and a large amount of space is reserved for future use. The 16 bit coding space mentioned here is the plane 0 of the unified code (also known as“Basic Multilingual plane”At present, there are 16 additional definitions in the unified code versionAuxiliary planeIn this way, 21 bit coding space is required, i.e. 16 + 5 bits, with a total of 17 planes (not limited to), and each plane has 2 ^ (16) code points. This is shown in the following table (from Wikipedia)
ASCII (“ASCO”) is a widely used character coding system in the world. It is encoded by 8-bit binary system, and the highest bit is always 0. Therefore, 128 characters can be defined, including 10 decimal digits, 52 English uppercase and lowercase letters (a ~ Z, a ~ z), etc.
UTF(Unicode transformation format), utf-7, UTF-8, utf-16, UTF-32, GB18030… Are only one way to realize Unicode, that is, how to convert the number defined by Unicode into program data.
UTF-8 codingUTF-8 is the parent set of ASCII encoding. In other words, UTF-8 is compatible with ASCII encoding. For example, UTF-8 encoding is the same as ASCII encoding for the first 128 characters between 0x000000-0x00007f. UTF-8 coding is widely used. Almost all Internet protocols support UTF-8 coding, which is one of the preferred coding methods at present.
The conversion relationship between Unicode and UTF-8 codes is shown in the following table:
The 00d800-00dfff range is used in the Basic Multilingual planeUtf-16 extended identifier auxiliary plane(that is, the lower two bytes), which will be described in detail in utf-16.
For example, the Unicode encoding of Chinese character “listen” is U + 542C, which is converted into UTF-8. The steps are as follows:
It can be concluded from the above table that the Unicode encoding of “listen” belongs to the area from U + 0800 to U + d7ff, which indicates that the word occupies 3 bytes and is filled according to 1110xxxx-10xxxxxx-10xxxxxx.
U + 542C is converted to binary: 0101-0100-0010-1100.
Fill from low position to high position instead of X, 11100101-10010000-10101100.
The UTF-8 code of Chinese character “listen” is: 0xe590ac.
Starting from Unicode 2.0, Unicode adopts the same font and code as ISO 10646-1; ISO also promises that ISO 10646 will not assign values to ucs-4 codes beyond U + 10ffff, so as to make them consistent. UTF-8 was released in November 2003RFC 3629The new specification can only use the original Unicode defined area, U + 0000 to U + 10ffff. If you can understand all of the above, the following table is very easy to understand (excerpted from Wikipedia)
At the beginning of UTF-8 file, EF, BB and BF are used to show that the text file is encoded in UTF-8.
Byte 0xFE and 0xff are never used in UTF-8 coding. At the same time, UTF-8 takes byte as coding unit, and its byte order is the same in all systems. There is no problem of byte order. Therefore, it does not need BOM (byte order mark), but it is used to mark the storage mode (big end and small end) in utf-16.
ASCII and UTF-8 are the same, so UTF-8 is the parent set of ASCII encoding.
Now that we know the meaning of UTF-8 and its coding principle, let’s explore the coding method of utf-16.
Utf-16 coding, encoded in 16 bit unsigned integers. In the coding space of the “Basic Multilingual plane” mentioned above, a region (from U + d800 to U + dfff) is reserved, which does not map Unicode characters. Utf-16 uses the reserved encoding space of 0xd800-0xdfff to map characters from U + 10000 to U + 10ffff (i.e. auxiliary plane).
In utf-16 coding, the mapping relationship of coding space from U + 0000 to U + d7ff and from U + e000 to U + ffff is the same as that of Unicode, corresponding to usc-2 in ISO universal character set. From U + 10000 to U + 10ffff, utf-16 encodes with a pair of 16 bit long symbols (i.e. 32bit, 4bytes), which is known as surrogate pair
The encoding space of 0xd800-0xdfff is divided into two parts (i.e. the proxy pair mentioned above)
High level agent of utf-16: from U + d800 to U + dbff, also known aslead surrogates （lead surrogates）。
Low bit agent of utf-16: from U + dc00 to U + dfff, also known asRear end agent（trail surrogates）。
UTF-16 The auxiliary plane coding method is quite ingenious. From U + 10000 to U + 10ffff, there are a total of fffffs, that is, 2 ^ (20), which need at least 20 bits to represent. Let’s look at the proxy pairs. First, we look at the high half area. From U + d800 to U + dbff, there are a total of 3ff, that is, 2 ^ (10). Similarly, the low half area is also 2 ^ (10), which is exactly 2 ^ (20) proxy pairs. This is also not corresponding to unicode in the “Basic Multilingual plane” The reason for the 2048 code bits of the character. Let’s take a look at a table
For example, the Unicode encoding of the ancient Italian letter “?” is U + 10300, which is converted into utf-16. The steps are as follows:
Subtract 0x10000 from 0x10300–>0x00300, converted to binary: 0000-0000-0011-0000-0000.
Get the high 10 bits (0000-0000-00) and low 10 bits (11-0000-0000)
Add 0xd800 to the high 10 bits (fill 0 if not enough) to get the high bit of utf-16: 0xd800 + 0x0000–> 0xD800
Add 0xdc00 to the lower 10 bits (not enough to make up for 0) to get the lower bit of utf-16: 0xdc00 + 0x0300–> 0xDF00
The utf-16be code of the ancient Italian letter “?” is U + d800df00
The conversion relationship between Unicode and utf-16 encoding is shown in the following table:
As can be seen from the above table, utf-16 is not compatible with ASCII encoding.
Utf-16 storage form
Must read now have such a doubt, utf-16 It encodes in 16 bit unsigned integer bit units, that is, each character takes up two bytes. For example, on MAC and window, the understanding of byte order is different. At this time, there is a problem that the same byte stream may be interpreted as different contents. Take the character “heart” as an example, the character is coded in hexadecimal as u + 5fc3, split by two bytes: 5F and C3, in M When reading from low byte on AC, Mac OS will think that the U + 5fc3 is encoded as u + c35f and the display character is “썟”. When reading from high byte on windows, the character encoded as u + 5fc3 is “heart”. In order to solve this problem, the byte order mark (BOM) is born. If the character U + FEFF appears at the beginning of the byte stream, it is used to identify whether the byte order of the byte stream is high or low, and vice versa. These two kinds of byte order are usually called big end and small end in computer. Let’s continue to explore.
Large end storage and small end storage
Big endian (be): the high byte of a word is placed in the low address of the word area in memory. Little endian (LE): that is, the low bit of a word is placed at the low address of the word area in memory.
Or take the ancient Italian alphabet “?” as an example, we have just calculated that its utf-16 code is U + d800df00. If large end storage is used, the coded storage sequence is d800 df00, and if small end storage is used, it is 00d8 00DF. The difference between the two storage modes is that the storage order of the bytes in the word is different, while the storage order of the word is the same. Let’s look at a few more examples (from Wikipedia)
At the beginning of utf-16 file, FEFF or fffe is used to show whether the text file is encoded by be or le.
Utf-16 encoding can be said to be the parent set of ucs-2. For Unicode codes less than 0x10000, utf-16 encoding is equal to UCS code, and utf-16 encoding is equal to unicode scalar value.
UTF-16 VSUTF-8, I think these two coding methods are not comparable, mainly depends on which plane the character itself is mainly concentrated in. Both of them are variable length coding.
UTF-16 VS UCS-21If this word exceeds U + ffff (e.g. U + 10000 to U + 10ffff), it cannot be encoded in the format of ucs-2, and utf-16 can be regarded as the parent set of ucs-2.
Related articles and links
- UCS is the universal character set (UCS) of ISO 10646. Ucs-2 can be simply understood as utf-16, which also uses 16 bit encoding space.↩