Introduction to the Differences and Relations between UTF-8 GBK UTF8 GB2312

Time:2019-9-24

UTF-8 contains all the characters that need to be used in all countries of the world. It is an international code with strong universality. UTF-8 coded text can be displayed in browsers that support UTF8 character sets in various countries. For example, if UTF8 is encoded, Chinese can also be displayed on the English IE of foreigners, who do not need to download the Chinese language support package of IE.

GBK is compatible with GB2312 after expansion based on national standard GB2312. GBK’s text encoding is expressed by double bytes, that is, both Chinese and English characters are represented by double bytes. In order to distinguish Chinese, the highest bit of GBK is set to 1. GBK contains all Chinese characters and is a national code. It is less universal than UTF8, but UTF8 occupies a larger database than GBD.

GBK, GB2312 and UTF8 must be coded in Unicode to convert to each other.

GBK、GB2312--Unicode--UTF8

UTF8--Unicode--GBK、GB2312

For a website or forum, if there are more characters in English, UTF-8 is recommended to save space. But now many forum plug-ins generally only support GBK.
Detailed Explanation of the Differences between Encodes
Simply put, unicode, GBK and big five codes are the coded values, while utf-8, uft-16 and other codes are the manifestations of this value. While the former three codes are compatible, the same Chinese character, the three codes are completely different. For example, the uncode value of “Han” is different from that of gbk, assuming that uncode is a040 and GBK is b030. Uft-8 code is a form of expression of that value. UTF-8 code is only organized for uncode. If GBK wants to transfer UTF-8, it must first transfer uncode code, then transfer UTF-8 to OK.

For more details, see the article below.

Talking about Unicode Coding, Briefly explaining UCS, UTF, BMP, BOM and other nouns
This is an interesting book written to programmers by programmers. Interest refers to the ease of understanding some previously unclear concepts and improving knowledge, similar to the upgrade of playing RPG games. The motivation to organize this article is two questions:

Question 1:
Using the “save as” of Windows notebook, the encoding modes of GBK, Unicode, Unicode big endian and UTF-8 can be converted to each other. It’s also a TXT file. How does Windows recognize the encoding?

I found earlier that Unicode, Unicode bigendian and UTF-8 encoded txt files start with a few bytes more, namely FF, FE (Unicode), FE, FF (Unicode bigendian), EF, BB, BF (UTF-8). But what criteria are these tags based on?

Question two:
Recently, I saw a ConvertUTF.c on the Internet, which realized the conversion of UTF-32, UTF-16 and UTF-8. For Unicode (UCS2), GBK, UTF-8, these encoding methods, I knew. But this program makes me confused. I can’t remember what UTF-16 has to do with UCS2.
After checking the relevant information, we finally got these questions clear, and by the way, we also got some details of Unicode. Write an article for friends who have similar questions. This paper tries to be easy to understand when writing, but requires readers to know what bytes are and what hexadecimal system is.

0, big endian and little endian
Bigendian and littleendian are different ways for CPUs to handle multi-byte numbers. For example, the Unicode code code of the word “Han” is 6C49. So when you write in a document, do you write 6C in front or 49 in front? If you write 6C in front, it’s big endian. If 49 is written in front, it’s little endian.

The word “endian” comes from Gulliver’s Travels. The Civil War in Lilliputian originated from whether the egg was eaten from Big-Endian or Little-Endian. Six rebellions took place, one emperor was killed and the other lost his throne.

We usually translate endian into “byte order” and refer to big endian and little endian as “big tail” and “small tail”.

1. Character encoding and internal code. Incidentally, Chinese character encoding is introduced.
Characters must be coded before they can be processed by computer. The default encoding method used by a computer is the internal code of the computer. Early computers used 7-bit ASCII encoding. In order to process Chinese characters, programmers designed GB2312 for simplified Chinese and Big5 for traditional Chinese.

GB2312 (1980) contains 7445 characters, including 6763 Chinese characters and 682 other symbols. The internal code range of Chinese character area is high byte from B0-F7 and low byte from A1-FE. The occupied bits are 72*94=6768. Five of the vacancies are D7FA-D7FE.

GB2312 supports too few Chinese characters. The Chinese Character Extension Specification GBK 1.0 in 1995 contains 2186 symbols, which are divided into Chinese Character Area and Graphic Symbol Area. The Chinese character area includes 21003 characters.

From ASCII, GB2312 to GBK, these encoding methods are downward compatible, that is, the same character always has the same encoding in these schemes, and the latter standard supports more characters. In these codes, English and Chinese can be processed in a unified way. The method of distinguishing Chinese encoding is that the highest bit of the high byte is not 0. As programmers call it, GB2312 and GBK belong to the double byte character set (DBCS).

In 2000, GB18030 was the official national standard to replace GBK1.0. The standard contains 27484 Chinese characters, as well as Tibetan, Mongolian, Uygur and other major minority languages. In terms of Chinese character vocabulary, GB18030 adds 6582 Chinese characters (Unicode code code 0x3400-0x4db5) of CJK extension A to 20902 Chinese characters of GB13000.1, totaling 27484 Chinese characters.

CJK means China, Japan and Korea. Unicode codes Chinese, Japanese and Korean languages in order to save code bits. GB13000.1 is the Chinese version of ISO/IEC 10646-1, equivalent to Unicode 1.1.

The encoding of GB18030 adopts single-byte, double-byte and 4-byte schemes. Single byte, double byte and GBK are completely compatible. The 4-byte code contains 6582 Chinese characters of CJK extended A. For example, the coding of UCS 0x3400 in GB18030 should be 8139 EF30, and that of UCS 0x3401 in GB18030 should be 8139 EF31.

Microsoft has provided an upgrade package for GB18030, but this upgrade package only provides a set of 6582 new fonts of Chinese characters that support CJK extension a: new tahoma-18030, without changing the internal code. Windows’s internal code is still GBK.

Here are some details:

The original text of GB2312 is still location code. From location code to internal code, we need to add A0 on high byte and low byte respectively.

For any character encoding, the sequence of encoding units is specified by the encoding scheme, independent of endian. For example, the encoding unit of GBK is byte, which represents a Chinese character with two bytes. The order of these two bytes is fixed and not affected by CPU byte order. The encoding unit of UTF-16 is word (double bytes). The order between words is specified by the encoding scheme. The byte arrangement inside word is affected by endian. UTF-16 will be introduced later.

The highest bit of both bytes of GB2312 is 1. But only 128*128=16384 bits meet this requirement. So the highest bit of low bytes of GBK and GB18030 may not be 1. However, this does not affect the parsing of DBCS character streams: when reading DBCS character streams, the next two bytes can be encoded as a double byte, regardless of the high bit of the low byte.

2. Unicode, UCS and UTF
As mentioned earlier, the encoding methods from ASCII, GB2312, GBK to GB18030 are downward compatible. Unicode is only compatible with ASCII (more precisely, ISO-8859-1) and GB code. For example, the Unicode code code of the word “Han” is 6C49, while the GB code is BABA.

Unicode is also a character encoding method, but it is designed by an international organization and can accommodate all languages and words in the world. The scientific name of Unicode is Universal Multiple-Octet Coded Character Set, which is called UCS for short. UCS can be regarded as the abbreviation of “Unicode CharacterSet”.

According to Wikipedia (http://zh.wikipedia.org/wiki/), there are two organizations in history that attempt to design Unicode independently, namely the International Organization for Standardization (ISO) and an association of software manufacturers (unicode.org). ISO developed the ISO 10646 project, and the Unicode Association developed the Unicode project.

Around 1991, both sides realized that the world did not need two incompatible character sets. So they began to merge their work and work together to create a single coding table. Starting with Unicode 2.0, the Unicode project adopted the same fonts and codes as ISO 10646-1.

At present, both projects still exist and publish their own standards independently. The latest version of Unicode Association is Unicode 4.1.0 in 2005. The latest ISO standard is ISO 10646-3:2003.

UCS only specifies how to code, not how to transmit and save the code. For example, the UCS code of the word “Han” is 6C49. I can use four ASCII digits to transmit and save the code, or UTF-8 code: three consecutive bytes E6 B189 to represent it. The key is that both sides of the communication should recognize it. UTF-8, UTF-7 and UTF-16 are widely accepted schemes. A special advantage of UTF-8 is that it is fully compatible with ISO-8859-1. UTF is the abbreviation of UCS Transformation Format.

RFC2781 and RFC3629 of IETF describe the coding methods of UTF-16 and UTF-8 clearly, vividly and rigorously in the consistent style of RFC. I always can’t remember that IETF is the abbreviation of Internet Engineering Task Force. But the RFC maintained by IETF is the basis of all the specifications on the Internet.

2.1, internal code and code page
At present, the kernel of Windows already supports Unicode character set, so that it can support all languages and words in the world. However, because a large number of existing programs and documents are encoding in a specific language, such as GBK, Windows can not support the existing encoding, but all use Unicode.

Windows uses code pages to accommodate countries and regions. Codpage can be understood as the internal code mentioned earlier. The code page corresponding to GBK is CP936.

Microsoft also defines code page: CP54936 for GB18030. But because GB18030 has a part of 4-byte encoding, and Windows code page only supports single-byte and double-byte encoding, this code page is not really usable.

3、UCS-2、UCS-4、BMP
UCS has two formats: UCS-2 and UCS-4. As the name implies, UCS-2 is encoded with two bytes, UCS-4 is encoded with four bytes (actually only 31 bits, the highest bit must be 0). Now let’s do some simple math games:

UCS-2 has 2 ^ 16 = 65536 bits and UCS-4 has 2 ^ 31 = 2147483648 bits.

UCS-4 is divided into 2 ^ 7 = 128 groups according to the highest byte with the highest bit of 0. Each group is further divided into 256 planes based on sub-high bytes. Each plane is divided into 256 rows based on the third byte, each containing 256 cells. Of course, the cells on the same line are only the last byte different, and the rest are the same.

The plane 0 of group 0 is called Basic Multilingual Plane, or BMP. Or in UCS-4, a bit with two bytes high of 0 is called BMP.

UCS-2 is obtained by removing the first two zero bytes from the BMP of UCS-4. The BMP of UCS-4 is obtained by adding two zero bytes to the two bytes of UCS-2. At present, no character in UCS-4 specification has been assigned outside BMP.

4. UTF Coding

UTF-8 encodes UCS in units of 8. The encoding methods from UCS-2 to UTF-8 are as follows:

UCS-2 Encoding (Hexadecimal) UTF-8 Byte Stream (Binary)
0000 – 007F 0xxxxxxx
0080 – 07FF 110xxxxx 10xxxxxx
0800 – FFFF 1110xxxx 10xxxxxx 10xxxxxx

For example, the Unicode code code of the word “Han” is 6C49. 6C49 is between 0800-FFFFFF, so it must use a 3-byte template: 1110xxxx 10xxxxxx 10xxxxxx. Writing 6C49 into binary system is 0110 110001 001001. Using this bit stream instead of X in the template in turn, we get 11100 11010110001 10001001, that is E6 B1 89.

Readers can use notebooks to test whether our coding is correct. It should be noted that UltraEdit automatically converts to UTF-16 when it opens a text file encoded by utf-8, which may cause confusion. You can turn this option off in the settings. A better tool is Hex Workshop.

UTF-16 encodes UCS in 16 units. For UCS codes less than 0x10000, UTF-16 encoding equals 16-bit unsigned integers corresponding to UCS codes. For UCS codes no less than 0x10000, an algorithm is defined. However, as the actual use of UCS2, or UCS4 BMP must be less than 0x10000, so for now, UTF-16 and UCS-2 can be considered basically the same. But UCS-2 is only a coding scheme, UTF-16 is used for actual transmission, so byte order has to be considered.

5. Byte Order and BOM of UTF
UTF-8 uses bytes as its encoding unit and has no problem of byte order. UTF-16 uses two bytes as the encoding unit. Before interpreting a UTF-16 text, we first need to understand the byte order of each encoding unit. For example, the Unicode code code of “Kui” is 594E, and the Unicode code code of “B” is 4E59. If we receive the UTF-16 byte stream “594E”, is this “Kui” or “B”?

The recommended method of marking byte order in Unicode specification is BOM. BOM is not a “Bill Of Material” BOM table, but a Byte order Mark. BOM is a slightly smart idea:

In UCS coding, there is a character called ZERO WIDTH NO-BREAKSPACE, which is encoded by FEFF. FFFE does not exist in UCS, so it should not appear in actual transmission. The UCS specification recommends that we first transfer the character “ZERO WIDTH NO-BREAK SPACE” before transferring the byte stream.

So if the receiver receives FEFF, it indicates that the byte stream is Big-Endian; if it receives FFFE, it indicates that the byte stream is Little-Endian. Therefore, the character “ZERO WIDTH NO-BREAK SPACE” is also called BOM.

UTF-8 does not need BOM to represent byte order, but it can use BOM to represent encoding. The UTF-8 encoding of the character “ZERO WIDTH NO-BREAKSPACE” is EF BB BF (the reader can verify it with the encoding method we introduced earlier). So if the receiver receives a byte stream starting with EF BBBF, it knows that this is UTF-8 encoding.

Windows uses BOM to mark the encoding of text files.

6. Further References
This paper mainly refers to “Short overview of ISO-IEC 10646 and Unicode” (http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html).

I also looked for two good-looking articles, but I didn’t read them because I found the answers to my initial questions.

“Understanding Unicode A general introduction to the Unicode Standard” (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter04a)
“Character set encoding basics Understanding character set encodings and legacy encodings” (http://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter03)
I have written UTF-8, UCS-2, GBK conversion packages, including versions using Windows API and not using Windows API. If I have time later, I will sort it out and put it on my personal home page (http://fmddlmyy.home4u.China.com).

I started writing this article after I had thought about all the questions. I thought I could finish it in a moment. Unexpectedly, it took a long time to consider the wording and verification details, and it was written from 1:30 p.m. to 9:00 p.m. I hope readers can benefit from it.

Appendix 1 adds location codes, GB2312, internal codes and code pages
Some friends have doubts about this sentence in the article:
“The original text of GB2312 is still a location code. From the location code to the internal code, we need to add A0 on the high byte and the low byte respectively.”

Let me explain in detail:

“The original text of GB2312” refers to a national standard in 1980, “Basic Set of Chinese Characters Coded for National Standard Information Exchange of the People’s Republic of China GB2312-80”. This standard uses two numbers to encode Chinese characters and Chinese symbols. The first number is called “area” and the second number is called “bit”. So it is also called location code. Zone 1-9 is a Chinese symbol, Zone 16-55 is a first-level Chinese character, Zone 56-87 is a second-level Chinese character. Now Windows also has location input method, such as inputting 1601 to get “ah”. (This location input method can automatically identify the 16-digit GB2312 and 10-digit location codes, that is to say, input B0A1 will also get “ah”.)

The internal code refers to the character encoding within the operating system. The internal codes of early operating systems were language-related. Now Windows supports Unicode within the system, and then adapts code pages to various languages. The concept of “internal code” is blurred. Microsoft generally refers to the code specified in the default code page as internal code.

There is no official definition of the term internal code, and the code page is just what Microsoft calls it. As programmers, we just need to know what they are, and there’s no need to test them too much.

The so-called code page is a character encoding for a language. For example, the code page of GBK is CP936, the code page of BIG5 is CP950, and the code page of GB2312 is CP20936.

In Windows, there is the concept of default code pages, that is, what codes are used to interpret characters by default. For example, Windows Notepad opens a text file that contains byte streams: BA, BA, D7, D6. How should Windows explain it?

Is it interpreted according to Unicode code, GBK, BIG5 or ISO8859-1? If you interpret it according to GBK, you will get the word “Chinese character”. According to other coding interpretations, the corresponding characters may not be found, or the wrong characters may be found. The so-called “error” refers to the inconsistency with the original intention of the author of the text, which leads to confusion.

The answer is that Windows interprets byte streams in text files according to the current default code page. The default code page can be set through the area options in the control panel. The notebook has an ANSI in it, which is actually saved according to the encoding method of the default code page.

Windows’s internal code is Unicode, which technically supports multiple code pages at the same time. As long as the file can explain what encoding it uses and the user installs the corresponding code page, Windows can display correctly, for example, charset can be specified in HTML file.

Some HTML file writers, especially English writers, believe that all people in the world use English and do not specify charset in the document. If he uses characters between 0x80-0xff and Chinese Windows interprets them according to the default GBK, there will be scrambling. At this point, just add a statement specifying charset to the HTML file, such as:
<meta http-equiv=”Content-Type” content=”text/html; charset=ISO8859-1″>
If the code page used by the original author is compatible with ISO8859-1, there will be no scrambling.

Besides the location code, ah, the location code is 1601, written in hexadecimal is 0x10, 0x01. This is in conflict with ASCII coding, which is widely used in computers. In order to be compatible with ASCII encoding of 00-7f, we add A0 to the high and low bytes of the location code respectively. So the “ah” encoding becomes B0A1. We have added two A0 codes, also known as GB2312 codes, although the original GB2312 does not mention this point.