Escape and Unicode encoding in JSON serialization

Time:2020-8-20

In this paper, the escape of JSON encoding and the handling of Unicode encoding in JSON are sorted out.

In fact, this is a companion to my last article. In the study of Unicode characters, because our data transmission is completed through JSON strings, we also found a problem in the process of transcoding the color characters. After solving the problem, there will be this summary.

Common escape characters in JSON

In my opinion, JSON is one of the best readability data transmission formats for programmers, and JSON fully considers the escape in data transmission, avoiding various injection risks. When the JSON is serialized (referred to asmarshal)According to the JSON standard, the following characters in the string need to be escaped:

Symbol name Escaped string
" Double quotation marks \"
/ Slash \/
\ Backslash \\
\b Backspace character \b
\f vertical tab \f
Tab Horizontal tab \t
\r enter \r
\n Line break \n
< Left angle bracket \u003C
> Right angle bracket \u003E
& And symbol \u0026

In addition, for go language, I suggest that we should use a percent sign%by\u0025The reason is that the percent sign is a key character in various string formatting operations of go, which can avoid errors in logging or other formatting operations.

The processing of Unicode characters in JSON

The Unicode character here refers to the character outside the ASCII range, that is, the Unicode character whose value is greater than 0x7F.

In fact, in most cases, UTF-8 has become the standard of modern programming languages. Therefore, in the process of JSON serialization, it is only necessary to simply convert the value of Unicode characters to binary and then package them according to the network byte order.

However, in some cases, when the opposite end does not adopt UTF-8, or the opposite end does not adopt network byte order (for example, the other side is a technologically backward customer / partner / integrator with strong voice), the unified use of ASCII coding can avoid these problems.

So how does JSON use ASCII encoding to transmit Unicode? In fact, we can get a glimpse of the meaning of JSON\uXXXXTo represent a Unicode character. In each Unicode character representation,XXXXIt must be 4 hexadecimal numbers, even if the high bit is 0, it needs to be completed. In this way, Unicode characters are encoded and transmitted. In the data transmission based on ASCII, this coding method is more stable, and will not add too much data. Of course, for the case of more Unicode characters (such as a large number of Chinese), this requires the programmer to consider the additional network costs.

Encoding of Unicode characters greater than 0xFFFF

Readers may notice,\uXXXXThe format can only support up to 0xFFFF, but Unicode has already exceeded this range. How to represent characters larger than 65535? first,Absolutely notSimply use uxxxxx, which leads to coding errors.

For characters larger than 65535, JSON adopts utf-16 encoding. Utf-16 adopts a feature of Unicode: no more than 20 bits.
For example, we useuRepresenting such a character, utf-16 can be processed as follows:

  • first,u = u - 0x10000
  • takehiEqual to the high 10 bits of u after subtraction:hi = (u & 0xFFC00) >> 10
  • takeloEqual to the lower 10 bits of u after subtraction:lo = u & 0x003FF
  • High plus0xD800After that\ucode
  • Low plus0xDC00After that\ucode

For example: the color character symbol “” representing the earth, its code value is0x1F30DAccording to utf-16 coding process, the coding process is as follows:

  • u = 0x1F30D - 0x10000 = 0xF30D, binary is:1111 0011 0000 1101
  • The upper 10 bits are equal to0000111100The lower 10 bits are equal to1100001101
  • High value0x03CAdd and then equal0xD83C
  • Low value0x30DAdd and then equal0xDF0D
  • The final code is\uD83C\uDF0D

For example, the following JSON:

{
    "String": I am the earth
}

After serializing in ASCII, the result is:

{"string":"\u6211\u662F\u5730\u7403\uD83C\uDF0D"}

reference material

  • UTF-16
  • JavaScript has a Unicode problem
  • Meaning of escaped unicode characters in JSON

This article is licensed under the Creative Commons Attribution – noncommercial use – same way sharing 4.0 international license agreement.

Original author: AMC, welcome to reprint, but please indicate the source.

Original title: Escape and Unicode encoding in JSON serialization

Release date: May 31, 2020

Link to this article: https://segmentfault.com/a/1190000022797773

This article was first published on: https://cloud.tencent.com/developer/article/1336510 , is also my blog

Escape and Unicode encoding in JSON serialization