Why not recommend using UTF-8 in MySQL?

Time:2021-7-31

I remember last year when I was saving Emoji expressions into mysql, I kept making mistakes and couldn’t import them. Then we found a way — by puttingutf8Change toutf8mb4It’s OK. I didn’t delve into it.

A year later, I saw an article saying that Emoji text takes up 4 bytes and usually needs to be received with UTF-8. Other codes may make mistakes. It suddenly occurred to me that I changed utf8 to utf8mb4 in MySQL last year.

Huh? Isn’t he a utf8 code! So I changed a hammer?

Isn’t MySQL utf8 the real UTF-8 encoding??! This is the sleeping trough.. MySQL has a bug!

After querying many relevant materials with questions, I found that this is actually a historical problem of MySQL ~ ~ I smiled. I didn’t expect that such a cow B’s MySQL would have this past.

1、 Error Reporting Review

Write Emoji text directly into SQL, execute insert statement and report error;

INSERT INTO `csjdemo`.`student` (`ID`, `NAME`, `SEX`, `AGE`, `CLASS`, `GRADE`, `HOBBY`)
Values ('20 ',' Chen ha ha ',' male ',' 20 ',' class 181 ',' grade 9 ',' watching movies');

[Err] 1366 – Incorrect string value: ‘xF0x9Fx98x93’ for column ‘NAME’ at row 1

Changed database code, system code and table field coding format →utf8mb4After that, you can:

INSERT INTO `student` (`ID`, `NAME`, `SEX`, `AGE`, `CLASS`, `GRADE`, `HOBBY`)
Values (null, 'Chen ha ha', 'male', '20', 'class 181', 'grade 9', 'watching movies');

Why not recommend using UTF-8 in MySQL?

2、 Interesting stories in utf8 MySQL

MySQL’s “utf8” is not really UTF-8.

In mysql, “utf8” encoding only supports up to three bytes per character, while the real UTF-8 is up to four bytes per character.

In utf8 coding, Chinese occupies 3 bytes and other numbers, English and symbols occupy 1 byte.

However, Emoji symbols account for 4 bytes, and some more complex characters and traditional characters also account for 4 bytes. Therefore, if writing fails, it should be changed toutf8mb4。

Why not recommend using UTF-8 in MySQL?

As shown in the figure above, this is the data stored after the code is changed to utf8mb4. You can clearly compare the number of characters and bytes. Because of this, inserting 4 bytes of content into utf8 coding is definitely not possible. It can’t be inserted, can it (Pan stalls).

Why not recommend using UTF-8 in MySQL?

MySQL has not fixed this bug. They released a character set called “utf8mb4” in 2010, which cleverly bypassed this problem.

Of course, they did not advertise the new character set (perhaps because the bug made them feel embarrassed), so that developers are still advised to use “utf8” on the Internet, but these suggestions are wrong.

1. Utf8mb4 is the real UTF-8

Yes, MySQL’s “utf8mb4” is the real “UTF-8”.

MySQL’s “utf8” is an “exclusive encoding”, which can encode few Unicode characters.

Here, mark:All MySQL and MariaDB users using “utf8” should use “utf8mb4” instead and never use “utf8” again.

So what is coding? What is UTF-8?

As we all know, computers use 0 and 1 to store text. For example, if the character “C” is saved as “01000011”, the computer needs to go through two steps when displaying this character:

  1. The computer reads “01000011” and gets the number 67 because 67 is encoded as “01000011”.
  2. The computer looked for 67 in the Unicode character set and found “C”.

alike:

  1. My computer maps “C” to 67 in the Unicode character set.
  2. My computer encodes 67 as “01000011” and sends it to the web server.

Almost all network applications use Unicode character sets because there is no reason to use other character sets.

The Unicode character set contains millions of characters. The simplest encoding is UTF-32, which uses 32 bits per character. This is the easiest thing to do, because computers have always regarded 32 bits as numbers, and what computers are best at is processing numbers. But the problem is, it’s a waste of space.

UTF-8 can save space. In UTF-8, the character “C” only needs 8 bits, and some uncommon characters, such as “” need 32 bits. Other characters may use 16 or 24 bits. For an article like this article, if UTF-8 coding is used, the space occupied is only about a quarter of that of UTF-32.

2. Brief history of utf8

Why do MySQL developers disable “utf8”?We may be able to find the answer from the MySQL version submission log.

MySQL has supported UTF-8 since version 4.1, that is, in 2003, and the UTF-8 standard (RFC 3629) used today appeared later.

The old UTF-8 standard (RFC 2279) supports up to 6 bytes per character. On March 28, 2002, MySQL developers used RFC 2279 in the first MySQL 4.1 preview.

In September of the same year, they adjusted the MySQL source code: “utf8 now only supports 3-byte sequences at most”.

Who submitted the code? Why did he do that? The question is unknown. After migrating to git (MySQL first used bitkeeper), the names of many submitters in the MySQL code base were lost. There were no clues to explain this change in the September 2003 mailing list.

But we can try to guess:

In 2002, MySQL made a decision: if users can ensure that each row of the data table uses the same number of bytes, MySQL can greatly improve its performance. To do this, you need to define the text column as “char”, and each “char” column always has the same number of characters. If the inserted characters are less than the defined number, MySQL will fill in spaces after them. If the inserted characters exceed the defined number, the later excess will be truncated.

When MySQL developers first tried UTF-8, they used 6 bytes for each character, 6 bytes for char (1), 12 bytes for char (2), and so on.

It should be said that their initial behavior was correct. Unfortunately, this version has not been released. But the document says so, and it is widely circulated. All people who know UTF-8 agree with what is written in the document.

However, it is obvious that MySQL developers or manufacturers are worried that users will do these two things:

  • Using char to define columns (at present, char is an antique, but at that time, using char in MySQL would be faster, but it has not been so since 2005).
  • Set the encoding of the char column to “utf8”.

My guess is that MySQL developers wanted to help users who want to win-win in space and speed, but they screwed up “utf8” coding.

So the result is no winner. Users who want to win-win in space and speed actually use more space and slower speed than expected when they are using the char column of “utf8”. Users who want correctness cannot save characters like “” when they use “utf8” encoding, because “” is 4 bytes.

After this illegal character set is published, MySQL cannot repair it because it requires all users to rebuild their databases. final,MySQL re released “utf8mb4” in 2010 to support real UTF-8.

3、 Summary

At present, almost all articles on the network regard “utf8” as a real UTF-8, including the articles I wrote and projects I did before (covering my face); Therefore, I hope more friends can see this article.

I believe there are many people on the same boat as me, which is inevitable.

So,When you build MySQL and MariaDB databases in the future, remember to change the corresponding database codes to utf8mb4.One day, after the programmer who takes over your class sends or your leader finds out this problem, he will feel your technical cow B silently in his heart.
reference resources:The latest video tutorial and learning route on Java basics in 2020

Source:Utf8 in MySQL is not the real UTF-8 encoding!