Why not use UTF-8 in MySQL?

Time:2021-4-17

I remember last year when I was saving Emoji expression to MySQL, I kept making mistakes and couldn’t import it. And then we found a way — by puttingutf8Change toutf8mb4It’s OK. I didn’t go deep into it.

A year later, I read an article that Emoji text takes up four bytes, which usually needs to be received by UTF-8. Other coding may make mistakes. It suddenly occurred to me that I changed utf8 to utf8mb4 in MySQL last year.

Huh? He himself is not utf8 code! So I changed a hammer?

Isn’t MySQL’s utf8 the real UTF-8 code?? I’m not here.. MySQL has a bug!

After querying a lot of relevant materials with doubts, I found that this was actually a problem left over by MySQL history ~ ~ I laughed, but I didn’t expect that such an amazing MySQL would have this past.

1、 Review of error reporting

Write the Emoji text directly into SQL and execute the insert statement to report an error;

INSERT INTO `csjdemo`.`student` (`ID`, `NAME`, `SEX`, `AGE`, `CLASS`, `GRADE`, `HOBBY`)
Values ('20 ','chen ha ha','male ','20','class 181 ','grade 9','watching movies');

[Err] 1366 – Incorrect string value: ‘xF0x9Fx98x93’ for column ‘NAME’ at row 1

Changed database code, system code and table field code format →utf8mb4After that, you can:

INSERT INTO `student` (`ID`, `NAME`, `SEX`, `AGE`, `CLASS`, `GRADE`, `HOBBY`)
Values (null, 'Chen ha ha', 'male', '20', 'class 181', 'grade 9', 'watching movies');

Why not use UTF-8 in MySQL?

2、 Interesting stories of utf8 in MySQL

MySQL’s “utf8” is not really UTF-8.

In mysql, “utf8” encoding only supports up to three bytes per character, while the real UTF-8 encoding only supports up to four bytes per character.

In utf8 coding, Chinese takes up three bytes, while other numbers, English and symbols take up one byte.

However, Emoji symbols take up four bytes, and some more complex characters and traditional Chinese characters also take up four bytes. So the write failure should be changed toutf8mb4。

Why not use UTF-8 in MySQL?

As shown in the figure above, this is the input data after the encoding is changed to utf8mb4. You can clearly compare the number of characters and bytes. That’s why it’s impossible to insert 4-byte content into utf8 encoding. You can’t insert it, right.

Why not use UTF-8 in MySQL?

MySQL has not fixed this bug, they released a character set called “utf8mb4” in 2010, which skilfully bypassed this problem.

Of course, they didn’t advertise the new character set (probably because the bug made them feel embarrassed), so that they are still advising developers to use “utf8” on the Internet, but these suggestions are all wrong.

1. Utf8mb4 is the real UTF-8

Yes, MySQL’s “utf8mb4” is the real “UTF-8”.

MySQL’s “utf8” is a kind of “exclusive encoding”, which can encode not many Unicode characters.

Mark here:All MySQL and MariaDB users using “utf8” should switch to “utf8mb4” and never use “utf8”.

So what is coding? What is UTF-8?

As we all know, computers use 0 and 1 to store text. For example, if the character “C” is saved as “01000011”, the computer needs to go through two steps when displaying this character:

  1. The computer reads “01000011” and gets the number 67 because 67 is encoded as “01000011”.
  2. The computer looked up 67 in the Unicode character set and found “C”.

alike:

  1. My computer maps “C” to 67 in the Unicode character set.
  2. My computer encodes 67 as “01000011” and sends it to the web server.

Almost all network applications use the Unicode character set, because there is no reason to use other character sets.

The Unicode character set contains millions of characters. The simplest encoding is UTF-32, which uses 32 bits per character. It’s the easiest thing to do, because computers have always looked at 32 bits as numbers, and computers are best at dealing with numbers. But the problem is, it’s a waste of space.

UTF-8 can save space. In UTF-8, the character “C” only needs 8 bits. Some uncommon characters, such as “” need 32 bits. Other characters may use 16 or 24 bits. For an article like this one, if UTF-8 coding is used, the space occupied is only about a quarter of that of UTF-32.

2. A brief history of utf8

Why do MySQL developers disable “utf8”?We may be able to find out from the MySQL version submission log.

MySQL started to support UTF-8 from version 4.1, that is, in 2003, and today’s UTF-8 standard (RFC 3629) came out later.

The old UTF-8 standard (RFC 2279) supports up to 6 bytes per character. On March 28, 2002, MySQL developers used RFC 2279 in the first MySQL 4.1 preview.

In September of the same year, they adjusted the MySQL source code: “utf8 now only supports 3-byte sequences at most.”.

Who submitted the code? Why did he do that? The question is not known. After migrating to git (where MySQL started with bitkeeper), many of the submitted names in the MySQL code base were lost. No clue to explain the change was found in the September 2003 mailing list.

But we can try to guess:

In 2002, MySQL made a decision: if users can ensure that each row of the data table uses the same number of bytes, then MySQL can greatly improve its performance. For this purpose, the user needs to define the text column as “char”, and each “char” column always has the same number of characters. If the number of characters inserted is less than the number defined, MySQL will fill in spaces after it. If the number of characters inserted exceeds the number defined, the excess part will be truncated.

When MySQL developers first tried UTF-8, they used 6 bytes per character, char (1) used 6 bytes, char (2) used 12 bytes, and so on.

It should be said that their initial behavior was correct, but this version has not been released. But that’s what the document says, and it’s widely circulated. Everyone who knows about UTF-8 agrees with what it says.

However, it is obvious that MySQL developers or vendors are worried that users will do these two things:

  • Use char to define columns (char seems old-fashioned now, but it would be faster to use char in MySQL at that time, but not since 2005).
  • Set the code of char column to “utf8”.

My guess is that MySQL developers wanted to help users who wanted to win both space and speed, but they messed up the “utf8” coding.

So the result is no winner. Users who want to win-win in space and speed actually use more space and slower speed than expected when they use the char column of “utf8”. However, users who want correctness can’t save characters like “” when they use “utf8” encoding, because “” is 4 bytes.

After the illegal character set is released, MySQL cannot fix it because it requires all users to rebuild their database. final,MySQL re released “utf8mb4” in 2010 to support real UTF-8.

3、 Summary

The main reason is that almost all the articles on the Internet regard “utf8” as the real UTF-8, including the articles I wrote and the projects I did (covering my face); therefore, I hope more friends can see this article.

I believe there are many people on the same boat with me, which is inevitable.

So,When you build MySQL and MariaDB databases in the future, remember to change the corresponding database codes to utf8mb4.One day, the programmer who takes over your class or your leader will feel that your technology is excellent.
reference resources:“2020 latest Java foundation intensive lecture video tutorial and learning route! 》

Source:Utf8 in MySQL is not a real UTF-8 code!

Recommended Today

Review of SQL Sever basic command

catalogue preface Installation of virtual machine Commands and operations Basic command syntax Case sensitive SQL keyword and function name Column and Index Names alias Too long to see? Space Database connection Connection of SSMS Connection of command line Database operation establish delete constraint integrity constraint Common constraints NOT NULL UNIQUE PRIMARY KEY FOREIGN KEY DEFAULT […]