Summary: by MySQL utf8 coding pit of the bitter lesson

Time:2020-10-27

Summary: by MySQL utf8 coding pit of the bitter lesson
Source: https://blog.liexing.me/2018/…

Recently, I encountered several projects that were coded by utf8 of MySQL. I remember the painful lesson of coding problems before. I will record it and warn myself.

Once upon a time, I chose utf8 every time I built a library, and I felt that I didn’t know how cool I was than those people who used messy coding. Many years ago, when I was doing a project in a course design, I happily created a user table:

CREATE TABLE `test_user` ( 
`id` int(11) unsigned NOT NULL AUTO_INCREMENT, 
`name` varchar(32) DEFAULT NULL,  
 PRIMARY KEY (`id`)
 ) 
ENGINE=InnoDB DEFAULT CHARSET=utf8;

Then happy new users: insert into test_ User (name) values (“I am”), and then happily reflect on life:

Incorrect string value: '\xF0\x9F\x98\x81' for column 'name' at row 1

Who am I? Where do I come from? What am I doing? Is it the wrong character set in my code? No, I use utf8 everywhere

#What is the utf8 encoding of MySQL?

Let’s start with the official documents:

The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters. The utf8mb4 character set uses a maximum of four bytes per character supports supplementary characters:

For a BMP character, utf8 and utf8mb4 have identical storage characteristics: same code values, same encoding, same length.

For a supplementary character, utf8 cannot store the character at all, whereas utf8mb4 requires four bytes to store it. Because utf8 cannot store the character at all, you have no supplementary characters in utf8 columns and need not worry about converting characters or losing data when upgrading utf8 data from older versions of MySQL.

Let’s take a look at Wikipedia’s interpretation of utf8 encoding:

UTF-8 is a variable width character encoding capable of encoding all 1,112,064 valid code points in Unicode using one to four 8-bit bytes.

As you can see, utf8 in MySQL is not the standard utf8 in essence. In mysql, utf8 uses up to three bytes per character, so some Emoji and even some rare Chinese characters can’t be saved, such as “.”.

MySQL has always denied that this is a bug. In 2010, they released the “utf8mb4” character set to circumvent this problem. In mysql, utf8mb4 should be the standard utf8 encoding, and the official thieves secretly added it to the latest document. It can be regarded as a mistake

utf8 is an alias for the utf8mb3 character set.

The utf8mb3 character set will be replaced by utf8mb4 in some future MySQL version. Although utf8 is currently an alias for utf8mb3, at that point utf8 will become a reference to utf8mb4. To avoid ambiguity about the meaning of utf8, consider specifying utf8mb4 explicitly for character set references instead of utf8.

#A brief history of mysqlutf8

MySQL has been supporting utf8 since version 4.1 in 2003, but the current utf8 standard (RFC 3629) was released later. MySQL used the old version of the utf8 standard (RFC) in the 4.1 preview on March 28, 2002 The standard supports up to 6 bytes per character. In September of the same year, MySQL adjusted its utf8 character set to support up to 3 bytes. This adjustment may only be used to optimize space (char class fields were recommended before 2005, but a utf8 char would occupy 6 bytes in length) and time performance (using char fields in MySQL before 2005 would have better speed). Well, you can see in GitHub that you make complaints about this pit.

Summary: by MySQL utf8 coding pit of the bitter lesson

However, if the character code is released, it can not be easily modified, because if there are already users using it, they need to rebuild their database.

How to remedy it? As can be seen from the latest document above, they will use the current utf8 as the alias of utf8mb3, and will re use utf8 as the alias of utf8mb4 one day in the future, so as to solve this huge hole of many years.

#What is utf8

slightly

# utf8mb4_ unicode_ Ci and utf8mb4_ general_ Ci

In addition to storage, characters also need to be sorted or compared. This operation is related to the coded character set, called collation. Utf8mb4 corresponds to utf8mb4_ unicode_ Ci and utf8mb4_ general_ Ci these two collations.

accuracy

utf8mb4_ unicode_ Ci is based on the standard Unicode to sort and compare, which can keep the precise sorting among different languages;

utf8mb4_ general_ Ci is not based on Unicode collation, so sorting results on some special languages or characters may not be expected.

performance

utf8mb4_ general_ Ci is faster in comparison and sorting because it implements some better performance operations, but on modern servers, this performance improvement is almost negligible.

utf8mb4_ unicode_ Ci uses the rules of Unicode to sort and compare, and its collation is more complex to handle some special characters.

There is little reason to continue to use utf8mb4_ general_ Ci, because the performance difference is very small, it is not as good as better data design, such as using index and so on.

#How to save MySQL from miscoding

  1. Backup, otherwise collapse, only delete database run away;
  2. Upgrade MySQL server to version 5.3.3 and above to support utf8mb4;
  3. Change the character encoding and collation of database, table and column to utf8mb4
# For each database:
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;
# For each table:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
# For each column:
ALTER TABLE table_name CHANGE column_name column_name VARCHAR(length) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

4. Check the maximum length of column and index key;

5. Modify the character set of connection, client and server;

6. Repair and optimize all tables to avoid some inexplicable errors. You can use the following methods:

# For each table
REPAIR TABLE table_name;
OPTIMIZE TABLE table_name;

Or use the ‘mysqlcheck’ tool:

$ mysqlcheck -u root -p --auto-repair --optimize --all-databases

#Other pits

Index invalidation caused by different character sets of MySQL table fields

#Reference

  • https://medium.com/@adamhoope…
  • https://dev.mysql.com/doc/ref…
  • https://www.joelonsoftware.co…
  • https://stackoverflow.com/que…
  • https://mathiasbynens.be/note…

If there are any errors or other problems, you are welcome to comment and correct. If you have any help, please click like + forward to share.

Welcome to the official account of the brother of migrant workers:Migrant workers’ technology road
Summary: by MySQL utf8 coding pit of the bitter lesson