Detailed explanation of collation instances you may ignore in MySQL

Time:2021-7-27

preface

The string types of MySQL database include char, varchar, binary, blob, text, enum and set. Different types have completely different performances in business design and database performance, among which char and varchar are most commonly used. Today I’ll take you to learn more about the application of string types char and varchar.

Definition of char and varchar

Char (n) is used to store fixed length characters. The range of n is 0 ~ 255. Remember that n represents characters, not bytes. Varchar (n) is used to save variable length characters. The range of n is 0 ~ 65536, and n also represents characters.

When 65536 bytes are exceeded, the larger character type text or blob can be considered. The maximum storage length of the two is 4G. The difference is that blob has no character set attribute and is purely binary storage.

Different from traditional relational databases such as Oracle and SQL server, the varchar character type of MySQL database can store 65536 bytes at most. Therefore, in MySQL database, the varchar type is sufficient for most scenarios.

character set

In the table structure design, in addition to defining columns as char and varchar to store characters, it is also necessary to define the character set corresponding to characters, because each character corresponds to different binary values under different character set codes. Common character sets include GBK and UTF8. It is generally recommended to set the default character set to utf8.

Moreover, with the rapid development of mobile Internet, it is recommended to set the default character set of Mysql to utf8mb4. Otherwise, some Emoji expression characters cannot be stored in utf8 character set, such as Emoji smiling face expression. The corresponding character code is 0xf09f988e:

If the Emoji Emoji character is forcibly inserted into the column with utf8 character set, MySQL will throw the following error message:


mysql> SHOW CREATE TABLE emoji_test\G

*************************** 1. row ***************************

       Table: emoji_test

Create Table: CREATE TABLE `emoji_test` (

  `a` varchar(100) CHARACTER SET utf8,

  PRIMARY KEY (`a`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8



1 row in set (0.01 sec)

mysql> INSERT INTO emoji_test VALUES (0xF09F988E);

ERROR 1366 (HY000): Incorrect string value: '\xF0\x9F\x98\x8E' for column 'a' at row 1

Including MySQL version 8.0, the default character set is utf8mb4, and the default character set before version 8.0 is Latin1. Because the default character sets of different versions are different, you need to explicitly configure relevant parameters in the configuration file:


[mysqld]

character-set-server = utf8mb4

...

In addition, the longest bytes corresponding to char (n) and varchar (n) are different for different character sets. For example, in the GBK character set, one character can store up to 2 bytes, and in the utf8mb4 character set, one character can store up to 4 bytes. Therefore, from the perspective of the underlying storage kernel, under the multi byte character set, the underlying implementations of char and varchar are exactly the same, both of which are variable length storage!

As can be seen from the above example, char (1) can store either 1 ‘a’ byte or 4 bytes of Emoji smiley expression, so char is also longer in nature.

As like as two peas are set up, the default character set is set to UTF8MB4. So when you design the table structure, you can replace CHAR with VARCHAR, and the underlying storage is exactly the same.

Sorting rules

Collation is a rule for comparing and sorting strings. Each character set will have a default collation. You can view it with the command show charset:


mysql> SHOW CHARSET LIKE 'utf8%';

+---------+---------------+--------------------+--------+

| Charset | Description   | Default collation  | Maxlen |

+---------+---------------+--------------------+--------+

| utf8    | UTF-8 Unicode | utf8_general_ci    |      3 |

| utf8mb4 | UTF-8 Unicode | utf8mb4_0900_ai_ci |      4 |

+---------+---------------+--------------------+--------+

2 rows in set (0.01 sec)



mysql> SHOW COLLATION LIKE 'utf8mb4%';

+----------------------------+---------+-----+---------+----------+---------+---------------+

| Collation                  | Charset | Id  | Default | Compiled | Sortlen | Pad_attribute |

+----------------------------+---------+-----+---------+----------+---------+---------------+

| utf8mb4_0900_ai_ci         | utf8mb4 | 255 | Yes     | Yes      |       0 | NO PAD        |

| utf8mb4_0900_as_ci         | utf8mb4 | 305 |         | Yes      |       0 | NO PAD        |

| utf8mb4_0900_as_cs         | utf8mb4 | 278 |         | Yes      |       0 | NO PAD        |

| utf8mb4_0900_bin           | utf8mb4 | 309 |         | Yes      |       1 | NO PAD        |

| utf8mb4_bin                | utf8mb4 |  46 |         | Yes      |       1 | PAD SPACE     |

......

Collation to_ Ci ends, indicating case insensitive_ CS means case sensitive_ Bin means to compare by storing binary characters. It should be noted that when comparing MySQL strings, the case insensitive sorting rules are adopted by default:


mysql> SELECT 'a' = 'A';

+-----------+

| 'a' = 'A' |

+-----------+

|         1 |

+-----------+

1 row in set (0.00 sec)



mysql> SELECT CAST('a' as char) COLLATE utf8mb4_0900_as_cs = CAST('A' as CHAR) COLLATE utf8mb4_0900_as_cs as result;

+--------+

| result |

+--------+

|      0 |

+--------+

1 row in set (0.00 sec)

Keep in mind that the table structure design of most businesses does not need to set the sorting rules to be case sensitive! Unless you understand what your business really needs.

Correct modification of character set

Of course, I believe that many businesses do not consider the impact of character set on business data storage in the design, so character set conversion is required in the later stage, but many students will find that utf8mb4 characters such as Emoji cannot be inserted after performing the following operations:


ALTER TABLE emoji_test CHARSET utf8mb4;

In fact, the above modification only modifies the character set of the table to utf8mb4. The next time you add a column, if you do not explicitly specify the character set, the character set of the new column will be changed to utf8mb4. However, for existing columns, the default character set will not be modified. You can confirm through the command show create table:


mysql> SHOW CREATE TABLE emoji_test\G

*************************** 1. row ***************************

       Table: emoji_test

Create Table: CREATE TABLE `emoji_test` (

  `a` varchar(100) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,

  PRIMARY KEY (`a`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

1 row in set (0.00 sec)

You can see that the character set of column A is still utf8, not utf8mb4. Therefore, the correct command to modify the column character set should use alter table… Convert to… To modify the previous column a character set from utf8 to utf8mb4:


mysql> ALTER TABLE emoji_test CONVERT TO CHARSET utf8mb4;

Query OK, 0 rows affected (0.94 sec)

Records: 0  Duplicates: 0  Warnings: 0



mysql> SHOW CREATE TABLE emoji_test\G

*************************** 1. row ***************************

       Table: emoji_test

Create Table: CREATE TABLE `emoji_test` (

  `a` varchar(100) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci NOT NULL,

  PRIMARY KEY (`a`)

) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

1 row in set (0.00 sec)

Business table structure design practice

User gender design

When designing the table structure, you will encounter some fields with fixed option values. For example, the gender field (sex), only male or female; Or the state field. The valid values are limited states such as run, stop, restart, etc.
After observation, I found that most developers prefer to use the int number type to store gender fields, such as:


CREATE TABLE `User` (

  `id` bigint NOT NULL AUTO_INCREMENT,

  `sex` tinyint DEFAULT NULL,

  ......

  PRIMARY KEY (`id`)

) ENGINE=InnoDB;

Among them, the tinyint column sex indicates the user gender, but the design problem is obvious.

  • Unclear expression: in specific storage, does 0 mean female or 1 mean female? Each business may have different hidden rules;
  • Dirty data: because it is tinyint, users can insert values such as 2, 3 and 4 except 0 and 1. There may be invalid data in the final table. It will be very expensive to clean up later.

Before MySQL version 8.0, enum string enumeration types can be used, and only limited defined value insertion is allowed. If the parameter SQL_ If the mode is set to strict mode, an error will be reported when inserting undefined data:


mysql> SHOW CREATE TABLE User\G

*************************** 1. row ***************************

       Table: User

Create Table: CREATE TABLE `User` (

  `id` bigint NOT NULL AUTO_INCREMENT,

  `sex` enum('M','F') COLLATE utf8mb4_general_ci DEFAULT NULL,

  PRIMARY KEY (`id`)

) ENGINE=InnoDB

1 row in set (0.00 sec)



mysql> SET sql_mode = 'STRICT_TRANS_TABLES';

Query OK, 0 rows affected, 1 warning (0.00 sec)



mysql> INSERT INTO User VALUES (NULL,'F');

Query OK, 1 row affected (0.08 sec)



mysql> INSERT INTO User VALUES (NULL,'A');

ERROR 1265 (01000): Data truncated for column 'sex' at row 1

Because the type enum is not a data type of SQL standard, but a string type unique to MySQL. The error prompt thrown is not intuitive. There are always some regrets in this implementation, mainly because MySQL before 8.0 does not provide constraint function. Since MySQL version 8.0.16, the native database provides check constraint function, which can easily design finite state column types:


mysql> SHOW CREATE TABLE User\G

*************************** 1. row ***************************

       Table: User

Create Table: CREATE TABLE `User` (

  `id` bigint NOT NULL AUTO_INCREMENT,

  `sex` char(1) COLLATE utf8mb4_general_ci DEFAULT NULL,

  PRIMARY KEY (`id`),

  CONSTRAINT `user_chk_1` CHECK (((`sex` = _utf8mb4'M') or (`sex` = _utf8mb4'F')))

) ENGINE=InnoDB

1 row in set (0.00 sec)



mysql> INSERT INTO User VALUES (NULL,'M');

Query OK, 1 row affected (0.07 sec)



mysql> INSERT INTO User VALUES (NULL,'Z');

ERROR 3819 (HY000): Check constraint 'user_chk_1' is violated.

As you can see from this code, the constraint definition in line 8 is user_ chk_ 1 indicates the value range of column sex, which can only be m or F. At the same time, when inserting illegal data Z in 15 rows, you can see that MySQL explicitly throws out the prompt of illegal constraints.

Account password storage design

Remember, when designing the database table structure, do not directly store passwords in the database table. Once a malicious user enters the system, he will face the great risk of user data disclosure. For example, in the financial industry, from the perspective of compliance, all user privacy fields need to be encrypted, and even the business itself cannot know the information stored by the user (privacy data such as login password, mobile phone, credit card information, etc.).

I believe that many development students will encrypt and store private data through function MD5, which is not wrong, because MD5 algorithm is not reversible. However, the encrypted value of MD5 is fixed, such as password 12345678. Its corresponding fixed value of MD5 is 25d55ad283aa400af464c76d713c07ad.

Therefore, MD5 can be brutally cracked to calculate the MD5 value corresponding to all possible strings. If you cannot enumerate all string combinations, you can calculate some common passwords, such as 111111, 12345678, etc. This website I put in the manuscript can be used to decrypt MD5 encrypted strings online.

Therefore, salt needs to be added when designing password storage. The salt value of each company is different, so the calculated value is also different. If the salt value is PSALT, the value of password 12345678 in the database is:


password = MD5(‘psalt12345678')

There are three main problems in the design of salt encryption algorithm:

If the salt value is leaked by (resigned) employees, external hackers still have the possibility of profiteering cracking;

For the same password, its password storage value is the same. Once a user’s password is leaked, the passwords of other users with the same password will also be leaked;

Fixed use of MD5 encryption algorithm, once the MD5 algorithm is cracked, it will have a great impact.

Therefore, a really good password storage design should be: dynamic Salt + non fixed encryption algorithm.

I prefer to design the password in this way. The format of password storage is as follows:


$salt$cryption_algorithm$value

Of which:

  • $salt: indicates dynamic salt. Each time a user registers, the business generates different salt values and stores them in the database. If it is more refined, the dynamic salt value + user registration date can be combined into a more dynamic salt value.
  • $cryption_ Algorithm: represents the encryption algorithm, for example, V1 represents MD5 encryption algorithm, V2 represents aes256 encryption algorithm, V3 represents aes512 encryption algorithm, etc.
  • $value: represents the encrypted string.

At this time, the structural design of table user is as follows:


CREATE TABLE User (

    id BIGINT NOT NULL AUTO_INCREMENT,

    name VARCHAR(255) NOT NULL,

    sex CHAR(1) NOT NULL,

    password VARCHAR(1024) NOT NULL,

    regDate DATETIME NOT NULL,

    CHECK (sex = 'M' OR sex = 'F'),

    PRIMARY KEY(id)

);



SELECT * FROM User\G

*************************** 1. row ***************************

      id: 1

    name: David

     sex: M

password: $fgfaef$v1$2198687f6db06c9d1b31a030ba1ef074

 regDate: 2020-09-07 15:30:00

*************************** 2. row ***************************

      id: 2

    name: Amy

     sex: F

password: $zpelf$v2$0x860E4E3B2AA4005D8EE9B7653409C4B133AF77AEF53B815D31426EC6EF78D882

 regDate: 2020-09-07 17:28:00

In the above example, the passwords of David and Amy are 12345678. However, due to the use of dynamic salt and dynamic encryption algorithm, the contents stored by them are completely different.

Even if users with ulterior motives get the current password encryption algorithm, they can use the encryption algorithm $encryption_ Algorithm version can upgrade the password stored by the user to further prevent malicious data attacks.

summary

This is the end of this article about the collection you may ignore in MySQL. For more information about the collection in mysql, please search the previous articles of developeppaer or continue to browse the relevant articles below. I hope you will support developeppaer in the future!