High performance SQL — data type selection and optimization

Time:2021-3-8

Data type selection and optimization

1. Why do we need to optimize the database type

2. Database type optimization principle

3. How to select database field type

Text:

1. Why do we need to optimize the database type?

In our previous database design experience, we often focus on the business fields of the database rather than the specific data types. In fact, the two are complementary. We should pay attention to both the overall situation and the details. This paper introduces how to pay attention to the details of the field types and the optimization principles of the database types.

2. Database type optimization principle

(1) Smaller is usually better.
Try to use the smallest data type to store data.
Smaller data types are usually faster, consume less CPU and disk.

Suppose we now have an order table order, we need to design an order status field, ordstatus. Generally, we can use 0 ~ 9 to replace the order status with numbers, such as 0 – ordered, 1 – paid, 2 – transaction completed. In this way, the status bit of a piece of data can be compressed to 1 bit, and a lot of space can be saved under tens of millions of data. You ask me what to do if 0 ~ 9 is used up? Then go to A-Z!)

Note:
Make sure that the range of stored values is not underestimated. For example, if I want to store a user name, I usually use varchar (50). If I can’t determine which data type is the best, set it to the minimum value that you don’t think will exceed the range. If we need to make the data field longer in the later stage, such as changing varchar (50) to varchar (150), it is a very time-consuming and painful operation for the database.

(2) Just simple.
Simple data type operations usually require fewer CPU cycles. For example, integer is cheaper than character operation. Because string proofreading rules are more complex than integer.

(3) Try to avoid null
It’s better to have the column specified as not null.
If the query contains null, it is more difficult for MySQL to optimize, because the nullable column makes the index, index statistics and value comparison more complex. Nullable columns will use more storage space and need special processing in MySQL. When nullable columns are indexed, each index record requires an extra byte.
If you build an index on a plan column, you should try to avoid designing it as a nullable column

be careful:
Usually, after changing null to not null, the performance improvement is less, so it is not necessary to put this in the first place.

3. How to select database field type

The next step is to choose the appropriate data type, because many different fields of MySQL can store the same data, but the length, range, precision of storage are different, or the physical space required is different.

(1) Integer type
If you want to save integers, you can choose an integer type without decimal point.
There are many kinds of integer ranges, such as tinyint, int, bigint.
The integer type has an optional unsigned attribute, which can double the upper limit of a positive number. For example, tinyint can be stored in the range of 0-255, or – 127-127 if it is not selected.
Signed and unsigned types use the same storage space and have the same performance, so they can be selected according to the situation.

be careful:
MySQL can specify the width for integer types, such as int (11). For most applications, this is meaningless. It does not limit the legal range of values, but only specifies the number of characters to be displayed by some interactive tools of MySQL. For storage and computation, int (1) and int (20) are the same.

(2) Real number type
Real numbers are numbers with decimal points. But it can store not only decimals, but also larger numbers than bignints.

Decimal type is used to store precise decimal, decimal type supports precise calculation, early database versions used floating-point type for precise calculation, but this would lose some precision.

Because the CPU does not support the direct calculation of decision, MySQL implements the accurate calculation of decision itself. Relatively speaking, the CPU directly supports the native floating-point calculation, so the floating-point operation is obviously faster.

Decimal (18,9) will store 9 numbers on both sides of the decimal point, a total of 9 bytes, 4 bytes for the number before the decimal point, 4 bytes for the number after the decimal point, and 1 byte for the decimal point itself.

Floating point types usually use less space than decimal to store the same range of values. Four floats and eight double. Compared with float, double has higher precision and wider range.

(3) String type
Varchar and char types

varchar
Varchar is used to store variable length strings, which is a common string data type. It is more space efficient than the fixed length type because it only uses the necessary space. (when MySQL uses row)_ When format = fixed is created, fixed length storage is used for each row.
Varchar needs to use 1 or 2 extra bytes to record the length of the string. If the maximum length of the column is less than or equal to 255 bytes, only 1 byte is used, otherwise 2 bytes are used.
Varchar saves storage space and is helpful for performance. However, because it is longer, it may make the row longer than the original when updating, which leads to the need to do extra work. If the space occupied by a row grows and there is no more space to store in the page, MySQL will split the row into different fragments for storage.

Varchar is appropriate in the following case:
The maximum length of a string column is much larger than the average length
Columns are rarely updated, so fragmentation is not an issue.

char
The char type is fixed length, and MySQL always allocates space according to the defined string length. When defining char, MySQL will delete all the trailing spaces.
Char is suitable for storing very short strings, such as MD5. Char is more efficient in storage space than varchar, and it is not easy to generate fragments.

Char is also related to binary and varbinary, which store binary strings. Binary strings are very similar to regular strings, but binary strings store bytecode. The comparative advantage of binary is not only reflected in case, but also in speed

Question: the space cost of ‘hello’ stored in varchar (200) and varchar (5) is the same, so what’s the advantage of using short columns?
A: obviously short columns have advantages. Longer columns consume more storage space. When querying again, especially when using memory temporary table for sorting operation, it will be particularly bad. MySQL usually allocates fixed size memory blocks to save memory values.

So the best strategy is to allocate only the space needed.

(4) Date type and time type
datetime
This type can hold a wide range of values, 1001-9999, using 8 bytes of storage space.
timestamp
Time stamp can only be used to represent the period from 1970 to 2038.
Saved the number of milliseconds since 1970
Use only 4 bytes

(5) Select primary key
It is very important to select the appropriate data type for the primary key. The primary key may act as a foreign key in other tables. Therefore, when selecting the data type for the primary key, you should choose the same data type (character code should also be the same) as in the associated table.

be careful:
(1) We should not only consider the storage type, but also consider how MySQL performs calculation and comparison for this type.
(2) On the premise of meeting the range requirements of values and reserving space for future growth, the minimum value type should be selected.
(3) Integer types are usually the best choice for primary keys because they are fast and can use auto increment.
(4) String types should be avoided as much as possible as they are space consuming and usually slower than numeric types. MySQL uses compressed index for strings by default, which leads to a much slower query. In the test, we noticed a performance degradation of up to 6 times.
(5) For completely random strings, such as the strings generated by MD5 and UUID, the new values generated by these functions will be arbitrarily distributed in a large space, resulting in slow insert and some select queries.

The above is the common data type selection and optimization summarized by the author.