How to generate the primary key in distributed system

Time:2021-7-3

The evolution of databaseWhen the pressure bottleneck of the database comes, we can share the pressure of the database by sub database and sub table. In the case of sub database and sub table, how to set the primary key?

Database primary key self growth

Under normal circumstances, if every database grows by itself, there will be a problem of duplicate database IDs. For example, as shown in the figure below, primary keys with ID of 1, 2, 3 appear.
How to generate the primary key in distributed system
In order to avoid the above problems, we can set an initial value in each database and the increment each time. As shown in the figure below, the initial value of the first database is 1, the increment is 3, and the ID is 1, 4, 7.
Although ID duplication is avoided, there is no way to guarantee the increment of ID. for example, the ID of database 1 is 1, 4, 7, 10. The ID of database 2 is 2, 5, and the ID of database 3 is 3. And if we want to expand the capacity in the future, it will be very troublesome.
How to generate the primary key in distributed system
Since the database itself can’t generate ID, we can use a special database to generate self growing ID. Although ID can grow by itself without repetition, all the pressure is on the database that generates ID.
How to generate the primary key in distributed system
In order to alleviate the pressure of the database, N primary keys can be generated at one time, such as 100, and then stored in the cache application. When the database needs ID, it will cache the application.
The disadvantage is that there is another layer of service. Take a self growing ID to cache the service + database.
How to generate the primary key in distributed system
Can you use redis to generate the primary key above? Although redis is fast, there is no real-time persistence, which may cause duplicate primary keys. For example, if it’s 9 at this time, the incr becomes 10, and then it hangs. At this time, it’s not persistent. When the ID is generated again, it’s still 9. If the incr is 10, there are two data with ID 10. Even if the data is asynchronously synchronized, the master may hang up before the data reaches the slave, and the ID will still repeat.
In addition to the pressure of the database, the self growing primary key may also reveal business secrets. It’s easy for others to find out what the next primary key is.

UUID

Just use the database to generate the primary key, the database pressure is very big, that can use the application to generate. The simple one is UUID, with good performance and no repetition. The disadvantage is that the increment cannot be guaranteed, and the UUID string is relatively long, and the index performance is poor.

time stamp

Taking the current MS as the primary key has the advantage of simplicity and incrementing, while the disadvantage is that it may be repeated. For example, if there are 10 concurrent milliseconds at the same time, it will be repeated at this time. In order to reduce the repetition, reduce to microsecond level, or add a random string after the timestamp, there is still the risk of repetition.

snowflake

Snowflake is an open source distributed ID generation algorithm of twitter. It generates 64 bit, the first bit is 0, and the ID is a positive number. The last 41 bits are the binary form of the current timestamp, the next 10 bits are the binary of the machine code, and the last 12 bits are the counting sequence number to record the number generated in the same millisecond.
How to generate the primary key in distributed system
For example, the current time is 2020-01-01 00:00:00, the conversion time stamp is 157780800000, and then the binary is 1011110101110010101111101000000000. In this case, the first 42 bits are:
How to generate the primary key in distributed system
Suppose that the current machine code is 100 and the binary code is 1100100. Since it is less than 10 bits, we fill in three zeros in the front and the result is 0001100100. In this case, the first 52 bits are:
How to generate the primary key in distributed system
If the concurrency of the same machine is relatively large in the same millisecond, the generated data will be duplicated at this time, so the following 12 bits will be used as the counter in snowflake. For example, the first access is 1, and the second access is 2. Suppose that we are the 200th to get the ID in the current millisecond, 200 is converted to binary 11001000, and after 0 is supplemented, it is 00001111001000. At this time, the first 64 bits are:
How to generate the primary key in distributed system
If we think that the last 12 bits are not enough, we can compress the number of machine codes in front of us, so that the value of counting sequence number can be larger.
The disadvantage is also obvious, because this algorithm is dependent on timestamp, so when the system time callback, it may cause ID duplication