How do I handle ID PK after splitting databases and tables?

Time:2021-8-24

In fact, the road is not to walk fast, just walk more. The important thing is to find a road suitable for yourself.

This is a problem you must face after dividing databases and tables, that is, how to generate IDS? Because if you divide into multiple tables, each table is accumulated from 1, it must be wrong. You need oneGlobally uniqueID to support. So these are all issues that you must consider in your actual production environment.

Implementation scheme based on Database

Database self increment ID

This means that every time you get an ID in your system, you insert a piece of data with no business meaning into a table of a database, and then obtain an ID added by a database. After obtaining this ID, write it into the corresponding sub database and sub table.

The advantage of this scheme is that it is convenient and simple, and everyone can use it;The disadvantage is single database generationSelf increasing ID, if it is highly concurrent, there will be a bottleneck; If you just want to improve, open a special service. This service will get the maximum value of the current ID each time, and then increase several IDS by itself, return a batch of IDS at one time, and then modify the current maximum ID value to a value after increasing several IDS; howeverIn any case, it is based on a single database

Suitable scene: there are two reasons for splitting databases and tables: either the concurrency of a single database is too high or the amount of data in a single database is too large; Unless it’s youConcurrency is not high, but the amount of data is too largeYou can use this scheme to expand the capacity of databases and tables, because the maximum concurrency per second may be hundreds at most, so you can use a separate database and table to generate a self increasing primary key.

Set the step size of database sequence or table self increment field

You can scale horizontally by setting the database sequence or the self increasing field step of the table.

For example, there are eight service nodes. Each service node uses a sequence function to generate an ID. the starting ID of each sequence is different and increases in turn. The step size is 8.
How do I handle ID PK after splitting databases and tables?
Suitable scene: this scheme is relatively simple to implement and can achieve the performance goal when users prevent ID duplication. However, the service node is fixed and the step size is fixed. It will be difficult to add service nodes in the future.

UUID

The advantage is to generate locally instead of based on the database; The disadvantage is that the UUID is too long and takes up too much space,Poor performance as primary keyYes; More importantly, UUIDs are not ordered, which will lead to too many random write operations when the B + tree index is written (continuous IDS can generate partial sequential writes). In addition, because sequential append operations cannot be generated when writing, insert operations are required, and the entire B + tree node will be read to memory, After inserting this record, the entire node will be written back to disk. This operation will significantly reduce the performance when the record takes up a large space.

Suitable scenario: if you want to randomly generate a file name and number, you can use UUID, but you can’t use UUID as the primary key.

UUID.randomUUID().toString().replace(“-”, “”) 
-> dff7a1016f1a4a889c9e76e1bfd8ecb4

Get the current system time

This is just to get the current time, but the problem is,When concurrency is highFor example, thousands of times a second,There will be duplication, this is definitely inappropriate. Basically don’t think about it.

Suitable scenario: this scheme is generally used to splice the current time with many other business fields as an ID. if you think it is acceptable in business, it is also acceptable. You can splice other business field values with the current time to form a globally unique number.

Snowflake algorithm

Snowflake algorithm is an open-source distributed ID generation algorithm of twitter. It is implemented in Scala language. It takes a 64 bit long ID and one bit is not used. Among them, 41 bit is used as the number of milliseconds, 10 bit is used as the working machine ID and 12 bit is used as the serial number.

  • 1 bit: No, why? If the first bit in the binary system is 1, it is all negative, but the IDS we generate are all positive, so the first bit is 0.
  • 41 bit: indicates the timestamp in milliseconds. 41 bit can represent up to2^41 - 1, that is, it can be identified2^41 - 1A millisecond value, converted to adulthood, represents 69 years.
  • 10 bit: record the working machine ID, which means that the service can be deployed on 2 ^ 10 machines at most, that is, 1024 machines. However, in 10 bits, 5 bits represent the machine room ID and 5 bits represent the machine ID. It means the most representative2^5Machine rooms (32 machine rooms), each machine room can represent2^5Machines (32 machines).
  • 12 bit: This is used to record different IDs generated in the same millisecond. The maximum positive integer that 12 bit can represent is2^12 - 1 = 4096In other words, it can be distinguished by the 12 bit numberWithin the same millisecond4096 different IDs.
public class IdWorker {
    private long workerId;
    private long datacenterId;
    private long sequence;

    public IdWorker(long workerId, long datacenterId, long sequence) {
        // sanity check for workerId
        //Here's a check. The requirement is that the machine room ID and machine ID you pass in cannot exceed 32 and cannot be less than 0
        if (workerId > maxWorkerId || workerId < 0) {
            throw new IllegalArgumentException(
                    String.format("worker Id can't be greater than %d or less than 0", maxWorkerId));
        }
        if (datacenterId > maxDatacenterId || datacenterId < 0) {
            throw new IllegalArgumentException(
                    String.format("datacenter Id can't be greater than %d or less than 0", maxDatacenterId));
        }
        System.out.printf(
                "worker starting. timestamp left shift %d, datacenter id bits %d, worker id bits %d, sequence bits %d, workerid %d",
                timestampLeftShift, datacenterIdBits, workerIdBits, sequenceBits, workerId);
        this.workerId = workerId;
        this.datacenterId = datacenterId;
        this.sequence = sequence;
    }
    private long twepoch = 1288834974657L;
    private long workerIdBits = 5L;
    private long datacenterIdBits = 5L;
    //This is a binary operation, that is, 5 bits can only have 31 numbers at most, that is, the machine ID can only be within 32 at most
    private long maxWorkerId = -1L ^ (-1L << workerIdBits);
    //This means that 5 bits can only have 31 numbers at most, and the machine room ID can only be within 32 at most
    private long maxDatacenterId = -1L ^ (-1L << datacenterIdBits);
    private long sequenceBits = 12L;
    private long workerIdShift = sequenceBits;
    private long datacenterIdShift = sequenceBits + workerIdBits;
    private long timestampLeftShift = sequenceBits + workerIdBits + datacenterIdBits;
    private long sequenceMask = -1L ^ (-1L << sequenceBits);
    private long lastTimestamp = -1L;

    public long getWorkerId() {
        return workerId;
    }

    public long getDatacenterId() {
        return datacenterId;
    }

    public long getTimestamp() {
        return System.currentTimeMillis();
    }

    public synchronized long nextId() {
        //Here is the current timestamp, in milliseconds
        long timestamp = timeGen();
        if (timestamp < lastTimestamp) {
            System.err.printf("clock is moving backwards.  Rejecting requests until %d.", lastTimestamp);
            throw new RuntimeException(String.format(
                    "Clock moved backwards.  Refusing to generate id for %d milliseconds", lastTimestamp - timestamp));
        }
        if (lastTimestamp == timestamp) {
            //This means that there can only be 4096 numbers in a millisecond
            //No matter how many times you pass in, the bit operation is guaranteed to always be within the range of 4096 to avoid passing a sequence beyond the range of 4096
            sequence = (sequence + 1) & sequenceMask;
            if (sequence == 0) {
                timestamp = tilNextMillis(lastTimestamp);
            }
        } else {
            sequence = 0;
        }
        //Here, record the timestamp of the last generated ID in milliseconds
        lastTimestamp = timestamp;
        //Here is to move the timestamp to the left and put it at 41 bit;
        //Move the machine room ID left to 5 bits;
        //Move the machine ID left to 5 bits; Put the serial number in the last 12 bits;
        //Finally, it is spliced into a 64 bit binary number and converted into hexadecimal, which is a long type
        return ((timestamp - twepoch) << timestampLeftShift) | (datacenterId << datacenterIdShift)
                | (workerId << workerIdShift) | sequence;
    }

    private long tilNextMillis(long lastTimestamp) {
        long timestamp = timeGen();
        while (timestamp <= lastTimestamp) {
            timestamp = timeGen();
        }
        return timestamp;
    }

    private long timeGen() {
        return System.currentTimeMillis();
    }
    //---------------- test---------------
    public static void main(String[] args) {
        IdWorker worker = new IdWorker(1, 1, 1);
        for (int i = 0; i < 30; i++) {
            System.out.println(worker.nextId());
        }
    }
}

How to put it, it probably means that 41 bit is a timestamp of the current millisecond unit; Then 5 bit is the one you passed incomputer room  ID (but the maximum can only be within 32), and the other 5 bits are passed in by youmachine  ID (but the maximum can only be within 32). If the remaining 12 bit serial number is still within one millisecond from the time you last generated the ID, the sequence will be accumulated to you, up to 4096 serial numbers.

So you use this tool class to create a service, and then initialize such a thing for each machine in each machine room. At first, the serial number of the machine in this machine room is 0. Then every time you receive a request saying that the machine in the computer room needs to generate an ID, you will find the corresponding worker to generate it.

Using this snowflake algorithm, you can develop your own company’s services, even for the machine room ID and machine ID. anyway, 5 bits + 5 bits are reserved for you, and you can change to other things with business meaning.

This snowflake algorithm is relatively reliable, so if you are really engaged in distributed ID generation, if it is highly concurrent, it should have better performance. Generally, tens of thousands of concurrent scenarios per second are enough for you.