Explain the principle of varint coding in detail

Time:2020-4-29

What is varint code

Varint is a method of serializing an integer using one or more bytes, encoding the integer as a variable length byte. For 32-bit integer data, it needs 1-5 bytes after varint encoding, 1 byte for small numbers and 5 bytes for large numbers. The 64 bit integer data takes 1-10 bytes after encoding. In the actual scene, the utilization rate of small numbers is far more than large numbers, so the varint coding can play a good compression effect for most scenes.

Coding principle

In addition to the last byte, the most significant bit – MSB is set for each byte in the varint encoding – if the MSB is 1, it indicates that the next byte belongs to the current data. If it is 0, it is the last byte of the current data. The lower 7 bits of each byte are used for the binary complement representation with 7 bits as a group of stored numbers, with the least significant group first, or the least significant byte first. This indicates that the bytes of data encoded by varint are arranged in small endian order.

How to arrange the bytes

There are two general rules for how bytes are arranged. For example, if the least significant byte (similar to the least significant bit) of a multi bit integer sorted from low to high by storage address precedes the most significant byte, it is calledSmall terminal sequenceOtherwise, it is calledbig-endian 。 In network application, byte order is a factor that must be considered, because different machine types may adopt different standard byte order, so they are all transformed according to the network standard.

Generally speaking, the big endings are arranged according to the writing order of numbers, while the small endings are arranged in reverse.

Look at the diagram below to get a better understanding

Explain the principle of varint coding in detail

In the figure, the number 123456 is varint coded, and 123456 is represented by binary1 11100010 01000000, each time from low to high 7 bits plus the most significant bit becomes1100 0000 11000100 00000111So after varint encoding, 123456 takes up three bytes, which are192 196 7

The process of decoding is to take out the bytes in turn and remove the most significant bits. Because of the small end sorting, the bytes to be decoded first should be placed in the low position, and then the binary bits to be decoded should be placed in the high position of the binary that has been decoded before and finally converted to the decimal number to complete the decoding process of varint coding.

Coding implementation

Since the protocol buffer uses a lot of varint coding, Igithub.com/golang/protobuf/protoIn the library, a go language implementation method is found to encode and decode the data in varint, and the above-mentioned varint coding process is completed by bit operation in the code.

const maxVarintBytes = 10 // maximum length of a varint

//Returns the byte stream after varint type encoding
func EncodeVarint(x uint64) []byte {
    var buf [maxVarintBytes]byte
    var n int
    //The following coding rules need to be understood in detail:
    //1. The highest bit of each byte is the reserved bit. If it is 1, it indicates whether the next byte belongs to the current data. If it is 0, it is the last byte of the current data
    //Look at the following code. Since the highest bit of a byte is the reserved bit, only the following 7bits can save data in this byte
    //Therefore, if x > 127, the data needs to be saved larger than one byte, so the highest bit of the current byte is 1. See buf [n] = 0x80 |
    //0x80 indicates that the highest position of this byte is 1, and the following X & 0x7F is to obtain the lower 7-bit data of X, so the overall meaning of 0x80 | uint8 (X & 0x7F) is
    //The highest bit of this byte is 1, which means this is not the last byte, and the last 7 is the official data! Note that x > > = 7 should be set before the next byte is operated
    //2. If x < = 127, then x can be represented by 7bits now. Then the highest bit does not need to be 1, and it's ok if it's directly 0! So the last one is buf [n] = uint8 (x)
    //
    //If the data is larger than one byte (127 is the maximum data of one byte), then continue, i.e. add 1 to the highest bit
    for n = 0; x > 127; n++ {
        //X & 0x7F means to take out the lower 7bit data, and 0x80 means to add 1 to the highest bit
        buf[n] = 0x80 | uint8(x&0x7F)
        //Move 7 bits to the right to continue the following data processing
        x >>= 7
    }
    //Last byte data
    buf[n] = uint8(x)
    n++
    return buf[0:n]
}
  • 0x7FThe binary representation of is0111 1111, sox & 0x7FAnd operation, getxThe last seven bit bits of binary representation (the previous bit bits are discarded by doing bit and operation with 0)
  • 0x80The binary representation of is1000 0000, so0x80 | uint8(x&0x7F)It is to add 1 (MSB) before the last 7 bits of the extracted X

Decoding implementation

Decoding is the reverse process of coding. It can also be quickly and effectively completed by bit operation. It is not difficult to understand it by combining the following code notes and deducing it on paper again.

func DecodeVarint(buf []byte) (x uint64, n int) {
    for shift := uint(0); shift < 64; shift += 7 {
        if n >= len(buf) {
            return 0, 0
        }
        b := uint64(buf[n])
        n++
    //This is divided into three steps:
        //1: B & 0x7F get the lower 7bits valid data
        //2: (B & 0x7F) < shift because it is a small end sequence, each time a byte data is processed, it needs to move 7bits to the high order
        //3: put the data X together with the current byte data
        x |= (b & 0x7F) << shift
        if (b & 0x80) == 0 {
            return x, n
        }
    }

    // The number is too large to represent in a 64-bit value.
    return 0, 0
}

Playground

The coding and decoding process of varint will be understood here. It will be much easier to understand the coding principle of protocol buffer after understanding the coding principle of varint.