Deep understanding of redis’s simple dynamic string

Time:2021-4-11

Redis does not directly use the traditional string representation of C language (character array ending with null character, hereinafter referred to as C string), but builds a new string representation namedSimple dynamic string(simple dynamic string, SDS) and uses SDS as the default string representation of redis. The SDS built by reids is better and more secure than the default C string.

SDS

What is the structure of SDS? What’s the difference between C string and C string?

Here is the definition of SDS

struct sdshdr {
    //Number of bytes used in buf array records
    //Equal to the length of the string stored in SDS
    int len;
    
    //Records the number of unused bytes in the buf array
    int free;
    
    //Byte array, used to save strings
    char buf[];
}

In 64 bit system, the attribute len and the attribute free occupy 4 bytes each, and then store the byte array.

The buf above is a flexible array. Flexible array members, also known as flexible array members, can only be placed at the end of the structure. The structure containing the members of the flexible array dynamically allocates memory for the flexible array through malloc function.

For flexible arrays, see this article:C language flexible array explanation

Here is an example of SDS

set name "Redis"

  • The value of the free property is 0, which means that the SDS does not allocate any unused space.
  • The value of len attribute is 5, which means that the SDS stores a byte long string.
  • The buf attribute is an array of char type. The first five bytes of the array hold five characters’ R ‘,’e’,’d ‘,’I’,’s’, respectively, while the last byte holds the null character ‘\ 0’.

SDS follows the Convention of C string ending with null character. The 1 byte space for storing null character is not included in the len attribute of SDS, and the operations of allocating extra 1 byte space for null character and adding null character to the end of string are automatically completed by SDS functions. Therefore, the null character is completely transparent to SDS users. The advantage of following the Convention of empty string ending is that SDS can directly reuse some functions in the C string function library.

The difference between SDS and C string

C language uses a character array of length N + 1 to represent a string of length N, and the last element of the character array is always empty character ‘\ 0’. However, this simple string representation used by C language can not meet the requirements of redis in terms of security, efficiency and function. Let’s talk about why SDS is more suitable for redis than C string.

The length complexity of the string obtained by SDS is O (1), and the C string is O (n)

Because C string does not record its own length information, in order to obtain the length of a C string, the program must traverse the whole string and count each character until it encounters an empty character representing the end of the string. The complexity of this operation is O (n).

Unlike the C string, because SDS records the length of SDS itself in the len attribute, the complexity of getting an SDS length is O (1).

By using SDS instead of C string, redis reduces the complexity of getting string length from O (n) to o (1), which ensures that the work of getting string length will not become a performance bottleneck of redis. Therefore, even if we repeatedly execute the strlen command on a very long string, it will not have any impact on the system performance, because the complexity of the strlen command is only O (1).

SDS eliminates buffer overflow

C string does not record its own length, which not only leads to high complexity in obtaining the length of string, but also causes buffer overflow. For example, suppose there are two C strings S1 and S2 in the program that are next to each other in memory, where S1 stores the string “redis” and S2 stores the string “mongodb”, as shown in the figure below.

Two C strings next to each other in memory

If a programmer decides to modify the content of S1 to “redis cluster” through strcat (S1, “cluster”), but carelessly forgets to allocate enough space for S1 before the execution of strcat, then after the execution of strcat function, the data of S1 will overflow into the space of S2, resulting in unexpected modification of the content saved in S2, as shown in the figure below.

The content of S1 overflows to the location of S2

This is a problem with using C strings. Different from C string, the space allocation strategy of SDS completely eliminates the possibility of buffer overflow When the API needs to modify the SDS, the API will first check whether the space of the SDS meets the requirements for modification. If not, the API will automatically expand the space of the SDS to the size required for modification, and then perform the actual modification operation. Therefore, using SDS does not need to manually modify the space of the SDS, and there will be no buffer overflow problem mentioned above.

Reduce the number of memory reallocation when modifying strings

Because a C string does not record its own length, the underlying implementation of a C string containing n characters is always an array of N + 1 characters (an additional character space is used to store empty characters). Because there is such a correlation between the length of the C string and the length of the underlying array, every time a C string is increased or shortened, the program always performs a memory reallocation operation on the array that holds the C string

  • If the program performs string growing operations, such as append, the program needs to expand the space size of the underlying array through memory reallocation before performing this operation. If you forget this step, a buffer overflow will occur.
  • If the program performs the operation of shortening the string, such as the trim operation, then after the operation, the program needs to reallocate the memory to release the space that the string is no longer used. If you forget this step, a memory leak will occur.

In order to avoid this defect of C string, SDS breaks the association between string length and underlying array length through unused space: in SDS, the length of buf array is not necessarily the number of characters plus one, and the array can contain unused bytes, which are recorded by the free attribute of SDS.

Through unused space, SDS realizes two optimization strategies: space pre allocation and inert space release.

1. Space pre allocation

Space pre allocation is used to optimize the string growth operation of SDS: when the API of SDS modifies an SDS and needs to expand the space of SDS, the program will not only allocate the necessary space for modification, but also allocate additional unused space for SDS. The amount of unused space allocated additionally is determined by the following companies:

  • If the length of SDS (that is, the value of len attribute) is less than 1MB after modifying the SDS, then the program allocates unused space of the same size as len attribute, and the value of len attribute of SDS will be the same as that of free attribute. For example, if the len of SDS is changed to 13 bytes after modification, the program will also allocate 13 bytes of unused space, and the actual length of the buf array of SDS will be 13 + 13 + 1 bytes (an extra byte is used to store empty characters).
  • If the length of SDS is greater than or equal to 1MB after modification, the program will allocate 1MB of unused space. For example, if the len of SDS is changed to 30MB after modification, the program will allocate 1MB of unused space, and the actual length of the buf array of SDS is 30MB + 1MB + 1byte.

Through space pre allocation strategy, redis can reduce the number of memory reallocation required for continuous string growth operations.

2. Inert space release

Idle space release is used to optimize the string shortening operation of SDS: when the API of SDS needs to shorten the string saved in SDS, the program does not immediately use memory reallocation to recover the extra bytes after shortening, but uses the free attribute to record the number of these bytes and wait for future use.

Through the strategy of inert space release, SDS avoids the memory reallocation operation when shortening strings, and does not provide optimization for possible future growth operations. At the same time, ads also provides the corresponding API, so that we can really release the unused space of SDS when necessary, so we don’t have to worry about the memory waste caused by the inert space release strategy.

Binary security

What is binary security?

Generally speaking, in C language, the end of a string is represented by ‘\ \ 0′. If the string itself has’ \ \ 0 ‘character, the string will be truncated, which is not binary security. If some mechanism is adopted to ensure that the content of the string will not be damaged when reading and writing, it is binary security.

The characters in the C string must conform to some encoding (such as ASCII), Besides the end of the string, the string can not contain empty characters, otherwise the first empty character read by the program will be mistaken for the end of the string. These restrictions make C string only save text data, but not binary data such as pictures, audio, video and compressed files.

In order to ensure that redis can be used in different scenarios (save text, images, audio and video, etc.), the API of SDS is binary safe, and all SDS are safe API will process the data stored in the buf array of SDS in a binary way. The program will not restrict, filter, or assume the data in the buf array. The data is like Emei when it is written and what it is like when it is read.

This is also the reason why the buf attribute of SDS becomes a byte array. Redis does not use this array to save characters, but uses it to save a series of binary data.

From:

Redis design and Implementation (2nd Edition)

Redis5 design and source code analysis

Recommended Today

Review of SQL Sever basic command

catalogue preface Installation of virtual machine Commands and operations Basic command syntax Case sensitive SQL keyword and function name Column and Index Names alias Too long to see? Space Database connection Connection of SSMS Connection of command line Database operation establish delete constraint integrity constraint Common constraints NOT NULL UNIQUE PRIMARY KEY FOREIGN KEY DEFAULT […]