Author brief introductionChen Chuang, known as “Soldier Leo”, Baishan Super Engineer. Senior developers of Linux Kernel, Nginx module and storage architecture have more than 7 years’experience in storage architecture, design and development. They have successively worked in DongSoft, Zhongke Dawning, Sina and Metro. They are good at independent research and development of Haystack, erasure codes and other projects. Their hobbies constantly reduce IO and challenge redundancy baseline. Baishan skateboarder professional level 10, will drift, is actively preparing for the 6th Zhanfangzhuang Street Motion Skateboarding Games, family dream is to win a silicone-free shampoo for his wife.
Nowadays, with the explosive growth of data on the Internet, social networks, mobile communications, network video, e-commerce and other applications can often produce billions or even billions of billions of small files. Because of the huge challenges in metadata management, access performance, storage efficiency and so on, the problem of massive small files has become a recognized problem in the industry.
Some well-known Internet companies in the industry have also proposed solutions to the huge number of small files, such as Facebook, a famous social networking site, which stores more than 60 billion pictures, and has launched Haystack system to customize and optimize the storage of large number of small pictures.
Baishan Cloud Storage CWN-X also offers a unique solution for small files, which we call Haystack_plus. The system provides high-performance data reading and writing, fast data recovery, periodic reorganization and merging functions.
Facebook’s Haystack solution to small files is to merge small files. The small file data is appended to the data file in turn, and the index file is generated. The offset and size of the small file in the data file are searched through the index, and the file is read.
Haystack’s data file section：
Haystack’s data file encapsulates each small file into a needle, which contains the key, size, data and other data information of the file. All small files are appended to the data files in the order in which they are written.
The index file section of Haystack：
The index file of Haystack stores the key of each needle, offset, size and other information of the needle in the data file. When the program starts, it loads the index into memory and locates the offset and size in the data file by looking up the index in memory.
Facebook’s Haystack feature is to load the entire key of the file into memory and locate the file. Facebook’s full 8-byte keys can be loaded into memory when the machine’s memory is large enough.
But there are two main problems in the real environment:
Storage server memory will not be too large, generally 32G to 64G;
The size of key corresponding to small file is difficult to control. Generally, MD5 or SHA1 of file content is chosen as the key of the file.
A storage server has 12 4T disks and about 32GB of memory.
The server now needs to store about 4K files, such as avatars, thumbnails and so on, about 1 billion.
The key of the file uses MD5, plus offset and size fields. On average, the index information of a small file takes 28 bytes.
In this case, the index occupies nearly 30GB of memory and the disk occupies only 4TB. Memory consumption is nearly 100% and disk consumption is only 8%.
So index optimization is a problem that must be solved.
The core of Haystack_plus is also composed of data files and index files.
1. Haystack_plus data file:
Similar to Facebook’s Haystack_plus, Haystack_plus writes multiple small files to a data file, each needle holds key, size, data, and other information.
2. The index file of Haystack_plus:
Index is our main optimization direction:
The index file only saves the first four bytes of key, not the complete key.
The offset and size fields in the index file are aligned by 512 bytes to save 1 byte, and the number of bytes used by offset and size is calculated according to the actual size of the entire Haystack_plus data file.
3. The differences between Haystack_plus:
The needle in the data file is stored in alphabetical order of key.
Because of the key of index file, only the first four bytes are saved. If the first four bytes of key of small file are the same and stored in different order, the exact location of key can not be found. The following may occur:
For example, the file key read by the user is
0x ab cd ef ac eeBut because the key in the index file only saves the first four bytes, it can only match.
0x ab cd ef acThis prefix does not locate the offset to be read at this time.
We can solve this problem by needle sequential storage:
For example, the key for the user to read the file is
0x ab cd ef ac bbMatch to
0x ab cd ef acThis prefix, where offset points to
0x ab cd ef ac aaThis needle missed the first match.
By storing size in needle header, we can locate
0x ab cd ef ac bbLocation, match to the correct needle, and read the data to the user.
4. The index search process is as follows:
5. Documents that do not exist are requested:
problemWe use the half-fold search algorithm to find key in memory with time complexity of O (log (n), where n is the number of needles. When the index prefix is the same, you need to continue searching in the data file. When the accessed file does not exist at this time, it is easy to cause multiple IO lookups.
ResolventIn memory, the existing files are mapped to bloom filter. At this point, only through a quick search, you can exclude non-existent files.
The time complexity is O (k), and K is the bit number required for an element. When K is 9.6, the false alarm rate is 1%. If K is increased by 4.8, the false alarm rate will be reduced to 0.1%.
6. How about prefix compression?
Haystack_plus vs. Facebook Haystack Memory Consumption. Scenario examples, file size 4K (e.g. avatar, thumbnail, etc.), key MD5:
|Memory consumption comparison||Key||offset||size|
|Haystack||Full key, 16 bytes||8 byte||4 byte|
|Haystack_plus||4 byte||4 byte||1 byte|
Note: Haystack’s needle is an additional write, so offset and size are fixed. The key of Haystack_plus uses its first four bytes, offset calculates the number of bytes according to the address space of the Haystack_plus data file and aligns them with 512 bytes; size calculates the number of bytes according to the size of the actual file and aligns them with 512 bytes.
As can be seen from the figure above, when the number of files is 1 billion, using Facabook’s Haystack consumes more than 26G of memory, using Haystack_plus only consumes more than 9G of memory and reduces memory usage by 2/3.
7. Index optimization can’t stop at all.
One billion 4K small files consume more than 9G of memory. Key takes 4 bytes, Office takes 4 bytes, and needs to be smaller.
According to the prefix of the file key, the prefix is hierarchical, and the same prefix is one layer.
Benefits of stratification:
Reduce the number of bytes of key:
By layering, only one duplicate prefix is saved, which saves the number of bytes of key.
Reduce the number of bytes in offset:
The offset before optimization is in the address space of the entire Haystack_plus data file.
After optimization, only need to migrate in the layer of the data file, according to the largest layer address space can calculate the number of bytes required.
The effect after stratification:
As can be seen from the figure above, the memory consumption is reduced from more than 9G before optimization to more than 4G after layering, which saves half of the memory consumption.
Haystack_plus Overall Architecture
1. Haystack_plus Organization:
On each server, we split all the files into groups, and each group creates a Haystack_plus. The system manages all Haystack_plus in a unified way.
Read, write, delete and other operations, will be positioned in the system to operate a Haystack_plus, and then through the index to locate the specific needle, to operate.
2. Index Organization
As mentioned earlier, all needles are stored sequentially, indexed by prefix compression, and hierarchically.
3. Document composition:
Chuk file: The actual data of the small file is split and stored in a fixed number of chunk data files, defaulting to 12 data blocks;
Needle list file: save the information of each needle (such as file name, offset, etc.);
Needle index and layer index files: save index information of needle list in memory;
Global version file: save version information and automatically add new version information to the file when creating a new version;
Attribute file: save the attribute information of the system (such as SHA1 of chunk, etc.);
Original filenames: Save the original filenames of all files.
A, Haystack_plus data files are split into chunk organizations, chunk1, chunk2, chunk3... B. Benefits of dividing into chunks: 1. When data is damaged, it does not affect the data of other chunks. 2. In data recovery, only damaged chunks need to be recovered. C. The SHA1 value of each chunk is stored in the attribute file.
4. Version control:
Because needle is stored in key order in data files, in order not to affect its order, the newly uploaded files can not be added to Haystack_plus, but are first saved to the hash directory, and then added to Haystack_plus by periodic automatic merging.
When merging, all needle information will be read from needle_list file, deleted needle will be deleted, and new uploaded files will be added, while reordering, generating chunk data files, index files and so on.
When merged, a new version of Haystack_plus will be generated. The version name is the first four bytes of the SHA1 value sorted by the file name of all users.
Every half month, the system automatically checks the hash directory to see if there are any new files, and calculates the SHA 1 of all the file name collections to see if it is the same as the current version number. At the same time, it indicates that there are new files uploaded, and the system will merge and generate new data files.
At the same time, the system allows new versions to be recreated when the number of files in the hash directory exceeds the specified number, thus reducing the number of merges.
Version control is recorded in the global_version file. Each time a new version is created, the version number and the corresponding CRC32 are appended to the global_version file (crc32 is used to check whether the version number is damaged).
Each time a new version is generated, the automatic notification program reloads the index file, attribute file, etc.
5. Data recovery:
The user’s files will be saved in three copies, so Haystack_plus will also be stored on three different machines.
Recovery scenario 1：
When a Haystack_plus file is damaged, it will find out whether there is the same version of Haystack_plus on the replica machine. If the version is the same, the content of the file is the same. At this time, only the recovered files need to be downloaded from the replica machine and replaced.
Recovery scenario 2：
If the replica machine does not have the same version of Haystack_plus, but there is a higher version, then the entire version of Haystack_plus can be copied from the replica machine and replaced.
Recovery scenario 3：
If the first two cases do not match, then read all the files from the other two replica machines to the hash directory locally, and extract the files saved in the undamaged chunk into the hash directory, and regenerate the new version of Haystack_plus with all the files.
How effective is Haystack_plus
After using Haystack_plus for a period of time, we found that the overall performance of small files was significantly improved, RPS was more than doubled, and IO utilization of machines was nearly doubled. At the same time, the smallest memory cell is optimized, and the fragmentation is reduced by 80%.
Using this system, we can provide users with faster reading and writing services, and save the resource consumption of the cluster.