Oppo is an intelligent terminal manufacturing company with hundreds of millions of end users. It produces a large number of unstructured data such as text, pictures, audio and video every day. On the premise of ensuring data connectivity, real-time and data security governance requirements, how to fully mine data value at low cost and high efficiency has become a major problem for companies with massive data. At present, the popular solution in the industry is data lake. The oppo self-developed data Lake storage CBFS introduced in this paper can solve the current pain point to a great extent.
▌ brief description of data Lake
Data Lake definition: a centralized storage warehouse that stores data in its original data format, usually binary blob or file. A data lake is usually a single data set, including original data and transformed data (report, visualization, advanced analysis, machine learning, etc.)
1. Value of data storage
Compared with the traditional Hadoop architecture, the data lake has the following advantages:
- Highly flexible: data reading, writing and processing are very convenient, and all original data can be saved
- Multiple analysis: support multiple loads including batch and stream computing, interactive query, machine learning, etc
- Low cost: independent expansion of storage and computing resources; Using object storage, cold and hot separation, lower cost
- Easy to manage: perfect user management, authentication, compliance and audit, and the whole process of data “storage and use” can be traced
2. Overall solution of oppo data Lake
Oppo mainly constructs the data lake from three dimensions: the bottom Lake storage. We use CBFS, which is a low-cost storage that supports three access protocols: S3, HDFS and POSIX files at the same time; The middle layer is the real-time data storage format, and we use iceberg; The top layer can support various computing engines
3. Oppo data Lake architecture features
The early big data storage is characterized by the storage of stream computing and batch computing in different systems. The upgraded architecture unified metadata management and integrated batch and stream computing; At the same time, it provides unified interactive query, more friendly interface, second response and high concurrency. At the same time, it supports the change operation of data source upsert; The bottom layer uses large-scale and low-cost object storage as a unified data base to support multi engine data sharing and improve the ability of data reuse
4. Data Lake storage CBFS architecture
Our goal is to build a data Lake storage that can support EB level data and solve the challenges of data analysis in cost, performance and experience. The whole data Lake storage is divided into six subsystems:
- Protocol access layer: it supports a variety of different protocols (S3, HDFS and POSIX files), so that data can be written with one protocol and directly read with the other two protocols
- Metadata layer: presents the hierarchical namespace of the file system and the flat namespace of the object. The whole metadata is distributed, supports fragmentation and is linearly scalable
- Metadata cache layer: used to manage metadata cache and provide metadata access acceleration capability
- Resource management layer: the master node in the figure is responsible for the management of physical resources (data node, metadata node) and logical resources (volume / bucket, data fragmentation, metadata fragmentation)
- Multi Copy Layer: it supports additional write and random write, and is friendly to large objects and small objects. One function of the subsystem is to store multiple copies as persistent; Another function is the data cache layer, which supports elastic copies, speeds up data access, and expands later.
- Erasure code storage layer: it can significantly reduce the storage cost, support the deployment of multiple availability areas, support different erasure code models, and easily support the EB level storage scale
Next, we will focus on the key technologies used in CBFS, including high-performance metadata management, erasure code storage, and lake acceleration
▌ key technologies of CBFS
1. Metadata management
The file system provides a hierarchical namespace view. The logical directory tree of the whole file system is divided into multiple layers. As shown in the right figure, each metadata node contains hundreds of metadata partitions. Each partition is composed of inodetree (BTREE) and dentrytree (BTREE). Each dentry represents a directory item, and the dentry is composed of parentid and name. In dentrytree, the index is composed of partentid and name for storage and retrieval; In inodetree, index with inode ID. Multiraft protocol is used to ensure high availability and data consistency replication, and each node set will contain a large number of fragment groups, and each fragment group corresponds to a raft group; Each slice group belongs to a volume; Each fragment group is a metadata range (an inode ID) of a volume; The metadata subsystem completes dynamic capacity expansion by splitting; When a partition group of resources (performance and capacity) is close to the value, the resource manager service will estimate an end point and notify this group of node devices to only serve the data before this point. At the same time, a new group of nodes will be selected and dynamically added to the current business system.
A single directory supports the capacity of one million level, and the metadata is fully stored in memory to ensure excellent reading and writing performance. The memory metadata is persisted to disk by snapshot for backup and recovery.
Object storage provides a flat namespace; Take the object whose access objectkey is / bucket / A / B / C as an example. Start from the root directory and parse through the “/” separator layer by layer to find the dentry of the last layer directory (/ bucket / A / b). For the inode of / bucket / A / B / C, this process involves multiple interactions between nodes. The deeper the level, the worse the performance; Therefore, we introduce the pathcache module to accelerate objectkey resolution. The simple way is to cache the dentry of the parent directory (/ bucket / A / b) of the objectkey in the pathcache; Analyzing the online cluster, we find that the average size of the directory is about 100. Assuming that the scale of the storage cluster is 100 billion, and there are only 1 billion directory entries, the single machine cache efficiency is very high. At the same time, the read performance can be improved through different nodes; While supporting the design of “flat” and “hierarchical” namespace management at the same time, compared with other systems in the industry, CBFS is more concise and efficient. It can easily realize one data without any conversion, access and interworking of multiple protocols, and there is no problem of data consistency.
2. Erasure code storage
One of the key technologies to reduce the storage cost is erasure code (EC). Briefly introduce the principle of erasure code: K original data are encoded and calculated to obtain new m data. When K + M data are lost, the original data can be restored by decoding (the principle is a bit like disk RAID); Compared with traditional multi copy storage, EC has lower data redundancy, but higher data durability; There are many different ways to implement it. Most of them are based on XOR operation or Reed Solomon (RS) coding. Our CBFS also adopts RS coding
1. Coding matrix, the upper n lines are the unit matrix I, and the lower m lines are the coding matrix; Vector composed of K + M data blocks, including original data blocks and M check blocks
2. When a block is lost: delete the row corresponding to the block from matrix B to get a new matrix B ‘, and then multiply the inverse matrix of B’ on the left to recover the lost block. For the detailed calculation process, you can read the relevant materials offline
There are some problems with ordinary RS coding: Taking the above figure as an example, assume that X1 ~ X6, Y1 ~ y6 are data blocks and P1 and P2 are check blocks. If any one of them is lost, you need to read the other 12 blocks to repair the data. The disk IO loss is large and the bandwidth required for data repair is high. The problem is particularly obvious in multi AZ deployment;
The LRC code proposed by Microsoft solves this problem by introducing local check blocks. As shown in the figure, two local check blocks PX and py are added on the basis of the original global check blocks P1 and P2. Assuming X1 is damaged, only six blocks associated with X1 ~ X6 need to be read to repair the data. Statistics show that in the data center, the probability of single disk failure of a strip within a certain period of time is 98%, and the probability of simultaneous damage of two disks is 1%. Therefore, LRC can greatly improve the data repair efficiency in most scenarios, but the disadvantage is that it is not the maximum distance separable coding, so it can not lose any m copies of data like global RS coding, and all data can be lost and repaired.
1. Offline EC: after the k data units of the whole strip are filled, the overall calculation generates m check block
2. Online EC: after receiving the data, split it synchronously and calculate the check blocks in real time, and write k data blocks and M check blocks at the same time
CBFS cross AZ multimode online EC
CBFS supports online EC storage across AZ multi-mode strips. The system can flexibly configure different coding modes for different machine room conditions (1 / 2 / 3az), objects of different sizes, service availability and data durability requirements
Taking the “1az-rs” mode in the figure as an example, 6 data blocks plus 3 verification blocks are deployed in single AZ; 2az-rs mode adopts the deployment of 6 data blocks and 10 verification blocks, and the data redundancy is 16 / 6 = 2.67; 3az-lrc mode adopts 6 data blocks, 6 global check blocks and 3 local check blocks; Different coding modes are supported in the same cluster.
Online EC storage architecture
Contains several modules
Access: data access layer, providing EC encoding and decoding capability at the same time
Cm: the cluster management layer manages resources such as nodes, disks and volumes, and is also responsible for migration, repair, balancing and patrol inspection. The same cluster supports the coexistence of different EC coding modes
Allocator: responsible for volume space allocation
EC node: stand-alone storage engine, which is responsible for the actual storage of data
Erasure code writing
1. Streaming data collection
2. Generate multiple data blocks for the data slice, and calculate the check block at the same time
3. Request storage volume
4. Distribute data blocks or check blocks to each storage node concurrently
Simple NRW protocol is adopted for data writing to ensure the minimum number of written copies, so that in case of normal node and network failure, the request will not be blocked to ensure availability; Asynchronous pipeline mode is adopted for data reception, segmentation and check block coding, which also ensures high throughput and low delay.
Erasure code reading
NRW model is also adopted for data reading. Taking the coding mode of k = M = 2 as an example, as long as two blocks (whether data blocks or check blocks) are correctly read, the original data can be quickly decoded and calculated by RS, and the original data can be saved; In addition, in order to improve availability and reduce latency, access will give priority to reading EC nodes of adjacent or low load storage nodes
It can be seen that online EC combined with NRW protocol is very suitable for big data business model in order to ensure strong data consistency and provide guarantee of high throughput and low delay.
3. Data Access Acceleration
One of the significant benefits of the data Lake architecture is cost savings, but the memory computing separation architecture will also encounter bandwidth bottlenecks and performance challenges. Therefore, we also provide a series of Access Acceleration technologies:
The first is the multi-level cache capability:
1. First level cache: local cache, which is deployed on the same machine as the computing node, supports metadata and data cache, and supports different types of media such as memory, pmem, nvme and HDD. It is characterized by low access delay but low capacity
2. Second level cache: distributed cache, flexible number of copies, providing location awareness, supporting active preheating and passive cache at user / bucket / object level, and data elimination strategy can also be configured
The multi-level cache strategy has a good acceleration effect in our machine learning training scenario
In addition, the storage data layer also supports predicate push down operation, which can significantly reduce a large amount of data flow between storage and computing nodes, reduce resource overhead and improve computing performance;
There is still a lot of meticulous work in data Lake acceleration, and we are also in the process of continuous improvement
▌ future outlook
Current cbfs-2 Version x has been open source, and version 3.0 supporting key features such as online EC, Lake acceleration and multi protocol access is expected to be open source in October 2021;
Subsequent CBFS will add features such as direct mounting of stock HDFS clusters (no data relocation), intelligent layering of hot and cold data, so as to support the smooth entry of stock data into the lake under the original architecture of big data and AI.
About the author:
Xiaochun oppo storage architect
For more exciting content, please pay attention to [oppo digital intelligence technology] official account