| lead language 1 billion level, is the order of magnitude of WeChat users. Behind this huge number, is “take a look”, “WeChat advertising”, “WeChat payment”, “small program” and other businesses on the database 1 billion level of read and write demand. So, how is FeatureKV a powerful storage system born in this scenario?
Background: two billion-dollar challenges
PaxosStore is a highly consistent distributed storage system widely used in WeChat, which widely supports the online applications of WeChat, with a peak value of over 100 million TPS, running on thousands of servers, and strong performance in online service scenarios. But there is no silver bullet in software development, and PaxosStore faces two new billion challenges in the face of offline output, online read-only data scenarios:
1 billion/SEC challenge:
The “take a look” team needs a storage system to store the models needed for the CTR process, separating storage and computation so that the size of the recommended model is not limited to single machine memory.
Each time the article is ranked and graded, CTRSVR will pull thousands of features from this storage system. These features need to be the same version, and BatchGet of PaxosStore does not guarantee the same version.
The business side estimates that the storage system needs to support QPS of 1 billion/second, and the PaxosStore has a fixed number of copies and cannot add read-only copies.
This storage system needs to have version management and model management functions, which support historical version rollback.
1 billion/hour challenge:
Many teams in WeChat feedback that they need to write 1 billion (that is, the order of magnitude of WeChat users) of information into PaxosStore regularly every day, but the PaxosStore writing speed cannot meet the requirements, sometimes even a day can not finish, too fast will affect the network of other business.
The PaxosStore is a storage system that ensures strong consistency, is designed for online business, and has the performance to meet the needs of online business. However, in the face of such scenarios as offline irrigation, online read-only, and no strong consistency guarantee, high costs are needed to meet the needs of the business.
With more and more data-based applications, this kind of data storage needs more and more. We need to solve this problem and control the writing time of data of 1 billion level of key to about 1 hour.
The scene has the characteristics of batch write regularly, on-line read-only, in order to solve the pain points of these scenarios, we based on the powerful WFS (WeChat from research distributed file system) and steady as a rock Chubby (WeChat from research data storage), FeatureKV was designed and implemented, it is a high-performance Key – Value storage system, has the following features:
High performance and easy to scale
Excellent read performance: on the B70 model, the full memory table can have tens of millions of QPS; On the TS80A model, data stored in SSD tables can have millions of QPS.
Good write performance: with adequate remote file system performance, one billion key writes can be completed in an hour, with an average ValueSize of 400 bytes.
Easy to scale: horizontal (read performance) and vertical (capacity) scaling can be done in hours, while write performance scaling is just scaling a stateless module (DataSvr) that can be done in minutes.
Batch write friendly support
Task-based write interface: support WFS/HDFS file as input, the business side does not need to write, perform irrigation data tools, support failure retry, warning.
Support incremental update/full update: incremental update is to overwrite a batch of new input key-value based on the previous version, while the Key that is not in the input remains unchanged. Full volume update is to discard the data of the previous version and insert a batch of new key-values.
Support TTL: support expiration automatic delete function.
Version management capability
The BatchGet interface of transactions: ensure that the data obtained by a BatchGet is the same version.
Support for historical rollback: an update produces an incremental version, which supports historical rollback, including incrementally updated versions.
Of course, there are no silver bullets in software development, and FeatureKV makes design trade-offs:
Write data online is not supported, and when the amount of data is small (GB level), FeatureKV can do ten minute update frequency.
Strong consistency is not guaranteed, final consistency is guaranteed, and sequential consistency is guaranteed most of the time.
FeatureKV is now widely used in WeChat, including take a look, WeChat advertising, WeChat payment, applet and other services. The following will describe the design of FeatureKV and how to solve the above two billion challenges.
The overall design
- System architecture
FeatureKV involves three external dependencies:
Chubby: to store metadata in your system. Many places in FeatureKV achieve distributed collaboration and communication by polling metadata within Chubby.
USER_FS: a distributed file system on the business side, which can be WFS/HDFS, since FeatureKV’s write interface is task-based, and the input is a path on a distributed file system.
FKV_WFS: a distributed file system for FeatureKV to store DataSvr generated data files that can be used by KVSvr. Multiple historical versions can be saved to support historical version rollback.
All three external dependencies can be Shared with other businesses. FKV_WFS and USER_FS can be the same module. FKV_WFS can be replaced by HDFS. Chubby can use etcd instead.
Mainly responsible for writing data, the input of USER_FS, through the data format reorganization, routing shard, index building and other processes, generate KVSvr available data files, write to FKV_WFS.
It is a stateless service. The state information of the write task is stored in Chubby. Expanding DataSvr can increase the write performance of the system.
General deployment 2 good, part of the scene to write more tasks can be appropriately expanded.
Provide external read services, poll Chubby to perceive data updates, pull data from WFS to local, load data, and provide read-only services.
It is a stateful service, a KVSvr module will be made up of K Sect and N roles, K * N machines in total.
Each Sect has a full amount of data, and each time BatchGet only needs to be sent to a certain Sect. Adding Sect can expand the read performance, but it will not increase the RPC number of BatchGet.
The data slices that the same Role is responsible for are all the same. In the case of a single machine failure, the Batch request can be directly replaced and retried.
K is at least 2 to ensure the resilience of the system, including availability at time of change.
N can’t be any number. You can look at the second part down here.
FeatureKV only supports batch data writes, and each write task can be incremental/full update, with an unlimited amount of data per write. Off-line bulk write interface design, we stepped through some potholes:
Initially, we were going to seal some classes/tools, and we were going to have the business side package key-value data directly with our classes/tools and write it directly to the FKV_WFS directory. This scheme saved the most bandwidth, but it made our subsequent data format upgrades cumbersome and required the cooperation of all the business parties, so the scheme was scrapped.
Then, we made a new module DataSvr, and opened a TCP SVR on the DataSvr. The business side outputs key-value, and the writing tool will send the key-value data to the TCP SVR to complete packaging. However, there are still the following problems:
Write speed related to the business side of the code quality, machine resources, once met, the business side of the code inside with STD: : stringstreams parsing floating-point number input, this function takes up 90% + CPU (with STD: : strtof will be a lot faster), or run business party written tool machine, with 90% + CPU by other processes, finally feedback FeatureKV write slowly.
Daily changes to DataSvr or machine failures can cause the task to fail. The method of the front-end tool dispatch cannot retry the task because the input stream of the key-value cannot be replayed.
Finally, we designed a task-like interface with the path on USER_FS as input:
The business side puts the data in the agreed format in USER_FS and submits a write task to the DataSvr.
DataSvr streams the data in USER_FS, reformats, routes and splits the data, indexes it, then writes the data to FKV_WFS and updates the metadata in Chubby. The distributed execution of written tasks, failed retries, and so on also require Chubby to synchronize the status of the task.
KVSvr polls Chubby for updates to its aware data, pulls it locally, loads it, and provides services.
- Data routing
Considering scaling, FeatureKV splits a version of data into N pieces, which is now 2400, by hashing HashFun(key) % N to determine which file key belongs to.
Which files KVSvr loads is determined by consistent hash. KVSvr with the same role will load the same batch of files. During the expansion, the unit of data movement is files.
Since this consistent hash has only 2400 nodes, when 2400 cannot be divisible by the number of machines in the sect, a more obvious load imbalance will occur. So FeatureKV’s sect has to be divisible by 2400. Ok 2400 is a lucky number, its factors including 1,2,3,4,5,6,8,10,12,15,16,20,24,25,30 within 30, already can satisfy most of the scene.
The figure above is an example of N=6, where Part_00[0-5] represents 6 data files. When you expand from RoleNum=2 to RoleNum=3, you only need to move Part_003 and Part_005 from Role_0 to Role_2 and Part_003 from Role_1 to Role_2.
Since N=2400 is used in the current network and the number of nodes is small, in order to reduce the time consumption of each routing, we enumerate all cases of RoleNum<100 && 2400%RoleNum==0 and make a consistent hash table.
- System scalability
FeatureKV’s FKV_WFS has all the data for the currently available version, so the expansion requires only that the machine in the new character pull the numbered file from FKV_WFS, and the machine in the old character discard the numbered file.
When the BatchSize is large enough, the number of RPCS at a time is equivalent to the number of roles, and these RPCS are all in parallel. When the number of roles is large, these RPCS have a higher probability of having at least one long-tail request, whereas the time spent on BatchGet depends on the slowest RPC. The figure above shows the BatchGet long-tail probability under the condition that the probability of a single RPC being a long-tail request is 0.01% and the probability of different roles is calculated by formula 1 – (0.999^N).
Increase Sect (read ability expansion) :
Each Sect has a full amount of data, adding a Sect means adding a read-only copy, which can achieve the effect of read performance expansion.
Since a BatchGet only needs to be sent to one Sect, the number of RPCS is convergent and will not initiate 200 RPCS because there are 200 KVSvr below. This design can reduce the average BatchGet operation time and reduce the probability of long-tail requests.
Increase Role (storage capacity + read performance expansion) :
Assuming that each machine has equal storage capacity, increasing the number of roles increases the storage capacity.
Since there are more machines in the whole module, the read performance will also increase. The expansion effect on the read throughput of the whole module is equivalent to the increase of Sect.
However, when the number of roles is large, the number of machines involved in a BatchGet increases and the probability of long-tail requests increases, so it is generally recommended that the number of roles should not exceed 30.
Add DataSvr (write performance expansion) :
DataSvr is a stateless service that can scale capacity at minute rates.
The following write task is distributed running, and a write will be divided into multiple parallel jobs. Increasing the number of DataSvr instances can increase the write performance of the whole module.
Data migration is all at the level of file, without complex migration logic. If gray process is not considered, it can be completed at the hour level, and gray process is generally considered within one day.
- Disaster system
Each Sect machine is deployed on the same Sect, and only two Sect machines need to be deployed to tolerate the failure of one Sect machine.
Specific case: on March 23, 2019, the optical cable in Shanghai nanhui park was cut off. 1/3 of the machines in a featurekv are on it, and the service is stable during the failure.
Part of the RPC timeout occurred during the failure, resulting in an increase in long-tail requests. However, most of the requests were successful after a retry, and the number of final failures was very low. After the follow-up global shielding of the nanhui park machines, the long tail request and the final failure completely disappeared.
Even if both parts die, FeatureKV’s KVSvr can still provide read-only services, which are sufficient for most timed batch write, online read-only scenarios.
Case in point: on June 3, 2019, a distributed file system cluster failed and was unavailable for 9 hours. A featurekv USER_FS and FKV_WFS are both clusters. The business side’s output output process also stops during the failure, with no write task generated. Featurekv read service is stable throughout the failure.
Billion per second challenge – detailed design of online reading service
- KVSvr read performance optimization
In order to improve the performance of KVSvr, we adopted the following optimization methods:
High-performance hash tables: for some low-volume, high-read data, FeatureKV can serve with MemTable, a full-memory table structure. The underlying implementation of Memtable is a read-only hash table implemented by ourselves, which can reach 2800w QPS when accessed concurrently by 16 threads, which has exceeded the performance of RPC framework and will not become the bottleneck of the whole system.
Libco aio: for some data with larger volumes and lower read requirements, FeatureKV can be served with BlkTable or IdxTable, which store data in SSDS. SSD’s read performance needs to be fully exploited through multiple concurrent accesses. Online services cannot open too many threads, and operating system scheduling is expensive. Here, we use the encapsulation of Linux aio in libco to realize concurrent multi-channel disk reading at the coprogram level. Under the condition that value_size is 100Byte after pressure test, the 4 SSD disks on TS80A can reach QPS of 150w+/s.
Packet serialization: in the process of perf tuning, we found that when the batch_size was large (the average batch_size of ctrfeaturekv is 4k+), the serialization of RPC packets would be time-consuming, so here we made a layer of serialization/deserialization by ourselves, and the parameter of RPC layer was a binary buffer.
Data compression: different businesses have different requirements for data compression. In the scenario of storage model, value will be a floating point number/floating point number group, representing some non-0. Characteristics. At this time, if the plaintext compression algorithm like snappy is used, the effect is not so good, the compression ratio is not high and the CPU is wasted. For such scenarios, we introduced semi-precision floating point Numbers (provided by kimmyzhang’s sage library) to do data compression during the transmission phase and reduce the cost of bandwidth.
- Implementation of distributed transaction BatchGet
Demand background: the update is divided into full volume update and incremental update. One update includes multiple data. Each update will increase the version number and BatchGet will return multiple data. The business side wants these updates to be transactional. When BatchGet, if one update is not fully executed, the data of the previous version will be returned instead of the half-old and half-new data.
There is no sharding of data, all of which are on the same machine. After our research, we found two methods:
MVCC: multi-version concurrency control. The specific implementation is a storage engine like LevelDB, which can save multi-version data, control the life cycle of data through snapshot, and access the data of the specified version. The data structure of this scheme needs to support both read and write operations, and there must be threads in the background to clean up expired data. It is also complicated to support full update.
COW: write copy, specific implementation is double Buffer switch, specific to FeatureKV scene, incremental update also need to copy the previous version of the data, plus incremental data. The advantage of this scheme is that it is possible to design a data structure that generates read-only data. Read-only data structures can have higher performance, but the disadvantage is that they need double the space overhead.
In order to ensure the performance of online services, we adopted the COW approach and designed the read-only hash table mentioned in the first part to achieve single-machine transaction BatchGet.
The data is distributed in different machines, and different machines have different time points to complete the data loading. From the perspective of distribution, there may not be a unified version.
An intuitive idea is to save the most recent N versions and then select the most recent version that each Role has.
The value of N affects the cost of storage resources (memory, disk), at least 2. To achieve this goal, we added the following two limitations to DataSvr side:
Updates to individual tables are serial.
Before the writing task starts to end, add one more step of version alignment logic, that is, wait until all KVSVR have loaded the latest version.
In this way, we can ensure a unified version on the distribution without retaining only the last two versions. In the COW scenario, as long as the data in the other Buffer is postponed to delete (it will not be deleted until the next update), the last two versions can be retained without memory overhead.
With a globally consistent version, how should transaction BatchGet be implemented?
First round of RPC questioning about the version of each role? Doing so will double the QPS, and the next time that machine might be updated.
Data update, Version changes is very low, most of the time is returned to the latest Version, and can be with B – when you back to the package Version (that is, another Version of the Buffer), let the client side in the event of a Version is inconsistent, can choose a global unified Version SyncVersion, again to retry not SyncVersion data.
The duration of data inconsistencies during data updates can be minute, which can lead to waves of retry requests that affect the stability of the system. So we also made an optimization to cache this SyncVersion. Every time you BatchGet, if you have a SyncVersion cache, you pull the SyncVersion data directly.
- Version back
There is a fallback version field in the metadata of each table. The default is 0 to indicate that the field is not in the fallback state. When the field is not 0, it indicates that it is fallback to a version.
First consider how to implement version rollback:
Consider the simple case where a table is updated in full volume every time. Then every time we make it, we will ask KVSvr to pull the specified version of data from FKV_WFS to the local, and go through the normal full-volume update process.
Then you need to think about increments. If a table is updated incrementally every time, it is necessary to pull V1 to Vi to the local KVSvr for updating and playing back a version Vi. It is similar to the binlog of a database, which is impossible when tens of thousands of incremental versions are accumulated.
We need an asynchronous worker to combine a sequence of increments with the previous full-volume version into a new full-volume version, like checkpoint, so that a fallback doesn’t involve too many incremental versions. The implementation of the asynchronous worker is in DataSvr.
Further, there is an optimization that if the fallback version is in the local dual Buffer, simply switch the pointer of the dual Buffer to achieve a second fallback effect. In fact, many fallback operations are fallback to the last normal version, probably the previous version, in the local dual Buffer.
Tables in the fallback state prohibit data from being written, preventing the wrong data from being written again.
Consider how to undo the fallback:
Unrollback is to allow a table to continue to service the rollback version of the data, and to perform subsequent incremental updates based on the rollback version of the data.
Unrollback status directly, now the network will be updated to the version before the rollback, if there is traffic, it will read the abnormal data before the rollback, there is a time window.
The version number of the data should be increased continuously, which is dependent on the process of updating the data, so you cannot simply delete the last piece of data.
To avoid this problem, we borrowed COW’s idea and copied it first. The implementation is to put the current version back, write a full version, as the latest version of the data.
This step takes some time, but in the fallback scenario, we don’t require much time to undo the fallback. As long as the retreat is fast enough, it is safe to lift the retreat.
Billion hourly challenges – offline write process specific design
DataSvr’s main job is to write the data from USER_FS to FKV_WFS. During the writing process, we need to do routing segmentation and data format reconstruction, which is a streaming process.
There are currently three table structures in FeatureKV. Different table structures have different processing logic in the write process:
MemTable: data full memory, index is an unordered hash structure, capacity is limited by memory, offline write logic is simple.
IdxTable: index full memory, index is an ordered array, the amount of Key is limited by memory, offline writing logic is relatively simple, need to write more index.
BlkTable: block index is full memory, the index is ordered data, recording the begin_key and end_key of a 4KB data block in the disk, the capacity is not limited, the offline writing process is complex, the data file needs to be sorted.
- The single DataSvr
In the beginning, we only had MemTable and the data was all in memory. The maximum data of MemTable is 200+GB, which is not a large amount of data. Single processing can save the cost of distributed collaboration, result combination and other steps, so we have the above structure:
A write task is performed by only one DataSvr.
The Parser processes one input file at a time, parses key-value data, calculates the route and delivers the data to the corresponding Que.
A Sender handles a Que of data, which corresponds to multiple FKV_FS files at the bottom. A file on FKV_FS can only be written by one Sender.
The general idea is to run parallel processes in parallel, draining hardware resources.
In the specific implementation, a lot of batch optimization was added, such as IO of FS with buffer and batch of queue data, etc., to improve the throughput capacity of the whole system as much as possible.
Eventually, on a 24-core machine, the write speed could reach 100MB/s, and it would take about 20 minutes to write 100GB of data.
- Distributed DataSvr
In the future, FeatureKV needs to handle billions of Key and terabyte data writes, so we have added IdxTable and BlkTable, which have the following two challenges for the writing process:
The generated data needs to be orderly. Only orderly data can achieve the effect of range index, so that the key amount of a single machine is not limited by memory.
Terabyte write speed, 100MB/s is not enough, it takes almost 3 hours to write 1 terabyte, and it’s not scalable, even with lots and lots of machines, it’s 3 hours, but it needs to be scalable.
Consider the data sorting problem first:
We have to run the data slice first before we can take out all the data of a Part and sort the data. The previous data slice is similar to the Map of MapReduce, and the subsequent sorting is Reduce. There is a large cost of computing resources in Reduce, which needs to be distributed.
In the Map stage, the single-machine DataSvr logic mentioned above is reused. After data segmentation, a temporary full-volume result will be obtained, and then a distributed Reduce logic will be implemented. The input of each Reduce is a disordered data, and an ordered data and its index will be output.
This approach has the additional overhead of one full write and one full read.
The specific process is shown in the figure below. The DATASVR SORTING stage consists of multiple DATASVR participating, with each light blue box representing one DATASVR instance.
Then consider the scalability in the case of large amount of data:
As shown in the figure above, the sorting stage of DataSvr is already distributed. The only single point that cannot be expanded is the data slicing stage.
There are two ways to achieve distributed data slices:
First, each DataSvr processes the User_Part file of part input, and each DataSvr outputs 2400 sliced files. Then, when a distributed slice has K DataSvr instances participating, 2400 * K sliced files will be generated, and files with the same number need to be merged later, or directly used as the input of sorting stage.
Second, each DataSvr is responsible for generating part numbered FKV files, reading in a full amount of user input each time, and batch processing to generate a batch numbered FKV files.
The first way, if you were to do MemTable or IdxTable, you would need to merge TMP_i_0, TMP_i_1, TMP_i_2… Merge into one FKV_i. When processing BlkTable, because there is a Sorting logic in its follow-up, only the Sorting logic needs to be changed to accept the input of multiple files. Therefore, the disadvantage of this approach is that when the amount of data is small, MemTable or IdxTable adopting distributed data slice may be slower, and the time taken in the Merging stage will be more time-consuming than that in the distributed slice.
The second method directly generates 2400 files, and there is no subsequent Merging process. However, it poses the problem of read magnification, assuming that the data is shred into T batches, there will be an additional T-1 full read overhead. In the case of large amount of data, the number of batches will be more, because the sorted data needs to be all into memory, only smaller cut;
In the small data scenario, single machine data sharding is sufficient, so we chose the first scheme.
Distributed sharding is an optional option. In the case of small amount of data, we can avoid this path and return to the processing process of stand-alone DataSvr.
Finally, we have an offline processing process that can be extended linearly, facing 1 billion or 1TB of data:
Before implementing BlkTable, this was an impossible task.
Before the distributed data slice is implemented, it takes 120 minutes for this data to be written.
Now, we can write this data in only 71 minutes.
The above set of processes, in fact, much like MapReduce, is a Map, Reduce process Mosaic together results. We did it ourselves, mainly on the basis of performance considerations, can optimize the system to the extreme.
Current network operation status
FeatureKV in has now been deployed 10 + module, a total of 270 + machine, business related to have a look, search, WeChat ads, a small program, WeChat payments, data center user portrait, near the life, good things circle number of various types, such as business, solved the generated off-line data is applied to online services (problem, supporting the development of all kinds of data driven business.
The largest model storage module has 210 machines:
1.1 billion features /s: the average daily peak BatchGet number is 29w/s, and the average BatchSize is 3900. The module pressure test reached 3 billion features /s.
15ms: 96.3% of BatchGet requests were completed within 15ms, while 99.6% of BatchGet requests were completed within 30ms.
99.999999% : 99.999999% of transactions BatchGet executed successfully.
WeChat advertising based on FeatureKV to achieve personalized pull + personalized advertising location, recommendation strategy can be updated in a timely manner. Compared with the old scheme, both pull volume and revenue have achieved a large increase, pull +21.8% and revenue +14.3%.
WeChat pay has FeatureKV for face-to-face coupon issuance and payment risk control, storing billions of features that can be updated within hours of data not being updated in a day.
At first, the need for this kind of timed batch write, online read-only is not very common, and the general business will use PaxosStore or file distribution to solve.
But as more and more application/demand are related to the data, these data needs to be mass input to the online services on a regular basis, and requires a strong ability of version management, such as user portrait, machine learning model (within DNN, LR, FM), rules, a dictionary, is even row/inverted index, etc., so we FeatureKV is developed to solve the problem of this kind of pain points, and achieved good results.
This article is reproduced from the public account of “yunjia community” : QcloudCommunity
Pay attention to “cloud add community” public number, obtain more technical dry goods!