When the business develops to a certain scale, real-time data warehouse is a necessary basic service. From the perspective of data driven, the importance of multidimensional real-time data analysis system is self-evident. However, when the amount of data is huge, taking Tencent as an example, the amount of data reported in a day reaches a trillion level. It is a technical challenge to realize real-time calculation with extremely low latency and multi-dimensional real-time query at sub second level.
This paper will introduce the technical framework of Tencent’s real-time data warehouse and multidimensional real-time data analysis system in the context of information flow.
1. Solvable pain points
Let’s take a look at the pain points that can be solved by multidimensional real-time data analysis system. For example:
- Recommended students 10 minutes ago on a recommendation strategy, want to know how the recommendation effect in different groups?
- Operation students want to know what is the most popular regional content of Guangdong among users in Guangdong Province, which is convenient for regional push.
- Review students want to know, in the past five minutes, the game category has been reported the most content and account?
- The boss may want to know how many users have consumed content in the past 10 minutes and have a macro understanding of the consumer population.
We did these surveys before we did the development.
1. Whether the off-line data analysis platform can meet these requirements is not satisfied. The reasons for the failure of offline data analysis platform are as follows.
- When the C-side data is reported, it needs to go through multi-layer offline calculation of spark, and the final results are sent out to MySQL or es for offline analysis platform query. The delay of this process is at least 3-6 hours. At present, it is common to provide queries every other day, so many business scenarios with high real-time requirements cannot be satisfied.
- Another problem is that the data volume of Tencent’s watch points is too large, which leads to greater instability and often leads to unexpected delays. Therefore, the offline analysis platform can not meet many requirements.
2. For the real-time data analysis platform, the business group provides the function of quasi real-time data query. The underlying technology is kudu + impala. Although impala is a big data computing engine of MPP architecture, it also accesses kudu, which stores data in columns. However, for the real-time data analysis scenario, the query response speed and data delay are still relatively high. Once querying a real-time dau, the returned results take at least a few minutes, which can not provide a good interactive user experience. Therefore, the speed advantage of (kudu + impala) general big data processing framework is more than (spark + HDFS) offline analysis framework, which is unable to meet our higher real-time requirements.
3. Project background
After the introduction just now, let’s take a look at the background of our project. The content of the author’s post is introduced by the content center, and is enabled or removed from the shelf after the content audit link. The enabled content is given to the recommendation system and the operation system, and then the recommendation system and the operation system distribute the content on the C side. After the content is distributed to the C-side users, users will generate various behaviors, such as exposure, click, report, etc., and access to the message queue in real time through the buried point reporting. Next, we did two parts of the work, that is, the two parts with colors in the figure.
- The first part constructs a real-time data warehouse of Tencent.
- The second part is to develop a multidimensional real-time data analysis system based on OLAP storage engine.
Why should we build a real-time data warehouse? Because the original amount of reported data is very large, and there are more than one trillion reports per day. Moreover, the reporting format is chaotic. Lack of content dimension information, user profile information, downstream can not be used directly. The real-time data warehouse provided by us is based on the business scenario of Tencent’s information flow. It carries out the association of content dimensions, user portraits, and aggregation of various granularity. The downstream can use real-time data conveniently.
4. Scheme selection
Let’s take a look at the scheme selection of our multidimensional real-time data analysis system. We have compared the leading solutions in the industry and selected the one that most conforms to our business scenario.
- The first one is the selection of real-time data warehouse. We choose lambda, which is a mature industry The advantages of the architecture are high flexibility, high fault tolerance, high maturity and low migration cost. The disadvantage is that the real-time and offline data use two sets of codes. One caliber may be modified and the other is not changed. We do data reconciliation every day. If there is an exception, an alarm will be given.
- The second part is the selection of real-time computing engine, because Flink was designed for stream processing, sparkstreaming is strictly micro batch processing, and strom is no longer used. We choose Flink as the real-time computing engine because of its exact once accuracy, lightweight checkpoint fault tolerance mechanism, low latency, high throughput and high usability.
- The third part is the real-time storage engine. Our requirements are to have dimensional index, support high concurrency, pre aggregation and high-performance real-time multidimensional OLAP queries. As can be seen, HBase, tdsql and ES can not meet the requirements. Druid has a defect. It divides segments according to the time sequence. It can not store the same content on the same segment. The global topn calculation can only be approximate. Therefore, we chose the MPP database engine Clickhouse, which has been on fire in the last two years.
5. Design objectives and difficulties
Our multidimensional real-time data analysis system is divided into three modules
- Real time computing engine
- Real time storage engine
- App layer
The difficulty lies in the first two modules: real-time computing engine and real-time storage engine.
- How to access the massive data of tens of millions of levels / s in real time, and carry out the extremely low delay dimension table Association.
- It is difficult for real-time storage engine to support high concurrent write, high availability distributed and high-performance index query.
The specific implementation of these modules, take a look at our system architecture design.
6. Architecture design
The front-end uses the open source component ant design, uses the nginx server to deploy static pages, and reversely proxy the browser’s request to the background server.
The background service is based on the RPC background service framework developed by Tencent, and some secondary caching will be carried out.
The real-time data warehouse is divided into access layer, real-time computing layer and real-time data warehouse storage layer.
- The access layer mainly splits the micro queue of different behavior data from the original message queue of 10 million level / s. take the video of watching point as an example, after splitting, the data will be only million level / s;
- The real-time computing layer is mainly responsible for the row to column conversion of multi line behavior flow data, and real-time Association of user profile data and content dimension data;
- The storage layer of real-time data warehouse is mainly designed to meet the requirements of watching business, and the downstream is easy to use real-time message queue. We temporarily provide two message queues as two layers of real-time data warehouse. One layer of DWM layer is content ID user ID granularity aggregation, that is, one piece of data contains content ID user ID, as well as B-side content data, C-side user data and user profile data; the other layer is DWS layer, which is content ID granularity aggregation, and a piece of data contains content ID, B-side data and C-side data. We can see that the message queue traffic of content ID user ID granularity is further reduced to 100000 level / s, and the content ID granularity is 10000 level / s, and the format is clearer and the dimension information is richer.
Real time storage is divided into real-time write layer, OLAP storage layer and background interface layer.
- The real-time writing layer is mainly responsible for writing data to hash routing;
- In OLAP storage layer, MPP storage engine is used to design indexes and materialized views that are in line with the business, so as to store massive data efficiently;
- The background interface layer provides efficient multidimensional real-time query interface.
7. Real time computing
The two most complex parts of this system are real-time computing and real-time storage.
First, the real-time computing part is introduced, which is divided into real-time correlation and real-time data warehouse.
7.1 real time high performance dimension table Association
The difficulty of real-time dimension table Association lies in. For a million level / s real-time data stream, if it is directly associated with HBase, it will take hours to complete the association of HBase for 1 minute of data, which will lead to serious data delay.
We propose several solutions:
- The first is that in the Flink real-time calculation phase, the window aggregation is carried out according to 1 minute, and the multi row behavior data in the window is converted into the data format of one row and multiple columns. After this operation, the original hourly correlation time is reduced to more than ten minutes, but it is not enough.
- The second is to set up a layer of redis cache before accessing HBase content. Because 1000 pieces of data access HBase in seconds, while accessing redis in milliseconds, the speed of accessing redis is 1000 times that of accessing HBase. In order to prevent the expired data from wasting the cache, the cache expiration time is set to 24 hours, and the cache consistency is ensured by monitoring and writing HBase proxy. In this way, the access time is changed from ten minutes to seconds.
- Third, many unconventional content IDs will be reported during the reporting process. These content IDs are not stored in the content HBase, which will cause the problem of cache penetration. Therefore, in real-time computing, we directly filter out these content IDs to prevent cache penetration and reduce some time.
- The fourth is that because the timing cache is set, a cache avalanche problem will be introduced. In order to prevent the avalanche in real time, we set the peak to fill in the buffer.
It can be seen that before and after optimization, the amount of data decreased from 10 billion to 1 billion, and the time consumption was reduced from hour level to tens of seconds, with a reduction of 99%.
7.2 downstream services
The difficulty of real-time data warehouse lies in: it is in a relatively new field, and there is a big gap between each company’s businesses. How to design a convenient, easy-to-use, real-time data warehouse in line with the business scene is difficult.
Let’s take a look at what the real-time data warehouse has done. The external real-time data warehouse is composed of several message queues. Different message queues store real-time data with different aggregate granularity, including content ID, user ID, C-side behavior data, B-side content dimension data and user profile data.
How do we build a real-time data warehouse? That is, the output of the real-time computing engine described above is saved in the message queue, which can be provided for downstream multi-user reuse.
We can see the difference between developing a real-time application before and after we build a real-time data warehouse. When there is no data warehouse, we need to consume tens of millions of levels / s of original queue, carry out complex data cleaning, and then conduct user portrait Association and content dimension association to get real-time data in the required format. The cost of development and expansion will be relatively high. If we want to develop a new application, we should go through this process again. If you want to develop real-time application of content ID granularity after having the data warehouse, you can directly apply for the message queue of DWS layer of TPS 10000 level / s. The development cost is much lower, the resource consumption is much less, and the scalability is much stronger.
Let’s take a practical example to develop the real-time data screen of our system. We need to do all the above operations before we can get the data. Now we only need to consume DWS layer message queue, write a Flink SQL, only consume 2 CPU cores and 1G memory.
It can be seen that taking 50 consumers as an example, before and after the establishment of real-time data warehouse, the downstream development of a real-time application can reduce 98% of resource consumption. Including computing resources, storage resources, human costs and developer learning access costs and so on. And the more consumers, the more savings. Take redis storage as an example. It can save millions of RMB in a month.
8. Real time storage
After introducing the real-time calculation, we will introduce the real-time storage.
This is divided into three parts
- The first is distributed high availability
- The second is massive data – write
- The third is high performance query
8.1 distributed high availability
We are here to listen to the official advice of the Clickhouse to implement a highly available solution with the help of ZK. Data is written into a partition, only one copy is written, and then ZK is written. Through ZK, the other copies of the same partition are told, and the other copies come to pull data to ensure data consistency.
Message queue is not selected for data synchronization because ZK is more lightweight. When writing, any copy can be written, and other copies can get consistent data through ZK. And even if other nodes fail to obtain data for the first time, if it is found that it is inconsistent with the data recorded on ZK, it will try to obtain data again to ensure consistency.
8.2 massive data – write
The first problem encountered in data writing is that if massive data is directly written to the Clickhouse, the QPS of ZK will be too high. The solution is to use batch mode to write. How big is the batch setting? If the batch is too small, the pressure of ZK can’t be relieved, and the batch can’t be too large, otherwise the upstream memory pressure will be too large. Through the experiment, we finally chose the batch with the size of several hundred thousand.
In particular, the problem of writing data to a single disk in the bottom layer of the merqe-mdb is similar to that of the original one. This problem may be caused by the use of a single level of data in the underlying system, especially in the case of a single disk. In the process of merging, there will be write enlargement, which will increase the disk pressure. The peak value is tens of millions of data per minute, and it takes tens of seconds to finish writing. If merge is in progress, the write request will be blocked and the query will be very slow. We do two optimization schemes: one is to do raid on the disk to improve the IO of the disk; the other is to divide the tables before writing, and write them to different partitions separately, so that the disk pressure directly changes to 1 / n.
The third problem is that although we divide the writes according to fragmentation, we introduce a common problem in distributed systems, that is, the local top is not the global top. For example, the data of the same content ID falls on different partitions. When calculating the content ID read by the global Top100, one content ID is Top100 on partition 1, but it is not Top100 on other partitions. As a result, part of the data will be lost during summary, affecting the final result. Our optimization is to add a layer of routing before writing, so that all records of the same content ID are routed to the same fragment, which solves the problem.
After the introduction of writing, the next step is to introduce the high-performance storage and query of the Clickhouse.
8.3 high performance storage query
A key point of Clickhouse high performance query is sparse index. The design of sparse index is very particular. If it is well designed, it can speed up the query, but if it is not well designed, it will affect the query efficiency. According to our business scenario, most of our queries are time related and content ID related. For example, how does a certain content perform in different groups in the past n minutes? I built a sparse index by date, minute, granularity, time, and content ID. For the query of a certain content, after building a sparse index, the file scanning can be reduced by 99%.
Another problem is that we have too much data and too many dimensions. Take QQ watching video content for example, there are more than 10 billion streams a day, and there are hundreds of categories in some dimensions. If all dimensions are pre aggregated at one time, the amount of data will expand exponentially, and the query will become slower and occupy a lot of memory space. In our optimization, we establish corresponding pre polymer views for different dimensions, and trade space for time, which can shorten the query time.
There is also a problem with distributed table query, which queries the information of a single content ID. the distributed table will distribute the query to all fragments, and then return the query results for summary. In fact, because of routing, a content ID only exists in one fragment, and the remaining fragments are running in the air. For this kind of query, our optimization is that the background will first route according to the same rules, and query the target fragment directly, which can reduce the load of n-1 / N and greatly shorten the query time. Moreover, because we provide OLAP queries, the data can meet the final consistency. By separating the read and write of master-slave replica, the performance can be further improved.
We also do a one minute data cache in the background. For the same query conditions, the background will directly return.
Here we will introduce our expansion plan, and investigate some common solutions in the industry.
For example, HBase, the original data is stored on HDFS. The expansion is only the expansion of the region server, and does not involve the migration of the original data. However, each partitioned data of Clickhouse is local, which is a relatively low-level storage engine and cannot be expanded as easily as HBase.
Redis is a hash slot, which is similar to consistent hashing, and is a more classic distributed cache scheme. In the process of rehash, redis slot is temporarily unavailable for ask reading, but generally speaking, it is relatively convenient to migrate from the original h  to h , and finally delete h . However, most of the Clickhouse queries are OLAP batch queries, not point queries. Moreover, due to column storage, it does not support deletion, so the scheme of consistent hashing is not very suitable.
At present, the solution of capacity expansion is to consume another piece of data and write it to the new Clickhouse cluster. The two clusters will run together for a period of time, because the real-time data will be saved for three days. After that, the background service will directly access the new cluster.
Tencent real-time data warehouse: DWM layer and DWS layer, data delay 1 minute.
Foresight multidimensional real-time data analysis system: sub second level response to multidimensional conditional query requests. In the case of cache miss, 99% of the queries in the past 30 minutes took less than 1 second; 90% of the requests in the past 24 hours took 5 seconds, and 99% of the requests took 10 seconds.