This article collates from the topic “real-time data warehouse scenario practice based on Flink” shared by Kwai data technology expert litianshuo at Flink meetup on May 22, Kwai, which includes:
- Kwai real-time computing scenario
- Kwai real-time data warehouse structure and safeguard measures
- Kwai scenario problems and Solutions
- Future planning
1、 Kwai real-time computing scenario
The real-time computing scenarios in the Kwai business are mainly divided into four parts:
- Company level core data:Including the company’s business market, real-time core daily, and mobile data. It means that the team will have the company’s overall indicators and each business line, such as video related and live related, will have a core real-time Kanban;
- Real time indicators of large-scale activities:The core content is the real-time large screen. For example, for the Spring Festival Gala of Kwai, we will have an overall large screen to see the current situation of the overall activities. A large-scale event will be divided into n different modules. We will have different real-time data Kanban for different playing methods of each module;
Data of operation part:Operation data mainly includes two aspects: one is the creator and the other is the content. For creators and content, on the operation side, such as launching a big V activity, we want to see some information, such as the real-time status of the live studio and the traction of the live studio to the market. Based on this scenario, we will do all kinds of real-time large screen multidimensional data and some data of the market.
In addition, it also includes the support of operational strategies. For example, we may discover some hot content and hot creators in real time, as well as some current hot situations. We output strategies based on these hot spots, which is also some support capabilities we need to provide;
Finally, it also includes the C-end data display. For example, there are now the creator center and the anchor center in Kwai. Here, there will be some anchor related pages. Part of the real-time data of the related pages is also done by us.
- Real time features:Including search recommendation features and real-time advertising features.
2、 Kwai real-time data warehouse structure and safeguard measures
1. Objectives and difficulties
- First of all, because we are engaged in data warehouse, we hope that all real-time indicators can be matched with offline indicators. The overall data difference between real-time indicators and offline indicators is required to be within 1%, which is the minimum standard.
- The second is the data delay. The SLA standard is that the data delay of all core report scenarios during the activity period cannot exceed 5 minutes, including the time after the job is suspended and the recovery time. If it exceeds, it means that the SLA is not up to standard.
- Finally, stability. For some scenarios, for example, after job restart, our curve is normal, and there will be no obvious abnormalities in the output of indicators due to job restart.
- The first difficulty is the large amount of data. The overall daily import flow data is about trillions. In the scene of activities such as Spring Festival Gala, the peak value of QPS can reach 100 million / s.
- The second difficulty is that component dependency is complex. Maybe some of the links depend on Kafka, some on Flink, and some on kV storage, RPC interface, OLAP engine, etc. we need to think about how to distribute them in this link in order to make these components work normally.
- The third difficulty is the complex link. At present, we have 200 + core business operations and 50 + core data sources, with an overall operation of more than 1000.
2. Real time warehouse layered model
Based on the above three difficulties, let’s take a look at the data warehouse architecture:
As shown above:
- There are three different data sources in the lowest layer, namely client log, server log and binlog log log;
The public foundation layer is divided into two different layers, one is DWD layer, which is used for detailed data, and the other is DWS layer, which is used for public aggregate data. Dim is the dimension we often say. We have a theme pre layering based on offline data warehouse. This theme pre layering may include traffic, users, equipment, video production and consumption, risk control, social networking, etc.
- The core work of DWD layer is standardized cleaning;
- DWS layer is to associate dimension data with DWD layer, and then generate some general granularity aggregation layers.
- Then up is the application layer, including some large-scale data, multi-dimensional analysis model and business thematic data;
- At the top is the scene.
The overall process can be divided into three steps:
- The first step is to make business data, which is equivalent to connecting business data;
- The second step is data capitalization, which means to do a lot of cleaning on the data, and then form some regular and orderly data;
- The third step is data commercialization. It can be understood that data can feed the business at the real-time data level and provide some empowerment for the value construction of business data.
3. Real time warehouse – Safeguard Measures
Based on the above layered model, let’s take a look at the overall safeguard measures:
The guarantee level is divided into three different parts: quality guarantee, timeliness guarantee and stability guarantee.
Let’s first look at the quality assurance of the blue part. For quality assurance, we can see that in the data source stage, we have done out of order monitoring of data sources, which is based on the collection of our own SDK, as well as the consistency calibration of data sources and offline. There are three stages in the calculation process of R & D stage: R & D stage, online stage and service stage.
- In the R & D stage, a standardized model may be provided. Based on this model, there will be some benchmarks, and offline comparison and verification will be done to ensure that the quality is consistent;
- The online stage is more about service monitoring and indicator monitoring;
- In the service phase, if there are some exceptions, first pull up the Flink status. If there are some scenarios that do not meet the expectations, we will repair the overall offline data.
The second is timeliness guarantee. For the data source, we also include the delay of the data source into the monitoring. In fact, there are two other things in the R & D stage:
- The first is pressure measurement. For routine tasks, the peak traffic in the last 7 or 14 days will be used to see whether there is task delay;
- After the pressure test is passed, there will be some task on-line and restart performance evaluation, which is equivalent to what the restart performance looks like after recovery according to CP.
The last one is stability guarantee, which will be done more in large-scale activities, such as switching drill and hierarchical guarantee. We will limit the current based on the previous pressure test results, in order to ensure that the operation is still stable when it exceeds the limit, and there will not be a lot of instability or CP failure. After that, we will have two different standards, one is cold standby dual machine room, and the other is hot standby dual machine room.
- Cold standby double machine room: when a single machine room hangs up, we will pull it up from another machine room;
- Hot standby dual machine room: it is equivalent to deploying the same logic in two machine rooms once.
The above is our overall safeguard measures.
3、 Kwai scenario problems and Solutions
1. PV / UV standardization
The first problem is PV / UV standardization. Here are three screenshots:
The first picture is the warm-up scene of the Spring Festival Gala, which is equivalent to a play method. The second and third pictures are screenshots of the red envelope activity and the live room on the day of the Spring Festival Gala.
During the activity, we found that 60 ~ 70% of the requirements are the information in the calculation page, such as:
- How many people come to this page, or how many people click to enter this page;
- How many people came to the event;
- How many clicks and exposures are generated for a pendant on the page.
Abstract this scenario is the following SQL:
In short, it is to make filter conditions from a table, then aggregate according to the dimension level, and then generate some count or sum operations.
Based on this scenario, our initial solution is shown on the right of the figure above.
We used the early fire mechanism of Flink SQL to get data from the source data source, and then did the bucket division of did. For example, at the beginning, the purple part is divided into buckets according to this. The reason for dividing buckets first is to prevent hot spots in a did. After the bucket is divided, there will be a thing called local window AGG, which is equivalent to adding the same type of data after the bucket is divided. After local window AGG, the bucket closing of global window AGG is carried out according to the dimension. The concept of bucket closing is equivalent to calculating the final result according to the dimension. The early fire mechanism is equivalent to opening a sky level window in local window AGG, and then outputting it every minute.
We encountered some problems in this process, as shown in the lower left corner of the figure above.
There is no problem when the code is running normally, but if there is a delay in the overall data or tracing the historical data, such as early fire once a minute, because the amount of data will be large when tracing the history, it may lead to tracing the history at 14:00 and directly reading the data at 14:02, and the point at 14:01 will be lost. What will happen after losing it?
In this scenario, the curve at the top of the figure is the result of early fire backtracking historical data. The abscissa is minutes and the ordinate is the UV of the page up to the current time. We found that some points are horizontal, which means there is no data result, and then a sharp increase, then a horizontal increase, and then another sharp increase. The expected result of this curve is actually the smooth curve at the bottom of the figure.
In order to solve this problem, we use the cumulate window solution, which is also involved in Flink version 1.13, and its principle is the same.
A large sky level window is opened for the data, and a small minute level window is opened under the large window. The data falls to the minute level window according to the row time of the data itself.
- Watermark pushes past the event in the window_ Time, it will trigger the issue once. In this way, the problem of backtracking can be solved. The data itself falls in the real window, watermark advances, and is triggered after the end of the window.
- In addition, this method can solve the problem of disorder to a certain extent. For example, its out of order data itself is a non discarded state, and the latest accumulated data will be recorded.
- The last is semantic consistency, which is based on the event time. When the disorder is not serious, the consistency with the results calculated offline is quite high.
The above is a standardized solution for PV / UV.
2. Dau calculation
2.1 background introduction
The dau calculation is described below:
We have more monitoring on the active equipment, new equipment and reflux equipment of the whole market.
- Active equipment refers to the equipment that came in the same day;
- New equipment refers to the equipment that has been here on the same day and has not been here in history;
- Reflux equipment refers to the equipment that has been here on the same day and has not been here in n days.
However, we may need 5 ~ 8 different topics to calculate these indicators in the calculation process.
Let’s take a look at how logic should be calculated in the offline process.
First, we calculate the active devices, merge them together, then de duplicate the day level under a dimension, and then associate the dimension table. This dimension table includes the first and last time of the device, that is, the time of the first and last access of the device as of yesterday.
After obtaining this information, we can perform logical calculation, and then we will find that the newly added and returned devices are actually a sub tag in the active device. The newly added equipment is a logical processing, and the reflux equipment is a logical processing for 30 days. Based on this solution, can we simply write an SQL to solve this problem?
In fact, we did this at first, but we encountered some problems:
- The first problem is: there are 6 ~ 8 data sources, and the caliber of our market is often fine tuned. If it is a single operation, it must be changed during each fine-tuning process, and the stability of a single operation will be very poor;
- The second problem is: the amount of data is trillions, which will lead to two situations. First, the stability of single job of this order is very poor, and second, the kV storage used in real-time Association of dimension tables. Any such RPC service interface can not guarantee the service stability in the scenario of trillions of data;
- The third problem is that we have high requirements for delay, which is less than one minute. Batch processing should be avoided in the whole link. If there is a single point problem in the performance of some tasks, we also need to ensure high performance and scalability.
2.2 technical proposal
To solve the above problems, let’s introduce how we do it:
As shown in the example above, the first step is to de duplicate the three data sources a, B and C at the minute level according to the dimension and did. After de duplication respectively, three data sources with de duplication at the minute level are obtained, and then they are united together, and then the same logical operation is carried out.
This is equivalent to that the entry of our data source has changed from trillion to 10 billion. After minute level de duplication, a day level de duplication can be carried out, and the generated data source can change from 10 billion to billions.
In the case of billions of data, we can de associate the data service, which is a more feasible state. It is equivalent to de associating the RPC interface of the user portrait. After obtaining the RPC interface, it is finally written to the target topic. This target topic will be imported into the OLAP engine to provide many different services, including mobile version service, large screen service, indicator Kanban service, etc.
This scheme has three advantages: stability, timeliness and accuracy.
- The first is stability. Loose coupling can be simply understood as that when the logic of data source a and data source B need to be modified, they can be modified separately. The second is that the task can be expanded, because we split all logic into very fine-grained. When there is a traffic problem in some places, it will not affect the later parts. Therefore, it is relatively simple to expand the capacity. In addition, there are post service and controllable state.
- The second is timeliness. We achieve millisecond delay and rich dimensions. On the whole, 20 + dimensions do multi-dimensional aggregation.
- Finally, accuracy. We support data verification, real-time monitoring, model export unification, etc.
At this time, we encounter another problem – disorder. For the above three different jobs, the restart of each job will be delayed for at least two minutes. The delay will lead to the disorder of the downstream data source Union.
2.3 delay calculation scheme
What should we do in case of disorder?
We have three solutions:
The first solution is to use “did + dimension + minute” for de duplication, and set the value to “have you been here”. For example, if a did comes at 04:01, it will output the result. Similarly, the results will be output at 04:02 and 04:04. However, if it comes again at 04:01, it will be discarded, but if it comes at 04:00, the result will still be output.
There are some problems with this solution, because we save by minute. The state size of saving for 20 minutes is twice that of saving for 10 minutes. Later, the state size is a little uncontrollable, so we changed solution 2.
The second solution involves an assumption that there is no disorder of data sources. In this case, the key stores “did + dimension” and the value is “timestamp”. Its update method is shown in the above figure.
At 04:01, a piece of data came and the result was output. At 04:02, a piece of data came. If it is the same did, it will update the timestamp and still output the results. 04:04 is the same logic, and then the timestamp is updated to 04:04. If a piece of 04:01 data comes later, it finds that the timestamp has been updated to 04:04, it will discard the data.
This approach greatly reduces some states required by itself, but there is zero tolerance for disorder and no disorder is allowed. Because we can’t solve this problem, we came up with solution 3.
Scheme 3 adds a ring like buffer based on the timestamp of scheme 2, and disorder is allowed in the buffer.
For example, a piece of data comes at 04:01 and the result is output; A piece of data comes at 04:02. It will update the timestamp to 04:02 and record that the same device came at 04:01. If there is another piece of data at 04:04, make a displacement according to the corresponding time difference, and finally ensure that it can tolerate a certain disorder through such logic.
Taken together, these three schemes:
- In scheme 1, under the condition of tolerating 16 minute disorder, the state size of a single job is about 480g. Although the accuracy of this situation is guaranteed, the recovery and stability of the operation are completely uncontrollable, so we still give up this scheme;
- Scheme 2 has a state size of about 30g, which is tolerant of disordered order 0, but the data is inaccurate. Because we have very high requirements for accuracy, we also give up this scheme;
- Compared with scheme 1, the state of scheme 3 has changed, but it has not increased much, and the overall effect can be the same as that of scheme 1. Scheme 3 tolerates disorder for 16 minutes. If we update a job normally, 10 minutes is enough to restart. Therefore, scheme 3 is finally selected.
3. Operation scenario
3.1 background introduction
The operation scenario can be divided into four parts:
- The first is the support of large data screen, including the analysis data of single live broadcast room and the analysis data of large market. It needs to achieve minute delay and high update requirements;
- The second is the support of live kanban. The data of live Kanban will have the analysis of specific dimensions, which is supported by specific people, and has high requirements for the richness of dimensions;
- The third is the list of data strategies, which is mainly used to predict popular works and popular models. It requires hourly data and low update requirements;
- The fourth is the C-end real-time indicator display. The query volume is relatively large, but the query mode is relatively fixed.
The following is an analysis of some different scenarios generated by these four different states.
The first three are basically the same, but in the query mode, there are some specific business scenarios and some general business scenarios.
For the third and fourth types, it has low requirements for updating, high requirements for throughput, and the curve in the process does not require consistency. The fourth query mode is more about some queries of a single entity, such as the query content and the indicators, and has high requirements for QPS.
3.2 technical proposal
For the above four different scenarios, how do we do it?
First, take a look at the basic details layer (on the left in the figure). The data source has two links, one of which is the consumption stream, such as the consumption information of live broadcast, and watch / like / comment. After a round of basic cleaning, then do dimension management. The upstream dimension information comes from Kafka. Kafka writes some content dimensions and puts them into kV storage, including some user dimensions.
After these dimensions are associated, they are finally written to Kafka’s DWD fact layer. Here, in order to improve the performance, we do the L2 cache operation.
- As shown in the top of the figure, we read the data of DWD layer and then make basic summary. The core is window dimension aggregation to generate four kinds of data with different granularity, namely, large market multidimensional summary topic, live studio multidimensional summary topic, author multidimensional summary topic and user multidimensional summary topic. These are all general dimension data.
As shown in the lower part of the figure, based on these general dimension data, we will process the personalized dimension data, that is, the ads layer. After obtaining these data, there will be dimension expansion, including content expansion and operation dimension expansion, and then aggregation. For example, there will be e-commerce real-time topic, institutional service real-time topic and big V live broadcast real-time topic.
There is one advantage of dividing into two links: one deals with the general dimension and the other deals with the personalized dimension. The requirements of general dimension guarantee will be higher, while the personalized dimension will do a lot of personalized logic. If the two are coupled together, it will be found that tasks often have problems, and it is not clear which task’s responsibilities are, so it is impossible to build such a stable layer.
- As shown on the right, we finally used three different engines. In short, redis query uses the C-end scenario, and OLAP query uses the large screen and business Kanban scenario.
4、 Future planning
There are three scenarios mentioned above. The first scenario is the calculation of standardized PU / UV, the second scenario is the overall solution of dau, and the third scenario is how to solve it on the operation side. Based on these contents, we have some future plans, which are divided into four parts.
The first part is the improvement of real-time guarantee system:
- On the one hand, do some large-scale activities, including Spring Festival Gala activities and subsequent normalized activities. In view of how to guarantee these activities, we have a set of norms to build a platform;
- The second is the formulation of hierarchical support standards. There will be a standardized description of which operations are what kind of support levels / standards;
- The third is the ability of the engine platform to promote solutions, including some engines of Flink tasks. On this platform, we will have a platform to promote standardization and standardization based on this platform.
The second part is the construction of real-time data warehouse content:
- On the one hand, it is the output of scenario based solutions. For example, there will be some general solutions for activities, rather than developing a new set of solutions for each activity;
- On the other hand, there is the precipitation of content data hierarchy. For example, in the current data content construction, there are some missing scenes in terms of thickness, including how the content can better serve the upstream scenes.
- The third part is the scenario construction of Flink SQL, including SQL continuous promotion, SQL task stability and SQL task resource utilization. In the process of estimating resources, we will consider, for example, what kind of QPS scenario, what kind of solution SQL uses, and what situation it can support. Flink SQL can significantly reduce human efficiency, but in this process, we want to make business operations easier.
- The fourth part is the exploration of batch flow integration. The scene of real-time data warehouse is actually to accelerate offline ETL calculation. We will have many hour level tasks. For these tasks, some logic can be put into stream processing each time of batch processing, which greatly improves the SLA system of offline data warehouse.