Introduction:In the supply chain logistics scenario, the business complexity is high, the business link is long, the nodes are many, the entities are many, and the real-time data warehouse construction is difficult. This is especially true for rookie cross-border import business scenarios. More complex scenarios lead to more complex entity data models. Many docking business systems make the ETL process particularly complex, and there is a huge amount of daily data processing, which makes the team face many challenges in the process of building import real-time data warehouse.
How to ensure the accuracy of data in complex entity relationship? How to reduce the complexity of data processing in the case of multiple data sources? How to improve the processing efficiency of real-time multi stream join? How to realize real time timeout statistics? How to realize the recovery of data state under abnormal conditions? This paper mainly shares the upgrade experience of rookie import real-time data warehouse, and how to use the characteristics of Flink to solve the problems encountered in the development practice.
The main contents include:
- Related background
- Evolution of import real time data warehouse
- Challenge and Practice
- Summary and Prospect
1. Introduction of import business
The process of import business is relatively clear. After domestic buyers place orders, foreign sellers deliver goods. After customs clearance, trunk transportation, domestic customs clearance, distribution, and delivery to consumers, rookie is responsible for coordinating various resources on the link in the whole process to complete the logistics contract fulfillment service. After koala was integrated into the Ali system last year, the total import business scale accounted for a very high proportion of the domestic import volume. In addition, the number of orders is growing rapidly every year, the order fulfillment cycle is particularly long, and there are many links involved in the process. Therefore, it is very difficult to consider not only integrating all the data together, but also ensuring the validity of the data.
2. Real time data warehouse processing flow
① General process
The following is a brief introduction to the processing process of real-time data warehouse. Generally, it will connect with the business library or log source, and synchronize the messages to the message middleware for temporary storage by means of data synchronization, such as sqoop or dataX. The downstream will connect with a real-time computing engine to consume the messages. After consumption, it will calculate and process the messages, and output some detailed tables or summary indicators for query service For data application.
② Rookie internal process
It’s the same process inside rookie. We use DRC (data backup center) to incrementally collect binlog logs, synchronize them to TT (message middleware similar to Kafka) for temporary message storage, and then connect them with a Flink real-time computing engine for consumption. After calculation, we write them to two kinds of query services, one is ADB, the other is HBase (lindorm) ）, ADB is an OLAP engine, and Alibaba cloud also provides services to the outside world. It mainly provides some rich multidimensional analysis queries, and writes some mild summary or detailed data with rich dimensions. For real-time large screen scenarios, because there are fewer dimensions and the indicators are relatively fixed, we will write some highly summarized indicators into HBase for real-time large screen use.
02 evolution process of import real time data warehouse
Next, let’s talk about the evolution of import real-time data warehouse
2014: about 14 years ago, the import business line established an offline data warehouse to provide daily reports.
2015: can provide small times, update frequency from day to hour.
2016: Based on jstorm, we explored some computing services for real-time indicators, which tend to be more and more real-time. As the real-time indicators have just been tried in the past 16 years, the indicators are not particularly rich.
2017: rookie introduced blink, the internal version of Flink in Alibaba, as our flow computing engine, and the import business line opened up real-time details in the same year, providing data services through the real-time details wide list.
2018: completed the construction of rookie import real-time data warehouse 1.0.
2020: the construction of real-time data warehouse 2.0 has started. Why start 2.0? Because there are many problems in the design process of 1.0, the whole model architecture is not flexible enough, the scalability is not high, and some of them are because they do not understand the characteristics of blink, which leads to the increase of operation and maintenance costs caused by misuse, so they are upgraded later.
1. Real time data warehouse 1.0
Next, let’s talk about the situation of real-time data warehouse 1.0. At the beginning, because the business model was not stable in the early stage of development, the strategy at the beginning was to run around the business quickly. For example, a set of real-time details layer will be developed for business 1, and a set of real-time tasks will be developed for business 2. The advantage is that it can iterate quickly with the development of business, and it will not affect each other, so it will be more flexible in the early stage Live.
As shown on the right side of the figure above, the bottom layer is the message source of each business system. There are two main layers of real-time tasks. One layer is the real-time detail layer. Different detail tables will be developed for the business line. The detail table is to extract the data needed by the business line. Above this is the ADM layer, which is the real-time application layer. The application layer is mainly customized for specific scenarios, such as The whole process is vertical chimney development. The model is chaotic, difficult to expand, and there are many repeated calculations.
Later, due to the problem of repeated calculation, a layer of abstraction is carried out, and a front intermediate layer is added to extract the common parts. However, the temporary solution is not the root cause. The whole model is still chaotic, and the data construction is not unified, and the scalability of the model is also poor.
2. Real time data warehouse 2.0
After the upgrade of 2.0, there is a clear picture:
Front layer: the bottom data source will be connected to the front middle layer, shielding some very complex logic at the bottom.
- Detail layer: the front layer will give clean data to the detail table. The detail layer gets through each business line and unifies the model.
- Summary layer: there will be light summary and high summary above the detail layer. The light summary table has many dimensions, which are mainly written into OLAP engine for multi-dimensional query and analysis. The high summary index is mainly used for real-time large screen scenes.
- Interface service: data output will be provided on the summary layer according to the unified interface service.
- Data application: the application layer mainly includes real-time large screen, data application, real-time report and message push.
This is the model after the upgrade of real-time data warehouse 2.0. Although the whole model looks relatively simple, in fact, there are many difficulties and great efforts from model design to development. Let’s share our challenges and practices in the process of upgrading.
03 challenge and Practice
In the process of upgrading real-time data warehouse, we are faced with the following challenges:
1. Multiple business lines and business models
The first is that there are more business lines to connect with, and different business lines have different modes, which leads to the separation of the model of small step and fast running mode at the beginning. There is no reusability between models, high development and operation and maintenance costs, and serious resource consumption.
Solution: logical middle layer upgrade
The relatively simple idea we think of is to build a unified data middle layer. For example, business a has several business nodes, such as delivery, collection, and delivery, while business B may be other nodes. The whole model is in a fragmented state. But in fact, when the business is relatively stable in the middle and later stages, the business models are relatively stable. At this time, we can analyze the data An abstraction, for example, business a has node 1 and node 5, which are the same as other business modes. Through this alignment, we can find out which are public and which are non-public, extract them and precipitate them into the logical middle layer, so as to shield the gap between businesses and complete the unified data construction. There is also a big reason for unifying the logical middle layer. Although business a, B and C are different business systems, such as fulfillment system and customs system, they are essentially the same set, and the underlying data sources are also abstracted. Therefore, the data warehouse modeling should be constructed through a unified idea.
2. Many business systems and large data sources
The second is that there are a lot of docking systems, and each system has a large amount of data. There are more than ten hundred million level data sources every day, which is very difficult to sort out. The first problem is the big state, which needs to be maintained in Flink, and how to control the cost after accessing so many data sources.
Solution: make good use of state
State is a major feature of Flink, because it can guarantee the state calculation and needs more reasonable utilization. We need to recognize what state does, when it needs to be, and how to optimize it. These are all things that need to be considered. There are two kinds of states. One is keyedstate, which is specifically related to the key of data. For example, group by in SQL and Flink will store the related data according to the key value, such as in a binary array. The second is operatorstate, which is related to specific operators, such as recording the offset read from the source connector, or how to recover the state between different operators after the failure of tasks between operators.
① “De duplication” in data access
Let’s take an example, how to use keyedstate, such as logistics order flow and fulfillment log flow. Two jobs are associated to generate a large table that needs to be generated. How is the join stored? Flows come all the time, and the order of message arrival may be inconsistent. It needs to be stored in the operator. For the state node of join, a relatively simple and crude way is to save the left flow and the right flow at the same time. In this way, no matter the message arrives first or later, at least the data in the operator is complete, even if one of the flows arrives late, It can also ensure that the previous data is matched. It should be noted that the state storage is different according to the upstream. For example, if a primary key rowkey is defined in the upstream and joinkey contains the primary key, there will be no multiple orders corresponding to the same foreign key. This tells the state that it only needs to store a unique line according to joinkey. If the upstream has a primary key, but the joinkey does not contain a rowkey, you need to save the orders of two rowkeys in the state at the same time. The worst case is that there is no primary key in the upstream. For example, there are 10 messages in the same order, and the last one is valid. However, for the system, it does not know which one is valid, and it is not easy to duplicate without specifying the primary key. It will all be saved, which consumes resources and performance. Relatively speaking, it is a very poor way.
Therefore, we carry out “de duplication” during data access. Data access, according to the row_ Number to sort, tell the system to update the data according to the primary key, solve the problem of 10 messages do not know how many should be saved. In the above case, it is updated according to the primary key, and the last message is taken each time.
According to row_ Number will not reduce the amount of data processing, but it will greatly reduce the amount of state storage. Each state only stores a valid state, instead of recording all its historical data.
② Multi stream join optimization
The second is the optimization of multi stream join. For example, in the pseudo code on the left side of the figure above, a main table is associated with many data sources to generate a large detail table. This is our preferred method, but it’s not good. Why? In real-time computing, such a SQL will be processed in the way of double stream join, and only one join can be processed at a time. For example, there are 10 joins in the code on the left, and there will be 10 join nodes on the right. The join node will save all the data of the left and right streams at the same time. So you can see that in the red box of the figure on the right, each join node will store the data of the left and right streams at the same time Node, suppose we have 100 million order sources, and there are 1 billion in them. This data storage is very terrible.
The other is that the link is very long, and the network transmission, calculation and task delay are also very large. For example, more than a dozen data sources are associated, which is real in our actual scene, and our association relationship is more complex than this.
How can we optimize it? We use the union all method to splice the data together by dislocation, followed by a layer of group by, which is equivalent to converting the join Association into a group By, its execution diagram is like the one on the right in the figure above. Yellow is the storage needed in the process of data access, and red is a join node, so there are very few states to be stored in the whole process. The main table will store one in the yellow box and one in the red box respectively. Although there are many data sources, only one data will be stored. For example, our logistics order is 10 million, and other data sources It’s also 10 million. The final result is that the effective row is 10 million. In fact, the amount of data storage is not high. Assuming that a new data source is connected, it may be another 10 million logs. But in fact, the effective record is 10 million. It’s just that a data source is added and a data update is carried out. The cost of adding a new data source is almost zero, so we use union The way that all replaces join is a big optimization in state.
- Many foreign keys are easy to be out of order
The third problem is that there are many foreign keys and out of order. In fact, there are many kinds of out of order. The data collected by the acquisition system is out of order, or out of order caused by the transmission process. What we want to discuss here is the out of order caused by carelessness in the actual development process, because the East and West platforms at other levels have already helped us to consider it, providing a good end-to-end consistency guarantee.
For example, two orders are logistics orders. According to the order number, some messages in the warehouse are taken. Message 1 and message 2 enter into the flow processing one after another. When they are associated, shuffle is performed according to the joinkey. In this case, the two messages flow to different concurrent operators. If the processing speed of the two concurrent operations is inconsistent, it may lead to the message entering the system first and then completing For example, message 1 arrives at the system first, but the processing is relatively slow, while message 2 outputs first, resulting in the wrong final output result. In essence, in the multi concurrency scenario, the flow direction of data processing is uncertain, and multiple messages of the same order flow to different places for calculation, which may lead to disorder.
So, after the same order message is processed, how to ensure it is orderly?
The figure above is a simplified process. The business database flows into Kafka. Binlog logs are written in sequence. Certain strategies need to be adopted, and they are also collected in sequence. Hash partition can be carried out according to the primary key and written into Kafka to ensure that the data stored in each partition in Kafka is the same key. First of all, order is ensured at this level. Then, when Flink consumes Kafka, it needs to set reasonable concurrency to ensure that the data of a partition is in the charge of one operator. If a partition is in the charge of two operators, there will be a situation similar to that just now, resulting in message disorder. In addition, it should cooperate with downstream applications to ensure that some primary keys are updated or deleted, so as to ensure end-to-end consistency.
Flink has cooperated with the upstream and downstream systems to help us achieve the end-to-end consistency function. We only need to ensure that the internal processing tasks can not be out of order. Our solution is to avoid the change of the join key, such as changing the join key into the business primary key through a special mapping relationship in advance to ensure that the task processing is orderly.
4. The statistical indicators depend on details, and the service pressure is high
Another difficulty is that many of our statistical indicators rely on details, mainly some real-time statistics. This kind of risk is obvious, and the pressure on the server side is particularly heavy, especially when it is greatly promoted, it is extremely easy to drag down the system.
Real time overtime statistics is a typical scenario. For example, there will be two orders. One order creates a logistics order at 1 o’clock and leaves the warehouse at 2 o’clock. How to count the number of receipts that have not been collected for more than 6 hours? Because there is no message, the calculation cannot be triggered. Flink is triggered based on the information. For example, if the order leaves the warehouse at 2 o’clock, it will not be collected for more than 6 hours at 8 o’clock But because there is no message trigger, the downstream system will not trigger the calculation, which is a relatively difficult thing. So at the beginning, there is no particularly good solution. Let’s directly list it from the details. For example, the delivery time of the order is 2 o’clock. After generating this detail, it will be written into the OLAP engine of the database to compare with the current details.
We have also explored some solutions, such as sending timed timeout messages based on message middleware or Flink In CEP mode, the first mode requires the introduction of third-party middleware, so the maintenance cost will be higher. In CEP mode, the time window is used to move forward steadily. In our logistics scenario, there will be many such situations, such as sending back a 2:00 delivery time, then sending back a wrong return time, and making up a 1:30 time, so we need to trigger the calculation again, Flink CEP is not well supported. Later, we explore the method based on Flink timer service, and create a message flow based on the timer service callback method of Flink. First, we access the data flow in our method. According to some rules we define, for example, when the delivery time is 2 o’clock, a timeout of 6 hours will be defined to register with the timer In the service, a comparison calculation will be triggered at 8 o’clock, otherwise a timeout message will be triggered. The whole scheme does not rely on third-party components, so the development cost is relatively low.
5. Many performing links and long data link
Another difficulty is that we have many implementation links and long data links, which makes it difficult to deal with abnormal situations. For example, a message should be kept valid for more than 20 days, and a state should be kept for more than 20 days. The state always exists in Flink. If there is a data error or logic processing error on a certain day, traceability is a big problem, because the upstream message system generally keeps the validity of the data for three days.
Here are some real cases.
We found a bug during the double 11. It’s been several days since the double 11. Because our fulfillment link is very long, it takes 10-20 days. The first time we find that the error can’t be corrected. After the correction, the DAG execution chart will change, and the state can’t be restored. Moreover, the upstream can only chase the number of 3 days. After the correction, the number equivalent to the upstream is gone, which is unacceptable Yes.
For some super long tail orders during the epidemic period, the TTL setting of the state is 60 days. We think that it will be over in about 60 days. Later, we found that the data began to be distorted in more than 24 days. It was clear that the validity period of the setting was 60 days. Later, we found that the underlying state storage used int type, so it can only store the validity period of more than 20 days at most, which was equivalent to triggering a boundary case of Flink, so it also proved that The scene on our side is really complicated. Many states need a very long state life cycle to guarantee.
Every time the code stops upgrading, the status is lost and the data needs to be pulled again for calculation. However, generally, the upstream data is only valid for 3 days. In this way, the business can only see the data for 3 days, and the user experience is very bad.
Solution: batch flow mixing
What do we do?
State reuse is achieved by batch flow hybrid method, real-time message flow is processed by blink flow processing, and offline calculation is completed by blink batch processing. Through the integration of the two methods, all historical data is calculated in the same task. For example, if an association calculation is performed between order message flow and fulfillment message flow, an offline order message source will be added to the task , which is combined with our real-time order message source union all, and a group by node is added below to de duplicate according to the primary key. In this way, state reuse can be realized. There are several points to note. The first one is to customize the source Another problem related to the combination of offline message and real-time message in the development of connector is whether offline message or real-time message is preferred after groupby. The consumption of real-time message may be relatively slow. We need to judge which message is real and effective, so we have customized some, such as lastvalue, to solve the problem of whether offline message or real-time message is preferred, The whole process is based on blink and maxcompute.
- Some little tips
① The problem of message delivery unable to be withdrawn
The first is that once a message is sent, it cannot be withdrawn. Therefore, some orders are valid at first, but later become invalid. Instead of filtering them in tasks, they should be marked and sent down for statistics.
② Add data version, data processing time and data processing version
- The data version is the version definition of the message structure to avoid the task restart to read dirty data after the model upgrade.
- The processing time is the current processing time of the message. For example, when the message flows back to offline, we will sort the time according to the primary key, get the latest record, and restore a quasi real-time data in this way.
- The reason for increasing the data processing version is that it is not accurate enough even to the millisecond level to distinguish the sequence of messages.
③ Real time logarithm scheme
The real-time logarithm scheme has two levels, real-time details and offline details. It was just mentioned that the real-time data will be returned to offline. We can see the messages generated before the current 24 o’clock, because offline T + 1 can only see the data of yesterday’s 23:59:59, which can be simulated in real time. We can only intercept the data at that time and restore it, and then compare it in real time with offline Good data comparison, in addition to real-time details and real-time summary comparison, because they are all in the same dB, the comparison is also very convenient.
03 summary and Prospect
Here is a brief summary
- Model and Architecture: good model and architecture is equivalent to 80% success.
- Accuracy requirement evaluation: it is necessary to evaluate the data accuracy requirements, whether alignment checkpoint or consistency semantic guarantee is really needed. In some cases, it is OK to ensure the general accuracy, so there is no need for so many additional resource consuming designs.
- Rational use of Flink features: we need to make rational use of some features of Fink to avoid the pain of misuse, such as the use of state and checkpoint.
- Code self check: ensure that the data processing is normal flow, in line with the target.
- SQL understanding: writing SQL is not so big, but more testing is some thinking in the process of data flow.
① Real time data quality monitoring
Real time processing is not like batch processing. After batch processing, you can run a small script to count whether the primary key is unique and the number of records fluctuates. Real time data monitoring is a troublesome thing.
② Flow batch unification
There are several levels of stream batch unification. The first is the unification of storage level. It is more convenient to write to the same place in real time and offline. The second is the unification of computing engines. For example, Flink can support batch processing and stream processing at the same time, and can also write to hive. A higher level is to achieve the unification of processing results. The semantics of batch and stream may be different for the same piece of code. How to achieve the same piece of code, the processing results of batch and stream are completely unified.
③ Automatic tuning
There are two kinds of automatic tuning. For example, when we applied for 1000 cores, how to allocate the 1000 cores reasonably, where might be performance bottlenecks, and how to allocate more. This is the automatic tuning of a given resource. For example, there is no single volume or data flow in the early morning. At this time, the resources can be adjusted to a very small size and automatically adjusted according to the data flow, that is, the automatic scaling capability.
The above is our overall outlook and research direction for the future.
Author:Zhang Ting (Rookie Data Engineer)
This article is the original content of Alibaba cloud and cannot be reproduced without permission