Alibaba big data practice real time technology

Time:2020-9-30

Source: digital intelligence transformation Club

The value of data is time sensitive. When a piece of data is generated, if it can not be processed in time and used in the business system, it can not keep the highest “freshness” and maximize the value of the data.

Compared with offline batch processing technology, streaming real-time processing technology, as a very important technology supplement, is widely used in Alibaba group.

In the big data industry, the research of stream computing technology is a very hot topic in recent years.

The business demand is to get the processed data at the first time, so as to monitor the current business status and make operational decisions, and guide the business to develop in a good direction. For example, for an advertising space with high traffic on the website, the drainage effect of advertising space needs to be monitored in real time. If the conversion rate is very low, the operators need to replace it with other advertisements in time to avoid the waste of traffic resources. In this example, we need to make real-time statistics of advertising space exposure and click indicators as a reference for operational decision-making.

According to the data delay, data timeliness can be divided into three types (offline, quasi real-time and real-time)

  • Offline: process data n days ago (T-N, n ≥ 1) before today (T), and the latency granularity is days.
  • Quasi real time: data before n hours (H-N, n > 0, such as 0.5 hours, 1 hour, etc.) are processed in the current hour (H), and the delay time granularity is hours.
  • Real time: process the current data at the current time, and the delay time granularity is seconds;

Offline and quasi real-time can be implemented in batch processing systems (such as Hadoop, maxcompute, spark, etc.), but the scheduling cycle is different, and the real-time data needs to be completed in the streaming processing system. In short, the streaming data processing technology means that every time a piece of data is generated by the business system, it will be immediately collected and sent to the streaming task for processing. There is no need to schedule the task to process the data.

Generally speaking, streaming data processing has the following characteristics.

1. High timeliness

Real time data acquisition, real-time processing, delay granularity in seconds or even milliseconds, business side can get the processed data in the first time.

2. Permanent mission

Different from the periodic scheduling of offline tasks, streaming tasks belong to resident process tasks. Once started, they will run until they are terminated artificially, so the computational cost is relatively high. This feature also indicates that the data source of streaming task is unbounded, while that of offline task is bounded. This is also the main difference between real-time processing and offline processing. This feature will lead to limitations in data processing of real-time tasks.

3. High performance requirements

Real time computing is very strict on the performance of data processing. If the processing throughput can not keep up with the collection throughput, the calculated data will lose the real-time characteristics. For example, a real-time task can only process 30 seconds of collected data in one minute, so the delay of the output data will be longer and longer, which can not represent the current business status, which may lead to the business side to make wrong operational decisions. In the Internet industry, the amount of data to be processed is huge. How to maintain high throughput and low latency in the case of rapid expansion of data volume is an important challenge. Therefore, the performance optimization of real-time processing takes up a large part of the task development.

4. Application limitation

Real time data processing can not replace offline processing. In addition to the high computational cost, the limitations of real-time data processing lead to insufficient support for scenarios with complex business logic (such as dual stream association or data rollback). In addition, because the data source is streaming, the uncertainty of data arrival time leads to some differences between the real-time processing and offline processing.

Streaming technology architecture

In streaming computing technology, it is necessary to form a data processing link among subsystems to produce results and finally provide real-time data services. In the actual technology selection, there are many open source technology solutions, but the overall architecture of each scheme is similar, but the implementation principle of each subsystem is different. In addition, the system in streaming technology architecture overlaps with offline processing. The two technical solutions are not completely independent, and there is a trend of merging in the industry.

Each subsystem can be divided into the following parts according to their functions:

1. Data acquisition

The source of data generally comes from the log server of each business (such as the website browsing behavior log, order modification log, etc.), which is collected in real time into the data middleware for downstream real-time subscription.

2. Data processing

After the data is collected into the middleware, the downstream needs to subscribe to the data in real time and pull it to the tasks of the flow computing system for processing. We need to provide a stream computing engine to support the execution of streaming tasks.

**3. Data storage
**
After real-time processing (such as aggregation, cleaning, etc.), the data will be written to the storage system of an online service for downstream callers to use. The write operations here are incremental and continuous.

4. Data service

A unified data service layer (such as providing HSF interface, HTTP service, etc.) will be set up on the storage system to obtain real-time calculation results.

The overall technical architecture is shown in the figure
Alibaba big data practice real time technology

As can be seen from the figure, real-time and offline data collection and data service parts are common, because there is no need to care about the timeliness of data in both layers. In order to avoid inconsistent processing of data stream and offline.

Streaming data model

In streaming computing technology, it is necessary to form a data processing link among subsystems to produce results and finally provide real-time data services. In the actual technology selection, there are many open source technology solutions, but the overall architecture of each scheme is similar, but the implementation principle of each subsystem is different. In addition, the system in streaming technology architecture overlaps with offline processing. The two technical solutions are not completely independent, and there is a trend of merging in the industry.

Each subsystem can be divided into the following parts according to their functions:

The design of data model runs through the process of data processing, and the same is true for streaming data processing, which requires hierarchical modeling of data flow. Real time modeling is very similar to offline modeling. The data model is divided into five layers (ODS, DWD, DWS, ads, dim).

Due to the limitations of real-time computing, each layer is not as wide as offline, and there are not so many dimensions and indicators, especially the indicators related to backtracking state, which are rarely used in real-time data model.

As a whole, real-time data model is a subset of offline data model. In the process of real-time data processing, many models are designed with reference to offline data model.

1. Data layering

In the streaming data model, the data model is divided into five layers.

ODS layer: like the definition of offline system, ODS layer belongs to the operation data layer, which is the most original data directly collected from the business system, including the change process of all services, and the data granularity is also the thinnest. In this layer, the real-time and offline data are unified at the source. The advantage is that the indicators processed by the same data have the same caliber, which makes it easier to compare the real-time and offline data. For example: original order change record data, server engine access log.

DWD layer: DWD layer is a real-time fact detail layer modeled according to business process on the basis of ODS layer. For access log data (there is no context relationship, and there is no need to wait for process records), it will flow back to the offline system for downstream use, so as to maximize the consistency of real-time and offline data between ODS layer and DWD layer. For example: payment details of orders, refund details, user access log details.

DWS layer: after subscribing to the data of detail level, the summary indicators of each dimension will be calculated in the real-time task. If the dimension is common to each vertical line of business, it will be placed in the real-time general summary layer as a common data model. For example, the seller granularity of e-commerce websites is related to this dimension as long as it involves the transaction process. Therefore, the seller dimension is the general dimension of each vertical business, and the summary index is also shared by all business lines. For example: summary table of several dimensions of e-commerce data (seller, commodity, buyer).

Ads layer: the summary layer of personalized dimension. For the statistical dimension data that is not particularly common, it will be placed in this layer. Here, the dimensions and indicators that only its own business will pay attention to are calculated. Generally, it has no intersection with other business lines, and is often used in some vertical innovation businesses. For example: Mobile Taobao under a love shopping, micro Taobao and other vertical business.

Dim layer: the data of real-time dimension surface layer is basically derived from offline dimension surface layer, and extracted to online system for real-time application call. This layer is static for real-time applications, and all ETL processing will be completed in the offline system. Dimension table is slightly different from offline application in real-time application, which will be explained in detail in the following chapters. For example: Commodity dimension table, seller dimension table, buyer dimension table, category dimension table.

2. Multi stream Association

In stream computing, it is often necessary to associate two real-time streams with primary keys to get the corresponding real-time details. In the offline system, the association of two tables is very simple, because the full data of the two tables can be obtained by offline calculation when the task is started. It is only necessary to associate by bucket according to the association key. However, different from stream computing, the arrival of data is an incremental process, and the arrival time of data is uncertain and disordered. Therefore, in the process of data processing, there will be some details such as the mechanism of saving and restoring the intermediate state.

For example, table a and table B use ID for real-time Association. Since the arrival order of the two tables cannot be known, when each new data of the two data streams arrives, it is necessary to search another table. If a piece of data from table a arrives, it will be searched in the total data of table B. if it can be found, it can be linked up and spliced into a record and output directly to the downstream; however, if the association fails, it needs to wait in memory or external storage until the record of table B is reached. A key point of multi stream association is to wait for each other. Only when both sides arrive, can the association succeed.

The following is illustrated by an example (the order information table is associated with the payment information table), as shown in the figure.
Alibaba big data practice real time technology

In the above example, the data of two tables are collected in real time. When a new data is arrived, it is searched in the total data of the opposite table in memory up to the current time. If it can be found, it indicates that the association is successful and output directly; if it is not found, the data is put in the data set of its own table in memory and wait. In addition, no matter whether the association is successful or not, the data in memory needs to be backed up to the external storage system. When the task is restarted, the memory data can be recovered from the external storage system to ensure that the data is not lost. Because when you restart, the task will continue, and the previous data will not be rerun.

In addition, the change of order records may occur multiple times (for example, multiple fields of the order are updated many times). In this case, the duplicate records need to be removed according to the order ID to avoid the successful association between table a and table B. otherwise, multiple records will be output to the downstream, and the data obtained will be repetitive.

The above is the overall dual flow association process. In the actual processing, considering the performance of the search data, the real-time association process generally divides the data according to the associated primary key, and also according to the bucket when recovering from failure, so as to reduce the amount of search data and improve the throughput.

3. Use of dimension table

In an offline system, the fact table and dimension table are usually associated according to the business partition, because the data of the dimension table is ready before the association. In real-time computing, the associated dimension table usually uses the current real-time data (T) to associate the dimension table data of T-2, which means that the dimension table data needs to be prepared before the data of T arrives, and it is generally a static data.

Why do you do this in real-time computing? Mainly based on the following considerations.

Data cannot be prepared in time: when reaching the zero point, the real-time stream data must be associated with the dimension table (because it can’t wait, the real-time feature will be lost if it waits). At this time, the dimension table data of T-1 cannot be ready immediately at zero point (because the data of T-1 needs to be processed and generated on the day of t-1). Therefore, de associating the T-2 dimension table is equivalent to processing in one day of T-1 Good T-2 dimension table data.

Unable to accurately obtain the latest data of the full volume: the dimension table is generally full data. If the latest dimension table data of the day needs to be obtained in real time, the complete dimension table data can only be obtained by T-1 data + current day change. In other words, the dimension table is also used as a real-time stream input, which requires the use of multi stream real-time association to achieve. However, because the real-time data is out of order and the arrival time is uncertain, there is ambiguity in the dimension table Association.

Disorder of data: if dimension table is used as real-time stream input, it will be difficult to obtain dimension table data. For example, the business data at 10:00 am is successfully associated with the dimension table, and the relevant dimension table field information is obtained. Is the latest dimension table data available at this time? In fact, this only means getting the latest status data up to 10:00 (real-time applications never know when the latest state is, because they don’t know whether the dimension table will change later).

In general, the time delay between the two-dimensional tables is rarely used for the calculation of the time-delay between the two-dimensional tables.

In some business scenarios, T-1 data can be associated, but T-1 data is incomplete. For example, the processing of dimension table starts at 22:00 p.m. on T-1, the data can be prepared in two hours before the arrival of zero point. In this way, the data of T-1 can be correlated at time t, but two hours of dimension table change process will be missing.

In addition, since real-time tasks are resident processes, the use of dimension tables can be divided into two forms.

Full load: in the case of less dimension table data, it can be loaded into memory at one time, and it is directly associated with real-time stream data in memory, which is very efficient. But the disadvantage is that the memory is always occupied and needs to be updated regularly. For example: category dimension table, only tens of thousands of records are loaded into memory at 0:00 every day.

Incremental loading: there is a lot of data in the dimension table, and it is impossible to load all the data into the memory. Incremental search and LRU expiration can be used to keep the most popular data in memory. Its advantage is that it can control the amount of memory used; the disadvantage is that it needs to find the external storage system, and the running efficiency will be reduced. For example, there are hundreds of millions of records in the member dimension table. When the real-time data arrives, it will query the external database, and put the query results in the memory. Then, clean up the latest least used data at regular intervals to avoid memory overflow.

In practical application, the two forms are selected according to the data amount of dimension table and the real-time performance requirements. Note: some proper terms, professional terms, product names, software project names, tool names, etc. appearing in this book are commonly used words in Taobao (China) Software Co., Ltd. internal projects. It is a coincidence if they are identical with the names of third parties.

Link to original text
This article is the original content of Alibaba cloud and can not be reproduced without permission.

Recommended Today

Summary of basic knowledge points of react

Contents of this document: 1. React entrance 2. JSX syntax 3. Components 4. Use setstate correctly 5. Life cycle 6. Component composition 7.redux 8.react-redux 9.react-router 10.PureComponent 11. Know hook 12. Customize hook and hook usage rules 13. Usememo and usecallback of hook API Reference website:https://zh-hans.reactjs.org/docs/getting-started.html 1. React entrance start Create project: NPX create react app […]