abstractMore and more enterprises accelerate the digital transformation and upgrading, and the importance of data value is becoming more and more significant. Li Shanshan, product manager of Alibaba cloud computing platform interactive analysis team, will interpret the value of cloud native HSAP system hologres. It mainly shares the mainstream real-time data warehouse architecture and the pain points of its practice, and the value of cloud native HSAP system innovation.
Introduction to Speakers:Li Shanshan, product manager of Alibaba cloud computing platform interactive analysis team
The following content is based on the speech video and PPT. Click to viewVideo playback
This sharing mainly focuses on the following three aspects:
1、 Mainstream real time data warehouse architecture: lambda
2、 Practice of Ali lambda
3、 Hologres product of cloud native HSAP system
1、 Mainstream real time data warehouse architecture: lambda
1. Timeliness is the multiplier of data value
Enterprises embrace digital transformation has become the industry consensus. As we all know, the value of data will decrease rapidly with the passage of time. Therefore, timeliness is the multiplier of data value.
Timeliness is a broad concept, including end-to-end data, real-time data acquisition, processing and analysis. Secondly, timeliness includes how to quickly transform the effect of real-time analysis into real-time service, and provide data service for online production system. At the same time, it also includes making the data of the existing data architecture be able to carry out rapid self-service analysis for the business party and respond to business changes quickly. Only by doing the above three aspects well, can we fully reflect the value of data and make the data center and data architecture better serve the business.
2. The mainstream real-time data warehouse architecture — lambda architecture
In the process of enterprise digital transformation, many enterprises are feeling the stone to cross the river, in order to solve business problems, constantly upgrade the data architecture. At present, the mainstream real-time data warehouse architecture is lambda architecture.
Meituan, Zhihu, rookie and other enterprises have successfully implemented the lambda architecture. As shown in the figure below, after data collection, lambda real-time data warehouse is divided into real-time layer and offline layer according to business requirements, which are used for real-time processing and offline processing respectively. Before the data service layer connects with the data application, the data will be merged, so that the real-time data and offline data can serve the online data application and data product at the same time.
Offline layer processing:After the collected data is merged into the offline data warehouse, it will be unified to the ODS layer first. The data of ODS layer is cleaned and DWD layer (detail layer) is constructed. In the process of enterprise data modeling, in order to improve the efficiency of data analysis and business stratification, DWD layer data will be further processed and processed, which is equivalent to pre calculation. The pre calculation results will be dumped to some offline storage systems before docking with data services.
Real time layer processing:The logic is similar to the offline layer, but more time sensitive. The real-time layer also subscribes and consumes the real-time data of the upstream database or system. After consuming the data, it will be written to the real-time DWD layer or DWS layer. The real-time computing engine provides computing power, and the final processing results need to be written into some real-time storage, such as kV store.
Considering the cost and development efficiency, the real-time layer and offline layer are not completely synchronized. The real-time layer generally keeps two to three days of data, or for extreme requirements, it usually stores seven days of data. Longer data such as monthly and annual data are stored in the offline data warehouse.
The above is the layering of data warehouse. In the actual analysis and application of business, the business party may not care about the data processing method, but need the effect of data analysis, so it needs to process real-time and offline full data. The data processed in the real-time layer and offline layer are written in two different storage respectively. Before docking data service, data merging operation is needed. After merging the results of stream processing and batch processing, online service is provided.
This architecture seems to solve many business problems, such as offline data warehouse, data analysis, large data screen and so on. However, the lambda architecture is not perfect, and there are still some difficulties.
3. Lambda architecture pain points
1) Consistency challenges:It is mainly reflected in two sets of semantics, two sets of logic and two sets of data.
A piece of data in lambda architecture is divided into offline layer and real-time layer for processing respectively. The offline layer and real-time layer introduce different computing engines and storage engines. That is to say, the semantics of stream and batch are different, and they need two sets of code, that is, two sets of logic. Therefore, the logic of data processing is different, which leads to inconsistent processing results of the same source data.
The processing results of the offline layer and the real-time layer are written in two different stores respectively. At least two copies of data are generated after batch processing and stream processing. Therefore, multiple copies of data need to be merged before docking data services. In this process, we need to constantly redefine the data structure, data dump, change, merge, will bring the problem of data consistency.
The problem of data consistency is caused by the complexity of architecture design. At present, the industry is from the business level to solve, that is, from the business negotiation. For example, when the consistency difference rate between real-time data and offline data is less than 3%, the architecture can be run.
2) Multi system combination, interlocking, complex architecture, high operation and maintenance cost:Batch processing usually introduces offline computing engines such as maxcompute or self built Hadoop engine. Several new products such as Flink and spark may be introduced into the stream processing part. Data will be written to storage after processing, and products introduced by data service layer may be more complex. For example, HBase is introduced to provide efficient point query; Presto and impala are introduced to interactively analyze data in offline data warehouse; data may also be imported into MySQL; in order to realize end-to-end real-time in real-time data warehouse, open source products such as druid and Clickhouse are mostly used. The above systems first lead to complex system architecture and high operation and maintenance costs. Data development students need to master multiple sets of systems, and the learning cost introduced by the system is very high. At the same time, a piece of data after layer by layer processing, layer by layer cleaning, the data of the whole link will have a lot of redundancy. The real-time layer and the offline layer each have a set of data, and there is a set of data before data merging. Data expansion causes huge consumption of storage resources.
3) The development cycle is long and the business is not agileAny set of data or business solution needs data proofreading and data verification before it goes online. Once there is a problem in the process of data proofreading, its location and diagnosis will be very complex. Because data problems may occur in any link, or they may only be found in the data application layer. After problems are found, it is necessary to check whether there are problems in data merging, real-time calculation, offline layer, and even data acquisition. The process is complex, resulting in a long period of data revision and complement. At the same time, a piece of data needs to be processed in the offline layer and the real-time layer respectively, and the processing link is long. If you need to add new fields in the link, especially in the upstream link, the whole link needs to be corrected together. The process is long, and the historical data is supplemented, which consumes huge resources and time.
4) After the data development is completed, the business is recognized, and the online driven business gets very good results, more business parties will realize the value of data architecture and put forward requirements. The product operation or decision-making level may think that the data is very valuable, and will require the opening of new data analysis reports for analysis, or whether self-service real-time analysis can be carried out. In the lambda architecture, all calculation and analysis are completed in the calculation layer. For example, if you add a layer of business data to the offline layer, you need to re develop a DWS layer job, write the data to the DWS layer data store, synchronize it to the data service system, and then provide online report service. This process needs the involvement of data development students, needs to review and evaluate the data requirements, and the development cycle will be at least offline processing time t + 1. When the time of many scenes is urgent, you can’t wait for T + 1, and you may miss the business opportunity. In addition, new business needs to do real-time link development is also the same, the need for real-time job development, proofreading, online, long development cycle. Therefore, the flexibility of lambda architecture can not meet the demands of online business.
2、 Practice of Ali lambda
1. Old architecture of search recommendation refined operation
The actual architecture is more complex than the lambda ideal architecture. Alibaba’s biggest data scenario is the search recommendation scenario. The following figure shows the practice of the old architecture of search recommendation refined operation on lambda architecture, which is very similar to the above lambda architecture.
On the left side of the figure below is the data of Alibaba, including transaction data, user behavior data, product attribute data, search recommendation data, etc. The data can be imported into maxcompute in batch through data integration, or collected through real-time message queue datahub and cleaned through Flink.
Online architecture evolution:Fink is developing rapidly. Ali’s typical business, double 11 screens, is to clean online data through Flink and write it into HBase system. HBase docking real-time large screen to provide efficient point query.
Maxcompute is an off-line data warehouse product developed by Alibaba for more than 10 years, which carries a very large off-line data analysis scenario of Alibaba. After the completion of real-time data processing, a set of offline data will be formed, most of which are stored in maxcompute. The data of online marketing and competition are all from maxcompute.
With the enhancement of Flink’s real-time computing power and the development of Flink’s application in real-time data warehouse and real-time report, all product operators and decision-makers see the value of real-time for business. Therefore, on the basis of maxcompute’s offline analysis capability for many years and the real-time capability of discovering Flink’s powerful computing, the business side proposes whether the same data can provide real-time online analysis service capability in real time. Therefore, open source products such as Druid are introduced to write online logs to Druid in real time through Flink to provide Druid with real-time analysis capability. For example, docking real-time reports, real-time decision-making, real-time inventory system.
To sum up, two links are formed. Real time data is analyzed in druid and offline data is analyzed in maxcompute. But because of Druid’s storage capacity and other performance requirements, Druid only stores data within seven days or two or three days. Marketing activities such as big promotion often need to compare historical year-on-year or month on month data. For example, we need to make a comparative analysis with the double 11 of last year or the year before last. We need to analyze the data of last week or last week in the analysis of marketing strategy. We need to make a ring comparison analysis between the official period of double 11 and the data of warm-up period. At this time, we use the combination of offline data and real-time data. The way is to introduce more products. Merge the data of maxcompute offline data warehouse with the real-time data in Druid in mysql, and then provide online services.
2. Integration of multiple systems, multiple scenarios and analysis services
It can be seen that the upstream data and data cleaning links have not changed, but the scenarios of business application layer and business analysis layer have changed more, and more demands have appeared, so more products have been introduced for business support. However, the old architecture is still a typical lambda architecture, so under the rapid growth and expansion of Alibaba’s business, its consistency, complex operation and maintenance, high cost, business agility and other issues are gradually highlighted.
This paper briefly analyzes why a variety of systems are introduced in the increasingly complex business scenarios, and what capabilities each system provides.
KV Store：Redis / MySQL / HBase / Cassandra provides efficient point query capability for high QPS query scenarios of data products.
Interactive computing power:Presto/Drill。
Real time data warehouse:Clickhouse / Druid, real-time storage + online computing.
3. New architecture of refined operation of search recommendation
Multiple products are introduced to support the above three capabilities. Can we solve the same business problems in multiple business scenarios at the same time, but integrate the capabilities of multiple big data products into one engine. So that the data can be unified storage, and then provide unified services to the upper layer. Therefore, the architecture shown in the figure below is formed.
The upstream data processing and cleaning does not change, but provides richer capabilities when docking with the upper business applications. For example, the system can provide point query, result cache, offline acceleration, federated analysis, interactive analysis and other capabilities at the same time. The system is defined as HSAP (hybrid serving analytical processing) system, which can realize the integration of analysis and service, and can use a piece of data for real-time analysis and online service at the same time.
Hologres is a cloud native HSAP database system launched in this context. Hologres products realize unified storage of real-time offline data, support real-time writing of Flink data or real-time computing real-time data, and support batch import of offline data. Second, the docking data service level is designed with real-time analysis as the center, which can meet the needs of business real-time analysis and online service at the same time. Third, without changing the original real-time data warehouse architecture, we can directly accelerate through maxcompute and use the computing power of hologres to directly serve the wiring.
3、 Hologres product of cloud native HSAP system
1. Core advantages of hologres products
Cloud native HSAP database, a data for real-time analysis and online services.
Fast response:To achieve millisecond response, so as to easily meet the needs of customers for complex multi-dimensional analysis of massive data. Tens of millions of QPS point query, real-time analysis of thousands of QPS simple query.
Real time storage:Support 100 million level write TPS, strong timeliness, write to query.
Maxcompute acceleration:Maxcompute direct analysis, no data relocation, no redundant storage.
PG Ecology:PG developer ecology, developer friendly, PG tools (pslq, Navicat, DataGrid) compatible. Seamless docking of Bi tools.
Online service data of double 11 hologres in 2019:In the case of super large amount of data on the day of double 11, hologres supports the peak of 130 million real-time write TPS, and the data can be queried after being written, providing 145 million high concurrent online query QPS.
2. Hologres interactive analysis products – typical application scenarios
Offline data acceleration query:At present, maxcompute offline data second level interactive query is supported. Without extra ETL work, it can easily transform cold data into easily understood analysis results, improve enterprise decision-making efficiency and reduce time cost.
Real time data warehouse:Flink + hologres aims to build a user insight system, monitor the situation of platform users in real time, diagnose users in real time from different perspectives, and then take targeted user operation strategies, so as to achieve the purpose of refined user operation. Help real-time refined operation.
Real time offline joint Computing:Based on the joint calculation of offline data warehouse maxcompute and real-time data warehouse interactive analysis, starting from business logic, real-time offline data analysis and real-time offline federated query are realized, and real-time full link refined operation is constructed.
Next, the specific architecture and application of the above three scenarios are introduced.
3. Maxcompute acceleration analysis
Traditional solution: data redundancy, high cost and long development cycleAs shown in the figure below, the data on the left is synchronized to the offline data warehouse through data integration, and then processed in DWD layer and DWS layer. The processed data serves the wiring.
One is to directly use MapReduce computing power of maxcompute to provide online marketing strategies and online real-time reports. Although the scheme can meet the business requirements well, it needs a certain amount of time to wait for queuing and resource allocation after the MapReduce task is submitted. In many cases, the waiting time is longer than the data analysis time, and the analysis timeliness is tens of minutes or even hours. The efficiency is low.
Therefore, the data is transferred from maxcompute offline data warehouse to online redis, MySQL and other products, and the interactive analysis or point query capabilities are used to provide services. However, it is difficult to integrate data from maxcompute offline data warehouse into online redis, MySQL and other products. The first is the data capacity. For example, when the amount of data in the ads layer of the offline data warehouse is very large and MySQL cannot carry it, it is necessary to add a layer of ads jobs in the offline data warehouse, process and pre calculate the data again, and integrate the data into MySQL after reducing the amount of data. In other words, for data analysis, further data preprocessing is needed to maintain a data synchronization job. After data synchronization, it needs to be stored in MySQL. The above process will lead to data redundancy, high cost and long development cycle.
Hologres – no data relocation, high data analysis efficiency:Hologres is an architecture that separates storage and computing, and has a seamless connection with maxcompute. Hologres provides computing power in this scenario, while maxcompute is equivalent to the storage cluster of hologres. It can directly read and accelerate the data stored in maxcompute. As long as the data is processed in maxcompute and can be queried, hologres can be used for data analysis directly. In the maxcompute accelerated analysis scenario, hologres is designed around the interactive analysis scenario, and the results can be obtained immediately after the assignment is issued, so as to meet the requirements of efficient and self-service analysis with low cost.
Demo demo:Refer to the documentReal time analysis of massive maxcompute dataDemo demonstration, a variety of functions can achieve millisecond level query, real-time return.
Deep integration of dataworks:Hologres is deeply integrated with dataworks. When maxcompute data is directly analyzed for acceleration, a Web table needs to be established. Here, one click maxcompute table structure synchronization, one click maxcompute data synchronization, and one click local file upload are supported. For details, please refer to the documentHoloStudio。
4. Real time data warehouse: high real-time cost, long development cycle and inflexible business support
The real-time data warehouse architecture is that after data collection, the data in the ODS layer Kafka is cleaned through Flink to generate DWD layer data. If there is further processing demand, subscribe to DWD layer data again and write it to DWS layer. Different products such as HBase, MySQL and OLAP are introduced according to different business scenarios.
The architecture has solved the existing problems well. However, all the computing logic in this architecture is written into DWS layer after processing in Flink. Once a new business scenario needs to be added or an existing business scenario needs to be adjusted, and new fields or calculation logic need to be added, it is necessary to re evaluate the link, re develop the Flink job, modify or add a Flink job, and then write to the DWS layer.
Therefore, in this scenario, most of the computing is in the Flink layer, and its business flexibility is not high enough. In other words, the architecture calculation is done in advance, which can not meet the requirements of self-service analysis or DWD layer data analysis.
5. Integration of real-time, offline, analysis and service
To solve the above problems, hologres and Feitian big data platforms (such as Flink and maxcompute) jointly launched a new generation of real-time, offline, analysis and service integration solutions. The data is still cleaned offline in maxcompute and real-time in Flink. However, after the data cleaning of Flink layer is completed, the detail layer data can be directly written into hologres, and the hologres will connect the large screen. Hologres provides powerful real-time storage and real-time computing capabilities, which means that the detail layer data can also be directly connected to the report.
For the joint analysis of real-time data and offline data, the data in maxcompute can be directly associated with hologres to realize federated query.
In the actual business scenario of Flink computing, if you want to precipitate the data to provide online services, you can subscribe to hologres again, process the data of hologres DWD layer into data of DWS layer, and write it back to hologres. At the same time, hologres supports the ability of super large dimension table in Flink, and other Flink jobs can analyze the data in hologres.
The above is a brief description of the integration scheme of real-time, offline, analysis and service. The scheme in the actual business scenario is far more complex than the above statement.
Actual business scenario:The following figure shows the scheme architecture launched by Feitian big data platform product family. The data link is more complex, but the essence is the same as the above architecture. The upstream data source obtains the real-time data warehouse through data collection, and finally provides real-time offline federated query or analysis capability for the upper data application after correlation analysis.
6. Internet content information customer case
In addition to being widely used in Alibaba group, hologres has also been widely used in the cloud, Internet industry and traditional enterprises.
The figure below shows a typical case of Internet customers. Xiaoying is a popular short video app in Southeast Asia. In addition to the real-time large screen and real-time report, the Internet industry will also carry out user analysis, user portrait, user tag, real-time video recommendation, etc. This scenario is similar to Alibaba search recommendation refinement scenario, so its architecture reference is built based on offline data maxcompute, real-time computing Flink and interactive analysis hologres of Alibaba cloud Feitian big data product family.
7. Focus on data construction and get through the whole link
Hologres builds data ecology around big data ecology, PG ecology and alicloud ecology, and around the whole link of data construction. From data source docking, data synchronization, data processing, data operation and maintenance, to data analysis and application, the whole link is opened.
Hologres has been officially commercialized in the cloud. It provides different billing specifications such as monthly package and pay as you go. It supports the purchase of different proportions of computing and storage resources. Users can purchase goods according to their own business needs. At the same time, data products, data processing, data synchronization, data development tools support the use of self built, open source and other products.
3% off for the first month of specified specifications, hologres, a new generation of HSAP system, has been released! ClickBuy now
If you are interested in hologres products and care about product trends, please visit the link below for learning and exchange.