History of development
The changes of the times, the reincarnation of life and death, the long history, nothing is eternal, only change is invariable, so is technology. When you choose the Internet, you are equivalent to riding a rolling train of the times, heading for the unknown direction. No matter what kind of technical architecture, only in the current background of the times, can it be meaningful, people So is life.
Time is a ruler, which can measure the progress of the striver; time is a balance, which can measure the weight of the striver’s achievements; time is a shuttle, which can take us to travel the long river of history. Today, let’s take a look at the development of the data warehouse structure, feel the historical changes, and look back at those relics. Are you ready? Let’s go! Before that, let’s take a look at the position of data warehouse in the whole data platform
Before we start, let’s take a big picture and have a general cognition, from the whole to the part, from the summary to the concrete, to see what causes the institutional change, to explore the significance of the era, and to see what is the data warehouse
So what is a data warehouse? A data warehouse is a subject oriented, integrated, relatively stable, and reflecting historical changes There are two links in the construction of data warehouse in data platform: one is the construction of data warehouse, the other is the application of data warehouse.
Data warehouse is developed with the development of enterprise informatization,With the upgrading of information tools and the application of new tools，With the increasing amount of data, more and more data formats, more and more stringent decision-making requirements, data warehouse technology is constantly developingThis is the reason for the architecture upgrade. In fact, the external environment has changed, and the existing system can not meet the current needs. Now that we have found the reason, let’s enjoy itWhat are the shining stars in the long history
“We are moving from the IT age to the DT age (data age). Between it and DT, it is not only the change of technology, but also the change of ideology. It is mainly for self-service, for better self-control and management. DT is to activate productivity and make others live better than you
——Jack Ma, chairman of Alibaba’s board of directors.
Classic data warehouse
Before we start, let’s say one thing. In fact, data warehouse has existed long ago. That is to say, before offline data warehouse (based on big data architecture), there are many traditional data warehouse technologies, such as Teradata based data warehouse. It’s just that data warehouse technology has changed a lot in the context of big data, that is, we have begun to abandon the traditional technology of building data warehouse , and chose the big data technology that can better meet the needs of the current era. Of course, big data technology has not completely replaced the traditional technology. We can still see them in many places
Classic data warehouse can put different layers of data warehouse in different databases, different layers in different database instances, and even different layers in different computer rooms
Big data technology has changed the way of data warehouse storage and calculation. Of course, it has also changed the concept of data warehouse modeling. For example, classic data warehouse is stored in relational databases such as mysql, big data warehouse is stored in hive (actually HDFS) of Hadoop platform, and there are other data warehouse products such as TD and greenplug.
Offline data warehouse (offline big data architecture)
With the advent of the Internet era, the amount of data has increased dramatically, and big data tools have been used to replace the traditional tools in the classic data warehouse. This is just the replacement of tools, and there is no fundamental difference in architecture. We can call this architecture offline big data architecture.
With the increasing amount of data, the number of fact tables reaches tens of millions, traditional ETL tools such as kettle are becoming unstable, and storage technologies such as database are also facing storage tension. Every day, they are in a battle with disk. The execution time of zipper task of single table increases exponentially. At this time, we start to use HDFS instead of database for storage, and hive (MR) instead of database for computing It is not ETL tools such as kettle and Informatica used in traditional data warehouse technology architecture;
The company began to consider redesigning the data warehouse architecture, using hive of Hadoop platform as the data warehouse, saving the report layer data in mysql, and using tableau as the report system. In this way, there is no need to worry about the storage problem, and the calculation speed is greatly accelerated. On this basis, the company opened the hue to all departments, so that the simple work of counting can be operated by the operators themselves. Presto can be used for cross database query of MySQL and hive. When using presto, you should pay attention to the strict data type.
Later, with the development of network technology and communication technology, the real-time reporting and transmission of terminal data becomes possible. As a result, the real business system changes, which leads to the continuous improvement of our demand for timeliness. Before we start, let’s take a look at the impact of network technology and communication technology on our life
In order to cope with this change, we began to add an acceleration layer on the basis of offline big data architecture, using stream processing technology to directly complete the index calculation with high real-time requirements, and then integrate it with offline calculation, so as to provide users with more informationA complete real-time calculation resultThis is the lambda architecture.
In order to calculate some real-time indexes, a real-time calculation link is added on the basis of the original offline data warehouse, and the data source is stream transformed (that is, the data is sent to the message queue). The real-time calculation subscribes to the message queue to directly complete the calculation of index increment and push it to the downstream data service,Data service layer completes the combination of offline and real-time results。
It should be noted that the index of stream processing calculation is still calculated in batch processing, and the final result is based on batch processing, that is, the result of stream processing will be covered after each batch processing calculation (this is only a compromise made by the imperfect stream processing engine). Lambda architecture is a real-time big data processing framework proposed by storm author Nathan marz. While working on twitter, marz developed storm, a famous real-time big data processing framework. Lambda architecture is based on years of experience in distributed big data system. The goal of lambda architecture is to design an architecture that can meet the key characteristics of real-time big data system, including high error tolerance, low latency and scalability. Lambda architecture integrates offline computing and real-time computing, integrates a series of architectural principles such as immutability, read-write separation and complexity isolation, and integrates Hadoop, Kafka, storm, spark, HBase and other big data components.
If you forget the merge operation above, the lambda architecture is two completely different processes, as shown below
The same needsDevelop two sets of the same code, which is the biggest problem of lambda architecture. Two sets of codes not only mean development difficulties (the same requirements, one is implemented on batch processing engine and the other is implemented on stream processing engine, but also construct data tests to ensure the consistency of the two results). Later maintenance is more difficult. For example, after the requirements change, two sets of codes need to be changed separately, the test results need to be independent, and two tests need to be done separately The industry needs to go online at the same time.
Increased use of resources: if the same logic calculation is performed twice, the overall resource consumption will increase (more than real-time calculation)·
The calculation results of real-time link and offline link are easy to be misunderstood. The data we saw yesterday is the same as the data we saw todayInconsistent data**
The downstream processing is complex,Need to integrate real-time and offline processing resultsThis part is often done before we present it to users
Later, there are more and more real-time businesses, more and more event based data sources, real-time processing has changed from a secondary part to a major part, and the architecture has been adjusted accordingly, resulting in the emergence of kappa architecture with real-time event processing as the core. Of course, this does not need to achieve this change, but also needs the innovation of the technology itself – Flink. The emergence of Flink makes accurate once and state calculation possible. At this time, the results of real-time calculation ensure the accuracy of the final results
Although lambda architecture meets the real-time needs, it brings more development and operation and maintenance work,Its architecture background is that the stream processing engine is not perfect, and the result of stream processing is only used as a temporary and approximate value for reference. Later, with the emergence of Flink and other stream processing engines, stream processing technology was very mature. At this time, in order to solve the problem of two sets of code, jackedin’s Jay Kreps proposed kappa architecture
Kappa architecture can be considered as a simplified version of lambda architecture (just remove the batch part of lambda Architecture). In kappa architecture, requirement modification or historical data reprocessing are accomplished by upstream replay.
Reprocessing of kappa architecture
Select a message queue with replay function, which can save historical data and support multi consumers, and set the storage time of historical data according to the demand, such as Kafka, which can save all historical data. Of course, pulsar and pravega, which are specially used for real-time output storage, are also included
When one or some indicators need to be reprocessed, write a new job according to the new logic, then re consume from the beginning of the upstream message queue, and write the result to a new downstream table.
When the new job catches up with the schedule, the application switches the data source and uses the new result table. Stop the old job and delete the old result table.
The biggest problem of kappa architecture is that the throughput of streaming reprocessing history will be lower than that of batch processing, but this can be compensated by increasing computing resources
Pravega (streaming storage)
If you want to unify the big data processing architecture of stream batch processing, you actually have mixed requirements for storage
For the historical data from the old part of the sequence, we need to provide high throughput read performance, that is, catch-up read. For the real-time data from the new part of the sequence, we need to provide low latency append only tailing write and tailing read
The bottom layer of the storage architecture is based on Scalable Distributed cloud storage, and the middle layer means that the log data is stored as stream to serve as the shared storage primitive. Then, based on stream, different functions can be provided for operation, such as message queuing, NoSQL, full-text search of streaming data, and real-time and batch analysis combined with Flink. In other words, the stream primitive provided by pravega can avoid the data redundancy caused by the original data moving in multiple open source storage search products in the existing big data architecture, and it completes a unified data Lake in the storage layer.
The proposed big data architecture uses Apache Flink as the computing engine and unifies batch processing and stream processing through a unified model / API. With pavega as the storage engine, it provides a unified abstraction for streaming data storage,Make the access to historical and real-time data consistent. Both of them form a closed loop from storage to calculation, which can deal with high throughput historical data and low latency real-time data at the same time. At the same time, pravega team also developed the Flink pravega connector, which provides the semantics of exactly once for the whole pipeline of computing and storage.
The meaning, advantages and disadvantages of lambda architecture and kappa architecture were introduced. In real scenarios, most of the time, it is not a fully standardized lambda architecture or kappa architecture. It can be a mixture of the two. For example, most real-time indicators use kappa architecture to complete calculation, and a small number of key indicators (such as amount related)Using lambda architecture, recalculating with batch processing, adding a proofreading process。
Kappa architecture does not mean that the intermediate results do not fall to the ground at all. Now many big data systems need to support machine learning (offline training), so the real-time intermediate results need to fall to the corresponding storage engine for machine learning. In addition, sometimes it also needs to query the detailed data. In this scenario, the real-time details need to be written into the corresponding engine.
There is also the real-time architecture design of kappa, which not only increases the difficulty of calculation and the requirement of resource change, but also increases the difficulty of development. Therefore, the following hybrid architecture is available. It can be seen that the emergence of this architecture is completely in the consideration of demand and current situation
Real time data warehouse
Real time data warehouse should not be a kind of architecture. It can only be said that it is an implementation of kappa architecture, or that real-time data warehouse is a kind of implementation in industry. With the theoretical support of kappa architecture, real-time data warehouse mainly solves the demand of data real-time, such as real-time data intake, real-time processing, real-time computing, etc
In fact, real-time data warehouse owners mainly solve three problems: 1. Real time data; 2. Relieving cluster pressure; 3. Relieving business database pressure.
The first layer of DWD public real-time detail layer calculates the message queue of subscription business data in real time. Then, through the combination of data cleaning, multi data source join, streaming data and offline dimension information, all the dimension attributes of some business systems and dimension tables with the same granularity are associated together to increase the data usability and reusability and get the final real-time detail data. This part of data has two branches, one is directly landed to ads for real-time details query, and the other is sent to message queue for lower level calculation
Layer 2 DWS public real time summary layer The public summary layer is built with the concept of data domain + business domain. Different from offline data warehouse, the summary layer is divided into light summary layer and high summary layer, which are output at the same time. The light summary layer is written into ads for complex OLAP query scenarios of front-end products to meet self-service analysis; the high summary layer is written into HBase for relatively simple kV query scenarios of front-end products to improve query performance, Such as output report
Key points of real time data warehouse implementation
- End to end data delay and data flow monitoring
- The fast recovery of fault can be used for the retrospective processing of force data, and the system can support the consumption of data in the specified time
- The real-time data is queried from the real-time data warehouse, and the T + 1 data is corrected through the offline channel
- Data map, data ⾎ sorting of blood relationship
- The real-time monitoring of business data quality can identify the quality status according to the ⽅ way of rules at the initial stage
- Every year, the group has the double 11 promotion, during which the flow and data volume will increase sharply. Compared with offline system, real-time system should be more sensitive to the amount of data and more stable
- Therefore, in order to cope with this scenario, we need to make two preparations: 1. System pressure test before big promotion; 2. Active and standby link support during big promotion
At the beginning, each application will generate and store a large amount of data, which can not be used by other applications. This situation leads to data islands. Then the data mart came into being. The data generated by the application program is stored in a centralized data warehouse, which can export the relevant data to the departments or individuals who need the data in the enterprise,However, data marts solve only part of the problem.The remaining problems, including data management, data ownership and access control, need to be solved urgently, because enterprises seek to obtain higher ability to use effective data.
In order to solve the problems mentioned above,Enterprises have a strong demand to build their own data lake,Data lake can not only store traditional types of data, but also store any other types of data (text, image, video, audio), and can do further processing and Analysis on them to produce the final output for all kinds of program consumption. And with the development of data diversity,Data warehouse, which defines schema in advance, is becoming more and more difficult to support flexible exploration and analysis requirements. At this time, a data Lake technology appears, that is, caching all the original data to a big data storage, and then parsing the original data according to the requirements in subsequent analysis. In a nutshellThe data warehouse mode is schema on write, and the data Lake mode is schema on read
Kappa vs. lambda architecture
In real scenes, many timesIt is not a fully standardized lambda architecture or kappa architectureFor example, most real-time indicators are calculated using kappa architecture,A small number of key indicators (such as amount related) use lambda architecture to recalculate with batch processing, adding a proofreading process。
These two architectures are both real-time architectures and extensions of offline architectures
Comparison between real time data warehouse and offline data warehouse
Offline data warehouse is mainly based on sqoop, hive and other technologies to build T + 1 offline data. It pulls incremental data into hive table every day through regular tasks, and then creates topic dimension data related to each business, and provides t + 1 data query interface
At present, the real-time data warehouse is mainly based on real-time data acquisition tools, such as canal, which writes the original data into the data channel like Kafka, and finally writes it into the storage system like HBase, which provides the query scheme of minute level and even second level.