Author: Chen Jianxin, development engineer of call technology data warehouse, currently focuses on the integration of offline and real-time architecture of call technology big data platform.
Shenzhen call Technology Co., Ltd. (hereinafter referred to as “call technology”) is a pioneering enterprise in the shared power bank industry. Its main business covers self-service leasing of power bank, development of customized shopping mall navigation machine, advertising display equipment and advertising communication services. Calltech has three-dimensional product lines, large and medium-sized cabinets and desktop type in the industry. At present, more than 90% of the cities in China have implemented business services, with more than 200 million registered users, meeting the needs of all scene users.
1、 Introduction to big data platform
1. Development history
The development process of the call technology big data platform is mainly divided into the following three stages:
1) Discrete 0. X Greenplum
Why discrete? Before, there was no unified big data platform to support data services. Instead, each business development line took data or did some calculations by itself, and used a low configuration Greenplum offline service to maintain daily data requirements.
2) Offline 1.0 EMR
After that, the architecture was upgraded to offline 1.0 EMR, which refers to alicloud’s elastic distributed hybrid cluster service composed of big data, including Hadoop, hivespark offline computing and other common components.
Alibaba cloud EMR mainly solves our three pain points:
- First, the level of storage computing resources is scalable;
- Second, it solves the development and maintenance problems caused by the heterogeneous data of each business line, and the platform cleans and stores the data uniformly;
- Third, we can establish our own data warehouse hierarchical system, divide a subject field, and lay a good foundation for our index system.
3) Real time, unified 2.0 Flink + hologres
The “Flink + hologres” real-time data warehouse that we are currently experiencing is also the core of this article. It brings two qualitative changes to our big data platform, one is real-time computing, the other is unified data service. Based on these two points, we accelerate the exploration of knowledge data and promote the rapid development of business.
2. Platform capability
In general, version 2.0 big data platform provides the following capabilities:
The platform now supports real-time or offline integration of business database or business data log.
The platform now supports offline computing based on spark and real-time computing based on Flink.
Data service consists of two parts
- The first part is the analysis service and the ability of ad hoc analysis provided by impala;
- The other part is the interactive analysis capability for business data provided by hologres.
At the same time, the platform can directly dock with common Bi tools, and the business system can quickly integrate and dock.
The capabilities provided by big data platform have brought us a lot of achievements, which can be summarized as the following five points:
The core of big data platform is distributed architecture, so that we can expand storage or computing resources at a low cost.
The resources available to all servers can be consolidated. In the previous architecture, each business department maintained its own cluster, which would cause some waste, make it difficult to ensure reliability, and the freight cost is high. Now, the platform is responsible for unified scheduling.
It integrates all the business data of the business department and other heterogeneous data sources such as business logs, which are cleaned and connected by the platform.
After data sharing, the platform will unify the external output services, and each business line can quickly get the data support provided by the platform without self-development.
The platform provides a unified security authentication and other authorization mechanism, which can achieve different levels of fine-grained authorization for different people and ensure data security.
2、 Data requirements of enterprise business
With the rapid development of business, it is urgent to build a unified real-time data warehouse. Based on the platform architecture of version 0. X and 1.0, and the current development and future trend of business, the requirements of building version 2. X data platform mainly focus on the following aspects:
Real time large screen
The real-time large screen needs to replace the old quasi real-time large screen and adopt a more reliable and low latency technical scheme.
Unified data service
High performance, high concurrency and high availability data services become the key to the enterprise digital transformation of the unified data portal. It is necessary to build a unified data portal and unified external output.
Real time data warehouse
The importance of data timeliness in enterprise operation is becoming increasingly prominent, and it needs to respond faster and more timely.
3、 Technical scheme of real time data warehouse and unified data service
1. Overall technical framework
The technical architecture is mainly divided into four parts: Data ETL, real-time data warehouse, offline data warehouse and data application.
- Data ETL is used for real-time processing of business database and business log. Flink is used for real-time computing,
- The data in real-time data warehouse is stored and analyzed in hologres after real-time processing
- Business cold data is stored in hive offline data warehouse and synchronized to hologres for further data analysis and processing
- Hologres connects common Bi tools, such as tableau, quick Bi, datav and business system.
2. Real time data warehouse model
As shown above, there are some similarities between real-time data warehouse and offline data warehouse, except that there are fewer links of other layers.
- The first layer is the original data layer. There are two types of data sources. One is the binlog of the business library, and the other is the business log of the server. Kafka is used as the storage medium.
- The second layer is the data details layer, which extracts the information in the original data layer Kafka by ETL and stores it in Kafka as real-time details. The purpose of this is to facilitate different downstream consumers to subscribe at the same time, and to facilitate the use of subsequent application layer. Dimension table data is also stored through hologres to meet the following data association or conditional filtering.
- The third is the data application layer. In addition to getting through hologres, hologres is also used to connect hive, and hologres provides unified upper application services.
3. Overall technical architecture and data flow
The following data flow diagram can deepen the planning of the overall architecture and the overall data flow of the data warehouse model.
As can be seen from the figure, it is mainly divided into three modules:
- The first is integrated processing;
- The second is real-time data warehouse;
- The third is data application.
From the inflow and outflow of data, we can see that there are two main cores:
- The first core is Flink’s real-time computing: you can get it from Kafka, or directly read MySQL binlog data from Flink CDT, or directly write it back to Kafka cluster.
- The second core is unified data service: now the unified data service is completed by hologres, which avoids the problems caused by data islands, or the consistency is difficult to maintain, and also speeds up the analysis of offline data.
4、 Specific practice details
1. Big data technology selection
The implementation of the scheme is divided into two parts: real-time and service analysis. In terms of real-time, we choose the full hosting mode of Alibaba cloud Flink, which has the following advantages:
- State management and fault tolerance mechanism;
- Table API and Flink SQL support;
- High throughput and low latency;
- Exact only semantic support;
- Flow batch integration;
- Total custody and other value-added services.
For service analysis, we choose Alibaba cloud hologres interactive analysis, which brings several benefits:
- Fast response analysis;
- High concurrency reading and writing;
- Separation of computation and storage;
- It’s easy to use.
2. Implementation of real-time large screen business
The figure above shows the comparison between the new and old schemes for real-time large screen business.
Take the order as an example. In the old scheme, the order is synchronized from the order database to another database through DTS. Although this is real-time, in terms of calculation and processing, it is mainly through timing tasks, such as setting the scheduling interval to 1 minute or 5 minutes to complete the real-time update of data. The sales layer and management layer need to master the business dynamics in real time,, So it’s not really real-time. In addition, slow and unstable response is also a big problem.
The new scheme adopts Flink real-time computing + hologres architecture.
The development method can fully use the SQL support of Flink. For our previous MySQL computing development method, it can be said that it is a seamless migration and realizes rapid landing. Hologres is used for data analysis and service. Take orders as an example. For example, the revenue of today’s orders, the number of today’s order users or the number of today’s order users may need to increase the city dimension with the increase of business diversity. Through the analysis ability of hologres, it can perfectly support the quick display of some indicators of revenue, order volume, order user number and city dimension.
3. Implementation of real-time data warehouse and unified data service
Take a business scenario as an example, for example, the daily average data volume of a business log with a relatively large level is at the TB level. Let’s first analyze the pain points of the old scheme
- Poor timeliness of data: due to the large amount of data, the old scheme used the strategy of hourly offline scheduling for data calculation. However, the timeliness of the scheme is poor, which can not meet the real-time needs of many business products. For example, the hardware system needs to know the current status of the equipment in real time, such as alarm, error, empty warehouse, etc., in order to make the corresponding decision-making action in time.
- Data island: in the old scheme, tableau was used to connect a large number of business reports. The reports were used to analyze the number of reported devices in the past hour or day, and which devices reported abnormalities. For different scenarios, the data previously calculated offline through spark will be backed up and stored on MySQL or redis. In this way, there will be multiple systems, forming data islands, which are a huge challenge to platform maintenance.
Now the business log can be transformed through the 2.0 Flink + hologres architecture.
- In the past, there was no pressure on TB level log volume under Flink’s low latency computing framework. For example, the previous link from flume HDFS to spark was abandoned and replaced by Flink. We only need to maintain a computing framework of Flink.
- When collecting device status data, they are all unstructured data, which need to be cleaned and then returned to Kafka, because consumers may be diversified, which can facilitate multiple downstream consumers to subscribe at the same time.
- In the scenario just now, the hardware system needs high concurrency and real-time query of tens of millions of devices (power bank) status, which requires high service capability. Hologres can provide high concurrent read-write ability, establish primary key table with associated state devices, and update the status in real time to meet the real-time query of CRM system for devices (power bank).
- At the same time, hologres will also store the latest hot spot detail data to directly provide external services.
4. Business support effect
Through the new scheme of Flink + hologres, we support three scenarios:
Real time large screen
At the business level, it iterates diversified requirements more efficiently and reduces the development, operation and maintenance costs.
Unified data service
Through a HSAP system to achieve service / analysis integration, avoid data islands and consistency, security and other issues.
Real time data warehouse
It can meet the higher and higher requirements of data timeliness in enterprise operation, and can provide second level response.
5、 Future planning
With the iteration of business, our future planning in big data platform mainly has two points: integration of flow and batch and improvement of real-time data warehouse.
- Generally speaking, the current big data platform is still a mixture of offline architecture and real-time architecture. In the future, the redundant offline code architecture will be abandoned, and the flow batch unified computing engine of Flink will be used.
- In addition, at present, we have only migrated some businesses, so we will refer to the previous perfect offline data warehouse index system to meet our current real-time data warehouse construction, and fully migrate to the 2.0 Flink + hologres architecture.
Through future planning, we hope to build a more perfect real-time data warehouse together with Flink full trust and hologres, but we also have a further demand for it:
1. Demand for Flink full hosting
The SQL editor in Flink is very efficient and convenient to write Flink SQL jobs, and it also provides many common SQL upstream and downstream connectors to meet the development needs. However, there are still some requirements that we hope Flink full hosting will support in subsequent iterations
- SQL job version control and compatibility monitoring;
- SQL jobs support hive3. X integration;
- It is more convenient to package datastream jobs and faster to upload resource packages;
- The tasks deployed in session cluster mode support automatic tuning.
2. Requirements for hologres interactive analysis
Hologres can not only support high concurrency real-time write and query, but also be compatible with PostgreSQL ecology, which makes it easy to access and use unified data services. However, there are still some requirements that hologres hopes to support in later iterations
- Support hot upgrade operation to reduce the impact on business;
- Support data table backup and read-write separation;
- Support accelerating query of Alibaba cloud EMR hive data warehouse;
- It supports computing resource management for user groups.
This article is the original content of Alibaba cloud and cannot be reproduced without permission.