This article starts with WeChat official account of vivo Internet technology.
Author: Liu Yanjiang
In recent years, with the continuous development of IT technology and big data, machine learning and algorithm, more and more enterprises are aware of the value of data, manage data as their own valuable assets, and use big data and machine learning ability to mine, identify and use data assets. If there is a lack of effective data architecture design or partial capacity, it will make it difficult for the business layer to directly use big data and big data and business have a huge gap. The emergence of this gap leads to a series of problems in the process of using big data, such as the data is not known, the demand is difficult to achieve, and the data is difficult to share. This paper introduces some data platforms Design ideas to help businesses reduce pain points and difficulties in data development.
This paper mainly includes the following chapters:
- The first part of this article introduces the basic components and related knowledge of big data.
- The second part introduces lambda architecture and kappa architecture.
- The third part will introduce the general big data architecture under the lambda and kappa architecture modes
- The fourth part introduces the data end-to-end difficulties and pain points under the exposed data architecture.
- The fifth part introduces the overall design of excellent big data architecture
- From the fifth part, it will be introduced to combine these big data components through various data platforms and components to build a set of efficient and easy-to-use data platform to improve the performance of business system, so that business development is not afraid of complex data development components, without paying attention to the underlying implementation, just using SQL can complete one-stop development, complete data return, and let big data It’s no longer a data engineer’s skill.
1、 Big data technology stack
The overall process of big data involves many modules, each of which is relatively complex. The following figure lists these modules and components as well as their functional characteristics. Later, there will be a special topic to introduce the domain knowledge of related modules in detail, such as data collection, data transmission, real-time calculation, offline calculation, big data storage and other related modules.
2、 Lambda architecture and kappa architecture
At present, almost all big data architectures are based on lambda and kappa architectures. Different companies have designed their data architectures in line with the two architectures. Lambda architecture enables developers to build large-scale distributed data processing system. It has good flexibility and scalability, and also has good fault tolerance for hardware faults and human errors. There are many articles about lambda architecture that can be found on the Internet. Kappa architecture solves the two sets of data processing systems existing in lambda architecture, which brings various cost problems. This is also the current research direction of flow batch integration. Many enterprises have begun to use this more advanced architecture.
3、 Big data architecture under kappa architecture and lambda architecture
At present, major companies basically use kappa architecture or lambda architecture mode. Under these two modes, the overall big data architecture in the early development stage may be as follows:
4、 Data end to end pain points
Although the above architecture seems to connect a variety of big data components to implement integrated management, people who have been exposed to data development will feel more intense. Such naked architecture business data development needs to pay attention to the use of many basic tools. There are many pain points and difficulties in actual data development, which are shown in the following aspects.
- Without a data development ide to manage the whole data development process, the long-term process cannot be managed.
- There is no standard data modeling system, which leads to different data engineers’ wrong understanding of different calculation caliber of indicators.
- The development requirements of big data components are high, so it will cause various problems for ordinary business to directly use HBase, ES and other technical components.
- Basically, each company’s big data team will be very complex, involving many links, and it is difficult to locate the corresponding person in charge in case of problems.
- It’s difficult to break the data island, cross team and cross department data sharing, and they don’t know each other’s data.
- It needs to maintain two sets of calculation model batch calculation and flow calculation, which is difficult to develop. It needs to provide a set of flow batch unified SQL.
- Lack of company level metadata system planning, it is difficult to reuse the same data in real-time and offline computing, and each development task needs to be sorted out.
Basically, most companies have the above problems and pain points in data platform governance and providing open capabilities. Under the complex data architecture, for the data applicator, the unclear of each link or the unfriendly function will make the complex link change more complicated. To solve these pain points, we need to polish every link carefully, connect the above technical components seamlessly, and make the business use data from end to end as simple as writing SQL query database.
5、 Excellent big data overall architecture design
Provide a variety of platforms and tools to help the data platform: data collection platform of multiple data sources, one key data synchronization platform, data quality and modeling platform, metadata system, data unified access platform, real-time and offline computing platform, resource scheduling platform, one-stop development ide.
6、 Metadata – the cornerstone of big data system
Metadata is to get through data source, data warehouse and data application, recording the complete link of data from generation to consumption. Metadata contains static table, column, partition information (that is, Metastore). Dynamic task and table dependency mapping; data warehouse model definition, data life cycle; ETL task scheduling information, input and output metadata are the basis of data management, data content and data application. For example, metadata can be used to build data maps among tasks, tables, columns, and users; DAG dependency of tasks can be built to arrange task execution sequences; task portraits can be built to manage task quality; asset management and calculation resource consumption overview of individuals or bu can be provided.
It can be considered that the whole big data flow is managed by metadata. Without a complete set of metadata design, there will be problems such as difficult to track the data, difficult to control the permissions, difficult to manage the resources, difficult to share the data and so on.
Many companies rely on hive to manage metadata, but I think in a certain stage of development, we still need to build our own metadata platform to match the relevant architecture.
For metadata, please refer to the actual battle: https://www.jianshu.com/p/f60b2111e414
7、 Flow batch integration calculation
If we maintain two sets of computing engines, such as off-line computing spark and real-time computing Flink, it will cause great trouble to users. We need to learn both flow computing knowledge and batch computing domain knowledge. If you use Flink offline to use spark or Hadoop in real time, you can develop a set of customized DSL description language to match different computing engine syntax. Upper users don’t need to pay attention to the specific execution details of the bottom layer, just need to master a DSL language, and can complete the access of spark, Hadoop, Flink and other computing engines.
8、 Real time and offline ETL platform
ETL, extract transform load, is used to describe the process of extracting, transforming and loading data from the source to the destination. ETL is more commonly used in data warehouse, but its object is not limited to data warehouse. Generally speaking, ETL platform plays an important role in data cleaning, data format conversion, data completion, data quality management, etc. As an important data cleaning middle layer, ETL generally has the following functions at least:
- Support multiple data sources, such as message system, file system, etc
- Support a variety of operators, filtering, segmentation, conversion, output, query data source completion and other operator capabilities
- It supports dynamic change logic. For example, the above operators can be submitted by dynamic jar mode to continuously publish changes.
9、 Intelligent unified query platform
Most data queries are driven by requirements. One requirement develops one or several interfaces, writes interface documents, and opens them to business parties. This mode has many problems in the big data system:
- This architecture is simple, but the interface granularity is very coarse, the flexibility is not high, the scalability is poor, and the reuse rate is low.
- At the same time, the development efficiency is not high, which will obviously cause a large number of repeated development for the massive data system, it is difficult to achieve data and logic reuse, and seriously reduce the experience of business application parties.
- If there is no unified query platform to directly expose databases such as HBase to the business, the subsequent operation and maintenance management of Data permission will be difficult. Access to big data components is also very painful for the business application parties, and a little carelessness will lead to various problems.
Solve the problem of big data query pain points through a set of intelligent query
10、 Specification system of data warehouse modeling
With the increase of business complexity and data scale, chaotic data call and copy, resource waste caused by repeated construction, ambiguity caused by different definition of data indicators, and higher and higher threshold of data use. Taking the actual business burying point and warehouse use witnessed by the author as an example, some table fields of the same commodity name are good_id, some are spu_id, and many other names, which will cause great trouble to those who want to use these data. Therefore, there is no complete big data modeling system, which will bring great difficulties to data governance, as shown in the following aspects:
- Data standards are inconsistent, even if they are named the same, but the definition caliber is inconsistent. For example, there are more than a dozen definitions for UV alone. The question is: it’s all UV, which one should I use? It’s all UV. Why are the data different?
- It causes huge R & D costs. Every engineer needs to understand every detail of the R & D process from the beginning to the end. Everyone will step on the same “pit” again, which wastes the R & D personnel’s time and energy costs. This is also the problem that the target author encountered. It’s too difficult to extract data from actual development.
- There is no unified standard management, resulting in the waste of resources such as repeated calculation. But the level and granularity of data table are not clear, which makes the repeated storage serious.
Therefore, the development of big data and the design of data warehouse table must adhere to the design principle. The data platform can be developed to restrict unreasonable design, such as Alibaba’s OneData body. Generally speaking, data development should be carried out according to the following guidelines:
For those interested, please refer to Alibaba’s onedata design system.
11、 One key integration platform
It is very simple to collect all kinds of data to the data platform with one key, and connect the data to ETL platform seamlessly through the data transmission platform. ETL standardizes the schema definition by connecting with the metadata platform, and then converts and diverts the data to the real-time and offline computing platforms. Any subsequent offline and real-time processing of the data only needs to apply for the metadata table permission to complete the development task. Data collection supports a variety of data sources, such as binlog, log collection, front-end buried point, Kafka message queue, etc
12、 Data development ide – an efficient end-to-end tool
An efficient one-stop solution tool for data development can complete the development of real-time and offline computing tasks through IDE, and provide a one-stop solution for all the above platforms. Data development ide provides all-round product services such as data integration, data development, data management, data quality and data services. It is a one-stop development management interface, through which data transfer, transformation and integration are completed. Data is introduced from different data storage, transformed and developed, and finally processed data is synchronized to other data systems. Through the efficient IDE for big data development, big data engineers can basically shield all kinds of pain points, combine the above platform capabilities, and make big data development as simple as writing SQL.
For data development tools, please refer to Alibaba cloud’s dataworks.
To solve the end-to-end difficulties, we need some other abilities. We will not talk about it here. Students who are interested in it can study it on their own.
The research and development of complete data system also includes alarm and monitoring center, resource scheduling center, resource computing isolation, data quality detection, one-stop data processing system, which will not be discussed here.
More contentVivo Internet technologyWeChat official account
Note: for reprint, please contact wechat:labs2020Contact.