Alibaba big data practice: implementation of onedata model

Time:2021-7-13

How to transform specific requirements or projects into implementable solutions, how to carry out requirements analysis, architecture design, detailed model design, etc., are the contents of discussion in the process of model implementation. This section first briefly introduces the model implementation process commonly used in the industry, and then focuses on Alibaba onedata model design theory and implementation process.

1. Model implementation process commonly used in the industry

**Implementation process of Kimball model
**
Kimball dimension modeling mainly discusses the whole process of requirement analysis, high-level model, detailed model and model review.

The first stage is the high-level design stage, which defines the scope of business process dimension model and provides the technical and functional description of each star pattern; The second stage is the detailed model design stage, adding attributes and measurement information to each star model; The third stage is to review, redesign and verify the model. The fourth stage is to generate detailed design documents and submit them to ETL for design and development.

High level model: the direct output goal of high-level model design stage is to create high-level dimension model graph, which is the graphic description of dimension table and fact table in business process. Determine the dimension table, create an initial attribute list, and create proposed metrics for each fact table.

Detailed model: the detailed dimension modeling process is to fill in the missing information for the high-level model, solve the design problems, and constantly test whether the model can meet the business requirements to ensure the completeness of the model. Determine the attributes of each dimension table and the metrics of each fact table, and determine the location and definition of the information source, and determine how the attributes and metrics fill in the preliminary business rules of the model.

Model review, redesign and verification: in this stage, relevant personnel are mainly called to review and verify the model, and the detailed dimensions are redesigned according to the review results.

Submit ETL design and development: finally, complete the detailed model design documents, submit to ETL developers, enter the ETL design and development phase, and ETL personnel complete the design and development of physical model.

The above contents are mainly quoted from the Data Warehouse Lifecycle Toolkit of Ralph Kimball, etc. Please refer to the original works for details.

Implementation process of inNon model

The positioning of the data model is to play the role of intelligent roadmap to other parts of the data warehouse. Because the construction of data warehouse is not easy, in order to coordinate the work of different personnel and adapt to different types of users, it is very necessary to establish a roadmap data model to describe how the various parts of the data warehouse are combined.

Inmon divides the model into three levels: ERD (entity relationship diagram), dis (data item set) and physical model.

ERD layer is the highest level of data model, which describes the entity or subject domain in company business and the relationship between them; Dis layer is the middle layer, which describes the relationship among keywords, attributes and detailed data in the data model; Physical layer is the bottom layer of data modeling, which describes the physical characteristics of data model.

For the construction of data warehouse model, it is recommended to adopt spiral development method and adopt iterative method to complete multiple requirements. However, a unified ERD model is needed to integrate the results of each iteration. ERD model is a highly abstract data model, which describes the complete data of an enterprise. Each iteration is a subset of ERD model, which is implemented by DIS and physical data model.

The above content is mainly quoted from building the data warehouse of inmon. Please refer to the original work for details.

Other model implementation process

In practice, we often use the following data warehouse model hierarchy division, which has some similarities with Kimball and inmon’s model implementation theory, but does not involve specific model expression.

Business modeling, business model generation, mainly to solve the business level decomposition and programming.

Domain modeling and generation of domain model are mainly to abstract business model and generate domain conceptual model.

Logic modeling and generation of logic model are mainly to logicalize the conceptual entities of domain model and the relationship between entities at the database level.

Physical modeling, generation of physical model, mainly to solve the physical and performance of logical model for different relational databases and other specific technical problems.

2. Onedata implementation process

This section focuses on how to use onedata system and supporting tools to implement the model construction of big data system. In the explanation, we will explain Alibaba’s specific business.

Guidelines

First of all, in the construction of big data warehouse, it is necessary to conduct sufficient business research and demand analysis. This is the cornerstone of data warehouse construction. Whether business research and demand analysis are sufficient or not directly determines the success of data warehouse construction. Secondly, the overall data architecture is designed, which mainly divides the data according to the data domain; According to the theory of dimension modeling, the bus matrix is constructed and the business process and dimension are abstracted. Thirdly, the report requirements are abstracted to sort out the relevant index system, and the onedata tool is used to complete the index specification definition and model design. Finally, code development and operation and maintenance. This article will focus on the physical model design before (including) the content of the steps.

Implementation workflow

Data research business researchThe whole Alibaba group covers e-commerce, digital entertainment, navigation (Gaode), mobile Internet services and other fields. Each field covers a number of business lines. For example, the e-commerce field covers class C (Taobao, tmall, tmall International) and class B (Alibaba Chinese station, international station, express) businesses. Data warehouse is to cover all business areas, or each business area to build alone, the business line in the business area is also facing this problem. Therefore, to build a big data warehouse, we need to understand what the business areas and lines of business have in common and different points, as well as which business modules each business line can be subdivided into, and what the specific business process of each business module is. Whether the business research is sufficient or not will directly determine the success of data warehouse construction.

In Alibaba, data warehouses are generally built independently in various business areas, and business lines in business areas are built in a unified and centralized way because of business similarity and business relevance.

Data research demand researchIt can be imagined that without considering the data needs of analysts and business operators, the data warehouse built according to business research is undoubtedly building behind closed doors. Understanding the business of the business system does not mean that it can be implemented. What we need to do now is to collect the needs of data users. We can go to analysts and business operators to find out what their data demands are. At this time, more is the report requirements.

There are two ways of demand research: one is to know the demand according to the communication with analysts and business operators (e-mail, IM); The second is to research and analyze the existing reports in the report system. Through the needs of research and analysis, it is clear what kind of data to make. Most of the time, the data warehouse team is driven by the specific data requirements to understand the business data of the business system, and there is no strict order between the two.

For example: analysts need to know the transaction amount of the first category of Taobao (Taobao, tmall, tmall global). When we know the demand, we need to analyze what (dimension) to summarize and what (measure) to summarize. Here, category is dimension and amount is measure; How to design detailed data and summary data? Is this a public report? Do you need to precipitate in the summary table or summarize in the report tool?
Architecture design – data domain division: data domain is a collection of abstract business processes or dimensions for business analysis. Business process can be summarized as an indivisible behavior event, such as order, payment and refund. In order to ensure the vitality of the whole system, the data domain needs to be abstracted, maintained and updated for a long time, but not easily changed. When dividing the data domain, it can not only cover all the current business requirements, but also be included in the existing data domain or expand the new data domain when new business enters.

Architecture Design — building bus matrix: after full business research and demand research, it is necessary to build the bus matrix. Two things need to be done: define the business processes under each data domain; Which dimensions are related to the business process, and define the business process and dimensions under each data field.

Specification definitionStandard definition mainly defines index system, including atomic index, modifier, time period and derived index.

model design The model design mainly includes the standard definition of dimension and attribute, the model design of dimension table, detail fact table and summary fact table. Please refer to the following chapters for detailed explanation of relevant practice.

Conclusion: the implementation process of onedata is a highly iterative and dynamic process, which generally adopts the spiral implementation method. After the overall architecture design is completed, the iterative model design and review are started according to the data domain. In the process of model implementation, such as architecture design, specification definition and model design, a review mechanism will be introduced to ensure the correctness of the model implementation process.
Note: some proper nouns, technical terms, product names, software project names, tool names, etc. in this book are commonly used words for internal projects of Taobao (China) Software Co., Ltd. if they are identical with the names of third parties, it is a coincidence.

Original link
This article is the original content of Alibaba cloud and cannot be reproduced without permission.

Recommended Today

A brief introduction to CGI programming in Ruby

Ruby is a general language, not just a language used for web development, but ruby is the most common in web applications and web tools. With ruby, you can not only write your own SMTP server, FTP program, or Ruby web server, but also use ruby for CGI programming. Next, let’s take a moment to […]