Nanny level tutorial on data warehouse construction, offline and real-time (Theory + Practice)


Document outline:

Nanny level tutorial on data warehouse construction, offline and real-time (Theory + Practice)

1、 Basic concepts of data warehouse

1. Data warehouse architecture

Before we talk about data warehouse, in order to let you have an intuitive understanding, let’s talk about the structure of data warehouse first“framework”What is it? There has never been an accurate answer to this question. Here we quote a passage: in the software industry, a generally accepted architecture definition refers to one or more structures of the system. The structure includes the construction of software (construction refers to the design and implementation of software), and the attributes and the relationship between them can be seen outside the construction.

Refer to this definition here,The data warehouse architecture is understood as the components constituting the data warehouse and the relationship between them, draw the following data warehouse structure diagram:

Nanny level tutorial on data warehouse construction, offline and real-time (Theory + Practice)

Data warehouse architecture

The whole data warehouse environment shown in the figure above includes two parts: operating system and data warehouse system. The data of the operational system consists of various forms of business data, which enter the data warehouse system through the process of extraction, transformation and loading (ETL).

Everything becomes more and more perfect with the evolution of time. Of course, it is also more and more complex, and shucang is no exception. During the evolution of data warehouse technology, several main architecture methods have emerged, includingIndependent data marts Inmon enterprise information factory architectureKimball data warehouse architectureHybrid data warehouse architecture。 We will talk about these architectures later. Next, let’s look at the basic concept of data warehouse.

2. Data warehouse concept

The English name is data warehouse, which can be abbreviated as DW or DWH. The purpose of data warehouse is to build an analysis oriented integrated data environment and provide decision support for enterprises. It was created for analytical reporting and decision support purposes.

The data warehouse itself does not “produce” any data, and it does not need to “consume” any data. The data comes from the outside and is open to external applications, which is why it is called “warehouse” instead of “factory”.

1) Basic features

Data warehouse is a subject oriented, integrated, nonvolatile and time-varying data set to support management decision-making.

Subject oriented:

In the traditional database, the biggest feature is the application-oriented data organization, and each business system may be separated from each other. Data warehouse is subject oriented. Topic is an abstract concept, which is the abstraction of data synthesis, classification, analysis and utilization in enterprise information system at a higher level. In the logical sense, it is the analysis object involved in a macro analysis field in the corresponding enterprise.


By extracting, cleaning, transforming and summarizing the scattered, independent and heterogeneous database data, the data of the data warehouse is obtained, which ensures the consistency of the data in the data warehouse about the whole enterprise.

The comprehensive data in the data warehouse can not be obtained directly from the original database system. Therefore, before entering the data warehouse, the data must be unified and integrated. This step is the most critical and complex step in the construction of data warehouse. The work to be completed includes:

We should unify all contradictions in the source dataFor example, the fields have different names and meanings, different synonyms, different units, different word lengths, and so on.

Data synthesis and calculation. The data synthesis work in the data warehouse can be generated when extracting data from the original database, but many are generated inside the data warehouse, that is, after entering the data warehouse.

The following figure illustrates the simple processing process of comprehensive data of an insurance company, in which the data related to the “insurance” topic in the data warehouse comes from multiple different operating systems. The naming of internal data in these systems may be different, and the data format may also be different. These inconsistencies need to be removed before storing data from different sources into the data warehouse.

Nanny level tutorial on data warehouse construction, offline and real-time (Theory + Practice)

Warehouse theme

Non volatile (non renewable):

The data of data warehouse reflects a long period of timeContent of historical data, is a collection of database snapshots at different time points, as well as the export data for statistics, synthesis and reorganization based on these snapshots.

Data nonvolatile is mainly for applications. Data warehouse users’ operations on data are mostly data query or more complex mining. Once the data enters the data warehouse, it is generally retained for a long time. Generally, there are a lot of query operations in data warehouse, but there are few modification and deletion operations. So,Data is rarely updated after being processed and integrated into the data warehouse, and usually only needs to be loaded and updated regularly

Time variant:

Data warehouse contains historical data of various granularity. The data in the data warehouse may be related to a specific date, week, month, quarter or year. The purpose of data warehouse is to mine the hidden patterns by analyzing the business operation of enterprises in the past period of time. Although the users of the data warehouse cannot modify the data, it does not mean that the data of the data warehouse will never change. The analysis results can only reflect the past situation. When the business changes, the mined patterns will lose their timeliness. Therefore, the data of data warehouse needs to be updated to meet the needs of decision-making. From this perspective, the construction of data warehouse is not only a project, but also a process. The data of data warehouse changes with time in the following aspects:

(1) The data time limit of data warehouse is generally much longer than that of operational data.

(2) The operating system stores the current data, while the data in the data warehouse is historical data.

(3) The data in the data warehouse is appended in chronological order, and they all have time attributes.

3. Why is there a data warehouse

Let’s take a look at where the data in the data warehouse comes from and where it will eventually go?

Usually, the data of data warehouse comes from various business application systems。 There are many forms of data in business systems, which may be structured data in relational databases such as Oracle, MySQL and SQL server, flat files such as text and CSV, or data in word and EXCEL documents, or semi-structured data such as HTML and XML. After a series of data extraction, transformation and cleaning, these business data are finally loaded into the data warehouse in a unified format. The data in the data warehouse is used as the data source for analysis,Provided to the following ad hoc query, analysis system, data mart, report system, data mining system, etc

At this time, we thought, why can’t the data of the business system be directly used for ad hoc query, analysis system and report system? Why go through the step of data warehouse? In fact, before the emergence of data warehouse, it did do so, but many pioneers of data analysis had found that the simple “direct access” method was difficult to work well, and there were countless failure cases. Here are some reasons why the direct access business system cannot work:

Some business data cannot be accessed directly due to security or other factors.

The version of the business system changes frequently. Each change requires rewriting the analysis system and retesting.

It is difficult to establish and maintain reports with summary data from multiple business system versions.

The column names of business systems are usually hard coded, sometimes just meaningless strings, which makes it more difficult to write analysis systems.

The data format of business system, such as date and number, is not unified.

The table structure of business system is optimized for transaction processing performance, and sometimes it is not suitable for query and analysis.

There is no appropriate way to merge valuable data into the database of a specific application.

There is no appropriate location to store metadata.

The display data fields that users need to see sometimes do not exist in the database.

Generally, the priority of transaction processing is higher than that of the analysis system, so if the analysis system and transaction processing run on the same hardware, the performance of the analysis system is often very poor.

There is a risk of misuse of business data.

It is very likely to affect the performance of the business system.

Although the investment in software and hardware needs to be increased, it is worthwhile to establish an independent data warehouse in terms of cost and benefits compared with direct access to business data。 With the reduction of processor and storage costs year by year, the advantages of data warehouse scheme are more obvious and economically feasible.

4. Difference between data warehouse and database

The difference between database and data warehouse is actually the difference between OLTP and OLAP.

Operational processing, called on-line transaction processing (OLTP), can also be called transaction oriented processing system. It is the daily operation of specific business online in the database, usually querying and modifying a few records.Users are more concerned about the response time of operations, data security, integrity and the number of concurrent users。 As the main means of data management, traditional database system is mainly used for operational processing. Relational databases such as MySQL and Oracle generally belong to OLTP.

Analytical processing, called on-line analytical processing (OLAP), generally analyzes the historical data of some topics to support management decisions.

First of all, we should understand that the emergence of data warehouse is not to replace database. Database is a transaction oriented design, and data warehouse is a subject oriented design. Databases generally store business data, while data warehouses generally store historical data.

Database design is to avoid redundancy as much as possible. It is generally designed for a business application, such as a simple user table to record simple data such as user name and password. It is in line with business application, but not analysis.The data warehouse is designed to introduce redundancy intentionally. It is designed according to the analysis requirements, analysis dimensions and analysis indicators

Database is designed to capture data, and data warehouse is designed to analyze data

Take banking as an example. The database is the data platform of the transaction system. Every transaction made by the customer in the bank will be written into the database and recorded. Here, it can be simply understood as bookkeeping with the database. Data warehouse is the data platform of the analysis system. It obtains data from the transaction system, summarizes and processes it, and provides the basis for decision-makers. For example, how many transactions occur in a certain branch of a bank in a month and what is the current deposit balance of the branch. If there are more deposits and more consumer transactions, it is necessary to set up ATMs in the region.

Obviously, the transaction volume of banks is huge, usually calculated in millions or even tens of millions of times. The transaction system is real-time, which requires timeliness. It is unbearable for customers to save a sum of money for tens of seconds, which requires that the database can only store data for a short period of time. The analysis system is ex post, which should provide all the valid data in the concerned time period. These data are massive, and the summary calculation is slower. However, as long as we can provide effective analysis data, we can achieve our goal.

Data warehouse is produced in order to further mine data resources and make decisions when there are a large number of databases. It is by no means the so-called “large database”

5. Data warehouse layered architecture

According to the warehouse structure, the data inflow and outflow process can be divided into:Source datadata warehouse Application data

data warehouse

The data of the data warehouse comes from different source data and provides a variety of data applications. After the data flows into the data warehouse from bottom to top, the application is open to the upper layer, and the data warehouse is only a platform for intermediate integrated data management.

Source data: there is no change in the data of this layer, which directly follows the data structure and data of the peripheral system and is not open to the outside world; It is the temporary storage layer, which is the temporary storage area of interface data to prepare for the next step of data processing.

data warehouse : also known as the detail layer, the data of DW layer should be consistent, accurate and clean, that is, the data after cleaning (removing impurities) the source system data.

Application data: data source directly read by front-end application; Calculate the generated data according to the needs of reports and special analysis.

The data warehouse obtains data from various data sources and the data conversion and flow in the data warehouse can be regarded as ETL(Extract extra, convert transfer, load)ETL is the assembly line of the data warehouse, which can also be regarded as the blood of the data warehouse. It maintains the metabolism of data in the data warehouse, and most of the energy of the daily management and maintenance of the data warehouse is to keep ETL normal and stable.

So why should data warehouse be layered?

Exchange space for time and improve the user experience (efficiency) of the application system through a lot of preprocessing. Therefore, there will be a lot of redundant data in the data warehouse; Without layering, if the business rules of the source business system change, the whole data cleaning process will be affected and the workload will be huge.

Through hierarchical data management, the process of data cleaning can be simplified, because dividing the original work into multiple steps is equivalent to breaking a complex work into multiple simple work, turning a large black box into a white box. The processing logic of each layer is relatively simple and easy to understand, so that we can easily ensure the correctness of each step. When the data is wrong, Often we only need to adjust a certain step locally.

6. Main data warehouse architecture

Through the above content, we have a general understanding of the concept of data warehouse. Next, let’s look at several Evolution Architectures of data warehouse.

1. Data mart architecture

A data mart is a collection of data organized by subject domain, which is used to support department level decision-making. There are two types of data marts:Independent Data Mart andDependent Data Mart

1) Independent data mart

The independent data mart focuses on a single subject domain concerned by the Department. The data is deployed on the basis of the Department without considering the information sharing and integration at the enterprise level. For example, manufacturing, human resources, and other departments each have their own data marts.

advantage: because the business of a department is simpler and the amount of data is much smaller than that of the whole enterprise, the independent data mart of a department has the characteristics of short cycle and quick effect.


From a business perspective, when the analysis needs of departments are expanded, or the data of cross departments or cross subject fields need to be analyzed, the independent data market will be inadequate.

When there is ambiguity in the data, such as the same product, when the definitions of department a and department B are different, information comparison between departments will not be possible.

Each department uses different technologies to establish different ETL processes and deal with different transaction systems. There will be data intersection and overlap between multiple independent data marts, and even data inconsistency.

2) Dependent data mart

The data of subordinate data mart comes from data warehouse. The data in the data warehouse is transferred to the subordinate data mart after integration, reconstruction and summary.

The main benefits of establishing a dependent data mart are:

performance: when there is a problem in the query performance of the data warehouse, you can consider establishing several subordinate data marts to move the query from the data warehouse to the data mart.

security: each department has complete control over their own data.

Data consistency: because the data source of each data mart is the same data warehouse, the inconsistency of data is effectively eliminated.

2. Inmon enterprise factory structure

The first two steps in the figure above are not enough to be introduced. Start directly from step 3.

Enterprise data warehouseThe core component in the architecture. As defined by inmon data warehouse, enterprise data warehouse is an integrated resource library of detailed data. The data is captured at the lowest granularity level and stored in a relational database that meets the three paradigm design.

Departmental data mart: it is a department level view of subject oriented data, and the data is obtained from the enterprise level data warehouse. Data may be aggregated when entering the departmental data mart. Data marts are designed using multidimensional models for data analysis. It is important that all report tools, Bi tools or other data analysis applications query data from the data mart rather than directly query the enterprise data warehouse.

3. Kimball data warehouse architecture

Compared with the previous figure, the main difference between Kimball and inmon lies in the design and establishment of core data warehouse.

Kimball’s data warehouse contains high-grained enterprise data and is designed using multidimensional models, which also meansData warehouse is composed of dimension table and fact table of star pattern。 The analysis system or report tool can directly access the data in the multidimensional data warehouse.

The data mart in this architecture is also different from that in inmon. thereData mart is a logical concept, which is only the division of subject fields in multidimensional data warehouse, it does not have its own physical storage, which can also be said to be a virtual data mart.

4. Hybrid data warehouse architecture

The so-called hybrid structure refers to a data warehouse environment,The two architectures of inmon and Kimball are used together

As can be seen from the architecture diagram, this architecture replaces the data mart part of the inmon method with a multidimensional data warehouse, and the data mart is the logical view on the multidimensional data warehouse.

The benefits of using this architecture are: standardized design can be used to eliminate data redundancy and ensure that the granularity of data is fine enough; In addition, multi-dimensional structure can be used to realize report and analysis more flexibly at the enterprise level.

7. Metadata management of data warehouse

Metadata (meta date) mainly records the definition of the model in the data warehouse, the mapping relationship between various levels, the data status of the monitoring data warehouse and the task running status of ETL。 Metadata is generally stored and managed uniformly through metadata repository. Its main purpose is to achieve coordination and agreement in the design, deployment, operation and management of data warehouse.

Metadata is an important part of data warehouse management system. Metadata management is the key component of enterprise data warehouse. It runs through the whole process of data warehouse construction and directly affects the construction, use and maintenance of data warehouse.

One of the main steps of building a data warehouse is ETL. At this time, metadata will play an important role. It defines the mapping from the source data system to the data warehouse, the rules of data conversion, the logical structure of the data warehouse, the rules of data update, data import history and loading cycle. Experts in data extraction and transformation and data warehouse administrators build data warehouses efficiently through metadata.

When using data warehouse, users access data through metadata to clarify the meaning of data items and customize reports.

The scale and complexity of data warehouse are inseparable from correct metadata management, including adding or removing external data sources, changing data cleaning methods, controlling wrong queries and arranging backups.

Metadata can be divided into Technical Metadata and business metadata.technical meta data For IT personnel who develop and manage data warehouse, it describes the data related to the development, management and maintenance of data warehouse, including data source information, data transformation description, data warehouse model, data cleaning and update rules, data mapping and access rights, etc. andBusiness Metadata Serve the management and business analysts, describe the data from a business perspective, including business terms, what data is in the data warehouse, the location of the data and the availability of the data, so as to help business personnel better understand what data is available in the data warehouse and how to use it.

It is not only the basis of metadata extraction in the whole system, but also loosely connected with the metadata source in the whole data warehouse.

8. Analysis of common terms of data warehouse

The structure of this section is shown in the figure below:

1. Explanation of warehouse terms

1. Entities

Entity refers to the subject of attachment, which is an object of our analysis. For example, when we analyze the sales of goods, such as the sales volume of Huawei mobile phones in recent six months, Huawei mobile phones are an entity; We analyze the activity of users. Users are an entity. Of course, entities that do not exist in reality, such as virtual business objects, activities and members, can be regarded as an entity.

Entity exists for business analysis. As a screening dimension of analysis, it has its own attributes and has analyzable value

2. Dimensions

Dimension is to analyze business data from the perspective of looking at problems. From what perspective, we can establish what kind of dimension. Therefore, dimension is a quantity used to analyze data. For example, if you want to analyze product sales, you can choose to analyze by commodity category, which constitutes a dimension. Collecting all commodity categories together constitutes a dimension table.

3. Measurement

A metric is a numeric value on a business process node. Such as sales volume, price, cost, etc.

The measures in the fact table can be divided into three categories: completely additive, semi additive and non additive

Completely additive measures are the most flexible and useful, such as sales volume and sales volume, which can be summarized in any dimension;

Semi additive measures can summarize some dimensions, but not all dimensions. Variance is a common semi additive measure. It can add across all dimensions except time dimension;

The other is completely non additive, such as ratio. For this kind of non additive measure, a good method is,Store fully additive components that are not additive measures as much as possibleAnd summarize these components into the final result set before calculating the final non additive facts.

4. Particle size

Granularity is the unit of measurement in a business process, such as whether goods are measured by piece records or by batch records.

In the construction of data warehouse, we say that this is a fact table with user granularity, so each row of data in the table is a user without duplicate users; For example, if there is a table with sales granularity, each row in the table is a sales record.

Choosing the appropriate granularity level is an important key content for the construction of data warehouse, when designing data granularity, we usually need to focus on the following factors:

The type of analysis to be accepted, the minimum granularity of acceptable data and the amount of data that can be stored;

The higher the level definition of granularity, the more detailed analysis cannot be carried out in the warehouse;

If there are certain restrictions on storage resources, we can only use higher data granularity division;

The data granularity division strategy must ensure that the data granularity can really meet the needs of users’ decision analysis, which is the most important criterion in the data granularity division strategy

5. Caliber

Dimension is data retrieval logic (how to retrieve data)For exampleNumber to fetchIs the average height of boys among children under the age of 10, which is the statistical caliber.

6. Indicators

The indicator is the measurement value of the caliber, that is, the final result。 For example, the order volume in the last seven days, the purchase conversion rate of a promotion, etc.

An indicator is mainly composed of the following parts when it comes to calculation and Implementation:

Indicator processing logic, such as count, sum, AVG

Dimension, such as indicator statistics by department and region, corresponding to group by in SQL

Business qualifiers / modifiers, such as the corresponding indicators calculated by different payment channels, the order refund rate paid by wechat, and the order refund rate paid by Alipay. Corresponding to where in SQL.

In addition, indicators themselves can also derive more indicators. Based on these characteristics, indicators can be classified:

Atomic index: basic business facts. There are no business restrictions and dimensions. For example, the order quantity and total order amount in the order table are all atomic indicators;

Indicators that business parties are more concerned about are indicators that have actual business meaning and can get data directly. For example, the order payment amount of the store in recent one day is a derivative indicator, which will be directly displayed to the merchant on the product.

However, this indicator cannot be retrieved directly from the unified middle layer of the data warehouse (because there is no ready-made fact field, the data warehouse generally provides large and wide tables). There needs to be a bridge to connect the index requirements of the middle layer of the data warehouse and the business party, so there is a derived index

Derived indicators: Dimension + modifier + atomic indicator. In the order payment amount of the store in the past 1 day, the store is a dimension, the last 1 day is a modifier of time type, and the payment amount is an atomic indicator;

Dimension: the angle of observing various indicators;

Modifier: one or some values of the dimension. For example, under the dimension gender, male and female are two modifiers.

Derivative index: for example, the conversion rate of a promotion activity is a derivative indicator, because it needs to be calculated by the number of promotion releases and the number of promotion orders.

7. Label

Tags are highly refined feature marks set artificially and obtained by applying certain algorithms to the target object according to the requirements of business scenarios. It can be seen that the label is the result of artificial reprocessing, such as wanghong, baifumei and Lori. For ambiguous labels, we can distinguish them internally, such as apple. We can define apple as fruit and apple mobile phone as mobile phone.

8. Natural key

A key composed of attributes that already exist in reality. It is unique in the business concept and has a certain business meaning, such as commodity ID and employee ID.

From the perspective of data warehouse, identifiers from the business system are natural keys, such as the number of employees in the business library.

9. Persistent key

Remain permanent and will not change. Sometimes called a supernatural persistent bond. For example, the ID number belongs to the persistent key.

Difference between natural key and persistent key: take an example to understand. For example, if an employee of the company resigns and rejoins, his natural key, that is, the employee number, has changed, but his permanent key, ID number, is unchanged.

10. Proxy key

Is a key that has no business meaning. Surrogate keys have many other names: meaningless keys, integer keys, unnatural keys, artificial keys, synthetic keys, etc.

Proxy keys are simply expressed as integers produced in a sequential sequence. If the proxy key in the first line of the product line is 1, then the proxy key in the next line is 2. Do this.The proxy key is only used to connect the dimension table and the fact table

11. Degradation dimension

Degenerate dimensions are those that look like a dimension keyword of a fact table, but actually have no corresponding dimension table, that is, dimension attributes are stored in the fact table. The dimension columns stored in the fact table are called degenerate dimensions. Like other dimensions stored in dimension tables, degenerate dimensions can also be used to filter and query fact tables and realize aggregation operations.

So how to define the degradation dimension? For example, order ID, which is a very large dimension, does not need to be stored in a dimension table, and we need it very much when we conduct data query or data filtering, so this kind of redundancy is in the fact table. This kind is called degenerate dimension, and citycode will also be redundant in the fact table, butIt has a corresponding dimension table, so it is not a degenerate dimension

12. RIH in

This is a common concept in data analysis. Drilling down can be understood as increasing the level of dimension, so thatObserve the data from coarse-grained to fine-grainedFor example, when analyzing product sales, you can observe more fine-grained data from year to month to day along the time dimension. You can drill down from the year dimension to the month dimension and the day dimension.

13. Roll up

When you know how to run in and roll up, it’s easy to understand. They are opposite operations, soVolume up can be understood as deleting some layers of the dimension, the operation of observing data from fine-grained to coarse-grained or aggregating and summarizing data upward along the dimension level.

14. Data mart

Data mart, also known as data market, is to meet the needs of specific departments or users and store them in a multidimensional way, including defining dimensions, indicators to be calculated and dimension levels, so as to generate a data cube for decision analysis. In fact, it is a small collection extracted from the data warehouse.

2. Relationship between warehouse nouns

1. Relationship among entity table, fact table and dimension table

In Kimball dimensional modeling, there are dimensions and facts, and in inmon paradigm modeling, there are entities and relationships. If we separate the two modeling methods, these concepts are easier to understand. However, there are also many hybrid modeling methods. When the two modeling methods are combined, are these concepts easy to remember and confused, especially the fact table and entity table? How are they different and related? Let’s first look at their respective concepts:

Dimension table: the dimension table can be regarded as a window for users to analyze a fact. The data in it should describe all aspects of the fact, such as time dimension table and regional dimension table. The dimension table is an analysis angle of the fact table.

Fact table: fact table is to determine a fact through the combination of various dimensions and some indicator values. For example, through the time dimension and regional organization dimension, the indicator value can determine the fact of some indicator values at a certain time and place. Each piece of data in the fact table is obtained by the intersection of the data and indicator values of several dimension tables.

Entity table: an entity table is a table of actual objects. The data placed in the entity table must be the object data one by one. For example, various commodities exist objectively, so it can be designed as an entity table. The real-time table only describes various things, and there are no specific facts. Therefore, some people call the entity table a fact table without facts.

For example: for example, there are apple phones, Huawei phones and other brands and models of mobile phones in the mobile phone mall. These data can form aMobile entity table, but there is no measurable data in the table. One day, Apple sold 15 mobile phones and Huawei sold 20 mobile phones. These mobile phone sales data belong to facts and form aFact table。 So you can useDimension table dateandRegional dimension tableAnalyze the fact table in various dimensions.

2. Difference between indicators and labels

Different concepts

indexIt is a standard or method used to define, evaluate and describe specific things. For example, the number of new users, cumulative users and user activity rate are indicators to measure the development of users;

labelIt is a highly refined feature identification set artificially and obtained by applying certain algorithms to the target object according to the requirements of business scenarios. It can be seen that the label is the result of artificial reprocessing, such as wanghong, baifumei and Lori.

Different composition

Indicator nameIt is the naming of the characteristics of matter, matter and quantity; The index value is the quantitative performance of the index under specific time, region and conditions, such as human weight. The index name is weight, and the index value is 120 kg;

Label nameIt is usually the structure of adjectives or adjectives + nouns. Labels are generally non quantifiable and usually isolated. Except for basic labels, labels processed by certain algorithms generally have no units and dimensions. For example, people who weigh more than 200 kilograms are called fat people.

Different classification

Classification of indicators

According to the calculation logic of indicators, indicators can be divided into three types: atomic indicators, derived indicators and derived indicators;

According to the different contents of event description, it is divided into process indicators and result indicators;

Classification of labels

According to the variability of labels, they are divided into static labels and dynamic labels;

It can be divided into qualitative label and quantitative label according to the different reference of label and evaluation index;

indexThe best applications are monitoring, analysis, evaluation and modeling.

labelThe best applications are annotation, characterization, classification and feature extraction.

In particular, it should be pointed out that since the labeling of results is also a label, in the application scenario of algorithms related to natural language processing and machine learning, labels are of great value for supervised learning, but it is difficult to do with simple indicators. The role of indicators in task allocation, performance management and other fields can not be achieved.

3. Differences and relations between dimensions and indicators

Dimension is the observation angle of data, that is, from which angle to analyze and look at problems.

The indicator is to calculate the value of this result on the basis of dimension.

Dimension is generally a discrete value, such as each independent date or region in the time dimension. Therefore, during statistics, the records of the same dimension can be aggregated together, and the aggregation function can be used for aggregation calculations such as accumulation, mean value, maximum value and minimum value.

An indicator is the aggregated general calculation, that is, the result of aggregation operation, which is generally a continuous value.

4. The difference between the use of natural keys and proxy keys in the data warehouse

According to the data warehouse toolboxThe unique primary key of a dimension table should be a proxy key rather than a natural key。 Sometimes modelers are reluctant to give up using natural keys because they want to query the fact table with operational code rather than join with the dimension table. However, multidimensional keys with business implications should be avoided, because no matter what assumptions we make, they may eventually become invalid, because we can’t control the changes of the business library.

Therefore, each connection between dimension table and fact table in data warehouse should be based on integer proxy key with no actual meaning. Avoid using natural keys as the primary keys of dimension tables

5. Relationship between data mart and data warehouse

Data mart is a subset of enterprise data warehouse, it is mainly for department level business and only for a specific topic. In order to solve the contradiction between flexibility and performance, data mart is a small department or working group level data warehouse added to the data warehouse architecture. The Data Mart stores pre calculated data for specific users, so as to meet the performance needs of users. Data mart can alleviate the bottleneck of accessing data warehouse to a certain extent.

Main differences between data mart and data warehouse: data warehouse is enterprise level, which can provide decision support means for the operation of all departments of the whole enterprise; Data mart is a kind of micro data warehouse, which usually has less data, less subject areas and less historical data. Therefore, it is department level and can only serve managers in a local scope. Therefore, it is also called department level data warehouse.

2、 Offline warehouse construction core

The core of data warehouse is the presentation layer and providing high-quality services. ETL and its specifications, layering and so on are all done for a clearer and easy-to-use presentation layer.

1. Data warehouse layering

Principle of data warehouse stratification

In order to facilitate data analysis, it is necessary to shield the underlying complex business and expose the data to the analysis layer in a simple, complete and integrated way.

The impact of bottom-level business changes and top-level demand changes on the model is minimized, and the impact of business system changes is weakened in the basic data layer. Combined with the top-down construction method, the impact of demand changes on the model is weakened.

High cohesion and loose coupling refers to the high cohesion of data within topics or systems with complete meanings, and the loose coupling of data between topics or systems with complete meanings.

Build the basic data layer of the warehouse, isolate the bottom business data integration from the upper application development, lay the foundation for the large-scale development of the warehouse, make the warehouse level clearer and the exposed data more unified.

Generally, the following layered structure is adopted:

1. Data source layer: ODS (operational data store)

ODS layer is the layer closest to the data in the data source. In order to consider the possible need to trace the data in the future, it is not recommended to do too much data cleaning for this layer, and just connect the original data intact. The processes of data denoising, de duplication and outlier processing can be done in the later DWD layer.

2. Data warehouse layer: DW (data warehouse)

The data warehouse layer is the core design layer when we do the data warehouse. Here, the data obtained from the ODS layer establishes various data models according to the theme.

DW layer is subdivided intoDWD(data warehouse detail) layerDWM(data warehouse middle) layer andDWS(data warehouse servce) layer.

1) Data detail layer: DWD (data warehouse detail)

This layer generally maintains the same data granularity as the ODS layer, and provides certain data quality assurance.What the DWD layer needs to do is to clean up, integrate, standardize, dirty data, garbage data, inconsistent specifications, inconsistent state definitions and irregular naming data

At the same time, in order to improve the ease of use of the data detail layer,This layer will adopt some dimension degradation methods to degrade the dimension into the fact table and reduce the association between the fact table and the dimension table

In addition, some data aggregation will be done in this layer to collect the data of the same subject into one table to improve the availability of data.

2) DWM: warehouse data middle layer

Based on the data of DWD layer, this layer will do light aggregation operation to generate a series of intermediate tables, improve the reusability of public indicators and reduce repeated processing.

Intuitively,It is to aggregate the common core dimensions and calculate the corresponding statistical indicators

In the actual calculation, if the statistical indicators of the wide table are calculated directly from DWD or ODS, there will be too much calculation and too few dimensions. Therefore, the general practice is to calculate multiple small intermediate tables in the DWM layer, and then splice them into a wide table of DWS. Since the boundary between wide and narrow is not easy to define, the DWM layer can also be removed and only the DWS layer can be left. All data can also be placed in DWS.

3) Data service layer: DWS (data warehouse service)

DWS layer is a public summary layer, which will be slightly summarized, and the granularity is slightly coarser than the detailed data. It is based on the basic data on DWD layer,Integrate and summarize the service data of a certain subject domain into analysis, which is generally a wide table。 DWS layer shall cover 80% of the application scenarios. Also known as data mart or wide table.

According to the business division, such as subject domain traffic, order, user, etc., a wide table with many fields is generated to provide subsequent business query, OLAP analysis, data distribution, etc.

Generally speaking, there are relatively few data tables in this layer, and a table will cover more business contents. Because of its many fields, the table in this layer is generally called wide table.

3. Data application layer: app (application)

Here, the data mainly provided for data products and data analysis is generally stored in ES, PostgreSQL, redis and other systems for online systems, or hive or Druid for data analysis and data mining. For example, the report data we often talk about is usually placed here.

4. Dimension surface: dim (dimension)

If there are too many dimension tables, you can also design a separate layer for the dimension table. The dimension surface mainly contains two parts of data:

High cardinality dimension data: it is generally a data table similar to user data table and commodity data table. The amount of data may be tens of millions or hundreds of millions.

Low cardinality dimension data: it is generally a configuration table, such as the Chinese meaning of enumeration values or a date dimension table. The amount of data may be single digits or tens of thousands.

2. Warehouse modeling method

On which floor is data warehouse modeling built? Let’s take dimensional modeling as an example,Modeling is built at the next layer of the data source layer, in the layered architecture in the previous section,Data warehouse modeling is carried out in DW layer, soDW layer is the core layer of data warehouse construction

How to build warehouse modeling? In fact, there are many modeling methods of data warehouse. Each modeling method represents a philosophical point of view and a method of summarizing and summarizing the world. Common areParadigm modeling method, dimension modeling method and entity modeling methodIn essence, each method will look at the problems in the business from a different perspective.

1. Third normal form (3NF)

In fact, paradigm modeling method is a common method for us to build data model. This method is mainly advocated by inmon. It is a technical method to solve the data storage of relational database. At present, most of our modeling methods in relational database use the three paradigm modeling method.

A paradigm is a collection of relational patterns that conform to a certain level. The construction of database must follow certain rules, and in relational database, this rule is the paradigm, which is also called normalization. At present, relational database has six paradigms: the first paradigm (1NF), the second paradigm (2NF), the third paradigm (3NF), Boyce Codd paradigm (BCNF), the fourth paradigm (4NF) and the fifth paradigm (5NF).

In the model design of data warehouse, the third paradigm is generally adopted. A relationship conforming to the third paradigm must have the following three conditions:

Each attribute value is unique and has no ambiguity;

Each non primary attribute must be completely dependent on the whole primary key, not a part of the primary key;

Each non primary attribute cannot depend on the attributes in other relationships, because in this case, this attribute should belong to other relationships.

Paradigm modeling

According to inmon, the construction method of data warehouse model is similar to the enterprise data model of business system. In the business system, the enterprise data model determines the source of data, and the enterprise data model is also divided into two levels: subject domain model and logical model. Similarly, the topic domain model can be regarded as the conceptual model of the business model, while the logical model is the instantiation of the domain model on the relational database.

2. Dimensional modeling

Dimension model is advocated by Ralph kimall, another master in the field of data warehouse. His data warehouse toolbox is the most popular classic of data warehouse modeling in the field of data warehouse engineering. Dimensional modeling starts from the needs of analysis and decision-making, and the constructed data model serves the analysis needs. Therefore, it focuses on how to complete the analysis needs more quickly, and has better response performance for large-scale complex queries.

Dimensional modeling

Typical representatives are the well-known star schema and snow schema applicable in some special scenarios.

The more important concepts in dimension modeling are fact table and dimension table. The simplest description is to build data warehouse and data mart according to fact table and dimension table.

3. Entity modeling

Entity modeling method is not a common method in data warehouse modeling. It comes from a school of philosophy. In the sense of philosophy, the objective world should be subdivided. The objective world should be divided into entities and the relationship between entities. Then we can introduce this abstract method in the modeling process of data warehouse, and divide the whole business into entities, and the relationship between each entity and the description of these relationships are the work we need to do in data modeling.

Although substantive law seems to be abstract, it is easy to understand. That is, we can divide any business process into three parts,Entity, event, description, as shown in the figure below:

Solid modeling

The above figure represents an abstract meaning. If we describe a simple fact: “Xiao Ming drives to school”. Taking this business fact as an example, we can regard “Xiao Ming” and “school” as an entity, and “going to school” describes a business process. Here, we can abstract it into a specific “event”, while “driving to school” can be regarded as an explanation of the event “going to school”.

3. Detailed explanation of dimension modeling

At present, the most commonly used modeling method in Internet companies is dimensional modeling, which we will focus on!

Dimensional modeling is a method specially applied to the modeling of analytical database, data warehouse and data mart. Data mart can be understood as a “small data warehouse”.

Let’s not rush to start dimensional modeling. Let’s understand it firstTable type in dimension modelingandModel of dimension modelingThen start modeling, which can make us deeply understand!

1. Table type in dimension modeling

Dimension modeling is divided into two tables: fact table and dimension table:

Fact table: some data that must exist, such as collected log files and order tables, can be used as fact tables.

Feature: it is a collection of primary keys. Each primary key corresponds to a record in the dimension table. It exists objectively and determines the data to be used according to the subject

Dimension table: dimension is a quantity of the analyzed data. Dimension table is a table created from an appropriate angle. It is an angle for analyzing problems: time, region, terminal, user, etc

1. Fact sheet

The measurable values generated by operational events that occur in the real world are stored in the fact table. At the lowest level of granularity, the fact table row corresponds to a measurement event, and vice versa.

The fact table represents the measurement of the analysis topic。 For example, a purchase can be understood as a fact.

Facts and dimensions

The order table in the figure is a fact table. You can understand that it is an operational event in reality. Every time we complete an order, we will add a record to the order. Feature of fact table: there is no actual content stored in the table. It is a collection of primary keys. These IDS can correspond to a record in the dimension table. The fact table contains foreign keys associated with each dimension table, which can be associated with the dimension table. The measurement of fact table is usually numerical type, and the number of records will continue to increase, and the scale of table data will grow rapidly.

Schedule (width table):

In the data of the fact table, some attributes form a field together (mixed together). For example, the time is composed of month, month, day, hour, minute and second. When grouping statistics are required according to a certain attribute, operations such as interception and splicing are required, which is extremely inefficient. For example:


2021-03-18 06:31:42

For the convenience of analysis, you can cut and extract multiple attributes from a field in the fact table to form a new field. Because there are more fields, it is called a wide table, and the original one becomes a narrow table

Set the above local_ The time field is expanded to the following six fields:



Because the information in the wide table is clearer, it can also be called a detail table.

Type of fact table

Fact tables are divided into the following six categories:

Transaction fact table

periodic snapshot fact table

Cumulative snapshot fact table

Fact table without facts

Aggregate fact table

Consolidated fact sheet

Briefly explain the concept of each table:

Transaction fact table

A row in the table corresponds to a measurement event at a point in space or time. For example, if there is a measurement in the sales amount field, it means that the sales amount must be added with one and a half. Another point is that the transaction fact table contains a foreign key associated with the dimension table. And the measurement value must be consistent with the transaction granularity.

periodic snapshot fact table

As the name suggests, the periodic fact table means that each row has a time value field to represent the period. Usually, the time value is a standard period, such as a day, a week, a month, etc. Granularity is a cycle, not an individual transaction, that is, the data in a cycle snapshot fact table can be multiple facts, but they all belong to a cycle.

Cumulative snapshot fact table

The periodic snapshot fact table is the data in a single cycle, while the cumulative snapshot fact table is composed of multiple periodic data. Each row summarizes the measurement from the beginning to the end of the process. Each row of data is equivalent to a pipeline or workflow, with the start point, process and end point of the event, and each key step contains a date field. For example, for order data, a row in the cumulative snapshot fact table is an order. When an order is generated, a row is inserted. When an order changes, this row is modified.

Fact table without facts

The fact table measures discussed above are all digital. Of course, most of them are digital measures in practical application, but there may be a small number of valuable fields without digital values. The fact table without facts is prepared for this kind of data. What happens can be analyzed by using this fact table.

Aggregate fact table

Aggregation is a simple aggregation operation of atomic granularity data in order to improve query performance. For example, our demand is to query the total sales of all stores in the country. Each row in our atomic granularity fact table is the sales of each store and each commodity. The aggregate fact table can aggregate the total sales of each store first, so that the amount of data calculated when summarizing the sales of all stores will be much smaller.

Consolidated fact sheet

This fact table follows a principle that the data can come from multiple processes with the same granularity, but as long as they belong to the same granularity, they can be combined into one fact table. This kind of fact table is especially suitable for multi process metrics that often need to be analyzed together.

2. Dimension table

Each dimension table contains a single primary key column. The primary key of the dimension table can be used as the foreign key of any fact table associated with it. Of course, the description environment of the dimension table row should fully correspond to the fact table row. Dimension tables are usually wide, flat and nonstandard, and contain a large number of low granularity text attributes.

Dimension refers to a quantity used when you want to analyze data. For example, if you want to analyze product sales, you can choose to analyze by category or region. Each category constitutes a dimension. The user table, merchant table and schedule in the above figure belong to dimension tables. These tables have a unique primary key, and then the detailed data information is stored in the table.

Generally speaking, there is no need to strictly abide by the standardized design principles in the data warehouse. Because the leading function of data warehouse is analysis oriented, mainly query, and does not involve data update operation.The design of fact table is based on the ability to correctly record historical information, and the design of dimension table is based on the ability to aggregate subject content from an appropriate perspective

Dimension table structure

A principle should be kept in mind that dimension tables contain a single primary key column, but sometimes due to complex business, joint primary keys may also occur. Please try to avoid it. If it is unavoidable, it is important to ensure that they must be single. If the primary key of dimension tables is not single, data divergence will occur when associated with fact tables, resulting in errors in the final results.

Dimension tables are usually wide and contain a large number of low granularity text attributes.

Cross surface drilling

Cross table drilling means that when the row header of each query contains the same consistency attribute, different queries can be queried against two or more fact tables

Drilling can change the level of dimension and the granularity of analysis. It includes up / down drilling:

Roll up: roll up is to gather and summarize data upward along the dimension hierarchy. For example, if the product sales data is rolled up along the time dimension, the monthly (or quarterly or annual or all) sales of all products in all regions can be calculated.

Drill down: drill down is the reverse operation of drilling up. It goes down the dimension level to view more detailed data.

Degenerate dimension

Degenerate dimension is to return the dimension to the fact table. Because sometimes the dimension has nothing other than the primary key. Although it is also a legal dimension key, it will generally be returned to the fact table to reduce the number of associations and improve the query performance

Multilevel dimension

Most dimensions contain more than one natural level. For example, the date dimension can range from the level of day to the level of week, month to year. Therefore, in some cases, there are different levels in the same dimension.

Dimension table null attribute

When a given dimension line is not fully populated, or when the existing attribute is not applied to all dimension lines, a null value dimension attribute will be generated. In the above two cases, it is recommended to use descriptive strings instead of null values, such as using unknown or not applicable to replace null values.

Calendar date dimension

In the date dimension table, the setting of the primary key should not be represented by the ID generated in sequence, but can be represented by more meaningful data, such as combining the year, month and day, that is, yyyymmdd, or more detailed precision.

2. Three models of dimension modeling

1. Star mode

Star schema is the most commonly used dimension modeling method.The star pattern is centered on the fact table, and all dimension tables are directly connected to the fact table, just like stars。 The dimension modeling of star mode is composed of a fact table and a group of dimension tables, and has the following characteristics: A. dimension tables are only associated with fact tables, and there is no association between dimension tables; b. The primary key of each dimension table is a single column, and the primary key is placed in the fact table as a foreign key connected on both sides; c. The fact table is the core, and the dimension table is distributed in a star around the core;

2. Snowflake mode

Snowflake schema is an extension of star schema.Dimension tables in snowflake mode can have other dimension tables, although this model is more standardized than the star model, it is not easy to understand, the maintenance cost is relatively high, and the performance needs to associate multi-layer dimension tables, and the performance is lower than the star model. So it’s not usually used

snowflake schema

3. Constellation mode

The constellation pattern is an extension of the star pattern, which is based on a fact tableThe constellation pattern is based on multiple fact tables and shares dimension information。 The two dimension modeling methods described above are multi-dimensional tables corresponding to single fact tables. However, in many cases, there are more than one fact table in the dimension space, and a dimension table may also be used by multiple fact tables. In the later stage of business development, most dimensional modeling adopts constellation mode.

constellation model

3. Dimension modeling process

We know that the table types of dimension modeling are real table and dimension table; The model has the concepts of star model, snowflake model and constellation model, but in the actual business, it gives us a pile of data. How can we use these data for data warehouse construction? The author of data warehouse toolbox summarizes the following four steps according to his more than 60 years of actual business experience. Please remember!

Dimension modeling in data warehouse Toolbox: four steps

Four steps of dimensional modeling

pleaseremember wellThe above four steps, no matter what business, follow this step. Don’t mess up the order, because these four steps are linked step by step. Here’s how to do each step of disassembly in detail

1. Select business process

Dimensional modeling is close to the business, so it must be based on the business. Choosing the business process, as the name suggests, is to select the business we need to model in the whole business process, and select the business according to the needs provided by the operation and the scalability in the future. For example, the whole mall process is divided into merchant end, user end and platform end. The operation demand is the total order quantity, the number of orders, and the purchase situation of users. When we choose the business process, we choose the data of the user end, and the merchant and platform end will not be considered temporarily. Business selection is very important because all subsequent steps are based on this business data.

2. Declaration granularity

For example, if a user has a ID number, a registered residence address, multiple mobile phone numbers, and multiple bank cards, the granularity attributes that are the same as the user granularity are ID card granularity, registered residence address granularity, and the granularity finer than the user granularity is mobile phone number granularity, bank card granularity. There is a one-to-one relationship that is the same granularity. Why should we mention the same granularity? Because dimension modeling requires us toSame fact tableMust haveSame granularity, do not mix different granularity data in the same fact table. Different granularity data can be used to establish different fact tables. When obtaining data from a given business process, it is strongly recommended to focus on atomic granularity, that is, starting from the most fine-grained, because atomic granularity can withstand unexpected user queries. However, the rollup summary granularity is very important to improve query performance. Therefore, for data with clear requirements, we establish the rollup summary granularity for requirements, and for data with uncertain requirements, we establish the atomic granularity.

3. Confirm dimension

Dimension table is the entry and descriptive identification of business analysis, so it is also known as the “soul” of data warehouse. How to confirm which are dimension attributes in a pile of data? If the column is a description of a specific value, a text or constant, and a participant of a constraint and row identification, this attribute is often a dimension attribute. Tell us in the data warehouse toolboxBy firmly grasping the granularity of the fact table, we can distinguish all possible dimensions, andEnsure that there is no duplicate data in the dimension table, and make the dimension primary key unique

4. Confirm the facts

The fact table is used for measurement, which is basically expressed in quantitative values. Each row in the fact table corresponds to a measurement, and the data in each row is a specific level of detail data, which is called granularity. One of the core principles of dimensional modelingYes, all measures in the same fact table must have the same granularity。 This ensures that there is no problem of double counting metrics. Sometimes it is often uncertain whether this column of data is a fact attribute or a dimension attribute. rememberThe most practical facts are numeric types and additive facts。 Therefore, you can analyze whether the column is a measure that contains multiple values and acts as a participant in the calculation. In this case, the column is often a fact.

3、 Offline warehouse construction practice

Technology serves the business, and the business creates value for the company. Technology that leaves the business is meaningless

1. Business introduction

Different products need to be developed for users with different needs, so there are many business lines within the company, but for the data Department, the data of all business lines are data sources. The division of data is not only based on business, but also combined with the attributes of data.

2. Early planning

In the previous development, different business lines correspond to different data teams, and each data team does not interfere with each other. This model is relatively simple. It can only carry out data warehouse construction and report development for their own business lines.

However, with the development of business, there are more and more vertical business units with frequent iterations and cross departments, and there is coupling between businesses. At this time, there is a problem in adopting this chimney development:

For example, the company has strict data management, different data development groups do not have permission to share data, and the data permissions of other business lines need to be submitted for approval, which takes a long time;

There is also the problem of repeated development. Different business lines will have the same report requirements. If each business party develops its own report, it will be a waste of resources.

Therefore, for data development, it is necessary to uniformly manage the data of each business line, so there is the emergence of data center.

3. Data center

I think the data center is built according to the specific business needs of each company. Different businesses have different understanding of the center.

The agile data midrange developed by the company mainly ranges from the reuse of data technology and computing power to the reuse of data assets and data services. The data midrange enables the data to directly enable the business with greater value bandwidth. Provide a unified management, break the data island, trace the blood relationship of data, and realize self-help and high reuse.

As follows:

Data center

The above explanation is more abstract. Let’s look at the convenience of the data console from the actual project development.

For example, in the previous report development process, we first need to collect data. Different data sources are collected to the big data platform through tools such as sqoop, and then build the data warehouse. Finally, the report data is output and displayed in the visual system. Finally, the whole process is written into a script and placed on the scheduling platform for automatic execution.

With the data center, you don’t need to be so cumbersome. You can directly build the data warehouse and generate the report, without paying too much attention to the data source, visual display and scheduling. And you can intuitively view the blood relationship of data and calculate the blood relationship between tables. As shown in the figure below, the dependency between tables is clear:

Data center

On the other hand, the heterogeneous data system in the data center can perform association query very simply, such as the table of hive is associated with the table of MySQL. It can transparently shield the heterogeneous interaction mode of heterogeneous data systems, and easily realize transparent mixed computing across heterogeneous data systems.

The principle of heterogeneous data system is that the data center provides the mapping from virtual table to physical table. End users do not need to care about the physical storage location of data and the characteristics of underlying data source. They can directly operate data and experience similar operation of a virtual database

The data console additionally integrates visual display, provides a one-stop data visualization solution, supports JDBC data source and CSV file upload, supports intelligent generation of visualization components based on data model drag, and large screen display adapts to different sizes of screens.

The scheduling system is self written and integrated into the data center within the company. It can be scheduled directly after writing SQL statements.

4. Warehouse construction

It’s only here that I really come to the construction of data warehouse. Why should I take so much space to introduce the company’s business and the data center system used in the front, because the following construction of data warehouse is carried out according to the company’s business development and existing data center, and the construction of data warehouse is inseparable from the company’s business.

Intelligent warehouse planning

The core idea of data warehouse construction: from the aspects of design, development, deployment and use, avoid repeated construction and redundant construction of indicators, so as to ensure the standardization and unification of data caliber, and finally realize the full link association of data assets, provide standard data output and establish a unified data common layer.With the core idea, how to start the construction of data warehouse? There is a saying that the builder of data warehouse is not only a technical expert, but also most of the business experts, so the way adopted is to promote the data construction. Because the data is in the middle stage, the business knowledge system is relatively centralized, and the business data is no longer scattered, which speeds up the construction of data warehouse.

The construction of data warehouse is mainly carried out from two aspects,Models and specifications, all businesses are unified


All businesses adopt a unified model system, so as to reduce R & D costs, enhance index reuse, and ensure the unity of data caliber

Model stratification

Combined with the company’s business, there are many new demands in the later stage, so there should not be too many layers, and the responsibilities of each layer should be clearly defined. To ensure the stability of the data layer and shield the impact on the downstream, the following layered structure is adopted:

Data hierarchy

Data flow direction

Following the hierarchical structure during model development and the forward flow of data from ODS – > DW – > DM – > app can prevent the confusion of data link and the difficulty of ensuring the timeliness of SLA due to non-standard data reference. At the same time, it ensures that the blood relationship is concise and can easily track the data flow. The following situations should be avoided during development:

The data reference link is incorrect, such as ODS – > DM – > app, which indicates that the detail layer does not completely cover the data; For example, ODS – > DW – > app indicates that the subject division of the light summary layer is not fully covered. Reducing cross layer references can improve the reusability of intermediate tables. The ideal warehouse model design should have:The data model is reproducible, perfect and standardized

Try to avoid generating the table of the current layer from the table of one layer, such as generating the DW layer table from the DW layer table, which will affect the ETL efficiency.

Reverse dependency is prohibited, for example, DW table depends on DM table.


Table naming conventions

aboutods、dm、appLayer table name: type_ Theme_ Table meaning, such as: DM_ xxsh_ user

aboutdwLayer table name: type_ Theme_ Dimensions_ Table meaning, such as: DW_ xxsh_ fact_ Users (fact table), DW_ xxsh_ dim_ City (dimension table)

Field naming conventions

Construct word roots, which are the basis of dimension and index management. They are divided into common word roots and proprietary word roots

Common root: the smallest unit describing things, such as sex gender.

Proprietary root: a descriptive body with industry-specific or internal regulations of the company, such as: xxsh- the name of a product within the company.

Script naming conventions

Script name: script type Script function [database name] Script name, such as hive hive. dm. dm_ xxsh_ users

Script types are mainly divided into the following three categories:

General SQL: hive

Custom shell script: SH

Custom Python script: Python

Script content specification

#The definition of variables should conform to the syntax requirements of Python

#Designated task leader

owner =”[email protected]

#Script storage directory / opt / xxx

#Script name: hive hive. dm. dm_ xxsh_ users

#Source is used to identify upstream dependent tables. If a task has multiple upstream tables, they need to be written in

#(xxx_name needs to be changed, and the rest do not need to be changed)

source= {

“table_name”: {





#For example, source, but there is only one table for each task target

target = {

“db_table”: {






#Variable list


#$now. Date, common format example: December 11, 2020

task =”’

Write SQL code

5. Specific implementation of data layer

Four diagrams are used to illustrate the specific implementation of each layer

Data source layer ODS

Nanny level tutorial on data warehouse construction, offline and real-time (Theory + Practice)

Data source layer

The data source layer mainly imports various business data into the big data platform as a snapshot storage of business data.

Data detail layer DW

Nanny level tutorial on data warehouse construction, offline and real-time (Theory + Practice)

Data detail layer

Each row in the fact table corresponds to a measure, and the data in each row is a specific level of detail data, which is called granularity. One of the core principles of dimensional modeling isAll measures in the same fact table must have the same granularity。 This ensures that there is no problem of double counting metrics.

Dimension tables are generally single primary keys, and a few are joint primary keys. Pay attention to avoid duplicate data in dimension tables, otherwise they will be associated with fact tablesData divergenceQuestion.

Sometimes it is often uncertain whether this column of data is a fact attribute or a dimension attribute. rememberThe most practical facts are numeric types and additive facts。 Therefore, you can analyze whether the column is a measure that contains multiple values and serves as a participant in the calculation. In this case, the column is often a fact; If the column is a description of a specific value, a text or constant, and the participant of a constraint and row ID, this attribute is often a dimension attribute. However, the final judgment should be based on the business to determine whether it is a dimension or a fact.

Data summary layer DM

Data summary layer

This layer is named light summary layer, which means that this layer has started to summarize the data, but not completely. It only carries out associated summary for the data with the same granularity. The data with different granularity but related can also be summarized. At this time, the granularity needs to be unified through aggregation and other operations.

Data application layer app

Data application layer

The tables in the data application layer are provided to users. The construction of data warehouse is coming to an end. Next, different data retrieval will be carried out according to different needs, such as directly displaying reports, providing data required by colleagues in data analysis, or other business support.

6. Summary

A figure summarizes the overall process of data warehouse construction

Nanny level tutorial on data warehouse construction, offline and real-time (Theory + Practice)

Data center

7. Precautions in actual production

The operation in the production environment cannot be as random as when we test ourselves. Carelessness may cause production accidents. Be careful with your right hand, so be careful with every step.

Only the following but not limited to the following precautions are listed:

Do not operate other library tables other than your own management and authorization table;

Do not operate scripts and files of others in the production environment without authorization;

Before modifying the production environment script, be sure to back up locally by yourself;

Please confirm that your modification can be rolled back quickly;

All naming of table names and fields in the production environment should follow the naming rules.

4、 Real time calculation

Real time computing is generally carried out for massive data, and it is required to be second level. At the beginning of the rise of big data, Hadoop did not provide real-time computing solutions. Then, storm, sparkstreaming, Flink and other real-time computing frameworks came into being. The rise of Kafka and ES makes the technology in the field of real-time computing more and more perfect. With the promotion of Internet of things, machine learning and other technologies, real-time streaming computing will be fully applied in these fields.

Three features of real-time computing:

Infinite data: infinite data refers to a growing and basically infinite data set. These are often referred to as “stream data”, as opposed to limited data sets.

Unbounded data processing: a continuous data processing mode, which can repeatedly process unlimited data through the processing engine, and can break through the bottleneck of the limited data processing engine.

Low delay: there is no clear definition of the delay. But we all know that the value of data will decrease with the passage of time, and timeliness will be a problem that needs to be solved continuously.

At present, in the field of popular big data applications, for example, the recommendation system is limited by technology at the beginning of practice. It may take one minute, one hour or even longer to recommend users, which is far from meeting the needs. We need to complete the data processing faster rather than offline batch processing.

1. Real time computing application scenario

With the development of real-time technology becoming more and more mature, real-time computing is more and more widely used. The following only lists several common application scenarios of real-time computing:

1. Real time intelligent recommendation

The user’s purchase behavior may be predicted through the recommended information or training algorithm. For individuals, the recommendation system plays the role of information filtering. For Web / APP servers, the recommendation system plays the role of meeting users’ personalized needs and improving users’ satisfaction. The recommendation system itself is also developing rapidly. In addition to the more and more perfect algorithm, the requirements for delay are also more and more demanding and real-time. Flink stream computing is used to help users build a more real-time intelligent recommendation system, calculate user behavior indicators in real time, update the model in real time, predict user indicators in real time, and push the predicted information to the web / APP side to help users obtain the desired product information. On the other hand, it also helps enterprises improve sales and create greater business value.

2. Real time fraud detection

In the financial field, there are often various types of fraud, such as credit card fraud, credit application fraud and so on. How to ensure the capital security of users and companies is a common challenge faced by many financial companies and banks in recent years. With the continuous upgrading of criminals’ fraud means, the traditional anti fraud means are not enough to solve the current problems. In the past, it may take several hours to calculate the user’s behavior indicators through the transaction data, and then identify the users suspected of fraud through the rules, and then conduct case investigation and processing. In this case, the funds may have been transferred by criminals, resulting in a lot of economic losses to enterprises and users. The use of Flink streaming computing technology can complete the calculation of fraud judgment indicators in milliseconds, and then intercept the transaction flow in real time to avoid economic losses caused by untimely processing.

3. Public opinion analysis

Some customers need to do public opinion analysis and require all data to be stored for several years. The daily data volume of public opinion data may exceed one million, and the annual data volume can reach several billion. Moreover, the data crawled by the crawler is public opinion. After word segmentation through big data technology, you may get a large number of netizen comments. Customers often ask to query public opinion, do full-text search, and control the response time to seconds. The crawler crawls the data to Kafka on the big data platform, performs Flink stream processing, de duplication and de-noising, performs voice analysis, and writes it into elasticsearch. One feature of big data is multiple data sources. The big data platform can choose different data sources according to different scenarios.

4. Complex event handling

For complex event processing, it is common to focus on industrial fields, such as real-time fault detection of on-board sensors and mechanical equipment. These business types usually have a large amount of data and have high requirements for timeliness of data processing. The CEP provided by Flink is used to extract the time mode, and Flink’s SQL is used to convert the event data. The implementation rule engine is built in the streaming system. Once the event triggers the alarm rule, the alarm result will be immediately notified to the downstream notification system, so as to realize the purposes of rapid early warning and detection of equipment failure and vehicle status monitoring.

5. Real time machine learning

Real time machine learning is a broader concept. The traditional static machine learning mainly focuses on the training and prediction of static models and historical data. In many cases, the short-term behavior of users can modify the model, or predict the business judgment. For the system, we need to collect the user’s recent behavior and carry out feature engineering, and then give it to the real-time machine learning system for machine learning. If we dynamically implement new rules or launch new advertisements, it will have great reference value.

2. Overview of real-time calculation

Let’s first look at a real-time architecture diagram of big data platform:

Nanny level tutorial on data warehouse construction, offline and real-time (Theory + Practice)

Data synchronization:

In the above architecture diagram, data is generated from the web platform and imported into the big data platform through the data synchronization system. Due to different data sources, the data synchronization system here is actually a combination of multiple related systems. Sqoop is usually used for database synchronization. Flume can be selected for log synchronization. The data quality generated by different data sources may vary greatly. The formatted data in the database can be directly imported into the big data system, while the data generated by logs and crawlers need a lot of cleaning and transformation to be used effectively.

Data storage:

This layer stores the original data and the detailed data after cleaning and association. Based on the layered concept of unified real-time data model, the data of different application scenarios are stored in Kafka, HDFS, kudu, Clickhouse, HBase, etc.

Data calculation:

The computing layer mainly uses four computing engines, including Flink, spark, Presto and the computing power of Clickhouse. Flink computing engine is mainly used for real-time data synchronization, streaming ETL and second level real-time index calculation scenarios of key systems. Spark SQL is mainly used for quasi real-time index calculation demand scenarios of complex multidimensional analysis. Presto and Clickhouse mainly meet the needs of multidimensional self-service analysis Scenarios that require less query response time.

Real time application:

Support the data scenarios of each business line with unified query service. The business mainly includes real-time large screen, real-time data products, real-time OLAP, real-time features, etc.

Of course, a good big data platform cannot lack metadata management and data governance:

1. Metadata and index management: it mainly manages the real-time Kafka table, kudu table, Clickhouse table, hive table, etc., standardizes the naming of the table by the naming method of the table in the warehouse model, and defines the field meaning and user of each table. Indicator management is to manage all real-time indicators through the indicator management system as far as possible, clarify the calculation caliber and provide them to different business parties for use;

2. Data quality and blood relationship analysis: data quality is divided into two parts: platform monitoring and data monitoring. Kinship analysis mainly analyzes the dependence of real-time data and real-time tasks.

The above architecture is only the general data model of big data platform. If you want to build a specific platform, you need to consider the following situations: whether the business needs to be real-time or quasi real-time, and whether the data timeliness is second or minute.

stayScheduling overheadOn the one hand, the quasi real-time data is a batch process, so it still needs the support of the scheduling system, and the scheduling frequency is high, but the real-time data has no scheduling overhead;

stayBusiness flexibilityOn the one hand, because the quasi real-time data is implemented based on ETL or OLAP engine, the flexibility is better than that based on stream computing;

stayTolerance for late arrival of dataOn the one hand, because the quasi real-time data can be calculated in full amount based on the data in one cycle, the tolerance of data late arrival is also relatively high, while the real-time data uses incremental calculation, and the tolerance of data late arrival is lower;

stayApplicable scenarioOn the one hand, quasi real-time data is mainly used in scenes with real-time requirements but not too high, involving multi table Association and frequent business changes, such as real-time analysis of transaction types. Real time data is more suitable for scenes with high real-time requirements and large amount of data, real-time analysis of real-time characteristics and traffic types.

3. Real time architecture

In some scenarios, the value of data decreases over time. Therefore, on the basis of traditional big data offline data warehouse, higher requirements are gradually put forward for the real-time performance of data.

Therefore, big data real-time data warehouse was born, and two technical architectures lambda and kappa were derived.

1. Lambda architecture

Look at lambda architecture first:

Nanny level tutorial on data warehouse construction, offline and real-time (Theory + Practice)

Lambda architecture diagram

The data is collected from the underlying data source through Kafka, flume and other data components, and then divided into two lines for calculation:

One line is to enter the streaming computing platform (such as storm, Flink or sparkstreaming) to calculate some real-time indicators;

The other line enters the batch data processing offline computing platform (such as MapReduce, hive, spark SQL) to calculate the relevant business indicators of T + 1. These indicators can only be seen every other day.

Why should lambda architecture be divided into two lines of calculation?

If the whole system has only one batch processing layer, users will have to wait a long time to obtain the calculation results, usually with a delay of several hours. The e-commerce data analysis department can only view the statistical analysis results of the previous day and cannot obtain the current results, which has a huge time gap for real-time decision-making, which is likely to cause managers to miss the best decision-making opportunity.

Lambda architecture belongs to an earlier architecture. The early stream processing is not as mature as it is now. In terms of accuracy, scalability and fault tolerance, the stream processing layer can not directly replace the batch processing layer. It can only provide users with an approximate result, but also can not provide users with a consistent and accurate result. Therefore, batch processing and stream processing coexist in lambda architecture.

In lambda architecture, each layer has its own task.

1. The batch layer stores and manages master data sets (immutable data sets) and pre batch calculated Views:

The batch processing layer uses a distributed processing system that can process a large amount of data to pre calculate the results. It realizes the accuracy of data by processing all the existing historical data. This means that it is recalculated based on the complete data set, can fix any errors, and then update the existing data view. The output is usually stored in a read-only database, and the update completely replaces the existing pre calculated view.

2. The stream processing layer will process new big data in real time:

The stream processing layer minimizes latency by providing a real-time view of the latest data. The data views generated by the stream processing layer may not be as accurate or complete as the views finally generated by the batch processing layer, but they are available almost immediately after receiving the data. When the same data is processed in the batch layer, the data in the speed layer can be replaced.

What are the disadvantages of lambda architecture?

After years of development, lambda architecture has the advantage of stability. The computing cost of real-time computing is controllable. Batch processing can use the time at night for overall batch computing, so as to separate the peak of real-time computing and offline computing. This architecture supports the early development of the data industry, but it also has some fatal shortcomings, and it is becoming more and more unsuitable for the needs of data analysis business in the era of big data 3.0. The disadvantages are as follows:

Use two sets of big data processing engines: the cost of maintaining two complex distributed systems is very high.

Batch calculation cannot be completed in the calculation window: in the IOT era, the amount of data is getting larger and larger. It is often found that there is only a time window of 4 or 5 hours at night, and it is impossible to complete the accumulated data of more than 20 hours during the day. Ensuring the data on time before going to work in the morning has become a headache for every big data team.

Data source changes need to be redeveloped, and the development cycle is long: every time the format of the data source changes and the logic of the business changes, it needs to be developed and modified for ETL and streaming. The overall development cycle is very long and the business response is not fast enough.

The fundamental reason for the shortcomings of lambda architecture is to maintain two sets of system architecture at the same time: batch processing layer and speed layer. We already know that the batch layer is added to the architecture because the results obtained from the batch layer have high accuracy, while the speed layer is added because it has low latency when processing large-scale data.

Can we improve the architecture of one layer and make it have the characteristics of another layer?

For example, improve the system of batch processing layer to make it have lower latency, or improve the system of speed layer to make the data view more accurate and closer to historical data?

Another architecture commonly used in large-scale data processing, kappa architecture, was born under such thinking.

2. Kappa architecture

Jay Kreps, the founder of Kafka, believes that in many scenarios, maintaining a set of lambda architecture big data processing platform takes time and effort, so he proposes that in some scenarios, it is not necessary to maintain a batch processing layer, but directly use a stream processing layer to meet the requirements, that is, the kappa architecture shown in the figure below:

Nanny level tutorial on data warehouse construction, offline and real-time (Theory + Practice)

Kappa architecture

This architecture only focuses on streaming computing, and the data is collected in the form of stream. The real-time computing engine puts the calculation results into the data service layer for query.It can be considered that kappa architecture is a simplified version of lambda architecture, which only removes the offline batch processing part of lambda architecture

There are two main reasons for the rise of kappa architecture

Kafka not only acts as a message queue, but also can save historical data for a longer time to replace the batch layer data warehouse in lambda architecture. The stream processing engine starts consumption at an earlier time and plays the role of batch processing.

Flink stream processing engine solves the problem of accuracy of calculation results under the disorder of events.

Kappa architecture is relatively simpler and has better real-time performance. The computing resources required are far less than lambda architecture. With the growing demand for real-time processing, more enterprises begin to use kappa architecture.But this does not mean that kappa architecture can replace lambda architecture

Lambda and kappa architectures have their own application fields; For example, the flow processing and batch analysis processes are relatively unified, and some fault tolerance is allowed. Kappa is more appropriate. A small number of key indicators (such as transaction amount, performance statistics, etc.) use lambda architecture for batch calculation, adding a proofreading process.

There are also some complex scenarios. Batch processing and stream processing produce different results (using different machine learning models, expert systems, or complex calculations that are difficult to deal with in real-time computing), which may be more suitable for lambda architecture.