The most commonly used model in data warehouse construction — detailed explanation of Kimball dimensional modeling

Time:2021-3-5

Data warehouse Toolbox: an authoritative guide to dimensional modeling,This article is based on this book

The first official account is:Learn big data in five minutesIn the official account, the “third dimensional e-book” can be obtained by sending “dimensional modeling”.

This book is based on the author’s 60 years of practical business environment and summarizes the experience and lessons, to provide readers with formal dimension design and development technology. For data warehouse and Bi designers, the book covers a wide range of contents, organized around a series of business scenarios or case studies. It is strongly recommended to buy a physical book for research and read the whole book at least three times. Your technology will have a qualitative leap.

The most commonly used model in data warehouse construction -- detailed explanation of Kimball dimensional modeling

Because this article is pure theoretical knowledge, dense words, many people may not be able to read, so I try to use the least words to express, as far as possible will be difficult to understand the words into easy to understand words, the emphasis of the article is bold display, the content is as simple as possible, in order to ensure that in the case of no expression error is more conducive to the reader’s learning! I hope we can learn together, make progress together, and strive to reach the top of our own pyramid

What is dimensional modeling

Dimension model is advocated by Ralph Kimball, the master of data warehouse,Based on the demand of analysis and decision-making, the model is constructed, and the data model is built to serve the analysis demand. Therefore, it focuses on how to complete the analysis demand more quickly, and has better response performance for large-scale complex queries

Dimension modeling is the key to the success of a data warehouse / business intelligence project. Why do we say that? No matter whether our data volume is from GB to Tg or Pb, although the data volume is increasing, the success of data presentation must be based on simplicity,Dimension modeling is always considering how to provide simplicity, business driven, user understanding and query performance as the goal

Dimension modeling: dimension modeling is a method specially applied to the modeling of analytical database, data warehouse and data market. Data market can be understood as a “small data warehouse”
Dimension modeling guides us how to build tables in data warehouse

Dimension modeling is divided into two kinds of tables: fact table and dimension table

  1. Fact sheet: there must be some data, such as the collected log file, order table, can be used as fact table

Feature: it is a collection of primary keys. Each primary key corresponds to a record in the dimension table. It exists objectively and determines the data to be used according to the subject

  1. Dimension tableDimension is a quantity of data to be analyzed, and dimension table is a table created from an appropriate angle to analyze problems from time, region, terminal, user, etc

Three models of dimension modeling

  1. star schema : Centered on the fact table, all dimension tables are directly connected to the fact table, which is the simplest and most commonly used one

The most commonly used model in data warehouse construction -- detailed explanation of Kimball dimensional modeling

  1. snowflake schema : snowflake mode dimension table can have other dimension tables, which are not easy to maintain and are generally not recommended

The most commonly used model in data warehouse construction -- detailed explanation of Kimball dimensional modeling

  1. constellation model : Based on multiple fact tables and sharing dimension information, that is, some dimension tables can be shared between fact tables

The most commonly used model in data warehouse construction -- detailed explanation of Kimball dimensional modeling

How to build dimension modeling

We know the concepts of fact table, dimension table, star model and constellation model, but in the actual business, we are given a lot of data. How can we use these data for data warehouse construction? The author of data warehouse toolbox summarizes the following four steps according to his own 60 years of actual business experience, please remember!

There are four steps for dimension modeling in data warehouse toolbox

The most commonly used model in data warehouse construction -- detailed explanation of Kimball dimensional modeling

pleaseremember wellThe above four steps, no matter what business, follow this step. Don’t mess up the order, because these four steps are closely linked. How to do each step in detail

1. Select business process
Dimension modeling is close to business, so it must be based on business. Choosing business process, as the name suggests, is to select the business we need to model in the whole business process, and select the business according to the demand provided by operation and the future scalability. For example, the whole mall process is divided into business end, user end and platform end. The operation demand is the total order quantity, the number of orders, and the purchase situation of users. We choose the data of user end when we choose the business process, and the business and platform end are not considered. Business selection is very important because all the following steps are based on this business data.

2. Declaration granularity
Mobile phone ID number, registered residence address mobile phone number and multiple bank cards are the first example. For users, the granularity attribute of the same size as the user size has the size of ID card, the size of registered residence address, the granularity of the mobile phone than the granularity of the user, and the size of the bank card is one size to one. Why should we mention the same granularity? Because dimension modeling requires us toSame fact tableMust haveSame granularityIn the same fact table, do not mix multiple different granularity data, different granularity data to establish different fact tables. When obtaining data from a given business process, it is strongly recommended to focus on atomic granularity, that is, the most fine-grained, because atomic granularity can withstand unexpected user queries. However, the granularity of roll up summary is very important to improve the query performance, so for the data with clear requirements, we establish the granularity of roll up summary for the requirements, and for the data with uncertain requirements, we establish the atomic granularity.

3. Confirmation dimension
Dimension table is the entry and descriptive identification of business analysis, so it is also known as the “soul” of data warehouse. How to confirm which dimension attributes are in a pile of data? If the column is a description of a specific value, a text or constant, a participant of a constraint or row ID, then the attribute is often a dimension attribute, which is told in the data warehouse toolboxWith a firm grasp of the granularity of the fact table, all possible dimensions can be distinguished, and toEnsure that duplicate data cannot appear in the dimension table, and make the dimension primary key unique

4. Confirmation of facts
Fact table is used to measure, which is basically expressed by quantity value. Each row in fact table corresponds to a measure, and the data in each row is a specific level of detail data, which is called granularity. One of the core principles of dimensional modelingIs that all measures in the same fact table must have the same granularity. This ensures that there is no problem of double counting metrics. Sometimes it is not sure whether the column data is a fact attribute or a dimension attribute. rememberThe most practical facts are numerical type and additive type facts. Therefore, we can analyze whether the column is a measure that contains multiple values and serves as a participant in the calculation. In this case, the column is often a fact.

Type of fact table

The fact table is divided into the following six categories:

  1. Fact sheet
  2. periodic snapshot fact table
  3. Cumulative snapshot fact table
  4. Fact table without facts
  5. Aggregate fact table
  6. Consolidated fact sheet

Briefly explain the concept of each table:

  • Fact sheet

A row in the table corresponds to a metric event at a point in space or time. There must be a measurement field in a row of data. What is a measurement is an indicator, such as sales amount, sales quantity, etc. These can be added or semi added are the measurement values. Another point is that the transaction fact table contains a foreign key associated with the dimension table. And the measurement must be consistent with the transaction granularity.

  • periodic snapshot fact table

As the name suggests, the period fact table is that each row has a time value field, which represents the period. Generally, the time value is a standard period, such as a day, a week, a month, etc. Granularity is a cycle, not an individual transaction. That is to say, the data in a cycle snapshot fact table can be multiple facts, but they all belong to a certain cycle.

  • Cumulative snapshot fact table

The cycle snapshot fact table is the data in a single cycle, while the cumulative snapshot fact table is composed of multiple cycle data. Each row summarizes the measurement from the beginning to the end of the process. Each row of data is equivalent to a pipeline or workflow, with the start, process and end of the event, and each key step contains a date field. For example, in order data, a row in the cumulative snapshot fact table is an order. When an order is generated, a row is inserted. When an order changes, the row is modified.

  • Fact table without facts

The fact table metrics we discussed above are all digital. Of course, most of them are digital metrics in practical applications, but there may be a small number of valuable fields without digital values. The fact table without facts is prepared for this kind of data, and we can analyze what happened by using this fact table.

  • Aggregate fact table

Aggregation is a simple aggregation operation of atomic granularity data, and the purpose is to improve the query performance. For example, if we need to query the total sales of all stores in the country, each row in our atomic granularity fact table is the sales of each store and each commodity. If we aggregate the fact table, we can first aggregate the total sales of each store. In this way, the amount of data calculated when summarizing the sales of all stores will be much smaller.

  • Consolidated fact sheet

This kind of fact table follows a principle, that is, the same granularity, data can come from multiple processes, but as long as they belong to the same granularity, they can be combined into a fact table. This kind of fact table is especially suitable for multi process metrics that often need to be analyzed together.

Dimension table technology

  1. Dimension table structure

Dimension tables should keep in mind a principle that they contain a single primary key column. However, sometimes due to complex business, joint primary keys may occur. Please try to avoid it. If it cannot be avoided, it is important to ensure that the primary key of dimension tables is single. If the primary key of dimension tables is not single, data divergence may occur when it is associated with fact tables, which may lead to errors in the final result.

Dimension tables are usually wide and contain a large number of low granularity text attributes.

  1. Cross table drilling

Cross table drilling means that when the row header of each query contains the same consistency attribute, different queries can query two or more fact tables

Drilling can change the level of dimension and the granularity of analysis. It includes up / down drilling:

Roll up: roll up is the aggregation of data up the dimension level. For example, for product sales data, the monthly (or quarterly or annual) sales of all products in all regions can be calculated by rolling up along the time dimension.

Drill down: drill down is the reverse operation of drill up. It is down the dimension level to view more detailed data.

  1. Degenerative dimension

Degenerate dimension is to return dimension to fact table. Because sometimes a dimension has no content other than the primary key. Although it is also a legal dimension key, it is usually returned to the fact table to reduce the number of association and improve the query performance

  1. Multi level dimension

Most dimensions contain more than one natural level. For example, the date dimension can range from day level to weekday month to year level. So in some cases, there are different levels in the same dimension.

  1. Dimension table null attribute

When the given dimension row is not filled in completely, or when the existing attribute is not applied to all dimension rows, the null value dimension attribute will be generated. In the above two cases, it is recommended to use descriptive string instead of null value, such as using unknown or not applicable instead of null value.

  1. Calendar date dimension

In the date dimension table, the setting of the primary key should not be represented by the ID generated in sequence, but can be represented by more meaningful data, such as yyyymmdd or more detailed precision.

last

The latest article can be taken as public official account.Learn big data in five minutes

Reply in official accountDimensional modelingYou can get the electronic version of digital books