With the rapid development of the Internet, more and more people flock to the Internet and generate a large amount of data through browsers, intelligent terminals and various devices. At the same time, a lot of data concepts are generated, such asDatabase, data warehouse, data lake, data mart, data centerThese concepts are intertwined and interrelated. What are they, how to use them, and what is the relationship between them? An article today will give you a whole picture.
database, in essence, it is a two-dimensional relational storage system, which stores structured data, such as student information table of a school, student transcript of a grade, etc. Because of its simple use and high degree of structure, it has greatly promoted the development of the Internet. It includes operational database and analytical database.
The so-called operational database is mainly for a “transaction” operation, which is used to support daily business, such as purchasing goods, ordering takeout, playing Didi, etc.
The so-called analytical database mainly analyzes historical data, such as the sales volume of a commodity, the order volume of a store, the car order volume of a master, etc.
Due to the characteristics of more writing and less checking, dynamic changes of data and low storage time requirements of operational database, it is doomed that it will not be the same database as analytical database. Analytical database has less writing and more checking, basically stable data and long storage time. As our requirements for analytical data become higher, we hope to see more dimensional analysis, and it becomes difficult to support the traditional analytical database. For example, if we want to see under what circumstances the pizza of a store on Taobao is best sold, we need to combine pizza information table, order sales table, consumer information table, China Weather table and other tables, In order to analyze what weather, what geographical location, what taste and what price is the best time to sell, so the data warehouse came into being.
data warehouse , in essence, it is a subject oriented, integrated, relatively stable data set reflecting historical changes. It is a library with a larger scope than the database. The so-called topic oriented means that the information in the data warehouse is aggregated according to a certain topic, such as region, cost, commodity, revenue, profit, etc; The so-called integration means that the data in different databases can be gathered together; The so-called relatively stable means that the data in the data warehouse will not change as often as the operational database; By reflecting historical changes, it means that the information in the data warehouse not only reflects the current situation of the enterprise, but also records and analyzes the changes from a certain point in time in the past to the present.
In the development and exploration of data warehouse, there are also some problemsdata mart 、Business Intelligence BIThe concept of. The so-called data mart is a small data warehouse that only focuses on a certain topic. For example, if it only focuses on cost, it will only include cost related data. The data source can be its own source database or obtain the data of a certain topic from the data warehouse; The so-called business intelligence is the advanced operation analysis data. After obtaining the analytical data through the data warehouse, Bi personnel will make a judgment on the current business and provide the boss with decision-making in combination with the current business situation, market situation and analysis data.
Data Lake, it is a large warehouse that is larger than the data warehouse and has no restrictions on data. The data in it can flow naturally like lake water, and the data can be stored, processed and analyzed. In the data lake, the stored data is directly imported from the source system without any processing. It includes structured data, unstructured data and semi-structured data. It is also the data source of the data warehouse. In addition, it is also used in machine learning, prediction analysis, information tracking and other scenarios, providing a large amount of data for scientists to train models and make recommendation engines in a certain field. The difference between data warehouse and data lake can be seen in the following table.
Data center station, in essence, it is a data analysis system serving the business. It was born for the business from the beginning. Data warehouse provides statistical analysis, single domain dimension, passive analysis and non real-time analysis, which can not meet the scenarios of enterprise multi-dimensional analysis, active analysis, predictive analysis, real-time analysis and diversified analysis. Therefore, the data center came into being. The whole data middle platform product is a closed-loop solution, which is no longer a part of the business process. It includes five modules: data embedding point, data access standardization, data warehouse abstraction, data governance and data service, which opens up multiple dimensions of people, objects and fields and better serves the front desk. In addition, in the construction of data center, the enterprise organizational culture is also very important. It needs to link various business lines to access this system for standardized governance and management, but it does not need to pay attention to this level in the construction process of data warehouse. Therefore, the data center is another qualitative leap in data warehouse.
Databases, data lakes, data warehouses, data marts, and data centers are all data processing solutions that we have combined with different needs at different stages. It does not mean that any solution is outdated. Each solution has its own use scenarios up to now. We can carry out corresponding construction in combination with our own demands