Abstract:What is data lake? What does it do? Today, Huawei’s cloud technology experts will start from theory and explain the problems from the technical dimension.
What is data lake
If you need to define a data lake, you can define it as follows:Data lake is a large warehouse for storing all kinds of original data of enterprises, in which the data can be accessed, processed, analyzed and transmitted.
Data Lake obtains original data from multiple data sources of an enterprise, and for different purposes, there may be multiple copies of the same original data that meet the specific internal model format. Therefore, the data to be processed in the data lake may be any type of information, from structured data to completely unstructured data.
Enterprises place high hopes on data lake, hoping that it can help users quickly obtain useful information, and use this information for data analysis and machine learning algorithm, so as to obtain insight related to enterprise operation.
The relationship between data lake and enterprises
Data lake can bring a variety of capabilities to enterprises, for example,It can realize the centralized management of dataOn this basis, enterprises can dig out a lot of abilities they didn’t have before.
In addition, data Lake combines advanced data science and machine learning technology, which canHelp enterprises build more optimized operation models,It can also provide other capabilities for enterprises, such as prediction analysis, recommendation model, etc. these models can stimulate the subsequent growth of enterprise capabilities.
There are many kinds of capabilities hidden in enterprise data. However, until important data can be used by people with business data insight, people cannot use them to improve the business performance of enterprises.
How data Lake helps enterprises
For a long time, enterprises have been trying to find a unified model to represent all entities in the enterprise. This task is extremely challenging for many reasons, some of which are listed below:
1. An entity may have multiple representations in an enterprise, so there may not be a complete model to represent entities uniformly.
2. Different enterprise applications may deal with entities based on specific business objectives, which means that some enterprise processes will be adopted or excluded when dealing with entities.
3. Different applications may adopt different access modes and storage structures for each entity.
These problems have plagued enterprises for many years, and hindered the standardization of business processing, service definition and terminology.
From a data Lake perspective, we’re looking at this in a different way.By using data lake, a better unified data model is implicitly implemented, without worrying about the substantial impact on the business process.These business procedures are “experts” to solve specific business problems. The data Lake represents the entity as “plump” as possible based on the full amount of data captured from all systems related to the entity owner.
Because the entity representation is better and more complete, data Lake really brings great help to enterprise data processing and management, which makes enterprises have more insight about enterprise growth and help enterprises achieve their business goals.
Advantages of data Lake
Enterprises will generate massive data in their multiple business systems. With the increase of enterprise size, enterprises also need to process these data across multiple systems more intelligently.
One of the most basic strategies is to adopt a separate domain model, which can accurately describe the data and represent the most valuable part of the data for the overall business.These data refer to the enterprise data mentioned above.
Enterprises with well-defined enterprise data certainly have some methods to manage data. Therefore, the changes of enterprise data definition can maintain consistency, and it is very clear within the enterprise how the system shares this information.
In this case, the system is divided intoData owner(data owner) andData consumer（data consumer）。 For consumers, how to define the data needs of other systems is the role of consumers.
Once the enterprise has a clear definition of data and system, it can use a lot of enterprise information through this mechanism. A common implementation strategy of this mechanism is to provide a unified enterprise data model by building an enterprise data lake,In this mechanism, data lake is responsible for capturing data, processing data, analyzing data, and providing data services for consumer system.
Data lake can help enterprises from the following aspects:
1. Realize data governance and data lineage.
2. Achieve business intelligence through the application of machine learning and artificial intelligence technology.
3. Prediction analysis, such as domain specific recommendation engine.
4. Information tracking and consistency assurance.
5. Generate new data dimension according to the analysis of history.
6. A centralized data center that can store all enterprise data is conducive to the realization of a data service optimized for data transmission.
7. Help organizations or enterprises make more flexible decisions about enterprise growth.
In this section, we discuss what capabilities data lakes should have. In the future, we will discuss and comment on how the data lake works and how to understand its working mechanism.
How does data Lake work
In order to accurately understand what benefits data lake can bring to enterprises, it is particularly important to understand the working mechanism of data lake and what components are needed to build a fully functional data lake. Before diving into the details of the data Lake architecture, we might as well understand the background of the data lakeData lifecycle。
At a higher level, the data life cycle in the data lake is shown in the figure.
The above life cycle can also be called multiple different stages of data in the data lake. The data and analysis methods required in each stage are also different. Data processing and analysis can be done according tobatch(batch) or pressNear real time(real-time) treatment.
The implementation of data Lake needs to support these two processing methods at the same time, because different processing methods serve different scenarios.The choice of processing mode (batch processing or near real-time processing) also depends on the amount of calculation of data processing or analysis task, because many complex calculations can not be completed in near real-time processing mode, and in some cases, it can not accept longer processing cycle.
Similarly, the choice of storage system also depends on the requirements of data access. For example, if you want to store data so that you can easily access it through SQL queries, the storage system you choose must support the SQL interface.
If data access requires the provision of data view, it involves the storage of data in the corresponding form, that is, data can be provided as a view, and provide convenient manageability and accessibility.
Recently, an increasingly important trend is to provide data through services, which involves exposing data on the lightweight service layer.Each public service must accurately describe the service function and provide external data. This pattern also supports service-based data integration, so that other systems can consume the data provided by data services.
When the data flows into the data lake from the collection point, its metadata is captured and managed from the aspects of data traceability, data lineage and data security according to the data sensitivity in its life cycle.
Data lineage is defined as the life cycle of data, including the origin of data and how data moves over time. It describes the changes of data in various processes, helps to provide the visibility of data analysis pipeline, and simplifies error tracing. Traceability is the ability to verify the history, location or application of data items by identifying records. ——Wikipedia
The difference between data lake and data warehouse
Many times, data lake is considered to be the same as data warehouse. In fact, data lake and data warehouse represent different goals that enterprises want to achieve.
The key differences are shown in the table below.
From the chart,The difference between data lake and data warehouse is obvious.However, in enterprises, the two functions are complementary,We should not think that the emergence of data lake is to replace data warehouse. After all, the roles of data lake and data warehouse are quite different.
Construction method of data Lake
Different organizations have different preferences, so they build data lakes in different ways. The construction method is related to business, processing flow and existing system.
A simple data Lake implementation is almost equivalent to defining a central data source, which can be used by all systems to meet all data requirements. Although this method may be simple and cost-effective, it may not be a very practical method for the following reasons:
1. This method is feasible only when these organizations start to build their information system again.
2. This method can not solve the problems related to the existing system.
3. Even if an organization decides to build a data Lake in this way, it lacks clear responsibility and separation of concerns.
4. Such a system usually tries to complete all the work at one time, but it will eventually fall apart with the increase of data transaction, analysis and processing requirements.
A better strategy to build a data lake is to treat the enterprise and its information system as a whole, classify the data ownership relationship, and define a unified enterprise model.
Although this approach may have process related challenges and may require more effort to define system elements, it can still provide the required flexibility, control and clear data definition, as well as the separation of concerns between different system entities in the enterprise.
Such a data lake can also have an independent mechanism to capture, process and analyze data, and provide data services for consumer applications.