One stop Lake entry with multiple data sources

Time:2020-10-31

Introduction:The data from different data sources are unified into the centralized data Lake storage based on OSS object storage through one-stop access to the lake, which solves the problem of data island faced by enterprises and lays a solid foundation for unified data analysis

background

As a centralized data storage warehouse, data Lake supports a variety of data types, including structured, semi-structured and unstructured data. Data sources include database data, binglog incremental data, log data and existing data warehouse storage data High cost performance storage, such as OSS and other object storage, and provides a unified data analysis method, effectively solves the data island problem faced by enterprises, and greatly reduces the cost of data storage and use

One stop Lake entry with multiple data sources

Due to the diversity of data sources in the data lake, how to simply and efficiently migrate the data from these heterogeneous data sources to the centralized data Lake storage is a problem faced by the construction process of the data lake. Therefore, we need to provide a perfect one-stop ability to enter the lake and solve the problems we are facing, mainly including the following points:

  • Supporting the unified way of heterogeneous data sources into the lake

It provides a simple and unified way to enter the lake, and users can realize the operation of heterogeneous data sources through simple page configuration

  • Meet the timeliness of data into the lake

For log, binglog and other data sources, it is necessary to realize the ability of data entering lake with minute delay to meet the timeliness requirements of real-time interactive analysis scenarios

  • Support the real-time change of data source

For data sources such as database and tablestore tunnel, the source data often changes, such as update and delete operations at the data level, and even field structure changes at the schema level. Better data formats are needed to support such changes

To this end, Alibaba cloud launched a newThe data lake formation (DLF) service provides a complete one-stop solution to the lake.

Overall plan

The technical scheme of data Lake construction is shown in the following figure:

One stop Lake entry with multiple data sources

The data into the lake is divided into four parts: Lake entry template, Lake entry engine, file format and data Lake storage

Template for entering lake

The lake entry template defines the common methods of data source entering the lake. At present, it mainly includes five templates: RDS full template, DTS incremental template, tablestore template, SLS template and file format conversion

One stop Lake entry with multiple data sources

According to different data sources, the user selects the corresponding Lake entry template, and then fills in the source related parameter information to complete the creation of the lake entry template and submit it to the lake entry engine for running

Lake engine

The lake entry engine uses spark streaming SQL and EMR spark engine developed by Alibaba cloud EMR team. Based on spark structured streaming, streaming SQL provides relatively perfect streaming SQL syntax, greatly simplifying the development cost of real-time computing. For the real-time incremental template, the upper Lake entry template will be translated into streaming SQL, and then submitted to spark cluster for running. We extend merge into syntax in streaming SQL to support update and delete operations. For RDS and other full templates, they are directly translated into spark SQL to run.

file format

The file formats supported by DLF include delta lake, parquet, JSON, etc., and more file formats such as Hudi are also being accessed. The file formats such as delta lake and Hudi can well support operations such as update and delete, as well as schema merge. It can solve the problem of real-time data source change.

Data Lake storage

Data Lake data is put in OSS object storage, OSS provides the ability of massive data storage, and has more advantages in reliability, price and so on

The one-stop approach to the lake solves the problems mentioned above

  • Supporting the unified way of heterogeneous data sources into the lake

Through the template configuration, a unified and simple way of data into the lake is realized

  • Meet the timeliness of data into the lake

Through the self-developed streaming SQL, the real-time data into the lake with minute delay is realized

  • Support the real-time change of data source

By introducing better file formats such as delta lake, the requirements of real-time data change such as update and delete are realized

Real time Lake entry

With the continuous development of big data, users have higher and higher requirements for data timeliness. Real time Lake entry is also our focus. At present, we have supported the real-time Lake entry capabilities of DTS, tablestore and SLS

DTS incremental data real time entering lake

DTS is a highly reliable data transmission service provided by alicloud, and supports the subscription and consumption of incremental data of different types of databases. We implement DTS real-time subscription data into the lake, and support users to enter the lake through existing subscription channels and automatically create subscription channels to enter the lake to reduce the user configuration cost.

One stop Lake entry with multiple data sources

In terms of technology, it supports the update and delete operations of incremental data to historical data, and realizes the ability of data change awareness with minute delay. In technical implementation, merge into syntax is extended in streaming SQL to interface with delta lake

MERGE INTO delta_tbl AS target
USING (
  select recordType, pk, ...
  from {{binlog_parser_subquery}}
) AS source
ON target.pk = source.pk
WHEN MATCHED AND source.recordType='UPDATE' THEN
UPDATE SET *
WHEN MATCHED AND source.recordType='DELETE' THEN
DELETE
WHEN NOT MATCHED THEN
INSERT *

Compared with traditional binlog warehousing, the scheme based on data lake has more advantages In the traditional data warehouse, in order to realize the warehousing of database and other changed data, it is usually necessary to maintain two tables. One incremental table is used to store the newly added database change details, and the other full scale is used to store all historical merge data. The daily and incremental tables of the full scale are merged according to the primary key Obviously, the scheme based on data lake is better in simplicity and timeliness

Real time access to the lake by tablestore

Tablestore is a NoSQL multi model database developed by Alibaba cloud. It provides massive structured data storage and fast query and analysis services. It also supports channel function and real-time consumption of changed data. We support the implementation of tablestore full channel, incremental channel and full plus incremental channel. The full channel contains the historical total data, the incremental channel contains the incremental change data, and the full plus incremental channel contains the historical full volume and incremental change data

One stop Lake entry with multiple data sources

SLS log enters Lake in real time

SLS is a one-stop service for log data provided by alicloud, which mainly stores user log data. The log data in SLS can be archived to the data Lake in real time for analysis and processing, which can fully mine the value of the data. At present, through the SLS Lake entry template and filling in a small amount of information such as project and logstore, the ability of real-time log entry into the lake can be completed.

Summary and Prospect

The one-stop Lake entry function greatly reduces the cost of heterogeneous data sources entering the lake, meets the timeliness requirements of SLS, DTS and other data sources, and also supports the ability of real-time change of data sources. The data from different data sources are unified into the centralized data Lake storage based on OSS object storage through one-stop access to the lake, which solves the problem of data island faced by enterprises and lays a solid foundation for unified data analysis

On the one hand, the following one-stop Lake entry will continue to improve the functions, support more types of data sources, open more capabilities to users in terms of Lake entry templates, support the function of custom ETL, and improve flexibility. On the other hand, it will continue to invest in performance optimization to provide better timeliness and stability.

Link to original text
This article is the original content of Alibaba cloud and can not be reproduced without permission.

Recommended Today

Mongodb splitchunk caused routing table refresh, resulting in slow response

Mongodb splitchunk caused routing table refresh, resulting in slow response Mongodb sharding instance fromVersion 3.4Upgrade toVersion 4.0In the future, the performance of the insertion was significantly reduced, and a large number ofinsertRequest slow log: 2020-08-19T16:40:46.563+0800 I COMMAND [conn1528] command sdb.sColl command: insert { insert: “sColl”, xxx} … locks: {Global: { acquireCount: { r: 6, w: […]