How to realize data analysis of Lake Warehouse Integration in data warehouse?

Time:2021-4-15

Introduction:With the popularization of cloud computing and the expansion of data analysis demand, the integrated analysis capability of data lake and data warehouse has become the core capability of the next generation of data analysis system. Compared with data warehouse, data lake has obvious advantages in cost, flexibility and multi-source data analysis. Among the ten forecasts of China’s cloud computing market in 2021 released by IDC, three are related to data Lake analysis. It can be predicted that cross system integration ability, data control ability and more comprehensive data-driven ability will be the important competitive fields of data analysis system in the future.

1、 Background

With the popularization of cloud computing and the expansion of data analysis demand, the integrated analysis capability of data lake and data warehouse has become the core capability of the next generation of data analysis system. Compared with data warehouse, data lake has obvious advantages in cost, flexibility and multi-source data analysis. Among the ten forecasts of China’s cloud computing market in 2021 released by IDC, three are related to data Lake analysis. It can be predicted that cross system integration ability, data control ability and more comprehensive data-driven ability will be the important competitive fields of data analysis system in the future.

Analyticdb PostgreSQL (referred to as ADB PG) is a cloud native data warehouse product built by Alibaba cloud database team based on PostgreSQL kernel (referred to as PG). In the business scenarios of Pb level data real-time interactive analysis, HTAP, ETL and Bi report generation, ADB PG has unique technical advantages. As a data warehouse product, how does ADB PG have the ability of integrated analysis? This article will introduce how ADB PG builds data analysis capability based on PG appearance.

How to realize data analysis of Lake Warehouse Integration in data warehouse?

ADB PG inherits the function of PG’s foreign table. At present, the ability of integrating lake and warehouse of ADB PG is mainly based on appearance. Based on the appearance of PG, ADB PG can query and write the data of other data analysis systems. While compatible with a variety of data sources, it can reuse the advantages of the original optimizer and execution engine of ADB PG. At present, the lake warehouse integrated analysis capability of ADB PG supports the analysis or writing of multiple data sources such as OSS, maxcompute, Hadoop, RDS PG, Oracle and RDS mysql. Users can flexibly apply ADB PG to data storage, interactive analysis, ETL and other fields, and can realize multiple data analysis functions in a single instance. That is to say, we can use ADB PG to complete the core process of data analysis, or as a link in many links to build data links.

However, the analysis of appearance data relies on external SDK and network IO to realize data reading and writing. Due to the huge difference between the characteristics of the network itself and the local disk, it needs different performance optimization schemes from the local storage at the technical level. In this paper, taking the reading and writing of OSS appearance data as an example, we introduce some important problems and solutions encountered by ADB PG in building the integrated analysis capability of lake and warehouse.

2、 Problem analysis

ADB PG kernel can be divided into optimizer, execution engine and storage engine. Appearance data analysis can reuse the core part of the original optimizer and execution engine of ADB PG with only a few modifications. The main expansion is the transformation of the storage engine layer, which is to read and write the external table data through the appearance interface. Appearance data is stored in another distributed system, which needs to be connected with ADB PG through the network. This is the core difference between reading local files and appearance data. On the one hand, different appearance data will provide different remote access interfaces, which need to be compatible in engineering. For example, the data reading interfaces of OSS and maxcompute are different. On the other hand, there are some commonalities in accessing data on remote machine through network, such as network delay, network amplification, bandwidth limitation, network stability and so on.

How to realize data analysis of Lake Warehouse Integration in data warehouse?

This article will focus on the above core challenges and introduce some important technical points of ADB PG appearance analysis project in supporting OSS data analysis. OSS is a low-cost distributed storage system launched by Alibaba cloud, which stores a large number of hot and cold data and has a large demand for data analysis. In order to facilitate developers to expand, OSS provides SDK based on Java, go, C / C + +, Python and other mainstream development languages. ADB PG is developed with OSS C SDK. At present, ADB PG has perfectly supported various functions of OSS appearance analysis. Except for different table building statements, users can access OSS appearance just like local tables. Support concurrent read and write, support CSV, ORC, parquet and other common data formats.

How to realize data analysis of Lake Warehouse Integration in data warehouse?

3、 Optimization of surface analysis technology

Next, we introduce some core technical problems that ADB PG solves in the process of developing OSS appearance analysis based on OSS C SDK.

3.1 network fragmentation request problem

In the analytical database scenario, it is generally believed that column storage is better than row storage in IO performance. Because when column storage scans data, it only needs to scan specific columns, while row storage scans full data after all, so column storage can save some IO resources. However, in the development process, the team found that in some scenarios, such as large wide table scanning with more fields, the column format with higher scanning performance is worse than the CSV row text format. After positioning, it is found that, on the one hand, when scanning the orc / parquet format, the interaction between the client and the OSS server is too frequent; on the other hand, the amount of data that ADB PG requests from the OSS is relatively small. These two reasons bring about great performance problems.

We know that compared with local disk IO, the round-trip delay generated by network IO can be amplified by several orders of magnitude. Therefore, if network requests are treated as local disk requests when parsing some column storage formats (such as Orc / parquet), the reduction of network bandwidth consumption caused by high compression ratio is not enough to offset the round-trip delay amplification caused by fragmentation requests, so the performance test results are lower than expected. The solution to this problem is to reduce fragmented network requests through caching. Each time ADB PG scans OSS data, it will “preload” enough data and cache it. When requesting, it will determine whether the cache is hit. If it is hit, it will directly return to the cache; otherwise, it will continue the next round of “preloading”, so as to reduce the number of network requests and improve the efficiency of a single request. The cache size of “preload” is open configuration, and the default size is 1MB.

How to realize data analysis of Lake Warehouse Integration in data warehouse?

3.2 column filtering and predicate pushdown

Because the IO performance of the network itself is often lower than that of the local storage, the bandwidth consumption of IO should be minimized when scanning the external data. When dealing with Orc and parquet files, ADB PG uses column filtering and predicate push down techniques to achieve this goal.

Column filtering, that is, only the data columns needed by SQL query are requested, and the unnecessary data columns are ignored. Because Orc and parquet are all column storage formats, when initiating a network request, you only need to request the data range of the required column, thus greatly reducing the network I / O. At the same time, Orc and parquet will compress column data to further reduce I / O.

Predicate push down is to move the upper filter condition (such as the condition in where clause) in the execution plan to the lower appearance scanning node, so that when the appearance scanning makes network requests, the data blocks that do not meet the query conditions will be filtered out, so as to reduce the network I / O. In the orc / parquet format file, the min / max / sum and other statistical information of each column of data in the block will be saved in the header of each block. When the appearance is scanned, the header statistical information of the block will be read first and compared with the query conditions pushed down. If the statistical information of the column does not meet the query conditions, the column data can be skipped directly.

Here is a brief introduction to the implementation of predicate push down of ORC format appearance. An orc file is divided into several strips according to data rows, and the data in the strip is stored in columns. Each stripe is divided into several row groups, and every 10000 rows of all columns form a row group. As shown in the figure below.

How to realize data analysis of Lake Warehouse Integration in data warehouse?

The orc file stores the statistical information of three levels, the statistical information of file level and stripe level is stored at the end of the orc file, and the statistical information of row group level is stored at the head of each stripe block. Using these three levels of statistical information, ORC can realize file level filtering, stripe level filtering and row group level filtering. The specific method is: every time a new Orc file is scanned, the file level statistics at the end of the file will be read first; if the query conditions are not met, the whole file scanning will be skipped directly; then the statistics of all the stripe levels at the end of the file will be read, and the unqualified stripe blocks will be filtered; for each qualified stripe block, the row at the block head will be read Group level statistics, filtering out unnecessary data.

3.3 “996” problem

OSS C SDK defines a kind of error code, which is used to indicate abnormal conditions. Here 996 is the error code – 996 defined in OSS C SDK. Similar error codes – 998, – 995, – 992, etc. This kind of error is usually the failure of importing and exporting OSS appearance caused by network exception. -996 is the most common one.

The OSS C SDK uses curl to interact with the OSS server. The corresponding curl error codes are curl 56 (connection reset by peer), 52, etc. These network anomalies are usually caused by the fact that the OSS server actively rejects the client connections that it considers “inactive” when the load is high. When large-scale OSS data needs to be imported or exported, because the client is in different stages of the execution plan, it can not hold the connection for a long time for continuous communication, so it is closed by the OSS server as an “inactive” client connection.

Usually in this case, the client needs to try to solve the problem again. In the actual development process, it is found that even if the automatic exception retrial mechanism is added to the client interface, the exception still can not be improved. After localization, it is found that in order to improve the connection efficiency, the OSS C SDK increases the connection pool of the curl handle. However, the curl handle of these network exceptions will also be stored in the pool. Therefore, even if you try again, the abnormal curl handle will still be used for communication, so the problem of 996 exception can not be improved.

Now that we know the root cause, the solution is intuitive. In the callback interface of the curl handle, we add the check for the status of the curl handle, and destroy the abnormal curl handle instead of adding it back to the connection pool. This avoids invalid curl handles in the connection pool. When the client interface tries again, select a valid or create a new curl connection to communicate again. Of course, automatic exception retrial mechanism can only be used for those cases that can be solved by retrial.

How to realize data analysis of Lake Warehouse Integration in data warehouse?

① When accessing the OSS appearance, ADB PG first obtains the connection from the curl connection pool, and creates a new one if it does not exist.

② ADB PG uses the curl connection handle to communicate with OSS server.

③ The OSS server returns the communication result through the curl connection handle.

④ After the normally returned curl connection handle is used, it will be added back to the connection pool for the next use.

⑤ The abnormal curl connection handle is destroyed.

3.4 compatibility of memory management scheme

ADB PG is based on the PostgreSQL kernel and inherits the memory management mechanism of PostgreSQL. The memory management of PostgreSQL adopts process safe memory context, while OSS C SDK is thread safe memory context APR pool. In the memorycontext memory environment, each allocated memory can be explicitly released by calling free, and the memory fragmentation can be cleaned up by memorycontext. However, in Apr pool, we only see the creation of memory pool, the application of memory and the destruction of memory pool, but there is no explicit memory release interface.

This situation means that we need to have a clear understanding of the life cycle of the memory held by the OSS C SDK interface, otherwise it is very easy to have problems such as memory leakage and access to the released memory. Generally, we will apply for the memory of APR pool in the following two ways.

·Mode 1 is applicable to the operation interface of reentry low frequency, such as obtaining OSS file list.

·Mode 2 is suitable for multiple reentry operation interfaces, such as periodically requesting data from OSS in a specified range of specified files.

Through this method, the incompatibility of memory management between ADB PG and OSS C SDK can be well solved.

How to realize data analysis of Lake Warehouse Integration in data warehouse?

3.5 compatibility and optimization of data format

Most of the data on OSS are in CSV, ORC, parquet and other formats. Because the underlying storage coding of data in Orc / parquet format is not consistent with the data coding of ADB PG, data type conversion is an essential step in appearance scanning. Type conversion, in essence, is to change data from one encoding to another. For example, the representation of decimal type in Orc is different from that in ADB pg. in orc, decimal64 uses an Int64 to store the numeric value of data, and then precision and scale represent the number of digits and decimal places. In ADB PG, decimal64 uses an int16 array to store the numeric value of data. Format conversion algorithm needs to carry out cyclic division and modulus operation on each data, which is very CPU consuming.

In order to reduce the CPU consumption caused by type conversion and further optimize the performance of appearance query, when using appearance to export data, ADB PG skips the type conversion step and directly writes the data of ADB PG to the appearance file in binary form, so that there is no need for any data type conversion when querying appearance. For example, when exporting an orc surface, any data type can be directly written into the binary type of ORC. The binary data stored in Orc is encoded according to the data type of the corresponding ADB pg. therefore, when querying the orc surface, the type conversion step can be directly omitted to reduce CPU consumption. According to the TPCH query test results, the overall query performance can be improved by about 15% – 20%.

4、 Performance test

For how to use appearance analysis function in ADB PG, please refer to Alibaba cloud product manual(https://help.aliyun.com/document\_detail/164815.html?spm=a2c4g.11186623.6.602.78db2394eaa9rq)。 In addition to the different table building statements, there is almost no difference between the operation of the appearance and the operation of the local table, so the learning difficulty is very low. Here we compare the OSS appearance analysis scenario with the local table analysis scenario.

Environment configuration. The machine we tested is alicloud ECS d1ne.4xlarge. The single machine is configured with 16 Intel Xeon e5-2682v4 cores and 64GB memory. Each ECS is configured with 4 local HDD disks, and the read-write speed of each disk is about 200MB / s. Four ECSS were used in the test, two for the master node and four for the segment node. A total of 16 segments were deployed. This test uses TPCH query and 1TB dataset generated by official tools.

We have tested four kinds of compression tables, which are csvquet, JSON, and paraos. The total execution time of 22 queries in TPCH is shown in the table below. It can be seen from the test data that the query performance of AOCS table is slightly better than that of heap table. In terms of appearance, the query performance of CSV format, ORC format and parquet format is slightly slower than that of local table, and the difference is about 50%. The appearance query performance of JSON format is obviously slower than that of other formats, which is mainly due to the slow parsing speed of JSON format itself and has nothing to do with appearance.

How to realize data analysis of Lake Warehouse Integration in data warehouse?

The following figure shows the detailed time of 22 queries in TPCH. The performance gap between local table and appearance is different in different queries. Considering the advantages of appearance in terms of storage cost, flexibility and scalability, ADB PG appearance analysis has great potential in application scenarios.

How to realize data analysis of Lake Warehouse Integration in data warehouse?

5、 Summary

The integration of warehouse and lake is an important capability of the next generation of data warehouse products. As a data warehouse product with powerful function and strong expansibility, ADB PG develops the analysis and writing ability of multiple data sources based on PG, and precipitates a lot of performance optimization technologies. In the future, ADB PG will continue to work in the direction of product function, cost performance, cloud native ability and integration of lake and warehouse, providing users with more functions, performance and cost optimization.

Copyright notice:The content of this article is spontaneously contributed by alicloud real name registered users, and the copyright belongs to the original author. The alicloud developer community does not own its copyright, nor does it bear the corresponding legal responsibility. For specific rules, please refer to the user service agreement of alicloud developer community and the guidelines for intellectual property protection of alicloud developer community. If you find any suspected plagiarism content in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the suspected infringement content.