How to optimize OSS for data Lake analysis?

Time:2021-6-23

Introduction:Best practices, taking DLA as an example. DLA is committed to helping customers build a low-cost, easy-to-use and flexible data platform, saving at least 50% of the cost compared with traditional Hadoop. Among them, DLA meta supports a unified view of 15 + data sources (OSS, HDFS, DB, DW) on the cloud, introduces multi tenant, metadata discovery, pursues zero marginal cost, and provides free use. DLA Lakehouse is based on Apache Hudi. Its main goal is to provide an efficient Lake warehouse and support incremental writing of CDC and messages. At present, it is stepping up its production. DLA serverless Presto is developed based on Apache prestodb, mainly for federated interactive query and lightweight ETL.

background

At present, data lake is a hot solution at home and abroad_(https://www.marketsandmarkets.com/Market-Reports/data-lakes-market-213787749.html)_ Market research shows that the market scale of data lake is expected to increase from US $7.9 billion in 2019 to US $20.1 billion in 2024. Some enterprises have built their own cloud native data Lake solutions to effectively solve the business pain points; There are also many enterprises building or planning to build their own data lake. According to the report released by Gartner in 2020_(https://www.gartner.com/smarterwithgartner/the-best-ways-to-organize-your-data-structures/)_ At present, 39% of users are using the data lake, and 34% of users consider using the data lake within one year. With the maturity of cloud native storage technologies such as object storage, we will first store structured, semi-structured, image, video and other data in object storage. When you need to analyze these data, you will select Hadoop or alicloud’s cloud native data Lake analysis service DLA for data processing. Compared with deploying HDFS, object storage has some disadvantages in performance analysis. At present, the industry has made extensive exploration and implementation.

1、 Challenges of object based storage analysis

1. What is data lake

In Wikipedia, data lake is a kind of system or storage that stores data in natural / original format, usually object block or file, including original data copy generated by original system and conversion data generated for various tasks, including structured data (row and column), semi-structured data (such as CSV, log, XML, JSON) from relational database Unstructured data (e.g. email, document, PDF, image, audio, video).

It can be concluded from the above that the data lake has the following characteristics:

  • Data sources: original data, converted data
  • Data types: structured data, semi-structured data, unstructured data, binary data
  • Data Lake storage: scalable massive data storage service

2. Data Lake analysis scheme architecture

How to optimize OSS for data Lake analysis?

It mainly includes five modules

  • Data source: original data storage module, including structured data (database, etc.), semi-structured data (file, log, etc.), unstructured data (audio and video, etc.);
  • Data integration: in order to unify data storage and management in the data lake, data integration is mainly divided into three forms: Data Association, ETL and asynchronous metadata construction;
  • Data Lake storage: at present, data Lake storage in the industry includes object storage and self built HDFS. With the evolution of cloud native, there are a lot of optimizations for object storage in terms of scalability, cost and free operation and maintenance. At present, customers prefer cloud native object storage as the data Lake storage base rather than self built HDFS.
  • Metadata management: metadata management, as the bus connecting data integration, storage and analysis engine;
  • Data analysis engine: there are abundant analysis engines, such as spark, Hadoop, presto, etc.

3. Challenges of object oriented storage analysis

Compared with HDFS, object storage chooses flat way in metadata management to ensure high scalability; Metadata management does not maintain the directory structure, so it can achieve the horizontal expansion of metadata services, unlike HDFS namenode, which has a single bottleneck. At the same time, compared with HDFS, object storage can achieve operation and maintenance free, store and read on demand, and build a complete storage computing separation architecture. However, there are some problems in the analysis and computation oriented design

  • Slow list: how can object storage be so slow by directory / list compared with HDFS?
  • Too many requests: when analyzing and calculating, how can the request times cost of object storage be higher than the calculation cost?
  • Rename slow: how can spark and Hadoop analyze and write data always stuck in the commit phase?
  • Slow reading: the analysis of 1TB data is much slower than the self built HDFS cluster!
  • ……

4. Optimization status of object oriented storage analysis in industry

These are the typical problems encountered in building data Lake analysis scheme based on object storage. To solve these problems, we need to understand the difference between object storage and traditional HDFS architecture, and optimize it accordingly. At present, the industry has done a lot of exploration and practice

  • Juicefs: maintain independent metadata service, use object storage as storage medium. Through independent metadata services to provide efficient file management semantics, such as list, rename and so on. However, additional services need to be deployed, and all analysis read object storage depends on this service;
  • Hadoop: because Hadoop and spark write data based on the two-stage submission protocol of outputcommmitter, in the V1 version of outputcommmitter, two renames will be performed on committask and commitjob. Renaming on the object storage will copy the object, which costs a lot. Therefore, this paper proposes an algorithm named outputcommmitter v2. The algorithm only needs to rename once, but the interrupt in commitjob process will produce dirty data;
  • Alluxio: by deploying an independent cache service, the remote object storage file is cached locally, and the local data reading speed is analyzed and calculated;
  • Hudi: Currently, Hudi, delta lake and iceberg store the meta information of dataset file independently through metadata to avoid list operation, and provide acid and read-write isolation similar to traditional database;
  • Alicloud native data Lake analysis service DLA: the DLA service has made a lot of optimizations on OSS, including rename optimization, InputStream optimization, data cache and so on.

2、 Architecture optimization of DLA object oriented storage OSS

Because of the above problems in object storage oriented analysis scenario, DLA builds a unified DLA FS layer to solve the problems of meta information access, rename, slow reading and so on. DLA FS also supports the reading and writing of ETL by DLA’s serverless spark, the interactive query of DLA’s serverless Presto data, and the efficient reading of Lakehouse’s warehousing data. The architecture optimization of object-oriented storage OSS is divided into four layers

  • Data Lake storage OSS: storage structured, semi-structured, unstructured, and through the DLA Lakehouse into the lake to build warehouse Hudi format;
  • DLA FS: solve the analysis and optimization problems of object-oriented storage OSS, including the optimization of rename, read buffer, data cache, file list, etc;
  • Load analysis: DLA serverless spark mainly reads the data ETL in OSS and then writes it back to OSS. Serverless Presto mainly performs interactive query on the data of the warehouse built on OSS;
  • Business scenarios: dual engine spark and presto based on DLA can support multiple business scenarios.

How to optimize OSS for data Lake analysis?

3、 DLA FS object oriented storage OSS optimization technology analysis

The following mainly introduces the optimization technology of DLA FS object-oriented storage OSS

1. Name optimization

In the Hadoop ecosystem, the outputcommitter interface is used to ensure the data consistency in the writing process. Its principle is similar to the two-phase commit protocol.

Open source Hadoop provides the implementation of Hadoop file system to read and write OSS files. The default implementation of outputcommmitter is fileoutputcommmitter. For the sake of data consistency, users are not allowed to see intermediate results. When executing tasks, the results are first output to a temporary working directory. When all tasks confirm that the output is completed, the driver will rename the temporary working directory to the production data path. As shown in the figure below:

How to optimize OSS for data Lake analysis?

Compared with HDFS, OSS’s rename operation is very expensive, which is a copy & delete operation, while HDFS is a metadata operation on namenode. In DLA’s analysis engine, we continue to use open source Hadoop’s fileoutputcommmitter, which has poor performance. To solve this problem, we decided to introduce OSS multipart upload feature into DLA fs to optimize the write performance.

3.1 DLA FS supports multipart upload mode to write OSS objects

Alibaba cloud OSS supports multipart upload function. The principle is to split a file into multiple data slices and upload them simultaneously. After the upload is completed, users can call multipart upload completion interface at their own time to merge these data slices into original files, so as to improve the throughput of file writing to OSS. Since multipart upload can control when files are visible to users, we can use it instead of rename operation to optimize the performance of DLA FS when writing OSS in the outputcommmitter scenario.

The whole algorithm flow of the outputcommmitter based on multipart upload is as follows:

How to optimize OSS for data Lake analysis?

Using OSS multipart upload has the following advantages:

  • Writing to a file does not require multiple copies.As you can see, the expensive rename operation is no longer needed, and copy & delete is not needed to write files. In addition, compared with rename, OSS’s completemultipartupload interface is a very lightweight operation.
  • The probability of data inconsistency is less.Although it is still not an atomic operation to write multiple files at one time, compared with the original rename, which will copy data, its time window will be much shorter, and the probability of data inconsistency will be much smaller, which can meet most scenarios.
  • File meta information related operations in rename are no longer needed.According to our statistics, the metadata operations of a file in algorithm 1 can be reduced from 13 times to 6 times, while algorithm 2 can be reduced from 8 times to 4 times.

The interfaces controlling user visibility in OSS multipart upload are completemultipartupload and abortmultipartupload. The semantics of this interface is similar to commit / abort. The Hadoop file system standard interface does not provide the semantics of commit / abort.

To solve this problem, we introduce the semi transaction layer in DLA FS.

3.2 DLA FS introduces semi transaction layer

As mentioned earlier, outputcommitter is similar to a two-phase commit protocol, so we can abstract this process as a distributed transaction.It can be understood that the driver starts a global transaction, and each executor starts its own local transaction. When the driver receives the completion information of all local transactions, it will submit the global transaction.

Based on this abstraction, we introduce a semi transaction layer (we don’t implement all transaction semantics), which defines interfaces such as transaction.Under this abstraction, we encapsulate the consistency assurance mechanism that adapts to the OSS multipart upload feature. In addition, we also implement the OSS transactional outputcommmitter, which implements the outputcommmitter interface. The upper computing engine, such as spark, interacts with our DLA FS semi transaction layer through it. The structure is as follows:

How to optimize OSS for data Lake analysis?

The following describes the general process of OSS transactional output calculator of DLA FS by using DLA serverless spark:

  1. setupJob。The driver starts a globaltransaction. When the globaltransaction is initialized, a hidden working directory belonging to the globaltransaction will be created on the OSS to store the file metadata of the job.
  2. setupTask。The executor uses the globaltransaction serialized by the driver to generate the localtransaction. And monitor the file write completion status.
  3. The executor writes the file.The metadata information of the file will be monitored by the local transaction and stored in the local rocksdb. The OSS remote call is time-consuming. We can save the metadata on the local rocksdb until the next submission to reduce the time-consuming of remote call.
  4. commitTask。When the executor calls the localtransaction commit operation, the localtransaction will upload the metadata related to the task to the corresponding OSS working directory, and no longer monitor the file completion status.
  5. commitJob。The driver will call the commit operation of globaltransaction, the global transaction will read the list of files to be submitted in all metadata in the working directory, and call the OSS completemultipartupload interface to make all files visible to users.

The introduction of DLA FS’s semi transaction has two advantages

  • It doesn’t depend on the interface of any computing engine, so it can be easily transplanted to another computing engine, and its implementation can be used by Presto or other computing engines through adaptation.
  • More implementations can be added under the semantics of transaction. For example, for the scenario of partition merging, mvcc features can be added to merge data without affecting the use of data online.

2. InputStream optimization

Users report that the OSS request cost is high, even higher than the DLA cost (OSS request cost = number of requests) × Unit price per 10000 requests ÷ 10000)。 The survey found that the reason is that the open source OSS file system will pre read data according to 512KB as a unit in the process of reading data. For example, if a user reads a 1MB file in sequence, it will generate two calls to OSS: the first request reads the first 512KB, and the second request reads the next 512KB. This implementation will result in more requests when reading large files. In addition, because the pre read data is cached in memory, if more files are read at the same time, it will also cause some pressure on the memory.

How to optimize OSS for data Lake analysis?

Therefore, in the implementation of DLA FS, we remove the pre read operation. When the user calls Hadoop read, the bottom layer will request the OSS to read the whole range of data from the current location to the end of the file, and then read the data that the user needs from the stream returned by the OSS and return it. In this way, if the user reads in sequence, the next read call will naturally read data from the same stream, and there is no need to initiate a new call. Even if the user reads a large file in sequence, it only needs one call to OSS to complete.

In addition, for the small seek operation, the implementation of DLA FS reads the data to be skipped from the stream and discards it. In this way, there is no need to generate a new call. Only the large jump will close the current stream and generate a new call (because the large jump read discard will cause the delay of seek to increase). This implementation ensures the optimization of DLA FS and reduces the number of calls in Orc / parquet and other file formats.

How to optimize OSS for data Lake analysis?

3. Data cache acceleration

Based on the object storage OSS architecture, it is still a costly operation to read data from the remote storage through the network, which often leads to performance loss. Cloud native data Lake analysis DLA FS introduces a local caching mechanism, which caches hot data on the local disk, shortens the distance between data and computing, reduces the latency and IO limit caused by reading data from the remote end, and achieves smaller query latency and higher throughput.

3.1 local cache architecture

We encapsulate the cache processing logic in DLA FS. If the data to be read exists in the cache, it will be returned directly from the local cache without pulling data from OSS. If the data is not in the cache, it will be read directly from the OSS and cached asynchronously to the local disk.

How to optimize OSS for data Lake analysis?

3.2 data cache hit rate improvement strategy

Here we use DLA serverless Presto to show how to improve the hit rate of DLA FS local cache. Presto’s default split submission policy is No\_ Under this strategy, the main factor to be considered is the workload of the worker. Therefore, which worker a split will be assigned to is largely random. In DLA presto, we use soft\_ Affinity submission policy. When submitting hive’s split, the hash value of split will be calculated to submit the same split to the same worker as much as possible, so as to improve the hit rate of cache.

How to optimize OSS for data Lake analysis?

Use\_ SOFT\_ AFFINITY\_ The submission policy of split is as follows:

  1. Through the hash value of split, the preferred worker and alternative worker of split are determined.
  2. If the preferred worker is idle, submit to the preferred worker.
  3. If the preferred worker is busy, submit to the alternative worker.
  4. If the candidate worker is also busy, submit to the least busy worker.

4、 The value of DLA FS

1. Rename optimizes the effect of writing scenes in ETL

In the process of using DLA, customers usually use DLA serverless spark to do ETL of large-scale data. We use the orders table in the TPC-H 100g dataset for writing test, and create a new one with O\_ The ordermonth field is the orders of the partition\_ Test table. Execute SQL in spark: “insert rewrite table \ ` TPC”\_ h\_ test\`.\`orders\_ test\` select * from \`tpc\_ h\_ test\`.\`orders\`”。 Using the same resource configuration, one version of spark is open source spark, and the other is DLA serverless spark. Compare their results.

How to optimize OSS for data Lake analysis?

It can be concluded from the figure that:

  • This optimization has greatly improved algorithm 1 and algorithm 2.
  • Algorithm 1 and algorithm 2 will be optimized after this feature is turned on, but algorithm 1 is more obvious. This is because algorithm 1 needs to rename twice, and once the Rename is done at a single point on the driver; In algorithm 2, each executor performs a distributed rename operation only once.
  • Under the current data volume, the gap between algorithm 1 and algorithm 2 is not so obvious after this feature is turned on. There is no need for rename operation in both methods. It is just whether completemultipart is executed on the driver at a single point (algorithm 2, our modification is that completemultipart upload is executed when committask is executed). Large amount of data may still have a great impact.

2. The effect of InputStream optimization in interactive scene

DLA customers will use DLA’s serverless Presto to analyze various formats, such as text, ORC, parquet, etc. The following is a comparison of access requests in 1GB text and orc formats based on DLA FS and community ossfs.

How to optimize OSS for data Lake analysis?

Comparison of request times for 1GB text file analysis

How to optimize OSS for data Lake analysis?

  • The number of text class calls is reduced to about 1 / 10 of the open source implementation;
  • The call times of ORC format are reduced to about 1 / 3 of open source implementation;
  • On average, it can save 60% to 90% of OSS call cost;

3. The effect of data cache in interactive scene

We compare the performance of community versions prestodb and DLA. For community version, we choose prestodb 0.228, and add support for OSS data source by copying jar package and modifying configuration. We compared DLA Presto Cu 512 core 2048gb community version cluster.

We select TPC-H 1TB data test set for test query. Since most of the queries in TPC-H are not IO intensive, we only select the queries that meet the following two criteria for comparison

  1. The query contains the scan of the largest table lineitem, so the amount of data scanned is large enough, IO may become a bottleneck.
  2. The query does not involve the join operation of multiple tables, so there will not be a large amount of data involved in the calculation, so the calculation will not be prior to IO and become a bottleneck.

According to these two criteria, we select Q1 and Q6 to query a single table of lineitem, and Q4, Q12, q14, Q15, Q17, Q19 and Q20 to join lineitem with another table.

It can be seen that cache acceleration can manage these queries, and it has obvious effect.

How to optimize OSS for data Lake analysis?

5、 Cloud native data Lake best practices

Best practices, taking DLA as an example. DLA is committed to helping customers build a low-cost, easy-to-use, flexible platformData platform can save at least 50% of the cost compared with traditional Hadoop. Among them, DLA meta supports a unified view of 15 + data sources (OSS, HDFS, DB, DW) on the cloud, introduces multi tenant, metadata discovery, pursues zero marginal cost, and provides free use. DLA Lakehouse is based on Apache Hudi. Its main goal is to provide an efficient Lake warehouse and support incremental writing of CDC and messages. At present, it is stepping up its production. DLA serverless Presto is developed based on Apache prestodb, mainly for federated interactive query and lightweight ETL. DLA supports spark, mainly for large-scale ETL on the lake, as well as stream computing and machine learning; Compared with the traditional self built spark, it has a 300% cost performance improvement, and the cost can be reduced by 50% when migrating from ECS self built spark or hive batch processing to DLA spark. The integrated data processing scheme based on DLA can support Bi report, big data screen, data mining, machine learning, IOT analysis, data science and other business scenarios.

How to optimize OSS for data Lake analysis?

Copyright notice:The content of this article is spontaneously contributed by alicloud real name registered users, and the copyright belongs to the original author. The alicloud developer community does not own its copyright, nor does it bear the corresponding legal responsibility. For specific rules, please refer to the user service agreement of alicloud developer community and the guidelines for intellectual property protection of alicloud developer community. If you find any suspected plagiarism content in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the suspected infringement content.