Introduction:In the scenario of separation of storage and computing, it is a costly operation to read data from remote storage through network, which often leads to performance loss. Taking OSS as an example, the data reading delay of OSS is usually much larger than that of local disk. At the same time, OSS limits the bandwidth limit of a single user, which will affect the delay of data analysis. In the cloud native data Lake analysis (DLA) SQL Engine, we introduce a local caching mechanism to cache the hot data on the local disk, shorten the distance between data and computing, reduce the latency and IO limit caused by reading data from the remote end, and achieve smaller query latency and higher throughput.
In the background of data on the cloud, with the improvement of network and storage hardware capabilities, the separation of storage and computing has gradually become a major trend of big data processing. Compared with the architecture of storage and computing coupling, the separation of storage and computing can bring many benefits, such as allowing independent expansion of computing and storage, improving resource utilization, improving business flexibility and so on. In particular, with the help of infrastructure on the cloud, storage can choose cheap object storage OSS, and computing resources can be paid on demand and flexibly expanded. All these make the storage computing separation architecture can give full play to the cost advantage and flexibility of cloud computing.
However, in the scenario of separation of storage and computing, it is still a costly operation to read data from remote storage through the network, which often leads to performance loss. Taking OSS as an example, the data reading delay of OSS is usually much larger than that of local disk. At the same time, OSS limits the bandwidth limit of a single user, which will affect the delay of data analysis. In the cloud native data Lake analysis (DLA) SQL Engine, we introduce a local caching mechanism to cache the hot data on the local disk, shorten the distance between data and computing, reduce the latency and IO limit caused by reading data from the remote end, and achieve smaller query latency and higher throughput.
Based on elastic presto, DLA SQL engine adopts the architecture of complete separation of computing and storage, and supports data analysis such as ad hoc query, Bi analysis and lightweight ETL for various file formats stored on OSS, HDFS and other media. With the launch of data Lake analysis acceleration, DLA cooperates with aluxio, an open source large-scale data editing system manufacturer, to solve the performance loss caused by remote data reading in the scenario of storage and computing separation by means of the cache acceleration capability provided by aluxio. In the future, the two sides will continue to carry out all-round cooperation in the field of data Lake Technology to provide customers with one-stop, efficient data Lake analysis and computing services.
DLA SQL data analysis acceleration scheme
Cache acceleration principle based on alluxio
In DLA SQL Engine, the worker node is responsible for reading data from the remote data source. Therefore, a natural idea is to cache hot data in the worker node to achieve the effect of query acceleration. As shown in the figure below:
The main challenge here is how to improve the efficiency of cache in the big data scenario, including: how to quickly locate and read the cache data, how to improve the cache hit rate, how to quickly load the cache data from the remote end, and so on. In order to solve these problems, at the stand-alone level, we use alluxio to manage the cache and improve the cache efficiency with the help of the capabilities provided by alluxio; At the system level, soft is used\_ The affinity submission policy establishes a corresponding relationship between the worker and the data, so that the same piece of data (high probability) is always read on the same worker, so as to improve the cache hit rate.
SOFT\_ Affinity submission policy
Presto’s default split submission policy is No\_ Under this strategy, the main factor to be considered is the workload of the worker. Therefore, which worker a split will be assigned to is largely random. In the caching scenario, we need to consider the factor of “data localization”. If a split is always submitted to the same worker, it will be very helpful to improve the caching efficiency.
Therefore, in DLA SQL, we use soft\_ Affinity submission policy. When submitting hive’s split, the hash value of split will be calculated to submit the same split to the same worker as much as possible. As shown in the figure below.
SOFT_AFFINITY\_ The submission policy of split is as follows:
- Through the hash value of split, the preferred worker and alternative worker of split are determined.
- If the preferred worker is idle, submit to the preferred worker.
- If the preferred worker is busy, submit to the alternative worker.
- If the candidate worker is also busy, submit to the least busy worker.
As shown in the figure below:
The judgment of “busy” is based on the following two parameters:
- The node scheduler.max-splits-per-node parameter is used to control the maximum number of splits that can be submitted on each worker. The default value is 100. If it exceeds this value, the worker is determined to be busy.
- Node scheduler.max-pending-splits-per-task is used to control how many splits on each worker can be in the pending state at most. If it exceeds this value, the worker is determined to be busy.
Through this judgment, we can take into account the data localization and the workload of workers, avoid the unbalanced workload among workers caused by the uneven hash of split, and avoid the query slowing down as a whole because a worker is particularly slow.
Alluxio cache management
In worker, we manage the cache based on alluxio local cache. Local cache is a library embedded in Presto process, which communicates with Presto through interface call. Compared with using alluxio cluster, Presto calls alluxio in local cache mode with less cost. At the same time, local cache has complete cache management functions, including cache loading, elimination, metadata management and monitoring. In addition, alluxio supports cache concurrent asynchronous write and cache concurrent read, which are very helpful to improve cache efficiency.
Alluxio exposes a standard HDFS interface, so cache management is transparent to presto. In this interface, when users need to access the OSS data source for query, if the data exists in the local cache, they will directly read the data from the cache to speed up the query; If the cache is not hit, the data will be read directly from the OSS (and written to the local disk asynchronously).
Further optimization in DLA
Improve cache hit rate
In order to achieve higher cache hit rate, we mainly do two aspects of work
- Increase the disk space for cache acceleration as much as the cost allows.
- Increase the proportion of data “localization”.
The former is very easy to understand. Here we focus on the latter.
Let’s analyze the soft\_ The affinity submission policy will find that if the query enters the “busy” state, split will fall back to and No\_ In this case, the proportion of data “localization” will certainly decrease, so the key is to avoid “busy” as much as possible. However, if we simply increase the “busy” threshold, the workload of workers may be uneven, and the performance improvement brought by cache will be eaten up by the long tail effect.
In DLA, we do this:
- Increase the value of node scheduler.max-splits-per-node so that more splits can hit the cache.
- Modify the hash algorithm of hivesplit. When calculating the hash value, not only the file name but also the position of split in the file is used. In this way, large files will not be hashed to a worker, and the hash value of split will naturally be evenly distributed.
Improve disk throughput
In addition to cache hit rate, another key point to improve cache efficiency is cache read-write speed. In the disk based caching scheme, an important part to achieve this goal is to improve the disk throughput performance.
In DLA, we use efficient cloud disk as the data disk of cache. The consideration behind this is that we take cache acceleration as the built-in product capability of the Cu version, and there is no extra charge. This requires that the cost of cache introduction should be small enough in the total cost of the Cu, so we can’t use expensive SSD disks. From the cost point of view, the use of efficient cloud disk is an inevitable choice, but it needs to solve the problem of low throughput of efficient cloud disk.
We achieve higher throughput by using multiple disks and breaking them up when the cache is written, which makes up for the lack of cloud disk throughput. At present, with the configuration of DLA, the measured single machine read-write throughput can reach nearly 600 MB / s, which can reduce the cost and still provide good read-write performance.
We compare the performance of community versions prestodb and DLA. For community version, we choose prestodb 0.228, and add support for OSS data source by copying jar package and modifying configuration. We compare the three specifications of dla-sql Cu (256 core 1024gb, 512 core 2048gb, 768 core 3072gb) with the community version cluster with the same computing power.
We select TPC-H 1TB data test set for test query. Since most of the queries in TPC-H are not IO intensive, we only select the queries that meet the following two criteria for comparison
- The query contains the scan of the largest table lineitem, so the amount of data scanned is large enough, IO may become a bottleneck.
- The query does not involve the join operation of multiple tables, so there will not be a large amount of data involved in the calculation, so the calculation will not be prior to IO and become a bottleneck.
According to these two criteria, we select Q1 and Q6 to query a single table of lineitem, and Q4, Q12, q14, Q15, Q17, Q19 and Q20 to join lineitem with another table.
The results are as follows
How to use
At present, the caching feature is only available in the Cu version, and the newly purchased cluster automatically opens the caching capability for OSS and HDFS data sources. If you have a cluster, please contact us to upgrade to the latest version. For the opening and use of Cu, please refer to ourHelp document。
We also have a special package for 1000 Cu. Please try it.Click to buy package
Summary and Prospect
The cache acceleration feature provides less query latency and higher throughput by caching hot data on local disk, which has a good acceleration effect for IO intensive queries. In the scenario where computing and storage are separated from each other in the cloud, caching must have a broader application. In the future, we will further explore the use of cache in maxcompute and other data sources and scenarios to speed up more types of data reading and computing and provide better query performance.
During the current activity, user 1 will purchase the DLA 1000cu hour resource package with the original price of 315 yuan,Click to buy package
Welcome to our nail group for the latest information:
Copyright notice:The content of this article is spontaneously contributed by alicloud real name registered users, and the copyright belongs to the original author. The alicloud developer community does not own its copyright, nor does it bear the corresponding legal responsibility. For specific rules, please refer to the user service agreement of alicloud developer community and the guidelines for intellectual property protection of alicloud developer community. If you find any suspected plagiarism content in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the suspected infringement content.