Engineering practice of Apache Druid in shopee

Time:2022-5-9

Author

Yuanli, from shopee data infra OLAP team.

abstract

Apache Druid is a high-performance open-source temporal database, which is suitable for low latency query analysis scenarios of interactive experience. This article will mainly share the engineering practice of Apache Druid in supporting the OLAP real-time analysis of shopee related core business.

With the continuous development of shopee business, more and more related core businesses increasingly rely on OLAP real-time analysis services based on Druid cluster. More and more stringent application scenarios make us begin to encounter various performance bottlenecks of open source project Apache Druid. By analyzing and studying the core source code, we have optimized the performance of the metadata management module and cache module with performance bottlenecks.

At the same time, in order to meet the customization needs of the company’s internal core business, we have developed some new features, including integer exact de duplication operator and flexible sliding window function.

1. Application of Druid cluster in shopee

The current cluster deployment scheme is to maintain a super large cluster based on physical machine deployment, with a cluster scale of 100 + nodes. Druid cluster, as the downstream of relevant core business data projects, can write data through batch tasks and flow tasks, and then relevant business parties can conduct OLAP real-time query and analysis.

Engineering practice of Apache Druid in shopee

2. Technology optimization scheme sharing

2.1 coordinator load balancing algorithm efficiency optimization

2.1.1 problem background

Through the real-time task monitoring alarm, we found that many real-time tasks failed due to the timeout of the last step, and then users reported to us that their real-time data query was jittery.

Through the survey, it is found that as more services begin to access Druid cluster, more and more datasources are accessed. Coupled with the accumulation of historical data, the number of segments in the overall cluster is increasing. This increases the pressure on the coordinator metadata management service, gradually presents performance bottlenecks, and affects the stability of the overall service.

2.1.2 problem analysis

Analysis of a series of serial subtasks of Coordinator

First, we need to analyze whether these serial tasks can be parallel, but the analysis shows that these subtasks have logical before and after dependencies, so they need to be executed serially. Through the log information of the coordinator, we find that one of the subtasks responsible for balancing the segment loading in the history node is super slow and takes more than 10 minutes. It is this subtask that slows down the total time consumption of the whole serial task, making the execution interval of another subtask responsible for arranging segment loading too long, resulting in the failure of the real-time task mentioned above due to the timeout of the release phase.

By using jpprofiler tool, we find that there are performance problems in the implementation of reservoir sampling algorithm used in load balancing algorithm. Analyzing the source code, it is found that the current reservoir sampling algorithm can only sample one element from a total of 5 million segments per call, and 2000 segments need to be balanced in each cycle. In other words, it is obviously unreasonable to traverse the 5 million list 2000 times.

Engineering practice of Apache Druid in shopee

2.1.3 optimization scheme

The reservoir algorithm to realize batch sampling only needs to traverse the 5 million segment metadata list once, and can complete the sampling of 2000 elements. After optimization, the execution time of the subtask responsible for segment load balancing takes only 300 milliseconds. The total time of coordinator serial subtasks is significantly reduced.

Benchmark results

Engineering practice of Apache Druid in shopee

The comparison of benchmark results shows that the performance of the reservoir algorithm with batch sampling is significantly better than other options.

Community cooperation

We have contributed this optimization to the Apache Druid community. SeePR

2.2 optimization of incremental metadata management

2.2.1 problem background

When the current coordinator manages metadata, a scheduled task thread pulls the full amount of segment records from the metadata MySQL DB every 2 minutes by default, and updates a snapshot of the segment set in the memory of the coordinator process. When the amount of segment metadata in the cluster is very large, the SQL execution of each full pull becomes very slow, and deserializing a large number of metadata records also requires a lot of resource overhead. A series of sub tasks of segment management in the coordinator depend on the snapshot update of the segment set, so the slow execution of full pull SQL will directly affect the timeliness of the visibility of the overall cluster data.

2.2.2 problem analysis

Firstly, from the perspective of metadata addition, deletion and modification, we analyze the changes of segment metadata in three different scenarios.

Metadata increase

Engineering practice of Apache Druid in shopee

The data writing of datasource will generate new segment metadata, and the data writing methods are mainly divided into batch task and Kafka real-time task. The coordinator’s segment management subtask senses and manages these newly added segment metadata in time, which is very important for the visibility of the written data of the Druid cluster. Through the internal metric index of Druid, it is found that the increment of segment per unit time is far less than the total number of records of 500W.

Metadata deletion

Druid can clean up the segments of the datasource within the specified time interval by submitting a kill type task. The kill task will first clean up the segment record in the metadata dB, and then delete the segment file in HDFS. For the segments that have been downloaded to the local history node, the coordinator’s segment management subtask is responsible for notifying the cleanup.

Metadata change

One of the segment management subtasks of the coordinator will mark and clear the segment with older version number according to the version number of the segment. This process will change the flag bit representing whether the segment is valid in the relevant metadata record. For the old version of segment that has been downloaded to the local history node, the segment management subtask of the coordinator is also responsible for notifying the cleanup.

2.2.3 optimization scheme

Through the analysis of the addition, deletion and modification of segment metadata, we find that it is very important to timely perceive and manage the newly added metadata, which will directly affect the timely visibility of the newly written data. The deletion and change of metadata mainly affect data cleaning, and the timeliness requirements are relatively low.

To sum up, our optimization idea is to realize an incremental metadata management method, only pull the newly added segment metadata from the metadata dB, and merge it with the current metadata snapshot to obtain a new metadata snapshot for metadata management. At the same time, in order to ensure the final consistency of data, complete the data cleaning with relatively low priority, and pull the metadata in full every long period of time.
Engineering practice of Apache Druid in shopee

Original SQL statement of full pull:

SELECT payload FROM druid_segments WHERE used=true;

Incremental pull of SQL statement:

--In order to ensure the efficiency of SQL execution, create an index for the newly added filter conditions in the metadata dB in advance
SELECT payload FROM druid_segments WHERE used=true and created_date > :created_date;
Incremental function attribute configuration
#Incrementally pull the metadata added in the last 5 minutes
druid.manager.segments.pollLatestPeriod=PT5M
#Pull metadata in full every 15 minutes
druid.manager.segments.fullyPollDuration=PT15M
Online performance

Through monitoring system indicators, it is found that after enabling the incremental management function, the time-consuming of pulling metadata and deserialization is significantly reduced. At the same time, the pressure of metadata DB is reduced, and the problem of slow readability of written data responded by users is also solved.

Engineering practice of Apache Druid in shopee

2.3 broker result cache optimization

2.3.1 problem background

In the process of query performance tuning, we found that many query application scenarios can not make good use of the caching function provided by Druid. Currently, there are two caching methods in Druid: result cache and segment level intermediate result cache. The first result cache can only be applied to broker processes, while the segment level intermediate result cache can be applied to brokers and other data nodes. However, these two caching functions have obvious limitations, as shown in the table below.

Cache scheme / usage scenario / availability Scenario 1: use group by V2 engine Scenario 2: scan only historical segments Scenario 3: scan historical segment and real-time segment at the same time Scenario 4: efficiently cache the results of a large number of segments
Segment level cache
Result cache

2.3.2 problem analysis

The cache is not available when using the group by V2 engine

The group by V2 engine has been the default engine for groupby type queries in many stable versions for a long time in the past, and it will be the same for a long time in the foreseeable future. Moreover, groupby query is one of the most common query types, and the other two types are topn and timeseries. The problem that the group by V2 engine does not support caching still exists until version 0.22.0. SeeCaching does not support scenes

By tracking the change records of the community, we found that the reason why the group by V2 engine does not support caching is that the intermediate results at the segment level are not sorted, which may lead to incorrect query consolidation results. See this section of the community for detailsissue

The following is a brief summary of why the Druid community chose to repair this bug by disabling the function:

  • If the intermediate results at the segment level are sorted, and then the sorting results are cached, when the number of segments is large, the load of historical nodes will be increased;
  • If the intermediate results at the segment level are not sorted and cached directly, the broker needs to reorder the intermediate results of each segment, which will increase the burden on the broker;
  • If you disable this function directly, not only will the historical nodes not be affected, but also the bug that the broker merge result is incorrect has been solved.:)

At the same time, the community repair scheme also wrongly damaged the function of the result cache, so that when the repaired version uses the group by V2 engine, the result cache on the broker is also unavailable. SeeCaching does not support scenes

Limitations of result caching

The result cache requires that the set of segments scanned by each query is consistent, and all segments are historical segments. That is, as long as the query criteria need to query the latest real-time data, the result cache is not available.

For Druid, which is good at real-time query analysis application scenarios, this limitation of result caching is particularly prominent. The query panel of many business scenarios queries the time series aggregation results of the latest day / week / month, including the latest real-time data, but these queries do not support result caching.

Limitations of segment level intermediate result caching

The function of segment level intermediate result cache can be enabled on broker and other data nodes at the same time. It is mainly applicable to historical nodes.

Enable segment level intermediate result caching on the broker. When the number of scanned segments is large, there are the following limitations:

  • The deserialization process of extracting cached results will add additional overhead to the broker;
  • This increases the cost of broker nodes merging intermediate results. It is impossible to use historical nodes to merge some intermediate results.

Enable segment level intermediate result caching on the history node, and its workflow is as follows:

Engineering practice of Apache Druid in shopee

In the actual application scenario, we find that when the intermediate cache results of segment are large, the overhead of serialization and deserialization cache results can not be ignored.

2.3.3 optimization scheme

Through the above analysis, we find that the current two caching functions have obvious limitations. In order to better improve the cache efficiency, we designed and implemented a new cache function on broker. This function will cache the intermediate merging results of historical segments, which can make up for the shortcomings of the current two caches.

New cache attribute configuration
druid.broker.cache.useSegmentMergedResultCache=true
druid.broker.cache.populateSegmentMergedResultCache=true
Applicable scenario comparison
Cache scheme / usage scenario / availability Scenario 1: use group by V2 engine Scenario 2: scan only historical segments Scenario 3: scan historical segment and real-time segment at the same time Scenario 4: efficiently cache the results of a large number of segments
Segment level cache
Result cache
Segment merge intermediate result cache
working principle

Engineering practice of Apache Druid in shopee

Benchmark results

Engineering practice of Apache Druid in shopee

Through the benchmark results, it can be found that the segment merge intermediate result caching function not only has no obvious additional overhead for the first query, but also the caching efficiency is significantly better than other caching options.

Online performance

After enabling the new caching function, the overall query latency of the cluster is reduced by about 50%.

Engineering practice of Apache Druid in shopee

Community cooperation

We are ready to contribute this new caching function to the community. At present, it is time toPRStill waiting for more community feedback.

3. Customized demand development

3.1 exact de duplication operator based on bitmap

3.1.1 problem background

Many key businesses need to count accurate order quantity and UV, and several de duplication operators of Druid are implemented based on approximate algorithm, which has errors in practical application. Therefore, related businesses hope that we can provide an accurate de duplication implementation.

3.1.2 demand analysis

De duplication field type analysis

By analyzing the collected requirements, it is found that the order ID and user ID in the urgent needs are integer or long integer, which makes us consider omitting the dictionary coding process.

3.1.3 implementation scheme

Due to the lack of this implementation in Druid community, we choose the commonly used roaring bitmap to customize the new aggregator. Corresponding operators are developed for integer and long integer respectively, and both support serialization and deserialization for rollup import model. So we quickly released the first stable version of this function, which can well solve the demand of small amount of data.

Operator API
// native JSON API
{
    "type": "Bitmap32ExactCountBuild or Bitmap32ExactCountMerge",
    "name": "exactCountMetric",
    "fieldName": "userId"
}
-- SQL support
SELECT "dim", Bitmap32_EXACT_COUNT("exactCountMetric") FROM "ds_name" WHERE "__time" >= CURRENT_TIMESTAMP - INTERVAL '1' DAY GROUP BY key
Limitation analysis and optimization direction

The current simple implementation scheme, facing the demand of a large amount of data, its performance bottleneck has also been exposed.

Performance bottleneck caused by too large intermediate result set

The memory space of the new operator is too large, and there is obvious overhead in cache writing and extraction. Moreover, this kind of operator is mainly used for group by query, so the current existing cache can not play its due role. This further drives us to design and develop a new caching option, segment merging intermediate results, as described above.

By effectively caching the intermediate results of segment merging, the serialization and deserialization overhead caused by too large intermediate results at the segment level is greatly reduced. In addition, recoding will be considered in the future to reduce the dispersion of data distribution and improve the compression rate of Integer Sequences by bitmap.

Memory estimation difficult problem

The Druid query engine mainly processes the intermediate settlement results through the off heap memory buffer to reduce the impact of GC, which requires the internal data structure of the operator to support more accurate memory estimation. However, such operators based on roaming bitmap are not only difficult to estimate memory, but also can only construct object instances in heap memory during operation. This makes the memory overhead of such operators uncontrollable in queries, and even oom may occur in extreme queries.

In the short term, we mainly alleviate these problems by combining upstream data processing, such as recoding, reasonable zoning, etc.

3.2 flexible sliding window function

3.2.1 problem background

Druid core query engine only supports aggregate functions with fixed window size and lacks support for flexible sliding window functions. Some key business parties want to count UVs for nearly 7 days every day, which requires druid to support sliding window aggregation functions.

3.2.2 demand analysis

Limitations of community moving average query extension

Through the survey, we found that there are existing extensions in the communityMoving Average QueryIt supports some basic types of sliding window calculation, but lacks support for Druid native operators of other complex types (object types), such as the widely used HLL type approximation operator. At the same time, this extension also lacks support and adaptation for SQL.

3.2.3 implementation scheme

By studying the source code, we find that this extension can be more general and concise. We added an operator implementation of default type, which can realize sliding window aggregation of basic fields according to the type of basic fields. In other words, all Druid native operators (aggregators) can support sliding window aggregation through this default operator.

At the same time, we adapt SQL function support for this general operator.

Operator API
// native JSON API
{
    "aggregations": [
        {
            "type": "hyperUnique",
            "name": "deltaDayUniqueUsers",
            "fieldName": "uniq_user"
        }
    ],
    "averagers": [
        {
            "name": "trailing7DayUniqueUsers",
            "fieldName": "deltaDayUniqueUsers",
            "type": "default",
            "buckets": 7
        }
    ]
}
-- SQL support
select TIME_FLOOR(__time, 'PT1H'), dim, MA_TRAILING_AGGREGATE_DEFAULT(DS_HLL(user), 7) from ds_name where __time >= '2021-06-27T00:00:00.000Z' and __time < '2021-06-28T00:00:00.000Z' GROUP BY 1, 2
Community cooperation

We are ready to contribute this new feature to the community. At present, we shouldPRStill waiting for more community feedback.

4. Future architecture evolution

In order to better solve the stability problem from the architecture level and realize cost reduction and efficiency increase, we began to explore and implement Druid’s cloud native deployment scheme. In the future, we will share our practical experience in this area. Please look forward to it!

Reference link