In the field of software infrastructure, people begin to pay more and more attention to the performance optimization of code. While meeting its complete functions, if the code can be optimized step by step, it can achieve better work efficiency under the same hardware conditions and further improve the business production efficiency. Especially in the field of big data, such as OLAP database, which usually serves the scene of ad hoc query and analysis of massive data, some seemingly insignificant underlying tuning can be optimized and superimposed on the data level, so as to optimize the performance of the overall analysis and query of the system. It can be said that the fire of stars can start a prairie fire.
Here, I will mainly share with you about the performance tuning of databend. It will be introduced to you in three times, as follows:
1. Fundamentals: pre knowledge of code tuning
2. Practice 1: performance tuning practice of databend source code
3. Practice 2: why does databend’s group by aggregate query run so fast?
In the subsequent process of databend polishing, we will also introduce more design related to performance optimization, such as the new pipeline workflow model design of databend, and how databend combines vectorization with compilation and execution.
For the video playback of this series of courses, you can clickhttps://www.bilibili.com/vide…View ～
1. Memory hierarchy
With the continuous development of science and technology, we must be more and more familiar with memory. The content shown in the figure below is the pyramid model of memory.
The memory hierarchy in the figure is from top to bottom: register, L1 cache, L2 cache, L3 cache, memory, SSD / HDD Hard disk. Each hierarchy has different capacity and cache time. From the top down, the larger the storage capacity, but the slower the speed, the lower the cost. From the top up, it is the opposite. Generally, the principle of writing high-performance code is to make the data access close to the upper layer and the data close to the CPU, so as to reduce the overhead of data access.
The figure above shows the general memory. Let’s introduce the approximate access delay of the memory. Latency is also expressed as the time required to complete the calculation or I / O request. In order to quantify this indicator, the performance will also be measured through various calculations, such as IOPs, throughput, etc.
The following figure  lists several delay numbers that every programmer should know. These numbers represent an approximate time-consuming of various computer operations:
Jeff Dean once asked that all programmers should keep these delays in mind. It allows us to have an objective measure when tuning the code.
3. Locality principle
Based on the design of modern computer memory, there is a strong locality in the process of program execution. Locality is divided into “time locality” and “space locality”.
Time locality means that when a data is accessed at a certain time, it may be accessed again in the near future, so we see that many programs will use cache to speed up data access.
Also known as data locality, it benefits from the design of memory cache. It means that when a data is accessed at a certain address, it will be more efficient to access the data in its adjacent location. It is common in circular access to array data.
Therefore, when we make good use of these locality principles, we can write more “locality” code to improve the performance and utilization of CPU. The following figure shows the memory mountain model of Intel’s Core i7 Haswell processor. The x-axis represents the capacity of memory, the y-axis represents the step size of program access, the z-axis represents the memory access throughput, and its 3D model is shown in the figure.
With the increase of memory capacity, the performance basically shows a backward trend. However, if the step size can be in a certain short range, it can make good use of the optimization of data locality, such as the prefetch of CPU, so as to improve the throughput.
4. Data locality example
Here is another example of row first and column first traversal access of a matrix. Whether we traverse the sum of the matrix in a circular row or column way first. As shown in the figure, in row major order, the performance of row ergodic summation is much higher than that of column ergodic summation, because row ergodic data can have better locality, adjacent data can be in CPU cache, and data can be accessed in a very efficient way. At the same time, the compiler can also optimize local data access and generate higher performance code, such as automatic vectorization.
5. Performance expression
The example of prefix sum functions in casapp  is quoted here. In the psum1 function, a for loop is directly used for summation and summation. In psum2, the loop is expanded and traversed in steps of 2. We set different cycle times for the two functions. Each time we run the code, we can get a CPU clock result. The results can be fitted into the following two lines by using the least square method. The efficiency of the lines corresponds to the cost of function performance. It can be concluded that the performance of psum2 is significantly better than that of psum1.
6. Introduction to utilities and scripts
Later in the basics section, we will introduce the common tuning methods in databend.
Print & Log & Tracing metrics
(1) The most common is to print the time-consuming logs or output the time-consuming data to the monitoring system
(2) prometheus, jaejer 
Perf is a good tool, pprof, flamegraph
(1) How to profile databend 
(2) Cargo flamegraph crate 
- cargo bench & criterion 
- databend-benchmark 
- ignore function && progress 
7. Article related references
CS:APP3e, Bryant and O’Hallaron (cmu.edu)
ISSUE-3627: adding distributed tracing instrumentation to query by dantengsky · Pull Request #3628 · datafuselabs/databend (github.com)
How to profile Databend | Databend
flamegraph-rs/flamegraph: Easy flamegraphs for Rust projects and everything else, without Perl or pipes <3 (github.com)
bheisler/criterion.rs: Statistics-driven benchmarking library for Rust (github.com)
databend/databend-benchmark.rs at main · datafuselabs/databend (github.com)
databend/perfs.yaml at 9393f5cf7da5827132acd479185878726caf58bf · datafuselabs/databend (github.com)