This article starts with WeChat official account of vivo Internet technology.
Author: Yang Yijun
This paper mainly describes the background of Linux page cache optimization, the basic concept of page cache, lists some solutions for the IO performance bottleneck of Kafka, how to adjust the relevant parameters of page cache, and the effect comparison before and after performance optimization.
1、 Optimization background
When the business grows rapidly and needs to process trillions of record level data every day. In terms of reading and writing data, the pressure of Kafka cluster will become huge, and disk IO becomes the biggest performance bottleneck of Kafka cluster.
When there is a sudden increase of incoming or outgoing traffic, the disk IO is always in the full state, resulting in the inability to process new read and write requests, and even the avalanche of some broker nodes, which affects the stability of the cluster.
As shown in the figure below, the disk IO is continuously full:
This seriously affects the stability of the cluster, thus affecting the stable operation of the business. In this regard, we have made some targeted optimization programs
- Optimize the page cache parameters of Linux operating system
- Limit the access traffic of Kafka cluster users to avoid the pressure of disk IO caused by the sudden increase of in / out traffic
- Resource group isolation (physical isolation of cluster broker) is performed according to business to avoid the interaction of shared disk IO between different services
- Optimize the service parameters of Kafka cluster broker node; [this paper does not explain this scheme]
- Transform Kafka replica migration source code to realize incremental concurrent replica migration, and reduce the pressure of replica migration on disk IO of cluster broker node; [this scheme is not explained in this paper]
- Develop a set of Kafka cluster automatic load balancing service to balance the load of the cluster regularly
- The SSD SSD with better IO performance is used to replace the ordinary mechanical hard disk, and the disk raid makes the IO load among multiple disks in broker more balanced [this scheme is not explained in this paper]
- The Kafka source code is modified to restrict the access traffic of a single broker and a single topic in the Kafka cluster, so as to realize the most fine-grained control of the traffic. When the traffic of a single broker increases suddenly, it can be limited to the upper limit, so as to avoid the nodes being hung up by abnormal traffic; [this scheme is not explained in this paper]
- The Kafka source code is modified to fix the defect that the replica migration task cannot be terminated manually after it is started, so as to realize the problem that the copy migration task cannot be stopped when the load is too high due to the migration
- The competition of network bandwidth in the computer room will also indirectly affect the data of the follower synchronous leader, which will eventually cause the follower to pull the historical data synchronously and increase the IO load. Therefore, it is necessary to mark the priority of the network bandwidth. When there is competition, the priority of the Kafka cluster should be raised to avoid the Kafka cluster’s brokers and other services that consume a lot of network bandwidth to share the machine room switch. [this paper does not explain this scheme]
The above only lists a few main optimization schemes, and there are some other contents, which will not be repeated here. In this paper, we will mainly explainOptimization of page cache parameters in Linux operating system。
2、 Basic concepts
1. What is page cache?
Page cache is a cache for file system. It can reduce disk I / O operations and improve performance by caching file data in disk into memory.
There are two main factors to improve the performance of disk data caching
- Disk access is several orders of magnitude slower than memory (the difference between milliseconds and nanoseconds);
- There is a high probability that the accessed data will be accessed again.
The file reading and writing process is as follows:
2. Read cache
When the kernel initiates a read request (for example, a process initiates a read() request), it first checks whether the requested data is cached in the page cache.
If so, it can be read directly from the memory without accessing the disk. This is called cache hit;
If there is no requested data in the cache, that is, cache miss, the data must be read from the disk. Then the kernel caches the read data into the cache so that subsequent read requests can hit the cache.
Page can only cache the contents of one file part, and does not need to cache the whole file.
3. Write cache
When the kernel initiates a write request (for example, a process initiates a write () request), it also writes directly to the cache, and the contents in the backup storage will not be updated directly (when the server is powered off, there is a risk of data loss).
The kernel marks the page to be written as dirty and adds it to the dirty list. The kernel periodically writes the pages in the dirty list back to the disk, so that the data on the disk is consistent with the data cached in memory.
When one of the following two conditions is met, the dirty data flush to disk operation will be triggered:
- Data has been around for more than dirty_ expire_ Centisecs (default 300 CS, i.e. 30 seconds) time;
- Memory occupied by dirty data > dirty_ background_ In other words, the ratio of the total memory occupied by dirty data exceeds that of dirty memory_ background_ Pdflush will be triggered to refresh dirty data when the ratio is 10% by default.
4. Page cache cache view tool
How do we look at the cache hit rate? Here we can use a cache hit rate viewing tool, cachestat.
(1) Download and install
mkdir /opt/bigdata/app/cachestat cd /opt/bigdata/app/cachestat git clone --depth 1 https://github.com/brendangregg/perf-tools
(2) Start execution
(3) Description of output content
5. How to recycle page cache
Execution script: echo 1 > / proc / sys / VM / drop_ Caches you may need to wait for a while here, because an application is writing data.
After the cache is recycled, under normal circumstances, the buff / cache should be 0. The reason why I am not 0 here is that there is data being written constantly.
3、 Parameter tuning
Note: servers with different hardware configurations may have different effects. Therefore, you need to consider your own cluster hardware configuration when setting specific parameter values.
The main factors considered include: CPU core number, memory size, hard disk type, network bandwidth, etc.
1. How to view page cache parameters
Execute the command sysctl – a|grep dirty
2. Default values of operating system page cache related parameters
vm.dirty_ background_ Bytes = 0 ා and parameters vm.dirty_ background_ Ratio implements the same function, but only one of the two parameters will take effect, indicating how many bytes the dirty page size reaches, and then it will trigger disk flushing vm.dirty_background_ratio = 10 vm.dirty_ Bytes = 0 ා and parameters vm.dirty_ Ratio implements the same function, but only one of the two parameters will take effect. It means that after the dirty page reaches the number of bytes, it stops receiving write requests and starts to trigger disk flushing vm.dirty_ratio = 20 vm.dirty_ expire_ Centisecs = 3000 ා here, it means 30 seconds (time unit: centisecs) vm.dirty_ writeback_ Centisecs = 500 ා here is 5 seconds (time unit: centiseconds)
3. If there is a large amount of cached data in the system, there may be problems
- The more data is cached, the greater the risk of losing data.
- There will be periodic IO peaks, and the peak time will be longer. During this period, all new write IO performance will be very poor (extreme cases will be hung directly).
The latter problem has a great impact on applications with high write load.
4. How to adjust kernel parameters to optimize IO performance?
（1）vm.dirty_ background_ Ratio parameter optimization
In cached cache, when the proportion of data in total memory reaches the value set by this parameter, the disk flushing operation will be triggered.
If this parameter is adjusted down appropriately, the original large IO disk flushing operation can be changed into multiple small IO disk flushing operations, so as to flatten the IO write peak value.
For servers with large memory and poor disk performance, this value should be set a little lower.
#Setting method 1: sysctl -w vm.dirty_ background_ Ratio = 1 (temporarily effective, invalid after server restart) #Setting method 2 (permanent effect) echo vm.dirty_background_ratio=1 >> /etc/sysctl.conf sysctl -p /etc/sysctl.conf #Setting method 3 (permanent effect) #Of course, you can also create a parameter optimization file of your own in the / etc / sysctl.d/directory, classify and store the system optimization parameters, and then the settings take effect, such as: touch /etc/sysctl.d/kafka-optimization.conf echo vm.dirty_background_ratio=1 >> /etc/sysctl.d/kafka-optimization.conf sysctl --system
（2）vm.dirty_ Ratio parameter optimization
If the proportion of cached data (here is the proportion of the total memory) exceeds this setting, it is recommended to increase this parameter appropriately; if the write pressure is small, it can be adjusted down appropriately,
The system will stop all IO write operations of the application layer, wait for the data to be flushed and resume io. Therefore, in case of triggering this operation of the system, the impact on the user is very large.
（3）vm.dirty_ expire_ Optimization of centrices parameters
This parameter will be the same as the parameter vm.dirty_ background_ Ratio works together, one represents the size ratio, and the other represents time; that is, if any one of the conditions is satisfied, the disk brushing condition is reached.
Why is it so designed? Let’s imagine the following scenario:
- If only parameters vm.dirty_ background_ Ratio, that is to say, the data in the cache needs to exceed this threshold value in order to meet the disk brushing condition;
- If the data has not reached this threshold, the data in the cache will never be persistent to the disk. In this case, once the server restarts, the data in the cache will be lost.
Combined with the above situation, a data expiration time parameter is added. When the data volume does not reach the threshold value, but reaches the expiration time set by us, the data flash disk can also be realized.
This can effectively solve the above problems, in fact, this design is in most frameworks.
（4）vm.dirty_ writeback_ Optimization of centrices parameters
In theory, if the parameter is small, the frequency of disk flushing can be increased, and the dirty data can be flushed to the disk as soon as possible. However, it must be ensured that the data flash disk can be completed within the interval time.
（5） vm.swappiness parameter optimization
Disable swap space, set vm.swappiness=0
5. Comparison of effect before and after parameter tuning
(1) Write traffic comparison
As can be seen from the figure below, there are a lot of spikes in the write traffic before optimization, and the fluctuation is very large. After optimization, the write traffic is smoother.
(2) Disk IO util comparison
As can be seen from the figure below, there are a lot of spikes in io before optimization, which fluctuates greatly, and IO is smoother after optimization.
(3) Comparison of network traffic
As can be seen from the figure below, there is no impact on the network traffic before and after optimization.
The final optimization effect may be different for different models and different hardware configurations, but the trend of parameter change should be consistent.
1. When vm.dirty_ background_ ratio、 vm.dirty_ expire_ When centisecs is larger
- In and out flow jitter increased and a large number of spikes appeared;
- The IO jitter becomes larger, a large number of spikes appear, and the disk is continuously full;
- The average size of inflow and outflow is not affected;
2. When vm.dirty_ background_ ratio、 vm.dirty_ expire_ Centisecs becomes smaller
- The inflow and outflow Flow Jitter becomes smaller, and tends to be smooth and stable without spikes;
- The disk IO jitter becomes smaller, without spikes, and the disk IO is not full;
- The average size of inflow and outflow is not affected;
3. When vm.dirty_ Ratio decreases (below 10)
- There is an obvious trough in and out traffic at intervals, because the amount of cache data exceeds vm.dirty_ The value set by ratio will block the write request for disk flushing.
4. When vm.dirty_ When the ratio increases (higher than 40), there is no obvious trough in the inflow and outflow, and the flow is smooth;
5. When the following three parameters are the corresponding values, the inflow and outflow flow is very smooth and tends to be a straight line;
6. As shown in the figure below, the average flow of the whole adjustment process is not affected
Please pay attention to more detailsVivo Internet technologyWeChat official account
Note: please contact the wechat:Labs2020Contact