With the development of the domestic Internet industry, although the mega cluster is not as rare as it was a few years ago, it is also rare, especially the opportunity of performance troubleshooting involving the mega cluster is even rarer.
And this time, I have carried out the performance troubleshooting of Hadoop namenode with a scale of over one trillion. Since I started my business several years ago, I have encountered the cluster with the largest scale, the longest time consuming, the largest workload of troubleshooting, and the most hair loss. Finally, I have to turn to Daniel’s experience.
Therefore, after the problem is solved, I will record the whole investigation process for the first time to summarize, hoping to help you. Next, enjoy:
The reason for this is that the effect of using our database has suddenly deteriorated due to customer feedback recently. The data query and retrieval that used to respond in seconds is always in a “circle” and stuck. Because it was a sudden phenomenon, the colleagues who had been on the scene ruled out the business change first, but did not find the problem. As the first mega data project I received after I founded the company, I also attached great importance to it and immediately went to the site.
Here is an introduction to the platform architecture,The bottom layer uses Hadoop for distributed storage, the middle database uses lsql, and the data real-time import uses Kafka. The daily data scale is 50 billion, and the data storage cycle is 90 days. There are more than 4000 data tables, of which the largest single table has nearly 2 trillion records, the total data scale is nearly 5 trillion, and the storage space accounts for 8pb.
The basic use of data platform support mainly includes data full-text retrieval, multidimensional query, geographic location retrieval, data collision and other operations. There will also be part of the business involved in data statistics and analysis, there will be a very small amount of data export and multi table association operation.
① Before the day
Before I started my business, I worked in Tencent’s Hermes system, and the daily amount of real-time data has reached 360 billion per day, and then nearly one trillion pieces of data are imported in real time. In order not to appear so Versailles, I just want to say that Liang Qichao and I had the same mood when giving a speech at Peking University: I don’t know much about super clusters, but I still have a little bit! Facing the current system of 50-100 billion per day, when I was considering whether to buy tickets, I also bought the return tickets of the day
In order to quickly locate the problem, I asked the site for some logs and jstack before I started,Initial positioning is the bottleneck of Hadoop namenodeWe have done NN optimization many times before.
The following figure shows the analysis of the stack at that time. It is estimated that you will be full of confidence after watching it. This is obviously Hadoop Karton.
② First day: try to adjust log4j
On the first day at the scene, it was still sunny and the mood continued to be beautiful.
The first thing I did when I got to the scene was to constantly grab the stack jstack of Hadoop namenode. The conclusion is thatThe problem is really that Caton is on NN. Here NN is a global lock, and all read and write operations are waiting for sortingThe details are as follows:
1. Where is the card
The waiting number of this lock is as long as more than 1000. It’s no wonder that it doesn’t get stuck. Let’s take a closer look at what the thread that owns this lock is doing?
2. Problem analysis
Obviously, there is a bottleneck in the log and the blocking time is too long.
1) The log4j of the record should not be added with [% l]. It will create a throwable object, which is a duplicate object in Java.
2) The log record is too frequent and the disk cannot be swiped.
- Log4j has global lock, which will affect throughput.
3. Adjustment plan
1) The client’s version of Hadoop is version 2.6.0. There are many problems in the log processing of this version of Hadoop, so we call in the patch that the official has clearly indicated that there are problems
https://issues.apache.org/jir…NN slow due to log
https://issues.apache.org/jir…Record the log outside the lock to avoid locking
https://issues.apache.org/jir…The logging problem caused by processincrementalblock report seriously affects the performance of NN
2) Disable all info level logs in namenode
It is observed that when there is a large number of log output, the global lock will block the NN.
At present, the modification method is to mask log output to log4j and disable all info level logs of namenode.
3) Log output of log4j removes the [% l] parameter
This parameter will create a new throwable object to get the line number. This object has a great impact on performance, and a large number of creation will affect throughput.
4) Enable asynchronous audit log
dfs.namenode.audit . log.async Set to true to change the audit log to asynchronous.
4. Optimization effect
After optimization,It’s true that log4j does not cause the stuck problem, but the throughput of Hadoop is still stuck on the lock.
③ Second day: optimize Du, check and solve all stuck problems
Then yesterday’s work:
1. After solving the problem of log4j, continue to grasp jstack, and grasp the following positions:
2. Through code analysis, it is found that there is a lock here, and it is confirmed that all access blocks will be caused here
3. Continue to study the code and find that it is controlled by the following parameters:
(the default value of version 2.6.5 is 5000, which does not exist any more.)
The core logic of this parameter is that if the configuration value is greater than zero, it will release the lock with a certain number of files, so that other programs can continue to execute. This problem only exists in Hadoop version 2.6.0, which has been fixed in later versions.
1) Click the official patch:
2) Remove all use of Hadoop Du in lsql
5. Why patch
In version 2.6.5, you can define the sleep time by yourself. The default sleep time is 500ms, while in version 2.6.0, the sleep time is 1ms. I’m worried that if it’s too short, there will be problems.
Continue to follow the original idea and check all jstacks. By now, Hadoop can’t catch any active threads through jstack, but it is still stuck in the switch of read-write lock, which shows that
1. Every function in namenode has been optimized, and jstack can’t catch it;
2. Stack call can only see nearly 1000 read-write locks switching constantly, which indicates that the request concurrency of NN is very high, and the context switching of lock between multi threads has become the main bottleneck.
thereforeAt present, the main idea should be how to reduce the call frequency of NN.
④ Third day: reduce NN request frequency as much as possible
In order to reduce the request frequency of NN, several methods are tried
1. Enable different tables of lsql with different partition functions
Considering that there are more than 4000 tables on site, and each table has more than 1000 concurrent write partitions, it is possible that too many files are written at the same time, resulting in too high NN request frequency. Therefore, we consider splitting and merging those small tables, and the number of written files is less, so the request frequency naturally decreases.
2. Cooperate with on-site personnel to clean up unnecessary data and reduce the pressure of Hadoop cluster. After cleaning up, the number of file blocks in Hadoop cluster is reduced from nearly 200 million to 130 million, which is enough.
3. Adjust the heartbeat frequency of a series of NN related interactions, such as blockmanager.
4. Adjust the type of NN internal lock: from fair lock to non fair lock.
The parameters involved in this adjustment are as follows:
- dfs.blockreport.intervalMsec It was adjusted from 21600000 l to 25920000 L (3 days)
- dfs.blockreport.incremental . intervalmsec incremental data heartbeat is changed from 0 to 300, try to batch report once (the old version does not have this parameter)
- dfs.namenode.replication . interval is adjusted from 3 seconds to 60 seconds to reduce heart rate
- dfs.heartbeat.interval The heartbeat time is adjusted from 3 seconds to 60 seconds to reduce the heartbeat rate
- dfs.namenode.invalidate . work.pct.per . iteration is adjusted from 0.32 to 0.15 (15% nodes), reducing the number of scanning nodes
Stack involved in this adjustment:
In the end, the Caton problem still exists. I have no skills, people have been ignorant, do not know how to deal with.
⑤ Fourth day: there’s nothing we can do. Consider establishing a diversion mechanism
Dragging a tired body that has been going through three nights in a row, I reported the specific situation of investigation to the company and customers in the morning of the fourth day, and directly said that I had no idea. We hope to enable scheme B:
1. Enable Hadoop Federation scheme to solve current problems by multiple namenodes;
2. Modify the lsql database immediately, and adapt Hadoop multi cluster scheme in one lsql database, that is to build two identical clusters. The lsql database starts 600 processes, 300 processes request the old cluster, and 300 processes flow to the new cluster, so as to reduce the pressure.
The idea of the family (company) is to go back to sleep first and make a decision when you are clear headed.
The customer suggested that we continue to investigate, because the system has been running stably for more than a year, and it doesn’t make sense. I still hope to have an in-depth study.
It’s like most system failures can be solved with a single restart. I decided to go to sleep first, hoping that the problems can be solved when I wake up.
When I woke up, I had no choice but to turn to my old colleague Gao. Gao Gao was a big bull in charge of HDFS when I was at Tencent. He was as proficient in Hadoop as I was in all kinds of hair loss prevention tips. Moreover, the optimization experience of tens of thousands of large clusters is available but not available. I think if he can’t order one or two, I’m afraid no one will be able to do it, and I don’t have to waste my efforts.
Gao Gao first inquired about the basic situation of the cluster and gave me a number of effective suggestions. What excites me most is that according to Gao Gao’s analysis, our cluster has never reached the upper limit of performance.
⑥ The last day: analyze every function that calls the lock of NN
This time, I didn’t directly look at JMX information, worried that the result was inaccurate. Btrace is used to check which thread frequently locks the NN, resulting in such a high NN load.
It took three hours to analyze, and finally it was surprising to find that the request frequency of processincremental blockreport was very high, which was much higher than that of other threads. Isn’t this thread the logic of incremental heartbeat of datanode (DN) node? Why is the frequency so high? Didn’t I change my heart rate? Didn’t it work?
Looking at the Hadoop code carefully, I found that there was a problem with this logic. Every time I wrote data or deleted data, it would be called immediately. However, the heartbeat parameters I set were not optimized in this aspect in the Hadoop cluster of this version of the customer, and it was useless to set them. So I urgently searched for a patch method on the Internet, and finally found this one, which not only solved the problem of heartbeat rate, It also solves the problem of lock frequency. By reducing the number of lock usage, it reduces the number of context switching and improves the throughput of NN.
With this patch, it is obvious that the throughput of NN has increased, and not only the access to NN is not blocked, the consumption speed of real-time Kafka has increased from 4 billion per hour to 10 billion per hour, and the performance of storage has doubled. After patching, this problem has been fundamentally solved.
To investigate the reasonThe root cause is the single lock design inside HDFS namenode, which makes the lock extremely heavy. The price of holding this lock is very high. Each request needs to get the lock, and then let NN handle the request, which contains a very fierce lock competition. Therefore, once the lock of NN is imported / deleted on a large scale, it is easy for namenode to handle a large number of requests at once, and the tasks of other users will be affected immediately. The main function of this patch is to change the incremental report lock to asynchronous lock, so that deletion, reporting and other operations do not affect the query.
For detailed description and modification, please refer to here:
Finally, for the troubleshooting of this performance failure, I summarize from two aspects: the causes of the problem and the solutions
① Causes of the problem
The system has been running smoothly before. The main reasons for the sudden problems are as follows:
1. Users delete a large number of files, which increases the pressure of Hadoop
- Recently, the hard disk is almost full, and a batch of data has been cleaned up
- Recently, Hadoop is unstable, releasing a large number of files
2. The amount of daily data has increased dramatically recently. After tuning Hadoop, the data is re entered and the number of data is counted according to the log. The recent data scale has increased a lot
3. Consumption data backlog
In this tuning process, due to the backlog of data for many days, Kafka has been consuming data at full speed. In the case of full speed consumption, NN will have a greater impact.
4. The impact of snapshot and mover on Hadoop
- When cleaning up the snapshot, a large number of data blocks are released, resulting in the deletion of data
- Mover adds a large number of data blocks, which causes the system to delete a large number of file blocks on SSD. Because the number of nodes increases and the heartbeat is frequent, processing incremental block report will cause great pressure on NN
② My suggestions
1.Never give up easily！
On the fourth day of troubleshooting, after trying a variety of solutions, I also thought about giving up, and thought that there was no solution to this performance failure. At this time, we might as well discuss with colleagues, even the former leaders, which may bring different ideas and inspiration. We should believe in the wisdom of the group!
2. We must understand the principle of Hadoop, which is also the key point of this Hadoop tuning
(1) When we delete files in HDFS: namenode just delete the directory entry, and then record the data blocks to be deleted to the pending deletion blocks list. The next time a datanode sends a heartbeat to the namenode, the namenode sends the deletion command and the list to the datanode side. Therefore, the pending deletion blocks list is very long, resulting in a timeout.
(2) When we import data: the client will write the data to the datanode, and the datanode will immediately call the processincrementalblock report to the NN after receiving the data block. The more data is written, the more frequent it is, the more machines there are, and the more processes there are, the more frequent it will be to call the NN. So this asynchronous lock patch will have an effect here.
3. The key point: never use Hadoop 2.6.0 version!!!
In the official words of Hadoop, other versions all have a few of bugs, but this version has a lot of bugs, so the first thing to do when you go back is to urge customers to upgrade as soon as possible.
PS: if you still want to improve your Hadoop performance, it is recommended to update to Hadoop version 3.2.1 and enable federated mode. Therefore, we have sorted out the possible problems and precautions.
If you want to break through the performance bottleneck again, we are ready to teach you how to improve the performance bottleneck of router.
WeChat official account searches for “soft copy” to acquire dry cargo for the first time.