Database ‖ Hadoop cluster upgrade process sharing of thousands of nodes


The last time I did itOver a trillionTroubleshooting of Hadoop namenodeThrough four days of hard work, we finally solved the bottleneck problem of Hadoop 2.6.0, but life is often Murphy’s law, and the worst possibilities you try to avoid may eventually happen.

In order to avoid the second bottleneck of namenode, I decided to upgrade the current cluster from Hadoop version 2.6.0 to Hadoop version 3.2.1, and enable federated mode. However, after the upgrade, we still encountered many unexpected problems. Although these problems were finally solved after a series of troubleshooting, I think the experience of upgrading the cluster can help some students avoid detours.

Next, enjoy:

1、 Preparation before upgrading

When upgrading a Hadoop cluster, especially a large cluster with thousands of nodes like this case, two points need to be made clear:

First, is it really necessary to upgrade the cluster, and will the current data growth reach the bottleneck of the current version

First of all, we need to make a statement. For such a large Hadoop cluster as this case, it is very difficult to upgrade itself. It takes a lot of manpower and material resources to modify the underlying source code of the database. It takes our team two weeks to modify the source code of our product lsql. Needless to say, the investigation and treatment of a series of problems that I will encounter after the upgrade listed below will require a lot of time and energy.

The main reason why we choose to upgrade Hadoop cluster this time is that the original version 2.6.0 is too old, and there may be other undetected problems. At the same time, customers have nearly 100 billion pieces of data growth every day. If they do not upgrade, they will soon encounter the second bottleneck of namenode. If the customer can run stably in the current version of the system, and the data volume will not increase significantly in the future, we will not adopt the scheme of upgrading the cluster.

Second, whether we have done enough preparation before upgrading, and whether there are emergency plans to deal with all kinds of problems arising from upgrading.

Before this upgrade, we have done a lot of preparatory work. We have carried out many upgrading drills to ensure that we can deal with all kinds of emergencies in the upgrading process. At the same time, in view of the characteristics of the trillions of customers’ production system, data loss or long-term service stop must not occur in the process of upgrading. For this reason, we have formulated the system fallback scheme, the recovery scheme for data loss, and the functional verification after the upgrade.

2、 It’s not easyProblems after upgrading & Solutions

1. After Hadoop crosses multiple federations, the speed is not fast but slow

theoretically,When Hadoop is upgraded to a Cross Federation mechanism, read and write requests will be balanced to three different namenodes, reducing the load of namenode and improving the read and write performance. However, the fact is that the load of namenode is not reduced, on the contrary, it becomes very slow.

For the current cluster of the degree of stuck, is certainly unable to meet the needs of the production system. At this time, the upgrade has been going on for three days. If the problem of stuck can not be solved, even if we are unwilling, we can only start the fallback scheme immediately. But after all, we have been preparing for this upgrade for a long time, and customers are also looking forward to the performance improvement after the upgrade. I gently stroked the sleeping hair follicle on my head for many years, and decided to fight all night.

The first thing we need to do is to locate the cause of the jam. I started with the stack and found that the main problem is namenode jam. The production system can’t continue to investigate. If we don’t want to rollback the system, we have to make the system available. We can only start from the most familiar place – modify the lsql, and add a cache to all the places that request the namenode file. As long as you The requested file is cached to avoid the second request for namenode.

After the modification of lsql, the state of namenode is greatly improved, and the system is basically available. However, we can not give a reasonable explanation for why the throughput of namenode does not increase but decreases after upgrading to Federation. We can only preliminarily conclude that the historical data is still in a federation, resulting in uneven distribution of data. We hope that with the gradual introduction of new data, a more balanced situation can be formed again.

2. After upgrading the Federation, the stability of the database is reduced and it is very easy to hang up

After the upgrade, we found that almost every few days our database system lsql will hang up.

By observing the outage log of lsql, it is found that there are a large number of outputs as follows, which proves that Hadoop namenode is often in standby mode, resulting in the overall unavailability of Hadoop services, thus causing the outage of lsql database.

Further analysis of namenode shows that the namenode of Hadoop is not down, active and standby are alive. Combined with zookeeper log analysis, it is found that the active and standby switch of namenode occurred during this period.

There are two problems involved here. One is why the active / standby handoff occurs, and the other is whether the service will be unavailable during the handoff.

In view of the reasons for the active standby handoff, by comparing the time of handoff, it is found that the namenode made relatively large load requests during the handoff, such as deleting the snapshot, or making some heavy load queries for the service. This is most likely due to the high load of namenode, resulting in the timeout of zookeeper link. We will explain why namenode has a high load later.

As for whether the service will be unavailable during the handover process, we have done a test. In the first test, the service is switched directly through the handoff command provided by Hadoop, and the result shows that the handoff has no impact on the service. In the second test, the active namenode was directly killed, and the problem reappeared. soIn Hadoop version 3.2.1, there is a problem of service unavailability in the process of switching between active and standby. We further analyze the error location as follows:

Look at the design and implementation of routerrpcclient class, it should be the logic of failure. The key values of related configuration parameters are as follows:

Final analysis It is found that there is a problem with the design logic of this class. Although the above parameters are reserved, they do not take effect,There are two namenodes in the process of standby and active switchover. At this stage, an error will be reported directly, resulting in the unavailability of the service.

We read the implementation of this class in detail, and found that although this class reserved the retrial logic, these retrial logic did not take effect, and there was a bug, so we fixed it. After the repair, the problem of lsql downtime no longer appears. The following is the analysis process involving the source code:

The following is the log recorded in the process of router retrying after repairing the version:

Database ‖ Hadoop cluster upgrade process sharing of thousands of nodes

3. Database query appears indirect stuck

After Hadoop is upgraded to version 3.2.1, lsql will be intermittently jammed, resulting in the business query page always in “circles”.

After arriving at the scene, I encountered this phenomenon the next day. After a series of investigation and tracking, I finally located it in the namenode Caton of Hadoop (it’s a familiar feeling, yes, it’s coming again…). Specifically reflected in an LS operation, it will jam for 20 ~ 30 seconds. The machine entering the namenode node node observes the load and finds that the CPU load is about 3000%.

After grabbing jstack for many times, it is found that the location of the stack causing namenode to jam is as follows:

Database ‖ Hadoop cluster upgrade process sharing of thousands of nodes

Yes, it’s addblock, that is, the write lock that writes data will block all queries. According to the previous experience, we think it is caused by the lock in the figure above, so we do the operation of unlocking.

Since the problem of master-slave switching of namenode has been solved, we can verify our problem in the production system in an eager way. After unlocking, the result is very bad. Instead of reducing the blocking, it becomes more serious. Instead of reducing the load of the whole namenode, it soars to full capacity.

We have no choice but to continue to investigate the reasons and carefully analyze the related implementation of this class. Finally, we find the following problems:

Database ‖ Hadoop cluster upgrade process sharing of thousands of nodes

Combined with the above reasons, we slightly changed the code implementation.

Database ‖ Hadoop cluster upgrade process sharing of thousands of nodes

After analysis, it is found that every time a request for addblock is made, a cycle will be generated here. This is all the machines in a rack.This means that every time the addblock method is called, namenode will cycle through the devices in a rackIf there are thousands of cluster devices, that is, thousands of cycles, then the CPU utilization rate is excellent! Before the lock state, because of the lock limit, it will only lead to slow writing, and the query is still available. Now that the lock is removed, the CPU is full, and it can’t be checked. Then jstack finds that the place where the stuck is changed, as shown in the figure below:

We temporarily adjust the log level of namenode to debug level, and see the following output, which further verifies our idea.

At the same time, we compared the source code of Hadoop version 2.8.5 and found that the logic of this place has been greatly changed in Hadoop version 3.2.1. There is the problem of improper design, this loop will cost a lot of CPU. The original intention of the design is to randomly extract nodes, but to traverse all nodes under a rack, lock one by one. In order to add block once, hundreds of locks have been added for no reason. How can we design this way? We follow some of the writing methods of 2.8.5 and revise them again, aiming at the random way here.

After the change is completed, continue to grasp jstack. What’s broken is that the logic of this part can’t be grasped, but it appears in other places. There is a similar cycle. Looking at the code in detail, we find that there are too many loops in this class of Hadoop 3.2.1. We can’t afford to change them in the production system. We can only change our way of thinking to solve this problem. If it is the problem of this cycle that causes namenode to jam, why is it not stuck all the time, but intermittently? If we make clear this idea, we may open a window.

(1) Murder case caused by SSD machine’s xreceiver

We analyzed the IP list in the loop (IP of datanode), and found a rule combining with the devices in the field log. The IP in these loops is the IP of SSD device, not the IP of SATA device. The field devices are divided into six groups, one rack for each group, and half of them are SATA and SSD. From the law of IP, these IP are SSD devices. Further analysis shows when these DN will be added to this cycle. From the analysis of log and source code, we finally locate on exclude. That is to say, Hadoop thinks that these devices are broken and should be excluded.

Are all our SSD devices down, or is there something wrong with their network? But after investigation, we ruled out these problems. The problem returns to the source code level. After about one day’s tracking and positioning, we find the following logic:

When Hadoop chooses to write DN, it will consider the load of DN. If the load of DN is high, it will add this node to the execute, so as to exclude this node, so as to reduce the loadIt is the number of links of xreceiver that determines the load of a DN.By default, as long as the number of links of xreceiver exceeds twice the number of connections of the whole cluster, this node will be excluded. The heterogeneous feature of lsql itself means that we will read data from SSD disk first, and SSD disk devices have good performance and relatively small number, resulting in the number of machine connections of these SSD devices is much higher than that of SATA disk. SATA disk cold data is too much, almost few people to query. In view of this situation, all or most SSD devices will be eliminated in a flash due to the number of connections.

Our Hadoop cluster uses one to write some data_ SSD mode, that is, one data is stored on the SATA device, and the other data is stored on the SSD device. The feature of Hadoop 3.2.1 will randomly extract a node from the SSD device. If it is not in the excluded exclude list, it will be written to the device. If the device is excluded, Hadoop will try again, and then randomly select other nodes, and repeat the cycle. However, due to the load of xreceiver, most or all of our SSD devices are eliminated, which directly leads to continuous cycles, even hundreds of times. Finally, we may find that one SSD device is not available. The log will prompt that two copies have been successfully written to one SATA, and one copy is missing, as shown in the following figure:

Among the above problems, two fatal problems directly lead to the big data cluster are: first, the CPU of namenode soars, causing serious intermittent stuck; second, SSD devices are intermittently excluded, only one SATA copy is successfully written, if the SATA device suddenly damages a disk, the data will be lost.

The repair method of xreceiver for SSD is very simple. Disable the function or increase the multiplierAs shown in the figure below:

After the adjustment, we obviously feel the decrease of namenode load, and the response speed of all queries of the business returns to normal.

(2) The same problem caused by unreasonable rack allocation strategy

In the process of verifying the above problems, although the load of namenode has decreased significantly, the above-mentioned cycle still exists. Looking at the debug log, we are very surprised that the device in this cycle is not an SSD device, but a SATA device. What is the situation? It doesn’t make sense.

Fortunately, after a few days of studying the Hadoop source code, we have a preliminary understanding of this part of the logic. We open the debug log of Hadoop and observe the following logs:

Database ‖ Hadoop cluster upgrade process sharing of thousands of nodes

We find that when the number of blocks allowed to be written in one rack exceeds the limit, and there is no available storage policy in other racks, the above cycle will occur. Let’s see how many data blocks are allowed to be written in a rack, and how to decide? As shown in the figure below:

Database ‖ Hadoop cluster upgrade process sharing of thousands of nodes

We have 6 racks, 3 SATA and 3 SSD. From the log, we need to write 24 copies of a data block to the SATA device. Each rack rack can write up to 5 copies, 3 SATA racks can write 15 copies, and the other 9 fail to write. Hadoop 3.2.1 did not take this situation into consideration in the computer rack strategy. Other SSD devices could not write the data of the SATA storage strategy, which led to Hadoop cycling the SATA device hundreds of times in this case, and the CPU soared.

Since there is no place to write in the next nine copies, Hadoop will cycle all the SATA devices hundreds of times as before. This will also lead to a surge in CPU.

In our actual production environment, it is rare to write 24 copies. There are only some update scenarios. Because the data with update needs to be read by each node (a bit similar to the distirbutechache of Hadoop), if there are 2 copies, it will cause too much pressure on individual datanodes, so we increase the number of copies.

This situation also corresponds to the previous business proposal that as soon as the update is executed, the business query will be obviously stuck.

In view of this problem, we think we shouldAdjust the rack allocation strategy, do not allow separate SSD and SATA in different racks, a rack must have both SSD and SATA devices.In this way, the problem can be avoided. However, we are worried that once we change the rack policy at this time, a large number of copies of Hadoop will be migrated (in order to meet the needs of the new rack Policy). Our cluster scale is too large, so we finally choose to modify the source code of namenode. Solve this problem from the source level, as shown in the figure below:

Database ‖ Hadoop cluster upgrade process sharing of thousands of nodes

3、 Summary

To sum up, there are two main knowledge points shared in this upgrade process:

First,Hadoop router does not retry the failure of namenode. During the active / standby handover, service error will occur, resulting in the overall system unavailability;

Second,In the design of Hadoop addblock version 3.2.1, because of the problem of rack strategy, it will be processed circularly, resulting in high CPU consumption and frequent locking.

Finally, the Hadoop cluster upgrade process is not smooth on the whole, and we have encountered some thorny problems after the upgrade. Here we hope that students will make sufficient preparations when preparing for Hadoop cluster upgrade, especially for the super large-scale cluster upgrade, and have the courage to break their wits when they find that the upgraded system can not meet the production demand Courage, timely system version back.

PS: more Hadoop technology, dry cargo, welcome to WeChat’s attention to the official account number “soft copy”.