Author: Wu Yelei
I’ve always understood the graceful stop of pod very simply: don’t you just use prestop hook to do graceful exit? But recently, it has been found that prestop hook can’t fulfill the requirements well in many scenarios. In this article, I will briefly analyze the issue of “stop pod gracefully”.
What is elegant stop?
The term “graceful shutdown” comes from the operating system. After the shutdown, we have to complete some cleaning operations for the OS, and the opposite is hard shutdown, such as unplugging the power supply.
In a distributed system, elegant stopping is not only the business of the process itself on a single machine, but also the business of dealing with other components in the system. For example, we start a microservice, and the gateway distributes part of the traffic to us. At this time:
- If we kill the process directly without saying a word, this part of traffic will not be handled correctly, and some users will be affected. However, generally speaking, the gateway or service registry will keep a heartbeat with our service. After the heartbeat timeout, the system will automatically remove our service, and the problem will be solved. This is a hard stop. Although our whole system is well written and can self-healing, some jitters and even errors will occur.
- If we first tell the gateway or service registry that we want to go offline, wait for the other party to finish the service removal operation and then stop the process, then no traffic will be affected; this is an elegant stop to minimize the impact of the start and stop of a single component on the whole system.
According to the Convention, sigkill is the signal of hard termination, while SIGTERM is the signal of notifying the process to exit gracefully. Therefore, many microservice frameworks will listen to SIGTERM signals, and after receiving the signals, they will do anti registration and other cleaning operations to achieve graceful exit.
Back to kubernetes (hereinafter referred to as k8s), when we want to kill a pod, the ideal situation is, of course, k8s removes the pod from the corresponding service (if any), and sends SIGTERM signal to the pod to let each container in the pod exit gracefully. But in fact, pod may commit various monomoths:
- It’s stuck. It can’t handle the code logic of graceful exit or it will take a long time to complete.
- The logic of elegant exit has bug, which is dead cycle.
- The code was written wildly, ignoring SIGTERM at all.
Therefore, there is a “maximum tolerable time” in the pod termination process of k8s, that is, grace period (in the pod’s
.spec.terminationGracePeriodSecondsThe default value is 30 seconds. We are executing
kubectl deleteYou can also pass
--grace-periodThe parameter explicitly specifies an elegant exit time to override the configuration in the pod. When grace period exceeds, k8s can only choose sigkill to force pod out.
In many scenarios, in addition to taking the pod off the k8s service and gracefully exiting the process, we have to do some additional things, such as unregistering from the service registry outside the k8s. At this time, prestop hook will be used. K8s currently provides
HTTPThere are two kinds of prestop hooks. When they are actually used, they need to pass the pod’s
.spec.containers.lifecycle.preStopThe fields are configured separately for each container in the pod, such as:
spec: contaienrs: - name: my-awesome-container lifecycle: preStop: exec: command: ["/bin/sh"，"-c"，"/pre-stop.sh"]
/pre-stop.shWe can write our own cleaning logic in the script.
Finally, let’s go through the whole process of pod exit (more rigorous in the official documents):
- The user deletes the pod.
- 2.1. Pod enters the terminating state.
- 2.2. At the same time, k8s will remove the pod from the corresponding service.
- 2.3. At the same time, for containers with prestop hook, kubelet will call the prestop hook of each container. If the running time of prestop hook exceeds grace period, kubelet will send SIGTERM and wait another 2 seconds.
- 2.4. At the same time, kubelet sends SIGTERM to the container without prestop hook.
- After grace period is exceeded, kubelet sends sigkill to kill the containers that have not yet exited.
It’s a good process, but one of the problems is that we can’t predict how long it will take for pod to complete graceful exit, and we can’t deal with the failure of graceful exit gracefully. In our product tidb operator, this is an unacceptable thing.
The challenge of stateful distributed application
Why can’t we accept this process? In fact, this process is usually OK for stateless applications, but the following scenario is a little more complicated:
There is a core distributed kV storage layer named tikv in tidb. Tikv is internally based on multi raft for consistent storage, which is a complex architecture. Here we can simply describe it as a master-slave architecture, with leader writing and follower synchronization. Our scenario is to perform planned operation and maintenance operations on tikv, such as rolling upgrade and node migration.
In this scenario, although the system can accept less than half of node downtime, we should try our best to stop the expected downtime gracefully. This is because the database scenario itself is very strict, and it is basically at the core of the whole architecture, so we need to make the jitter as small as possible. To do this, we need to do a lot of cleaning work. For example, we need to migrate all the leaders on the current node to other nodes before the shutdown.
Thanks to the good design of the system, most of the time this kind of operation is very fast. However, in the distributed system, exceptions are commonplace, and elegant exit takes too long or even fails. If something like that happens,For the sake of business stability and data security, we can’t force the pod to be closed. Instead, we should stop the operation process and inform the engineer to intervene.At this point, the pod exit process described above is no longer applicable.
Careful: manual control of all processes
In fact, k8s has no out of the box solution, so in our controller (tidb object itself is a CRD) we have very detailed control over the service startup and Shutdown Logic in various operation scenarios.
Regardless of the details, the final general logic is that before each stop of service, the controller informs the cluster to perform various migration operations before the node is offline. After the operations are completed, the node will be offline and the next node will be operated.
If the cluster fails to complete migration and other operations normally or takes too long, we can also “hold the bottom line” and do not forcibly kill the nodes, which ensures the security of operations such as rolling upgrade and node migration.
But there is a problem in this method that it is more complex to implement. We need to implement a controller ourselves, in which we can implement fine-grained control logic and constantly check whether the pod can be stopped safely in the controller control cycle.
Another way: decoupling the control flow of pod deletion
Complex logic is always not easy to maintain. At the same time, the amount of development to write CRD and controller is not small. Can there be a more concise and general logic that can realize the requirement of “ensure elegant shutdown (or not)?
Yes, the way is validatingadmissionwebhook.
Here’s a little background. Kubernetes’s apiserver has the design of adissioncontroller from the beginning. This design is similar to the filter or middleware in various web frameworks. It’s a plug-in responsibility chain. Each plug-in in in the responsibility chain does some operation or verification for the requests received by apiserver. Here are two examples of plug-ins:
DefaultStorageClass, automatically set storageclass for PVC that does not declare storageclass.
ResourceQuota, verify that the resource usage of pod exceeds the quota of the corresponding namespace.
Although it’s plug-in, before 1.7, all plugins need to be compiled together in apiserver code, which is not flexible. In 1.7, k8s introduces dynamic permission control mechanism, which allows users to register webhook with apiserver, while apiserver uses webhook to call external server to implement filter logic. In 1.9, this feature is further optimized, and webhook is divided into two categories:
ValidatingAdmissionWebhook, as the name implies, the former is to operate API objects, such as in the example above
DefaultStroageClassThe latter is used to verify API objects, such as
ResourceQuota。 After splitting, apiserver can ensure that all modifications are completed before validation. The following diagram is very clear:
And our approach is to use
ValidatingAdmissionWebhookWhen an important pod receives a delete request, it first requests the cluster on the webhook server to clean up and prepare for offline, and directly returns to reject. At this time, the key point is that in order to achieve the target state (such as upgrading to a new version), control loop will continue to reconcile and try to delete the pod, while our webhook will continue to refuse, unlessThe cluster has completed all the cleaning and preparation work。
Here is a step-by-step description of the process:
- Users update resource objects.
- Controller manager watch to object change.
- The controller manager starts synchronizing the object state, trying to delete the first pod.
- Apiserver calls the external webhook.
- Webhook server requests the cluster to make preparations before the tikv-1 node is offline (this request is idempotent), and queries whether the preparations are completed. If the preparations are completed, deletion is allowed, if not, rejection is allowed. The whole process will return to step 2 due to the control loop of controller manager.
It seems that everything is clear all at once. The logic of this webhook is very clear, that is, to ensure that all related pod deletion operations must first complete the preparations before graceful exit, and do not care how the external control loop runs at allIt’s very easy to write and test, which is very elegant to meet our needs of “ensure graceful shutdown (or not)”. At present, we are considering replacing the old online solution in this way.
In fact, dynamic permission control is widely used, such as istio
MutatingAdmissionWebhookTo implement the injection of envoy container. From the above example, we can also see that it has strong expansibility, and can often stand in an orthogonal perspective, solve problems cleanly, and decouple well with other logic.
Of course, there are many extension points in kubernetes. From kubectl to apiserver, scheduler, kubelet (device plugin, flexvolume), custom controller to CNI at the cluster level, storage (CSI) can do things everywhere. In the past, some conventional microservice deployment was not familiar with or used for these, but now faced with such a complex distributed system as tidb, especially when kubernetes’s support for stateful applications and local storage is not good enough, it has to be carefully measured at each extension point, which is very interesting to do. Therefore, there may be some thoughts in tidb operator in the future Shared solutions.