Introduction:As we all know, kubernetes is the cornerstone of the cloud native domain. As the infrastructure of container orchestration, kubernetes is widely used in the serverless domain. Resilience is the core competitiveness in the serverless field. This sharing will focus on how to optimize pod creation efficiency and improve resilience efficiency in the serverless service based on kubernetes.
Serverless Introduction to calculation
Before entering the topic, briefly review the definition of serverless computing.
It can be learned from Wikipedia that serverless computing is a form of cloud computing. Cloud vendors manage servers, dynamically allocate machine resources to users, and charge based on the amount of resources actually used.
When users build and run services, they do not need to consider the server, which reduces the burden of user management server. Automatically expand instances through the platform’s elastic capacity during business peak periods, and automatically shrink instances during low peak periods to reduce resource costs.
Serverless Computing platform
The following are common at present Serverless Calculate the architecture of the product.
The whole product architecture usually has two layers: control plane and data plane. Control plane service developers, manage application life cycle, and meet developers’ needs for application management. Data plane service application visitors, such as developers’ business users, meet application traffic management and access demands.
Kubernetes is usually used for resource management and scheduling in the control plane. The master is usually 3 nodes to meet the requirements of high availability. Nodes access the k8s master through the intranet SLB.
At the node level, there are usually two types of nodes:
- One is the node running kubelet, such as bare metal server, virtual machine, etc. this kind of node will run the security container as a pod. When running, each pod has an independent kernel to reduce the security risk brought by the shared host kernel. At the same time, the network access of tenants will be isolated at the data link layer through cloud product VPC network or other network technologies. Through security container + layer-2 network isolation, a reliable multi rent operation environment can be provided on a single node.
- The other is a virtual node, which connects k8s and elastic instances through virtualkubelet. Elastic instance is a lightweight resource form similar to virtual machine in cloud products. It provides container group service with infinite resource pool. The concept of container group corresponds to the pod concept in k8s. AWS provides fargate elastic instances and Alibaba cloud provides ECI elastic instances.
Serverless products will provide k8s based PAAS layer, which is responsible for providing deployment, development and other related services to developers, shielding k8s related concepts and reducing the cost of developers’ development, operation and maintenance applications.
In the data plane, users can access application instances through SLB. The PAAS layer also usually provides information such as traffic gray, a / B in this plane Testing and other traffic management services to meet the needs of developers for traffic management.
Flexibility is the core competitiveness of serverless computing platform. It needs to meet developers’ demands for pod scale and provide capabilities similar to unlimited resource pool. At the same time, it also needs to meet the demands of creating pod efficiency and respond to requests in time.
Pod scale can be met by increasing IAAs layer resources. Next, focus on the technologies to improve pod creation efficiency.
Pod Create related scenes
Let’s first understand the scenarios related to pod creation, so that we can more effectively meet business demands through technology.
There are two scenarios in the business that involve pod creation:
- The first is to create an application. This process will first be scheduled to determine the node most suitable for the pod, and then create a pod on the node.
- The second is to upgrade the application. In this process, it is usually continuous to create new pods and destroy old pods.
In serverless service, developers focus on the application life cycle, especially the creation and upgrade stages. Pod creation efficiency will affect the overall time consumption of these two stages, and then affect the developer’s experience. In the face of sudden traffic, the creation efficiency will have an important impact on the response speed of developer services, and in serious cases, the developer’s business will be damaged.
Facing the above business scenarios, the next step is to focus on how to improve the creation efficiency of pod.
establish Pod technological process
The stage of pod creation under the overall analysis shall be solved in order according to the priority affecting pod creation efficiency.
This is a simplified pod creation process:
When there is a pod creation request, schedule first to select the most appropriate node for the pod. On the node, first pull the image, and then create the container group after the image is prepared locally. In the phase of pulling the image, it is divided into two steps: downloading the image and decompressing the image.
We tested two types of images, and the results are as follows:
It can be seen from the test results that the proportion of image decompression time in the whole image pulling process can not be ignored. For the golang: 1.10 image of about 248mb before decompression, the image decompression time actually accounts for 77.02% of the image pulling time. For the Hadoop namenode image of about 506MB before section decompression, the image decompression time and image downloading time account for about 40% and 60% respectively, That is, the total time-consuming of the image pulling process can not be ignored.
Next, we will optimize the different nodes of the above process, and discuss the whole process, decompressing the image, downloading the image and so on.
Improve the efficiency of pulling images
A quick way to think of is to warm up the image, prepare the image on the node in advance before the pod is scheduled to the node, and remove the pull image from the main link creating the pod, as shown in the following figure:
Global warm-up can be performed before scheduling, and images can be pulled in advance on all nodes. It can also be preheated during the scheduling process to determine the scheduled time
After the node, pull the image on the target node.
The two methods are indisputable and can be selected according to the actual situation of the cluster.
Openkruise project in the community is about to launch image preheating service. You can pay attention to it. The following is how the service is used:
Issue the image preheating task through imagepulljob CRD, specify the target image and node, and configure the concurrency of pull, the timeout of job processing, and the time of automatic recovery of job object. For private images, you can specify the secret configuration when pulling images. Events for imagepulljob The status information of the task will be presented. You can consider appropriately increasing the automatic recovery time of job objects to view the processing status of the task through imagepulljob events.
Improve decompression efficiency
From the data of pulling the image just seen, the time spent decompressing the image will account for a large proportion of the total time spent pulling the image, and the test examples account for the largest proportion 77%, so we need to consider how to improve the decompression efficiency.
First review the technical details of docker pull:
During docker pull, there are two stages as a whole:
- Parallel download image layer
- Disassemble image layer
When decompressing the image layer, gunzip is used by default.
Let’s briefly understand the process of docker push:
- First, package the image layer. In this process, it will be compressed through gzip.
- Then upload in parallel.
Gzip / gunzip is a single threaded compression / decompression tool. You can consider using pigz / unpigz for multi-threaded compression / decompression to make full use of the advantages of multi-core.
Containerd supports pigz from version 1.2. After installing the unpigz tool on the node, it will be used for decompression first. In this way, the image decompression efficiency can be improved through the multi-core capability of nodes.
This process also needs to pay attention to the concurrency of download / upload. Docker daemon provides two parameters to control the concurrency and the number of image layers for parallel processing, – Max concurrent downloads and — Max concurrent uploads. By default, the download concurrency is 3 and the upload concurrency is 5, which can be adjusted to an appropriate value according to the test results.
Image decompression efficiency after using unpigz:
In the same environment, the image decompression efficiency of golang: 1.10 is improved by 35.88%, and the image decompression efficiency of Hadoop namenode is improved by 16.41%.
Generally, the bandwidth of the intranet is large enough. Is it possible to omit the decompression / compression logic and focus the time spent pulling the image on downloading the image? That is, appropriately increase the download time and shorten the decompression time.
Review the docker pull / push process. In the unpack / pack phase, consider removing the logic of gunzip and gzip:
For docker images, if the image is uncompressed during docker push, decompression is not required during docker pull. Therefore, to achieve the above objectives, the compression logic needs to be removed during docker push.
The docker daemon does not support the above operations for the time being. We have modified the docker and do not perform compression when uploading images. The test results are as follows:
Here we focus on the time-consuming image decompression. We can see that the image decompression efficiency of golang: 1.10 has been improved by about 50%, and the image decompression efficiency of Hadoop namenode has been suspended by about 28%. In terms of the total time spent pulling images, this scheme has a certain effect.
In small-scale clusters, the focus of improving the efficiency of pulling images needs to be on improving the decompression efficiency. Downloading images is usually not the bottleneck. In large-scale clusters, due to the large number of nodes, the bandwidth and stability of the central image Registry will also affect the efficiency of pulling images, as shown in the following figure:
The pressure of downloading images is concentrated on the central image registry.
Here is a P2P based image distribution system to solve the above problems. Take the dragonfly project of CNCF as an example:
Here are several core components:
It is essentially a central supernode, which acts as a tracker and scheduler to coordinate the download tasks of nodes in P2P networks. At the same time, it is also a caching service that caches the images downloaded from the image registry to reduce the pressure on the image registry caused by the increase of nodes.
It is not only the client to download the image on the node, but also the ability to provide data to other nodes. It can provide the locally existing image data to other nodes on demand.
There is a dfdaemon component on each node, which is essentially a proxy. It implements a transparent proxy service for the request of docker daemon to pull the image, and uses dfget to download the image.
Through the P2P network, the central image registry data is cached in the clustermanager. The clustermanager coordinates the node’s demand for image download and apportions the pressure of image download to the cluster node. The cluster node is not only the puller of image data, but also the provider of image data, making full use of the ability of Intranet bandwidth to distribute images.
Load mirror on demand
In addition to the methods described above, are there other optimization methods?
When creating a container on the current node, you need to pull all the image data locally before starting the container. Consider the process of starting the virtual machine. Even if the virtual machine image is hundreds of GB, the virtual machine is usually started at the second level, and the impact of the virtual machine image size is hardly felt.
Can similar technologies be used in the container field?
Take another look at the paper description entitled slacker: fast distribution with lazy docker containers published on USENIX:
Our analysis shows that pulling packages accounts for 76% of container start time, but only 6.4% of
that data is read.
According to the paper analysis, pulling images accounts for 76% of the image startup time, but only 6.4% of the data is used during startup, that is, the amount of image data required during image startup is very small. It is necessary to consider loading images on demand during image startup and changing the use method of images.
For “image can be started only after all layers of image are downloaded”, it is necessary to load the image on demand when starting the container, which is similar to starting the virtual machine. Only the data required in the startup phase is transmitted through the network.
However, the current image format is usually tar.gz or tar, while the tar file has no index, and the gzip file cannot read data from any location, which cannot meet the demand of pulling the specified file when pulling on demand. The image format needs to be changed to an indexable file format.
Google has proposed a new image format, stargz, whose full name is seeable tar.gz. It is compatible with the current image format, but provides a file index that can read data from a specified location.
The traditional. Tar. GZ file is generated as follows: gzip (tarf (file1) + tarf (File2) + tarf (file3) + tarfooter)). Package each file separately, and then compress the file group.
The stargz file makes such innovations: gzip (tarf (file1)) + gzip (tarf (File2)) + gzip (tarf (file3_chunk1)) + gzip (f (file3_chunk2)) + gzip (f (index of earlier files in magic file), tarfooter). Package and compress each file, form an index file at the same time, and compress it together with tarfooter.
In this way, you can quickly locate the location of the file to be pulled by indexing the file, and then pull the file from the specified location.
Then, in the image pulling phase of container D, a remote snapshot is provided for container D. when creating the container rootfs layer, the remote storage layer is directly mounted instead of downloading the image layer first and then building it, as shown in the following figure:
To realize this capability, on the one hand, you need to modify the current logic of containerd and identify the remote image layer in the filter stage. For such image layer, you do not download. On the other hand, you need to implement a remote snapshot to support the management of the remote layer.
When containerd creates a container through remote snapshot, the phase of pulling the image is omitted. For the files required in the startup process, an HTTP range get request can be initiated for the image data in stargz format to pull the target data.
Alibaba cloud has implemented an accelerator called Dadi, which is similar to the above ideas. At present, it is applied to Alibaba cloud container service and realizes 3.01s startup
10000 containers, perfectly eliminating the long wait for cold start. Interested readers can also refer to this article:https://developer.aliyun.com/…
In situ upgrade
The above technical solutions are provided for the creation of pod. Is it possible to improve the efficiency of the upgrade scenario under the existing technology? Is it possible to achieve the following effect, that is, to eliminate the process of creating pod and realize the in-situ upgrade of pod?
In the upgrade scenario, only the image is upgraded. For this scenario, k8s’s own patch capability can be used. Through patch Image, the pod will not be rebuilt, only the target container will be rebuilt, so there is no need to completely go through the scheduling + new pod process, and only the containers that need to be upgraded will be upgraded in place.
In the process of in-situ upgrade, with the help of k8s readinessgates capability, you can control the graceful offline of the pod. The k8s endpoint controller takes the initiative to remove the pod to be upgraded, and adds the upgraded pod after the in-situ upgrade of the pod, so as to achieve no loss of traffic in the upgrade process.
The cloneset controller in the openkruise project provides the above capabilities:
Developers use cloneset to declare applications, which is similar to deployment. Cloneset controller is responsible for upgrading the image Patch operation, and ensure that the business traffic is lossless during the upgrade process.
Starting from the business scenario, we learned about the scenario of improving pod creation efficiency and bringing benefits. Then, by analyzing the process of pod creation, corresponding optimization is made for different stages.
Through such analysis and processing process, the business requirements can be effectively met through technology.
Introduction to the author
Zhang Yifei works for Alibaba cloud container service team, mainly focusing on Serverless Product R & D in the field.
This article is the original content of Alibaba cloud and cannot be reproduced without permission