Introduction:Recently, Alibaba has opened source its cloud native container image acceleration technology. Compared with the traditional layered tar package file format, its overlay BD image format realizes network-based on-demand reading, so that the container can be started quickly.
Author Chen Bo
Source|Alibaba cloud official account
Recently, Alibaba opened itsCloud native container image acceleration technology, its overlay BD image format, compared with the traditional layered tar package file format, realizes network-based on-demand reading, so that the container can start quickly.
The technical solution was originally part of Alibaba cloud’s internal Dadi project. Dadi is the abbreviation of data accelerator for disaggregated infrastructure. It aims to provide various possible data access acceleration technologies for computing storage separation architecture. Image acceleration is a breakthrough attempt of Dadi architecture in the field of container and cloud native. Since the technology was put into operation in 2019, a large number of machines have been deployed online, and the container has been started more than 1 billion times. It supports multiple business lines of Alibaba group and Alibaba cloud, and greatly improves the efficiency of application publishing and capacity expansion. In 2020, the team published the paper “Dadi: block level image service for agile and elastic application deployment. USENIX ATC’20” at the international top conference , and then launched the open source project. It plans to contribute the technology to the community and attract more developers to invest in the field of container and cloud native performance optimization by establishing standards and creating ecology.
With the outbreak of kubernetes and cloud native, container has been more and more widely used in large-scale enterprises. Fast deployment startup is one of the core advantages of the container. This fast startup means that the instantiation time of the local image is very short, that is, the “hot start” time is short. However, for “cold start”, that is, when there is no local image, you need to download the image from the registry before creating the container. After long-term maintenance and updating, both the number of image layers and the overall size of the business image will reach a large order of magnitude, such as hundreds of MB or several gigabytes. Therefore, in the production environment, the cold start of the container often takes several minutes, and with the expansion of the scale, the registry cannot quickly download the image due to network congestion in the cluster.
For example, in the double 11 event in a previous year, an application in Alibaba triggered an emergency capacity expansion due to insufficient capacity, but the overall capacity expansion took a long time due to excessive concurrency, which affected the use experience of some users. By 2019, with the deployment of Dadi online, the total time spent on “image pull + container start” of the new image format container is 5 times shorter than that of the ordinary container, and p99 long tail time is 17 times faster than that of the latter.
How to deal with the image data stored at the remote end is the core to solve the problem of slow container cold start. Historically, the industry has made attempts to this problem: using block storage or NAS to save container images to achieve on-demand reading; Use network-based distribution technology (such as P2P) to download the image from multiple sources or preheat it to the host in advance to avoid a single point of network bottleneck. In recent years, the discussion on the new image format has gradually been put on the agenda, according to Harter et alResearchIt shows that pulling the image takes 76% of the container startup time, while only 6.4% of the time is used to read data. Therefore, the image supporting on demand read technology has become the default trend wind direction. Proposed by GooglestargzFormat, its full name is seekable tar.gz. As the name suggests, you can selectively search and extract specific files from the archive without scanning or decompressing the entire image. Stargz aims to improve the performance of image pulling. Its lazy pull technology does not pull the entire image file, and realizes on-demand reading. In order to further improve the runtime efficiency, stargz has launched a container snapshot plug-in to further optimize I / O in the storage layer.
In the life cycle of the container, the image needs to be mounted after it is ready, and the core technology of hierarchical image mounting is overlay FS, which combines multiple layer files in the lower layer in a stacked form, and exposes a unified read-only file system upward. Compared with block storage and NAS mentioned above, it can generally be stacked hierarchically in the form of snapshots, and CRFs bound to stargz can also be regarded as another implementation of overlay FS.
New image format
Dadi does not directly use overlay FS, or it just draws on the ideas of overlay FS and early union file system, but proposes a new layered stacking technology based on block devices, called overlay BD, which provides a series of block based merged data views for container mirroring. The implementation of overlay BD is very simple, so many things that you want to do but can’t do before can become a reality; The implementation of a fully POSIX compatible file system interface is full of challenges and may have bugs, which can be seen from the development history of various mainstream file systems.
In addition to simplicity, overlaybd has other advantages over overlayfs:
- Avoid the performance degradation caused by multi-layer mirroring. For example, the update of large files in overlay FS mode will trigger cross layer reference replication, and the system must copy the files to the writable layer first; Or the speed of creating hard links is very slow.
- It can easily collect block level I / O mode, record and replay, so as to prefetch data and further accelerate startup.
- The user’s file system and host OS can be flexibly selected, such as supporting windows NTFS.
- An effective codec can be used for online decompression.
- It can sink into distributed storage (such as EBS) in the cloud, and the mirror system disk can use the same set of storage scheme as the data disk.
- Overlaybd has natural writable layer support (RW), and read-only mount can even become history.
Overlay BD principle
In order to understand the principle of overlay BD, we first need to understand the hierarchical mechanism of container mirroring. The container image consists of multiple incremental layer files, which are superimposed when used, so that only the layer files need to be distributed when the image is distributed. Each layer is essentially a compressed package different from the previous layer (including the addition, modification or deletion of files). The container engine can stack the differences through its storage driver in the agreed way, and then mount them in the read only mode to the specified directory, which is called lower_ dir； For the writable layer mounted in read / write mode, the mounted directory is generally called upper_ dir。
Note that overlaybd itself has no concept of files. It just abstracts the image as a virtual block device and mounts a conventional file system on it. When the user applies to read data, the read request is first processed by the conventional file system to convert the request into one or more reads of the virtual block device. These read requests will be forwarded to the user state receiver, that is, the runtime carrier of overlay BD, and finally converted to random reads of one or more layers.
Like the traditional image, overlaybd still retains the layer hierarchical structure internally, but the content of each layer is a series of data blocks corresponding to file system change differences. Overlaybd provides a merged view upward. The stacking rules for layers are very simple, that is, for any data block, the last change is always used, and the blocks that have not changed in the layer are regarded as all zero blocks; It also provides the function of exporting a series of data blocks into a layer file, which is high-density, non sparse and indexable. Therefore, reading a continuous LBA range of a block device may contain multi-layer small data segments, which are called segments. If you find the layer number from the segment attribute, you can continue to map to the reading of the layer file of this layer. The traditional container image can save its layer file in registry or object storage, so the overlay BD image can also be used.
For better compatibility, overlaybd wraps the header and tail of a tar file in the outermost layer of the layer file, disguised as a tar file. Since there is only one file in tar, it does not affect on-demand reading. At present, whether docker, containerd or buildkit, the downloading or uploading of images has an untar and tar process by default, and it is insurmountable not to invade the code. Therefore, adding tar camouflage is conducive to the unity of compatibility and process. For example, there is no need to modify the code when image conversion, construction or full download is used, just provide plug-ins.
The overall architecture of Dadi components is shown in the following figure:
Since version 1.4, containerd has initially supported some functions of starting remote images, and k8s has explicitly abandoned docker as runtime support. Therefore, the open source version of Dadi gives priority to supporting the container ecology, and then supports docker.
The core function of snapshot is to implement the abstract service interface, which is used to mount and unload the container rootfs. Its design replaces the module called graphdriver in the early version of docker, which simplifies the storage driver and is compatible with block device snapshots and overlays.
On the one hand, the overlay BD snapshot provided by Dadi enables the container engine to support the new overlay BD format image, that is, mount the virtual block device to the corresponding directory. On the other hand, it is also compatible with the traditional OCI tar format image, allowing users to continue to run ordinary containers with overlay FS.
ISCSI is a widely supported remote block device protocol, which is stable, mature, high performance and recoverable in case of failure. As the back-end storage of the iSCSI protocol, the overlaybd module can be recovered even if the program crashes unexpectedly. File system based image acceleration schemes, such as stargz, cannot be restored.
ISCSI target is the runtime carrier of overlay BD. In this project, we have implemented two target modules: the first is based on open source projectstgt, because it has the backing store mechanism, the code can be compiled into a dynamic link library for runtime loading; The second is based on the Linux kernelLIO SCSI target(also known as tcmu), the whole target runs in the kernel state and can easily output virtual block devices.
Zfile is a data compression format that supports online decompression. It divides the source file according to the block size of fixed size, compresses each data block separately, and maintains a jump table to record the physical offset position of each data block in zfile. If you need to read data from zfile, just look up the index, find the corresponding location, and decompress only the relevant data block.
Zfile supports various effective compression algorithms, including lz4, zstd, etc. it has extremely fast decompression speed and low overhead, which can effectively save storage space and data transmission. The experimental data show that the performance of decompressing remote zfile data on demand is better than loading uncompressed data, because the time saved by transmission is greater than the additional overhead of decompression.
Overlaybd supports exporting layer files to zfile format.
As mentioned above, the layer file is saved in the registry, and the container’s read I / O to the block device will be mapped to the request to the registry (here, the registry’s support for HTTP partial content is used). However, due to the existence of cache mechanism, this situation will not always exist. The cache will automatically start downloading the layer file after a period of time after the container is started and persist it to the local file system. If the cache hits, the read I / O will not be sent to the registry, but will be read locally.
On March 25, Forrester, an authoritative consulting firm, released the evaluation report on function-as-a-service platforms in the first quarter of 2021. Alibaba cloud stood out with the advantage of being the first in the world in terms of product capability, scored the highest score in eight evaluation dimensions, and became a global FAAS leader comparable to Amazon AWS. This tooFor the first time, domestic technology companies have entered the FAAS leader quadrant。
As we all know, the container is the bearing foundation of the FAAS platform, and the container startup speed determines the performance and response delay of the whole platform. Dadi helps Alibaba cloud function computing products,Greatly shorten the container start-up time by 50% ~ 80%, bringing a new serverless experience.
Summary and Prospect
Alibaba’s open source Dadi container acceleration project and its overlay BD image format help meet the container’s demand for quick start in the new era. In the future, the project team will work with the community to speed up the docking of mainstream tool chains and actively participate in the formulation of new image format standards. The goal is to make overlay BD one of the standards of OCI remote image format.
Welcome to participate in open source projects and contribute together!
OCI image’s V1 manifest format description ability is limited and cannot meet the requirements of remote mirroring. At present, there is no substantive progress in the discussion of V2, and it is unrealistic to overthrow v1. However, with the help of OCI artfacts manifest, you can use additional descriptor to describe the original data, which ensures compatibility and is easier for users to accept. Artfacts is also a project promoted by OCI / CNCF. Dadi plans to embrace artfacts and realize POC in the future.
Open support for multiple file systems
Dadi itself supports users to select an appropriate file system to build an image according to their needs, but the corresponding interface has not been opened yet. Ext4 file system is used by default. In the future, we will improve relevant interfaces and release this function, and users will decide what file system to use according to their own needs.
Buildkit tool chain
At present, users can build images through the buildkit plug-in snapshot, which will be further improved in the future to form a complete tool chain.
After the container is started, the I / O mode is recorded. When the same image is started later, the record can be replayed and the data can be prefetched to avoid temporary request for registry. In this way, the cold start time of the container will continue to be shortened by more than half. Theoretically, all stateless or idempotent containers can be recorded and replayed.
This article is the original content of Alibaba cloud and cannot be reproduced without permission.