[data Lake acceleration] – how to use cache acceleration service to improve the training speed of machine learning on data Lake

Time:2021-1-14

Introduction:Jindofs provides a distributed caching system on the computing side, which can effectively use the local storage resources (disk or memory) on the computing cluster to cache the hot data on the OSS, thus reducing the repeated fetching of the data on the OSS and consuming the network bandwidth.

Background

In recent years, the field of machine learning develops rapidly and is widely used in all walks of life. For the practitioners in the field of machine learning, it is full of opportunities and challenges. The emergence of deep learning frameworks such as tensorflow and pytorch makes it easy for developers to build and deploy machine learning applications. With the continuous maturity of cloud computing technology in recent years, more and more people accept to move their development and production services to the cloud platform, because the cloud environment has significant advantages over the traditional platform in terms of computing cost and scale expansion. In order to achieve flexibility and cost saving, cloud platform usually adopts the solution of computing and storage separation. Using object storage to build data lake can reduce cost and store massive data. In the scene of machine learning, it is especially suitable to store training data on the data lake.

The advantages of storing training data in data lake are as follows:

1. There is no need to synchronize the data to the training node in advance。 Traditionally, we need to import the data to the local disk of the computing node in advance. If the data is stored in the object storage, we can read the data directly for training, reducing the preparation work.

2. It can store more training data, which is no longer limited to calculating the local disk size of the node. For deep learning, with more data, we can get better training effect.

3. Computing resources can be expanded and reduced flexibly to save cost。 Machine learning usually uses more cores of CPU or high-end GPU, which is more expensive and the cost of object storage is relatively low. The training data stored in the data lake can be decoupled from the computing resources. Computing resources can be paid on demand and released at any time to save costs.

However, there are some problems and challenges in this way

1. The delay and bandwidth of remote pull data cannot expand linearly with computing resources.With the continuous development of hardware computing power, GPU can achieve faster training speed. Large scale computing resources can be quickly scheduled by using elastic computing ECs and container services on the cloud. Access to object storage needs to go through the network. Thanks to the development of network technology, we have a high-speed network for access to object storage. Even so, the network delay and bandwidth of object storage can not expand linearly with the scale of the cluster, which may become a bottleneck and limit the training speed. In the architecture of computing and storage separation, how to access these data efficiently has become a huge challenge.

2. More convenient and general data access methods are needed.Deep learning frameworks such as tensorflow are more friendly to GCs and HDFS, but lag behind in supporting many third-party object storage. POSIX interface is a more natural and friendly way to access data in the same way as local disk, which greatly simplifies the developer’s adaptation work to the storage system.

In order to solve these problems, jindofs provides a cache acceleration optimization scheme for this scenario.

Training architecture scheme based on jindofs cache acceleration

Jindofs provides a distributed caching system on the computing side, which can effectively use the local storage resources (disk or memory) on the computing cluster to cache the hot data on the OSS, thus reducing the repeated fetching of the data on the OSS and consuming the network bandwidth.

[data Lake acceleration] - how to use cache acceleration service to improve the training speed of machine learning on data Lake

Memory cache

For deep learning, we can choose the GPU model with stronger computing power to get faster training speed. At this time, high-speed memory throughput is needed to make the GPU full. At this point, we can use jindofs to build a distributed cache based on memory. When the total memory of the whole cluster is enough to support the whole data set (excluding the memory required by the task itself), we can use the memory cache and local high-speed network to provide high data throughput and speed up the calculation.

Disk cache

For some machine learning scenarios, the size of training data exceeds the size of memory, and the CPU / GPU capacity required for training is not so high, but the data access is required to have high throughput. At this time, the bottleneck of computing will be limited by the network bandwidth pressure. Therefore, we can build jindofs distributed cache service using local SSD as cache medium, and use local storage resources to cache hot data, so as to improve the training speed.

Fuse interface

Jindofs includes fuse client, which provides a simple and familiar data access method. By mapping jindofs cluster instance to local file system through fuse program, you can enjoy the acceleration effect of jindofs just like accessing local disk files.

Actual combat: build kubernetes + jindofs + tensorflow training cluster

1. Creating kubernetes cluster

Let’s go to Alibaba cloud container service and create a kubernetes cluster.

[data Lake acceleration] - how to use cache acceleration service to improve the training speed of machine learning on data Lake

2. Install jindofs service

2.1 go to Container Services > application directory and enter the “jindofs” installation configuration page.

[data Lake acceleration] - how to use cache acceleration service to improve the training speed of machine learning on data Lake

2.2 configuration parameters

For a complete configuration template, please refer to Container Services – Application Directory – jindofs installation instructions.
Configure OSS bucket and AK. Refer to the document for JFS scheme deployment. We need to modify the following configuration items:

jfs.namespaces: test
jfs.namespaces.test.mode :  cache
jfs.namespaces.test.oss.uri :  oss:_//xxx-sh-test.oss-cn-shanghai-internal.aliyuncs.com/xxx/k8s_c1_
jfs.namespaces.test.oss.access.key :  xx
jfs.namespaces.test.oss.access.secret :  xx

Through these configuration items, we create a namespace named test, pointing to XXX / k8s of Chengli sh test, an OSS bucket_ C1 directory. Later, when we operate the test namespace through jindofs, it is equivalent to operating the OSS directory.

2.3 installation services

[data Lake acceleration] - how to use cache acceleration service to improve the training speed of machine learning on data Lake

1. Verify that the installation is successful

# kubectl get pods
NAME                               READY   STATUS      RESTARTS   AGE
jindofs-fuse-267vq                 1/1     Running     0          143m
jindofs-fuse-8qwdv                 1/1     Running     0          143m
jindofs-fuse-v6q7r                 1/1     Running     0          143m
jindofs-master-0                   1/1     Running     0          143m
jindofs-worker-mncqd               1/1     Running     0          143m
jindofs-worker-pk7j4               1/1     Running     0          143m
jindofs-worker-r2k99               1/1     Running     0          143m

2. Accessing / MNT / JFS / directory on the host is equivalent to accessing jindofs files

ls /mnt/jfs/test/
15885689452274647042-0  17820745254765068290-0  entrypoint.sh

3. Install kubeflow (arena)

**Kubeflow is an open source cloud native AI platform based on kubernetes, which is used to develop, orchestrate, deploy and run scalable portable machine learning workload. Kubeflow supports two tensorflow framework distributed training modes: parameter server mode and allreduce mode. Based on arena developed by Alibaba cloud container service team, users can submit these two types of distributed training frameworks.
We refer to the documentation on GitHub repo for installation.
**

4. Start TF operation

arena submit mpi 
--name job-jindofs
 --gpus=8 
 --workers=4 
 --working-dir=/perseus-demo/tensorflow-demo/ 
 --data-dir /mnt/jfs/test:/data/imagenet 
 -e DATA_DIR=/data/imagenet -e num_batch=1000 
 -e datasets_num_private_threads=8  
 --image=registry.cn-hangzhou.aliyuncs.com/tensorflow-samples/perseus-benchmark-dawnbench-v2:centos7-cuda10.0-1.2.2-1.14-py36 
 ./launch-example.sh 4 8

In this article, we submit a resnet-50 model job using a 144gb Imagenet dataset. The data is stored in tfrecord format, and each tfrecord is about 130mb in size. Both model jobs and Imagenet datasets can be easily found online. Among these parameters, / MNT / JFS / is a directory mounted to the host through jindofs fuse, and test is a namespace corresponding to an OSS bucket. We use — data dir to map this directory to the / data / Imagenet directory in the container, so that the job can read the OSS data, and the read data will be automatically cached in the jindofs cluster.

summary

Through the cache acceleration service of jindofs, you only need to read the data once, most of the hot data will be cached to the local memory or disk, and the training speed of deep learning can be significantly improved. For most of the training, we can also use the preloading method to load the data into the cache to speed up the next training.

Original link
This article is the original content of Alibaba cloud and cannot be reproduced without permission.

Recommended Today

Kotlin will support wasm-wasm weekly report 1014 with the combination of 64 bit wasm, wasm FAAS, wasm and AI

Editor’s note: first, webassembly has been making great efforts in the fields of serverless and cloud computing. We have seen the application of 64 bit webassembly, serverless based on webassembly, and webassembly in the field of AI. ​ WebAssembly 64 bit webassembly Memory64 of 64 bit webassembly is being implemented step by step! 64 bit […]