Summary:The threshold for AI to enter the industry becomes higher. If developers want to make an excellent AI model, they have to compromise between computing power and cost. What should they do?
In order to help enterprises further reduce costs and increase efficiency in the process of AI landing, Huawei cloud launched AI black technology – elastic training.
This year, the gpt-3 model recently released by openai is the most popular in the AI community. As the largest natural language processing (NLP) converter released so far, it has 175 billion model parameters, uses 45tb of data, requires 3640pfs day of computing power, and the training cost is as high as 12 million US dollars.
If AI developers want to use big data to train models, they need super computing power and have to pay high training costs. This leads to a higher threshold for AI to enter the industry. If developers want to make an excellent AI model, they have to compromise between computing power and cost.
On the one hand, with limited budget investment, AI developers can only use weak computing power, resulting in the lag of AI service development. On the other hand, for cloud manufacturers, due to the flexibility of user use time and scale, there are often idle computing resources that are not used, resulting in waste. Huawei cloud AI black technology elastic training dynamically reduces and expands nodes, which well solves the contradiction between insufficient computing power of AI developers and idle computing power of cloud manufacturers.
Flexible allocation of computing resources and flexible training to reduce cost and increase efficiency for AI development
Huawei cloud elastic training scheme monitors the computing power of the resource pool in real time. If there are free computing resources, the resources will be allocated to the elastic job under training to improve the computing power of the training job, so as to make the training job converge quickly. When a new task is submitted, the Huawei cloud elastic training scheme will recycle the resources to the new task according to the usage of free resources and elastic jobs in the resource pool, so as to ensure the rapid effect of the new training.
Flexibility training process
Elastic training can adaptively match the optimal number of resources according to the requirements of model training speed. In terms of products, it provides two modes.
First, turbo mode can make full use of free resources to accelerate existing training operations. In most typical scenarios, the acceleration efficiency is greater than 80%, the training speed is increased by 10 times, and the convergence accuracy of the model will not be affected.
The second is the economic model, which can provide developers with the ultimate cost performance by maximizing resource utilization. In most typical scenarios, it can improve the cost performance by more than 30%.
Engineering and algorithm are optimized in multiple dimensions to reduce the difficulty of model training
Huawei cloud elastic training solution needs to solve many complex distributed training problems: how to achieve the convergence process and convergence results of dynamic multiple elastic training are equivalent to those of ordinary inelastic training, how to ensure elegant switching in the elastic process, how to solve the drag of straggler on system performance in mixed parts and other scenarios, how to enable users to reduce code modification How to select the appropriate communication framework to reduce the gradient convergence time. The Huawei cloud elastic training scheme is optimized from multiple dimensions of engineering and algorithm, solves the above problems, and realizes the ideal training accuracy and acceleration ratio.
Specifically, Huawei cloud elastic training scheme has four advantages: easy-to-use, efficient and elegant training framework and equivalent training process, Pratt Whitney’s powerful computing power and high utilization of cloud resources.
Easy to use, efficient and elegant training framework
Huawei cloud’s elastic training is based on an easy-to-use and efficient training framework. Users can meet the requirements of elastic training by simply modifying the code according to requirements.
The elastic training framework supports nccl communication and all_ Reduce or point-to-point networking mode can efficiently carry out gradient aggregation, so it has good acceleration performance.
At the same time, it also supports multi GPU / NPU performance monitoring and dynamic adjustment of training load based on the performance of each GPU / NPU. It still has good performance in the scenario of uneven performance of multi GPU / NPU such as mixing unit.
In addition, the elastic training framework can ensure that the elastic process is elegant. In the process of elastic training, the number of nodes becomes more and less. When the number of nodes becomes large, it can ensure that the old node can train normally before the new node cuts in, and the new node is ready for smooth cut in training, so there is no need for the old node to wait for a long time. When the number of nodes becomes small, the elastic training framework can make the released nodes exit smoothly.
Equivalent training process
The number of nodes in the elastic training process changes dynamically. How to adjust the training hyperparameters to ensure the convergence of the model is a great challenge. Huawei cloud’s elastic training scheme can theoretically ensure that after the correct training hyperparameters are initially set, the nodes become more or less in the elastic process, and the convergence process of the training model is consistent with the results. Therefore, when using the training scheme, users do not need to introduce too complex hyperparametric adjustment strategies because of elasticity, and do not need to worry about the impact of the introduction of elasticity on the convergence results. The equivalent training process allows users to safely use elastic training.
Pratt Whitney’s powerful computing power
Compared with the traditional direct purchase of determined computing power scheme, AI developers can obtain huge computing power with little investment. After the user brings up the elastic training job, he can obtain the idle computing resources in Huawei cloud during the training process, and the computing power is rapidly enhanced. Therefore, he can finish the training in a short time, so as to realize high-frequency training iteration and fast service online realization. The elastic solution really makes users affordable.
Highly utilized cloud resources
The traditional resource enhancement customization scheme can not revitalize the idle resources and dynamically adjust the trained jobs according to the use of real-time resources. Therefore, in the traditional scheme, there is often a contradictory situation that the training task has insufficient computing power and takes a long time, and a large number of resources in the resource pool are idle.
In contrast, Huawei cloud elastic training scheme has great flexibility. Based on the elastic training scheme, Huawei cloud monitors the resources in the resource pool in real time and dynamically adjusts the computing power of elastic training jobs. When there are free resources in the resource pool, the free resources are allocated to training jobs to ensure the full utilization of resources.
After the elastic scheme is determined, Huawei cloud’s elastic training scheme is automatically monitored and adjusted without human participation, which is convenient and efficient. The scheme meets the needs of cloud service providers to make full use of computing resources and the demands of AI developers, and achieves a win-win situation.
The application prospect of elastic training program is broad
With the explosive growth of data, AI’s entry into the industry increasingly needs the support of large computing power to process big data. In the future, elastic training program has broad application space. Use Huawei cloud’s elastic training scheme to train the resent50 model on Imagenet (large visual database). At the beginning, the 1-node training model is used. After there are free resources, the training node is adjusted to 16, and the linear acceleration ratio is 10. After 60 epochs of training, the Top1 accuracy was 76.1%. With consistent accuracy, Huawei cloud’s elastic training scheme makes the convergence speed nine times faster.
Huawei cloud has always adhered to the concept of “leaving simplicity to developers and complexity to Huawei cloud”. Huawei cloud AI continues to innovate iteratively, launch black technology functions, accelerate the entry of AI into the industry and implement the actual scene, so that thousands of lines and industries can share AI technology dividends.