Summary:The era of large-scale Chinese pre training language model with 100 billion parameters has come.
This article is shared from Huawei cloud community《With the blessing of mindspire open source framework, how can we “refine” the first Chinese pre training language model with 100 billion parameters and TB memory?》, original author: chengxiaoli.
The era of large-scale Chinese pre training language model with 100 billion parameters has come.
Recently, there has been some excitement in the large-scale Chinese pre training language model circle. The “Enlightenment · text source” with 2.6 billion parameters, the plug with 27 billion parameters, and the 100 billion level “Pangu” NLP model released by Huawei cloud yesterday. The pre training language model has grown to require TB of memory or video memory only for loading.
We can intuitively think that the effect of “Pangu” should be better, but the demand for calculation is also greater, and it is more difficult to train.
However, “Pangu” is actually such an exploration: the open source framework mindspire, the basic software and hardware platform of shengteng, and the super large-scale Chinese pre training model mean that the infrastructure has been improved.
This work was jointly completed by Huawei and relevant technical teams of Peking University. With the help of black technologies such as shengteng basic software and hardware platform and mindspire framework automatic parallelism, the largest Chinese pre training model was trained.
So how was Pangu’s big model trained? Next, let’s carefully interpret the key technologies behind “Pangu”.
100 billion parameter, TB memory model
Taking Pangu 200 billion as an example, if we use the standard fp32 data format for weights during training, the space occupied by weights will reach 750GB, and the memory overhead will rise several times during training. These 750GB parameters are not placed on the hard disk or loaded into memory, but need to be moved to the memory of the shengteng atlas training server HBM (high bandwidth memory) to use the training model of the shengteng atlas training server.
Large model means large data, and high-quality data is required. In order to meet the data needs, the R & D team crawled 80 TB text from the Internet and finally cleaned it into 1TB Chinese data set.
Such models and data can no longer be loaded by several servers, let alone trained. Fortunately, the R & D team will provide APIs, and general algorithm engineers can try the effect by directly calling the interface.
It can be said that at present, Pangu is the first 100 billion scale Chinese pre training model in the industry, of which the maximum number of parameters is 200 billion.
Super large scale automatic parallelism is the gospel of Algorithm Engineers
Consider a question first. Do you think of how to train such a large model?
If you are given enough computing power, can you think of how to train such a large model? The most commonly used distributed training method is data parallel. It is certainly not possible to do this alone, because no computing hardware can put down 800gb parameters. What about model parallelism? A new problem arises. How can we split such a huge Pangu? What is the gradient flow and data flow communication between hardware products (such as NPU, GPU, etc.)?
Obviously, training such a huge model is far more complex than we think, and requires a lot of engineering operations, and ensure that these operations will not or rarely affect the final convergence effect of the model.
Does Pangu really have to rely on manual parallel optimization?
If you manually write the distributed training logic, you need to comprehensively consider a lot of complex things, such as the amount and type of computation, cluster bandwidth, topology, number of samples, etc., and then design a parallel segmentation strategy with excellent performance, and write a large number of parallel segmentation and communication codes between nodes. If the system environment changes, you have to redesign and modify the algorithm. Think about it and you feel that your head is big.
If we use tensorflow or other similar frameworks, the series of distributed strategies of mirrored strategy can not be used at all. It seems that it is essential to write parallel strategies by ourselves. However, Pangu’s real training is a way of software and hardware cooperation, including mindspire computing framework, cann heterogeneous computing architecture, shengteng basic software and hardware platform and a complete set of infrastructure. Among them, mindspire provides a crucial automatic parallelism capability.
Integration of 5 dimensions, powerful automatic parallelism
Mindspire automatic parallelism provides 5-Dimensional parallelism: data parallelism, operator level model parallelism, pipeline model parallelism, optimizer model parallelism and recalculation, and organically integrates 5-Dimensional parallelism in the graph compilation stage. These five-dimensional parallel methods are combined to form Pangu’s parallel strategy.
a. Data parallel
Data parallel is the most basic and widely used parallel method. It divides the training data (Mini batch), and each device obtains one of them; Each device has a complete model. During training, after gradient calculation, each equipment needs gradient synchronization between equipment, and then the model parameters can be updated.
b. Operator level model parallelism
Operator level model parallelism is to segment the tensors involved in each operator in the model network. Mindspire models each operator independently, and each operator can have different segmentation strategies.
Take the matrix multiplier matmul (x, w) as an example. X is the training data and W is the model parameter. Both are two-dimensional matrices. The parallel strategy ((4,1), (1,1)) means that x is cut into 4 parts by line and W is not cut. If there are 4 devices in total, each device has a slice of X and a complete W.
c. Pipeline model parallelism
Pipeline model divides the layers of the model into multiple stages in parallel, and then maps each sage to multiple devices. In order to improve the utilization of equipment resources, mini batch is divided into multiple micro batches, so that different devices can process different micro batch data at the same time.
A pipeline parallel mode (gpipe) requires that the reverse calculation starts only after the forward calculation of all devices is completed, and the reverse calculation may depend on the forward output, resulting in the cumulative activation memory of each card in the forward calculation process is directly proportional to the number of micro batches, thus limiting the number of micro batches. In mindspire’s pipeline parallelism, the reverse is advanced. After each micro batch calculation is completed, the reverse is calculated, which effectively reduces the activation storage time, so as to improve the overall parallel efficiency.
d. Optimizer model parallelism
The optimizer model divides the parameters and gradients involved in the optimizer into multiple devices in parallel. Take the Adam optimizer as an example. There may be multiple “momentum” with the same size as the weight to participate in the calculation. In the case of parallel data, each card has a complete “momentum”. They repeat the calculation on each card, resulting in a waste of memory and calculation. By introducing the parallel optimizer, each card only saves the slices of weight and momentum, which can reduce the static memory of each card and improve the computational efficiency.
For the problem that the output of the forward operator is accumulated and stored in memory, resulting in too large memory peak, rematerialization discards the output of some forward operators and recalculates it again in the reverse phase. This effectively reduces the peak memory usage during training. As shown in the figure below, the first memory peak can be eliminated by recalculation, and the second memory peak can be eliminated in parallel by the optimizer mentioned above.
With these five-dimensional parallel dimensions, how to combine them to act on Pangu, and how to distribute the segmented model to each device is still a difficult problem. Mindspire automatically parallelizes and organically combines these five dimensions to achieve very efficient large model distributed training capability.
The following figure (b) is a typical tree hardware topology. Its bandwidth decreases with the increase of tree depth, and some traffic conflicts will occur. In order to take advantage of this feature, mindspire’s goal is to maximize the computing communication ratio and place the parallel mode with large traffic (operator level parallel) between multiple cards in the server; Place the servers with small traffic (pipeline parallel) in the same rack; The data parallel (optimizer parallel) part is placed between different racks, because the communication can be overlapped with the calculation, and the bandwidth requirement is low.
In Pangu’s 200 billion model, mindspire divides 64 layers into 16 stages, and each stage contains 4 layers. In each layer, the tensor is segmented by operator level parallelism.
As shown in the figure below, the parameters of Q, K and V are actually cut into 8 copies (by column), the input tensor (by row) is cut into 16 copies, and the output tensor is cut into 128 copies (8 * 16). The recalculation configuration is configured in each layer, that is, the redundant calculation amount introduced by recalculation will not exceed the calculation amount of one layer. In total, mindspire used 2048 Pentium processors to train Pangu.
Mindspire shields the details of complex parallel implementation, making it as simple for users to write scripts for stand-alone models. On the basis of stand-alone script, users can realize multi-dimensional hybrid parallelism only through less configuration. The following figure is a simplified version of Pangu script, in which the red bold font represents the parallel strategy in mindspire. If the bold red font is removed, it is a stand-alone script.
Figure calculation cross layer joint optimization to give full play to the ultimate performance of hardware
In addition to the large-scale automation between cross nodes, in a single card node, mindspire further gives play to its computing power through cross layer collaborative optimization of layers and operator layers.
In traditional NN networks, the amount of computation and computational complexity carried by different operators are also different. For example, layernorm consists of 11 basic operators, while add has only one basic operator. This operator definition based on the user’s point of view usually can not give full play to the computing power of hardware resources. Because of too much computation and too complex operators, it is usually difficult to generate high-performance operators with good segmentation. So as to reduce equipment utilization; The operator with too small amount of calculation can not effectively hide the data moving overhead, which may also cause the empty delay of calculation, so as to reduce the equipment utilization.
In order to improve the hardware utilization, mindspire uses the graph fusion optimization technology, through the joint optimization of layers and operator layers, reorganizes and fuses the “ease of use operator from the perspective of user use”, and then converts it into a “high-performance operator from the perspective of hardware execution”, so as to fully improve the hardware resource utilization and then improve the execution performance of the whole network. The specific optimization process is shown in the figure below:
Taking layernorm operator as an example, through operator splitting and reorganization, 11 small operators are composed of one single operator and two fusion operators. These reorganized operators can generate higher performance operators, which greatly reduces the overall network running time.
In the Pangu model, the graph calculation fusion helps to reduce the overall training time by more than 20%. In addition, for other NLP, CV and other tasks, graph computing fusion has a good performance in optimizing performance.
Conclusion: the perfect embodiment of training super large model
Even if we are given enough computing power, the training of super large model is still extremely complex and far more difficult than expected. For our general algorithm engineers, hundreds of millions of parameters are large for a task, but we don’t feel any difficulty in training, because each deep learning framework can directly call the data parallel interface.
However, if the model continues to grow to 10 billion level, 100 billion level or even trillion level, the complexity of parallelism and optimization strategy will rise sharply, and it will be too difficult for Algorithm Engineers to write and optimize code bit by bit. Mindspire decouples the computing logic and parallel logic through automatic parallelism, and the single card serial code automatically realizes distributed parallelism, which makes algorithm scientists liberate their energy to the model itself.
In order to obtain more knowledge from pre training, models such as gpt-3 and Pangu will become larger and larger. After all, we haven’t seen the limit of pre training effect of large models yet. At that time, such models will have greater demand for infrastructure and more complex parallel and optimization strategies. Only with enough excellent infrastructure, the effect of large-scale pre training will be better, so as to play a greater role in Knowledge Q & A, knowledge retrieval, knowledge reasoning, reading comprehension and other scenes, and realize the business value of intelligent customer service, marketing, copywriting and so on.
Large scale computing cluster and software and hardware collaborative optimization have been fully and perfectly reflected in Pangu’s training this time. As the development team said, “the practice of the 100 billion parameter model based on mindspore and shengteng basic software and hardware platform is also an exploration. There are too many unknowns in the distributed training, hyperparametric optimization, data set composition and model structure adaptability of the large model. Now, Pangu’s model works very well and refreshes the first version of the clue, which means that it is the first time based on domestic software and hardware collaborative optimization and super large-scale distributed training. The results are exciting, and we also have a strong enough infrastructure. “
Of course, as mentioned above, Pangu is only an exploration of super large-scale distributed training and super large-scale Chinese pre training model. In the future, more researchers need to invest in the research of general intelligence and large-scale distributed computing.