After Huawei opened mindspire Lite version 1.0.0 in September 20, its interface ease of use, operator performance and completeness, and extensive support for third-party models have been widely recognized by many mobile application developers. Mindspire Lite provides a full scene AI reasoning framework for HMS core AI field, supports AI related modules such as camera, gallery, wallet and browser QR code scanning and object recognition of Huawei mobile phones, and provides basic AI services for all kinds of Huawei wearable, smart screen and other devices. At the same time, as one of the important capabilities of HMS core open to global developers, Huawei machine learning service has accessed 1000 + applications worldwide, with an average daily call volume of more than 300 million.
At present, at the beginning of the 21st new year, Huawei released mindspire Lite version 1.1.0, which has been comprehensively upgraded in terms of operator performance optimization, model miniaturization, automatic cutting tool of acceleration library, end-to-side model training, voice model support, java interface opening and model visualization. The upgraded version is lighter, faster and easier to use, The new features will also be reflected in the new version of HMS core.
1. Operator library optimization and expansion
Reasoning performance optimization is the highlight of this version. In addition to the continuous performance optimization of ARM CPU (fp16 / fp32 / int8), arm GPU and x86_ 64 optimization is also the highlight of this time. In terms of GPU, in addition to the traditional operator optimization, we also added technologies such as online fusion and autotuning, which greatly improved the reasoning performance of arm GPU; At the same time, in order to better support PC side reasoning, in x86_ In terms of 64 operators, we have done a lot of assembly level optimization; Through the measurement of a large number of models, mindspire Lite version 1.1.0 is highly competitive among various frameworks in the industry in terms of reasoning performance.
1.1 ARM CPU optimization
From introducing a better algorithm to reduce the amount of computation to reducing hardware memory access as much as possible, so as to improve the instruction throughput, the CPU operator performance of mindspire Lite has been greatly improved. We used the 100 + end-side preset model on the official website of TF hub to conduct a comparison test of reasoning delay. The test results show that mindspire Lite has comprehensively surpassed the data on the official website on high-end models such as mate30 / P30, and the proportion of reasoning performance superior to the data on the official website on medium and low-end models such as P20 has also reached 97%.
1.1.1 fp16 reasoning performance
Mindspire Lite fully supports fp16 reasoning of armv8.2, and the reasoning delay basically reaches half of fp32 reasoning. While the reasoning delay is greatly reduced, the accuracy meets the business requirements; Our fp16 reasoning scheme has been widely used in various AI services preset by Huawei HMS mlkit and Huawei mobile phones.
Because TF Lite does not support fp16 reasoning, we only selected the latest version 1.1 of MNN in the fp16 reasoning performance comparison test. From the test results, mindspire Lite has lower reasoning delay and better performance in fp16.
Comparison of overall network delay on Huawei mate30
Comparison of reasoning delay of fp16 on Huawei mate30
Comparison of reasoning delay of fp16 on Xiaolong 865 +
1.1.2 int8 quantitative model reasoning performance
For quantization operators, the current version of mindspot Lite implements the addition of Winograd optimization algorithm with 3×3 revolution kernel (currently mainly for non armv8.2 models) at the algorithm level, the use of sdot instructions to optimize operators such as matmul, full connection and revolution on high-end machines supporting armv8.2, and a series of optimization strategies to improve the hit rate of the underlying cache, The performance of mindspire Lite quantitative reasoning is greatly improved, which is 40% higher than that of fp16 reasoning.
We selected the latest version 2.4 of TF lite and the latest version 1.1 of MNN for reasoning performance comparison test. The model used is the official preset quantitative model of TF hub (during the test, we found that MNN has a large number of quantitative models that cannot be converted, and even TF Lite has conversion problems with its own models), From the test results, mindspire Lite has the lowest delay and the best performance in terms of support and reasoning performance.
Comparison of overall delay of quantization network on Huawei mate30
Armv8.2 model test
Delay comparison of quantization model on Xiaolong 865 +
Armv8 model test
Comparison of quantization model delay on Huawei P20
1.1.3 fp32 reasoning performance
At the same time, in order to ensure that the industry’s best reasoning performance can be obtained when using mindspire Lite reasoning on the low-end CPU, we continue to optimize the reasoning performance of fp32. Taking tflite (version 2.4) and MNN (version 1.1) as comparison objects, we conducted benchmark performance test on Huawei P20. From the test results, we can see that mindspire Lite fp32 still has the lowest reasoning delay and the best performance, but it is not far from other frameworks.
Comparison of quantization model delay on Huawei P20
1.2 arm GPU optimization
In mindspire Lite version 1.1, we focused on optimizing the reasoning performance of GPU. In addition to the conventional optimization at the operator level, we also added a variety of optimization methods such as online fusion, autotuning, OpenCL kernel binary cache mechanism, so that the overall performance is 25% higher than that of mindspire Lite 1.0;
We also conducted a GPU reasoning performance comparison test on Huawei mate30 using the TF hub official website 100 + preset model with MNN (version 1.1) and TF (version 2.4). From the test results in the figure below, it can be seen that the reasoning performance of mindspire Lite GPU is the lowest in most model delays, while that of MNN is relatively high.
Comparison of reasoning delay of GPU fp32 on Huawei mate30
1.3 X86_ 64 CPU optimization
In this version, we also support x86_ A lot of optimization work has been done on the reasoning performance on the 64 platform. We have conducted benchmark tests on several classic CV networks with Intel openvino and MNN on the CPU of Intel Core i7-8700. From the test results, the mindspire Lite latency is also the lowest;
Intel Core i7-8700 X86_ 64 CPU reasoning performance comparison
1.4 more integration
The current mindspire Lite version has basically covered the convolution correlation fusion pattern commonly used in the field of machine vision. At the same time, it has carried out in-depth fusion optimization for the voice model based on transformer structure and the model based on LSTM structure, mainly including the fusion of small operators into large operators such as layernorm and LSTM, and the fusion of multiple matmuls into batch matmul operators, Slice operator, forward fusion of segmentation matrix, etc. make the speech class model improve by 20%. In the future, we will try to integrate the automatic schedule function of pattern.
2. Operator completeness extension
Mindcore Lite supports a variety of hardware platforms including ARM CPU, arm GPU, x86 CPU, Kirin NPU and MTK APU.
2.1 ARM CPU
Mindspire Lite is one of the most abundant frameworks supported by CPU operators in the end-to-side reasoning framework. At present, our model conversion tool supports the analysis of third-party framework operator definitions such as TF Lite (100), TF (53), onnx (96) and Caffe (26), achieving high compatibility. We also mentioned in the performance test above that MNN cannot convert many models, TF Lite’s support for the preset model on its official website is not perfect; Mindspire Lite implements 121 fp32, 55 fp16 and 71 int8 CPU operators; In this version 1.1, we also made a major adjustment and improvement on the control flow operator to better support the speech class model.
2.2 ARM GPU
OpenCL operator 10 + is added, and the total number of GPU operators currently supported is 58, basically realizing common CV network coverage; New online integration, auto tuning and other feature support are added, and weight quantization is supported to realize the operation of 8bit weight quantization network in the whole GPU network.
2.3 Kirin NPU
In version 1.1, we improved the support for Huawei Kirin NPU hardware platform, increased the support for Kirin 9000 chip, and added 50 + NPU operator support, so as to support the accelerated execution of most CV scenarios on NPU; We have performed benchmark verification on several typical networks on Huawei’s latest mate 40 mobile phone. The reasoning delay on NPU is significantly higher than that on CPU;
Comparison of reasoning delay between NPU and CPU fp32 / 16 on mate 40
3. Support end to side training
Because there is a certain deviation between the model trained with public data set and the real user scene, such as face recognition, speech recognition and other scenes, we often need to use local data to fine tune the pre training model, so as to improve the reasoning accuracy of the local model and improve the user experience.
In mindcore Lite version 1.1, we open source the end-to-end training framework. The first version brings us the following features:
1) Support 30 + reverse operators, provide common optimizers such as SGD and Adam, and loss functions such as crossenterprise / sparscrossentropy / MSE; It can not only train the model from zero, but also specify the fine-tuning of specific network layer to achieve the purpose of transfer learning;
2) It supports network training such as lenet / alexnet / RESNET / mobilenetv1 / V2 / V3 and effecivenet, and provides complete model loading, conversion and training scripts for users to use and test;
3) Mindspire realizes seamless connection between cloud side training and end side training, and the cloud side model can be directly loaded to the end side for training;
4) It supports checkpoint mechanism, and can quickly resume continuous training after abnormal interruption of training process;
Our end-to-end training framework has been commercially available in AI applications of some Huawei devices, such as home albums, and achieved a good user experience.
4. Quantification after training
With the increasingly common deployment of AI applications in end-side devices, and subject to the limitations of end-side resources, the challenges of model miniaturization and reasoning performance improvement are increasing. Mindspire Lite provides a simple and practical post training quantization function to compress the model size to the greatest extent, reduce memory occupation, improve reasoning speed and reduce power consumption.
Compared with quantitative retraining, quantization after training has two obvious advantages: one is that there is no need for a large number of training data sets, and the other is that there is no need for retraining and fast offline conversion. Mindspire Lite post training quantization tool provides two methods: weight quantization and full quantization. It supports 1 ~ 16bit quantization, classification, detection, NLP and other models.
In order to ensure that the accuracy loss of the quantization model after training is small, we use the pipeline combined quantization method. In the first stage, the conventional linear quantization method is used to quantify the weight and activation value, and in the second stage, the quantization error is analyzed, and the statistical method is used to correct the quantization model to compensate for the accuracy loss caused by quantization.
Pipeline combination quantization
Take the TF official website mobilenet_ Taking V2 model as an example, the accuracy of a8w8 (activation value 8bit quantization and weight 8bit quantization) after mindspire Lite training is compared with fp32 model. After loss correction, the accuracy loss is reduced from 0.82% to 0.4%. It is also applicable to 7bit quantization, and the accuracy loss is still no more than 1%.
Fully quantized mobilenet after training_ Comparison of V2 model accuracy
In the HMS face scenario, we quantified the int8 weight of the model (the model size range is 364kb ~ 2.9mb), and the actual end-to-end recognition accuracy fully meets the service requirements. The comparison of the relative accuracy error of the weight quantization accuracy loss correction scheme is as follows. It can be seen that the quantization accuracy loss under the loss correction scheme is significantly reduced.
Comparison of relative accuracy loss of face scene model weight quantization accuracy loss correction scheme
After a large number of internal tests and actual commercial delivery feedback, the effect of pipeline combined quantization method is remarkable, and even the model as small as 300KB still meets the commercial requirements after int8 quantization and compression.
5. Enhanced ease of use
5.1 automatic cutting tool for acceleration Library
In order to meet some scenarios that demand extreme miniaturization of the release package size, we provide a one click clipping tool, which can automatically clip out the minimized mindspire Lite version sufficient to run the model specified in the list according to the model list specified by the user.
5.2 offline tool parameter reduction
We have simplified the parameters of the offline conversion tool to maximize the ease of use of the conversion tool, so that developers do not need to perceive the quantization type, input and output node name and corresponding data type of the tripartite model when converting the tripartite model.
5.3 support Java interface
Version 1.1 officially opened the java interface to facilitate Android developers to use mindspire Lite for application development.
5.4 model visualization
In order to facilitate developers’ debugging, we submitted the code supporting the visualization of mindspire Lite model in netron open source community. Now developers can use netron tools to visualize mindspire Lite model. I believe it can bring great convenience to developers to debug models, especially some models with complex structure.
6. Open more end side preset models
In order to facilitate developers to quickly deploy their AI services on the end side, mindspire has opened more models suitable for the end side, including some original models launched by mindspire, which can be easily obtained on the mindspire hub.
6.1 model of resnet50 network pruning on Oxford III pet dataset using SCOP algorithm
SCOP: scientific control for reliable neural network practicing is a scientific control mechanism jointly proposed by Huawei Noah Ark laboratory and Peking University to minimize the impact of pruning nodes on network output. Using this pruning method, the top-1 accuracy of resnet101 network can be reduced by only 0.01% on Imagenet dataset, and the amount of model parameters and calculation can be reduced by 57.8% and 60.2% respectively, which is significantly better than SOTA method. Model link:https://www.mindspore.cn/reso…
6.2 VGg small model based on SLB Lightweight Technology
The model uses the SLB quantization technology (searching for low bit weights in quantified neural networks) selected by Huawei Noah Ark laboratory as the lightweight technology of neurips 2020 model, and the end-side model obtained based on 2-bit weight and 2-bit activation quantization on cifar10. Model link:https://www.mindspore.cn/reso…
If you want to learn more about the lightweight technology used in the above mindspire launch model, please refer to:https://mp.weixin.qq.com/s/H1zg3ezZDdXQ-IQ7ki0Mtw
The test data comes from Huawei’s internal laboratory test data. If you have questions, you can give feedback on the mindspire Forum:https://bbs.huaweicloud.com/forum/forum-1076-1.html
Mindspire open source repository link:https://gitee.com/mindspore/mindspore
Original author: Pepper