How to build a high-performance front-end intelligent reasoning engine


Introduction: what is the front-end intelligent reasoning engine and how to build and apply it?

What is the front-end intelligent reasoning engine

Before the front-end intelligent inference engine, let’s talk about “what is”Terminal intelligence”。

On device machine learning refers to putting the application of machine learning on the end side. The “end side” here is relative to cloud services. It can be a mobile phone, an IOT device, etc.

Traditional machine learning is done on the server side due to the problems of model size and machine computing power. For example, Amazon AWS has “Amazon rekognition service” and Google has “Google cloud vision service”. With the improvement of computing power of end-side devices represented by mobile phones and the evolution of model design itself, smaller and more capable models can be deployed to the end gradually.

Compared with cloud deployment, app has more direct user characteristics and the following advantages:

  • High real-time performance, end-to-end processing can save the network transmission time of data.
  • save resources, make full use of end-to-end computing power and storage space.
  • Good privacy, data generation and consumption are completed on the end side to avoid the risk of privacy disclosure caused by transmission.

These are the advantages of end-to-end intelligence, but it is not a panacea. There are still some limitations:

  • Limited equipment resources, the end-to-side calculation force and storage are limited, and large-scale and high-intensity continuous calculation cannot be done.
  • The size of the algorithm is small, the end-to-end computing power is small, and the data of a single user can not be optimized in algorithm.
  • Limited user data, the end-to-end data is not suitable for long-term storage, and the available data is limited.

Similarly, front-end intelligence refers to putting machine learning applications on the front-end (web, H5, applet, etc.)

So, what is the front-end intelligent reasoning engine?

As shown below:

How to build a high-performance front-end intelligent reasoning engine

The front-end intelligent reasoning engine actually uses the computing power on the front-end to execute the model.

Existing front-end reasoning engine in the industry

Here are three common inference engines

  • Tensorflow.js (hereinafter referred to as tfjs)
  • ONNX.js
  • WebDNN

What is the most important thing for an end-to-end reasoning engine? Performance, of course! The better the performance, the more application scenarios on the end. Let’s take a look at the performance comparison of the three reasoning engines:

(the following data usage model is mobilenetv2 classification model)

CPU (JS Computing)

How to build a high-performance front-end intelligent reasoning engine

It can be seen that the calculation in the pure JS environment takes more than 1500ms for only one classification. Imagine that if a camera needs to classify and predict the photographed objects in real time (such as predicting whether the photographed object is a cat or a dog), 1500 ms is required for each prediction, which is unbearable.


How to build a high-performance front-end intelligent reasoning engine

In the wasm environment, onnx.js with the best performance has reached the performance of 135ms, that is, about 7fps, which is barely usable. And tfjs is a terrible 1501ms. This is because onnx.js uses worker for multithreading acceleration, so its performance is the best.


How to build a high-performance front-end intelligent reasoning engine

Finally, the GPU environment. It can be seen that the performance of tfjs and onnxjs has reached a relatively good performance level, while the performance of webdnn is relatively poor.

In addition to the above three engines, there are Baidu’s pad.js and Taobao’s mnn.js in China, which will not be discussed here.

Of course, when choosing an appropriate reasoning engine, in addition to performance, there are a series of considerations such as ecology, engine maintenance and so on. From a comprehensive perspective, tfjs is the most suitable front-end reasoning engine in the current market. Because tfjs can rely on the powerful ecology of tensorflow and the full-time maintenance of Google’s official team. In contrast, the onnx framework is relatively small, and onnxjs has not been maintained for nearly a year. Webdnn’s performance and ecology are not competitive.

High performance computing solutions on the front end

As can be seen from the previous chapter, wasm and GPU computing based on webgl are generally used for high-performance computing on the front end. Of course, asm.js is not discussed here.


Wasm should be familiar to everyone. Here is only a brief introduction:

Web assembly is a new type of code running in modern web browsers, and provides new performance features and effects. It is not designed for handwritten code, but to provide an efficient compilation target for low-level source languages such as C, C + + and rust.

For the network platform, this is of great significance – it provides a way for the client app to run the code written in multiple languages close to the local speed on the network platform; Before that, it was impossible for the client app to do so.

Moreover, you can use it without knowing how to write web assembly code. Webassembly modules can be imported into a web app (or node. JS) and expose webassembly functions for JavaScript. JavaScript framework can not only use web assembly to obtain great performance advantages and new features, but also make various functions easy to use for network developers– From mdnwebassembly concept


What? Doesn’t webgl do graphics rendering? Isn’t it 3D? Why can we do high-performance computing?

Some students may have heard of gpgpu.js library, which uses webgl for general computing. What is the specific principle? (in order to continue reading, please browse this article quickly first): “using webgl2 to realize GPU computing at the web front end”.

Optimize the performance of reasoning engine

Well, now that we know the two high-performance computing methods on the front end, what if the performance of the existing frameworks (tfjs and onnxjs) does not meet our needs? How can we further improve the engine performance and implement it in the production environment?

The answer is: tear the source code by hand to optimize performance. Yes, it’s that simple and rough. Taking tfjs as an example (other frameworks are consistent in principle), let’s introduce how to optimize engine performance with different postures.

At the beginning of last year, our team had an in-depth exchange with Google’s tfjs team. Google made it clear that the development direction of tfjs is mainly wasm computing, and webgl computing does not do new features, and maintenance is the main feature.However, at this stage, the support of browsers and applets for wasm is not complete (such as SIMD, multi thread and other features), so wasm cannot be implemented in the production environment for the time being.Therefore, at this stage, we still need to rely on the computing power of Web GL. Unfortunately, at this time, the web GL performance of tfjs is still unsatisfactory on the mobile terminal, especially on the medium and low-end computers, which can not meet our business requirements. I have no choice but to go in and optimize the engine. Therefore, the following contents are introduced for web GL computing.

N postures for optimizing webgl high performance computing

Pose 1: compute Vectorization

Computing vectorization refers to computing using the vec2 / vec4 / matrix data type of glsl, because the biggest advantage of GPU is computing parallelization. Computing through vectors can achieve the effect of parallelization as much as possible.

For example, a matrix multiplication:
c = a1 b1 + a2 b2 + a3 b3 + a4 b4;
Can be changed to
c = dot(vec4(a1, a2, a3, a4), vec4(b1,b2,b3,b4));

Vectorization should also cooperate with the optimization of memory layout;

Posture 2: memory layout optimization

If you read the above article “using webgl2 to realize GPU calculation of web front end”, you should understand that all data storage in GPU is through texture, and texture itself is a long nWidth mChannel (RGBA) 4 things, if we want to save a 3 224 What should I do when I enter the four-dimensional matrix of 224 * 150? It will certainly involve the coding of the matrix, that is, the high-dimensional matrix is stored in the texture of the characteristic shape in a certain format, and the data layout of the texture will affect the reading and memory performance in the calculation process. For example, take a simple example:

How to build a high-performance front-end intelligent reasoning engine

If it is a conventional memory arrangement, the matrix needs to be traversed by row or case column once for calculation, and the GPU cache is of tile type, that is, n * n type cache, which varies according to different chips n. Therefore, this traversal method will frequently cause cache miss, which becomes a performance bottleneck. Therefore, we need to optimize the performance through memory layout. Similar to the following figure:

How to build a high-performance front-end intelligent reasoning engine

Pose 3: graph optimization

Because a model is composed of operators one by one, and each operator in GPU is designed as a webgl program, more performance losses will be caused each time the program is switched. Therefore, if there is a means to reduce the number of programs of the model, the performance improvement is also very considerable. As shown below:

How to build a high-performance front-end intelligent reasoning engine

We fuse some nodes that can be fused on the graph structure (NOP – > 1op), and realize a new OP based on a new computing node. In this way, the number of OPS is greatly reduced, and then the number of programs is reduced, so the reasoning performance is improved. The effect is particularly obvious on low-end mobile phones.

Pose 4: hybrid accuracy calculation

All the above calculations are based on conventional floating-point calculation, that is, float32 single precision floating-point calculation. Then, can the mixed accuracy calculation be realized in GPU? For example, calculation of mixing accuracy of float16, float32 and uint8. The answer is yes. The value of hybrid accuracy calculation in GPU is to improve the bandwidth of GPU. Because each pixel of the texture of Web GL contains four RGBA channels, and each channel is up to 32 bits, we can store as much data as possible in 32 bits. If the accuracy is float16, two float16 can be stored, and the bandwidth is twice that of the previous one. Similarly, the bandwidth of uint8 is four times that of the previous one. This performance improvement is huge. Let’s talk:

How to build a high-performance front-end intelligent reasoning engine

Pose n:

There are many optimization methods, which are not listed here.

Scene of engine landing

At present, the engine based on our in-depth optimization has been applied to many application scenarios of ant group and Alibaba economy. The typical ones are pet identification demonstrated at the beginning of the article, card identification, broken screen camera and so on.

There are some popular virtual makeup trial applet in the industry.

How to build a high-performance front-end intelligent reasoning engine

Friends who read this article can also open your brain holes and dig out more and more interesting intelligent scenes.

Future outlook

With the upgrading of models and in-depth optimization of engines in the market, I believe tfjs will shine in more interactive scenes, such as front-end games with AI capability, AR, VR and so on. Now all we have to do is calm down, stand on the shoulders of giants and continue to polish our engines, willing to wait for flowers to bloom.

Author: Qingbi
Original link
This article is the original content of Alibaba cloud and cannot be reproduced without permission