Author / Zhou Yanjun
Organize / livevideostack
Hello, I’m Zhou Yanjun, technical director and software architect of netint. I’m glad to have the opportunity to share with you the ASIC solution of netint’s real-time high-density AI assisted video coding. First, allow me to briefly introduce netint Technologies Inc. Netint is a technology company focusing on new intelligent storage and video / image codec solutions. It has R & D centers in Vancouver, Toronto and Shanghai. The SOC independently designed by netint can provide ASIC based video solutions with ultra-large scale, ultra-high density and ultra-low delay. Our T-series video transcoder products have been used by many top companies in the world.
Next, I’d like to talk about high-density AI assisted video coding solutions based on ASIC. These are the agenda of this speech. I will first explain some typical application scenarios of the ASIC solutions we are designing. Although these application scenarios are not novel, they will become more economical, efficient and practical on a large scale by using ASIC solutions. After that, I will explain how to make the ASIC solution more adaptive – that is, how to ensure that it has a longer product life and can be easier to use in different operating systems, and how to easily and efficiently integrate into different applications. I will also talk about how to expand ASIC solutions, including expanding the capacity of hardware accelerators inside the server or outside the server. At the same time, we will also discuss some topics about reducing latency.
1 case analysis
1.1 AI assisted region of interest (ROI) coding on cloud
The first typical case is AI assisted region of interest coding in the cloud. The workflow usually uses the DNN engine to identify the region of interest and generate the region of interest graph. In this example, we simply use a human face detection model to frame the face. The coordinate of the bounding box is the ROI layout area. This ROI area will be fed back to the encoder, and then the encoder will use a smaller QP for the region of interest, so as to retain more detailed information for the region. Using ASIC solution can effectively reduce the delay of such cases, because we use video encoder with DNN engine, which greatly simplifies the overall workflow and can perform such operations in real time and efficiently.
1.2 AI assisted video coding for scenes and video categories
Another typical use case is AI assisted video coding for different scenes or video categories. Generally, this kind of coding process uses DNN engine to classify the input video, generate a set of coding parameters, and then use new parameters to adjust or reconfigure the encoder. The possible adaptive parameters include step resolution / bit rate, CRF value, GOP structure, etc. Before the large-scale popularity of streaming media, the video ABR ladder needs to be precoded. Economically, it is only suitable for application scenarios such as precoding and video on demand. Now, through the unique real-time coding capability of ASIC solution, such technology can be applied in the live broadcast process.
1.3 real time streaming media using nsfw content detection
Using nsfw (content not suitable for browsing) to detect real-time streaming media refers to identifying pornographic, violent and other content not suitable for browsing contained in the media through nsfw. By decoding the input media stream, the DNN engine is used for nsfw detection. If such content is found in the media, the system will report nsfw events to the upper layer, and the upper layer will issue instructions to blur or directly block such content, and then send the video to the encoder for real coding after editing. ASIC solutions can effectively reduce the delay in such processing, simplify the workflow, and have real-time performance. ASIC solutions make complex AI assisted video coding cases scalable, adaptable and economical.
2 adaptability – how to ensure product life cycle
Let’s talk about the adaptability of products, that is, how to prolong the service life of products. Developing chips is a huge investment. The R & D cost alone is as high as tens of millions of dollars. This product may take several years to mature. We hope that the products can be sold in the market for a long time, so the investment is reasonable. For netint, our new generation chip has the following functions, which will make it an excellent video and Artificial Intelligence Application Accelerator for a long time in the future. The chip of netint supports most popular codecs, such as AVC, hevc / heif AV1, and lcevc. It also supports HDR, in fact, a full set of HDR technologies, including hdr10 / hdr10 + / HLG / Dolby vision. It has support for up to [email protected] It is an ideal choice for AR / VR applications. It supports low subframe delay, slow preset quality up to x265, and 2D video processing, including scaling, superposition, etc. It also supports on-chip DNN derivation. This is a chip with both AI engine and codec engine.
2.1 how to achieve maximum interoperability
Another aspect of adaptability is how to achieve maximum interoperability and how to ensure that your solution can be used in different operating systems. Netint has a patented solution – using nvme as the device interface from the host to the hardware accelerator. Nvme is a nonvolatile memory interface protocol, which is designed for PCIe based storage devices, such as SSD (solid state disk). It can also be extended to support computable storage. Both Linux kernel and windows are embedded with nvme drivers with stability and related performance. When the device talks with the host through nvme, it is not necessary to install customized kernel drivers for the hardware accelerator. Windows will automatically obtain support and do not need to spend time developing corresponding drivers. Similarly, it also supports Android (Android), because the Android system uses the Linux kernel. Supporting nvme also means that you can expand in parallel on the nvme of protocol. Parallel expansion means that capacity can be expanded outside the server, which we will discuss in detail later. At the same time, supporting nvme also enables us to actually and effectively combine storage functions with hardware accelerators. Netint will soon release a series of products with SSD storage function and video transcoding function, and the product is designed for edge deployment application scenarios, such as base station and edge server. On the other hand, if we adopt the traditional method, we need to design our own special host device interface, and we must customize the kernel driver independently, which often leads to various incompatibilities in different operating systems, and it is difficult to support Windows system.
2.2 how to ensure easy integration
For ease of integration, when designing an ASIC solution, we need to consider how to ensure that it is easy to integrate into the existing workflow. In fact, ffmpeg is widely used for video codec and video filtering. Therefore, developing an adaptive ffmpeg plug-in is the first choice in the industry. Video coders / decoders (AVC / hevc / AV1 / lcevc) all have their own ffmpeg / libavcodec plug-ins, various video 2D operations (such as scaling, covering, transposing and clipping), and also have their own ffmpeg / libavfilter plug-ins. However, in the traditional sense, AI inferences are independent of video processing and usually have independent workflow. Netint’s ASIC solution allows AI derivation and video codec to be run synchronously on the same chip. We will also develop ffmpeg plug-ins for AI derivation, including feature map generation (for ROI coding) and nsfw (not suitable for browsing content) detection. This AI operation performance will integrate ffmpeg libavfilter plug-in, libavcodec plug-in and libavfilter plug-in at the same time.
The netint software stack is based on the netint firmware. The firmware communicates with the host through the nvme interface. Therefore, on this basis, the nvme driver common to the operating system can be used. Moreover, users can use the fully open source and free libnetint function library provided by us to customize it independently, so as to realize the recompilation for the characteristics of the local system, so as to further maximize interoperability. Libnetint function library contains libavcodec / libavfilter plug-in. We also provide a customizable encoder control plug-in module. This module will provide some code control and 2-pass coding example codes. Users can recompile this code into an adaptive proprietary algorithm.
3.1 high density with ASIC solutions
Scalability, that is, how to expand the capacity of ASIC solutions. Netint and tirias research recently released a new white paper video & interactive media at cloud edge. This is a great paper. I strongly recommend that you go to the website of netint to download and have a look. The above figure comes from this paper, which shows that in terms of TCO (total cost of ownership), the operating cost of the encoding server equipped with netint ASIC based solution is 1 / 2 of that of the GPU based encoding server and 1 / 10 of that of the software based or CPU based encoding server; In terms of power consumption or carbon emission, netint’s ASIC solution is about 1 / 4 of GPU based servers and 1 / 20 of software or CPU based coding servers; In terms of density, the ASIC solution of netint can process 80x1080p30 media streams on one server, twice that of GPU based coding server and 10 times that of software or CPU based coding server. The upward expansion of the coding capacity into the server is called vertical expansion. The ASIC solution of netint can achieve very high application density in the server. In the actual case, we have customers using 2U servers. Each server is equipped with 24 netint code video encoders, which can realize 192 real-time 1080p30 coding streams.
3.2 composable infrastructure under nvme of protocol
The figure above shows parallel expansion. As mentioned earlier, the ASIC solution of netint uses nvme as the host to device interface, which enables us to use nvme over fabrics technology to expand the capacity of hardware accelerators outside the server. Nvme of is a solution for parallel expansion (i.e. expansion outside the server) in the storage industry. It is reliable, low latency and mature. By performing the same extension technology on the ASIC hardware accelerator, unnecessary cost investment can be greatly saved. In this example, a cabinet has multiple compute nodes and multiple servers with video transcoding AI + DNN accelerators. The computing node is connected with the hardware accelerator node through Ethernet or fibre channel, so that all hardware accelerators can be shared among all computing nodes or servers. Although they are not physically in the computing server, this technology allows the resource pool of video transcoders and DNN engines to be shared between a group of servers. Therefore, users do not need to worry about the appearance of the card, which server can insert what kind of card, and how much transcoder or AI computing power needs to be allocated to each server. These concerns can be specified and allocated by some configuration tools to form a dedicated server, which is why it is called a composable infrastructure.
4.1 how to reduce delay
The fourth topic we want to discuss is delay, which will cover many aspects. Low latency is a key factor in Ar / VR and other interactive video applications. The maturity of 5g technology and the large-scale deployment of edge computing will effectively reduce the network delay. On this basis, the use of ASIC solutions can also greatly reduce the delay of video coding and DNN derivation. At present, there are several general technologies to reduce video coding delay, such as removing lookahead. If you have used x265, you may know that its medium preset enables lookahead, which will cause a delay of at least 20 frames. However, netint’s t408 video encoder can achieve x265 medium quality without lookahead, which greatly reduces such delay. In some applications, such as video conferencing, you may want to disable B frames to further reduce latency. Some other technologies are more related to ASIC solutions. I will explain these technologies in detail below, including:
- Same chip DNN derivation and video decoding / coding
- Reduce latency by controlling rendering and encoding time
- Reduce the delay through the coding ability of reserved space
- Reduce latency through sriov virtualization
- Better delay consistency through hardware encoder
- Subframe delay is realized by software coding
4.2 reduce delay through the same chip and codec
AI and video encoder integration. In the next generation chip of netint, our video codec / decoder AI will be integrated on the same chip, which allows users to realize some complex AI assisted coding on the same chip, such as real-time stream decoding – when the video stream in the source stream comes in, YUV data is saved in the chip, and then extended to perform YUV / RGB conversion and DNN derivation, And send the calculation result back to the host. The host will generate an ROI map or set another set of coding parameters for use. The new parameters will be sent to the encoding engine as frame metadata, and can also be decoded through some buffers, and the decoded YUV buffer and the original YUV will also be imported into the encoder, so that the encoder can encode the region of interest generated in the chip, and then transmitted to the host for output. All data processing processes are completed through codec compression and AI derivation on the same chip, which will greatly improve work efficiency and significantly reduce delay.
4.3 reduce delay by coordinating coding time
The ASCI scheme can also achieve full capacity while maintaining a very low delay for each stream. Here, eight 1080p30 stream codes are taken as an example. Assuming that 8 frames from 8 different streams arrive at the encoder at the same time, and there is only one coding engine, it must encode frame by frame, so that each frame needs 4ms. Therefore, in this case, the encoding delay is at least 32ms, because the next stream can be encoded only after other streams have completed encoding. This case is called uncoordinated encoding. However, if coordinated coding can be performed, 8 frames can be separated from 8 different streams at the same time. Although the coding engine is still coding frame by frame, and it also takes 4ms to complete coding for each frame, because the separated 8 frames are coded at the same time, the delay required to complete coding is 4ms for each stream. Considering other system consumption, the delay required for overall completion is about 6ms to 7ms, which is very low in practical application scenarios. Therefore, if the application can control the encoding time and achieve low latency while ensuring capacity, it will be a very economical and competitive solution.
4.4 reduce delay through reserved space coding
However, in many cases, we can not effectively control the coding time. In the case of real-time streaming, the application usually cannot control the arrival time of the frame, so it cannot indicate the time to start encoding. However, using ASIC solutions, you can use the function of reserved space coding for applications that are delay sensitive and cannot control timing. The above figure shows the test results obtained by using the t408 transcoder of netint. We put 1 stream, 2 streams, 5 streams or 8 streams 1080p on the same card, and each runs at the maximum speed they can reach. It can be seen that the smaller the number of streams, the smaller the delay. This is expected, because the fewer streams, the better the number of delays. The key is that the t408 transcoder of netint can process eight 1080p30 streams at the same time. Through the ASIC solution, the “luxury” ability of reserved space coding is feasible, because the density or capacity of the ASIC solution is 10 times that of the CPU or software solution.
4.5 reduce virtualization latency through sriov
Many times, you have to run new applications in a virtualized environment. When encoding starts in a virtual machine, it is usually necessary to use the sriov standard to bypass the virtual machine monitor to further reduce latency. In the example above, the netint card provides multiple PCIe virtual functions. Each virtual function is connected to a virtual machine. You can see that each virtual machine has a virtual nvme device, that is, a virtual encoder. Applications running on virtual CPUs can be encoded through nvme devices. When the encoded command or encoded data is sent to the nvme device, the virtual function of PCIe will be directly used and the bypass virtual machine monitor will be passed to the device. In this way, the same delay will be obtained on the virtual machine as on the host.
4.6 delay consistency – ASIC vs Software
Here I want to talk about delayed consistency. Compared with software encoder, ASIC solution provides better delay consistency. Using ASIC solutions, the impact of video content, video codec, bit rate, coding delay and other factors on the final coding delay time does not change much. Whether it is positioning video or phase, the encoding time is almost the same. On the contrary, the delay time of software encoder will vary greatly with the complexity of video frames, and AVC / hevc / AV1 occupy different CPU resources for coding, and the coding time or delay time will be very different. For different bit rates, software codecs usually have different delays. ASIC encoders usually do not have this problem. Delay consistency is as important as known image delay, which is one of the main reasons why delay sensitive applications should use ASIC.
4.7 subframe delay
The last delay related topic I want to talk about today is subframe delay – delay is less than frame interval. The frame interval of a 1080p30 video stream is 33ms, and the subframe delay is the delay of encoding a frame, which needs to be less than 33ms. Netint’s t408 transcoder can handle [email protected] , the encoding engine can encode a frame every 4ms. Therefore, it is not difficult to realize the subframe delay of 1080p, and we can even encode the whole coordinate system. Of course, as I mentioned earlier, when encoding with maximum capacity, you need to control the encoder timing to avoid conflict, or you need to reserve space to reduce conflict. However, for the low resolution of 1080p, subframe delay can be achieved by full frame coding. For higher resolutions, such as 4K, the encoding time is usually equivalent to the frame interval. Taking 4K stream as an example, it takes about 15ms to encode a frame, and sometimes the original video data transmission time is equivalent to the frame interval. In this case, subframe coding is required. Full frame coding means that the encoder encodes the frame after receiving a complete frame, and then outputs the whole encoded frame. Subframe coding means that the encoder starts coding when receiving data and starts outputting slices when slices are generated. Subframe coding allows data transmission to run in parallel with coding to a great extent, so as to realize subframe delay.
5 Summary of key points
The theme of this speech is: ASIC solution for real-time high-density AI assisted coding, covering four main aspects – application case, adaptability, scalability and latency. Here are three main points, hoping to bring you some thoughts.
Firstly, the ASIC solution is low TCO, low delay and high adaptability. It supports high-density real-time AI assisted coding applications, which makes AI intelligent assisted coding more economical and practical; Secondly, using nvme as the host to device interface can maximize interoperability. It can expand the coding and artificial intelligence computing power outside the server through parallel expansion or composable infrastructure technology; Third, low delay and delay consistency can be achieved by using ASIC solutions. By integrating DNN engine with encoder, coordinating coding timing to avoid conflict, using reserved space coding to reduce conflict, using sriov and other technologies to reduce delay in virtualization, and using subframe coding to achieve high-resolution subframe delay.