Webrtc (WEB real-time communication), as an open source technology that supports web browsers to conduct real-time voice or video dialogue, has solved the technical threshold of Internet audio and video communication, and is gradually becoming a global standard.
Over the past decade, thanks to the contributions of many developers, the application scenarios of this technology have become more and more extensive and rich. In the era of artificial intelligence, where will webrtc go? This article mainly shares the relevant direction of the combination of webrtc and artificial intelligence technology and the innovative practice of rongyun (jiarongyun global Internet communication cloud service number to learn more).
Webrtc + artificial intelligence makes the sound more vivid and the video more HD
Artificial intelligence technology is more and more widely used in audio and video. In audio, artificial intelligence technology is mainly used for noise suppression, echo removal and so on; In terms of video, artificial intelligence technology is more used for virtual background, video super-resolution and so on.
AI voice noise reduction
Speech noise reduction has a history of many years, and analog circuit noise reduction method is often used in the first time. With the development of digital circuit, noise reduction algorithm replaces traditional analog circuit, which greatly improves the quality of speech noise reduction. These classical algorithms estimate the noise based on statistical theory, which can eliminate the steady-state noise relatively clean. For unsteady noise, such as the sound of hitting the keyboard and table, and the sound of cars coming and going on the road, the classical algorithm is powerless.
AI speech denoising came into being. It is based on a large number of corpus, through the design of complex algorithms and continuous training and learning, which eliminates the cumbersome and ambiguous parameter adjustment process. AI speech denoising has natural advantages in dealing with unsteady noise. It can recognize the characteristics of unsteady data and reduce unsteady noise.
The echo is generated by the sound released by the speaker after attenuation and delay and then recorded by the microphone. When we send audio, we should remove unwanted echoes from the middle of the voice stream. The linear filter of webrtc adopts frequency domain block adaptive processing, but does not carefully consider the problem of multi person communication. The Wiener filter is used in the nonlinear echo cancellation part.
Combined with artificial intelligence technology, we can directly eliminate linear echo and nonlinear echo through well-designed neural network algorithm based on deep learning method and speech separation.
Relying on the segmentation technology, the virtual background can be realized by segmenting the foreground in the picture and replacing the background picture. The main application scenarios include live broadcast, real-time communication and interactive entertainment. The technologies involved mainly include image segmentation and video segmentation. A typical example is shown in Figure 1.
(in Figure 1, the black background in the upper figure is replaced by the purple background in the lower figure)
Video super resolution
Video super-resolution is to make high paste video clear, transmit low-quality video under the condition of limited bandwidth and low bit rate, and then restore it to high-definition video through image super-resolution technology. This technology is of great significance in webrtc. A typical image is shown in Figure 2. In the case of limited bandwidth, high-resolution video can still be obtained by transmitting low-resolution video bitstream.
(Fig. 2 original low resolution image vs processed high resolution image)
Innovative practice of rongyun
Webrtc is an open source technology stack. If you want to achieve the ultimate in the actual scenario, you still need to make a lot of optimization. Combined with its own business characteristics, rongyun modifies the source code of webrtc audio processing and video compression to realize audio noise suppression and efficient video compression based on deep learning.
In addition to the original aec3, ANS and AGC of webrtc, for pure speech scenes such as conference and teaching, rongyun added AI speech noise reduction module and optimized aec3 algorithm to greatly improve the sound quality in music scenes.
AI speech noise reduction: the industry mostly adopts the mask method in time domain and frequency domain, which combines the traditional algorithm and deep neural network. By estimating the signal-to-noise ratio through the depth neural network, the gain of different frequency bands can be calculated. After conversion to the time domain, the gain of a time domain can be calculated again. Finally, when applied to the time domain, the noise can be eliminated to the greatest extent and the speech can be retained.
Due to the excessive use of RNN (cyclic neural network) in the deep learning speech noise reduction model, the model still believes that there is a human voice within a period of time after the speech ends, and the delay time is too long to mask the residual noise, resulting in a short noise after the speech ends. Based on the existing model, rongyun adds a prediction module to predict the end of speech in advance according to the speech amplitude envelope and SNR decline, so as to eliminate the residual noise that can be detected at the end of speech.
(Figure 3 noise tailing before optimization)
(Figure 4 no noise tailing after optimization)
In the webrtc source code, the video coding part mainly adopts the open source openh264, VP8 and VP9, and repackages them into a unified interface. Rongyun completes tasks such as background modeling and region of interest coding by modifying the openh264 source code.
Background modeling: in order to complete real-time video coding, it is very necessary to put the processing of background modeling on GPU. After investigation, it is found that the background modeling algorithm in opencv supports GPU acceleration. In practice, we convert the original YUV image obtained by camera and other acquisition equipment into RGB image, and then send the RGB image to GPU. Then, the background frame is obtained in the GPU and transferred from the GPU to the CPU. Finally, the background frame is added to the long-term reference frame list of openh264 to improve the compression efficiency. The flow chart is shown in Figure 5.
(Figure 5 background modeling flow chart)
Region of interest extraction: the coding part of region of interest adopts yolov4tiny model for target detection and fusion with the foreground region extracted by background modeling. Part of the code is shown in Figure 6 below. After the network is loaded, select CUDA for acceleration, and set the input image to 416 * 416.
(Figure 6 partial program of loading network into GPU)
Experimental effect of video coding on webrtc: in order to verify the effect, we use the videoloop test program in webrtc to test the modified openh264. Figure 7 shows the effect of using 1920 * 1080 resolution to model the background of the video collected by the camera on site. Figure 8 shows the output results. In order to ensure real-time performance, webrtc will discard frames that are not actually encoded within the set time for various reasons. Figure 8 shows that our algorithm does not consume a lot of coding time and does not make the encoder produce abandoned frames.
(Figure 7 current frame and background frame)
How to optimize webrtc products with artificial intelligence technology (with specific scheme attached)_ Deep learning_ ten
(Figure 8 actual effect of encoder)
In conclusion, the noise reduction processing based on artificial intelligence in audio can significantly improve the existing voice call experience, but the model prediction is not accurate enough and the amount of calculation is relatively large. With the continuous improvement and optimization of the model and the continuous expansion of the data set, AI voice noise reduction technology will bring us a better call experience. In terms of video, background modeling technology is used to add the background frame to the long-term reference frame list, which effectively improves the coding efficiency of monitoring scenes. Target detection, background modeling and efficient bit rate allocation scheme are used to improve the coding quality of video regions of interest, and effectively improve people’s viewing experience in weak network environment.
With the continuous technological change, we have entered the era of comprehensive intelligence. Artificial intelligence technology is deeply applied to all kinds of scenes. In the field of audio and video industry, the combination of advanced technology and webrtc also has broad prospects. Service optimization will never end. Rongyun will continue to keep up with the trend of science and technology, actively explore innovative technology, and precipitate it into an underlying capability that can be easily used by developers, enabling developers for a long time.