Interpretation and optimization of webrtc audio neteq

Time:2021-4-17

Introduction:Neteq is one of the core technologies of audio and video in webrtc, which has a significant effect on improving the quality of VoIP. From a more macro perspective, this paper introduces the related concept background, framework principle and Optimization Practice of audio neteq in webrtc in vernacular.

Author Liang Yi

Proofread Taiyi

Why neteq in vernacular?

Just search, we can find many articles about audio neteq in webrtc on the Internet. For example, the following articles are very good learning materials and references. In particular, Wu jiangrui’s master’s thesis “Research on neteq technology in webrtc voice engine” of Xi’an University of Electronic Science and technology in 2013 introduces the implementation details of neteq in great detail, which has also been cited in many articles.

Research on neteq technology in webrtc voice engine

Neteq algorithm

Audio related neteq in webrtc

Most of these articles have made a very thorough analysis of the details of neteq from a more “academic” or “algorithmic” point of view, so here I would like to talk about my personal understanding from a more macro point of view. Vernacular is easier to be accepted by everyone. If you don’t need a mathematical formula or a line of code, you can explain your ideas clearly. If you don’t understand it correctly, please give me your advice.

Understanding of packet loss, jitter and optimization

In the field of audio and video real-time communication, especially the mobile office (4G), home office and online classroom (WiFi) under the epidemic situation, the network environment has become the most critical factor affecting the quality of audio and video. In the face of poor network quality, no matter how good the audio and video algorithm is, it seems to be a bit of a drop in the bucket.The performance of poor network quality mainly includes delay, disorder, packet loss and jitter,Who can deal with and balance these problems, who can get a better audio and video experience. Because the basic delay of the network is determined by the selection of the link, it needs to optimize the link scheduling layer to solve it; and the disorder is not many in most network conditions, and the degree of disorder is not very serious, so next we will mainly discuss packet loss and jitter.

Jitter is the speed of data transmission on the network. Packet loss is that the data packet is lost because of various reasons. After several retransmissions, the recovery packet is received successfully. If the retransmission fails or the recovery packet is out of date, the real packet loss will be formed. The packet loss recovery PLC algorithm is needed to generate some false data to compensate. Packet loss and jitter are unified in terms of time dimension. When jitter comes later, retransmission packets come late for a long time, and “real packet loss” is when they don’t come all their lives. Our goal is to reduce the probability of packet becoming “real packet loss” as far as possible.

Optimization, intuitively speaking, is a data index. After a fierce operation, it is upgraded from XXX to XXX. But I think that the evaluation of optimization can not only stay in this dimension. Optimization is to “know yourself and know the other”. Self is the demand of your product, that is the ability of the existing algorithm, and the combination of self and that is the best optimization. No matter the algorithm is simple or complex, as long as it can perfectly match your product demand, it is the best algorithm, “the cat that can catch mice”.

Neteq and related modules

The origin of neteq

Gips neteq original documentThis is the original neteq document provided by Gips(Chinese translation), which describes what neteq is and a brief description of its performance. Neteq is essentially a jitter buffer for audio. Its name is network equalizer. As we all know, audio equalizer is used to equalize the sound, and here neteq is used to equalize the network jitter. Moreover, Gips also registered a trademark for this name, so neteq (TM) is seen in many places.

In the above official document, there is a very important message, “minimize the delay effect caused by jitter buffer”, which indicates that one of the design goals of neteq is to:“Pursue very low delay”。 This information is crucial and provides important clues for our subsequent optimization.

Interpretation and optimization of webrtc audio neteq

Interpretation and optimization of webrtc audio neteqInterpretation and optimization of webrtc audio neteqInterpretation and optimization of webrtc audio neteq

The position of neteq in the QoS process of audio and video communication

Audio and video communication for ordinary users, as long as the network is connected, WiFi and 4G can, a call past, see people and hear sound, OK, very simple thing, but for the underlying implementation is not as simple as it seems. The number of relevant code files of webrtc open source engine alone is about 200000. I don’t know if anyone has calculated the number of code lines, which should be tens of millions of magnitude. I don’t know how many yards the farmer lost all his hair.

The following figure is an abstraction and simplification of the more complex audio and video communication process. The left side is the sending (streaming) side: after collection, coding, encapsulation and sending; the middle side is through network transmission; the right side is the receiving (streaming) side: receiving, unpacking, decoding and playing; here it focuses on several major functions of QoS (quality of service), as well as the relationship with the main process of streaming data. It can be seen that QoS functions are scattered in every position of the audio and video communication process, leading to a more comprehensive understanding of QoS after understanding the whole process. It seems that there are more QoS functions on the sending side on the left. This is because the purpose of QoS is to solve the user experience problem in the communication process. To solve the problem, it is best to find the source of the problem. All the solutions that can be solved from the source are better solutions. But there are always some problems that can’t be solved from the source. For example, in the scene of multi person meeting, one person’s network on the receiving side is broken, which can’t affect the meeting experience of others, can’t appear the situation of “one mouse excrement spoils one pot of porridge”, and can’t pollute the source. Therefore, the receiving side also needs to do the function of QoS. At present, the necessary function of the receiving side is Jitterbuffer, including video and audio. This paper focuses on the analysis of audio Jitterbuffer neteq.

Interpretation and optimization of webrtc audio neteq

The principle of neteq and the relationship between related modules

Interpretation and optimization of webrtc audio neteq

The figure above is an abstract of the workflow of neteq and its related modules, which mainly includes four parts: neteq input, neteq output, audio retransmission NACK request module, audio and video synchronization module. Why put the NACK request module and audio and video synchronization module into the analysis of neteq? Because these two modules are directly dependent on neteq and interact with each other. The dotted line in the figure identifies the information of other modules that each module depends on, as well as the source of this information. Next, let’s introduce the whole process.

1. The first part is the input part of neteq

After the underlying socket receives a UDP packet, it triggers the parsing from UDP packet to RTP packet. After matching SSRC and payloadtype, it finds the corresponding channel of audio stream reception, and then sends the data fromInsertPacketInternalInput to the receiving module of neteq.

The received audio RTP package is likely to have red redundancy. According to rfc2198 standard or some private packaging formats, unpack it and restore the original package. The duplicate original package will be ignored. The original RTP packet will be inserted into the packet buffer according to a certain algorithm. After that, the serial number of each original packet received will be sent through theUpdateLastReceivedPacketFunction is updated to the NACK retransmission request module. The NACK module will trigger two modes through RTP packet receiving or timer to callGetNackListFunction to generate the retransmission request and send it to the streaming side in the format of NACK RTCP packet.

At the same time, for each original packet, the only receiving time on the time axis is obtained, and the receiving time difference between packets can also be calculated,The IAT (interval time) used for jitter estimation in neteq is the difference of receiving time divided by the packing time of each packet,For example, if the time difference between two packets is 120 ms and the packing time is 20 ms, the IAT value of the current packet is 120 / 20 = 6. After the IAT value of each packet is processed by the core network jitter estimation module (delaymanager), the final target level is obtained, and the input processing part of neteq ends.

Secondly, the output part of neteq

The output is triggered regularly by the playback thread of the audio hardware playback device, and the playback device will pass through every 10msGetAudioInternalThe interface takes 10ms data from neteq to play.

get intoGetAudioInternalThe first step is to decide how to deal with the current data request. This task is completed by the operation decision module, which gives the final operation type judgment according to the previous and current data and operation status.Neteq defines several operation types: normal, acceleration, deceleration, fusion, stretch (packet loss compensation), mute,The significance of these operations will be explained in detail later. With the operation type of decision-making, an RTP packet is taken from the packet buffer of the input part and sent to the abstract decoder. The abstract decoder passes through the packet bufferDecodeLoopFunction layer by layer called to the real decoder to decode, and put the decoded PCM audio data into theDecodedBufferGo inside. Then it starts to perform different operations. Neteq implements different audio digital signal processing algorithms (DSP) for each operation, except for “normal” operationDecodedBufferOther operations will be combined with the decoded data for secondary DSP processing, and the processing results will be put into the algorithm buffer first, and then inserted into the sync buffer. The sync buffer is a circular buffer, which is cleverly designed to store the data that has been played and the data that has not been played after decoding. The data just inserted from the algorithm cache is placed at the end of the sync buffer, as shown in the figure above. Finally, it takes the earliest decoded data from the sync buffer and sends it to the external mixing module. After mixing, it is sent to the audio hardware to play.

In addition, it can be seen from the figure that the decision-making module (bufferlevelfilter) will get the current buffer level of the audio after filtering by combining the buffer time in the current packet buffer and the data time in the sync buffer. The audio and video synchronization module will use the current audio buffer level and the current video buffer level, combined with the time stamp of the latest RTP packet and the time stamp of the audio and video SR packet to calculate the degree of audio and video asynchronySetMinimumPlayoutDelayFinally, it is set to the minimum target water level in neteq to control the targetlevel and realize audio and video synchronization.

Neteq internal module

Neteq jitter estimation module (delaymanager)

1. Stationary jitter estimation part

The IAT value of each packet, according to a certain proportion (the proportion is determined by the calculation of the forgetting factor part below), is accumulated into the histogram of IAT statistics below. Finally, the 0.95 position of the accumulated value from left to right is calculated, and the IAT value at this position is taken as the final jitter IAT estimation value. For example, in the figure below, assume that the target water level targetlevel is 9, which means that the target cache data length will be 180ms (assuming the packing time is 20ms).Interpretation and optimization of webrtc audio neteq

2. Smooth jitter forgetting factor calculation

Forgetting factor is used to control the ratio of IAT value of current packet to the above histogram. The calculation process uses a complex formula. After analysis, its essence is the following yellow curve, which means that when forgetting factor is small at the beginning, more IATs of current packet will be taken As time goes on, the forgetting factor becomes larger, and fewer IAT values of the current packet will be taken to accumulate. This process is a bit complicated. From an engineering point of view, it can be simplified to a straight line or something, because the target value of 0.9993 basically converges in about 5 seconds after the test. In fact, this 0.9993 is the most important factor affecting the jitter estimation. Many optimizations also directly modify this coefficient to adjust the sensitivity of the estimation.

Interpretation and optimization of webrtc audio neteq

3. Peak jitter estimation

There is a peak detector in delaymanager to identify the peak. If the peak is detected frequently, it will enter the estimation state of peak jitter, and take the maximum peak as the final estimation result. Once it enters this state, it will last for 20 seconds, regardless of whether the current jitter has returned to normal. Here is a diagram.Interpretation and optimization of webrtc audio neteq

Neteq decision logic

The simplified basic decision logic of the decision module is shown in the figure below, which is relatively simple and need not be explained. Here is an explanation of the meaning of the following operation types:

ComfortNoise:It is used to produce comfortable noise, which sounds more comfortable than simple mute bag;

Expand(PLC):Packet loss compensation, the most important algorithm module, is to solve the problem of no data when “real packet loss” occurs;

Merge:If the last time it was fake data from expand, in order to sound more comfortable, it will do a fusion algorithm with normal data packets;
Accelerate:Speed up playing algorithm of changing sound without changing tone;
PreemptiveExpand:Slow down playing algorithm of changing sound without changing tone;
Normal:Normal decoding and playing, without additional introduction of false data;
Interpretation and optimization of webrtc audio neteqInterpretation and optimization of webrtc audio neteq

Optimization points of neteq related modules

Neteq anti jitter optimization

1. Because neteq’s design goal is “very low delay”, it can’t match very well. Video conference, online classroom, live broadcast and other non very low delay scenes need to adjust their sensitivity, mainly to adjust the sensitivity of jitter estimation module;

2. For live scenes, the delay sensitivity can be more than seconds, so the function of streammode needs to be enabled (it seems to be removed in the new version), and the parameters need to be adapted;

3. Serving the target of very low latency, the original packet buffer packet buffer is too small, which is easy to cause flush, and needs to be adjusted larger according to the business needs;

4. Some services will actively identify the network status according to their own business scenarios, and then directly set the minimum targetlevel to simply and rudely control the water level of neteq.

Interpretation and optimization of webrtc audio neteq

Neteq anti loss Optimization:

1. In the original webrtc, the trigger mechanism of NACK packet loss request is triggered by packet, which will worsen the retransmission effect in the weak network, and can be solved by timing trigger instead;

2. In the case of packet loss, there will be retransmission, but if the buffer is too small, the retransmission will also be discarded. Therefore, in order to improve the retransmission efficiency, the ARQ delay reservation function is added, which can significantly reduce the stretch rate;

3. The algorithm level optimization is to optimize the packet loss compensation PLC algorithm, adjust the existing neteq stretching mechanism, and optimize the listening effect;

4. After turning on the DTX function of opus, the audio buffer will become larger in the case of packet loss, and the DTX related processing logic needs to be optimized separately.Interpretation and optimization of webrtc audio neteq

The following is the effect comparison of ARQ delay reservation function after it is turned on. The average tensile rate is reduced by 50%, and the delay will increase accordingly:

Interpretation and optimization of webrtc audio neteq

Audio and video synchronization optimization: Interpretation and optimization of webrtc audio neteq

Interpretation and optimization of webrtc audio neteq

1. The original webrtc P2P audio and video synchronization algorithm is no problem, but the current architecture generally has a media forwarding server (SFU), and the server’s SR packet generation algorithm may not be completely correct due to some limitations or errors, resulting in normal synchronization. In order to avoid the SR packet generation problem, this paper proposes a new method Packet generation error, need to optimize the calculation mode of audio and video synchronization module, use water level as the main reference to synchronize, that is to ensure that the buffer time of audio and video is about the same size at the receiving end. The following is a comparison of the optimization effects:Interpretation and optimization of webrtc audio neteqInterpretation and optimization of webrtc audio neteq

Interpretation and optimization of webrtc audio neteq

Interpretation and optimization of webrtc audio neteq

2. There is also an audio-video synchronization problem. In fact, it is not caused by the audio-video synchronization mechanism, but by the performance of the device, which can not deal with the decoding and rendering of the video in time, resulting in the accumulation of video data, thus forming the audio-video asynchronous. This problem can be solved by comparing the trend of asynchronous duration with the trend of video decoding and rendering duration, and the matching degree between them will be very high, as shown in the figure below:Interpretation and optimization of webrtc audio neteq

summary

Neteq is the core function of audio receiving side,Basically contains all aspects, so many audio and video communication technology implementation will have its trace, taking advantage of webrtc open source fast 10 years of east wind, neteq has become very popular, I hope this vernacular article can help you better understand neteq.

The author’s last words: the demand does not stop, the optimization does not stop!

“Video cloud technology”, the official account of your most noteworthy audio and video technology, is pushing practical technical articles from Ali cloud every week.

Copyright notice:The content of this article is spontaneously contributed by alicloud real name registered users, and the copyright belongs to the original author. The alicloud developer community does not own its copyright, nor does it bear the corresponding legal responsibility. For specific rules, please refer to the user service agreement of alicloud developer community and the guidelines for intellectual property protection of alicloud developer community. If you find any suspected plagiarism content in the community, fill in the infringement complaint form to report. Once verified, the community will immediately delete the suspected infringement content.