Distributed real-time network construction driven by QOE: the evolution of Agora SD-RTN

Time:2022-8-8

Editor's note: **Recently, the Global Software Case Study Summit was held in Beijing. The Global Software Case Study Summit ("TOP100Summit" for short) is an annual list of case studies in the science and technology industry. It selects the 100 most worthy of reference each year, and aims to unveil the practices and thinking behind outstanding R&D teams and refine them for readers. The best learning path is to sort out and think about the long-tail value of cases.

On the topic of "Architecture Evolution/Engineering Practice/Open Source Landing" at the Yibai Case Summit, Liu Yong, Chief Architect of Agora, delivered a speech on "QOE-Driven Distributed Real-Time Network Construction: The Evolution of Agora SD-RTN". He focused on how SD-RTN and Agora RTC systems ensured system upgrades, capacity expansion, and continuous improvement in the quality of real-time interactive experience while maintaining no downtime and major failures online, supporting billions of minutes of communication for customers every day .

Distributed real-time network construction driven by QOE: the evolution of Agora SD-RTN

Liu Yong graduated from Zhejiang University with a bachelor's degree in 2001; obtained a doctorate degree from Tsinghua University in 2013; joined Agora in 2014 and is engaged in the design and development of the overall system. He proposed, designed and led the development of SD-RTN system, and built Agora RTC system based on SD-RTN. A loyal fan of the C++ language and a permanent recruit in the Internet field.

Summary

With the development of network and video technologies, real-time interactive applications based on real-time audio and video have brought new challenges and requirements to the low latency and high reliability of long-distance network transmission. Different from traditional network architecture and SDN technology, SoundNet SD-RTN builds a low-latency and high-reliability network with overlay network as the main idea, and cooperates with UDP-based multiplexing transmission protocol AUT to provide real-time services such as RTC. The underlying network guarantee ensures the global end-to-end experience of Agora RTE users.

• By layering and decoupling RTC service and network transmission, it is different from traditional RTC protocols such as RTP protocol that heavily couples audio and video media streams and network transmission protocols, and provides a scalable, flexible and professional system architecture to ensure quality of service

• Utilize the idea of ​​overlay network and SDN to ensure the intra-network transmission quality of the real-time network cloud of a certain scale of global networking

• A 4-layer multiplexing real-time transport protocol AUT is proposed. Based on this protocol, the experience quality of Agora RTC/RTE at the lastmile end is enhanced, and an abstract and flexible control mechanism is provided for upper-layer applications.

SD-RTN and Agora RTC Service Architecture

Evolution of RTC System Architecture

Real – Time Communication As the name implies, since it is communication, two or more parties of the communication must initiate a connection or handshake. In a common 2-person scenario, the two communicating parties can initiate a P2P connection through signaling services to directly establish a data channel.

Distributed real-time network construction driven by QOE: the evolution of Agora SD-RTN

Evolution of RTC System Architecture (P2P/Mesh Architecture)

advantage:The server load is light, only the logic of signaling

shortcoming:

• Rely on the network environment of both parties to achieve interconnection. The penetration success rate is low, and the availability cannot be guaranteed

• The two parties are in different network environments, and the communication quality depends on the Internet quality between the two parties. The quality is unstable across regions and network autonomous domains, and a stable QoE guarantee cannot be provided.

• In the multi-person conference scenario, P2P channels need to be established in pairs, and the availability will be significantly reduced; at the same time, there are upstream waste and performance problems

• Poor scalability of system architecture

in conclusion:Based on the above factors, the P2P architecture is not suitable for the underlying architecture of RTC basic service providers worldwide, and can only be applied to small and limited specific areas, or as a useful supplement to improve the coverage quality of specific scenarios.

Evolution of RTC System Architecture (MCU Architecture)

The MCU (Multipoint Conferencing Unit) scheme consists of a server and multiple terminals to form a star structure. Each terminal sends the audio and video stream to the server, and the server side will mix the audio and video streams of all terminals in the same room, and finally generate a mixed audio and video stream and send it to each terminal.

shortcoming:

• High resource consumption of mixed stream server

• Large delay

• Poor scalability

• poor flexibility

Evolution of RTC System Architecture (SFU Architecture)

The SFU (Selective Forwarding Unit) solution consists of a server and multiple terminals. SFU does not mix audio and video streams. After receiving an audio and video stream shared by a terminal, it directly forwards the audio and video stream according to the subscription result of the terminal. to other terminals in the room.

Distributed real-time network construction driven by QOE: the evolution of Agora SD-RTN

Using this publish-subscribe model, flexible multi-person interaction scenarios can be realized. There are still two problems in the simplified model in the above figure: 1) In the multi-person scenario, if one server is used to serve the communication parties with distant geographical distribution, the coverage quality cannot be guaranteed. The number of concurrent participants in the session is limited, and the scalability of the system is poor.

Distributed real-time network construction driven by QOE: the evolution of Agora SD-RTN

Therefore, the industry naturally thought of a distributed multi-server collaboration solution, and tried to use the nearest access method. It has:

• Good scalability

• Good coverage

But it poses a challenge to the network quality between servers

Distributed real-time network construction driven by QOE: the evolution of Agora SD-RTN

Distributed expansion of the SFU architecture, which has the following featuresadvantage

• It is close to the access person, avoiding the impact on lastmile caused by long-distance access

• Due to the regional characteristics of RTC communication, the communication sessions in the region converge to the edge service center in the region, which is beneficial to reduce the communication delay

shortcoming:

• There is inter-network traffic between different edge clusters, which increases costs

• Data interaction between different edge clusters brings challenges to network quality: such as across regions, countries, and operators.

Using the distributed edge computing architecture, by striving to optimize the network quality between edge computing centers, the impact of the instability of the public Internet on user experience can be minimized. Based on this idea, Shengwang proposed the concept of SD-RTN.

Proposition of Agora SD-RTN

Design goals:

Different from protocols such as RTP/RTCP and webRTC and its derived server architecture, in design, we hope to reduce the complexity caused by system coupling through horizontal layered system design. Through a layer of network transmission protocol and service architecture independent of the audio and video media protocol, the audio and video RTC service can focus on the business logic itself, and the network algorithm and protocol design and network hardware architecture engineers can use their respective areas of expertise to meet the needs of upper-layer services. QOS requirements:

• Protocol decoupling

• Service decoupling

• Ability to fully and flexibly utilize existing network infrastructure, such as public Internet, dedicated lines, etc.

• Safety

Agora SD-RTN abstracts the requirements for network transmission under the RTC distributed architecture (low latency, high reliability), adopts protocol layered design, decouples RTC service and network transmission, and realizes the layering of protocols, modules and services and decoupling:

• SD-RTN presents an overlay network layer 3 interface to the upper layer

• SD-RTN is a UDP-based distributed network system that runs under heterogeneous networks and does not depend on specific hardware and software. It can perform real-time routing and traffic scheduling for different QoS requirements

SD-RTN and Agora Hierarchical Service Architecture

The following figure shows the service architecture of Agora. We can see that SD-RTN and Layer 4 transmission protocol AUT constitute the network foundation of Agora real-time cloud:

Distributed real-time network construction driven by QOE: the evolution of Agora SD-RTN

Agora SD-RTN Architecture

SD-RTN system also includes control plane and forwarding plane:

• Control plane

Link Probe and Capacity Assessment System

Edge Node Information Collection System

Routing scheduling system

management system

• Forwarding plane

Distributed real-time network construction driven by QOE: the evolution of Agora SD-RTN

The link detection and capacity evaluation system and the routing scheduling system are described in detail below.

1. Link detection and capacity evaluation system:According to a certain scheduling strategy, regularly test the network quality data between different server clusters, analyze the network model, especially the quality under the lossy network, summarize and evaluate

2. Routing scheduling system:The route analysis and scheduling system is similar to SDN-Controller. SD-RTN scheduling system is a set of real-time intelligent parallel computing services that undertake routing planning and load balancing. Calculate and deliver the route of the data flow in the network

Agora SD-RTN and SDN

In the design and continuous evolution of the RTN system, some ideas are borrowed from the existing network design practice, especially the SDN architecture.

the same

The design ideas of SD-RTN and SDN are generally similar, mainly as follows:

• Separation of complex control plane logic and forwarding plane logic of routers

• The calculation of the routing policy of the control plane is configured or calculated by the centralized control center (SDN-controller)

the difference

• The SDN forwarding plane needs to rely on the flow table to control the forwarding logic. With the increase of the network scale, the query, maintenance and update of the flow table becomes complicated, especially in the case of multi-hop; SD-RTN uses technologies such as SR to simplify forwarding logic

• SD-RTN evaluates the status of the network link at the bottom layer, and adopts technologies such as FEC or multi-channel redundancy according to the required qos level to achieve real-time reliable delivery at the packet level

• SD-RTN is an overlay network design that does not depend on specific hardware and software, and can utilize both the public Internet and private lines for link calculation and traffic distribution

Evolution of Agora SD-RTN

The evolution process is divided into three stages:

• The initial phase

SD-RTN and RTC services are heavily coupled, except for link evaluation and routing algorithms, the protocol itself and services are integrated in the RTC access and repeater

• More mature stage

RTC/RTN protocol layering and modularization, most services are decoupled, providing dedicated services for Agora RTC

• Current and future directions:

RTN service, providing service interface for Agora cloud services (in progress)

Benefits of Agora SD-RTN

• Development efficiency

The introduction of SD-RTN and AUT (see below) makes the upper-layer business no longer need to care about the quality of the underlying network transmission, and can focus on the development of business logic itself, which reduces the complexity of the system, simplifies the business model, and shortens the RTC. Iterative cycle of business development

• Transmission quality

For different QOS requirements, SD-RTN provides corresponding different transmission strategies, and cooperates with the AUT protocol to complete the corresponding quality requirements.

SD-RTN focuses on and optimizes two technical indicators of network quality:

• Latency

• Package delivery/delivery achievement power

Agora SD-RTN Quality Index (Latency)

Distributed real-time network construction driven by QOE: the evolution of Agora SD-RTN

Distributed real-time network construction driven by QOE: the evolution of Agora SD-RTN

The success rate of packet delivery is further subdivided. For common Agora RTC requirements, RTN focuses on the following indicators and continuously optimizes them:

1. The service time when the arrival rate of packet delivery within 2s delay is above 99.9%. This indicator is aimed at the latency requirements of the viewers of general live broadcast services. When this indicator reaches the standard, most of the live viewers can run smoothly without any other factors. (Already basically better than CDN-based live broadcast technology solutions)

2. The service time when the arrival rate of packet delivery within 800ms delay is above 99.9%. This indicator is aimed at the quality requirements of the audience in the Agora ultra-fast live broadcast business scenario

3. The service time when the arrival rate of packet delivery within 200ms delay is above 99.9%. This indicator focuses on the communication needs of common RTCs. When this indicator reaches the standard, the two communication parties can have a smooth conversation, without the sense of delay and chattering.

Agora SD-RTN quality index (jitter 200ms arrival rate)

Distributed real-time network construction driven by QOE: the evolution of Agora SD-RTN

Agora SD-RTN Challenges and Issues

• Scalability (Horizontal expansion by cooperating with RTC system)

• Link quality assessment and capacity assessment under lossy diverging networks

• Fast traffic scheduling algorithm (NP problem)

• Security (Ipsec)

Agora RTC over RTN Architecture

The Agora RTC system has the following main services at the transport layer:

• AP/LBS service

• RTC SFU service

1、Native

2、webRTC

3、RTMP

• Channel synchronization service and subscription service

• Capacity negotiation and arbitration services

Agora Universal

Agora Universal UDP-based Transport Protocol (Aut)

In the RTC scenario, you need:

• A reliable network channel to send and receive control messages

• Multiple real-time channels that are as reliable as possible are needed to meet the sending and receiving of multiple data streams (audio and video, etc.)

• When bandwidth is limited, the priority management of the above flows needs to be resolved

• In the To B scenario, it is necessary to allow customers to independently and flexibly decide the priority of the stream and the transmission degradation strategy

Challenges to Network Transmission in RTC Scenarios

For example, when the bandwidth is limited, it is often necessary to ensure the high-priority transmission of control commands; in scenario-based applications, such as guaranteed audio transmission; to ensure that teachers are better than students, etc. Consider a conventional implementation, the control command goes through the TCP channel; and each audio and video stream goes through an RTP/RTCP scheme. In this case, under the competition of multiple streams:

• If the RTP/RTCP channel adopts the TCP-friendly control strategy, the audio and video streams are the same as other data streams on the network, and high priority cannot be guaranteed.

• If aggressive congestion control strategies are used, the RTC control command channel may be blocked

• Congestion control strategy for multiple RTP/RTCP channels, how to adjust to ensure high priority flow

In this scenario, we need a multiplexed transmission channel, under the same congestion control module, to manage the priority of the flow and make overall arrangements:

• The biggest problem with using TCP channels for logical multiplexing such as RTMP is that the implementation of TCP can cause head-of-line blocking problems

• For web application scenarios, Quic implements multiplexing, priority management, and anti-blocking of transmission channels at the protocol level, but it does not support real-time unreliable data streams

• Users of real-time streams have more control requirements for the underlying layer than reliable data streams. How to design and implement layered media and network transmission is not a trivial issue.

• Contradictions between flexibility and the customization requirements of major customers: If the flexibility design cannot be done well, then the needs of many major customers will become customized requirements, which have to be solved through a large number of hard code methods.

Design goals

• Versatility: Use a set of protocol design to meet the needs of different scenarios, not only RTC, but also reliable data channels

• Native streaming support in transport protocols:

1. Multiplexing, flexible priority management

2. By piggybacking custom Stream Meta information in the stream, users can make stream management decisions

• Flexible congestion control module interface, which can be extended to implement different congestion control algorithms

• Low-level network interface, capable of supporting SD-RTN, udp socket and any virtual network, etc.

Distributed real-time network construction driven by QOE: the evolution of Agora SD-RTN

The Aut protocol design refers to the QUIC protocol design, but has undergone a lot of redesign.

• Removed some version management and negotiation mechanisms

• Added information mechanisms such as Stream Option/Meta in real-time streaming scenarios

• Designed the interface and implementation of the real-time stream

In addition to support for real-time streaming, the Aut protocol also includes:

• encryption

• Connection Migration

• FEC support

• MultiPath (experimental)

Distributed real-time network construction driven by QOE: the evolution of Agora SD-RTN

Application of Aut in RTC

The Aut protocol has been technically verified as the underlying transmission technology in the Agora RTC SDK Nasa2 (current version 3.0.0.18), providing high-quality transmission assurance and flexible control mechanisms for upper-layer applications. Sensitive control and feedback mechanisms provide the possibility for upper-level engine or application optimization.

Distributed real-time network construction driven by QOE: the evolution of Agora SD-RTN

Distributed real-time network construction driven by QOE: the evolution of Agora SD-RTN

Application of Aut in RTM

Distributed real-time network construction driven by QOE: the evolution of Agora SD-RTN

Aut Over Aut: Point-to-point network acceleration (RTNS service)

Distributed real-time network construction driven by QOE: the evolution of Agora SD-RTN

Summarize

• The system architecture follows a step-by-step evolution and gray-scale iteration method, which should be adapted to customer needs and production scale, and adopt the most reasonable and cost-effective solutions at different stages, while ensuring the continuity and consistency of technological evolution.

• System design should pay full attention to, investigate existing systems and design implementations, and track the latest technological evolution (academia and industry)

• In the To B industry, we must try our best to meet the needs of customers and products, but also avoid projectization and outsourcing of technical products. Together with the product, try to seek to extract the common pain points and difficulties of customers and consider them into the overall evolution of the system

• During the system iteration process, it is necessary to examine the system's:

1. Iterative linearity. Control the iterative complexity of the online system

2. Observability. Observing and ensuring the effectiveness of system improvements based on a data-driven approach

ROI analysis

• SD-RTN and Agora RTC systems have maintained no downtime and no major failures for more than 6 years, and have achieved gradual system upgrades, capacity expansion and continuous improvement in the quality of real-time interactive experience. Supports billions of minutes of communication time for customers every day

• With the gradual delivery of SD-RTN and AUT, it provides a fast and consistent solution for Agora's in-cloud business building system; and uses AUT protocol capabilities to provide flexible self-service solutions for customers' customized needs through RTC SDK

The above content comes from the sharing of Mr. Liu Yong.