On October 27, 2019, clapping cloud held the API gateway and high-performance service best practice open talk Hangzhou station together with Apache apinix community. Li Yuan, an ant financial services technical expert, shared the title “the evolution road of ant financial services network agent“. In this activity, technical experts from Alibaba, ant financial services, Apache apistix, polaristech, youpai cloud and other enterprises were invited to share the practical experience of gateway and high-performance services.
Lie yuan, technical expert of ant financial services, core member of the open source project sofamosn of ant financial services, and Tengine expert.
Here is the full text:
From network hardware equipment to self-developed platform, from traditional service governance to service mesh, this sharing will introduce how ant financial service network agent supports the second level million payment step by step on the access layer and the service mesh Road, and the ten million spring evenings.
What is network agent
Macroscopically, the network agent is mainly composed of north-south traffic and east-west traffic. The north-south traffic is the unified structure layer, which is the traffic trend from the external Internet to the data center. The east-west traffic refers to the traffic between VMS in the data center. For example, the micro service is the east-west traffic.
When we track the North-South network flow, the request usually goes through four layers of load balancing, seven layers of load balancing, etc., which is usually called network access layer agent. When the data center actively accesses the public network, the traffic will usually go through the NAT gateway to do network address conversion, which is also known as network agent. In the data center, the sense of existence of the network agent seems not so strong. With the development of SOA, we have formed various mature service communication frameworks, such as ant financial’s sofarpc, Alibaba Group’s HSF, Google’s grpc, etc. the network agent function has been integrated into various communication frameworks, which seems to have proxyless, but with the development of micro services and services Mesh’s architecture proposed that the East-West network agent appeared again in an independent manner.
- The representative product of traditional four-tier load balancing is IPVS. In the early years, Baidu, Ali and other companies have made a very deep customization function for IPVS, supporting the rapid development of early business.
- There are product representatives in every big factory of the seven layer network agent. Google’s GFE, Baidu’s BFE, Tencent’s TGW, Alibaba’s internal economy also have many reasons for the scenes, such as the aserver of the manual search, the group’s Web unified access to Tengine, and of course, the spanner of ant financial.
- With the concept of service mesh and the gradual maturity of technology, there are as many network agents in the sidecar role of mesh, including sofamosn of ant financial service, envy of istio community scheme and linkerd written by rust. Of course, there is no essential difference between the network agent in the service mesh scenario and the agent in the network access layer. With cloud native With the deepening, we will eventually form a joint force and keep the same shape.
Next, I will describe the evolution process of ant financial service network agent in recent ten years from these three aspects.
Ten years of ant financial network agent
As early as 2016, the challenge of traffic to the network has been very big. For example, the “wheeze and wheeze” business generates 21 billion times of traffic volume adjustments in one minute. This is the real import of internal traffic from the outside, and the data is even bigger now. This kind of large traffic and business scenario is a great challenge to the system.
Ant financial has always been focusing on the five aspects of “stability, high availability, capacity, efficient access acceleration, flexibility, security compliance and anti attack”, constantly upgrading the core capabilities, architecture and operation and maintenance capabilities. The physical bandwidth of the underlying infrastructure network ranges from 1g to 10g, 25g and 100g. Alibaba backbone Wan has expanded to the national and global scale when it comes out of Hangzhou Through prospective technology architecture research and development, and the improvement and transformation of technology independence, we will help the business development.
The ten-year change of access layer network agent, from 2010 to 2019, mainly experienced three times, four stages of development.
Before 2010, ant financial service network agent is the era of commercial equipment, including F5 BIGIP as four-tier access, responsible for hardware load balancing, which is slowly replaced by LVS; NetScaler as SSO network offload, etc.
Independent research and development
In 2011, because we couldn’t meet all kinds of business logic, we embarked on the road of self-study and designed the integration scheme of hardware and software.
Self developed four layer network agent
After 2011, ant financial service entered the stage of self-study of four layer network agent, mainly based on the kernel Netfilter, such as LVS. In 2014, we fully used dpdk technology for reconstruction, which greatly increased the network throughput. From 2018 to now, the kernel has more network technologies, such as ebpf, programmable switching chip, etc.
Ant seven layer network agent: spanner
Spanner is the seven layer network agent of ant gold suit. It is called spanner, which mainly provides a white box solution for the SSL unloading and network access of ant SSL, which carries all the business traffic of ants, including the terminal access of App, Web, and merchants.
The above figure shows the evolution process of layer 7 access from 2010 to now. Each stage has different characteristics. Spanner is based on nginx fork and has a lot of integration with Tengine, so there will be a lot of Tengine characteristics.
The above figure shows the access architecture of ant layer 7 network agent. Users enter through the network entrance of ant financial service, and access through multi protocol to LVS and spanner. As a unified layer 7 gateway, spanner distributes requests to later applications. There are many business logic and protocol support on the spanner, such as TLS 1.3, quic, HTTP and the protocol developed by ants themselves. All the business of ants, including Alipay wallet and other pages, come through this entry. Spanner currently supports millions of users at the same time, online, tens of millions of QPS, millions of users push.
Traffic scheduling of financial level, three places and five centers
In 2013, ant financial launched its own logical data center architecture LDC, and with the evolution, it supports the current financial level disaster recovery architecture of three places and five centers of ant financial:
Spanner plays an important role in traffic scheduling. At first, when the traffic is low, one computer room and one LDC can connect all the traffic. With the growth of users, there are dual active LDCs in the same city and multiple active LDCs in the same city. Until now, the elastic expansion hybrid LDCs can quickly grow up and down, and can quickly expand the capacity of the machine room. This elastic framework has high requirements for spanner traffic scheduling.
In order to meet the traffic scheduling requirements of the financial level three places and five centers architecture, spanner needs to provide different functions for different business scenarios. For example, when the first request comes in, the spanner will randomly split into a zone, and the zone will divide the users into their own units. For example, the user belongs to Hangzhou unit, i.e. Hangzhou machine room. When the user visits again next time, he will directly locate the Hangzhou machine room. This function is similar to Tengine’s session sticky, but it is relative to the single machine dimension, and spanner’s scheduling is applied to the machine room dimension.
At present, spanner can support the following scenarios:
- Zone random routing in machine room
- Cookie zone forwarding
- Blue green release
- Disaster tolerance
- Flexible scheduling
- Piezometric flow control
- Gray scale traffic scheduling
SSL / TLS practice
Ant financial service, as the group’s first Bu to practice HTTPS, has been building the whole station encryption system around the theme of security, compliance and performance.
Integrated software and hardware solution
In 2013, ants introduced two kinds of cards, Cavium Nitrox and Intel qat. In 2014, we have implemented HTTP and various hardware acceleration in the whole station.
The main transformation point of hardware acceleration is to support asynchrony, because the use of native qat is synchronous. For example, nginx submits the request of handshake encryption to the qat card, and needs to do synchronous waiting. In this period of time, you can’t handle other things in nginx. Asynchrony is different. When nginx submits the request of handshake encryption to the qat card, it can directly process other business logic, wait until the qat card finishes the handshake related information and callback, and then continue processing. In spanner, we do the asynchronous transformation of nginx SSL handshake, and transform OpenSSL to match with Cavium’s SSL acceleration card. The whole scheme improves the performance of SSL handshake based on rsa2048 algorithm by three times compared with CPU at that time.
Transformation of protocol implementation – MTLs
In the protocol layer, we have developed the MTLs protocol. In 2015, TLS 1.3 is just a draft, not a real realese. Therefore, based on TLS 1.3 draft, we implemented TLS 1.2 in an extended form:
We have enjoyed TLS 1.3 in advance, brought bonus and made more optimization on this basis, and deposited the lightweight MTLs encryption library of ant financial.
Continuous upgrade of safety compliance capability
Ant financial is a financial company, which requires us to support the national secret algorithm. For example, online commercial banks have implemented the support of national secret algorithm. As the national secret algorithm standard is based on TLS 1.1, there are big performance problems under TLS 1.3. As the only team in Asia with OpenSSL committer, we have been working with the country to promote TLS 1.3 to support the national secret algorithm, and we believe that in the near future we will see the national secret supporting TLS 1.3.
Anttls library is a self-developed Library Based on OpenSSL in ant. It adds features such as trusted mechanism of multi hardware card, and optimizes assembly including national secret. In addition, because the hardware of national secret must use encryption machine, our hardware acceleration card has passed these compliance verification.
Mobile wireless campaign
With the advance of ALL IN wireless group strategy, the growth of Alipay App and the complexity of its scenarios, we collaborate with the Alipay network team in 2015 to launch a mobile technology rectification project called “all in one.” before we introduce the specific technical transformation, let’s take a look at the problems of mobile Internet.
- End to end wireless network complexity;
- Operator network black box;
- The length of wireless terminal is linear;
Specific to Alipay App, online payment, offline payment, promotion, overseas travel payment and so on are common scenarios. Slow response, slow response, slow payment and untimely push messages are all the headaches of mobile experience. Therefore, we carry out a mobile wireless campaign around the fast, stable and efficient campaign. Here we will focus on analyzing the technical transformation carried out on Spanner.
At that time, our two biggest businesses were “Xiu Yi” and “Ji Wufu”.
For example, the demand for network is very high. At that time, there were hundreds of millions of users waiting for the interface shown in the figure above. At the same time, there were hundreds of millions of users continuously clicking, and there were also real-time display numbers. At that time, these blocks generated a hundred million level QPS for our backstage. It can be said that the real click is all accessed by the seven layer network agent, which is a huge consideration for the stability of the system.
In the wireless mobile network, we have carried out many optimization, through the optimal scheduling, network detection, dynamic timeout and so on, we can get better results when the network is smooth. At the same time, we also made a lot of short connection compensation in the case of error, for example, in the case of bad network, i.e. sending a short connection to make compensation when the long connection request is unsuccessful, as well as flexible connection building and automatic retry, which enables us to better complete the task in the weak network environment.
Everything is connected, cloud is original
Since 2018, we have been expanding protocol access, such as mqtt, which is more popular recently, and quic, which is equivalent to the new generation of HTTP / 3.
In the framework upgrade, we have built many nearby access nodes overseas, so that you can access Alipay wallets faster through overseas nodes.
There is also the integration of cloud native ecology, such as the basic data plane platform similar to udpa, as well as the containerization and mixing of access layer.
Quic scheme introduction
Because quic has only one protocol implementation, there is no real way to implement and use quic. We introduce quic lb to solve the problem of quic connection migration. For example, the transition process from 4G to WiFi keeps the data connected. The app initiates quic requests. The first entry request is actually LVS + aliguard. This is a security component equivalent to traffic protection, which is used to prevent DDoS attacks. After the cleaning of this security component, the traffic will be sent back to quic lb. this function is to make a point to ensure that multiple requests of the same user can access the same machine at the back end.
That is to say, quic itself is stateless, but when the user requests it for the first time, he will bury it and then send it back. After that, the request will be transferred to the real server. This is implemented by the nginx stream module.
We have too many optimizations for quic, as shown in the figure above. Some of them optimize quic for these three major patent outputs.
Quic is mainly used in overseas links. For example, it is through this link to carry out the source return in foreign countries. Because quic works better in this weak network scenario.
In 2013, ant’s operation mode was multi domain and multi VIP, which led to thousands of listening ports for an nginx, resulting in a big performance problem. For example, closing acceptmutex will definitely generate a surprise, which will consume a lot of CPU sys. If you open acceptmutex, you will encounter a performance bottleneck, because every time you get the lock, all the listening sockets will be added to epoll, and there are thousands of listening sockets, which is a big performance loss.
Now there are many technologies to solve this problem, such as reuseport. At first, Tengine supported reuseport, but when the first supported version of Tengine is reload, the connection will reset. Until the final official version, the reset problem was solved. There is also a parameter of epollexclusive, which is used to solve the problem of swarm alarm in the kernel stage, but it has very high requirements for the kernel.
In 2013, our solution is as shown in the figure above, nesting epoll listen FD. We add all the FDS of the listening socket to epoll. Epoll itself has an FD. We add this FD to the main FD of nginx. In this way, after each lock is taken, only one FD needs to be added. As long as the FD is readable, there must be events in it. In this way, we greatly reduce the system load for high concurrency. This gave us a way to nest epoll FD at that time. In this scenario, it is not necessary to use the epoll of nginx, but also to add epoll to store a certain type of event. Here is the event to be monitored.
Network agent based on mesh architecture
Service discovery and routing
As mentioned at the beginning, from a macro point of view, the network agent is mainly composed of north-south traffic and east-west traffic, while the service discovery of east-west traffic is related to registration and micro service. The previous introduction is the seven layer agent of entrance traffic, and the latter one will be closer to the east-west traffic.
As shown in the figure above, the initial architecture of east-west traffic is F5. One application accesses another application and schedules it through internal VIP, which is the oldest mode and is difficult to manage. Later, it developed into proxy proxy proxy mode, which is also a seven tier mode, where an nginx proxy acts as the proxy to the back app. The right side mode is now popular in the industry, which is used by many enterprises Register mode, i.e. an application registration service, and then call the corresponding service provider. At present, ant financial is also in this mode, but we are also making some technical reforms.
First, the concept of service mesh is briefly introduced. Take ant financial service as an example. Ant’s applications are all Java type. Therefore, there is a. Jar for publishing registration function. This. Jar contains all the business independent logic of publishing, registration, traffic balance and flow restriction. This is what servicemesh means. At the data level, it is to tear out the logic unrelated to business and run it as a separate process. After stripping out all the publishing, registration and application logic independent content, there is only business logic and a simple protocol sending in the real Java. These two are deployed at the same level, which is equivalent to Java sending directly to a proxy, and then doing such delivery or publishing registration through proxy. This is the evolution that we are currently doing internally, and it is also a popular micro service in the industry.
The main reasons for ant financial’s internal service mesh are:
- Embrace micro service, cloud native
- Integration of heterogeneous language systems
- Unified service governance
- Favorable support of operation and maintenance system
- Global traffic management, connecting north and south, East and West
- Financial network security
Now you can add and subtract passwords only when you use them externally. Financial level network security means that our east-west traffic, that is, internal traffic, also needs encryption. Without this sidecar and architecture, it is difficult to achieve this encryption. Because. Jar in every Java can’t support positive loading for encryption and decryption, and it’s hard to do a lot of optimization.
Sofamesh for financial business
The above figure shows the framework of sofamesh. Under each pod, there are applications and sofamosn. Sofamosn is the previously mentioned independent sidecar, which is a data plane. The communication between application and TLS, national security and service authentication is through sofamosn. There is also the traffic image layer, which is more related to security. What we use internally will do audit work. In this layer, we can do a lot of business logic, and the top layer is the control surface.
Taking sofamosn supporting API gateway as an example, the centralized gateway is an early architecture, which is a cluster gateway. The cluster gateway will have a key performance bottleneck in the ants, because there are thousands of business access below during the double 11th National Congress of the Communist Party of China. You don’t know what their water level is. It’s difficult to recover the water level.
Therefore, the architecture of decentralized gateway is derived. The logic of API gateway is sunk into the application and accessed in the form of. Jar. In this way, the actual water level of the hub is the same as that of the application, and there is no need for a centralized single point business. But this also causes the problem of upgrading very difficult, because hundreds of online applications need to be promoted by hundreds of business parties to upgrade the architecture, sometimes it takes three months or even half a year. Another problem with this architecture is that heterogeneous systems cannot be fully supported.
So we are landing the mesh gateway solution, which will be independent of the application process and deployed in two processes of the same level, so that the business can be changed at any time, and the change of the whole business logic is transparent to the user. This solution can also solve the previously mentioned problems such as difficult flow evaluation and performance bottleneck, because it is deployed at the same level as the application, and it can be directly expanded horizontally, and the pressure measurement under any circumstances can be well evaluated. In mosn, Lua script mode is embedded for dynamic configuration, which will be open source in the near future.
Ant financial started to explore service mesh in 2017, started to research sofamesh in 2018, and started to support 618 business in the first half of 2019. At present, it covers 100 + applications and 10W + containers of transaction core links, and through the sinking of some business processes, RT has been reduced by 7%. The figure above shows some of the benefits of sofamesh.
Cloud native security network agent sofamosn
Sofamosn’s address is: https://github.com/sofastack /… Because it is a cross team project, we choose to use golang for the consideration of compromise and landing cost. For the performance of golang, we have done sufficient research and testing in the early stage. In the service mesh scenario, the performance of placing an order for sidecar never needs to be considered as the highest priority, and the business with extreme requirements for performance RT is not suitable for mesh architecture at present.
Sofamosn capability and module division
The figure above is the diagram of sofamosn module and capability division. We have used many design concepts and models of nginx and envoy for reference in our design. It can expand capabilities based on stream, net, etc.
Sofamosn co process model
Under the golang system, we use lightweight protocols for infrastructure. A TCP connection corresponds to a read protocol, performs packet receiving and protocol parsing, a request corresponds to a worker protocol, and performs business processing, proxy and write logic.
Sofamosn capability expansion
By using the same codec engine and core codec interface, the plugin mechanism of the protocol is provided, which currently supports:
By providing the network filter registration mechanism and the unified packet read / write filter interface, the network filter extension mechanism is implemented. Currently, it supports:
- TCP proxy
- Fault Injection
By providing stream filter registration mechanism and unified stream send / receive filter interface, the streamfilter extension mechanism is implemented, including support for:
- Traffic Mirroring
- RBAC authentication
The figure above is a simple example of the heartbeat used to illustrate this extension.
Dynamic configuration based on XDS
The above figure shows the full dynamic configuration of XDS supported by us. The biggest feature is that cluster can be added as long as one interface is called. One of the biggest problems in using nginx may be the update of cluster. In XDS mode, it is full dynamic update, such as monitoring socket, cluster and routing, which are all dynamic. It is a standard scheme that can meet the community.
The challenge of network agent in mesh scenario
There are many differences between the network agent and the access layer in the service mesh scenario. For the access layer, it may be a centralized single product, which can be controlled by one team. However, in the service mesh scenario, a higher consideration is needed. For example, sofamosn needs to be deployed to hundreds of thousands of containers online, and the users of each container are actually different users. Therefore, it is very necessary for smooth upgrade and roll back compatibility, and we also need to extend some common frameworks.
We usually write software, usually the pressure measurement performance is very good, but once put on the large-scale online environment, it will collapse. This is the problem of compatibility. Different applications are partially meshed; the same application is partially meshed. Then there are TLS encryption links, which links are encrypted, these are the actual problems.
How to inject sidecar into the user’s app, the industry currently uses the transparent agent mode, and the most used transparent agent is IPtable. IPtable redirects the port directly, but we found some big performance problems, so we still choose to use local mode, that is, app will change its access port, all access to local, access to sidecar, of course, we will make great changes in this direction next year.
In a large-scale scene, not only the data plane but also the control plane is a huge challenge. For example, a single example may have tens of thousands of routing nodes, a node may have 200000 back-end machine lists, and there are thousands of routing rules. In this way, in the whole matching process, the actual performance has a great impact, so we do a lot of optimization.
Dynamic service discovery
When you have been doing high-frequency release registration, the stability of the software has a great test. For example, one second may push tens of megabytes of machine list data, and there are many machine list pages in our country, which leads to Pb deserialization and backhand injection when the tens of thousands of machine lists are pushed down.
Operation and maintenance challenges
Sofamosn release service has no perception and is upgraded smoothly.
Due to the particularity of sidecar as an infrastructure, we need to achieve the service imperceptible purpose of infrastructure upgrading. Traditional network agents, such as nginx, take over new connections and requests by closing the listen port of the old process. This scheme is very effective for short connection ping pong protocols such as HTTP, but it can not support long connection very well Two way streaming protocol. Therefore, we have implemented connection migration capability on sofamosn to achieve smooth connection migration in the process of network agent upgrade and ensure the continuity of services. Socket migration can be achieved through sendmsg and TCP ﹣ repair. In fact, socket migration can be well realized in this scenario, and session recovery of the whole connection will be a troublesome process.
When sofamosn is deployed as sidecar, we are faced with new challenges. It is no longer the same as spanner that monopolizes physical machines, or only cares about its own capabilities and resource consumption in the way of containerization of independent applications. We must refine CPU, memory and other resources to achieve the optimal cooperation with applications.
Performance problems are more related to golang:
- Gomaxprocs: CPU consumption and tradeoff of RT
- Optimize GC policy upgrade version 1.12, madv? Free, madv? Dontneed
- Chan’s throughput limit reduces the transmission of main business data
- CGO has 83% performance degradation for TLS signature calculation, AES symmetric encryption
- UNIX domain socket improves performance by 8% compared with TCP socket
- Using TMPFS or MMAP map ﹣ locked to optimize the impact of high IO load on shared memory paging cache
The performance of UNIX domain socket is 8% higher than that of TCP socket, because it takes away a lot of things in TCP protocol, so we plan to let the same machine deliver UNIX domain socket.
Golang’s TLS is implemented by itself and does not use OpenSSL. We have done the test for this part, as shown in the figure above. Blue is based on nginx. You can see that golang has done a lot of assembly optimization, so the performance is not much different. We plan to optimize the assembly of runtime, such as national secret, and the effect will be better and better in the future.
At present, we are not very satisfied with HTTP data, and we will continue to optimize it in the future. At present, RPC Protocol is the most widely used protocol in our company. HTTP protocol will be put into roadmap next year, and HTTP system will be mainly supported next year.
About the future:
- In the age of cloud primary and multi cloud mixed cloud, the boundary of North-South and east-west flow is gradually blurred;
- Part of the ability of the application network agent layer solidifies and sinks to the system network stack or intelligent hardware device;
- Sidecar -> Proxyless -> Networkless；
- The upgrade of physical communication infrastructure will inevitably bring about the change of application network layer.
Alibaba Wang fakang: the evolution of Alibaba’s seven layer traffic entry load balancing algorithm
Analysis of Apache apistix microservice gateway’s ultimate performance architecture