User case farewell to traditional financial message architecture: Apache pulsar’s practice in Ping An Securities

Time:2022-1-15

This article starts from InfoQ farewell to traditional financial message architecture: Apache pulsar’s practice in Ping An Securities.
Authors: Wang Dongsong, Chen Xiang

In the financial scenario, with the expansion of business, more scenarios are added to the application system. These new scenarios put forward more diverse requirements for the message system, resulting in a series of challenges to the original architecture. After trying to use Apache pulsar, Ping An Securities decided to practice it in a production environment. This paper introduces the reasons why Ping An Securities chose Apache pulsar, the scenarios of using Apache pulsar, the problems encountered in the practical application of Apache pulsar, and the future planning of using Apache pulsar.

Background introduction

Traditional financial companies or securities companies generally use unified access services or components to handle external business. After receiving the user request, transfer the request to the corresponding business system / module according to the corresponding business rules. Some requests will be forwarded to the message queue. After the request is written, the downstream business system obtains the request from the message queue and returns it to the customer through the original path of the message queue after processing. The whole request process runs in a closed manner with limited functions.

Challenges brought by traditional architecture under message queue

Ping An Securities adopts the above traditional architecture and currently only supports message queuing. Although we have certain development capabilities, it is difficult to obtain the details of the message queue. At the same time, because it is a custom developed system, the supported languages are relatively limited. The existing message queue has the following deficiencies in business development and business innovation:

  • Black box system, difficult to observe:Message queue is a black box system, so it is difficult for us to observe the details of the architecture;
  • Direct exchange, unable to route:At present, the architecture only supports message queues and cannot support scenarios requiring routing;
  • Weak verification access, high security risk:The password authentication and verification of the existing system are weak, and the security risk is high;
  • Customized system, limited language support:The support of customized system access language is limited, which makes it difficult for us to reform on the basis of the original system.

With the business expansion and architecture improvement, the company’s existing message queuing system / components are facing a series of challenges, and many problems existing in the system, such as security, are urgent in the financial scenario.

Business requirements of financial scenario

Our business requirements are mainly divided into three categories: Identification & security control, routing and distribution, and audit.

Identification & security control

Identity identification is mainly used to determine the identity information of the client and the accessor accessing the message queue, specify the corresponding security rules, reject illegal accessors, and then realize the expected security requirements. From the most basic level, it is necessary to identify the system and IP controlling access, and restrict the permission according to the business scenario and specific requirements.

Routing distribution

Routing distribution means that messages are routed from the write queue to the corresponding queue according to the corresponding rules. The existing message queue supports limited scenarios. If you want to support more scenarios, you need to invest a lot of time and energy in development (involving the transformation of upstream and downstream systems), and other problems will be introduced. A better solution is that the message queue system supports more modes and features, such as topic mode and streaming message processing. If the message queuing system can support routing, the access complexity of the system will be greatly reduced, and the access layer can be operated in a better way. Each system only needs to connect to a group of topics, and the routing is responsible for distribution; You can also optimize performance more specifically (routing, forwarding and protocol conversion are all performance consuming operations).

The communication mechanism of the original system architecture is point-to-point and closed operation. The request message cannot be shared. It can only be distributed indirectly by adapter or log collection. Such practices are difficult to effectively meet the real-time requirements.

audit

The publisher / receiver of the message belongs to the participants of the whole system and is the top priority. The main influencing factor of system security is all participants of the system; Therefore, from the perspective of security, the audit requirements for messages are relatively high. Another urgent requirement is to control the flow of messages. If identity identification and security control can be carried out, the security information can be improved and optimized during audit, so as to ensure that invalid and illegal requests are rejected at the business entrance and ensure the robustness of the internal system. In addition, it can also be used to monitor and audit the information of the receiver / Auditor of the message.

System requirements for new business

New services put forward higher requirements for the message system, mainly including availability, message sending delay, capacity expansion and contraction, message backtracking, etc.

Requirement 1: high availability and low latency

For the Internet industry, high availability and low latency are the basic requirements of the system. From single point to disaster recovery, to cross computer rooms in the same city, to cross multi centers in different cities, or first cross city, disaster recovery, and then cross city multi centers (two places and three centers), the business systems of many companies are or will develop in this direction. Such a system has high requirements for high availability and low delay. Therefore, it is necessary to consider how to minimize the delay when the system complexity increases (such as disaster recovery, cross city and other scenarios).

Demand 2: rapid capacity expansion and recovery

For the financial industry, one of the main characteristics of business is that requests may surge in a certain period of time or cycle. After this time window, the traffic will gradually return to normal. This feature requires that the system can expand and shrink horizontally quickly. For cost reasons, it is obviously unreasonable to deploy the whole system architecture according to the maximum traffic. The best solution is that the system can reasonably arrange the system architecture or system deployment mode according to the single-layer traffic. When the traffic increases suddenly, the system can expand rapidly to support the business. Ideally, all components of the system have the ability to expand, shrink and recover quickly.

Requirement 3: orderly messages and anti duplication of messages

In some special business scenarios, it is necessary to ensure that messages are orderly or anti duplication. We often perform idempotent operations on some interfaces. If we can ensure that the upstream messages are not repeated, we can reduce the downstream pressure.

Requirement 4: traceability and serialization

If there is a problem in the business system, but it is difficult to reproduce this problem in the test environment, it is necessary to introduce message backtracking. Message backtracking refers to replaying all requests in the time window of the problem, verifying whether the problem can be reproduced and troubleshooting, which can greatly reduce the workload of troubleshooting. In addition, we can also use this function for gray-scale verification and parallel verification.

Select Apache pulsar

Based on the above business and system requirements, it is found that many features of Apache pulsar perfectly meet our requirements.

  1. Cluster mode supports cross cluster synchronization. Build system dual activity, cross cluster regional replication, and realize message synchronization when the client is insensitive.
  2. Compute storage separation. Scale storage / computing horizontally according to usage, and the client is not aware of this operation. Based on the function of secondary storage, the usage scenario of messages is extended to make it possible for data analysis and message audit.
  3. The client access authentication module is plug-in and supports user-defined development. Due to business requirements, authentication and authentication are required during client access to effectively ensure the reliable and controllable source of messages.
  4. Complete rest API to view the queue status. The previously used message system has good performance, but it is lack of observability, which makes it difficult for the system to remove obstacles. At the same time, the management mode of the message system is relatively primitive, which is difficult to meet the requirements of large-scale system management. The perfect rest API of Apache pulsar can not only obtain the system operation indicators, but also help the efficient management of the cluster.
  5. Based on functions, message routing development, filtering and statistics can be realized.
  6. The persistence mode and expiration time of messages can be set to allow message replay.
  7. Multi language support, fast and convenient access.

Business scenario of Apache pulsar in Ping An Securities

Ping An Securities uses Apache pulsar to build a unified messaging platform. It is expected to integrate the four data streams of customers, transactions, quotes and funds for market distribution, real-time risk control, etc. This paper mainly introduces how to apply Apache pulsar to three business scenarios: request routing, data broadcasting and message notification, the advantages and disadvantages of the new architecture, and its impact on the development and operation and maintenance teams.

Scenario 1: Request Routing – simplifying the system

Our message routing process is shown in the following figure. The request sent from component A is written to topic a, and then the routing module routes the information in the topic and distributes it to multiple corresponding topics. The downstream components that subscribe to these topics can process relevant messages. Component a only needs to write messages to a fixed queue and does not need to pay attention to the information of Topic B, C and D. the downstream system only needs to know the queue receiving messages and does not need to pay attention to topic a, so as to simplify the structure of the whole network.

User case farewell to traditional financial message architecture: Apache pulsar's practice in Ping An Securities

This message routing mode simplifies the overall architecture of the system. At present, our routing system still needs to be optimized:

  1. Although the workload of routing distribution has been reduced, the steps of troubleshooting have increased. For example, after component a sends a message and component B does not receive a message, it is necessary to check whether component a writes the message to topic A and whether the routing module successfully routes the message, and then see whether component B subscribes to the message correctly.
  2. From the current test results, the delay increases due to the longer message link.
  3. Because the messages of each queue are persistent, data redundancy occurs in both storage and queue.
  4. The routing module is a new module, and the learning cost of operation and maintenance is high.

Scenario 2: data broadcasting – reducing delay

Data broadcasting is another business scenario where we use Apache pulsar. Data broadcasting adopts send / subscribe mode, which is mainly used to synchronize messages. A long time ago, we didn’t need to synchronize the market to the business system or through other ways (such as synchronizing the database). However, with the growth of business, the competition between synchronous timeliness and user experience is becoming more and more fierce. How can users see information faster? Taking the scenario of synchronizing the market as an example, the time delay is relatively long by first synchronizing the database and then consulting it; In the broadcast mode, the business system only needs to subscribe to all the required topics, and can directly read the data when looking up, so as to effectively reduce the delay.

User case farewell to traditional financial message architecture: Apache pulsar's practice in Ping An Securities

Scenario 3: message notification – Security Control

The third scenario we use Apache pulsar is message notification. Although message notification involves relatively few businesses, this business scenario is very important. The overall business flow chart is as follows. Since the signal source is not unique, after the message is published to the computing engine, the computing engine needs to calculate the logic, security and other aspects according to the information of the signal source. After the computation is completed, the Task is activated, and the business request is sent to the related business system by the activated Task. After execution, the result is returned to the service of the originating signal source, which triggers the next signal source according to the result returned.

The businesses involved in this scenario have very strict requirements for security and control. They not only need to limit the messages or signals sent by the signal source, truncate / filter some signals, but also need to process the returned results: which can be returned and which need to be filtered or converted into other contents. If the message queue method is not used, the message source will directly send the message to the computing engine. After the computing engine executes security or control policies, the message will be sent to the task; After the task is executed, the result needs another round of security control. Repeated operations in this part have a great impact on performance, and the timeliness of policy update and signal status viewing is not so real-time.

After the introduction of Apache pulsar, we separated the control audit module, conducted filtering, auditing, statistics and other operations specifically for the signal queue and result queue, and output the results to the management end in real time. Operation and maintenance personnel or auditors can control and update corresponding strategies after seeing these information. The first mock exam can not only streamline data flow, but also increase data supplement channels and define the boundaries of service modules more clearly.

User case farewell to traditional financial message architecture: Apache pulsar's practice in Ping An Securities

Problem discovery and Solutions

At present, we have mainly explored the use of Apache pulsar in the above three scenarios and gradually put it into production. During the use, we found several problems and shared our solutions here for reference.

1. Implement req-rep mode

The first problem we encounter is how to implement the request response (req-rep) mode. Our solution is to be compatible through the bus mode.

At present, the common call method is that the client initiates the call request, and the server returns the response after processing. However, after the introduction of the bus (synchronous to asynchronous), in the multi node deployment scenario, node 1 sends a request and the server returns the processing result after receiving the request. All nodes need to listen to the processing result. What should node 2 do when receiving the response message from home node 1? Node 2 needs to subscribe to and get the message back to the packet first to determine whether it is the response of the request initiated by its own node. If not, the message will be discarded. If implemented according to this mode, each node needs to cache its own message ID when sending messages; After processing, the server needs to bring the requested message ID to the packet back data according to the protocol. Each node subscribes to obtain all packets back, and verifies whether the message ID exists in the cache. If it does not exist, the message will be discarded.

User case farewell to traditional financial message architecture: Apache pulsar's practice in Ping An Securities

There is a very serious problem to be solved under this implementation method: when a node initiates a request to query a large amount of data, assuming that Apache pulsar sets the size of a message to 8m and the TPS to 1000, does each node have to receive so many requested packet back traffic? If there are five nodes, each node should only receive 200 requested packet back traffic, but the current mode requires each node to bear 1000 requested packet back traffic, and its purpose is only for filtering operation. If the node load performance reaches the upper limit, the node needs to be expanded, which will double the network bandwidth. Since Apache pulsar can support a large number of topics, although this problem can be solved by configuring a packet return queue for each node, we want to try to solve this problem through the filter function of the broker.

2. Separation of reading and writing

The message broadcast scenario involves read-write separation. If a large number of subscription nodes are added, it is best to avoid centralizing the links of all nodes on the owner broker of topic. To solve this problem, the feasible solution is to reasonably allocate and use topics and partitions. Apache pulsar 2.7.2 we currently use does not support read-write separation. We plan to upgrade Apache pulsar to 2.8 to easily realize read-write separation and meet the needs of message broadcasting scenarios.

3. Solve the problem of multiple network cards

Considering the company’s network security, there are multiple network partitions and network segments inside. Different network partitions / network segments use different IP addresses, and the server has multiple network cards for cross partition system communication. At present, if you use IP registration broker, you can only register the IP of a network segment; If the domain name registration broker is used, the DNS resolution of different network zones needs to be configured differently. If the broker can support multi network card communication, these problems will not exist. At present, our solution is to use proxy to proxy the client’s requests, and the external system is only connected to proxy. We will also add some highly available configurations for proxy.

User case farewell to traditional financial message architecture: Apache pulsar's practice in Ping An Securities

Future planning

At present, we are running Apache pulsar on a small scale on single room and single cluster lines, and we did not consider the construction of double live in the early stage of the launch. As the infrastructure of business system, Apache pulsar’s own availability is extremely important. Therefore, we plan to carry out dual activity planning based on the construction of double centers and single clusters in the same city, as shown in the figure:

User case farewell to traditional financial message architecture: Apache pulsar's practice in Ping An Securities

In the process of testing and using Apache pulsar, we encountered some problems. Thank the Apache pulsar community for its positive response. We look forward to participating more in the research and development of Apache pulsar and contributing to Apache pulsar and the Apache pulsar community.

About the author:

  • Wang Dongsong, R & D Engineer of Ping An Securities brokerage business division.
  • Chen Xiang, architect of Ping An Securities brokerage division.

Join Apache pulsar Chinese communication group
User case farewell to traditional financial message architecture: Apache pulsar's practice in Ping An Securities