Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

Time:2021-2-19

Ctrip has begun to explore the field of microservice since the era of. Net technology stack. After turning to Java technology stack, it has experienced self-developed microservice framework and now high-performance Dubbo. At present, we are exploring on the road of service mesh, hoping to realize the comprehensive standardization of microservice framework and cloud native.

Past (self research service framework)

Ctrip starts from. Net technology stack, and is based on ESB bus at the beginning. Although it solves the governance problem of Intranet service call, centralized service architecture often leads to the situation that a single service drags down the whole bus, and then leads to the phenomenon that the whole network is paralyzed. SOA Service Architecture Based on registry solves the huge impact of single point of failure through distributed service invocation. At present, Ctrip mainly focuses on the technology stack. Considering the compatibility history of. Net technology stack, the current framework is mainly self-developed. However, compared with the open source high-performance service framework, the self-developed framework may have the following problems.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

Now (cdubbo service framework)

C in the name of cdubbo stands for the governance of Ctrip, and Dubbo stands for Alibaba’s open source Dubbo SDK. Throughout the practice and exploration in the past two years, from the first version in April 2018 to nearly 10000 service examples, we can roughly summarize the following major milestones.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

1. Registration discovery

Registration discovery is the core element of distributed service framework. In order to support the existing service interworking, we need to access the registration center of Ctrip.

Service registration supports the health detection extension mechanism, and the business can customize the health detection extension according to the business scenario. For example, when the dependent database is not available, it will no longer provide services. The server maintains the availability of the service through a heartbeat of 5S. When there is no heartbeat sent for N consecutive times, it will automatically notify the client.

The client initiates the subscription to the service, and ensures the final consistency of the node in the client through the push-pull mode. Through Dubbo’s extension mechanism, the user-defined routing policies are implemented, such as specifying the routing policies according to the method name and determining different routing policies according to the request parameters. At the same time, it can also support nearby access and give priority to the services of the computer room.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

2. Monitoring – Cat

For micro services, without monitoring, it’s like blind people who don’t know anything. Cat provides the ability of distributed link tracking, which can provide good reports and scenario analysis.

Sometimes, we need to know the total number of service requests, the request distribution and QPS of a single machine, or we need to know the service execution time and time. Cat’s aggregate reports can help us better understand the health of services.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

For timeout, you may need to know which phase is slow, client or server, serialization phase or service execution process is too slow. For exception error reporting, you can see the exception in which process, and print the exception stack information to help locate the problem.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

3. Monitoring metrics

Framework personnel need to understand the macro situation of the company’s services, such as which services are available in each computer room, which services use the protobuf serialization format, which services use the SOA protocol, and the average execution time. Business colleagues may also want to know the specific situation of their services, such as which callers are there and whether the thread pool is full.

By accessing Ctrip’s dashboard, you can provide global total amount, error amount, thread pool statistics, and aggregate data according to the computer room, protocol, serialization format, etc. It can also customize alarm rules to intervene as soon as possible when problems occur.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

4. Dynamic configuration

For business colleagues, it is possible that the dependent service will suddenly slow down, leading to client timeout. Framework staff may need to adjust the check to false when the computer room fails, so as to solve the problem that no one can start the A / b service cycle. Dynamic configuration provides the ability of configuration hot validation, and does not need to be re released for a configuration, so the cost is very high.

The execution time of multiple methods on the server side may be different. Through multi-level parameter configuration, you can set the default service timeout to 1s, and set the independent timeout to 5S for the slow method.

The service owner may be more aware of the time-consuming of his service. Through the parameter setting of the server to the client, it is not necessary for each caller to set a timeout, and the setting time will be more reasonable. In order to avoid the loss caused by configuration errors, we provide a friendly visual interface.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

5. SOA protocol and interworking

In order to support the migration of existing clients to cdubbo, we need to consider supporting the existing SOA protocol. In addition to ensuring that it is compatible with HTTP 1.1 protocol, it is also necessary to ensure that it is consistent with the serializer of the client.

Cdubbo will receive the request of SOA protocol through Tomcat port, use the existing serializer to perform the conversion of request object, ensure the consistency of Dubbo internal call and filter link, and ensure the unity of business logic, that is, the business code can start the two protocols without changing.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

6. Test platform

For private binary protocols, there is no ready-made postman and other tools to use. Sometimes, developers need to verify the services of the test environment locally, and they may also need to verify the locally started server. It is relatively expensive for each developer to construct a client.

Through VI (GitHub open source called corestone), and using Dubbo 2.7.3’s metadata center and generalization ability, we have implemented a postman like calling tool. It can not only support direct connection, but also support local testing, as well as protobuf serialization format. About the protobuf serialization test scheme, has been contributed to the Dubbo community, interested students can learn by themselves.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

7. Upgrade Dubbo 2.7.3

For the detailed upgrade process of Dubbo 2.7.3, please refer to:https://www.infoq.cn/article/…

Now let’s review the final result of the upgrade. At present, 99% of Ctrip’s services have been running above Dubbo 2.7.3. So far, there are no failures, only some incompatibilities. For incompatibilities, it also ensures that the compiler is exposed in advance, and there are no problems at runtime.

After the release, there have been some small problems, such as that the preheating ability does not take effect, that onerror will not be called back under abnormal conditions, and that trace buried points supporting server asynchronism have been completely fixed in the open source version.

8、Threadless

Business colleagues feedback, need to control the thread in the ideal range. However, there are too many threads in Dubbo. On the one hand, it is the service level exclusive thread pool. When the caller relies on 10 services, the QPS of each service is 1, and the latency may only be 10ms, at least 10 threads are needed because each service has an independent thread pool. If multiple services share a thread pool, since the client defaults to cached thread pool mode, only one thread may be enough in this scenario. On the other hand, for the synchronization service, the thread less of Dubbo 2.7.5 can save the dubboclienthhandler thread, and the netty IO thread directly gives the response to the business thread, thus saving a thread switch.

Through practice, the number of business threads has decreased to a great extent. In the case of high QPS and large number of dependent services, it can even decrease by 60-70%.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

9. Cdubbo service system

The existing service architecture of cdubbo supports both Dubbo and SOA protocols. For Dubbo clients, it can support the transmission of TCP protocol. For existing SOA clients, it can be compatible with existing SOA protocols.

At the same time, it can also support the request of Intranet and extranet gateway, ensure the unified configuration of multi protocol, and be compatible with the serialization format of SOA.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

10. Performance

From the protocol level, the response of Dubbo protocol is better than that of SOA protocol, and the average time consumption is reduced from 1ms to about 0.3ms. Of course, the specific improvement depends on the service message and request volume.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

Some people may think that the performance improvement in a few milliseconds is not enough, but the stability of performance will be very important to the service. We observed that when the service traffic increased by 3-4 times, the client could keep 0 exception. Long connection multiplexing provides good shock resistance.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

11. Expansibility

Microservice framework is heavily coupled with business code. Framework staff mainly spend 20% of their time to solve 80% of the business requirements, while the other 20% needs 80% of their time, which is more suitable for the business to solve. The only way to provide this ability is scalability. Dubbo has good horizontal and vertical expansion ability.

Through practice, it is found that the business does have its own expansion at all levels. For example, the business extends the router layer, supports its own routing rules, extends the load balancing strategy, and even extends the transport layer to replace it with its own transport protocol.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

12. Ecology

Good ecology can reduce the development cost, such as using the existing open source Dubbo admin, Dubbo go and other components. In addition, it can also reduce the learning cost of the business. You can learn the Dubbo framework in other companies, and you can continue to use it in Ctrip. You don’t need to relearn the private service framework. Technical support is also relatively small, and many business colleagues are even more familiar with the Dubbo framework than we are.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

13. Existing problems of Dubbo Protocol & Dubbo 3.0 planning

In addition to the advantages that the Dubbo framework mentioned above has been widely recognized by the industry, in the process of our practice, we also found some shortcomings of the existing Dubbo 2. X protocol, such as in the background of cloud native, the protocol is not friendly to the gateway, the lack of lightweight SDK on the mobile end, and so on. According to our in-depth communication with the official maintenance team of Dubbo, these points are also the key breakthrough directions of current Dubbo 3.0, such as the following generation protocol, application level service discovery, cloud native infrastructure support, etc. Ctrip, as a deep user of Dubbo, will continue to participate in the construction and landing process of Dubbo 3.0.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

Service mesh

There are many opinions about the meaning of service mesh on the Internet. I think the main points are as follows.

  • Standardization means lower costs, such as low R & D costs and low learning costs. The micro service framework learned by other companies can continue to be used by Ctrip, saving the cost of learning and stepping on the pit;
  • Framework students may be interested in process decoupling. The problem that middleware can’t be upgraded independently has been bothering framework students. In this problem, envoy can be upgraded independently;
  • By sinking, some capabilities of cloud infrastructure are reused. On the one hand, it can better support multi language, and businesses can choose the appropriate language according to their own scenarios. On the other hand, it can make the SDK simpler and reduce the compatibility problem of jar dependence;
  • Because more standard and sinking can bring better cloud deployment capability, when the business goes to sea, it can deploy the required components according to the actual situation, and no longer rely on the framework.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

1、Service Mesh SDK

The following figure is the service mesh architecture provided by the official website of istio. If istio solves the standardization of the control plane, and Evoy or sofa mosn solves the standardization of the data plane, does the SDK need a standardized component, or is there an SDK suitable for our standard?

For some small and medium-sized companies without their own middleware team, they may choose the commercial SDK. However, for a company with such a scale as Ctrip, strong scalability requirements and thousands of Dubbo services, we may look forward to the standard agreement of 3.0.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

2. The existing agreement is not suitable for sinking

The existing SOA protocol may not be suitable as a standard protocol. The text protocol based on HTTP 1.1, compared with TCP protocol, will bring about long tail and affect the stability of service.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

Dubbo protocol is not very friendly to gateway, but also has the problems of cross language and protocol penetration. Envoy itself can be understood as a stand-alone gateway agent, so it is not suitable as a standard protocol.

Secondly, the problems of cross language and protocol penetration have been shared by Liu Jun of Ali. For interested students, please refer to:https://www.infoq.cn/article/…

3. New agreement

Since the existing protocols are not suitable, can we consider the cloud native standard protocol grpc. Yes, from the perspective of protocol, there is no problem with this choice, but the problem of strong binding between grpc and proto requires thousands of existing services of Ctrip to rewrite the business logic code, and the cost is unacceptable.

Our expectation for the new protocol should be that it can be based on POJO object and conform to grpc protocol specification. On the one hand, it can make good use of the basic ability of cloud. On the other hand, for minority languages, the existing grpc framework can also be used to achieve interoperability with mainstream SDKs.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

For the new SDK, we should not only have standard transport protocol, but also consider the tight coupling between service framework and business. Scalability is also the main feature to be retained. We also need to consider the standardization of API and the use of unified monitoring components.

4. Summary

Now, we have partially standardized the SDK. In the future, we will go faster, more stable and more standard on the road of cloud origin.

Nearly ten thousand service instances are running stably, but there is no fault. How is the Ctrip micro service architecture implemented?

Author Gu Haiyang
Source | Alibaba cloud native (ID: alicloud native)