On July 6, 2019, openresty community cooperated to shoot cloud again and held the openresty × open talk national tour salon · Shanghai station. Zhang Zhitong, a technical expert from the Infrastructure Department of meituan, shared the “meituan HTTP service governance practice” at the event.
Openresty x open talk national tour salon is initiated by openresty community and cloud shooting. Senior openresty technical experts in the industry are invited to share openresty practical experience, enhance the communication and learning of openresty users, and promote the development of openresty open source projects. Activities will be held in Shenzhen, Beijing, Wuhan, Shanghai, Chengdu, Guangzhou, Hangzhou and other cities.
First of all, I’d like to introduce myself. My name is Zhang Zhitong, graduated from Harbin University of technology, and joined meituan in 2015. At present, in meituan, I’m mainly responsible for Oceanus seven layer load balancing gateway, mtrace distributed link tracking system, KMS key management service, etc.
Meituan is an old user of nginx. It has been using nginx since the beginning of its business, until it moved to Tengine in Ali in 2013, and then in March and April this year, the whole station service moved to openresty. The most fundamental reason for moving from Tengine to openresty is the difficulty of upgrading. As the version iteration of nginx becomes faster and faster, it is difficult for Tengine to integrate into the latest version of the official nginx. However, openresty can be used to smoothly upgrade the community version of the entire nginx.
Oceanus meituan seven layer load balancing gateway
Oceanus, the word means sea god. It is a seven layer load balancing gateway of the whole meituan access layer, with a call volume of 100 billion levels every day, deployed thousands of service sites and nearly 10000 registered application services. The core function of Oceanus is to provide HTTP service governance function, which mainly includes service registration and discovery, health inspection, and complete visual management. At the same time, Oceanus also provides gateway functions such as session multiplexing, dynamic HTTPS, monitoring, logging, WAF, anti crawler, and flow restriction.
A small problem in current limiting is added here. At present, meituan is implemented through the global redis cluster, and some simple optimizations are also made to realize the redis cluster based on openresty completely, because the official openresty version only supports single instance redis calls. At the same time, we don’t do redis incr every time we request it. We set a threshold value every time. The larger the setting is, the less the cost of local add-on is. Because there is no need for remote call, but the error will increase correspondingly. The basic idea is to add a step locally and synchronize the step to redis cluster regularly to realize the current limiting function of cluster.
The above figure shows the current Oceanus system architecture, and the underlying engine core is based on openresty. An agent process will be deployed on each openresty node, mainly for the purpose of logic decoupling. We don’t want the whole nginx or openresty to have too much logic independent of the request, so we sink a lot of logic into the agent to realize decoupling with openresty, such as pulling the service list with MNS and filling it into openresty through the agent. The site management and landing file configuration are managed by the front-end management platform Tethys. After that, the agent will land in MySQL in real time. The agent will land in the local server block file through MySQL synchronization and reload the site. On the right is the module outside the Oceanus system. The first is MNS, which is a unified naming service within the company. Another scanner is mainly responsible for health examination.
Nginx configure reverse agent
As shown in the figure above, there are several problems in configuring nginx reverse agent:
- Write the dead service address. The IP address cannot be changed. You need to change the file every time you change it.
- Reload is required for each change.
- File based configuration is prone to problems.
How can we solve these three problems? The first is dynamic service registration. The second is that dynamic configuration of reload is not needed to take effect. The third file configuration becomes a structured management.
At present, service registration is based on MNS unified naming service within meituan. The above figure is the front-end interface of the whole service registration. Its back-end is still based on the basic components such as etcd and ZK service registration. It is mainly used to cache the information of services and realize the functions of batch pull and registration services. All site information related to this kind of cluster can be pulled according to the nginx cluster selection, and the real-time and accurate data can be guaranteed through the combination of push and pull. And regularly pull all data to the local, relying on ZK’s watcher mode to ensure the real-time arrival of data.
Nginx active health check has some open source modules, but these active health check will encounter some problems. Suppose that there is a website http://xxx.meituan.com, which is equipped with upstream for health check. Every worker of every proxy server will regularly send health check to the back-end service. If you check it once a second, the number of the entire nginx cluster is 100, 32 workers are deployed on each stand-alone instance, the request QPS for health check is 100 × 32, and the QPS of the actual server is less than 10 every day, plus the health check mechanism, it becomes more than 3000. So we abandoned the internal initiative to do health examination, and chose scanner to do periodic health examination. In addition, scanner supports custom heartbeat, can check whether the port is smooth, whether the HTTP URL is accurate, and supports the isolation of fast and slow threads.
Meituan implements dynamic upstream in a mature way: dyups module provided by Tengine. It provides a dyups API, through which service nodes are added, deleted, and created, and then a worker processes the modification request, putting the request into a shared memory queue, from which each worker pulls the change out to make local effect, and then falls into the local memory, realizing the whole step. Among them, the first call is to lock, and then synchronize the data in the memory that has not been consumed, and then update the operation after synchronization, ensuring the data string.
Some problems of dyups:
The biggest problem is that the memory takes effect, because it runs the memory inside the local worker process, so the next time you reload, the entire service list will be lost. Our solution is to host the update and file landing of this node through the local agent. When the agent perceives the change of service list on a regular basis, it first updates the locally generated upstream file, then calls dyups API, synchronizes the changed node into memory in real time, and realizes that the service node not only lands in the local file for persistent storage, but also pours into the memory of nginx worker to ensure the implementation of the service.
Among them, we need to pay attention to the concurrent problem of reload calling dyups API. If there is a special scenario, when the agent perceives the change of the service node, it has not yet landed the upstream file. At this time, nginx has a reload, and the updated one is the old upstream file. At this time, dyups API is called to notify that the service node needs to be updated. After the service node is updated, the updated information will be put into shared memory, similar to a receiver, and each worker will delete the message after receiving the update. There may be a problem here. When reloading, there are six worker processes. It is possible that this update was removed by the old worker process, resulting in no update of the new worker, which leads to the success of some of the new workers and the failure of some of them.
At present, all the reload, start and stop nodes of nginx, including some injected nodes, are handed over to the agent for processing, which guarantees the serialization of the reload and dyups API calls.
2. Flow tilt
Each machine updates nodes at the same time, and the initial sequence is the same, resulting in flow skew. For example, there are 100 service nodes on the line, one machine room for every 25 nodes. When the nodes are poured in, the order is the same. From the beginning, the first selected node is the same, resulting in the first node in the request list to filter a request, so all traffic at the same time reaches the same back-end machine.
Our solution is to initialize the nodes during the internal weighted round training in nginx, and make internal random to ensure that the first node selected by each worker is a randomized node, rather than selecting nodes according to the stable sequence guaranteed by the original dynamic upstream weighted round training.
Nginx structured configuration management
As shown in the figure above, the creation site can be configured directly on the Oceanus platform. After submission, it is equivalent to establishing a nginx server configuration. At the same time, it supports the import function. The configuration files of nginx server can be imported in real time and fall on the cluster machines.
After the site is built, you can directly configure the mapping rules. On the left side is the location, on the right side is the appkey, and each service has a name. After that, some validation rules will be used to verify whether the configured rules from location to appkey are legal or exceed expectations. When the location configuration rules are very complex and some rules appear in the middle, it’s easy to have problems when configuring rules on the platform as a business Rd, because you don’t know whether the configured rules are correct, whether the traffic that you want to drain is really directed to the appkey, or whether the requests that should not be imported into this service are wrongly directed to the appkey. Therefore, a lot of pre verification needs to be done. At present, the internal verification rule of meituan is to simulate the generation of regular matching URLs under the existing path, which is used to test which traffic is verified on the newly deployed appkey. This kind of verification also has some disadvantages. For example, we have configured many regular matching methods. The simulated URL is not enough to cover all regular URLs, which will lead to inaccurate verification. At present, our plan is to obtain all back-end services, such as Java services, followed by controller, which has the URL of the specified business. We can filter the offline logs for the business URL to find out the matching real URL under each path in the history, and play back the real URL once to see whether it matches the service that should be matched.
Instruction configuration and traffic statistics
We also support the configuration of all instructions on nginx, including setting header, setting timeout, rewrite, custom instructions, etc., or some instructions we encapsulate. It also supports performance statistics of some services, such as QPS, HTTPS QPS, and 4xx and 5xx within the service.
Iterative process of load balancing scheme
The background of fine diversion project is some requirements of meituan online. For example, online users want to update the gray-scale new functional features of users in a certain region, or according to the percentage of flow on the diversion line and the characteristics of fixed flow, choose to let it fall on the fixed back-end server to ensure the physical isolation of this part of users and other users.
For example, on the right side of the figure above, three servers are all service A. take two of them as a group-g. after the agent obtains the service information, it will land in the upstream file in real time. If it’s group-g, it can fall into the upstream file of upstream a ﹣ gr ﹣ g; if it’s upstream a, it will land as a normal service, and three servers will fall into one service at the same time. At this time, there are user ID requests coming in the front-end, so we need to choose a shunting strategy. For example, if the mod100 of the user ID is equal to the request of 1, we want to route it to the grayscale grouping group G. through the calculation of this strategy, we can route the 1001 user request to the upstream a-gr-g service, and then the rest of the other users will be filtered by the strategy and routed to the service a 。
First, embed timer in a worker process, which will pull policy configuration periodically. At the same time, DB configures double buffers written to shared memory structurally. When worker data requests, it will read policy from shared memory for matching. The granularity of policy matching is host + location + appkey. The policy is divided into public policy and private policy. Public policy is a policy that the whole network needs to adopt. Private policy can be customized for its own services.
When the request comes, get the context of the request, and use host + location to find the policy set it needs to use. If it matches the public policy, it will take effect directly. If it is the private policy, it will find the policy according to the appkey. For example, after the request comes, get the context of the request, then find the corresponding policy set through the host + location in the request context, and then maybe find the policy set in the lower left corner.
The process of shunting and forwarding is triggered in the rewrite stage. After the request enters the rewrite stage, the policy data will be parsed to obtain the parameters in the request source in real time. The parameters and expressions will be rendered into expression strings
if (ngx.var.xxx % 1000 = 1) ups = ups + target_group;
Execute this command to see whether the shunt policy is hit. If it is hit, the route ups will be overwritten to the specified UPS group. Otherwise, the upstream will not be modified.
In the framework of microservice, there are many services and long call links. If one service fails, the whole link will be affected. For example, QA testing often requires multiple service supporting tests on this link, or even multiple evolution versions of a service at the same time. The scientific nature of the test is not perfect. To solve the problem of offline QA implementing stable concurrent testing, we propose the concept of swimlane.
As shown above, there are two QA. The first QA can establish its own Lane 1, and the second QA can establish its own Lane 2. The function of QA 1 test is on B, C and D services. It only needs to establish a B, C and D service with the characteristics of this test, and then it can reuse the original backbone link. For example, the backbone link requests come in through the domain name of swimlane. First, they will be routed to the a service of the backbone chain, and then they will directly forward the request to the B, C and D services of swimlane 1. After that, the D service will return to the e service and f service of the backbone link because there is no unrelated service deployed.
The function of QA2 test is mainly focused on a and B services. It only needs to deploy a service related to a and B services separately. When the request comes in, after the A and B services flow through in swimlane 2, it will return to the backbone link c, D, e and f services, so as to achieve the effect of concurrent testing, and ensure the stability of the backbone link, because the backbone link has not been moving in this process, the only moving part is the content to be tested.
At the same time, the coexistence of multiple swimlanes can ensure the parallel testing of multiple services and multiple versions, and isolate errors, greatly improving the service online process.
The implementation of swimlane is very simple based on fine shunting. For example, give a label to service a, which belongs to swimlane s. Using the same principle, you can put it into upstream a-sl-s, and put the swimlane IP into upstream. At this time, there is no swimlane machine in service a. Generally, meituan uses the way of service image to test the service, creates the link of swimlane directly through docker, automatically generates a domain name of swimlane, and forwards the request to the domain name of swimlane directly through the access of test domain name. The implementation scheme is to determine the naming rules of the host and whether there is a swimlane in the header through the Lua swimlane module, so as to determine whether it needs to be forwarded to the upstream node of the back end.
With the continuous expansion of the company’s scale, we have realized the third set of load balancing scheme – unitization. First of all, I’d like to introduce some questions. Have you really expanded your service level? Is your service physically Segregated?
For example, as shown in the figure above, there are two sets of clusters on a business line, service a and service B. at the same time, there are databases below. The databases are divided into databases and tables, and the services are also distributed services. Is it a horizontally extended service?
There are n service nodes in service cluster A and B. when a node is added in service cluster B, all the nodes in service cluster a will establish a connection with the newly added nodes in service cluster B to make a long connection pool. In fact, resources with long connections cannot be expanded horizontally, because the number of long connections borne by each additional machine is n. In the same way, the most serious problem is that on DB, the main database of DB is generally single point. Even if the database is divided, all the write requests will be placed on the main database. Its long connection is actually limited. How do you ensure that its long connection is always in a controllable range?
Another problem is that any node with exception may affect all users, and the N node of service cluster B has problems. At this time, all requests in service cluster A may be forwarded to the N service node of B cluster, that is to say, any user’s requests may be affected. So it seems that the whole distributed system you do can expand horizontally, but it is not.
In order to solve the above problems, we put forward the unit operation. According to the traffic characteristics of users, all requests are framed into a service unit, which is usually divided by region. At this time, the services within each cell are distributed to each other, but there is no relationship between the services across cells. The original service node in service cluster a established a connection to each node in service cluster B, which became a long connection only for the services in its own service unit, so the number of connections decreased to one nth of the original. At the same time, the user’s traffic will be closed-loop in a certain unit, realizing complete isolation. Of course, in reality, there are still some preconditions for unitization, such as the data distribution of DB. If DB cannot be divided by units, then unitization cannot be realized.
The gateway layer of Oceanus realizes cellular routing, reuses the function module of message conversion, and supports modifying, deleting, adding header or get parameters according to a header or get parameter.
As shown in the example above, if a request comes from an app, it will have regional characteristics. Users in Beijing may bring location IDs of 01001, 01002 and 01003. When it comes up, we have a map map table, which is not the same as the previous fine shunting, but is filtered through the routing table. The previous one may be based on expressions. If the routing table of location of 01001 corresponds to set ID Set1, then add a header directly in the user request of 01001, and the name of the header is Set1, so as to realize message conversion, that is, users in Beijing will add a new Set1 identifier in the gateway layer. After that, we can reuse the previous fine shunting scheme, and forward the request of Set1 to the packet of Set1, so as to realize the front-end unit routing scheme.
In the future, Oceanus will make further optimization in configuration dynamic, especially location dynamic, because each reload operation in the way of file configuration location is harmful to the online cluster. At the same time, we hope to make plug-in management dynamic, its hot deployment and upgrading, and automatic operation and maintenance. For nearly 1000 machines on meituan line, it is a very emancipating operation to do automatic operation and maintenance. How to quickly build a cluster and migrate the sites of each cluster is a key task.
Speech video and PPT download:
HTTP service governance practice of meituan