Game case | application evolution and practice of service mesh in happy games

Time:2022-1-3

author

Chen Zhiwei, Tencent level 12 background expert engineer, is now responsible for the public background technology research and development and team management of happy game studio. Rich experience in micro service distributed architecture and game background operation and maintenance research and development.

preface

The background of happy game studio is a distributed micro service architecture. At present, it stably carries a variety of games, tens of millions of daus and millions of online games. The original under cloud architecture was born in the background of qqgame. The core architecture has a history of more than 10 years, in which multiple sets of self-developed frameworks with different purposes and self-developed basic components are used. In order to adapt to complex business scenarios, different service models are derived, and hundreds of micro services are finally accumulated. The overall simplified architecture is as follows:

Under this large-scale platform background system and complex and diverse business architecture, we should continue to create greater business value, which brings great challenges and pressure to the team. Briefly list several questions:

  • The utilization rate of machine resources is very low, the average peak CPU utilization in the cluster is less than 20%;
  • Insufficient service governance capacity, due to the existence of multiple R & D frameworks and different service management methods, the R & D cost of overall business maintenance and basic service governance capability is high;
  • Service deployment is very cumbersome, lack of automation, time-consuming and labor-consuming, and easy to have external network problems;
  • A large number of obsolete business services lack maintenance, the visualization ability of obsolete services is insufficient, and the quality is not easy to guarantee;
  • The overall structure is relatively complex, the cost for newcomers is high, and the maintainability is insufficient
  • The annual abolition of the computer room will cost a large labor cost

In the cloud native era, taking advantage of the company’s comprehensive “embrace of cloud native”, we deeply combined k8s and istio capabilities, split and combed module by module, experienced various cloud services with and without stateful services, protocol transformation, framework transformation and adaptation, service model cloud protobiology, data migration, and improved cloud surrounding service components, Establish cloud service Devops process and many other systematic engineering transformation. Finally, on the premise of continuous service and smooth compatible transition, cloud and grid the services of the overall architecture.

In terms of the overall architecture and cloud technology scheme selection, we weighed the completeness, scalability, transformation and maintenance cost of various schemes, and finally made the final choiceuseIstioService GridAs a technical solution for the cloud as a whole.

Next, I will briefly introduce the cloud solutions of some modules according to the evolution of the original architecture.

R & D framework and architecture upgrade to realize low-cost and insensitive smooth evolution to service grid

In order to access istio and smooth service transition, many adaptation adjustments have been made in the basic framework and architecture, and finally:

  1. There is no need to adjust the stock business code, and the grpc protocol can be supported by re editing;
  2. Call between grid services and use grpc communication;
  3. The cloud service invokes the grid service, which can use either private protocol or grpc protocol;
  4. Grid services call services under the cloud and use grpc protocol;
  5. The old services can be transferred to the grid smoothly;
  6. Compatible with the private protocol request on the client side;

Next, some of them are briefly described.

Introducing grpc into the original architecture

Considering the need for more comprehensive application of istio’s service governance capability, we have introduced grpc protocol stack into the existing development framework. At the same time, in order to be compatible with the communication capability of the original private protocol, grpc is used to package the private protocol, and compatibility processing is done in the development framework layer and architecture layer. The structure diagram of the development framework is as follows:

Use meshgate to bridge the grid and services under the cloud

In order to enable the services in istio on the cloud to communicate with the services under the cloud, we have developedThe meshgate service bridges the on cloud grid and off cloud services

The main function of meshgate is to realize the bilateral proxy registration of services inside and outside the grid, and realize the mutual transfer adaptation between grpc and private protocols. The architecture is shown in the following figure:

Architecture evolution

Based on the ability of business reconfiguration to support grpc and the compatibility of services inside and outside the grid, we can realize the smooth migration of old and new businesses to the cloud.

Of course, in the process of migration, we are not just mindless container on the cloud. We will do targeted cloud original biochemical treatment and service quality reinforcement for various services, improve the observability of services, and finally improve the maintainability and resource utilization of services.

After the service goes to the cloud, its resource configuration granularity changes to pod level and supports automatic scalability. Therefore, there is no need to reserve too many resources for specific services. Most services can share node resources. Thus, the utilization rate of machine resources can be greatly improved,The overall decline in resource use can reach about 60-70%

In addition to the benefits of reducing machine resources, the service uses the helm declarative one click deployment mode, so that k8s can better maintain the service availability. At the same time, the architecture has also gained strong support from istioService governance capability。 Finally, it improves the Devops efficiency of the business.

The evolution of the overall architecture is shown in the figure below:

However, careful students may find that after the service is on the grid, the communication between the service and the client side needs to be forwarded from the self-developed access cluster lotus to the meshgate, and multiple protocol conversion and forwarding are done, resulting in increased performance overhead and delay of the communication link. For the delay sensitive business scenarios in the game, the loss of delay is unacceptable. Therefore, we urgently need aGateway access service in GridNext, we will introduce the transformation scheme of gateway access.

Access service of private protocol in Grid

The original self-developed access cluster lotus under the cloud isTCP long link based on private protocolThe client side access service has the capabilities of service registration, large-scale user link management, communication authentication, encryption and decryption, forwarding, etc.

In addition to the loss of communication effect caused by the migration of the above services to the grid, there are also some other problems:

  1. The operation and maintenance of lotus cluster is very cumbersome; In order to prevent the bad experience caused by the disconnection of the link during the game, the lotus process needs to wait for the user side to actively disconnect, and the new link will not be sent to the lotus to be stopped. In short,Stopping lotus requires emptying existing long linksThis also leads to a long wait time for lotus updates. According to our statistics, each time the whole network releases a new Lotus version, it takes several days. In case of problems, abolishment or addition of nodes, the change requires manual adjustment of the whole network configuration strategy, and more than ten steps need to be performed, resulting in low overall efficiency.

  2. The resource utilization of lotus cluster is low; Because lotus is the most basic service and its deployment is inconvenient, sufficient machine resources should be reserved to cope with the changes of business traffic. However, this also leads to the low resource utilization of lotus, and the daily CPU peak resource utilization is only about 25%;

To this end, we are based on the open source project of CNCFEnvoyOn the basis of, it supports the forwarding of private protocols, interfaces with istio control surface, adapts it to our original business model, realizes private communication authentication, encryption and decryption, client link management and other capabilities, and finally completes the work of accessing the cloud on the service. The overall technical framework is shown in the figure below:

After the transformation, the cloud access cluster has been better improved in all aspects.

  1. Core business scenarioThe forwarding performance and latency overhead of private protocols are close to those in the cloud environment
    For the core business scenario, we have conducted corresponding stress tests. After envoy supports private protocols, the performance overhead and delay of access forwarding belong to the same order of magnitude as that of direct connection under the cloud. The test delay is shown in the table below:

    scene Average time P95 time consuming
    Direct connection under cloud 0.38ms 0.67ms
    Forwarding between k8s pods 0.52ms 0.90ms
    Istio + TCP forwarding (private protocol) 0.62ms 1.26ms
    Istio + grpc forwarding 6.23ms 14.62ms
  2. Naturally support istio’s service governance capability, which is closer to the use mode of cloud native istio;

  3. adoptHelm deployment and definition controller management, realize one click Service on the shelf and rolling update; The whole upgrade is automatic, and the emptying and updating capacity is realized in the process. Considering the load capacity, the emptying efficiency is better.

  4. Because it supports automatic scalability, the access service does not need to reserve too many resources, so the resource overhead can be greatly reduced; Access to cluster after full cloudThe CPU is saved by 50% – 60%, and the memory is saved by about 70%

Architecture evolution

With cloud access clusters, the overall architecture evolution is shown in the figure above. Next, take gamesvr in the game business as the representative of the game strong state service to briefly introduce its cloud solution.

Gamesvr cloud

Happy studio used to be a single game room game(Note: at present, there are far more than these. There are also MMO, big world, SLG and other game categories)。 The gamesvr architecture under the cloud is shown in the following figure:

However, there are some problems with the above architecture under the cloud:

  1. Cumbersome operation and maintenance; More than ten steps of manual operation are required for a single gamesvr to get on and off the shelf. It takes several weeks of manpower to dismantle the machine every year, and it is prone to accidents;
  2. Low resource utilization; Similarly, due to the difficulty of expansion and contraction, sufficient resources need to be reserved for redundant deployment, resulting in a CPU utilization rate of only about 20% in peak hours;
  3. The overall disaster recovery ability is weak, manual intervention is required after shutdown;
  4. Inflexible to local dispatching, all rely on manual configuration of static strategies;

Therefore, with the help of cloud native capabilities, we have created a single office gamesvr architecture that is easy to scale, easy to maintain and highly available. As shown in the figure below:

In the whole process of moving to the cloud, we areWithout stopping service and changing the front end, users can smoothly transition to the on cloud grid gamesvr cluster。 Finally:

  1. Resource utilization has been greatly improved;The overall CPU and memory usage have been reduced by nearly 2 / 3

  2. The operation and maintenance efficiency has been greatly improved; Through customized CRD and controller management, helm can deploy the whole cluster with one click. It is very convenient to get on and off the shelf. Only one business project team can effectively save nearly 10 people and days due to the release of gamesvr every month;

  3. Gamesvr can realize reliable automatic scaling according to the load pressure change of the current cluster and the time series of historical load pressure;

  4. Flexible and reliable single office scheduling capability is realized; Through simple configuration, a single office can be scheduled to different sets according to different attributes. In the process of scheduling, the load and quality of service will also be considered, and finally the better choice of overall scheduling will be realized.

Architecture evolution

After gamesvr goes to the cloud, the overall architecture changes are shown in the figure above. Next, let’s look at how CGI goes to the cloud.

A large number of CGI on the cloud

We have used CGI under Apache on a large scale as the framework for the development of operation activities. However, some current situations of the original CGI business:

  1. There are many types of services. At present, about 350 CGI services are deployed in the network, and the traffic is huge;

  2. The process model of CGI synchronous blocking leads to very low throughput of single process; The QPS of most CGI services is only single digits, and there is performance overhead of Apache service scheduling and message distribution;

  3. Poor resource isolation between CGIs; Because CGI is a multi process deployment on the same machine, it is very easy to affect the CGI of other services due to the sudden increase of resource overhead of one service;

In the face of the cloud of a large number of CGI with low performance, it is necessary toCloud solution with low R & D cost and resource overhead。 At the beginning, we tried to package Apache and CGI as a whole into a simple container for cloud, but found that the resource overhead and deployment model are not ideal, so we need a more elegant cloud scheme.

Then, we analyze the traffic distribution of CGI and find that 90% of the business traffic is mainly concentrated in 5% of CGI, as shown in the figure below.

Therefore, we have made some distinctions and transformed the cloud for CGI with different traffic.

  1. in the light ofThe head flow CGI is transformed into co process asynchronization, split Apache and improve the performance of the framework dozens of times.

    • Implement HTTP request listening and asynchronization at the framework layer:

      • usehttp-parserTransformation, so that the framework itself supports HTTP listening and processing;
      • be based onlibcoTransformation, so that the bottom layer of the framework supports collaborative process, so as to realize asynchronization;
    • In the business layer, it is also necessary to carry out various adaptation processing:

      • The global variables are privatized or associated to the process object management;
      • Reuse and optimize resources such as back-end network, IO, configuration loading and memory to improve efficiency;

      Finally, the business side makes minor adjustments, that is, the collaborative process asynchronous transformation can be completed. However, even if the transformation cost is lower, there are still too many CGIs, and the cost performance of full-scale asynchronous transformation is very low.

  2. For the remaining long tail traffic CGI, it is packaged with Apache, and the script is used to move to the cloud at one time. In order to improve the observability, special treatment is also made for the metrics collection export of super multiple processes in a single container.

Finally, in the process of going to the cloud, make full use of Apache’s forwarding mechanism to realize gray rollback going to the cloud.
After going to the cloud, the overall resource utilization and maintainability of CGI have been greatly improved.After full cloud, the CPU can save nearly 85% of the cores and memory can save about 70%

Architecture evolution

After the relocation of CGI, the overall architecture is shown in the figure above. Next, let’s introduce the transformation scheme of self-developed storage cubedb.

Self developed storage service migration

We have tens of tons of self-developed storage data under the cloud and hundreds of MySQL tables built by ourselves. The overall maintenance cost is high and it is difficult to go to the cloud. Therefore, our solution is to “leave professional things to professional people”, and migrate and host the storage to tcallusdb (Tencent IEG self-developed public storage service). The overall migration steps are briefly described as follows:

  1. The adaptation proxy service, i.e. cube2tcaplusproxy shown in the above figure, is developed to convert the private protocol adaptation of cubedb to tcalusdb, so that tcalusdb can be directly used for the storage of new services;

  2. The standby machine of cubedb synchronizes the hot data of the service. After synchronization is enabled, tcallusdb has the latest data of the service;

  3. Import the cold data into tcallusdb. If there is recorded data in tcallusdb, it indicates that it is the latest and will not be overwritten;

  4. Compare the full data of MySQL and tcallusdb, and switch the proxy route after checking the full data for many times;

Finally, through this scheme, we realize the lossless and smooth migration of DB storage.

Architecture evolution

After the transformation of our self-developed storage services, most services can be put on the cloud. At the same time, we have also built and applied many cloud peripheral capabilities, such as cloud unified configuration center, grafana as code, promethues, log center, dyeing call chain and so on.

The final architecture evolved into:

Multi cluster deployment mode

Under the cloud, we are a region wide and server wide architecture, and all game businesses are in a cluster. However, due to our organizational structure and business form, it is expected that different business teams will work in different business k8s clusters after going to the cloud, and the services shared by everyone will be managed under the public cluster. Therefore, in the process of migrating to the cloud, more adaptation migration needs to be done.

At the istio level, our istio services are hosted by the TCM team(Tencent cloud service grid), with the strong support of TCM students, combined with our current organizational structure and business form, we can realize the exchange of control surface information under istio multiple clusters. Therefore, the cost of mutual call between multiple clusters is very low. The following is the background architecture related to TCM:

summary

Finally, under the complex game business architecture, through careful analysis and continuous reconstruction and evolution based on cloud native technology, and in-depth combination with the capabilities of k8s and istio, we finally realize the stable and smooth high-quality cloud and grid of the architecture under the game business scenario, with a multi frame and multi language micro Service Framework, automation, service discovery, elastic scaling and service control, Traffic scheduling and governance, three-dimensional measurement and monitoring, and precipitation of cloud experience in various business scenarios of the game. The reliability, observability and maintainability of the business module are greatly improved, and the overall research and operation efficiency is significantly improved.

Happy game studio has several national chess and card games such as happy landlords, happy mahjong and happy upgrade. At the same time, it is studying a variety of games such as big world, MMO and SLG. Now it is recruiting a large number of positions in R & D, planning and art. Welcome to rushRecruitment link, recommend or submit resume.

About us

More about cloud native cases and knowledge, you can focus on the same name [Tencent cloud primer] official account.

Benefits:

(1) the official account of the public is restored to the manual, which is entitled “Tencent cloud primary roadmap manual” & “Tencent cloud native best practices”.

(2) the official account reply to the series can get 15 series of 100+ super practical cloud original dry cargo collection, including Kubernetes series, K8s performance optimization, practice and best practices.

(3) the official account is back to the white paper, which is entitled “Tencent cloud container security white paper” & “the source of the drop in the book” – white paper on cloud primary cost management v1.0.

[Tencent cloud native] cloud says new products, Yunyan new technology, cloud tours, new cloud and cloud reward information, scanning code concern about the same official account number, and get more dry cargo in time!!