In 2017, New Oriental began to explore the use of containerization to serve middleware business, using ES based on rancher 1.6; in 2019, New Oriental started to expand the service of middleware business again, using Kafka, ES and redis based on kubernetes. What problems does New Oriental encounter in the process of servitization? What are the prospects?
Jiang Jiang, head of platform architecture of New Oriental Information Management Department
Chen Boxuan, senior operation and maintenance engineer of New Oriental Information Management Department
In 2017, New Oriental began to explore the use of containerization to serve middleware business, using ES based on rancher 1.6; in 2019, New Oriental started to expand the service of middleware business again, using Kafka, ES and redis based on kubernetes. The use of container means to serve the middleware effectively improves the work efficiency of the operation and maintenance team and greatly shortens the software development process. This article will share new Oriental’s attempt on middleware service.
From kindergarten, primary school, middle school, University and studying abroad, New Oriental is involved in almost every field of education. Our education product line is very long and complex. So, what kind of IT capabilities do we use to support such a long education line? ——New Oriental cloud.
At present, we have 16 cloud data centers, including self built and rented IDC. We have also directly connected Alibaba cloud and Tencent cloud through cloud networking, and finally formed a hybrid cloud architecture across multiple cloud providers. New Oriental’s cloud system is very special. You can see some relatively traditional parts in it, such as SQL server, windows and counter service program. But there are also relatively new things, such as tidb, containers, microservices and so on. In addition, we can also find some partial Internet applications such as video double teachers and interactive live broadcast. The IT architecture of an enterprise is closely related to the business stage of an enterprise. New Oriental and thousands of enterprises from traditional business to Internet plus are in the critical stage of digital transformation.
Next, let’s talk about the containerization of New Oriental. New Oriental has been making containers for many years. In 2016, New Oriental tried some commercial solutions based on docker swarm, which were not ideal. 2017 is a year of great changes in container choreography architecture. We chose rancher’s Castle engine to start the independent construction of containerization. At the same time, we are also watching the trend of the industry. In 2018, New Oriental’s container construction will evolve again, and finally turn to k8s.
So what do you think of k8s in New Oriental? We think that k8s is the middle layer between PAAS and IAAs. We make the interface and specification for the lower IAAs layer and the upper PAAS layer. But I don’t care about function implementation. To build a complete container cloud, only k8s is far from enough. We need to introduce other open source components to supplement it.
As can be seen from the above figure, we use various open source groups to supplement k8s ecology and combine them into the current container cloud platform of New Oriental.
Our runtime component is based on docker, the host operating system is Ubuntu, k8s network component is canal, and mellanox network card acceleration technology is adopted. As our k8s management platform, rancher 2.0 provides important functions such as multi tenant management, visualization, authority docking with AD domain, which helps us avoid a lot of background integration and integration work, and provides us with a stable graphical management platform while saving manpower.
Let’s introduce our k8s practice. As can be seen from the figure above, we are completely based on the native community version of k8s. Through kubedm tool and nginx stream load balancing, a three node ha architecture is deployed.
The key components of cluster run in host network mode. This can reduce the resource consumption on the network and achieve better performance. For example, the Ingres component builds the overlay container network through flannel and runs the upper application.
Using containers must involve image management. New Oriental is an early user of Gabor. We have been using Gabor since version 1.2. The back-end storage is connected with CEPH object storage. At present, we are trying the function of image distribution, using the open source dragonfly of alicloud. It can convert the download traffic from north to south to east to west, which makes it possible to copy images between nodes. When the cluster scale is very large, it can reduce the pressure load caused by pulling the image on the Gabor server.
Our k8s cluster is completely run on physical machines. When the cluster scale is large, the number of physical machines increases. We must use the management software of physical machines to reduce our operation and maintenance costs.
Here, we use the Maas of Ubuntu, which is a bare metal management platform. We can install the physical machine without operating system into the specified standard physical machine according to the template. Then we initialize the physical node with ansible playbook, turn it into the physical machine node we want, and add it to the cluster.
As can be seen from the figure above, the standard physical machine is transformed into a node of tidb by loading the role of tidb, and the standard physical machine is transformed into a node of k8s by loading the role of k8s. We will push osquery and filebeat to the physical machine in each role, which can collect the machine information of the physical machine and push it to CMDB for asset management.
Our CI / CD is differentiated based on business. One part of our business is directly connected with Jenkins of New Oriental, and the other part is directly connected with the function of rancher pipeline to realize CI / CD.
For cluster monitoring, we are now using the operator of Prometheus, an open source community. At the beginning, we used the native Prometheus, but the native Prometheus was particularly troublesome in configuring alarm discovery and alarm rules.
After quoting the operator, the operator helps us simplify the configuration process and is easy to use. It is recommended to use.
It is worth mentioning that cluster monitoring after rancher version 2.2 is based on Prometheus operator. If you are interested, you can go back to the next new version of rancher to experience it.
Our log is set for two levels. The business log runs filebeat in the form of sidecar, collects the data into Kafka cluster, and then consumes it to es by logstash, which can reduce the pressure load of ES.
On the other hand, it is the cluster level. The logs of the cluster level are collected in the ES cluster with fluent through rancher 2.2.
We have a total of five clusters, one for online business, two for production and testing; one for platform1 cluster, running middleware applications, such as es, redis and Kafka, is also divided into two sets of production and testing; the other is test cluster, which is for testing functions. K8s cluster upgrade iteration, testing new components and testing new functions are all completed on this cluster.
Maybe careful friends find that our cluster is version 1.14.1, which is very new. Why? Because kubernetes 1.14 has a very important function called local PV, which has already been GA. we are very interested in this function, so we upgrade the cluster all the way to 1.14.
At present, business application mainly includes two aspects
- Handheld bubble app and New Oriental app back-end services are running on the container Cloud Architecture.
- The service of middleware, such as Kafka, redis and ES cluster level middleware services, all run on our container Cloud Architecture.
Why service middleware?
So, why should we service middleware?
In our opinion, middleware, such as es, queue, redis cache, have several common features. Just like the monster in the figure, it has a large size.
Let me give you an example to make a comparison: for example, when I start a virtual machine, 4c8g is more common, and the first 10 are 40C 80g. As a comparison, can 40c80g start an elastic search node? 40c80g is very nervous to start an ES node. In actual production, a high throughput es node usually needs more than 100g of memory. From this example, we can see that the consumption of single resource of middleware class load is very large.
In addition, middleware is widely used in projects. Any application will definitely use redis, MQ and other components. Any component deployed alone will occupy multiple virtual machines. Each project also hopes to have a small kitchen, hoping to monopolize an environment. The small kitchen consumes more resources, and the inevitable versions and configurations of middleware are different. We need to employ a lot of people to maintain middleware, which is a big problem.
Of course, if there are about ten projects in the whole company, it’s OK to use virtual machine completely. However, New Oriental now has three or four hundred projects, and middleware consumes considerable resources. If all the virtual machine resources are used, the cost is still very high.
How can we solve this problem? We offer three arrows: containerization, automation and service.
Containerization is the best way to understand. Just now, I mentioned that all kinds of configurations are unified by containers. You must follow my standards. Deploy containers directly to physical machines for better performance and flexibility.
The next step of containerization is automation, which is actually coding. It is to code the infrastructure and manage the infrastructure in the way of code online iteration. We use helm and ansible for coding and automation.
After the first two steps are completed, we can go to the third step. If we use our own management norms and best practices to constrain everyone, it may not be very effective. The simplest way is service output, let us use our services.
Gradually, small stoves are merged into big pots to cut peaks and fill valleys, which also avoids the waste of resources. Each company has more or less some super VIP projects. This kind of business has become a separate small stove in the big pot. It’s also a big pot mechanism, but I will provide you with resource isolation and permission isolation separately.
Before the service, our understanding of the operation and maintenance personnel is more about the labor export of the project. I’m busy every day, but I don’t see much achievement. After the service-oriented, we will transform the labor export into the construction of service platform, empower the front-line personnel, and let the second-line personnel do more meaningful things.
Our practice: Elk / es
Next, let’s explain the arrangement of Middleware in New Oriental one by one.
Elastic company has a product called ECE, which is the first container management platform of ES in the industry. ECE is based on k8s 1.7 (or 1.8). Through the way of container, the physical machine provides users with ES instances of various versions. But it also has some limitations: it can only manage es, other middleware can’t.
This inspired us a lot. We just wanted to see if we could imitate and create our own service platform in the way of rancher + docker. So the first version of our platform was born to manage Elk with rancher 1.6.
The above figure shows our elk cluster, which spans rancher 1.6 and k8s and their mixture, and is in the middle of migrating to k8s.
We have two versions of elk: UAT and production. In the UAT environment, we use the hook (CEPH) scheme, and the ES node is started in stateful mode. The advantage of this scheme is that in case of which node hang up, the storage and computing are completely separated, and you can drift wherever you want.
Our production environment will be quite different. We make every es node into a deployment. We don’t let it drift. We use taint and label to limit the deployment to a certain host. Pod storage no longer uses RBD, but writes directly to the local disk hostpath. The network uses the host network to get the best performance.
What if I hang up? Hang up and wait for local resurrection, machine hang up and restart on the spot, change disk or change hardware. What if we can’t revive? We have a machine management. After the machine is killed, we directly pull a new machine out of the pool, go online, and use the ES replication function to copy the data.
You may wonder, how can you make two plans, and the production layout is so ugly?
We believe that simple architecture is the most beautiful architecture. The fewer components in the middle link, the fewer fault points, and the more reliable it is. The performance of local disk is better than RBD and local network is better than k8s network stack. The most important thing is: all of these middleware applications we choreographed are actually distributed (or built-in ha Architecture). They all have built-in replication mechanism. We don’t need to consider the protection mechanism in k8s layer at all.
We have also experimented and compared the two schemes. If the node fails, the local restart time is much lower than the drift time, and sometimes RBD can not drift past. The probability that the physics section will fail completely is still very small. So we finally chose a slightly more conservative solution for the online environment.
Our practice: redis
Our current redis is mainly a sentry scheme, which is also arranged in the way that the deployment is limited to a specific node. Our redis doesn’t do any persistence, it’s just used as a cache. This will bring a problem: if the master hangs up, then k8s will restart immediately, and the restart time must be less than the time when the sentry found it hung up. After it gets up, it will still be a master, an empty master, and then all the data in the remaining slave will be lost, which is unacceptable.
We have done a lot of research before. Referring to Ctrip’s practice, we use supervisor to start redis when the container is started. Even if redis in the pod fails, we will not restart the pod immediately, so as to give the sentry enough time to switch between master and slave, and then restore the cluster through manual intervention.
For the optimization of redis, we bind CPU for each redis instance. We know that redis process will be affected by CPU context switching or network card soft interrupt. Therefore, we made some restrictions on the node where the redis instance is located, and marked taint. We tie all the processes needed by the operating system to the first n CPUs, and spare the latter CPUs to run redis. When starting redis, the process and CPU will be mapped one by one to get better performance.
Our practice: Kafka
As we all know, Kafka is a high throughput distributed publish subscribe message. Compared with other middleware, Kafka has the characteristics of high throughput, data persistence and distributed architecture.
So, how does New Oriental use Kafka? What are the special requirements for Kafka cluster?
According to the business application scenarios, we can divide it into three categories: the first is to use Kafka as the message queue of the transaction system; the second is to use Kafka as the middleware of the business log; the third is to use Kafka as the message queue of the transaction system.
If we want to meet these three application scenarios, our Kafka must meet the security requirements. For example, transaction data cannot be transmitted in plaintext, so it must be encrypted.
Next, let’s talk about Kafka’s native security encryption. How do we do it? How to choose?
Except for the financial industry, other industries use Kafka and generally do not use their security protocols. Without using the security protocol, the performance of Kafka cluster is very good, but it obviously does not meet the requirements of New Oriental for Kafka cluster, so we turn on data encryption.
We use Kafka native support, encrypt Kafka channel through SSL, change plaintext transmission into ciphertext transmission, verify users through SASL, and control users’ rights through ACL.
Let’s take a brief look at the difference between the two types of SASL user authentication. SASL_ Plan is to write the user name and password in the JAAS file in clear text, and then load the JAAS file into the Kafka process in the form of startup parameters. In this way, when the client side of Kafka accesses the server, it will take the JAAS file to authenticate and start the user authentication.
SASL_ The gassapi is based on the Kerberos KDC network security protocol. Those who are familiar with the ad domain must know Kerberos. The ad domain also uses the Kerberos network security protocol. The client directly requests the KDC server to interact with the KDC server to realize user authentication.
Each method has its own advantages and disadvantages, and New Oriental chooses the first SASL_ The reason for plan is very simple. In this way, we can avoid maintaining KDC services separately and reduce the cost of operation and maintenance deployment. But there is a problem with this method, because the Kafka user name and password are loaded through this process. If you want to change the file, such as adding a user or changing the user password, you must restart the Kafka cluster.
Restarting Kafka cluster is bound to affect the business, which is unacceptable. Therefore, we adopt a flexible method, grouping according to the permissions. A total of 150 users are pre-set in the JAAS file, and the administrator assigns different users to the project. In this way, we avoid the embarrassment of adding the project restart cluster.
As shown in the figure above, we have opened two ports on the Kafka cluster, one with user authentication and SSL encryption, and the other with SASL without SSL encryption and only with user authentication_ Plain port. The client connecting Kafka chooses the port to access according to their own needs.
With the architecture, let’s talk about the arrangement of Kafka. Our Kafka and ZK clusters are deployed through the host network, and the data volumes are sent to the local physical machine through the host path mode, so as to obtain better performance.
Kafka and ZK are both single deployment deployment, which are fixed on nodes. Even if there is a problem, we will let them restart on the original machine, and do not allow containers to migrate at will.
In the aspect of monitoring, the scheme of exporter + Prometheus is adopted, which runs on the container network of overlay.
Our practice: Service Platform
When we build this service-oriented platform, our idea is very simple: don’t reinvent the wheel, try to use the existing technology stack, and combine helm, ansible and k8s.
Taking Kafka as an example, ansible generates a helm chart according to the environment. For example, SSL certificate and embedded user configuration are generated by ansible according to user input. The generated results are inserted into the helm chart, and then the corresponding instance is created by helm according to the chart.
Here are some screenshots of our platform 1.0 demo.
This is cluster management. When deployed to different clusters, there will be different entrances to maintain its state.
The above shows the steps to apply for the service. The whole process is very simple. Just select the cluster and the desired version.
In this management interface, you can see your IP, access portal, and the port used by your instance (the port is automatically assigned by the platform). If it is an SSL connection, you can also get your certificate, which can be downloaded directly from the page. We will connect the logs of the cluster to the platform later.
Our backstage is quite complicated. The background uses ansible awx platform. You can see that creating a cluster actually requires a lot of input items, but these input items are directly generated for users in the foreground interface.
This is a complete Kafka cluster deployed, including zookeeper, Kafka and exporter for monitoring. We have configured a Kafka manager for each cluster, which is a set of graphical management console. You can directly manage Kafka in the manager.
Monitoring alarm is essential. We have made some preset alarm rules based on Prometheus operator. For example, whether there is a delay in topic. After the cluster is generated, the operator will automatically find your endpoint, that is, the exporter we just saw. After the operator finds the endpoint, it will automatically join the alarm. There is no need for manual access.
We have generated a visual panel for each project. When we need to monitor, we can directly log in to grafana to view it.
The figure above is a simple pressure measurement result. 512k message, SSL + ACL configuration five partitions three copies, about 1 million messages per second, configuration is five 16C 140g memory containers, SSD hard disk. We find that the performance of message will decrease as the volume of message increases.
Prospect of service platform route
We have just talked about some of our work this year. What do we want to do next year?
From fiscal year 2020, New Oriental plans to service redis and ES, and finally integrate these exposed APIs into cloud portal for users in the group or third-party system calls.
Another thing we have to mention is operator. Last week, elastic released a new project called Eck, which is the official operator of ES.
Through the operator, you just need to input CRD, and the operator will automatically generate the cluster you need.
We think that although the helm based approach can greatly simplify yaml’s work, it is not the end point. We believe that the end point is operator.
This paper is based on the speeches of Jiang Jiang and Chen Boxuan of New Oriental at the third enterprise container Innovation Conference (ECIC) held by rancher in Beijing on June 20, 2019. This year’s ECIC has a grand scale, with 17 keynote speeches throughout the day, attracting nearly 1000 container technology enthusiasts and more than 10000 viewers to watch online. You can read more articles in the Rancher official account.