Build cloud native machine learning platform based on wedatasphere prophet and kubesphere


Hello, friends of kubesphere open source community. I’m Zhou Ke, an engineer of Weizhong bank’s big data platform. Next, I’ll share with you how to build a cloud native machine learning platform prophet based on the products of the two open source communities, wedatasphere and kubesphere.

What is prophet?

First, let me introduce what is prophecy in wedatasphere? Its Chinese meaning is prophecy.

Prophet is a one-stop machine learning platform developed by the big data platform team of Weizhong bank. Based on the multi tenant container high-performance computing platform managed by kubesphere, we have built a machine learning platform that we provide to data science and Algorithm Engineers and our IT operation and maintenance.
In the interactive interface layer, you can see that at the top, we have a set of machine learning application development interface for ordinary users and a set of management interface for our operation and maintenance administrator. The administrator’s interface is basically customized and developed based on kubesehre; The middle service layer is several key services of our machine learning platform, mainly including:

  • Prophet machine learning flow: a machine learning distributed modeling tool, which has the ability to train single machine and distributed model, supports tensorflow, pytoch, xgboost and other machine learning frameworks, and supports the complete pipeline from machine learning modeling to deployment;
  • Prophet mllabis: machine learning development and exploration tool, which provides development and exploration services. It is an online ide based on Jupiter lab, supports machine learning modeling tasks of GPU and Hadoop clusters, supports python, R and Julia languages, and integrates debug and tensorboard plug-ins;
  • Prophet model factory: machine learning model factory, which provides machine learning model storage, model deployment test, model management and other services;
  • Prophet data factory: machine learning data factory, providing services such as feature engineering tools, data annotation tools and material management;
  • Prophet application factory: machine learning application factory, jointly built by the big data platform team of Weizhong bank and the AI department, is customized and developed based on the open-source kubesphere of qingcloud, providing CI / CD and Devops tools and monitoring and alarm capabilities of GPU clusters.

The lowest basic platform is the high-performance container computing platform managed by kubesphere.

When we build such a machine learning platform for our current financial scenario or Internet scenario, we have two considerations:

The first point is one-stop, that is, tools should be complete, and a complete ecological chain tool should be provided to users from the whole pipeline of machine learning application development;

Another focus is all China Unicom. We have a big pain point when we do machine learning application development. You may have seen a picture of Google before. Maybe 90% of the work is outside machine learning, and then when we really do model tuning, it may be 10% of the work.

Because the previous data processing actually has a lot of work. One of our tasks is to connect our prophet service components with the complete tool chain of scheduling system schedulis, data middleware datamap, computing middleware linkis and datasphere studio for data application development portal currently provided by wedatasphere through plug-in access, Build a fully connected machine learning platform.

Introduction to Prophet functional components

Next, briefly introduce the functions of various components of our machine learning platform prophet.

The first is that the component we have put into the open source community is called mllabis, which is actually similar to the sagemaker studio provided by AWS for machine learning developers.

We have made some customized development in jupyter notebook. The overall architecture is actually the figure in the upper left corner. In fact, there are two main core components. One is notebook server (restful server), which provides various API interfaces for notebook life cycle management; The other is the notebook controller (Jupiter notebook CRD), which manages the status of the notebook.

When a user creates a notebook, he only needs to select the kubernetes namespace with permission, and then set some parameters required by the notebook runtime, such as CPU, memory, GPU or the storage to be mounted. If everything is normal, the notebook container group will start and provide services in the corresponding namespace.

We have made an enhanced function here, that is, to provide a component called linkismagic. If you know about our wedatasphere open source product, there is a component called linkis, which provides the computing governance capability of the big data platform, gets through all the underlying computing and storage components, and then goes to the upper layer to build data applications.

By calling the interface of linkis, our linkismagic can submit the data processing code written in jupyter notebook to the big data platform for execution; We can pull the processed feature data to the mounted storage of notebook through the data download interface of linkis, so that we can do some accelerated training with GPU in our container platform.
In terms of storage, mllabis currently provides two kinds of data storage, one is CEPH; One is our big data platform HDFS. For HDFS, we actually mount the HDFS client and HDFS configuration files into the container, and control the permissions, so that we can interact with HDFS in the container.

This is our mllabis notebook list page;

This is the interface from the list page to the notebook.

Next, we introduce another component, mlflow.

We build a distributed machine learning experiment management service. It can not only manage a single modeling task, but also build a complete machine learning experiment by connecting with our one-stop data development portal datasphere studio. The experimental tasks here are managed and run on the container platform through job controller (TF operator, pytorch operator, xgboost operator, etc.) or on the data platform through linkis.

Here again, mlflow interacts with datasphere studio through appjoint, which can not only reuse the workflow management capabilities provided by DSS, but also connect mlflow experiment as a sub workflow to the large data workflow of DSS, so as to build a pipeline from data preprocessing to machine learning application development.

This is a data science workflow composed of our data processing and machine learning experiments.

This is the DAG interface of mlflow’s machine learning experiment. At present, it provides two task types: GPU and CPU, and supports the stand-alone and distributed execution of tensorflow, pytorch, xgboost and other machine learning framework tasks.

Next, let’s introduce our machine learning model factory: model factory. After the model is built, how do we manage these models, how do we manage the version of the model, how do we manage its deployment, and how do we verify the model? We use model factory.

This service is mainly based on the secondary development of Seldon core and provides the ability of model interpretation, model storage and model deployment. It should be emphasized that the service interface of this block can also be connected to mlflow as a node in the machine learning experiment. In this way, the trained model can be quickly deployed through interface configuration and then verified.
Another thing to note is that if we only verify a single model, we mainly use the helm based deployment capability provided by MF. If we are building a complex reasoning engine available for production, we will still use the CI / CD and microservice governance capabilities provided by kubesphere to build and manage model reasoning services.

The next component to be introduced is our data factory.

In our data factory, we obtain basic metadata from hive, mysql, HBase, Kafka and other data components through data discovery services, provide data preview and data kinship analysis capabilities, and tell us what data science and modelers want to use and how they can use it. In the future, we will also provide some data annotation tools or data crowdsourcing tools to let our data development students complete the work of data annotation.

The last component to be introduced is the machine learning application factory.

Just now, if we build some complex influence sevice for some complex models, in fact, it is not enough for us to use a simple single container service. We need to form a complete set of reasoning process similar to DAG. In fact, at this time, we need the management ability of more complex container applications.

Application factory is based on kubesphere. After we have prepared these models, we will use the CI / CD workflow provided by kubesphere to complete the overall model application publishing process. After the model service is online, we will use various OPS tools provided by kubesphere to operate, maintain and manage the services of various business parties.

Kubesphere application practice

Next, let’s go to the application practice of kubesphere in our Weizhong bank.

Before we introduced kubesphere, in fact, the main problems we faced were some operation and maintenance problems. At that time, we also used some scripts written by ourselves or ansible playbook to manage our set or several sets of k8s clusters, including our development and test clusters on the public cloud and several sets of production k8s clusters on the private cloud in the bank. But in this area, because our operation and maintenance manpower is limited, it is actually very complicated to manage this thing; Our built models are used for banking business, some are related to risk control, and the requirements for the availability of the overall service are still very high. We need to focus on how to do a good job in tenant management, resource use control and how to form a complete monitoring system for all business parties; In addition, the kubernetes dashboard basically has no management ability, so we still hope to have a set of easy-to-use management interface for our O & M personnel to make their O & M efficiency higher.

Therefore, we build such a machine learning container platform based on kubesphere.

The overall service architecture is basically similar to the current API architecture of kubesphere. After the user’s request comes in, it locates the services to be accessed through the API gateway. These services are the components just introduced, and the gateway distributes the request to the corresponding micro services. The management of container platforms on which various services depend is the capabilities provided to us by kubesphere: CI / CD, monitoring, log management, and code scanning tools. Then we made some modifications on this solution, but generally speaking, there are not many things to be modified, Because these capabilities provided by the current open source kubesphere can basically meet our needs.

The version we use internally is V2.1.1 of kubesphere. Our modifications are mainly as follows:

  • Monitoring and alarm: we connect the kubesphere notification with the monitoring and alarm mentality in our bank, and associate the configuration information of the container instance with the business information managed in our CMDB system, so that if a container is abnormal, we can send an alarm message through our alarm information and tell us which business system is affected;

  • Resource Management: we have made a small expansion in kubesphere namespace resource quota management. We support namespace GPU resource quota management, which can limit the basic GPU resources and maximum GPU resources available to each tenant;

  • Persistent storage: we mount the key service storage in the container to the highly available distributed storage (CEPH) and database (MySQL) in our bank to ensure the security and stability of data storage.

This is a management interface for our test environment.

Then this is what we just said. In fact, we do two things. One thing is that we combine the whole monitoring object with the CMDB system in our bank. When an alarm occurs, we configure and associate it with this CMDB system. We can know which business systems are affected by this alarm instance. Then, once an exception occurs, we will call our alarm system. Here is an alarm information of an enterprise wechat. Of course, it can also send wechat or call, You can also send e-mail.

The above part is the customization of GPU resource quota.

This is our log query interface based on kubesphere.

Next, let’s talk about the future outlook. In fact, at present, because our manpower is very limited, we have great pressure on the development of various components. At present, we still do it based on the previous version of kubesphere V2.1.1, Next, we will consider combining and adapting kubesphere 3.0 with some of our existing development capabilities.

Second, at present, kubesphere still does not have some capabilities of GPU monitoring and statistical index management. We will also consider migrating some things we have done before or some interface capabilities to kubesphere console.

The last is the containerization adaptation and transformation of our entire wedatasphere components based on kubesphere. We finally hope that all components will be containerized to further reduce the operation and maintenance management cost and improve the resource utilization.

About wedatasphere

Having said that, let me briefly introduce our Weizhong bank big data platform wedatasphere.

Wedatasphere is a complete set of financial grade one-stop machine learning Big Data Platform Suite implemented by our big data platform. It provides the functional capabilities of various components from data application development to middleware and at the bottom layer, as well as the operation and maintenance management portal including our entire platform, some of our security controls, and a complete set of operation and control capabilities for operation and maintenance support.

At present, the part of these components that are not grayed out has been open source. If you are interested, you can pay attention to it.

Looking forward to the future of wedataspehre and kubesphere, our two communities have officially announced open source cooperation.

We plan to container all the components of our wedatasphere big data platform, and then contribute them to the kubespehre app store to help our users quickly and efficiently complete the life cycle management and release of our components and applications.

Welcome to pay attention to our open source project of prophet and our open source community assistant of wedatasphere. If you have any questions about this open source cloud native machine learning platform, you can further communicate with us. Thank you.

About kubesphere

KubeSphereIt’s in kubernetes   The container hybrid cloud built on it provides full stack   IT   The ability of automatic operation and maintenance simplifies the operation and maintenance of enterprises   DevOps   Workflow.

KubeSphere   Has been  Aqara   Smart home, original life, Sina, PICC Life Insurance, Huaxia Bank, SPD Silicon Valley Bank, Sichuan Airlines, Sinopharm group, Weizhong bank, Zijin insurance, radore, zalopay  And thousands of enterprises at home and abroad. KubeSphere   It provides an operation and maintenance friendly wizard operation interface and rich enterprise level functions, including multi cloud and multi cluster management and kubernetes   Resource management, Devops  ( CI / CD), application lifecycle management, microservice governance  ( Service   Mesh), multi tenant management, monitoring log, alarm notification, storage and network management, GPU   support   And other functions to help enterprises quickly build a powerful and feature rich container cloud platform.

Kubesphere official website
KubeSphere GitHub

KubeSphere 微信公众号

This article is composed of blog one article multi posting platformOpenWriterelease!