The challenge and response of AI technology in the transformation of enterprise intelligence. The answer is mlops
The author shared his views on the transformation of enterprise intelligence and the challenges and responses of AI technology at the 2021 new generation AI academician Summit Forum jointly held by lf AI & data foundation and openi Qizhi community on December 20.
What is enterprise intelligence first?
The author believes that the transformation of enterprise intelligence is a higher stage of enterprise digital transformation. It is the large-scale application of AI in enterprises. It is not the landing of a single scene such as face recognition, OCR and language recognition, but the use of AI to completely change the core business flow of enterprises. The effect is: either completely change the business model of the enterprise, or greatly improve the core efficiency, so as to achieve the purpose of enterprise intelligent transformation. For example, for a chain catering consumption enterprise, its core business process includes selecting store address, selecting commodity category, head office’s distribution and replenishment of branches, commodity promotion and online recommendation. Multiple scenarios in this core process are optimized and reorganized by AI technology, so as to achieve more or less improvement in efficiency in multiple scenarios, which is a huge improvement in efficiency, So as to establish a greater competitive advantage compared with competitors and achieve the purpose of intelligent transformation.
The fourth paradigm has formed its own unique methodology of enterprise intelligent transformation – from quantitative change to qualitative change.
Divide the intelligent transformation of enterprises into the following four steps:
- Identify the current core pain points of the enterprise
- Analyze core business processes related to pain points (including multiple scenarios)
- Landing AI in multiple scenes and continuously improving the effect
- Multiple effects have been improved to produce qualitative changes
Then conduct a new analysis, and the pain point may become another one, and then follow the above four steps in turn.
So, what is the demand for technology in the transformation of enterprise intelligence?
To put it simply, it refers to the implementation of machine learning technology in the scenarios related to the core business flow of the enterprise, which is “more, faster, better and less expensive”:
Multiple: implement multiple scenarios around key business processes
Fast: each scene has short landing time and fast iteration speed
Good: the effect of each scene meets the expectation
Province: the landing cost of each scene is relatively saved, which is in line with expectations
However, the actual situation of AI landing in enterprises is as follows:
- Slow landing: the landing time of a model is often several times more than that of the laboratory model. An AI scientist sighed at a sharing meeting: “it took me 3 weeks to develop the model It has been >11 months, and it is still not deployed.” In fact, according to the 2018 report of an analysis organization, more than 89% of the models have never been deployed online, that is, they have not produced practical results.
- The effect is not up to expectations: some models work well when training offline, and various indicators meet expectations. However, when the models are deployed online and connected with real online data to provide prediction services, it is found that the effect is greatly reduced.
- The effect will also fall back: the effect of a model is OK for a period of time after it goes online, but the effect gets worse and worse over time, resulting in the complete unavailability of the model. For example, with the occurrence of the COVID-19, almost all risk control models in a country’s financial system have failed. Because the COVID-19 has led to a great change in people’s shopping habits, many people who do not like online shopping are forced to do online shopping. As a result, the risk control model modeled according to people’s consumption habits before the epidemic can not reflect the current law, so it is completely invalid.
Why? Because of the machine learning system running in a production environment, the code related to the machine learning model accounts for only a small part, about 5%.
This figure comes from a famous nips 2015 paper “hidden technical debt in machine learning systems”. Several machine learning experts from Google described various technical problems in machine learning in this paper, and ml code only accounts for a small part of the whole system.
And we all know that AI system = code + data Data is very important in machine learning, but it is a very difficult part. “Data is the hardest part of ML and the most important piece to get right… Broken data is the most common cause of problems in production ML systems”.
This is what Uber’s machine learning engineers mentioned in a famous blog.
Data has the following problems and challenges. The author briefly lists some:
Scale: massive data for training
Low latency: serving with high QPS and low delay
Data change cause model decay： World change
Time travel: time series characteristic data processing is easy to cause problems
Training / serving skew: the data used for training and prediction are inconsistent
Also, live data poses more challenges to these issues.
This chart is the AI value analysis chart. The horizontal axis is the technical capability of the machine learning system, and the vertical axis is the commercial value of the system. As can be seen from the picture, Real-Time ML (real time machine learning, that is, the machine learning model with streaming real-time data access and online prediction ability, has the greatest value. It can be understood that in the real world, the most commercial value of machine learning is the CTR prediction model of advertising system and the recommendation model of e-commerce system. Obviously, if the model can get the recent behavior of users and train them, and then recommend them for users, it can best improve the prediction accuracy Accuracy, and finally reflected in the improvement of commercial effect.
So how to solve the problems of slow, poor effect and even fallback in machine learning? We can review how we solved the quality and efficiency problems of computer software or online system in those years. We use a method called Devops to improve our R & D model and tool system. On the premise of ensuring quality, we can faster improve the speed of version release and realize more and faster deployment. To this end, we have adopted a large number of automation to carry out the automation of pipeline (commonly known as pipeline), that is, starting from code submission, trigger the pipeline to carry out automation, and complete code static inspection, code compilation, code dynamic inspection, unit test, automatic interface test, automatic function test, small flow deployment, blue-green deployment, full flow deployment, etc. After the container becomes the mainstream of the computer system, the steps of achieving docker image and deploying the image to the container warehouse are also added.
Drawing on the mature experience in the field of Devops, mlops has been developed in the industry, which combines machine learning development and modern software development to form a complete set of tools, platforms and R & D processes.
So what is mlops?
It can be represented by the following figure, that is, the machine learning system is also carried out continuously in the way of pipeline from the definition project in the first step, to the feature data processing in the second step, to the model training and iteration in the third step, to the model deployment and monitoring in the fourth step. There are several small cycles, such as model training is not ideal, and data processing may need to be carried out again; Under the model monitoring, it is found that the effect of the online model has fallback and needs to be retrained. In short, it is to implement CI (continuous integration) + CD (continuous deploy) + CT (continuous training) + cm (continuous monitoring) of code, model and data.
Of course, mlops is not only pipeline and automation, but also includes many tools and platforms. Here are some examples:
- Storage platform: storage and reading of features and models
- Computing platform: streaming and batch processing are used for features and models
- Message queue: used to receive real-time data
- Scheduling tool: scheduling of various resources (Computing / storage)
- Feature store: register / discover / share various features
- Model store: model registration / storage / version, etc
- Evaluation store: model monitoring / AB test, etc
Next, the author will focus on an open source project of the fourth paradigm: openmldb.
Openmldb is an open source machine learning database that provides enterprises with a full stack of featureops solutions.Maybe some students are a little dizzy. Why did they have another featureops. In fact, featureops is a part of mlops. It focuses on feature, that is, feature related operations, including extraction, transformation, storage and calculation. The following figure can clearly show the relationship between the two.
This is the complete life cycle of mlops, which includes offline part of dataops (mainly data collection and data storage), featureops (offline feature calculation, storage and sharing), modelops (model training and tuning); It also includes dataops (access of real-time data stream and response to real-time request), featureops (real-time feature calculation, feature service) and modelops (online reasoning, result data backflow, etc.) in the online part.
For featureops (i.e. feature operation), a considerable challenge is how to ensure the consistency of offline and online and avoid training / serving skew. Let’s take a look at the workflow of typical AI scientists and AI engineers. Offline, AI scientists get the original data of features from an offline data warehouse, and then feed it to the model after extraction / transformation, and then train it; If the effect is not satisfactory, new data may be added as features, or existing features may be more transformed. Then adjust the model network architecture, debug various learning super parameters, and retrain until better results are obtained. In this process, scientists often use Python to work in the notebook. After training and generating the model, they hope to deploy the model online to bear the real traffic, that is, face the real data. At this time, AI engineers need to deploy models and develop prediction services. They need to get the original data required by the model from the data warehouse, and then convert the ETL (extract, transform, load) process of features conducted by scientists during training into the functions related to online prediction services, because scientists do not consider the performance requirements of online services (such as concurrency, low latency, etc.) during training, AI engineers must be considered, so this transformation process is very time-consuming, because a slight inconsistency will lead to the difference between training and prediction results, which will lead to the ideal training results. After online, the effect will be greatly reduced, so it can not meet the expectations. Therefore, it requires repeated communication and debugging between AI engineers and AI scientists, which are very time-consuming.
Openmldb adopts an innovative approach to let AI scientists and AI engineers use the same very common language, SQL, to do their own work. In this way, AI scientists use a set of SQL scripts to build the characteristic data required for training to complete the training. Similarly, this set of SQL scripts can be deployed online intact by AI engineers for prediction services. The same set of SQL scripts is used by two roles in training and prediction, which creatively solves the biggest problem in featureops, that is, the inconsistency between training and prediction. Of course, openmldb also has some other very good features, such as a built-in online feature storage system with high performance, low latency and specific optimization of timing features. In June this year, openmldb has been open-source on GitHub. The license adopted is the business friendly Apache V2 license. It has been run in the actual operation scenarios of many commercial customers in the fourth paradigm, and its performance and quality have been widely verified. The open source address ishttps://github.com/4paradigm/openmldb, welcome to pay attention to this project or join the openmldb technology exchange group.
Finally, let me summarize:
The intelligent transformation of enterprises requires the landing of multiple AI scenarios
AI landing status is difficult, slow, ineffective and will fall back
Learning from the experience of Devops and implementing mlops is the solution
Openmldb is a good tool in mlops
Finally, the QR code of mlops enthusiast discussion group is attached. Students who are interested in mlops are welcome to join us. Let’s discuss relevant technologies and projects together.
About the author: Tan Zhongyi – fourth paradigm architect, vice chairman of TOC of open atomic Foundation