During the Spring Festival last year, Alipay launched a five year campaign which was popular. Every scrape card on the back of every card has hundreds of rights from ants, gold clothing, Alibaba and partners. The activities of gathering five blessings are concentrated in the days before the Spring Festival, which has a strong timeliness. Therefore, how to realize the automatic matching of rights and interests with the users, solve the cold start problem of the system, optimize the conversion rate and improve the user experience has become an optimization problem of online learning.
Before we built such a system, the modules needed were very complex. We need flow processing tasks such as log collection, data aggregation, sample splicing and sampling, machine learning modules such as model training and model verification, model service to load the model in real time, and other supporting facilities. The connection of many modules greatly increases the complexity of the system.
Because there are many systems involved, our previous system encountered many problems. For example, in order to ensure the stability of high-quality links, some data processing links in the upstream will be degraded, but the downstream students do not know. Another common problem is that the flow batch logic is inconsistent, which requires off-line features to train the benchmark model and on-line computing features to update the model in real time. These two modules are both offline and online. There have been subtle differences in processing logic that have had a great impact on business results.
In conclusion, we have encountered three types of pits:
- SLA: the SLA of the whole link will be affected by the SLA of each module, and will be enlarged with the increase of modules. Stability becomes an important factor restricting the development of business.
- System efficiency: most of the connection between modules is through data dropping, and the scheduling between modules is realized through system scheduling, which results in unnecessary I / O, computation and network overhead.
- Cost of development and operation and maintenance: the styles of each module are different, and the development mode, calculation framework and even code style are different. It takes a lot of time to be familiar with the system when developing and operation and maintenance docking, so as to reduce the efficiency of business opening.
What capabilities should an ideal system provide? From three aspects of “stability, speed and simplicity”: first of all, from the perspective of data, it needs to ensure the consistency of data and calculation, and realize the end-to-end SLA of the whole link. Data consistency and link stability are the basis for ensuring the stability of business. Second, we need to optimize the system efficiency. We want to transform the connection of these ten systems into the connection within the system, and we want to transform the job scheduling into the task scheduling. Through this transformation, we want to coordinate the scheduling between the calculation and the calculation, so as to improve the system efficiency and reduce the use of network bandwidth. An integrated system can also provide great convenience for development and operation and maintenance. In the past, more than ten systems need to be connected, but now only one system can be connected. In the past, we need to trace back several businesses in case of emergency to find problems, and now the integrated system debugging will be easier.
The outer layer of online machine learning needs three abilities: data processing, model training and model service. These three capabilities reflect the requirements of the computing engine framework: agile invocation mechanism, more flexible resource management and control, and more perfect fault-tolerant mechanism. The upper system is often implemented by different programming languages, so there is also a need for multi language interface. Through the consideration of the underlying requirements and the characteristics of the current frameworks, we finally choose ray as the base of fusion computing.
Ray is an open-source distributed computing framework initiated by the riselab Laboratory of Berkeley University and jointly participated by ant financial. Its original intention is to make the development and application of the distributed system simpler. Ray, as a computing framework, can help us achieve the above three goals of “stability, speed and simplicity”. Ray, as a computing framework, has an agile scheduling mechanism, which can schedule millions of tasks in a second. It can also achieve heterogeneous scheduling according to the demand of computing on resource use.
In the current popular distributed framework, there are three basic distributed primitives: distributed tasks, objects and services. There are just three basic concepts in our common process oriented programming language: function, variable and class. These three basic concepts of programming language just correspond to the primitives of distributed framework. In ray system, they can be transformed by simple changes.
On the left is a simple example. You need to add a “@ remote” modifier in front of this function to convert a function into a distributed task. The task is executed through the “. Remote” call. The return value is a variable, which can also participate in other calculations.
On the right is another example of how you can turn a class into a service by adding the “@ remote” modifier. The methods in the class can be turned into a distributed task through the “. Remote” call, which is very similar to the use of functions. In this way, we can realize the transformation from a single program to a distributed task, and schedule the local task to a remote machine for execution.
What kind of scheduling should be done on ray? The measurement index is the efficiency of the system. The efficiency of the system often depends on the way of calculation and data organization. For example, we need to calculate add (a, b). First, this function will be automatically registered locally and provided to the local scheduler. Then, the whole play scheduler and the local scheduler of the second node work together to back up a to the second node to perform the add operation. It can also carry out further scheduling and control optimization according to the data size of a and B. A and B can be simple data types or more complex variables or matrices.
Ray provides a multilingual API interface. For historical reasons, Java is the most commonly used language for flow computing in ant financial services, while Python is the most commonly used language for machine learning modeling. First of all, I want to reuse the stream processing operators implemented by java language, while maintaining the convenience of Python for machine learning modeling. Ray provides such diversified support, which is very convenient for us to do this. Users can easily use Java and python to develop flow processing and machine learning models respectively when developing at the upper level.
For online machine learning, the core problem to be solved is to get through flow calculation and model training. Then we need to use a medium, which can connect the two conveniently. Previously, we introduced several features of ray, such as providing multilingual interface and flexible transfer mechanism. This is because these two features are relatively convenient for ray to do this, and ray can play a role of convergence. The last node of data processing is the output of flow calculation, and the worker node consumes data, which is the input of model training. Ray can schedule these two calculations on one node through the scheduling mechanism, and realize data sharing so as to realize the connection of the two modes. In this way, not only flow computing and machine learning can be compatible, but also other modes can be connected.
The concept of DAG was first proposed to solve the efficiency of multi-stage distributed computing. The main idea is to reduce the IO of computing by scheduling. But in the previous computing DAG, it was determined when the task was executed, but in the task of machine learning, we often need to design new models, or debug the model’s super parameters. We hope to see these models can be loaded on the link, see the business effect, and do not want the training and service of the existing models online to be interrupted 。 In the ray system, another node can be generated dynamically during the calculation process. We can use this feature to add points and change, so as to dynamically modify the DAG locally.
The big difference between online system and offline system is that if the task in an offline system hangs up, it can be solved by restarting the machine in general, but for online system, for the sake of timeliness, we can’t simply solve it by restarting cluster backtracking data. So we need to have a more perfect fault-tolerant mechanism. During model training, we can use Ray’s actor to pull up the worker and server nodes of model training. If the worker or server node is in an unhealthy state, we can use the fault-tolerant feature of actor to recover the data and calculation through kinship, so as to achieve fault-tolerant training.
We pursue the timeliness of the link, and the model can fit the real-time data as soon as possible. However, in pursuit of timeliness, the stability of the whole link should be guaranteed, and the balance between agility and sensitivity should be achieved. We guarantee the stability of the whole link from three aspects: system stability, model stability and mechanism stability.
- System stability, including data real-time and strong consistency guarantee.
- Model stability, we hope the designed model can fit the real-time data flow, but at the same time, we should prevent the effect degradation caused by the online learning link under various uncertainties, such as data noise. So we need to consider the combination of on-line and off-line features. In model design, we need to consider the sensitivity of deep model and shallow model to data and the tolerance of noise.
- Mechanism stability, horse racing mechanism and fast rollback strategy.
In addition to the previous use of ray to achieve integration and its benefits, we have also done a lot of module construction, including TF integration, stability assurance, sample return, delayed sample correction, data sharing, flow batch integration, strong end-to-end consistency, and incremental model export. We put the platform on several scenes of Alipay, and we can explore the effect from the following numbers.
- 99.9% full link SLA
- 2% to 40% improvement in business indicators
- Dozens of minutes model delay to 4, 5 minutes, and can be further reduced according to business requirements
- 60% reduction in machine use
We started construction in August last year, and launched the first scene in February this year. We have achieved good results in the wealth line of the payment line. Next, we will promote it to other business lines of ant financial.
Based on fusion computing machine learning, it is an organic combination of fusion computing and machine learning to achieve optimal resource sharing. Through these two aspects of exploration, we have preliminarily verified the framework of fusion computing, which aims at data sharing to achieve the compatibility of computing modes. The essence of fusion is openness, and the basis of openness is to achieve data interoperability. As long as we can easily achieve data interoperability between various modes and guarantee their digital real-time and end-to-end consistency, we can In order to realize the complex scene, multiple modes are needed for combined calculation. The connection of modules is like building Lego blocks. There may be only a few basic modules, but a complex and changeable system is built.
Alicloud double 1.1 receives a subsidy of 100 million yuan, and hand-held iPhone 11 pro, sweater and other gifts. Click here to participate: http://t.cn/ai1hlljt
Read the original text
This is the original content of yunqi community, which can not be reproduced without permission.