Guidance:Nowadays, most AI applications are developed based on supervised learning paradigm, that is, the model is trained offline, and then deployed to the server for online prediction, which has great limitations in real-time response. With the maturity of computing and AI system, we hope that machine learning applications can run more in dynamic environment and respond to the changes in real-time environment, which promotes the evolution of machine learning from traditional offline learning to online learning. Compared with the traditional off-line machine learning, online learning can bring faster model iteration speed, make the model prediction effect more close to the real situation, and more sensitive to online fluctuations.
In the past two years, domestic first-line Internet manufacturers have launched their own online learning technology system and related architecture. Ant King has been following the rules since July 2018. Based on the latest ray distributed engine, it has developed a financial level online learning system. Compared with the traditional online learning framework, Ant King has improved in terms of end-to-end delay, stability, R & D efficiency, etc.
Ray is an open-source high-performance distributed computing engine of the University of Berkeley amplab in December 2017. It has been launched for less than two years. Ray is still a full “newborn” in the field of computing framework. Although there is a high degree of attention in the industry, there are not many enterprises that actually put ray into application. Ant financial is perhaps the first “crab eating” company in China. Why does ray get the favor of ant gold suit? What are its unique advantages compared with spark and Flink, the open-source computing engines? What problems might you encounter during the use of ray? What can we learn from the experience of stepping on the pit? With these questions, InfoQ interviewed Zhou Jiaying (TU Li), a senior technical expert of ant financial services, at the recent qcon Shanghai 2019 conference. The following is the interview Q & a record.
InfoQ: can you give us an overview of the evolution of ant financial’s big data technology architecture, including the stages you have gone through and the key work you have done in each stage.
Zhou Jiaying:Ant financial’s big data technology architecture also developed from the offline computing stage in the early stage, which was probably from 2011 to 2013. At that time, it was still dominated by the industry’s traditional offline computing, namely Hadoop. After 2013, with the launch of the distributed real-time computing system storm, we began to gradually shift business to real-time computing. Since 2016, the team has undergone a relatively large transformation, hoping to build a set of technical system to meet the next generation of big data computing. At the beginning, we tried to separate the computing engine and let the business directly connect with the computing platform or the middle platform system, rather than a specific engine. In this stage, we have experienced such concepts as feature center, event center or decision center.
Since then, the big data engine and the whole big data system have developed very fast. We don’t want to continue to build an ecosystem around one or two engines or one or two more popular computing models like catching up with the trend. We think there should be a set of stable big data computing architecture design ideas that can cover all data level issues. We hope to gradually precipitate our own set of technology system, which can be compatible with and support all the more active computing engines in the industry at the same time. Therefore, we have put forward the concept of “open architecture” since 2017, which will be transformed from building a set of open computing architecture for different computing engines.
First of all, it is an overall architecture dedicated to solving big data computing problems. In this architecture, there will be different computing engines, but these engines exist in a plug-in way, which means that when the engine changes, the upper business is unaware. Based on this architecture, we have done a lot of self-study work on some key capabilities, such as the fusion computing engine we are doing now. The traditional computing mode and computing engine are bound. From Flink to spark, one is flow and the other is batch. Although these two can be converted to each other, many features are not so smooth during the conversion, and some advantages will be lost during the conversion. In addition, graph like computing patterns cannot be included in any computing engine, because these computing engines are designed with a pattern bound. We propose the concept of converged computing, which is to use ray as a distributed computing framework to support multiple computing modes at the same time, and coherently integrate each computing mode. This refers to the combination of various calculation modes through a closed-loop in the stages of R & D, disaster recovery and operation through different fusion means to achieve the best performance and efficiency. In addition, we also have more investment in graph computing, AI, software and hardware integration. This is the development process of ant financial’s big data computing.
InfoQ: in the whole process, has ant financial learnt from other foreign enterprises?
Zhou Jiaying:Of course, we don’t do the so-called innovation out of thin air or blindly, but we will first see the most advanced technology and experience in the industry. We will compare with some large-scale Internet companies in the industry, such as Google, Facebook, Amazon, etc.; at the same time, we will compare the products or concepts of some companies that are more research-oriented, such as Microsoft, IBM, etc.; in addition, we will also consider the business characteristics and the pits we have stepped on before. That is to say, first look at what the industry’s leading technologies are in terms of engineering and research, at the same time, look at the previously stepped holes, as well as the problems encountered by ourselves, and combine with their own business scenarios and scale, so as to determine the work focus and future planning we just mentioned.
InfoQ: what are the key differences between the real-time data phase and the online data phase mentioned earlier?
Zhou Jiaying:The real-time data stage is developed from the offline data stage. Although it is faster than before, the problems it faces are also very intuitive. For example, data computing changes from t + 1 to t + minute or T + second, which is from offline to real-time, but whether it is second or minute can be switched in a large range, which will not have a great impact on online scenes. If it is a monitoring task or a synchronous task, its timeliness can be freely transitioned at the real-time and offline levels. But online computing needs to be aligned with the consistency of online business. For example, our business depends on the database for computing. Only when the database returns results can we continue to support the next business. We believe that online data computing is more a big data computing scenario supporting online decision-making business than a simple transformation from offline to real-time.
InfoQ: so what are the challenges of online data phase to technical architecture?
Zhou Jiaying:There are many challenges. First, online computing means a completely different computing model. For example, from the perspective of computing data preparation, it is a flow computing model; but if you want to find it out and rely on online services, there is actually another concept, such as distributed services. How to make the data obtained by the query faster and more accurate also depends on the matching of the written data with the calculation results and the final results of the query data. This is a different calculation mode, which will be more diverse. At the same time, there is also a key point, that is, our previous so-called offline computing or real-time computing, in fact, it is separate from online applications, such as online applications SLA and physical machine room deployment are separate, while big data machine room deployment is another, and both sides are relatively decoupled, so we generally say that when data warehouse or data calculation has problems, it will not affect online business. But after the concept of online computing comes out, it means that our data computing and data business should be put together, so the whole deployment architecture, disaster recovery system and SLA standards need to be comprehensively changed and improved.
InfoQ: compared with the traditional online learning framework, what is the optimization of ant financial’s online learning system?
Zhou Jiaying:Traditional machine learning is off-line machine learning, which is characterized by a very long iteration period, and data calculation is carried out at the level of days or hours. Traditional online learning mainly refers to turning batch computing into flow computing, connecting the computing engine of flow computing and the engine of machine learning training together, and then doing fast iteration on both sides to generate data model. On the basis of the industry, ant’s online learning system combines different computing models from different engines to a set of integrated architecture, that is, using one engine to support different computing models. We think that flow computing is a computing mode, model training is a mode, and distributed service is a mode. We put these three modes together on a set of computing engine, which is ray. In summary, we use a computing engine to cover all aspects of online learning, while the traditional online learning framework may use different engines to solve different problems, and do the splicing work, which is the biggest difference.
InfoQ: Why did ant financial choose to research online learning system based on ray? What technical research did you do in the early stage? What are the advantages and disadvantages of ray compared with other distributed engines?
Zhou Jiaying:We chose ray because in addition to it, most other computing engines have been bound to a certain computing mode. For example, when spark was launched, its goal was to do batch computing instead of Hadoop. Although it can also run stream computing, spark used batch to simulate stream; when Flink was launched, it was to replace storm To do better flow calculation, although it can also run batch calculation, it uses flow to simulate batch, and there will be certain defects or congenital deficiencies in the process of simulation. Because these computing engines are designed for a specific computing mode, they can’t be integrated naturally. So we found Berkeley’s in about 16-17 years Amplab, the concept they put forward is very consistent with our previous idea of computing, that is, there is an abstract and general distributed scheduling capability in the lower layer. Based on this original layer, we can abstract different computing modes on the upper layer, and at the same time, we can deposit the general capability in the lower layer, and finally become two levels: the first level is the computing mode, including flow, batch, graph computing and machine learning It is different computing mode; the next layer is distributed service, which we think is a core layer, which must be able to solve scheduling problems, disaster tolerance problems, resource recovery problems, etc. Through such early research and later continuous attempts and communication with Berkeley amplab and the community, we finally reached an agreement that ray is the best practice solution in fusion computing.
Ray’s advantage is that at the beginning of the design, he did not bind himself to a solution of a certain scenario or computing mode. It is a real native distributed framework with strong scalability. It doesn’t have any strong encapsulation features, so it’s very flexible to make some changes. At a disadvantage, ray itself is a very new framework. We think that a computing engine in the first three years of its launch is actually in a very primitive state. It may change a lot in the future, or it may change a lot.
But in fact, Ray’s advantages and disadvantages can also be seen as two complementary characteristics. It is not only a newly launched thing, but also a more original, simpler, easier to transform, and easier to achieve the effect of integration. It is such a complementary relationship. At present, we believe that ray is the most suitable set of cloud native computing architecture.
InfoQ: are there other enterprises using the ray engine?
Zhou Jiaying:Judging from the activities or partners of official community organizations, Alibaba, Facebook and Amazon are all focusing on and cooperating at present, but they are still in a relatively early stage, with a relatively small proportion. Many enterprises may just use Ray’s most native API or some native functions to solve a small part of the enhanced learning problem, or do some experimental use. The only enterprise in the world that has deep participation and large-scale online like ant golden suit is probably the only one in the world.
InfoQ: what pits did ant golden suit step on in the process of using ray? What should we pay attention to in building online learning system based on ray? Can you share your experience.
Zhou Jiaying:There are many pits. Ray itself is a very new engine, and there is a big gap between just coming out of the lab and being able to really go to online production. For example, the laboratory may often use a relatively small-scale test set to test its performance stability, etc., but in the production environment of an enterprise, it may need a larger scale test set and a more strict guarantee of reliability. There may be many functions that it did not have before, such as availability, performance, etc., that we need to develop independently and contribute back to the community Optimization, as well as supporting ecology, such as tuning, Devops tools, deployment, scheduling and other things to keep up with the downstream integration.
In addition to the above engineering pits, there are other problems. For example, ray needs to be compatible Tensorflow, in order to achieve multilingual scheduling and multilingual disaster recovery, there are many additional tasks to be done, such as some machine learning feature languages in the online training process, such as how to train different models without affecting other models, such as how to minimize noise, how to do version rollback, how to make online communication, etc 。
These are some features that we think are relatively large, and also some points that the traditional machine learning system is difficult to guarantee. Ant financial has invested a lot of energy and done a lot of work around these points. At present, we have dozens of people on this project. We also hope that we can give back these works to the community in the future. We plan to open the framework of online learning and all the changes made to ray in March next year, that is, we don’t want other companies or users to step on the same pit as us.
InfoQ: do you think ray engine is mature enough now? What kind of enterprise or scenario is suitable for ray? Why?
Zhou Jiaying:In terms of the official open-source ray version, it is also a relatively primitive and non multi-functional primitive computing engine. From this perspective, if other enterprises want to use ray for reinforcement learning or deep learning computing, it may be more practical. If it corresponds to our current internal Mobius version, it includes a set of overall R & D platform for online learning, with the support of flow mode, online learning mode, machine learning mode and distributed services. For this version, I think all enterprises that want to develop online learning jobs quickly can use it, because it is already a complete platform. We encapsulate the business computing field better, and it is more suitable for the production environment.
InfoQ: when did you start planning to open up Ray’s internal version?
Zhou Jiaying:The idea of open source came into being when we started this project. We don’t want to close the door to make a thing, but we want to turn it into a project and product that can be shared by the public after it reaches a certain stage of development. We hope that it can serve different customer departments and users, and that users can contribute better features.
We hope to build an online learning system through ray to further the online learning ability of the whole industry, such as lower end-to-end latency, higher availability, or more cohesive overall computing system. Open source ray will definitely have a positive impact on technology. In terms of the impact of this project, we also hope to add more contributors and committers through open source, so as to make the features or capabilities of this project stronger and stronger, and let ray, the computing engine, be known by more and more people.
InfoQ: what are the next technical plans for ant financial? What new technologies will you focus on?
Zhou Jiaying:The future technology planning of big data computing includes open computing architecture and converged computing engine Ray; there is also integrated full graph computing, so-called large scene including graph computing; then there is the combination of hardware and computing. We have a hardware team that specializes in hardware optimization to make it more suitable for computing, which is our overall plan.
Other new technologies, such as data lake and graph computing, include very large-scale graph computing, fast dynamic graph computing, unified graph language, etc., which we have been focusing on.
Introduction of interview guests
Zhou Jiaying (TU Li), senior technical expert of ant financial, is now in charge of the online computing team of the data technology department. After joining Alipay in 2011, he has been involved in Alipay data related work. It has experienced the different stages of ant’s submission to offline data and real-time data, and has participated in the construction of ant’s real-time data platform, serverless streaming, online job scheduling, computing metadata, and a new generation of computing engine. Familiar with the evolution history of ant data technology architecture, with personal experience of online computing scenarios and high availability schemes in distributed environment.
Read the original text
This is the original content of yunqi community, which can not be reproduced without permission.