This paper is organized from Zhang Jun, the research and development director of oppo big data platform. If readers are considering or building a real-time computing platform, I hope to bring you some reference. Meanwhile, welcome to the official account of OPPO Internet technology team: OPPO_ tech
Introduction:In order to comprehensively promote the real-time of data warehouse, oppo’s real-time computing platform ostream based on Flink has been widely used in real-time ETL / real-time report / real-time label and other application scenarios. This paper mainly focuses on the real-time computing platform ostream based on Flink, sharing the research and development of ostream platform (including design principles, overall architecture, Flink improvement and Optimization), business scene access and application practice, as well as the exploration and thinking of the intelligent development of the platform.
Introduction to oppo big data platform
First of all, we will introduce oppo’s business and data scale. Oppo is a very low-key company. What does it have to do with the Internet and big data? A brief introduction. Oppo has its own customized Android based system coloros, with many Internet applications built in, including some popular applications such as app store, browser, information flow, etc. After several years of development, the daily life of this system has exceeded 200 million. Driven by the business, oppo’s data volume has increased by 2-3 times every year since 2013. Therefore, the total data volume is relatively large, more than 100pb, and more than 200tb are added every day.
Figure 1 oppo business and data scale
In such a large scale of data, it is inevitable to build a pipeline of data. Our main data source is the buried point data on the mobile phone, and some are log or DB data. We built the whole access system based on the open source project of nifi, and then did data storage and calculation based on HDFS and hive. There are two main parts in the calculation layer. The first part is the hourly level ETL to do data cleaning and processing; the second part is the daily hive task, which does the daily summary work. The whole pipeline has a scheduling system, which is a customized version based on open source alflow, which we call oflow.
Figure 2 data processing pipeline: offline mode
Next is the application layer. Data application is mainly divided into three parts: report analysis, user profile and interface service. There are also some self-developed products. We can import data from hive to MySQL / kylin, ES, redis / HBase to support the application of the product every day. We will also provide some interactive query services based on presto. The whole pipeline based on offline batch processing, after two to three years of precipitation, has almost supported the development of all businesses. However, we found that the demographic dividend of the whole Internet has been declining in recent years, so our business has been forced to go to the road of refined operation.
A key point of refinement is timeliness, which needs to capture users’ immediate behavior and short-term interests. The most typical application scenario is real-time recommendation. I believe most domestic companies are doing real-time recommendation. Such timeliness has a real-time demand for the whole data processing. Real time is a transition from the past hour level and day level to minute level and second level (there is no demand for microsecond level yet).
Just mentioned the application of the most important three pieces of data – report, label and interface – respectively lists some scenarios of actual business. In fact, in addition to the business, there is another area that is easy to be ignored by us, which is that the platform side also has a demand for real-time. For example, most of oppo’s offline scheduling tasks are started at 0:00 a.m. This is understandable, because it is the processing of T + 1, and the data of the previous day can be processed at 0:00 a.m., so we all hope that the earlier the processing, the better. But this is a big problem for our platform: in the early morning, the whole cluster is under great pressure, and is often called up by phone in the middle of the night to deal with various cluster problems. For the entire platform, if batch processing can be turned into streaming processing, and the centralized cluster load is distributed to 24 hours, the platform pressure will be reduced.
In addition, if there is real-time streaming processing, it is also very helpful for the whole label import, including quality monitoring. With such a real-time appeal, each business is trying to build its own real-time stream pipeline. When building pipeline, the two most important parts of technology selection are computing engine and storage engine. As we all know, in the open source ecology, these two areas are now competing for hegemony, and there are many choices. For computing engines, there are old overlord storm, as well as new upstarts like Flink and spark streaming; in terms of storage engines, there are redis, HBase, elasticsearch, etc. We want to build a pipeline, and the computing engine will eventually dock with the storage engine. It is easy to face the situation as shown in the figure below, like a very complex spider web.
Figure 3 the chaos of real-time stream processing
Since it is a data platform team, it has the responsibility to build a platform to converge these systems, so it will face very practical problems. For example, how to convince the business that we need a platform now and need you to access my platform? How can we tell him that our platform is valuable? From the perspective of platform, we will think that this value is obvious. Platformization can produce scale effect. What does the scale effect bring? The marginal cost of R & D resources, hardware resources and operation resources can be reduced. However, from the perspective of business, he does not necessarily agree. He may think that what you are talking about is face-to-face, but he is concerned about the point. He will say: “the rapid development of my business is urgent at present. Your scale effect is really good. But for my business, platform will bring me the cost of migration and learning, and what value can it bring to me?”
Therefore, for the construction of the platform, it is very important to look at this issue from the perspective of business empowerment when building this platform from the very beginning. Can you shield some details of the bottom layer to improve usability? Can we provide a better abstraction for more users? Can we sign a service level agreement (SLA) with the business to provide better service guarantee? Only when the whole platform is promoted from point to area, the needs of a certain business and a certain point are solved first, and then the business is brought in slowly, then the scale effect can be produced, and finally, the value of the platform can be maximized. This is one of our thinking in the promotion of platform.
Technical practice of building real time computing platform
Based on such a background, how to promote the real-time computing platform? What is the idea of construction? We believe that the top-level design is the first priority no matter whether building a system or building a platform. What is the top design? We think there are two layers. The upper layer is API and the lower layer is runtime. “Iceberg Theory” says: what you see is probably only a small part of the whole. In fact, there are more parts you can’t see hidden under the water. This theory can be applied in many places.
For the platform design, we hope that the API of the platform is as simple and abstract as possible, leaving more complexity in the runtime layer that users can’t see. Therefore, the API is easy to use, expressive and flexible. In the runtime layer, performance, robustness and scalability are evaluated. These are very complex distributed concepts.
Based on this idea, how to choose the platform API? Since this API is facing users, we should first consider the distribution and usage habits of the company’s personnel.
Figure 4 distribution of our staff’s usage habits
In the era of offline processing, almost all of us use hive, so most people are used to writing SQL; there may be some who are used to writing Scala, Java, etc., and they will want to submit data processing in the way of jar; and some may not write SQL, also can’t write programs, will hope to have some simple and easy-to-use interface, directly do data analysis.
As you can see, SQL is a very important platform first-class citizen. On the other hand, let’s examine the SQL language. We are all familiar with it. It is a declarative language. After 30 or 40 years of development, it is easy to use, flexible and expressive, which can meet the design principles we just mentioned.
At runtime, we have a lot of choices.
Figure 5 platform runtime selection
Only a few items are listed here, and the core demands are listed as a comparison matrix. It is found that only Flink can meet the requirements in all aspects. Other engines, such as spark streaming, the natural micro batch mode, can not achieve this low latency performance; for Kafka, it is a relatively lightweight framework; for storm, it is in the twilight stage of heroes, and many features were not considered in its era. So Flink is a relatively good choice.
Many people may say that spark is developing. It has evolved from structure streaming. It can also support continuous processing mode and low latency. However, we believe that with the development of the whole technology framework, technology will eventually converge. Why Flink? Another important reason is the popularity of Flink in China in the past two years. Including the Ali team, they are also vigorously promoting and investing, including two teachers Dasha and Yunxie at today’s qcon conference. They are also senior big V’s of Ali team community.
In retrospect, what are the core advantages of Flink? What are the highlights for us? First of all, look at the Flink engine layer. Its biggest advantage is runtime. We hope it is of high performance, and it is low latency and high throughput. Many people may ask, isn’t low latency and high throughput contradictory concepts? Shouldn’t a trade-off be made? In fact, Flink doesn’t break this trade-off. In fact, it has a so-called buffer timeout mechanism. You can go left and choose extremely low latency; you can also go right and choose extreme high throughput. What do we think of its so-called low latency and high throughput? I think so: first, whether it’s extreme left or right, the final performance has certain advantages compared with other frameworks; second, if you don’t need to go extreme on both sides, you can stand in the middle directly. It also has comprehensive performance advantages. You can take a look at the official website, there are some performance comparison data.
The next step is end-to-end “exactly once” and high fault tolerant state management to meet the robustness requirements we just mentioned. Next, what is very important to us is the mechanism based on event time and data processing on delay. Because for the mobile phone industry, the biggest difference between the embedded points on mobile phones and those on Web pages is that the power consumption and network consumption of users need to be considered in data reporting. Therefore, there is often a very random delay between a user’s behavior and the actual reporting of this data, which can not guarantee the real-time reporting of data. Therefore, it is very important for us to analyze the user’s behavior correctly based on the event time, including the processing of delayed data. Finally, it can be based on yarn. Because our development in the past few years is based on the yarn cluster management, we hope to continue the past experience.
Let’s look at the capabilities offered by Flink SQL.
Figure 6 capabilities provided by Flink SQL
Needless to say, the first two items support ANSI SQL and UDF, including data types and built-in functions. This is described in detail on Flink’s official website. Next, it can customize the source / sink, which meets the scalability mentioned just now. We can continuously expand its ecology. If it does not contain the upstream and downstream that we need, we can make our own extension. Next, there is the ability of batch flow unification mentioned in other lecturers’ speeches, such as windows, join, and batch stream unification, which will not be repeated here.
After we select the API and runtime of the platform, we will think: can the whole platform migrate smoothly from offline, batch processing to real-time streaming? What is smooth migration? As mentioned earlier, we hope that the API layer can be as abstract and simplified as possible. Can we reduce the learning cost and migration cost for users after the API layer has been transformed? We found it feasible. So we can see that in the offline era, the abstract data warehouse of API layer is table, and the programming interface is SQL + UDF; in the real-time era, we can actually keep consistent, put more complexity in the runtime layer, and migrate the computing framework and storage platform to engines like Flink and Kafka. However, for users, API This layer can see that the basic abstraction is consistent, so the cost of learning is relatively low.
Figure 7 smooth migration from offline to real-time processing
Based on such an idea, the real-time mode of the entire data pipeline is finally derived, which is almost the same as the offline era we mentioned just now, but we have replaced some key components, such as replacing HDFS with Kafka, replacing hive with Flink, and ostream platform, replacing offline task scheduling. Then other products, including those based on reports, portraits, and interfaces, are consistent.
Figure 8 the real-time mode is consistent with the offline mode
With such a construction idea, how to carry out the research and development of the entire platform? First of all, to make a platform, we should draw out the overall structure. Then it can be roughly divided into the following levels: the bottom layer is the basic engine layer, which is very familiar to all of us. It is based on Kafka, Flink, ES and other systems. The whole cluster management is managed by Yan. Then Flink needs to do checkpoints, which will use HDFS; the platform feature layer provides a web version IDE has the capabilities of pixel data management, stream job management, log retrieval, monitoring and alarm, etc.; up there is the API layer. The programming interface supports SQL, which is our first-class citizen, and then supports jar package submission. Finally, we call it “integration tool”. As mentioned earlier, we will have some scenarios in which users can automatically generate SQL through simple UI configuration without writing SQL. Then we will communicate with the internal Ci and CD tools automatically. After compiling the jar package automatically, we will automatically transfer the jar package to our platform. This is under development. This is the overall architecture.
Figure 9 overall architecture of ostream platform
I mentioned that SQL is the first-class citizen of the whole platform, so the first step is to solve such a framework based on SQL development. As mentioned just now, the API layer is based on the table SQL + UDF method. If it is applied to our products, we need to have an interface similar to this (if you are familiar with it, this is the open source Hill interface): there are a series of tables on the left, and the SQL edit box on the right, so that SQL can be submitted. How to support this development interface when it falls into the runtime layer? There are two core points: one is the management of metadata, which is how to create database tables and upload UDFs; the other is the management of stream jobs, how to edit SQL, how to submit jobs, and finally submit them to Flink framework for execution.
With this problem in mind, let’s take a look at the current API situation of Flink SQL.
Figure 10 Flink SQL API programming example
This is the simplest Flink SQL programming example, which is about 20 lines of code. As mentioned just now, the management of stream jobs, the compilation and submission of SQL, including metadata management, upload table and registration table, can be created in this way. But we still don’t want to be user-oriented. Because users don’t want to implement it programmatically. So we simply extended it based on Flink.
The whole process is like this: first of all, on our development IDE, users can write a SQL, and then submit it. When submitting, we create a concept similar to job – we will encapsulate SQL with its required resources and configuration into a job, and then save it to MySQL. At this time, there is a job store that periodically scans MySQL to see which new job is there, then calls the TableEnvironment module in Flink to compile the SQL truly, and then generates a JobGraph (which is an executable unit recognized by Flink) after compilation, and finally submits this task to Yarn.
Just mentioned metadata management – how can a table be recognized by Flink after it is created? This involves how to identify the table created through the metadata center during the compilation process? We have implemented a framework and used the concept of tabledescriptor in Flink to create a tabledescriptor, that is, the descriptor of the table, which is also saved in MySQL. Then through Flink’s external catalog (mentioned by teacher Yunxie just now), you can identify the table, and finally register through the tableenvironment. In this way, Flink can identify the external table, which is the implementation of the entire development framework.
Figure 11 interface of metadata Center
On the left is the interface for creating tables. We can create tables like Kafka, mysql, Druid, and HDFS. Of course, we don’t want to expose DDL to users, because for our users, they don’t want to write create table. They want to create tables through UI. On the right is the upload of UDF. You can write a UDF yourself or submit a jar package, and then specify what your main class is.
Figure 12 developing IDE interface
And this is our development IDE, which is also a very simple interface. In fact, it imitates the development interface of hill.
With such an IDE, what problems should be solved? If we really want to push it to users, there are two basic problems to be solved.
One is UDF. In hive era, we have accumulated a large number of internal UDFs, including encryption and decryption, format conversion, which may have some problems; the other is the calculation of location and distance. In fact, there are a lot of hive UDFs. In Flink processing, we also hope to inherit them and use them directly. For now, Flink doesn’t support that. Mr. Yunxie said just now that 1.9 may support calling hive’s UDF directly. We are looking forward to this, but we have not. As a platform, we need to re implement all UDFs on Flink framework.
The second is the join of dimension table. Some business dimension tables exist in mysql, HBase and hive. We need to implement this. This will also be implemented in Flink 1.9, but not in the current version. What can we do?
Figure 13 dimension table creation page
First of all, this is our dimension table creation page. You can create a MySQL dimension table. Since the dimension table is not too large in most scenarios, it can be directly cached in the task management container of Flink. Therefore, we can select the all mode here, that is, import all the data into the container. Or select LRU (least recently used) mode to indicate that it can be refreshed. Another mode (none) is to simply do nothing and check external data sources every time.
This is the implementation of dimension table. How can we join now?
Figure 14 implementation of table Association
As shown in the figure above, we will automatically rewrite SQL at the platform level. Suppose the user writes a SQL with a join syntax in it. Before submitting, we will add a layer of SQL parsing to resolve that this is a join for a dimension table (because we know which table is a dimension table), and encapsulate the parsed context into a joincontext. With joincontext, we will parse SQL into simpler SQL. We replace the syntax of join with a form similar to another table.
What do you do behind your back? In fact, we use the function of Flink that can make seamless conversion between table and stream: convert the original table into stream, and then use the form of flatmap in the stream layer. Every time we call flatmap, we will call a flatmap function defined by us. This function will take the joincontext we just parsed. What does this joincontext contain? Which table, address and key words are included in the table. After we get the information, we can go to MySQL. In the initialization phase and the open phase, we can load the dimension table into the cache. Then we can find it from the cache every time we do flatmap. Finally, we can convert it back to a table through the conversion between stream and table, and finally rewrite it into such a SQL. This is our current implementation, which may seem strange, so we are looking forward to Flink’s native support for join.
Next, this platform indispensable, is the log retrieval. This is also a very common function that needs to be provided in the platform. We can now provide full-text retrieval of logs based on job name, yarn app ID and container ID. How do we do it?
Figure 15 log collection pipeline
In fact, it is also relatively simple. When submitting Flink jobs through ostream for SQL or jar packages, we will automatically generate such a log4j properties configuration file, which will be provided to Yan. Finally, when the job manager and task manager execute, we will use our custom log4j appender to unify our logs Kafka finally imports all logs into es for indexing, and then the full-text index is achieved. There is another problem to be solved: how can all logs be associated with ostream jobs? For example, you need to know which job each line of log belongs to. This is a better solution. We have a configuration in Flink: when submitting a Flink job, we can give six environment variables, which are injected into taskmanager and jobmanager respectively. Our log4j appender will automatically identify whether there is such an environment variable in the process. If so, write it to es, with the job name and app ID for each line Can be linked.
Next is the monitoring of indicators, which is also a basic capability that needs to be provided in the platform. There are many indicators, but for users, what are the two most important indicators? One is the throughput of Flink jobs, and the second is the lag consumed by Kafka. How do we do this?
Figure 16 pipeline of index collection
First of all, the Kafka consumer of Flink is a source. For each Flink job, kafkconsumer will automatically inherit many indicators it provides from Kafka’s consumer. For example, one of the indicators is the consumer’s position of consuming Kafka and the consumption delay. For all Flink operators, this indicator will also be exposed: how many pieces of data are sent out by this operator per second. Based on these two indicators, we can use Flink’s native internal matric system / subsystem, such as its internal encapsulation, such as matricgroup and matricregistry. We can import the two indicators mentioned just now into Kafka and finally write them to es through our own customized matricreporter. Then based on these indicators to do monitoring and alarm.
Figure 17 alarm rules
This is the alarm page, but also relatively simple. Based on this indicator, the same, ring ratio, or absolute value alarm can be made.
The platform work based on Flink is introduced. Here is a summary of our platform R & D process accumulated in the small experience, a simple share.
First of all, for a team like us, the best strategy is to maintain pure branches. This is closely related to R & D investment. The best strategy is to keep up with the pace of the community and try to minimize the changes to the kernel of Flink. Otherwise, Flink’s entire community is developing very fast and will not catch up with its pace. How to do it? There are two strategies. First, any open source framework to build its ecology, it is necessary to do a lot of extension points, we can do some plug-in development based on these extension points. For example, the metadata management mentioned just now, how to connect external table pairs to Flink, and how to collect logs and indicators are all based on the extension points of the catalog, log4j appender, and matrix reporter mentioned just now. The second point is that if there is no extension point to develop plug-ins, we can do secondary Abstract development based on API without changing the native code. As mentioned just now, SQL job and dimension table association are all developed based on the native tableenvironment and tabledescriptor APIs.
Second, for small teams, R & D investment is limited. If you want to participate in the open source community, how can you participate? I have some small suggestions. First of all, we need to focus on the dynamics of the whole community. There are two best ways. The first is to pay attention to flip. Every big change is introduced in flip. From modification to design to example, it is a complete story. You can understand the whole development path of this module. Another is to pay attention to its pull request. This may be the best way, because every person who wants to submit code to the community should send a pull request. You can pay attention to each module, such as table, connect and runtime. You can see the internal dialogue of the authors and the progress of the community. The second one can be seen as establishing a cognition – if you have not been involved in open source projects, you may not know the mechanism of GIT branch, how to fork a branch, how to merge code and how to submit pr. There are some small ways, for example, to help fix some typos. You’ll find that there are a lot of grammatical errors in Flink documents and comments, whether it’s grammar spelling or grammatical expression – these are typos. Of course, committing doesn’t make sense, but if you just want to build some basic knowledge, you can do something about it. What do you mean by fast debugging? If you want to understand a module of Flink or an idea, you want to quickly verify it. It is impossible to connect external yarn, then build a Kafka, and then insert a piece of data into Kafka. This is too slow. The best way is to localize the end-to-end integrated testing. On your own computer, you don’t need to rely on any external data source, and you can run through it directly. Moreover, automation: the whole end-to-end, such as inserting a piece of data into Kafka and submitting a job to Yan, can be automated to better understand Flink; even some breakpoints can be set to track its code and understand the internal mechanism of Flink.
Here is a small project of my own. If you are interested, you can learn about it. I use many mechanisms such as Kafka Mini cluster and Yan Mini cluster to run the whole end-to-end process in the local IDE, and the process is automatic and does not need to rely on any external system.
Operating experience of ostream platform
This paper introduces the research and development work, and finally introduces the practice of platform operation. The first practice is that we think the best way is to split the offline and real-time clusters. When we started, all flinks were running in offline clusters. But we think there is a problem. Why? The jobs run by the whole offline cluster are short and fast jobs, and the resource allocation is very uncertain.
Figure 18 offline real time computing cluster splitting
If you are familiar with Yan, it should be very kind to see this picture. We can see that yarn is a fair scheduling mechanism, and the resource allocation between each queue is uncertain. Where is the uncertainty? You can see the green part and the orange part on it. Although resources can be allocated to each queue, other resources change dynamically with the situation of each queue. For example, for example, it has a steady fair share, which is a fixed allocation of resources; and instant fair share means that if there is a queue without any jobs in it, it should be shared for others to use. Another aspect is that queues can overdraft resources: for example, some queues, even if there are programs running, have not consumed this share. This share needs to be transferred to other queues to share, and other queues can be overdrawn. That’s a problem for us. There have been several such problems in the online cluster. Suppose that batch processing and real-time processing are two different queues. Maybe at the beginning, the two queues have 50% and 50% resource allocation. Suppose there are some problems in the real-time job that cause it to restart. During the restart process, its resources will be immediately preempted by another queue, because the other queue can overdraft this resource. So for online queues, we will turn preemption off. This may be another topic. If preemption is turned on, there are many uncertainties.
For us, the real-time queue resources are preempted, resulting in no resources running after restart. Therefore, we think that although everyone is talking about offline and online shuffling, we don’t have the energy to study how to do it. Therefore, if we want to provide service guarantee, the safest way is to physically split the two clusters.
There are also tests, which are inherited from our original real-time processing mode system. As you all know, we do data development, it is very important that its test may be different from ordinary programs. Most of the time, to do data development and test the logic of data, what is needed is not just a few random pieces of data, but full data and production data, not just some random data. We used to read and write the test database before publishing the test job, and the test database data was sampled from the production data. But for the business, they think it’s a problem. So the platform can do this: since the test needs to rely on the production library, the platform can do SQL rewriting when submitting SQL, such as automatically changing the insert into to point to a test library, but the read is still read directly from the production library, which can get timely and full data without any impact on the production library. This is our practical experience.
Figure 19 full link delay monitoring
The last practice is full link delay monitoring. The above link is based on Flink’s entire real-time processing link. The data starts from the beginning of nifi, after three times of Kafka, and finally comes to Druid, and then reports can be made from Druid. As you can see, there are lag in every link, from nifi to Kafka, from Kafka to aggregate, and then from Kafka to Druid, there are three lag. Previously, delay monitoring was done for a single job, but users are concerned about when their data can be reflected in the report after they arrive in the system. They are concerned about the delay of the whole link. A single delay is meaningless to him. So we need to build kinship.
Analyze ETL, aggregate, and analyze the whole blood relationship to form the following path, from access channel, to the middle table, to the middle job, and finally to the Druid. In this way, we can add up these four lags to form a unified lag and make a unified monitoring. You may also have a question: there are two gaps between ETL and Kafka and between aggregate and Kafka. In fact, there are also lag. Why didn’t you monitor? We believe that Flink has a good back pressure mechanism. If there is a delay from ETL to Kafka and aggregate to Kafka, Flink can be reflected in the lag of Kafka consumption in the upstream through the back pressure mechanism. Therefore, we can ignore the middle two lag and only calculate the four lag.
The above describes our R & D and operation work. Next, we will share some cases.
Figure 20 interactive query interface
This interface is an interactive query interface provided in the previous offline era. In the era of real-time data, users feel that such an interface is still very good. They still need to analyze real-time data directly through UI and drag and drop. Of course, we will make some restrictions: for example, it is impossible to check the full amount of real-time data by dragging and dropping, but we may only be able to check the latest data, such as the data in the last hour or a few minutes.
Here is the real-time ETL, the most widely used real-time stream processing application on our line.
Figure 21 real time ETL
All data from the mobile phone will be reported to the same channel. Before entering the data warehouse, we need to split the data. We use the Flink SQL method. For example, we can write these four SQL, and split the data of the same table into different downstream tables through different conditions. With the same set of processing logic, we can insert data into Kafka or HDFS at the same time as the source data for offline processing later.
And then there are real-time tags. Tags are the most important data assets for us, so real-time tags are very important applications.
Figure 22 real time label
This is the channel of our whole real-time tag. Write a SQL similar to this from Kafka, insert a table, and make certain restrictions on the format – because this will ultimately affect the label system that is received by us. As you can see, we use nested SQL and UDF functions (including windows) to implement.
Future outlook and planning
Finally, some future prospects: how should the platform go?
This may be a big one. We believe thatThe inevitable path for the development of a platform must be from automation to intelligenceIf we are more confident, we will add a word “wisdom”. People have different definitions of these three nouns, especially “intelligence”. Nowadays, the word “intelligence” is found in various places.
I give my definition here. What is automation? I think it’s about dealing with mechanical, repetitive things, freeing our hands. The most typical automation application is task scheduling system. At a point every day, we can schedule tasks repeatedly and without human intervention. This is automation. What is intelligence? It must be self-adaptive and self-learning. Because of continuous self-learning, its behavior should be unpredictable. Perhaps the most typical application is that some AI systems, like the alpha dog, can be constantly learning. The last one is “intellectualization”, which may not be seen in reality. Once upon a time, there was an American drama “western world”, in which the hosts were self-conscious and self thinking, and intelligent. Let’s not talk about intelligence first. From automation to intelligence, it still sounds rather empty. What can we do if we implement it to our platform?
For example, for automation, we can do end-to-end automatic connection (I’ll talk about it later), automatically generate SQL through rules, and automatically generate alarm rules every time a job is submitted. What is intelligence? In fact, we can also try, for example, whether the job resources can automatically scale, because the online load must be a curve, not a straight line. If the operation resource is a straight line, it is inevitable that there will be a waste of resources or insufficient use. Can we do automatic scaling. In addition, if there are so many job parameters, it is impossible to say that we should consider improving the job when submitting it. If there is a perfect parameter configuration, can we automatically and dynamically tune the job during the operation of the job.
First look at the end-to-end connection. This is a real case.
Figure 23 pipeline of SQL before figure 23
This is the pipeline of our previous SQL. When the data is processed, it is read from Kafka and written to Kafka; then it is imported into the storage engine through data import, which may be done by another group of students who maintain the storage engine; finally, it is turned into online assets, which may be done by data product personnel. These three links are fragmented and need to be handled by different people. Let’s consider whether we can automate this process. Instead of exposing it to users that it’s a Kafka table, we can show it in the form of presentation table, tag table and interface table. This is Scenario Oriented. For example, for a presentation table, there are no non dimension fields, indicator fields, filter fields, time fields, etc. With these fields, after the whole data processing and SQL writing, the data can be automatically loaded into a platform like Druid, and eventually it will automatically become a report. The whole process does not need to be bothered by users. This is an example of what is being done online. We also mentioned the creation of the entire table just now. In the future, the creation of a table will be like this: after writing the field name, description and type, you may have to determine which are the dimension fields and which are the indicator fields if this is a displayed table. After the specification is completed, when you write SQL, such as insert into, it will automatically communicate with the downstream systems like druid and report system, and even try to create reports automatically. For users, this is an end-to-end process.
Finally, I’d like to share a paper about intelligence. This is a paper published by Microsoft and twitter in 2017 (they have made a system called dhalion), which is very interesting. Its keyword is self regulating, which may be translated as “self-regulation”, meeting the intelligent direction we mentioned just now.
Figure 24 intelligent automatic scaling of online operation
This is a graph extracted from the paper to illustrate how to realize intelligent automatic scaling of online homework. It does this: first of all, there are some Detector, like a doctor to diagnose, collects indicators and judges the symptoms according to the indicators, such as whether there is back pressure or the online speed is tilted; after obtaining the symptoms, make a diagnosis, and judge whether the data is tilted or some instances become slow due to the deterioration of the disk; after the diagnosis, it can find the right medicine, such as How to solve this problem. This may also be automatic. There are rules about how to solve the problem of data skew and how to solve the problem of slowing down an instance.
The whole process, including the intermediate diagnosis, can be made into a rule-based, machine learning based, continuously adaptive way. This framework is based on the heron system that Twitter did before. Can we do this in Flink? Because we found that no matter how to do index collection, back pressure mechanism or automatic job scaling through API, all these can be realized in Flink. It’s interesting that we can build such a framework to automatically and intelligently dynamically scale and scale resources online.