The following is the current architecture of the platform:
Data from multiple data sources are written to Kafka. The computing engines are storm, spark and Flink. The result data from the computing engine is then landed on various storage.
At present, there are more than 100 storm tasks and about 50 spark tasks. Flink is still relatively small.
At present, our cluster has 60tb of data per day, 1000000000 calculations and 400 nodes. It’s worth mentioning that spark and Flink are on yarn, in which Flink onyarn is mainly used as job manager isolation between tasks, and storm is a standalone mode.
1. Consistency semantics
Before describing our application scenario, we first emphasize an important concept of real-time computing, consistency semantics:
1) At most once: that is, fire and forget. We usually write a Java application without considering the offset management of the source or the idempotence of the downstream. It is a simple at most once. No matter what the intermediate state is, no matter what the state of writing data is, there is no ack mechanism.
2) At least once: resend mechanism, which ensures that each data is processed at least once.
3) Exactly once: we use coarse checkpoint granularity control to implement exactly once, which mostly refers to the exactly once in the computing engine, that is, whether the internal state of the operator in each step can be replayed; if the last job is hung, whether it can recover smoothly from the last state does not involve the idempotence concept of output to sink.
4) At least one + idempotent = exactly one: if we can guarantee that there are idempotent operations in the downstream, such as update on duplicate key based on MySQL, or if you use es, Cassandra and so on, you can realize the semantics of upset through the primary key key to ensure that at least once, plus idempotence, is exactly once.
When I was hungry, I used storm in the early stage. 16 years ago, it was storm, and 17 years ago, sparkstreaming, structured streaming began. Storm used earlier, mainly including the following concepts:
1) Data is tuple based
2) Millisecond delay
3) It mainly supports Java, and now it also supports Python and go with Apache beam.
4) SQL’s functions are not complete. We have encapsulated Typhon internally. Users only need to expand some of our interfaces to use many main functions. Flux is a better tool for storm, and only needs to write a yaml file to describe a storm task. To some extent, it meets some requirements, but it still requires users to be engineers who can write Java, Data analysts can’t use it.
1) Ease of use: because of the high threshold of use, its promotion is limited.
2) Statebackend: more external storage is needed, such as redis and other kV storage.
3) In terms of resource allocation, the method of setting worker and slot in advance is used. In addition, the engine throughput is relatively low due to less optimization points.
One day, a business party came to ask if we could write an SQL and publish a real-time calculation task in a few minutes. So we started to do sparkstreaming. Its main concepts are as follows:
1) Micro batch: you need to set a window in advance, and then process the data in the window.
2) The delay is second level, and the better case is about 500ms.
3) The development languages are Java and scala.
4) Streaming SQL is mainly our work. We hope to provide a platform for streaming SQL.
1) Spark ecology and sparksql: This is a good place for spark. The technology stack is unified. The packages of SQL, graph computing and machine learning can be exchanged. Because it does batch processing first, which is different from Flink, its natural real-time and offline APIs are unified.
2) Checkpointon hdfs。
3) Onyarn: spark is a Hadoop ecosystem with high integration with yarn.
4) High throughput: because it is a micro batch mode, the throughput is also relatively high.
Now let’s show you the steps we need for the platform users to quickly publish a real-time task operation page. We are not writing DDL and DML statements here, but the way the UI presents the page.
The page will let users select some necessary parameters. First, which Kafka cluster will be selected, how much each partition will consume, and the backpressure is also enabled by default. The consumption location needs to be specified by the user every time. It is possible that the user can select the offset consumption point according to the business demand when rewriting the real-time task next time.
In the middle, let the user describe the pipeline. SQL is Kafka’s multiple topics. Choose an output table for the output. SQL registers the Kafka dstream consumed above into a table, and then writes a string of pipelines. Finally, we help users encapsulate some external sink (all the mentioned storage supports, if the storage can realize the upsert semantics, we all support it).
Although it just meets the calculation requirements of general stateless batches, some users want to say what to do with flow join. For the early spark 1.5, you can refer to spark streaming SQL, an open source project, to register dstream as a table, and then join the table, but this only supports the version before 1.5. After spark 2.0 launched structured streaming, the project was abandoned. We have a trick way:
Let sparkstreaming consume multiple topics, but I can convert each batch of RDD in the consumed dstream into a dataframe according to some conditions, so that it can be registered as a table. According to specific conditions, it can be divided into two tables, and it can simply make a join. The join problem completely depends on the data of this consumption, and the join conditions are uncontrollable, which is compared with tr The way of icky. For example, in the following example, we consume two topics, and then simply split them into two tables through the filer condition, and then we can make a join of two tables, but it is essentially a flow.
One point to pay special attention to is exactly once:
We must ask that the data sink be stored externally before the offset can be committed. No matter in ZK or mysql, you’d better ensure that it’s in a transaction, and that it’s output to the external storage (here it’s better to ensure that an upsert semantic is implemented according to the unique key), and then the source driver generates Kafka RDD according to the stored offset The executor then consumes data according to the offset of each partition of Kafka. If these conditions are met, end-to-end exactly once can be achieved. This is a big premise.
1) Stateful processing SQL (< 2. X mapwithstate, updatestatebykey): if we want to realize the calculation of cross batch with state, in version 1. X, we use these two interfaces to do it, but we still need to save this state to HDFS or external, which is a bit more troublesome to implement.
2) Real multi stream join: there is no way to realize the semantics of real multi stream join.
3) End to end exactly once semantics: its end-to-end exactly once semantics is quite cumbersome to implement. After sink to external storage, offset needs to be manually submitted in the transaction.
We investigate and then use the incremental computation with state after spark 2. X. The following picture is on the official website:
All flow calculations refer to Google’s data flow, which has an important concept: data processing time and event time, that is, there is a gap between data processing time and real occurrence time. So there is another watermark in the field of flow calculation. The current event water level needs watermark to maintain. Watermark can specify the range of time delay, the data outside the delay window can be discarded, and the data late in the business is meaningless.
The following is the architecture diagram of structuredtreaming:
In this way, steps 1, 2 and 3 of spark streaming’s “exactly once” are implemented. In essence, it is a batch mode. The HDFS used for offset self maintenance and state storage is not used for the external sink to do similar idempotent operations, nor to commit offset after writing. It just ensures fault tolerance and realizes the exactly once of the internal engine.
1) Stateful processing SQL & DSL: can satisfy the flow calculation with state
2) Real multi stream join: you can join multiple streams through spark 2.3. The join method of multiple streams is similar to Flink. You need to define the conditions of two streams (mainly time as a condition), for example, there are two topics flowing in, and then you want to define the data that needs buffer through a certain field (usually event time) in a specific schema The real sense of flow join can be realized.
3) It is easy to implement the semantics of end-to-end exactly once, which can be realized only by extending the interface of sink to support idempotent operation.
In particular, there is a little difference between structured streaming and native streaming APIs. When creating a dataframe of a table, you need to specify the schema of the table, which means you need to specify the schema in advance. In addition, its watermark does not support SQL, so we add an extension to fully write SQL, which can be converted from the left to the right (below). We hope that users are not only programmers, but also data analysts and other students who cannot write programs can also use it.
1) Trigger (processing time, continuous): before 2.3, it was mainly based on processing time, and the calculation of the next batch was triggered immediately after the data of each batch was processed. 2.3 introduces record by record’s continuous processing trigger.
2) Continuous processing (only map like operations): at present, it only supports map like operations, while SQL has some limitations.
3) Lowend to end latency with exactly once guarantees: end-to-end exactly once guarantees need some additional extensions. We find that Kafka version 0.11 provides transaction functions, which can be considered based on this aspect to realize the end-to-end exactly once from source to engine to sink.
4) CEP (drools): we found that there is a business party that needs to provide the function of CEP, which is a complex event processing. At present, our syntax cannot directly support it. We let users use drools, a rule engine, and then run on each executor, relying on the rule engine function to implement CEP.
So based on the characteristics and disadvantages of the above spark structures, we consider using Flink to do these things.
Flink’s goal is to benchmark spark. Streaming is more advanced and ambitious. It has graph computing, machine learning and so on. The underlying layer also supports yarn, tez and so on. For more storage used by the community, Flink community official support is better, relatively speaking.
The job manager in Flink is the driver role of spark, and the task manger is the executor. The tasks in Flink are similar to those in spark. However, the RPC used by Flink is akka. At the same time, Flink core has customized the memory serialization framework. In addition, tasks do not need to wait for each stage of spark, but send data downstream after processing.
Flink binary data processing operator:
In general, users of spark serialization will use kryo or Java default serialization. At the same time, tungsten project optimizes Spark Program at the JVM level and code generation. Compared with spark, Flink has implemented its own memory based serialization framework, which maintains the concepts of key and pointer. Its key is continuous storage, and some optimization will be done at the CPU level. The probability of cache miss is very low. When comparing and sorting, you don’t need to compare real data. First, you can use this key to compare. Only when it is equal, can you deserialize this data from memory, and then compare specific data. This is a good performance optimization point.
Flink task chain：
Operator chain in task is a better concept. If the upstream and downstream data distribution does not need to be shuffled again, for example, the source in the figure is Kafka source, and the map following is just a simple data filter. We can reduce the cost of thread context switching by putting it in a thread.
Concept of parallelism
For example, there will be five tasks and several concurrent threads to run. When chain is up, the data transmission performance can be improved by running on one thread. Spark is black box, and each operator cannot set the concurrency, while Flink can set the concurrency for each operator, which can be more flexible, and the resource utilization of the job is higher when it runs.
Spark generally adjusts the parallelism through spark.default.parallelism. If there is a shuffle operation, the parallelism is generally adjusted through the spark.sql.shuffle.partitions parameter. For real-time calculation, it should be reduced. For example, the difference between our production and Kafka’s partition number adjustment is not much, and batch will be increased in production. We set it to 1000. We set concurrency in the figure on the left Degree is 2, and the maximum is 10. In this way, it is divided into two concurrent runs. In addition, according to the concept of making a group based on key, it is divided into 10 groups at the maximum, so as to break up the data as much as possible.
State & Checkpoint
Because Flink’s data is processed one by one, each data in Flink is processed and sent to the downstream immediately. Unlike spark, you need to wait until all tasks of the stage where the operator is located are completed before sending.
Flink has a coarse-grained checkpoint mechanism, which gives each element a snapshot concept at a very small cost. Only when all the data belonging to this snapshot come in will the calculation be triggered. After the calculation, the buffer data will be sent down. Currently, Flink SQL does not provide an interface to control the buffer timeout, that is, how long does my data need to be sent down. When building a Flink context, you can specify a buffer timeout of 0, and the processed data will be sent immediately. You do not need to wait until a certain threshold value is reached before sending.
Backend is maintained in jobmanager memory by default. What we use more is to write to HDFS, and the status of each operator is written to rocksdb, and then asynchronous periodic incremental synchronization to external storage.
In the left half of the figure, the red node fails. If it is at least once, it is better to resend the data at the upstream; but if it is exactly once, it needs to be replayed from the last time when each computing node failed.
Exactly Once Two-Phase Commit
After Flink 1.4, there are two stages of submission to support exactly once. Its concept is that after the upstream Kafka consumes data, a vote will be initiated at each step to record the status, and the mark will be processed through the checkpoint barrier. Only when the last step is written to Kafka (version after 0.11), the status of each step will be sent to the coordinator in jobmanager to inform it that it can be fixed In this way, we can implement exactly once.
Another good thing about Flink is that it implements the savepoint function based on its checkpoint. The business side needs different recovery nodes for each application, and the version to be recovered can also be specified, which is better. This savepoint is not only the recovery of data, but also the recovery of calculation state.
1) Trigger (processing time, event time, ingression time): in contrast, Flink supports richer streaming semantics. It supports not only processing time, but also event time and ingression time.
2) Continuous processing & window: it supports continuous processing in a pure sense. For record by record, window is better than spark.
3) Low end to end latency with exactly once guarantees: because there are two stages of commits, users can choose to adjust according to the business requirements to ensure the end-to-end exactly once at the expense of a certain throughput.
4) CEP: good support.
5) Savepoints: you can do some version control according to the business requirements.
There are also some that are not well done:
1) SQL (syntax function, parallelism): SQL functions are not very complete. Most users are migrated from hive. Spark supports hive coverage of more than 99%. SQL functions are not supported, and the parallelism of a single operator cannot be set at present.
2) Ml, graph, etc.: machine learning, graph computing and other fields are weaker than spark, but the community is also focusing on continuous improvement of this problem.
Author: Yi Weiping
Read the original text
This article is from alitech, a partner of yunqi community. If you need to reprint it, please contact the original author.