In today’s article, we will focus on how to build an extensible data processing platform using smack (spark, mesos, akka, Cassandra and Kafka) stack. Although the stack consists of only a few simple parts, it can implement a large number of different system designs. In addition to the pure batch or stream processing mechanism, we can also use it to implement complex lambda and kappa architectures.
Digital cloud based on mesos technology can quickly deploy and run spark, akka, Cassandra and Kafka. You are also welcome to experience and Practice on digital cloud and feel the convenience brought by their powerful functions. Before the beginning of this article, let’s start with design and examples based on the existing production project experience.
· spark – a high-speed universal engine for distributed large-scale data processing tasks.
· mesos cluster resource management system can provide effective resource isolation and sharing capabilities based on distributed applications.
Akka – a toolkit and runtime for building highly concurrent, distributed and resilient message driven applications on top of the JVM.
· Cassandra – a distributed high availability database designed to process large-scale data across multiple data centers.
· Kafka – a high throughput, low latency, distributed messaging system / submission log solution designed to handle real-time data delivery.
Storage layer: Cassandra
Cassandra has been attracting much attention for its high availability and high throughput. It can handle considerable write load and has node fault tolerance. Based on cap principle, Cassandra can provide adjustable consistency / availability level for business operation.
What’s more interesting is that Cassandra has linear scalability in data processing (that is, load capacity can be increased by adding nodes to the cluster) and can provide cross data center replication (xdcr). In fact, in addition to data replication, cross data center replication can also achieve the following types of extension use cases:
The geographically distributed data center processes data oriented to a specific region or customer’s surrounding location.
Data migration between different data centers to achieve post failure recovery or to move data to a new data center.
Split operational workload and analysis workload.
But the above features also have their own implementation costs, and for Cassandra, this cost is reflected in the data model, which means that we need to group / classify partition keys and entries through clustering, so as to realize nested ordered mapping. Here is a simple example:
In order to get the specific data in a certain range, we must specify the full key, and do not allow any range delimitation except the last column in the list to be executed. This limitation is used to limit multiple scans for different ranges, otherwise it may bring random disk access and slow down the overall performance. This means that the data model must be carefully designed based on read queries to limit the amount of read / scan – but it will also lead to less flexibility in supporting new queries.
So if we need to add some tables to other tables, what should we do? Let’s consider the next scenario: total traffic calculation for all activities for a specific month.
Under a specific model, the only way to achieve this goal is to read all activities, read all events, summarize the attribute values (which match the activity ID) and assign them to activities. This kind of application operation is obviously very challenging, because the total amount of data stored in Casandra is often very large, and the memory capacity is not enough to accommodate it. So we have to deal with this kind of data in a distributed way, and spark will play an important role in this kind of use case.
Processing layer: Spark
The abstract core of spark mainly involves RDD (elastic distributed data set, a set of distributed elements) and workflow composed of the following four main stages:
The RDD operation (transformation and operation) is in the form of DAG (directed acyclic graph)
The DAG will be split according to each task phase and then submitted to the cluster manager
Each stage can be combined with tasks without shuffling / reassigning
The task runs on top of the working program, and the results are then returned to the client
Here is how we can use spark and Cassandra to solve the above problems:
The interaction to Cassandra is executed through spark Cassandra connector, which can make the whole process more intuitive and simple. Another interesting option is sparksql, which can translate SQL statements into a series of RDD operations.
With a few lines of code, we have been able to implement the native lambda design – which is obviously more complex, but this example shows that you have the ability to implement the given function in a simple way.
MapReduce like solution: shorten the distance between processing and data
Spark Cassandra connector has the ability to identify the location of data, and will read data from the nearest node in the cluster, so as to minimize the transmission requirements of data in the network. In order to give full play to the data location recognition ability of spark-c * connector, we should let spark working program cooperate with Cassandra node in parallel.
In addition to the cooperation between spark and Cassandra, we also have reasons to distinguish the operation (or high write strength) cluster from the analysis cluster, so as to ensure that:
Different clusters can scale independently
· data is replicated by Cassandra without other mechanisms
Analyze the different read / write load patterns of the cluster
The analysis cluster can accommodate additional data (such as dictionaries) and processing results
The impact of spark on resources is limited to a single cluster
Let’s review the application deployment options of spark again:
At present, we have three main cluster resource manager options to choose from:
· use spark alone — spark is the main body, and each working program is installed and executed in the form of independent application (this obviously increases the additional resource burden, and only supports the allocation of static resources for each working program)
If you already have a Hadoop ecosystem, horn is definitely a good option
Since the birth of mesos, the dynamic allocation of cluster resources has been considered in its design. Besides Hadoop applications, mesos is also suitable for handling all kinds of heterogeneous workloads
Mesos cluster is composed of master nodes, which are responsible for resource supply and scheduling, while slave nodes are actually responsible for task execution load. In Ha mode, we use multiple primary zookeeper nodes to select primary nodes and discover services. The applications executed on mesos are called “framework”, and use API to process resource supply and submit tasks to mesos. Generally speaking, the task execution process consists of the following steps:
The slave node provides available resources to the master node
The master node sends resource supply to the framework
The scheduler responds to these tasks and resource requirements for each task
The master node sends tasks to the slave node
Combining spark, mesos and Cassandra
As mentioned earlier, spark worker should cooperate with Cassandra nodes to realize data location identification capability to reduce network traffic and Cassandra cluster load. The following figure shows an example of a feasible deployment scenario using mesos to achieve this goal:
The master node of mesos cooperates with zookeeper
· mesos slave node cooperates with Cassandra node to provide more ideal data location for spark
· spark binaries are deployed to all work nodes, while spark binaries are deployed to all work nodes- env.sh Then configure the appropriate primary endpoint and actuator location
The spark actuator jar is uploaded to S3 / HDFS
According to the above setting process, spark tasks can be submitted to the cluster from any work node with spark binary file installed and uploaded with jar containing actual task logic by using simple spark submit call.
Since the existing options can already run docker like spark, we do not need to distribute binary files to each single cluster node.
Execution mechanism of regular and long term operation tasks
Sooner or later, every data processing system has to face two essential categories of task operation: periodic batch aggregation type periodic / phased task and long-term task represented by data flow processing. One of the main requirements of these two types of tasks is fault tolerance – each task must always be running, even if the cluster node fails. Mesos provides two excellent frameworks to support these two task categories.
Marathon is a set of architecture specially designed to achieve high fault tolerance for long-term running tasks, and supports ha mode combined with zookeeper. It can run docker and provide excellent rest API. The following shell command example is a simple task configuration by running spark submit:
Chronos has the same features as marathon, but its design goal is to run periodic tasks, and in general, its distributed ha cron supports task mapping. The following example is how to configure the S3 compression task with a simple bash script:
At present, there are a variety of framework schemes to choose from, or are in the process of active development, in order to dock the widely used mesos resource management functions in various systems. Here are some typical examples:
• Myriad: YARN on Mesos
So far, everything is going well: the storage layer has been designed, the resource management mechanism has been set up, and the tasks have been configured. The only thing to do next is data processing.
Assuming that the input data will flow in at a very high rate, the endpoint needs to meet the following requirements in order to deal with it successfully:
· high throughput / low latency
Easy to scale
Support back pressure
Back pressure capability is not necessary, but it is a good choice as an option to cope with peak load.
Akka can perfectly support the above requirements, and basically its design goal is to provide this set of functions. Let’s look at the characteristics of akka
The realization ability of JVM oriented role model
· message based and asynchronous architecture
· enforce non shared variable state
Easy expansion from single process to device cluster
Use the top-down supervision mechanism to realize the role hierarchy
Not just concurrency frameworks: akka HTTP, akka stream and akka persistence
The following brief example shows three roles responsible for handling JSON httprequest. They parse the request into a domain model instance class and save it in Cassandra:
It seems that it only takes a few lines of code to achieve the above goal, but using akka to write raw data (i.e. events) to Cassandra may bring the following problems:
· Cassandra’s design still focuses on high-speed delivery rather than batch processing, so it is necessary to pre aggregate the input data.
The computing time brought by aggregation / aggregation will gradually increase with the growth of the total amount of data.
Due to the stateless design pattern, roles are not suitable for performing aggregation tasks.
Micro batch mechanism can solve this problem to a certain extent.
There is still a need to provide some reliable buffering mechanism for raw data
Kafka acts as a buffer mechanism for input data
In order to preserve the input data and pre aggregate / process it, we can also use some kind of distributed commit log mechanism. In the following use case, the consumer program will read the data in batch, process it and save it in Cassandra in the form of pre aggregation. This example shows how to use akka HTTP to publish JSON data to Kafka through HTTP
Data consumption: Spark streaming
Although akka can also be used to consume the stream data from Kafka, the following problems can be solved by incorporating spark into the ecosystem to introduce spark streaming
It supports multiple data sources
Provide “at least once” semantics
The semantics of “only once” can be realized in combination with Kafka direct and idempotent storage
The following code example illustrates how to use spark streaming to consume event flows from kinesis:
Fault design: backup and patch installation
Generally speaking, fault design is the most boring part of any system, but its importance can not be doubted – when the data center is not available or needs to analyze the crash situation, it is very important to protect the data from loss as much as possible.
So why store data in Kafka / kinesis? Up to now, kinesis is still the only solution that can ensure data retention after all processing results are lost without backup. Although Kafka can also support long-term data retention, the cost of hardware holding is still a problem that needs to be seriously considered, because the cost of using S3 storage service is far lower than the large number of instances needed to support Kafka – in addition, S3 also provides a very ideal service level agreement.
In addition to the backup capability, the recovery / patch installation strategy should also consider the pre test and test requirements, so as to ensure that any data related problems can be solved quickly. Programmers may inadvertently destroy the calculation results in aggregation tasks or data De duplication operations, so the ability to fix such errors becomes very critical. A simple way to simplify this kind of operation task is to introduce idempotent mechanism into the data model, so that multiple repetitions of the same operation will produce the same result (for example, SQL update belongs to idempotent operation, while count increment does not).
The following example reads the S3 backup for the spark task and loads it into Cassandra:
Top level design of data platform based on smack
Throughout the full text, the outstanding capabilities of smack stack include:
Concise tool pool to solve a wide range of data processing scenarios
The software solution has been tested for a long time and has a wide popularity. It also has a strong technology community behind it
Easy to scale and replicate data with low latency
Unified cluster management for heterogeneous load
· single platform for any application type
Implementation platform for different architecture design (batch, stream data, lambda, kappa)
Excellent product release speed (e.g. for MVP validation)