Nathan Marz, author of Storm, proposed lambda architecture, which builds streaming applications on MapReduce and Storm. Lambda architecture captures immutable data sequence and sends it to batch processing system and stream processing system in parallel. But you need to implement one-time data processing logic in batch processing system and stream processing system respectively. In order to complete the query and return it to the requester, we need to merge the results calculated by the two systems.
For the two systems, you can flexibly replace the implementation system, such as using Kafka + storm to achieve outflow processing, using Hadoop to achieve batch processing, the output results are usually separated in two databases, one is for streaming optimization, the other is for batch updating optimization.
But for Jay Kreps, who has been working on real-time data pipelines, although some of them are lambda architectures, he prefers a new alternative.
Lambda Architecture Diagram
Why did lambda architecture come up with?
Because those who try to build a streaming processing system do not think too much about the problem of data recalculation, resulting in the system does not have a convenient way to deal with data recalculation.
The lambda architecture emphasizes keeping input raw data immutable andDisplay the problem of recalculating the dataIt’s shown. The lambda architecture can solve the calculation problems of streaming data and historical data better.
Why is it possible to recalculate the data?
Because over time, the code may change. The reason for the change may be that you want to add a new field to the output, or that there is a bug in the code that needs to be fixed. Whatever the reason, in order to get new expected results from historical data, it is necessary to recalculate the data.
Jay Kreps’s Rejection to Lambda Architecture
Because lambda architecture puts forward one of the viewpoints that streaming systems are approximate, inaccurate and less accurate than batch processing.
Jay Kreps disagrees with this view. He argues that the existing streaming framework is not as mature as MapReduce, but does not mean that streaming systems cannot provide powerful semantic guarantees as batch systems. And the lambda architecture is titled “beats the CAP theorem”, which is to eliminate the CAP theory, but in fact, although there is a trade-off between latency and availability in stream processing, it is an asynchronous processing architecture. Therefore, the results of asynchronous computing can not be immediately consistent with the input data, so the CAP theory is still not broken.
What are the problems with lambda architecture?
The lambda architecture needs to maintain code that outputs the same results in two complex distributed systems, as painful as it seems, and Jay Kreps does not think the problem can be solved.
Because storm and Hadoop distributed frameworks are very complex, the inevitable code will be designed for the frameworks they run.
Why is Lambda so exciting?
Jay Kreps recommends using batch processing systems such as MapReduce only if they are insensitive to delay. If delay-sensitive, a streaming processing framework is used, unless it is particularly necessary to use both systems at the same time.
But the demand is always strange, people need to build complex, low-latency processing systems (and in the day PM all want large and full-featured, so the demand is even greater).
Two things they have do not solve their problems: an extensible high-latency batch processing system that can handle historical data and a low-latency stream processing system that cannot reprocess results. But by connecting the two things together, it actually constitutes a viable solution, the lambda architecture. But while lambda architecture is painful, it does address the often overlooked problem of recalculation.
But Jay Kreps thinks lambda architecture is only a temporary solution, it’s not a new programming paradigm, nor is it the future direction of big data.
Experience of Jay Kreps
Because there have been many discussions and attempts in linkedin. It was found very difficult to keep the code written in two different systems in full synchronization. APIs used to hide underlying frameworks have proved to be the most Low abstraction, because such a design will require in-depth Hadoop knowledge and in-depth understanding of the real-time layer, and when you debug or troubleshoot for performance problems, add in-depth understanding of how the abstraction layer is converted to the underlying processing framework. Maybe simplicity is the most effective.
Soul torture of Jay Kreps
- Why can’t streaming systems be improved to deal with complete problems in their target domain?
- Why do you need to stick to another system?
- Why can’t it be processed in real time at the same time and reprocessed when the code changes?
- Why not deal with recalculation very, very quickly by adding parallelism and replaying history?
What is Jay Kreps’point of view?
Jay Kreps is thinking about why streaming processing systems cannot be improved to deal with the complete set of problems in their target domain.
So there are two ways of thinking.
- Using a language or framework to abstract real-time and batch frameworks, you can write code with a higher-level framework api, which then compiles and chooses to use real-time or batch processing.
It certainly makes things better, but it doesn’t solve the problem.
Even if this avoids writing code twice, the burden of running and debugging the two systems can be very high, and the new abstraction can only provide a union of the two system features.（But isn’t Beam doing this right now?) And it’s as notorious as transparent ORM across databases.
It is much more difficult to construct a unified abstraction layer on an almost unstable distributed system than to construct a completely different programming paradigm by providing similar interfaces and interfacing languages on the original system.
- Strengthen the capability of streaming system to solve the complete problem in its target domain. The basic abstraction of stream processing is data stream DAG, which is exactly the same as the underlying abstraction in traditional data warehouse. It is also the basic abstraction of MapReduce and Tez. Stream processing is only an extension of this data flow model, which exposes checkpoints and persistent output of intermediate results to end users.
The Logic of Recalculating in Flow Processing System proposed by Jay Kreps
- Use Kafka or other systems to retain all log data that you want to recalculate, such as within 30 days.
- When you want to recalculate, start a new instance of streaming job from scratch and output the results to a new table.
- When the data processed by the second instance catches up with the previous one, switch the application to read the new table.
- Disable the old version of JOB and delete the old table data
This method only needs to be recalculated when the code changes. Of course, recalculation is only an improved version of the unified code, running on the same framework, consuming the same input data. Of course, it can also improve the parallelism of job in order to complete it quickly.
The architecture is calledKappa Architecture
And the old and new tables can also exist at the same time, so that the old logic can be restored by switching applications to the old tables. In particularly important cases, AB tests or bandit algorithms can also be used to ensure that bug fixes or code improvements are not accidentally degraded.
Similarly, data can still be stored on HDFS, but data recalculation will no longer be done on HDFS.
For the Kappa system, Samza, which Linkedin uses internally, is in use.
The trade-offs of efficiency and resources between the two approaches differ to some extent.
- Lambda architecture needs to always run reprocessing and real-time processing
- The Kappa architecture only needs to run the second job when it needs to be reprocessed. However, the Kappa architecture requires two times the storage space in the output database temporarily, and a database supporting high-capacity writing is needed to reload.
In both cases, the additional load of reprocessing may be averaged. If you have a lot of these jobs, they don’t reprocess them all at once, so on a shared cluster of dozens of such jobs, you might provide an extra few percentage points of capacity budget for activating the few Jobs that reprocess at any given time.
|Lambda architecture||Kappa architecture|
|Data Processing Ability||Can handle very large-scale historical data||Limited capacity of historical data processing|
|Machine overhead||Batch processing and real-time computing need to run all the time, and the machine overhead is high.||Computing in full when necessary, the machine overhead is relatively small.|
|Storage overhead||Only one query result needs to be saved with less storage overhead.||New and old instance results need to be stored, and the storage overhead is relatively large. But if it is a cluster shared by multiple Jobs, only a small portion of the storage is reserved.|
|The difficulty of developing and testing||It is difficult to develop and test two sets of codes.||With only one framework to face, development and testing are relatively less difficult.|
|Operation and Maintenance Cost||Maintenance of two systems will cost a lot||Maintaining only one framework can reduce the cost of operation and maintenance.|
The comparison table is referenced from: http://bigdata.51cto.com/art/…
The real advantage of Kappa is not about efficiency, but about allowing people to develop, test, debug and operate their systems on a single processing framework. Therefore, when simplicity is important, consider the Kappa architecture as an alternative to the Lambda architecture.