I. Large Data Processing Flow
The figure above is a simplified flow chart of large data processing. The main flow chart of large data processing includes data collection, data storage, data processing and data application. Next, we will explain the technology stack for each link one by one.
1.1 Data collection
The first step in big data processing is data collection. Nowadays, large and medium-sized projects usually use micro-service architecture for distributed deployment, so data acquisition needs to be carried out on multiple servers, and the acquisition process can not affect the normal business development. Based on this requirement, a variety of log collection tools have been derived, such as Flume, Logstash, Kibana, etc. They can complete complex data collection and data aggregation through simple configuration.
1.2 Data Storage
After collecting the data, the next question is: How should the data be stored? The traditional relational databases, such as MySQL and Oracle, are well known. Their advantages are that they can store structured data quickly and support random access. However, the data structure of large data is usually semi-structured (such as log data) or even unstructured (such as video and audio data). In order to store massive semi-structured and unstructured data, distributed file systems such as Hadoop HDFS, KFS and GFS have been derived, which can support structured, semi-structured and GFS. The storage of unstructured data can be extended horizontally by adding machines.
Distributed file system perfectly solves the problem of mass data storage, but an excellent data storage system needs to consider both data storage and access issues, such as random access to data, which is good at traditional relational database, but not distributed file system. What we are good at is whether there is a storage scheme that has the advantages of both distributed file system and relational database. Based on this requirement, HBase and MongoDB come into being.
1.3 Data Analysis
Data analysis is the most important part of big data processing. Data analysis is usually divided into two kinds: batch processing and stream processing.
- Batch processingThe corresponding processing framework includes Hadoop MapReduce, Spark, Flink and so on.
- Stream processingTo process the data in motion, that is, to process the data while receiving it. The corresponding processing frameworks are Storm, Spark Streaming, Flink Streaming and so on.
Batch processing and stream processing have their own applicable scenarios. Because of time insensitivity or limited hardware resources, batch processing can be used. Flow processing can be used if time sensitivity and timeliness are high. With the lower price of server hardware and the higher requirement of timeliness, stream processing is becoming more and more popular, such as stock price forecasting and business operation data analysis.
The above frameworks need to be programmed for data analysis, so if you are not a background engineer, can’t you do data analysis? Of course not. Big data is a very perfect ecosystem. There are solutions when there is demand. In order to enable people familiar with SQL to analyze data, query analysis framework came into being, commonly used are Hive, Spark SQL, Flink SQL, Pig, Phoenix and so on. These frameworks can use standard SQL or SQL-like syntax to query and analyze data flexibly. These SQLs are parsed and optimized and converted to corresponding job programs to run. For example, Hive essentially converts SQL into MapReduce jobs, Spark SQL converts SQL into a series of RDDs and transformations, and Phoenix converts SQL queries into one or more HBase Scans.
1.4 Data Application
After data analysis is completed, the next step is the scope of data application, which depends on your actual business needs. For example, you can visualize the data, or use the data to optimize your recommendation algorithm, which is now very common, such as short video personalized recommendation, e-commerce merchandise recommendation, headline news recommendation and so on. Of course, you can also use the data to train your machine learning model, which belongs to other fields, and all have corresponding frameworks and technology stacks for processing, which will not be repeated here.
1.5 Other frameworks
Above is a technical framework for a standard large data processing process. However, the actual large data processing process is much more complex than the above, and various frameworks have been derived for various complex problems in large data processing.
- There are bottlenecks in the processing capacity of single computer, so the large data framework is deployed in cluster mode. In order to deploy, monitor and manage cluster more conveniently, cluster management tools such as Ambari and Cloudera Manager are derived.
- To ensure high availability of clusters, ZooKeeper is needed. ZooKeeper is the most commonly used distributed coordination service. It can solve most cluster problems, including leader election, failure recovery, metadata storage and consistency assurance. At the same time, Hadoop YARN is derived from the requirement of cluster resource management.
- Another significant problem with complex large data processing is how to schedule multiple complex jobs that depend on each other. Based on this requirement, workflow scheduling frameworks such as Azkaban and Oozie are created.
- Kafka is another framework which is widely used in large data stream processing. It can be used to eliminate peaks and avoid the impact of concurrent data convection processing in such scenarios as second kill.
- Another commonly used framework is Sqoop, which mainly solves the problem of data migration. It can import data from relational database into HDFS, Hive or HBase by simple commands, or export data from HDFS, Hive to relational database.
II. Learning Route
After introducing the Big Data Framework, we can introduce its corresponding learning routes, which are mainly divided into the following aspects:
2.1 Language Foundation
Large data frameworks are mostly developed in Java language, and almost all frameworks provide Java APIs. Java is the mainstream background development language at present, so there are more free learning resources on the Internet. If you are used to learning through books, the following introductory books are recommended here:
- Logic of Java Programming: Here is a book written by a native about introducing Java to the system. It is simple and comprehensive.
- “Java Core Technologies”: The latest is the 10th edition. There are volumes one and two. Volume two can be read selectively because many of the chapters are seldom used in practical development.
Most frameworks currently require Java version at least 1.8, because Java 1.8 provides functional programming that allows you to implement the same functions with more streamlined code, such as calling the Spark API, using code 1.8 may be a few times more than 1.7, so here’s an extra recommendation to read Java 8 Actual Warfare. This book.
Scala is a static programming language that integrates the concepts of object-oriented programming and functional programming. It runs on the Java virtual machine and can cooperate seamlessly with all Java class libraries. The famous Kafka is developed in Scala language.
Why do you need to learn Scala? This is because Flink and Spark, the most popular computing frameworks at present, both provide Scala language interfaces, which can be used for development with less code than Java 8, and Spark is written in Scala language. Learning Scala can help you understand Spark more deeply. Similarly, two introductory books are recommended for small partners who are used to book learning.
- Fast Learning Scala (2nd Edition)
- Scala Programming (3rd Edition)
Let’s say that if you have limited time, you don’t have to finish Scala to learn Big Data Framework. Scala is indeed compact and flexible enough, but it is slightly more complex than Java in language. Concepts such as implicit transformation and implicit parameters are more difficult to understand when first involved, so you can learn Scala after understanding Spark, because concepts such as implicit transformation are widely used in Spark source code.
2.2 Linux Foundation
Usually large data frameworks are deployed on Linux servers, so you need to have some knowledge of Linux. The most famous of Linux books is the “Bird Brother Private Room” series, which is comprehensive and classical. But if you want to get started quickly, here’s Linux Should Learn That Way, which has a free e-book version on its website.
2.3 Building Tools
Maven is the main automated build tool that needs to be mastered here. Maven is widely used in large data scenarios, mainly in the following three aspects:
- Manage project JAR packages to help you quickly build large data applications;
- Whether your project is developed in Java or Scala, it needs to be compiled and packaged in Maven when submitted to the cluster environment.
- Most big data frameworks use Maven for source code management, and when you need to compile installation packages from their source code, you need to use Maven.
2.4 Framework Learning
1. Framework classification
Above, we have introduced many big data frameworks, and here we summarize them.
Log Collection Framework：Flume 、Logstash、Kibana
Distributed File Storage System：Hadoop HDFS
Distributed Computing Framework：
- Batch Framework: Hadoop MapReduce
- Flow processing framework: Storm
- Mixed Processing Framework: Spark, Flink
Query Analysis Framework：Hive 、Spark SQL 、Flink SQL、 Pig、Phoenix
Cluster Resource Manager：Hadoop YARN
Distributed Coordination Service：Zookeeper
Data migration tools：Sqoop
Task Scheduling Framework：Azkaban、Oozie
Cluster deployment and monitoring：Ambari、Cloudera Manager
All of the data frameworks listed above are relatively mainstream, with active communities and rich learning resources. It is recommended to start with Hadoop, because it is the cornerstone of the whole big data ecosystem, and other frameworks are directly or indirectly dependent on Hadoop. Then you can learn the computing framework, Spark and Flink are the mainstream hybrid processing framework, Spark appeared earlier, so its application is more extensive. Flink is currently the hottest new generation of hybrid processing framework, with many excellent features has been favored by many companies. Both can be learned according to your personal preferences or actual work needs.
Pictures are quoted from ：https://www.edureka.co/blog/h…
As for other frameworks, there is no specific sequence in learning. If your learning time is limited, it is recommended that you master one of the same types of frameworks at the first time of learning. For example, there are many kinds of log collection frameworks. When you first learn, you only need to master one to complete the task of log collection. After work, there is a need for targeted learning.
2. Learning materials
The most authoritative and comprehensive learning material for big data is official documents. The popular big data framework communities are active and the revision iteration is fast, so their publications are obviously lagging behind their actual versions. For this reason, book learning is not the best solution. Fortunately, the official documents of Big Data Framework are well written, with perfect content and prominent focus. At the same time, a large number of maps are used to assist the explanation. Of course, there are also some excellent books which have gone through the test of time and are still very classic. Here are some classic books that have been read by individuals:
- Haoop Authoritative Guide (4th Edition) 2017
- Kafka Authoritative Guide 2017
- Principles and Practices of Distributed Consistency from Paxos to Zookeeper, 2015
- “In-depth Analysis of Spark Technology Insider Architecture Design and Implementation Principles of Spark Kernel” 2015
- Spark. The. Definitive. Guide 2018
- HBase Authoritative Guide 2012
- Hive Programming Guide 2013
3. Video Learning Materials
What I recommend above are all book learning materials, seldom video learning materials. Here’s the reason: because books have gone through the test of time, the proof that they can be reprinted or that the platform such as Douban has high evaluation is recognized by the public. From the perspective of probability, they are bound to be more excellent and not easy to waste. Home learning time and energy, so I personally prefer the official document or book learning style, rather than video. Because of the lack of a public evaluation platform and a perfect evaluation mechanism, the quality of video learning materials varies from good to bad. But video still has its irreplaceable advantages. It is more intuitive and impressive to learn. So for the small partners who are used to video learning, here I recommend a free and paid video learning resource, you can choose as you need:
- Free Learning Resources: Silicon Valley Big Data Learning Route – Download Links and Watch Links Online
- Pay-for-Learning Resources: Michael PK’s Series of Courses
Last but not least, AmwayBlogAnd the GitHub projectIntroduction Guide to Big DataThere are also a series of articles on big data.
Here are some common development tools for big data:
Java IDEIDEA and Eclipse are both available. In terms of personal usage habits, IDEA is preferred.
VirtualBoxIn the learning process, you may often need to build services and clusters on virtual machines. VirtualBox is an open source, free virtual machine management software. Although it is lightweight software, it has abundant functions and can basically meet the daily use needs.
MobaXtermLarge data frameworks are usually deployed on servers, and MobaXterm is recommended for connections. It is also free and open source, supports multiple connection protocols, supports drag-and-drop upload of files, and supports plug-in extension.
Translate ManA free translation plug-in for browsers (both Google and Firefox support it). It uses Google’s translation interface, with high accuracy, supports word translation, and can assist in reading official documents.
Above is my personal learning experience and route recommendation about big data. This paper makes a narrow definition of big data technology stack. With the deepening of learning, you can also gradually add Python language, recommendation system, machine learning to your big data technology stack.