Introduction to big data and classification of technical system

Time:2021-10-16

Introduction:Big data refers to a data set that cannot be captured, managed and processed by conventional software tools within a certain time range. It is a massive, high growth rate and diversified information asset that requires a new processing mode to have stronger decision-making power, insight and discovery power and process optimization ability. Big data technology is mainly used to solve the storage and analysis of massive data.

1、 Introduction to big data

1. Basic concepts

Big data refers to a data set that cannot be captured, managed and processed by conventional software tools within a certain time range. It is a massive, high growth rate and diversified information asset that requires a new processing mode to have stronger decision-making power, insight and discovery power and process optimization ability. Big data technology is mainly used to solve the storage and analysis of massive data.

2. Characteristic analysis

5V features of big data (proposed by IBM): Volume (large), velocity (high speed), variety (diversity), value (low value density), veracity (authenticity).

3. Development process

Three papers published by Google around 2004 are file system GFS, computing framework MapReduce and NoSQL database system BigTable. Massive data files are analyzed, calculated and stored, which establishes the basic principles and ideas of big data.

Doug Cutting, a talented programmer, is also the initiator of Lucene and nutch projects. According to the principle of Google paper, it initially realized functions similar to GFS and MapReduce, and later developed into the famous Hadoop.

Later, after rapid development, Hadoop has formed an ecosystem. Based on Hadoop, there are a series of contents, such as real-time computing, offline computing, NoSQL storage, data analysis, machine learning and so on.

From the development of this series of things to see the law of Technology: Google creatively puts forward papers as the basis in its business practice. The growth and demand of business force the technology to be constantly updated. Therefore, business is the key to the continuous development of technology.

2、 Hadoop framework

1. Introduction to Hadoop

Note that this is based on the Hadoop 2. X version description. If there is no special description, it is version 2.7.

Introduction to big data and classification of technical system

Hadoop is a distributed system infrastructure developed by the Apache foundation;

Provide massive data storage capacity and analysis and calculation capacity;

As the top-level project of Apache, it contains many subprojects and is an ecosystem;

2. Frame features

Reliability: Hadoop stores and stores multiple data copies by bit to provide reliable services;

Scalability: Hadoop uses computer clusters to distribute data and complete computing tasks, which can be easily extended to thousands of nodes;

Efficiency: Based on the idea of MapReduce, it provides efficient parallel computing for massive data;

Fault tolerance: automatically save multiple copies of data and automatically reassign failed tasks;

3. Composition structure

HDFS storage

  • NameNode

Store file related metadata, such as file name, file directory, creation time, permission, number of copies, etc.

  • DataNode

The file system stores the file block data and the mapping relationship with the data block ID.

Yarn scheduling

Responsible for resource management and job scheduling, allocate system resources to various applications running in Hadoop cluster, and schedule tasks to be executed on different cluster nodes.

MapReduce calculation

MapReduce divides the calculation process into two stages: the map stage processes the input data in parallel, and the reduce stage summarizes the map results.

3、 Big data technology stack

Introduction to big data and classification of technical system

1. Kafka Middleware

Open source organization: Apache Software

Application scenario:

Kafka is a high-throughput distributed publish subscribe message system, which provides message persistence through disk data structure. This structure can maintain long-term stable performance for even terabytes of message storage. High throughput: even very ordinary hardware Kafka can support millions of messages per second. It supports partitioning messages through Kafka server and consumer cluster. Support Hadoop parallel data loading.

2. Flume log system

Open source organization: cloudera

Application scenario:

Flume is a highly available, reliable and distributed system for massive log collection, aggregation and transmission provided by cloudera. Flume supports customization of various data senders in the log system for data collection; At the same time, flume provides the ability to simply process data and write to various data recipients (customizable).

3. Sqoop synchronization tool

Open source organization: Apache Software

Application scenario:

Sqoop is an open source tool, which is mainly used to transfer data between Hadoop, hive and traditional databases such as mysql. It can import the data in a relational database (such as mysql, Oracle, etc.) into Hadoop’s HDFS or HDFS into a relational database.

4. HBase database

Open source organization: Apache Software

Application scenario:

HBase is a distributed, column oriented open source database. HBase provides BigTable like capabilities on Hadoop. HBase is a subproject of Apache’s Hadoop project. HBase is different from the general relational database. It is a database suitable for unstructured data storage, and the storage mode is column based rather than row based.

5. Storm real-time computing

Open source organization: Apache Software

Application scenario:

Storm is used for real-time calculation, continuous query of data stream, and output the results to users in the form of stream during calculation. Storm is relatively simple and can be used with any programming language.

6. Spark computing engine

Open source organization: Apache Software

Application scenario:

Spark is a fast and universal computing engine designed for large-scale data processing, with the advantages of Hadoop’s MapReduce; But what is different from MapReduce is that the job intermediate output results can be saved in memory, so there is no need to read and write HDFS. Therefore, spark can be better applicable to MapReduce algorithms that need iteration, such as data mining and machine learning. Spark is implemented in the Scala language, which uses Scala as its application framework.

7. R language

Open source organization: Microsoft Corporation

Application scenario:

R is the language and operating environment for statistical analysis and drawing. R is a free, free and open source software belonging to GNU system. It is an excellent tool for statistical calculation and statistical mapping.

8. Hive data warehouse tool

Open source organization: Facebook

Application scenario:

Hive is a data warehouse tool based on Hadoop, which is used for data extraction, transformation and loading. It is a mechanism that can store, query and analyze large-scale data stored in Hadoop. Hive data warehouse tool can map structured data files into a database table, provide SQL query function, and convert SQL statements into MapReduce tasks for execution.

9. Oozie component

Open source organization: Apache Software

Application scenario:

Oozie is a workflow scheduling management system that manages hdoop jobs.

10. Azkaban component

Open source organization: LinkedIn

Application scenario:

Batch workflow task scheduler. It is used to run a set of work and processes in a specific order within a workflow. Azkaban defines a kV file format to establish dependencies between tasks, and provides an easy-to-use web user interface to maintain and track workflow.

11. Mahout components

Open source organization: Apache Software

Application scenario:

Mahout provides the implementation of some scalable classical algorithms in the field of machine learning, which is designed to help developers create intelligent applications more conveniently and quickly. Mahout includes many implementations, including clustering, classification, recommendation filtering and frequent sub item mining.

12. Zookeeper component

Open source organization: Apache Software

Application scenario:

Zookeeper is a distributed, open-source distributed application coordination service. It is an open-source implementation of chubby of Google and an important component of Hadoop and HBase. It is a software that provides consistency services for distributed applications. Its functions include configuration maintenance, domain name service, distributed synchronization, group service, etc.

4、 Technology stack classification

Storage system:Hadoop-HDFS、HBase、MongoDB、Cassandra

Computing system:Hadoop-MapReduce、Spark、Storm、Flink

Data synchronization:Sqoop、DataX

Resource scheduling:YARN、Oozie、Zookeeper

Log collection:Flume、Logstash、Kibana

Analysis engine:Hive、Impala、Presto、Phoenix、SparkSQL

Cluster monitoring:Ambari、Ganglia、Zabbix

5、 Source code address

GitHub · address
https://github.com/cicadasmile/big-data-parent
Gitee · address
https://gitee.com/cicadasmile/big-data-parent