Basic knowledge of big data


For an emergency exam, I began to make up for the concept of big data without serious and systematic study

It involves Hadoop, HBase, spark, Flink, flume, Kafka, sqoop, HDFS, hive, MapReduce, impala, spark SQL, elasticsearch, Yan, hue and cloudera manager. The purpose of this article is to sort out these related knowledge concepts and possible test sites as Xiaobai.

Big data – concept

What is big data

Big data (big data or megadata): big data, or huge amount of data, massive data and big data, refers to data assets that involve a huge amount of data that cannot be intercepted, managed, processed and sorted into a form that can be interpreted by human beings in a reasonable time.

☆ five characteristics of big data (5V)

Volume, velocity, variety, value, veracity – (authenticity)

  • Volume: large amount of data, including collection, storage and calculation. The starting measurement unit of big data is at least p (1000 t), e (1 million T) or Z (1 billion T).
  • Variety: variety of species and sources. It includes structured, semi-structured and unstructured data, which are embodied in network logs, audio, video, pictures, geographic location information, etc. multiple types of data put forward higher requirements for data processing ability.
  • Value: the data value density is relatively low, or it is valuable to wash sand in the waves. With the wide application of the Internet and the Internet of things, information perception is everywhere, information is massive, but the value density is low. How to combine business logic and mine data value through powerful machine algorithms is the most important problem to be solved in the era of big data.
  • Velocity: fast data growth, fast processing speed and high timeliness requirements. For example, search engines require that the news a few minutes ago can be queried by users, and personalized recommendation algorithms require real-time recommendation as much as possible. This is a significant feature that big data is different from traditional data mining.
  • Veracity: the accuracy and reliability of data, that is, the quality of data.
Data unit of big data

All units are given in order: bit, byte, KB, MB, GB, TB, Pb, EB, ZB, Yb, Nb, DB, CB. (advance rate 2 ^ 10)
1Byte = 8 Bit
1 KB(Kilobyte) = 1,024 Bytes 
1 MB (Megabyte)= 1,024 KB = 1,048,576 Bytes 
1 GB(Gigabyte) = 1,024 MB = 1,048,576 KB = 1,073,741,824 Bytes
1 TB (Terabyte)= 1,024 GB = 1,048,576 MB = 1,073,741,824 KB = 1,099,511,627,776 Bytes
1 PB(Petabyte) = 1,024 TB = 1,048,576 GB =1,125,899,906,842,624 Bytes
1 EB(Exabyte) = 1,024 PB = 1,048,576 TB = 1,152,921,504,606,846,976 Bytes
1 ZB(Zettabyte) = 1,024 EB = 1,180,591,620,717,411,303,424 Bytes
1 YB(Yottabyte) = 1,024 ZB = 1,208,925,819,614,629,174,706,176 Bytes
1 NB(NonaByte) = 1,024 YB
1 DB(DoggaByte) = 1,024 NB
1 CB (Corydonbyte )= 1,024 DB

Computing mode of big data

MapReduce (spark): the most suitable computing mode for batch processing of big data is MapReduce. Firstly, MapReduce adopts the parallel processing idea of “divide and conquer” for large-scale data with simple data relationship and easy division; Then, a large number of repeated data record processing processes are summarized into two abstract operations: map and reduce; Finally, MapReduce provides a unified parallel computing framework, which hands over many system layer details involved in parallel computing to the computing framework, which greatly simplifies the burden of parallel programming by programmers.

Stream computing (scribe, flume, storm, S4, sparkstreaming) stream computing is a high real-time computing mode. It needs to complete real-time computing and processing of new data generated by the application system within a certain time window to avoid data accumulation and loss.

Iterative computing (haloop, imappreduce, twister, spark) in order to overcome the defect that Hadoop MapReduce is difficult to support iterative computing, industry and academia have made a lot of improvement research on Hadoop MapReduce. Haloop puts iteration control into the framework of MapReduce job execution, and ensures that the reduce output of the previous iteration and the map input data of this iteration are on the same physical machine through a loop sensitive scheduler, so as to reduce the data transmission overhead between iterations;

interactive computing

Graph calculation (pregel, powergrapg, graphx)

Memory computing (Dremel, Hana, redis)

Big data technology system

Basic knowledge of big data


The big data processing process is divided into collection, storage, processing and visualization, which requires security and operation and maintenance technology.

The core of big data is Hadoop ecosystem. Hadoop is the most widely used distributed big data processing framework at present. It contains a large number of components, from data collection to data storage, data processing and data analysis.

Basic knowledge of big data


1、 Data source description

  • Structured data: relational database records
  • Semi structured data: log, mail, etc
  • Unstructured data: file, video, audio, network data stream, etc

2、 Data warehouse

1. What is a data warehouse?

In computing, data warehouse (DW or DWH), also known as enterprise data warehouse (EDW), is a system for reporting and data analysis, which is regarded as the core component of business intelligence. They store current and historical data in one place to create analysis reports for staff throughout the enterprise.

2. Characteristics of two operation modes of data warehouse

① Online analytical processing (OLAP) is characterized by relatively low transaction volume. Queries are often very complex and involve aggregation. For OLAP system, response time is an effectiveness measure. Data mining technology widely uses OLAP applications. OLAP database stores summarized historical data in multidimensional mode (usually star mode). Compared with data marts, OLAP systems usually have a data delay of several hours, and data marts are expected to have a delay of nearly one day. OLAP method is used to analyze multidimensional data from multiple sources and perspectives. The three basic operations in OLAP are: summary (merging), drilling and slicing and slicing.

② Online transaction processing (OLTP) is characterized by a large number of transient online transactions (insert, update, delete). OLTP system emphasizes very fast query processing and maintains data integrity in multi access environment. For OLTP systems, effectiveness is measured in transactions per second. The OLTP database contains detailed and current data. The schema used to store the transaction database is the entity model (usually 3NF). Standardization is the specification of data modeling technology in the system.

3、 Differences between ETL and DM

ETL / extraction transformation loading – used to complete the data transfer from DB to DW. It “extracts” the state at a certain time point in dB, and “converts” the data format according to the requirements of DW’s storage model, and then “loads” it to DW. It should be emphasized that the DB model is ER model, which complies with the principle of normal form design, The data model of DW is a snowflake structure or star structure, which uses a theme oriented and problem-oriented design idea. Therefore, the model structures of DB and DW are different and need to be transformed.

DM / data mining / data mining – this mining is not simple statistics. It analyzes the large amount of data in DW according to probability theory or other statistical principles to find out the laws that we can’t find intuitively.


1、 Hadoop

1. What is Hadoop?

Hadoop is defined as a software framework written in Java language to facilitate the distributed storage and calculation of large data sets. Hadoop is a distributed system infrastructure developed by the Apache foundation. Users can develop distributed programs without knowing the details of the distributed bottom layer. Make full use of the power of cluster for high-speed computing and storage. Hadoop implements a distributed file system, one of which is HDFS (Hadoop distributed file system). HDFS has the characteristics of high fault tolerance and is designed to be deployed on low-cost hardware; Moreover, it provides high throughput to access application data, which is suitable for applications with large data sets. The core design of Hadoop framework is HDFS and MapReduce. HDFS provides storage for massive data, while MapReduce provides computing for massive data.

2. What are the characteristics of Hadoop?

① Efficient: distributed cloud computing is implemented by large-scale clusters of servers with standard x86 architecture. Each module is a discrete processing unit, using parallel computing technology and load balancing of each computing node in the cluster. When the load of a node is too high, it can intelligently transfer the load to other nodes and support the linear and smooth expansion of nodes; Distributed cloud storage is realized by using the local hard disk of X86 server and distributed file system. Each data is stored in at least 3 nodes to ensure the performance and reliability objectives of storage design.

② Reliability: it can maintain multiple costs of its own data, and automatically redeploy computing tasks in case of task failure

③ Scalability: it can reliably store and process Pb level data

④ Low cost: data can be distributed and processed through a server cluster composed of ordinary machines. These server clusters can total up to thousands of nodes.

Data acquisition tools

Offline data collection: sqoop
Real time data acquisition: ogg
Log data collection: logstash \ flume


Sqoop is a tool used to transfer data between Hadoop and relational databases. You can import data from a relational database (such as mysql, Oracle, Postgres, etc.) into HDFS of Hadoop or HDFS into relational databases.
It is used to collect log data, simply process the data, and write to the data receiver.
Flume provides the ability to collect data from console, RPC (Thrift RPC), text (file), tail (UNIX tail) and syslog (syslog log log system), supports two modes such as TCP and UDP, and exec (command execution) and other data sources.


Flume is a highly available, reliable and distributed system for massive log collection, aggregation and transmission provided by cloudera. Flume supports customization of various data senders in the log system for data collection; At the same time, flume provides the ability to simply process data and write to various data recipients (customizable).
It can be linearly extended and has data consistency.
Agent is mainly composed of three components: source, channel and sink
Receive data from the data generator and transfer the received data to one or more channel channels in flume event format. Flume provides a variety of data receiving methods, such as Avro, thrift, twitter 1%, etc
Channel is a temporary storage container. It caches the event format data received from the source until they are consumed by sinks. It acts as a bridge between source and sink. Channel is a complete transaction, which ensures the consistency of data when sending and receiving And it can be linked to any number of sources and sinks Supported types include JDBC channel, file system channel, memory channel, etc
Sink stores data in centralized storage such as HBase and HDFS. It consumes data (events) from channels and passes them to the target The destination may be another sink or HDFS, HBase

Data storage tools

HDFS: distributed file storage system, suitable for one-time write and multiple read scenarios
Kudu: distributed file storage system, which can be updated quickly and supports fast reading and writing scenarios
HBase: distributed database
Kafka: Message Bus
Hive: Data Warehouse


HDFS: Hadoop distributed file storage system based on Java, which is suitable for distributed storage of large files. It can be written once and read many times. For example, a 1t file will be stored on multiple machines instead of a single machine.

  • Easy to scale distributed storage system
  • The performance requirements of the machine are not high, and it runs on a large number of ordinary cheap machines
  • The data is saved in 3 copies. If the copy is lost, it can be replied automatically
  • High scalability, nodes can be added or deleted arbitrarily
    It is divided into master node and slave node
    1. Master node namenode
  • Receive user operation request
  • Maintain the directory structure of the file system
  • Manage the relationship between files and data blocks, and the relationship between data blocks and datanodes
    2. Slave datanode
  • Storage database
  • Files are divided into database storage
  • The file has multiple copies

Blocksize: large files are divided into blocks, usually 64 or 128MB
Each database is stored in a different place, usually three

HDFS command:
1. List files and directories

//Under the root directory
hadoop fs -ls /
//Current directory
hadoop fs -ls
//User home directory
hadoop fs -ls /user/foo

2. HDFS directory operation

//Create directory
hadoop fs -mkdir /user/foo/newdir
//Delete directory
hadoop fs -rmdir /user/foo/newdir

3. Directory after uploading files

//Upload file
hadoop fs -put localfile /user/foo/newfile
//Upload directory
hadoop fs -put localdir /user/foo/newdir
//Append upload
hadoop fs -apendToFile localfile /user/foo/oldfile

4. View file

//View file contents
hadoop fs -cat /user/foo/file
//View end of file
hadoop fs -tail /user/foo/file

5. Download files or directories

//Download File
hadoop fs -get /user/foo/remotefile localfile

6. Delete file or directory

//Delete file
hadoop fs -rm /user/foo/remotefile


HBase is based on HDFS and provides a database system with high reliability, high performance, column storage, scalability and real-time reading and writing. HBase is different from the general relational database. It is a database suitable for unstructured data storage. Another difference is that HBase is column based rather than row based.
HBase – Hadoop database is a highreliableHigh performance, column oriented, scalableDistributed storage system, HBase technology can be used to build large-scale on cheap PC serverstructureturnstorageCluster.
HBase is an open source implementation of Google BigTable,similarGoogle BigTable uses GFS as its filestorageHBase uses Hadoop HDFS as its systemfilestoragesystem; Google runs MapReduce to handle massive data in BigTabledataIn addition, HBase also uses Hadoop MapReduce to process the data in HBaseMassivedata Google BigTable uses chubby as a collaborative service, and HBase uses zookeeper as a counterpart.

  • High reliability: store 3 copies of redundancy to ensure high reliability
  • High performance, real-time reading and writing, massive data processing capacity, real-time reading and writing of big data and concurrent data
  • Column oriented: column independent index
  • Scalable, fast cluster expansion
  • Strong consistency, row transaction: data reading and writing in the same row are atomic

    Basic knowledge of big data



Hive is a data warehouse tool based on Hadoop. It can map structured data files into a database table, provide complete SQL query function, and convert SQL statements into MapReduce tasks for operation. Hive mainly includes user interface, metadata storage, interpreter, compiler, optimizer, actuator and other components.

  • User interface: there are three, client, client and Wui. Client is a hive client. Users connect to hive server and Wui access hive through browser.
  • Metadata storage: hive stores metadata in the database, including table name, column, partition, attribute, directory, etc. There are three models to connect to the database: single user mode, multi-user mode and remote server mode
  • Diver (interpreter, compiler, optimizer, executor): generate a query plan, store it in HDFS, and then call and execute it by MapReduce.
    The learning cost is low. Rapid MapReduce statistics can be realized through SQL statements, making MapReduce easier without developing special MapReduce applications. Hive is very suitable for statistical analysis of data warehouse.
    It is most suitable for batch jobs based on a large amount of immutable data.
    The SQL Engine used for data warehouse data processing converts SQL into multiple jobs
    Built on HDFS and MapReduce of Hadoop, it is used to manage and query structured and unstructured data warehouse.
    The purpose is to enable engineers who can use SQL to process data.
    Hive command
    1. Database operation
//Establish database
create database db1
//Delete database
drop database db1
//Switch database
user db1

2. Table operation

//Show all tables in the library
show tables
//Build table
create table table1(aaa string)
//Delete table
drop table table1


The tool used for data warehouse data processing is a subsystem of spark ecology. Like hive, SQL is processed into jobs. Because it is calculated in memory, it is faster than MapReduce. It is used for batch processing and interactive analysis


Focusing on OLAP under data warehouse, it is generally used for foreground interactive analysis and query of data, and the performance of big data processing is poor


Document data query, which can be used for multi field query, is applicable to customer label query, customer data query and other scenarios.


Kafka is a high-throughput distributed publish subscribe message system, which can process all the action flow data of consumers in the website. Is an open source stream processing platform, written by Scala and Java.
Is a distributed queue system. Using disk sequential read-write to achieve persistence, fully distributed structure, and load balancing between message producers and consumers based on zookeeper. It supports multiple consumers as a whole to consume messages, and supports multi topic message publishing and subscription modes.

  • High throughput and low latency: it can process hundreds of thousands of messages per second, with a minimum latency of only a few milliseconds
  • Scalability: Kafka cluster supports hot expansion
  • Persistence and reliability: messages are persisted to local disks, and data backup is supported to prevent data loss
  • Fault tolerance: node failures in the cluster are allowed (if the number of replicas is n, n-1 node failures are allowed)
  • High concurrency: support thousands of clients to read and write at the same time
    Common terms:
  • Broker: the Kafka cluster contains one or more servers, which are called brokers
  • Topic: each message published to the Kafka cluster has a category called topic. (physically, messages of different topics are stored separately. Logically, although messages of one topic are stored in one or more brokers, users only need to specify the topic of the message to produce or consume data without caring where the data is stored)
  • Partition: partition is a physical concept. Each topic contains one or more partitions
  • Producer: responsible for publishing messages to Kafka broker
  • Consumer: a message consumer, a client that reads messages from Kafka broker.
  • Consumer group: each consumer belongs to a specific consumer group (group name can be specified for each consumer. If group name is not specified, it belongs to the default group).

Applicable scenarios:

  • Log collection: Kafka can be used to collect logs of various services, which can be opened to various consumers in the form of unified interface services
  • Message system: decoupling producers and consumers, caching messages, etc
  • User activity tracking: Kafka is often used to record various activities of web users or app users, such as browsing web pages, searching, clicking and so on
  • Operational indicators: Kafka is also often used to record operational monitoring data
  • Streaming: such as spark streaming and storm

Data processing tools

Offline calculation: MapReduce
Dag calculation: tez
Memory calculation: Spark
Real time computing: Spark streaming, flick


MapReduce is a distributed computing model, which is mainly used in the search field to deal with the computing problems of massive data. It consists of two phases, map and reduce. The user needs to implement the two functions of map and reduce to realize distributed computing.

  • High reliability: the ability to process data is trustworthy.
  • High scalability: allocate data and complete computing tasks among available computer clusters, which can be easily extended to thousands of nodes.
  • Efficiency: it can dynamically move data between nodes and ensure the dynamic balance of each node, so the processing speed is very fast.
  • High fault tolerance: it can automatically save multiple copies of data and automatically reassign failed tasks.
    MapReduce computing framework adopts master / slave architecture. A Hadoop cluster consists of a jobtracker and a certain number of tasktrackers.
    The MapReduce calculation model is suitable for batch tasks.
    MapReduce is a linear extensible model. The more servers, the shorter the processing time.


Spark is an open source cluster distributed computing system based on memory computing, which is developed in scala.
Based on memory computing, the efficiency is higher than Hadoop The job intermediate output and results can be saved in memory, so there is no need to read and write HDFS, which saves the time of disk IO. It is said that the performance is 100 times higher than that of Hadoop blocks.
It has the advantages of Hadoop MapReduce, but the difference from MapReduce is that the intermediate output results of job can be saved in memory, so there is no need to read and write HDFS. Therefore, spark can be better applicable to MapReduce algorithms that need iteration, such as data mining and machine learning.
Spark is compatible with Hadoop ecosystem, can run on Yan, and can read HDFS, HBase, Cassandra and any Hadoop data source
Spark can be used in the following scenarios:
√ batch processing of spark shell / spark submit
√ interactive query of spark SQL
√ real time processing application of spark streaming
√ machine learning of mllib / mlbase
√ graph processing of graphx and sparkr data mining

Usage scenario:

  • Complex batch processing, focusing on processing massive data
  • Interactive query based on historical data focuses on interactive response, which takes tens of seconds to tens of minutes. Spark SQL is used
  • Data processing based on real-time data stream, real-time processing with low delay


Flink is an open-source distributed, high-performance, highly available and accurate stream processing framework, which is used for stateful computing on unbounded and bounded data streams, and supports real-time stream processing and batch processing.

Open source software, real-time processing tool, can process batch and stream processing tasks at the same time
Fast and reliable, used as general data processing, fast
Easy to use, using java \ Scala programming language
Flink is a data processing engine when locating. Flink can be combined with batch flow
The biggest advantage of Flink is continuous query.

Cluster resource management


Yarn (abbreviation of yet another resource negotiator) is a resource management and job scheduling technology in the open source Hadoop distributed processing framework. As one of the core components of Hadoop, yarn is responsible for allocating system resources to various applications running in Hadoop cluster and scheduling tasks to be executed on different cluster nodes.

  • Resource Manager: it has the right to decide the allocation of all resources in the system, is responsible for the resource allocation of all applications in the cluster, and has the main and global views of cluster resources. Therefore, it provides users with fair, capacity-based and localized resource scheduling.

  • Nodemanager: it is mainly responsible for communicating with ResourceManager, starting and managing the container life cycle of applications, monitoring their resource usage (CPU and memory), tracking the monitoring status of nodes, managing logs, etc., and reporting to RM.

  • Applicationmanager: it is mainly responsible for receiving the job submission request, assigning the first container to the application to run the applicationmaster, and monitoring the applicationmaster and restarting the container that the applicationmaster runs in case of failure.

Data visualization tool

Hue: CDH’s own visualization tool queries the visualization data of hive and impala through the web interface. The task execution is slow but stable. It is suitable for big data processing and has good performance. User DPI log offline analysis and network signaling offline analysis
Zepplin: visualization tool
Klbana: query es data


Hue is an open source Apache Hadoop UI system. Cloudera contributes to the open source community. It is implemented based on the python web framework Django. By using hue, you can interact with Hadoop cluster on the browser side web console to analyze and process data, such as operating hive and impala queries, running MapReduce job, etc

Data security operation and maintenance

Cloudera Manager: the built-in tool of CDH, including cluster installation, deployment, configuration, etc

cloudera manager

Cloudera manager covers the unified configuration, management, monitoring and diagnosis of all resources and services in the cluster.

  • Zero downtime rolling installation and upgrade
  • High availability, manual and automatic switching of configuration components
  • Configure log and rollback
  • Dynamic provisioning between services
  • Disaster recovery, backup and recovery
  • LDAP, Kerberos integration
  • Direct connection cloudera support service architecture

Recommended Today

The whole tutorial of docker installation and use, installation / complete command / dockerfile image production / docker container arrangement and one click installation of nginx + redis + MySQL / visualizer portal (version 2022)

官网: 官网: docker 镜像市场: 一、docker 说明 1.1、docker 核心 1、Docker 是一个开源的应用容器引擎,基于 Go 语言 并遵从 Apache2.0 协议开源,Docker 是一个 CS 架构软件。 2、Docker 是一个虚拟化轻量级linux服务器,可以解决我们在开发环境中运行配置问题 3.、Docker的主要目标是‘build ,ship and run any app,anywhere’,一次封装,到处运行 4、容器是完全使用沙箱机制,相互之间不会有任何接口(类似 iPhone 的 app),更重要的是容器性能开销极低。 1.2、docker 版本问题 .Docker 从 17.03 版本之后分为 CE(Community Edition: 社区版) 和 EE(Enterprise Edition: 企业版),我们用社区版就可以了。 1.3、docker 架构( 3大核心) · 1、Images 镜像 (等于软件) · 2、Registry […]