Introduction to the principle of big data Hadoop + installation + actual operation (HDFS + yarn + MapReduce)

Time:2022-5-20
catalogue

1、 Hadoop overview

Hadoop is the next open source distributed computing platform of Apache Software Foundation. With HDFS (Hadoop distributed file system) and MapReduce (Hadoop 2.0 joined yarn, yarn is a resource scheduling framework, which can manage and schedule tasks in fine granularity, and can also support other computing frameworks, such as spark), Hadoop provides users with a distributed infrastructure with transparent bottom details of the system. HDFS has the advantages of high fault tolerance, high scalability and high efficiency, so that users can deploy Hadoop on low-cost hardware to form a distributed system. At present, the latest version is 3 [official document]


2、 HDFS details

1) HDFS overview

HDFS (Hadoop distributed file system) is the core sub project of Hadoop project. It is the basis of data storage management in distributed computing. It is developed based on the requirements of accessing and processing large files in streaming data mode. It can run on cheap commercial servers. What it hasHigh error tolerance, high reliability, high scalability, high availability and high throughputAnd other features provide fault free storage for massive data, and bring a lot of convenience to the application and processing of large data set. HDFS is derived from the GFS (Google File System) paper published by Google in October 2003. It is actually a cloned version of GFS (Google File System).

Design features of HDFS

HDFS is chosen to store data because it has the following advantages:

  • High fault tolerance: automatically save multiple copies of data. It improves fault tolerance by adding copies. After a copy is lost, it can be recovered automatically. This is realized by the internal mechanism of HDFS. We don’t need to care.
  • Suitable for batch processing: it moves computing, not data. It exposes the data location to the computing framework.
  • Suitable for big data processing: process data up to GB, TB or even Pb. It can handle a large number of documents with a scale of more than one million. It can handle the scale of 10K nodes.
  • Streaming file access: write once, read many times. Once a file is written, it cannot be modified, but can only be appended. It can ensure the consistency of data.
  • Can be built on cheap machines: it improves reliability through multi copy mechanism. It provides fault tolerance and recovery mechanism. For example, if a copy is lost, it can be recovered through other copies.

Of course, HDFS also has its disadvantages and is not suitable for all occasions:

  • Low latency data access: it is suitable for scenarios with high throughput, that is, writing a large amount of data at a certain time. But it can’t work in the case of low latency, such as reading data within milliseconds, which is very difficult to do.

  • Small file storage: if a large number of small files are stored (small files here refer to files smaller than the block size of HDFS system (Hadoop 3. X default 128M), it will occupy a large amount of memory of namenode to store file, directory and block information. This is not desirable because namenode memory is always limited. The seek time of small file storage will exceed the read time, which violates the design goal of HDFS.

  • Concurrent writing and random file modification: a file can only have one write, and multiple threads are not allowed to write at the same time. Only data append is supported, and random modification of files is not supported.

2) HDFS composition

HDFS uses the master / slave architecture to store data. This architecture is mainly composed of four parts: HDFS client, namenode, datanode and secondary namenode. Let’s introduce these four components respectively:

1、Client

Client is the client

  • File segmentation. When uploading files to HDFS, the client divides the files into – blocks one by one, and then stores them.
  • Interact with namenode to obtain the location information of the file.
  • Interact with datanode to read or write data.
  • The client provides some commands to manage HDFS, such as starting or shutting down HDFS.
  • The client can access HDFS through some commands.

2、NameNode(NN)

Namenode is the master, which is a supervisor and manager.

3、DataNode(DN)

Datanode is slave. The namenode issues a command and the datanode performs the actual operation.

  • Store the actual data block.
  • Perform read / write operations on data blocks.

4、Secondary NameNode(2NN)

The secondary namenode is not a hot standby of the namenode. When the namenode hangs, it cannot immediately replace the namenode and provide services.

  • Secondary namenode is just a tool of namenode, which helps namenode manage metadata information.
  • Regularly merge fsimage and fsedits and push them to namenode.
  • In case of emergency, namenode can be recovered.

3) Specific working principle of HDFS

1. Two core data structures: fslmage and editlog

  • Fsimage is responsible for maintaining the metadata of the file system tree and all files and folders in the tree.
    ——- maintain the image of file structure and file meta information
  • The editlog operation log file records all file creation, deletion and renaming operations.
    ——- record operations on files

PS:
1. The metadata block of NN is written in memory for the sake of reading and writing speed, and fsimage is just an image saving file
2. When each addition, deletion and modification operation is input, editlog will generate a separate file, and finally El will generate multiple files
3.2nn is not a backup of NN (but it can be backed up). Its main work is to help NN merge edits log and reduce NN startup time.
4. Topological distance: calculate the shortest path according to the tree structure composed of node network
5. Rack perception: the placement position of nodes according to the topological distance

2. Workflow

  • First step: when the client makes a request for adding, deleting or modifying metadata, due to the high security requirements of Hadoop,It will first write the operation to the editlog file and persist it first
  • Step two: then add, delete and modify the specific operation, write the fsimage and edit into the memory for specific operation, write the file first, and the data can be recovered even if it is down, otherwise the memory data will disappear first. At this time, 2nn will request auxiliary operation when it finds that the time is up, or the edit data is full or just started up. NN will copy the edit instantly after receiving it. At this time, the data transmitted by the client will continue to be written into the edit.
  • Step 3: we copy the copied edit and fsimage to 2nn (secondary namenode), write the operation in 2nn’s memory, merge, and then return the file to NN as a new fsimage. Therefore, once the NN goes down, 2nn is one edit part worse than NN and cannot completely restore the original state. It can only be said to be assisted recovery.

3. HDFS file reading process

[step 1] the client calls the file system Open() method

  • The file system communicates with NN through RPC, and NN returns part or all of the block list of the file (including the DN address of the block copy).
  • Select the nearest DN of the client to establish a connection, read the block, and return fsdatainputstream.

[step 2] the client calls the read() method of the input stream

  • When the end of the block is read, fsdatainputstream closes the connection with the current DN and does not read the next block to find the nearest DN.
  • After reading a block, the checksum verification will be performed. If there is an error in reading the DN, the client will notify NN, and then continue reading from the next DN with the copy of the block.
  • If the file is not finished after reading the block list, the file system will continue to obtain the next batch of block list from NN.

[step 3] close fsdatainputstream

4. HDFS file writing process

[step 1] the client calls the create() method of the filesystem

  • The file system sends a request to the NN to create a new file in the NN namespace, but does not associate any blocks.
  • NN checks whether the file already exists and operation permissions. If the check passes, NN records the new file information and creates a data block on a DN.
  • Return fsdataoutputstream and guide the client to the data block to perform write operation.

[step 2] the client calls the write () method of the output stream

  • HDFS places 3 copies of each data block by default. Fsdataoutputstream writes data to the first node first, the first node transmits and writes data packets to the second node, the second node — > the third node.

[step 3] the client calls the close () method of the flow

  • The NN returns a success message after the number of copies of the data packet in the flush buffer and the block is completed.

3、 Detailed explanation of yarn

1) Yarn overview

The abbreviation of Apache yhaother cluster is yhaother resourceResource manager system, yarn was originally introduced from Hadoop 2 to improve the implementation of MapReduce, but it hasgeneralityOther distributed computing modes are also implemented.

Yarn features:

  • Support the requirements of non MapReduce applications
  • Scalability
  • Improve resource utilization
  • User agility
  • It can be built to be highly available

2) Yarn architecture components

On the whole, yarn belongs to the master / slave model, which mainly depends on three components to realize its functions. The first is the ResourceManager, which is the arbiter of cluster resources. It includes two parts: a pluggable scheduling scheduler and an application manager, which are used to manage user jobs in the cluster. The second is the nodemanager on each node, which manages user jobs and workflows on the node, and also continuously sends its own container usage to the ResourceManager. The third component is applicationmaster, the manager of user job life cycle. Its main function is to apply for computing resources (containers) from ResourceManager (global) and interact with nodemanager to execute and monitor specific tasks. The architecture diagram is as follows:

1、ResourceManager(RM)

RM is a global resource manager, which manages the computing resources of the whole cluster and allocates these resources to applications. include:

  • Interact with the client and process requests from the client
  • Start and manage the applicationmaster and restart it if it fails to run
  • Manage nodemanager, receive resource reporting information from nodemanager, and issue management instructions to nodemanager
  • Resource management and scheduling, receive resource application requests from applicationmaster and allocate resources for them

RM key configuration parameters:

  • Minimum container memory: yarn scheduler. minimum-allocation-mb
  • Container memory increment: yen scheduler. increment-allocation-mb
  • Maximum container memory: yen scheduler. maximum-allocation-mb
  • Minimum number of container virtual CPU cores: yarn scheduler. minimum-allocation-mb
  • Container virtual CPU kernel increment: yen scheduler. increment-allocation-vcores
  • Maximum number of container virtual CPU cores: yarn scheduler. maximum-allocation-mb
  • ResourceManager web application HTTP port: yarn resourcemanager. webapp. address

2、ApplicationMaster(AM)

At the application level, manage applications running on yarn. include:

  • Each application submitted by the user contains an AM, which can run on a machine other than RM.
  • Responsible for negotiating with RM scheduler to obtain resources (represented by container)
  • Further allocate the obtained resources to internal tasks (secondary allocation of resources)
  • Communicate with nm to start / stop the task.
  • Monitor the running status of all tasks, and re apply for resources for the task to restart the task when the task fails to run

Am key configuration parameters:

  • Maximum attempts of applicationmaster: yarn resourcemanager. am. max-attempts
  • Applicationmaster monitoring expired: yarn am. liveness-monitor. expiry-interval-ms

3、NodeManager(NM)

The agent on each node in yarn, which manages a single computing node in the Hadoop cluster. include:

  • Start and monitor the compute container on the node
  • Report the resource usage on this node and the operation status of each container (CPU, memory and other resources) to RM in the form of heartbeat
  • Receive and process container start / stop and other requests from am

Nm key configuration parameters:

  • Node memory: yarn nodemanager. resource. memory-mb
  • Node virtual CPU core: yarn nodemanager. resource. cpu-vcores
  • Nodemanager web application HTTP port: yarn nodemanager. webapp. address

4、Container

Container is an abstraction of resources in yarn. It encapsulates multi-dimensional resources on a node, such as memory, CPU, disk, network, etc. The container is requested by am to RM and allocated to am asynchronously by the resource scheduler in RM. The operation of the container is initiated by the am to the nm where the resource is located.

There are two main categories of applications required for containers:

  • Container for running am: This is applied for and started by RM (to the internal resource scheduler). When users submit applications, they can specify the resources required by unique am;
  • Container for running various tasks: This is requested by am to RM, and am communicates with nm to start it.

The above two types of containers may be on any node, and their locations are usually random, that is, am may run on the same node with the tasks it manages.

3) Yarn operation process

The execution process of application in yarn is shown in the following figure:

  1. The client program submits an application to the resource manager and requests an applicationmaster instance. In the response, the resource manager gives an application ID and resource capacity information that helps the client request resources.

  2. The ResourceManager finds the nodemanager that can run a container and starts the applicationmaster instance in the container

    • The application submission context sends a response, which includes: applicationid, user name, queue and other information to start the applicationmaster. The container launch context (CLC) will also be sent to the ResourceManager. The CLC provides resource requirements, job files, security tokens and other information required to start the applicationmaster at the node.
    • When the resource manager receives the context submitted by the client, it will schedule an available container (usually called container0) for the application master. Then the resource manager will contact the nodemanager to start the applicationmaster, and establish the RPC port of the applicationmaster and the URL for tracking to monitor the status of the application.
  3. After registering with the application master resource client, you can register with the application master to get your own details. In the registration response, the resource manager will send information about the maximum and minimum capacity of the cluster,

  4. During normal operation, the applicationmaster sends a resource request request to the resource manager according to the resource request protocol. The resource manager will allocate container resources to the applicationmaster as best as possible according to the scheduling policy and send an applicationmaster as a response to the resource request

  5. After the container is successfully allocated, the applicationmaster starts the container by sending the container launch specification information to the nodemanager. The container launch specification information contains the information needed to enable the container and the applicationmaster to communicate. Once the container is successfully started, the applicationmaster can check their status. The resource manager is no longer involved in the execution of the program, It only handles scheduling and monitoring other resources. The resource manager can command nodemanager to kill the container,

  6. The application code runs in the launched container and sends the running progress, status and other information to the applicationmaster through the application specific protocol. With the execution of the job, the applicationmaster sends the heartbeat and progress information to the ResourceManager. In these heartbeat information, the applicationmaster can also request and release some containers.

  7. During the operation of the application, the client submitting the application actively communicates with the applicationmaster to obtain the operation status, progress update and other information of the application. The communication protocol is also the application specific protocol

  8. Once the execution of the application is completed and all related work has been completed, the applicationmaster cancels the registration with the ResourceManager and then closes, and all containers used are returned to the system. When the container is killed or recycled, the ResourceManager will notify the nodemanager to aggregate the log and clean up the files dedicated to the container.

4) Yarn three resource schedulers

1. FIFO scheduler

The advantage of FIFO scheduler is that it is easy to understand and does not need any configuration, but it is not suitable for shared clusters. Large applications will occupy all resources in the cluster, so each application must wait until its turn to run. In a shared cluster, it is more suitable to use capacity scheduler or fair scheduler. Both schedulers allow long-running jobs to be completed in time, and also allow users who are making small temporary queries to get the returned results in a reasonable time.

2. Capacity scheduler

The capacity scheduler allows multiple organizations to share a Hadoop cluster, and each organization can allocate part of all cluster resources. Each organization is configured with a special queue, and each queue is configured to use certain cluster resources. Queues can be further divided into layers, so that different users in each organization can share the resources allocated by the organization queue. In a queue, FIFO scheduling strategy is used to schedule applications.

  • A single job does not use more resources than its queue capacity. However, what if there are multiple jobs in the queue and the queue resources are insufficient? At this time, if there are still free resources available, the capacity scheduler may allocate the free resources to the jobs in the queue, even if it will exceed the queue capacity. This is called queue elasticity.

3. Resource Scheduler – Fair

Fair scheduling is a resource allocation method that evenly allocates global resources and all application jobs. By default, fair scheduler fairscheduler schedules fair scheduling policies based on memory. It can also be configured to schedule based on memory and CPU at the same time. In a queue, FIFO, fair and DRF scheduling strategies can be used to schedule applications. Fairscheduler allows guaranteed allocation of minimum resources to queues.

  • [note] in the fair scheduler in the figure below, there will be a certain delay from the submission of the second task to the acquisition of resources, because it needs to wait for the first task to release the occupied container. After the small task is completed, it will also release the resources occupied by itself, and the large task will obtain all the system resources. The final effect is that the fair scheduler not only obtains high resource utilization, but also ensures the timely completion of small tasks.


4、 MapReduce details

1) MapReduce overview

MapReduce is aProgramming model(without the concept of cluster, the task will be submitted to the yarn cluster for running), which is used for parallel operation of large-scale data sets (greater than 1TB). The concepts “map” and “reduce” are their main ideas, which are borrowed from functional programming language and vector programming language. It greatly facilitates programmers to run their own programs on distributed systems without distributed parallel programming. The current software implementation is to specify a map function to map a group of key value pairs into a new group of key value pairs, and specify a concurrent reduce function to ensure that each of all mapped key value pairs shares the same key group. (MapReduce is almost no longer used in enterprises. Just know a little about it)

2) MapReduce running process

The operation process of the job mainly includes the following steps:

1. Job submission
2. Initialization of job
3. Assignment of job tasks
4. Execution of job tasks
5. Job execution status update
6. Job complete

The flow chart of specific job execution process is shown in the figure below:

1. Job submission

Call the waitforcompletion () method in Mr code, which encapsulates the job Submit() method, and job A jobsubmiter object will be created in the submit () method. When we are in waitforcompletement (true), the waitforcompletement method will poll the execution progress of the job every second. If it is found that the status is different from the last query, the details will be printed to the console. If the job is executed successfully, the job counter will be displayed, otherwise the record of job failure will be output to the console.

The general process of jobsubmiter implementation is as follows:

  1. Submit an application to the resource manager resource manager for a MapReduce job ID, as shown in step 2
  2. Check the output configuration of the job and judge whether the directory already exists
  3. Calculate the size of the input slice of the job
  4. Copy the running job jars, configuration files, and input fragment computing resources to a temporary HDFS directory named after the job ID. there are many copies of job jars, which are 10 by default (controlled by the parameter mapreduce.client.submit.file.replication),
  5. Submit the job through the submitapplication method of the resource manager

2. Initialization of job

  1. When the resource manager is called through the submitapplication method, the request is passed to the dispatcher of yarn. Then the dispatcher allocates a container (container0) on a node manager to start the application master (the main class is mrappmaster) process. Once the process is started, it will register with the resource manager and report its own information to the application master, and can monitor the running status of map and reduce. Therefore, the application master initializes the job by creating multiple bookkeeping objects to keep track of the job progress.

  2. The application master receives the resource files, jars, partition information, configuration information, etc. in the HDFS temporary shared directory when the job is submitted. And create a map object for each fragment, and through MapReduce job. The reductions parameter (set by the setnumreducetasks() method in the job) determines the number of reductions.

  3. The application master will determine whether to use the Uber mode (the job runs in the same JVM as the application master, that is, maptask and reducetask run on the same node) to run the job. The operating conditions of Uber mode: the number of maps is less than 10, one reduce, and the input data is less than one HDFS block

You can use the following parameters:

mapreduce. job. ubertask. Enable # whether Uber mode is enabled
mapreduce. job. ubertask. Max maps #ubertask of maxmaps
mapreduce. job. ubertask. Max reduce #ubertask of maxeduces
mapreduce. job. ubertask. Maxbytes #ubertask maximum job size
  1. The application master calls the setupjob method to set the outputcommitter. Fileoutputcommitter is the default value, which indicates the establishment of the final output directory and the temporary workspace for task output

3. Assignment of job tasks

  1. When the application master judges that the job does not conform to the Uber mode, the application master will apply to the resource manager for resource containers for map and reduce tasks.

  2. The first step is to issue a resource request for the map task. The request will not be issued for the resource request required by the reduce task until 5% of the map tasks are completed.

  3. In the process of task allocation, the reduce task can run on any datanode node, but the data localization mechanism needs to be considered when executing the map task. When assigning resources to the task, each map and reduce defaults to 1G memory, which can be configured through the following parameters:

mapreduce.map.memory.mb
mapreduce.map.cpu.vcores
mapreduce.reduce.memory.mb
mapreduce.reduce.cpu.vcores

4. Execution of job tasks

After the application master submits the application, the resource manager allocates resources as needed. At this time, the application master communicates with the node manager to start the container. This task is performed by a Java application of the main class yarnchild. Before running the task, first localize the required resources, including job configuration, jar files, etc. The next step is to run the map and reduce tasks. Yarnchild runs in a separate JVM.

5. Status update of job tasks

Each job and its task have a status: the status of the job or task (running, success, failure, etc.), the progress of map and reduce, the value of job counter, status message or description. When the job is running, the client can directly communicate with the application master and poll the execution status of the job every second (which can be set through the parameter mapreduce.client.progressmonitor.polling), Progress and other information.

6. Job completion

  • When the application master receives the notification that the last task has been completed, it sets the status of the job to success.
  • When the job polls the job status, it knows that the task has been completed, and then prints a message to inform the user, which is returned from the waitforcompletement () method.
  • When the job is completed, the application master and container will clean up temporary problems such as intermediate data results. The commitjob() method of outputcommitter is called, and the job information is archived by the job history service for users to query in the future.

3) Shuffle process in MapReduce

MapReduce ensures that the input of each reduce is sorted according to the key value, and the system performs sorting,The process of taking the input of map as the input of reduce is called shuffle process。 Shuffle is also the key part of our optimization. The shuffle flow chart is shown in the following figure:

1. Map end

  • Before generating the map, calculate the size of the file fragment

  • Then, the number of maps will be calculated according to the partition size. For each partition, a map job will be generated, or a map job will be generated for a file (less than the partition size * 1.1). Then, the user-defined logical calculation will be carried out through the self-defined map method, and the calculation will be written to the local disk.

    1. Instead of directly writing to the disk here, in order to ensure IO efficiency, the ring buffer written to the memory first is adopted, and a pre sort (quick sort) is performed.The size of the buffer is 100MB by default (it can be modified by modifying the configuration item mpareduce.task.io.sort.mb), when the size of the write memory buffer reaches a certain proportion,The default value is 80% (which can be modified through the mapreduce.map.sort.spin.percent configuration item), an overflow thread will be started to overflow the contents of the memory buffer to the disk. This overflow thread is independent and does not affect the thread that writes the result of the map to the buffer. In the process of overflow to the disk, the map will continue to be input into the buffer. If the buffer is filled during this period, the map write will be blocked until the overflow to the disk process is completed. Overflow is to write the memory in the buffer to the local MapReduce by polling cluster. local. Dir directory. Before overflow writing to the disk, we will know the number of reduce, and then divide the partition according to the number of reduce. By default, the overflow data is written to the corresponding partition according to hashpartition. In each partition, the background thread will sort according to the key, so the files overflowed to the disk are partitioned and sorted. If there is a combiner function, it runs on the sorted output, making the map output more compact. Reduce the data written to disk and transmitted to reduce.

    2. Each time the memory of the ring flush area reaches the threshold, it will overflow to a new file. Therefore, after a map overflow is completed, there will be multiple partition cut and sorted files in the local area. Before the completion of the map, these files will be merged into a partition and sorted (merge and sort) file. You can use the parameter MapReduce task. io. sort. Factor controls how many files can be merged at a time.

    3. In the process of map overflow to disk, compressing the data can submit the transmission speed, reduce disk IO and storage. By default, it is not compressed and uses the parameter MapReduce map. output. Compress control, and the compression algorithm uses MapReduce map. output. compress. Codec parameter control.

2. Reduce end

  • After the map task is completed, the application master monitoring the job status will know the execution of the map and start the reduce task. The application master will know the mapping relationship between the map output and the host, and the reduce polling application master will know the data to be copied by the host.
  • The output of a map task may be captured by multiple reduce tasks. Each reduce task may need the output of multiple map tasks as its special input file, and the completion time of each map task may be different. When one map task is completed, the reduce task starts running. The reduce task fetches the data of the corresponding partition in multiple map outputs according to the partition number. This process is also the copy process of shuffle.. Reduce has a small number of copy threads, so it can copy the output of map in parallel. The default is 5 threads. You can use the parameter MapReduce reduce. shuffle. Parallel options control.
  • This copy process is similar to the process of map writing to disk. There are also thresholds and memory size. The same threshold can be configured in the configuration file, and the memory size is the memory size of the tasktracker of reduce. When copying, reduce will also perform sorting and merging files.
  • If the map output is very small, it will be copied to the memory buffer of the node where the reducer is located. The size of the buffer can be determined through mapred site Mapreduce.xml file reduce. shuffle. input. buffer. Percent is specified. Once the memory buffer of the node where the reducer is located reaches the threshold, or the number of files in the buffer reaches the threshold, the merge overflows to the disk.
  • If the map output is large, it will be directly copied to the disk of the node where the reducer is located. With the increase of overflow files in the disk of the node where the reducer is located, the background thread will merge them into larger and orderly files. After copying the map output, enter the sort phase. In this stage, multiple small map output files are gradually merged into large files by merging and sorting. Finally, several large files formed by merging are used as the output of reduce

5、 Install Hadoop (HDFS + yarn)

1) Environmental preparation

Three VM virtual machines are prepared here

OS hostname ip Run role
Centos8.x hadoop-node1 192.168.0.113 namenode,datanode ,resourcemanager,nodemanager
Centos8.x hadoop-node2 192.168.0.114 secondarynamedata,datanode,nodemanager
Centos8.x hadoop-node3 192.168.0.115 datanode,nodemanager

2) Download the latest Hadoop installation package

Download address:https://dlcdn.apache.org/hadoop/common/

Download and install the source package here. The default compiled files do not support snappy compression, so we need to recompile ourselves.

$ mkdir -p /opt/bigdata/hadoop && cd /opt/bigdata/hadoop
$ wget https://dlcdn.apache.org/hadoop/common/stable/hadoop-3.3.1-src.tar.gz
#Decompress
$ tar -zvxf hadoop-3.3.1-src.tar.gz

Why recompile Hadoop source code?

Match the local library environment of different operating systems. For some Hadoop operations, such as compression, IO needs to call the local library of the system(.so|.dll)

Refactoring source code

There is one in the source package directory BUILDING.txt, because my operating system here is centos8, I choose the operation steps of centos8, and the partners can find the operation steps of their corresponding system and execute them.

$ grep -n -A40 'Building on CentOS 8' BUILDING.txt

Building on CentOS 8

----------------------------------------------------------------------------------


* Install development tools such as GCC, autotools, OpenJDK and Maven.
  $ sudo dnf group install --with-optional 'Development Tools'
  $ sudo dnf install java-1.8.0-openjdk-devel maven

* Install Protocol Buffers v3.7.1.
  $ git clone https://github.com/protocolbuffers/protobuf
  $ cd protobuf
  $ git checkout v3.7.1
  $ autoreconf -i
  $ ./configure --prefix=/usr/local
  $ make
  $ sudo make install
  $ cd ..

* Install libraries provided by CentOS 8.
  $ sudo dnf install libtirpc-devel zlib-devel lz4-devel bzip2-devel openssl-devel cyrus-sasl-devel libpmem-devel

* Install optional dependencies (snappy-devel).
  $ sudo dnf --enablerepo=PowerTools snappy-devel

* Install optional dependencies (libzstd-devel).
  $ sudo dnf install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
  $ sudo dnf --enablerepo=epel install libzstd-devel

* Install optional dependencies (isa-l).
  $ sudo dnf --enablerepo=PowerTools install nasm
  $ git clone https://github.com/intel/isa-l
  $ cd isa-l/
  $ ./autogen.sh
  $ ./configure
  $ make
  $ sudo make install

----------------------------------------------------------------------------------

Enter the Hadoop source code path and execute the Maven command to compile Hadoop

$ cd /opt/bigdata/hadoop/hadoop-3.3.1-src
#Compile
$ mvn package -Pdist,native,docs -DskipTests -Dtar

[question] failed to execute goal org apache. maven. plugins:maven-enforcer-plugin:3.0.0-M1:enforce

[INFO] BUILD FAILURE
[INFO] ————————————————————————
[INFO] Total time: 19:49 min
[INFO] Finished at: 2021-12-14T09:36:29+08:00
[INFO] ————————————————————————
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-enforcer-plugin:3.0.0-M1:enforce (enforce-banned-dependencies) on project hadoop-client-check-test-invariants: Some Enforcer rules have failed. Look above for specific messages explaining why the rule failed. -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR] mvn -rf :hadoop-client-check-test-invariants

[solution]

  • Scheme 1: skip the mandatory constraint of enforcer and add the skipped instruction to the built command, such as:-Denforcer.skip=true
  • Scheme 2: the failure of setting rule verification will not affect the construction process, and instructions will be added to the construction commands, such as:-Denforcer.fail=false

The specific reason is not clear. First use the one of the above two schemesScheme ISkip. For interested partners, you can open debug mode (- x) to view the specific error reports

$ mvn package -Pdist,native,docs,src -DskipTests -Dtar -Denforcer.skip=true

So compile the command

#Of course, there are other options
$ grep -n -A1 '$ mvn package' BUILDING.txt

$ mvn package -Pdist -DskipTests -Dtar -Dmaven.javadoc.skip=true
$ mvn package -Pdist,native,docs -DskipTests -Dtar
$ mvn package -Psrc -DskipTests
$ mvn package -Pdist,native,docs,src -DskipTests -Dtar
$ mvn package -Pdist,native -DskipTests -Dmaven.javadoc.skip \
  -Dopenssl.prefix=/usr/local/opt/openssl


So far ~ Hadoop source code compilation is completed,
The compiled file is located in Hadoop dist / target under the source code path/

Copy the compiled binary package

$ cp hadoop-dist/target/hadoop-3.3.1.tar.gz /opt/bigdata/hadoop/
$ cd /opt/bigdata/hadoop/
$ ll

The compiled packages are also put on Baidu cloud here. If you don’t want to compile them yourself, you can directly use my here:

Link:https://pan.baidu.com/s/1hmdHY20zSLGyKw1OAVCg7Q
Extraction code: 8888

3) Initialize and configure the server and Hadoop

1. Modify host name

#192.168.0.113 execution on machine
$ hostnamectl set-hostname hadoop-node1
#192.168.0.114 execution on machine
$ hostnamectl set-hostname hadoop-node2
#192.168.0.115 execution on machine
$ hostnamectl set-hostname hadoop-node3

2. Modify the mapping relationship between host name and IP (implemented by all nodes)

$ echo "192.168.0.113 hadoop-node1" >> /etc/hosts
$ echo "192.168.0.114 hadoop-node2" >> /etc/hosts
$ echo "192.168.0.115 hadoop-node3" >> /etc/hosts

3. Turn off firewall and SELinux (all nodes execute)

$ systemctl stop firewalld
$ systemctl disable firewalld

#Temporary shutdown (without restarting the machine):
$setenforce0 ## sets SELinux to permissive mode

#Permanently close and modify the / etc / SELinux / config file
Change SELinux = enforcing to SELinux = disabled

4. All nodes perform synchronization (all times)

$ dnf install chrony -y
$ systemctl start chronyd
$ systemctl enable chronyd

/etc/chrony. Conf configuration file content

# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
#pool 2. centos. pool. ntp. Org iburst (comment out this line and add the following two lines)
server ntp.aliyun.com iburst
server cn.ntp.org.cn iburst

Reload configuration and test

$ systemctl restart chronyd.service
$ chronyc sources -v

5. Configure SSH security free (executed on hadoop-node1)

#1. Execute the following command on hadoop-node1 to generate a public-private key:
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_dsa
#2. Then set the master public key ID_ DSA is copied to hadoop-node1 | hadoop-node2 | hadoop-node3 for public key authentication.
$ ssh-copy-id -i /root/.ssh/id_dsa.pub hadoop-node1
$ ssh-copy-id -i /root/.ssh/id_dsa.pub hadoop-node2
$ ssh-copy-id -i /root/.ssh/id_dsa.pub hadoop-node3
$ ssh hadoop-node1
$ exit
$ ssh hadoop-node2
$ exit
$ ssh hadoop-node3
$ exit


6. Install unified working directory (all nodes execute)

#Software installation path
$ mkdir -p /opt/bigdata/hadoop/server
#Data storage path
$ mkdir -p /opt/bigdata/hadoop/data
#Installation package storage path
$ mkdir -p /opt/bigdata/hadoop/software

7. Install JDK (all nodes execute)
Download from the official website:https://www.oracle.com/java/technologies/downloads/
Baidu Download

Link:https://pan.baidu.com/s/1-rgW-Z-syv24vU15bmMg1w
Extraction code: 8888

$ cd /opt/bigdata/hadoop/software
$ tar -zxvf jdk-8u212-linux-x64.tar.gz -C /opt/bigdata/hadoop/server/
#Add the environment variable / etc / profile to the file
export JAVA_HOME=/opt/bigdata/hadoop/server/jdk1.8.0_212
export PATH=$JAVA_HOME/bin:$PATH
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
#Source loading
$ source /etc/profile
#View JDK version
$ java -version

4) Start installing Hadoop

1. Unzip the installation package I compiled above

$ cd /opt/bigdata/hadoop/software
$ tar -zxvf hadoop-3.3.1.tar.gz -C /opt/bigdata/hadoop/server/
$ cd /opt/bigdata/hadoop/server/
$ cd hadoop-3.3.1/
$ ls -lh

2. Installation package directory description

catalogue explain
bin Hadoop is the most basicManagement scriptAnd the directory of using scripts. These scripts are the basic implementation of managing scripts under SBIN directory. Users can directly use these scripts to manage and use Hadoop
etc hadoopconfiguration fileDirectory of
include The externally provided programming library header files (the specific dynamic library and static library are in the Lib directory). These files are defined in C + +. They are usually used for C + + programs to access HDFS or write MapReduce programs.
lib This directory contains the programming dynamic library and static library provided by Hadoop, which are used in combination with the header file in the include directory.
libexec The location of the shell configuration file used by each service team can be used to configure basic information such as log output and startup parameters (such as JVM parameters).
sbin The directory where the Hadoop management script is located mainly contains the contents of various services in HDFS and yarnStart and close scripts。
share The directory of the compiled jar package of each Hadoop module.Official exampleAlso in it

3. Modify profile

Profile directory:/opt/bigdata/hadoop/server/hadoop-3.3.1/etc/hadoop
Official documents:https://hadoop.apache.org/docs/r3.3.1/

  • modifyhadoop-env.sh
#In Hadoop env Append at the end of SH file
export JAVA_HOME=/opt/bigdata/hadoop/server/jdk1.8.0_212
export HDFS_NAMENODE_USER=root
export HDFS_DATANODE_USER=root
export HDFS_SECONDARYNAMENODE_USER=root
export YARN_RESOURCEMANAGER_USER=root
export YARN_NODEMANAGER_USER=root
  • modifycore-site.xml#Core module configuration

stayAdd the following content in the middle

fs.defaultFS
  hdfs://hadoop-node1:8082




  hadoop.tmp.dir
  /opt/bigdata/hadoop/data/hadoop-3.3.1




  hadoop.http.staticuser.user
  root




  hadoop.proxyuser.hosts
  *




  hadoop.proxyuser.root.groups
  *




  fs.trash.interval
  1440
  • hdfs-site.xml#HDFS file system module configuration

stayAdd the following content in the middle

dfs.namenode.secondary.http-address
  hadoop-node2:9868




  dfs.webhdfs.enabled
  true
  • modifymapred.xml#MapReduce module configuration

stayAdd the following content in the middle

mapreduce.framework.name
  yarn




  mapreduce.jobhistory.address
  hadoop-node1:10020




  mapreduce.jobhistory.webapp.address
  hadoop-node1:19888




  yarn.app.mapreduce.am.env
  HADOOP_MAPRED_HOME=${HADOOP_HOME}




  mapreduce.map.env
  HADOOP_MAPRED_HOME=${HADOOP_HOME}




  mapreduce.reduce.env
  HADOOP_MAPRED_HOME=${HADOOP_HOME}
  • modifyyarn-site.xml#Yarn module configuration

stayAdd the following content in the middle

yarn.resourcemanager.hostname
  hadoop-node1



  yarn.nodemanager.aux-services
  mapreduce_shuffle




  yarn.nodemanager.pmem-check-enabled
  false




  yarn.nodemanager.vmem-check-enabled
  false




  yarn.log-aggregation-enable
  true




  yarn.log.server.url
  http://hadoop-node1:19888/jobhistory/logs




  yarn.log-aggregation.retain-seconds
  604880
  • modifyworkers
    Overwrite the file with the following contents. Only localhost is available by default
hadoop-node1
hadoop-node2
hadoop-node3

4. Distribute the synchronous Hadoop installation package to several other machines

$ cd /opt/bigdata/hadoop/server/
$ scp -r hadoop-3.3.1 hadoop-node2:/opt/bigdata/hadoop/server/
$ scp -r hadoop-3.3.1 hadoop-node3:/opt/bigdata/hadoop/server/

5. Add Hadoop to the environment variable (all nodes)

$ vi /etc/profile

export HADOOP_HOME=/opt/bigdata/hadoop/server/hadoop-3.3.1
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH

#Loading
$ source /etc/profile

6. Hadoop cluster startup (executed on hadoop-node1)

1) (first start) format namenode(It can only be executed once)
  • When you start HDFS for the first time, you must format it
  • Format is essentially initialization, HDFS cleaning and preparation
$ hdfs namenode -format

2) Manually start and stop process by process

Each machine manually starts and closes a role process each time, which can accurately control the start and stop of each process and avoid group start and stop

1. HDFS cluster startup

$ hdfs --daemon start|stop namenode|datanode|secondarynamenode

2. Yarn cluster startup

$ yarn --daemon start|stop resourcemanager|nodemanager
3) One click start through shell script

On hadoop-node1, use the shell script that comes with the software to start with one click. Premise:Configure SSH password free login and works files between machines

  • HDFS cluster startup and shutdown
$ start-dfs.sh
$ stop-dfs. SH # not implemented here

Check java process

$ jps

  • Yarn cluster startup and shutdown
$ start-yarn.sh
$ stop-yarn. SH # not implemented here
#View java process
$ jps

Through log check, the log path is: / opt / bigdata / Hadoop / server / hadoop-3.3.1/logs

$ cd /opt/bigdata/hadoop/server/hadoop-3.3.1/logs
$ ll

  • Hadoop cluster startup and shutdown (HDFS + yarn)
$ start-all.sh
$ stop-all.sh
4) Access through web pages

[note] configure the domain name mapping in the window C: \ windows \ system32 \ drivers \ etc \ hosts file, and add the following contents to the hosts file:

192.168.0.113 hadoop-node1
192.168.0.114 hadoop-node2
192.168.0.115 hadoop-node3

1. HDFS cluster

Address: http://namenode_host:9870

The address here is:http://192.168.0.113:9870

2. Yarn cluster

Address: http://resourcemanager_host:8088

The address here is:http://192.168.0.113:8088

So far, Hadoop and yarn clusters have been deployed~


6、 Hadoop actual operation

1) HDFS actual operation

  • Command introduction
#Access local file system
$ hadoop fs -ls file:///
#The default without protocol is to access HDFS file system
$ hadoop fs -ls /
  • View configuration
$ cd /opt/bigdata/hadoop/server/hadoop-3.3.1/etc/hadoop
$ grep -C5 'fs.defaultFS' core-site.xml

#HDFS protocol is equivalent to no protocol
$ hadoop fs -ls hdfs://hadoop-node1:8082/

[tip] so the default is to access HDFS file system without protocol

  • How to use the old version
$ hdfs dfs -ls /
$ hdfs dfs -ls hdfs://hadoop-node1:8082/


1. Create and delete files

#Check
$ hadoop fs -ls /
#Create directory
$ hadoop fs -mkdir /test20211214
$ hadoop fs -ls /
#Create file
$ hadoop fs -touchz /test20211214/001.txt
$ hadoop fs -ls /test20211214

2. Web view

#Delete file
$ hadoop fs -rm /test20211214/001.txt
#Delete directory
$ hadoop fs -rm -r /test20211214

3. Push files to HDFS

$ touch test001.txt
$ hadoop fs -put test001.txt /
$ hadoop fs -ls /

4. Pull files from HDFS

#Put test001 Txt pull it down and rename it a.txt
$ hadoop fs -get /test001.txt a.txt

2) MapReduce + yarn actual operation

1. Execute the MapReduce case provided by Hadoop to evaluate the PI Π Value of

$ cd /opt/bigdata/hadoop/server/hadoop-3.3.1/share/hadoop/mapreduce
$ hadoop jar hadoop-mapreduce-examples-3.3.1.jar pi 2 4

2. Statistical words
Create hello Txt, the contents of the file are as follows:

hello hadoop yarn world
hello yarn hadoop
hello world

Create a directory for storing files in HDFS

$ hadoop fs -mkdir -p /wordcount/input
#Upload files to HDFS
$ hadoop fs -put hello.txt /wordcount/input/

implement

$ cd /opt/bigdata/hadoop/server/hadoop-3.3.1/share/hadoop/mapreduce
$ hadoop jar hadoop-mapreduce-examples-3.3.1.jar wordcount /wordcount/input /wordcount/output

3) Common commands of yarn

Syntax: yarn application [options] # print reports, apply for and kill tasks

-Appstates # is used with - list to filter applications based on the comma separated list of application states entered. A valid application state can be one of the following: all, new, new_ SAVING,SUBMITTED,ACCEPTED,RUNNING,FINISHED,FAILED,KILLED
-Apptypes # is used with - list to filter applications based on the comma separated list of application types entered.
-List # lists the applications in RM. Support the use of - apptypes to filter applications according to application type and - appstates to filter applications according to application state.
-Kill # terminates the application.
-Status # prints the status of the application.

Simple example

#Lists the applications that are running
$ yarn application --list
#List finished applications
$ yarn application -appStates FINISHED --list

For more operation commands, you can view the help by yourself

$ yarn -help

[[email protected] hadoop]# yarn -help
Usage: yarn [OPTIONS] SUBCOMMAND [SUBCOMMAND OPTIONS]
 or    yarn [OPTIONS] CLASSNAME [CLASSNAME OPTIONS]
  where CLASSNAME is a user-provided Java class

  OPTIONS is none or any of:

--buildpaths                       attempt to add class files from build tree
--config dir                       Hadoop config directory
--daemon (start|status|stop)       operate on a daemon
--debug                            turn on shell script debug mode
--help                             usage information
--hostnames list[,of,host,names]   hosts to use in worker mode
--hosts filename                   list of hosts to use in worker mode
--loglevel level                   set the log4j level for this command
--workers                          turn on worker mode

  SUBCOMMAND is one of:


    Admin Commands:

daemonlog            get/set the log level for each daemon
node                 prints node report(s)
rmadmin              admin tools
scmadmin             SharedCacheManager admin tools

    Client Commands:

app|application      prints application(s) report/kill application/manage long running application
applicationattempt   prints applicationattempt(s) report
classpath            prints the class path needed to get the hadoop jar and the required libraries
cluster              prints cluster information
container            prints container(s) report
envvars              display computed Hadoop environment variables
fs2cs                converts Fair Scheduler configuration to Capacity Scheduler (EXPERIMENTAL)
jar             run a jar file
logs                 dump container logs
nodeattributes       node attributes cli client
queue                prints queue information
schedulerconf        Updates scheduler configuration
timelinereader       run the timeline reader server
top                  view cluster information
version              print the version

    Daemon Commands:

nodemanager          run a nodemanager on each worker
proxyserver          run the web app proxy server
registrydns          run the registry DNS server
resourcemanager      run the ResourceManager
router               run the Router daemon
sharedcachemanager   run the SharedCacheManager daemon
timelineserver       run the timeline server

SUBCOMMAND may print help when invoked w/o parameters or with -h.

Here is just a simple dmeo case demonstration operation. There will be enterprise level case + actual operation sharing later. Please wait patiently