1、 HDFS basic overview
1. HDFS description
The big data field has been facing two core modules: data storage and data computing. As the most important big data storage technology, HDFS has high fault tolerance, stability and reliability. HDFS (Hadoop distributed file system), which is a distributed file system, is used to store files and locate files through the directory tree; The original intention of the design is to manage hundreds of servers and disks, so that the application can store large-scale file data like an ordinary file system. It is suitable for the scenario of one write and multiple read, and does not support file modification. It is suitable for data analysis.
HDFS has a master / slave architecture with two core components, namenode and datanode.
Responsible for metadata management of file system, i.e. file pathname, data block ID, storage location and other information, configure replica policy and handle client read-write requests.
Execute the actual storage and read-write operations of file data. Each datanode stores a part of file data blocks, and the files are distributed and stored in the whole HDFS server cluster.
On the client side, when uploading HDFS files, the client divides the files into blocks one by one, and then uploads them; Obtain the location information of the file from the namenode; Communicating with datanode to read or write data; The client accesses or manages HDFS through some commands.
It is not a hot standby of namenode, but shares the workload of namenode, such as regularly merging fsimage and edits and pushing them to namenode; In case of emergency, the namenode can be recovered.
3. High fault tolerance
Schematic diagram of multiple copy storage of data block, file / users / sameerp / data / part-0, copy backup is set to 2, and the stored block IDs are 1 and 3 respectively; File / users / sameerp / data / Part-1, copy backup is set to 3, and the stored block IDs are 2, 4 and 5 respectively; After any single server goes down, at least one backup service still exists for each data block, which will not affect the access to files and improve the overall fault tolerance.
The files in HDFS are physically stored in blocks. The block size can be configured through the parameter dfs.blocksize. If the block setting is too small, the addressing time will be increased; If the block is set too large, the time of transferring data from the disk will be very slow. The size setting of HDFS block mainly depends on the disk transfer rate.
2、 Basic shell command
1. Basic command
View the relevant shell operation commands under Hadoop.
[[email protected] hadoop2.7]# bin/hadoop fs [[email protected] hadoop2.7]# bin/hdfs dfs
DFS is the implementation class of FS
2. View command description
[[email protected] hadoop2.7]# hadoop fs -help ls
3. Create directory recursively
[[email protected] hadoop2.7]# hadoop fs -mkdir -p /hopdir/myfile
4. View directory
[[email protected] hadoop2.7]# hadoop fs -ls / [[email protected] hadoop2.7]# hadoop fs -ls /hopdir
5. Clip file
hadoop fs -moveFromLocal /opt/hopfile/java.txt /hopdir/myfile ##View file hadoop fs -ls /hopdir/myfile
6. View file contents
##View all hadoop fs -cat /hopdir/myfile/java.txt ##View end hadoop fs -tail /hopdir/myfile/java.txt
7. Add file content
hadoop fs -appendToFile /opt/hopfile/c++.txt /hopdir/myfile/java.txt
8. Copy file
The copyfromlocal command is the same as the put command
hadoop fs -copyFromLocal /opt/hopfile/c++.txt /hopdir
9. Copy HDFS files to local
hadoop fs -copyToLocal /hopdir/myfile/java.txt /opt/hopfile/
10. Copy files in HDFS
hadoop fs -cp /hopdir/myfile/java.txt /hopdir
11. Move files within HDFS
hadoop fs -mv /hopdir/c++.txt /hopdir/myfile
12. Merge and download multiple files
The basic commands get and copytolocal have the same effect.
hadoop fs -getmerge /hopdir/myfile/* /opt/merge.txt
13. Delete file
hadoop fs -rm /hopdir/myfile/java.txt
14. View folder information
hadoop fs -du -s -h /hopdir/myfile
15. Delete folder
bin/hdfs dfs -rm -r /hopdir/file0703
3、 Source code address
GitHub · address https://github.com/cicadasmile/big-data-parent Gitee · address https://gitee.com/cicadasmile/big-data-parent
Recommended reading: programming system sorting
|[Java describes design patterns, algorithms, and data structures]GitHub==GitEE|
|[Java foundation, concurrency, object-oriented, web development]GitHub==GitEE|
|[detailed explanation of spring cloud microservice basic component case]GitHub==GitEE|
|[actual combat comprehensive case of springcloud microservice Architecture]GitHub==GitEE|
|[introduction to basic application of springboot framework to advanced]GitHub==GitEE|
|[common middleware for integrated development of springboot framework]GitHub==GitEE|
|[basic case of data management, distributed and architecture design]GitHub==GitEE|
|[big data series, storage, components, computing and other frameworks]GitHub==GitEE|