HDFS of Hadoop (3) HDFS API operation example and read / write process

Time:2021-3-3

1. HDFS API example

1.1 Client API

1) Preparation
I’m used to using Ubuntu, and the IDE tool is idea. If it’s windows or eclipse, I haven’t tried it.

Old routine, create Maven project, import dependency

    <dependencies>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
        </dependency>
        <dependency>
            <groupId>org.apache.logging.log4j</groupId>
            <artifactId>log4j-slf4j-impl</artifactId>
            <version>2.12.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client-api</artifactId>
            <version>3.1.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client-runtime</artifactId>
            <version>3.1.3</version>
        </dependency>

Add log configuration file “log4j2. XML” under Resources

<?xml version="1.0" encoding="UTF-8"?>
<Configuration status="error" strict="true" name="XMLConfig">
    <Appenders>
        <Appender type="Console" name="STDOUT">

            <Layout type="PatternLayout"
                    pattern="[%p] [%d{yyyy-MM-dd HH:mm:ss}][%c{10}]%m%n" />
        </Appender>

    </Appenders>

    <Loggers>
        <Logger name="test" level="info" additivity="false">
            <AppenderRef ref="STDOUT" />
        </Logger>

        <Root level="info">
            <AppenderRef ref="STDOUT" />
        </Root>
    </Loggers>
</Configuration>

2) Creating package in Src / main / Java cn.leaf . You can create the hdfsclient class at will. The usage of the client is as follows

public void hdClient() throws IOException, InterruptedException {
        //Get a client object
        URI uri = URI.create("hdfs://hadoop10:9820");
        Configuration conf = new Configuration();
        String user = "v2admin";
        FileSystem fs = FileSystem.get(uri,conf,user);
        //Todo execution
        
        //Shut down resources
        fs.close();
    }

1.2 upload file sample code

public static void main(String[] args) throws IOException, InterruptedException {
        //Get a client object
        URI uri = URI.create("hdfs://hadoop10:9820");
        Configuration conf = new Configuration();
        String user = "v2admin";
        FileSystem fs = FileSystem.get(uri,conf,user);
        //Todo execution
        Path upsrc = new path ("/ home / Zhao w / desktop / shell/ passwd.txt ");
        Path updst = new Path("/home");
        upFile(fs,upsrc,updst);
        
        //Shut down resources
        fs.close();
    }

    /**
     *Upload file
     *@ param FS client object
     *@ param SRC files to be uploaded
     *@ param DST HDFS path
     */
    public static void upFile(FileSystem fs, Path src, Path dst){
        try {
            fs.copyFromLocalFile(src, dst);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

1.3 download file sample code

/**
     *Download files
     *@ param FS client object
     *@ param SRC is the file path to be downloaded, that is, the file path on HDFS
     *@ param DST target path
     */

    public static void downFile(FileSystem fs, Path src, Path dst){
        try {
            fs.copyToLocalFile(false,src,dst);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

2 read write process

2.1 document writing process

Through the code to complete the file upload, then how does HDFS complete the file writing?
The process is as follows:

HDFS of Hadoop (3) HDFS API operation example and read / write process

The following roles are included
Client, distributed file system, fsdata InputStream, namenode and datanode have been mentioned before, except for the two roles of distributed file system and fsdata InputStream.

Distributed file system, translated as distributed file system.
Fsdataoutputstream is an output stream object.

1) When the client writes data to HDFS, the file is stored in blocks on each datanode node of HDFS, and the storage location is specified by namenode, so the client needs to interact with namenode before uploading the file.

The client first calls the Create method of the distributed file system, and then remotely calls the create() method of namenode.
Check whether the file already exists and check the permissions of namenode. If it passes the check, write the operation directly to the edit file client and return to fsdataoutputstream.

2) The client starts to split the file, requests to upload, and then obtains the node information of the datanode from the namenode, assuming that it is D1, D2 and D3.

3) The client calls write() to write data through fsdataoutputstream object. The specific process is as follows:
Request DN1 to upload data, D1 will continue to call D2 when receiving the request, and then D2 calls D3 to complete the establishment of the transmission channel.

4) Client uploads the first block to D1 (read data from disk and put it into a local memory cache first). In packet unit, D1 will send a packet to D2 and D2 will send it to D3; every packet sent by D1 will be put into a reply queue waiting for reply.
5) After writing the whole file, execute the close() method of fsdataoutputstream, and inform namenode that I have finished writing the file.

2.2 document reading process

HDFS of Hadoop (3) HDFS API operation example and read / write process

Similarly, the client requests the namenode to download the file through the distributed file system, and the namenode queries the metadata to get the information of the datanode where the file is downloaded.
There are many nodes to store the file, and the nearest node will be selected first to initiate the request.
Datanode starts to transfer data to the client.
The client receives the packet as a unit, caches it locally, and then writes it to the target file.