First knowledge of Hadoop architecture

Time:2022-5-14

Get to know Hadoop

Google’s “troika”“

  • GFS
  • MapReduce
  • BigTable

HDFS

  • HDFSWhat is it?

    • HDFSIt is a distributed file system based on streaming data access mode. It supports the storage of massive data and allows users to form storage clusters with hundreds of computers.
    • Advantages: it can handle large files, support streaming data access (write once, read many times), and run at low cost.
    • Disadvantages: it is not suitable for dealing with low latency data access, mainly dealing with applications with high data throughput; It is not suitable for dealing with a large number of small files, which will be wastedNameNodeMemory; It is not suitable for multi-user writing and arbitrary modification of files.

  • HDFSComposition structure of

    • NameNode

      • NameNodeThe name node is the manager of HDFS.
      • Main functions (three):
      • Manage and maintain the namespace of HDFS: namespace image file (fsimage), operation log file (edits)

        • Fsimage: stores the serialization information of all directories and files in the Hadoop file system
        • Edits: record the latest status of HDFS. All write operations performed by HDFS clients are recorded in editlog
      • Manage data blocks on datanode: in HDFS, a file is divided into one or more data blocks, which are stored in datanode. Namenode is determined by “file name — > data block” mapping or “data block — > datanode”
      • Receive requests from clients
    • DataNode

      • Each disk has a default data block size, which is the smallest unit for reading and writing. The default data block size of HDFS is 128MB. The data block is so large that the purpose is to reduce the addressing overhead and reduce the one-time reading time of the disk.
      • Function:
      • Save data block: each data block corresponds to a metadata information file, which is used to describe which file the data block belongs to and which data block it belongs to
      • Run the datanode thread and report the data block information to the namenode regularly
      • Send heartbeat information to namenode regularly to keep in touch
    • SecondaryNameNode

      • That is, the second name node. Its main responsibility is to regularly download the fsimage and edits of namenode locally, load them into memory for merging, and finally upload the merged new fsimage back to namenode. This process is called checkpoint.
      • Merge fsimage and edits files regularly to keep the size of edits within the limit and reduce the time spent merging fsimage and edits when restarting namenode.

    First knowledge of Hadoop architecture

  • HDFS Shell

    • HDFS shell command is a command similar to Linux shell to operate the file system.
    • For example:hdfs dfs -ls, list files or directories.
  • HDFS API

    • Of course, Hadoop provides a variety of HDFS access interfaces, includingJava APIYou can use code to operate the file system.
    1. instantiation Configuraionclass
    2. instantiation FileSystemclass
    3. Set the path of the target object
    4. Perform file or directory operations
  • High availabilityHA
  • federalFedeeration