Distributed file storage HDFS

Storage model: bytes

  • The file is divided into blocks according to bytes. The default size is 128MB and the minimum size is 1m (modifiable). Each block has an offset. Block blocks are stored in cluster nodes.Lateral expansion
  • The block block size of the same file is the same, and the block size of different files can be different.
  • The block block can be set with three copies by default (modifiable).Vertical expansion

    • Replica placement policy:

      • The first copy: placed in the datanode of the uploaded file
      • Second replica: placed on a node in a different rack from the first replica
      • Third replica: placed on other nodes in the same rack as the second replica
      • More copies: random nodes
  • Only one write and multiple reads are supported, and the block content cannot be modified,
  • You can append data. There is only one writer at a time

Architecture model 1. X: master-slave



  • Manage the metadata information of the file: block size, offset offset, etc
  • Receive client read / write services
  • Memory based storage (no swap with disk)
  • List of blocks reported by the collection datanode


  • Maintain and manage the metadata information of the block block placed on this node (MD5: ensure data integrity)
  • Keep heartbeat with namenode and report block information to namenode

Namenode persistence:

  • Fsimage (point in time backup): the name of the file where the metadata is stored on disk
  • Editslog: Records operations on metedata
  • Fsimage and editslog will be merged to form a new fsimage when certain conditions (check points) are met


  • It is not a backup of namenode (but it can be backed up). Its main work is to help NN merge editlog and fsimage
  • SNN consolidation process


HDFS benefits:

  • High fault tolerance

    • Automatically save multiple copies of data
    • Automatic recovery after replica loss (copy from good node)
  • Suitable for batch processing

    • Computing moves to data
    • The data location (offset) is exposed to the computing framework
  • Suitable for big data processing

    • GB, TB, Pb, millions of files
  • Can be built on cheap machines

HDFS write process:


HDFS read process:



Generation background

  • HDFS and MapReduce in Hadoop 1. X have high availability and scalability problems
  • HDFS problem

    • Namenode single point of failure. Resolve through active and standby namenodes (HA)
    • Namenode pressure is too high, Federation


Architecture model 2. X

  • Hadoop 2. X consists of three branches: HDFS, MapReduce and Yan

    • HDFS 2. X: only supports 2 node ha, and 3. X supports one active and multiple standby


Distributed computing MapReduce

Mr primitive: Map + reduce


  • What determines the number of maps?

    • Map — one-to-one correspondence between split and the number of split pieces
    • A block block is a physical block that is really cut,
    • A split slice is a logic slice that is cut on the basis of a block block, and is determined by the block block
    • The number of maps in a job is related to the number of split slices
  • After getting the data, map the intermediate set (k, V),Generate partition number at the same time
  • What determines the number of reduce?

    • General: one K corresponds to one reduce
    • You can also map multiple K’s to one reduce, but you cannot map one K to multiple reduce’s(the same K must be given to the same reduce)


  • Get the split slice first and give it to the map (default split slice size = = block size, 128M)
  • After the map is mapped to the (k, V) intermediate set, the data is written firstMemory bufferbuffer in memory

    • Sort once: in the buffer byPartition(there are several partitions when there are several reduce partitions) to sort, and the data of the same partition are put together
    • Secondary sort: sort by K in the partition,Same KPut together (one reduce may correspond to multiple KS)
    • Data compression combiner: perform an iterative calculation on the map side and then send it to reduce to reduce the amount of calculation
  • When the memory buffer is full, it is translated into a small file and stored on disk. When the map data is processed, a pile of small files will be formed,

    • Tertiary sorting:Numerous small filesAlso sort and merge into a large file
    • Quartic sorting:Numerous mapsThe generated large files are sorted and merged into larger files
    • Finally, the merged larger files are handed over to reduce for processing


Distributed resource management yarn

Mrv1 role

  • JobTracker

    • Core, master, single point
    • Schedule all jobs
    • Monitor the resource load of the entire cluster
  • TaskTracker

    • Resource management from its own node
    • Heartbeat with jobtracker, report resources and actively obtain tasks
  • Client

    • Operation unit
    • Planning job calculation distribution
    • Submit job resources to HDFS
    • The final job is submitted to jobtracker


  • Jobtracker: overload, single point of failure
  • Resource management and computing scheduling are strongly coupled, and other computing frameworks need to implement resource management repeatedly
  • Different frameworks cannot manage resources globally



  • The client submits the task to the resource manager (full controller of resources, long service)
  • The resource manager finds a server with free resources and starts a process: applicationmaster (task scheduler, short service)
  • The applicationmaster does not understand cluster resources and needs to apply for task resources from the ResourceManager
  • The applicationmaster finds the corresponding node and starts to create the container
  • Nodemanager (long service) reports container information and node resources to ResourceManager


  • Yarn: decoupling resources and Computing

    • ResourceManager

      • Main, core
      • Cluster node resource management
    • NodeManager

      • Report resources to resource manager
      • Manage container lifecycle
  • MR

    • MR-ApplicationMaster-Container

      • Jobs are assigned to different nodes to avoid single point of failure
      • To create a task, you need to apply for resources from the resource manager
    • Task-Container

YARN : Yet Another Resource Negotiater

  • Core idea: separate jobtracker from resource management and task scheduling. Implemented by ResourceManager and applicationmaster processes respectively.
  • Resource Manager: responsible for the resource management of the whole cluster
  • Applicationmaster: responsible for application task scheduling and task monitoring
  • The introduction of yarn enables multiple computing frameworks to run in a cluster

    • MapReduce,Spark,Strom
    • Each application corresponds to an applicationmaster