Horn > ResourceManager



Unified management and scheduling of cluster resources

  1. Nodemanager (Management): receive resource report information
  2. Applicationmaster: allocating resources
  3. Client (response): processing requests

signal communication

(three role communication)
1. Communicate with nodemanager (resourcetracker)

  • Register, heartbeat (report node health status), container running status
  • Receive execution instruction (start / clean / delete container)

2. Communication with application master protocol

  • Registration, heartbeat
  • Application / release of resources

3. Communication with client (application client protocol)

  • Submit / query / control application

Module introduction

Seven modules
Horn > ResourceManager

1. User interaction module

  • Clientrmservice: handle the requests of ordinary users (submit, terminate program, query program status, etc.)
  • Adminservice: handle the administrator’s request (update node / ACL list, update queue information, etc.)
    Prevent a large number of ordinary user requests from starving management commands.
  • Webapp: display the usage of cluster resources and programs through web pages

2. Nm management module

  • Nmlivelinessmonitor: monitors nm status
    If the heartbeat is not reported regularly (10 minutes by default), it will be removed from the cluster.
  • Nodeslistmanager: maintain the list of normal and abnormal nodes
    Both lists are set in the configuration file and can be loaded dynamically.
  • Resourcetrackerservice: process nm requests

3. Am management module

  • Amlevelinessmonitor: monitors am status
    If the heartbeat is not reported regularly (10 minutes by default), it is considered dead, and the containers running on it are set to the failed state. Am will be reassigned to another node for execution (user specified number of retries, default is 2)
  • Application master launcher: communicate with nm and issue command to start application master
  • Application master service (AMS): processing requests from am
    Registration: the external RPC port number and tacking URL of the application master startup node
    Heartbeat: report the required resource description, container list to be released, blacklist list, etc
    The return value is the newly allocated container, the failed container, and the list of containers to be preempted

4. Application management module

  • Application aclsmanager: managing application access
    View permissions: view the basic information of the application
    Modify permission: modify program priority, kill application process, etc
  • Rmappmanager: managing the startup and shutdown of applications
  • Containerallocationexpirer: determines whether the container is recycled and executed

When am receives the newly allocated container from RM, it must start the container in the corresponding nm within a certain period of time (10 minutes by default), otherwise RM will forcibly reclaim the container

5. State machine module
Make the design architecture clearer

  • Rmapp: maintaining the life cycle of an application
  • Rmappattempt: maintain the small task life cycle generated by mrapp
  • Rmcontainer: maintain the life cycle of container
    The current container does not support reuse. Let’s see if it can be reused later
  • Rmnode: maintain the life cycle of nodemanager

6. Safety module
It consists of the following sub modules

  • ClientToAMSecretManager
  • ContainerTokenSecretManager
  • ApplicationTokenSecretManager

7. Resource allocation module
Resourcescheduler: responsible for allocating resources to applications

  • Batch resource scheduler: FIFO
  • Multi user scheduler: Fair scheduler and capacity scheduler

Module details

Seven modules

1. User interaction module

Clientrmservice and adminservice handle the requests of ordinary users and administrators respectively

The essence of RPC server is to provide RPC service to clients

  • The RM context object rmcontext is reserved in the clientrmservice, and the central asynchronous scheduler
    Rmcontext is used to get the node list, queue organization and application list to respond to client requests

It is RPC server in essence, but the service object is administrator
yarn.admin.acl The default value is *, indicating that all users are administrators

2. Nm management module

It consists of the following three components
Periodically traverse all nm, and all the containers on it are considered as failed
Heartbeat cycle (default 10 minutes)yarn.nm.liveness-monitor.expiry-interval-ms

Nodes that manage RM

Specify whitelist file:yarn.resourcemanager.nodes.include-path
Designated blacklist file:yarn.resourcemanager.nodes.exclude-path
Execute the following command to make the configuration take effectbin/yarn rmadmin -refreshNodes

RPC server is essentially used to process nm requests (through the application master protocol protocol protocol)

  • Registration (single)
    Nm sends the request when it starts, carrying the node ID, the upper limit of available resources and the open HTTP port
  • Heartbeat (cycle)
    It contains the running application list, node health status and container running status
    Return to the list of containers and applications to be released

3. Am management module

It consists of the following three components
Application master launcher: responsible for starting am
Application master service: responsible for communication with am
Amlevelinessmonitor is responsible for monitoring the life cycle of am

That is, the service is also an event handler, responding to the amlauncherevent event (start / clean AM)

  • Start am
    Communicate with nm through containermanagementprotocol, encapsulate the information needed to start am, such as start command, jar package, environment variable, etc. into startcontainerrequest object and send it to nm
  • Clean am
    Communicate with nm through containermanagement protocol and ask it to kill am

Processing am requests (via applicationmaster protocol)

  • Registration (single)
    When am starts, it sends the request, carrying the node, RPC port, trackingurl and other information
  • Heartbeat (cycle)
    Contains the type of the requested resource, the list of containers to be released, etc
    AMS returns information such as newly allocated container and failed container
  • Cleaning (single time)
    Am sends clean request to RM to reclaim / clean various resources
    Reclaim the container occupied by am and delete am from amlevelinessmonitor

Periodically traverse all AMS. If an AM does not send heartbeat regularly, it is considered to be dead, and all its containers are set to fail

RM reallocates resources for it and starts on another node

Heartbeat time (default 10 minutes)yarn.am.liveness-monitor.expiry-interval-ms
Am failed retrying times (two times by default)yarn.resourcemanager.am.max-attempts

4. Application management module

Manage application lifecycle, permissions, etc

Manage application view / modify permissions
Use this parameter to configure permissionsyarn.admin.acl

Responsible for application startup and shutdown

  • Put the application in the application list
  • Remove application from rmstatestore

Set the maximum number of applications through this parameter:yarn.resourcemanager.max-completed-applications

Manage container usage
If an AM is not used for a period of time after getting the container, it will be forced to recycle (improve utilization rate)

Waiting time:yarn.resourcemanager.rm.container-allocation.expiry-interval-ms


1. Event driven
The central asynchronous scheduler organizes the components / services together. The output of each component / service is event, and the interaction between components / services is through event, so as to realize the asynchronous parallel efficient system