Slurm is an open source distributed resource management software similar to sun Grid Engine (SGE). It is used for supercomputers and large-scale cluster of computing nodes. It is highly scalable and fault-tolerant. Since sun was sold to Oracle, the easy-to-use SGE has become Oracle grid engine and has become commercial software since 6.2u6 (it can be used for 90 days for free), so we have to find other open source alternatives. Slurm was introduced by a stranger at the high performance conference in Durban last time, which sounds good.
Slurm manages cluster computing nodes through a pair of redundant cluster control nodes (redundancy is optional). It is implemented by a management daemons named slurmctld. Slurmctld provides monitoring, allocation and management of computing resources, and maps and distributes incoming job sequences to each computing node. Each computing node also has a daemon slurmd, which manages the nodes running on it, monitors the tasks running on the nodes, accepts the requests and work from the control nodes, maps the work to the nodes, and so on. The figure is as follows:
Monitoring bandwidth
The code is as follows:
It uses characters to display text graphics.
For example:
The code is as follows:
$ slurm -i eth1
option
Press l to display LX / TX indicator
Press C to switch to classic mode
Press R to refresh the screen
Press Q to exit
Control node
Install the slurm package in the control node and the calculation node respectively. This package contains both the slurmctld required by the control node and the slurmd required by the calculation node
The code is as follows:
The communication between the control node and the computing node requires authentication. Slurm supports two authentication methods: authd of Brent Chun’s and munge of LLNL. Munge is specially designed for high-performance cluster computing. Here we select munge and start munge authentication service after generating the key
The code is as follows:
Generating a pseudo-random key using /dev/urandom completed.
# /etc/init.d/munge start
Use the online configuration tool slurm version 2.3 configuration tool to generate the configuration file, and then copy the configuration file to the / etc / slurm LLNL of the control node and each computing node/ slurm.conf (yes, the control node and the compute node use the same configuration file).
With the configuration file and the munge service started, the slurmctld service can be started at the control node
The code is as follows:
* Starting slurm central management daemon slurmctld [ OK ]
The control node is generated munge.key Copy to each computing node:
The code is as follows:
After logging in to the computing node, start the munge service munge.key The owner and group of are munge, otherwise it will fail to start)
The code is as follows:
# chown munge:munge munge.key
# /etc/init.d/munge start
* Starting MUNGE munged [ OK ]
# slurmd
On the control node (slurm00), test whether it can connect to the calculation node (slurm01), and simply run a program / bin / host name to see the effect
The code is as follows:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 idle slurm01
# srun -N1 /bin/hostname
slurm01