Slurm is installed on Linux system to monitor network bandwidth and control nodes

Time:2021-2-21

Slurm is an open source distributed resource management software similar to sun Grid Engine (SGE). It is used for supercomputers and large-scale cluster of computing nodes. It is highly scalable and fault-tolerant. Since sun was sold to Oracle, the easy-to-use SGE has become Oracle grid engine and has become commercial software since 6.2u6 (it can be used for 90 days for free), so we have to find other open source alternatives. Slurm was introduced by a stranger at the high performance conference in Durban last time, which sounds good.
Slurm manages cluster computing nodes through a pair of redundant cluster control nodes (redundancy is optional). It is implemented by a management daemons named slurmctld. Slurmctld provides monitoring, allocation and management of computing resources, and maps and distributes incoming job sequences to each computing node. Each computing node also has a daemon slurmd, which manages the nodes running on it, monitors the tasks running on the nodes, accepts the requests and work from the control nodes, maps the work to the nodes, and so on. The figure is as follows:
20151029115037919.jpg (580×297)

Monitoring bandwidth

Copy code

The code is as follows:

$ apt-get install slurm

It uses characters to display text graphics.
For example:

Copy code

The code is as follows:

$ slurm -i <interface>
$ slurm -i eth1

20151029115120152.png (835×422)

option
Press l to display LX / TX indicator
Press C to switch to classic mode
Press R to refresh the screen
Press Q to exit

Control node
Install the slurm package in the control node and the calculation node respectively. This package contains both the slurmctld required by the control node and the slurmd required by the calculation node

Copy code

The code is as follows:

# apt-get install slurm-llnl

The communication between the control node and the computing node requires authentication. Slurm supports two authentication methods: authd of Brent Chun’s and munge of LLNL. Munge is specially designed for high-performance cluster computing. Here we select munge and start munge authentication service after generating the key

Copy code

The code is as follows:

# /usr/sbin/create-munge-key
Generating a pseudo-random key using /dev/urandom completed.
# /etc/init.d/munge start

Use the online configuration tool slurm version 2.3 configuration tool to generate the configuration file, and then copy the configuration file to the / etc / slurm LLNL of the control node and each computing node/ slurm.conf (yes, the control node and the compute node use the same configuration file).
With the configuration file and the munge service started, the slurmctld service can be started at the control node

Copy code

The code is as follows:

# /etc/init.d/slurm-llnl start
* Starting slurm central management daemon slurmctld [ OK ]

The control node is generated munge.key Copy to each computing node:

Copy code

The code is as follows:

# scp /etc/munge/munge.key [email protected]:/etc/munge/

After logging in to the computing node, start the munge service munge.key The owner and group of are munge, otherwise it will fail to start)

Copy code

The code is as follows:

# ssh [email protected]
# chown munge:munge munge.key
# /etc/init.d/munge start
* Starting MUNGE munged [ OK ]
# slurmd

On the control node (slurm00), test whether it can connect to the calculation node (slurm01), and simply run a program / bin / host name to see the effect

Copy code

The code is as follows:

# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
debug* up infinite 1 idle slurm01
# srun -N1 /bin/hostname
slurm01

Recommended Today

Choose react or angular 2

Original addressChoosing between React vs. Angular 2The following is the translation of this article, which can be used at your choiceReactperhapsAngular2We need to help when we need to. React has become a cool representative in 2015, but angular.js has changed from a front-end framework that people love to a terrible devil (and not so terrible…) […]