IT and Library Services

Submitting a job to HPC

Instructions to write scripts and automate your jobs on the HPC, and details of how to test your job prior to running it in full.

Using SLURM

The HPC uses a package called SLURM to control the primary work flow.

The SLURM (Simple Linux Utility for Resource Management workload manager is a free and open-source job scheduler for the Linux kernel. It is used by the HPC and many of the world's supercomputers (and clusters). It provides three key functions.

  • First, it allocates exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work.
  • Second, it provides a framework for starting, executing, and monitoring work (typically a parallel job such as MPI) on a set of allocated nodes.
  • Thirdly, it arbitrates contention for resources by managing a queue of pending jobs.

It will take your batch job submission and execute it across the computing nodes of the HPC. How it is processed will depend on a number of factors including the queue it is submitted to, jobs already submitted to the queue etc.

SLURM 

The command for submitting jobs is 'sbatch'

sbatch <job_script_name>

This command can take many options to set a number of controls before it is submitted. Typing 'man sbatch' at the terminal will  display these. These can be typed in at the command line (as below).

sbatch --ntasks 28 myjob.sh

For ease and repetition it is much easier to build these into the batch script (e.g. mybatchjob.sh) using an editor such as vi, emacs etc.

Job Commands

Command Description
srun Run a parallel job on cluster managed by SLURM. If necessary, srun will first create a resource allocation in which to run the parallel job
sbatch Submits  a  batch  script  to SLURM.  The batch script may be given to sbatch through a file name on the command line, or if no file name is specified, sbatch will read in a script from standard input.
squeue Used to view job and job step information for jobs managed by SLURM.
scancel Used to signal or cancel jobs, job arrays or job steps.
scontrol Used to view or modify Slurm configuration including: job, job step, node, partition, reservation, and overall system configuration. Most of the commands can only be executed by user root.
salloc Used to allocate a SLURM job allocation, which is a set of resources (nodes), possibly with some set of constraints (e.g. number of processors per node). When salloc successfully obtains the requested allocation, it then runs the command specified by the user. Finally, when the user specified command is complete, salloc relinquishes the job allocation. 
sacct Accounting information for jobs invoked with SLURM are either logged in the job accounting log file or saved to the SLURM database
sinfo Used to view partition and node information for a system running SLURM.
sattach Attaches to a running SLURM job step. By attaching, it makes available the IO streams of all of the tasks of a running SLURM job step. It also suit‐ able for use with a parallel debugger like TotalView.

At the terminal you can also type 'man <command>', e.g. man sbatch.

Additional documentation of SLURM can be found at http://www.ceci-hpc.be/slurm_tutorial.html and https://slurm.schedmd.com/quickstart.html squeue