Skip to main content

Running Jobs

How to use Slurm to submit a job

note

Please note that the job submission procedure is evolving and the directions below may change over time. Please reach out if you run into any issues.

Slurm Configuration

Version: v2 - enabled 2026-04-21

Guiding principles

  • Iterative This will be versioned and improved over time
  • Maximized Utilization We aim for maximum cluster utilization and high-impact outcomes, guided by the governance committee
  • Minimal Friction We want to minimize wait times and avoid unnecessary barriers
  • Feedback-Driven We know we won't get it right the first time—we need your input
  • User Experience We want everyone to have a good experience, though we understand not everyone will agree with every configuration choice

Current configuration

FairShare Scheduling

Scheduling priority is governed by your FairShare score, calculated by Slurm based on:

  • QoS levels
  • Resources requested
  • Time requested
  • Past utilization
  • Decay factors

QoS Recommendations

QoSQoS DescriptionDefault
highHigher initial priority, faster FairShare decayNo
standardBalanced (recommended)Yes
lowBoosts FairShare by slowing decayNo
scavengerLowest initial priority, greatest FairShare boost (slowest decay)No

Key Limits

  • Max wall time: 48 hours (all QoS levels)

  • Max resources: 90 nodes / 720 GPUs (all QoS levels)

  • Time field: Required on all submissions

  • Project/Account: Specify if you belong to multiple projects

    Slurm Submission Flags

SpecificationOptionExampleExample-PurposeRequired
Wall Clock Limit--time=[hh:mm:ss]--time=05:00:00Set wall clock limit to 5 hours 00 minYes
Job Name--job-name=[SomeText]--job-name=myJobSet the job name to "myJob"No
Quality of Service--qos=[QoS name]--qos=standardChoose the “standard” Qos - select from values in above tableNo
Total nodes--nodes=[#]--nodes=1Request 1 nodeNo
Total Task Count--ntasks=[#]--ntasks=2Request 2 tasks totalNo
CPUs per task--cpus-per-task=[#]--cpus-per-task=4Request 4 CPUs per taskNo
Total GPUs per node--gres=gpu:[#]--gres=gpu:4Request 4 GPUs per nodeNo
Total GPUs per task--gpus-per-task=[#]--gpus-per-task=2Request 2 GPUs per taskNo
Total GPUs for job--gpus=[#]--gpus=10Request 10 GPUs across the jobNo
Tasks per Node--ntasks-per-node=[#]--ntasks-per-node=48Request exactly (or max) of 48 tasks per nodeNo
Memory Per Node--mem=value[K|M|G|T]--mem=360GRequest 360 GB per nodeNo
Combined stdout/stderr--output=[OutputName].%j--output=myJobOut.%jCollect stdout/err in myJobOut.[JobID]No

Examples

  • Job request for 1 node, 1 CPU, 1 GPU, and 10 minutes of runtime.

Create a script called job_hello_world.job:

#!/bin/bash  
#SBATCH --job-name=hello_world
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=hello_%j.out
#SBATCH --error=hello_%j.err
#SBATCH --time=00:10:00
#SBATCH --qos=standard
#SBATCH --gres=gpu:1

srun sh -c 'echo "hello world ($(hostname)) ($XDG_RUNTIME_DIR) ($XDG_SESSION_ID) ($XDG_SESSION_TYPE) ($XDG_SESSION_CLASS)" | tee /scratch/user/$USER/hello_world_$(hostname)'

Run the job with:

sbatch ./job_hello_world.job
  • Job request for 100 GPUs, 150 GBs of RAM per node, 500 CPUs and 30 hours of runtime. There will be 500 instances of the srun command, each with 1 CPU.

Create a script called job_hello_world.job:

#!/bin/bash  
#SBATCH --job-name=hello_world
#SBATCH --gpus=100
#SBATCH --mem=150G
#SBATCH --ntasks=500
#SBATCH --output=hello_%j.out
#SBATCH --error=hello_%j.err
#SBATCH --time=30:00:00
#SBATCH --qos=standard

srun sh -c 'echo "hello world ($(hostname)) ($XDG_RUNTIME_DIR) ($XDG_SESSION_ID) ($XDG_SESSION_TYPE) ($XDG_SESSION_CLASS)" | tee /scratch/user/$USER/hello_world_$(hostname)'

Run the job with:

sbatch ./job_hello_world.job
  • Job request for 100 GPUs, 150 GBs of RAM per node, 500 CPUs and 30 hours of runtime. There will be 500 instances of the srun command each with 4 CPUs.

Create a script called job_hello_world.job:

#!/bin/bash  
#SBATCH --job-name=hello_world
#SBATCH --gpus=100
#SBATCH --mem=150G
#SBATCH --ntasks=500
#SBATCH --cpus-per-task=4
#SBATCH --output=hello_%j.out
#SBATCH --error=hello_%j.err
#SBATCH --time=30:00:00
#SBATCH --qos=standard

srun sh -c 'echo "hello world ($(hostname)) ($XDG_RUNTIME_DIR) ($XDG_SESSION_ID) ($XDG_SESSION_TYPE) ($XDG_SESSION_CLASS)" | tee /scratch/user/$USER/hello_world_$(hostname)'

Run the job with:

sbatch ./job_hello_world.job

Glossary

FairShare Scheduling:
Scheduling priority is governed by a user’s FairShare score. The FairShare score is calculated by Slurm based on QoS levels, resources requested, time requested, past utilization, decay and other factors.
Using the “high” QoS will initially give you a higher priority but will cause a faster decay of your FairShare score. This will have the effect of delaying the start of future jobs. It is generally recommended to use the “standard” QoS. However, if your job can wait a bit, use of the “low” and “scavenger” QoS will boost your FairShare score by slowing its decay. This will enable future jobs to be scheduled more quickly.

  • All levels of QoS can potentially use all 90 nodes and all associated GPU’s (720 GPU)
  • Time is a required field on all submissions
  • Your default project is the first project to which you were assigned. However, if you are on multiple projects, you will need to specify which project (account) when scheduling a job.
  • All QoS levels have a max of 48 hours of wall clock time

Project:
The basic unit of system allocation

Slurm Account:
There is a 1 to 1 correspondence between a VISION project and a Slurm account

Slurm Association:
Assignment to project

  • All members of the project will be associated with the relevant Slurm account

Slurm Quality of Service (QoS):
The requested priority of a job submission