Running Jobs
How to use Slurm to submit a job
Please note that the job submission procedure is evolving and the directions below may change over time. Please reach out if you run into any issues.
Slurm Configuration
Version: v2 - enabled 2026-04-21
Guiding principles
- Iterative This will be versioned and improved over time
- Maximized Utilization We aim for maximum cluster utilization and high-impact outcomes, guided by the governance committee
- Minimal Friction We want to minimize wait times and avoid unnecessary barriers
- Feedback-Driven We know we won't get it right the first time—we need your input
- User Experience We want everyone to have a good experience, though we understand not everyone will agree with every configuration choice
Current configuration
FairShare Scheduling
Scheduling priority is governed by your FairShare score, calculated by Slurm based on:
- QoS levels
- Resources requested
- Time requested
- Past utilization
- Decay factors
QoS Recommendations
| QoS | QoS Description | Default |
|---|---|---|
| high | Higher initial priority, faster FairShare decay | No |
| standard | Balanced (recommended) | Yes |
| low | Boosts FairShare by slowing decay | No |
| scavenger | Lowest initial priority, greatest FairShare boost (slowest decay) | No |
Key Limits
-
Max wall time: 48 hours (all QoS levels)
-
Max resources: 90 nodes / 720 GPUs (all QoS levels)
-
Time field: Required on all submissions
-
Project/Account: Specify if you belong to multiple projects
Slurm Submission Flags
| Specification | Option | Example | Example-Purpose | Required |
|---|---|---|---|---|
| Wall Clock Limit | --time=[hh:mm:ss] | --time=05:00:00 | Set wall clock limit to 5 hours 00 min | Yes |
| Job Name | --job-name=[SomeText] | --job-name=myJob | Set the job name to "myJob" | No |
| Quality of Service | --qos=[QoS name] | --qos=standard | Choose the “standard” Qos - select from values in above table | No |
| Total nodes | --nodes=[#] | --nodes=1 | Request 1 node | No |
| Total Task Count | --ntasks=[#] | --ntasks=2 | Request 2 tasks total | No |
| CPUs per task | --cpus-per-task=[#] | --cpus-per-task=4 | Request 4 CPUs per task | No |
| Total GPUs per node | --gres=gpu:[#] | --gres=gpu:4 | Request 4 GPUs per node | No |
| Total GPUs per task | --gpus-per-task=[#] | --gpus-per-task=2 | Request 2 GPUs per task | No |
| Total GPUs for job | --gpus=[#] | --gpus=10 | Request 10 GPUs across the job | No |
| Tasks per Node | --ntasks-per-node=[#] | --ntasks-per-node=48 | Request exactly (or max) of 48 tasks per node | No |
| Memory Per Node | --mem=value[K|M|G|T] | --mem=360G | Request 360 GB per node | No |
| Combined stdout/stderr | --output=[OutputName].%j | --output=myJobOut.%j | Collect stdout/err in myJobOut.[JobID] | No |
Examples
- Job request for 1 node, 1 CPU, 1 GPU, and 10 minutes of runtime.
Create a script called job_hello_world.job:
#!/bin/bash
#SBATCH --job-name=hello_world
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --output=hello_%j.out
#SBATCH --error=hello_%j.err
#SBATCH --time=00:10:00
#SBATCH --qos=standard
#SBATCH --gres=gpu:1
srun sh -c 'echo "hello world ($(hostname)) ($XDG_RUNTIME_DIR) ($XDG_SESSION_ID) ($XDG_SESSION_TYPE) ($XDG_SESSION_CLASS)" | tee /scratch/user/$USER/hello_world_$(hostname)'
Run the job with:
sbatch ./job_hello_world.job
- Job request for 100 GPUs, 150 GBs of RAM per node, 500 CPUs and 30 hours of runtime. There will be 500 instances of the srun command, each with 1 CPU.
Create a script called job_hello_world.job:
#!/bin/bash
#SBATCH --job-name=hello_world
#SBATCH --gpus=100
#SBATCH --mem=150G
#SBATCH --ntasks=500
#SBATCH --output=hello_%j.out
#SBATCH --error=hello_%j.err
#SBATCH --time=30:00:00
#SBATCH --qos=standard
srun sh -c 'echo "hello world ($(hostname)) ($XDG_RUNTIME_DIR) ($XDG_SESSION_ID) ($XDG_SESSION_TYPE) ($XDG_SESSION_CLASS)" | tee /scratch/user/$USER/hello_world_$(hostname)'
Run the job with:
sbatch ./job_hello_world.job
- Job request for 100 GPUs, 150 GBs of RAM per node, 500 CPUs and 30 hours of runtime. There will be 500 instances of the srun command each with 4 CPUs.
Create a script called job_hello_world.job:
#!/bin/bash
#SBATCH --job-name=hello_world
#SBATCH --gpus=100
#SBATCH --mem=150G
#SBATCH --ntasks=500
#SBATCH --cpus-per-task=4
#SBATCH --output=hello_%j.out
#SBATCH --error=hello_%j.err
#SBATCH --time=30:00:00
#SBATCH --qos=standard
srun sh -c 'echo "hello world ($(hostname)) ($XDG_RUNTIME_DIR) ($XDG_SESSION_ID) ($XDG_SESSION_TYPE) ($XDG_SESSION_CLASS)" | tee /scratch/user/$USER/hello_world_$(hostname)'
Run the job with:
sbatch ./job_hello_world.job
Glossary
FairShare Scheduling:
Scheduling priority is governed by a user’s FairShare score. The FairShare score is calculated by Slurm based on QoS levels, resources requested, time requested, past utilization, decay and other factors.
Using the “high” QoS will initially give you a higher priority but will cause a faster decay of your FairShare score. This will have the effect of delaying the start of future jobs. It is generally recommended to use the “standard” QoS. However, if your job can wait a bit, use of the “low” and “scavenger” QoS will boost your FairShare score by slowing its decay. This will enable future jobs to be scheduled more quickly.
- All levels of QoS can potentially use all 90 nodes and all associated GPU’s (720 GPU)
- Time is a required field on all submissions
- Your default project is the first project to which you were assigned. However, if you are on multiple projects, you will need to specify which project (account) when scheduling a job.
- All QoS levels have a max of 48 hours of wall clock time
Project:
The basic unit of system allocation
Slurm Account:
There is a 1 to 1 correspondence between a VISION project and a Slurm account
Slurm Association:
Assignment to project
- All members of the project will be associated with the relevant Slurm account
Slurm Quality of Service (QoS):
The requested priority of a job submission
Links
- https://slurm.schedmd.com/quickstart.html(opens in new window) - Great quick start guide for Slurm
- https://slurm.schedmd.com/fair_tree.html(opens in new window) - FairShare algorithm implementation
- https://slurm.schedmd.com/priority_multifactor.html(opens in new window) - Multifactor Priority