Slurm Guide for HPC3

1. Overview

HPC3 will use the Slurm scheduler. Slurm is used widely at super computer centers and is actively maintained. Many of the concepts of SGE are available in Slurm, Stanford has a guide for equivalent commands. There is a nice quick reference guide directly from the developers of Slurm.

We provide the numerous EXAMPLES that show in depth how to run array jobs, request GPUs, CPUs, and memory for a variety of different job types and common applications.

1.1. Simple code of conduct

  1. All jobs, batch or interactive must be submitted to the scheduler

  2. Do not run computational jobs on login nodes this adversely affects many users. Login nodes are meant for light editing or compilation and for submitting jobs. Any job that runs for more than an hour or is using significant memory and CPU within an hour should be submitted to Slurm either as interactive or batch job.

    • Long-running jobs will be removed.

    • We reserve the right to limit access for the users who abuse the system.

  3. Ssh access to the compute nodes is turned off to prevent users from starting jobs bypassing Slurm. See attaching to running job below.

  4. Do not run Slurm jobs in your $HOME.

  5. Make sure you stay within your disk quota. File system limits are generally the first ones that will negatively affect your job. See storage guides

1.2. HPC3 Queue Structure

Slurm uses the term partition to signify a batch queue of resources. HPC3 has different kinds of hardware, memory footprints, and nodes with GPUs. Jobs running in some queues will charge core-hours (or GPU-hours) to the account.

Please do not override the memory defaults unless your particular job really requires it. Analysis of more than 3 Million jobs on HPC3 indicated that more than 98% of jobs fit within the defaults. With slightly smaller memory footprints, the scheduler has MORE choices as to where to place jobs on the cluster.

Memory is allocated per-CPU core. When you request more cores, your job is allocated more memory.

Table 1. HPC3 Available Queues and associated memory.
Partition Default memory/core Max memory/core Default / Max runtime Cost Jobs preemption

CPU Partitions

standard

3 GB

6 GB

2 day / 14 day

1 / core-hr

No

free

3 GB

18 GB

1 day / 3 day

0

Yes

debug

3 GB

18 GB

15 min / 30 min

1 / core-hr

No

highmem1

6 GB

10 GB

2 day / 14 day

1 / core-hr

No

hugemem1

18 GB

18 GB

2 day / 14 day

1 / core-hr

No

maxmem1,2

1.5 TB/node

1.5 TB/node

1 day / 7 day

40 / node-hr

No

GPU Partitions

gpu3

3 GB

9 GB

2 day / 14 day

1 / core-hr, 32 / GPU-hr

No

free-gpu4

3 GB

9 GB

1 day / 3 day

0

Yes

gpu-debug4

3 GB

9 GB

15 min / 30 min

1 / core-hr, 32 / GPU-hr

No

1 You must be added to a specific group to access the highmem / hugemem/ maxmem partitions. If you are not a member of these groups then you will not be able to submit jobs to these partitions and sinfo will not show these partitions.

2 The maxmem partition is a single 1.5 TB node and that is reserved for those rare applications that really require that much memory. You can only be allocated the entire node. No free jobs run in this partition.

3 You must have a gpu account and you must specify it in order to submit to the gpu/gpu-debug partitions. This is because of differential charging. GPU accounts are not automatically given to everyone, your faculty adviser can request a GPU lab account.

4 Anyone can run jobs in free-gpu partition without special account.

1.3. How accounts are charged

Every HPC3 user is granted 1,000 free CPU hours as a startup allowance. This allocation is there for you to become familiar with HPC3, Slurm scheduler, and accounting.

Each CPU core-hour (allocation unit) is charged to the account you specify (or your default account which is your user account). Similarly, each GPU hour is charged to a gpu-enabled account at the rate of 32/GPU-hour plus the core-hour charge of the cpu cores the job utilizes (at least one is required). A GPU-enabled account can only be used on GPU-nodes.

Table 2. Example of charges
Using Will be charged

1 core X 1 hour

1 units

1 core X 6 minutes

0.1 units

10 cores X 1 hour

10 units

( 1 GPU-core + 1 CPU core ) X 1 hour

33 units

Most jobs ran on HPC3 are charged to a lab account because most HPC3 users are part of at least one research lab. If a user or a research lab runs out of CPU hours, more CPU hours can be purchased via recharge.

Any UCI Professor can request an HPC3 lab account on behalf of their research group and add any number of researchers/students to this account. Based upon the number of requests and the number of nodes that have been purchased by RCIC, this number will vary. The aspirational goal is faculty who request an account will be granted 200,000 CPU hours per fiscal year.

You may request your Slurm lab account by emailing hpc-request@uci.edu. In the email, please specify

Information needed in a PI Request for Granted Hours
  1. PI user name

  2. User names (if any) of the researchers, graduate students or other collaborators that should be able to charge CPU hours to the lab account.

  3. Define account Coordinators - one or two lab members (typically Postdocs and/or Project Specialists) for the Slurm lab account. Account Coordinators are able to manage the group members' jobs, modify their queue priority, update limits for the total CPU hours for individual members, etc.

1.3.1. Free Jobs

The free queues are designed to allow the cluster to run at 100% utilization, but allow allocated jobs to still have very quick access to cores. This is accomplished by allowing allocated jobs to displace (kill) a running free job. The design of HPC3 is that, on average, about 20% of the cluster is available for free jobs.

  • Jobs submitted to free partitions are not charged to an account

  • Free jobs can be killed at any time to make room for allocated jobs

When time the standard partition, where paid for jobs run, becomes full, jobs in free partition are killed in order to allow the allocated jobs to run with a priority. In an attempt to get as much goodput through the system, the most-recently started free jobs are the ones to be killed first

1.3.2. Allocated Jobs

  • Jobs submitted to the standard partition are allocated jobs

  • Once a job started running in the standard queue it will run to completion

Standard jobs have the following properties:

  1. Standard jobs cannot be killed preemptively by any other job.

  2. Standard jobs preempt free jobs.

  3. Standard jobs with QOS set to normal are charged for the CPU time consumed.

  4. Standard jobs with QOS set to high are charged double the CPU time consumed.

  5. Standard Jobs with QOS set to high are placed at the front of the jobs queue. They are meant to be used when a user needs to jump in front of the queue when the time from submission to running is of the essence (i.e. grant proposals and paper deadlines).

1.3.3. Recommendations

Charging jobs to an account is new for the UCI community. Like any policy, it can be two-edged. A large fraction of users should be able to run allocated jobs and never see the limits of their accounts. However, users who are accessing a very large number of free cores are likely to have some of their free jobs preempted (killed).

Get the most from your allocation:

  1. Look at your past jobs and see how much CPU resource was used. Don’t request more than needed.

  2. Prioritize your own work. Test and low-priority jobs can go to free. Others should be allocated.

  3. Understand that free "comes with no guarantees". Your free job can be killed at anytime.

1.3.4. Quota Enforcement

When HPC3 users exceed their disk space or CPUs quota allocations the following will happen:

  • users will not be able to submit new jobs

  • running jobs will fail

Please check the available disk space and CPU hours in your Slurm account regularly. Delete or archive data as needed.

2. Quick Start

2.1. Example Scripts

We provide the numerous EXAMPLES that show in depth how to run array jobs, request GPUs, CPUs, and memory for a variety of different job types and common applications. The scripts can be downloaded from this directory.

There are a few methods to submit your jobs to Slurm: batch, interactive, and running jobs immediately. The sections below show a few common submission details.

2.2. Batch Job

A batch job is run at sometime in the future by the scheduler. Submitting batch jobs to Slurm is accomplished through the sbatch command and the job description is provided by the submit script.

An example job description is shown below:

#!/bin/bash

#SBATCH --job-name=test      ## Name of the job.
#SBATCH -A panteater_lab     ## account to charge (1)
#SBATCH -p standard          ## partition/queue name
#SBATCH --nodes=1            ## (-N) number of nodes to use
#SBATCH --ntasks=1           ## (-n) number of tasks to launch
#SBATCH --cpus-per-task=1    ## number of cores the job needs
#SBATCH --error=slurm-%J.err ## error log file

# Run command hostname and save output to the file out.txt
hostname > out.txt

To submit a job on HPC3, login and using your favorite editor create simplejob.sub file with the contents shown above.

1 Edit the Slurm account to charge for the job to either your personal account or lab account. Your personal account is the same as your UCINetID.

To submit the job do:

[user@login-x:~]$ sbatch simplejob.sub
Submitted batch job 362

When the job has been submitted, Slurm returns a job ID that will be used to reference the job in Slurm user log files and Slurm job reports. After the job is finished look at the file out.txt to see the name of the compute node that ran the job.

2.3. Job Status

To check the status of your job in the queue:

[user@login-x:~]$ squeue -u panteater
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               362  standard     test panteater R       0:03      1 hpc3-17-11

To get detailed info about the job:

[user@login-x:~]$ scontrol show job 382

The output will contain a list of key=value pairs that provide job information.

AVOID using command watch to query the Slurm queue in a continuous loop as shown in:
[user@login-x:~]$ watch -d squeue <...other arguments...>

This frequent querying of Slurm queue creates an unnecessary overhead and affects many users. Instead, check your job output and use mail notification for the job end or use the squeue command when you want to see an update.

2.4. Immediate job

The command srun is used to run a job immediately and uses your console for stdin / stdout / stderr (standard input/output/error) instead of redirecting them to a file.

Srun submits jobs for immediate execution but it does not bypass scheduler priority. If your job cannot run immediately, you will wait until Slurm can schedule your request.

The main difference between srun and sbatch:

  • srun is interactive and blocking. Srun is quite useful in the debug or free queues. Srun is often used to create job steps in sbatch scripts or to run interactive jobs.

  • sbatch is batch processing and non-blocking. Sbatch can do everything srun can and more.

2.5. Interactive Job

To request an interactive shell, use the salloc command. For example, a user panteater can use one of the following to submit a job to a standard partition:

[user@login-x:~]$ salloc srun --pty /bin/bash -i                       (1)
[user@login-x:~]$ salloc -A PI_LAB --ntasks=4 srun --pty /bin/bash -i  (2)
1 get an interactive node reserving 1 CPU (default), and panteaer account (default).
2 get an interactive node reserving 4 CPUs, use a PI_LAB account. Note, salloc creates a resource allocation and this is why options for the account, partition, CPU, GPU, etc must be entered before srun in the command arguments order. The srun is used to run a command, the options needed for the command need to be specified after srun. In this case, srun requests a pseudo terminal to execute /bin/bash command and redirects all stdin to a terminal.

A simpler way to run an interactive is to use srun. For example:

[user@login-x:~]$ srun -A PI_LAB --pty /bin/bash -i                    (1)
[user@login-x:~]$ srun -p free --pty /bin/bash -i                      (2)
[user@login-x:~]$ srun --mem=8G -p free --pty /bin/bash -i             (3)
[user@login-x:~]$ srun -c 4 --time=10:00:00 -N 1 --pty /bin/bash -i    (4)
1 start an interactive session in standard partition and charge to the PI_LAB account
2 start interactive session in free partition (where it may be killed at any time)
3 ask for 8Gb of memory per job (when you truly need it)
4 ask for 4 CPUs for 10 hrs

Once the salloc or srun command is executed, the scheduler allocates available resource and starts an interactive shell on the chosen node. Your shell prompt will indicate a new hostname.

Once done with your work simply type at the prompt:

[user@login-x:~]$ exit

2.6. Interactive GUI job

To run an interactive session for GUI jobs a user must login with Xforwarding enabled in ssh, see Reference guide and then use the --x11 to enable Xforwarding in srun command.

[user@login-x:~]$ srun -p free --x11  --pty /bin/bash -i

2.7. Attach to a job

The ssh access to compute nodes is turned off

Users will need to use Slurm commands to attach to running jobs if they want to run simple jobs verification commands on the node where their job is running.

  1. For example, find your running job and use its jobid to attach to it:

    [user@login-x:~]$ squeue -u panteater
          JOBID PARTITION     NAME      USER ST       TIME  NODES NODELIST(REASON)
        3559123      free    Tst41 panteater  R   17:12:33      5 hpc3-14-02
        3559124      free    Tst42 panteater  R   17:13:33      7 hpc3-14-17,hpc3-15-[05-08]
    [user@login-x:~]$ srun --pty --jobid 3559123 --overlap /bin/bash

    This will put a user on the same node hpc3-14-02 where the job is running and will run inside the cgroup (CPU, RAM etc.) of the running job. This means the user will be able to execute simple commands such as ls, top, ps, etc., but will not be able to start new processes that use resources outside of what is specified in $JOBID. Any command will use computing resources, and therefore will add to the usage of the job. After executing your desired verification commands simply type exit. The original job will be still running.

    For jobs with multiple nodes use the -w switch to specify the specific node:

    [user@login-x:~]$ srun --pty --jobid 3559124 --overlap -w hpc3-14-17 /bin/bash
  2. Most often users just need to see the processes of the job, etc. Such commands can be run directly as for example:

    [user@login-x:~]$ srun --pty --overlap --jobid $JOBID top

2.8. Email notification

To receive email notification on the status of jobs, include the following lines in your submit scripts and make the appropriate modifications to the second line:

#SBATCH --mail-type=fail,end
#SBATCH --mail-user=user@domain.com

The first line specifies the event type for which a user requests an email (failure/exception events), the second specifies a valid email address. We suggest to use a very few event types especially if you submit hundreds of jobs. See output of man sbatch command for more info.

2.9. Node Selection Using Constraints

HPC3 is heterogeneous hardware with several different CPU types available. In Slurm you can request that a job only run on nodes with certain "features". We add features to assist users who have discussed their specific needs. Using a constraint is straightforward in Slurm. For example, to run on only nodes that support the AVX512 instruction set, one would add to the submission:

#SBATCH --constraint=avx512

We have defined the following features for node selection:

Table 3. HPC3 Defined Features
Feature Name Description (processor/storage) Node Count Cores
(Min, Mode, Max)

intel

Select Intel node (including HPC legacy)

compute: 162
GPU: 14

compute: 24, 40, 48
GPU: 40, 40, 40

avx512

Select Intel node supporting AVX512 instructions

compute: 157
GPU: 14

compute: 28, 40, 48
GPU: 40, 40, 40

epyc or amd

Select AMD EPYC node

22

40, 64, 64

epyc7551

Select AMD EPYC 7551 node only

18

40, 64, 64

epyc7601

Select AMD EPYC 7601 node only

4

64, 64, 64

nvme or fastscratch

Select Intel AVX512 node only with /tmp on NVMe

40

48, 48, 48

2.10. Default Settings

2.10.1. Node Information

Sinfo command provides information about nodes and partitions. A few useful examples:

  1. The following command will give a short table where nodes are grouped per their features (output trimmed):

    [user@login-x:~]$ sinfo -o "%60N %10c %10m  %30f  %10G" -e
    NODELIST                            CPUS MEMORY  AVAIL_FEATURES                 GRES
    hpc3-14-[00-31],hpc3-15-[00-19],... 40   180000  intel,avx512                   (null)
    hpc3-19-16                          44   500000  intel                          (null)
    hpc3-20-[23,25-32]                  48   180000  intel,avx512                   (null)
    hpc3-l18-[04-05]                    28   245000  intel,avx512                   (null)
    hpc3-18-02                          64   244000  amd,epyc,epyc7601              (null)
    hpc3-19-[07,17],hpc3-l18-03         64   500000  amd,epyc,epyc7551              (null)
    hpc3-21-[00-32],hpc3-22-[00-04]     48   180000  intel,avx512,fastscratch,nvme  (null)
    hpc3-20-[16-20,24]                  48   372000  intel,avx512                   (null)
    hpc3-gpu-16-00                      40   180000  intel,avx512                   gpu:V100:4
    hpc3-l18-02                         40   1523544 amd,epyc,epyc7551              (null)
    hpc3-gpu-16-[01-07],hpc3-gpu-17-... 40   180000  intel,avx512                   gpu:V100:4
    hpc3-gpu-18-00                      40   372000  intel,avx512                   gpu:V100:4
  2. The following command will produce the same table but for each node without grouping:

    [user@login-x:~]$ sinfo -o "%20N %10c %10m  %20f  %10G" -N
    NODELIST             CPUS       MEMORY      AVAIL_FEATURES        GRES
    hpc3-14-00           40         192000      avx512                (null)
    hpc3-14-00           40         192000      avx512                (null)
    hpc3-14-01           40         192000      avx512                (null)
    hpc3-14-01           40         192000      avx512                (null)
    ... output cut ...
  3. This command will give an output just for one specified node:

    [user@login-x:~]$ sinfo -o "%20N %10c %10m  %20f  %10G"  -n hpc3-14-00
    NODELIST             CPUS       MEMORY      AVAIL_FEATURES        GRES
    hpc3-14-00           40         192000      avx512                (null)

Run man sinfo command for detailed information about options.

2.10.2. Node Memory

There are nodes with three different memory footprints. Slurm uses Linux cgroups to enforce that applications do not use more memory/cores than they have been allocated.

Some nodes have Graphics Processing Units (GPUs) and these are defined in separate queues Please see the default settings for Slurm partitions.

Users cannot submit jobs to highmem/hugemem without first being added to special groups. User must be either (a) member of a group that purchased these node types or (b) demonstrate that their applications require more than standard memory. There is no difference in cost/core-hour on any of the CPU partitions.
If you want more memory on a standard memory node, you should request more cores. You will be charged more for this, but you use a larger fraction of the node.

2.10.3. Queue configuration

Command scontrol can be used to view Slurm configuration including: job, node, partition, reservation, and overall system configuration. For example, to display information about a standard queue:

[user@login-x:~]$ scontrol show partition=standard
PartitionName=standard
   AllowGroups=ALL AllowAccounts=ALL AllowQos=ALL
   AllocNodes=ALL Default=YES QoS=normal
   DefaultTime=2-00:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=65 MaxTime=14-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=40
   Nodes=hpc3-14-[00-31],hpc3-15-[00-19,21,24-31],hpc3-17-[08-11]
   PriorityJobFactor=100 PriorityTier=100 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=NONE PreemptMode=OFF
   State=UP TotalCPUs=2600 TotalNodes=65 SelectTypeParameters=NONE
   JobDefaults=(null)
   DefMemPerCPU=4500 MaxMemPerCPU=4500

The output contains information about default configuration settings. The key Nodes lists all the nodes for the queue. We can get more information about a specific node:

[user@login-x:~]$ scontrol show node=hpc3-14-00

Please run man scontrol command for available options.

3. Monitor jobs

3.1. Job History

We have a clsuter-specifc tool to print a ledger of jobs based on specified arguments. Default is to print jobs of the current user for the last 30 days:

[user@login-x:~]$ /pub/hpc3/zotledger -u panteater
      DATE       USER    ACCOUNT PARTITION    JOBID  JOBNAME ARRAYLEN CPUS WALLHOURS    SUs
2021-07-21  panteater  panteater  standard  1740043     srun        -    1      0.00   0.00
2021-07-21  panteater  panteater  standard  1740054     bash        -    1      0.00   0.00
2021-08-03  panteater     lab021  standard  1406123     srun        -    1      0.05   0.05
2021-08-03  panteater     lab021  standard  1406130     srun        -    4      0.01   0.02
2021-08-03  panteater     lab021  standard  1406131     srun        -    4      0.01   0.02
    TOTALS          -          -         -        -        -        -    -      0.07   0.09

To find all available arguments use:

[user@login-x:~]$ /pub/hpc3/zotledger -h

3.2. Job Info

sacct can be used to see accounting data for all jobs and job steps. An example below shows how to use job ID for the command:

[user@login-x:~]$ sacct -j 43223
       JobID  JobName  Partition      Account  AllocCPUS      State ExitCode
------------ -------- ---------- ------------ ---------- ---------- --------
   36811_374    array   standard panteater_l+          1  COMPLETED      0:0

The above command uses a default output format. A more useful example will set a specific format for sacct that provides extra information:

[user@login-x:~]$ export SACCT_FORMAT="JobID%20,JobName,Partition,Elapsed,State,MaxRSS,AllocTRES%32"
[user@login-x:~]$ sacct -j 600
     JobID JobName  Partition  Elapsed     State  MaxRSS AllocTRES
---------- -------  --------  -------- --------- ------- --------------------------------
       600    all1  free-gpu  03:14:42 COMPLETED         billing=2,cpu=2,gres/gpu=1,mem=+
 600.batch   batch            03:14:42 COMPLETED 553856K           cpu=2,mem=6000M,node=1
600.extern  extern            03:14:42 COMPLETED       0 billing=2,cpu=2,gres/gpu=1,mem=+
the MaxRSS value shows your job memory usage.
other useful options in SACCT_FORMAT are User, NodeList, ExitCode. To see all available options for the format see man page man sacct

Slurm efficiency script seff can be used after the job completes, to find useful information about the job including the memory and CPU use and efficiency.

[user@login-x:~]$ seff -j 4322385
seff doesn’t produce accurate results for multi-node jobs. Use this command for single node jobs.

3.3. Job Statistics

sstat displays various running job and job steps resource utilization information. For example, to print out a job’s average CPU time use (avecpu), average number of bytes written by all tasks (AveDiskWrite), average number of bytes read by all tasks (AveDiskRead), as well as the total number of tasks (ntasks) execute:

[user@login-x:~]$ sstat -j 125610 --format=jobid,avecpu,aveDiskWrite,AveDiskRead,ntasks
       JobID     AveCPU AveDiskWrite  AveDiskRead   NTasks
------------ ---------- ------------ ------------ --------
125610.batch 10-18:11:+ 139983973691 153840335902        1

3.4. Allocations

sbank is short for "Slurm Bank". Sbank is used to display HPC3 user account information. In order to run jobs on HPC3, a user must have available CPU hours. To check how many CPU hours are available in your personal account, run the command with your account name:

[user@login-x:~]$ sbank balance statement -a panteater
User           Usage |        Account     Usage | Account Limit Available (CPU hrs)
---------- --------- + -------------- --------- + ------------- ---------
panteater*        58 |      PANTEATER        58 |         1,000       942

To check how many CPU hours are available in all accounts that you have access to and how much you used:

[user@login-x:~]$ sbank balance statement -u panteater
User           Usage |        Account     Usage | Account Limit Available (CPU hrs)
User           Usage |        Account     Usage | Account Limit Available (CPU hrs)
---------- --------- + -------------- --------- + ------------- ---------
panteater*        58 |      PANTEATER        58 |         1,000       942
panteater*     6,898 |         PI_LAB     6,898 |       100,000    93,102
The panteater* in the output means the command was run by user panteater. The user name is not reflected in the generic prompt [user@login-x:~]$ used for these examples.

3.5. Pending Job

Once you submit your job it should start running depending on the availability of the nodes, job priority and job resources. However, sometimes job is in PD (pending) status for a long time. Here is how to determine why.

  1. Check the queue status for your jobs

    [user@login-x:~]$ squeue -u panteater
    JOBID   PARTITION     NAME      USER  ST       TIME  NODES NODELIST(REASON)
    1666961  standard     tst1  panteater PD       0:00      1 (AssocGrpCPUMinutesLimit)
    1666962  standard     tst2  panteater PD       0:00      1 (AssocGrpCPUMinutesLimit)

    Note, in this case the reason is AssocGrpCPUMinutesLimit which means there is not enough balance left in the account.

  2. Check your account balance

    [user@login-x:~]$ sbank balance statement -u panteater
    User            Usage |        Account     Usage | Account Limit Available (CPU hrs)
    ---------- ---------- + -------------- --------- + ------------- ---------
    panteater *         0 |      PANTEATER         0 |   1,000     1,000

    The account has 1000 hours in the account.

  3. Check the jobs request

    [user@login-x:~]$ squeue -o "%i %u %j %C %T %L %R" -p standard -t PD -u panteater
    JOBID        USER NAME CPUS   STATE        TIME_LEFT  NODELIST(REASON)
    1666961 panteater tst1  16  PENDING 3-00:00:00 (AssocGrpCPUMinutesLimit)
    1666962 panteater tst2  16  PENDING 3-00:00:00 (AssocGrpCPUMinutesLimit)

    Each jobs asks for 16 CPUs to run for 3 days which is 16 * 24 * 3 = 1152 core-hours, and is more than 1000 in the account balance.

    This job will never be scheduled to run and needs to be removed from the queue.

3.5.1. Pending Job Reasons

Jobs submitted to Slurm will start up as soon as the scheduler can find an appropriate resource match. While lack of resources or unsufficient account balance are common reasons that prevent a job from starting (technically, the job is in the "pending" state), there are other possibilities. From a policy point of view, RCIC does not generally put limits in place unless we see access, unreasonable impact to shared resources (often, file systems), or other fairness issues. To see the state reasons of your pending jobs, you can issue the squeue command as in the following example. The $(whoami) used in the command below is equivalent to your user account name, you can simply use that.

[user@login-x:~]$ squeue -t PD -u $(whoami)
JOBID PARTITION NAME USER ACCOUNT ST TIME CPUS NODE NODELIST(REASON)
92005 standard  watA peat   p_lab PD 0:00    1    1 (ReqNodeNotAvail,Reserved for maintenance)
92008 standard  watA peat   p_lab PD 0:00    1    1 (ReqNodeNotAvail,Reserved for maintenance)
92011 standard  watA peat   p_lab PD 0:00    1    1 (ReqNodeNotAvail,Reserved for maintenance)
95475 free-gpu  7sMD peat   p_lab PD 0:00    2    1 (QOSMaxJobsPerUserLimit)
95476 free-gpu  7sMD peat   p_lab PD 0:00    2    1 (QOSMaxJobsPerUserLimit)

Reasons that are often seen on HPC3 for job pending state

Table 4. Job Pending Reasons.
NODELIST(REASON) Explanation

(AssocGrpCPUMinutesLimit)

Insufficient funds are available to run the job to completion.

(Dependency)

Job has a user-defined dependency on a running job and cannot start until the previous job has completed.

(Priority)

Slurm’s scheduler is temporarily holding the job in pending state because other queued jobs have a higher priority.

(QOSMaxJobsPerUserLimit)

The user is already running the maximum number of jobs allowed by the particular partition.

(ReqNodeNotAvail, Reserved for maintenance)

If the job were to run for the requested maximum time, it would run into a defined maintenance window. Job will not start until maintenance has been completed.

(Resources)

The requested resource configuration is not currently available. If a job requests a resource combination that physically does not exist, the job will remain in this state forever.

A job may have multiple reasons for not running, squeue will only show one of them.

3.5.2. Fix pending job

You will need to resubmit the job so that the requested execution hours can be covered by your bank account balance. Verify the following settings in your Slurm script for batch jobs:

  • #SBATCH -A use a different Slurm account (lab) where you have enough balance

  • #SBATCH -p free use free partition if you don’t have another account

  • #SBATCH --ntasks or #SBATCH --cpus-per-task are you requesting correct CPU

  • #SBATCH --mem or #SBATCH --mem-per-cpu are you requesting correct memory

  • #SBATCH --time set a time limit that is shorter than the default runtime (see the default settings )

Similar fixes apply when using srun for interactive jobs. See EXAMPLES for more info

4. Modify jobs prior to execution

It is possible to make some changes to jobs that are still waiting to run in the queue by using the scontrol command. If changes need to be made for a running job, it may be better to kill the job and restart it after making the necessary changes.

[user@login-x:~]$ scontrol update jobid=<jobid> timelimit=<new timelimit> (1)
[user@login-x:~]$ scontrol update jobid=<jobid> qos=[low|normal|high]     (2)
1 change time limit. The format set is minutes, minutes:seconds, hours:minutes:seconds, days-hours, days-hours:minutes or days-hours:minutes:seconds. The 2-12:30 means 2days, 12hrs, and 30 min.
2 change QOS. By default, jobs are set to run with qos=normal. Users rarely need to change QOS.

5. Hold/Release/Cancel jobs

[user@login-x:~]$ scontrol hold <jobid>    (1)
[user@login-x:~]$ scontrol release <jobid> (2)
[user@login-x:~]$ scancel <jobid>          (3)
[user@login-x:~]$ scancel -u <username>    (4)
1 To prevent a pending job from starting.
2 To release held jobs to run (after they have accrued priority).
3 To cancel a specific job.
4 To cancel all jobs owned by a user. This only applies to jobs that are associated with your accounts.

6. Account Coordinators

Slurm uses account coordinators as users who can directly control accounts. Please see Account Coordinators Guide