1. Slurm workload manager

HPC3 is using the SLURM as the workload manager and job scheduler. Slurm is widely used widely at super computer centers and is actively maintained.

1.1. Dos and Don’ts

Cluster is a shared resource. Please follow the Acceptable use that describe how to properly use HPC3. These rules apply to using Slurm for running jobs.

Failure to follow conduct rules may adversely impact others working on the cluster.

1.2. Slurm Accounting

Each personal and lab account has a balance that is getting used when you run Slurm jobs.

Slurm Personal account

is created automatically when your HPC3 account is created. Every user is granted one time 1000 free CPU hours as a startup allowance. This base allocation is there for you to become familiar with HPC3, Slurm scheduler, and accounting.

Slurm Lab accounts

Any UCI Professor can request an HPC3 Slurm Lab account and add researchers/students to this account. The goal is faculty who request an account will be granted no-cost 200,000 CPU hours per fiscal year. Based upon the number of requests and the number of nodes that have been purchased by RCIC, this number will vary.

Most jobs ran on HPC3 are charged to Slurm Lab accounts because most HPC3 users are part of at least one research lab. If a lab account runs out of CPU hours, more CPU hours can be purchased via recharge.

1.2.1. Getting Slurm Lab Account

PI may request a Slurm Lab account by sending a request to hpc-support@uci.edu and specifying the following information:

  • PI name and UCInetID

  • Name and UCInetID of the researchers, graduate students or other collaborators to add to the account. They will be be able to charge CPU hours to the lab account.

Attention

Students and group members who wish to use Slurm Lab account

1.2.2. Accounts balances

Please learn

Allocation Units

When a job is allocated resources, the resources include CPUs and memory. Memory in each partition is allocated per-CPU core. When you request more cores, your job is allocated more memory and vice versa.

Jobs will charge core-hours or GPU-hours to the account. The costs are calculated as follows.

1 core-hour:
is 1 allocation unit charged for
1 CPU used for 1 hour
Each CPU core-hour is charged to the specified account. Default is your Slurm Personal account.
1 GPU-hour:
is 34 allocation units charged for
1 GPU used for 1 hour as 32 allocation units, plus
2 CPU used for 1 hour (required to run the job) as 2 units.
Each GPU hour is charged to a GPU-enabled account which can only be used on GPU-nodes.
1 GPU-hour (RTX6000 Pro):
is 68 allocation units charged for
1 GPU used for 1 hour as 64 allocation units, plus
4 CPU used for 1 hour (required to run the job) as 4 units.
RTX6000 Pro Blackwell GPUs are in the gpu32 queue and require a gpu32 slurm account

A job is using

Units charged

1 CPU X 1 hr

1

1 CPU X 6 min

0.1

10 CPU X 1 hr

10

(1 GPU + 2 CPU ) X 1 hr

34

(1 RTX6000 + 4 CPU ) X 1 hr

68

1.2.3. Free and Allocated Jobs

All computational processes on the cluster must be submitted as Slurm jobs.
Charging jobs to an account is new for the UCI community.
There are two types of jobs:
allocated

jobs are charged to an account. A large fraction of users will run allocated jobs and never see the limits of their accounts.

free

jobs are not charged to an account. Users who are running a very large number of free jobs are likely to have some of their jobs preempted (killed).

Slurm jobs properties

Free jobs

Allocated jobs

Are submitted to free, free-gpu* partitions

Are Submitted to all other partitions

Are not charged to any account [1]

Default is a user personal Slurm account

Are charged to a specified account.

Default is a user personal Slurm account

Can be killed at any time to make room

for allocated jobs [2]

Can not be killed by any other job.

Once start, will run to completion

Can preempt free jobs

Jobs with QOS normal are charged for the CPU time used

Jobs with QOS high are charged double the CPU time

used and are placed at the front of the jobs queue [3]

Submitted with sbatch for batch jobs

Submitted with srun for interactive jobs

Submitted with sbatch for batch jobs

Submitted with srun for interactive jobs

1.3. Partitions Structure

Slurm uses the term partition to signify a batch queue of resources. HPC3 has heterogeneous hardware, memory footprints, and nodes with GPUs.

The tables below show available partitions, their memory, runtime and job preemption configuration, and cost per hour in Allocation Units.

GPUs in the gpu partition can natively accelerate 64-bit floating point. GPUs in the gpu32 partition can natively accelerate 32-bit floating point, but cannot accelerate 64-bit floating point.

Table 1.4 Available CPU partitions

Partition

name

Default / Max

memory per core

Default / Max

runtime

Cost

(units/hr)

Job

preemption

free

3 GB / 18 GB

1 day / 3 day

None

Yes

standard

3 GB / 6 GB

2 day / 14 day

1

No

highmem

6 GB / 10 GB

2 day / 14 day

1

No

hugemem

18 GB / 18 GB

2 day / 14 day

1

No

maxmem

1.5 TB/node / 1.5 TB/node

1 day / 7 day

40 / node

No

Note

You cannot submit to the standard, highmem, hugemem, or maxmem partitions with a Slurm account that ends with gpu or gpu32.

Table 1.5 Available GPU partitions

Partition

name

Default / Max

memory per core

Default / Max

runtime

Cost

(units/hr)

Job

preemption

gpu

3 GB / 9 GB

2 day / 14 day

34

No

free-gpu

3 GB / 9 GB

1 day / 3 day

0

Yes

gpu32

3 GB / 9 GB

2 day / 14 day

34 - for L40S

68 - for RTX6000

No

free-gpu32

3 GB / 9 GB

1 day / 3 day

0

Yes

Note

To submit to the gpu partition, you must have a Slurm account that ends with gpu.
To submit to the gpu32 partition, you must have a Slurm account that ends with gpu32.

Note, there is no difference in cost/core-hour for default and max memory per core.

1.3.1. Higher Memory

There are a few applications that need more memory than a node in standard partition can offer. Users must be added to a specific group to access the higher memory highmem / hugemem / maxmem partitions.

If you are not a member of these groups then you will not be able to submit jobs to these partitions and sinfo command will not show these partitions.

User must be either:
(a) member of a group that purchased these node types or
(b) clearly demonstrate that their applications require more than standard memory.

Attention

To demonstrate your job requires more memory submit a ticket with the following information:
- your job ID and error message
- what was your submit script
- what is the memory (in GB) that your job needs
- include the output of seff and sacct commands about your job
highmem / hugemem

There is no difference in cost/core-hour on any of the CPU partitions,

maxmem

The partition is a single 1.5 TB node and that is reserved for those rare applications that really require that much memory. You can only be allocated the entire node. No free jobs run in this partition.

1.3.2. GPU-enabled

There are NO personal GPU accounts.

GPU lab accounts are not automatically given to everyone. Your faculty adviser can request a GPU Lab account. See how to request Slurm Lab account and add a note that this request is for GPU account.

free-gpu

Anyone can run jobs in this partition without special account.

gpu and gpu32

You must have a GPU Lab account and you must specify it in order to submit jobs to this partition. This is because of differential charging.

gpu32

You must have a GPU32 Lab account and you must specify it in order to submit jobs to these partition. This is because of differential charging.

There are two types of GPUs in the gpu32 partition - L40S and RTX6000. Be aware that RTX6000 GPUs are double SUs of L40S. This is due to the 2.5X price difference between the existing L40S GPU nodes and the Blackwell GPU Nodes.

Attention

RTX6000 Pro Blackwell GPUs must be explicitly requested in the gpu32 queue.
A job submission with gres request in gpu32 partition:
--gres=gpu:1 will only schedule your job on L40S GPUs.
--gres=gpu:L40S will specifically schedule your job on L40S GPUs.
--gres=gpu:RTX6000 will request the newer (and more expensive) RTX6000 GPU.
Use RTX6000 only when your job can truly benefiut from it.

1.4. Node/partition Information

sinfo show information about nodes and partitions
scontrol show details of configuration

Use above commands to get information about nodes and partitions. There are many command line options available, please run man sinfo and man scontrol for detailed information about options.

A few useful examples show information for:

Nodes grouped by features:
[user@login-x:~]$ sinfo -o "%33N %5c %8m %30f %10G" -e
NODELIST                          CPUS  MEMORY   AVAIL_FEATURES                 GRES
hpc3-14-[00-31],hpc3-15-[00-19,21 40    192000   intel,avx512,mlx5_ib           (null)
hpc3-19-12                        24    515000   intel,mlx4_ib                  (null)
hpc3-19-17                        64    515000   amd,epyc,epyc7551,mlx4_ib      (null)
hpc3-20-[16-20],hpc3-22-05        48    384000   intel,avx512,mlx5_ib           (null)
hpc3-21-[00-15,18-32],hpc3-22-[00 48    191000   intel,avx512,mlx5_ib,nvme,fast (null)
hpc3-20-[00-15,23,25-32],hpc3-21- 48    191000   intel,avx512,mlx5_ib           (null)
.. output cut ..
hpc3-20-24                        48    385000   intel,avx512,mlx5_ib           (null)
hpc3-24-[00-02]                   80    127000   intel,avx512,mlx5_ib,nvme,fast (null)
hpc3-l18-01                       64    515000   amd,epyc,epyc7601,mlx4_ib      (null)
hpc3-gpu-18-00                    40    386000   intel,avx512,mlx5_ib,gpugeneri gpu:V100:4
hpc3-gpu-k54-00                   64    3095000  intel,avx512,mlx5_ib,nvme,fast gpu:A30:4
hpc3-gpu-l54-[03-06]              32    256000   intel,avx512,mlx5_ib,nvme,fast gpu:A100:2
hpc3-gpu-l54-08                   32    257000   intel,avx512,mlx5_ib,nvme,fast gpu:A30:4
hpc3-gpu-16-[00-07],hpc3-gpu-17-[ 40    192000   intel,avx512,mlx5_ib,gpugeneri gpu:V100:4
hpc3-gpu-18-[03-04],hpc3-gpu-24-[ 32    256000   intel,avx512,mlx5_ib,nvme,fast gpu:A30:4
hpc3-gpu-l54-09                   32    224000   intel,avx512,mlx5_ib,nvme,fast gpu:A30:4
hpc3-gpu-k54-[06-07]              48    256000   intel,avx512,mlx5_ib,nvme,fast gpu:L40S:4
hpc3-gpu-k54-08                   48    1031000  intel,avx512,mlx5_ib,nvme,fast gpu:L40S:4
hpc3-gpu-m54-[00-02]              64    257000   amd,epyc,epyc9115,mlx5_ib,nvme gpu:RTX600
Each node by features without grouping:
[user@login-x:~]$ sinfo -o "%20N %5c %8m %20f %10G" -N
NODELIST             CPUS  MEMORY   AVAIL_FEATURES       GRES
hpc3-14-00           40    192000   intel,avx512,mlx5_ib (null)
hpc3-14-00           40    192000   intel,avx512,mlx5_ib (null)
hpc3-14-01           40    192000   intel,avx512,mlx5_ib (null)
hpc3-14-01           40    192000   intel,avx512,mlx5_ib (null)
hpc3-14-02           40    192000   intel,avx512,mlx5_ib (null)
hpc3-14-02           40    192000   intel,avx512,mlx5_ib (null)
... output cut ...
Specific single node:
[user@login-x:~]$ sinfo -o "%20N %5c %8m %20f %10G" -n hpc3-gpu-16-00
NODELIST             CPUS  MEMORY   AVAIL_FEATURES       GRES
hpc3-gpu-16-00       40    192000   intel,avx512,mlx5_ib gpu:V100:4

A more detailed information is obtained with

[user@login-x:~]$ scontrol show node hpc3-gpu-16-00
NodeName=hpc3-gpu-16-00 Arch=x86_64 CoresPerSocket=20
CPUAlloc=26 CPUEfctv=40 CPUTot=40 CPULoad=6.80
AvailableFeatures=intel,avx512,mlx5_ib
ActiveFeatures=intel,avx512,mlx5_ib
Gres=gpu:V100:4
NodeAddr=hpc3-gpu-16-00 NodeHostName=hpc3-gpu-16-00 Version=24.05.3
OS=Linux 4.18.0-477.15.1.el8_8.x86_64 #1 SMP Wed Jun 28 15:04:18 UTC 2023
RealMemory=192000 AllocMem=150720 FreeMem=39430 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=1 TmpDisk=228000 Weight=3 Owner=N/A MCS_label=N/A
Partitions=free-gpu,gpu
BootTime=2024-09-17T15:48:44 SlurmdStartTime=2024-10-22T16:04:19
LastBusyTime=2024-10-21T16:19:36 ResumeAfterTime=None
CfgTRES=cpu=40,mem=187.50G,billing=168,gres/gpu=4
AllocTRES=cpu=26,mem=150720M,gres/gpu=4
CurrentWatts=0 AveWatts=0
How many CPU and GPUs are available in GPU partition:
[user@login-x:~]$ sinfo -NO "CPUsState:14,Memory:9,AllocMem:10,Gres:14,GresUsed:22,NodeList:20" -p gpu
CPUS(A/I/O/T) MEMORY  ALLOCMEM GRES        GRES_USED              NODELIST
40/0/0/40     180000  122880   gpu:V100:4  gpu:V100:4(IDX:0-3)    hpc3-gpu-16-00
20/20/0/40    180000  174080   gpu:V100:4  gpu:V100:3(IDX:0-1,3)  hpc3-gpu-16-02
4/36/0/40     180000  22528    gpu:V100:4  gpu:V100:3(IDX:0,2-3)  hpc3-gpu-17-04
0/40/0/40     372000  0        gpu:V100:4  gpu:V100:0(IDX:N/A)    hpc3-gpu-18-00
4/36/0/40     180000  32768    gpu:V100:4  gpu:V100:4(IDX:0-3)    hpc3-gpu-18-01
4/36/0/40     180000  32768    gpu:V100:4  gpu:V100:4(IDX:0-3)    hpc3-gpu-18-02
4/28/0/32     245000  12288    gpu:A30:4   gpu:A30:2(IDX:0,2)     hpc3-gpu-18-03
2/30/0/32     245000  6144     gpu:A30:4   gpu:A30:1(IDX:3)       hpc3-gpu-18-04
0/32/0/32     245000  0        gpu:A30:4   gpu:A30:0(IDX:N/A)     hpc3-gpu-24-05
4/28/0/32     245000  32768    gpu:A30:4   gpu:A30:1(IDX:0)       hpc3-gpu-24-08
0/32/0/32     245000  0        gpu:A30:4   gpu:A30:0(IDX:N/A)     hpc3-gpu-k54-01
15/17/0/32    245000  46080    gpu:A100:2  gpu:A100:2(IDX:0-1)    hpc3-gpu-l54-03
0/32/0/32     245000  0        gpu:A30:4   gpu:A30:0(IDX:N/A)     hpc3-gpu-l54-07
... output cut ...

The above output shows in the columns:

CPUS(A/I/O/T): number of cores by state as “Allocated/Idle/Other/Total”
ALLOCMEM: memory already in use
GRES: type and number of GPUs
GRES_USED: which GPUs are in use, the part after GPU type means:
* 4(IDX:0-3) all four are in use (0,1,2,3)
* 3(IDX:0,2-3) three are in use (0,2,3) and one (1) is free
* 0(IDX:N/A) all are free
NODE_LIST: nodes with this configuration
Detailed configuration of a standard queue:
[user@login-x:~]$ scontrol show partition standard
PartitionName=standard
   AllowGroups=ALL AllowAccounts=ALL AllowQos=normal,high
   AllocNodes=ALL Default=YES QoS=normal
   DefaultTime=2-00:00:00 DisableRootJobs=NO ExclusiveUser=NO GraceTime=0 Hidden=NO
   MaxNodes=159 MaxTime=14-00:00:00 MinNodes=1 LLN=NO MaxCPUsPerNode=64
   Nodes=hpc3-14-[00-31],hpc3-15-[00-19,21,24-31],hpc3-17-[08-11],...
   PriorityJobFactor=100 PriorityTier=100 RootOnly=NO ReqResv=NO OverSubscribe=NO
   OverTimeLimit=0 PreemptMode=OFF
   State=UP TotalCPUs=7136 TotalNodes=159 SelectTypeParameters=CR_CORE_MEMORY
   JobDefaults=(null)
   DefMemPerCPU=3072 MaxMemPerCPU=6144
   TRES=cpu=7136,mem=35665000M,node=159,billing=7136