HPC3 - How to Expand Your allocation through core-hour or hardware purchase

1. Baseline Allocation

UCI has purchased both CPU and GPU nodes hardware as part of HPC3. Annual funds are used to add to this resource enabling RCIC to provide no-cost allocations to a larger fraction of the UCI research community.

  1. Faculty may request a no-cost allocation. These accounts are refilled (re-allocated) on a semi-annual (every 6 months). Re-allocation is based upon a lab’s previous 6 month utilization and available hours on UCI-purchased hardware. Faculty allocations may be shared with anyone but may not be combined into larger group or "center" accounts.

  2. Every user is given a 1000 core hour, one-time, allocation. A Slurm allocation is required to allow users to access the "free" queues and this enables users who are not affiliated with any research program to have meaningful access to HPC3. Most users should use lab accounts instead of this one-time allocation.

2. HPC3: Expanding your allocation through hardware or core-hour purchase

HPC3 is made up of hardware purchased by UCI and purchased by researchers (usually from grant funds). Since the no-cost allocation may not be enough computing time for some labs, there are two options to acquire more hours for accounted jobs.

  1. Purchasing Core hours from RCIC

  2. Purchasing hardware that is placed into the cluster

2.1. Purchasing Core hours

Core-hour purchases a done through an MOU on a prepaid card basis. RCIC does not post-bill for core hours and it is not possible to be surprised with a large bill at the end of a month. Prepaid core hours are intended to be used within one calendar year. Unused prepaid hours are forfeit after 18 months. The current rates[1]

  1. $0.01/Core Hour (CPU)

  2. $0.32/GPU Hour (NVidia V100 GPU)

These rates are computed to recover the cost of hardware over a 5 year period at 60% use.

2.2. Hardware Purchase - Conversion to Core-hours

Purchasing of nodes in HPC3 does NOT give you a "private" queue. This means that some of your jobs may have to wait for resources, but it also gives you non-preemptable access a larger number of cores than purchased. In the first 1.5 years of HPC3 operation, owners rarely wait for long periods of time.

Hardware that you purchase is converted to allocable core hours. NOTE: The conversion rate is 95% of the theoretical core hours your hardware could deliver in a year. For example a 40-core node can deliver 8760 hours/year * 40 cores = 350,400 core hours/year. At 95% this would become a 332880 core-hour credit

The 95% factor accounts accounts for usual annual downtime through scheduled and unscheduled mainteance. The 50% of this credit is applied during the 6 month reallocation each year the node is in the cluster (warranty period + year)

2.2.1. Hardware purchase process

Hardware is purchased through RCIC and can be requested at any time. When a large enough number of nodes have been requested (at least 4 CPU nodes and/or 1 GPU node), RCIC will obtain quote(s) from vendors for acceptable hardware. Your source of funds (grant or other) is used to fund your share of the purchase. You can only purchase whole nodes, but you may use multiple sources of funds. In this model, we easily support two different faculty splitting the cost of a single node.

Hardware is commodity-based and subject to market variability. CPU nodes (48 cores, 2022) are approximately $10K. GPU-node (4 x NVidia A30, 2022) are approximately $35K.
Outline of Purchase Process
  1. Send a message to hpc-support indicating your interest in purchasing nodes and time frame

  2. RCIC obtains quotes once enough requests have been aggregated.

  3. Upon your approval. Purchase is made. Your funds for the hardware are used at purchase time

  4. $1000/node one-time integration fee is re-charged AFTER the hardware has arrived

You may not purchase hardware, send to the machine room, and then expect RCIC to integrate this hardware into HPC3. Any hardware purchased outside of the above process will not be integrated into HPC3. Nor will it be managed by RCIC.

2.2.2. What does hardware purchase get me?

  • A larger bank of core hours. RCIC does not allocate more core hours than the cluster can physically deliver. When you add hardware, you expand the capacity of the system.

  • Hardware hours are added to your no-cost allocation.

  • Consuming your core-hours does not have to be 7x24. Idle private hardware represents never-to-recovered lost computing capacity. Your bank of hours can instead be spent in bursts.

  • The least-expensive way to add a large number of core hours. Using 48-core nodes and 5 year warranty ~ 2.4 Million core hours would be credited over 6 years. At $11.5K (node, tax, integration), this is roughly $0.005/core-hour.

Using UCI-puchased hardware as an effective buffer and a no over-allocation policy, if all node owners utilized 100% of their allocation, there would still be about 20% of the cores unused.

3. Should I purchase hardware? Buy cores? Neither?

Our goal is to enable you to make efficient use of your grant or other funds. RCIC recovers only the cost of hardware through its recharges. We are fortunate at UCI, the people cost of administration is supported centrally by campus. Here’s is advice that we believe is universally applicable

  1. Prioritize your jobs into ones that can be killed (and hence start over from the beginning) and those that are more important. Use the free queues for your lower priority work and your allocation for higher-priority work. This will allow you to make effective use of the free queues.

  2. Analyze your own usage to get a good feel for how long a typical job takes and how much memory it requires. For, single node jobs, use the command seff -j <jobid> to easily find out this information after a job has run. Note: seff doesn’t produce accurate results for multi-node jobs.

  3. Estimate how much computing time you need in the allocated (non-killable) category. A typcial 40-core node will credit your account with about 330,000 core hours/year. Determine how many whole nodes you need to purchase to meet your computing requirements

  4. If your estimate is less than 200K core hours/year, you can mostly rely upon the no-cost cycles (reminder, this allocation is per research lab, not per user).

  5. If your estimate is more than 200K core hours/year but less than 400K core hours, core-hour purchase will likely be the most cost-effective way to expand usage

  6. If your estimate is more than 400K core hours/year, then purchasing hardware is the least expensive way to obtain additional hours.

  7. You should only purchase what your lab needs.

4. Saving or Spending core hours?

You should spend your allocation. Core-hour accounting has many positives, but to work effectively, the UCI community as a whole needs to spend their allocations at a regular rate.

5. Node Types, Approximate Costs, Lifetime of nodes

There are two node types to consider. We give "ballpark" cost estimates that should be sufficient for rough budgeting. Actual costs are commodity market-driven and require firm quotes from vendors. As time progresses, the RCIC executive/advisory committees will evaluate other hardware configurations.

CPU-only nodes
  • Dual-Socket, Intel Ice Lake 6336Y 24-core CPU@2.4GHz. 48 Cores total. 256GB Memory, EDR Infiniband, 10GbE Ethernet. Dell or HPE. ~$10000.00

Table 1. Standard Compute Node Sample Configuration
Component Description

Chassis

Dell R650 1RU with Dual Power Supplies

Processor x 2

Intel Xeon Gold 6336Y 24-core CPU@2.4GHz. 48 Cores tota

Memory

16 x 16GB ECC 3200MT/s (DDR4-3200) RDIMMs Dual Rank

Interconnect

100Gb/s Mellanox ConnectX-6 HDR Infiniband

Scratch Disk

1.92 TB NVMe Solid State Drive

Operating System Disk

480 GB mixed-use SATA Solid-State Drive

Ethernet

10Gb/s SFP+

Warranty

5-year Next-Business Day

GPU-Enabled Nodes
  • Dell (or similar) chassis. 2RU. Dell R750xa with up to four A30 GPUs/chassis. ~$35000

Table 2. Standard GPU Node Sample Configuration
Component Description

Chassis

Dell R750xa 2RU with Dual 2KW Power Supplies

Processor x 2

Intel Xeon Gold 6326 16-core CPU@2.9GHz. 32 Cores total

GPUs x 4

Nvidia A30. 24GB HBM, 933GB/s, 9216 CUDA Cores

Memory

16 x 16GB ECC 3200MT/s (DDR4-3200) RDIMMs Dual Rank

Interconnect

100Gb/s Mellanox ConnectX-6 HDR Infiniband

Scratch Disk

1.92 TB NVMe Solid State Drive

Operating System Disk

480 GB mixed-use SATA Solid-State Drive

Ethernet

10Gb/s SFP+

Warranty

5-year Next-Business Day

Options beyond baseline Configs

Technology is always changing. Users may opt for additional memory per node (512GB, 1024GB) at additional cost. Please note that you are not guaranteed access to your node, but higher core count nodes give you more core credits. We also recognize that different grant budgets sometimes come with special constraints. RCIC will work with you during purchase.

Integration Fee

$1000/node. This is a one-time cost that covers connection to three different networks.

Ongoing Administrative Costs

None.

Lifetime in Cluster

Period of Warranty + 1 year. Most CPU nodes are purchased with 5 year warranties (a six year lifetime in the cluster). GPU nodes with 4 or 5 year warranties. If a node breaks in the extension year and isn’t easily repairable, it will be removed.

Disposition after Lifetime

If the hardware is still viable and space/power are not a concern, the node may run longer but will not generate core-hour credits for the original purchaser. It essence, it would add capacity to the "free queues".


1. For purchases above $5K, these rates are discounted approximately 20%