Hardware FAQ

Purchasing

  1. What does hardware purchase get me?

    1. A larger bank of core hours RCIC does not allocate more core hours than the cluster can physically deliver. When you add hardware, you expand the capacity of the system.

    2. Hardware hours are added to your no-cost allocation

    3. Consuming your core-hours does not have to be 7x24 Idle private hardware represents never-to-recovered lost computing capacity. Your bank of hours can instead be spent in bursts.

    4. The least-expensive way to add a large number of core hours Using 48-core nodes and 5 year warranty ~ 2.4 Million core hours would be credited over 6 years. At $11.5K (node, tax, integration), this is roughly $0.005/core-hour.

    Using UCI-purchased hardware as an effective buffer and a no over-allocation policy, if all node owners utilized 100% of their allocation, there would still be about 20% of the cores unused.

  2. Should I purchase hardware? Buy cores? Neither?

    Our goal is to enable you to make efficient use of your grant or other funds. RCIC recovers only the cost of hardware through its recharges. We are fortunate at UCI, the people cost of administration is supported centrally by campus. Here’s is advice that we believe is universally applicable:

    1. Prioritize your jobs into ones that can be killed (and hence start over from the beginning) and those that are more important. Use the free queues for your lower priority work and your allocation for higher-priority work. This will allow you to make effective use of the free queues.

    2. Analyze your own usage to get a good feel for how long a typical job takes and how much memory it requires. For, single node jobs, use the command seff -j JOBID to easily find out this information after a job has run. Note, seff doesn’t produce accurate results for multi-node jobs.

    3. Estimate how much computing time you need in the allocated (non-killable) category. A typical 40-core node will credit your account with about 330,000 core hours/year. Determine how many whole nodes you need to purchase to meet your computing requirements

    4. If your estimate is less than 200K core hours/year, you can mostly rely upon the no-cost cycles (reminder, this allocation is per research lab, not per user).

    5. If your estimate is more than 200K core hours/year but less than 400K core hours, core-hour purchase will likely be the most cost-effective way to expand usage

    6. If your estimate is more than 400K core hours/year, then purchasing hardware is the least expensive way to obtain additional hours.

    7. You should only purchase what your lab needs.

  3. Saving or Spending core hours?

    You should spend your allocation! Core-hour accounting has many positives, but to work effectively, the UCI community as a whole needs to spend their allocations at a regular rate.

Node type

There are two node types to consider. We give “ballpark” cost estimates that should be sufficient for rough budgeting. Actual costs are commodity market-driven and require firm quotes from vendors. As time progresses, the RCIC executive/advisory committees will evaluate other hardware configurations. These estimates are current as of January, 2025

CPU-only nodes

Dual-Socket, Intel Ice Lake Intel Xeon Gold 6542Y processor, 256GB Memory, HDR Infiniband, 10GbE Ethernet,local Solid-State Storage. Dell, Lenovo, or HPE.. Price: ~$15000.00.

Table 1 Standard Compute Node Sample Configuration

Component

Description

Chassis

Dell R660 1RU with Dual Power Supplies

Processor x 2

Intel Xeon Gold 6542Y 24-core CPU@2.9GHz 48 Cores total

Memory

16 x 16GB ECC 5600MT/s (DDR5-5600) RDIMMs Single Rank

Interconnect

100Gb/s Mellanox ConnectX-6 HDR Infiniband

Scratch Disk

1.92 TB NVMe Solid State Drive

Operating System Disk

960 GB NVMe Solid-State Drive

Ethernet

10Gb/s SFP+

Warranty

5-year Next-Business Day

GPU-Enabled Nodes

Dell (or similar) chassis, 2RU, Dell R760xa with up to four L40s GPUs/chassis. Price: ~$47000

Table 2 Standard GPU Node Sample Configuration

Component

Description

Chassis

Dell R760xa 2RU with Dual 2KW Power Supplies

Processor x 2

Intel Xeon Gold 6526Y 16-core CPU@2.8GHz 32 Cores total

GPUs x 4

Nvidia L40S 48GB HBM, 864MB/s, 18176 CUDA Cores

Memory

16 x 16GB ECC 5600MT/s (DDR5-5600) RDIMMs Single Rank

Interconnect

100Gb/s Mellanox ConnectX-6 HDR Infiniband

Scratch Disk

1.92 TB NVMe Solid State Drive

Operating System Disk

960 GB NVMe Solid-State Drive

Ethernet

10Gb/s SFP+

Warranty

5-year Next-Business Day

Options beyond baseline Configs

Technology is always changing. Users may opt for additional memory per node (512GB, 1024GB) at additional cost. Please note that you are not guaranteed access to your node, but higher core count nodes give you more core credits. We also recognize that different grant budgets sometimes come with special constraints. RCIC will work with you during purchase.

Table 3 Additional Configuration

Option

Description

Integration Fee

$1000/node. This is a one-time cost that covers connection to three different networks.

Ongoing Administrative Costs

None

Lifetime in Cluster

Period of Warranty + 1year. Most CPU nodes are purchased with 5 year warranties (a six year lifetime in the cluster). GPU nodes with 4 or 5 year warranties. If a node breaks in the extension year and isn’t easily repairable, it will be removed.

Disposition after Lifetime

If the hardware is still viable and space/power are not a concern, the node may run longer but will not generate core-hour credits for the original purchaser. In essence, it would add capacity to the “free queues”.

Network type

  • The 10Gbit/s Ethernet network is the provisioning and control network to access Ethernet-only resources.

  • The 100Gbit/s ConnectX-6 HDR Infiniband is a 2-level Clos-Topology with a maximum 8:1 oversubscription: Nodes in the same rack (max 32) are connected to a full-bisection, 36-port Infiniband switch. Each lower-level switch is connected to two root-level switches with two links/switch. The subnet manager is opensm with LMC (Lid Mask Control) set to 2 for multi-path diversity.