HPC3: High Performance Community Computing Cluster

Overview

HPC3 is the next-generation of shared computing at UCI and builds upon the very successful HPC and GreenPlanet clusters. Today, HPC is a strictly a "condo-style" cluster with expansion of the physical system being accomplished by researchers purchasing nodes to add capacity. HPC3 adds to the condo model with the ability for UCI to grant cycles to researchers and for researchers to pre-purchase cycles by the core-hour.

Granted cycles and by-the-core-hour purchases are being added to allow to reach the Research Cyberinfrastructure Vision articulated by a faculty-lead committee in 2016. These changes require some modification in how resources are shared. The HPC3 planning committee began its work in early 2018 and in August 2018 started to craft policy guidelines to meet the following goals:

  • Enables users to have access to a larger compute/analysis system than they could reasonably afford “on their own”

  • Enables access to specialized nodes (e.g. Large memory, GPU (64-bit), deep-learning (32-bit), ..)

  • Fosters a growing community across UCI to utilize scalable computing (HPC and HTC)* for their scientific research program and teaching

  • Provides a well-managed software environment that forms the basis of a “reproducible” scientific instrument Fits “seamlessly” into the progression of : desktop, lab cluster, campus, national (e.g. XSEDE) and commercial cloud

  • Enables construction of more-secure research environments

Executive Summary of HPC3 and Changes

An executive summary of HPC3 and a longer draft policy document can be consulted for some more in-depth treatment. At the heart of HPC3 sharing is core-hour accounting where jobs are classifed as either

  • accounted (time utilized by a job is tracked against a "bank" of hours)

  • free

The fundamental policy difference between accounted and free is that an accounted job may NOT be suspended or killed in favor of another job. Free jobs may be suspended or killed at any time to make room for accounted jobs.

Critical to this model is that a "bank" can be filled in a variety of ways

  • Converted. Condo node owners have the core-hours that their equipment could deliver in a year deposited into their bank at the beginning of each year the node is in the cluster.

  • Granted. UCI core funds will purchase (and will likely add more) nodes that create the capacity for granted hours.

  • Purchased. UCI researchers will be able to purchase pre-paid time to fill/augment their banks. While a final rate has not been set, it is expected to be about 1.25 Cents/core-hour.

Fair Sharing, free jobs, and other principles

Accounted jobs simply says that HPC3 will be able to track usage of the cluster and the overall usage is in proportion to how much a particular research program has contributed to the physical infrastructure. For example, over each year, a research group that has purchased 10 nodes will have about 10 nodes-worth of accounted jobs to run on the cluster. Nothing limits that group to just their 10 nodes. Nor does it guaranteee that they can have instantaneous access to their particular purchased nodes in the cluster. The "not limited" statement means that a group can access a larger number of nodes in bursts. The "no guarantee of instant access" means that there may times of contention where a node owner may have to wait. It also means that if you are running a job a larger number of nodes than purchased, your allocated job cannot be killed/suspended.

Human-understandable principles can be coded into "queuing policy" to achieve reasonable balance. Some of the principles include

  • Small core-count, short-duration (e.g. debugging) jobs should have little to no waiting time

  • Users who submit very large numbers of jobs should at one time should not result in others waiting for all those jobs to complete ( HPC3 is not first-in, first-out)

  • Once a job has started, predictibility of run time is highly-desirable.

  • Free (non-accounted) jobs should still be possible (and encouraged) if their impact to accounted jobs is minimal