HPC3 - High Performance Community Computing Cluster

1. Overview

HPC3 (specs, nodes/owners) is the next-generation of shared computing at UCI and builds upon the very successful HPC and GreenPlanet clusters. HPC was implemented as a strictly a "condo-style" cluster. Expansion of the physical system was limited to researchers purchasing nodes to add capacity. HPC3 builds on top of this condo model by adding:

Granting of cycles to researchers from UCI-purchased hardware

Pre-purchase of cycles by the core-hour

Faculty/grants can still purchase nodes. Granted cycles and by-the-core-hour purchases enable UCI to reach the Research Cyberinfrastructure Vision articulated by a faculty-led committee in 2016. These changes require some modification in how resources are shared.

The HPC3 planning committee has crafted initial policy guidelines to meet the following goals:

  1. Enables users to have access to a larger compute/analysis system than they could reasonably afford on their own

  2. Enables access to specialized nodes (e.g. large memory, GPU

  3. Fosters a growing community across UCI to use scalable computing (HPC and HTC) for their scientific research and teaching

  4. Provides a well-managed software environment that forms the basis of a reproducible scientific instrument. Fits seamlessly into the progression of: desktop → lab cluster → campus → national (e.g. XSEDE) and commercial cloud

  5. Enables construction of more-secure research environments

2. Getting Started

Please see the HPC3 reference guide for information on getting an account, logging in, allocations, submitting jobs, using environment modules, submitting support tickets, hardware specs, purchasing and more.

3. HPC3 Executive Summary

An executive summary of HPC3 and a longer draft policy document can be consulted for some more in-depth treatment. At the heart of HPC3 sharing is core-hour accounting where jobs are classified as either

  • Accounted - time used by a job is tracked against a "bank" of hours

  • Free - jobs may be killed at any time to make room for accounted jobs.

The fundamental policy difference between accounted and free is that an accounted job may NOT be suspended or killed in favor of another job.

The bank represents computing units (e.g. core-hour or GPU-hour) and logically functions as prepaid account. Accounts have owners, and they can designate others who can charge to their account. A "bank" can be filled in a variety of ways:

Granted

UCI core funds will purchase (and will likely add more) nodes that create the capacity for granted hours.

Converted

Each node has a maximum delivery of core-hours = cores * 8760 hours/year. For node owners, 95% of this theoretical maximum is deposited annually into an account for their use on any resource in the cluster.

Purchased

UCI researchers will be able to purchase prepaid time to fill/augment their banks. While a final rate has not been set, it is expected to be about 1.1 Cents/core-hour.

4. HPC3 Policies

HPC3 policies are needed to primarily address issues such

  • How is contention for acquiring and using resources addressed?

  • How does one balance high utilization against wait times for jobs to start ?

  • What are principles to enable and support long-running jobs

  • Are there ways to support priority boosting for jobs with specific deadlines (e.g. grants and publications)

  • How can groups that contributed resources be ensured their fair share?

The questions above have no single "right" answer and this means that any policy employed on HPC3 must be tuned to balance the wide range of needs specifically for the UCI research community. It also means, that any implemented policy must be fluid and flexible.

The RCI Vision Document (cached) provides the rationale for what research cyberinfrastructure should be and some new features that need to be implemented. This document was the output of a faculty-lead committee, who completed their work in 2016.

In 2018, the RCIC began the process of crafting a policy/usage document that could provide the framework for creating HPC3 and the principles by which it would run. The HPC3 subcommittee of the RCIC advisory committee edited and refined the initial version. Going forward, this document will continually be updated to reflect adjustments and refinements.

5. Fair Sharing, free jobs, and other principles

Accounted jobs simply says that HPC3 will be able to track usage of the cluster and the overall usage is in proportion to how much a particular research program has contributed to the physical infrastructure. For example, over each year, a research group that has purchased 10 nodes will have about 10 nodes-worth of accounted jobs to run on the cluster.

  • Nothing limits that group to just their 10 nodes. This means that a group can access a larger number of nodes in bursts.

  • No guarantee is given that they can they can have instantaneous access to their particular purchased nodes in the cluster. This means that there may times of contention where a node owner may have to wait.

This combination supports the goal of access to a larger resource:

When an accounted job is running on a larger number of nodes than purchased, that job cannot be killed/suspended

Human-understandable principles can be coded into "queuing policy" to achieve reasonable balance of access, stability and other qualities.

Some of the sharing principles include

  • Small core-count, short-duration (e.g. debugging) jobs should have little to no waiting time

  • Very large numbers of jobs submitted at one time by some users should not result in other users waiting for all those jobs to complete (HPC3 is not first-in, first-out)

  • Once a job has started, predictability of run time is highly-desirable

  • Free (non-accounted) jobs should still be possible (and encouraged) if their impact to accounted jobs is minimal

6. Transition to HPC3

The transition of users from HPC to HPC3 is expected to be straightforward.

  • Every user who has an HPC account already has an HPC3 account

  • Scalable Storage Systems: DFS and CRSP are mounted on HPC3

  • Users will need to familiarize themselves with Slurm Scheduler.

  • The applications managed by RCIC have been updated/recompiled.

  • User-installed code will need to be recompiled because of a fundamental OS revision