The REmotely-managed Power Aware Computer Systems and Services (REPACSS) resource is a high-performance computing (HPC) cluster supported by multiple forms of energy developed to support research into advanced data center control for running scalable scientific workflows and data-intensive research in remotely managed settings. The focus of the project is on improvements to data center and infrastructure control to provide adaptability to emergent conditions and ability to adjust workloads to match data center load conditions including the availability and cost of electrical power. The CPU infrastructure comprises 110 AMD EPYC 9754 compute nodes with access to high-speed cluster-wide storage. Each CPU compute node offers 256 cores and 1.5TB of DDR5 memory, supported by local NVMe swap and temporary storage (1.92TB) to support high-speed checkpoint and restore and local ephemeral usage. The cpu nodes are interconnected with the rest of the cluster and with storage by NVIDIA ConnectX-7 network NDR Infiniband adapters running at 200 Gbps per card with two Infiniband cards per node. The Hammerspace storage provides nearly 3PB of combined NVMe and HDD storage, supporting large-scale data throughput. All nodes are controlled and provisioned through high-bandwidth Dell PowerSwitch S5248-ON and S5232-ON Ethernet switches at 25 Gbps per node. The cluster supports intelligent workload placement and adaptive scheduling tools to align computational activity with the goal to match as much of the workload as possible to low-cost energy availability. REPACSS also features advanced remote management capabilities and automation tools to manage scientific workflows that are specifically targeted to be adopted at scale by other resource facilities and industry.
Texas Tech REPACSS CPU
Resource Type
Compute
Latest Status
pre-production
Description
User Guide URL
Features
Is an ACCESS Allocated Production Compute resource
General compute use
Unique, innovative or non-traditional compute resource
Resource supports community software areas for users to share software with other users
Resource offers discounted job queues where running jobs can be preempted
Resource is allocated by ACCESS
An intuitive, innovative, and interactive interface to remote computing resources
Provides Globus data transfer and data sharing services for local storage
preemption
NSF ACSS Category 2 Resources
AI tools and support
Organization Name
Texas Tech University
Global Resource ID
repacss-cpu.ttu.access-ci.org