USRC Focus Areas

USRC focuses on a variety of topics related to systems software in high-performance computing.  These are discussed below including the staff and interns working in these areas.

Machine Learning

The USRC’s machine learning efforts focus on the use of statistical machine learning techniques for better understanding of HPC systems and facilities.  This includes the use of pre-existing machine learning techniques as well as the development of novel statistical models for particular HPC problems.  We are a highly interdisciplinary team, with backgrounds in machine learning, statistics, HPC systems, HPC resilience, and mathematics.  Our team engages on a wide range of problems, from highly applied and dataset-specific issues (such as predicting specific hardware errors), through more theoretical machine learning problems (such as general frameworks for explainability).  Our staff and students are actively involved in developing our research results into production tools, as well as academic publications, service on organizing committees and program committees, and community outreach. Current project highlights include: interpretable and interactive analysis of computer-generated text logs, characterization of HPC behavioral modes through telemetry data, undirected graphical models for hardware error modeling, and deep neural network models of time series data from job schedulers.

More

Expertise

  • Statistical modeling of HPC monitoring data: memory errors, telemetry data, textual logs, etc.
  • Statistical relational learning
  • Probabilistic graphical models
  • Interpretable/explainable machine learning
  • Human language technology / natural language processing

Staff working in this area

Students

  • Alexandra DeLuica (Post-baccaleaureate)
  • Randall Woodall (Summer 2018)
  • Emily Porter (Summer 2018)
  • Michael Kuchnik (Summer 2018)
  • David Huff (Summer 2018)
  • Megan Hickman (Summer 2018)

Networking

The Networking Focus Area within USRC concerns itself with all things networking.  This includes research into the design, optimization, monitoring and deployment of both intra-cluster high speed networks and external Ethernet campus connectivity.  The benefactor and recipient of this research is the High Performance Computing Division at LANL, as well as organizations that collaborate with us.  At this time the primary interconnects used, and therefore studied, within our compute and visualization clusters are InfiniBand and OmniPath.  The integration of those interconnects with MPI capability and Lustre access is critical to the success of our HPC systems.  We also investigate the functionality and optimization of InfiniBand Storage Area Networks and the Ethernet backbone that connects other necessary services to our clusters.

More

Expertise

  • High Speed Fabric (InfiniBand & Omni-Path)
  • Monitoring
  • Optimization
  • Design
  • Baseline Testing and Trending of Networks
  • IO Subsystem Monitoring
  • Ethernet Backbone Support and Optimization
  • InfiniBand Storage Area Network Design

Staff working in this area

  • Jesse Martinez
  • Howard Pritchard

Students

  •  Colette Caskie

Resilience

The USRC Resilience Team focuses primarily in the area of characterization.  These characterization efforts take on several forms including:

  • Supercomputer reliability characterization through data analytics and monitoring in collaboration with the USRC Machine Learning team (system logs, application job logs, telemetry, etc.)
  • Supercomputer hardware characterization neutron beam experiments at the Los Alamos Neutron Science Center for accelerated testing of hardware reliability.
  • Characterization of the thermal and fast neutron environment with detectors.
  • Characterization of application sensitivity to soft errors with the open source tool the team developed, the Parallel Fine-grained Soft Error Fault Injector (P-FSEFI)
More

Expertise

  • Software fault injection
  • Neutron beam studies / hardware characterization
  • Supercomputer memory fault, error, and failure analysis

Staff working in this area

Students

  • Megan Hickman
  • Dakota Fulp
  • Alexandra Poulos
  • Dylan Wallace
  • Spencer Ortega

Storage

The USRC Storage Team is building an advanced capability for designing, deploying, and managing storage systems for HPC platforms and data centers. Our efforts include a unified file indexing service, scale-out
campaign storage systems, advanced monitoring and logging capabilities, software-defined networking to support long-distance data flows, metadata performance supporting trillions of files, and much more. All of these efforts contribute to and support LANL's goal to define and deploy a next-generation software stack for extreme-scale HPC platforms.

More

Expertise

  • High-performance Storage Systems
  • Campaign Storage Systems
  • Metadata Management
  • Data center Monitoring
  • Software-defined Storage and Networking
  • Data Management for Wide-area Network Systems
  • Storage Systems Modeling

Staff working in this area

  • Mark Allen
  • Lei Cao
  • HB Chen
  • Hugh Greenberg
  • Jeff Inman
  • Dominic Manno
  • Wendy Poole
  • Brad Settlemyer (POC)
  • Scott White
  • Brian Atkinson
  • Chris DeJager
  • Jason Lee

Software

The current HPC system hardware and software design is no longer viable to the large-scale HPC institutions. The system state is binary, because it is either up or down and making changes to the system requires a full system reboot. This is due to a number of reasons including tightly coupled interconnects, undefined API, and a lack of workload manager and system state integration. Overcoming these problems require a redesign of the current HPC architecture and HPC management software in order to accomplish this goal.

More

Expertise

  • Configuration Management
  • Managing Large Scale Systems
  • Job Schedulers
  • Boot and Provisioning Systems

Staff working in this area

  • Paul Peltz
  • Lowell Wofford
  • Sean Blanchard