USRC »
Focus Areas

USRC Focus Areas

USRC focuses on a variety of topics related to systems software in high-performance computing. These are discussed below including the staff and interns working in these areas.

Machine Learning

The USRC’s machine learning efforts focus on the use of statistical machine learning techniques for better understanding of HPC systems and facilities. This includes the use of pre-existing machine learning techniques as well as the development of novel statistical models for particular HPC problems. We are a highly interdisciplinary team, with backgrounds in machine learning, statistics, HPC systems, HPC resilience, and mathematics. Our team engages on a wide range of problems, from highly applied and dataset-specific issues (such as predicting specific hardware errors), through more theoretical machine learning problems (such as general frameworks for explainability). Our staff and students are actively involved in developing our research results into production tools, as well as academic publications, service on organizing committees and program committees, and community outreach. Current project highlights include: interpretable and interactive analysis of computer-generated text logs, characterization of HPC behavioral modes through telemetry data, undirected graphical models for hardware error modeling, and deep neural network models of time series data from job schedulers.

Statistical modeling of HPC monitoring data: memory errors, telemetry data, textual logs, etc.
Statistical relational learning
Probabilistic graphical models
Interpretable/explainable machine learning
Human language technology / natural language processing

Staff working in this area

Elisabeth (Lissa) Baseman
Sean Blanchard
Nathan DeBardeleben
Laura Monroe
Hugh Greenberg

Students

Alexandra DeLuica (Post-baccalaureate)
Randall Woodall (Summer 2018)
Emily Porter (Summer 2018)
Michael Kuchnik (Summer 2018)
David Huff (Summer 2018)
Megan Hickman (Summer 2018)

Networking

The Networking Focus Area within USRC concerns itself with all things networking. This includes research into the design, optimization, monitoring and deployment of both intra-cluster high speed networks and external Ethernet campus connectivity. The benefactor and recipient of this research is the High Performance Computing Division at LANL, as well as organizations that collaborate with us. At this time the primary interconnects used, and therefore studied, within our compute and visualization clusters are InfiniBand and OmniPath. The integration of those interconnects with MPI capability and Lustre access is critical to the success of our HPC systems. We also investigate the functionality and optimization of InfiniBand Storage Area Networks and the Ethernet backbone that connects other necessary services to our clusters.

High Speed Fabric (InfiniBand & Omni-Path)
Monitoring
Optimization
Design
Baseline Testing and Trending of Networks
IO Subsystem Monitoring
Ethernet Backbone Support and Optimization
InfiniBand Storage Area Network Design

Staff working in this area

Jesse Martinez
Howard Pritchard

Students

Colette Caskie

Resilience

The USRC Resilience Team focuses primarily in the area of characterization. These characterization efforts take on several forms including:

Supercomputer reliability characterization through data analytics and monitoring in collaboration with the USRC Machine Learning team (system logs, application job logs, telemetry, etc.)
Supercomputer hardware characterization neutron beam experiments at the Los Alamos Neutron Science Center for accelerated testing of hardware reliability.
Characterization of the thermal and fast neutron environment with detectors.
Characterization of application sensitivity to soft errors with the open source tool the team developed, the Parallel Fine-grained Soft Error Fault Injector (P-FSEFI).

Software fault injection
Neutron beam studies / hardware characterization
Supercomputer memory fault, error, and failure analysis

Staff working in this area

Nathan DeBardeleben
Sean Blanchard
Lissa Baseman
Laura Monroe
Terry Grové
Claude "Rusty" Davis

Students

Megan Hickman
Dakota Fulp
Alexandra Poulos
Dylan Wallace
Spencer Ortega

Storage

The USRC Storage Team is building an advanced capability for designing, deploying, and managing storage systems for HPC platforms and data centers. Our efforts include a unified file indexing service, scale-out
campaign storage systems, advanced monitoring and logging capabilities, software-defined networking to support long-distance data flows, metadata performance supporting trillions of files, and much more. All of these efforts contribute to and support LANL's goal to define and deploy a next-generation software stack for extreme-scale HPC platforms.

High-performance Storage Systems
Campaign Storage Systems
Metadata Management
Data center Monitoring
Software-defined Storage and Networking
Data Management for Wide-area Network Systems
Storage Systems Modeling

Staff working in this area

Mark Allen
Lei Cao
HB Chen
Hugh Greenberg
Jeff Inman
Dominic Manno
Wendy Poole
Brad Settlemyer (POC)
Scott White
Brian Atkinson
Chris DeJager
Jason Lee

Software

The current HPC system hardware and software design is no longer viable to the large-scale HPC institutions. The system state is binary, because it is either up or down and making changes to the system requires a full system reboot. This is due to a number of reasons including tightly coupled interconnects, undeﬁned API, and a lack of workload manager and system state integration. Overcoming these problems require a redesign of the current HPC architecture and HPC management software in order to accomplish this goal.

Configuration Management
Managing Large Scale Systems
Job Schedulers
Boot and Provisioning Systems

Staff working in this area

Paul Peltz
Lowell Wofford
Sean Blanchard

USRC Focus Areas

CONTACT US

Machine Learning

Expertise

Staff working in this area

Students

Networking

Expertise

Staff working in this area

Students

Resilience

Expertise

Staff working in this area

Students

Storage

Expertise

Staff working in this area

Software

Expertise

Staff working in this area