USRC Focus Areas
Machine Learning
The USRC’s machine learning efforts focus on the use of statistical machine learning techniques for better understanding of HPC systems and facilities. This includes the use of pre-existing machine learning techniques as well as the development of novel statistical models for particular HPC problems. We are a highly interdisciplinary team, with backgrounds in machine learning, statistics, HPC systems, HPC resilience, and mathematics. Our team engages on a wide range of problems, from highly applied and dataset-specific issues (such as predicting specific hardware errors), through more theoretical machine learning problems (such as general frameworks for explainability). Our staff and students are actively involved in developing our research results into production tools, as well as academic publications, service on organizing committees and program committees, and community outreach. Current project highlights
Expertise
- Statistical modeling of HPC monitoring data: memory errors, telemetry data, textual logs, etc.
- Statistical relational learning
- Probabilistic graphical models
- Interpretable/explainable machine learning
- Human language technology / natural language processing
Staff working in this area
- Elisabeth (Lissa) Baseman
- Sean Blanchard
- Nathan DeBardeleben
- Laura Monroe
- Hugh Greenberg
Students
- Alexandra
DeLuica (Post-baccalaureate ) - Randall Woodall (Summer 2018)
- Emily Porter (Summer 2018)
- Michael Kuchnik (Summer 2018)
- David Huff (Summer 2018)
- Megan Hickman (Summer 2018)
Networking
The Networking Focus Area within USRC concerns itself with all things networking. This includes research into the design, optimization, monitoring
Expertise
High Speed Fabric (InfiniBand & Omni-Path)- Monitoring
- Optimization
- Design
- Baseline Testing and Trending of Networks
- IO Subsystem Monitoring
- Ethernet Backbone Support and Optimization
- InfiniBand Storage Area Network Design
Staff working in this area
- Jesse Martinez
- Howard Pritchard
Students
- Colette Caskie
Resilience
The USRC Resilience Team focuses primarily in the area of characterization. These characterization efforts take on several forms including:
- Supercomputer reliability characterization through data analytics and monitoring in collaboration with the USRC Machine Learning team (system logs, application job logs, telemetry, etc.)
- Supercomputer hardware characterization neutron beam experiments at the Los Alamos Neutron Science Center for accelerated testing of hardware reliability.
- Characterization of the thermal and fast neutron environment with detectors.
- Characterization of application sensitivity to soft errors with the open source tool the team developed, the Parallel Fine-grained Soft Error Fault Injector (P-FSEFI).
Expertise
- Software fault injection
- Neutron beam
studies / hardware characterization - Supercomputer memory fault, error, and failure analysis
Staff working in this area
- Nathan DeBardeleben
- Sean Blanchard
- Lissa Baseman
- Laura Monroe
- Terry Grové
- Claude "Rusty" Davis
Students
- Megan Hickman
- Dakota Fulp
- Alexandra Poulos
- Dylan Wallace
- Spencer Ortega
Storage
The USRC Storage Team is building an advanced capability for designing, deploying, and managing storage systems for HPC platforms and data centers. Our efforts include a unified file indexing service, scale-out
campaign storage systems, advanced monitoring and logging capabilities, software-defined networking to support long-distance data flows, metadata performance supporting trillions of files, and much more. All of these efforts contribute to and support LANL's goal to define and deploy a next-generation software stack for extreme-scale HPC platforms.
Expertise
- High-performance Storage Systems
- Campaign Storage Systems
- Metadata Management
- Data center Monitoring
- Software-defined Storage and Networking
- Data Management for Wide-area Network Systems
- Storage Systems Modeling
Staff working in this area
- Mark Allen
- Lei Cao
- HB Chen
- Hugh Greenberg
- Jeff Inman
- Dominic Manno
- Wendy Poole
- Brad Settlemyer (POC)
- Scott White
- Brian Atkinson
- Chris DeJager
- Jason Lee
Software
The current HPC system hardware and software design
Expertise
- Configuration Management
- Managing
Large Scale Systems - Job Schedulers
- Boot and Provisioning Systems
Staff working in this area
- Paul Peltz
- Lowell Wofford
- Sean Blanchard