USRC Data Sources Failure Data
Description
The following are collections of data related to failures on high performance computing (HPC) systems. Many of the data dumps include READMEs explaining the datasets as well as the descriptions of the machines and/or systems that the statistics were taken from. Failures include memory, processor, network, etc.
Citation
Please use the following citation unless the data source below offers a more specific citation:
- Los Alamos National Laboratory, “Ultrascale Systems Research Center (USRC) Data Sources,” https://usrc.lanl.gov/data-sources.php.
If you'd like to cite our data and use bibtex, please use the following snippet:
@MISC{usrc:datasources-general,
author = {{Los Alamos National Laboratory}},
title = {{Ultrascale Systems Research Center (USRC) Data Sources}},
howpublished= {\url{https://usrc.lanl.gov/data-sources.php}}
}
NOTE: These data are historical in nature and were originally released in 2005.
In order to enable open computer science research, access to computer operational data is desperately needed. Data in the areas of failure, availability, usage, environment, performance, and workload characterization are some of the most desperately needed by computer science researchers. The following sets of data are provided under universal release to any computer science researcher to use to enable computer science work.
All we ask is that if you use these data in your research, please recognize Los Alamos National Laboratory for providing these data.
All files and content available for download are covered by LA-URs listed in the filename. Each file is a tar.gz archive with a data file (either CSV or text file) and a README with some explanations.
Description | Size Gz (unpacked) | # Records | Link |
All systems failure/interrupt data 1996-2005 | 336K (2.8M) | 23,741 | Data |
System 20 usage with domain info | 9.9M (50M) | 489,376 | Data |
System 20 usage with node info - nodes number from zero | 10M (42M) | 489,376 | Data |
System 20 usage event info - nodes number from zero | 3.1M (32M) | 433,490 | Data |
System 20 node internal disk failure info - nodes number from zero | 8K (16K) | 14 | Data |
System 15 usage with node info - nodes number from zero | 560K (2.3M) | 17,823 | Data |
System 16 usage with node info - nodes number from one | 52M (308M) | 1,630,479 | Data |
System 23 usage with node info - nodes number from one |
15M(58M) | 654,927 | Data |
System 8 usage with node info - nodes number from one | 14M (64M) | 763,293 | Data |
Bianca Schroeder (at Carnegie Mellon University at the time) had been kind enough to provide a frequently asked questions (FAQ) document about the 1995–2005 failure data set. NOTE: These responses should be seen in the context of this dataset and in the time they were written.
All datasets reference anonymized "system numbers." To understand the failure, usage, and event info, you also have to understand the machine and/or system layout. For this, we have provided this Excel file. NOTE: Care needs to be taken when looking at the above data as some of the systems may appear to grow (or shrink) in size for short time periods. Sadly, these time periods are not well documented, but this occasionally occurred to put two systems together to perform a larger calculation than could be done on any one part. The machine and/or system layout data are a single snapshot in time and not a snapshot over time, so they may require other effort to determine these periods. Luckily, this likely only impacts certain types of studies, and the time periods were relatively short.