USRC Data Sources Failure Data

Failure data from components on high performance computing systems

Description

The following are collections of data related to failures on high performance computing (HPC) systems. Many of the data dumps include READMEs explaining the datasets as well as the descriptions of the machines and/or systems that the statistics were taken from. Failures include memory, processor, network, etc.

Citation

Please use the following citation unless the data source below offers a more specific citation:

If you'd like to cite our data and use bibtex, please use the following snippet:

@MISC{usrc:datasources-general,
    author = {{Los Alamos National Laboratory}},
    title = {{Ultrascale Systems Research Center (USRC) Data Sources}},
    howpublished= {\url{https://usrc.lanl.gov/data-sources.php}}
}

1995–2005 Reliability/Interrupt/Failure/Usage Data Sets

NOTE: These data are historical in nature and were originally released in 2005.

In order to enable open computer science research, access to computer operational data is desperately needed. Data in the areas of failure, availability, usage, environment, performance, and workload characterization are some of the most desperately needed by computer science researchers. The following sets of data are provided under universal release to any computer science researcher to use to enable computer science work.

All we ask is that if you use these data in your research, please recognize Los Alamos National Laboratory for providing these data. 

All files and content available for download are covered by LA-URs listed in the filename. Each file is a tar.gz archive with a data file (either CSV or text file) and a README with some explanations.

failure data
Description Size Gz (unpacked) # Records Link
All systems failure/interrupt data 1996-2005 336K (2.8M) 23,741 Data
System 20 usage with domain info 9.9M (50M) 489,376 Data
System 20 usage with node info - nodes number from zero 10M (42M) 489,376 Data
System 20 usage event info - nodes number from zero 3.1M (32M) 433,490 Data
System 20 node internal disk failure info - nodes number from zero 8K (16K) 14 Data
System 15 usage with node info - nodes number from zero 560K (2.3M) 17,823 Data
System 16 usage with node info - nodes number from one 52M (308M) 1,630,479 Data
System 23 usage with node info - nodes number from one
15M(58M) 654,927 Data
System 8 usage with node info - nodes number from one 14M (64M) 763,293 Data

Bianca Schroeder (at Carnegie Mellon University at the time) had been kind enough to provide a frequently asked questions (FAQ) document about the 1995–2005 failure data set. NOTE: These responses should be seen in the context of this dataset and in the time they were written.

All datasets reference anonymized "system numbers." To understand the failure, usage, and event info, you also have to understand the machine and/or system layout. For this, we have provided this Excel file. NOTE: Care needs to be taken when looking at the above data as some of the systems may appear to grow (or shrink) in size for short time periods. Sadly, these time periods are not well documented, but this occasionally occurred to put two systems together to perform a larger calculation than could be done on any one part. The machine and/or system layout data are a single snapshot in time and not a snapshot over time, so they may require other effort to determine these periods. Luckily, this likely only impacts certain types of studies, and the time periods were relatively short.