Reliability/Interrupt/Failure/Usage Data sets available
These data have been used in a paper at the Carnegie Mellon University Parallel Data Lab. This paper is an excellent example of how these data can be used in
computer science research endeavors. The paper numbers the systems differently than the numbering in the data itself. In the table below both the
original system number in the data and the CMU paper number is given. This is a link to the CMU paper: CMU Failure Project Site

Additionally, other Los Alamos related supercomputer failure papers:
Bianca Schroeder at CMU has been kind enough to provide a FAQ about the failure data set.
Click here to see that FAQ


209 Aug 31 11:03 LA-UR-06-6079-MX20_NODES_0_TO_255_NODEDISK-Z.TXT
description size records name get readme get file
all systems failure/interrupt
data 1996-2005
2963538 23741 LA-UR-05-7318-failure-data-1996-2005.csv
' '
' email:' '
system 20 usage with domain
info
51675641 489376 LA-UR-06-0803-MX20_NODES_0_TO_255_DOM.TXT
' '
' email:' '
system 20 usage with node info
nodes number from zero
43926669 489376 LA-UR-06-0803-MX20_NODES_0_TO_255_NODE-Z.TXT
' '
' email:' '
system 20 event info
nodes number from zero
33120015 433490 LA-UR-06-0803-MX20_NODES_0_TO_255_EVENTS.csv
' '
' email:' '
system 20 node internal disk failure info
nodes number from zero
209 14 LA-UR-06-6079-MX20_NODES_0_TO_255_NODEDISK-Z.TXT
' '
' email:' '
system 15 usage with node info
nodes number from zero
2416139 17823 LA-UR-06-0999-MX15-NODE-Z.TXT
' '
' email:' '
system 16 usage with node info
nodes number from one
321293488 1630479 LA-UR-06-1446-MX16-NODE-NOZ.TXT
' '
' email:' '
system 23 usage with node info
nodes number from one
60674531 654927 LA-UR-06-1447-MX23-NODE-NOZ.TXT
' '
' email:' '
system 8 usage with node info
nodes number from one
67291020 763293 LA-UR-06-3194-MX8-NODE-NOZ.TXT
' '
' email:' '