LANL has recently built a prototype Parallel Log Structured File System to address parallel small strided writes N processed to 1 file. This software may increase speeds for HPC checkpoints for small strided I/O N to 1 patterns significantly. Click here
LANL has recently released file systems statistics data for its scratch file systems. This data outlines the number of files and the shape of the file systems for supercomputer scratch file systems. LANL has released similar data for thousands of workstations across the lab. Click here
LANL has recently released Parallel I/O traces to assist researchers. These traces are of parallel I/O benchmarks that do representative I/O patterns for some supercomputer applications. Click here
LANL started a recent trend in computer failure data release.
LANL released nearly 10 years of failure, availability, and usage data for over 20 supercomputers.
In some cases, the data released comprises the complete life of a machine.
Before the LANL release, previous failure data releases of any size were from the Digital VAX era
and for small numbers of VAX machines for a few months.
The LANL release dwarfs all past releases with over 23,000 interrupts over 9 years on thousands
of machines. Additionally the inclusion of usage data along with the failure data is a big plus.
The LANL failure data has already been used by researchers at Carnegie-Mellon University, Garth Gibson and
Bianca Schroeder, for two papers, "A Large Scale Study of Failures in High-performance Computing Systems"
and "Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?".
The paper analyzing the LANL disk failures won best paper award at the recent File and Storage
Technologies conference (FAST) in San Jose, CA, and has already made a splash with disk vendors and
the non-HPC community with two articles based on the research in the electronic version of Computer World;
"Disk Drive Failures 15 Times What Vendors Say, Study Says" and "Hard Data". Some interesting findings from
the CMU paper include the traditionally accepted belief of failures takes the shape of a bathtub curve is
false, just steadily rising annual replacement rates, the manufacturer's MTBF numbers are overstated by a
factor of 2-10 even for new drives, but MTBF for 5-8 year old drives is some 30 times lower than advertised,
and failure rates between disk technologies (SATA/SCSI/FC) are very similar contrary to vendor assertions
and popular belief.
Additionally, other papers from UCSC and Colorodo School of Mines have also been written.
This huge release and subseqent papers have started an outpouring of data relases
from other sites.
So many sites are giving data now, USENIX has started an index site to showcase these data.