Leveraging experience in applications and diverse file and storage systemsexpertise of its members, the institute will enable a group of researchersto collaborate extensively on developing requirements, standards, algorithms, and development and performance tools.
Petascale computing infrastructures for scientific discovery makepetascale demands on information storage capacity, performance,concurrency, reliability, availability, and manageability. The lastdecade has shown that parallel file systems can barely keep pace withhigh performance computing along these dimensions; this poses acritical challenge when petascale requirements are considered. Thisproposal describes a Petascale Data Storage Institute that focuses onthe data storage problems found in petascale scientific computingenvironments, with special attention to community issues such as interoperability, community buy-in, and shared tools. Leveragingexperience in applications and diverse file and storage systemsexpertise of its members, the institute allows a group of researchersto collaborate extensively on developing requirements, standards,algorithms, and development and performance tools. Mechanisms forpetascale storage and results are made available to the petascale computing community. The institute holds periodic workshops anddevelops educational materials on petascale data storage for science.
The Petascale Data Storage Institute is a collaboration betweenresearchers at Carnegie Mellon University, National Energy ResearchScientific Computing Center, Pacific Northwest National Laboratory,Oak Ridge National Laboratory, Sandia National Laboratory, Los AlamosNational Laboratory, University of Michigan, and the University ofCalifornia at Santa Cruz.
The Institute’s work will be organized into six projects:
- Petascale Data Storage Outreach: (Type: Dissemination) Developmentand deployment of training materials, both tutorials for scientistsand course materials for graduate students; support and advise otherSciDAC projects and institutes; and development of frequent workshopsdrawing together experts in the field and petascale science users.
- Protocol/API Extensions for Petascale Science Requirements: (Type:
Dissemination) Drive deployment of best practices for petascale datastorage systems through development and standardization ofapplication programmer interfaces and protocols, with specificemphasis on Linux APIs. Validate and demonstrate these APIs in largescale scientific computing systems.
- Petascale Storage Application Performance Characterization: (Type:
Data Collection) Capture, characterize, model and distribute workload, access trace, benchmark and usage data on terascale andprojected petascale scientific applications, and develop anddistribute related tools.
- Petascale Storage System Dependability Characterization: (Type:
Data Collection) Capture, characterize, model and distribute failure,error log and usage data on terascale and projected petascalescientific systems, and develop and distribute related tools.
- Exploration of Novel Mechanisms for Emerging Petascale ScienceRequirements: (Type: Exploration) In anticipation of petascale challenges for data storage, explore novel mechanisms such as global/ WAN high performance file systems based on NFS; security aspects forfederated systems, collective operations, and ever higher performancesystems; predictable sharing of high performance storage by heavystorage load applications; new namespace/search and attributedefinition mechanisms for ever large namespaces; and integration andspecialization of storage systems for server virtualization systems.
- Exploration of Automation for Petascale Storage System
Administration: (Type: Exploration) In anticipation of petascale challenges for data storage, explore and develop more powerfulinstrumentation, visualization and diagnosis methodologies; datalayout planning and access scheduling algorithms; and automation fortuning and healing configurations.
Managing scientific data has been identified as one of the most important emerging needs by the scientific community because of the sheer volume and increasing complexity of data being collected. Effectively generating, managing, and analyzing this information requires a comprehensive, end-to-end approach to data management that encompasses all of the stages from the initial data acquisition to the final analysis of the data.
Based on the community input, we have identified three significant requirements. First, more efficient interactions with disks and the resulting files are needed. In particular, parallel file system improvements are needed to write and read large volumes of data without slowing a simulation, analysis, or visualization engine. Second, scientists require improved access to their data, in particular the ability to effectively perform complex data analysis and searches over large data sets. Specialized feature discovery and statistical analysis techniques are needed before the data can be understood or visualized. Finally, generating the data, collecting and storing the results, data post-processing, and analysis of results is a tedious, fragmented process. Tools for automation of these workflows this process in a robust, tractable, and recoverable fashion are required to enhance scientific exploration.
We have organized our activities in three layers abstracting the end-to-end data flow described above: the Storage Efficient Access (SEA), Data Mining and Analysis (DMA), and Scientific Process Automation (SPA) layers. The SEA layer is immediately on top of hardware, operating systems, file systems, and mass storage systems, and provides parallel data access technology. On top of the SEA layer exists the DMA layer, consisting of indexing, feature selection, and parallel statistical analysis. The SPA layer, which is on top of the DMA layer, provides the ability to compose scientific workflows from the components in the DMA layer as well as application specific modules. Together these layers provide an integrated system for data management in computational science.