AA-2010: Abstracts for Advanced Architecture and Critical Technologies 2010 FSIO projects
- Blackcomb: Hardware-Software Co-design for Non-Volatile Memory in Exascale Systems
Jeffrey Vetter, Oak Ridge National Lab
Robert Schreiber, HP Labs
Trevor Mudge, University of Michigan
Yuan Xie, Penn State University
- HECFSIO Topic: Next Generation I/O Architectures
Keywords: Non-Volatile Memory
Memory, not processing, is the crux of the exascale co-design problem. Exascale machines
will push the limits of memory capacity, power, and performance. DRAM, the universal memory technology of today, may not scale to meet the needs of exascale applications. Disk storage, critical for checkpointing and for archiving computational inputs and results, may also fail to provide adequate performance, reliability, and power efficiency by the end of this decade. We confront a memory/storage crisis.
The Blackcomb effort seeks to create and understand new memory technologies, develop their roles in exascale systems, adapt applications to them, and assess their relative merits. We focus on emerging nonvolative memory (NVM) technologies, including spin-torque-transfer RAM (STT-RAM), phase-change RAM (PC-RAM), and memristor (resistive RAM, or R-RAM).
-
- NoLoSS: Investigating the Roles of Node Local Storage in Exascale Systems
Kamil Iskra, Argonne National Lab
Maya Gokhale, Lawrence Livermore National Lab
- HECFSIO Topic: Next Generation I/O Architectures
Keywords: In-system storage solutions, SSD
The international computational science community is on a path to build exaFLOP-capable systems by the year 2018. These exascale systems will enable transformative science discoveries in a number of areas, including climate, combustion, nuclear energy, and national security. A key exascale barrier is the need for scalable storage of persistent state: one that provides the necessary I/O bandwidth and capacity without overwhelming the power, cooling, and cost budgets of an exascale system. Traditional global storage system approaches simply cannot scale to meet these requirements.
With the development of inexpensive, nonvolatile memory technologies such as flash memory and phase change memory, it is feasible to include solid state persistent memory on every node in a future exascale system – enabling in-system storage (also referred to as node local storage). In-system storage augments the memory hierarchy, potentially reducing DRAM requirements and thus the node's power requirements. It streamlines and simplifies checkpointing, increasing system reliability. In-system storage reduces the peak bandwidth requirements of a global exascale storage system, offering a scalable checkpoint/restart solution. However, there remain considerable research challenges to realizing these potential benefits, especially if one wants to hide the complexity introduced by another layer in the storage hierarchy from the user.
The goal of the Node Local Storage Systems (NoLoSS) project is to conduct a detailed assessment of the potential roles and benefits of in-system storage in exascale computational science. We are exploring existing hardware options for NLS and assess the software mechanisms that best exploit them based on a detailed analysis of existing Office of Science applications. We are implementing important examples of those mechanisms and determining how modifications to the existing hardware mechanisms could better support them. We will continue this three-pronged, iterative process throughout the project's lifetime, including anticipating how our successes will alter I/O usage patterns of emerging exascale applications.
-
- CODES: Enabling Co-Design of Multi-Layer Exascale Storage Architectures
Sam Lang, Argonne National Lab
Chris Carothers, Rensselaer Polytechnic Institute
- HECFSIO Topic: Measurement and Understanding
Keywords: Resilience
The data demands of science and the limited rates of data access place a daunting challenge on the designers of exascale storage architectures. Co-design of these systems will be necessary to find the best possible design points for exascale systems. Designers must consider performance, reliability, and power consumption in the context of the I/O patterns and requirements of applications and analysis tools at exascale. Meeting these constraints will require the development of a multi-layer hardware and software architecture incorporating devices that do not yet exist. The most promising approach for codesign of such systems is simulation.
The goal of this project is to enable the exploration and co-design of exascale storage systems by providing a detailed, accurate, and highly parallel simulation toolkit for exascale storage. We will develop models to realistically represent application checkpoint and analysis workloads. These models will be joined together using the Rensselaer Optimistic Simulation System (ROSS), a discrete event simulation framework that allows simulations to be run in parallel, decreasing the simulation run time of massive simulations to hours. Building on our prior work in highly parallel simulation and using our new high-resolution models, our system will capture the complexity, scale, and multi-layer nature of exascale storage hardware and software, and it will execute in a time frame that enables “what if” exploration of design concepts.
With this new toolkit we will investigate design options and trade-offs related to improving the reliability, performance at scale, and power consumption of potential exascale storage architectures. We will work with industry, DOE computing facilities, and the computer and computational science communities to refine our models and to encourage the use of this powerful tool in the design of future extreme-scale storage systems.
-
- Data Movement Dominates: Adding Data Management Services to Parallel File Systems
Arun Rodrigues, Sandia National Lab
John Shalf, Lawrence Berkeley National Lab
- HECFSIO Topic: Measurement and Understanding
Keywords: 3D memory Stacking, Optical chip-to-chip communication
Energy is the fundamental barrier to exascale computing, and is dominated by the cost of moving data, not computation. Further, data movement, not computation, dominates the performance of real applications in HPC environments. This project will addresses the problems of data movement by examining three critical technologies: 3D integration, optical chip-to-chip communication and hardware support for logic operations in the memory system.
Simulation of the proposed systems will be accomplished by merging and improving several existing simulation models: the PhoenixSim optical interconnect simulator; the DRAMsim advanced memory simulator; and the Structural Simulation Toolkit (SST), which will provide processor and I/O models as well as a parallel simulation and power analysis infrastructure. This unified simulation infrastructure will provide accurate physical layer device models as well as more abstract designs for architectural exploration