

The Wisconsin Hierarchically-Redundant, Decoupled storage project (HaRD) investigates the next generation of storage software for hybrid Flash/disk storage clusters. The main objective of the project is to improve the performance of storage in a variety of diverse scenarios, including new application environments such as photo storage as found in Facebook and Flickr, high-end scientific processing as found in government labs, and large-scale data processing such as that found in Google and Microsoft. The HaRD project focuses on three key issues in order to improve performance of these important applications: client-side Flash-based RAID and file-system integration, server-side memory reduction and multicore scheduling of file-system tasks, and scheduled network transfers. HaRD pulls together these technologies into a synthesized whole through three targeted storage systems: a scalable photo server, a high-performance checkpoint subsystem, and an improved file system for MapReduce workloads. The impact of this project is significant, as HaRD helps to shape the storage software architecture of the next generation of cloud computing services, which are of increasing relevance to both industry and society at large.
This project will develop a job scheduling and resource allocation system for data-intensive high-performance computing (HPC) based on the congestion pricing of a systems' heterogeneous resources. This extends the concept of resource management beyond processing: it allocates memory, disk I/O, and the network among jobs. The research will overcome the critical shortcomings of processor-centric resource management, which wastes huge portions of cluster and supercomputer resources for data-intensive workloads, e.g. I/O bandwidth governs the performance of many modern HPC applications but, at present, it is neither allocated nor managed. The research will develop techniques that (1) reconfigure the degree of parallelism of HPC jobs to avoid congestion and wastage, (2) support lower-priority, allocation elastic jobs that can be scheduled on arbitrary numbers of nodes to consume unallocated resource fragments, and (3) co-schedule batch-processing workloads that use system resources that are unoccupied due to asymmetric utilization and temporal shifts in the foreground jobs. These techniques will be implemented and supported for free public use as extensions to an open-source resource-management framework. If used broadly, the software has the potential to provide much better utilization of the national investment in HPC facilities.
The increasing performance and decreasing cost of processors has enabled increased system intelligence at peripherals such as disk drives. This computational capability at the disk has led to the development of object-based storage whereby some of the file system functionality is moved to the disk. The computation capability can also enable computation at the storage node in what has been called active disks or active storage. This active storage computation serves as a mechanism to enable parallel computation using distributed storage nodes.
This research focuses on the use of these active disks for parallel file system and storage management. A functional active storage system architecture built on the standardized object-storage device specification is being developed. The architecture supports a variety of execution engines allowing multiple programming languages and models. Using this active object storage architecture, mechanisms to improve overall scalability and large-scale system reliability are being investigated. In addition, active and object storage are used to enable customizable and extensible file systems including autonomic (self-configuring and self-managing) storage as well as application aware storage such that the storage can be optimized for application and user needs.
This research focuses on developing scalable parallel file access methods for multi-scale problem domain decompositions, such as the one presented in Adaptive Mesh Refinement (AMR) based algorithms. Existing parallel I/O methods concentrate on optimizing the process collaboration under a fairly evenly-distributed request pattern. However, they are not suitable for data structures in AMR, because the underlying data distribution is highly irregular and dynamic. Process synchronization in the existing parallel I/O methods can penalize the I/O parallelism if the process collaboration is not carefully coordinated. This research addresses such synchronization issue by developing scalable solutions in the Parallel netCDF library (PnetCDF), particularly to address AMR structured data and its I/O patterns. PnetCDF is a popular I/O library used by many computer simulation communities. A scalable solution for storing and accessing AMR data in parallel is considered a challenging task. This research will design a process-group based parallel I/O approach to eliminate unrelated processes and thus avoid possible I/O serialization. In addition, a new metadata representation will also be developed in pnetCDF for conserving tree-structured AMR data relationship in a portable form.
Query response time and system throughput are the most important metrics when it comes to database and file access performance. Because of data proliferation, efficient access methods and data storage techniques have become increasingly critical to maintain an acceptable query response time and system throughput. One of the common ways to reduce disk I/Os and therefore improve query response time is database clustering, which is a process that partitions the database/file vertically (attribute clustering) and/or horizontally (record clustering). To take advantage of parallelism to improve system throughput, clusters can be placed on different nodes in a cluster machine.
This project develops a novel algorithm, AutoClust, for database/file clustering that dynamically and automatically generates attribute and record clusters based on closed item sets mined from the attributes and records sets found in the queries running against the database/files. The algorithm is capable of re-clustering the database/file in order to continue achieving good system performance despite changes in the data and/or query sets. The project then develops innovative ways to implement AutoClust using the cluster computing paradigm to reduce query response time and system throughput even further through parallelism and data redundancy. The algorithms are prototyped on a Dell Linux Cluster computer with 486 compute nodes available at the University of Oklahoma. For broader impacts, performance studies are conducted using not only the decision support system database benchmark (TPC-H) but also real data recorded in database and file formats collected from science and healthcare applications in collaboration with domain experts, including scientists at the Center for Analysis and Prediction of Storms (CAPS) at the University of Oklahoma. The project also makes important impacts on education as it provides training for graduate and undergraduate students working on this project in the areas of national critical needs: database and file management systems, and high-end computing and applications. The developed algorithm and prototype, real datasets and performance evaluation results are made available to the public at the Website: http://www.cs.ou.edu/~database/AutoClust.html .
Extending virtualization technology into high-performance, cluster platforms generates exciting new possibilities. However, I/O efficiency in virtualized environments, specifically with respect to disk I/O, remains little understood and hardly tested.
The objective of this research is to investigate fundamental techniques for virtual clusters that not only facilitate rigorous performance studies, but also identify places where performance is suffering and then optimize the system to lessen the impact of such bottlenecks. To accomplish this objective, the following research tasks will be conducted: 1) An in-depth analysis of I/O efficiency in virtualized environments and investigation of intelligent and automated I/O bottleneck identification schemes; 2) Design and development of techniques to optimize I/O to address the detected I/O bottlenecks; 3) Development of an extensible framework for characterizing I/O workloads across virtualized clusters.
This research will greatly contribute to understanding virtualized I/O, identifying I/O bottlenecks and optimizing I/O, and thus facilitate the cluster systems to most effectively utilize virtualization technology. This project will also contribute to the society through promoting research and engaging under-represented groups that leads students to advancing their careers in science and engineering.
Emerging high-end computing platforms, such as leadership-class machines at the petascale, provide new horizons for complex modeling and large-scale simulations. These machines are used to execute data intensive applications of national interest such as climate modeling, cosmic microwave background radiation, and astrophysical thermonuclear flashes. While these systems have unprecedented levels of peak computational power and storage capacity, a critical challenge concerns the design and implementation of scalable I/O (input-output) system software (also called I/O stack) that makes it possible to harness the power of these systems for scientific discovery and engineering design. Unfortunately, currently, there are no available mechanisms that accommodate I/O stack-wide, application-level QoS (quality-of-service) specification, monitoring, and management.
This project investigates a revolutionary approach to the QoS-aware management of the I/O stack using feedback control theory, machine learning, and optimization. The goal is to maximize I/O performance and thus improve overall performance of large scale applications of national interest. The project uses (1) machine learning and optimization to determine the best decomposition of application-level QoS to sub-QoSs targeting individual resources, and (2) feedback control theory to allocate shared resources managed by the I/O stack such that the specified QoSs are satisfied throughout the execution. The project tests the developed I/O stack enhancements using the workloads at NCAR, LBNL and ANL systems. It also involves two efforts in broadening participation: CISE Visit in Engineering Weekends (VIEW) and NASA-Aerospace Education Services Project (NASA-AESP) at the Center for Science and the Schools (CSATS).
This project investigates optimization problems that arise while performing thermal management in very large data storage centers. To satisfy the growing data management needs, such storage centers contain possibly hundreds of thousands of hard disks and other components, and typically are consistently active. These generate a lot of heat, and hence the storage system must be cooled to maintain reliability, resulting in significant cooling costs. The cooling mechanism and the workload assignments in a storage center are intricately tied together.
This project is developing a general science of thermal management for large scale storage systems, by focusing on thermal modeling and management at different levels of the system hierarchy. Thermal aware techniques for allocating data access tasks to specific disks on which data is located, for controlling the schedules and speeds of thousands of tasks and disks to optimize quality of service, and for reorganizing data layouts on disks are being developed. This project will enable better thermal management in data storage centers, which can potentially result in significant reductions in the carbon footprint caused by those. The project will train several Ph.D. students in conducting research both at the University, and through internships at Industrial Research Labs.
This research project aims to understand and develop systems for maintaining superlinear indexes for streaming data. A superlinear index describes over an abstract space that cannot easily be linearized. In contrast, a linear index, typified by a B-tree, supports point and range queries on totally ordered data.
Examples of superlinear indexes include (1) multidimensional indexes, which can be over a geometric domain, such as geographic data, or which can be over multiple linear indexes; and (2) full text queries, which can include searching for a particular word or substring.
The superlinear indexes found in today's databases cannot support high rates of insertion. On traditional mechanical disk drives, the existing superlinear indexes can only support about one hundred insertions per second in the worst case. For many important applications, that is too slow, and so database users often avoid superlinear indexing. Even traditional linear indexes based on B-trees cannot support the high insertion rates demanded by many databases.
This research investigates streaming superlinear indexes, that is, indexes that efficiently support full text or multidimensional queries, and can be updated at speeds that are related to disk bandwidth rather than seeks per second.
Among the significant research issues are the following: (1) design efficient files structures for streaming superlinear indexes; (2) investigate how streaming superlinear indexes might pave the way to improved file systems; (3) determine whether cache-oblivious algorithms technology can enhance streaming superlinear indexes; and (4) program complex data structures for transactions and recovery.
If successful, this research will show how to build filesystems that achieve dramatically better performance than today's B-tree-based filesystems, how to maintain rich geometrical data and multidimensional nongeographical databases in real time, and how to maintain full-text searchable databases in real time. For example, some of today's file systems try to maintain an full-text index to find strings in files quickly, but these systems often fall behind at high data write rates. A streaming superlinear index would allow such a file system to keep up, and would improve the usability of both high-end storage systems and relatively small consumer storage systems that are nonetheless too large to index with today's indexes.
The researchers are developing course materials on streaming indexing technology which will be made freely available under the MIT OpenCourseWare initiative (http://ocw.mit.edu).
The objective of this research is to develop techniques that utilize solid-state memory technologies from device, circuit, architecture, and system perspectives across I/O hierarchy in order to exploit their true potential for improving I/O stack performance in high-performance computing systems. I/O friendly memory system architectures will be developed to enable hybrid processor-memory 3D integrations with largely reduced off-chip I/O traffic.
Adaptive cache management and hotspot prediction methods will be developed to address the low random write performance of solid-state drives, and data processing techniques will be developed to enable run-time configurable trade-offs among solid-state drive performance characteristics. A comprehensive full-system simulation infrastructure will be developed to evaluate and demonstrate the research under diverse high-performance computing workloads.
The research will facilitate the high-performance computing systems to most effectively utilize existing/emerging memory and processing technologies to tackle the grand I/O stack design challenge. It can greatly contribute to enabling high-performance computing systems to stay on track of their historic scaling, and hence benefit numerous real-life applications such as biology, chemistry, earth science, health care, etc. This project will also contribute to the society through engaging under-represented groups, research infrastructure dissemination for education and training, and outreach to high school students.
Modern supercomputers are complex, hierarchical systems consisting of huge number of cores, systems for disk storage, and nodes for I/O forwarding. These numbers will continue to grow and the need for tools to understand the behavior of the system software becomes paramount: without these tools it will be impossible to effectively tune system software, and high degrees of efficiency will be unattainable by applications. This project addresses the challenge of understanding behavior of complex system software on very large-scale compute platforms like the current petascale computers. In particular, this project will develop software infrastructure to provide end-to-end analysis and visualization of I/O system software. Specifically, the objectives of this project are to develop, improve, and deploy (1) end-to-end, scalable tracing integrated into the I/O system (MPI-IO, I/O forwarding, and file system); (2) information visualization tools for inspecting traces and extracting knowledge; (3) testing components that drive this system to generate example patterns, including an "I/O system failure" component to generate anomalies; and (4) tutorials and tools for helping other system software developers incorporate this analysis and visualization system into their production software. It is clear that the software and techniques developed in this project will be directly applicable to and useful in other system software libraries, such as communication libraries, which perform complex interactions on large systems.
I/O performance is often an issue for high-end computing (HEC) codes, due to their increasingly data-intensive nature and the ever-growing CPU-I/O performance gap. Portable parallel I/O benchmarks can help
(1) application owners to improve their codes' performance
(2) HEC storage systems architects to improve their designs
(3) future and current owners of HEC platforms to reduce hardware cost and improve application performance through better system provisioning and configuration.
This research will produce benchmarks and tools that benefit the computational science community at large. Our benchmark prototypes will be used for parallel computing course projects and student research contests.
Data provenance documents the inputs, entities, systems, and processes that influence data of interest---in effect providing a historical record of the data and its origins. The generated evidence supports essential forensic activities such as data-dependency analysis, error/compromise detection and recovery, and auditing and compliance analysis.
This collaborative project is focused on theory and systems supporting practical end-to-end provenance in high-end computing systems. Here, systems are investigated where provenance authorities accept host-level provenance data from validated provenance monitors, to assemble a trustworthy provenance record. Provenance monitors externally observe systems or applications and securely record the evolution of data they manipulate. The provenance record is shared across the distributed environment.
In support of this vision, tools and systems are explored that identify policy (what provenance data to record), trusted authorities (which entities may assert provenance information), and infrastructure (where to record provenance data). Moreover, the provenance has the potential to hurt system performance: collecting too much provenance information or doing so in an inefficient or invasive way can introduce unacceptable overheads. In response, the project is further focused on ways to understand and reduce the costs of provenance collection.
This project is developing new techniques for identifying and managing files, replacing tree-structured file names with content- and metadata- based search access. By leveraging existing work in search and recognizing the explosion in the volume of data stored, this project enables users to find and access their data in natural and intuitive ways, based on the files' contents, tags the user has assigned, system metadata, and provenance (information about the file's origins). This research targets high-end computing (HEC) users, who manage billions of files generated by measurement devices, experimentation, or scientific workflows. The techniques and system developed are also applicable to general-purpose computing.
Realizing this goal requires advances in several areas. First, the project is designing and developing fast, scalable mechanisms to gather, maintain and index the large volume of metadata and provenance that HEC applications and users generate. This project is also exploring search algorithms that operate on graph structures, enabling users to find files "near" their current workspace. To enable users to access this functionality, the project is developing a new "language" that facilitates the kind of searches that users need.
Current high-end computing (HEC) applications explicitly manage persistent data, including both application state and application output. This practice not only increases development time and cost, but also requires an application developer to be intimately aware of the underlying platform-dependent storage mechanisms to achieve good application I/O performance. Such vertical development also makes the application software less portable.
The Software Persistent Memory (SoftPM) project builds a lightweight abstraction and practical infrastructure for streamlining data management in next generation HEC applications. SoftPM eliminates the duality of data management in HEC applications by allowing applications to allocate persistent memory in much the same way volatile memory is allocated and easily restore, browse, and interact with past versions of persistent memory state. This simplifies the implementation of three broad capabilities required in HEC applications –– recoverability (e.g., checkpoint-restart), record-replay (e.g., data-visualization), and execution branching (e.g., simulation model-space exploration).
The SoftPM project is organized in three modules. The first module builds an evolvable SoftPM API and addresses memory management issues. The second module addresses high-performance I/O and the atomicity of persistence points for local storage and parallel file systems. The final module builds several HEC application case-studies to illustrate the different capabilities supported by SoftPM in HEC environments.
This research focuses on the design and implementation of a lightweight, yet, versatile middleware framework that provides effective and scalable solutions to the problem of interleaving storage workloads with a wide spectrum of demands. The framework uses simple and non-intrusive collection of workload statistics such as workload histograms and measures of temporal dependence to provide accurate forecasting of system workload characteristics and their impact on system metrics.
The framework maps accurately and swiftly complex processes that exist and interact in storage clusters into robust allocation decisions. Central to the framework is its ability to estimate beforehand the effect of resource allocation policies on system metrics, which enables navigating through multiple possible allocations of system resources and selecting the on that best meets system targets. This research has the potential to revolutionize autonomic resource management in storage systems and provide methodologies to meet conflicting targets such as discovering trade-offs and dependencies between performance and other metrics including cost, energy consumption, reliability, and availability. This project will enable enhancement of graduate courses on parallel and distributed systems with aspects of emerging paradigms such as data intensive, cloud, and green computing, as well as will advance the education of the multiple students directly involved.
Disk I/O on high-end computing machines continues to be a significant performance bottleneck. Parallel file systems have been developed to improve parallel I/O performance. However, most of these methods are application dependent and their performance varies largely from application to application. The performance of parallel I/O can be improved with better understanding of I/O access characteristics at both client and file-server side. There is a great need for research into next-generation intelligent and application-specific I/O architectures to meet the demand of highend computing.
We propose a dynamic application-specific I/O architecture that tailors various parallel I/O optimizations based on I/O characteristics of applications. This architecture is dynamic in the sense that its underlying optimization strategies are able to adapt to the variations in different applications for best performance. The proposed research is twofold: 1) understanding I/O behavior, 2) developing application-specific optimizations for data layout, prefetching, and caching to form an integrated application-specific I/O architecture. Several technical hurdles have been identified, which include I/O access signature, compiler analysis, global-aware coordinated caching, collective prefetching, data layout optimization and distribution strategies. Solutions are proposed and detailed plans are provided to test these newly proposed solutions and techniques under the PVFS2 parallel file system.
Disk I/O on high-end computing machines continues to be a significant performance bottleneck. Parallel file systems have been developed to improve parallel I/O performance. However, most of these methods are application dependent and their performance varies largely from application to application. The performance of parallel I/O can be improved with better understanding of I/O access characteristics at both client and file-server side. There is a great need for research into next-generation intelligent and application-specific I/O architectures to meet the demand of highend computing.
We propose a dynamic application-specific I/O architecture that tailors various parallel I/O optimizations based on I/O characteristics of applications. This architecture is dynamic in the sense that its underlying optimization strategies are able to adapt to the variations in different applications for best performance. The proposed research is twofold: 1) understanding I/O behavior, 2) developing application-specific optimizations for data layout, prefetching, and caching to form an integrated application-specific I/O architecture. Several technical hurdles have been identified, which include I/O access signature, compiler analysis, global-aware coordinated caching, collective prefetching, data layout optimization and distribution strategies. Solutions are proposed and detailed plans are provided to test these newly proposed solutions and techniques under the PVFS2 parallel file system.
The nature of scientific computing is changing – it is becoming increasingly data-centric. We use Amdahl's Laws to quantify what is (i) a data-intensive computational problem, (ii) and what is a data-intensive computational architecture. Based on these objective metrics we propose several different architectural approaches, including some next-generation, low-power processors and storage devices, e.g., Solid State Disks (SSDs), and consider how these architectures might offer substantial benefits over the existing ones. As data volumes grow, transferring data to where our computational resources are is becoming increasingly difficult. As a result we need to bring our computing as close to the data storage as possible. In this research, we plan to explore approaches where the first steps of the scientific data processing are performed on the backplane of the database servers – the closest we can get to low level scientific data. There are several challenges in using databases for large scale scientific computations; not all data structure map equally well to relational database tables. We will also explore how we can use trees and arrays, representing very large scientific data sets, in both relational databases and on a MapReduce/Hadoop like environment. We are in a fortunate situation that the necessary hardware components of our research have been provided by funds from the Gordon and Betty Moore Foundation and Microsoft Research. We seek a moderate support for graduate students, beyond fractional salaries of the senior personnel involved.
Intellectual merits: Our premise is that in the near future the only feasible scalable data-intensive environment will consist of a massively scaled-down and -out system, where data partitioning, fault tolerance, and massive parallelism will play a much larger role than today. While SSDs can provide an excellent performance for both sequential and random I/Os, we will still need traditional hard disks for the bulk volume. How this additional tier in our storage system can be used in the most effective way, both with databases and without, is yet to be explored. We propose to use real workloads from our existing experiments to evaluate these compromises, both on our 1.1PB GrayWulf system, and on small-scale hybrid systems built out of low-power motherboards, SSDs, and hard disks. The proposed research will focus on the challenge of designing and developing massively distributed query environments, that is, providing data storage and access support for scientific applications, to name a few, SkyQuery, Turbulence, and Viz. This research will rethink and redesign our existing database clusters and data management systems in the presence of solid-state drives and low-power processors, as the emergence of these new devices has undoubtedly introduced exciting opportunities for improving not only performance but also energy efficiency.
Broader impacts: We feel that the research here is transformative, and has a very broad impact to much of the next generation of scientific computing architectures, in all scientific disciplines. It is clear that current systems (BeoWulf) will not scale well beyond the next two years, mostly due to power requirements, and the difference in the scaling laws of multi-core CPUs and hard-disk based IO subsystems. Our approach, using Amdahl's Laws as an objective criterion, enable us to design systems from the first principles and match architectures to applications. The development of a balanced data-intensive supercomputing system that can effectively utilize multi-core processors and SSDs will lead not only to a reduced cost of ownership, but also performance improvement for many scientific applications. The education and outreach plan in this proposal consists of three tasks: developing educational materials, mentoring underrepresented students, and developing collaborations in the industry. The PI plans to interleave them to maximize the broader impact on multiple fronts.
High-End Computing (HEC) systems are designed for performance, not energy efficiency. In recent years, HEC users have found that as energy costs increase, network and disks have become significant bottlenecks; worse, scientific workloads vary wildly, exercising different parts of HEC clusters, making it impossible to understand where the bottlenecks are and where energy is being wasted.
This project explores the impact of storage-stack configurations on power and performance, using actual cluster configurations and realistic scientific workloads. The research follows three thrusts: tracing and analysis, adaptive cluster reconfiguration, and new storage software stacks.
(1) Traces are collected and analyzed which combine both performance and energy data on a large set of scientific workloads: I/O-, network-, memory-, and CPU-intensive. Three popular scientific cluster configurations are investigated, varying many configuration parameters. (2) Tools are being developed to dynamically adapt a cluster's configurations to a given workload, so as to optimize power and performance prior to running long-term scientific experiments or simulations. (3) New operating systems software is developed specifically to optimize power and performance for scientific workloads: a new lightweight file system and disk I/O scheduler.
The long-term results of this project help society save energy in computing without unduly hurting performance.In today's high-end computing (HEC) systems, the parallel file system (PFS) is at the core of the storage infrastructure. PFS deployments are shared by many users and applications, but currently there are no provisions for differentiation of service - data access is provided in a best-effort manner. As systems scale, this limitation can prevent applications from efficiently utilizing the HEC resources while achieving their desired performance and it presents a hurdle to support a large number of data-intensive applications concurrently. This NSF HECURA project tackles the challenges in quality of service (QoS) driven HEC storage management, aiming to support I/O bandwidth guarantees in PFSs by addressing the following four research aspects: 1. Per-application I/O bandwidth allocation based on PFS virtualization, where each application gets its specific I/O bandwidth share through its dynamically created virtual PFS. 2. PFS management services that control the lifecycle and configuration of per-application virtual PFSs as well as support application I/O monitoring and storage resource reservation. 3. Efficient I/O bandwidth allocation through autonomic, fine-grained resource scheduling across applications that incorporate coordinated scheduling and optimizations based on profiling and prediction. 4. Scalable application checkpointing based on performance isolation and optimization on virtual PFSs customized for checkpointing I/Os.
Existing data storage systems based on the hierarchical directory-tree organization do not meet the scalability and functionality requirements for exponentially growing datasets and increasingly complex metadata queries in large-scale Exabyte-level file systems with billions of files. This project focuses on a new decentralized semantic-aware metadata organization that exploits semantics of file metadata to improve system scalability, reduce query latency for complex data queries, and enhance file system functionality.
The research has four major components: 1) exploit metadata semantic-correlation to organize metadata in a scalable way, 2) exploit the semantic and scalable nature of the new metadata organization to significantly speed up complex queries and improve file system functionality, 3) fully leverage the semantic-awareness of the new metadata organization to optimize storage system designs, such as caching, prefetching, and data de-duplication, and 4) implement the new metadata organization, complex query functions, and system design optimizations in large-scale storage systems. This project has broader impact to data-intensive scientific and engineering applications, graduate and undergraduate education, and K-12 education through its contributions to storage system research and its integration with an existing NSF-REU site award and an NSF-ITEST award.