Nathan DeBardeleben (firstname.lastname@example.org) regularly hosts resilience seminars at LANL to discuss topics related to resilience. Topics of interest include the discussion of recent conferences, publications, events (dealing with a recent problem), and important and influential works.
Meetings are held at LANL in the Center for Nonlinear Studies (CNLS) Conference room at TA-3-1690. The seminar regularly occurs biweekly on Wednesdays from 9-10am.
Abstract: The unfortunate reality of next generation HPC environments is the high probability of process loss due to hardware failure for large scale and/or long running scientific applications. Applications have a long history of using checkpoint/restart fault tolerance techniques. The transparent implementation of such techniques into system software support environments, such as MPI, is an appealing option for HPC application developers reluctant or unable to restructure their code. Checkpoint/restart researchers often struggle while experimenting with new techniques due to the implementation overhead involved in extending, or re-implementing support environments. In Open MPI we have designed a novel checkpoint/restart fault tolerance infrastructure with a particular focus on extensibility, modularity, scalability and performance. The primary goal of this infrastructure is to empower researchers to explore new techniques while enabling applications to transparently take advantage of fault tolerance services provided by Open MPI. This talk will discuss the checkpoint/restart fault tolerance infrastructure in Open MPI and how it is being used to support application fault tolerance, proactive process migration, parallel debugging, and HPC system administration.
Bio: Joshua Hursey is a Ph.D. candidate in Computer Science at Indiana University. He is currently working in the Open Systems Laboratory on the Open MPI project under the direction of Dr. Andrew Lumsdaine. Joshua received a B.A. in Computer Science from Earlham College in 2003, and a M.S. in Computer Science from Indiana University in 2006. His current work focuses on developing a transparent checkpoint/restart infrastructure for scalable fault tolerance in Open MPI. His primary research interests include parallel and distributed systems, scalable fault tolerance, software engineering, and scientific computing.
Further information: http://www.cs.indiana.edu/~jjhursey/
Abstract: Developing resilient systems, whether computing or otherwise, has been a widespread objective for quite some time. Over the past two decades, the DoD (via its various entities) and the NSF have sponsored a considerable amount of effort in this regard. Of particular note was DARPA's major "built-in-test" program in the 1990s that precipitated the development and wide-scale fielding of technologies for in-the-field fault detection, diagnosis, and prediction of both mechanical and electrical systems. Ralph will briefly recap some of those efforts, and describe how Cisco's efforts to produce a 100% uptime router is impacting the development of a truly resilient MPI.
Abstract: Under certain conditions, an application's optimal checkpoint interval can be determined as a function of the dump time and application mean time to interrupt (AMTTI). In practice, an estimate of AMTTI for each application is therefore necessary to assign an optimal checkpoint interval. This estimate is based on a number of job and system parameters that can be difficult to determine and may even change over time. Errors in estimating AMTTI lead to errors in assigning optimal checkpoint intervals. This in turn impacts average application efficiency. By making use of BeoSim, a discrete-event driven multi-cluster simulator, we study the impact of non-optimal checkpoint intervals on overall application efficiency. Using LANL's Pink cluster and workload to parameterize the simulator, we find that dramatically overestimating the AMTTI has a fairly minor impact on application efficiency. The first two-thirds of the talk will introduce BeoSim and this recent study of non-optimal checkpoint intervals; while the latter third will detail some previous work regarding the use of a checkpoint-migration scheme to mitigate network over-subscription in a grid environment.
Bio: Dr. Will Jones is an assistant professor of Computer Science at Coastal Carolina University in Myrtle Beach, South Carolina. He previously held the position of assistant professor of Electrical Engineering at the United States Naval Academy for two years prior to accepting a position at CCU. His research interests include parallel job scheduling and resilience in computational clusters. He earned a Ph.D. in Computer Engineering from Clemson University in 2005.
Abstract: The Reconfigurable Computing Cluster project centers around an experimental parallel computing platform that consists of exclusively of Platform FPGA nodes. Each of these highly configurable devices are capable of hosting Linux/OpenMPI, application-specific hardware accelerators, and an integrated on-chip/off-chip network on a single, power-efficient chip. Current work is investigating the feasibility of scaling this model to tens-of-thousands of nodes (in terms of power, size, and speed) and the benefits of various MPI compute- and communication-assists implemented in hardware. Recently, it was realized that this also is an excellent testbed for experiments in resiliency. Specifically, hardware cores can be readily introduced that: (1) perturb the system in various, reproducible ways that (mirror the undesirable behavior found in very large HPC machines today) and (2) observe system behavior without disturbing the application running at wall clock speed (i.e., not in simulation). The first half of this talk will introduce Spirit, a 64-node FPGA cluster that has been constructed at the University of North Carolina at Charlotte, and the HPC applications that run on it. The second half of the talk will focus on our nascent experiments in resiliency.
Bio: Dr. Ron Sass is an Associate Professor in the Electrical and Computer Engineering Department at The University of North Carolina at Charlotte. Previously he has held positions at the University of Kansas and Clemson University. He has been the Principle Investigator on several FPGA-based research projects over the last decade including the Adaptable Computing Cluster which integrated FPGAs in the network interface card of a commodity cluster. He graduated with his Ph.D. from Michigan State University in 1999.
Abstract: Resilience is a new approach to thinking about the growing failure rates of HPC systems. While fault-tolerance addresses the problem of keeping the platform (or application) running in spite of failures in individual platform (or application) components, resilience focuses on the problem of keeping the application running to a correct solution in a timely and resource efficient manner in the presence of degradations and failures in individual platform components. While fault-tolerance is resigned to the notion that platform failures lead to application interrupts, resilience attempts to avoid application interrupts by anticipating and circumventing platform failures. We will discuss the challenges faced by traditional fault tolerance and suggest resilience can be a more effective solution for keep the application running to a correct solution.
Bio: John T. Daly is a computer systems researcher and resilience thrust lead for the Advanced Computing Systems (ACS) Program at the Center for Exceptional Computing (CEC). He is responsible for stimulating and directing collaborative research efforts in industry, academia, and government, that are focused on the problem of keeping supercomputer applications running toward a correct solution in a timely and efficient manner in the presence of system degradations and failures. His research interests include mathematical modeling and analysis of failure, reliability, fault tolerance, calculational correctness, and throughput for applications at extreme scale. Prior to working at the CEC, John was a scientist and resilience researcher in the High Performance Computing (HPC) division at Los Alamos National Laboratory and a software engineer and application analyst for Raytheon Intelligence and Information Systems. He is a nationally recognized expert in resilience with more than 20 years of experience developing, porting, and running applications as an early adopter of many of the world's fastest supercomputers.