FTXS 2012 Attendee List and Schedule in PDF format (2 pages). We have 73 attendees pre-registered.


Workshop Agenda

FTXS is designed around a very interactive audience participation. As such, we have allocated blocks of time to discuss papers and some seed questions.

The schedule has been defined to stimulate interactions between topics of fault tolerance of interest to the extreme-scale / HPC crowd. In that regard, sessions are mixing papers on different topics and papers on the same topic are distributed over several sessions.

Monday - June 25th, 2012


Start

End

Title

Authors / Speaker

8:308:35Welcome and LogisticsFTXS Organizers
8:359:00Invited KeynoteJohn Daly - Department of Defense / Center for Exceptional Computing
9:009:25CHAOTIC-IDENTITY MAPS FOR ROBUSTNESS ESTIMATION OF EXASCALE COMPUTATIONSNageswara Rao
9:259:50ASYNCHRONOUS CHECKPOINT MIGRATION WITH MRNET IN THE SCALABLE CHECKPOINT / RESTART LIBRARYKathryn Mohror, Adam Moody and Bronis de Supinski
9:5010:00Facilitated Discussion
10:0010:30BREAK
10:3010:55DOES PARTIAL REPLICATION PAY OFF?Jon Stearley, Kurt Ferreira, David Robinson, Dorian Arnold, Patrick Bridges, Jim Laros, Kevin Pedretti and Rolf Riesen
10:5511:20ENERGY CONSIDERATIONS IN CHECKPOINTING AND FAULT TOLERANCE PROTOCOLSMohammed el Mehdi DIOURI, Olivier GLÜCK, Laurent LEFEVRE and Franck CAPPELLO
11:2011:45A PROGRAMMING MODEL FOR RESILIENCE IN EXTREME SCALE COMPUTINGSaurabh Hukerikar, Pedro C. Diniz and Robert F. Lucas
11:4512:00Facilitated Discussion
12:001:30Lunch - PROVIDED
1:301:55ROSE::FTTRANSFORM - A SOURCE-TO-SOURCE TRANSFORMATION FRAMEWORK FOR EXASCALE FAULT-TOLERANCE RESEARCHJacob Lidman, Daniel Quinlan, Chunhua Liao and Sally McKee
1:552:20A MESSAGE-LOGGING PROTOCOL FOR MULTICORE SYSTEMSEsteban Meneses, Xiang Ni and Laxmikant V. Kalé
2:202:45AN EVALUATION OF DIFFERENCE AND THRESHOLD TECHNIQUES FOR EFFICIENT CHECKPOINTSSean Hogan, Andrew Chien and Jeff Hammond
2:453:00Facilitated Discussion
3:003:30BREAK
3:303:55ON THE COMPLEXITY OF SCHEDULING CHECKPOINTS FOR COMPUTATIONAL WORKFLOWSYves Robert, Frédéric Vivien and Dounia Zaidouni
3:554:20DESIGN AND IMPLEMENTATION OF A HARDWARE CHECKPOINT/RESTART COREAshwin Mendon, Ron Sass, Zachary Baker and Justin Tripp
4:204:45A SCALABLE DOUBLE IN-MEMORY CHECKPOINT AND RESTART SCHEME TOWARDS EXASCALEGengbin Zheng, Xiang Ni and Laxmikant Kale
4:455:00Wrap-up discussion, conclusions, takeaways, action items, next steps

2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2012)

Welcome!

The 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2012) will be held in conjunction with the 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012) in Boston, Massachusetts (USA) from June 25 to June 28, 2012.

FTXS is a workshop aimed at identifying looming problems and discussing promising research solutions in the area of High Performance Computing (HPC). In particular, extreme-scale "leadership class" supercomputers fall into this broad category.

Motivation

For the HPC community, a new scaling in numbers of processing elements has superseded the historical trend of Moore’s Law scaling in processor frequencies. This progression from single core to multi-core and many-core will be further complicated by the community’s immanent migration from traditional homogeneous architectures to ones that are heterogeneous in nature. As a consequence of these trends, the HPC community is facing rapid increases in the number, variety, and complexity of components, and must thus overcome increases in aggregate fault rates, fault diversity, and complexity of isolating root cause.

Recent analyses demonstrate that HPC systems experience simultaneous (often correlated) failures. In addition, statistical analyses suggest that silent soft errors can not be ignored anymore, because the increase of components, memory size and data paths (including networks) make the probability of silent data corruption (SDC) non-negligible. The HPC community has serious concerns regarding this issue and application users are less confident that they can rely on a “correct” answer to their computations. Other studies have indicated a growing divergence between failure rates experienced by applications and rates seen by the system hardware and software. At Exascale, some scenarios project failure rates reaching one failure per hour. This conflicts with the current checkpointing approach to fault tolerance that requires up to 30 minutes to restart a parallel execution on the largest systems. Lastly, stabilization periods for the largest systems are already significant, and the possibility that these could increase in length is of great concern. During the “Approaching Exascale” report at SC11, DOE program managers identified resilience as a black swan - the most difficult under-addressed issue facing HPC.

Open Questions

What does the fault-tolerance community need to do in order to be prepared to face the challenges of extreme scale computing? What is needed to keep applications with billions of threads of parallelism up and running on systems that fail tens of times per day? As models predict less than 50% efficiency of traditional checkpoint/restart methods on future systems, are we ready to pay the cost of full redundancy, effectively performing redundant multi-threading (RMT) across entire systems? Do we even have the infrastructure necessary to implement an RMT strategy?

How is the supercomputing community going to efficiently isolate failures on enormously complex systems? Is there any chance to understand these systems in such a way that some failure could be predicted with enough accuracy and anticipation to trigger useful failure avoidance actions? What can the community do to protect applications from SDC in memory and logic? How far the user and the programmer should be involved in managing faults? What are the most p romising self-­‐healing numerical methods?

Goals

The goals of FTXS 2012 are to consider these complex questions, to discuss the unique limitations that extreme scale and complexity impose on traditional methods of fault-tolerance, and to explore new strategies for dealing with those challenges.

General Information

Call for Papers

Available here in TEXT format.
Available here in PDF format.

Workshop Organizers

    Nathan DeBardeleben - Los Alamos National Laboratory
    Jon Stearley - Sandia National Laboratory
    Franck Cappello - INRIA and University of Illinois at Urbana Champaign

Program Committee

    George Bosilca - University of Tennessee, Knoxville
    Greg Bronevetsky - Lawrence Livermore National Laboratory
    John Daly - Department of Defense
    Christian Engelmann - Oak Ridge National Laboratory
    Kurt Ferreira - Sandia National Laboratories
    Ana Gainaru - University of Illinois, Urbana-Champaign
    Hideyuki Jitsumoto - University of Tokyo
    Zbigniew Kalbarczyk - University of Illinois, Urbana-Champaign
    Rakesh Kumar - University of Illinois, Urbana-Champaign
    Zhiling Lan - Illinois Institute of Technology
    Bogdan Nicolae - INRIA
    Yve Robert - ENS Lyon
    Roel Wuyts - (Intel ExaScience Lab, Leuven, Belgium) and KU Leuven (Leuven, Belgium)
    Felix Salfner - SAP Innovation Center Potsdam
    Mitsuhisa Sato - University of Tsukuba
    Stephen Scott - Oak Ridge National Laboratory and Tennessee Tech University

Topics

Assuming hardware and software errors will be inescapable at extreme scale, this workshop will consider aspects of fault tolerance particular to extreme scale that include, but are not limited to:
    • Quantitative assessments of cost in terms of power, performance, and resource impacts of fault-tolerant techniques, such as checkpoint restart, that are redundant in space, time or information
    • Novel fault-tolerance techniques and implementations of emerging hardware and software technologies that guard against silent data corruption (SDC) in memory, logic, and storage and provide end-to-end data integrity for running applications
    • Studies of hardware / software tradeoffs in error detection, failure prediction, error preemption, and recovery
    • Advances in monitoring, analysis, and control of highly complex systems
    • Highly scalable fault-tolerant programming models
    • Metrics and standards for measuring, improving and enforcing the need for and effectiveness of fault-tolerance
    • Failure modeling and scalable methods of reliability, availability, performability and failure prediction for fault-tolerant HPC systems
    • Scalable Byzantine fault tolerance and security from single-fault and fail-silent violations
    • Benchmarks and experimental environments, including fault-injection and accelerated lifetime testing, for evaluating performance of resilience techniques under stress

Participation and Paper Submission

Submissions are expected in the following categories:
    • Regular papers presenting innovative ideas improving the state of the art
    • Experience papers discussing the issues seen on existing extreme-scale systems, including some form of analysis and evaluation
    • Extended abstracts proposing disruptive ideas in the field, including some form of preliminary results
Submissions shall be sent electronically, must conform to IEEE conference proceedings style and should not exceed six pages including all text, appendices, and figures. US Letter format, not A4.
All papers will be published, as workshop papers, in the DSN 2012 proceedings and on IEEE Xplore.
Click here to SUBMIT A PAPER.

Important Dates

    Submission of papers: March 16, 2012 March 24, 2012 - 11:59 PM EST(DEADLINE EXTENDED!)
    Author notification: April 13, 2012
    Camera ready papers: April 27, 2012
    Workshop: June 25, 2012

Further Information

Workshop location, registration and accommodation: http://www.dsn.org.
Questions

For questions contact Nathan DeBardeleben (ndebard@lanl.gov).

Resilience website designed and hosted by Los Alamos National Laboratory.
Email Contact: Project Leader and Webmaster