1st Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2010)

Thank you!

The FTXS 2010 program committee and chairs want to thank everyone for a great workshop. Presentations are posted below in the agenda section. Hopefully we will see you next year in Hong Kong at DSN 2011!

WORKSHOP AGENDA

Monday - June 28th, 2010


Start

End

Topic

Presenter

8:008:30Workshop Registration
8:309:15Introduction / Welcome / Level-Setting (PDF)Nathan DeBardeleben, CEC / DoD
9:1510:00Quantifying Effectiveness of Failure Prediction and Response in HPC Systems: Methodology and Example (PPT)Jackson Mayo, Sandia National Laboratories
10:0010:30Coffee Break
10:3011:15Accurate Fault Prediction of BlueGene\P RAS Logs Via Geometric Reduction (PDF)Josh Thompson, Colorado State University
11:1512:00A Practical Failure Prediction with Location and Lead Time for Blue Gene/P (PDF)Ziming Zheng, Illinois Institute of Technology
12:001:30Lunch Break
1:302:15Distributed Object Storage Rebuild Analysis via Simulation with GOBS (PDF)Justin Wozniak, Argonne National Laboratory
2:153:00See Applications Run and Throughput Jump: The Case for Redundant Computing in HPC (PDF)Rolf Riesen, Sandia National Laboratories
3:003:30Coffee Break
3:304:15Cross-Layer Reliability Status ReportNick Carter, Intel
4:155:00Open FloorAll Attendees
6:007:30Registration / Welcome Reception@ International Foyer - 2nd Floor
General Information

Call for Papers

The call for papers is available here. Submissions should not exceed six pages including all text, appendices, and figures. The formatting information for submissions is the same as the basic DSN guidelines (which can be found here - including style files).
SUBMISSION DEADLINE EXTENDED to March 23rd, close of business Pacific time. HARD DEADLINE.

Submissions are closed.

Venue

Held in conjunction with The 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2010) in Chicago, Illinois, USA June 28 - July 1, 2010. For a full list of the DSN workshops, see here.
FTXS will take place on Monday, June 28th.

Objectives and Challenges

With the emergence of many-core processors, accelerators, and alternative/heterogeneous architectures, the HPC community faces a new challenge: a scaling in number of processing elements that supersedes the historical trend of scaling in processor frequencies. The attendant increase in system complexity has first-order implications for fault tolerance. Mounting evidence invalidates traditional assumptions of HPC fault tolerance: faults are increasingly multiple-point instead of single-point and interdependent instead of independent; silent failures and silent data corruption are no longer rare enough to discount; stabilization time consumes a larger fraction of useful system lifetime, with failure rates projected to exceed one per hour on the largest systems; and application interrupt rates are apparently diverging from system failure rates.

The workshop will convene a diverse group of experts in HPC and fault-tolerance to inaugurate a fault-tolerance research agenda for responding to the unique challenges that extreme scale and complexity. Innovation is encouraged and discussion of non-traditional approaches is welcome.

Program Committee

    John Daly, Center for Exceptional Computing / Department of Defense, USA (Co-Chair)
    Nathan DeBardeleben, Center for Exceptional Computing / Department of Defense, USA (Co-Chair)
    Greg Bronevetsky, Lawrence Livermore National Laboratory, USA
    Franck Cappello, INRIA, France
    Daniel Katz, University of Chicago, USA
    Armando Fox, University of California, USA
    Zbigniew Kalbarczyk, University of Illinois, USA
    Yasunori Kimura, Fujitsu Laboratories, Japan
    Sébastien Monnet, University of Pierre and Marie Curie, France
    Takashi Nanya, University of Tokyo, Japan
    Nuno Neves, University of Lisbon, Portugal
    Stephen Scott, Oak Ridge National Laboratory, USA
    Marc Snir, University of Illinois, USA
    Jon Stearley, Sandia National Laboratory, USA
    Kishor Trivedi, Duke University, USA

Topics

Assuming hardware and software errors will be inescapable at extreme scale, this workshop will consider aspects of fault tolerance peculiar to extreme scale that include, but are not limited to:
    • Quantitative assessments of cost in terms of power, performance, and resource impacts of fault-tolerant techniques, such as checkpoint restart, that are redundant in space, time or information
    • Novel fault-tolerance techniques and implementations of emerging hardware and software technologies that guard against silent data corruption (SDC) in memory, logic, and storage and provide end-to-end data integrity for running applications
    • Studies of hardware / software tradeoffs in error detection, failure prediction, error preemption, and recovery
    • Advances in monitoring, analysis, and control of highly complex systems
    • Highly scalable fault-tolerant programming models
    • Metrics and standards for measuring, improving and enforcing the need for and effectiveness of fault-tolerance
    • Failure modeling and scalable methods of reliability, availability, performability and failure prediction for fault-tolerant HPC systems
    • Scalable Byzantine fault tolerance and security from single-fault and fail-silent violations
    • Benchmarks and experimental environments, including fault-injection and accelerated lifetime testing, for evaluating performance of resilience techniques under stress

Participation and Paper Submission

Submissions are expected in the following categories:
    • Extended abstracts that propose original ideas in the field
    • Work-in-progress report that present considerable progress in the challenging areas
    • Position papers that identify open issues or discuss existing solutions
Papers should be submitted using this link.

Important Dates

    Submission of papers: March 23, 2010 (EXTENDED!) - close of business, hard deadline
    Author notification: April 9, 2010
    Camera ready papers: April 30, 2010

Further Information

Workshop location, registration and accommodation: http://www.dsn.org.
QUESTIONS

For questions contact Nathan DeBardeleben (ndebard@lanl.gov) or John Daly (john.t.daly@ugov.gov).

Resilience website designed and hosted by Los Alamos National Laboratory.
Email Contact: Project Leader and Webmaster