SUBMISSION DEADLINE EXTENDED: February 18, 2013 - 11:59 PM EST (FINAL DEADLINE EXTENSION!)
WHEN: June 18th, 2013
VENUE: The New Yorker Hotel
IN ASSOCIATION WITH: The 22nd International ACM Symposium on High Performance Parallel and Distributed Computing (HPDC'13)
REGISTER: Register for HPDC and FTXS 2013
PAST FTXSs: FTXS 2010 and FTXS 2012 with DSN
FTXS is designed around a very interactive audience participation. As such, we have allocated blocks of time to discuss papers and some seed questions.
The schedule has been defined to stimulate interactions between topics of fault tolerance of interest to the extreme-scale / HPC crowd. In that regard, sessions are mixing papers on different topics and papers on the same topic are distributed over several sessions.
Tuesday - June 18th, 2012
Authors / Speaker
SESSION: Algorithms and Applications - INVITED TALK - Toward Resilient Algorithms and Applications Mike Heroux - Sandia National Laboratories Fault Tolerance Using Lower Fidelity Data in Adaptive Mesh Applications Anshu Dubey, Prateeti Mohapatra and Klaus Weide SESSION: Hardware Issues - INVITED TALK - Circuits for Resilient Systems Vivek De - Intel Neutron Sensitivity and Software Hardening Strategies for Matrix Multiplication and FFT on Graphics Processing Units Paolo Rech, Laercio Pilla, Francesco Silvestri, Philippe Navaux and Luigi Carro SESSION: Injection, Detection, and Replication - Using Unreliable Virtual Hardware to Inject Errors in Extreme-Scale Systems Scott Levy, Matthew G. F. Dosanjh, Patrick G. Bridges and Kurt B. Ferreira Fault Detection in Multi-Core Processors Using Chaotic Maps Nageswara Rao Replication for Send-Deterministic MPI HPC Applications Arnaud Lefray, Thomas Ropars and André Schiper SESSION: Energy and Checkpointing - Energy-aware I/O Optimization for Checkpoint and Restart on a NAND Flash Memory System Takafumi Saito, Kento Sato, Hitoshi Sato and Satoshi Matsuoka When is Multi-version Checkpointing Needed Guoming Lu, Ziming Zheng and Andrew A. Chien SESSION: Wrap-up discussion, conclusions, takeaways, action items, next steps -
|3rd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2013)|
Welcome!The 3rd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2013) will be held in conjunction with the The 22nd International ACM Symposium on High Performance Parallel and Distributed Computing (HPDC'13) in New York City, New York (USA) on June 18th, 2013.
FTXS is a workshop aimed at identifying looming problems and discussing promising research solutions in the area of High Performance Computing (HPC). In particular, extreme-scale "leadership class" supercomputers fall into this broad category.
MotivationFor the HPC community, a new scaling in numbers of processing elements has superseded the historical trend of Moore’s Law scaling in processor frequencies. This progression from single core to multi-core and many-core will be further complicated by the community’s immanent migration from traditional homogeneous architectures to ones that are heterogeneous in nature. As a consequence of these trends, the HPC community is facing rapid increases in the number, variety, and complexity of components, and must thus overcome increases in aggregate fault rates, fault diversity, and complexity of isolating root cause.
Recent analyses demonstrate that HPC systems experience simultaneous (often correlated) failures. In addition, statistical analyses suggest that silent soft errors can not be ignored anymore, because the increase of components, memory size and data paths (including networks) make the probability of silent data corruption (SDC) non-negligible. The HPC community has serious concerns regarding this issue and application users are less confident that they can rely on a “correct” answer to their computations. Other studies have indicated a growing divergence between failure rates experienced by applications and rates seen by the system hardware and software. At Exascale, some scenarios project failure rates reaching one failure per hour. This conflicts with the current checkpointing approach to fault tolerance that requires up to 30 minutes to restart a parallel execution on the largest systems. Lastly, stabilization periods for the largest systems are already significant, and the possibility that these could increase in length is of great concern. During the “Approaching Exascale” report at SC11, DOE program managers identified resilience as a black swan - the most difficult under-addressed issue facing HPC.
Open QuestionsWhat does the fault-tolerance community need to do in order to be prepared to face the challenges of extreme scale computing? What is needed to keep applications with billions of threads of parallelism up and running on systems that fail tens of times per day? As models predict less than 50% efficiency of traditional checkpoint/restart methods on future systems, are we ready to pay the cost of full redundancy, effectively performing redundant multi-threading (RMT) across entire systems? Do we even have the infrastructure necessary to implement an RMT strategy?
How is the supercomputing community going to efficiently isolate failures on enormously complex systems? Is there any chance to understand these systems in such a way that some failure could be predicted with enough accuracy and anticipation to trigger useful failure avoidance actions? What can the community do to protect applications from SDC in memory and logic? How far the user and the programmer should be involved in managing faults? What are the most p romising self-‐healing numerical methods?
GoalsThe goals of FTXS 2013 are to consider these complex questions, to discuss the unique limitations that extreme scale and complexity impose on traditional methods of fault-tolerance, and to explore new strategies for dealing with those challenges.
Call for PapersAvailable here in TEXT format.
Available here in PDF format.
Nathan DeBardeleben - Los Alamos National Laboratory
Jon Stearley - Sandia National Laboratory
Franck Cappello - INRIA and University of Illinois at Urbana Champaign
Rob Aulwes – Los Alamos National Laboratory
Aurélien Bouteiller – University of Tennessee, Knoxville
Greg Bronevetsky - Lawrence Livermore National Laboratory
Clayton Chandler – Department of Defense
Robert Clay – Sandia National Laboratories
John Daly - Department of Defense
Christian Engelmann – Oak Ridge National Laboratory
Felix Salfner - SAP Innovation Center Potsdam
Kurt Ferreira – Sandia National Laboratories
Ana Gainaru – University of Illinois at Urbana-Champaign
Leonardo Bautista Gomez – Tokyo Institute of Technology
Hideyuki Jitsumoto – The University of Tokyo
Rakesh Kumar - University of Illinois, Urbana-Champaign
Zhiling Lan – Illinois Institute of Technology
Naoya Maruyama – Tokyo Institute of Technology
Kathryn Mohror – Lawrence Livermore National Laboratory
Bogdan Nicolae – IBM Research – Ireland
Rolf Riesen – IBM Research – Ireland
Yve Robert - ENS Lyon
Thomas Ropars - EPFL
Mitsuhisa Sato – University of Tsukuba
Stephen Scott – Tennessee Tech University and Oak Ridge National Laboratory
Vilas Sridharan – AMD, Inc.
Roel Wuyts - Intel ExaScience Lab
TopicsAssuming hardware and software errors will be inescapable at extreme scale, this workshop will consider aspects of fault tolerance particular to extreme scale that include, but are not limited to:
- Quantitative assessments of cost in terms of power, performance, and resource impacts of fault-tolerant techniques, such as checkpoint restart, that are redundant in space, time or information
- Novel fault-tolerance techniques and implementations of emerging hardware and software technologies that guard against silent data corruption (SDC) in memory, logic, and storage and provide end-to-end data integrity for running applications
- Studies of hardware / software tradeoffs in error detection, failure prediction, error preemption, and recovery
- Advances in monitoring, analysis, and control of highly complex systems
- Highly scalable fault-tolerant programming models
- Metrics and standards for measuring, improving and enforcing the need for and effectiveness of fault-tolerance
- Failure modeling and scalable methods of reliability, availability, performability and failure prediction for fault-tolerant HPC systems
- Scalable Byzantine fault tolerance and security from single-fault and fail-silent violations
- Benchmarks and experimental environments, including fault-injection and accelerated lifetime testing, for evaluating performance of resilience techniques under stress
Participation and Paper SubmissionSubmissions are expected in the following categories:
Authors are invited to submit papers with unpublished, original work of not more than 8 pages of double column text using single spaced 10 point size on 8.5 x 11 inch pages (including all text, figures, and references), as per ACM 8.5 x 11 manuscript guidelines (document templates can be found at http://www.acm.org/sigs/publications/proceedings-templates).
- Regular papers presenting innovative ideas improving the state of the art
- Experience papers discussing the issues seen on existing extreme-scale systems, including some form of analysis and evaluation
- Extended abstracts proposing disruptive ideas in the field, including some form of preliminary results
Papers will be peer-reviewed, and accepted papers will be published in the workshop proceedings as part of the ACM digital library. Submission implies the willingness of at least one of the authors to register and present the paper.
Submit a paper using this link.
Submission of papers:
February 11, 2013February 18, 2013 - 11:59 PM EST (DEADLINE EXTENDED!)
Author notification: March 18, 2013
Camera ready papers: April 15, 2013
Workshop: June 18, 2013
Further InformationWorkshop location, registration and accommodation: http://www.hpdc.org/2013/.
For questions contact Nathan DeBardeleben (email@example.com).