Traditional fault-tolerance faces three significant challenges for the future of HPC. The first, and perhaps most intuitive, is that growing counts of platform hardware and software components, greater process densities, and increasing platform complexity result in higher rates of permanent, intermittent and transient faults. The second challenge is the cost of recovery based fault-tolerance relying on redundancy in space, time or information. These approaches can result in a factor of two, or more, increase in power, time, or component count. The final challenge, and perhaps the least well understood, is the impact of the growing number of single fault and fail silent violations on extreme scale HPC platforms. Some of the consequences of this trend are higher rates of data corruption, new impediments to failure prediction or detection, and an increasing gap between the platform's and application's view of the reliability and availability of the system.
Resilience is a new approach to coping with failure in HPC. As opposed to fault-tolerance, which is focused on keeping a platform (or application) running in spite of failures in individual platform (or application) components, resilience is concerned with keeping the application running to a correct solution in a timely and efficient manner in the presence of degradations or failures of individual platform components. Resilience is focused on failure preemption, as opposed to failure recovery, and keeping the application mean time to interrupt high relative to the application mean time to failure. However, since keeping the application running at an undue cost in power or performance is unacceptable, resilience must produce total system solutions that are balanced in their impact on power, performability, and cost.