For the HPC community, a new scaling in numbers of processing elements has superseded the historical trend of Moore's Law scaling in processor frequencies. This progression from single core to multi-core and many-core will be further complicated by the community's imminent migration from traditional homogeneous architectures to ones that are heterogeneous in nature. As a consequence of these trends, the HPC community is facing rapid increases in the number, variety, and complexity of components, and must thus overcome increases in aggregate fault rates, fault diversity, and complexity of isolating root cause.
Recent analyses demonstrate that HPC systems experience simultaneous (often correlated) failures. In addition, statistical analyses suggest that silent soft errors can not be ignored anymore, because the increase of components, memory size and data paths (including networks) make the probability of silent data corruption (SDC) non-negligible. The HPC community has serious concerns regarding this issue and application users are less confident that they can rely on a correct answer to their computations. Other studies have indicated a growing divergence between failure rates experienced by applications and rates seen by the system hardware and software. At Exascale, some scenarios project failure rates reaching one failure per hour. This conflicts with the current checkpointing approach to fault tolerance that requires up to 30 minutes to restart a parallel execution on the largest systems. Lastly, stabilization periods for the largest systems are already significant, and the possibility that these could increase in length is of great concern. During the Approaching Exascale report at SC11, DOE program managers identified resilience as a black swan - the most difficult under-addressed issue facing HPC.
|Past and Upcoming FTXS Workshops|
- FTXS 2010 - in association with The 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2010) - Chicago, Illinois
- FTXS 2012 - in association with The 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012) - Boston, Massachusetts
- FTXS 2013 - in association with The 22nd International ACM Symposium on High Performance Parallel and Distributed Computing (HPDC'13) - New York City, New York
For questions contact Nathan DeBardeleben (email@example.com).