June 16-20, 2013

Leipzig, Germany

Session Details

Name: System Challenges in HPC
Time: Thursday, June 20, 2013
11:00 AM - 12:00 PM
Room:   Hall 3
CCL - Congress Center Leipzig
Chair(s):   Franck Cappello, Argonne National Laboratory
Abstract:   Five years ago, reports and analysis on fault tolerance and resilience projected a catastrophic situation at Exascale where essentially applications could not progress because of the overhead of fault tolerance/resilience mechanisms. Nowadays, with the experience of 10 Petaflops systems, experts consider that progresses in building systems using hundreds of thousands of cores and advances on fault tolerance mechanisms makes the Exascale objective more realistic. The three talks of the session will show this evolution: the first talk will present FT/Resiiience learning from one of the largest HPC system on earth. The second talk will show how good engineering can reduce the complexity of the FT/Rresilience problem and the last talk will present how advanced research can help in addressing the remaining difficult issues.  
Presentations: What Petascale System Resiliency is Telling Us About Exascale Resiliency
11:00 AM - 11:20 AM
    Bill Kramer, NCSA
Resiliency in Exascale Systems – Rocket Science or Engineering?
11:20 AM - 11:40 AM
    Satoshi Matsuoka, Tokyo Institute of Technology
Sophisticated Fault Resilience
11:40 AM - 12:00 PM
    Rolf Riesen, IBM Research Ireland
