Thanks for a great workshop! Presentations will be posted in the agenda at the bottom when they are all available. The white paper produced by the working group will be available in a month or two when it is ready. The presentations below are in a variety of formats including PDF, Keynote (Mac), Powerpoint and 'new' Powerpoint.

WALK UP REGISTRATION: We can accept walk ups for registration. We have more than 80 registrants but if you are in the area, please join us.

The motivation for a National Resilience workshop is the pressing need for an integrated multi-disciplinary approach to maintaining productivity within HPC systems of today and tomorrow.

This workshop will span multiple disciplines from machine hardware and the traditional reporting on the hardware environment to requirements of "self-aware" software that monitors itself and the machine environment to ensure application progress in the face of failure. Resilience is a new approach to coping with failure in HPC.

While fault-tolerance is focused on keeping a platform (or application) running in spite of failures, resilience is concerned with keeping the application running to a correct solution in a timely and efficient manner in the presence of degradations or failures of individual platform components. Resilience is focused on failure preemption, as opposed to failure recovery, and keeping the application mean time to interrupt high relative to the application mean time to failure. However, since keeping the application running at an undue cost in power or performance is unacceptable, resilience must produce total system solutions that are balanced in their impact on power, performance, and cost.

This workshop will bring together a broad spectrum of needed research needs from government, academic and industry to address resilience requirements.

Workshop Hosts and Organizational Committee

    Nathan DeBardeleben (Los Alamos National Laboratory)
    John Daly (Department of Defense / Center for Exceptional Computing)
    Stephen Scott (Oak Ridge National Laboratory)
    William Harrod (DARPA)

Workshop Sponsors

The workshop is being sponsored by Los Alamos National Laboratory and the Petascale Data Storage Institute (DOE SciDAC). The workshop immediately follows the 5th Annual HECIWG sponsored HEC FSIO 2009 Conference.

August 12-14, 2009

    Registration: August 12, 2009 (4:00pm - 6:00pm)
    Poster Session: August 12, 2009 (5:00pm - 8:00pm)
    Workshop (Keynote, Invited Talks, Panels): August 13, 2009 (all day)
    Resilience Committee and Panel Organizational Session: August 14, 2009 (8:30am - noon)
(extended) DEADLINES!

July 18th for hotel government rate block. July 20th for workshop registration. Deadline extended to July 24th for registration. The hotel block will be held until the 24th or until the rooms are gone.


The 2009 National HPC Workshop on Resilience will NOT have a registration fee. Refreshments, continental breakfasts, and lunch during the workshop will also be provided. See the workshop agenda below for times.

Attendees are responsible for their transportation and hotel costs.


    The Westin Arlington Gateway Hotel
    801 North Glebe Road
    Arlington, VA 22203, USA
    (703) 717-6200

The hotel is located 1 block from the Ballston DC Metro (subway) station. These directions show how to walk between the Ballston stop and the hotel.

The National Science Foundation (NSF) headquarters is 2 blocks from the hotel.

The Ballston Commons Mall is about 2 blocks from the hotel. These walking directions show the Mall but might go a bit out of the way getting there.


Click here for the 2009 Information Sheet and Registration Form (pdf)

Click here for the 2009 Information Sheet and Registration Form (docx)


A block of rooms is available at the government rate of $165 per night. When reserving rooms, PLEASE use the Conference name HECI R&D 2009. Our block or rooms will be held until Saturday, July 18th. After that, rooms may not be available. The hotel has extended the block of rooms until July 24th or until the rooms are gone.

Click here to go to the hotel reservation page.

Be sure to check out the agenda below. Non-speakers and non-organizational-committee members will not need to attend Friday's session.


The DC Metro (subway) stop Ballston is 1 block from the hotel. This link has the walking directions.

The public transportation in DC costs money so be prepared to buy tickets using the routes suggested below. Some of these public transit routes below may take quite a long time. The BWI to hotel path using Amtrak takes several hours, for instance.

The Washington, DC area is served by three main airports:

  • Ronald Reagan Washington National Airport (DCA)

    Located in Arlington County, Virginia about 7 miles from the conference venue.
    • BY CAR: try these directions.
    • BY PUBLIC TRANSIT: (Metro subway)
      • Follow these directions to get to the Ronald Reagan Washington National Airport Metro station which is connected to the airport. It is on the blue/yellow subway line in the southern portion of the Metro map (PDF).
      • Using the Metro System Map (PDF) go from Ronald Reagan Washington National Airport Metro station (blue/yellow line, southern) to Ballston (orange line, western).
      • Walk from the Ballston Metro stop to the hotel using these directions.
  • Washington Dulles International Airport (IAD)

    Located in Dulles, Virginia about 21 miles from the conference venue.
    • BY CAR: try these directions.
    • BY PUBLIC TRANSIT: (Bus and Metro subway):
      • Follow these directions to go from the Dulles airport to the West Falls Church (orange line, western) Metro station.
      • Using the Metro System Map (PDF) go from West Falls Church (orange line, western) to Ballston (orange line, western). It is only 2 stops eastward.
      • Walk from the Ballston Metro stop to the hotel using these directions.
  • Baltimore / Washington International Airpot (BWI)

    Located south of Baltimore, Maryland between 40 and 50 miles (depending on path) from the conference venue.
    • BY CAR: try these directions.
    • BY PUBLIC TRANSIT (Amtrak rail and Metro subway):
      • Following these instructions to leave BWI using Amtrak.
      • Take the Amtrak free shuttle to the BWI Marshall Rail Station.
      • Buy a ticket to Union Station. Try this link to find trains that will take you there.
      • Get off at Union Station where you will transfer to the Metro (subway) line.
      • Using the Metro System Map (PDF) go from Union Station (red line, central) to Ballston (orange line, western).
      • Walk from the Ballston Metro stop to the hotel using these directions.
    • BY PUBLIC TRANSIT (Transit Authority bus and Metro subway):

All logistical workshop questions contact Paul Iwanchuk at pniwanc@lanl.gov.

For technical questions contact Nathan DeBardeleben (ndebard@lanl.gov) or John Daly (john.t.daly@ugov.gov).


All events are at the Westin Arlington Gateway.

Start Time

End Time

Session / Presentation

Presenter / Chair



4:00 PM6:00 PMRegistration / Packet Pickup
5:00 PM8:00 PMPoster SessionLight Food and Drink



7:30 AM8:30 AMContinental BreakfastAll Attendees Welcome
8:15 AM9:30 AMRegistration / Packet Pickup (for late arrivals)
8:30 AM8:40 AMWelcomeNathan DeBardeleben (LANL)
8:40 AM9:00 AMKickoff and Overview (Keynote) or Quicktime MovieJohn Daly (DoD)
9:00 AM9:10 AM

THRUST 1: Introduction to Data Integrity

Bill Harrod (DARPA)
9:10 AM9:30 AMSilent Data Corruption: A Threat to Data Integrity in High-End Computing SystemsSarah Michalak (Los Alamos National Laboratory)
9:30 AM9:50 AMChallenges with Data IntegrityHenry Newman (Instrumental)
9:50 AM10:10 AMArchitectural Vulnerability Factor? (Or, Does a Soft Error Matter?)Arijit Biswas (Intel)
10:10 AM10:30 AMTHRUST 1, Discussion / PanelTHRUST 1 Presenters
10:30 AM11:00 AMBreak
11:00 AM11:10 AM

THRUST 2: Introduction to Collection, Monitoring, and Analysis of Data

Stephen Scott (ORNL)
11:10 AM11:30 AMThe Role and Challenges of Monitoring in System ResiliencyJim O'Connor (IBM)
11:30 AM11:50 AMData Analysis for HPC ResilienceGeorge Ostrouchov (Oak Ridge National Laboratory)
11:50 AM12:10 PM RAS Subsystems: How Will They Support Next Generation Platforms?Jim Laros (Sandia National Laboratory)
12:10 PM12:30 PMTHRUST 2, Discussion / PanelTHRUST 2 Presenters
12:30 PM1:30 PMLunch (PROVIDED)
1:30 PM1:40 PM

THRUST 3: Introduction to Metrics and Modeling

John Daly (DoD)
1:40 PM2:00 PMHPC Kaizen - Metrics and Root Cause AnalysisJon Stearley (Sandia National Laboratory)
2:00 PM2:20 PMModeling Techniques Towards ResilienceChristian Engelmann (ORNL) and Chokchai (Box) Leangsuksun (LaTech)
2:20 PM2:40 PMResilience at Scale: The Importance of Real World DataBianca Schroeder (U. of Toronto)
2:40 PM3:10 PMTHRUST 3, Discussion / PanelTHRUST 3 Presenters
3:10 PM3:30 PMBreak
3:30 PM3:40 PM

THRUST 4: Introduction to Resilient Middleware

Nathan DeBardeleben (LANL)
3:40 PM4:00 PMProblem Diagnosis, Debugging and Visualization ToolsPriya Narasimhan (Carnegie Mellon U.)
4:00 PM4:20 PMFault Tolerance Techniques Based on an Adaptive Runtime SystemSanjay Kale (U. of Illinois at Urbana-Champaign)
4:20 PM4:40 PMFault Tolerance and MPI - Can They Coexist?Rich Graham (Oak Ridge National Laboratory)
4:40 PM5:00 PMTHRUST 4, Discussion / PanelTHRUST 4 Presenters
5:00 PM5:30 PMWrap Up and Next StepsNathan DeBardeleben (LANL)



INTENDED FOR Organizational Committee and Speakers Only
7:30 AM8:30 AMContinental BreakfastOrganizational Committee and Speakers
8:30 AM12:00 PMResilience Committee and Speaker Organization Session

Resilience website designed and hosted by Los Alamos National Laboratory.
Email Contact: Project Leader and Webmaster