Frequently Asked Questions for LANL Failure data for all machines data Q: What is the data format? A: A failure record contains the time when the failure started (start time), the time when it was resolved (end time), the system and node affected, the type of workload running on the node and the root cause. Q: How does failure reporting work at LANL? A: Failure reporting at LANL follows the following protocol. Failures are detected by an automated monitoring system that pages operations staff whenever a node is down. The operations staff then create a failure record in the database specifying the start time of the failure, and the system and node affected, then turn the node over to a system administrator for repair. Upon repair, the system administrator notifies the operations staff who then put the node back into the job mix and fill in the end time of the failure record. If the system administrator was able to identify the root cause of the problem he provides operations staff with the appropriate information for the ``root cause'' field of the failure record. Otherwise the root cause is specified as ``Unknown''. Q: Are there common guidelines/procedures that your staff follow to classify the root cause of a problem (e.g.what is a hardware vs a software problem), or is this left to the individuals discretion? A: LANL has developed a scheme for classifying failures and assigning failures to root cause categories. The scheme was developed jointly by hardware engineers, administrators and operations staff at LANL, and is used by all staff. Q: How is the timestamp for the start time of a problem generated? Is it the time when the monitoring software first recorded an error, or is it the time when a sys admin first started looking at the problem? A: It is the time that an operator enters into a system shortly after they have been notified by a monitor that a node is unavailable. Q: What is the reason for the cause of a failure to be classified as "Unknown"? Is it that the sys admin taking care of the problem didn't make an entry, or is it that the cause of the problem was never figured out? A: It is that the cause was not figured out through a weekly follow up meeting. It could be of course that the operator got frustrated and just booted a node and didnt follow up as much as they should, but it is not a non entry. There is a weekly meeting where all of this stuff is brought up and the operators, sysads and hardware engineers look at their logs and try to classify everything they can. Q: Is there a way to determine from the data whether a fault was transient or permanent. E.g. a hardware failure classified as "Memory" could either be a permanent failure requiring replacement of the faulty dimm, or a transient failure, such as a parity check problem, that only requires rebooting. A: The data does not contain this information. What was done to fix the node is not really recorded, just the cause if it can be determined. Q: Almost all systems are clusters of multiple nodes. Only four systems are non-clusters (three non-cluster SMPs and one non-cluster NUMA). Are those non-cluster machines also used for scientific computing or do they serve a different purpose? A: Yes, but the non cluster SMP's really pre-date clusters for the most part, so this would be highly vectorized codes, few numbers of processors, but I/O every few hours and very cpu intensive Q: Does the data span only the production time of a system or also the testing period before the system went into production? A: We dont keep for long period the failures before production, so the data is really basically from production time. The install date, production date, and decommission date for each set of nodes (not cluster) is given in the table on the web page. One thing to note is that some clusters grew in the middle etc. so the data has these dates for each node, some clusters grew during their lives. Again there are no records previous to production really. Q: Does the new field for the type of CPU/memory distinguish only different architectures (e.g. Alpha vs Mips) or also different series of the same architecture (e.g. MIPS R4000 vs MIPS R12000)? A: In this data number 3 might be an alpha ev67 and an ev68 would be a different number. Q: Suppose a field replacable node is put to use, is there a way to identify it in the data as a new node, or would it just take over the "identity" (node number etc) of the old node that it replaces? A: Correct, the number would just be assumed by the new hardware, sorry. No way to track old nodes to new nodes. Q: Possible problems with failure data are underreporting of failures and misreporting of root cause. Do you think the accuracy of this data might be significantly affected by these problems? A: We don't consider underreporting (i.e. a failure does not get reported at all) a serious concern, since failure detection is initiated by automatic monitoring and failure reporting involves several people from different administrative domains (operations staff and system administrators). While misdiagnosis can never be ruled out completely, its frequency depends on the skills of the system administrator. LANL employs highly-trained staff backed by a well-funded cutting edge technology integration team, often pulling new technology into existence in collaboration with vendors. Q: Does root cause information get amended if more information about a failure becomes available later on? A: Yes. Operations staff and system administrators often have follow-up meetings for failures with ``Unknown'' root cause. If through those meetings or other ways the root cause becomes clear later on, the corresponding failure record gets amended accordingly. Q: What is the reason for the cause of a failure to be classified as "Unknown"? A: Unknown root causes are most common during the early life of a system, mostly because at this time the staff's experience in running this system and experience in problem diagnosis for this system are limited. As time goes on, the fraction of failures with root cause classified as unknown usually drops. Q: The single most common root cause for hardware problems is memory. Do you know what the most common cause for memory related failures is? A: Dual non corrected error or Dual corrected errors above a certain threshold Q: Do systems auto-reboot if certain types of failures are automatically detected and if so would there be a corresponding entry in the failure log? A: Not really, systems go down and monitors find that and tell operators, operators make logs and decide what to do, they run diags or something like that before they enter the job scheduler mix. Q: Do all nodes have some locally attached disks, or do some of the systems rely entirely on remote storage servers? (I assume the data does not include failures of remotely accessed storage?). A: All nodes on machines you have data for have disks, but the 2 processor machines do not use the disk for system, they dont use it for much, all access is via in memory fs or remote fs for those machines this does not include remote storage failures Q: For the non-cluster (single-node) SMP systems (systems 7, 22, 24) more than half of all failures have a root cause categorized as "Unknown". Is the reason that those system are less important and hence not as much effort is spent diagnosing the root cause? Or is there another reason? A: Well, those systems were of course were sort of hand made systems, cooled by immersion techniques, lots of boards, extremely complex. That company sold maybe 50 of those systems worldwide. There might have even been on site soldering involved. That was a while ago as well. Additionally those systems were of an era where we had 10 of them on site, (unlike today where we have 10's of thousands of commodity machines). So I guess if you have 10 of them machines over a decade, you dont so much care about root causes as you do if you have 3 orders of magnitude more of a particular part. Additionally, we had on site hardware service for those type machines, where parts were actually fixed on site. Current machines use the entire node as the fru. Those machines may be less important to your study unless your study is somewhat historical in nature. It is interesting that those machines were the machine that had no commodity parts in them. Q: Do you know why system 21 was decommissioned only a few month after its introduction? A: It was installed in one network and then the parts were picked up and moved into a different network and became parts of other clusters under different names. Q: Is it possible to determine from the data whether a hardware-related failure was permanent (i.e. required replacement of hardware) or not? A: No. The data contains information on which part of hardware (e.g. memory, CPU, etc.) was involved, but not whether the failure actually required hardware replacement. For example, a hardware failure classified as "Memory" could either be due to a faulty dimm, requiring replacement of the faulty dimm, or a transient failure, such as a parity check problem, that only requires rebooting. Q: Does the data contain information on disk failures? A: The data does contain entries for failures of disks located on the nodes. However, many applications load their data from remote storage and after that work only from main memory, without using on-node disks. Some nodes do not even have local disks. Therefore, the number of reported disk failures is low. Q: How would you describe the workloads run on the systems? And are the workloads typically more IO-intensive or more CPU-intensive? A: The majority of the workloads are large-scale scientific simulations, such as simulations of nuclear stockpile stability. These applications perform long periods (often months) of CPU computation, interrupted every few hours by a few minutes of I/O for check-pointing. Simulation workloads are often accompanied by scientific visualization of large-scale data. Visualization workloads are also CPU-intensive, but exhibit more reading of data from storage than compute workloads. Finally, some nodes are used purely as front-end nodes, and others run more than one type of workload, for instance, graphics nodes often run compute workloads as well. Q: What is the mechanism commonly used by applications to minimize lost work in the case of a failure? A: Applications typically create checkpoints at regular intervals that are written back to stable storage. In the case of a failure, an application restarts from its most recent checkpoint. Most commonly, coordinated checkpointing is used, i.e. all nodes that an application is running on write a checkpoint at the same time. If only one node fails all nodes roll back to their most recent checkpoint. Q: What is roughly the number of system administrators employed per system (or if that information is not available what is the total number of system administrators)? A: For the systems (not for operations, storage, network, archive, etc.), roughly 8-10 system admins for everything. Q: Does each of the 8-10 admins work on all systems, or does each admin specialize in a subset of systems? A: That has changed from time to time, right now they all work on all of them. Q: Do any of the newer systems have multi-core CPUs? A: No systems in this list have multicore cpu's we have some of those now but they were not in production when we did this data dump. Q: Why did system E experience such a high percentage (more than 50\% of all failures) of CPU related failures? A: This was due to a major flaw in the design of the type of CPU used in systems of type E. Q: The percentage of human error is much lower than in other published data. What would be your explanation? Q: Do you know why system 21 was decommissioned only a few month after its introduction? A: The system wasn't actually decommissioned. Its nodes were added to another system, for which no failure data is available. Q: Can several jobs timeshare a node, or are jobs always run in isolation? A: On the large 64-128 processor nodes, jobs did share nodes, on the smaller machines jobs never shared nodes. Q: Do any of the newer systems have multi-core CPUs? A: No. Q: Can you give us any information on the hardware vendors? A: No, we are not able to release any vendor specific information. Q: The failure rate during the first months of a system's lifetime are often much higher than during the rest of the system's lifetime. Why? A: The failure rate drops during the early age of a system, as initial hardware and software bugs are detected and fixed and administrators gain experience in running the system. One might wonder why the initial problems were not solved during the 1-2 months of testing before production time. The reason is that many problems in hardware, software and configuration are only exposed by real user code in the production workloads. Q: Sometimes it takes the first 1-2 *years* of a system's lifetime before failure rates drop. Why? A: Some systems were unprecedented in scale or the type of configuration, when they were introduced. As a result, the first years still involved a lot of development work among the administrators of the system, the vendors, and the users. Administrators had to develop new software for managing the system and providing the infrastructure to run large parallel applications. Users developed new large-scale applications that wouldn't have been feasible to run on any of the previous systems. With the slower development process it took longer until the systems were running the full variety of production workloads and the majority of the initial bugs were exposed and fixed.