Publication | Closed Access
Supercomputing's monster in the closet
46
Citations
0
References
2016
Year
High Performance ComputingHardware SecurityExascale ComputingAsci QEngineeringSupercomputer ArchitectureOpen Source SupercomputingUnconventional ComputingComputer EngineeringComputer ArchitectureSteel CabinetsParallel ProgrammingComputer ScienceNew Mexico LabParallel ComputingTechnologyNext Generation Computing
As a child, were you ever afraid that a monster lurking in your bedroom would leap out of the dark and get you? My job at Oak Ridge National Laboratory is to worry about a similar monster, hiding in the steel cabinets of the supercomputers and threatening to crash the largest computing machines on the planet. The monster is something supercomputer specialists call resilience- or rather the lack of resilience. It has bitten several supercomputers in the past. A high-profile example affected what was the second fastest supercomputer in the world in 2002, a machine called ASCI Q at Los Alamos National Laboratory. When it was first installed at the New Mexico lab, this computer couldn't run more than an hour or so without crashing.