Publication | Closed Access
Increasing relevance of memory hardware errors
34
Citations
19
References
2000
Year
Unknown Venue
Memory Hardware ErrorsSoftware ErrorsMemory ScalingEngineeringMem TestingComputer ArchitectureFault ToleranceHardware SecurityReliability EngineeringShared MemoryMemorySystems EngineeringMemory ErrorsParallel ComputingMemory ManagementMemory AnalysisComputer EngineeringComputer ScienceVirtual MemoryMemory ArchitectureOperating SystemsProgram AnalysisSystem Software
It is a common belief that most of computer system failures nowadays stem from programming errors. Computer systems are becoming more complex and harder to maintain and administer, making software errors an even more common case, while contemporary computer architectures are optimized for price and performance and not for availability. In this paper, we raise a case for an increasing relevance of memory hardware soft-errors. In particular with the introduction of 64-bit processors, memory scaling is significantly increased, resulting in higher probability for memory errors. At the same time, due to the ubiquitous use of computers, such as at higher altitudes, environmental conditions impact errors (terrestrial cosmic rays). Finally, in shared memory systems, the failure of one node's memory can take the whole machine down. Current commodity systems do not tolerate memory errors, neither commodity hardware (processors, memories, interconnects) nor software (operating systems, applications, application environments). At the same time, users expect increased reliability. We present the problems of such errors and some solutions for memory error recovery at the processor, operating system and programming model level.
| Year | Citations | |
|---|---|---|
Page 1
Page 1