Concepedia

Publication | Closed Access

AR-SMT: a microarchitectural approach to fault tolerance in microprocessors

490

Citations

22

References

2003

Year

Eric Rotenberg

Unknown Venue

TLDR

Rapid advances in microprocessor technology, such as gigahertz clock rates, reduce design tolerances and increase transient faults, making existing fault‑tolerant techniques too costly, intrusive, or inadequate for general‑purpose computing. The authors argue that microarchitecture‑level fault tolerance is necessary. They propose a time‑redundancy scheme that duplicates a program and runs both copies concurrently, leveraging simultaneous multithreading, flow prediction, and hierarchical processors to detect transient faults and partially cover permanent faults. Simulations on SPEC95 benchmarks show the fault‑tolerant design adds only 10–30% runtime overhead, confirming low performance and architectural cost.

Abstract

This paper speculates that technology trends pose new challenges for fault tolerance in microprocessors. Specifically, severely reduced design tolerances implied by gigaherz clock rates may result in frequent and arbitrary transient faults. We suggest that existing fault-tolerant techniques-system-level, gate-level, or component-specific approaches-are either too costly for general purpose computing, overly intrusive to the design, or insufficient for covering arbitrary logic faults. An approach in which the microarchitecture itself provides fault tolerance is required. We propose a new time redundancy fault-tolerant approach in which a program is duplicated and the two redundant programs simultaneously run on the processor: The technique exploits several significant microarchitectural trends to provide broad coverage of transient faults and restricted coverage of permanent faults. These trends are simultaneous multithreading, control flow and data flow prediction, and hierarchical processors-all of which are intended for higher performance, but which can be easily leveraged for the specified fault tolerance goals. The overhead for achieving fault tolerance is low, both in terms of performance and changes to the existing microarchitecture. Detailed simulations of five of the SPEC95 benchmarks show that executing two redundant programs on the fault-tolerant microarchitecture takes only 10% to 30% longer than running a single version of the program.

References

YearCitations

Page 1