Concepedia

TLDR

Transient faults from neutron and alpha particle strikes hinder transistor scaling, as adding more transistors increases the likelihood of encountering a fault even if individual transistor fault rates remain low. The study aims to lower processor error rates by proposing and evaluating two straightforward techniques applied to a microprocessor instruction queue. The authors present two methods: selectively squashing instructions during long delays to reduce time in vulnerable storage and redefining error detection to flag only potentially program‑affecting faults, introducing the MITF metric to balance performance and reliability.

Abstract

Transient faults due to neutron and alpha particle strikes posea significant obstacle to increasing processor transistor counts infuture technologies. Although fault rates of individual transistorsmay not rise significantly, incorporating more transistors into adevice makes that device more likely to encounter a fault. Hence,maintaining processor error rates at acceptable levels will requireincreasing design effort.This paper proposes two simple approaches to reduce errorrates and evaluates their application to a microprocessor instructionqueue. The first technique reduces the time instructions sit invulnerable storage structures by selectively squashing instructionswhen long delays are encountered. A fault is less likely to cause anerror if the structure it affects does not contain valid instructions.We introduce a new metric, MITF (Mean Instructions To Failure),to capture the trade-off between performance and reliability introducedby this approach.The second technique addresses false detected errors. In theabsence of a fault detection mechanism, such errors would nothave affected the final outcome of a program. For example, a faultaffecting the result of a dynamically dead instruction would notchange the final program output, but could still be flagged by thehardware as an error. To avoid signalling such false errors, wemodify a pipeline's error detection logic to mark affected instructionsand data as possibly incorrect rather than immediately signalingan error. Then, we signal an error only if we determine laterthat the possibly incorrect value could have affected the program'soutput.

References

YearCitations

Page 1