Concepedia

TLDR

Merrimac delivers an order‑of‑magnitude higher performance per unit cost than cluster‑based scientific computers by using a stream architecture and advanced interconnection networks. The authors design Merrimac, a scalable streaming scientific computer that can grow from a $20K 2 TFLOPS workstation to a $20M 2 PFLOPS supercomputer, and they evaluate its initial application performance. Merrimac organizes computation into streams and exploits locality with a register hierarchy, reducing memory bandwidth demands by an order of magnitude or more. With fixed‑bandwidth nodes, Merrimac supports ten times more arithmetic units, enabling a 1‑PFLOPS machine with only 8,192 nodes, which improves reliability and simplifies management, as shown by early application experiments.

Abstract

Merrimac uses stream architecture and advanced interconnection networks to give an order of magnitude more performance per unit cost than cluster-based scientific computers built from the same technology. Organizing the computation into streams and exploiting the resulting locality using a register hierarchy enables a stream architecture to reduce the memory bandwidth required by representative applications by an order of magnitude or more. Hence a processing node with a fixed bandwidth (expensive) can support an order of magnitude more arithmetic units (inexpensive). This in turn allows a given level of performance to be achieved with fewer nodes (a 1-PFLOPS machine, for example, with just 8,192 nodes) resulting in greater reliability, and simpler system management. We sketch the design of Merrimac, a streaming scientific computer that can be scaled from a $20K 2 TFLOPS workstation to a $20M 2 PFLOPS supercomputer and present the results of some initial application experiments on this architecture.

References

YearCitations

Page 1