Fault Tolerant Computing on Maestro

Stephen P. Crago, John Paul Walters, Robert Kost, Karandeep Singh, Dong-In Kang, and Jinwoo Suh

The microprocessor trend towards multi-core and many-core means that redundant resources are getting inexpensive and can be used for both computation and increased capabilities, including fault tolerance. Because multi-core architectures have an inherent programmable redundancy, multi-core processors can support flexible, software-implemented fault tolerance. The Maestro processor, which has been implemented with rad-hard by design technology, has 49 cores, as well as redundant inter-core networks and chip interfaces. In this talk, we explore a range of fault tolerance techniques that can be used on Maestro and other multi-core and many-core architectures and identify other challenges that have not yet been addressed.

Document date May 13, 2010.