(Back to Session Schedule)

The 19th Asia and South Pacific Design Automation Conference

Session 5S  Special Session: Billion Chips of Trillion Transistors
Time: 13:50 - 15:30 Wednesday, January 22, 2014
Location: Room 302
Organizer: Chen-Yong Cher (IBM TJ Watson Research Center, U.S.A.)

5S-1 (Time: 13:50 - 14:20)
Title(Invited Paper) Soft Error Resiliency Characterization on IBM BlueGene/Q Processor
Author*Chen-Yong Cher, K. Paul Muller, Ruud A. Haring, David L. Satterfield, Thomas E. Musta, Thomas M. Gooding, Kristan D. Davis, Marc B. Dombrowa, Gerard V. Kopcsay, Robert M. Senger, Yutaka Sugawara, Krishnan Sugavanam (IBM T. J. Watson Research Center, U.S.A.)
Pagepp. 385 - 387
Keywordsoft error rate, fault injection, high-performance computing, chip irradiation
AbstractFault injection through accelerated irradiation is an effective way to evaluate the overall soft error resiliency of microprocessors. In this work, we report on irradiation experiments on a Blue Gene/Q (BG/Q) compute processor chip running selected applications. Blue Gene/Q is the third generation of IBM’s massively parallel, energy efficient Blue Gene series of supercomputers. In the experiments, we found 26 code fails that are relevant for the calculation of the mean-time-between-failures (MTBF) for a 20 PetaFLOP, 96 rack system running a comparable workload mix. The expected MTBF for check-stops due to cosmic radiation and alpha particles from chip packaging materials is calculated to be 51 days for sea-level at New York City running the application mix studied. If the most vulnerable application is run exclusively, the projected MTBF is 35 days. These are outstanding results for a machine of this magnitude. The beaming experiment and projected MTBF validate the necessity to include autonomous hardware detection and recovery at the cost of design effort, silicon area and power.

5S-2 (Time: 14:20 - 14:50)
Title(Invited Paper) Resiliency for Many-Core System on a Chip
Author*Tanay Karnik, James Tschanz, Nitin Borkar, Jason Howard, Sriram Vangal, Vivek De, Shekhar Borkar (Intel Corporation, U.S.A.)
Pagepp. 388 - 389
KeywordResiliency, SOC
AbstractResilient techniques are commonly employed for dynamic and static variation tolerance. In this paper, we present an adaptive clocking technique that achieves 31% throughput increase with 15% energy reduction, and an adaptive interconnect fabric technique that increases bandwidth by 63% with 14.6% energy reduction. We also discuss variations in many-core microprocessors and some techniques to enable a resilient many-core system on a chip.

5S-3 (Time: 14:50 - 15:20)
Title(Invited Paper) Rethinking Error Injection for Effective Resilience
AuthorShahrzad Mirkhani (University of Texas, U.S.A.), Hyungmin Cho, Subhasish Mitra (Stanford University, U.S.A.), *Jacob Abraham (University of Texas, U.S.A.)
Pagepp. 390 - 393
Keywordtransient fault, soft error, error injection
AbstractSoft errors, caused by radiation, have become a major challenge in today’s computer systems and networking equipment, making it imperative that systems be designed to be resilient to errors. Error injection is a powerful approach to evaluate system resilience, and current practice is to inject errors in architectural registers of processors, program variables of applications, or storage elements in the hardware model. This paper, using answers to frequently asked questions, discusses the need for rethinking conventional approaches to error injection, showing data from recent research and our simulation results. Approaches to improving current error injections are also suggested.
Slides