ASP-DAC 2014 Technical Program

The 19th Asia and South Pacific Design Automation Conference

Session 6S Special Session: Overcoming Major Silicon Bottlenecks: Variability, Reliability, Validation and Debug
Time: 15:50 - 17:30 Wednesday, January 22, 2014
Location: Room 302
Organizer: Subhasish Mitra (Stanford University, U.S.A.)

6S-1 (Time: 15:50 - 16:20)

Title	(Invited Paper) Accurate and Inexpensive Performance Monitoring for Variability-Aware Systems
Author	Liangzhen Lai, *Puneet Gupta (UCLA, U.S.A.)
Page	pp. 467 - 473
Keyword	variability, performance monitoring, reliability, adaptive system
Abstract	Designing reliable integrated systems has become a major challenge with shrinking geometries, increasing fault rates and devices which age substantially in their usage life. The proposed research is motivated by the observation many of the infield failures are delay failures and several variability signatures are also delay-related. The origins of temporal delay fluctuations include manufacturing variability, voltage/temperature changes, negative or positive bias temperature instability-related Vth degradation, etc. Since the actual delay changes depend on process variations as well as workload, on-chip monitoring may be the best way of predicting them. There is a need to monitor circuit performance during manufacturing as well as at runtime to predict achievable performance and warn against impending failures. Adaptive mechanisms in hardware and/or software can optimize the trade o between errors, energy and performance based on the feedback from runtime circuit performance monitors. This paper presents approaches for automated synthesis of design-dependent performance monitors. These monitors can be used to predict impending delay failures relatively inexpensively. For low-overhead monitoring, we propose multiple designdependent ring oscillators (DDROs) as smart canary structures which can reliably predict achievable chip frequency but with margins for local variations. Early silicon results indicate that DDROs can reduce delay monitoring error by 35% compared to conventional ring oscillators. To further improve the prediction (albeit at a higher overhead), we propose in-situ slack monitors (SlackProbe) which can match local variations as well at overheads much smaller than monitoring all sequential elements. SlackProbe reduces the number of monitors required by over 15X with 5% additional delay margin in several commercial processor benchmarks. Finally, we show an example of software testbed that demonstrates a variability-aware system that utilizes the hardware monitors and operates with both hardware and software adaptation.

6S-2 (Time: 16:20 - 16:50)

Title	(Invited Paper) Quantifying Workload Dependent Reliability in Embedded Processors
Author	*Vikas Chandra (ARM, U.S.A.)
Page	pp. 474 - 477
Keyword	Reliability, BTI, TDDB, Soft Error
Abstract	With nearly three decades of continued CMOS scaling, the devices have now been pushed to their physical and reliability limits. Scaling to sub-20nm technology nodes changes the nature of reliability effects from abrupt functional problems to progressive degradation of the performance characteristics of devices and system components. The impact of unreliability results in time-dependent variability, directly translating into design uncertainty in manufactured chips. Further, application workloads can significantly affect the overall system reliability. In this work, we have analyzed aging effects on various design hierarchies of an embedded processor in 28nm running real-world applications. We have also quantified the dependencies of aging effects on switching-activity and power-state of workloads. Implementation results show that the processor timing degradation can vary from 2% to 11%, depending on the workload.

6S-3 (Time: 16:50 - 17:20)

Title	(Invited Paper) QED Post-Silicon Validation and Debug: Frequently Asked Questions
Author	David Lin, *Subhasish Mitra (Stanford University, U.S.A.)
Page	pp. 478 - 482
Keyword	Debug, Post-Silicon Validation, Quick Error Detection, Verification
Abstract	During post-silicon validation and debug, one or more manufactured integrated circuits (ICs) are tested in actual system environments to detect and fix design flaws (bugs). According to several industrial reports, the costs of post-silicon validation and debug are rising faster than design costs. Hence, new systematic techniques are essential to overcome the rising costs of existing post-silicon validation and debug techniques. QED, an acronym for Quick Error Detection, is such a technique that effectively overcomes several post-silicon validation and debug challenges. QED systematically creates a wide variety of validation tests to quickly detect bugs, not only inside processor cores, but also in uncore components (i.e., components in an SoC that are neither processor cores nor co-processors) of multi-core system-on-chips. In this paper, we present a brief overview of QED through a series of frequently asked questions.