(Back to Session Schedule)

The 19th Asia and South Pacific Design Automation Conference

Session 6S  Special Session: Overcoming Major Silicon Bottlenecks: Variability, Reliability, Validation and Debug
Time: 15:50 - 17:30 Wednesday, January 22, 2014
Location: Room 302
Organizer: Subhasish Mitra (Stanford University, U.S.A.)

6S-1 (Time: 15:50 - 16:20)
Title(Invited Paper) Accurate and Inexpensive Performance Monitoring for Variability-Aware Systems
AuthorLiangzhen Lai, *Puneet Gupta (UCLA, U.S.A.)
Pagepp. 467 - 473
Keywordvariability, performance monitoring, reliability, adaptive system
AbstractDesigning reliable integrated systems has become a major challenge with shrinking geometries, increasing fault rates and devices which age substantially in their usage life. The proposed research is motivated by the observation many of the infield failures are delay failures and several variability signatures are also delay-related. The origins of temporal delay fluctuations include manufacturing variability, voltage/temperature changes, negative or positive bias temperature instability-related Vth degradation, etc. Since the actual delay changes depend on process variations as well as workload, on-chip monitoring may be the best way of predicting them. There is a need to monitor circuit performance during manufacturing as well as at runtime to predict achievable performance and warn against impending failures. Adaptive mechanisms in hardware and/or software can optimize the trade o between errors, energy and performance based on the feedback from runtime circuit performance monitors. This paper presents approaches for automated synthesis of design-dependent performance monitors. These monitors can be used to predict impending delay failures relatively inexpensively. For low-overhead monitoring, we propose multiple designdependent ring oscillators (DDROs) as smart canary structures which can reliably predict achievable chip frequency but with margins for local variations. Early silicon results indicate that DDROs can reduce delay monitoring error by 35% compared to conventional ring oscillators. To further improve the prediction (albeit at a higher overhead), we propose in-situ slack monitors (SlackProbe) which can match local variations as well at overheads much smaller than monitoring all sequential elements. SlackProbe reduces the number of monitors required by over 15X with 5% additional delay margin in several commercial processor benchmarks. Finally, we show an example of software testbed that demonstrates a variability-aware system that utilizes the hardware monitors and operates with both hardware and software adaptation.

6S-2 (Time: 16:20 - 16:50)
Title(Invited Paper) Quantifying Workload Dependent Reliability in Embedded Processors
Author*Vikas Chandra (ARM, U.S.A.)
Pagepp. 474 - 477
KeywordReliability, BTI, TDDB, Soft Error
AbstractWith nearly three decades of continued CMOS scaling, the devices have now been pushed to their physical and reliability limits. Scaling to sub-20nm technology nodes changes the nature of reliability effects from abrupt functional problems to progressive degradation of the performance characteristics of devices and system components. The impact of unreliability results in time-dependent variability, directly translating into design uncertainty in manufactured chips. Further, application workloads can significantly affect the overall system reliability. In this work, we have analyzed aging effects on various design hierarchies of an embedded processor in 28nm running real-world applications. We have also quantified the dependencies of aging effects on switching-activity and power-state of workloads. Implementation results show that the processor timing degradation can vary from 2% to 11%, depending on the workload.

6S-3 (Time: 16:50 - 17:20)
Title(Invited Paper) QED Post-Silicon Validation and Debug: Frequently Asked Questions
AuthorDavid Lin, *Subhasish Mitra (Stanford University, U.S.A.)
Pagepp. 478 - 482
KeywordDebug, Post-Silicon Validation, Quick Error Detection, Verification
AbstractDuring post-silicon validation and debug, one or more manufactured integrated circuits (ICs) are tested in actual system environments to detect and fix design flaws (bugs). According to several industrial reports, the costs of post-silicon validation and debug are rising faster than design costs. Hence, new systematic techniques are essential to overcome the rising costs of existing post-silicon validation and debug techniques. QED, an acronym for Quick Error Detection, is such a technique that effectively overcomes several post-silicon validation and debug challenges. QED systematically creates a wide variety of validation tests to quickly detect bugs, not only inside processor cores, but also in uncore components (i.e., components in an SoC that are neither processor cores nor co-processors) of multi-core system-on-chips. In this paper, we present a brief overview of QED through a series of frequently asked questions.