(Go to Top Page)

The 16th Asia and South Pacific Design Automation Conference
Technical Program

Remark: The presenter of each paper is marked with "*".
Technical Program:   SIMPLE version   DETAILED version with abstract
Author Index:   HERE

Session Schedule


Wednesday, January 26, 2011

Room 411+412Room 413Room 414+415Room 416+417
1K  (Room 503)
Opening and Keynote Session I

8:30 - 10:00
1A  Analog, Mixed-Signal & RF Verification, Abstraction and Analysis
10:20 - 12:20
1B  Emerging Memories and System Applications
10:20 - 12:20
1C  Advances in Model Order Reduction and Extraction Techniques
10:20 - 12:20
1D  University LSI Design Contest
10:20 - 12:20
2A  Scheduling Techniques for Embedded Systems
13:40 - 15:40
2B  Memory Architecture and Buffer Optimization
13:40 - 15:40
2C  Modeling for Signal and Power Integrity
13:40 - 15:40
2D  Special Session: Emerging Memory Technologies and Its Implication on Circuit Design and Architectures
13:40 - 15:40
3A  High-Level Embedded Systems Design Techniques
16:00 - 18:00
3B  Timing, Power, and Thermal Issues
16:00 - 18:00
3C  Special Session: Post-Silicon Techniques to Counter Process and Electrical Parameter Variability
16:00 - 18:00
3D  Special Session: Recent Advances in Verification and Debug
16:00 - 18:00



Thursday, January 27, 2011

Room 411+412Room 413Room 414+415Room 416+417
2K  (Room 503)
Keynote Session II

9:00 - 10:00
4A  Design Automation for Emerging Technologies
10:20 - 12:20
4B  Novel Network-on-Chip Architecture Design
10:20 - 12:20
4C  Architecture Design and Reliability
10:20 - 12:20
4D  Special Session: Advanced Patterning and DFM for Nanolithography beyond 22nm
10:20 - 12:20
5A  System-Level Simulation
13:40 - 15:40
5B  Resilient and Thermal-Aware NoC Design
13:40 - 15:40
5C  High-Level and Logic Synthesis
13:40 - 15:40
5D  Designers' Forum: C-P-B Co-design/Co-verification Technology for DDR3 1.6G in Consumer Products
13:40 - 15:40
6A  Design Validation Techniques
16:00 - 18:00
6B  Clock Network Design
16:00 - 18:00
6C  Advances in Routing
16:00 - 18:00
6D  Designers' Forum: Emerging Technologies for Wellness Applications
16:00 - 18:00



Friday, January 28, 2011

Room 411+412Room 413Room 414+415Room 416+417
3K  (Room 503)
Keynote Session III

9:00 - 10:00
7A  System Level Analysis and Optimization
10:20 - 12:20
7B  NBTI and Power Gating
10:20 - 12:20
7C  Physical Design for Yield
10:20 - 12:20
7D  Special Session: Virtualization, Programming, and Energy-Efficiency Design Issues of Embedded Systems
10:20 - 12:20
8A  Modeling and Design for Variability
13:40 - 15:40
8B  Test for Reliability and Yield
13:40 - 15:40
8C  System-Level Power Optimization
13:40 - 15:40
8D  Designers' Forum: State-of-The-Art SoCs and Design Methodologies
13:40 - 15:40
9A  Printability and Mask Optimization
16:00 - 18:00
9B  Emerging Solutions in Scan Testing
16:00 - 18:00
9C  Clock and Package
16:00 - 18:00
9D  Designers' Forum: Advanced Packaging and 3D Technologies
16:00 - 18:00



List of Papers

Remark: The presenter of each paper is marked with "*".

Wednesday, January 26, 2011

Session 1K  Opening and Keynote Session I
Time: 8:30 - 10:00 Wednesday, January 26, 2011
Location: Room 503
Chair: Kunihiro Asada (University of Tokyo, Japan)

1K-1 (Time: 9:00 - 10:00)
Title(Keynote Address) Non-Volatile Memory and Normally-Off Computing
AuthorTakayuki Kawahara (Hitachi, Japan)
Slides


Session 1A  Analog, Mixed-Signal & RF Verification, Abstraction and Analysis
Time: 10:20 - 12:20 Wednesday, January 26, 2011
Location: Room 411+412
Chairs: Eric Keiter (Sandia National Labs, U.S.A.), Chin-Fong Chiu (National Chip Implementation Center, Taiwan)

1A-1 (Time: 10:20 - 10:50)
TitleAnalog Circuit Verification by Statistical Model Checking
Author*Ying-Chih Wang, Anvesh Komuravelli, Paolo Zuliani, Edmund M. Clarke (Carnegie Mellon University, U.S.A.)
Pagepp. 1 - 6
KeywordAnalog Circuits, Verification, Statistical Model Checking
AbstractWe show how statistical Model Checking can be used for verifying properties of analog circuits. As integrated circuit technologies scale down constantly, manufacturing variations in devices make analog designs behave like stochastic systems. In this paper, we use statistical model checking, an efficient verification technique for stochastic systems, for verifying properties of analog circuits in both the temporal and the frequency domain. In particular, randomly sampled traces are sequentially generated by SPICE and model checked to determine whether they satisfy a given specification, until the desired statistical strength is achieved.
Slides

1A-2 (Time: 10:50 - 11:20)
TitleFSM Model Abstraction for Analog/Mixed-Signal Circuits by Learning from I/O Trajectories
Author*Chenjie Gu, Jaijeet Roychowdhury (University of California, Berkeley, U.S.A.)
Pagepp. 7 - 12
Keywordmodel abstraction, finite state machine, analog/mixed-signal
AbstractAbstraction of circuits is desirable for faster simulation and high-level system verification. In this paper, we present an algorithm that derives a Mealy machine from differential equations of a circuit by learning input-output trajectories. The key idea is adapted from Angluin's DFA (deterministic finite automata) learning algorithm that learns a DFA from another DFA. Several key components of Angluin's algorithm are modified so that it fits in our problem setting, and the modified algorithm also provides a reasonable partitioning of the continuous state space as a by-product. We validate our algorithm on a latch circuit and an integrator circuit, and demonstrate that the resulting FSMs inherit important behaviors of original circuits.

1A-3 (Time: 11:20 - 11:50)
TitleA Structured Parallel Periodic Arnoldi Shooting Algorithm for RF-PSS Analysis based on GPU Platforms
AuthorXue-Xin Liu (University of California, Riverside, U.S.A.), Hao Yu (Nanyang Technological University, Singapore), Jacob Relles, *Sheldon X.-D. Tan (University of California, Riverside, U.S.A.)
Pagepp. 13 - 18
Keywordshooting-Newton, periodic steady state analysis, GPU, RF, Krylov
AbstractThe recent multi/many-core CPUs or GPUs have provided an ideal parallel computing platform to accelerate the time-consuming analysis of radio-frequency/millimeter-wave (RF/ MM) integrated circuit (IC). This paper develops a structured shooting algorithm that can fully take advantage of parallelism in periodic steady state (PSS) analysis. Utilizing periodic structure of the state matrix of RF/ MM-IC simulation, a cyclic-block-structured shooting-Newton method has been parallelized and mapped onto recent GPU platforms. We first present the formulation of the parallel cyclic-block-structured shooting-Newton algorithm, called {\em periodic Arnoldi shooting} method. Then we will present its parallel implementation details on GPU. Results from several industrial examples show that the structured parallel shooting-Newton method on Tesla's GPU can lead to speedups of more than 20$\times$ compared to the state-of-the-art implicit GMRES methods under the same accuracy on the CPU.

1A-4 (Time: 11:50 - 12:20)
TitleHierarchical Exact Symbolic Analysis of Large Analog Integrated Circuits By Symbolic Stamps
Author*Hui Xu, Guoyong Shi, Xiaopeng Li (Shanghai Jiao Tong University, China)
Pagepp. 19 - 24
Keywordanalog integrated circuits, binary decision diagram, determinant expansion, graph reduction, symbolic analysis
AbstractLinearized small-signal transistor models share the common circuit structure but may take different parameter values in the ac analysis of an analog circuit simulator. This property can be utilized in symbolic circuit analysis. This paper proposes to use a symbolic stamp for all device models in the same circuit for hierarchical symbolic analysis. Two levels of binary decision diagrams (BDDs) are used for maximum data sharing, one for the symbolic device stamp and the other for modified nodal analysis. The symbolic transadmittances of the device stamp share one BDD for storage saving. The modified nodal analysis (MNA) matrix formulated using symbolic stamp is of greatly lower dimension, hence its solving complexity by using a determinant decision diagram (DDD) is greatly reduced. A circuit simulator is implemented based on the proposed scheme. It is able to analyze an op-amp circuit containing 44 MOS transistors \emph{exactly} for the first time.
Slides


Session 1B  Emerging Memories and System Applications
Time: 10:20 - 12:20 Wednesday, January 26, 2011
Location: Room 413
Chairs: Mehdi Tahoori (Karlsruhe Institute of Technology, Germany), Chun-Ming Huang (National Chip Implementation Center, Taiwan)

1B-1 (Time: 10:20 - 10:50)
TitleGeometry Variations Analysis of TiO2 Thin-Film and Spintronic Memristors
Author*Miao Hu, Hai Li (Polytechnic Institute of New York University, U.S.A.), Yiran Chen (University of Pittsburgh, U.S.A.), Xiaobin Wang (Seagate Technology, U.S.A.), Robinson Pino (AFRL/RITC, U.S.A.)
Pagepp. 25 - 30
Keywordmemristor, process variation, TiO2 thin-film, spintronic
AbstractThe fourth passive circuit element, memristor, has attracted increased attentions since the first real device was discovered by HP Lab in 2008. Its distinctive characteristic to record the historic profile of the voltage/current through itself creates great potentials in future system design. However, as a nano-scale device, memristor is facing great challenge on process variation control in the manufacturing. The impact of process variations on a memristive system that relies on the continuous (analog) states of the memristor could be significant due to the deviation of the memristor state from the designed value. In this work, we analyze the impact of the geometry variations on the electrical properties of both TiO2 thin-film and spintronic memristors, including line edge roughness (LER) and thickness fluctuation. A simple algorithm was proposed to generate a large volume of geometry variation-aware three-dimensional device structures for Monte-Carlo simulations. Our simulation results show that due to the different physical mechanisms, TiO2 thin-film memristor and spintronic memristor demonstrate very different electrical characteristics even when exposing the two types of devices to the same excitations and under the same process variation conditions.
Slides

1B-2 (Time: 10:50 - 11:20)
TitleAdaMS: Adaptive MLC/SLC Phase-Change Memory Design for File Storage
Author*Xiangyu Dong, Yuan Xie (Pennsylvania State University, U.S.A.)
Pagepp. 31 - 36
KeywordPCM, MLC, Adaptive, Storage
AbstractPhase-change memory (PCM) is an emerging memory technology that has made rapid progress in the recent years, and surpasses other technologies such as FeRAM and MRAM in terms of scalability. Recently, the feasibility of multi-level cell (MLC) for PCM has also been shown, which enables a PCM cell to store more than one bit of digital data. This new property makes PCM more competitive and is able to be considered as the successor of the NAND Flash technology, which also has MLC capability but does not have an easy scaling path to reach the higher densities provided by future technology nodes. However, theMLC capability of PCMcomes with the penalty of longer programming time and shortened cell lifetime compared to its single-level cell (SLC) mode. Therefore, it suggests an adaptive MLC/SLC reconfigurable PCM design that can exploit the fast SLC access speed and large MLC capacity with the awareness of workload characteristic and lifetime requirement. In this paper, a circuit-level adaptive MLC/SLC PCM array is designed at first, the management policy of MLC/SLC mode is proposed, and finally the performance and lifetime of a novel PCM-based SSD with run-time MLC/SLC reconfiguration ability is evaluated.
Slides

1B-3 (Time: 11:20 - 11:50)
TitleSystem Accuracy Estimation of SRAM-based Device Authentication
AuthorJoonsoo Kim, Joonsoo Lee, *Jacob A. Abraham (The University of Texas at Austin, U.S.A.)
Pagepp. 37 - 42
KeywordSRAM Fingerprint, Device Authentication, HW Security, System Error-Rate Estimation and Its Confidence Interval, Bootstrapping
AbstractIt is known that power-up values of embedded SRAM memory are unique for each individual chip. The uniqueness enables the power-up values to be considered as SRAM fingerprints used to verify device identities, which is a fundamental task in security applications. However, as the SRAM fingerprints are sensitive to environmental changes, there always exists a chance of error during the authentication process. Hence, the accuracy of a device authentication system with the SRAM fingerprints should be carefully estimated and verified in order to be implemented in practice. Consequently, a proper system evaluation method for the SRAM-based device authentication system should be provided. In this paper, we introduce tractable and computationally efficient system evaluation methods, which include novel parametric models for the distributions of matching distances among genuine and imposter devices. In addition, novel algorithms to calculate the confidence intervals of the estimates, which are crucial in system evaluation, are presented. Also, empirical results follow to validate the models and methods.

1B-4 (Time: 11:50 - 12:20)
TitleOn-Chip Hybrid Power Supply System for Wireless Sensor Nodes
Author*Wulong Liu, Yu Wang, Wei Liu, Yuchun Ma (Tsinghua University, China), Yuan Xie (Pennsylvania State University, U.S.A.), Huazhong Yang (Tsinghua University, China)
Pagepp. 43 - 48
Keywordhybrid power, wireless sensor node, fuel cell
AbstractWith the miniaturization of electronic devices, small size but high capacity power supply system appears to be more and more important. A hybrid power source, which consists of a fuel cell (FC) and a rechargeable battery, has the advantages of long lifetime and good load following capabilities. In this paper, we propose the schematic of a hybrid power supply system, that can be integrated on a chip compatible with present CMOS process. Besides, considering the problem of maximizing the on chip fuel cell's lifetime, we propose a modified dynamic power management (DPM) algorithm for on chip fuel cell based hybrid power system in wireless sensor node design. Taking the wireless sensor node powered by this hybrid power system as an example, we analyze the improvement of the FC-Bat hybrid power system. The simulation results demonstrate that the on-chip FC-Bat hybrid power system can be used for wireless sensor node under different usage scenarios. Meanwhile, for an on chip power system with 1cm2 area consumption, the wafer-level battery can power a typical sensor node for only about 5 months, while our on chip hybrid power system will supply the same sensor node for 2 years steadily.
Slides


Session 1C  Advances in Model Order Reduction and Extraction Techniques
Time: 10:20 - 12:20 Wednesday, January 26, 2011
Location: Room 414+415
Chairs: Sheldon X.-D. Tan (University of California, Riverside, U.S.A.), Genichi Tanaka (Renesas, Japan)

1C-1 (Time: 10:20 - 10:50)
TitleA Moment-Matching Scheme for the Passivity-Preserving Model Order Reduction of Indefinite Descriptor Systems with Possible Polynomial Parts
Author*Zheng Zhang (Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, U.S.A.), Qing Wang, Ngai Wong (Department of Electrical and Electronic Engineering, the University of Hong Kong, Hong Kong), Luca Daniel (Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, U.S.A.)
Pagepp. 49 - 54
Keywordmodel order reduction, indefinite descriptor system, system passivity, polynomial part
AbstractPassivity-preserving model order reduction (MOR) of descriptor systems (DSs) is highly desired in the simulation of VLSI interconnects and on-chip passives. One popular method is PRIMA, a Krylov-subspace projection approach which preserves the passivity of positive semidefinite (PSD) structured DSs. However, system passivity is not guaranteed by PRIMA when the system is indefinite. Furthermore, the possible polynomial parts of singular systems are normally not captured. For indefinite DSs, positive-real balanced truncation (PRBT) can generate passive reduced-order models (ROMs), whose main bottleneck lies in solving the dual expensive generalized algebraic Riccati equations (GAREs). This paper presents a novel moment-matching MOR for indefinite DSs, which preserves both the system passivity and, if present, also the improper polynomial part. This method only requires solving one GARE, therefore it is cheaper than existing PRBT schemes. On the other hand, the proposed algorithm is capable of preserving the passivity of indefinite DSs, which is not guaranteed by traditional moment-matching MORs. Examples are finally presented showing that our method is superior to PRIMA in terms of accuracy.
Slides

1C-2 (Time: 10:50 - 11:20)
TitleBalanced Truncation for Time-Delay Systems Via Approximate Gramians
Author*Xiang Wang, Qing Wang, Zheng Zhang, Quan Chen, Ngai Wong (The University of Hong Kong, Hong Kong)
Pagepp. 55 - 60
Keywordmodel reduction, balanced truncation, time-delay systems
AbstractIn circuit simulation, when a large RLC network is connected with delay elements, such as transmission lines, the resulting system is a time-delay system (TDS). This paper presents a new model order reduction (MOR) scheme for TDSs with state time delays. It is the first time to reduce a TDS using balanced truncation. The Lyapunov-type equations for TDSs are derived, and an analysis of their computational complexity is presented. To reduce the computational cost, we approximate the controllability and observability Gramians in the frequency domain. The reduced-order models (ROMs) are then obtained by balancing and truncating the approximate Gramians. Numerical examples are presented to verify the accuracy and efficiency of the proposed algorithm.
Slides

1C-3 (Time: 11:20 - 11:50)
TitleEfficient Sensitivity-Based Capacitance Modeling for Systematic and Random Geometric Variations
AuthorYu Bi (Delft University of Technology, Netherlands), Pieter Harpe (Holst Centre/IMEC, Netherlands), *Nick van der Meijs (Delft University of Technology, Netherlands)
Pagepp. 61 - 66
Keywordprocess variations, capacitance, sensitivity, Design-for-Manufacturability
AbstractThis paper presents a highly efficient sensitivity-based method for capacitance extraction, which models both systematic and random geometric variations. With only one system solve, the nominal capacitances as well as its relative standard deviations caused by both variations can be obtained, at a very modest additional computational time which is negligible compared to that of the standard capacitance extraction without considering any variation. Specifically, experiments and a case study have been analyzed to show the impact of the random variation on the capacitance for a real design.
Slides

1C-4 (Time: 11:50 - 12:20)
TitleParallel Statistical Capacitance Extraction of On-Chip Interconnects with an Improved Geometric Variation Model
Author*Wenjian Yu, Chao Hu (Tsinghua University, China), Wangyang Zhang (Carnegie Mellon University, U.S.A.)
Pagepp. 67 - 72
Keywordcapacitance extraction, random variation, parallel computing, geometric modeling
AbstractIn this paper, a new geometric variation model, referred to as the improved continuous surface variation (ICSV) model, is proposed to accurately imitate the random variation of on-chip interconnects. In addition, we implemented a new statistical capacitance solver which incorporates the ICSV model, the weighted PFA [6] and HPC [5] techniques. The solver also employs a parallel computing technique to greatly improve its efficiency. Experiments show that on a typical 65nm technology structure, ICSV model has significant advantage over other existing models, and the new solver is at least 10X faster than the MC simulation with 10000 samples. The parallel solver achieves 7X further speedup on an 8-core machine. We conclude this paper with several criteria to discuss the trade-off between different geometric models and statistical methods for different scenarios.
Slides


Session 1D  University LSI Design Contest
Time: 10:20 - 12:20 Wednesday, January 26, 2011
Location: Room 416+417
Organizers: Masanori Hariyama (Tohoku University, Japan), Hiroshi Kawaguchi (Kobe University, Japan)

1D-1 (Time: 10:20 - 10:24)
TitleA H.264/MPEG-2 Dual Mode Video Decoder Chip Supporting Temporal/Spatial Scalable Video
Author*Cheng-An Chien, Yao-Chang Yang, Hsiu-Cheng Chang, Jia-Wei Chen, Cheng-Yen Chang, Jiun-In Guo, Jinn-Shyan Wang (National Chung Cheng University, Taiwan), Ching-Hwa Cheng (Feng Chia University, Taiwan)
Pagepp. 73 - 74
KeywordH.264, MPEG2, SVC
AbstractThis paper proposes the first dual mode video decoder with 4-level temporal/spatial scalability and 32/64-bit adjustable memory bus width. A design automation environment for simulation and verification is established to automatically verify the correctness and completeness of the proposed design. Using a 0.13 um CMOS technology, it comprises 439Kgates/10.9KB SRAM and consumes 2~328mW in decoding CIF~HD1080 videos at 3.75~30fps when operating at 1~150MHz, respectively.
Slides

1D-2 (Time: 10:24 - 10:28)
TitleA Gate-level Pipelined 2.97GHz Self Synchronous FPGA in 65nm CMOS
Author*Benjamin Devlin, Makoto Ikeda, Kunihiro Asada (University of Tokyo, Japan)
Pagepp. 75 - 76
Keywordself-synchronous, fpga, high-throughput, reliable, pipeline
AbstractWe have designed and measured the performance against power supply bounce and aging of a Self Synchronous FPGA (SSFPGA) in 65nm CMOS which achieves 2.97GHz throughput at 1.2V. The proposed SSFPGA employs a 38x38 array of 4-input,3-stage Self Synchronous Configurable Logic Blocks (SSCLB), with the introduction of a new dual tree-divider 4 input LUT to achieve a 4.5x throughput improvement over our previous model. Energy was measured at 3.23 pJ/block/cycle using a custom built board. We measured the SSFPGA for aging with accelerated degradation and results show the SSFPGA has 8% longer time margin before chip malfunctions compared to a Synchronous FPGA.
Slides

1D-3 (Time: 10:28 - 10:32)
TitleA 4.32 mm2 170mW LDPC Decoder in 0.13µm CMOS for WiMax/Wi-Fi Applications
Author*Dan Bao, Chuan Wu, Yan Ying, Yun Chen, Xiao Yang Zeng (Fudan University, China)
Pagepp. 77 - 78
KeywordLDPC Decoder
AbstractAn energy-efficient programmable LDPC decoder is proposed for WiMax and Wi-Fi applications. The proposed decoder is designed with scalable processing units, flexible message passing network and medium-grain partitioned memories to harvest programmability, area reduction, and energy efficiency. The decoder can be programmed by host processor with several special-purpose micro-instructions. Thus, various operation modes can be reconfigured. Fabricated in SMIC 0.13µm 1P8M CMOS process, the chip occupies 4.32 mm2 with core area 2.97 mm2, and consumes 170mW with a throughput of 302Mb/s when operating at 145MHz and 1.2V.

1D-4 (Time: 10:32 - 10:36)
TitleAll-Digital PMOS and NMOS Process Variability Monitor Utilizing Buffer Ring with Pulse Counter
Author*Jaehyun Jeong, Tetsuya Iizuka, Toru Nakura, Makoto Ikeda, Kunihiro Asada (University of Tokyo, Japan)
Pagepp. 79 - 80
Keywordprocess variability, process monitor, buffer ring, all digital
AbstractThis paper presents an all-digital PMOS and NMOS process variability monitor which utilizes a simple buffer ring with a pulse counter. The proposed circuit monitors the process variability according to a count number of a single pulse which propagates on the buffer ring and a fixed logic level after the pulse vanishes. The proposed circuit has been fabricated in 65nm CMOS process and the measurement results demonstrate that we can monitor the PMOS and NMOS variabilities independently using the proposed monitoring circuit.

1D-5 (Time: 10:36 - 10:40)
TitleJitter Amplifier for Oscillator-Based True Random Number Generator
Author*Takehiko Amaki, Masanori Hashimoto, Takao Onoye (Osaka University, Japan)
Pagepp. 81 - 82
Keywordtrue random number generator, jitter
AbstractThis paper presents a jitter amplifier for oscillatorbased TRNG (true random number generator). The proposed jitter amplifier fabricated in a 65nm CMOS process occupying the area of 3,300 um2 archives 8.4x gain at 25 degrees Celsius and significantly improves the entropy enough to pass randomness test.
Slides

1D-6 (Time: 10:40 - 10:44)
TitleA 65nm Flip-Flop Array to Measure Soft Error Resiliency against High-Energy Neutron and Alpha Particles
Author*Jun Furuta (Kyoto University, Japan), Chikara Hamanaka, Kazutoshi Kobayashi (Kyoto Institute of Technology, Japan), Hidetoshi Onodera (Kyoto University, Japan)
Pagepp. 83 - 84
KeywordSoft Error
AbstractWe fabricated a 65nm LSI including flip-flop array to measure soft error resiliency against high-energy neutron and alpha particles. It consists of two FF arrays as follows. One is an array composed of redundant FFs to confirm radiation hardness of the proposed and conventional redundant FFs. The other is an array composed of ordinal D-FFs to measure SEU (Single Event Upset) and MCU(Multiple Cell Upset) by the distance from tap cells.
Slides

1D-7 (Time: 10:44 - 10:48)
TitleDual-Phase Pipeline Circuit Design Automation with a Built-in Performance Adjusting Mechanism
AuthorYu-Tzu Tsai, Cheng-Chih Tsai (Dept. of Electronic Engineering Feng Chia University, Taiwan), *Cheng-An Chien (Dept. of CSIE, National Chung Cheng University, Taiwan), Ching-Hwa Cheng (Dept. of Electronic Engineering Feng Chia University, Taiwan), Jiun-In Guo (Dept. of CSIE, National Chung Cheng University, Taiwan)
Pagepp. 85 - 86
Keywordpipeline, domino circuit
AbstractThe high speed dual phase operation domino circuit, which includes high-performance and reliable characteristics, is proposed and the circuit design technique with practical implementation is presented. The cell-based automatic synthesis flow supports the quick design of high performance chips. The test chip of a dual-phase 64 bit high-speed multiplier with a built-in performance adjustment mechanism is successfully validated using TSMC 0.18 technology. The test chip shows 2.7x performance improvement compares to conventional static CMOS logic design.

1D-8 (Time: 10:48 - 10:52)
TitleGeyser-2: The Second Prototype CPU with Fine-grained Run-time Power Gating
Author*Lei Zhao, Daisuke Ikebuchi, Yoshiki Saito, Masahiro Kamata, Naomi Seki, Yu Kojima, Hideharu Amano (Keio University, Japan), Satoshi Koyama, Tatsunori Hashida, Yusuke Umahashi, Daiki Masuda, Kimiyoshi Usami (Shibaura Institute of Technology, Japan), Kazuki Kimura, Mitaro Namiki (Tokyo University of Agriculture and Technology, Japan), Seidai Takeda, Hiroshi Nakamura (University of Tokyo, Japan), Masaaki Kondo (The University of Electro-Communications, Japan)
Pagepp. 87 - 88
KeywordPower Gating, MIPS CPU, Low Power Design
AbstractGeyser-2 is the second prototype MIPS CPU which provides a fine-grained run-time power gating controlled by instructions. It works at 210MHz clock and reduces 60% of leakage power in the normal temperature.
Slides

1D-9 (Time: 10:52 - 10:56)
TitleAn Implementation of an Asychronous FPGA Based on LEDR/Four-Phase-Dual-Rail Hybrid Architecture
Author*Yoshiya Komatsu, Shota Ishihara, Masanori Hariyama, Michitaka Kameyama (Tohoku University, Japan)
Pagepp. 89 - 90
KeywordFPGA, Asynchronous architecture, LEDR, 4-phase dual-rail
AbstractThis paper presents an asynchronous FPGA that combines four-phase dual-rail encoding and LEDR (Level-Encoded Dual-Rail) encoding. Four-phase dual-rail encoding is used for small area and low power of function units, while LEDR encoding for high throughput and low power of data transfer. The proposed FPGA is fabricated in the e-Shuttle 65nm CMOS process and operates at 870 MHz. Compared to the synchronous FPGA, the power consumption is reduced by 38% for the workload of 15%.
Slides

1D-10 (Time: 10:56 - 11:00)
TitleDesign and Chip Implementation of a Heterogeneous Multi-core DSP
Author*Shuming Chen, Xiaowen Chen, Yi Xu, Jianghua Wan, Jianzhuang Lu, Xiangyuan Liu, Shenggang Chen (National University of Defense Technology, China)
Pagepp. 91 - 92
Keywordmulti-core processor, Digital Signal Processor, heterogeneous
AbstractThis paper presents a novel heterogeneous multi-core Digital Signal Processor, named YHFT-QDSP, hosting one RISC CPU core and four VLIW DSP cores. The CPU core is responsible for task scheduling and management, while the DSP cores take charge of speeding up data processing. The YHFT-QDSP provides three kinds of interconnection communication. One is for inner-chip communication between the CPU core and the four DSP cores, the other two for both inner-chip and inter-chip communication amongst DSP cores. The YHFT-QDSP is implemented under SMIC 130nm LVT CMOS technology and can run 350MHz@1.2V with 114.49 mm2 die area.
Slides

1D-11 (Time: 11:00 - 11:04)
TitleA Low-Power Management Technique for High-Performance Domino Circuits
AuthorYu-Tzu Tsai, Cheng-Chih Tsai (Dept. of Electronic Engineering Feng Chia University, Taiwan), *Cheng-An Chien (Dept. of CSIE, National Chung Cheng University, Taiwan), Ching-Hwa Cheng (Dept. of Electronic Engineering Feng Chia University, Taiwan), Jiun-In Guo (Dept. of CSIE, National Chung Cheng University, Taiwan)
Pagepp. 93 - 94
Keywordpower management, domino circuit
AbstractExploiting a charge sharing method enables a performance power management design for domino circuits. The domino circuits have both high performance and low power consumption. A test chip has been successfully validated using TSMC 0.13um CMOS technology. Reductions in dynamic power consumption of 68% and static power consumption of 15% are achieved.

1D-12 (Time: 11:04 - 11:08)
TitleDesign and Evaluation of Variable Stages Pipeline Processor Chip
Author*Tomoyuki Nakabayashi, Takahiro Sasaki, Kazuhiko Ohno, Toshio Kondo (Mie University, Japan)
Pagepp. 95 - 96
KeywordVLSI, Low energy processor, Variable stages pipeline, Glitch
AbstractIn order to reduce the energy consumption in high performance computing, variable stages pipeline processor (VSP) is proposed, which improves execution time by dynamically unifying the pipeline stages. The VSP adopts a special pipeline register called an LDS-cell that unifies the pipeline stages and prevents glitch propagation. We fabricate the VSP chip on a Rohm 0.18um CMOS process and evaluate the energy consumption. The result indicates the VSP can achieve 13% less energy consumption than the conventional approach.
Slides

1D-13 (Time: 11:08 - 11:12)
TitleTurboVG: A HW/SW Co-Designed Multi-Core OpenVG Accelerator for Vector Graphics Applications with Embedded Power Profiler
Author*Shuo-Hung Chen, Hsiao-Mei Lin, Ching-Chou Hsieh, Chih-Tsun Huang, Jing-Jia Liou, Yeh-Ching Chung (National Tsing Hua University, Taiwan)
Pagepp. 97 - 98
KeywordHW/SW Co-Design, Embedded System, Vector Graphics, OpenVG
AbstractTurboVG is a hardware accelerator for the OpenVG 1.1 library that operates sixteen times faster than an optimized software implementation. This improved efficiency stems from a well-designed hardware-software interaction capable of handling massive data transfers across hierarchical layers without performance loss. By combining multiple TurboVG cores, the library can support screen resolutions of up to Full-HD 1080p.
Slides

1D-14 (Time: 11:12 - 11:16)
TitleDesign and Implementation of a High Performance Closed-Loop MIMO Communications with Ultra Low Complexity Handset
Author*Yu-Han Yuan, Wei-Ming Chen, Hsi-Pin Ma (National Tsing Hua University, Taiwan)
Pagepp. 99 - 100
KeywordMIMO, GMD, THP
AbstractA MIMO transceiver in which transmitter antenna selection is applied to geometric mean decomposition (GMD) which is combined with Tomlinson-Harashima Precoder (THP) in TDD system is implemented. We take the decoder quantization into consideration, and it can be simple for the handset. The proposed work can save more than 60% computational complexity at the handset compared with the GMD scheme. From the simulation results, the proposed transceiver can achieve about 7 dB SNR improvement over the open-loop VBLAST counterparts even about 2dB SNR better than ML at BER=10-2 under i.i.d. channel. Finally, the proposed HW/SW co-verification strategy provided an efficient way to do the verification.
Slides

1D-15 (Time: 11:16 - 11:20)
TitleA 58-63.6GHz Quadrature PLL Frequency Synthesizer Using Dual-Injection Technique
Author*Ahmed Musa, Rui Murakami, Takahiro Sato, Win Chiavipas, Kenichi Okada, Akira Matsuzawa (Tokyo Institute of Technology, Japan)
Pagepp. 101 - 102
Keyword60GHz, ILO, Injection locking
AbstractThis paper proposes a 60GHz quadrature PLL frequency synthesizer that has a tuning range capable of covering the whole band specified by the IEEE802.15.3c with exceptional phase noise. The synthesizer is constructed using a 20GHz PLL that is coupled with a frequency tripler to generate the 60GHz signal. Both the 20GHz PLL and the ILO were fabricated using a 65nm CMOS process and measurement results show a phase noise of -96dBc/Hz at 60GHz while consuming 77.5mW from a 1.2V supply.
Slides

1D-16 (Time: 11:20 - 11:24)
TitleAn Ultra-low-voltage LC-VCO with a Frequency Extension Circuit for Future 0.5-V Clock Generation
Author*Wei Deng, Kenichi Okada, Akira Matsuzawa (Tokyo Institute of Technology, Japan)
Pagepp. 103 - 104
Keywordclock generator, 0.5-V, LC-VCO
AbstractThis paper proposes a 0.5-V LC-VCO with a frequency extension circuit to replace ring oscillators for ultra-low-voltage sub-1ps-jitter clock generation. Significant performances, in terms of 0.6-ps jitter, 50MHz-to-6.4GHz frequency tuning range with 2 bands and sub-1mW PDC, indicates the successful replacement of ring VCO for the future 0.5-V LSIs and power aware LSIs.
Slides

1D-17 (Time: 11:24 - 11:28)
TitleA 32Gbps Low Propagation Delay 4x4 Switch IC for Feedback-Based System in 0.13µm CMOS Technology
AuthorYu-Hao Hsu, Yang-Syu Lin, Ching-Te Chiu, Jen-Ming Wu, Shuo-Hung Hsu, Fan-Ta Chen, Min-Sheng Kao, *Wei-Chih Lai, YarSun Hsu (National Tsing Hua University, Taiwan)
Pagepp. 105 - 106
Keywordlow propagation delay, load-balanced switch
AbstractAbstract - In this paper, a low propagation delay, low power, and area-efficient 4x4 load-balanced switch circuit for feedback-based system is presented. In this periodic and deterministic switch, only two DFFs are used to implement a pattern generator which is a O(N3) hardware complexity in traditional matching algorithm based NxN switch. For packet reordering, a feedback path is established in series of symmetric patterns. As comparing with commercial switch systems, we implement a 4x4 switch IC directly in high speed domain without the use of SERDES interfaces to achieve low propagation delay and high scalability. In CML output buffer, PMOS active load and active back-end termination are introduced. A stacked current source and symmetric topology in CML-DFF are adopted. From our results, this work efficiently deducted 28ns propagation delay, 80% area and 80% power introduced by the SERDES interface. The throughput rate is up to 32Gbps (8Gbps/Ch).
Slides

1D-18 (Time: 11:28 - 11:32)
TitleA Fully Integrated Shock Wave Transmitter with an On-Chip Dipole Antenna for Pulse Beam-Formability in 0.18-μm CMOS
Author*Nguyen Ngoc Mai Khanh (The University of Tokyo, Japan), Masahiro Sasaki, Kunihiro Asada (VLSI Design and Education Center (VDEC), the University of Tokyo, Japan)
Pagepp. 107 - 108
Keywordshock wave, CMOS, on-chip antenna, beam-forming, transmitter
AbstractThis paper presents a fully integrated 9-11-GHz shock wave transmitter with an on-chip antenna and a digitally programmable delay circuit (DPDC) for pulse beam-formability in short-range microwave active imaging applications. The resitorless shock wave generator (SWG) produces a 0.4-V peak-to-peak (p-p) shock wave output in HSPICE simulation. The DPDC is designed to adjust delays of shock-wave outputs for the beam-forming purpose. SWG's output is sent to an integrated meandering dipole antenna through an on-chip transformer. The measured return loss, S11, of a stand-alone integrated meandering dipole is from -26 dB to -10 dB with frequency range of 7.5-12 GHz. A 1.1-mV(p-p) shock wave output is received by a 20-dB standard gain horn antenna located at a 38-mm distance from the chip. Frequency response and delay resolution of the measured shock wave output are 9-11-GHz and 3-ps, respectively. These characteristics are suitable for fully integrated pulse beam-forming array antenna system.
Slides

1D-19 (Time: 11:32 - 11:36)
TitleAn On-Chip Characterizing System for Within-Die Delay Variation Measurement of Individual Standard Cells in 65-nm CMOS
Author*Xin Zhang, Koichi Ishida, Makoto Takamiya, Takayasu Sakurai (University of Tokyo, Japan)
Pagepp. 109 - 110
Keywordwithin-die delay variation, design for manufacturing, on-chip oscilloscope
AbstractNew characterizing system for within-die delay variations of individual standard cells is presented. The proposed characterizing system is able to measure rising and falling delay variations separately by directly measuring the input and output waveforms of individual gate using an on-chip sampling oscilloscope in 65nm CMOS process. 7 types of standard cells are measured with 60 DUT’s for each type. Thanks to the proposed system, a relationship between the rising and falling delay variations and the active area of the standard cells is experimentally shown for the first time.
Slides

1D-21 (Time: 11:36 - 11:40)
TitleRobust and Efficient Baseband Receiver Design for MB-OFDM UWB System
AuthorWen Fan, *Chiu-Sing Choy (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 111 - 112
KeywordMB-OFDM UWB, baseband, receiver
AbstractRobust, efficient and low complexity design methodologies for high speed multi-band orthogonal frequency division multiplexing ultra-wideband (MB-OFDM UWB) is presented. The proposed design is implemented in 0.13μm CMOS technology with the core area of 2.66mm×0.94mm. Operating at 132MHz clock frequency, the estimated power consumption is 170mW.

1D-22 (Time: 11:40 - 11:44)
TitleA 95-nA, 523ppm/°C, 0.6-µW CMOS Current Reference Circuit with Subthreshold MOS Resistor Ladder
Author*Yuji Osaki, Tetsuya Hirose, Nobutaka Kuroki, Masahiro Numa (Kobe University, Japan)
Pagepp. 113 - 114
Keywordcurrent reference, low power, temperature stability
AbstractA low-power current reference circuit was developed in a 0.35-um standard CMOS process. The proposed circuit utilizes an offset-voltage generation subcircuit consisting of subthreshold MOS resistor ladder and generates temperature compensated reference current. Experimental results demonstrated that the proposed circuit generates a 95-nA reference current, and that the total power dissipation is 586 nW. The temperature coefficient of the reference current can be kept small within 523ppm/°C in a temperature range from -20 to 100°C.
Slides

1D-23 (Time: 11:44 - 11:48)
TitleA 80-400 MHz 74 dB-DR Gm-C Low-Pass Filter With a Unique Auto-Tuning System
Author*Ting Gao, Wei Li, Ning Li, Junyan Ren (Fudan University, China)
Pagepp. 115 - 116
KeywordGm-C, filter, auto tuning
AbstractAn 80-400 MHz 5TH order Chebyshev Gm-C low-pass filter with a unique auto tuning system is presented. The filter was fabricated with TSMC 0.13-E CMOS process. Experimental results show that the cut-off frequency of the filter can be tuned between 80-400MHz, with an average tuning error of 3.6%. The filter also realizes gain of 0-30 dB, IIP3 of 16.5 dBm, NF of 14-18 dB, and DR of 74 dB. Power dissipation is only 9 mW with 1.2 V supply voltage.
Slides

1D-24 (Time: 11:48 - 11:52)
TitleAn Adaptively Biased Low-Dropout Regulator with Transient Enhancement
Author*Chenchang Zhan, Wing-Hung Ki (Hong Kong University of Science and Technology, Hong Kong)
Pagepp. 117 - 118
Keywordlow-dropout regulator, adaptive biasing, output-capacitor-free, transient enhancement
AbstractAn output-capacitor-free adaptively biased low-dropout regulator with transient enhancement (ABTE LDR) is proposed. Techniques of Q-reduction compensation, adaptive biasing, and transient enhancement achieve low-voltage high-precision regulation with low quiescent current consumption while significantly improving the line and load transient responses and power supply rejections. The features of the ABTE LDR are experimentally verified by a 0.35-um CMOS prototype.
Slides

1D-25 (Time: 11:52 - 11:56)
TitleA Low-Power Triple-Mode Sigma-Delta DAC for Reconfigurable (WCDMA/TD-SCDMA/GSM) Transmitters
Author*Dong Qiu, Ting Yi, Zhiliang Hong (Fudan University, China)
Pagepp. 119 - 120
KeywordDigital-to-Analog Converter, Reconfigurable, Low-power
AbstractThis paper presents a sigma-delta DAC with channel filtering for multi-standard wireless transmitters. It can be digitally programmed to satisfy specifications of WCDMA, TD-SCDMA and GSM standards. The measured SFDR are 62.8/60.1/75.5 dB for WCDMA/ TD-SCDMA/ GSM mode, respectively. The sigma-delta DAC manufactured in SMIC 0.13-E CMOS process occupies a 0.72 mm2 die area, while drawing 5.52/4.82/3.04 mW in WCDMA/TD-SCDMA/GSM mode from a single 1.2-V supply voltage.

1D-26 (Time: 11:56 - 12:00)
TitleA Simple Non-coherent Solution to the UWB-IR Communication
Author*Mohiuddin Hafiz, Nobuo Sasaki, Kentaro Kimoto, Takamaro Kikkawa (Hiroshima University, Japan)
Pagepp. 121 - 122
Keywordnon-coherent, BPSK, CMOS, Transceiver
AbstractA simple non-coherent solution to UWB-IR communication has been presented here. An all digital differential transmitter, developed in a 65 nm CMOS technology and a simple receiver, developed in a 180 nm CMOS technology, for detecting the received differential signal are demonstrated in the work. Though the transmitter and the receiver have been developed in two different technologies, the main objective of this paper is to show the effectiveness of such a non-coherent solution for BPSK modulated UWB-IR communication.
Slides


Session 2A  Scheduling Techniques for Embedded Systems
Time: 13:40 - 15:40 Wednesday, January 26, 2011
Location: Room 411+412
Chairs: Dip Goswami (Technical University of Munich, Germany), Naehyuck Chang (Seoul National University, Republic of Korea)

2A-1 (Time: 13:40 - 14:10)
TitleThermally Optimal Stop-Go Scheduling of Task Graphs with Real-Time Constraints
Author*Pratyush Kumar, Lothar Thiele (ETH Zürich, Switzerland)
Pagepp. 123 - 128
KeywordStop-go scheduling, Task-graphs, Real-time
AbstractDynamic thermal management (DTM) techniques to manage the load on a system to avoid thermal hazards are soon becoming mainstream in today’s systems. With the increas- ing percentage of leakage power, switching off the processors is becoming a viable alternative technique to speed scaling. For real-time applications, it is crucial that under such techniques the system still meets the performance constraints. In this pa- per we study stop-go scheduling to minimize peak temperature when scheduling an application, modeled as a task-graph, within a given makespan constraint. For a given static-ordering of exe- cution of the tasks, we derive the optimal schedule referred to as the JUST schedule. We prove that for periodic task-graphs, the optimal temperature is independent of the chosen static-ordering when following the proposed JUST schedule. Simulation experi- ments validate the theoretical results.
Slides

2A-2 (Time: 14:10 - 14:40)
TitleRegister Allocation for Write Activity Minimization on Non-volatile Main Memory
AuthorYazhi Huang, Tiantian Liu, *Jason Xue (City University of Hong Kong, Hong Kong)
Pagepp. 129 - 134
KeywordNon-volatile memory, register allocation, graph coloring
AbstractNon-volatile memories are good candidates for DRAM replacement as main memory in embedded systems . This paper focuses on the embedded systems using nonvolatile memory as main memory. We propose register allocation technique with re-computation to reduce the number of store instructions. When non-volatile memory is applied as the main memory, reducing store instructions will reduce write activities on non-volatile memory. The proposed techniques can efficiently reduce the number of store instructions on systems with nonvolatile memory by 25% on average.

2A-3 (Time: 14:40 - 15:10)
TitleLeakage Conscious DVS Scheduling for Peak Temperature Minimization
AuthorVivek Chaturvedi, *Gang Quan (Florida International University, U.S.A.)
Pagepp. 135 - 140
KeywordDVS, real-time scheduling, leakage-aware, thermal management, peak temperature
AbstractIn this paper, we incorporate the dependencies among the leakage, the temperature and the supply voltage into the theoretical analysis and explore the fundamental characteristics on how to employ dynamic voltage scaling (DVS) to reduce the peak operating temperature. We find that, for a specific interval, a real-time scheduleusing the lowest constant speed is not necessarily the optimal choice any more in minimizing the peak temperature. We identify the scenarios when a schedule using two different speeds can outperform the one using the constant speed. In addition, we find that the constant speed schedule is still the optimal one to minimize the peak temperature at the temperature stable status when scheduling a periodic task set. We formulate our conclusions into several theorems with formal proofs.
Slides

2A-4 (Time: 15:10 - 15:40)
TitleReconfiguration-aware Real-Time Scheduling under QoS Constraint
AuthorHessam Kooti, *Deepak Mishra, Eli Bozorgzadeh (University of California, Irvine, U.S.A.)
Pagepp. 141 - 146
KeywordQuality of Service, Real-Time, Reconfiguration overhead
AbstractDue to the increase in demand for reconfigurability in embedded systems, schedulability in real-time task scheduling is challenged by non-negligible reconfiguration overheads. Reconfiguration of the system during task execution affects both deadline miss rate and deadline miss distribution. On the other hand, Quality of Service (QoS) in several embedded applications is not only determined by deadline miss rate but also the distribution of the tasks missing their deadlines (known as weakly-hard real-time systems). As a result, we propose to model QoS constraints as a set of constraints on dropout patterns (due to reconfiguration overhead) and present a novel online solution for the problem of reconfiguration-aware real-time scheduling. According to QoS constraints, we divide the ready instances of the tasks into two groups: critical and non-critical, then model each group as a network flow problem and provide an online scheduler for each group. We deployed our method on synthetic benchmarks as well as software defined radio implementation of VoIP on reconfigurable systems. Results show that our solution reduces the number of QoS violations by 19.01 times and 2.33 times (57.02%) in comparison with Bi-Modal Scheduler (BMS) [1] for synthetic benchmarks with low and high QoS constraint, respectively.


Session 2B  Memory Architecture and Buffer Optimization
Time: 13:40 - 15:40 Wednesday, January 26, 2011
Location: Room 413
Chairs: Yu Wang (Tsinghua University, China), Yinhe Han (Chinese Academy of Sciences, China)

2B-1 (Time: 13:40 - 14:10)
TitleTemplate-based Memory Access Engine for Accelerators in SoCs
Author*Bin Li, Zhen Fang, Ravi Iyer (Intel Corporation, U.S.A.)
Pagepp. 147 - 153
KeywordSoC, Accelerators, Memory systems
AbstractWith the rapid progress in semiconductor technologies, more and more accelerators can be integrated onto a single SoC chip. In SoCs, accelerators often require deterministic data access. However, as more and more applications are running simultaneous, latency can vary significantly due to contention. To address this problem, we propose a template-based memory access engine (MAE) for accelerators in SoCs. The proposed MAE can handle several common memory access patterns observed for near-future accelerators. Our evaluation results show that the proposed MAE can significantly reduce memory access latency and jitter, thus very effective for accelerators in SoCs.
Slides

2B-2 (Time: 14:10 - 14:40)
TitleRealization and Performance Comparison of Sequential and Weak Memory Consistency Models in Network-on-Chip based Multi-core Systems
Author*Abdul Naeem, Xiaowen Chen, Zhonghai Lu, Axel Jantsch (Royal Institute of Technology, Sweden)
Pagepp. 154 - 159
KeywordMemory consistency, Distributed shared memory
AbstractThis paper studies realization and performance comparison of the sequential and weak consistency models in the network-on-chip (NoC) based distributed shared memory (DSM) multi-core systems. Memory consistency constraint the order of shared memory operations for the expected behavior of the multi-core systems. Both the consistency models are realized in the NoC based multi-core systems. The performance of the two consistency models are compared for various sizes networks using regular mesh topologies and deflection routing algorithm. The results show that the weak consistency improves the performance by 46.17% and 33.76% on average in the code and consistency latencies over the sequential consistency model, due to the program order relaxation, as the system grows from single core to 64 cores.
Slides

2B-3 (Time: 14:40 - 15:10)
TitleNetwork-on-Chip Router Design with Buffer-Stealing
AuthorWan-Ting Su, *Jih-Sheng Shen, Pao-Ann Hsiung (National Chung Cheng University, Taiwan)
Pagepp. 160 - 164
KeywordNoC, Buffer Design
AbstractA Buffer-Stealing (BS) mechanism is proposed, which enables the input channels in NoC routers that have insufficient buffer space to utilize at runtime the unused input buffers from other input channels. Implementation results of the proposed BS design for a 64-bit 5-input-buffer router show a reduction of the average packet transmission latency by up to 10.17% and an increase of the average throughput by up to 23.47%, at an overhead of 22% more hardware resources.
Slides

2B-4 (Time: 15:10 - 15:40)
TitleMinimizing Buffer Requirements for Throughput Constrained Parallel Execution of Synchronous Dataflow Graph
AuthorTae-ho Shin (Seoul National University, Republic of Korea), Hyunok Oh (Hanyang University, Republic of Korea), *Soonhoi Ha (Seoul National University, Republic of Korea)
Pagepp. 165 - 170
KeywordSynchronous Dataflow Graph, Static mapping, Dynamic scheduilng, Buffer size minimize
AbstractThis paper concerns throughput-constrained parallel execution of synchronous data flow graph. This paper assumes static mapping and dynamic scheduling of the nodes, which has several benefits over static scheduling approaches. We determine the buffer size of all arcs to minimize the total buffer size while satisfying a throughput constraint. Dynamic scheduling is able to achieve the similar throughput performance as the static scheduling does by unfolding the given SDF graph. A key issue of dynamic scheduling is how to assign the priority to each node invocation, which is also discussed in this paper. Since the problem is NP-hard, we present a heuristic based on a genetic algorithm. The experimental results confirm the viability of the proposed technique.
Slides


Session 2C  Modeling for Signal and Power Integrity
Time: 13:40 - 15:40 Wednesday, January 26, 2011
Location: Room 414+415
Chairs: Hideki Asai (Shizuoka University, Japan), Kimihiro Ogawa (STARC, Japan)

2C-1 (Time: 13:40 - 14:10)
TitleA Fast Approximation Technique for Power Grid Analysis
Author*Mysore Sriram (Intel Corporation, India)
Pagepp. 171 - 175
KeywordIR drop, power grid, approximation, algorithm
AbstractWe present a fast approximation algorithm for computing IR drops in a VLSI power grid. Assuming that the grid does not have pathological defects, the algorithm can estimate IR drops to within 5% average error, with a run time of less than one second per million nodes. Incremental recomputations with new current source values are even faster. The IR drop profiles have excellent correlation with simulated values, making this approach a viable platform for building automatic grid optimization algorithms.
Slides

2C-2 (Time: 14:10 - 14:40)
TitleEquivalent Lumped Element Models for Various n-Port Through Silicon Vias Networks
Author*Khaled Salah Mohamed (Mentor Graphics, Egypt), Hani Ragai (Ain-Shams University, Egypt), Yehea Ismail (Nile University, Egypt), Alaa El Rouby (Mentor Graphics, Egypt)
Pagepp. 176 - 183
KeywordThree-Dimensional ICs, Through Silicon Via, Dimensional Analysis., TSV, Modeling
AbstractThis paper proposes an equivalent lumped element model for various multi-TSV arrangements and introduces closed form expressions for the capacitive, resistive, and inductive coupling between those arrangements. The closed form expressions are in terms of physical dimensions and material properties and are driven based on the dimensional analysis method. The model’s compactness and compatibility with SPICE simulators allows the electrical modeling of various TSV arrangements without the need for computationally expensive field-solvers and the fast investigation of a TSV impact on a 3-D circuit performance. The proposed model accuracy is tested versus a detailed electromagnetic simulation and showed less than 6% difference. Finally, the proposed model can be a possible solution to the industrial need for broadband electrical modeling of TSVs interconnections arising in 3-D integration. Also, our presented work provides valuable insight into creating guidelines for TSV macro-modeling.
Slides

2C-3 (Time: 14:40 - 15:10)
TitleClock Tree Optimization for Electromagnetic Compatibility (EMC)
Author*Xuchu Hu, Matthew R. Guthaus (University of California, Santa Cruz, U.S.A.)
Pagepp. 184 - 189
KeywordEMC, Clock, Dynamic programming
AbstractElectromagnetic Interference (EMI) generated by electronic systems is increasing with operating frequency and shrinking process technologies. The clock distribution network is one of the major causes of on-chip EMI. In this paper, we discuss the EMI problem in clock tree design. Spectrum analysis shows that slew rate of clock signal is the main parameter determining the high-frequency spectral content distribution. This is the first work to consider maximum and minimum buffer slew rates in clock tree synthesis to reduce EMI. In this paper, we propose a dynamic programming algorithm to optimize the clock tree considering both traditional metrics and Electromagnetic Compatibility (EMC). Our experimental results show that slew can be controlled in a feasible range and high-frequency spectrum contents can be reduced without sacrificing the traditional metrics such as power and skew. With the efficient optimization and pruning method, the biggest benchmark which has 1728 sinks is able to complete in four minutes.
Slides

2C-4 (Time: 15:10 - 15:40)
TitlePulser Gating: A Clock Gating of Pulsed-Latch Circuits
Author*Sangmin Kim, Inhak Han, Seungwhun Paik, Youngsoo Shin (Korea Advanced Institute of Science and Technology, Republic of Korea)
Pagepp. 190 - 195
KeywordPulsed-latch, sequential circuit, clock gating, low power
AbstractA pulsed-latch is an ideal sequencing element for low-power ASIC designs due to its smaller capacitance and simple timing model. Clock gating of pulsed-latch circuits can be realized by gating a pulse generator (or pulser), which we call pulser gating. The problem of pulser gating synthesis is formulated for the first time. A heuristic algorithm that considers all three factors (similarity of gating functions, literal count to implement gating functions, and proximity of latches) is proposed.
Slides


Session 2D  Special Session: Emerging Memory Technologies and Its Implication on Circuit Design and Architectures
Time: 13:40 - 15:40 Wednesday, January 26, 2011
Location: Room 416+417
Organizer: Yuan Xie (Pennsylvania State University, U.S.A.)

2D-1 (Time: 13:40 - 14:10)
Title(Invited Paper) Circuit Design Challenges in Embedded Memory and Resistive RAM (RRAM) for Mobile SoC and 3D-IC
AuthorMeng-Fan Chang (National Tsing Hua University, Taiwan), Pi-Feng Chiu, Shyh-Shyuan Sheu (Industrial Technology Research Institute, Taiwan)
Pagepp. 197 - 203
AbstractMobile systems require high-performance and low-power SoC or 3D-IC chips to perform complex operations, ensure a small form-factor and ensure a long battery life time. A low supply voltage (VDD) is frequently utilized to suppress dynamic power consumption, standby current, and thermal effects in SoC and 3D-IC. Furthermore, lowering the VDD reduces the voltage stress of the devices and slows the aging of chips. However, a low VDD for embedded memories can cause functional failure and low yield. This paper reviews various challenges in the design of low-voltage circuits for embedded memory (SRAM and ROM). It also discusses emerging embedded memory solutions. Alternative memory interfaces and architectures for mobile SoC and 3D-IC are also explored.

2D-2 (Time: 14:10 - 14:40)
Title(Invited Paper) Emerging Sensing Techniques for Emerging Memories
AuthorYiran Chen (University of Pittsburgh, U.S.A.), Hai Li (Polytechnic Institute of New York University, U.S.A.)
Pagepp. 204 - 210
AbstractAmong all emerging memories, Spin-Transfer Torque Random Access Memory (STT-RAM) has shown many promising features such as fast access speed, nonvolatility, compatibility to CMOS process and excellent scalability. However, large process variations of both magnetic tunneling junction (MTJ) and MOS transistor severely limit the yield of STT-RAM chips. In this work, we present a recently proposed sensing technique called nondestructive self-reference read scheme (NSRS) to overcome the bit-to-bit variations in STT-RAM by leveraging the different dependencies of the high-resistance state of MTJs on the sensing current biases. Additionally, a few enhancement techniques including R-I curve skewing, yield-driven sensing current selection, and ratio matching are introduced to further improve the robustness of NSRS. The measurements of a 16Kb STT-RAM test chip shows that NSRS can significantly improve the chip yield by reducing sensing failures with high sense margin and low power consumptions.

2D-3 (Time: 14:40 - 15:10)
Title(Invited Paper) A Frequent-Value Based PRAM Memory Architecture
AuthorGuangyu Sun, Dimin Liu, Jin Ouyang, Yuan Xie (Pennsylvania State University, U.S.A.)
Pagepp. 211 - 216
AbstractPhase Change Random Access Memory (PRAM) has great potential as the replacement of DRAM as main memory, due to its advantages of high density, non-volatility, fast read speed, and excellent scalability. However, poor endurance and high write energy appear to be the challenges to be tackled before PRAM can be adopted as main memory. In order to mitigate these limitations, prior research focuses on reducing write intensity at the bit level. In this work, we study the data pattern of memory write operations, and explore the frequent-value locality in data written back to main memory. Based on the fact that many data are written to memory repeatedly, an architecture of frequent- value storage is proposed for PRAM memory. It can significantly reduce the write intensity to PRAM memory so that the lifetime is improved and the write energy is reduced. The trade-off between endurance and capacity of PRAM memory is explored for different configurations. After using the frequent-value storage, the endurance of PRAM is improved to about 1.6X on average, and the write energy is reduced by 20%.

2D-4 (Time: 15:10 - 15:40)
Title(Invited Paper) Two-Terminal Resistive Switches (Memristors) for Memory and Logic Applications
AuthorWei Lu, Kuk-Hwan Kim, Ting Chang, Siddharth Gaba (University of Michigan, U.S.A.)
Pagepp. 217 - 223
AbstractWe review the recent progress on the development of two-terminal resistive devices (memristors). Devices based on solid-state electrolytes (e.g. a-Si) have been shown to possess a number of promising performance metrics such as yield, on/off ratio, switching speed, endurance and retention suitable for memory or reconfigurable circuit applications. In addition, devices with incremental resistance changes have been demonstrated and can be used to emulate synaptic functions in hardware based neuromorphic circuits. Device and SPICE modeling based on a properly chosen internal state variable have been carried out and will be useful for large-scale circuit simulations.


Session 3A  High-Level Embedded Systems Design Techniques
Time: 16:00 - 18:00 Wednesday, January 26, 2011
Location: Room 411+412
Chairs: Yuko Hara-Azumi (Ritsumeikan University, Japan), Yiran Chen (University of Pittsburgh, U.S.A.)

3A-1 (Time: 16:00 - 16:30)
TitleCo-design of Cyber-Physical Systems via Controllers with Flexible Delay Constraints
Author*Dip Goswami, Reinhard Schneider, Samarjit Chakraborty (Technical University of Munich, Germany)
Pagepp. 225 - 230
KeywordCyber-Physical Systems, Controller, Flexible Delay-contraints, FlexRay Protocol
AbstractIn this paper, we consider a cyber-physical architecture where control applications are divided into multiple tasks, spatially distributed over various processing units that communicate via a shared bus. While control signals are exchanged over the communication bus, they have to wait for bus access and therefore experience a delay. We propose certain (co-)design guidelines for (i) the communication schedule, and (ii) the controller, such that stability of the control applications is guaranteed for more flexible communication delay constraints than what has been studied before. We illustrate the applicability of our design approach using the FlexRay dynamic segment as the communication medium for the processing units.
Slides

3A-2 (Time: 16:30 - 17:00)
TitleEnhanced Heterogeneous Code Cache Management Scheme for Dynamic Binary Translation
Author*Ang-Chih Hsieh, Chun-Cheng Liu, TingTing Hwang (National Tsing Hua University, Taiwan)
Pagepp. 231 - 236
Keywordcode cache, binary translation
AbstractRecently, Dynamic Binary Translation (DBT) technology has gained much attentions on embedded systems due to its various capabilities. However, the memory resource in embedded systems is often limited. This leads to the overhead of code re-translation and causes significant performance degradation. To reduce this overhead, Heterogeneous Code Cache (HCC), is proposed to split the code cache among SPM and main memory to avoid code re-translation. Although HCC is effective in handling applications with large working sets, it ignores the execution frequencies of program segments. Frequently executed program segments can be stored in main memory and suffer from large access latency. This causes significant performance loss. To address this problem, an enhanced Heterogeneous Code Cache management scheme which considers program behaviors is proposed in this paper. Experimental results show that the proposed management scheme can effectively improve the access ratio of SPM from 49.48% to 95.06%. This leads to 42.68% improvement of performance as compared with the management scheme proposed in the previous work.

3A-3 (Time: 17:00 - 17:30)
TitleFast Hybrid Simulation for Accurate Decoded Video Quality Assessment on MPSoC Platforms with Resource Constraints
Author*Deepak Gangadharan (School of Computing, National University of Singapore, Singapore), Samarjit Chakraborty (Institute for Real-Time Computer Systems, Technical University of Munich, Germany), Roger Zimmermann (School of Computing, National University of Singapore, Singapore)
Pagepp. 237 - 242
KeywordMPSoC, MPEG-2 decoder, PSNR, resource constraints
AbstractMultimedia decoders mapped onto MPSoC platforms exhibit degraded video quality when the critical system resources such as buffer and processor frequency are constrained. Hence, it is essential for system designers to find the appropriate mix of resources, living within the constraints, for a desired output video quality. A naive approach to do this would be to run expensive system simulations of the decoder tasks mapped onto a model of the underlying MPSoC architecture. This turns out to be inefficient when the input video library set has a large number of video clips. We propose a fast hybrid simulation framework to quantitatively estimate decoded video quality in the context of an MPEG-2 decoder. Here, the workload of simulation heavy tasks are estimated using accurate analytical models. The workload of other light (but difficult to analytically model) tasks are obtained from system simulations. This framework enables the system designer to perform a fast trade-off analysis of the system resources in order to choose the optimal combination of resources for the desired video quality. When compared to a naive system simulation approach, the hybrid simulation-based framework shows speed-up factors of about 5x for motion and 8x for still videos. The results obtained using this framework highlight important trade-offs such as the decoded video quality (measured in terms of the peak signal to noise ratio (PSNR)) vs buffer size and PSNR vs processor frequency.
Slides

3A-4 (Time: 17:30 - 18:00)
TitleOn the Interplay of Loop Caching, Code Compression, and Cache Configuration
AuthorMarisha Rawlins, *Ann Gordon-Ross (University of Florida, U.S.A.)
Pagepp. 243 - 248
Keywordmemory, cache, optimization
AbstractEven though much previous work explores varying instruction cache optimization techniques individually, little work explores the combined effects of these techniques (i.e., do they complement or obviate each other). In this paper we explore the interaction of three optimizations: loop caching, cache tuning, and code compression. Results show that loop caching increases energy savings by as much as 26% compared to cache tuning alone and reduces decompression energy by as much as 73%.
Slides


Session 3B  Timing, Power, and Thermal Issues
Time: 16:00 - 18:00 Wednesday, January 26, 2011
Location: Room 413
Chair: Deming Chen (University of Illinois at Urbana-Champaign, U.S.A.)

3B-1 (Time: 16:00 - 16:30)
TitlePath Criticality Computation in Parameterized Statistical Timing Analysis
Author*Jaeyong Chung (University of Texas at Austin, U.S.A.), Jinjun Xiong, Vladimir Zolotov (IBM Thomas J. Watson Research Center, U.S.A.), Jacob A. Abraham (University of Texas at Austin, U.S.A.)
Pagepp. 249 - 254
KeywordCriticality probability, Statistical timing, Parametric variation
AbstractThis paper presents a method to compute criticality probabilities of paths in parameterized statistical static timing analysis (SSTA). We partition the set of all the paths into several groups and formulate the path criticality into a joint probability of inequalities. Before evaluating the joint probability directly, we simplify the inequalities through algebraic elimination, handling topological correlation. Our proposed method uses conditional probabilities to obtain the joint probability, and statistics of random variables representing process parameters are changed due to given conditions. To calculate the conditional statistics of the random variables, we derive analytic formulas by extending Clark's work. This allows us to obtain the conditional probability density function of a path delay, given the path is critical, as well as to compute criticality probabilities of paths. Our experimental results show that the proposed method provides 4.2X better accuracy on average in comparison to the state-of-art method.
Slides

3B-2 (Time: 16:30 - 17:00)
TitleRun-Time Adaptable On-Chip Thermal Triggers
Author*Pratyush Kumar, David Atienza (EPFL, Switzerland)
Pagepp. 255 - 260
KeywordDTM, Neural network, Predictive
AbstractWith ever-increasing power densities, Dynamic Thermal Management (DTM) techniques have become mainstream in today’s systems. An important component of such techniques is the thermal trigger. It has been shown that predictive thermal triggers can outperform reactive ones [4]. In this paper, we present a novel trade-off space of predictive thermal triggers, and compare different approaches proposed in the literature. We argue that run-time adaptability is a crucial parameter of interest. We present a run-time adaptable thermal simulator compatible with arbitrary sensor configuration based on the Neural Network (NN) simulator presented in [14]. We present experimental results on Niagara UltraSPARC T1 chip with real-life benchmark applications. Our results quantitatively establish the effectiveness of the proposed simulator for reducing (by up to 90%), the otherwise unacceptably high errors, that can arise due to expected leakage current variation and design-time thermal modeling errors.
Slides

3B-3 (Time: 17:00 - 17:30)
TitleRethinking Thermal Via Planning with Timing-Power-Temperature Dependence for 3D ICs
AuthorKan Wang, *Yuchun Ma, Sheqin Dong, Yu Wang, Xianlong Hong (Tsinghua University, China), Jason Cong (University of California, Los Angeles, U.S.A.)
Pagepp. 261 - 266
KeywordThermal Via, Leakage Power, Delay, Timing-Power-Temperature Dependence
AbstractDue to the increased power density and lower thermal conductivity, 3D is faced with heat dissipation and temperature problem seriously, which become a bottleneck in 3D circuit design. Previous researches showed that leakage power and delay are both relevant to temperature, and increase as the temperature increases. The timing-power-temperature dependence will potentially negate the performance improvement of 3D designs. TSV (Through-silicon-vias) has been shown as an effective way to help the heat removal, but they create routing congestions. Therefore, how to reach the trade-off between temperature, via number and delay is required to be solved. Different from previous works on TSV planning which ignored the effects of leakage power, in this paper, we integrate temperature-leakage-timing dependence into thermal via planning of 3D ICs. A weighted via insertion approach, considering both performance and heat dissipation with resource constraint, is proposed to achieve the best balance among delay, via number and temperature.
Slides

3B-4 (Time: 17:30 - 18:00)
TitleThe Impact of Inverse Narrow Width Effect on Sub-threshold Device Sizing
AuthorJun Zhou, *Senthil Jayapal, Jan Stuyt, Jos Huisken, Harmke de Groot (Holst Centre/IMEC, Netherlands)
Pagepp. 267 - 272
Keywordsub-threshold, sizing, inverse narrow width effect
AbstractWe have investigated the impact of inverse narrow width effect on sub-threshold device sizing and proposed a new sizing method to balance the rise and fall delay by taking into account the influence of inverse narrow width effect while minimizing the transistor size. Compared with the previous sub-threshold sizing method the delay and power-delay-product are reduced by up to 35.4% and 73.4% with up to 57% saving in the area. Further, due to symmetric rise and fall delay the minimum operating voltage can be lowered by 8% which leads to another 16% of energy reduction.
Slides


Session 3C  Special Session: Post-Silicon Techniques to Counter Process and Electrical Parameter Variability
Time: 16:00 - 18:00 Wednesday, January 26, 2011
Location: Room 414+415
Chair: Jing-Jia Liou (National Tsing Hua University, Taiwan)

3C-1 (Time: 16:00 - 16:30)
Title(Invited Paper) Post-silicon Bug Detection for Variation Induced Electrical Bugs
AuthorMing Gao, Peter Lisherness, Kwang-Ting (Tim) Cheng (University of California, Santa Barbara, U.S.A.)
Pagepp. 273 - 278
AbstractElectrical bugs, such as those caused by crosstalk or power droop, are a growing concern due to shrinking noise margins and increasing variability. This paper introduces COBE, an electrical bug modeling technique which can be used to evaluate the effectiveness of validation tests and DfD (design-for-debug) structures for detecting these errors in post-silicon validation. COBE first uses gate-level timing details to identify critical flip-flops in which the error effects of electrical bugs are more likely to be captured. Based on RTL simulation traces, the functional tests and corresponding cycles in which these critical flip-flops incur transitions are then recorded as the potential times and locations of bug activation. These selected “bit-flipsEare then analyzed through functional simulation to determine if they are propagated to an observation point for detection. Compared to the commonly employed random bit-flip injection technique, COBE provides a significantly more accurate electrical bug model by taking into account the likelihood of bug activation, in terms of both location and time, for bit-flip injection. COBE is experimentally evaluated on an Alpha 21264 processor RTL model. In our simulation-based experiments, the results show that the relative effectiveness of the tests predicted by COBE correlates very well with the tests' electrical bug detection capability, with a correlation factor of 0.921. This method is much more accurate than the random bit-flip injection technique, which has a correlation factor of 0.482.

3C-2 (Time: 16:30 - 17:00)
Title(Invited Paper) Diagnosis-assisted Supply Voltage Configuration to Increase Performance Yield of Cell-Based Designs
AuthorJing-Jia Liou, Ying-Yen Chen, Chun-Chia Chen, Chung-Yen Chien, Kuo-Li Wu (National Tsing Hua University, Taiwan)
Pagepp. 279 - 284
AbstractA diagnosis technique based on delay testing has been developed to map the severity of process variation on each cell/interconnect delay. Given this information, we demonstrate a post-silicon tuning method on row voltage supplies (inside a chip) to restore the performance of failed chips. The method uses the performance map to set voltages by either pumping up the voltage on cells with worse delays or tuning down on fast cells to save power. On our test cases, we can correct up to 75%of failed chips to pass performance tests, while maintaining less than 10% increase over nominal power consumption.

3C-3 (Time: 17:00 - 17:30)
Title(Invited Paper) Run-Time Adaptive Performance Compensation using On-chip Sensors
AuthorMasanori Hashimoto (Osaka University & JST, CREST, Japan)
Pagepp. 285 - 290
AbstractThis paper discusses run-time adaptive performance control with on-chip sensors that predict timing errors. The sensors embedded into functional circuits capture delay variations due to not only die-to-die process variation but also random process variation, environmental fluctuation and aging. By compensating circuit performance according to the sensor outputs, we can overcome PVT worst-case design and reduce power dissipation while satisfying circuit performance. We applied the adaptive speed control to subthreshold circuits that are very sensitive to random variation and environmental fluctuation. Measurement results of a 65nm test chip show that the adaptive speed control can compensate PVT variations and improve energy efficiency by up to 46% compared to the worst-case design and operation with guardbanding.

3C-4 (Time: 17:30 - 18:00)
Title(Invited Paper) The Alarms Project: A Hardware/Software Approach to Addressing Parameter Variations
AuthorDavid Brooks (Harvard University, U.S.A.)
Pagep. 291
AbstractParameter variations (process, voltage, and temperature) threaten continued performance scaling of power-constrained computer systems. As designers seek to contain the power consumption of microprocessors through reductions in supply voltage and power-saving techniques such as clock-gating, these systems suffer increasingly large power supply fluctuations due to the finite impedance of the power supply network. These supply fluctuations, referred to as voltage emergencies, must be managed to guarantee correctness. Traditional approaches to address this problem incur high-cost or compromise power/performance efficiency. Our research seeks ways to handle these alarm conditions through a combined hardware/software approach, motivated by root cause analysis of voltage emergencies revealing that many of these events are heavily linked to both program control flow and microarchitectural events (cache misses and pipeline flushes). This talk will discuss three aspects of the project: (1) a fail-safe mechanism that provides hardware guaranteed correctness; (2) a voltage emergency predictor that leverages control flow and microarchitectural event information to predict voltage emergencies up to 16 cycles in advance; and (3) a proof-of-concept dynamic compiler implementation that demonstrates that dynamic code transformations can be used to eliminate voltage emergencies from the instruction stream with minimal impact on performance.


Session 3D  Special Session: Recent Advances in Verification and Debug
Time: 16:00 - 18:00 Wednesday, January 26, 2011
Location: Room 416+417
Chair: Chung-Yang (Ric) Huang (National Taiwan University, Taiwan)

3D-1 (Time: 16:00 - 16:24)
Title(Invited Paper) Automatic Formal Verification of Reconfigurable DSPs
AuthorMiroslav N. Velev, Ping Gao (Aries Design Automation, U.S.A.)
Pagepp. 293 - 296
AbstractWe present a method for automatic formal verification of Digital Signal Processors (DSPs) that have VLIW architecture and reconfigurable functional units optimized for accelerating Software Defined Radio (SDR) applications to be used for future space communications by NASA. The formal verification was done with the highly automatic method of Correspondence Checking by exploiting the property of Positive Equality that allows a dramatic simplification of the solution space and many orders of magnitude speedup. The formal verification of a complex reconfigurable DSP took approximately 10 minutes of CPU time on a single workstation, when using our industrial-strength tool flow.

3D-2 (Time: 16:24 - 16:48)
Title(Invited Paper) SoC HW/SW Verification and Validation
AuthorChung-Yang Huang, Yu-Fan Yin, Chih-Jen Hsu (National Taiwan University, Taiwan), Thomas B. Huang, Ting-Mao Chang (InPA Systems, Inc., U.S.A.)
Pagepp. 297 - 300
AbstractIn modern SoC design flow, verification and validation are key components to reduce time-to-market and enhance product quality. To avoid trade-offs between timing accuracy and simulation speed in RTL simulation and C++/SystemC virtual prototyping, FPGA prototyping has become a better choice in the design flow. However, the time-consuming bring-up procedure and insufficient debugging visibility has impaired its potential strengths in verification and validation. In this paper, we present the technology from InPA Systems in which four different modes of operations, RTL-FPGA co-simulation, SystemC-FPGA co-emulation, vector prototyping, and in-circuit prototyping, are supported. With these different modes of FPGA operations, users can develop and verify their SoCs in different stages of the design flow with different abstraction levels. This methodology efficiently and robustly completes the SoC HW/SW verification and validation flow.

3D-3 (Time: 16:48 - 17:12)
Title(Invited Paper) Utilizing High Level Design Information to Speed up Post-silicon Debugging
AuthorMasahiro Fujita (The University of Tokyo and CREST, Japan)
Pagepp. 301 - 305
AbstractDue to the highly complicated control structures of modern processors as well as ASICs, some of the logical bugs may easily escape from the pre-silicon verification processes and remain into the silicon. Those bugs can only be found after the chip has been fabricated and used in the systems. So post-silicon debugging is becoming a essential part of the design flows for complicated and large system designs. This paper summarizes our research activities targeting post-silicon debugging for highly complicated pipeline processors as well as large ASICs. We have been working on the following three topics: 1) Translation of chip level error traces to high and abstracted level so that more efficient simulation as well as formal analysis become possible, 2) Utilize experiences on formal verification and debugging processes for pipelined processors for debugging and in-fields rectification of chips, and 3) Apply incremental high level synthesis for efficient in-fields rectifications of ASIC designs. Our approaches utilize high level or abstracted design information as much as possible to make things more efficient and effective. In this paper we briefly present the techniques for the first two topics.

3D-4 (Time: 17:12 - 17:36)
Title(Invited Paper) From RTL to Silicon: The Case for Automated Debug
AuthorAndreas Veneris, Brian Keng (University of Toronto, Canada), Sean Safarpour (Vennsa Technologies, Inc., Canada)
Pagepp. 306 - 310
AbstractComputer-aided design tools are continuously improving their scalability and efficiency to mitigate the high cost associated with designing and fabricating modern VLSI systems. A key step in the design process is the root-cause analysis of detected errors. Debugging may take months to close, introduce high cost and uncertainty ultimately jeopardizing the chip release date. This study makes the case for debug automation in each part of the design flow (RTL to silicon) to bridge the gap. Contemporary research, challenges and future directions motivate for the urgent need in automation to relieve the pain from this highly manual task.

3D-5 (Time: 17:36 - 18:00)
Title(Invited Paper) Multi-Core Parallel Simulation of System-Level Description Languages
AuthorRainer Dömer, Weiwei Chen, Xu Han (University of California, Irvine, U.S.A.), Andreas Gerstlauer (University of Texas at Austin, U.S.A.)
Pagepp. 311 - 316
AbstractThe validation of transaction levelmodels described in System-level Description Languages (SLDLs) often relies on extensive simulation. However, traditional Discrete Event (DE) simulation of SLDLs is cooperative and cannot utilize the available parallelism in modern multi-core CPU hosts. In this work, we study the SLDL execution semantics of concurrent threads and present a multi-core parallel simulation approach which automatically protects communication between concurrent threads so that parallel simulation on multi-core hosts becomes possible. We demonstrate significant speed-up in simulation time of several system models, including a H.264 video decoder and a JPEG encoder.



Thursday, January 27, 2011

Session 2K  Keynote Session II
Time: 9:00 - 10:00 Thursday, January 27, 2011
Location: Room 503
Chair: Kunihiro Asada (University of Tokyo, Japan)

2K-1 (Time: 9:00 - 10:00)
Title(Keynote Address) Managing Increasing Complexity through Higher-Level of Abstraction: What the Past Has Taught Us about the Future
AuthorAjoy Bose (Atrenta Inc., U.S.A.)
AbstractTime to market and design complexity challenges are well-known; we've all seen the statistics and predictions. A well-defined strategy to address these challenges seems less clear. Design for manufacturability approaches that optimize transistor geometries, "variability aware" physical implementation tools and design reuse strategies abound. While each of these techniques contributes to the solution, they all miss the primary force of design evolution. Over the past 30 years or so, it has been proven time and again that moving design abstraction to the next higher level is required if design technology is to advance. In this keynote presentation, a new EDA model will presented, examples of past trends will be identified, and an assessment will be made on what these trends mean in the context of the current challenges before us. A snapshot of the future will be presented which will contain some non-intuitive predictions.


Session 4A  Design Automation for Emerging Technologies
Time: 10:20 - 12:20 Thursday, January 27, 2011
Location: Room 411+412
Chairs: Hai Li (Polytechnic Institute of New York University, U.S.A.), Yu Wang (Tsinghua University, China)

4A-1 (Time: 10:20 - 10:50)
TitleVariation-aware Logic Mapping for Crossbar Nano-architectures
AuthorMasoud Zamani (Northeastern University, U.S.A.), *Mehdi B. Tahoori (Karlsruhe Institute of Technology, Germany)
Pagepp. 317 - 322
KeywordNano-architectures, Emerging technology, Variation Tolerant, Crossbar Array
AbstractIn this paper, we analyze the effect of variations on mapped designs and propose an efficient mapping method to reduce variation effect and increase reliability of a circuit implemented on crossbar nanoarchitectures. This method takes advantage of reconfigurability and abundance of resources in these nano-architectures for tolerating variation and improving reliability. The basic idea is based on duplicating crossbar input lines as well as swapping rows (columns) of a crossbar to reduce the output dependency and be able to reduce delay variation.

4A-2 (Time: 10:50 - 11:20)
TitleRouting with Graphene Nanoribbons
AuthorTan Yan (Synopsys Inc. & University of Illinois at Urbana-Champaign, U.S.A.), Qiang Ma, Scott Chilstedt, *Martin Wong, Deming Chen (University of Illinois at Urbana-Champaign, U.S.A.)
Pagepp. 323 - 329
KeywordRouting, Graphene Nanoribbon (GNR), nano-technology
AbstractConventional CMOS devices are facing an increasing number of challenges as their feature sizes scale down. Graphene nanoribbon (GNR) based devices are shown to be a promising replacement of traditional CMOS at future technology nodes. However, all previous works on GNRs focus at the device level. In order to integrate these devices into electronic systems, routing becomes a key issue. In this paper, the GNR routing problem is studied for the first time. We formulate the GNR routing problem as a minimum hybrid-cost shortest path problem on triangular mesh (“hybridEmeans that we need to consider both the length and the bending of the routing path). In order to model this hybrid-cost problem, we apply graph expansion and introduce a shortest red-black path problem on the expanded graph. We then propose an algorithm that solves the shortest red-black path problem optimally. This algorithm is then used in a negotiated congestion based routing scheme. Experimental results show that our GNR routing algorithm effectively handles the hybrid cost.

4A-3 (Time: 11:20 - 11:50)
TitleILP-Based Inter-Die Routing for 3D ICs
AuthorChia-Jen Chang, Pao-Jen Huang, *Tai-Chen Chen, Chien-Nan Jimmy Liu (Department of Electrical Engineering, National Central University, Taiwan)
Pagepp. 330 - 335
Keyword3D-IC, Routing, RDL, Micro-Bump, TSV
AbstractThe 3D IC is an emerging technology. The primary emphasis on 3D-IC routing is the interface issues across dies. To handle the interface issue of connections, the inter-die routing, which uses micro bumps and two single-layer RDLs (Re-Distribution Layers) to achieve the connection between adjacent dies, is adopted. In this paper, we present an inter-die routing algorithm for 3D ICs with a pre-defined netlist. Our algorithm is based on integer linear programming (ILP) and adopts a two-stage technique of micro-bump assignment followed by non-regular RDL routing. First, the micro-bump assignment selects suitable micro-bumps for the pre-defined netlist such that no crossing problem exists inside the bounding boxes of each net. After the micro-bump assignment, the netlist is divided into two sub-netlists, one is for the upper RDL and the other is for the lower RDL. Second, the non-regular RDL routing determines minimum and non-crossing global paths for sub-netlists in the upper and lower RDLs individually. Experimental results show that our approach can obtain optimal wirelength and achieve 100% routability under reasonable CPU times.

4A-4 (Time: 11:50 - 12:20)
TitleCELONCEL: Effective Design Technique for 3-D Monolithic Integration targeting High Performance Integrated Circuits
Author*Shashikanth Bobba (Swiss Institute of Technology Lausanne (EPFL), Switzerland), Ashutosh Chakraborty (University of Texas at Austin, U.S.A.), Olivier Thomas, Perrine Batude, Thomas Ernst, Olivier Faynot (LETI, France), David Z. Pan (University of Texas at Austin, U.S.A.), Giovanni De Micheli (Swiss Institute of Technology Lausanne (EPFL), Switzerland)
Pagepp. 336 - 343
Keyword3D monolithic integration, CAD tool, Physical design, Placement, Cell design
Abstract3-D monolithic integration (3DMI), also termed as sequential integration, is a potential technology for future gigascale circuits. Since the device layers are processed in sequential order, the size of the vertical contacts is similar to traditional contacts unlike in the case of parallel 3-D integration with through silicon vias (TSVs). Given the advantage of such small contacts, 3DMI enables manufacturing multiple active layers very close to each other. In this work we propose two different strategies of stacking standard cells in 3-D without breaking the regularity of the conventional design flow: a) Vertical stacking of diffusion areas (Intra-Cell stacking) that supports complete reuse of 2-D physical design tools and b) vertical stacking of cells over others (Cell-on-Cell stacking). A placement tool (CELONCEL-placer) targeting the Cell-on-Cell placement problem is proposed to allow high quality 3-D layout generation. Our experiments demonstrate the effectiveness of CELONCEL technique, fetching us an area gain of 37.5%, 15.51% reduction in wirelength, and 13.49% improvement in overall delay, compared with a 2-D case when benchmarked across an interconnect dominated low-density-parity-check (LDPC) decoder at 45nm technology node.


Session 4B  Novel Network-on-Chip Architecture Design
Time: 10:20 - 12:20 Thursday, January 27, 2011
Location: Room 413
Chairs: Yoshinori Takeuchi (Osaka University, Japan), Hao Yu (Nanyang Technological University, Singapore)

4B-1 (Time: 10:20 - 10:50)
TitleOPAL: A Multi-Layer Hybrid Photonic NoC for 3D ICs
Author*Sudeep Pasricha, Shirish Bahirat (Colorado State University, U.S.A.)
Pagepp. 345 - 350
Keywordphotonic interconnects, networks on chip, chip multiprocessors, 3D ICs
AbstractThree-dimensional integrated circuits (3D ICs) offer a significant opportunity to enhance the performance of emerging chip multiprocessors (CMPs) using high density stacked device integration and shorter through silicon via (TSV) interconnects that can alleviate some of the problems associated with interconnect scaling. In this paper we propose and explore a novel multi-layer hybrid photonic NoC fabric (OPAL) for 3D ICs. Our proposed hybrid photonic 3D NoC combines low cost photonic rings on multiple photonic layers with a 3D mesh NoC in active layers to significantly reduce on-chip communication power dissipation and packet latency. OPAL also supports dynamic reconfiguration to adapt to changing runtime traffic requirements, and uncover further opportunities for reduction in power dissipation. Our experimental results and comparisons with traditional 2D NoCs, 3D NoCs, and previously proposed hybrid photonic NoCs (photonic Torus, Corona, Firefly) indicate a strong motivation for considering OPAL for future 3D ICs as it can provide orders of magnitude reduction in power dissipation and packet latencies.

4B-2 (Time: 10:50 - 11:20)
TitleEnabling Quality-of-Service in Nanophotonic Network-on-Chip
Author*Jin Ouyang, Yuan Xie (The Pennsylvania State University, U.S.A.)
Pagepp. 351 - 356
Keywordoptical interconnect, network-on-chip, quality-of-service
AbstractWith the recent development in silicon photonics, researchers have developed optical network-on-chip (NoC) architectures that achieve both low latency and low power, which are beneficial for future large scale chip-multiprocessors (CMP). However, none of the existing optical NoC architectures has quality-of-service (QoS) support, which is a desired feature of an efficient interconnection network. QoS support provides contending flows with differentiated bandwidths according to their priorities (or weights), which is crucial to account for application-specific communication patterns and provides bandwidth guarantees for real-time applications. In this paper, we propose a quality-of-service framework for optical network-on-chip based on frame-based arbitration. We show that the proposed approach achieves excellent differentiated bandwidth allocation with only simple hardware additions and low performance overheads. To the best of our knowledge, this is the first work that provides QoS support for optical network-on-chip.
Slides

4B-3 (Time: 11:20 - 11:50)
TitleVertical Interconnects Squeezing in Symmetric 3D Mesh Network-on-Chip
Author*Cheng Liu, Lei Zhang, Yinhe Han, Xiaowei Li (Institute of Computing Technology, Chinese Academy of Sciences, China)
Pagepp. 357 - 362
Keyword3D, Network-on-Chip, Mesh, Through silicon via
AbstractThree-dimensional (3D) integration and Network-on-Chip (NoC) are both proposed to tackle the on-chip interconnect scaling problems, and extensive research efforts have been devoted to the design challenges of combining both. Through-silicon via (TSV) is considered to be the most promising technology for 3D integration, however, TSV pads distributed across planar layers occupy significant chip area and result in routing congestions. In addition, the yield of 3D integrated circuits decreased dramatically as the number of TSVs increases. For symmetric 3D mesh NoC, we observe that the TSVsEutilization is pretty low and adjacent routers rarely transmit packets via their vertical channels (i.e. TSVs) at the same time. Based on this observation, we propose a novel TSV squeezing scheme to share TSVs among neighboring router in a time division multiplex mode, which greatly improves the utilization of TSVs. Experimental results show that the proposed method can save significant TSV footprint with negligible performance overhead.
Slides

4B-4 (Time: 11:50 - 12:20)
TitlePower-efficient Tree-based Multicast Support for Networks-on-Chip
Author*Wenmin Hu (School of Computer, National University of Defense Technology, China), Zhonghai Lu, Axel Jantsch (Royal Institute of Technology, Sweden), Hengzhu Liu (School of Computer, National University of Defense Technology, China)
Pagepp. 363 - 368
KeywordNoC, multicast, power-efficient
AbstractIn this paper, a novel hardware supporting for multicast on mesh Networks-on-Chip (NoC) is proposed. It supports multicast routing on any shape of tree-based path. Two power-efficient tree-based multicast routing algorithms, Optimized tree(OPT) and Left-XY-Right-Optimized tree (LXYROPT) are also proposed. Compared with baseline, OPT and LXYROPT achieve a remarkable improvement both in latency and throughput while reducing the power consumption.
Slides


Session 4C  Architecture Design and Reliability
Time: 10:20 - 12:20 Thursday, January 27, 2011
Location: Room 414+415
Chairs: Shigeru Yamashita (Ritsumeikan University, Japan), Rolf Drechsler (University of Bremen, Germany)

4C-1 (Time: 10:20 - 10:50)
TitleArea-Efficient FPGA Logic Elements: Architecture and Synthesis
Author*Jason Anderson (University of Toronto, Canada), Qiang Wang (Xilinx, Inc., U.S.A.)
Pagepp. 369 - 375
KeywordFPGAs, synthesis, architecture, logic density, technology mapping
AbstractWe consider architecture and synthesis techniques for FPGA logic elements (function generators) and show that the LUT-based logic elements in modern commercial FPGAs are over-engineered. Circuits mapped into traditional LUT-based logic elements have speeds that can be achieved by alternative logic elements that consume considerably less silicon area. We introduce the concept of a trimming input to a logic function, which is an input to a K-variable function about which Shannon decomposition produces a cofactor having fewer than K−1 variables. We show that trimming inputs occur frequently in circuits and we propose low-cost asymmetric FPGA logic element architectures that leverage the trimming input concept, as well as some other properties of a circuit’s AND-inverter graph (AIG) functional representation. We describe synthesis techniques for the proposed architectures that combine a standard cut-based FPGA technology mapping algorithm with two straightforward procedures: 1) Shannon decomposition, and 2) finding non-inverting paths in the circuit’s AIG. The proposed architectures exhibit improved logic density versus traditional LUT-based architectures with minimal impact on circuit speed.
Slides

4C-2 (Time: 10:50 - 11:20)
TitleSelectively Patterned Masks: Structured ASIC with Asymptotically ASIC Performance
Author*Donkyu Baek, Insup Shin, Seungwhun Paik, Youngsoo Shin (Korea Advanced Institute of Science and Technology, Republic of Korea)
Pagepp. 376 - 381
Keywordstructured ASIC, selectively patterned masks
AbstractWe propose a new lithography method called selectively patterned masks (SPM). It exploits special masks called masking masks and double exposure technique to allow more than one types of tiles to be patterned on the same wafer, thereby relaxing the regularity of structured ASIC. We propose a new structured ASIC based on SPM and assess it using 45-nm technology; experimental result showed substantial improvement over conventional structured ASIC, achieving 1.2x delay and 2.0x area over ASIC design.
Slides

4C-3 (Time: 11:20 - 11:50)
TitleA Robust ECO Engine by Resource-Constraint-Aware Technology Mapping and Incremental Routing Optimization
Author*Shao-Lun Huang, Chi-An Wu, Kai-Fu Tang, Chang-Hong Hsu, Chung-Yang (Ric) Huang (National Taiwan University, Taiwan)
Pagepp. 382 - 387
KeywordECO, spare cell, technology mapping
AbstractECO re-mapping is a key step in functional ECO tools. It implements a given patch function on a layout database with a limited spare cell resource. Previous ECO re-mapping algorithms are based on existing technology mappers. However, these mappers are not designed to consider the resource limitation and thus the corresponding ECO results are generally not good enough, or even become much worse when the spare cells are sparse. In this paper, we proposed a new solution for ECO re-mapping. It includes a robust resource-constraint-aware technology mapper and a fast incremental router for wire-length optimization. Moreover, we adopt a Pseudo-Boolean solver to search feasible solutions when the spare cells are sparse. Our experimental results show that our ECO engine can outperform the previous tool in both runtime and routing costs. We also demonstrate the robustness of our tool by performing ECOs on various spare cell limitations.
Slides

4C-4 (Time: 11:50 - 12:20)
TitleSETmap: A Soft Error Tolerant Mapping Algorithm for FPGA Designs with Low Power
AuthorChi-Chen Peng, Chen Dong, *Deming Chen (University of Illinois at Urbana-Champaign, U.S.A.)
Pagepp. 388 - 393
KeywordTechnology Mapping, FPGA, SEU, Low Power, Soft Error
AbstractField programmable gate arrays (FPGAs) are widely used in VLSI applications due to their flexibility to implement logical functions, fast total turn-around time and low none-recurring engineering cost. SRAM-based FPGAs are the most popular FPGAs in the market. However, as process technologies advance to nanometer-scale regime, the issue of reliability of devices becomes critical. Soft errors are increasingly becoming a reliability concern because of the shrinking process dimensions. In this paper we study the technology mapping problem for FPGA circuits to reduce the occurrence of soft errors under the chip performance constraint and power reduction. Compared to two power-optimization mapping algorithms, SVmap [17] and Emap [15] respectively, we reduce the soft error rate by 40.6% with a 2.22% power overhead and 48.0% with a 2.18% power overhead using 6-LUTs.


Session 4D  Special Session: Advanced Patterning and DFM for Nanolithography beyond 22nm
Time: 10:20 - 12:20 Thursday, January 27, 2011
Location: Room 416+417
Organizer: David Z. Pan (University of Texas at Austin, U.S.A.)

4D-1 (Time: 10:20 - 10:50)
Title(Invited Paper) All-out Fight against Yield Losses by Design-manufacturing Collaboration in Nano-lithography Era
AuthorSoichi Inoue, Sachiko Kobayashi (Toshiba, Japan)
Pagepp. 395 - 401
AbstractThe concept of design-manufacturing collaboration for nano-lithography era has been clarified. The novel design-manufacturing system that the manufacturing tolerance reflecting design intention properly can be allocated to the layout has been proposed. According to the system, one can assign the “weak portionEexplicitly on the layout, and can control the process for reducing the burden of manufacturing and further getting higher yield. More specifically, the extraction of electrically critical portion and conversion to the manufacturing tolerance has been demonstrated. The tolerance has applied to reduce computational burden of mask data preparation. Besides, the yield model-based layout scoring system has been also suggested to be significant remarkably. One can check the layout and modify not to loose the yield. Creation of yield function, layout scoring, and layout modification based upon the yield model have been demonstrated.

4D-2 (Time: 10:50 - 11:20)
Title(Invited Paper) EUV Lithography: Prospects and Challenges
AuthorSam Sivakumar (Intel Corporation, U.S.A.)
Pagep. 402
AbstractIntegrated circuit scaling as codified in Moore’s Law has been enabled through the tremendous advances in lithographic patterning technology over multiple process generations. Optical lithography has been the mainstay of patterning technology to date. Its imminent demise has been oft proclaimed over the years but clever engineering has consistently been able to extend it through many lens size and wavelength changes. NA has increased steadily from about 0.3 to 1.35 today with improvements in lens design and the use of immersion lithography. Simultaneously, the illumination wavelength has been reduced from 436nm about 20 years ago to 193nm for state-of-the-art scanners today. However, this approach has reached its limits. The 22nm technology node, targeted for HVM in 2011, represents the last instance of using standard 1.35NA immersion lithography-based patterning for the critical layers with a k1 hovering right around the 0.3 value that is considered acceptable for manufacturability. For the 14nm node with a HVM date of 2013, one has to resort to double patterning to achieve a manufacturable k1. For the 10nm technology node with a 2015 HVM date, double patterning will also be insufficient. While further ArF extension schemes are being considered, the industry is working towards lowering the wavelength from 193nm to Extreme Ultraviolet Lithography with a λ of 13.5nm. EUV offers the prospect of operating at significantly higher k1 and as a consequence, much simpler design rules and potentially simpler OPC. However, the technical challenges are formidable. EUV lithography requires the re-engineering of every subsystem in the optical path - source, collector and projection optics, reticles and photoresists. A huge industry-wide effort is under way to solve these technical issues and bring 13.5nm EUV lithography to production. Two main approaches are being considered for EUV sources - Laser Produced Plasma (LPP) and Discharge Produced Plasma (DPP). Both approaches appear to be heading towards production and it remains to be seen if one approach is more scalable to higher power levels. Currently however, neither approach is close to the power levels required to deliver runrates that will have a reasonable Cost of Ownership (COO). Clearly, a lot of development is ahead to make this happen. Photoresists have also seen a significant amount of technical development, primarily using small field Micro Exposure Tools (MET). Photoresist companies are working on developing the chemical platforms needed for EUV photoresists. While much progress has been made on photospeed, resolution and linewidth roughness, further improvements are required to meet the needs of the 14nm and 10nm process nodes. Since EUV employs reflective optics, EUV reticles are reflective as well and this poses several challenges. Apart from the obvious complexities of EUV reticle manufacturing, defectivity is a major concern, both from the standpoint of making defect-free masks as well as from the requirement of detecting the defects and repairing them. A significant industry-wide effort is being driven both among individual companies and through consortia like SEMATECH to develop both the manufacturing techniques required to make high-quality masks and the inspection and repair capabilities needed. Probably the most complex technical challenge and one largely untested in an HVM sense is the scanner itself. The current state of the art is the ASML Alpha Demo Tool (ADT) currently in use at SEMATECH and IMEC. This 0.25NA tool has a low runrate and limited technical capabilities but can print full fields and has been a valuable tool in the early demonstration of integrated device and circuit fabrication using EUV lithography. Working SRAM cells and other circuits have been demonstrated with very promising results. The first development-quality EUV scanners are targeted to ship to end users in 2011, while the HVM versions with high targeted runrates and low targeted COO are slated for delivery beginning in mid-2012. The delivery of these tools at the end usersEfab and their subsequent integration into the process flow will pose the greatest challenge and is expected to require a significant outlay of engineering effort and resources in the next 2 years. EUV presents its own challenges in terms of non-idealities that would need to be quantified and corrected for. While conventional OPC may be minimal, EUV has other sources of variability including flare and mask shadowing that would need to be compensated for. Moreover, the likelihood of defects on EUV masks brings up the possibility of pattern shifting to place the defects in benign areas of the reticle. All of these new challenges require OPC, synthesis or other data manipulation methodologies to be developed for EUV. This paper will attempt to highlight the key technical challenges of EUV lithography and where the industry will need to focus its efforts over the next 2 years to make EUV manufacturing successful and cost-effective.

4D-3 (Time: 11:20 - 11:50)
Title(Invited Paper) Future Electron-Beam Lithography and Implications on Design and CAD Tools
AuthorJack J.H. Chen, Faruk Krecinic, Jen-Hom Chen, Raymond P.S. Chen, Burn J. Lin (Taiwan Semiconductor Manufacturing Company, Taiwan)
Pagepp. 403 - 404
AbstractThe steeply increasing price and difficulty of masks make the mask-based optical lithography, such as ArF immersion lithography and extreme ultra-violet lithography (EUVL), unaffordable when going beyond the 32-nm half-pitch (HP) node[1]. Electron beam direct writing (EBDW), so called maskless lithography (ML2), provides an ultimate resolution without jeopardy from masks, but the extremely low productivity of the traditional single beam systems made it very laborious for mass manufacturing after over 3 decades of development. Although electron beam lithography has been long used for mask writing, it is yet very slow and typically takes from hours to days to write a complete 6-inch high-end mask. Direct writing a 300-mm wafer definitely would take much longer. Considering production efficiency in the cleanroom, the throughput of lithography tools should be in the order of 10 wafers per hour (WPH) per square meter as compared to that of an ArF scanner. To achieve such a throughput per e-beam column requires an improvement of more than 3-order. Increasing the beam current in the conventional single beam system would induce the space charge effect and thus is not a solution. Several groups [2][3][4][5] have proposed different multiple electron beam maskless lithography (MEBML2) approaches, by multiplying either Gaussian beams, variable shape beams or by using cell projections, to increase the throughput. The maturing MEMS technology and electronic control technology enable precise control of more than ten thousands or even millions of electron beamlets, writing in parallel. Without the mask constraint, the exposure can be made by continuously scanning across the entire wafer diameter as long as the ultra-high speed data rate can be supported. Hence a much slower scan speed is required and therefore a small tool footprint is achievable. A MAPPER Pre-Alpha Tool, composed of a 110-beam 5-keV column and a 300-mm wafer stage within a vacuum chamber of 1.3x1.3m2 footprint, has been installed and operational for process development in the advanced Giga-Fab cleanroom environment. By sending the pre-treated optical data to the correspondent photodiode of each blanker, each beam writes its own features independently in raster scan mode. Resolution beyond 30-nm HP resolution for both C/H and L/S by using chemical amplified resist (CAR) has been demonstrated. Applying proper E-beam proximity corrections (EPC), a 20-nm node test circuit layout has been successfully patterned. The tool will be upgraded with a new Electron-Optics column containing 13,000 beamlets and each beamlet projecting 7x7 sub-beams to achieve 10 WPH of 32-nm HP node wafers by a single chamber. The achievement of high productivity MEBML2 needs not only the beams, but also the data preparation. For a 10-WPH MEBML2 tool, one wafer exposure is done in 6 minutes. However, the pre-treatments, for example logic operation and EPC, of the huge reticle field data file typically take a few days in present-day mask writing and therefore drastic speed enhancement is required to really gain the benefits of ML2 in cycle time and flexibility. In MAPPER’s writing approach, the circuit layout in GDSII or OASIS format at sub-nm addressing grid, whose file size can be up to hundreds of Giga-Bytes after EPC, has to be pre-rasterized to a bitmap writing format of 3.5-nm grid, which file size of the simple 0 and 1 bitmap for a full-26mmx33mm reticle field becomes more than 10 Tera-Bytes(TB). Real-time data decompression in the data path of the tool is designed to avoid storage and transportation of the extremely huge files. In this presentation, several suggestions regarding design, EPC and CAD tools to best fit the nature and operation of MEBML2 in high volume manufacturing are made. Because of high resolution by the e-beam, the restricted design rules due to resolution limit of the optical lithography, especially those related to the double patterning techniques, can be removed. By considering the speed of data treatment, required storage, and computing resources inside the data path, some minor rules like on-pixel design may be recommended. Although contour-based EPC has been demonstrated to meet CD requirements [6][7], hybrid EPC accompanied with dose modulation has been proposed to further enhance the imaging contrast. Even though we will optimize the tool precision to eliminate most of the beam-to-beam CD and overlay errors, it is nevertheless safer to propose some methods to avoid the BtB stitching on critical devices.

4D-4 (Time: 11:50 - 12:20)
Title(Invited Paper) Exploration of VLSI CAD Researches for Early Design Rule Evaluation
AuthorChul-Hong Park (Samsung Electronics, Republic of Korea), David Z. Pan (University of Texas at Austin, U.S.A.), Kevin Lucas (Synopsys, U.S.A.)
Pagepp. 405 - 406
AbstractDesign rule has been a primary metric to link design and technology, and is likely to be considered as IC manufacturer’s role for the generation due to the empirical and unsystematic in nature. Disruptive and radical changes in terms of layout style, lithography and device in the next decade require the design rule evaluation in early development stage. In this paper, we explore VLSI CAD researches for early and systematic evaluation of design rule, which will be a key technique for enhancing the competitiveness in IC market.


Session 5A  System-Level Simulation
Time: 13:40 - 15:40 Thursday, January 27, 2011
Location: Room 411+412
Chairs: Nagisa Ishiura (Kwansei Gakuin University, Japan), Bo-Cheng Charles Lai (National Chiao Tung University, Taiwan)

5A-1 (Time: 13:40 - 14:10)
TitleHandling Dynamic Frequency Changes in Statically Scheduled Cycle-Accurate Simulation
Author*Marius Gligor, Frédéric Pétrot (TIMA Laboratory, CNRS/INP Grenoble/UJF, France)
Pagepp. 407 - 412
Keywordcycle accurate simulation, simulation acceleration, static scheduling, dynamic frequencies change
AbstractAlthough high level simulation models are being increasingly used for digital electronic system validation, cycle accuracy is still required in some cases, such as hardware protocol validation or accurate power/energy estimation. Cycle-accurate simulation is however slow and acceleration approaches make the assumption of a single constant clock, which is not true anymore with the generalization of dynamic voltage and frequency scaling techniques. Fast cycle-accurate simulators supporting several clocks whose frequencies can change at run time are thus needed. This paper presents two algorithms we designed for this purpose and details their properties and implementations.
Slides

5A-2 (Time: 14:10 - 14:40)
TitleCoarse-grained Simulation Method for Performance Evaluation a of Shared Memory System
Author*Ryo Kawahara, Kenta Nakamura, Kouichi Ono, Takeo Nakada (IBM Research - Tokyo, Japan), Yoshifumi Sakamoto (Global Business Services, IBM Japan, Japan)
Pagepp. 413 - 418
KeywordSimulation, performance, UML, embedded system, multi-processors
AbstractWe propose a coarse-grained simulation method which takes the effect of memory access contention into account. The method can be used for the evaluation of the execution time of an application program during the system architecture design in an early phase of development. In this phase, information about memory access timings is usually not available. Our method uses a statistical approximation of the memory access timings to estimate their influences on the execution time. We report a preliminary verification of our simulation method by comparing it with an experimental result from an image processing application on a dual-core PC. We find an error of the order of 3 percents on the execution time.
Slides

5A-3 (Time: 14:40 - 15:10)
TitleT-SPaCS EA Two-Level Single-Pass Cache Simulation Methodology
AuthorWei Zang, *Ann Gordon-Ross (University of Florida, U.S.A.)
Pagepp. 419 - 424
KeywordConfigurable cache, cache hierarchy, cache optimization, low energy, embedded systems
AbstractThe cache hierarchy's large contribution to total microprocessor system power makes caches a good optimization candidate. We propose a single-pass trace-driven cache simulation methodology - T-SPaCS - for a two-level exclusive instruction cache hierarchy. Instead of storing and simulating numerous stacks repeatedly as in direct adaptation of a conventional trace-driven cache simulation to two level caches, T-SPaCS simulates both the level one and level two caches simultaneously using one stack. Experimental results show T-SPaCS efficiently and accurately determines the optimal cache configuration (lowest energy).
Slides

5A-4 (Time: 15:10 - 15:40)
TitleFast Data-Cache Modeling for Native Co-Simulation
Author*Héctor Posadas, Luis Diaz, Eugenio Villar (University of Cantabria, Spain)
Pagepp. 425 - 430
KeywordSystem-Level, Cache modeling, Native co-simiulation, Embedded SW
AbstractEfficient design of large multiprocessor embedded systems requires fast, early performance modeling techniques. Native co-simulation has been proposed as a fast solution for evaluating systems in early design steps. Annotated SW execution can be performed in conjunction with a virtual model of the HW platform to generate a complete system simulation. To obtain sufficiently accurate performance estimations, the effect of all the system components, as processor caches, must be considered. ISS-based cache models slow down the simulation speed, greatly reducing the efficiency of native-based co-simulations. To solve the problem, cache modeling techniques for fast native co-simulation have been proposed, but only considering instruction-caches. In this paper, a fast technique for data-cache modeling is presented, together with the instrumentation required for its application in native execution. The model allows the designer to obtain cache hit/miss rate estimations with a speed-up of two orders of magnitude with respect to ISS. Miss rate estimation error remains below 5% for representative examples.
Slides


Session 5B  Resilient and Thermal-Aware NoC Design
Time: 13:40 - 15:40 Thursday, January 27, 2011
Location: Room 413
Chairs: Michihiro Koibuchi (National Institute of Informatics, Japan), Pao-Ann Hsiung (National Chung Cheng University, Taiwan)

5B-1 (Time: 13:40 - 14:10)
TitleOn the Design and Analysis of Fault Tolerant NoC Architecture Using Spare Routers
Author*Yung-Chang Chang (Industrial Technology Research Institute, Taiwan), Ching-Te Chiu (National Tsing Hua University, Taiwan), Shih-Yin Lin, Chung-Kai Liu (Industrial Technology Research Institute, Taiwan)
Pagepp. 431 - 436
Keywordfault tolerance, network-on-chip, router-level redundancy
AbstractThe aggressive advent in VLSI manufacturing technology has made dramatic impacts on the dependability of devices and interconnects. In the modern manycore system, mesh based Networks-on-Chip (NoC) is widely adopted as on chip communication infrastructure. It is critical to provide an effective fault tolerance scheme on mesh based NoC. A faulty router or broken link isolates a well functional processing element (PE). Also, a set of faulty routers form faulty regions which may break down the whole design. To address these issues, we propose an innovative router-level fault tolerance scheme with spare routers which is different from the traditional microarchitecture-level approach. The spare routers not only provide redundancies but also diversify connection paths between adjacent routers. To exploit these valuable resources on fault tolerant capabilities, two configuration algorithms are demonstrated. One is shift-and-replace-allocation (SARA) and the other is defect-awareness-path-allocation (DAPA) that takes advantage of path diversity in our architecture. The proposed design is transparent to any routing algorithm since the output topology is consistent to the original mesh. Experimental results show that our scheme has remarkable improvements on fault tolerant metrics including reliability, mean time to failure (MTTF), and yield. In addition, the performance of spare router increases with the growth of NoC size but the relative connection cost decreases at the same time. This rare and valuable characteristic makes our solution suitable for large scale NoC design.

5B-2 (Time: 14:10 - 14:40)
TitleA Resilient On-chip Router Design Through Data Path Salvaging
Author*Cheng Liu, Lei Zhang, Yinhe Han, Xiaowei Li (Institute of Computing Technology, Chinese Academy of Sciences, China)
Pagepp. 437 - 442
KeywordNetwork-on-chip, fault tolerance, data path, salvaging, slicing
AbstractVery large scale integrated circuits typically employ Network-on-Chip (NoC) as the backbone for on-chip communication. As technology advances into the nanometer regime, NoCs become more and more susceptible to permanent faults such as manufacturing defects, device wear-out, which hinder the correct operations of the entire system. Therefore, effective fault-tolerant techniques are essential to improve the reliability of NoCs. Prior work mainly focuses on introducing redundancies, which can’t achieve satisfactory reliability and also involve large hardware overhead, especially for data path components. In this paper, we propose fine-grained data path salvaging techniques by splitting data path components, i.e., links, input buffers and crossbar into slices, instead of introducing redundancies. As long as there is one fault-free slice for each component, the router can be functional. Experimental results show that the proposed solution achieves quite high reliability with graceful performance degradation even under high fault rate.
Slides

5B-3 (Time: 14:40 - 15:10)
TitleNS-FTR: A Fault Tolerant Routing Scheme for Networks on Chip with Permanent and Runtime Intermittent Faults
Author*Sudeep Pasricha, Yong Zou (Colorado State University, U.S.A.)
Pagepp. 443 - 448
Keywordfault tolerant routing, NoC, turn model, permanent faults, intermittent faults
AbstractIn sub-65nm CMOS technologies, interconnection networks-on-chip (NoC) will increasingly be susceptible to design time permanent faults and runtime intermittent faults, which can cause system failure. To overcome these faults, NoC routing schemes can be enhanced by adding fault tolerance capabilities, so that they can adapt communication flows to follow fault-free paths. A majority of existing fault tolerant routing algorithms are based on the turn model approach due to its simplicity and inherent freedom from deadlock. However, these turn model based algorithms are either too restrictive in the choice of paths that flits can traverse, or are tailored to work efficiently only on very specific fault distribution patterns. In this paper, we propose a novel fault tolerant routing scheme (NS-FTR) for NoC architectures that combines the North-last and South-last turn models to create a robust hybrid NoC routing scheme. The proposed scheme is shown to have a low implementation overhead and adapt to design time and runtime faults better than existing turn model, stochastic random walk, and dual virtual channel based routing schemes.

5B-4 (Time: 15:10 - 15:40)
TitleA Thermal-aware Application Specific Routing Algorithm for Network-on-Chip Design
Author*Zhiliang Qian, Chi-Ying Tsui (The Hong Kong University of Science and Technology, Hong Kong)
Pagepp. 449 - 454
KeywordNetwork-on-Chip, Thermal-aware, Application specific, Routing algorithm
AbstractIn this work, we propose an application specific routing algorithm to reduce the hot-spot temperature for Network-on-chip (NoC) . Using the traffic information of the application, we develop a routing scheme which can achieve a higher adaptivity than the generic ones and at the same time distribute the traffic more uniformly. To reduce the hot-spot temperature, we find the optimal distribution ratio of the communication traffic among the set of candidate paths. The problem of finding this optimal distribution ratio is formulated as a linear programming (LP) problem and is solved offline. A router microarchitecture which supports our ratio-based selection policy is also proposed. From the simulation results, the peak energy reduction can be as high as 16.6% for synthetic traffic and real benchmarks.
Slides


Session 5C  High-Level and Logic Synthesis
Time: 13:40 - 15:40 Thursday, January 27, 2011
Location: Room 414+415
Chairs: Kiyoung Choi (Seoul National University, Republic of Korea), Shigeru Yamashita (Ritsumeikan University, Japan)

5C-1 (Time: 13:40 - 14:10)
TitleAn Efficient Hybrid Engine to Perform Range Analysis and Allocate Integer Bit-widths for Arithmetic Circuits
Author*Yu Pang (Chongqing University of Posts and Telecommunications, China), Katarzyna Radecka, Zeljko Zilic (McGill University, Canada)
Pagepp. 455 - 460
Keywordarithmetic circuits, range analysis, SMT, arithmetic transform, fixed-point synthesis
AbstractRange analysis is an important task in obtaining the correct, yet fast and inexpensive arithmetic circuits. The traditional methods, either simulation-based or static, have the disadvantage of low efficiency and coarse bounds, which may lead to unnecessary bits. In this paper, we propose a new method that combines several techniques to perform fixed-point range analysis in a datapath towards obtaining the much tighter ranges efficiently. We show that the range and the bit-width allocation can be obtained with better results relative to the past methods, and in significantly shorter time.
Slides

5C-2 (Time: 14:10 - 14:40)
TitleRegister Pressure Aware Scheduling for High Level Synthesis
Author*Rami Beidas, Wai Sum Mong, Jianwen Zhu (University of Toronto, Canada)
Pagepp. 461 - 466
KeywordPhase Coupling, Scheduling, Register Pressure, Area Optimization
AbstractVariations of list scheduling became the de-facto standard of scheduling straight line code in software compilers, a trend faithfully inherited by high-level synthesis solutions. Due to its nature, list scheduling is oblivious of the tightly coupled register pressure; a dangling fundamental problem that has been attacked by the compiler community for decades, and which results, in case of high-level synthesis, in excessive instantiations of registers and accompanying steering logic. To alleviate this problem, we propose a synthesis framework called "soft scheduling", which acts as a resource unconstrained pre-scheduling stage that restricts subsequent scheduling to minimize register pressure. This optimization objective is formulated as a live range minimization problem, a measure shown to be proportional to register pressure, and optimally solved in polynomial time using minimum cost network flow formulation. Unlike past solutions in the compiler community, which try to reduce register pressure by local serialization of subject instructions, the proposed solution operates on the entire basic block or hyperblock and systematically handles instruction chaining subject to the same objective. The application of the proposed solution to a set of real-life benchmarks results in a register pressure reduction ranging, on average, between 11% and 41% depending on the compilation and synthesis configurations with minor 2% to 4% increase in schedule latency.

5C-3 (Time: 14:40 - 15:10)
TitleParallel Cross-Layer Optimization of High-Level Synthesis and Physical Design
Author*James Williamson (ECEE Dept., University of Colorado at Boulder, U.S.A.), Yinghai Lu (EECS Dept., Northwestern University, U.S.A.), Li Shang (ECEE Dept., University of Colorado at Boulder, U.S.A.), Hai Zhou (EECS Dept., Northwestern University, U.S.A.), Xuan Zeng (State Key Lab of ASIC & System, Microelectronics Dept., Fudan University, China)
Pagepp. 467 - 472
Keywordcross-layer optimization, parallel CAD, GPGPU, heterogeneous architectures, parallelization
AbstractIntegrated circuit (IC) design automation has traditionally followed a hierarchical approach. Modern IC design flow is divided into sequentially-addressed design and optimization layers; each successively finer in design detail and data granularity while increasing in computational complexity. Eventual agreement across the design layers signals design closure. Obtaining design closure is a continual problem, as lack of awareness and interaction between layers often results in multiple design flow iterations. In this work, we propose parallel cross-layer optimization, in which the boundaries between design layers are broken, allowing for a more informed and efficient exploration of the design space. We leverage the heterogeneous parallel computational power in current and upcoming multi-core/many-core computation platforms to suite the heterogeneous characteristics of multiple design layers. Specifically, we unify the high-level and physical synthesis design layers for parallel cross-layer IC design optimization. In addition, we introduce a massively-parallel GPU floorplanner with local and global convergence test as the proposed physical synthesis design layer. Our results show average performance gains of 11X speed-up over state-of-the-art.
Slides

5C-4 (Time: 15:10 - 15:40)
TitleNetwork Flow-based Simultaneous Retiming and Slack Budgeting for Low Power Design
AuthorBei Yu, Sheqin Dong, *Yuchun Ma, Tao Lin, Yu Wang (Tsinghua University, China), Song Chen, Satoshi GOTO (Waseda University, Japan)
Pagepp. 473 - 478
KeywordRetiming, Slack Budgeting, Network Flow, Low Power
AbstractLow power design has become one of the most significant requirements when CMOS technology entered the nanometer era. Therefore, timing budget is often performed to slow down as many components as possible so that timing slacks can be applied to reduce the power consumption while maintaining the performance of the whole design. Retiming is a procedure that involves the relocation of flip-flops (FFs) across logic gates to achieve faster clocking speed. In this paper we show that the retiming and slack budgeting problem can be formulated to a convex cost dual network flow problem. Both the theoretical analysis and experimental results show the efficiency of our approach which can not only reduce power consumption but also speedup previous work.
Slides


Session 5D  Designers' Forum: C-P-B Co-design/Co-verification Technology for DDR3 1.6G in Consumer Products
Time: 13:40 - 15:40 Thursday, January 27, 2011
Location: Room 416+417
Organizer: Koji Kato (Sony, Japan)

5D-1 (Time: 13:40 - 15:10)
Title(Panel Discussion) C-P-B Co-design/Co-verification Technology for DDR3 1.6G in Consumer Products
AuthorOrganizer: Koji Kato (Sony, Japan), Moderator: Makoto Nagata (Kobe University, Japan), Panelists: Keisuke Matsunami (Sony, Japan), Yoshinori Fukuba (Toshiba, Japan), Ji Zheng (Apache Design Solutions, U.S.A.), Jen-Tai Hsu (Global Unichip Corporation, U.S.A.), CT Chiu (ASE, Taiwan)
AbstractChip-package-board co-design/co-verification techniques for coming DDR3 1.6-Gbps interface for consumer applications. Such high-data rate needs to be realized with low-cost assembly like wire bonding on a FR-4 board with small number of layers. The panelists will be solicited from a set maker, a semiconductor manufacturer, an IP provider, a tool vendor, and assembly foundary.
Slides


Session 6A  Design Validation Techniques
Time: 16:00 - 18:00 Thursday, January 27, 2011
Location: Room 411+412
Chairs: Miroslav Velev (Aries Design Automation, U.S.A.), Kiyoharu Hamaguchi (Osaka University, Japan)

6A-1 (Time: 16:00 - 16:30)
TitleManaging Complexity in Design Debugging with Sequential Abstraction and Refinement
Author*Brian Keng, Andreas Veneris (University of Toronto, Canada)
Pagepp. 479 - 484
Keyworddebugging, diagnosis, abstraction, refinement, verification
AbstractIn this work, a novel abstraction and refinement technique for design debugging is presented that addresses two key components of the debugging complexity, the design size and the error trace length. Experimental results show that the proposed algorithm is able to return solutions for all instances compared to only 41% without the technique demonstrating the viability of this approach in tackling real-world debugging problems.
Slides

6A-2 (Time: 16:30 - 17:00)
TitleFacilitating Unreachable Code Diagnosis and Debugging
AuthorHong-Zu Chou (National Taiwan University, Taiwan), *Kai-Hui Chang (Avery Design Systems, Inc., U.S.A.), Sy-Yen Kuo (National Taiwan University, Taiwan)
Pagepp. 485 - 490
KeywordRTL symbolic simulation, Unreachability analysis, Error diagnosis, Reachability analysis
AbstractCode coverage is a popular method to find design bugs and verification loopholes. However, once a piece of code is determined to be unreachable, diagnosing the cause of the problem can be challenging: since the code is unreachable, no counterexample can be returned for debugging. Therefore, engineers need to analyze the legality of nonexistent execution paths, which can be difficult. To address such a problem, we analyzed the cause of unreachability in several industrial designs and proposed a diagnosis technique that can explain the cause of unreachability. In addition, our method provides suggestions on how to solve the unreachability problem, which can further facilitate debugging. Our experimental results show that this technique can greatly reduce an engineer's effort in analyzing unreachable code.
Slides

6A-3 (Time: 17:00 - 17:30)
TitleDeterministic Test for the Reproduction and Detection of Board-Level Functional Failures
AuthorHongxia Fang (Duke University, U.S.A.), Zhiyuan Wang, Xinli Gu (Cisco Systems Inc., U.S.A.), *Krishnendu Chakrabarty (Duke University, U.S.A.)
Pagepp. 491 - 496
KeywordEcoverage, board-level diagnosis, functional failure, functional scan, functional state space
AbstractA common scenario in industry today is "No Trouble Found" (NTF) due to functional failures. A component on a board fails during board-level functional test, but it passes the Automatic Test Equipment (ATE) test when it is returned to the supplier for warranty replacement or service repair. To find the root cause of NTF, we propose an innovative functional test approach and DFT methods for the detection of boardlevel functional failures. These DFT and test methods allow us to reproduce and detect functional failures in a controlled deterministic environment, which can provide ATE tests to the supplier for early screening of defective parts. Experiments on an industry design show that functional scan test with appropriate functional constraints can adequately mimic the functional state space well (measured by appropriate coverage metrics). Experiments also show that most functional failures due to stuck-at, dominant bridging, and crosstalk faults can be reproduced and detected by functional scan test.

6A-4 (Time: 17:30 - 18:00)
TitleEquivalence Checking of Scheduling with Speculative Code Transformations in High-Level Synthesis
Author*Chi-Hui Lee, Che-Hua Shih, Juinn-Dar Huang, Jing-Yang Jou (National Chiao Tung University, Taiwan)
Pagepp. 497 - 502
Keywordequivalence checking, formal verification, finite state machine with datapath (FSMD), high-level synthesis (HLS), scheduling
AbstractThis paper presents a formal method for equivalence checking between the descriptions before and after scheduling in high-level synthesis (HLS). Both descriptions are represented by finite state machine with datapaths (FSMDs) and are then characterized through finite sets of paths. The main target of our proposed method is to verify scheduling employing code transformations - such as speculation and common subexpression extraction (CSE), across basic block (BB) boundaries - which have not been properly addressed in the past. Nevertheless, our method can verify typical BB-based and path-based scheduling as well. The experimental results demonstrate that the proposed method can indeed outperform an existing state-of-the-art equivalence checking algorithm.


Session 6B  Clock Network Design
Time: 16:00 - 18:00 Thursday, January 27, 2011
Location: Room 413
Chairs: Yuchun Ma (Tsinghua University, China), Youngsoo Shin (Korea Advanced Institute of Science and Technology, Republic of Korea)

6B-1 (Time: 16:00 - 16:30)
TitleAn Optimal Algorithm for Allocation, Placement, and Delay Assignment of Adjustable Delay Buffers for Clock Skew Minimization in Multi-Voltage Mode Designs
Author*Kyoung-Hwan Lim, Taewhan Kim (Seoul National University, Republic of Korea)
Pagepp. 503 - 508
KeywordClock skew, Timing
AbstractRecently, it is shown that adjustable delay buffer (ADB) whose delay can be tuned dynamically can be used to solve the clock skew problem effectively under multiple power (voltage) modes.We propose a linear time optimal algorithm that simultaneously solves the problems of computing (1) the minimum number of ADBs to be used, (2) the location at which each ADB is to be placed, and (3) the delay value of each ADB to be assigned to each power mode.
Slides

6B-2 (Time: 16:30 - 17:00)
TitleOn Applying Erroneous Clock Gating Conditions to Further Cut Down Power
Author*Tak-Kei Lam, Xiaoqing Yang, Wai-Chung Tang, Yu-Liang Wu (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 509 - 514
Keywordclock gating, logic synthesis, low power, error cancellation
AbstractAll of today's known clock gating techniques only disable clocks on valid ("correct") clock gating conditions, like idle states or observability don’t cares (ODC), whose applying will not change the circuit functionality. In this paper, we explore a technique that allows shutting down certain clocks during invalid cycles, which if applied alone will certainly cause erroneous results. However, the erroneous results will be corrected either during the current or later stages by injecting other clock gating conditions to cancel out each other’s error effects before they reach the primary outputs. Under this model, conditions across multiple flip-flop stages can also be analyzed to locate easily correctable erroneous clock gating conditions. Experimental results show that by using this error cancellation technique, a total power (including dynamic and leakage power) cut of up to 23% and in average of around 6% could be stably achieved, no matter with or without applying Power Compiler (which brought a power cut of 4% in average) together. The results indicate that the power saving conditions found by this new technique were nearly orthogonal (independent) to what can be done by the popular commercial power optimization tool. The idea of these new multi-stage logic error cancellation operations can potentially be applied to other sequential logic synthesis problems as well.

6B-3 (Time: 17:00 - 17:30)
TitleLow Power Discrete Voltage Assignment Under Clock Skew Scheduling
AuthorLi Li (Electrical Engineering and Computer Science Department, Northwestern University, U.S.A.), Jian Sun (State Key Lab of ASIC & System, Microelectronics Department, Fudan University, China), Yinghai Lu, *Hai Zhou (Electrical Engineering and Computer Science Department, Northwestern University, U.S.A.), Xuan Zeng (State Key Lab of ASIC & System, Microelectronics Department, Fudan University, China)
Pagepp. 515 - 520
Keywordlow power, voltage assignment, clock skew scheduling
AbstractMultiple Supply Voltage (MSV) assignment has emerged as an appealing technique in low power IC design, due to its flexibility in balancing power and performance. However, clock skew scheduling, which has great impact on criticality of combinational paths in sequential circuit, has not been explored in the merit of MSV assignment. In this paper, we propose a discrete voltage assignment algorithm for sequential circuit under clock scheduling. The sequential MSV assignment problem is first formulated as a convex cost dual network flow problem, which can be optimally solved in polynomial time assuming delay of each gate can be chosen in continuous domain. Then a mincut-based heuristic is designed to convert the unfeasible continuous solution into feasible discrete solution while largely preserving the global optimality. Besides, we revisit the hardness of the general discrete voltage assignment problem and point out some misunderstandings on the approximability of this problem in previous related work. Benchmark test for our algorithm shows 9.2\% reduction in power consumption on average, in compared with combinational MSV assignment. Referring to the continuous solution obtained from network flow as the lower bound, the gap between our solution and the lower bound is only 1.77\%.
Slides

6B-4 (Time: 17:30 - 18:00)
TitleA Practical Method for Multi-domain Clock Skew Optimization
Author*Yanling Zhi (State Key Lab. of ASIC & System, Microelectronics Department, Fudan University, China), Hai Zhou (Department of Electrical Engineering and Computer Science, Northwestern University, U.S.A.), Xuan Zeng (State Key Lab. of ASIC & System, Microelectronics Department, Fudan University, China)
Pagepp. 521 - 526
KeywordClock Skew Optimization, Multi-Domain
AbstractClock skew scheduling is an effective technique in performance optimization of sequential circuits. However, with process variations, it becomes more difficult to reliably implement a wide spectrum of clock delays at the registers. Multidomain clock skew scheduling is a good option to overcome this limitation. In this paper, we propose a practical method to efficiently and optimally solve this problem. A framework based on branch-and-bound is carefully designed to search for the optimal clocking domain assignment, and a greedy clustering algorithm is developed to quickly estimate the upper bound of cycle period for a given branch. Experiment results on ISCAS89 sequential benchmarks show both the optimality and efficiency of our method compared with the previous work.
Slides


Session 6C  Advances in Routing
Time: 16:00 - 18:00 Thursday, January 27, 2011
Location: Room 414+415
Chair: David Z. Pan (University of Texas at Austin, U.S.A.)

6C-1 (Time: 16:00 - 16:30)
TitleEfficient Multi-Layer Obstacle-Avoiding Preferred Direction Rectilinear Steiner Tree Construction
Author*Jia-Ru Chuang, Jai-Ming Lin (National Cheng Kung University, Taiwan)
Pagepp. 527 - 532
KeywordRouting, Multi-Layer, Obstacle-Avoiding, Rectilinear Steiner Tree, Preferred Direction
AbstractConstructing rectilinear Steiner trees for signal nets is a very important procedure for placement and routing because we can use it to find topologies of nets and measure the design quality. However, in modern VLSI designs, pins are located in multiple routing layers, each routing layer has its own preferred direction, and there exist numerous routing obstacles incurred from IP blocks, power networks, pre-routed nets, etc, which make us need to consider multilayer obstacle-avoiding preferred direction rectilinear Steiner minimal tree (ML-OAPDRSMT) problem. This significantly increases the complexity of the problem, and an efficient and effective algorithm to deal with the problem is desired. In this paper, we propose a very simple and effective approach to deal with ML-OAPDRSMT problem. Unlike previous works usually build a spanning graph and find a spanning tree to deal with this problem, which takes a lot of time, we first determine a connection ordering for all pins, and then iteratively connect every two neighboring pins by a greedy heuristic algorithm. The experimental results show that our method has average 5.78% improvement over [7] and at least five times speed up comparing with their approach.

6C-2 (Time: 16:30 - 17:00)
TitleCut-Demand Based Routing Resource Allocation and Consolidation for Routability Enhancement
Author*Fong-Yuan Chang (National Tsing Hua University, Taiwan), Sheng-Hsiung Chen (SpringSoft, Taiwan), Ren-Song Tsay, Wai-Kei Mak (National Tsing Hua University, Taiwan)
Pagepp. 533 - 538
KeywordCut-Demand, Resource, Routability, Routability, placement
AbstractTo successfully route a design, one essential requirement is to allocate sufficient routing resources. In this paper, we show that allocating routing resources based on horizontal and vertical (H/V) cut-demands can greatly improve routability especially for designs with thin areas. We then derive methods to predict the maximum H/V cut-demands and propose two cut-demand based approaches, one is to allocate routing resources considering the maximum H/V cut-demands and the other is to consolidate fragmented metal-1 routing resources for effective resource utilization. Experimental results demonstrate that the resource allocation method can determine design areas more precisely and the resource consolidation method can significantly improve routability. With better routability, the routing time is about 5 times faster on average and the design area can be further reduced by 2-15%.
Slides

6C-3 (Time: 17:00 - 17:30)
TitleNegotiation-Based Layer Assignment for Via Count and Via Overflow Minimization
Author*Wen-Hao Liu, Yih-Lang Li (National Chiao Tung University, Taiwan)
Pagepp. 539 - 544
KeywordPhysical design, Global routing, Layer assignment, Via count, Via overflow
AbstractLayer assignment determines on which layer the wires or vias should be placed; and the assignment results influence the circuit’s delay, crosstalk, and via counts. How to minimize via count and via overflow during layer assignment has received considerable attention in recent years. Traditional layer assignment to minimize via count tends to produce varying qualities of assignment results using different net orderings. This work develops a negotiation-based via count minimization algorithm (NVM) that can achieve lower via counts than in previous works. As for via overflow minimization, we observe via overflow can be well minimized if via overflow minimization is performed following stacked via minimization. The stacked via minimization adopts the proposed NVM, while via overflow minimization adopts a modified NVM by replacing via cost with via overflow cost.

6C-4 (Time: 17:30 - 18:00)
TitleWire Synthesizable Global Routing for Timing Closure
AuthorMichael Moffitt (IBM Corporation, U.S.A.), *C. N. Sze (IBM Research, U.S.A.)
Pagepp. 545 - 550
KeywordRouting, Placement, Timing, Algorithm
AbstractDespite remarkable progress in the area of global routing, the burdens imposed by modern physical synthesis flows are far greater than those expected or anticipated by available (academic) routing engines. As interconnects dominate the path delay, physical synthesis such as buffer insertion and gate sizing has to integrate with layer assignment. Layer directives Ecommonly generated during wire synthesis to meet tight frequency targets Eplay a critical role in reducing interconnect delay of smaller technology nodes. Unfortunately, they are not presently understood or honored by leading global routers, nor do existing techniques trivially extend toward their resolution. The shortcomings contribute to a dangerous blindspot in optimization and timing closure, leading to unroutable and/or underperforming designs. In this paper, we aim to resolve the layer compliance problem in routing congestion evaluation and global routing, which is very critical for timing closure with physical synthesis. We propose a method of progressive projection to account for wire tags and layer directives, in which classes of nets are successively applied and locked while performing partial aggregation. The method effectively models the resource contention of layer constraints by faithfully accumulating capacity of bounded layer ranges, enabling threedimensional assignment to subsequently achieve complete directive compliance. The approach is general, and can piggyback on existing interfaces used to communicate with popular academic engines. Empirical results on the ICCAD 2009 benchmarks demonstrate that our approach successfully routes many designs that are otherwise unroutable with existing techniques and naEe approaches.


Session 6D  Designers' Forum: Emerging Technologies for Wellness Applications
Time: 16:00 - 18:00 Thursday, January 27, 2011
Location: Room 416+417
Organizer: Hideki Yoshizawa (Fujitsu Labs., Japan)

6D-1 (Time: 16:00 - 16:30)
Title(Invited Paper) Biological Information Sensing Technologies for Medical, Health Care, and Wellness Applications
AuthorMasaharu Imai, Yoshinori Takeuchi, Keishi Sakanushi, Hirofumi Iwato (Osaka University, Japan)
Pagepp. 551 - 555
AbstractIn recent years, our society moves towards more and more aging society, and health care becomes one of the most important concerns. Since small-size wearable and implantable healthcare systems are required in aging society, the LSI technol- ogy becomes more important for biological information sensing. In this paper, we brie y summarize the fundamental biological sensing system, and introduce our current work, Medical Domain Specific SoC, Type I (MeSOC-I) for capsular inner bladder pressure sensing system. At the end of this paper, we sketch the future challenges for the biological information sensing.
Slides

6D-2 (Time: 16:30 - 17:00)
Title(Invited Paper) Ultra-Low Power Microcontrollers for Portable, Wearable, and Implantable Medical Electronics
AuthorSrinivasa R. Sridhara (Texas Instruments, Inc., U.S.A.)
Pagepp. 556 - 560
AbstractAn aging population, coupled with choices on diet and lifestyle, is causing an increased demand for portable, wearable, and implantable medical devices that enable chronic disease management and wellness assessment. Battery life specifications drive the power consumption requirements of integrated circuits in these devices. Microcontrollers provide the right combination of programmability, cost, performance, and power consumption needed to realize such devices. In this paper, we describe microcontrollers that are enabling today’s medical applications and discuss innovations necessary for enabling future applications with sophisticated signal processing needs. As an example, we present the design of an embedded microcontroller system-on-chip that achieves the first sub-microwatt per channel electroencephalograph (EEG) seizure detection.
Slides

6D-3 (Time: 17:00 - 17:30)
Title(Invited Paper) Human++: Wireless Autonomous Sensor Technology for Body Area Networks
AuthorValer Pop, Ruben de Francisco, Hans Pflug, Juan Santana, Huib Visser, Ruud Vullers, Harmke de Groot, Bert Gyselinckx (IMEC, Netherlands)
Pagepp. 561 - 566
AbstractRecent advances in ultra-low-power circuits and energy harvesters are making self-powered body wireless autonomous transducer solutions (WATS) a reality. Power optimization at the system and application level is crucial in achieving ultra-low-power consumption for the entire system. This paper deals with innovative WATS modeling techniques, and illustrates their impact on the case of autonomous wireless ElectroCardioGram monitoring. The results show the effectiveness of our power optimization approach for improving the WATS autonomy.
Slides

6D-4 (Time: 17:30 - 18:00)
Title(Invited Paper) Healthcare of an Organization: Using Wearable Sensors and Feedback System for Energizing Workers
AuthorKoji Ara, Tomoaki Akitomi, Nobuo Sato, Satomi Tsuji, Miki Hayakawa, Yoshihiro Wakisaka, Norio Ohkubo, Rieko Otsuka, Fumiko Beniyama, Norihiko Moriwaki, Kazuo Yano (Advanced Research Laboratory, Hitachi, Ltd., Japan)
Pagepp. 567 - 572
AbstractA badge-shaped sensor and feedback system were developed. This system makes it possible to study human and organizational behavior in an office. By prompting changes in workersEbehavior, it improves productivity as well as helps to solve individual’s problems. Aiming to improve the productivity and quality of project management, the system was applied at a software development company. The results show that the system can improve the human communication process, as well as motivate workers by giving them the chance to reflect on their work styles and to help their colleagues.
Slides



Friday, January 28, 2011

Session 3K  Keynote Session III
Time: 9:00 - 10:00 Friday, January 28, 2011
Location: Room 503
Chair: Kunihiro Asada (University of Tokyo, Japan)

3K-1 (Time: 9:00 - 10:00)
Title(Keynote Address) Robust Systems: From Clouds to Nanotubes
AuthorSubhasish Mitra (Stanford University, U.S.A.)


Session 7A  System Level Analysis and Optimization
Time: 10:20 - 12:20 Friday, January 28, 2011
Location: Room 411+412
Chairs: Hiroshi Saito (Aizu University, Japan), Lovic Ganchier (Kyushu University, Japan)

7A-1 (Time: 10:20 - 10:50)
TitleA Polynomial-Time Custom Instruction Identification Algorithm Based on Dynamic Programming
Author*Junwhan Ahn, Imyong Lee, Kiyoung Choi (Seoul National University, Republic of Korea)
Pagepp. 573 - 578
KeywordASIPs, Configurable processors, Instruction-set extension, Dynamic programming
AbstractThis paper introduces an innovative algorithm for automatic instruction-set extension, which gives a pseudo-optimal solution within polynomial time to the size of a graph. The algorithm uses top-down dynamic programming strategy with the branch-and-bound algorithm in order to exploit overlapping of subproblems. Correctness of the algorithm is formally proved, and time complexity is analyzed from it. Also, it is verified that the algorithm gives an optimal solution for some type of merit functions, and has very small possibility of obtaining non-optimal solution in general. Furthermore, several experimental results are presented as evidence of the fact that the proposed algorithm has notable performance improvement.
Slides

7A-2 (Time: 10:50 - 11:20)
TitleExploring the Fidelity-Efficiency Design Space using Imprecise Arithmetic
Author*Jiawei Huang, John Lach (University of Virginia, U.S.A.)
Pagepp. 579 - 584
Keywordfidelity-efficiency tradeoffs, imprecise adders, reduced precision, Pareto frontier, CORDIC
AbstractRecently many imprecise circuit design techniques have been proposed for implementation of error-tolerant applications, such as multimedia and communications. These algorithms do not mandate absolute correctness of their results, and imprecise circuit components can therefore leverage this relaxed fidelity requirement to provide performance and energy benefits. In this paper, several imprecise adder design techniques are classified and compared in terms of their error characteristics and power-delay efficiency. A general methodology for fidelity-efficiency design space exploration is presented and is applied to a case study implementing the CORDIC algorithm in 130nm technology. The case study reveals that simple precision scaling often provides better power-delay efficiency for a given fidelity than more complex imprecise adders, but different choice of algorithm and fidelity can influence the outcome.
Slides

7A-3 (Time: 11:20 - 11:50)
TitleThroughput Optimization for Latency-Insensitive System with Minimal Queue Insertion
AuthorJuinn-Dar Huang, *Yi-Hang Chen, Ya-Chien Ho (National Chiao Tung University, Taiwan)
Pagepp. 585 - 590
KeywordLatency-Insensitive System, Latency-Insensitive-design, Throughput Optimization, Queue Size Minimization, Integer Linear Programming
AbstractAs fabrication process exploits even deeper submicron technology, global interconnect delay is becoming one of the most critical performance obstacles in system-on-chip (SoC) designs nowadays. Recent years latency-insensitive system (LIS), which enables multicycle communication to tolerate variant interconnect delay without substantially modifying pre-designed IP cores, has been proposed to conquer this issue. However, imbalanced interconnect latency and communication back-pressure residing in an LIS still degrade system throughput. In this paper, we present a throughput optimization technique with minimal queue insertion. We first model a given LIS as a quantitative graph (QG), which can be further compacted using the proposed techniques, so that much bigger problems can be handled. On top of QG, the optimal solution with minimal queue size can be achieved through integer linear programming based on the proposed constraint formulation in an acceptable runtime. The experimental results show that our approach can deal with moderately large systems in a reasonable runtime and save about 28% of queues compared to the prior art.
Slides

7A-4 (Time: 11:50 - 12:20)
TitleA Fast and Effective Dynamic Trace-based Method for Analyzing Architectural Performance
Author*Yi-Siou Chen, Lih-Yih Chiou, Hsun-Hsiang Chang (Dept. of Electrical Engineering National Cheng Kung University, Taiwan)
Pagepp. 591 - 596
KeywordSystem analysis and design, Computer aided design, Space Exploration, Digital Systems
AbstractPerformance estimation at system-level involves quantitative analysis to allow designers to evaluate alternative architectures before implementation. However, designers must spend a tremendous amount of time in system remodeling for performance estimation for each alternative solution in a huge design space. The effort required for system remodeling prolongs the exploration step. Furthermore, the accuracy and speed of performance analysis affects the effectiveness of architectural exploration. This work presents an architectural performance analysis using a dynamic trace-based method (APDT) to reduce the effort required for system remodeling and the time required to estimate performance during architecture exploration, thereby improving the effectiveness of that exploration. Experimental results demonstrate that the APDT approach is faster than the bus functional-level simulation on CoWare with a minor average deviation.
Slides


Session 7B  NBTI and Power Gating
Time: 10:20 - 12:20 Friday, January 28, 2011
Location: Room 413
Chairs: Kimiyoshi Usami (Shibaura Institute of Technology, Japan), Toshio Sudo (Shibaura Institute of Technology, Japan)

7B-1 (Time: 10:20 - 10:50)
TitleControlling NBTI Degradation during Static Burn-in Testing
Author*Ashutosh Chakraborty, David Z. Pan (University of Texas at Austin, U.S.A.)
Pagepp. 597 - 602
KeywordNBTI, Burn-in, Performance, ILP, Static
AbstractNegative Bias Temperature Instability (NBTI) has emerged as the dominant PMOS device failure mechanism in the nanometer VLSI era. The extent of NBTI degradation of a PMOS device increases dramatically at elevated operating temperatures and supply voltages. Unfortunately, both these conditions are concurrently experienced by a VLSI chip during the process of burn-in testing. Therefore, burn-in testing can potentially causes significant NBTI degradation of the chip which can require designers to leave larger timing guard-bands during design phase. Our analysis shows that even during a short burn-in duration of 10 hours, the degradation accumulated can be as much 60% of the NBTI degradation experienced over 10 years of use at nominal conditions. Static burn-in testing in particular is observed to cause most NBTI degradation due to presence of a fixed vector which does not allow relaxation of NBTI effect as in the case of dynamic burn-in testing. The delay of benchmark circuits is observed to increase by over 10% due to static burn-in testing. We propose the first technique to reduce the NBTI degradation during static burn-in test by finding the minimum NBTI induced delay degradation vector (MDDV) based on timing criticality and threshold voltage change (Delta V_{TH}) sensitivity of the cells. Further, only a subset of the input pins need to be controlled for NBTI reduction, thus our technique allows other objectives (such as leakage reduction) to be considered simultaneously. Experimental results show NBTI induced V_TH and delay degradation can be reduced by as much as 25% and 20% respectively over several benchmark circuits using our proposed technique.

7B-2 (Time: 10:50 - 11:20)
TitleA Fine-Grained Technique of NBTI-Aware Voltage Scaling and Body Biasing for Standard Cell Based Designs
Author*Yongho Lee (Samsung Electronics, Republic of Korea), Taewhan Kim (Seoul National University, Republic of Korea)
Pagepp. 603 - 608
KeywordNBTI, Voltage scaling, Body biasing, Power
AbstractThis work addresses an important problem of minimizing the power consumption on standard cell based circuit while controlling the NBTI induced delay increase to meet the circuit timing constraint by simultaneously utilizing the effects of voltage scaling and (fine-grained) body biasing on both NBTI and power consumption. By a comprehensive analysis on the relations between the values of supply and body biasing voltages and the values of the resulting power consumption and NBTI induced delay, we precisely formulate the problem, and transform it into a problem of convex optimization to solve it efficiently.

7B-3 (Time: 11:20 - 11:50)
TitleNBTI-Aware Power Gating Design
AuthorMing-Chao Lee, *Yu-Guang Chen, Ding-Kai Huang, Shih-Chieh Chang (Dept. of CS, National Tsing Hua University, Taiwan)
Pagepp. 609 - 614
Keywordlow power, power gating, NBTI, reliability, leakage
AbstractA header-based power gating structure inserts PMOS as sleep transistors between the power rail and the circuit. Since PMOS sleep transistors in the functional mode are turned-on continuously, Negative Bias Temperature Instability (NBTI) influences the lifetime reliability of PMOS sleep transistors seriously. To tolerate NBTI effect, sizes of PMOS sleep transistors are normally over-sized. In this paper, we propose a novel NBTI-aware power gating architecture to extend the lifetime of PMOS sleep transistors. In our structure, sleep transistors are switched on/off periodically so that overall turned-on times of sleep transistors are reduced and sleep transistors are less influenced by NBTI effect. The experimental results show that our approach can achieve better lifetime extensions of PMOS sleep transistors than previous works and few area overheads.
Slides

7B-4 (Time: 11:50 - 12:20)
TitleRobust Power Gating Reactivation By Dynamic Wakeup Sequence Throttling
AuthorTung-Yeh Wu, Shih-Hsin Hu, *Jacob A. Abraham (The University of Texas at Austin, U.S.A.)
Pagepp. 615 - 620
Keywordpower gating, power supply noise, voltage drop, reactivation, wakeup
AbstractThe wakeup sequence for power gating techniques has become an important issue as the rush current typically causes a high voltage drop. This paper proposes a new wakeup scheme utilizing an on-chip detector which continuously monitors the power supply noise in real time. Therefore, this scheme is able to dynamically throttle the wakeup sequence according to ambient voltage level. As a result, even the adjacent active circuit blocks induce an unexpectedly high voltage drop, the possibility of the occurrence of excessive voltage drop is reduced significantly .


Session 7C  Physical Design for Yield
Time: 10:20 - 12:20 Friday, January 28, 2011
Location: Room 414+415
Chair: Cliff Sze (IBM, U.S.A.)

7C-1 (Time: 10:20 - 10:50)
TitleRobust Clock Tree Synthesis with Timing Yield Optimization for 3D-ICs
Author*Jae-Seok Yang, Jiwoo Pak (University of Texas at Austin, U.S.A.), Xin Zhao, Sung Kyu Lim (Georgia Institute of Technology, U.S.A.), David Z. Pan (University of Texas at Austin, U.S.A.)
Pagepp. 621 - 626
KeywordTSV, CTS, stress, timing optimization, 3D IC
Abstract3D integration has new manufacturing and design challenges such as timing corner mismatch between tiers and device variation due to Through Silicon Via (TSV) induced stress. Timing corner mismatch between tiers is caused because each tier is manufactured in independent process. Therefore, inter-die variation should be considered to analyze and optimize for paths spreading over several tiers. TSV induced stress is another challenge in 3D Clock Tree Synthesis (CTS). Mobility variation of a clock buffer due to stress from TSV can cause unexpected skew which degrades overall chip performance. In this paper, we propose clock tree design methodology with the following objectives: (a) to minimize clock period variation by assigning optimal z-location of clock buffers with an Integer Linear Program (ILP) formulation, (b) to prevent unwanted skew induced by the stress. In the results, we show that our clock buffer tier assignment reduces clock period variation up to 34.2%, and the most of stress-induced skew can be removed by our stress-aware CTS. Overall, we show that performance gain can be up to 5.7% with our robust 3D CTS.

7C-2 (Time: 10:50 - 11:20)
TitleTrack Routing Optimizing Timing and Yield
AuthorXin Gao, *Luca Macchiarulo (Department of Electrical Engineering, University of Hawaii at Manoa, U.S.A.)
Pagepp. 627 - 632
KeywordTrack routing, timing, yield, Geometric Programming
AbstractIn this paper, we propose a track routing algorithm for timing and yield optimization. The algorithm solves the problem in two stages: wire ordering, and wire spacing and sizing. The wire ordering problem is solved by an algorithm based on wire merging. For the wire spacing and sizing problem, we show that it can be represented as a Mixed Linear Geometric Programming (MLGP) problem which can be transformed into a convex optimization problem. Since general nonlinear convex optimization may take a long running time, we propose a heuristic that solves the problem much faster. Experimental results show that, compared to the algorithm that only optimizes yield, our algorithm is able to improve the minimum timing slack by 20%.

7C-3 (Time: 11:20 - 11:50)
TitleSimultaneous Redundant Via Insertion and Line End Extension for Yield Optimization
AuthorShing-Tung Lin (National Tsing Hua University, Taiwan), Kuang-Yao Lee (Taiwan Semiconductor Manufacturing Company, Taiwan), *Ting-Chi Wang (National Tsing Hua University, Taiwan), Cheng-Kok Koh (Purdue University, U.S.A.), Kai-Yuan Chao (Intel Corporation, U.S.A.)
Pagepp. 633 - 638
Keywordinteger linear program (ILP), redundant via insertion, line end extension, yield
AbstractIn this paper, we formulate a problem of simultaneous redundant via insertion and line end extension for via yield optimization. Our problem is more general than previous works in the sense that more than one type of line end extension is considered and the objective function to be optimized directly accounts for via yield. We present a zero-one integer linear program based approach, that is equipped with two speedup techniques, to solve the addressed problem optimally. In addition, we describe how to modify our approach to exactly solve a previous work. Extensive experimental results are shown to demonstrate the effectiveness and efficiency of our approaches.
Slides

7C-4 (Time: 11:50 - 12:20)
TitlePruning-based Trace Signal Selection Algorithm
Author*Kang Zhao, Jinian Bian (Tsinghua University, China)
Pagepp. 639 - 644
KeywordAlgorithm, silicon debug, state restoration, trace signal selection
AbstractTo improve the observability in the post-silicon validation, how to select the limited trace signals effectively for the data acquisition is the focus. This paper proposes an automated trace signal selection algorithm, which uses the pruning-based strategy to reduce the exploration space. The experiments indicate that the proposed algorithm can bring higher restoration ratios, and it is more effective compared to existing methods.
Slides


Session 7D  Special Session: Virtualization, Programming, and Energy-Efficiency Design Issues of Embedded Systems
Time: 10:20 - 12:20 Friday, January 28, 2011
Location: Room 416+417
Organizer: Tei-Wei Kuo (National Taiwan University, Taiwan)

7D-1 (Time: 10:20 - 10:50)
Title(Invited Paper) Temporal and Spatial Isolation in a Virtualization Layer for Multi-core Processor based Information Appliances
AuthorTatsuo Nakajima, Yuki Kinebuchi, Hiromasa Shimada, Alexandre Courbot, Tsung-Han Lin (Waseda University, Japan)
Pagepp. 645 - 652
AbstractA virtualization layer makes it possible to compose multiple functionalities on a multi-core processor with minimum modifications of OS kernels and applications. A multi-core processor is a good candidate to compose various software independently developed for dedicated processors into one multi-core processor to reduce both the hardware and development cost. In this paper, we present SPUMONE, which is a virtualization layer suitable for developing multi-core processor based-information appliances.

7D-2 (Time: 10:50 - 11:20)
Title(Invited Paper) Mathematical Limits of Parallel Computation for Embedded Systems
AuthorJason Loew, Jesse Elwell, Dmitry Ponomarev, Patrick H. Madden (SUNY Binghamton Computer Science Department, U.S.A.)
Pagepp. 653 - 660
AbstractEmbedded systems are designed to perform a specific set of tasks, and are frequently found in mobile, power constrained environments. There is growing interest in the use of parallel computation as a means to increase performance while reducing power consumption. In this paper, we highlight fundamental limits to what can and cannot be improved by parallel resources. Many of these limitations are easily overlooked, resulting in the design of systems that, rather than improving over prior work, are in fact orders of magnitude worse.

7D-3 (Time: 11:20 - 11:50)
Title(Invited Paper) An Enhanced Leakage-Aware Scheduler for Dynamically Reconfigurable FPGAs
AuthorJen-Wei Hsieh (National Taiwan University of Science and Technology, Taiwan), Yuan-Hao Chang (National Taipei University of Technology, Taiwan), Wei-Li Lee (National Taiwan University of Science and Technology, Taiwan)
Pagepp. 661 - 667
AbstractThe FPGAs (Field-Programmable Gate Array) are popular in hardware designs and even hardware/software co-designs. Due to the advance of manufacturing technologies, leakage power has become an important issue in the design of modern FPGAs. In particular, the partially dynamical reconfigurable FPGAs allow the latency between FPGA reconfiguration and task execution for the performance consideration. However, this latency introduces unnecessary leakage power called leakage waste. In this work, we propose a leakage-aware scheduling algorithm to minimize the leakage waste without increasing the schedule length of tasks. In this algorithm, a priority dispatcher with a split-aware placement is proposed to reduce the scheduling complexity with considering the hardware constraints of FPGAs. A series of experiments based on synthetic designs demonstrates that the proposed algorithm could effectively reduce leakage waste with limited sacrifices on the task schedulability.

7D-4 (Time: 11:50 - 12:20)
Title(Invited Paper) Power Management Strategies in Data Transmission
AuthorTiefei Zhang (Zhejiang University, China), Ying-Jheng Chen, Che-Wei Chang, Chuan-Yue Yang, Tei-Wei Kuo (National Taiwan University, Taiwan), Tianzhou Chen (Zhejiang University, China)
Pagepp. 668 - 675
AbstractWith the growing popularity of 3G-powered devices and their serious energy consumption problem, there are growing demands on energy-efficient data transmission strategies for various embedded systems. Different from the past work in energy-efficient real-time task scheduling, we explore strategies to maximize the amount of data transmitted by a 3G module under a given battery capacity. In particular, we present algorithms under different workload configurations with and without timing constraint considerations. Experiments were then conducted to verify the validity of the strategies and develop insights in energy-efficient data transmission.


Session 8A  Modeling and Design for Variability
Time: 13:40 - 15:40 Friday, January 28, 2011
Location: Room 411+412
Chairs: Fedor G. Pikus (Mentor Graphics, U.S.A.), Hidetoshi Matsuoka (Fujitsu Laboratories, Japan)

8A-1 (Time: 13:40 - 14:10)
TitleRobust Spatial Correlation Extraction with Limited Sample via L1-Norm Penalty
AuthorMingzhi Gao, *Zuochang Ye, Dajie Zeng, Yan Wang, Zhiping Yu (Institute of Microelectronics, Tsinghua University, China)
Pagepp. 677 - 682
Keywordprocess variation, spatial correlation, kriging model, L1 regularization, least angle regression
AbstractRandom process variations are often composed of location dependent part and distance dependent correlated part. While an accurate extraction of process variation is a prerequisite of both process improvement and circuit performance prediction, it is not an easy task to characterize such complicated spatial random process from a limited number of silicon data. For this purpose, kriging model was introduced to silicon society. This work forms a modified kriging model with L1-norm penalty which offers improved robustness. With the help of Least Angle Regression (LAR) in solving a core optimization sub-problem, this model can be characterized efficiently. Some promising results are presented with numerical experiments where a 3X improvement in model accuracy is shown.
Slides

8A-2 (Time: 14:10 - 14:40)
TitleDevice-Parameter Estimation with On-chip Variation Sensors Considering Random Variability
Author*Ken-ichi Shinkai, Masanori Hashimoto (Osaka University, Japan)
Pagepp. 683 - 688
Keywordvariation sensor, device-parameter extraction, process variability, die-to-die variation, within-die variation
AbstractDevice-parameter monitoring sensors inside a chip are gaining its importance as the post-fabrication tuning is becoming of a practical use. In estimation of variational parameters using on-chip sensors, it is often assumed that the outputs of variation sensors are not affected by random variations. However, random variations can deteriorate the accuracy of the estimation result. In this paper, we propose a device-parameter estimation method with on-chip variation sensors explicitly considering random variability. The proposed method derives the global variation parameters and the standard deviation of the random variability using the maximum likelihood estimation. We experimentally verified that the proposed method can accurately estimate variations, whereas the estimation result deteriorates when neglecting random variations. We also demonstrate an application result of the proposed method to test chips fabricated in a 65-nm process technology.
Slides

8A-3 (Time: 14:40 - 15:10)
TitleAccounting for Inherent Circuit Resilience and Process Variations in Analyzing Gate Oxide Reliability
AuthorJianxin Fang, *Sachin S. Sapatnekar (Department of ECE, University of Minnesota, U.S.A.)
Pagepp. 689 - 694
KeywordOxide Breakdown, Reliability Analysis, Process Variation
AbstractGate oxide breakdown is a major cause of reliability failures in future nanometer-scale CMOS designs. This paper develops an analysis technique that can predict the probability of a functional failure in a large digital circuit due to this phenomenon. Novel features of the method include its ability to account for the inherent resilience in a circuit to a breakdown event, while simultaneously considering the impact of process variations. Based on standard process variation models, at a specified time instant, this procedure determines the circuit failure probability as a lognormal distribution. Experimental results demonstrate this approach is accurate compared with Monte Carlo simulation, and gives 4.7-5.9x better lifetime prediction over existing methods that are based on pessimistic area-scaling models.

8A-4 (Time: 15:10 - 15:40)
TitleVariation-Tolerant and Self-Repair Design Methodology for Low Temperature Polycrystalline Silicon Liquid Crystal and Organic Light Emitting Diode Displays
Author*Chih-Hsiang Ho, Chao Lu, Debabrata Mohapatra, Kaushik Roy (ECE, Purdue University, U.S.A.)
Pagepp. 695 - 700
KeywordLTPS, LCD, OLED, Two-cycle, Variation
AbstractIn low temperature polycrystalline silicon (LTPS) based display technologies, the electrical parameter variations in thin film transistors (TFTs) caused by random grain boundaries (GBs) result in significant yield loss, thereby impeding its wide deployment. In this paper, from a system and circuit design perspective, we propose a new self-repair design methodology to compensate the GB-induced variations for LTPS liquid crystal displays (LCDs) and active-matrix organic light emitting diode (AMOLED) displays. The key idea is to extend the charging time for detected low drivability pixel switches, hence, suppressing the brightness non-uniformity and eliminating the need for large voltage margins. The proposed circuit was implemented in VGA LCD panels which were used for prediction of power consumption and yield. Based on the simulation results, the proposed circuit decreases the required supply voltage by 20% without performance and yield degradation. 7% yield enhancement is observed for high resolution, large sized LCDs while incurring negligible power penalty. This technique enables LTPS-based displays either to further scale down the device size for higher integration and lower power consumption or to have superior yield in large sized panels with small power overhead.
Slides


Session 8B  Test for Reliability and Yield
Time: 13:40 - 15:40 Friday, January 28, 2011
Location: Room 413
Chairs: Yu Huang (Mentor Graphics, U.S.A.), Yoshinobu Higami (Ehime University, Japan)

8B-1 (Time: 13:40 - 14:10)
TitleA Physical-Location-Aware Fault Redistribution for Maximum IR-Drop Reduction
Author*Fu-Wei Chen, Shih-Liang Chen, Yung-Sheng Lin, TingTing Hwang (National Tsing Hua University, Taiwan)
Pagepp. 701 - 706
Keyworddelay fault test, at-speed testing, IR-drop
AbstractTo guarantee that an application specific integrated circuits (ASIC) meets its timing requirement, at-speed scan testing becomes an indispensable procedure for verifying the performance of ASIC. However, at-speed scan test suffers the test-induced yield loss. Because the switching activity in test mode is much higher than that in normal mode, the switching-induced large current drawn causes severe IR drop and increases gate delay. X-filling is the most commonly used technique to reduce IRdrop effect during at-speed test. However, the effectiveness of X-filling depends on the number and the characteristic of X-bit distribution. In this paper, we propose a physical-location-aware X-identification which re-distributes faults so that the maximum switching activity is guaranteed to be reduced after X-filling. The experimental results on ITCE9 show that our method has an average of 8.54% more reduction of maximum IR-drop as compared to a previous work which re-distributes X-bits evenly in all test vectors.
Slides

8B-2 (Time: 14:10 - 14:40)
TitleOn the Impact of Gate Oxide Degradation on SRAM Dynamic and Static Write-ability
Author*Vikas Chandra, Robert Aitken (ARM, U.S.A.)
Pagepp. 707 - 712
KeywordReliability, Gate oxide degradation, SRAM, Vmin
AbstractLow voltage operation of SRAM arrays is critical in reducing the power consumption of embedded microprocessors. The minimum voltage of operation, Vmin, can be limited by any combination of write failure, read disturb failure, access failure and/or retention failure. Of these, the write failure is often observed as the major Vmin limiter in sub-50nm processes. In addition, the current generation transistors have high-k metal gate (HKMG) and these are prone to degradation due to higher level of electric field stress. The degradation increases Vmin due to increase in dynamic write failures and eventually, static write failures as the supply voltage decreases. We show that there exists a critical breakdown resistance (Rcrit) for a given supply voltage at which the SRAM write failure transitions from being dynamically limited to statically limited. For a 32nm low-power SRAM, the value of Rcrit increases by ∼9X as the supply voltage reduces from 1V to 0.7V. Further, we show that the commonly used SRAM write assist (WA) techniques do not lower Rcrit and can only improve the write-ability when the breakdown resistance, Rsbd, is larger than Rcrit.

8B-3 (Time: 14:40 - 15:10)
TitleA Self-Testing and Calibration Method for Embedded Successive Approximation Register ADC
AuthorXuan-Lun Huang, Ping-Ying Kang (National Taiwan University, Taiwan), Hsiu-Ming Chang (University of California, Santa Barbara, U.S.A.), *Jiun-Lang Huang (National Taiwan University, Taiwan), Yung-Fa Chou, Yung-Pin Lee, Ding-Ming Kwai, Cheng-Wen Wu (Industrial Technology Research Institute, Taiwan)
Pagepp. 713 - 718
Keywordmixed-signal testing, ADC testing, ADC calibration, SoC testing, successive approximation register (SAR) ADC
AbstractThis paper presents a self-testing and calibration method for the embedded successive approximation register (SAR) analog-to-digital converter (ADC). We first propose a low cost design-for-test (DfT) technique which tests a SAR ADC by characterizing its digital-to-analog converter (DAC) capacitor array. Utilizing DAC major carrier transition testing, the required analog measurement range is just 4 LSBs; this significantly lowers the test circuitry complexity. We also develop a fully-digital missing code calibration technique that utilizes the proposed testing scheme to collect the required calibration information. Simulation results are presented to validate the proposed technique.
Slides

8B-4 (Time: 15:10 - 15:40)
TitleOn-chip Dynamic Signal Sequence Slicing for Efficient Post-Silicon Debugging
Author*Yeonbok Lee, Takeshi Matsumoto, Masahiro Fujita (The University of Tokyo, Japan)
Pagepp. 719 - 724
KeywordPost-Silicon debugging, Functional Debugging, Dependency Analysis
AbstractIn post-silicon debugging, low observability of internal signal values and large amount of traces are considered as the most critical problems. To address these problems, we propose an on-chip circuitry named DSC (Dynamic Slicing Circuit) which outputs the input signal values that actually influence on an erroneous output value in a particular execution of a chip by analyzing dependencies among signals. Since such input signal values are usually a small subset of the entire input sequence, we can reproduce the error by simulation using them. To realize DSC, we propose a variable named d-tag (Dependency Tag) representing dependency of a signal value with respect to another signal value. For demonstrating our method, we prepared three design examples and implemented DSC circuits on them. As a result, we could successfully extract input signal values that influenced the target output value from a number of random input sequence, for every case. We observed that the number of the extracted input values was significantly smaller than that of the original sequence. The area overhead for DSC circuit were also practical, 4% in average.
Slides


Session 8C  System-Level Power Optimization
Time: 13:40 - 15:40 Friday, January 28, 2011
Location: Room 414+415
Chairs: Masanori Muroyama (Tohoku University, Japan), Lih-Yih Chiou (National Cheng Kung University, Taiwan)

8C-1 (Time: 13:40 - 14:10)
TitleAVS-Aware Power-Gate Sizing for Maximum Performance and Power Efficiency of Power-Constrained Processors
AuthorAbhishek Sinkar, *Nam Sung Kim (University of Wisconsin-Madison, U.S.A.)
Pagepp. 725 - 730
KeywordMulti-Core Processors, Power-Gating, Adaptive-Voltage-Scaling
AbstractPower-gating devices incur a small amount of voltage drop across them when they are on in active mode, degrading the maximum frequency of processors. Thus, large power-gating devices are often implemented to minimize the drop (thus the frequency degradation), requiring considerable die area. Meanwhile, adaptive voltage scaling has been used to improve yield of power-constrained processors exhibiting a large spread of maximum frequency and total power due to process variations. In this paper, first, we analyze the impact of power-gating device size on both maximum frequency and total power of processors in the presence of process variation. Second, we propose a methodology that optimizes both the size of power-gating devices and the degree of adaptive voltage scaling jointly such that we minimize the device size while maximizing performance and power efficiency of power-constrained processors. Finally, we extend our analysis and optimization for multi-core processors adopting frequency-island clocking scheme. Our experimental results using a 32nm technology model demonstrates that the joint optimization considering both die-to-die and within-die variations reduces the size of power-gating devices by more than 50% with 3% frequency improvement for power-constrained multi-core processors. Further, the optimal size of power-gating devices for multi-core processors using the frequency-island clocking scheme increases gradually while the optimal supply voltage decreases as the number OF cores per die increases.
Slides

8C-2 (Time: 14:10 - 14:40)
TitleEnergy/Reliability Trade-offs in Fault-Tolerant Event-Triggered Distributed Embedded Systems
Author*Junhe Gan (Technical University of Denmark, Denmark), Flavius Gruian (Lund University, Sweden), Paul Pop, Jan Madsen (Technical University of Denmark, Denmark)
Pagepp. 731 - 736
KeywordReliability, Mapping, Energy Minimization, System-level Optimization
AbstractThis paper presents an approach to the synthesis of low-power fault-tolerant hard real-time applications mapped on distributed heterogeneous embedded systems. Our synthesis approach decides the mapping of tasks to processing elements, as well as the voltage and frequency levels for executing each task, such that transient faults are tolerated, the timing constraints of the application are satisfied, and the energy consumed is minimized. Tasks are scheduled using fixed-priority preemptive scheduling, while replication is used for recovering from multiple transient faults. Addressing energy and reliability simultaneously is especially challenging, since lowering the voltage to reduce the energy consumption has been shown to increase the transient fault rate. We presented a Tabu Search-based approach which uses an energy/reliability trade-off model to find reliable and schedulable implementations with limited energy and hardware resources. We evaluated the algorithm proposed using several synthetic and real-life benchmarks.
Slides

8C-3 (Time: 14:40 - 15:10)
TitleProfile Assisted Online System-Level Performance and Power Estimation for Dynamic Reconfigurable Embedded Systems
AuthorJingqing Mu, *Roman Lysecky (University of Arizona, U.S.A.)
Pagepp. 737 - 742
Keywordperformance and power estimation, online estimation, dynamic reconfigurable systems
AbstractSignificant research has demonstrated the performance and power benefits of runtime dynamic reconfiguration of FPGAs and microprocessor/FPGA devices. For dynamically reconfigurable systems, in which the selection of hardware coprocessors to implement within the FPGA is determined at runtime, online estimation methods are needed to evaluate the performance and power consumption impact of the hardware coprocessor selection. In this paper, we present a profile assisted online system-level performance and power estimation framework for estimating the speedup and power consumption of dynamically reconfigurable embedded systems. We evaluate the accuracy and fidelity of our online estimation framework for dynamic hardware kernel selection to maximize performance or minimize system power consumption.
Slides

8C-4 (Time: 15:10 - 15:40)
TitleBattery-Aware Task Scheduling in Distributed Mobile Systems with Lifetime Constraint
AuthorJiayin Li, *Meikang Qiu (University of Kentucky, U.S.A.), Jian-wei Niu (Beihang University, China), Tianzhou Chen (Zhejiang University, China)
Pagepp. 743 - 748
KeywordBattery-aware, task scheduling, lifetime constraint
AbstractA distributed mobile system consists of a group of mobile devices with different computing powers. These devices are connected by a wireless network. Parallel processing in the distributed mobile system can provide high computing performance. Due to the fact that most of the mobile devices are battery based, the lifetime of mobile system depends on both the battery behavior and the energy consumption characteristics of tasks. In this paper, we present a systematic system model for task scheduling in mobile system equipped with Dynamic Voltage Scaling (DVS) processors and energy harvesting techniques. We propose the battery-aware algorithms to obtain task schedules giving shorter total execution time while satisfying the battery lifetime constraints. The simulations with randomly generated Directed Acyclic Graphs (DAG) show that our proposed algorithms generate better schedules which can satisfy the battery lifetime constraints.


Session 8D  Designers' Forum: State-of-The-Art SoCs and Design Methodologies
Time: 13:40 - 15:40 Friday, January 28, 2011
Location: Room 416+417
Organizer: Masaitsu Nakajima (Panasonic, Japan)

8D-1 (Time: 13:40 - 14:04)
Title(Invited Paper) Advanced System LSIs for Home 3D System
AuthorTakao Suzuki (Panasonic Corp., Japan)
Pagepp. 749 - 754
AbstractThe progress of digital video processing technology and LSI technology have been the driving force behind the creation of 3D systems, and various 3D products for the home were released. 2010 became a historic year for in-home 3D. We developed a suite of system LSIs that was the key to realizing home 3D systems by applying integrated platform for digital CE, the UniPhier (Universal Platform for High-quality Image Enhancing Revolution). The system LSIs for 3D TV deliver high display speeds, and the main system LSI for 3D Blu-ray provides MPEG-4 MVC decoding. This paper describes the 3D technologies, home 3D systems and advanced system LSIs for the consumer market.
Slides

8D-2 (Time: 14:04 - 14:28)
Title(Invited Paper) Development of Low Power and High Performance Application Processor (T6G) for Multimedia Mobile Applications
AuthorYoshiyuki Kitasho, Yu Kikuchi, Takayoshi Shimazawa, Yasuo Ohara, Masafumi Takahashi, Yoshio Masubuchi, Yukihito Oowaki (Toshiba Corporation Semiconductor Company, Japan)
Pagepp. 755 - 759
AbstractTOSHIBA has developed a mobile application processor for multimedia mobile applications in 40 nm with a H.264 full high-definition (full-HD) video engine and a video/audio multiprocessor for various CODECs and image processing. The application processor has 25 power domains to achieve coarse-grain power gating for adjusting to the required performance of wide range of multimedia applications. Furthermore, the application processor has Stacked Chip SoC (SCS) DRAM I/F to achieve high memory bandwidth with low power consumption.
Slides

8D-3 (Time: 14:28 - 14:52)
Title(Invited Paper) Design Constraint of Fine Grain Supply Voltage Control LSI
AuthorAtsuki Inoue (Fujitsu Labs., Japan)
Pagepp. 760 - 765
AbstractA supply voltage control technique for realizing low power LSI is utilized not only for general purpose processors, but also for custom ASIC thanks to advanced LSI design environments. Fine grain supply voltage control in time domain in power gating and DVFS scheme are seen as promising techniques to reduce power consumption. However, they require additional energy consumption for control themselves. In this paper, we discuss energy consumption including this overhead using simple circuit model and make it clear that charging energy of power supply line limits the minimum sleep duration or cycles as design constraint.
Slides

8D-4 (Time: 14:52 - 15:16)
Title(Invited Paper) FPGA Prototyping using Behavioral Synthesis for Improving Video Processing Algorithm and FHD TV SoC Design
AuthorMasaru Takahashi (SoC Software Platform Division, Renesas Electronics Corporation, Japan)
Pagepp. 766 - 769
AbstractThe System on Chip (SoC) can include Full High Definition (FHD) video processing, however the turn around time of algorithm improvement have been long. We provide the new method utilizing the behavioral synthesis. Therefore, the turn around time of the algorithm improvement and hardware implementation can be shorten.
Slides

8D-5 (Time: 15:16 - 15:40)
Title(Invited Paper) An RTL-to-GDS2 Design Methodology for Advanced System LSI
AuthorNobuyuki Nishiguchi (Semiconductor Technology Academic Research Center, Japan)
Pagepp. 770 - 774
AbstractSTARC is developing an RTL-to-GDS2 design methodology for 32nm (and 28nm) system LSIs called STARCAD-CEL. The design methodology focuses on four key areas: low power design, variation aware design and design for manufacturability as well as design productivity. This paper examines several techniques we used to solve issues the in design of challenging, leading edge devices. It also describes the effectiveness of the STARCAD-CEL design methodology when applied to the four key areas.
Slides


Session 9A  Printability and Mask Optimization
Time: 16:00 - 18:00 Friday, January 28, 2011
Location: Room 411+412
Chairs: Murakata Masami (STARC, Japan), Zheng Shi (Zhejiang University, China)

9A-1 (Time: 16:00 - 16:30)
TitleHigh Performance Lithographic Hotspot Detection using Hierarchically Refined Machine Learning
AuthorDuo Ding (University of Texas at Austin, U.S.A.), Andres Torres, Fedor Pikus (Mentor Graphics Corp., U.S.A.), *David Pan (University of Texas at Austin, U.S.A.)
Pagepp. 775 - 780
Keywordlithography hotspot detection, high performance, hierarchical machine learning, real manufacturing conditions, lithography friendly design
AbstractUnder real and continuously improving manufacturing conditions, lithography hotspot detection faces several key challenges. First, real hotspots become less but harder to fix at post-layout stages; second, false alarm rate must be kept low to avoid excessive and expensive post-processing hotspot removal; third, full chip physical verification and optimization require fast turn-around time. To address these issues, we propose a high performance lithographic hotspot detection flow with ultra-fast speed and high fidelity. It consists of a novel set of hotspot signature definitions and a hierarchically refined detection flow with powerful machine learning kernels, ANN (artificial neural network) and SVM (support vector machine). We have implemented our algorithm with industry-strength engine under real manufacturing conditions in 45nm process, and showed that it significantly outperforms previous state-of-the-art algorithms in hotspot detection false alarm rate (2.4X to 2300X reduction) and simulation run-time (5X to 237X reduction), meanwhile archiving similar or slightly better hotspot detection accuracies. Such high performance lithographic hotspot detection under real manufacturing conditions is especially suitable for guiding lithography friendly physical design.

9A-2 (Time: 16:30 - 17:00)
TitleRapid Layout Pattern Classification
Author*Jen-Yi Wuu (University of California, Santa Barbara, U.S.A.), Fedor G. Pikus, Andres Torres (Mentor Graphics Corporation, U.S.A.), Malgorzata Marek-Sadowska (University of California, Santa Barbara, U.S.A.)
Pagepp. 781 - 786
KeywordHotspot detection, Machine learning, Design for manufacturabililty
AbstractPrintability of layout objects becomes increasingly dependent on neighboring shapes within a larger and larger context window. In this paper, we propose a two-level hotspot pattern classification methodology that examines both central and peripheral patterns. Accuracy and runtime enhancement techniques are proposed, making our detection methodology robust and efficient as a fast physical verification tool that can be applied during early design stages to large-scale designs. We position our method as an approximate detection solution, similar to pattern matching-based tools widely adopted by the industry. In addition, our analyses of classification results reveal that the majority of non-hotspots falsely predicted as hotspots have printed CD barely over the minimum allowable CD threshold. Our method is verified on several 45nm and 32nm industrial designs.
Slides

9A-3 (Time: 17:00 - 17:30)
TitleMask Cost Reduction with Circuit Performance Consideration for Self-Aligned Double Patterning
AuthorHongbo Zhang, Yuelin Du, *Martin D. F. Wong (University of Illinois at Urbana-Champaign, U.S.A.), Kai-Yuan Chao (Intel Corporation, U.S.A.)
Pagepp. 787 - 792
KeywordDense Line Cut, Cost reduction, Polygon simplification, Self-Aligned Double Patterning
AbstractDouble patterning lithography (DPL) is the enabling technology for printing in sub-32nm nodes. In the EDA literature, researchers have been focusing on double-exposure double-patterning (DEDP) DPL for printing arbitrary 2D features where the layout decomposition problem for double exposure is an interesting graph coloring problem. But due to overlay errors, it is very difficult for DEDP to print even 1D features. A more promising DPL technology is self-aligned double patterning (SADP) for 1D design. SADP first prints dense lines and then trims away the portions not on the design by a cut mask. The complexity of cut mask is very high, adding to the sky-rocketing manufacturing cost. In this paper we present a mask cost reduction method with circuit performance consideration for SADP. This is the first paper to focus on the mask cost reduction issue for SADP from a design perspective. We simplify the polygons on the cut mask, by formulating the problem as a constrained shortest path problem. Experimental results show that with a set of layouts in 28nm technology, we can largely reduce the complexity of cut polygons, with little impact on performance.

9A-4 (Time: 17:30 - 18:00)
TitlePost-Routing Layer Assignment for Double Patterning
Author*Jian Sun (State Key Lab. of ASIC & System, Microelectronics Department, Fudan University, China), Yinghai Lu, Hai Zhou (EECS Dept., Northwestern University, U.S.A.), Xuan Zeng (State Key Lab. of ASIC & System, Microelectronics Department, Fudan University, China)
Pagepp. 793 - 798
KeywordDouble Patterning, Layer Assignment, NP-hard, Algorithm
AbstractDouble patterning lithography, where one-layer layout is decomposed into two masks, is believed to be inevitable for 32nm technology node of the ITRS roadmap. However, post-routing layer assignment, which decides the layout pattern on each layer, thus having great impact on double patterning related parameters, has not been explored in the merit of double patterning. In this paper, we propose a post-routing layer assignment algorithm for double patterning optimization. Our solution consists of three major phases: multi-layer assignment, single-layer double patterning, and via reduction. For phase one and three, multi-layer graph is constructed and dynamic programming is employed to solve optimization problem on this graph. In the second phase, single-layer double patterning is proved NP-hard and existing algorithm is implemented to optimize single layer double patterning problem. The proposed method is tested on CBL (Collaborative Benchmarking Laboratory) benchmarks and shows great performance. In comparison with single-layer double patterning, our method achieves 73% and 27% average reduction for unresolvable conflicts and stitches respectively, with only 9% increase of via number. When double patterning is constrained on only the bottom two metal layers as in current technology, these numbers become 62%, 8% and 0.42%.
Slides


Session 9B  Emerging Solutions in Scan Testing
Time: 16:00 - 18:00 Friday, January 28, 2011
Location: Room 413
Chairs: Seiji Kajihara (Kyusyu Institute of Technology, Japan), Ting Ting Hwang (National Tsing Hua University, Taiwan)

9B-1 (Time: 16:00 - 16:30)
TitleFault Simulation and Test Generation for Clock Delay Faults
Author*Yoshinobu Higami, Hiroshi Takahashi, Shin-ya Kobayashi (Ehime University, Japan), Kewal K. Saluja (University of Wisconsin-Madison, U.S.A.)
Pagepp. 799 - 805
KeywordTest generation, Fault simulation, Delay faults, LSI testing
AbstractIn this paper, we investigate the effects of delay faults on clock lines under launch-on-capture test strategy. We first show simulation results providing a relation between the duration of the delay and difficulty of detecting such faults in the launch-on-capture test. Next, we propose test generation methods to detect such clock delay faults, and show some experimental results to establish the effectiveness of our methods.
Slides

9B-2 (Time: 16:30 - 17:00)
TitleCompression-Aware Capture Power Reduction for At-Speed Testing
Author*Jia Li (Tsinghua University, China), Qiang Xu (The Chinese University of Hong Kong, Hong Kong), Dong Xiang (Tsinghua University, China)
Pagepp. 806 - 811
Keywordlow power testing, test compression, co-optimization, at-speed testing
AbstractTest compression has become a de facto technique in VLSI testing. Meanwhile, excessive capture power of at-speed testing has also become a serious concern. Therefore, it is important to co-optimize test power and compression ratio in at-speed testing. In this paper, a novel X-filling framework is proposed to reduce capture power of at-speed testing for different test compression schemes. The proposed technology has been validated by the experimental results on larger ITC'99 benchmark circuits.
Slides

9B-3 (Time: 17:00 - 17:30)
TitleFault Diagnosis Aware ATE Assisted Test Response Compaction
AuthorJoseph Howard, *Sudhakar M Reddy (University of Iowa, U.S.A.), Irith Pomeranz (Purdue University, U.S.A.), Bernd Becker (University of Freiburg, Germany)
Pagepp. 812 - 817
KeywordDiagnosis, Test response compaction, ATE assisted, Direct diagnosis, Multiple faults
AbstractRecently a new method called ATE assisted compaction for achieving test response compaction has been proposed. The method relies on testers to achieve additional compaction, without compromising fault coverage, beyond what may already be achieved using on-chip response compactors. The method does not add additional logic or modify the circuit under test or require additional tests and thus can be used with any design including legacy designs. In this work, we enhance this method so that the level of diagnostic resolution achieved without it can be maintained. Experimental results on larger ISCAS-89 show that additional test response compaction can be achieved while diagnostic resolution for single and double stuck-at faults is not adversely impacted by the procedure.
Slides

9B-4 (Time: 17:30 - 18:00)
TitleSecure Scan Design Using Shift Register Equivalents against Differential Behavior Attack
Author*Hideo Fujiwara (Nara Institute of Science and Technology, Japan), Katsuya Fujiwara, Hideo Tamamoto (Akita University, Japan)
Pagepp. 818 - 823
KeywordDesign for testability, Scan design, Security, Testability, Scan-based side-channel attack
AbstractIn this paper, we consider a scan-based side-channel attack called differential-behavior attack and propose several classes of SR-equivalent scan circuits using dummy flip-flops in order to protect the scan-based differential-behavior attack. To show the security level of those extended scan circuits, we introduce differential-behavior equivalent relation, and clarify the number of SR-equivalent extended scan circuits, the number of differential-behavior equivalent classes and the cardinality of those equivalent classes.
Slides


Session 9C  Clock and Package
Time: 16:00 - 18:00 Friday, January 28, 2011
Location: Room 414+415
Chair: Yasuhiro Takashima (University of Kitakyushu, Japan)

9C-1 (Time: 16:00 - 16:30)
TitleAn Efficient Algorithm of Adjustable Delay Buffer Insertion for Clock Skew Minimization in Multiple Dynamic Supply Voltage Designs
AuthorKuan-Yu Lin, *Hong-Ting Lin, Tsung-Yi Ho (National Cheng Kung University, Taiwan)
Pagepp. 825 - 830
KeywordClock Skew Minimization, Power Mode, Multiple Dynamic Supply Voltage, Adjustable Delay Buffer, Post-Silicon Tuning
AbstractPower consumption is known to be a crucial issue in current IC designs. To tackle this problem, multiple dynamic supply voltage (MDSV) designs are proposed as an efficient solution in modern IC designs. However, the increasing variability of clock skew during the switching of power modes leads to an increase in the complication of clock skew reduction in MDSV designs. In this paper, we propose a tunable clock tree structure by adopting the adjustable delay buffers (ADBs). The ADBs can be used to produce additional delays, hence the clock latencies and skew become tunable in a clock tree. Importing a buffered clock tree, the ADBs with delay value assignments are inserted to reduce clock skew in MDSV designs. An efficient algorithm of ADB insertion for the minimization of clock skew, area, and runtime in MDSV designs has been presented. Comparing with the state-of-the-art algorithm, experimental results show maximum 42.40% area overhead improvement and 117.84x runtime speedup.
Slides

9C-2 (Time: 16:30 - 17:00)
TitleAn Integer Programming Placement Approach to FPGA Clock Power Reduction
Author*Alireza Rakhshanfar, Jason Anderson (University of Toronto, Canada)
Pagepp. 831 - 836
KeywordFPGAs, power, placement, ILP, clock signals
AbstractClock signals are responsible for a significant portion of dynamic power in FPGAs owing to their high toggle frequency and capacitance. Clock signals are distributed to loads through a programmable routing tree network, designed to provide low delay and low skew. The placement step of the FPGA CAD flow plays a key role in influencing clock power, as clock tree branches are connected based solely on the placement of the clock loads. In this paper, we present a placement-based approach to clock power reduction based on an integer linear programming (ILP) formulation. Our technique is intended to be used as an optimization post-pass executed after traditional placement, and it offers fine-grained control of the amount by which clock power is optimized versus other placement criteria. Results show that the proposed technique reduces clock network capacitance by over 50% with minimal deleterious impact on post-routed wirelength and circuit speed.
Slides

9C-3 (Time: 17:00 - 17:30)
TitleRow-Based Area-Array I/O Design Planning in Concurrent Chip-Package Design Flow
AuthorRen-Jie Lee, *Hung-Ming Chen (National Chiao Tung University, Taiwan)
Pagepp. 837 - 842
KeywordArea-Array IC Design, Preliminary I/O-Bump Planning, Chip-Package Feasibility Study
AbstractIC-centric design flow has been a common paradigm when designing and optimizing a system. Package and board/system designs are usually followed by almost-ready chip designs, which causes long turn-around time communicating with package and system houses. In this paper, the realizations of areaarray I/O design methodologies are studied. Different from IC-centric flow, we propose a chip-package concurrent design flow to speed up the design time. Along with the flow, we design the I/O-bump (and P/G-bump) tile which combines I/O (and P/G) and bump into a hard macro with the considerations of I/O power connection and electrostatic discharge (ESD) protection. We then employ an I/O-row based scheme to place I/O-bump tiles with existed metal layers. By such a scheme, it reduces efforts in I/O placement legalization and the redistribution layer (RDL) routing. With the emphasis on package design awareness, the proposed methods map package balls onto chip I/Os, thus providing an opportunity to design chip and package in parallel. Due to this early study of I/O and bump planning, faster convergence can be expected with concurrent design flow. The results are encouraging and the merits of this flow are reassuring.
Slides

9C-4 (Time: 17:30 - 18:00)
TitleA Provably Good Approximation Algorithm for Rectangle Escape Problem with Application to PCB Routing
AuthorQiang Ma, Hui Kong, *Martin D. F. Wong (University of Illinois at Urbana-Champaign, U.S.A.), Evangeline F. Y. Young (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 843 - 848
KeywordPCB Routing, NP Complete, Approximation, Algorithm
AbstractIn this paper, we introduce and study the Rectangle Escape Problem (REP) which is motivated by PCB bus escape routing. Given a rectangular region R and a set S of rectangles within R, REP is to choose a direction for each rectangle to escape to the boundary of R, such that the resultant maximum density over R is minimized. We prove that REP is NP-Complete, and show that REP can be formulated as an Integer Linear Program (ILP). A provably good approximation algorithm for REP is developed by applying Linear Programming (LP) relaxation and a special rounding technique to the ILP. This approximation algorithm is also shown to work for a more general version of REP with weights (Weighted REP). In addition, an iterative refinement procedure is proposed as a postprocessing step to further improve the results. Our approach is tested on a set of industrial PCB bus escape routing problems. Experimental results show that the optimal solution can be obtained within 3 seconds for each of the test cases.
Slides


Session 9D  Designers' Forum: Advanced Packaging and 3D Technologies
Time: 16:00 - 18:00 Friday, January 28, 2011
Location: Room 416+417
Organizer: Yoshio Masubuchi (Toshiba, Japan)

9D-1 (Time: 16:00 - 17:30)
Title(Panel Discussion) Advanced Packaging and 3D Technologies
AuthorOrganizer: Yoshio Masubuchi (Toshiba, Japan), Moderator: Kenichi Osada (Hitachi, Japan), Panelists: Geert Van der Plas (IMEC, Belgium), Hirokazu Ezawa (Toshiba, Japan), Yasumitsu Orii (IBM, Japan), Yoichi Hiruta (J-Devices, Japan), Chris Cheung (Cadence Design Systems, U.S.A.)
Abstract3D packaging is a key technology to satisfy a growing demand to realize highly integrated system and memory. The panel session explores the technologies of three dimensional stacked chips and discusses the challenges to design and test of such integrated chips.
Slides