(Go to Top Page)

The 13th Asia and South Pacific Design Automation Conference
Technical Program

Remark: The presenter of each paper is marked with "*".
Technical Program:   SIMPLE version   DETAILED version with abstract
Author Index:   HERE

Session Schedule


Tuesday, January 22, 2008

ABCD
Op (Room 409)
Opening Ceremony
08:30 - 09:00
1K (Room 409)
Keynote Session I

09:00 - 10:00
Coffee Break
10:00 - 10:15
1A (Room 310A)
New Challenges in High Level Synthesis

10:15 - 12:20
1B (Room 310BC)
Power and Thermal Modeling and Optimization

10:15 - 12:20
1C (Room 311A)
Emerging Technologies

10:15 - 12:20
1D (Room 311BC)
University LSI Design Contest

10:15 - 12:20
Lunch
12:20 - 13:30
2A (Room 310A)
Advanced Topic in Logic Synthesis

13:30 - 15:35
2B (Room 310BC)
Interconnect Modeling and Simulation Techniques

13:30 - 15:35
2C (Room 311A)
Floorplanning

13:30 - 15:35
2D (Room 311BC)
Special Session - Tackling Manufacturability/Variability for 32nm and Below

13:30 - 15:35
Coffee Break
15:35 - 15:50
3A (Room 310A)
Routing

15:50 - 17:55
3B (Room 310BC)
Interconnect, NoCs, and MPSoCs

15:50 - 17:30

3D (Room 311A+311BC)
Special Session (Panel) The Tears and Joy of Sowing and Reaping Complex SoC's

15:50 - 17:55



Wednesday, January 23, 2008

ABCD
2K (Room 409)
Keynote Session II

9:00 - 10:00
Coffee Break
10:00 - 10:15
4A (Room 310A)
Variability Issues in Timing

10:15 - 12:20
4B (Room 310BC)
Memory and Processor Optimization

10:15 - 12:20
4C (Room 311A)
New Techniques for Physical Design Optimization

10:15 - 12:20
4D (Room 311BC)
Designers' Forum - New Emerging Application Areas for Future SoC

10:15 - 12:20
Lunch
12:20 - 13:30
5A (Room 310A)
Techniques for Formal and Simulation-Based Varification

13:30 - 15:35
5B (Room 310BC)
Power and Performance Optimization for Embedded Systems

13:30 - 15:35
5C (Room 311A)
Thermal Analysis and DFM

13:30 - 15:35
5D (Room 311BC)
Designers' Forum (Panel) Are System Level EDA Tools/Methodologies Coming?

13:30 - 15:35
Coffee Break
15:35 - 15:50
6A (Room 310A)
Trends in Timing

15:50 - 17:55
6B (Room 310BC)
Statistical Modeling and Yield Prediction

15:50 - 17:55

6D (Room 311A+311BC)
Special Session - How to Design Cool Chips for Hot Products

15:50 - 17:55



Thursday, January 24, 2008

ABCD
3K (Room 409)
Keynote Session III

9:00 - 10:00
Coffee Break
10:00 - 10:15
7A (Room 310A)
Reliable/Testable Design Techniques

10:15 - 12:20
7B (Room 310BC)
Communication and Interfaces

10:15 - 12:20
7C (Room 311A)
Power: Delivery and Reduction

10:15 - 12:20
7D (Room 311BC)
Special Session (Panel) Concurrent SoC and SiP Designs

10:15 - 12:20
Lunch
12:20 - 13:30
8A (Room 310A)
Test Generation and Test Power

13:30 - 15:35
8B (Room 310BC)
Design Space Exploration

13:30 - 15:35
8C (Room 311A)
Reliability and Power Management

13:30 - 15:35
8D (Room 311BC)
Designers' Forum - Low Power Chips

13:30 - 15:35
Coffee Break
15:35 - 15:50
9A (Room 310A)
Analog/RF/Mixed Signal CAD

15:50 - 17:55
9B (Room 310BC)
Architecture Exploration

15:50 - 17:55

9D (Room 311BC)
Designers' Forum (Panel) Best Ways to Use Billions of Devices on a Chip

15:50 - 17:55



List of Papers

Remark: The presenter of each paper is marked with "*".

Tuesday, January 22, 2008

Session 1K  Keynote Session I
Time: 09:00 - 10:00 Tuesday, January 22, 2008
Location: Room 409
Chair: Chong-Min Kyung (KAIST, Republic of Korea)

1K-1 (Time: 09:00 - 10:00)
Title(Keynote Address) A Brand New Wireless Day
Author*Jan M. Rabaey (Univ. of California, Berkeley, United States)
Pagep. 1


Session 1A  New Challenges in High Level Synthesis
Time: 10:15 - 12:20 Tuesday, January 22, 2008
Location: Room 310A
Chairs: Taewhan Kim (Seoul National University, Republic of Korea), Kazutoshi Wakabayashi (NEC Corp., Japan)

1A-1 (Time: 10:15 - 10:40)
TitleVariability-Driven Module Selection with Joint Design Time Optimization and Post-Silicon Tuning
AuthorFeng Wang, *Xiaoxia Wu, Yuan Xie (Pennsylvania State University, United States)
Pagepp. 2 - 9
Keywordmodule selection, design optimization, high level synthesis, delay variations
Abstract Increasing delay and power variation are significant challenges to the designers as technology scales to deep sub-micron (DSM) regime. Traditional module selection techniques in high level synthesis use worst case delay/power information to perform the optimization, and therefore may be too pessimistic. In this paper, we propose a module selection algorithm that combines design-time optimization with postsilicon tuning (using adaptive body biasing) to maximize design yield. Fast efficient performance and power yield gradient computation is developed. The post silicon optimization is formulated as an efficient sequential conic programming to determine the optimal body bias distribution, which in turn affects design-time module selection. To the best of our knowledge, this is the first variability-driven high level synthesis technique that considers post-silicon tuning during design time optimization.

1A-2 (Time: 10:40 - 11:05)
TitleBehavioral Synthesis with Activating Unused Flip-Flops for Reducing Glitch Power in FPGA
Author*Cheng-Tao Hsieh (Nat'l Tsing Hua Univ., Taiwan), Jason Cong, Zhiru Zhang (Univ. of California, Los Angeles, United States), Shih-Chieh Chang (Nat'l Tsing Hua Univ., Taiwan)
Pagepp. 10 - 15
Keywordlow power, behavioral synthesis, FPGA
AbstractIn this paper we discuss optimizing the interconnect power of designs implemented in FPGA platforms. In particular, we reduce the glitch power on interconnects associated with the output of functional units in a design. The idea is to activate unused flip-flops to block the propagation of glitches, which takes advantage of the abundant flip-flops in modern FPGA structures. Since the activation of additional flip-flops may cause data hazard problems, we develop several effective behavioral synthesis techniques to prevent such data hazards. We also study the optimality of our techniques. The experimental results show that on average, our methods lead to a 28% reduction in dynamic power in the Xilinx Virtex-II platform.

1A-3 (Time: 11:05 - 11:30)
TitleA Multicycle Communication Architecture and Synthesis Flow for Global Interconnect Resource Sharing
AuthorWei-Sheng Huang, Yu-Ru Hong, Juinn-Dar Huang, *Ya-Shih Huang (National Chiao Tung University, Taiwan)
Pagepp. 16 - 21
Keywordmulticycle communication architecture, distributed register file, interconnect, high-level synthesis, resource sharing
AbstractIn deep submicron technology, wire delay is no longer negligible and is gradually dominating the system latency. Some state-of-the-art architectural synthesis flows adopt the distributed register (DR) architecture to cope with this increasing latency. The DR architecture, though allows multicycle communication, introduces extra overhead on interconnect resource. In this paper, we propose the Regular Distributed Register - Global Resource Sharing (RDR-GRS) architecture to enable global sharing of interconnects and registers. Based on the RDR-GRS architecture, we further define the channel and register allocation problem as a path scheduling problem of data transfers. A formal and flexible formulation of this problem is then presented and optimally solved by Integer Linear Programming (ILP). Experimental results show that RDR-GRS/ILP can averagely reduce 58% wires and 35% registers compared to the previous work.

1A-4 (Time: 11:30 - 11:55)
TitleScheduling with Integer Time Budgeting for Low-Power Optimization
AuthorWei Jiang, Zhiru Zhang, Miodrag Potkonjak, *Jason Cong (Univ. of California, Los Angeles, United States)
Pagepp. 22 - 27
KeywordBehavior Synthesis, Scheduling, Delay Budgeting
AbstractIn this paper we present a mathematical programming formulation of the integer time budgeting problem for directed acyclic graphs. In particular, we formally prove that our constraint matrix has a special property that enables a polynomial-time algorithm to solve the problem optimally with guaranteed integral solution. Our theory can be directly applied to solve a scheduling problem in behavioral synthesis with the objective of minimizing the system power consumption. Given a set of scheduling constraints and a collection of convex power-delay tradeoff curves for each type of operation, our scheduler can intelligently schedule the operations to appropriate clock cycles and simultaneously select the module implementations that lead to low-power solutions. Experiments demonstrate that our proposed technique and produce near-optimal results (within 6% of the optimum by the ILP formulation), but with 40x+ speedup.

1A-5 (Time: 11:55 - 12:08)
TitleREWIRED - Register Write Inhibition by Resource Dedication
Author*Pushkar Tripathi, Rohan Jain (Indian Inst. of Tech. Delhi, India), Srikanth Kurra (Oracle, India), Preeti Ranjan Panda (Indian Inst. of Tech. Delhi, India)
Pagepp. 28 - 31
Keywordbehavioural synthesis, register allocation, low power
AbstractWe propose REWIRED (REgister Write Inhibition by REsource Dedication), a technique for reducing power during high level synthesis (HLS) by selectively inhibiting the storage of function unit (FU) output data into registers. Registers are generally inferred in HLS when data produced in one clock cycle is used in a later cycle. However, when it can be established that the input registers to an FU are not changing values during a certain period, the outputs during this period can be directly read off the FU output pins without needing to store them in registers. When the life-times of such data are short, it may be possible to completely eliminate the register storage operation, thereby reducing power. We present a genetic algorithm formulation and a heuristic for maximizing the number of register stores that can be inhibited in a scheduled data flow graph (DFG) during behavioral synthesis.

1A-6 (Time: 12:08 - 12:21)
TitleAn Efficient Performance Improvement Method Utilizing Specialized Functional Units in Behavioral Synthesis
Author*Tsuyoshi Sadakata, Yusuke Matsunaga (Kyushu University, Japan)
Pagepp. 32 - 35
KeywordBehavioral Synthesis, Specialized Functional Unit, Module Selection, Scheduling, Funcitonal unit Allocation
AbstractThis paper proposes a novel Behavioral Synthesis method that improves a performance of synthesized circuits utilizing specialized functional units efficiently. Almost all conventional methods can not utilize specialized functional units efficiently under a total area constraint because of their less flexibility for resource sharing. With proposed method, module selection, scheduling, and allocation problems under a total area constraint with specialized functional units can be solved in practical time. Experimental results show that proposed method has achieved up to 35 % and on average 14 % reduction of the number of cycles in practical time.


Session 1B  Power and Thermal Modeling and Optimization
Time: 10:15 - 12:20 Tuesday, January 22, 2008
Location: Room 310BC
Chairs: Joerg Henkel (Karlsruhe University, Germany), Eui-Young Chung (Yonsei University, Republic of Korea)

1B-1 (Time: 10:15 - 10:40)
TitlePredictive Power Aware Management for Embedded Mobile Devices
Author*Young-Si Hwang, Sung-Kwan Ku, Chan-Min Jung, Ki-Seok Chung (Hanyang University, Republic of Korea)
Pagepp. 36 - 41
KeywordLow Power, Embedded System, Dynamic Power Management
AbstractIntelligent power management of mobile devices is getting more important as ubiquitous computing is coming true in daily life. Power aware system management relies on techniques of collecting and analyzing information on the status of I/O devices or processors while some application is running. However, the overhead of collecting information using SW while the system is running is so huge that performance of the system may be severely deteriorated. Therefore, it is very crucial to design a PMU (power management unit) which collects information in HW so that the performance of the system is not degraded. In this paper, we propose a novel PMU design which collects information of I/O device while an application is running, and the power aware management is carried out based on the collected information. Experiments with various applications have been conducted to show the effectiveness of our design.

1B-2 (Time: 10:40 - 11:05)
TitleA Dynamic-Programming Algorithm for Reducing the Energy Consumption of Pipelined System-Level Streaming Applications
AuthorN. Liveris, *H. Zhou (Northwestern University, United States), P. Banerjee (HP Labs, United States)
Pagepp. 42 - 48
Keywordenergy, power gating, streaming, pipeline
AbstractIn this paper we present a System-Level technique for reducing energy consumption. The technique is applicable to pipelined applications represented as chain-structured graphs and targets the energy overhead of switching between active and sleep mode. The overhead is reduced by increasing the number of consecutive executions of the pipeline stages. The technique has no impact on the average throughput. We derive upper bounds on the number of consecutive executions and present a dynamic-programming algorithm that finds the optimal solution using these bounds. For specific cases we derive a quality metric that can be used to trade quality of the result for running-time.

1B-3 (Time: 11:05 - 11:30)
TitleTemperature-Aware MPSoC Scheduling for Reducing Hot Spots and Gradients
Author*Ayse Kivilcim Coskun, Tajana Simunic Rosing (Univ. of California, San Diego, United States), Keith A. Whisnant, Kenny C. Gross (Sun Microsystems, United States)
Pagepp. 49 - 54
Keywordscheduling, thermal management, reliability
AbstractThermal hot spots and temperature gradients on the die need to be minimized to manufacture reliable systems while meeting energy and performance constraints. In this work, we solve the task scheduling problem for multiprocessor system-on-chips (MPSoCs) using Integer Linear Programming (ILP). The goal of our optimization is minimizing the hot spots and balancing the temperature distribution on the die for a known set of tasks. Under the given assumptions about task characteristics, the solution is optimal. We compare our technique against optimal scheduling methods for energy minimization, energy balancing, and hot spot minimization, and show that our technique achieves significantly better thermal profiles. We also extend our technique to handle workload variations at runtime.

1B-4 (Time: 11:30 - 11:55)
TitleRun-Time Power Gating of On-Chip Routers Using Look-Ahead Routing
Author*Hiroki Matsutani (Keio University, Japan), Michihiro Koibuchi (National Institute of Informatics, Japan), Daihan Wang, Hideharu Amano (Keio University, Japan)
Pagepp. 55 - 60
KeywordNetwork-on-Chips, look-ahead routing, power gating, leakage power, low power
AbstractSince on-chip routers in Network-on-Chips play a key role for enabling on-chip communication between cores, they must be always preparing for packet injections even if a part of cores are in standby mode, resulting in a larger standby power of routers compared with cores. The run-time power gating of individual channels in a router is one of attractive solutions to reduce the standby power of chip without affecting the on-chip communication. However, a state transition between sleep and active mode incurs the performance penalty, and turning a power switch on or off dissipates the overhead energy, which means a short-term sleep adversely increases the power consumption. In this paper, we propose a sleep control method based on look-ahead routing that detects the arrival of packets two hops ahead, so as to hide the wake-up delay and reduce the short-term sleeps of channels. Simulation results using real application traces show that the proposed method conceals the wake-up delay of less than five cycles, and more leakage power can be saved compared with the original naive method.

1B-5 (Time: 11:55 - 12:08)
TitleAutomated Techniques for Energy Efficient Scheduling on Homogeneous and Heterogeneous Chip Multi-Processor Architectures
Author*Sushu Zhang, Karam S. Chatha (Arizona State Univ., United States)
Pagepp. 61 - 66
Keyword low power design, chip multi-processor, scheduling, approximation algorithm
AbstractWe address performance maximization of independent task sets under energy constraint on chip multi-processor (CMP) architectures that support multiple voltage/frequency operating states for each core. We prove that the problem is strongly NP-hard. We propose polynomial time 2-approximation algorithms for homogeneous and heterogeneous CMPs. To the best of our knowledge, our techniques offer the tightest bounds for energy constrained design on CMP architectures. Experimental results demonstrate that our techniques are effective and efficient under various workloads on several CMP architectures.

1B-6 (Time: 12:08 - 12:21)
TitleStatistical Power Profile Correlation for Realistic Thermal Estimation
Author*Love Singhal (University of California, Irvine, United States), Sejong Oh (KAIST, Republic of Korea), Eli Bozorgzadeh (University of California, Irvine, United States)
Pagepp. 67 - 70
KeywordPower profile, Thermal-aware, temperature estimation, Clustering
AbstractAt system level, the on-chip temperature depends both on power density and the thermal coupling with the neighboring region. The problem of finding the right set of input power profile(s) for accurate temperature estimation has not been studied. Considering only average or peak power density may lead either to underestimation or overestimation of the thermal crisis, respectively. To provide more realistic temperature estimation, we propose to incorporate multiple power profile representation, referred to as leader power profiles. Using the proposed statistical methods to determine the closeness between the power profiles, we apply a clustering algorithm to identify leader power profiles. We incorporate them in a thermal-aware floorplanner and empirical results show that using the single leader power profile (average or peak) leads to 37% degradation in critical wire delay and 20% degradation in wire length, compared to using the multiple leader power profiles.


Session 1C  Emerging Technologies
Time: 10:15 - 12:20 Tuesday, January 22, 2008
Location: Room 311A
Chairs: Li Shang (Queen's University, Canada), Chao Huang (Virginia Tech, United States)

1C-1 (Time: 10:15 - 10:40)
TitleReconfigurable RTD-Based Circuit Elements of Complete Logic Functionality
Author*Yexin Zheng, Chao Huang (Virginia Tech., United States)
Pagepp. 71 - 76
Keywordreconfigurable, RTD circuit
AbstractResonant tunneling diodes (RTDs) have demonstrated promising circuit characteristics of high speed switching property and versatile functionality with negative differential resistance (NDR). In this paper, we propose novel programmable logic elements (PLEs) that can be configured to realize all three- or four-input logic functions. These simple RTD-based circuit elements are implemented with threshold gates (TGs) and multi-threshold threshold gates (MTTGs) by employing programmable monostable-bistable logic element (MOBILE) principles. We also developed a dynamically reconfigurable scheme based on our PLE structures which facilitate nanopipelining without incurring delay overheads.

1C-2 (Time: 10:40 - 11:05)
TitleMBARC: A Scalable Memory Based Reconfigurable Computing Framework for Nanoscale Devices
AuthorSomnath Paul, *Swarup Bhunia (Case Western Reserve University, United States)
Pagepp. 77 - 82
KeywordReconfigurable, Nano-Computing, Memory-based computing
AbstractWe propose MBARC, a reconfigurable framework using memory as the primary computing element. The proposed framework leverages on the reported advantages of memory array design with nanodevices, which are compatible to fabrication into dense and regular structures. Simulation results for a set of ISCAS benchmarks show average improvement of 32% in area, 21% in delay and 34% in energy per vector compared to nanoscale FPGA implementation.

1C-3 (Time: 11:05 - 11:30)
TitleMoving Forward: A Non-Search Based Synthesis Method Toward Efficient CNOT-Based Quantum Circuit Synthesis Algorithms
Author*Mehdi Saeedi, Morteza Saheb Zamani, Mehdi Sedighi (Amirkabir Univ. of Tech., Iran)
Pagepp. 83 - 88
KeywordQuantum Computing, Reversible Logic, CNOT-based Circuit, , Matrix representation
AbstractQuantum information processing is in the beginning stages and there is no mature method for quantum circuit synthesis. Among open research problems, quantum circuit synthesis has recently received significant attention. In this paper, we propose a new non-search based moving forward synthesis algorithm (MOSAIC) for CNOT-based quantum circuits. In contrast with the widely used search-based methods, MOSAIC is guaranteed to produce a result and can lead to a solution in much fewer steps. To evaluate the proposed algorithms, different circuits taken from the literature are used. The experimental results show the efficiency of the proposed algorithm.

1C-4 (Time: 11:30 - 11:55)
TitleA CAD Tool for RF MEMS Devices
Author*Rajesh Pande, Rajendra Patrikar (Visvesvaraya Nat'l Inst. of Tech., India)
Pagepp. 89 - 94
KeywordMEMS, CAD, FEM, RF
AbstractA stable,multiple energy domains and multi scale simulation tool for Microsystems is developed. A structured design methodology is adopted for design and optimization of RF MEMS shunt switch and MEMS inductor. The CAD tool developed is a device specific and incorporates physical parameters such as surface roughness. The tool analyzes the impact of surface roughness and also does thermal analysis. These are useful for understanding reliability and failure mechanisms of RF MEMS components.


Session 1D  University LSI Design Contest
Time: 10:15 - 12:20 Tuesday, January 22, 2008
Location: Room 311BC
Chairs: Kenichi Okada (Tokyo Institute of Technology, Japan), Hiroshi Kawaguchi (Kobe Univ., Japan)

1D-1 (Time: 10:15 - 10:22)
TitleA 1.2GHz Delayed Clock Generator for High-speed Microprocessors
Author*Inhwa Jung, Moo-Young Kim, Chulwoo Kim (Korea University, Republic of Korea)
Pagepp. 95 - 96
Keywordclock generator, clock-on-demand, lock time, low-power
AbstractA 1.2GHz delayed clock generator capable of adjusting its clock phase according to input clock frequencies has been developed. It consists of a full-digital CMOS circuit that leads to a simple, robust, and portable IP. One-cycle lock time enables clock-on-demand circuit structures. The implemented delayed clock generator tile in 0.13um CMOS technology occupies only 0.004mm2 and operates at variable input frequencies ranging from 625MHz to 1.2GHz.

1D-2 (Time: 10:22 - 10:29)
TitleLVDS-Type On-Chip Transmision Line Interconnect with Passive Equalizers in 90 nm CMOS Process
Author*Akiko Mineyama, Hiroyuki Ito, Takahiro Ishii, Kenichi Okada, Kazuya Masu (Tokyo Institute of Technology, Japan)
Pagepp. 97 - 98
Keywordtransmision line , on-chip interconnect, global interconnect, LVDS-type, passive equalizer
AbstractThis paper demonstrates a low voltage differential signaling (LVDS)-type on-chip transmission line (TL) interconnect with passive equalizers to solve delay issues on global interconnects. The proposed on-chip TL interconnect can achieve 10.5 Gbps signaling and has smaller delay, smaller delay variation and better power efficiency than conventional on-chip interconnects at high-frequencies.

1D-3 (Time: 10:29 - 10:36)
TitleA Slew-Rate Controlled Output Driver with One-Cycle Tuning Time
Author*Young-Ho Kwak, Inhwa Jung, Chulwoo Kim (Korea University, Republic of Korea)
Pagepp. 99 - 100
Keywordslew-rate, one-cycle, low power, 0.18um
AbstractA low-power slew-rate controlled output driver with open loop digital scheme, one-cycle lock time is presented. Proposed output driver maintains slew rate in the range of 2.1V/ns to 3.6V/ns in a one cycle after the enable clock is inserted. It is implemented in 0.18um CMOS process, and the control block consumes 13.7mW at 1Gbps.

1D-4 (Time: 10:36 - 10:43)
TitleA Low-Leakage Current Power 180-nm CMOS SRAM
Author*Tadayoshi Enomoto, Yuki Higuchi (Chuo University, Japan)
Pagepp. 101 - 102
KeywordSRAM, leakage, power, CMOS
AbstractA low leakage power, 180-nm 1K-b SRAM was fabricated. The stand-by leakage power of a 1K-bit memory cell array incorporating a newly developed leakage current reduction circuit called a “self-controllable voltage level (SVL)” circuit was only 3.7 nW, 5.4% of that of an equivalent conventional memory-cell array at VDD of 1.8 V. On the other hand, the speed remained almost constant with a minimal overhead in terms of the memory cell array area.

1D-5 (Time: 10:43 - 10:50)
TitleA CMOS Direct Sampling Mixer Using Switched Capacitor Filter Technique for Software-Defined Radio
Author*Hong Phuc Ninh, Takashi Moue, Takashi Kurashina, Kenichi Okada, Akira Matsuzawa (Tokyo Institute of Technology, Japan)
Pagepp. 103 - 104
KeywordSampling Mixer, Switched Capacitor Filter, Software-Defined Radio
AbstractThis paper proposes a novel direct sampling mixer (DSM) using Switched Capacitor Filter (SCF) for multi-band receivers. The proposed DSM has a higher gain, more flexibility and lower flicker noise than that of conventional circuits. The mixer for Digital Terrestrial Television (ISDB-T) 1-segment was fabricated in a 0.18um CMOS process, and measured results are presented for a sampling frequency of 800MHz. The experimental results exhibit 430kHz signal bandwith with 27.3dB attenuation of adjacent interferer assuming at 3MHz offset.

1D-6 (Time: 10:50 - 10:57)
TitleSmall-Area CMOS RF Distributed Mixer Using Multi-Port Inductors
Author*Susumu Sadoshima, Satoshi Fukuda, Tackya Yammouch, Hiroyuki Ito, Kenichi Okada, Kazuya Masu (Tokyo Institute of Technology, Japan)
Pagepp. 105 - 106
Keywordmixer, cmos, uwb
AbstractThis paper presents a novel small-area distributed mixer for ultrawide-band (UWB) receivers.The proposed mixer uses five 4-port inductors instead of fifteen 2-port inductors to shrink area of the circuit.The proposed mixer achieves conversion gain of -10dB, noise figure of 15dB, return loss of less than -10dB from 2.3 to 6.0GHz, IIP3 of 13.6dBm, and the circuit area of 0.51mm^2.

1D-7 (Time: 10:57 - 11:04)
TitleDynamic Supply Noise Measurement Circuit Composed of Standard Cells Suitable for In-Site SoC Power Integrity Verification
Author*Yasuhiro Ogasahara, Masanori Hashimoto, Takao Onoye (Osaka University, Japan)
Pagepp. 107 - 108
Keywordmeasurement, power supply noise, ring oscillator
AbstractThis paper presents an all digital measurement circuit called ``gated oscillator'' for capturing waveforms of dynamic power supply noise. The gated oscillator is constructed with standard cells, and thus can be easily embedded in SoCs for design verification. The performance of the gated oscillator is verified with fabricated test chips in a 90nm process.

1D-8 (Time: 11:04 - 11:11)
TitleDuo-Binary Circular Turbo Decoder Based on Border Metric Encoding for WiMAX
Author*Ji-Hoon Kim, In-Cheol Park (KAIST, Republic of Korea)
Pagepp. 109 - 110
Keywordturbo code, turbo decoder, duo-binary SISO decoding, WiMAX, interleaver
AbstractThis paper presents a duo-binary circular turbo decoder based on border metric encoding. With the proposed method, the memory size for branch memory is reduced by half and the dummy calculation is removed at the cost of the small-sized memory which holds the encoded border metrics. Based on the proposed SISO decoder and the dedicated hardware interleaver, a duo-binary circular turbo decoder is designed for the WiMAX standard using a 0.13 um CMOS process, which can support 24.26Mbps at 200MHz.

1D-9 (Time: 11:11 - 11:18)
TitleArea and Power Efficient Design of Coarse Time Synchronizer and Frequency Offset Estimator for Fixed WiMAX System
Author*Tae-Hwan Kim, In-Cheol Park (KAIST, Republic of Korea)
Pagepp. 111 - 112
KeywordOFDM, WiMAX, IEEE 802.16d, coarse time synchronization, carrier frequency offset estimation
AbstractTargeting fixed WiMAX systems, this paper presents a new architecture for coarse time synchronization and carrier frequency offset (CFO) estimation. The proposed architecture is based on a two-step approach where the data-paths are decoupled to individually optimize performance and area. Implemented with 0.13um CMOS technology, the results show that the proposed architecture has advantages of less silicon area and power consumption as well as better performance compared to the previous joint approach.

1D-10 (Time: 11:18 - 11:25)
TitleA Low-Cost Cryptographic Processor for Security Embedded System
Author*Ronghua Lu, Jun Han, Xiaoyang Zeng, Qing Li, Lang Mai, Jia Zhao (Fudan Univ., China)
Pagepp. 113 - 114
KeywordSecurity, Processor, Cryptographic, RSA, AES
AbstractA low-cost cryptographic processor for security embedded system is presented in this paper. The processor, without any assistance of dedicated cryptographic coprocessors, is scalable and very efficient for popular cryptographic functions such as RSA/ECC, AES, Hash, etc. Based on SMIC 0.18um standard CMOS technology, the core circuit of the test chip has only about 32k gates, and a max frequency of 200MHz, under which the 1024-bit RSA algorithm takes only 150ms and the throughout of AES reaches 256Mbits/s.

1D-11 (Time: 11:25 - 11:32)
TitleMultithreaded Coprocessor Interface for Multi-Core Multimedia SoC
Author*Shih Hao Ou, Tay-Jyi Lin, Xiang Sheng Deng, Zhi Hong Zhuo, Chih Wei Liu (Nat'l Chiao Tung Univ., Taiwan)
Pagepp. 115 - 116
KeywordDual-core, Multithreaded
AbstractModern architectures exploit task level parallelism to improve their performance in a cost-effective manner. However, task synchronization and management is time consuming and wastes computing resources especially on application-specific architectures, such as DSP. In this paper, we propose a smart coprocessor interface that helps to offload the task management job from MPU or DSP. In our simulations, our approach can improve the overall performance of a dual-core platform by 57%. The hardware overhead of the interface is only 1.56% of the DSP core.

1D-12 (Time: 11:32 - 11:39)
TitleParameterized Embedded In-circuit Emulator and Its Retargetable Debugging Software for Microprocessor/Microcontroller/DSP Processor
Author*Liang-Bi Chen, Yung-Chih Liu, Chien-Hung Chen, Chung-Fu Kao, Ing-Jer Huang (Department of Computer Science and Engineering National Sun Yat-Sen University, Taiwan)
Pagepp. 117 - 118
KeywordIn-circuit Emulation, In-circuit Emulator, Testing, Debugging, Mircoprocessor
AbstractThe in-circuit emulator (ICE) is commonly adopted as a microprocessor debugging technique. In this paper, a parameterized embedded in-circuit emulator and its retargetable debugging software are proposed. The parameterized embedded in-circuit emulator can be integrated into different style processors such as microcontroller, microprocessor, and DSP processor. The GUI interface Debugging software can help user to debug easily. As a result of it, the duration of microprocessor debugging design procedure time is reduced.


Session 2A  Advanced Topic in Logic Synthesis
Time: 13:30 - 15:35 Tuesday, January 22, 2008
Location: Room 310A
Chairs: Shih-Chieh Chang (National Tsing Hua University, Taiwan), In-Cheol Park (Korea Advanced Institute of Science and Technology, Republic of Korea)

2A-1 (Time: 13:30 - 13:55)
TitleGlobal Optimization of Common Subexpressions for Multiplierless Synthesis of Multiple Constant Multiplications
AuthorYuen-Hong Alvin Ho, Chi-Un Lei, *Hing-Kit Kwan, Ngai Wong (The University of Hong Kong, Hong Kong)
Pagepp. 119 - 124
Keywordcommon subexpression sharing, multiple constant multiplications, mixed-integer linear programming
AbstractIn the context of multiple constant multiplication (MCM) design, we propose a novel common subexpression elimination (CSE) algorithm that models the optimal synthesis of coefficients into a 0-1 mixed-integer linear programming (MILP) problem. A time delay constraint is included for synthesis. We also propose coefficient decompositions that combine all minimal signed digit (MSD) representations and the shifted sum (difference) of coefficients. In some cases, the proposed solution space further reduces the number of adders/subtractors in the MCM synthesis.

2A-2 (Time: 13:55 - 14:20)
TitleDecomposition Based Approach for Synthesis of Multi-Level Threshold Logic Circuits
AuthorTejaswi Gowda, *Sarma Vrudhula (Arizona State University, United States)
Pagepp. 125 - 130
KeywordThreshold Logic, Logic Synthesis, Logic Decomposition, Nano Circuits, EDA
Abstract Scaling is currently the most popular technique used to improve performance metrics of CMOS circuits. This cannot go on forever because the properties that are responsible for the functioning of MOSFETs no longer hold in nano dimensions. Recent research into nano devices has shown that nano devices can be an alternative to CMOS when scaling of CMOS becomes infeasible in the near future. This is motivating the need for stable and mature design automation techniques for threshold logic since it is the design abstraction used for most nano-devices. This paper presents a new decomposition theory that is based on the properties of threshold functions. The main contributions of this paper are: (1) A new method of algebraic factorization called the min-max factorization. (2) A decomposition theory that uses this new factorization to identify and characterize threshold functions. (3) A new threshold logic synthesis methodology that uses the decomposition theory. This synthesis methodology produces circuits that are better than the previous state of art (27% better gate count and comparable circuit depth).

2A-3 (Time: 14:20 - 14:45)
TitleTiming-Power Optimization for Mixed-Radix Ling Adders by Integer Linear Programming
Author*Yi Zhu, Jianhua Liu, Haikun Zhu, Chung-Kuan Cheng (University of California, San Diego, United States)
Pagepp. 131 - 137
Keywordprefix adder, power optimization, integer linear programming
AbstractThis paper optimizes timing and power consumption of mixed-radix Ling adders with the physical area constraints using an integer linear programming formulation. Each cell in the prefix network is flexible to have different radix and size, and Ling carries are incorporated. Optimal solutions are obtained by solving the proposed formulation. The experiments show that the produced optimal structures have a large power saving compared with traditional designs. The ASIC implementation results are superior to those produced by Synopsys Module Compiler.

2A-4 (Time: 14:45 - 15:10)
TitleEfficient Synthesis of Compressor Trees on FPGAs
AuthorHadi Parandeh-Afshar (Univ. of Tehran, Iran), *Philip Brisk, Paolo Ienne (EPFL, Switzerland)
Pagepp. 138 - 143
KeywordGeneralized Parallel Counters, Compressor Tree, FPGA
AbstractFPGA performance is currently lacking for arithmetic circuits. In many applications, such as digital signal and video processing, large sums of k > 2 integer values is the most computationally intensive part. To improve the quality of addition circuits on FPGAs, both Xilinx and Altera have augmented their basic LUT structure with dedicated circuitry for addition, including a fast carry-chain that does not suffer from routing delays. To sum k > 2 values, the most efficient method is to use a tree of binary or ternary adders. In the world of ASICs, it is well known that compressor trees outperform adder trees when summing k > 2 values; however, due to the peculiarities of FPGAs, all previous literature has reported that adder trees are faster than compressor trees. This paper shows that the conventional wisdom is actually false. A heuristic to synthesize a compressor tree onto an FPGA is presented that reduces the combinational delay through the tree by 27.5%, on average, with an area increase of approximately 5.7%.

2A-5 (Time: 15:10 - 15:23)
TitleArea Recovery under Depth Constraint by Cut Substitution for Technology Mapping for LUT-Based FPGAs
Author*Taiga Takata, Yusuke Matsunaga (Kyushu University, Japan)
Pagepp. 144 - 147
KeywordTechnology Mapping, FPGA, Logic Synthesis
AbstractThis paper presents the post-processing algorithm, Cut Substitution, for technology mapping for LUT-based FPGAs to minimize the area under depth minimum constraint. The problem to minimize area under depth minimum costraint during technology mapping seems to be as difficult as NP-Hard class problem. Cut Substitution generates a local optimum solution by eliminating redundant LUTs while the depth is maintained. The experiments shows that the proposed method derives the solutions whose area are 9% smaller than those of DAOmap on average.

2A-6 (Time: 15:23 - 15:36)
TitleAn Optimal Algorithm for Sizing Sequential Circuits for Industrial Library Based Designs
AuthorSanghamitra Roy, Yu Hen Hu (Univ. of Wisconsin, Madison, United States), *Charlie Chung-Ping Chen, Shih-Pin Hung, Tse-Yu Chiang, Jiuan-Guei Tseng (Nat'l Taiwan Univ., Taiwan)
Pagepp. 148 - 151
KeywordGate sizing, Sequential circuit, Clock skew, Feedback, Optimization
AbstractIn this paper, we propose an optimal gate sizing and clock skew optimization algorithm for globally sizing synchronous sequential circuits. The number of constraints and variables in our formulation is linear with respect to the number of circuit components and hence our algorithm can efficiently find the optimal solution for industrial scale designs. To the best of our knowledge our method is the first exact gate sizing algorithm that can handle cyclic sequential circuits. Experimental results on industrial cell libraries demonstrate that our algorithm can yield an average of 12.6% improvement in the optimal clock period by combining clock skew optimization with gate sizing. For identical clock period, our algorithm can achieve an average of 11.3% area savings over a popular commercial synthesis tool.


Session 2B  Interconnect Modeling and Simulation Techniques
Time: 13:30 - 15:35 Tuesday, January 22, 2008
Location: Room 310BC
Chairs: Yungseon Eo (Hanyang University, Republic of Korea), Yokomizo Goichi (STARC, Japan)

2B-1 (Time: 13:30 - 13:55)
TitleEfficient Numerical Modeling of Random Rough Surface Effects for Interconnect Internal Impedance Extraction
Author*Quan Chen, Ngai Wong (The University of Hong Kong, Hong Kong)
Pagepp. 152 - 157
KeywordRough Surface, Impedance, SIE Method, Interconnects
AbstractThis paper proposes an efficient model for numerically evaluating the impact of random surface roughness on the internal impedance for large-scale interconnect structures. The effective resistivity (ER) and effective permeability (EP) are numerically formulated to avoid the computationally prohibitive global discretization, while maintaining the model accuracy and flexibility. A modified stochastic integral equation (SIE) method is proposed to significantly speed up the computation for the mean values of ER and EP under the assumption of random surface roughness. Numerical experiments then verify the efficacy of our approach.

2B-2 (Time: 13:55 - 14:20)
TitleEfficient Techniques for 3-D Impedance Extraction Using Mixed Boundary Element Method
Author*Fang Gong, Wenjian Yu, Zeyi Wang (Dept. of Computer Science, Tsinghua University, China), Zhiping Yu (Institute of Microelectronics, Tsinghua University, China), Changhao Yan (Fudan University, China)
Pagepp. 158 - 163
Keywordparasitic extraction, preconditioner, surface integral formulation, wide-band analysis, mixed boundary element method
AbstractIn this paper, we describe the algorithms implemented in MBEM, a program for wideband impedance extraction of complicated 3-D structures. MBEM is based on a mixed boundary element method (BEM), which reduces the number of unknowns from about 7N in FastImp to 4N, for MQS analysis. Efficient techniques are proposed to handle the extra matrix multiplication, form post-process matrices, and solve the final linear equation system. The inaccuracy of calcu- lation using FastImp at low frequency is also analyzed, which shows the mixed BEM eliminates it completely. Experiments on several typical 3-D structures validate the advantage of MBEM over FastImp, on both accuracy and efficiency.

2B-3 (Time: 14:20 - 14:45)
TitleGenerating Stable and Sparse Reluctance/Inductance Matrix under Insufficient Conditions
Author*Yuichi Tanji (Kagawa University, Japan), Takayuki Watanabe (The University of Shizuoka, Japan), Hideki Asai (Shizuoka University, Japan)
Pagepp. 164 - 169
KeywordSparse, Inductance, Reluctance, Extraction
AbstractThis paper presents generating stable and sparse reluctance/inductance matrix from the inductance matrix which is extracted under insufficient discretization. So far, to generate the sparse reluctance matrix with guaranteed stability, this matrix has to be diagonally dominant M matrix. Hence, the repeated inductance extractions are necessary using a smaller grid size, in order to obtain the well-defined matrix. Alternatively, this paper provides some ideas for generating the sparse reluctance matrix, even if the extracted reluctance matrix is not diagonally dominant M matrix, precisely, the positive off-diagonal elements are even found. This eases the extraction tasks greatly. Furthermore, the sparse inductance matrix is also generated by using the practical and sophisticated double inverse methods, which is useful for the SPICE simulation, since reluctance components are not still supported in SPICE-like simulators.

2B-4 (Time: 14:45 - 15:10)
TitleHierarchical Krylov Subspace Reduced Order Modeling of Large RLC Circuits
AuthorDuo Li, *Sheldon X.-D. Tan (University of California, Riverside, United States)
Pagepp. 170 - 175
KeywordModel order reduction, interconnect
AbstractIn this paper, we propose a new model order reduction approach for large interconnect circuits using hierarchical decomposition and Krylov subspace projection-based model order reduction. The new approach, called hiePrimor, first partitions a large interconnect circuit into a number of smaller subcircuits and then performs the projection-based model order reduction on each of subcircuits in isolation and on the top level circuit thereafter. The new approach can exploit the parallel computing to speed up the reduction process. Theoretically we show hiePrimor can have the same accuracy as the flat reduction method given the same reduction order and it can also preserves the passivity of the reduced models as well. We also show that partitioning is important for hierarchical projection-based reduction and the minimum-span objective should be required to archive best performance for hierarchical reduction. The proposed method is suitable for reducing large global interconnects like coupled bus, transmission lines, large clock nets in the post layout stage. Experimental results demonstrate that hiePrimor can be significantly faster than flat projection method like PRIMA and be order of magnitude faster than PRIMA with parallel computing without loss of accuracy.

2B-5 (Time: 15:10 - 15:35)
TitleStatistical Noise Margin Estimation for Sub-Threshold Combinational Circuits
Author*Yu Pu (Technische Universiteit Eindhoven, Netherlands), Jose Pineda de Gyvez (NXP Research Eindhoven, Netherlands), Henk Corporaal (Technische Universiteit Eindhoven, Netherlands), Yajun Ha (National University of Singapore, Singapore)
Pagepp. 176 - 179
Keywordsubthreshold , reliability, noise margin
AbstractThe increasingly popular sub-threshold design is strongly calling for EDA support to estimate noise margins, minimum functional supply voltage, as well as the functional yield. In this paper, we propose a fast, accurate and statistical approach to accomplish these goals. First, we derive close-form functions based on a new equivalent resistance model which enables the fast estimation of noise margins of individual cells at the gate-level. Second, we propose to calculate and propagate the noise margin information with an affine arithmetic model that takes into account process variations and correspondent inter-cell correlations. Experiments with ISCAS benchmarks have shown that the new approach has an accuracy of 98.5% w.r.t. transistor-level Monte Carlo simulations. The running time per input vector of the new approach only needs a few seconds, in contrast to the many hours required by transistor-level DC Monte-Carlo simulations. To the best of our knowledge, we are the first to provide a fast, accurate and statistical methodology other than Monte-Carlo simulation for the noise margin estimation of sub-threshold combinational circuits.


Session 2C  Floorplanning
Time: 13:30 - 15:35 Tuesday, January 22, 2008
Location: Room 311A
Chairs: Shin'ichi Wakabayashi (Hiroshima City University, Japan), Ting-Chi Wang (National Tsing Hua University, Taiwan)

2C-1 (Time: 13:30 - 13:55)
TitleSymmetry-Aware Placement with Transitive Closure Graphs for Analog Layout Design
Author*Lihong Zhang (Memorial University of Newfoundland, Canada), C.-J. Richard Shi (University of Washington, United States), Yingtao Jiang (University of Nevada, United States)
Pagepp. 180 - 185
KeywordAnalog integrated circuits, layout, symmetry, placement, transitive closure graph
AbstractA new scheme is proposed to use transitive closure graph (TCG) to explore the full symmetry solution space in analog layout design. We define a set of TCG symmetric-feasible conditions and show that it is extremely useful in reducing the solution space. A method is presented for generating random symmetric-feasible TCGs in O(n) time preserving the TCG closure property. Experimental results have confirmed the effectiveness of the proposed symmetry-aware TCG placement algorithm.

2C-2 (Time: 13:55 - 14:20)
TitleConstraint-Free Analog Placement with Topological Symmetry Structure
Author*Qing Dong, Shigetoshi Nakatake (Univ. of Kitakyushu, Japan)
Pagepp. 186 - 191
Keywordplacement, analog layout, symmetry, regularity, sequence-pair
AbstractIn analog circuits, blocks need to be placed symmetrically to satisfying the devices matching. Different from the existing constraint-driven approaches, the proposed topological symmetry structure enables us to generate a symmetrical placement without any constraint. Simulated annealing is utilized as the framework of the optimization, and we propose new move operation to keep the placement's topological symmetry. By inserting dummy blocks, we present a physical skewed symmetry structure allowing non-symmetry partly, so that to enhance the placement on area and wire length. Besides, we incorporate regularity into the evaluation of placement. Experiments showed that our approach generated topological complete symmetry placements without much compromise on chip area and wire length, compared to the placements with no symmetry.

2C-3 (Time: 14:20 - 14:45)
TitleTCG-Based Muli-Bend Bus Driven Floorplanning
AuthorTilen Ma, *Evangeline F. Y. Young (The Chinese Univ. of Hong Kong, Hong Kong)
Pagepp. 192 - 197
KeywordFloorplanning, Algorithm, Bus Planning
AbstractIn this paper, the problem of bus driven floorplanning is addressed. Given a set of modules and bus specifications, a floorplan solution including the bus routes will be generated with the floorplan area and total bus area minimized. Some previous works have addressed this problem with restricted bus shapes of 0-bend, 1-bend or 2-bend [1]. However, in this paper, we address this bus driven floorplanning without any limitations on the shapes of the buses. We solve this problem by a simulated annealing based floorplanner using the Transitive Closure Graph (TCG) representation. Experimental results show that we can improve over [1] significantly in terms of both run time and quality, since there are more flexibilities in routing the buses and complex shape validataion steps are not needed. For data sets with buses connecting a large number of blocks, our approach can still generate high quality solutions effectively, while the approach in [1] of restricting to 2-bend buses often cannot give any feasible solutions.

2C-4 (Time: 14:45 - 15:10)
TitleLarge-Scale Fixed-Outline Floorplanning Design Using Convex Optimization Techniques
Author*Chaomin Luo, Miguel F. Anjos (Univ. of Waterloo, Canada), Anthony Vannelli (Univ. of Guelph, Canada)
Pagepp. 198 - 203
Keywordfixed-outline floorplanning, convex optimization, second-order cone programming, relative position matrix , wirelength minimization
AbstractAbstract — A two-stage optimization methodology is proposed to solve the fixed-outline floorplanning problem that is a global optimization problem for wirelength minimization. In the first stage, an attractor-repeller convex optimization model provides the relative positions of the modules on the floorplan. The second stage places and sizes the modules using second-order cone optimization. A Voronoi diagram is employed to obtain a planar graph and thus a relative position matrix to connect the two stages. Overlapfree and deadspace-free floorplans are achieved in a fixed outline and floorplans with any specified percentage of whitespace can be produced. Experimental results on GSRC benchmarks demonstrate that we obtain significant improvements on the best results known in the literature for these benchmarks. Most importantly, our methodology provides greater improvement over other floorplanners as the number of modules increases.

2C-5 (Time: 15:10 - 15:23)
TitleBus-Aware Microarchitectural Floorplanning
AuthorDae Hyun Kim, *Sung Kyu Lim (Georgia Institute of Technology, United States)
Pagepp. 204 - 208
Keywordfloorplanning, bus
AbstractIn this paper we present the first bus-aware microarchitectural floorplanning. Our goal is to study the impact of bus routability on other important floorplanning objectives including area, performance, power, and thermal. We developed a fast performance-aware bus routing algorithm, which is integrated into the floorplanning engine to ensure routability while optimizing other conflicting objectives. Our related experiments performed on high performance processors show that we obtain 100% routability at the cost of minimal increase on area, performance, and power objectives under thermal constraint.

2C-6 (Time: 15:23 - 15:36)
TitleLP Based White Space Redistribution for Thermal Via Planning and Performance Optimization in 3D ICs
Author*Xin Li, Yuchun Ma, Xianlong Hong, Sheqin Dong (Tsinghua Univ., China), Jason Cong (Univ. of California, Los Angeles, United States)
Pagepp. 209 - 212
Keyword3D ICs, performance, thermal via, floorplanning
AbstractThermal issue is a critical challenge in 3D IC circuit design. Incorporating thermal vias into 3D IC is a promising way to mitigate thermal issues by lowering down the thermal resistances between device layers. However, it is usually difficult to get enough space at target regions to insert thermal vias. In this paper, we propose a novel analytical algorithm to re-allocate white space for 3D ICs to facilitate via insertion. Experimental results show that after reallocating whitespaces, thermal vias and total wirelength could be reduced by 14% and by 2%, respectively. It also shows that whitespace distribution with via planning alone will degrade performance by 9% while performance-aware via planning method can reduce thermal via number by 60% and the performance is kept nearly unchanged.


Session 2D  Special Session - Tackling Manufacturability/Variability for 32nm and Below
Time: 13:30 - 15:35 Tuesday, January 22, 2008
Location: Room 311BC
Chair: Dale Edwards (Semiconductor Research, United States)

2D-2
Title(Invited Paper) Predictive Models and CAD Methodology for Pattern Dependent Variability
Author*Nishath Verghese, Richard Rouse, Philippe Hurat (Cadence Design Systems, United States)
Pagepp. 213 - 218
AbstractLithography, etch and stress are dominant effects impacting the functionality and performance of designs at 65nm and below. This paper discusses pattern dependent variability caused by these effects and discusses a modelbased approach to extracting this variability. A methodology to gauge the extent of this pattern dependent variability for standard cells is presented by looking at the difference in transistor parameters when the cell is analyze in different contexts. A full-chip methodology that addresses the delay change due to systematic varation has been introduced to analyze and repair a 65nm digital design.

2D-3
Title(Invited Paper) Technology Modeling and Characterization Beyond the 45nm Node
Author*Sani R. Nassif (IBM, United States)
Pagep. 219

2D-4
Title(Invited Paper) Synergistic Physical Synthesis for Manufacturability and Variability in 45nm Designs and Beyond
Author*David Z. Pan, Minsik Cho (University of Texas, Austin, United States)
Pagepp. 220 - 225
AbstractNanometer IC designs are increasingly challenged by manufacturing closure, i.e., being fabricated with high product yield, mainly due to aggressive technology scaling and increasing process/environmental variations. Realizing the criticality of addressing manufacturability for higher yield and tolerance to variations during design, there has been a surge of research activities recently from both academia and industry. In this paper, we will survey the key activities in synergistic physical synthesis and shed lights on some of the future research directions.


Session 3A  Routing
Time: 15:50 - 17:55 Tuesday, January 22, 2008
Location: Room 310A
Chairs: Atsushi Takahashi (Tokyo Institute of Technology, Japan), Jung Dong Cho (Sungkyunkwan Univ., Republic of Korea)

3A-1 (Time: 15:50 - 16:15)
TitleMaizeRouter: Engineering an Effective Global Router
Author*Michael D. Moffitt (IBM Austin Research Lab, United States)
Pagepp. 226 - 231
Keywordglobal routing
AbstractIn this paper, we present MaizeRouter, winner of the inaugural 2007 Global Routing Contest. MaizeRouter reflects a significant leap in progress over existing publicly-available tools, and draws upon simple yet powerful edge-based operations (including extreme edge shifting, a technique aimed at congestion reduction, and edge retraction, a counterpart to extreme edge shifting that reduces unnecessary wirelength). These algorithmic contributions are built upon a framework of interdependent net decomposition, and permit a broad search space that previous algorithms have been unable to achieve.

3A-2 (Time: 16:15 - 16:40)
TitleA New Global Router for Modern Designs
Author*Jhih-Rong Gao, Pei-Ci Wu (Synopsys, Taiwan), Ting-Chi Wang (Nat'l Tsing Hua Univ., Taiwan)
Pagepp. 232 - 237
KeywordGlobal Routing
AbstractIn this paper, we present a new global router, NTHU-Route, for modern designs. NTHU-Route is based on iterative rip-ups and reroutes, and several techniques are proposed to enhance our global router. These techniques include (1) a history based cost function which helps to distribute overflow during iterative rip-ups and reroutes, (2) an adaptive multi-source multi-sink maze routing method to improve the wirelength of maze routing, (3) a congested region identification method to specify the order for nets to be ripped up and rerouted, and (4) a refinement process to further reduce overflow when iterative history based rip-ups and reroutes reach bottleneck. Compared with two state-of-the-art works on ISPD98 benchmarks, NTHU-Route outperforms them in both overflow and wirelength. For the much larger designs from the ISPD07 benchmark suite, our solution quality is better than or comparable to the best results reported in the ISPD07 routing contest.

3A-3 (Time: 16:40 - 17:05)
TitleRoutability Driven Modification Method of Monotonic Via Assignment for 2-Layer Ball Grid Array Packages
Author*Yoichi Tomioka, Atsushi Takahashi (Tokyo Inst. of Tech., Japan)
Pagepp. 238 - 243
Keywordball grid array, monotonic, package, routing
AbstractBall Grid Array packages in which I/O pins are arranged in a grid array pattern realize a number of connections between chips and a printed circuit board, but it takes much time in manual routing. We propose a fast routing method for 2-layer Ball Grid Array packages to support designers. Our method distributes wires evenly on top layer and increases completion ratio of nets by improving via assignment iteratively.

3A-4 (Time: 17:05 - 17:30)
TitleOrdered Escape Routing Based on Boolean Satisfiability
AuthorLijuan Luo, *Martin D.F. Wong (University of Illinois at Urbana-Champaign, United States)
Pagepp. 244 - 249
Keywordescape routing, Boolean satisfiability
AbstractRouting for high-speed boards is largely a time-consuming manual task today. In this paper we consider the ordered escape routing problem which is a key problem in board-level routing. All existing approaches to this problem cannot guarantee to find a routing solution even if one exists. We present in this paper an algorithm to exactly solve this problem based on Boolean satisfiability. Experimental results on escape routing problems from industry show that our algorithm performs well.

3A-5 (Time: 17:30 - 17:55)
TitleMeshWorks: An Efficient Framework for Planning, Synthesis and Optimization of Clock Mesh Networks
Author*Anand Rajaram, David Z. Pan (University of Texas at Austin, United States)
Pagepp. 250 - 257
KeywordClock, Mesh, CTS
AbstractA leaf-level clock mesh is known to be very tolerant to variations [1]. However, its use is limited to a few high-end designs because of the high power/resource requirements and lack of automatic mesh synthesis tools [2]. Most existing works on clock mesh [1], [3]–[7]either deal with semi-custom design or perform optimizations on a given clock mesh. However, the problem of obtaining a good initial clock mesh has not been addressed. Similarly, the problem of achieving a smooth tradeoff between skew and power/resources has not been addressed adequately. In this work, we present MeshWorks, the first comprehensive automated framework for planning, synthesis and optimization of clock mesh networks with the objective of addressing the above issues. Experimental results suggest that our algorithms can achieve an additional reduction of 26% in buffer area, 19% in wirelength and 18% in power, compared to the recent work of [7] with similar worst case maximum frequency under variation.


Session 3B  Interconnect, NoCs, and MPSoCs
Time: 15:50 - 17:30 Tuesday, January 22, 2008
Location: Room 310BC
Chairs: Sungjoo Yoo (Samsung Electronics, Republic of Korea), Sungchan Kim (Seoul Nat'l Univ., Republic of Korea)

3B-1 (Time: 15:50 - 16:15)
TitleInterconnect Modeling for Improved System-Level Design Optimization
AuthorLuca Carloni (Columbia University, United States), Andrew B. Kahng, Swamy Muddu (University of California, San Diego, United States), Alessandro Pinto (University of California, Berkeley, United States), *Kambiz Samadi, Puneet Sharma (University of California, San Diego, United States)
Pagepp. 258 - 264
KeywordSystem-Level, Network-on-Chip, Interconnect Delay, Modeling
AbstractAccurate modeling of delay, power, and area of interconnections early in the design phase is crucial for efficient system-level optimization. Models presently used in system-level optimizations, such as network-on-chip (NoC) synthesis are inaccurate in the presence of deep-submicron effects. In this paper, we propose new, highly accurate models for delay and power in buffered interconnects; these models are usable by system-level designers for existing and future technologies. We present a general and transferable methodology to construct our models from a wide variety of reliable sources (Liberty, LEF/ITF, ITRS, PTM, etc.). The modeling infrastructure, and a number of characterized technologies, are available as open-source. Our models comprehend key interconnect circuit and layout design styles, and a power-efficient buffering technique that overcomes unrealities of previous delay-driven buffering techniques. We show that our models are significantly more accurate than previous models for global and intermediate buffered interconnects in 90nm and 65nm foundry processes - essentially matching signoff analyses. We also integrate our models in an automatic NoC topology synthesis tool and show that the more accurate modeling signicantly affects optimal/achievable architectures that are synthesized by the tool. The increased accuracy afforded by our models enables system-level designers to obtain better assessments of the achievable performance/power/area tradeoffs for (communication-centric aspects of) system design, with negligible setup and overhead burdens.

3B-2 (Time: 16:15 - 16:40)
TitleNoCOUT : NoC Topology Generation with Mixed Packet-Switched and Point-to-Point Networks
AuthorJeremy Chan, *Sri Parameswaran (Univ. of New South Wales, Australia)
Pagepp. 265 - 270
KeywordNoC, Topology, Generation
AbstractNetworks-on-Chip (NoC) have been widely proposed as the future communication paradigm for use in next-generation System-on-Chip. In this paper, we present NoCOUT, a methodology for generating an energy optimized application specific NoC topology which supports both point-to-point and packet-switched networks. The algorithm uses a prohibitive greedy iterative improvement strategy to explore the design space efficiently. A system-level floorplanner is used to evaluate the iterative design improvements and provide feedback on the effects of the topology on wire length. The algorithm is integrated within a NoC synthesis framework with characterized NoC power and area models to allow accurate exploration for a NoC router library. We apply the topology generation algorithm to several test cases including real-world and synthetic communication graphs with both regular and irregular traffic patterns, and varying core sizes. Since the method is iterative, it is possible to start with a known design to search for improvements. Experimental results show that many different applications benefit from a mix of "on chip networks" and "point-to-point networks". With such a hybrid network, we achieve approximately 25% lower energy consumption (with a maximum of 37\%) than a state of the art min-cut partition based topology generator for a variety of benchmarks. In addition, the average hop count is reduced by 0.75 hops, which would significantly reduce the network latency.

3B-3 (Time: 16:40 - 17:05)
TitleAutomatic Generation of Hardware dependent Software for MPSoCs from Abstract System Specifications
Author*Gunar Schirner, Andreas Gerstlauer, Rainer Dömer (University of California, Irvine, United States)
Pagepp. 271 - 276
Keywordsoftware synthesis, Hardware dependent Software, TLM, system level design
AbstractIncreasing software content in embedded systems and SoCs drives the demand to automatically synthesize software binaries from abstract models. This is especially critical for Hardware dependent Software (HdS) due to the tight coupling. In this paper, we present our approach to automatically synthesize HdS from an abstract system model. We synthesize driver code, interrupt handlers and startup code. We furthermore automatically adjust the application to use RTOS services. We target traditional RTOS-based multi-tasking solutions, as well as a pure interrupt-based implementation (without any RTOS). Our experimental results show the automatic generation of final binary images for six real-life target applications and demonstrate significant productivity gains due to automation. Our HdS synthesis is an enabler for efficient MPSoC development and rapid design space exploration.

3B-4 (Time: 17:05 - 17:30)
TitleApplication-Specific Network-on-Chip Architecture Synthesis Based on Set Partitions and Steiner Trees
Author*Shan Yan, Bill Lin (University of California, San Diego, United States)
Pagepp. 277 - 282
KeywordNetwork-on-Chip, communication architecture synthesis, custom topology synthesis, Rectilinear Steiner Tree
AbstractThis paper considers the problem of synthesizing application-specific Network-on-Chip (NoC) architectures. We propose two heuristic algorithms called CLUSTER and DECOMPOSE that can systematically examine different set partitions of communication flows, and we propose Rectilinear-Steiner-Tree(RST) based algorithms for generating an efficient network topology for each group in the partition. Different evaluation functions in fitting with the implementation backend and the corresponding implementation technology can be incorporated into our solution framework to evaluate the implementation cost of the set partitions and RST topologies generated. In particular, we experimented with an implementation cost model based on the power consumption parameters of a 70nm process technology where leakage power is a major source of energy consumption. Experimental results on a variety of NoC benchmarks showed that our synthesis results can on average achieve a 6.92× reduction in power consumption over the best standard mesh implementation. To further gauge the effectiveness of our heuristic algorithms, we also implemented an exact algorithm that enumerates all distinct set partitions. For the benchmarks where exact results could be obtained, our CLUSTER and DECOMPOSE algorithms on average can achieve results within 1% and 2% of exact results, with execution times all under 1 second whereas the exact algorithms took as much as 4.5 hours.


Session 3D  Special Session (Panel) The Tears and Joy of Sowing and Reaping Complex SoC's
Time: 15:50 - 17:55 Tuesday, January 22, 2008
Location: Room 311A+311BC
Chair: Ing-Jer Huang (National Sun Yat-Sen University, Taiwan)

3D-2
Title(Invited Paper) Floating-Point Reconfiguration Array Processor for 3D Graphics Physics Engine
Author*Hoonmo Yang (Core Logic, Republic of Korea)
Pagep. 283

3D-5
Title(Invited Paper) Super-K: A SoC for Single-chip Ultra Mobile Computer
Author*Xu Cheng (Peking University, China)
Pagep. 284

3D-6
Title(Panel Discussion) The Tears and Joy of Sowing and Reaping Complex SoC's
AuthorModerator: Ing-Jer Huang (Nat'l Sun Yat-Sen Univ., Taiwan), Panelists: Youn-Long Lin (Nat'l Tsing Hua Univ./Global UniChip, Taiwan), Hoonmo Yang (Core Logic, Republic of Korea), Toshihiro Hattori (Renesas Technology, Japan), Ahmed Jarraya (CEA-LETI, MINATEC, France), Xu Chen (Peking Univ., China)



Wednesday, January 23, 2008

Session 2K  Keynote Session II
Time: 9:00 - 10:00 Wednesday, January 23, 2008
Location: Room 409
Chair: Kiyoung Choi (Seoul National Univ., Republic of Korea)

2K-1 (Time: 9:00 - 10:00)
Title(Keynote Address) The Evolution of SoC Platform According to the New Mobile Paradigm
AuthorKi-Soo Hwang (Core Logic, Republic of Korea)
Pagep. 285


Session 4A  Variability Issues in Timing
Time: 10:15 - 12:20 Wednesday, January 23, 2008
Location: Room 310A
Chairs: Masanori Hashimoto (Osaka University, Japan), Janet Wang (University of Arizona, United States)

4A-1 (Time: 10:15 - 10:40)
TitleStatistical Gate Delay Model for Multiple Input Switching
Author*Takayuki Fukuoka, Akira Tsuchiya, Hidetoshi Onodera (Kyoto University, Japan)
Pagepp. 286 - 291
KeywordStatistical timing, Multiple input switching, Process variation
AbstractIn this paper, we propose a calculation method of gate delay for SSTA (Statistical Static Timing Analysis) considering MIS (Multiple Input Switching). Most SSTA approaches assume a single input switching model and ignore the effect of MIS on gate delay. MIS occurs when multiple inputs of a gate switch nearly simultaneously. Thus, ignoring MIS causes error in MAX operation in SSTA. We propose a statistical gate delay model considering MIS. We verify the proposed method by SPICE based Monte Carlo simulations and experimental results show that the proposed method improves the error due to ignoring MIS.

4A-2 (Time: 10:40 - 11:05)
TitleNon-Gaussian Statistical Timing Models of Die-to-Die and Within-Die Parameter Variations for Full Chip Analysis
Author*Katsumi Homma, Izumi Nitta, Toshiyuki Shibuya (Fujitsu Labs., Japan)
Pagepp. 292 - 297
KeywordStatistical Timing Analysis, die-to-die variations, within-die variations
AbstractStatistical Timing Analysis (SSTA) is a method that calculates circuit delay statistically with process parameter variations, die-to-die (D2D) and within-die (WID) variations. In this paper, we model that WID parameter variations are for each cell and line in a chip and D2D variations are governed by one variation on a chip. We propose a new method of computing a full chip delay distribution considering both D2D and WID parameter variations. Experimental results show that the proposed method is more accurate than previous methods on actual chip designs.

4A-3 (Time: 11:05 - 11:30)
TitleNon-Gaussian Statistical Timing Analysis Using Second-Order Polynomial Fitting
AuthorLerong Cheng (Univ. of California, Los Angeles, United States), *Jinjun Xiong (IBM, United States), Lei He (Univ. of California, Los Angeles, United States)
Pagepp. 298 - 303
KeywordTiming, Statistical
AbstractIn the nanometer manufacturing region, process variation causes significant uncertainty for circuit performance verification. Statistical static timing analysis (SSTA) is thus developed to estimate timing distribution under process variation. However, most of the existing SSTA techniques have difficulty in handling the non-Gaussian variation distribution and non-linear dependency of delay on variation sources. To solve such a problem, in this paper, we first propose a new method to approximate the max operation of two non-Gaussian random variables through second-order polynomial fitting. We then present new non-Gaussian SSTA algorithms under two types of variational delay models: quadratic model and semi-quadratic model (i.e., quadratic model without crossing terms). All atomic operations (such as max and sum) of our algorithms are performed by closed-form formulas, hence they scale well for large designs. Experimental results show that compared to the Monte-Carlo simulation, our approach predicts the mean, standard deviation, and skewness within 1%, 1%, and 5% error, respectively. Our approach is more accurate and also 20x faster than the most recent method for non-Gaussian and nonlinear SSTA.

4A-4 (Time: 11:30 - 11:55)
TitleA Capacitive Boosted Buffer Technique for High-Speed Process-Variation-Tolerant Interconnect in UDVS Application
AuthorSaihua Lin, *Yu Wang, Rong Luo, Huazhong Yang (Tsinghua Univ., China)
Pagepp. 304 - 309
Keywordinterconnect, buffer, process variation
AbstractIn this paper, we propose a new capacitive boosted buffer technique that can be used in high speed interconnect for ultra-dynamic voltage scaling (UDVS) application with the process variation effect mitigated. The circuit is simple and fully compatible with digital CMOS technology. Implemented in a standard 0.18 µm CMOS technology, the circuit is shown applicable for both sub-threshold circuit and above threshold circuit without the problem of short current. Simulation results demonstrate the conclusion that the proposed new buffer is more robust to load, process, voltage, and temperature (PVT) variations. When applied to a simple H-tree clock network, the proposed buffer can reduce the skew by 5.5Õ when compared to that of the traditional buffer.

4A-5 (Time: 11:55 - 12:20)
TitleStatic Timing: Back to Our Roots
AuthorRuiming Chen, Lizheng Zhang, Vladimir Zolotov, Chandu Visweswariah, *Jinjun Xiong (IBM, United States)
Pagepp. 310 - 315
KeywordStatistical Timing Methodology, pessimism reduction, spatial correlation modeling , Incremental Timing
AbstractExisting static timing methodologies apply various techniques to address increasingly larger process variations. The techniques include multi-corner timing, on-chip variation (OCV) derating coefficients, and path-based common path pessimism removal (CPPR) procedures. These techniques, however, destroy the benefits of linear run-time and incrementality possessed by classical static timing. The major contribution of this work is an efficient statistical timing methodology with comprehensive modeling of process variations, while at the same time retaining those key benefits. Our methodology is compatible with existing characterization methods and scales well to large chip designs. To achieve this goal, three techniques are developed: (1) building the statistical delay model based on existing multi-corner library characterization; (2) modeling spatial correlation in a scalable manner; and (3) avoiding the time-consuming CPPR procedure by removing common path pessimism in the clock network by an incremental block-based technique. Experimental results on industrial 90 nm ASIC designs show that the proposed timing methodology correctly handles all types of process variation, achieves high correlation with traditional multi-corner timing with more than 4 x speedup, and is a vehicle for pessimism reduction.


Session 4B  Memory and Processor Optimization
Time: 10:15 - 12:20 Wednesday, January 23, 2008
Location: Room 310BC
Chairs: Jeonghun Cho (Kyungpook Nat'l Univ., Republic of Korea), Hiroyuki Tomiyama (Nagoya University, Japan)

4B-1 (Time: 10:15 - 10:40)
TitleSynthesis and Design of Parameter Extractors for Low-Power Pre-computation-Based Content-Addressable Memory Using Gate-Block Selection Algorithm
Author*Jui-Yuan Hsieh, Shanq-Jang Ruan (National Taiwan University of Science and Technology, Taiwan)
Pagepp. 316 - 321
KeywordCAM, low-power, pre-computation, gate-block selection algorithm, synthesis
AbstractContent addressable memory (CAM) is frequently used in applications, such as lookup tables, databases, associative computing, and networking, that require high-speed searches due to its ability to improve application performance by using parallel comparison to reduce search time. Although the use of parallel comparison results in fast search time, it also significantly increases power consumption. In this paper, we propose a gate-block selection algorithm, which can synthesize a proper parameter extractor of the pre-computation-based CAM (PB-CAM) to improve the efficiency for specific applications such as embedded systems. Through experimental results, we found that our approach effectively reduces the number of comparison operations for specific data types (ranging from 19.24% to 27.42%) compared with the 1's count approach. We used Synopsys Nanosim to estimate the power consumption in TSMC 0.35um CMOS process. Compared to the 1's count PB-CAM, our proposed PB-CAM achieves 17.72% to 21.09% in power reduction for specific data types.

4B-2 (Time: 10:40 - 11:05)
TitleBlock Cache for Embedded Systems
Author*Dominic Hillenbrand, Jörg Henkel (University of Karlsruhe (TH), Germany)
Pagepp. 322 - 327
Keywordcache, on chip memory, embedded systems, system on chip, memory bandwidth
AbstractWe present a new method to automatically use on chip memory for code blocks of instructions which are dynamically scheduled at runtime to increase performance and reduce power consumption which we call block caches. Block caches can already outperform instruction caches of the same size. We provide initial data and insights into the automated use of block caches and their respective on- and offline phases.

4B-3 (Time: 11:05 - 11:30)
TitleA Compiler-in-the-Loop Framework to Explore Horizontally Partitioned Cache Architectures
Author*Aviral Shrivastava (Arizona State University, United States), Ilya Issenin, Nikil Dutt (University of California, Irvine, United States)
Pagepp. 328 - 333
Keywordembedded, compiler, processor, cache, energy
AbstractHorizontally Partitioned Caches (HPCs) are a promising architectural feature to reduce the energy consumption of the memory subsystem. However, the energy reduction obtained using HPC architectures is very sensitive to the HPC parameters. Therefore it is very important to explore the HPC design space and carefuly choose the HPC parameters that result in minimum energy consumption for the application. However, since in HPC architectures, the compiler has a significant impact on the energy consumption of the memory subsystem, it is extremely important to include compiler while deciding the HPC design parameters. While there has been no previous apporaches to HPC design exploration, existing cache design space exploration methodologies do not include the compiler effectsduring DSE. In this paper, we present a Compiler-inthe- Loop (CIL) Design Space Exploration (DSE) methodology to explore and decide the HPC design parameters. Our experimental results on HP iPAQ h4300-like memory subsystem running benchmarks from the MiBench suite demonstrate that CIL DSE can discover HPC configurations with up to 80% lesser energy consumption than the HPC configuration in the iPAQ. In contrast, tradiation simulation-only exploration can discover HPC design parameters that result in only 57% memory subsystem energy reduction. Finally our hybrid CIL DSE heuristic saves 67% of the exploration time as compared to the exhaustive exploration, while providing maximum possible energy savings on our set of benchmarks.

4B-4 (Time: 11:30 - 11:55)
TitleFast, Quasi-Optimal, and Pipelined Instruction-Set Extensions
Author*Ajay K. Verma, Philip Brisk, Paolo Ienne (EPFL, Switzerland)
Pagepp. 334 - 339
KeywordInstruction Set Extension, Integer Linear Programming
AbstractNowadays many customised embedded processors offer the possibility of speeding up an application by implementing it using Application-Specific Functional units (AFUs). However, the AFUs must satisfy certain constraints in terms of read and write ports between AFU and processor register file. Due to these restrictions the size and complexity of AFUs remain small. However, in recent some work has been done on relaxing the register file port constraints by serialising register file access (i.e., by allowing multi cycle read and write). This makes the problem of selecting best AFU significantly more complex. Most previous approaches use a two staged process to solve this problem, i.e., first selecting AFUs under some higher I/O constraints and then serialise them under the actual register file port constraints. Not only these methods are complex but also lead to suboptimal solutions. In this paper we formulate the AFU selection problem as an Integer Linear Programming and solve it optimally. We show experimentally that our methodology produces significantly better results compared to state of art techniques.

4B-5 (Time: 11:55 - 12:20)
TitleLoad Scheduling: Reducing Pressure on Distributed Register Files for Free
Author*Mei Wen, Nan Wu, Maolin Guan, Chunyuan Zhang (National University of Defense Technology, China)
Pagepp. 340 - 345
KeywordVLIW, distributed register files
AbstractIn this paper we describe load scheduling, a novel method that balances load among register files by residual resources. Load scheduling can reduce register pressure for clustered VLIW processors with distributed register files while not increasing VLIW scheduling length. We have implemented load scheduling in compiler for Imagine and FT64 stream processors. The result shows that the proposed technique effectively reduces the number of variables spilled to memory, and can even eliminate it. The algorithm presented in this paper is extremely efficient in embedded processor with limited register resource because it can improve registers utilization instead of increasing the requirement for the number of registers.


Session 4C  New Techniques for Physical Design Optimization
Time: 10:15 - 12:20 Wednesday, January 23, 2008
Location: Room 311A
Chairs: Evangeline F.Y. Young (The Chinese Univ. of Hong Kong, Hong Kong), Sherief Reda (Brown University, United States)

4C-1 (Time: 10:15 - 10:40)
TitleDPlace2.0: A Stable and Efficient Analytical Placement Based on Diffusion
AuthorTao Luo, *David Z. Pan (University of Texas at Austin, United States)
Pagepp. 346 - 351
KeywordPlacement
AbstractNowadays a placement problem often involves multi-million objects and excessive fixed blockages. We present a new global placement algorithm that scales well to the modern large-scale circuit placement problems. We simulate the natural diffusion process to spread cells smoothly over the placement region, and use both analytical and discrete techniques to improve the wire length. Although any analytical wire length technique can be used in our new framework, by using the quadratic wire length model, the hessian of our formulation is extremely sparse compared with conventional formulations, which brings 24x speed up on quadratic solver. We also propose a wire linearization technique that transform quadratic star model into HPWL exactly. The overall runtime of our tool is close to the fastest placement tool in existing literature and significantly better than others. And meanwhile, we obtain competitive wire length results to the best known ones. The average total wire length is 2.2\% higher than mPL6, 0.2\%, 3.1\%, and 9.1\% better than FastPlace3.0, APlace2.0, and Capo10.2 respectively.

4C-2 (Time: 10:40 - 11:05)
TitleTotal Power Optimization Combining Placement, Sizing and Multi-Vt Through Slack Distribution Management
AuthorTao Luo (Univ. of Texas, Austin, United States), David Newmark (Advanced Micro Devices, United States), *David Z. Pan (Univ. of Texas, Austin, United States)
Pagepp. 352 - 357
Keywordpower, leakge, gate sizing, threshold voltage
AbstractPower dissipation is quickly becoming one of the most important limiters in nanometer IC design for leakage increases exponentially as the technology scaling down. However, power and timing are often conflicting objectives during optimization. In this paper, we propose a novel total power optimization flow under performance constraint. Instead of using placement, gate sizing, and multiple-Vt assignment techniques independently, we combine them together through the concept of slack distribution management to maximize the potential for power reduction. We propose to use the linear programming (LP) based placement and the geometric programming (GP) based gate sizing formulations to improve the slack distribution, which helps to maximize the total power reduction during the Vt-assignment stage. Our formulations include important practical design constraints, such as slew, noise and short circuit power, which were often ignored previously. We tested our algorithm on a set of industrial-strength manually optimized circuits from a multi-GHz 65nm microprocessor, and obtained very promising results. To our best knowledge, this is the first work that combines placement, gate sizing and Vt swapping systematically for total power (and in particular leakage) management.

4C-3 (Time: 11:05 - 11:30)
TitleAn Innovative Steiner Tree Based Approach for Polygon Partitioning
AuthorYongqiang Lu, *Qing Su, Jamil Kawa (Synopsys, United States)
Pagepp. 358 - 363
KeywordMinimal Steiner tree, Polygon partition, Minimal Partition tree
AbstractAs device technology continues to scale past 65nm, the number of geometries added by the heavy application of resolution enhancement techniques (RET) continues to grow. This is a direct consequence of the 193nm lithography having to suffice for tighter geometries with every new node. As a result issues associated with mask data preparation (MDP) such as complexity, run time, and quality are growing in severity. As one major and core step in MDP, polygon partitioning converts the complex layout shapes into trapezoids suitable for mask writing. The partitioning run time and quality of the resulting polygon partitions directly impacts the cost, integrity, and quality of the written mask. In this work, we introduce an innovative approach to solve the polygon partition quality problem by constructing a variant Steiner minimal tree: minimal partition tree (MPT). We prove the equivalence between the MPT and the optimal polygon partition. Also, the search space for MPT is further reduced for the efficiency of the MPT algorithms. Finally, a generic MPT algorithm flow and a linear-time heuristic algorithm based on it are proposed. Experimental results show that this new approach and the associated proposed algorithm solve the polygon partitioning problems with very promising and high quality results.

4C-4 (Time: 11:30 - 11:55)
TitleAn MILP-Based Wire Spreading Algorithm for PSM-Aware Layout Modification
Author*Ming-Chao Tsai, Yung-Chia Lin, Ting-Chi Wang (the Department of Computer Science, National Tsing Hua University, Taiwan)
Pagepp. 364 - 369
KeywordPSM, MILP, wire spreading, RET
AbstractPhase shifting mask (PSM) is a promising resolution enhancement technique, which is used in the deep sub-wavelength lithography of the VLSI fabrication process. However, applying the PSM technique requires the layout to be free of phase conflict. In this paper, we present an MILP-based layout modification algorithm which solves the phase conflict problem by wire spreading. Unlike existing layout modification methods which first solves the phase conflict problem by removing edges from the layout-associated conflict graphs and then tries to revise the layout to match the resultant conflict graphs, our algorithm simultaneously considers the phase conflict problem and the feasibility of modifying the layout. The experimental results indicate that without increasing the chip size, the phase conflict problem can be well tackled with minimal perturbation to the layout.

4C-5 (Time: 11:55 - 12:08)
TitleLow Power Clock Buffer Planning Methodology in F-D Placement for Large Scale Circuit Design
Author*Yanfeng Wang, Qiang Zhou, Yici Cai (Tsinghua University, China), Jiang Hu (Texas A&M University, United States), Xianlong Hong, Jinian Bian (Tsinghua University, China)
Pagepp. 370 - 375
Keywordlow power, buffer planning, F-D placement
AbstractTraditionally, clock network layout is performed after cell placement. Such methodology is facing a serious problem in nanometer IC designs where people tend to use huge clock buffers for robustness against variations. That is, clock buffers are often placed far from ideal locations to avoid overlap with logic cells. As a result, both power dissipation and timing are degraded. In order to solve this problem, we propose a low power clock buffer planning methodology which is integrated with cell placement. A Bin-Divided Grouping algorithm is developed to construct virtual buffer tree, which can explicitly model the clock buffers in placement. The virtual buffer tree is dynamically updated during the placement to reflect the changes of latch locations. To reduce power dissipation, latch clamping is incorporated with the clock buffer planning. The experimental results show that our method can reduce clock power significantly by 21% on average.

4C-6 (Time: 12:08 - 12:21)
TitlePower Grid Analysis Benchmarks
Author*Sani R. Nassif (IBM, United States)
Pagepp. 376 - 381
KeywordPower Grid Analysis, Benchmarks
AbstractBenchmarks are an immensely useful tool in performing research since they allow for rapid and clear comparison between different approaches to solving CAD problems. Recent experience from the placement and routing areas suggests that the ready availability of realistic industrial-size benchmarks can energize research in a given area, and can even lead to significant breakthroughs. To this end, we are making a number of power grid analysis benchmarks available for the public. These are all drawn from real designs, and vary over a reasonable range of size and difficulty thereby making studies of algorithm complexity possible. This paper documents the format for the various benchmarks, and give details for their access.


Session 4D  Designers' Forum - New Emerging Application Areas for Future SoC
Time: 10:15 - 12:20 Wednesday, January 23, 2008
Location: Room 311BC
Chair: Sungjoo Yoo (Samsung Electronics, Republic of Korea)

4D-1
Title(Invited Paper) In-band Mobile Digital TV Transmission Technology for Advanced Television Systems Committee
Author*Junehee Lee (Samsung Electronics, Republic of Korea)
Pagep. 382

4D-2
Title(Invited Paper) In-Vehicle Vision Processors for Driver Assistance Systems
Author*Shorin Kyo, Shin’ichiro Okazaki (NEC Corp., Japan)
Pagepp. 383 - 388

4D-3
Title(Invited Paper) Multi-Core DSP for Base Stations: Large and Small
Author*Doug Pulley (picoChip, Great Britain)
Pagepp. 389 - 391

4D-4
Title(Invited Paper) 1-cc Computer Using UWB-IR for Wireless Sensor Network
Author*Tatsuo Nakagawa, Masayuki Miyazaki, Goichi Ono, Ryosuke Fujiwara, Takayasu Norimatsu, Takahide Terada (Hitachi, Japan), Akira Maeki, Yuji Ogata, Shinsuke Kobayashi, Noboru Koshizuka, Ken Sakamura (YRP Ubiquitous Networking Lab., Japan)
Pagepp. 392 - 397
AbstractAn ultra-small, high-data-rate, low-power 1-cc computer (OCCC) with an UWB-IR (ultra-wideband impulse-radio) transceiver was developed for a wireless sensor network. Thanks to bear-chip implementation and a flexible printed circuit board, the size of the computer is only 1 cm3. To achieve 10-Mbps data rate, a middle-class 32-bit microcontroller, which has both a bus interface and a USB 2.0 controller, was selected. Low-power techniques, such as transition of microcontroller status to standby mode by using an external real-time clock during wait times, power shutdown of halted circuits, and detailed control of UWB-IR transceiver status, are applied. The effect of these low-power techniques is verified by measuring the time history of current consumption of the OCCC. It was confirmed that the OCCC can provide wireless communication at a transmission rate of 258 kbps over a distance of 30 m.


Session 5A  Techniques for Formal and Simulation-Based Varification
Time: 13:30 - 15:35 Wednesday, January 23, 2008
Location: Room 310A
Chairs: Sherief Reda (Brown University, United States), Jin-Young Choi (Korea University, Republic of Korea)

5A-1 (Time: 13:30 - 13:55)
TitleVerifying Full-Custom Multipliers by Boolean Equivalence Checking and an Arithmetic Bit Level Proof
Author*Udo Krautz, Markus Wedler, Wolfgang Kunz (University Kaiserslautern, Germany), Kai Weber, Christian Jacobi, Matthias Pflanz (IBM, Germany)
Pagepp. 398 - 403
Keywordformal verification
AbstractIn this paper we describe a methodology to formally verify highly optimized multipliers. We define a multiplier description language which abstracts from low-level optimizations and which can model a wide range of common implementations at a structural and arithmetic level. The correctness of the created model is established by bit level transformations matching the model against a standard multiplication specification. The model is also translated into a gate netlist to be compared with the full-custom implementation of the multiplier by standard equivalence checking.

5A-2 (Time: 13:55 - 14:20)
TitleA Symbolic Approach for Mixed-Signal Model Checking
Author*Alexander Jesser, Lars Hedrich (University of Frankfurt a.M., Germany)
Pagepp. 404 - 409
KeywordFormal Verification, Model Checking, Mixed-Signal, multi terminal binary decision diagram
AbstractIn this paper we firstly introduce a novel symbolic model checker MScheck for mixed-signal circuits. MScheck is capable to conflate the continuous behavior, typical for analog designs, and the discrete behavior in the digital domain for formal verification. Timing information of both systems will be symbolically stored within multi terminal binary decision diagrams (MTBDDs) for the entire verification procedure. The effectiveness of our approach is demonstrated on a phase locked loop (PLL) by formal verification of the locking property.

5A-3 (Time: 14:20 - 14:45)
TitleFaster Projection Based Methods for Circuit Level Verification
Author*Chao Yan, Mark Greenstreet (University of British Columbia, Canada)
Pagepp. 410 - 415
KeywordFormal Verification, Reachability Analysis, ODE
AbstractAs VLSI fabrication technology progresses to 65nm feature sizes and smaller, transistors no longer operate as ideal switches. This motivates the verification of digital circuits using continuous models. Recently, we showed how such verification can be performed using projection based methods.However, the verification was slow, requiring nearly four CPU days to verify a nine-transistor toggle flip-flop. Here, we describe improvements to the reachability algorithms and optimizations of the software architecture. These produce a 15x reduction in computation time and significant reductions in the over-approximation errors. With these changes, the same toggle flip-flop can be verified in a few hours, making formal verification a viable alternative to circuit simulation.

5A-4 (Time: 14:45 - 15:10)
TitleA Debug Probe for Concurrently Debugging Multiple Embedded Cores and Inter-Core Transactions in NoC-Based Systems
AuthorShan Tang, *Qiang Xu (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 416 - 421
KeywordPost-silicon validation, Debug probe, Transaction, NoC
AbstractExisting SoC debug techniques mainly target bus-based systems. They are not readily applicable to the emerging system that use Network-on-Chip (NoC) as on-chip communication scheme. In this paper, we present the detailed design of a novel debug probe (DP) inserted between the core under debug (CUD) and the NoC. With embedded configurable triggers, delay control and timestamping mechanism, the proposed DP is very effective for inter-core transaction analysis as well as controlling embedded cores' debug processes. Experimental results show the functionalities of the proposed DP and its area overhead.

5A-5 (Time: 15:10 - 15:23)
TitleA Fast Two-Pass HDL Simulation with On-Demand Dump
Author*Kyuho Shim (Pusan Nat'l Univ., Republic of Korea), Youngrae Cho, Namdo Kim (Samsung Electronics, Republic of Korea), Hyuncheol Baik, Kyungkuk Kim, Dusung Kim (Pusan Nat'l Univ., Republic of Korea), Jaebum Kim, Byeongun Min, Kyumyung Choi (Samsung Electronics, Republic of Korea), Maciej Ciesielski (Logic-Mill Technology LLC, United States), Seiyang Yang (Pusan Nat'l Univ., Republic of Korea)
Pagepp. 422 - 427
KeywordSimulation
AbstractSimulation-based functional verification is characterized by two inherently conflicting targets: the signal visibility and simulation performance. Achieving a proper trade-off between these two targets is of paramount importance. Even though HDL simulators are the most widely used verification platform at the RTL and gate level, their major drawback is the low performance in verifying complex SOCs, especially when the high visibility over the design under verification is required. This paper presents a new, fast simulation method as an effective way to achieve both high simulation speed and full signal visibility. It is based on an original two-pass simulation approach. During the 1st pass, with the simulation running at full speed, a set of design states is saved periodically at predetermined checkpoints. During the 2nd pass, another simulation is performed, using any of saved checkpoints and providing 100% signal visibility for debugging. Our method differs from the traditional simulation snapshot approach in the amount and the way the design state is saved. Experimental results show significant speed-up compared to existing traditional simulation methods while maintaining 100% visibility.


Session 5B  Power and Performance Optimization for Embedded Systems
Time: 13:30 - 15:35 Wednesday, January 23, 2008
Location: Room 310BC
Chairs: Naehyuck Chang (Seoul National University, Republic of Korea), Tohru Ishihara (Kyushu Univ., Japan)

5B-1 (Time: 13:30 - 13:55)
TitleHybrid Solid-State Disks: Combining Heterogeneous NAND Flash in Large SSDs
Author*Li-Pin Chang (National Chiao Tung University, Taiwan)
Pagepp. 428 - 433
Keywordflash memory, storage system, file system, embedded system
AbstractThis paper presents a hybrid approach to large SSDs. The idea is to complement the drawbacks of SLC flash and MLC flash with each other's advantages. The technical issues of the design of a hybrid SSD pertain to data placement and wear leveling over heterogeneous NAND flash. Our experimental results show that a hybrid SSD improves over a conventional SSD by 4.85 times in terms of average response. The average throughput and energy consumption are improved by 17% and 14%, respectively.

5B-2 (Time: 13:55 - 14:20)
TitleEnabling Run-Time Memory Data Transfer Optimizations at the System Level with Automated Extraction of Embedded Software Metadata Information
Author*Alexandros Bartzas (Democritus University of Thrace, Greece), Miguel Peon-Quiros (Universidad Complutense de Madrid, Spain), Stylianos Mamagkakis, Francky Catthoor (IMEC vzw, Belgium), Dimitrios Soudris (Democritus University of Thrace, Greece), Jose Manuel Mendias (Universidad Complutense de Madrid, Spain)
Pagepp. 434 - 439
Keywordmetadata, embedded systems, DMA, profiling, dynamic data
AbstractThe information about the run-time behavior of software applications is crucial for enabling system level optimizations for embedded systems. This embedded software Metadata information is especially important today, because several complex multi-threaded applications are mapped on the memory of a single embedded system. Each thread is triggered at run-time by different input events that can not be predicted at design-time. New methods and tools are needed to automatically profile and analyze the dynamic data access behavior of simultaneously executing threads in order to enable memory data transfer optimizations. In this paper, we propose such a method and tool which extract the necessary software Metadata information to enable these data transfer optimizations at the system level. We assess the effectiveness of our approach with the results for 5 real-life software applications using 7 real-life run-time input traces.

5B-3 (Time: 14:20 - 14:45)
TitleAutomatic Re-Coding of Reference Code into Structured and Analyzable SoC Models
AuthorPramod Chandraiah, *Rainer Dömer (University of California, Irvine, United States)
Pagepp. 440 - 445
KeywordSpecification Modeling, Structural hierarchy, System Level Design Languages, Code transformations, Architectural Exploration
AbstractThe quality of the input system model has a direct bearing on the effectiveness of the system exploration and synthesis tools. Given a well-structured system model, tools today are effective in generating efficient implementations. However, readily available reference C codes are not conducive for system synthesis as they lack the necessary structure and analyzability needed by the design flow. Usually reference C code is manually converted into a SoC model by applying necessary transformations. The type of transformations depends on the underlying design flow and tools. Proper structural hierarchy is one essential feature needed for architectural exploration. In this paper, we provide automatic C code transformations to encapsulate functions and insert structural hierarchy to create well-structured and analyzable SoC models. Our automatic transformations, combined with interactive application of the designer's knowledge and experience, enable faster creation of structural hierarchy in C models and hence result in significant reduction of the overall design time.

5B-4 (Time: 14:45 - 15:10)
TitleAction Coverage Formulation for Power Optimization in Body Sensor Networks
AuthorHassan Ghasemzadeh, *Eric Guenterberg, Katherine Gilani, Roozbeh Jafari (University of Texas, Dallas, United States)
Pagepp. 446 - 451
KeywordBody Sensor Networks, Wearable Embedded Systems, Physical Movement Monitoring, Power Optimization, Classification
AbstractAdvances in technology have led to the development of various light-weight sensory devices that can be woven into the physical environment of our daily lives. Such systems enable on-body and mobile health-care monitoring. Our interest particularly lies in the area of movement monitoring platforms that operate with inertial sensors. In this paper, we propose a power optimization technique that will consider the sensing coverage problem from a collaborative signal processing perspective. We introduce compatibility graphs and describe how they can be utilized for power optimization. The problem we outline can be transformed into an NP-hard problem. Therefore, we propose an ILP formulation to attain a lower bound on the solution and a fast greedy technique. Along side this, we introduce a system for dynamically activating and deactivating sensor nodes in real-time. Finally, we elucidate the effectiveness of our techniques on data collected from several subjects.

5B-5 (Time: 15:10 - 15:23)
TitleDynamic Scheduling of Imprecise-Computation Tasks in Maximizing QoS under Energy Constraints for Embedded Systems
Author*Heng Yu, Bharadwaj Veeravalli, Yajun Ha (National University of Singapore, Singapore)
Pagepp. 452 - 455
KeywordRT Embedded Systems, Scheduling, Imprecise-Computation, DVS
AbstractIn designing energy-aware CPU scheduling algorithms for real-time embedded systems, dynamic slack reclamation techniques significantly improve system Quality-of-Service (QoS) and energy efficiency. However, the limited schemes in this domain either demand high complexity or can only achieve limited QoS. In this paper, we present a novel low complexity runtime scheduling algorithm for the Imprecise Computation (IC) modeled tasks. The target is to maximize system QoS under energy constraints. Our proposed algorithm, named Gradient Curve Shifting (GCS), is able to decide the best allocation of slack cycles arising at runtime, with very low complexity. We study both linear and concave QoS functions associated with IC modelde tasks, on non-DVS and DVS processors. Furthermore, we apply the intra-task DVS technique to tasks and achieve as large as 18% more of the system QoS compared to the conventional “optimal” solution which is inter-task DVS based.


Session 5C  Thermal Analysis and DFM
Time: 13:30 - 15:35 Wednesday, January 23, 2008
Location: Room 311A
Chairs: Takashi Sato (Tokyo Institute of Technology, Japan), Hideki Asai (Shizuoka University, Japan)

5C-1 (Time: 13:30 - 13:55)
TitleArchitecture-level Thermal Behavioral Characterization For Multi-Core Microprocessors
AuthorDuo Li, *Sheldon X.-D. Tan (Univ. of California, Riverside, United States), Murli Tirumala (Intel, United States)
Pagepp. 456 - 461
KeywordThermal, Behavioral modeling, Multi-core
AbstractIn this paper, we investigate a new architecture-level thermal characterization problem from behavioral modeling perspective to address the emerging thermal related analysis and optimization problems for high-performance multi-core microprocessor design. We propose a new approach, called ThermPOF, to build the thermal behavioral models from the measured architecture thermal and power information. ThermPOF first builds the behavioral thermal model using generalized pencil-of-function (GPOF) method. And then to effectively model transient temperature changes, we proposed two new schemes to improve the GPOF. First we apply logarithmic-scale sampling instead of traditional linear sampling to better capture the temperature changing characteristics. Second, we modify the extracted thermal impulse response such that the extracted poles from GPOF are guaranteed to be stable without accuracy loss. To further reduce the model size, Krylov subspace based model order reduction is performed to reduce the order of the models in the state-space form. Experimental results on a practical quad-core microprocessor show that generated thermal behavioral models match the measured data very well.

5C-2 (Time: 13:55 - 14:20)
TitleFull-Chip Thermal Analysis for the Early Design Stage via Generalized Integral Transforms
Author*Pei-Yu Huang, Chih-Kang Lin, Yu-Min Lee (National Chiao Tung University, Taiwan)
Pagepp. 462 - 467
KeywordThermal analysis, generalized integral transforms
AbstractThe capability of predicting the temperature profile is critically important for circuit timing estimation, leakage reduction, power estimation, hotspot avoidance, and reliability concerns during modern IC designs. This paper presents an accurate and fast analytical full-chip thermal simulator for the early-stage temperature-aware chip design. By using the technique of generalized integral transforms (GIT), our proposed method can accurately estimate the temperature distribution of full-chip with very small truncation points of bases in the spatial domain. We also develop a fast Fourier transform (FFT) like evaluating algorithm to efficiently evaluate the temperature distribution. Experimental results confirm that our GIT based analyzer can achieve an order of magnitude speedup compared with a highly efficient Green's function based method.

5C-3 (Time: 14:20 - 14:45)
TitleA Stochastic Local Hot Spot Alerting Technique
AuthorHwisung Jung, *Massoud Pedram (University of Southern California, United States)
Pagepp. 468 - 473
Keywordhot spot, Markov decision process, Kalman filter, thermal alert
AbstractWith the increasing levels of variability in the behavior of manufactured nano-scale devices and dramatic changes in the power density on a chip, timely identification of hot spots on a chip has become a challenging task. This paper addresses the questions of how and when to identify and issue a hot spot alert. There are important questions since temperature reports by thermal sensors may be erroneous, noisy, or arrive too late to enable effective application of thermal management mechanisms to avoid chip failure. This paper thus presents a stochastic technique for identifying and reporting local hot spots under probabilistic conditions induced by uncertainty in the chip junction temperature and the system power state. More specifically, it introduces a stochastic framework for estimating the chip temperature and the power state of the system based on a combination of Kalman Filtering (KF) and Markovian Decision Process (MDP) model. Experimental results demonstrate the effectiveness of the framework and show that the proposed technique alerts about thermal threats accurately and in a timely fashion in spite of noisy or sometimes erroneous readings by the temperature sensor.

5C-4 (Time: 14:45 - 15:10)
TitleDesign Rule Optimization of Regular layout for Leakage Reduction in Nanoscale Design
AuthorAnupama R. Subramaniam (Arizona State University, United States), Ritu Singhal, *Chi-Chao Wang, Yu Cao (Arizona State University, United States)
Pagepp. 474 - 479
KeywordDesign Rule, Optimization, RDR, NRG leakage, manufacturability
AbstractThe effect of non-rectilinear gate (NRG) due to sub-wavelength lithograph dramatically increases the leakage current by more than 15X. To mitigate this penalty, we have developed a systematic procedure to optimize key layout parameters in regular layout with minimum area and speed overhead. As demonstrated in 65nm technology, the optimization of regular layout achieves more than 70% reduction in leakage under NRG, with area penalty of ~10% and marginal impact on circuit speed and active power.

5C-5 (Time: 15:10 - 15:35)
TitleInvestigation of Diffusion Rounding for Post-Lithography Analysis
AuthorPuneet Gupta (Univ. of California, Los Angles, United States), Andrew B. Kahng (Univ. of California, San Diego, United States), *Youngmin Kim (Univ. of Michigan, Ann Arbor, United States), Saumil Shah (Blaze-DFM, United States), Dennis Sylvester (Univ. of Michigan, Ann Arbor, United States)
Pagepp. 480 - 485
Keyworddiffusion rounding, gate variability, DFM, lithography simulation, non-rectilinear
AbstractDue to aggressive scaling of device feature size to improve circuit performance in the sub-wavelength lithography regime, both diffusion and poly gate shapes are no longer rectilinear. Diffusion rounding occurs most notably where the diffusion shapes are not perfectly rectangular, including common L and T-shaped diffusion layouts to connect to power rails. This paper investigates the impact of the non-rectilinear shape of diffusion (i.e., sloped diffusion or diffusion rounding) on circuit performance (delay and leakage). Simple weighting function models for Ion and Ioff to account for the diffusion rounding effects are proposed, and compared with TCAD simulation. Our experiments show that diffusion rounding has an asymmetric characteristic for Ioff due to the differing significance of source/drain junctions on device threshold voltage. Therefore, we can model Ion and Ioff as a function of slope angle and direction. The proposed models match well with TCAD simulation results, with less than 2% and 6% error in Ion and Ioff, respectively.


Session 5D  Designers' Forum (Panel) Are System Level EDA Tools/Methodologies Coming?
Time: 13:30 - 15:35 Wednesday, January 23, 2008
Location: Room 311BC
Chair: Ren-Song Tsay (National Tsing Hua Univ.)

5D-1
Title(Panel Discussion) Are System Level EDA Tools/Methodologies Coming?
AuthorModerator: Ren-Song Tsay (Nat l Tsing Hua Univ., Taiwan), Panelists: Raul Camposano (Xoomsys, Tajikistan), Toshihiro Hattori (Renesas Technology, Japan), Austin Kim (Samsung Electronics, Republic of Korea), Howard Mao (Springsoft, Taiwan), Sri Parameswaran (Univ. of New South Wales, Australia)


Session 6A  Trends in Timing
Time: 15:50 - 17:55 Wednesday, January 23, 2008
Location: Room 310A
Chairs: Youngsoo Shin (Korea Advanced Institute of Science and Technology, Republic of Korea), Jung Yun Choi (Samsung Electronics, Republic of Korea)

6A-1 (Time: 15:50 - 16:15)
TitlePessimism Reduction in Coupling-Aware Static Timing Analysis Using Timing and Logic Filtering
Author*Debasish Das (Northwestern Univ., United States), Kip Killpack, Chandramouli Kashyap, Abhijit Jas (Intel, United States), Hai Zhou (Northwestern Univ., United States)
Pagepp. 486 - 491
Keywordstatic timing analysis, crosstalk, algorithm
AbstractWith continued scaling of technology into nanometer regimes, the impact of coupling induced delay variations are significant. While several coupling aware static timers have been proposed, the results are often pessimistic with many false failures. We present an integrated iterative timing filtering and logic filtering based approach to reduce pessimism. We use a realistic coupling model based on arrival times and slews and show that non-iterative pessimism reduction algorithms proposed by previous research can give potentially non-conservative timing results. On a functional block from an industrial 65nm microprocessor our algorithm showed a maximum pessimism reduction of 11.18\% of cycle time over converged timing filtering analysis that does not consider logic constraints.

6A-2 (Time: 16:15 - 16:40)
TitleA Fast Incremental Clock Skew Scheduling Algorithm for Slack Optimization
Author*Kui Wang, Hao Fang, Hu Xu, Xu Cheng (Microprocessor Research Center of Peking University, China)
Pagepp. 492 - 497
Keywordclock schedule , semi-synchronous circuits, useful skew, timing analysis
AbstractWe propose a fast clock skew scheduling algorithm which minimizes clock period and enlarges the slacks of timing critical paths. To reduce the runtime of the timing analysis engine, our algorithm allows the sequential graph to be partly extracted. And the runtime of itself is almost linear to the size of the extracted sequential graph. Experimental results show its runtime is less than a minute for a design with more than ten thousands of flip-flops.

6A-3 (Time: 16:40 - 17:05)
TitleClock Tree Synthesis with Data-Path Sensitivity Matching
Author*Matthew R. Guthaus (University of California Santa Cruz, United States), Dennis Sylvester (University of Michigan, United States), Richard B. Brown (University of Utah, United States)
Pagepp. 498 - 503
Keywordclock tree synthesis, variability, skew
AbstractThis paper investigates methods for minimizing the impact of process variation on clock skew using buffer and wire sizing. While most papers on clock trees ignore data-path circuit variations and most papers on data-path circuit optimization disregard clock tree variation, we consider both. Using both clock and data-path variations together, we present a novel sensitivity-matching algorithm that allows clock tree skews to be intentionally correlated with data-path sensitivities to ameliorate timing violations due to variation. Our statistical tuning shows an improvement in terms of expected clock skew and clock skew variation over previously published robust algorithms.

6A-4 (Time: 17:05 - 17:30)
TitleBuffered Clock Tree Synthesis for 3D ICs Under Thermal Variations
AuthorJacob Minz (Synopsys, United States), Xin Zhao, *Sung Kyu Lim (Georgia Inst. of Tech., United States)
Pagepp. 504 - 509
Keywordthermal-aware optimization, clock, 3D IC
AbstractIn this paper, we study the buffered clock tree synthesis problem under thermal variations for 3D IC technology. Our major contribution is the Balanced Skew Theorem, which provides a theoretical background to efficiently construct a buffered 3D clock tree that minimizes and balances the skew values under two distinct non-uniform thermal profiles. Our clock tree synthesis algorithm named BURITO (Buffered Clock Tree With Thermal Optimization) first constructs a 3D abstract tree under the wirelength vs via-congestion tradeoff. This abstract tree is then embedded, buffered, and refined under the given non-uniform thermal profiles so that the temperature-dependent skews are minimized and balanced. Experimental results show that our algorithms significantly reduce and perfectly balance clock skew values with minimal wirelength overhead.

6A-5 (Time: 17:30 - 17:43)
TitleA Delay Model for Interconnect Trees Based on ABCD Matrix
Author*Guofei Zhou, Li Su, Depeng Jin, Lieguang Zeng (Tsinghua University, China)
Pagepp. 510 - 513
Keyworddelay estimation, interconnect, VLSI
AbstractThe accuracy of interconnect delay estimations can be improved by the method presented in this paper, in which the first two moments are obtained with ABCD matrix and a stable model to incorporate effects of transport delay into the delay estimate is developed. Simulation results show that the method share the same accuracy with traditional methods when rise time delay is much longer than transport delay and more accurate when the two are of the same order.

6A-6 (Time: 17:43 - 17:56)
TitleAnalytical Model for the Impact of Multiple Input Switching Noise on Timing
AuthorRajeshwary Tayade (University of Texas at Austin, United States), *Sani Nassif (IBM, United States), Jacob Abraham (University of Texas at Austin, United States)
Pagepp. 514 - 517
Keywordmultiple input switching, dynamic variability, path delay estimation
AbstractThe timing models used in current Static Timing Analysis tools characterize gate delays only for single input switching events. It is well known that the temporal proximity of signals arriving at different inputs causes significant variation in the gate delay. This variation in delay needs to be accounted for when selecting critical paths of a circuit. In this paper, a detailed analysis of Multiple Input Switching (MIS) behavior is presented that leads to a simple analytical model which can be used to estimate gate delay with MIS noise. The model presented requires minimum additional characterization effort, and can be employed in a statistical timing engine. The dynamic delay variability of a path caused due to MIS noise can be accurately estimated using the proposed model.


Session 6B  Statistical Modeling and Yield Prediction
Time: 15:50 - 17:55 Wednesday, January 23, 2008
Location: Room 310BC
Chairs: Sheldon Tan (University of California, Riverside, United States), Woojin Jin (Samsung Electronics, Republic of Korea)

6B-1 (Time: 15:50 - 16:15)
TitleDetermination of Optimal Polynomial Regression Function to Decompose On-Die Systematic and Random Variations
Author*Takashi Sato, Hiroyuki Ueyama, Noriaki Nakayama, Kazuya Masu (Tokyo Institute of Technology, Japan)
Pagepp. 518 - 523
Keywordprocess variation , log-likelihood estimate, AIC, model selection
AbstractA procedure that decomposes measured parametric device variation into systematic and random components is studied by considering the decomposition process as selecting the most suitable model for describing on-die spatial variation trend. In order to maximize model predictability, the log-likelihood estimate called corrected Akaike information criterion is adopted. Depending on on-die contours of underlying systematic variation, necessary and sufficient complexity of the systematic regression model is objectively and adaptively determined. The proposed procedure is applied to 90-nm threshold voltage data and found the low order polynomials describe systematic variation very well. Designing cost-effective variation monitoring circuits as well as appropriate model determination of on-die variation are hence facilitated.}

6B-2 (Time: 16:15 - 16:40)
TitleWithin-Die Process Variations: How Accurately Can They Be Statistically Modeled?
AuthorBrendan Hargreaves, Henrik Hult, *Sherief Reda (Brown Univ., United States)
Pagepp. 524 - 530
Keywordprocess variations, statistical modeling
AbstractWithin-die process variations arise during integrated circuit (IC) fabrication in the sub-100nm regime. These variations are of paramount concern as they deviate the performance of ICs from their designers’ original intent. These deviations reduce the parametric yield and revenues from integrated circuit fabrication. In this paper we provide a complete treatment to the subject of within-die variations. We propose a scan-chain based system, vMeter, to extract within-die variations in an automated fashion. We implement our system in a sample of 90nm chips, and collect the within-die variations data. Then we propose a number of novel statistical analysis techniques that accurately model the within-die variation trends and capture the spatial correlations. We propose the use of maximum-likelihood techniques to find the required parameters to fit the model to the data. The accuracy of our models is statistically verified through residual analysis and variograms. Using our successful modeling technique, we propose a procedure to generate synthetic within-die variation patterns that mimic, or imitate, real silicon data.

6B-3 (Time: 16:40 - 17:05)
TitleChebyshev Affine Arithmetic Based Parametric Yield Prediction Under Limited Descriptions of Uncertainty
AuthorJin Sun, Yue Huang (The University of Arizona, United States), Jun Li (Anova Solutions, United States), *Janet M. Wang (The University of Arizona, United States)
Pagepp. 531 - 536
KeywordChebyshev Affine Arithmetic, Process Variations, Limited Description of Uncertainty, Dependency Bounds
AbstractIn modern circuit design, it is difficult to provide reliable parametric yield prediction since the real distribution of process data is hard to measure. Most existing approaches are not able to handle the uncertain distribution property coming from the process data. Other approaches are inadequate considering correlations among the parameters. This paper suggests a new approach that not only takes care of the correlations among distributions but also provides a low cost and efficient computation scheme. The proposed method approximates the parameter variations with Chebyshev Affine Arithmetics (CAA) to capture both the uncertainty and the nonlinearity in Cumulative Distribution Functions (CDF). The CAA based probabilistic presentation describes both fully and partially specified process and environmental parameters. Thus we are capable of predicting probability bounds for leakage consumption under unknown dependency assumption among variations. The end result is the chip level parametric yield estimation based on leakage prediction. The experimental results demonstrate that the new approach provides reliable bound estimation while leads to 20% yield improvement comparing with interval analysis.

6B-4 (Time: 17:05 - 17:30)
TitleDistribution Arithmetic for Stochastical Analysis
Author*Markus Olbrich, Erich Barke (Leibniz University of Hannover, Germany)
Pagepp. 537 - 542
Keywordstochastic, robustness, arithmetic
AbstractThis paper presents a novel arithmetic which allows calculations with fluctuating values. Given the distributions of initial random variables, the moments (such as expected value, variance and higher moments) of any calculated variable can be determined. Our approach is not limited to normal distributions and works with linear and nonlinear functions. Correlations between variables are taken into account automatically by the arithmetic. Examples show the accuracy and runtimes compared to Monte Carlo simulation.

6B-5 (Time: 17:30 - 17:55)
TitleHandling Partial Correlations in Yield Prediction
AuthorSridhar Varadan (Texas A&M University, United States), *Janet Wang (University of Arizona, United States), Jiang Hu (Texas A&M University, United States)
Pagepp. 543 - 548
Keywordyield
AbstractIn nanometer regime, IC designs have to consider the impact of process variations, which is often indicated by manufacturing/parametric yield. This paper investigates a yield model - the probability that the values of multiple manufacturing/circuit parameters meet certain target. This model can be applied to predict CMP (Chemical-Mechanical Planarization) yield. We focus on the difficult cases which have large number of partially correlated variations. In order to predict the yield for these difficult cases efficiently, we propose two techniques: (1) application of Orthogonal Principle Component Analysis (OPCA); (2) hierarchical adaptive quadrisection (HAQ). Systematic variations are also included in our model. Compared to previous work, the OPCA based method can reduce the error on yield estimation from 17.1%-21.1% to 1.3%-2.8% with 4.6X speedup. The HAQ technique can reduce the error to 4.1%-5.6% with 6X-9.4X speedup.


Session 6D  Special Session - How to Design Cool Chips for Hot Products
Time: 15:50 - 17:55 Wednesday, January 23, 2008
Location: Room 311A+311BC
Chair: Massoud Pedram (Univ. of Southern California, United States)

6D-1
Title(Invited Paper) Reliability-Aware Design for Nanometer-Scale Devices
Author*David Atienza (EPFL, Swaziland), Giovanni De Micheli (LSI/EPFL, Swaziland), Luca Benini (DEIS/UNIBO, Italy), José L. Ayala, Pablo G. Del Valle (DACYA/UCM, Spain), Michael DeBole, Vijay Narayanan (CSE/PSU, United States)
Pagepp. 549 - 554
AbstractContinuous transistor scaling due to improvements in CMOS devices and manufacturing technologies is increasing processor power densities and temperatures; thus, creating challenges to maintain manufacturing yield rates and reliable devices in their expected lifetimes for latest nanometer-scale dimensions. In fact, new system and processor microarchitectures require new reliability-aware design methods and exploration tools that can face these challenges without significantly increasing manufacturing cost, reducing system performance or imposing large area overheads due to redundancy.

6D-3
Title(Invited Paper) An Industrial Perspective of Power-aware Reliable SoC Design
Author*Soo-Kwan Eo, Sungjoo Yoo, Kyu-Myung Choi (Samsung Electronics, Republic of Korea)
Pagepp. 555 - 557

6D-4
Title(Panel Discussion) How to Design Cool Chips for Hot Products
AuthorModerator: Massoud Pedram (Univ. of Southern California, United States), Panelists: Giovanni De Micheli (EPFL, Swaziland), Jan Rabaey (Univ. of California, Berkeley, United States), Sookwan Eo (Samsung Electronics, Republic of Korea)



Thursday, January 24, 2008

Session 3K  Keynote Session III
Time: 9:00 - 10:00 Thursday, January 24, 2008
Location: Room 409
Chair: Soonhoi Ha (Seoul National Univ., Republic of Korea)

3K-1 (Time: 9:00 - 10:00)
Title(Keynote Address) The Future of Semiconductor Industry - A Foundry's Perspective
Author*F. C. Tseng (TSMC, Taiwan)
Pagep. 558


Session 7A  Reliable/Testable Design Techniques
Time: 10:15 - 12:20 Thursday, January 24, 2008
Location: Room 310A
Chairs: Huawei Li (Chinese Academy of Sciences, China), Sungho Kang (Yonsei University, Republic of Korea)

7A-1 (Time: 10:15 - 10:40)
TitleSoft Error Rate Reduction Using Redundancy Addition and Removal
Author*Kai-Chiang Wu, Diana Marculescu (Carnegie Mellon University, United States)
Pagepp. 559 - 564
KeywordSER Reduction, Soft Error, RAR, Redundancy Addition and Removal, Reliability
AbstractDue to current technology scaling trends such as shrinking feature sizes and reducing supply voltages, circuit reliability has become more susceptible to radiation-induced transient faults (soft errors). Soft errors, which have been a great concern in memories, are now a main factor in reliability degradation of logic circuits. In this paper, we propose a novel framework based on redundancy addition and removal (RAR) for soft error rate (SER) reduction. Several metrics and constraints are introduced to guide our proposed framework towards SER reduction in an efficient manner. Experimental results show that up to 70% reduction in output failure probability can be achieved with relatively low area overhead.

7A-2 (Time: 10:40 - 11:05)
TitleLocalized Random Access Scan: Towards Low Area and Routing Overhead
Author*Yu Hu, Xiang Fu, Xiaoxin Fan (Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, China), Hideo Fujiwara (Graduate School of Information Science, Nara Institute of Science and Technology, Japan)
Pagepp. 565 - 570
KeywordRandom Access Scan, Design-for-Testability
AbstractConventional random access scan (RAS) designs are expensive in hardware overhead. In this paper, we present a localized RAS architecture (LRAS) to address this issue. A novel scan cell structure, which has fewer transistors than multiplexer-type scan cell, is proposed to eliminate the global test enable signal and to localize the row and column enable signals. Experimental results demonstrate that LRAS has less area overhead than scan chain based designs, while outperforms the state-of-the-art RAS scheme in routing overhead.

7A-3 (Time: 11:05 - 11:30)
TitleA Design-for-Diagnosis Technique for Diagnosing Both Scan Chain Faults and Combinational Circuit Faults
AuthorFei Wang, *Yu Hu, Huawei Li, Xiaowei Li (Chinese Academy of Sciences, China)
Pagepp. 571 - 576
Keywordscan chain diagnosis, design for diagnosis, logic diagnosis, scan chain, design for testestability
AbstractThe amount of die area consumed by scan chains and scan control circuit can range from 15%~30%, and scan chain failures account for almost 50% of chip failures. As the conventional diagnosis process usually runs on the faulty free scan chain, scan chain faults may disable the diagnostic process, leaving large failure area to time-consuming failure analysis. In this paper, a design-for-diagnosis (DFD) technique is proposed to diagnose faulty scan chains precisely and efficiently, moreover, with the assistant of the proposed technique, the conventional logic diagnostic process can be carried on with faulty scan chains. The proposed approach is entirely compatible with conventional scan-based design. Previously proposed software-based diagnostic methods for conventional scan designs can still be applied to our design. Experiments on ISCAS'89 benchmark circuits are conducted to demonstrate the efficiency of the proposed DFD technique.

7A-4 (Time: 11:30 - 11:55)
TitleGECOM: Test Data Compression Combined with All Unknown Response Masking
Author*Youhua Shi, Nozomu Togawa, Masao Yanagisawa, Tatsuo Ohtsuki (Waseda Univ., Japan)
Pagepp. 577 - 582
Keywordscan test, test data compression, X-masking, ATPG
AbstractThis paper introduces GECOM technology, a novel test compression method with seamless integration of test GEneration, test COmpression (i.e. integrated compression on scan stimulus and masking bits) and all unknown scan responses Masking for manufacturing test cost reduction. Unlike most of prior methods, the proposed method considers the unknown responses during ATPG procedure and selectively encodes the specified 1 or 0 bits (either 1s or 0s) in scan slices for compression while at the same time masks the unknown responses before sending them to the response compactor. The proposed GECOM technology consists of GECOM architecture and GECOM ATPG technique. In the GECOM architecture, for a circuit with N internal scan chains, only c tester channels, where $c = \lceil \log_2{N} \rceil + 2 $, are required. GECOM ATPG generates test patterns for the GECOM architecture thus not only the scan inputs could be efficiently compressed but also all the unknown responses would be masked. Experimental results on both benchmark circuits and real industrial designs indicated the effectiveness of the proposed GECOM technique.


Session 7B  Communication and Interfaces
Time: 10:15 - 12:20 Thursday, January 24, 2008
Location: Room 310BC
Chairs: Maziar Goudarzi (Kyushu University, Japan), Hiroyuki Yagi (STARC, Japan)

7B-1 (Time: 10:15 - 10:40)
TitleMixed Integer Linear Programming-Based Optimal Topology Synthesis of Cascaded Crossbar Switches
Author*Minje Jun (Yonsei Univ., Republic of Korea), Sungjoo Yoo (Samsung Electronics, Republic of Korea), Eui-Young Chung (Yonsei Univ., Republic of Korea)
Pagepp. 583 - 588
Keywordcascaded crossbar, topology synthesis, MILP, bus
AbstractWe present a topology synthesis method for high performance System-on-Chip (SoC) design. Our method provides an optimal topology of on-chip communication network for the given bandwidth, latency, frequency and/or area constraints. The optimal topology consists of multiple crossbar switches and some of them can be connected in a cascaded fashion for higher clock frequency and/or area efficiency. Compared to previous works, the major contribution of our work is the exactness of the solution from two aspects. First, the solving method of our work is exact by employing the mixed integer linear programming (MILP) method. Second, we generalize the crossbar switch representation in MILP in order that the optimal topology can include any arbitrary sizes of crossbar switches together. The experimental results show that the topologies optimized for the clock frequency (area) give up to 37.3% (12.7%) improvements compared to the conventional single large crossbar switch networks for two industrial strength SoC designs.

7B-2 (Time: 10:40 - 11:05)
TitleAutomatic Interface Synthesis Based on the Classification of Interface Protocols of IPs
Author*ChangRyul Yun (Agency for Defense Development, Republic of Korea), DongSoo Kang (Chungnam National University, Republic of Korea), YoungHwan Bae, HanJin Cho (ETRI, Republic of Korea), KyoungSon Jhang (Chungnam National University, Republic of Korea)
Pagepp. 589 - 594
Keywordinterface synthesis, protocol classification, IP reuse
AbstractIn a System on a Chip (SoC) design, we use an IP-based design methodology to reduce design time. An interface circuit design is one of the most essential factors in IP-based design. However, it is not easy to generate interface circuits because IPs have various characteristics. For example, one IP may send only one outstanding address in a burst but another IP may need one address for each transfer in a burst. IPs also use different clock frequencies or different data widths. It is necessary to analyze the interface protocols of each IP to consider and resolve these differences during synthesis. In this paper, we categorize the various interface protocols and use the synthesis algorithm to select the appropriate structure based on the categorizations, clock frequencies, and data width differences of the IPs. Through the experiments, we show that we could automatically generate interface circuits for IPs with different clocks, different data widths, and no address concepts. Experiments also show the pros and cons of two structures based on the comparisons of the synthesis results of several IP pairs which could be employed between two alternative structures, namely, product FSM-based structure and FSMD-like structure.

7B-3 (Time: 11:05 - 11:30)
TitleThe Shining Embedded System Design Methodology Based on Self Dynamic Reconfigurable Architectures
AuthorC. A. Curino, *L. Fossati, V. Rana, F. Redaelli, M. D. Santambrogio, D. Sciuto (Politecnico di Milano, Italy)
Pagepp. 595 - 600
KeywordReconfiguration, Embedded System, Design flow
AbstractComplex design, targeting System-on-Chip based on reconfigurable architectures, still lacks a generalized methodology allowing both the automatic derivation of a complete system solution able to fit into the final device, and mixed hardware-software solutions, exploiting partial reconfiguration capabilities. The Shining methodology organizes the input specification of a complex System-on-Chip design into three different components: hardware, reconfigurable hardware and software, each handled by dedicated sub-flows. A communication model guarantees reliable and seamless interfacing of the various components. The developed system, stand-alone or OS-based, is architecture-independent. The Shining flow reduces the time for system development, easing the design of complex hardware/software reconfigurable applications.

7B-4 (Time: 11:30 - 11:43)
TitleRobust On-Chip Bus Architecture Synthesis for MPSoC Under Random Tasks Arrival
Author*Sujan Pandey (NXP Semiconductors Research, Netherlands), Rolf Drechsler (University of Bremen, Germany)
Pagepp. 601 - 606
KeywordOn-chip bus synthesis, Synthesis for robustness
AbstractA major trend in a modern system-on-chip design is a growing system complexity, which results in a sharp increase of communication traffic on the on-chip communication bus architectures. In a real-time embedded system, task arrival rate, inter-tasks arrival time, and data size to be transferred are not uniform over time. This is due to the partial re-configuration of an embedded system to cope with dynamic workload. In this context, the traditional application specific bus architectures may fail to meet the real-time constraints. Thus, to incorporate the random behavior of on-chip communication, this work proposes an approach to synthesize an on-chip bus architecture, which is robust for a given distributions of random tasks. The randomness of communication tasks is characterized by three main parameters which are average tasks arrival rate, average inter-tasks arrival time, and data size. For synthesis, an on-chip bus requirement is guided by the worst-case performance need, while the dynamic voltage scaling technique is used to save energy when the workload is low or timing slack is high. This, in turn, results in an effective utilization of communication resources under variable workload.

7B-5 (Time: 11:43 - 11:56)
TitleA Multi-Processor NoC Platform Applied on the 802.11i TKIP Cryptosystem
Author*Jung-Ho Lee, Sung-Rok Yoon, Kwang-Eui Pyun, Sin-Chong Park (ICU, Republic of Korea)
Pagepp. 607 - 610
KeywordNoC, MPSoC, TKIP
AbstractSince 2001, there have been a myriad of papers on systematic analysis of Multi-Processor System on Chip (MPSoC) and Network on Chip (NoC). Nevertheless, we only have a few of their practical application. Till now, main interest of researchers has been to adapt NoC to the communication intensive multimedia system like H.263. However, this paper attempts to expand the domain of NoC platform to one of the wireless security algorithms (TKIP), because its inter-component transaction pattern shows considerable characteristic for NoC. This paper consists of the explanation on operational sequence of the algorithm in chosen architecture and the brief illustration of important composing NoC blocks (Network Interface, Router).


Session 7C  Power: Delivery and Reduction
Time: 10:15 - 12:20 Thursday, January 24, 2008
Location: Room 311A
Chairs: Ki-seok Chung (Hanyang University, Republic of Korea), Junhyung Um (Samsung Electronics, Republic of Korea)

7C-1 (Time: 10:15 - 10:40)
TitleA Unified Methodology for Power Supply Noise Reduction in Modern Microarchitecture Design
AuthorMichael Healy, Fayez Mohamood, Hsien-Hsin S. Lee, *Sung Kyu Lim (Georgia Institute of Technology, United States)
Pagepp. 611 - 616
Keywordpower noise, dynamic control, floorplanning
AbstractIn this paper, we present a novel design methodology to combat the ever-aggravating high frequency power supply noise (di/dt) in modern microprocessors. Our methodology integrates microarchitectural profiling for noise-aware floorplanning, dynamic runtime noise control to prevent unsustainable noise emergencies, as well as decap allocation; all to produce a design for the average-case current consumption scenario. The dynamic controller contributes a microarchitectural technique to eliminate occurences of the worst-case noise scenario thus our method focuses on average-case noise behavior.

7C-2 (Time: 10:40 - 11:05)
TitleHeuristic Power/Ground Network and Floorplan Co-Design Method
Author*Xiaoyi Wang, Jin Shi, Yici Cai, Xianlong Hong (Tsinghua University, China)
Pagepp. 617 - 622
KeywordFloorplan, IR drop, P/G network optimization
AbstractIt's a trend to consider power supply integrity at early stage to improve the design quality. In this paper, we propose a novel algorithm to optimize floorplan together with P/G network. Compared with previous methods, our algorithm can search the floorplan space more efficiently and therefore lead to better results. Further, we also propose a smart heuristic method to build P/G mesh grid with optimized topology. Experimental results show our method can speedup the floorplanning process by about 10 times and reduce the routing area of P/G network while maintaining the floorplan quality and P/G integrity.

7C-3 (Time: 11:05 - 11:30)
TitleVertical Via Design Techniques for Multi-Layered P/G Networks
Author*Shuai Li, Jin Shi, Yici Cai, Xianlong Hong (Tsinghua University, China)
Pagepp. 623 - 628
KeywordP/G , multi-layered, via
AbstractIn multi-layered power/ground (P/G) networks, to connect the whole network together, vertical vias are usually placed at intersections between metal wires of adjoining layers. In this paper, a deep study about the design of vertical vias is presented. First we present an efficient heuristic algorithm based on sensitivity analysis to optimize via allocation in early design stage. Compared with equal allocation, averagely our algorithm is capable of reducing worst voltage drop by 8.43% while using the same or even less number of vias. Also, adjoint network method is utilized and significantly improves the efficiency of our algorithm. Next, we demonstrate that by linking metal wires of nonadjacent layers, cross-layer vias are powerful in eliminating “hot” areas which suffer from large voltage drop on bottom layer. A similar heuristic algorithm is also developed for the addition of cross-layer vias.

7C-4 (Time: 11:30 - 11:55)
TitleStatistical Mixed Vt Allocation of Body-Biased Circuits for Reduced Leakage Variation
AuthorJinseob Jeong, *Seungwhun Paik, Youngsoo Shin (KAIST, Republic of Korea)
Pagepp. 629 - 634
KeywordStatistical, Mixed Vt, Body Biasing
AbstractLeakage current is susceptible to variation of transistor parameters and environment such as temperature, which results in wide spread in leakage distribution. The spread can be reduced by employing body biasing: reverse body bias for too leaky dies and forward body bias for too slow dies. We investigate body biasing of mixed Vt circuits. It is shown that the conventional body biasing has limitation in reducing leakage variation of mixed Vt circuits. This is because low- and high-Vt devices do not track each other and their body biasing sensitivities are different. We present alternative body biasing scheme that targets compensating die-to-die variation of low Vt. Under this body biasing scheme, within-die profiles of lowand high-Vt, which we need for statistical allocation of mixed Vt, get wider thus become different from the original ones. We present an analytical procedure to derive new within-die profiles. Experiments with 45-nm predictive model show that the spread in leakage can be reduced to 4.5 on average as opposed to 9.4 from conventional body biasing on mixed Vt circuits.

7C-5 (Time: 11:55 - 12:20)
TitleExploring High-Speed Low-Power Hybrid Arithmetic Units at Scaled Supply and Adaptive Clock-Stretching
AuthorSwaroop Ghosh, *Kaushik Roy (Purdue Univ., United States)
Pagepp. 635 - 640
KeywordLow Power, adder, high speed, hybrid, robust
AbstractIn this paper, we explore various arithmetic units for possible use in high speed, high yield ALU design at scaled supply voltage with variable latency operation. We demonstrate that careful modification of the existing arithmetic units indeed make them further suitable for supply voltage scaling with tolerable area overhead. Simulation results on different adder and multiplier topologies show 18-60% improvement in power with 2-8% increase in die-area at iso-yield. We also extend our studies to design low power and high yield multipliers. These optimized low power datapath units can be used to construct a low power and robust ALU.


Session 7D  Special Session (Panel) Concurrent SoC and SiP Designs
Time: 10:15 - 12:20 Thursday, January 24, 2008
Location: Room 311BC
Chair: Wei-Chung Lo (ITRI, Taiwan)

7D-1
Title(Panel Discussion) Concurrent SoC and SiP Designs
AuthorModerator: Wei-Chung Lo (ITRI, Taiwan), Panelists: C. P. Hung (ASE, Taiwan), Lung Chu (Cadence Design Systems, United States), Joungho Kim (KAIST, Republic of Korea), Epan Wu (VIA Technologies, Taiwan)


Session 8A  Test Generation and Test Power
Time: 13:30 - 15:35 Thursday, January 24, 2008
Location: Room 310A
Chairs: Hideo Fujiwara (NAIST, Japan), Yu Hu (Chinese Academy of Science, China)

8A-1 (Time: 13:30 - 13:55)
TitleCircuit Lines for Guiding the Generation of Random Test Sequences for Synchronous Sequential Circuits
AuthorIrith Pomeranz (Purdue University, United States), *Sudhakar M. Reddy (University of Iowa, United States)
Pagepp. 641 - 646
Keywordrandom test sequences, synchronous sequential circuits
AbstractA procedure proposed earlier for improving the fault coverage of a random primary input sequence modifies the input sequence so as to avoid repeated synchronization of state variables. We show that in addition to the values of state variables, it is also important to consider repeated setting of other lines to the same values. A procedure and experimental results are presented to demonstrate the improvements in fault coverage of random primary input sequences when the values of selected lines are considered.

8A-2 (Time: 13:55 - 14:20)
TitleA New Low Energy BIST Using A Statistical Code
Author*Sunghoon Chun, Taejin Kim, Sungho Kang (Yonsei University, Republic of Korea)
Pagepp. 647 - 652
KeywordBIST, low energy test, test data compression
AbstractTo tackle with the increased switching activity during the test operation, this paper proposes a new built-in self test (BIST) scheme for low energy testing that uses a statistical code and a new technique to skip unnecessary test sequences. From a general point of view, the goal of this technique is to minimize the total power consumption during a test and to allow the at-speed test in order to achieve high fault coverage. The effectiveness of the proposed low energy BIST scheme was validated on a set of ISCAS ’89 benchmark circuits with respect to test data volume and energy saving.

8A-3 (Time: 14:20 - 14:33)
TitleOn Reducing Both Shift and Capture Power for Scan-Based Testing
AuthorJia Li (Chinese Academy of Sciences, China), *Qiang Xu (The Chinese Univ. of Hong Kong, Hong Kong), Yu Hu, Xiaowei Li (Chinese Academy of Sciences, China)
Pagepp. 653 - 658
Keywordtest power, shift power, capture power, scan-based testing
AbstractPower consumption in scan-based testing is a major concern nowadays. In this paper, we present a new X-filling technique to reduce both shift power and capture power during scan tests, namely LSC-filling. The basic idea is to use as few as possible X-bits to keep the capture power under the peak power limit of the circuit under test (CUT), while using the remaining X-bits to reduce the shift power to cut down the CUT’s average power consumption during scan tests as much as possible. In addition, by carefully selecting the X-filling order, our X-filling technique is able to achieve lower capture power when compared to existing methods. Experimental results on ISCAS’89 benchmark circuits show the effectiveness of the proposed methodology.

8A-4 (Time: 14:33 - 14:46)
TitleRobust Test Generation for Power Supply Noise Induced Path Delay Faults
Author*Xiang Fu, Huawei Li, Yu Hu, Xiaowei Li (Chinese Academy of Sciences, China)
Pagepp. 659 - 662
KeywordPower Supply Noise, Robust Test generation
AbstractIn deep sub-micron designs, the delay caused by power supply noise (PSN) can no longer be ignored. A PSN-induced path delay fault (PSNPDF) model is proposed in this paper, and should be tested to enhance chip quality. Based on precise timing analysis, we also propose a robust test generation technique for PSNPDF. Concept of timing window is introduced into the PSNPDF model. If two devices in the same feed region simultaneously switch in the same direction, the current waveform of the two devices will have an overlap and excessive PSN will be produced. Experimental results on ISCAS’89 circuits showed test generation can be finished in a few seconds.

8A-5 (Time: 14:46 - 14:59)
TitleTest Vector Chains for Increased Targeted and Untargeted Fault Coverage
AuthorIrith Pomeranz (Purdue University, United States), *Sudhakar M. Reddy (University of Iowa, United States)
Pagepp. 663 - 666
Keywordn-detections, test generation
AbstractWe introduce the concept of test vector chains, which allows us to obtain new test vectors from existing ones through single-bit changes without any test generation effort. We demonstrate that a test set T0 has a significant number of test vector chains that are effective in increasing the numbers of detections of target faults, i.e., faults targeted during the generation of T0, as well as untargeted faults, i.e., faults that were not targeted during the generation of T0.

8A-6 (Time: 14:59 - 15:12)
TitleParallel Fault Backtracing for Calculation of Fault Coverage
Author*Raimund Ubar, Sergei Devadze, Jaan Raik, Artur Jutman (Tallinn University of Technology, Estonia)
Pagepp. 667 - 672
Keywordfault simulation, combinational circuits, stuck-at faults, critical path, Boolean differentials
AbstractAn improved method for calculation of fault coverage with parallel fault backtracing in digital circuits with scan path is proposed. The method is based on structurally synthesized BDDs (SSBDD) which represent gate-level circuits at higher, macro level where macros represent subnetworks of gates. A topological analysis is carried out to generate an efficient model for backtracing of faults to minimize the repeated calculations because of the reconvergent fanouts. The algorithm is equivalent to exact critical path tracing. Because of the parallelism and higher abstraction level modeling the speed of analysis was considerably increased. Experimental data show that the speed-up of the new method is considerable compared to the previous similar approach. The speed of the fault analysis in several times outperforms the speed of the current state-of-the-art commercial fault simulators


Session 8B  Design Space Exploration
Time: 13:30 - 15:35 Thursday, January 24, 2008
Location: Room 310BC
Chairs: Sri Parameswaran (University of New South Wales, United States), Rainer Dömer (University of California, Irvine, United States)

8B-1 (Time: 13:30 - 13:55)
TitleReSP: A Non-Intrusive Transaction-Level Reflective MPSoC Simulation Platform for Design Space Exploration
AuthorGiovanni Beltrame (European Space Agency, Netherlands), Cristiana Bolchini, *Luca Fossati, Antonio Miele, Donatella Sciuto (Politecnico di Milano, Italy)
Pagepp. 673 - 678
KeywordMPSoC, SystemC, Python, Simulation, reliability
AbstractThis paper presents ReSP, a multi-processor simulation platform based on SystemC and Python (which provides the platform with reflective capabilities). The designer has an easy way to specify the architecture of a system, simulate and perform automatic analysis on it. The overhead associated with Python intermediate layer is around 1%. The advantages of our approach are: (a) easy integration of external IPs (b) fine grain simulation control (c) effortless integration of tools for system analysis and design space exploration.

8B-2 (Time: 13:55 - 14:20)
TitleCollaborative Hardware/Software Partition of Coarse-Grained Reconfigurable System Using Evolutionary Ant Colony Optimization
Author*Dawei Wang, Sikun Li, Yong Dou (College of Computer Science, National University of Defense Technology, China)
Pagepp. 679 - 684
KeywordCollaborative Design, Reconfigurable Computing, System-on-Chips, Hardware/Software Partitioning, Ant Colony Optimization
AbstractThe flexibility, performance and cost effectiveness of reconfigurable architectures have lead to its widespread use for embedded applications. Reconfigurable system design is very complex for multi-fields experts to collaborate on application algorithm design, hardware/software co-design and system decision. However, existing reconfigurable system design methods and environments can only support hardware/software co-design, ignoring the collaboration between multi-field experts. This paper presents a collaborative partition approach of coarse-grained reconfigurable system design using evolutionary ant colony optimization. We create a distributed collaborative design environment for system decision engineers, software designers, hardware designers and application algorithm developers. The method not only utilizes the advantages of ant colony optimization for searching global optimal solutions, but also provides a framework for multi-field experts to work collaboratively. Experimental results show that the method improves the quality and speed of hardware/software partition for coarse-grained reconfigurable system design.

8B-3 (Time: 14:20 - 14:45)
TitleDesign Space Exploration for a Coarse Grain Accelerator
Author*Farhad Mehdipour, Hamid Noori (Kyushu University, Japan), Morteza Saheb Zamani (Amirkabir University of Technology, Iran), Koji Inoue, Kazuaki Murakami (Kyushu University, Japan)
Pagepp. 685 - 690
Keywordextensible processor, design space exploration, reconfigurable accelerator
AbstractIn the design process of a reconfigurable accelerator employing in an embedded system, multitude parameters may result in remarkable complexity and a large design space. Design space exploration as an alternative to the quantitative approach can be employed to find a right balance between the different design parameters. In this paper, a hybrid approach is introduced to analytically explore the design space for a coarse grain accelerator and determine a wise design point exploiting data extracted from applications, quantitatively. It also provides flexibility for taking into account new design constraints as well as new characteristics of applications. Furthermore, this approach is a methodological approach which reduces the design time and results in a point which satisfies the design goals.

8B-4 (Time: 14:45 - 15:10)
TitleEfficient Symbolic Multi–Objective Design Space Exploration
Author*Martin Lukasiewycz, Michael Glaβ, Christian Haubelt, Jürgen Teich (University of Erlangen-Nuremberg, Germany)
Pagepp. 691 - 696
Keyworddesign space exploration, pseudo-boolean solver, multi-objective, symbolic
AbstractNowadays many design space exploration tools are based on Multi–Objective Evolutionary Algorithms (MOEAs). Beside the advantages of MOEAs, there is one important drawback as MOEAs might fail in design spaces containing only a few feasible solutions or as they are often afflicted with premature convergence, i.e., the same design points are revisited again and again. Exact methods, especially Pseudo Boolean solvers (PB solvers) seem to be a solution. However, as typical design spaces are multi–objective, there is a need for multi–objective PB solvers. In this paper, we will formalize the problem of design space exploration as multi–objective 0–1 ILP. We will propose (1) a heuristic approach based on PB solvers and (2) a complete multi–objective PB solver based on a backtracking algorithm that incorporates the non–dominance relation from multi–objective optimization and is restricted to linear objective functions. First results from applying our novel multi–objective PB solver to synthetic problems will show its effectiveness in small sized design spaces as well as in large design spaces only containing a few feasible solutions. For non–linear and large problems, the proposed heuristic approach is outperforming common MOEA approaches. Finally, a real world example from the automotive area will emphasize the efficiency of the proposed algorithms.

8B-5 (Time: 15:10 - 15:23)
TitleScalable Unified Dual-Radix Architecture for Montgomery Multiplication in GF(P) and GF(2n)
Author*Kazuyuki Tanimura, Ryuta Nara, Shunitsu Kohara, Kazunori Shimizu, Youhua Shi, Nozomu Togawa, Masao Yanagisawa, Tatsuo Ohtsuki (Waseda University, Japan)
Pagepp. 697 - 702
KeywordElliptic curve cryptography, dual-radix, Montgomery multiplication, scalability, unified
AbstractModular multiplication is the most dominant arithmetic operation in elliptic curve cryptography (ECC), which is a type of public-key cryptography. Montgomery multiplication is commonly used as a technique for the modular multiplication and required scalability since the bit length of operands varies depending on the security levels. Also, ECC is performed in $GF(P)$ or $GF(2^n)$, and unified architectures for $GF(P)$ and $GF(2^n)$ multiplier are needed. However, in previous works, changing frequency or dual-radix architecture is necessary to deal with delay-time difference between $GF(P)$ and $GF(2^n)$ circuits of the multiplier because the critical path of $GF(P)$ circuit is longer. This paper proposes a scalable unified dual-radix architecture for Montgomery multiplication in $GF(P)$ and $GF(2^n)$. The proposed architecture unifies $4$ parallel radix-$2^{16}$ multipliers in $GF(P)$ and a radix-$2^{64}$ multiplier in $GF(2^n)$ into a single unit. Applying lower radix to $GF(P)$ multiplier shortens its critical path and makes it possible to compute the operands in the two fields using the same multiplier at the same frequency so that clock dividers to deal with the delay-time difference are not required. Moreover, parallel architecture in $GF(P)$ reduces the clock cycles increased by dual-radix approach. Consequently, the proposed architecture achieves to compute $GF(P)$ $256$-bit Montgomery multiplication in $0.23\mu s$.


Session 8C  Reliability and Power Management
Time: 13:30 - 15:35 Thursday, January 24, 2008
Location: Room 311A
Chairs: Koji Inoue (Kyushu University, Japan), Masaaki Kondo (The University of Tokyo, Japan)

8C-1 (Time: 13:30 - 13:55)
TitleOptimal Allocation and Placement of Thermal Sensors for Reconfigurable Systems and Its Practical Extension
Author*ByungHyun Lee, Taewhan Kim (Seoul National University, Republic of Korea)
Pagepp. 703 - 707
Keywordthermal sensor, allocation, placement, optimization
AbstractA dynamic monitoring of thermal behavior of hardware resources using thermal sensors is very important to maintain the operation of systems safe and reliable. This work proposes an effective solution to the problem of thermal sensor allocation and placement for reconfigurable systems at the post-manufacturing stage. Specifically, we define the sensor allocation and placement problem (SAPP), and propose a solution which formulates SAPP into the unate-covering problem (UCP) and solves it optimally. We then provide an extended solution to handle a practical design issue where the hardware resources for the sensor implementation on specific array locations have already been used up by the application logic. Experimental results using MCNC benchmarks show that our proposed technique uses 19.7% less number of sensors to monitor hotspots on the average than that used by the bisection based approaches.

8C-2 (Time: 13:55 - 14:20)
TitleExploring Power Management in Multi-Core Systems
AuthorReinaldo Bergamaschi (IBM T.J. Watson Research Center, United States), Guoling Han (University of California, Los Angeles, United States), Alper Buyuktosunoglu (IBM T.J. Watson Research Center, United States), Hiren Patel (Virginia Tech, United States), Indira Nair, *Gero Dittmann, Geert Janssen (IBM T.J. Watson Research Center, United States), Nagu Dhanwada (IBM EDA Laboratory, United States), Zhigang Hu, Pradip Bose, John Darringer (IBM T.J. Watson Research Center, United States)
Pagepp. 708 - 713
Keyworddynamic voltage and frequency scaling (DVFS), power management, multi-core systems modeling, performance and power simulation
AbstractPower dissipation has become a critical design metric in microprocessor-based system design. In a multi-core system, running multiple applications, power and performance can be dynamically traded off using an integrated power management (PM) unit. This PM unit monitors the performance and power of each core and dynamically adjusts the individual voltages and frequencies in order to maximize system performance under a given power budget (usually set by the operating system). This paper presents a performance and power analysis methodology, featuring a simulation model for multi-core systems that can be easily reconfigured for different scenarios and a PM infrastructure for the exploration and analysis of PM algorithms. Two algorithms have been implemented: one for discrete and one for continuous power modes based on non-linear programming. Extensive experiments are reported, illustrating the effect of power management both at the core and the chip level.

8C-3 (Time: 14:20 - 14:45)
TitleDependability, Power, and Performance Trade-Off on a Multicore Processor
Author*Toshinori Sato (Kyushu University, Japan), Toshimasa Funaki (Kyushu Institute of Technology, Japan)
Pagepp. 714 - 719
Keywordpower consumption, dependability, multicore processors, trade-off design, soft errors
AbstractAs deep submicron technologies are advanced, we face new challenges, such as power consumption and soft errors. A naïve technique, which utilizes emerging multicore processors and relies upon thread-level redundancy to detect soft errors, is power hungry. It consumes at least two times larger power than the conventional single-threaded processor does. This paper investigates a trade-off between dependability and power on a multicore processor, which is named multiple clustered core processor (MCCP). It is proposed to adapt processor resources according to the requested performance. A new metric to evaluate a trade-off between dependability, power, and performance is proposed. It is the product of soft error rate and the popular energy-delay product. We name it energy, delay, and upset rate product (EDUP). Detailed simulations show that the MCCP exploiting the adaptable technique improves the EDUP by up to 21% when it is compared with the one exploiting the naïve technique.

8C-4 (Time: 14:45 - 15:10)
TitleHigh Performance Current-Mode Differential Logic
AuthorLing Zhang (Univ. of California, San Diego, United States), Jianhua Liu (Altera, United States), Haikun Zhu (Qualcomm, United States), *Chung-Kuan Cheng (Univ. of California, San Diego, United States), Masanori Hashimoto (Osaka Univ., Japan)
Pagepp. 720 - 725
KeywordVLSI circuit design, differential logic, current-mode logic
AbstractThis paper presents a new logic style, named Current-Mode Differential logic (CMDL), that achieves both high operating speed and low power consumption. Inspired by the low-voltage swing (LVS) logic, CMDL uses a shunt resistor at the differential output to obtain constant low swing signal without the need to reset low. Furthermore, conditional shunt transistors are used for the internal nodes to prevent high-voltage swing, thus entirely eliminate the power-hungry clocked reset network in LVS circuits. We show that the CMDL is suitable for high-end microprocessor integer core by providing three datapath modules implemented in CMDL. Our simulation results indicate that, operating at comparable speed with LVS logic, CMDL circuits can achieve up to 50% reduction of delay-power product compared to CMOS logic and LVS logic. In addition, CMDL reduces the power consumption of LVS by up to 40%.

8C-5 (Time: 15:10 - 15:35)
TitleNBTI Induced Performance Degradation in Logic and Memory Circuits: How Effectively Can We Approach a Reliability Solution?
AuthorKunhyuk Kang, Saakshi Gangwal, Sang Phill Park, *Kaushik Roy (Purdue Univ., United States)
Pagepp. 726 - 731
KeywordReliability, NBTI, Temporal degradation
AbstractThis paper evaluates the severity of negative bias temperature instability (NBTI) degradation in two major circuit applications: random logic and memory array. For improved lifetime stability, we propose/select an efficient relia- bility-aware circuit design methodologies. Simulation results obtained from 65nm PTM node shows that NBTI induced degradation in random logic is considerably lower than that of a single transistor. As a result, simple delay guard-banding can efficiently mitigate the impact of NBTI in random logic. On the other hand, NBTI degradation in memory shows much severe effect especially when combined with the impact of random process variation, NBTI can dramatically reduce the READ stability of memory cells. Hence, aggressive design techniques such as stand-by VDD scaling or adaptive body biasing (ABB) are required in memory application to minimize the impact of NBTI.


Session 8D  Designers' Forum - Low Power Chips
Time: 13:30 - 15:35 Thursday, January 24, 2008
Location: Room 311BC
Chair: Kang Yi (Handong Global Univ., Republic of Korea)

8D-1
Title(Invited Paper) Reaching the Limits of Low Power Design
AuthorJ. S. Hobbs, *T. W. Williams (Synopsys, United States)
Pagepp. 732 - 735
AbstractAs process technologies continue to shrink, and feature demands continue to increase, more and more capabilities are being pushed into smaller and smaller packages. But are we finally reaching the point where power density limitations make this trend no longer sustainable? What advanced techniques are in use today, and on the horizon, to address this? Are we limited only to hardware techniques, or can these power limitation issues be addressed with smarter software development? And how do we handle verification of these complex implementations? This paper explores possible methods for improving the "power capacity" of power sensitive designs.

8D-2
Title(Invited Paper) Software-Cooperative Power-Efficient Heterogeneous Multi-Core for Media Processing
Author*Hiroaki Shikano, Masaki Ito, Kunio Uchiyama, Toshihiko Odaka (Hitachi, Japan), Akihiro Hayashi, Takeshi Masuura, Masayoshi Mase, Jun Shirako, Yasutaka Wada, Keiji Kimura, Hironori Kasahara (Waseda Univ., Japan)
Pagepp. 736 - 741
AbstractA heterogeneous multi-core processor (HMCP) architecture, which integrates general purpose processors (CPU) and accelerators (ACC) to achieve high-performance as well as low-power consumption with the support of a parallelizing compiler, was developed. The evaluation was performed using an MP3 audio encoder on a simulator that accurately models the HMCP. It showed that 16-frame encoding on the HMCP with four CPUs and four ACCs yielded 24.5-fold speed-up of performance against sequential execution on one CPU. Furthermore, power saving by the compiler reduced energy consumption of the encoding to 0.17 J, namely, by 28.4%.

8D-3
Title(Invited Paper) Experiences of Low Power Design Implementation and Verification
Author*Shi-Hao Chen, Jiing-Yuan Lin (Global Unichip, Taiwan)
Pagepp. 742 - 747
AbstractIn this paper, we present the experiences of some low power solutions that have been successfully implemented in 90nm/65nm production tape-outs. We also focus on power gating design, an effective low leakage solution, and present the experiences of power switch planning, optimization, and verification. Dynamic IR drop is an important issue in low power design, which may reduce the logic gate noise margins and result in functional or timing failures. We will present a low cost but effective methodology for dynamic IR drop prevention and fixing.

8D-4
Title(Invited Paper) Low Power Architecture and Design Techniques for Mobile Handset LSI Medity™ M2
Author*Shuichi Kunie, Takefumi Hiraga, Tatsuya Tokue, Sunao Torii, Taku Ohsawa (NEC, Japan)
Pagepp. 748 - 753
AbstractThis paper presents the low power architecture and design techniques for the mobile handset LSI Medity™ M2. M2 is a second-generation mobile handset LSI which integrates a Digital baseband and Application processor on a chip. M2 is capable of supporting 3.2 Mbps HSDPA, WCDMA communications, and rich, high-resolution multimedia applications, while power consumption is kept almost the same as in its predecessor chip M1. To reduce power consumption, M2 adopts hardware management clock control schemes, Multiple Vt transistors, an On-chip Power Switch, and Back-bias control. Preliminary measurement results show the design to work very well.


Session 9A  Analog/RF/Mixed Signal CAD
Time: 15:50 - 17:55 Thursday, January 24, 2008
Location: Room 310A
Chairs: Seonghwan Cho (Korea Advanced Institute of Science and Technology, Republic of Korea), Zhiping Yu (Tsinghua University, China)

9A-1 (Time: 15:50 - 16:15)
TitleAn Efficient, Fully Nonlinear, Variability-Aware Non-Monte-Carlo Yield Estimation Procedure with Applications to SRAM Cells and Ring Oscillators
Author*Chenjie Gu, Jaijeet Roychowdhury (University of Minnesota, United States)
Pagepp. 754 - 761
KeywordYield estimation, non-Monte-Carlo, SRAM, Ring oscillator
AbstractFailures and yield problems due to parameter variations have become a significant issue for sub-90-nm technologies. As a result, CAD algorithms and tools that provide designers the ability to estimate the effects of variability quickly and accurately are being urgently sought. The need for such tools is particularly acute for static RAM (SRAM) cells and integrated oscillators, for such circuits require expensive and high-accuracy simulation during design. We present a novel technique for fast computation of parametric yield. The technique is based on efficient, adaptive geometric calculation of probabilistic hypervolumes subtended by the boundary separating pass/fail regions in parameter space. A key feature of the method is that it is far more efficient than Monte-Carlo, while at the same time achieving better accuracy in typical applications. The method works equally well with parameters specified as corners, or with full statistical distributions; importantly, it scales well when many parameters are varied. We apply the method to an SRAM cell and a ring oscillator and provide extensive comparisons against full Monte-Carlo, demonstrating speedups of 100-1000X.

9A-2 (Time: 16:15 - 16:40)
TitleAnalog Circuit Simulation Using Range Arithmetics
Author*Darius Grabowski, Markus Olbrich, Erich Barke (Leibniz University of Hannover, Germany)
Pagepp. 762 - 767
KeywordSimulation, affine arithmetic
AbstractThe impact of parameter variations in integrated analog circuits is usually analyzed by Monte Carlo methods with a high number of simulation runs. Few approaches based on interval arithmetic were not successful due to tremendous overapproximations. In this paper, we describe an innovative approach computing transient and DC simulations of nonlinear analog circuits with symbolic range representations that keeps correlation information, and hence has a very limited overapproximation. The methods are based on affine and quadratic arithmetic. Ranges are represented by unique symbols so that linear correlation information is preserved. We demonstrate feasibility of the methods by simulation results using complex analog circuits.

9A-3 (Time: 16:40 - 16:53)
TitleLTCC Spiral Inductor Modeling, Synthesis, and Optimization
Author*Tuck-Boon Chan, Hsin-Chia Lu, Jun-Kuei Zeng, Charlie Chung-Ping Chen (National Taiwan University, Taiwan)
Pagepp. 768 - 771
KeywordLTCC, inductor, synthesis, optimization
AbstractIn RF/microwave circuit design, inductor design is one of the most difficult and time-consuming task due to the tedious try-and-error optimization process. This paper brings forward a fast and accurate spiral inductor synthesis method which automatically generates physical layout of inductors according to electronic specification. The fusion of substrate-aware PEEC model with optimal nonlinear optimization engine, our modeling and synthesis strategies have been extensively verified with 3D solvers and has less than 6% error within measurement result.

9A-4 (Time: 16:53 - 17:06)
TitleSymmetry Constraint based on Mismatch Analysis for Analog Layout in SOI Technology
Author*Jiayi Liu, Sheqin Dong, Xianlong Hong, Yibo Wang, Ou He (Tsinghua University, China), Satoshi Goto (Waseda University, Japan)
Pagepp. 772 - 775
Keywordmismatch, analog, symmetry, SOI
AbstractThe conventional tools for mismatch elimination such as geometric symmetry and common centroid technology can only eliminate systematic mismatch, but can do little to reduce random mismatch and thermal-induced mismatch. As the development of VLSI technology, the random mismatch is becoming more and more serious. And in the context of Silicon on Insulator (SOI), the self-heating effect leads to unbearable thermal-induced mismatch. Therefore, in this paper, we first propose a new model which can estimate the combination effect of both random mismatch and thermal-induced mismatch by mismatch analysis and SPICE simulation. And in order to meet the different sensitivities of different symmetry pairs, an automatic classification tool and a configurable optimization process are also introduced. All of these are embedded in the floorplanning process. The final experimental results prove the effectiveness of our method.


Session 9B  Architecture Exploration
Time: 15:50 - 17:55 Thursday, January 24, 2008
Location: Room 310BC
Chairs: Takao Onoye (Osaka University, Japan), Yun-Nan Chang (National Sun Yat-sen University, Taiwan)

9B-1 (Time: 15:50 - 16:15)
TitleSPKM : A Novel Graph Drawing Based Algorithm for Application Mapping onto Coarse-Grained Reconfigurable Architectures
Author*Jonghee Yoon (Seoul National University, Republic of Korea), Aviral Shrivastava (Arizona State University, United States), Sanghyun Park, Minwook Ahn (Seoul National University, Republic of Korea), Reiley Jeyapaul (Arizona State University, United States), Yunheung Paek (Seoul National University, Republic of Korea)
Pagepp. 776 - 782
KeywordReconfigurable, Mapping, CGRA, Compiler
AbstractRecently coarse-grained reconfigurable architectures (CGRAs) have drawn increasing attention due to their efficiency and flexibility. While many CGRAs have demonstrated impressive performance improvements, the effectiveness of CGRA platforms ultimately hinges on the compiler. Existing CGRA compilers do not model the details of the CGRA architecture, due to which they are, i) unable to map applications, even though a mapping exists, and ii) use too many PEs to map an application. In this paper, we model several CGRA details in our compiler and develop a graph mapping based approach (SPKM) for mapping applications onto CGRAs. On randomly generated graphs our technique can map on average 4.5X more applications than the previous approaches, while using fewer CGRA rows 62% times, without any penalty in mapping time. We observe similar results on a suite of benchmarks collected from Livermore Loops, Multimedia and DSPStone benchmarks.

9B-2 (Time: 16:15 - 16:40)
TitleBlock Remap with Turnoff: A Variation-Tolerant Cache Design Technique
Author*Mohammed Abid Hussain (Int'l Inst. of Information Tech., Hyderabad, India), Madhu Mutyam (Indian Inst. of Tech. Madras, India)
Pagepp. 783 - 788
Keywordprocess variations, data caches, performance, leakage energy
AbstractWith reducing feature size, the effects of process variations are becoming more and more predominant. Memory components such as on-chip caches are more susceptible to such variations because of high density and small sized transistors present in them. In this paper, we propose a variation-tolerant design technique for process variation affected on-chip data caches. In our technique we selectively turnoff few blocks after rearranging them in such a way that all sets get almost equal number of process variation effected blocks. We show that our technique significantly reduces the performance loss and leakage energy consumption due to process variations.

9B-3 (Time: 16:40 - 17:05)
TitleORB: An On-Chip Optical Ring Bus Communication Architecture for Multi-Processor Systems-on-Chip
Author*Sudeep Pasricha, Nikil Dutt (University of California, Irvine, United States)
Pagepp. 789 - 794
Keywordon-chip communication architectures, optical interconnects, MPSoC, performance, power analysis
AbstractAs application complexity continues to increase, multi-processor systems-on-chip (MPSoC) with tens to hundreds of processing cores are becoming the norm. While computational cores have become faster with each successive technology generation, communication between them has become a bottleneck that limits overall chip performance. On-chip optical interconnects can overcome this bottleneck by replacing electrical wires with optical waveguides. In this paper we propose an optical ring bus (ORB) based on-chip communication architecture for next generation MPSoCs. ORB uses an optical ring waveguide to replace global pipelined electrical interconnects while preserving the interface with today’s bus protocol standards such as AMBA AXI. We present experiments to show how ORB has the potential to provide superior performance (more than 2×) and significantly lower power consumption (a reduction of more than 10×) compared to traditionally used pipelined, all-electrical bus-based communication architectures, for 65-22 nm technology nodes.

9B-4 (Time: 17:05 - 17:30)
TitleWebpage-Based Benchmarks for Mobile Device Design
Author*Marc Somers, JoAnn M. Paul (Virginia Tech., United States)
Pagepp. 795 - 800
Keywordwebpage modeling, utilization, benchmarks, mobile computing
AbstractBy investigating the content, structure and usage of webpages, we observe that webpages represent a fundamentally different standard for performance evaluation of computer designs. We found that specialized architectures, customized to webpage content, can improve performance up to 70% over a homogeneous multiprocessor with 25% additional improvement when individual user preferences are also considered. Thus, a new form of benchmark suite is required, based upon the rapidly evolving and divergent content of information exchanged via webpages on mobile devices.


Session 9D  Designers' Forum (Panel) Best Ways to Use Billions of Devices on a Chip
Time: 15:50 - 17:55 Thursday, January 24, 2008
Location: Room 311BC
Chair: Grant Martin (Tensilica, United States)

9D-1
Title(Panel Discussion) Best Ways to Use Billions of Devices on a Chip
AuthorModerator: Grant Martin (Tensilica, United States), Panelists: Deming Chen (Univ. of Illinois, Urbana-Champaign, United States), Nikil Dutt (Univ. of California, Irvine, United States), Joerg Henkel (Karlsruhe Univ., Germany), Kyungho Kim (Samsung Electronics, Republic of Korea), Kazutoshi Kobayashi (Kyoto Univ., Japan)
Pagepp. 801 - 802
AbstractWe all know that Moore's law is good for at least a few more generations of silicon process, and this will give rise to many integrated circuits having billions of transistors on them. The leading 45 nm processors being announced are getting close to a billion transistors as of 2007. But how can we best use these devices in the future? Integrating more and more features and functions onto SoCs may not be the optimal use for all of these billions of resources. Indeed, to even have a working device at 45, 32, 22 and 16 nm may require new architectures and new structures to be incorporated.

9D-2
Title(Invited Paper) VEBoC: Variation and Error-Aware Design for Billions of Devices on a Chip
AuthorShoaib Akram, Scott Cromar, Gregory Lucas, Alexandros Papakonstantinou, *Deming Chen (University of Illinois, Urbana-Champaign, United States)
Pagepp. 803 - 808
AbstractBillions of devices on a chip is around the corner and the trend of deep submicron (DSM) technology scaling will continue for at least another decade. Meanwhile, designers also face severe on-chip parameter variations, soft/hard errors, and high leakage power. How to use these billions of devices to deliver power-efficient, high-performance, and yet error-resilient computation is a challenging task. In this paper, we attempt to demonstrate some of our perspectives to address these critical issues.

9D-3
Title(Invited Paper) Quo Vadis, BTSoC (Billion Transistor SoC)?
Author*Nikil Dutt (University of California, Irvine, United States)
Pagep. 809
AbstractBillion Transistor Systems-on-Chip (BTSoCs) present designers with a classic case of the “embarrassment-of-riches” syndrome: with so many devices at one’s disposal, designers may be tempted to integrate functionality willy-nilly, with no strategic rethinking of what this level of integration can both afford, as well as achieve. While many advocate “business-asusual” – including ad-hoc integration of functionality to achieve application-specific or domain-dependent designs – I believe BTSoCs present us with some opportunities for a paradigm shift in the architectural strategies and design processes for designing such complex chips.

9D-5
Title(Invited Paper) Best Ways to use Billions of Devices on a Wireless Mobile SoC
Author*KyungHo Kim (Samsung Electronics, Republic of Korea)
Pagep. 810
AbstractA rapid growth in the field of Information Technologies (IT) over the last decade gave us unimaginable possibilities and lots of convenience in our life, i.e. highspeed Internet, mobile-TV, High-definition digital TV, 3D-gaming, mobile multimedia player, Ultra-Mobile PC and so on. Nowadays, the technology is even exceeding market demands. Apple produced 160GB iPod which stores 40,000 songs. Samsung showed a world-first mobile phone with 10Megapixel camera on it. HSDPAbased video telephony service was commercialized in Korea.

9D-6
Title(Invited Paper) Best Ways to Use Billions of Devices on a Chip - Error Predictive, Defect Tolerant and Error Recovery Designs
Author*Kazutoshi Kobayashi, Hidetoshi Onodera (Kyoto University, Japan)
Pagepp. 811 - 812
AbstractError rates on an LSI are increasing accord- ing to the Moore's law. Now is the time to start incorporat- ing error-tolerant design methodologies. This paper intro- duces sources of failures in semiconductor devices, levels of dependability according to applications of devices and some circuit-level techniques to detect or recover faults af- ter shipping.