The 14th Asia and South Pacific Design Automation Conference
Technical Program

Remark: The presenter of each paper is marked with "*".

Technical Program: SIMPLE version DETAILED version with abstract

Author Index: HERE

Session Schedule

Tuesday, January 20, 2009

A	B	C	D
1K (Small Auditorium, 5F) Opening and Keynote Session I 8:30 - 10:00
1A (Room 411+412) On-Chip Communication Architectures 10:15 - 12:20	1B (Room 413) Dealing with Thermal Issues 10:15 - 12:20	1C (Room 414+415) Advances in Behavioral Synthesis 10:15 - 12:20	1D (Room 416+417) University LSI Design Contest 10:15 - 12:20
2A (Room 411+412) MPSoC and IP Integration 13:30 - 15:35	2B (Room 413) Power Analysis and Optimization 13:30 - 15:35	2C (Room 414+415) Logic and Arithmetic Optimization 13:30 - 15:35	2D (Room 416+417) Special Session: EDA Acceleration Using New Architectures 13:30 - 15:35
3A (Room 411+412) System-Level Design of 3D Chips and Configurable Systems 15:55 - 18:00	3B (Room 413) Advances in Timing Analysis and Modeling 15:55 - 18:00		3D (Room 416+417) Special Session: Hardware Dependent Software for Multi- and Many-Core Embedded Systems 15:55 - 18:00

Wednesday, January 21, 2009

A	B	C	D
2K (Small Auditorium, 5F) Keynote Session II 9:00 - 10:00
4A (Room 411+412) System Level Architectures 10:15 - 12:20	4B (Room 413) Beyond Traditional Floorplanning and Placement 10:15 - 12:20	4C (Room 414+415) Signal/Power Integrity and Simulation 10:15 - 12:20	4D (Room 416+417) Special Session: Challenges in 3D Integrated Circuit Design 10:15 - 12:20
5A (Room 411+412) Energy-Aware System Level Design Methodology 13:30 - 15:35	5B (Room 413) Design for Manufacturing and Reliability 13:30 - 15:35	5C (Room 414+415) Analog, RF and Mixed-Signal CAD 13:30 - 15:35	5D (Room 416+417) Designers' Forum: Consumer SoC 13:30 - 15:35
6A (Room 411+412) System Level Simulation and Modeling 15:55 - 18:00	6B (Room 413) Chip and Package Routing Techniques 15:55 - 18:00		6D (Room 416+417) Designers' Forum: ESL Design Methods 15:55 - 18:00

Thursday, January 22, 2009

A	B	C	D
3K (Small Auditorium, 5F) Keynote Session III 9:00 - 10:00
7A (Room 411+412) Compilation Techniques for Embedded Systems 10:15 - 12:20	7B (Room 413) Sequential Design Verification 10:15 - 12:20	7C (Room 414+415) Scan Test Generation 10:15 - 12:20	7D (Room 416+417) Designers' Forum: Analog/RF Circuit Designs 10:15 - 12:20
8A (Room 411+412) High-Level Design and Scheduling 13:30 - 15:35	8B (Room 413) Emerging Design Methodologies and Applications 13:30 - 15:35	8C (Room 414+415) Verification, Test, and Yield 13:30 - 15:35	8D (Room 416+417) Designers' Forum: Near-Future SoC Architectures -- Can Dynamically Reconfigurable Processors be a Key Technology? 13:30 - 15:35
9A (Room 411+412) Memory Systems Simulation and Optimization 15:55 - 18:00	9B (Room 413) Emerging Technologies 15:55 - 18:00		9D (Room 416+417) Special Session: Dependable VLSI: Device, Design and Architecture -- How should they cooperate ? -- 15:55 - 18:00

List of Papers

Remark: The presenter of each paper is marked with "*".

Tuesday, January 20, 2009

Session 1K Opening and Keynote Session I
Time: 8:30 - 10:00 Tuesday, January 20, 2009
Location: Small Auditorium, 5F
Chair: Kazutoshi Wakabayashi (NEC Corp., Japan)

1K-1 (Time: 9:00 - 10:00)

Title	(Keynote Address) Challenges to EDA System from the View Point of Processor Design and Technology Drivers
Author	Mitsuo Saito (Toshiba Corporation Semiconductor Company, Japan)
Abstract	Historically, many microprocessors have been developed, since it was invented in early 1970’s. Microprocessor design was always under the hardest competition, so they had been the technology driver for the semiconductor technology and the design methodology until recently. By discussing the relationship between the design methodology (EDA) revolution and the technology driver products transition, based upon famous Makimoto’s wave hypothesis, what happened to the microprocessor world is highlighted by showing typical examples. As a recent example, the positioning of the Cell Broadband Engine as a high performance computing processor and as a flexible HW, is discussed mainly, also the performance result, and the future trend of the microprocessors towards multi-core are discussed. Then it is explained, why SpursEngine derived from Cell Broadband Engine had to be developed. SoC (combination of microprocessor and HW functional unit) for custom applications should be the technology driver, for the next decade, which is the first experience after microprocessor was born. The special requirements to the EDA system to realize next wave, are predicted. Finally, when the next wave comes, maybe after 2017, software centric era, what happens to the world, is briefly mentioned.

Session 1A On-Chip Communication Architectures
Time: 10:15 - 12:20 Tuesday, January 20, 2009
Location: Room 411+412
Chair: Sri Parameswaran (University of New South Wales, Australia)

1A-1 (Time: 10:15 - 10:40)

Title	Adaptive Inter-router Links for Low-Power, Area-Efficient and Reliable Network-on-Chip (NoC) Architectures
Author	Avinash Karanth Kodi (Ohio University, United States), Ashwini Sarathy, Ahmed Louri, *Janet Wang (University of Arizona, United States)
Page	pp. 1 - 6
Keyword	network-on-chip, low-power architecture
Abstract	The increasing wire delay constraints in deep sub-micron VLSI designs have led to the emergence of scalable and modular Network-on-Chip (NoC) architectures. As the power consumption, area overhead and performance of the entire NoC is influenced by the router buffers, research efforts have targeted optimized router buffer design. In this paper, we propose iDEAL - inter-router, dual-function energy and area-efficient links capable of data transmission as well as data storage when required. iDEAL enables a reduction in the router buffer size by controlling the repeaters along the links to adaptively function as link buffers during congestion, thereby achieving nearly 30% savings in overall network power and 35% reduction in area with only a marginal 1-3% drop in performance. In addition, aggressive speculative flow control further improves the performance of iDEAL. Moreover, the significant reduction in power consumption and area provides sufficient headroom for monitoring Negative Bias Temperature Instability (NBTI) effects in order to improve circuit reliability at reduced feature sizes.
Slides

1A-2 (Time: 10:40 - 11:05)

Title	Analysis of Communication Delay Bounds for Network on Chips
Author	*Yue Qian (National University of Defense Technology, China), Zhonghai Lu (Royal Institute of Technology, Sweden), Wenhua Dou (National University of Defense Technology, China)
Page	pp. 7 - 12
Keyword	Network-on-chip, network calculus, delay bound
Abstract	In network-on-chip, computing worst-case delay bound for packet delivery is crucial for designing predictable systems but yet an intractable problem due to complicated resource contention scenarios. In this paper, we present an analysis technique to derive the communication delay bound for individual flows. Based on a network contention model, this technique, which is topology independent, employs the network calculus theory to first compute the equivalent service curve for individual flows and then calculate their packet delay bound. To exemplify our method, we also present the derivation of a closed-form formula to calculate the delay bound for all-to-one gather communication. Our experimental results demonstrate the theoretical bounds are correct and tight.

1A-3 (Time: 11:05 - 11:30)

Title	Frequent Value Compression in Packet-based NoC Architectures
Author	Ping Zhou, Bo Zhao, Yu Du, Yi Xu, Youtao Zhang, *Jun Yang (University of Pittsburgh, United States), Li Zhao (Intel, United States)
Page	pp. 13 - 18
Keyword	compression, NoC, performance, power
Abstract	The proliferation of Chip Multiprocessors (CMPs) has led to the integration of large on-chip caches. For scalability reasons, a large on-chip cache is often divided into smaller banks that are interconnected through packet-based Network-on-Chip (NoC). With increasing number of cores and cache banks integrated on a single die, the on-chip network introduces significant communication latency and power consumption. In this paper, we propose a novel scheme that exploits Frequent Value compression to optimize the power and performance of NoC. Our experimental results show that the proposed scheme reduces the router power by up to 16.7%, with CPI reduction as much as 23.5% in our setting. Comparing to the recent zero pattern compression scheme, the frequent value scheme saves up to 11.0\% more router power and has up to 14.5% more CPI reduction. Hardware design of the FV table and its overhead are also presented.

1A-4 (Time: 11:30 - 11:55)

Title	Simultaneous Data Transfer Routing and Scheduling for Interconnect Minimization in Multicycle Communication Architecture
Author	Yu-Ju Hong (Purdue University, United States), Ya-Shih Huang, *Juinn-Dar Huang (National Chiao Tung University, Taiwan)
Page	pp. 19 - 24
Keyword	multicycle communication, architectural synthesis, interconnect minimization, resource allocation and sharing, scheduling
Abstract	In deep submicron technology, wire delay is no longer negligible and is gradually becoming a dominant factor of system performance. Several state-of-the-art architectural synthesis flows have already adopted the distributed register architecture to cope with the increasing wire delay by allowing multicycle communication. In this paper, we formulate channel and register allocation within a refined regular distributed register architecture, named RDR-GRS, as a problem of simultaneous data transfer routing and scheduling for minimizing global interconnect resources. We also present an innovative algorithm with both spatial and temporal considerations. It features both a concentration-oriented path router gathering wire-sharable data transfers and a channel-based time scheduler resolving contentions for wires in a channel, which are in spatial and temporal domain, respectively. The experimental results show that the proposed algorithm can significantly outperform existing related works.

1A-5 (Time: 11:55 - 12:20)

Title	Dynamically Reconfigurable On-Chip Communication Architectures for Multi Use-Case Chip Multiprocessor Applications
Author	Sudeep Pasricha, *Nikil Dutt, Fadi Kurdahi (University of California, Irvine, United States)
Page	pp. 25 - 30
Keyword	crossbar, on-chip communication, synthesis, low power
Abstract	The phenomenon of digital convergence and increasing application complexity today is motivating the design of chip multiprocessor (CMP) applications with multiple use cases. Most traditional on-chip communication architecture design techniques perform synthesis and optimization only for a single use-case, which may lead to sub-optimal design decisions for multi-use case applications. In this paper we present a framework to generate a dynamically reconfigurable crossbar-based on-chip communication architecture that can support multiple use-case bandwidth and latency constraints. Our framework generates on-chip communication architectures with a low cost, low power dissipation, and with minimal reconfiguration overhead. Results of applying our framework on several networking CMP applications show that our approach is able to generate a crossbar solution with significantly lower cost (2.4× to 3.8×), and lower power dissipation (1.5× to 3.1×), compared to the best previously proposed approach.

Session 1B Dealing with Thermal Issues
Time: 10:15 - 12:20 Tuesday, January 20, 2009
Location: Room 413
Chairs: Youngsoo Shin (KAIST, Republic of Korea), Li Shang (University of Colorado at Boulder, United States)

1B-1 (Time: 10:15 - 10:40)

Title	Stochastic Thermal Simulation Considering Spatial Correlated Within-Die Process Variations
Author	*Pei-Yu Huang, Jia-Hong Wu, Yu-Min Lee (National Chiao Tung University, Taiwan)
Page	pp. 31 - 36
Keyword	Statistical IC thermal simulator, Karhunen-Loeve expansion, Leakage power, stochastic Galerkin method
Abstract	In this work, a statistical thermal simulator including the effect of spatial correlation under within-die process variations is developed. This method utilizes the Karhunen-Loeve (KL) expansion to model the physical parameters, and applies the Polynomial Chaoses (PCs) and the stochastic Galerkin method to tackle the stochastic heat transfer equations. The experimental results not only demonstrate the accuracy and efficiency of the proposed method, but also point out that the stochastic thermal analysis is essential to provide a robust estimation of temperature distribution for the thermal-aware design flow.

1B-2 (Time: 10:40 - 11:05)

Title	A Control Theory Approach for Thermal Balancing of MPSoC
Author	*Francesco Zanini, David Atienza, Giovanni De Micheli (Ecole Polytechnique Federale de Lausanne, Switzerland)
Page	pp. 37 - 42
Keyword	thermal balancing, MPSoC, control theory, linear quadratic regulator
Abstract	Thermal balancing and reducing hot-spots are two important challenges facing the MPSoC designers. In this work, we model the thermal behavior of an MPSoC as a control theory problem, which enables the design of an optimum frequency controller without depending on the thermal profile of the chip. The optimization performed by the controller is targeted to achieve thermal balancing on the MPSoC thermal profile to avoid hotspots and improve its reliability. The proposed system is able to perform an on-line minimization of chip thermal gradients based on both scheduler requirements and the chip thermal profile. We compare this with state of the art thermal management approaches, our comparison shows that the proposed system offers a better both thermal profile (temperature differences higher than 4±C have been reduced from 27.9% to 0.45%) and performance (up to 32% task waiting time reduction).

1B-3 (Time: 11:05 - 11:30)

Title	Thermal Optimization in Multi-Granularity Multi-Core Floorplanning
Author	*Michael B. Healy, Hsien-Hsin S. Lee, Gabriel H. Loh, Sung Kyu Lim (Georgia Institute of Technology, United States)
Page	pp. 43 - 48
Keyword	multicore, thermal, floorplanning
Abstract	Multi-core microarchitectures require a careful balance between many competing objectives to achieve the highest possible performance. Integrated Early Analysis is the consideration of all of these factors at an early stage. Toward this goal, this work presents the first adaptive multi-granularity multi-core microarchitecture-level floorplanner that simultaneously optimizes temperature and performance, and considers memory bus length. We include simultaneous optimization at both the module-level and the core/cache-bank level. Related experiments show that our methodology is effective for optimizing multi-core architectures.

1B-4 (Time: 11:30 - 11:55)

Title	Temperature-Aware Dynamic Frequency and Voltage Scaling for Reliability and Yield Enhancement
Author	*Yu-Wei Yang, Katherine Shu-Min Li (Department of Computer Science and Engineering, National Sun Yat-Sen University, Taiwan)
Page	pp. 49 - 54
Keyword	DVFS, DVS, oscillation ring, on-chip thermal sensors, on-chip DVFS monitor
Abstract	A novel oscillation-based on-chip thermal sensing architecture for dynamically adjusting supply voltage and clock frequency in System-on-Chip (SoC) is proposed. It is shown that the oscillation frequency of a ring oscillator reduces linearly as the temperature rises, and thus provides a good on-chip temperature sensing mechanism. An efficient Dynamic Frequency-to-Voltage Scaling (DF2VS) algorithm is proposed to dynamically adjust supply voltage according to the oscillation frequencies of the ring oscillators distributed in SoC so that thermal sensing can be carried at all potential hot spots. An on-chip Dynamic Voltage Scaling or Dynamic Voltage and Frequency Scaling (DVS or DVFS) monitor selects the supply voltage level and clock frequency according to the outputs of all thermal sensors. Experimental results on SoC benchmark circuits show the effectiveness of the algorithm that a 10% reduction in supply voltage alone can achieve about 20% power reduction (DVS scheme), and nearly 50% reduction in power is achievable if the clock frequency is also scaled down (DVFS scheme). The chip temperature is reduced accordingly.
Slides

1B-5 (Time: 11:55 - 12:20)

Title	A Multiple Supply Voltage Based Power Reduction Method in 3-D ICs Considering Process Variations and Thermal Effects
Author	Shih-An Yu, *Pei-Yu Huang, Yu-Min Lee (National Chiao Tung University, Taiwan)
Page	pp. 55 - 60
Keyword	Power Optimization, 3D ICs, Thermal analysis, Multiple Supply Voltage
Abstract	In this paper, a grid-based multiple supply voltage (MSV) assignment method is presented to statistically minimize the total power consumption of 3-D IC. This method consists of a statistical electro-thermal simulator to get the mean and variance of on-chip, a thermal-aware statistical static timing analysis (SSTA) to take into account the thermal effect on circuit timing, the statistical power delay sensitivity–slack product to be the optimization criterion, and an incremental update of statistical timing to save the runtime. The experimental results demonstrate the effectiveness of the developed methodology and indicate that the consideration of the thermal effect in the circuit simulation is imperative.

Session 1C Advances in Behavioral Synthesis
Time: 10:15 - 12:20 Tuesday, January 20, 2009
Location: Room 414+415
Chairs: Shigeru Yamashita (Nara Institute of Science and Technology, Japan), Kiyoung Choi (Seoul National University, Republic of Korea)

1C-1 (Time: 10:15 - 10:40)

Title	FastYield: Variation-Aware, Layout-Driven Simultaneous Binding and Module Selection for Performance Yield Optimization
Author	*Gregory Lucas, Scott Cromar, Deming Chen (University of Illinois, Urbana-Champaign, United States)
Page	pp. 61 - 66
Keyword	high level synthesis, process variation, ssta
Abstract	We propose a new variation-aware high-level synthesis binding/module selection algorithm, named FastYield, that takes into consideration multiplexers, functional units, registers, and interconnects. Additionally, FastYield connects with the lower levels of the design hierarchy through its inclusion of a timing driven floorplanner guided by a statistical static timing analysis (SSTA) engine which is used to modify/enhance the synthesis solution. On average, FastYield achieves an 85% performance yield clock period that is 14.5% smaller, and a performance yield gain of 78.9%, when compared to a variation-unaware algorithm.

1C-2 (Time: 10:40 - 11:05)

Title	CriAS: A Performance-Driven Criticality-Aware Synthesis Flow for On-Chip Multicycle Communication Architecture
Author	*Chia-I Chen, Juinn-Dar Huang (National Chiao Tung University, Taiwan)
Page	pp. 67 - 72
Keyword	Architectural synthesis, multicycle communication architecture, distributed register architecture, criticality-aware, performance-driven
Abstract	In deep submicron era, wire delay is no longer negligible and is dominating the system performance. Several state-of-the-art architectural synthesis flows have been proposed for the distributed register architectures to cope with the increasing wire delay by allowing on-chip multicycle communication. In this paper, we present a new performance-driven criticality-aware synthesis flow CriAS targeting regular distributed register architectures. CriAS features a hierarchical binding strategy and a coarse-grained placer for minimizing the number of critical global data transfers. The key ideas are to take time criticality as the major concern at earlier binding stages before the detailed physical placement information is available, and to preserve the locality of closely related critical components in the later placement phase. The experimental results show that 19% overall performance improvement can be achieved on average as compared to the previous work.

1C-3 (Time: 11:05 - 11:30)

Title	Tolerating Process Variations in High-Level Synthesis Using Transparent Latches
Author	*Yibo Chen, Yuan Xie (the Pennsylvania State University, United States)
Page	pp. 73 - 78
Keyword	high-level synthesis, process variation, latch
Abstract	Considering process variability at the behavior synthesis level is necessary, because it makes some instances of function units slower and others faster, resulting in unbalanced control steps and reducing the attainable frequency of the circuit. To tackle this problem, this paper proposes a methodology to replace the edge-trigged flip-flops by transparent latches, to exploit latches' extra ability of passing time slacks and tolerating delay variations. In the paper we first define the timing yield in high-level synthesis, and then present how to replace flip-flops with latches to improve timing yield and mitigate the impact of process variations. We then discuss the benefits and overheads for the replacement, and propose an optimization framework for latch replacement in high-level synthesis design flow. Experimental results show that the latch-based design can achieve an average of 27% improvement of timing yield compared with traditional flip-flop based design.

1C-4 (Time: 11:30 - 11:55)

Title	Variation-Aware Resource Sharing and Binding in Behavioral Synthesis
Author	Feng Wang (Qualcomm Inc., United States), Yuan Xie (Pennsylvania State University, United States), *Andres Takach (Mentor Graphics Corporation, United States)
Page	pp. 79 - 84
Keyword	High level synthesis, resource sharing, resource binding, process variation
Abstract	As technology scales, the delay uncertainty caused by process variations has become increasingly pronounced in deep submicron designs. In the presence of process variations, worst-case timing analysis may lead to overly conservative synthesis, and may end up using excess resources to guarantee design constraints. In this paper, we propose an efficient variation-aware resource sharing and binding algorithm in behavioral synthesis, which takes into account the performance variations for functional units. The performance yield, which is defined as the probability that the synthesized hardware meets the target performance constraints, is used to evaluate the synthesis result. An efficient metric called statistical performance improvement, is used to guide resource sharing and binding. The proposed algorithm is integrated into a commercial synthesis framework that transfer design specifications from behavioral description to RTL netlists. The effectiveness of the proposed algorithm is demonstrated with a set of industrial benchmark designs, which consist of blocks that are commonly used in wireless and image processing applications. The experimental results show that our method achieves an average 33% area reduction over traditional methods, which are based on the worst-case delay analysis, with an average 10% run time overhead.

1C-5 (Time: 11:55 - 12:20)

Title	Peak Temperature Control in Thermal-aware Behavioral Synthesis through Allocating the Number of Resources
Author	*Junbo Yu, Qiang Zhou, Jinian Bian (Tsinghua University, China)
Page	pp. 85 - 90
Keyword	resource usage allocation, behavioral synthesis, peak temperature
Abstract	High temperature adversely impacts on reliability, performance, and leakage power of ICs. In behavioral synthesis, both resource usage allocation and resource binding influence the final thermal profile. Previous thermal-aware behavioral syntheses only focused on binding, ignoring allocation. This paper proposes thermal-aware behavioral synthesis with resource usage allocation. According to power density and feedbacks from thermal simulation, we allocate the number of resources under area constraint. Our flow effectively controls peak temperature and creates even power densities among resources of gdifferenth and gsameh types. Compared to classic behavioral synthesis of peak temperature control, our technique reduces peak temperature by 11.1Ž on average with no area overhead and only 1.2 more steps latency overhead.

Session 1D University LSI Design Contest
Time: 10:15 - 12:20 Tuesday, January 20, 2009
Location: Room 416+417
Chairs: Jiun-In Guo (National Chung Cheng University, Taiwan), Hiroki Ishikuro (Keio University, Japan)

1D-1 (Time: 10:15 - 10:20)

Title	A Wireless Real-Time On-Chip Bus Trace System
Author	*Shusuke Kawai, Takayuki Ikari (Keio University, Japan), Yutaka Takikawa (Renesas Design Corp, Japan), Hiroki Ishikuro, Tadahiro Kuroda (Keio University, Japan)
Page	pp. 91 - 92
Keyword	Inductive coupring, Wireless interface
Abstract	A 480Mb/s wireless real-time bus trace system with a pulse-based inductive coupling channel array was developed using a 0.25m CMOS digital process. The size and pitch of the inductor array are determined by numerical calculation to optimize the tradeoff between the channel coupling, crosstalk, and alignment tolerance. A low-power quasi-synchronous system is proposed to obtain an enough timing margin for RX pulse detection under the presence of the clock skew
Slides

1D-2 (Time: 10:20 - 10:25)

Title	CKVdd: A Self-Stabilization Ramp-Vdd Technique for Dynamic Power Reduction
Author	Chin-Hsien Wang, *Ching-Hwa Cheng (Feng Chia University, Taiwan), Jiun-In Guo (National Chung Cheng University, Taiwan)
Page	pp. 93 - 94
Keyword	Low power
Abstract	We propose a self-stabilized ramp voltage technique, CKVdd, to reduce power dissipation in conventional CMOS circuit. Normal CMOS circuits show a power increase proportional to clock frequency. CKVdd results in a lower-than-usual power increase. This technique is easily implemented in CMOS circuits. CKVdd technique possesses several characteristics that differ from of the current circuits using Vdd power source. First, CKVdd circuits have less average current and peak current consumption, such that it can be a low power design technique applied to generic digital circuits. Second, CKVdd technique combines the power source and clock signal, and can easily implement the power management mechanism. Compared to constant Vdd for multimedia decoders, the proposed technique has 45% of the usual power dissipation and 88% of the usual peak current reduction at the cost of small delay penalty.
Slides

1D-3 (Time: 10:25 - 10:30)

Title	A 300 nW, 7 ppm/℃ CMOS Voltage Reference Circuit based on Subthreshold MOSFETs
Author	*Ken Ueno (Hokkaido University, Japan), Tetsuya Hirose (Kobe University, Japan), Tetsuya Asai, Yoshihito Amemiya (Hokkaido University, Japan)
Page	pp. 95 - 96
Keyword	Voltage reference, subthreshold, Ultra-low power, process variation
Abstract	An ultra-low power CMOS voltage reference circuit has been fabricated in a 0.35-um standard CMOS process. The circuit generates a reference voltage based on threshold voltage of a MOSFET at absolute zero temperature. Theoretical analyses and experimental results showed that the circuit generates a quite stable reference voltage of 745 mV on average. The temperature coefficient and line sensitivity of the circuit were 7 ppm/degC and 20 ppm/V, respectively. The power supply rejection ratio (PSRR) was -45 dB at 100 Hz. The circuit consists of subthreshold MOSFETs with a low-power dissipation of 0.3 uW or less and a 1.5-V power supply. Because the circuit generates a reference voltage based on threshold voltage of a MOSFET in an LSI chip, it can be used as an on-chip process monitoring circuit and as a part of the on-chip process compensation circuit systems.
Slides

1D-4 (Time: 10:30 - 10:35)

Title	A 100Mbps, 0.19mW Asynchronous Threshold Detector with DC Power-Free Pulse Discrimination for Impulse UWB Receiver
Author	*Lechang Liu, Yoshio Miyamoto, Zhiwei Zhou, Kosuke Sakaida, Jisun Ryu, Koichi Ishida, Makoto Takamiya, Takayasu Sakurai (The University of Tokyo, Japan)
Page	pp. 97 - 98
Keyword	Ultra-wideband (UWB), UWB receiver, Threshold detector, Pulse discriminator
Abstract	An asynchronous threshold detector for DC-960MHz band impulse ultra-wideband (UWB) receiver is proposed in this paper. It features a DC power-free pulse discriminator. The proposed architecture in 90nm CMOS achieves the lowest power consumption of 0.19mW and energy consumption of 1.9pJ/bit at 100Mbps in the UWB receiver.
Slides

1D-5 (Time: 10:35 - 10:40)

Title	Low-Power CMOS Transceiver Circuits for 60GHz Band Millimeter-wave Impulse Radio
Author	*Ahmet Oncu, Minoru Fujishima (The University of Tokyo, Japan)
Page	pp. 99 - 100
Keyword	Low-power, CMOS, 60GHz, impulse, radio
Abstract	In this paper we present an 8Gbps CMOS amplitude-shift-keying (ASK) modulator in the transmitter and a 19.2mW 2Gbps CMOS pulse receiver circuits for high-speed and low-power 60GHz millimeter-wave impulse radio. High-speed ASK modulation is obtained without using DC power by turning on and off of the shunt connected short channel NMOSFET switches. The isolation is maximized using quarter-wavelength on-chip transmission lines. The isolation data-rate product of this work is 3.7 times higher than recently reported millimeter-wave ASK modulators. The proposed 60GHz pulse receiver circuit requires low-power for high-speed data since it detects the envelope of the received pulses using a nonlinear detecting amplifier and only limiting amplifier process the high-speed data. This receiver requires the lowest DC power among recently reported millimeter-wave receivers.
Slides

1D-6 (Time: 10:40 - 10:45)

Title	An Inductor-less MPPT Design for Light Energy Harvesting Systems
Author	Hui Shao, *Chi-Ying Tsui, Wing-Hung Ki (The Hong Kong University of Science and Technology, Hong Kong)
Page	pp. 101 - 102
Keyword	solar cell, power management, MPPT, energy harvesting
Abstract	An inductor-less maximum power point tracker was designed for light energy harvesting systems. We target at systems under different lighting environments and sometimes the solar cell voltage may be low. A charge pump is used to convert the voltage to a higher value. At the same time, the control circuit tunes the charge pump switching frequency to track the system maximum output power point. The design was fabricated and measured to verify the system operation.
Slides

1D-7 (Time: 10:45 - 10:50)

Title	A 1 GHz CMOS Comparator with Dynamic Offset Control Technique
Author	*Xiaolei Zhu (Keio University, Japan), Sanroku Tsukamoto (Fujitsu Laboratories Limited, Japan), Tadahiro Kuroda (Keio University, Japan)
Page	pp. 103 - 104
Keyword	Offset cancel, Comparator, A/D converter
Abstract	Abstract− A dynamic offset control technique that employs charge compensation by timing control is proposed for comparator design in scaled CMOS technology. The analysis has been verified by fabricating a 65 nm CMOS 1.2 V 1 GHz comparator that occupies 25 x 65 ìm2 and consumes 380 ìW. Circuits for offset control occupies 21% of the areas and 12% of the power consumption of the whole comparator chip.
Slides

1D-8 (Time: 10:50 - 10:55)

Title	Circuit Design Using Stripe-Shaped PMELA TFTs on Glass
Author	*Keita Ikai, Jinmyoung Kim, Makoto Ikeda, Kunihiro Asada (University of Tokyo, Japan)
Page	pp. 105 - 106
Keyword	TFT, PMELA, Design environment, Glass
Abstract	A design environment for stripe-shaped PMELA TFTs on glass has been developed and successfully tested. Cell library including standard cells, logic synthesis database, Place and Route rule, layout parasitic extraction rule and transistor models are developed. Measurement results show that the digital circuits designed in this environment work correctly. They also show that the simulation environment is accurate enough for simulating digital circuits.
Slides

1D-9 (Time: 10:55 - 11:00)

Title	Low Energy Level Converter Design for Sub-V_th Logics
Author	Hui Shao, *Chi-Ying Tsui (The Hong Kong University of Science and Technology, Hong Kong)
Page	pp. 107 - 108
Keyword	low energy, sub-Vth logic, level converter
Abstract	A low energy consumption level converter (LC) is presented for logic voltage conversion from sub-Vth voltage to nominal high voltage. By employing the multi-stage architecture and implementing a unique circuit inside each stage, the proposed LC can reduce its energy consumption by almost 3 orders and at the same time ensure the robustness of its function. The LC was fabricated and measured to verify its operation and performance improvement.
Slides

1D-10 (Time: 11:00 - 11:05)

Title	A Time-to-Digital Converter with Small Circuitry
Author	Kazuya Shimizu, *Masato Kaneta, HaiJun Lin, Haruo Kobayashi, Nobukazu Takai (Gunma University, Japan), Masao Hotta (Musashi Institute of Technology, Japan)
Page	pp. 109 - 110
Keyword	Time-to-Digital Converter, Time Domain Analog Circuit, nano CMOS, Digital Assist Analog Technology, Time Measurement
Abstract	This paper describes a Time-to-Digital-Converter (TDC) architecture with small CMOS circuitry as well as fine time resolution better linearity compared to a conventional vernier delay line TDC. The TDC measures the interval time between two signals and it is used in an all digital PLL and a time-domain ADC. In the proposed TDC, the number of the delay buffers is half of the conventional TDC, which leads to small chip area and low power. Also the nonlinearity due to delay mismatch among buffers is reduced, which we have demonstrated by MATLAB simulation. We have also designed and laid out its circuitry using TSMC 0.18um CMOS process, and the chip measurements shows its principle functions as expected.
Slides

1D-11 (Time: 11:05 - 11:10)

Title	A V_DD Independent Temperature Sensor Circuit with Scaled CMOS Process
Author	*Hiroki Oshiyama, Toshihiro Matsuda, Kei-ichi Suzuki, Hideyuki Iwata (Toyama Prefectural University, Japan), Takashi Ohzone (Dawn Enterprise Co. Ltd., Japan)
Page	pp. 111 - 112
Keyword	CMOS, temperature sensor, voltage reference
Abstract	A supply voltage (VDD) independent temperature sensor circuit by a standard 90 nm CMOS process achieves the predicted errors about -1.0 to +2.0 C (-0.6 to +0 C) for the temperature range of -20 to +100 C (+20 to +80 C) for two-point calibration lines. This temperature sensor has a good tolerance to the change of VDD from 2.5 to 1.5 V, which corresponds to the measurement error of 0.9 C.
Slides

1D-12 (Time: 11:10 - 11:15)

Title	A Current-mode DC-DC Converter using a Quadratic Slope Compensation Scheme
Author	*Chihiro Kawabata, Yasuhiro Sugimoto (Chuo University, Japan)
Page	pp. 113 - 114
Keyword	DC-DC, converter, quadratic, slope, compensation
Abstract	A quadratic slope compensation scheme for a current-mode DC-DC converter to obtain stable frequency characteristics without depending on the input and output voltages is proposed. A 5 MHz and 500 mA operational buck DC-DC converter with input voltages ranging from 3.3 V to 2.5 V and with output voltages ranging from 2.5 V to 0.5 V was designed and fabricated by using a 0.35 um CMOS process to verify the effectiveness of the scheme. Little variation of frequency characteristics at frequencies above 200 KHz for the various input and output voltages was observed.
Slides

1D-13 (Time: 11:15 - 11:20)

Title	Ultra Low-Power ANSI S1.11 Filter Bank for Digital Hearing Aids
Author	*Yu-Ting Kuo, Tay-Jyi Lin, Yueh-Tai Li (National Chiao Tung University, Taiwan), Chou-Kun Lin (ITRI, STC, Taiwan), Chih-Wei Liu (National Chiao Tung University, Taiwan)
Page	pp. 115 - 116
Keyword	hearing aid, filter bank, low power
Abstract	This paper presents an ANSI S1.11-compliant filter bank for digital hearing aids, of which the power consumption is minimized through algorithmic, numerical and architectural optimizations. This filter bank has been implemented and fabricated using the TSMC 0.13¦Ìm CMOS technology. The transistor-level simulations show that the power dissipation is only 79¦ÌW for 24KHz & 18-band audio processing.
Slides

1D-14 (Time: 11:20 - 11:25)

Title	An 11,424 Gate-Count Dynamic Optically Reconfigurable Gate Array with a Photodiode Memory Architecture
Author	Daisaku Seto, *Minoru Watanabe (Shizuoka University, Japan)
Page	pp. 117 - 118
Keyword	ORGAs, FPGAs, optical configuration, multi-context devices
Abstract	The worldfs largest 11,424 gate-count dynamic optically reconfigurable gate array VLSI chip, which is based on the use of junction capacitance of photodiodes as configuration memory, has been fabricated. The size and process of the VLSI chip are, respectively, a 96.04 mm2 and a 0.35 ƒÊm-3 metal CMOS process technology. To clarify the availability of the VLSI, this paper shows an experimental result of
Slides

1D-15 (Time: 11:25 - 11:30)

Title	A Low-Power FPGA Based on Autonomous Fine-Grain Power-Gating
Author	*Shota Ishihara, Masanori Hariyama, Michitaka Kameyama (Tohoku University, Japan)
Page	pp. 119 - 120
Keyword	FPGA, asynchronous architecture, power-gating, LEDR encoding, bit-serial architecture
Abstract	This is the first implementation of an FPGA based on autonomous fine-grain power-gating. To cut the power consumption of clock network and detect the activity of the cell efficiently, asynchronous architecture is full exploited. The proposed FPGA is fabricated in a 90nm CMOS process with dual threshold voltages. It is more efficient in power than the synchronous FPGA at less than 30% utilization.
Slides

1D-16 (Time: 11:30 - 11:35)

Title	A 52-mW 8.29mm² 19-mode LDPC Decoder Chip for Mobile WiMAX Applications
Author	*Xin-Yu Shih, Cheng-Zhou Zhan, Cheng-Hung Lin, An-Yeu (Andy) Wu (National Taiwan University, Taiwan)
Page	pp. 121 - 122
Keyword	LDPC, Mobile WiMAX, Multi-mode
Abstract	This paper presents a LDPC decoder chip supporting all 19 modes in Mobile WiMAX applications. An efficient IC design strategy is proposed to reduce 31.25% decoding latency, and enhance hardware utilization ratio from 50% to 75%. In addition, we propose a new early termination scheme that can dynamically adjust the iteration number. The multi-mode chip implemented in 8.29mm2die area can be maximally measured at 83.3MHz with only 52mW power consumption.
Slides

1D-17 (Time: 11:35 - 11:40)

Title	A Full-Synthesizable High-Precision Built-In Delay Time Measurement Circuit
Author	Ming-Chien Tsai, *Ching-Hwa Cheng (Feng Chia University, Taiwan)
Page	pp. 123 - 124
Keyword	Built-in Delay Test, delay fault diagnosis, Vernier Delay Line
Abstract	Delay testing has become a major issue for manufacturing advanced Systems on a Chip. Automatic Test Equipment and scan techniques are usually applied in delay testing. However, the circuits under test have many circuit paths and dependent input patterns; it is hard to measure delay times accurately, especially when debugging small delay defects. We propose a Built-In Delay Measurement (BIDM) circuit that is modified from Vernier Delay Lines. All digitally designed BIDMs with small area overhead can be easily embedded within testing circuits. BIDMs can be used to record the data propagation delay times within circuit path segments, for delay testing, diagnosis, and calibration requirements internal to the chip. Our BIDM was implemented in a 32bit error correction circuit by a chip using TSMC 0.18u technology. The instruments measured results showing that the BIDM chip correctly reported the CUT segment path delay times. The chip measurement results were a 95.83% match to the postlayout SPICE simulation values. This BIDM makes it possible to debug small delay defects in chips.
Slides

1D-18 (Time: 11:40 - 11:45)

Title	A Dynamic Quality-Scalable H.264 Video Encoder Chip
Author	*Hsiu-Cheng Chang, Yao-Chang Yang, Jia-Wei Chen (National Chung Cheng University, Taiwan), Ching-Lung Su (National Yunlin University of Science and Technology, Taiwan), Cheng-An Chien, Jiun-In Guo, Jinn-Shyan Wang (National Chung Cheng University, Taiwan)
Page	pp. 125 - 126
Keyword	Quality-Scalable, H.264, Encoder, real-time
Abstract	This paper proposes a dynamic quality-scalable H.264 video encoder that comprises 470Kgates and 13.3Kbytes SRAM using 1P8M 0.13um CMOS technology. Exploiting parameterized algorithms for motion estimation and intra prediction, the proposed design can dynamically configure the encoding modes with the design trade-off between power consumption and video quality for various video encoding applications. It achieves real-time H.264 video encoding on CIF, D1, and HD720@30fps with 7mW-25mW, 27mW-162mW, and 122mW-183mW power dissipation in different quality modes.
Slides

1D-19 (Time: 11:45 - 11:50)

Title	A High Performance LDPC Decoder for IEEE802.11n Standard
Author	*Wen Ji, Yuta Abe, Takeshi Ikenaga, Satoshi Goto (Waseda University, Japan)
Page	pp. 127 - 128
Keyword	LDPC, message passing algorithm, partially-parallel LDPC decoder
Abstract	In this paper, we propose a partially-parallel irregular LDPC decoder for IEEE 802.11n standard. The design is based on a novel sum-delta message passing schedule to achieve high throughput and low area cost design. We further improve the design with pipeline structure and parallel computation. The synthesis result in TSMC 0.18 CMOS technology demonstrates that for (648,324) irregular LDPC code, our decoder achieves 7.5X improvement in throughput, which reaches 402 Mbps at the frequency of 200MHz, with 11% area reduction.
Slides

1D-20 (Time: 11:50 - 11:55)

Title	Design and Chip Implementation of the Ubiquitous Processor HCgorilla
Author	*Masa-aki Fukase, Kazunori Noda, Atsuko Yokoyama, Tomoaki Sato (Hirosaki University, Japan)
Page	pp. 129 - 130
Keyword	Processor, Wave-pipeline, Ubiquitous
Abstract	HCgorilla is a hardware cryptography-embedded multimedia mobile processor that follows the parallelism of multicore and multiple pipelines dedicated for ubiquitous computing. Multiple pipelines are composed of media and cipher pipes. Each pipe is partly wave-pipelined to achieve power conscious high performance. Media pipes have user friendly functions due to Java compatibility. Random number addressing by cipher pipes is suited to cryptographic streaming. This paper describes the design and implementation of HCgorilla chips by using CMOS standard cell libraries
Slides

1D-21 (Time: 11:55 - 12:00)

Title	An 8.69 Mvertices/s 278 Mpixels/s Tile-based 3D Graphics SoC HW/SW Development for Consumer Electronics
Author	*Liang-Bi Chen, Ruei-Ting Gu, Wei-Sheng Huang, Chien-Chou Wang, Wen-Chi Shiue, Tsung-Yu Ho, Yun-Nan Chang, Shen-Fu Hsiao, Chung-Nan Lee, Ing-Jer Huang (Department of Computer Science and Engineering, National Sun Yat-Sen University, Taiwan)
Page	pp. 131 - 132
Keyword	3D Graphics, SoC, Performance Tuning, Consumer Electronics, Tile-based
Abstract	This paper presents an 8.69 Mvertices/s, 278 Mpixels/s, 15.7 mm2 tiled-based 3D graphics SoC HW/SW supporting OpenGL ES 1.0 running at 139 MHz. The SoC also includes embedded circuitry to monitor run time characteristics, detect bus protocol error/inefficiency, and capture bus traces at various abstraction levels with compression ratio up to 98%.
Slides

1D-22 (Time: 12:00 - 12:05)

Title	A Multi-Task-Oriented Security Processing Architecture with Powerful Extensibility
Author	*Dan Cao, Jun Han, Xiao-yang Zeng, Shi-ting Lu (Fudan University, China)
Page	pp. 133 - 134
Keyword	security processing, multi-core, SoC
Abstract	A multi-task-oriented security processing architecture is presented in this paper. This architecture contains a host microprocessor and multiple security processors (SP). The SP could integrate dedicated Crypto-Engines, which provides functional extensibility. And the performance scalability and multi-task parallelism could be enhanced by increasing the number of SPs on system bus. It’s demonstrated that this architecture greatly improves the system efficiency. A test chip is implemented based on SMIC 0.18 um standard CMOS technology, and its functionality is well verified.
Slides

1D-23 (Time: 12:05 - 12:10)

Title	A Delay-Optimized Universal FPGA Routing Architecture
Author	*Fang Wu, Huowen Zhang, Lei Duan, Jinmei Lai, Yuan Wang, Jiarong Tong (Fudan University, China)
Page	pp. 135 - 136
Keyword	Routing, Delay, GRB
Abstract	A universal FPGA routing Architecture is presented, which ensures that every module in the FPGA including CLBs and IOBs have a uniform interconnect architecture, and the load of lines is equally distributed. So, this architecture is highly repeatable and the signal delay is predictable and regular. Furthermore, the realization of the Programmable Interconnect Point (PIP) and the BUFFER driver is also optimized to benefit the signal delay up to 5%.The test results of the example chip show the reasonableness of these ideas.
Slides

Session 2A MPSoC and IP Integration
Time: 13:30 - 15:35 Tuesday, January 20, 2009
Location: Room 411+412
Chairs: Nozomu Togawa (Waseda University, Japan), Marcello Lajolo (NEC Laboratories America, United States)

2A-1 (Time: 13:30 - 13:55)

Title	Timing Variation-Aware Task Scheduling and Binding for MPSoC
Author	*HaNeul Chon, Taewhan Kim (Seoul National University, Republic of Korea)
Page	pp. 137 - 142
Keyword	Timing variation, task scheduling, binding
Abstract	This work addresses the new problem of timing variation-aware task scheduling and binding (TSB) for multiprocessor system-on-chip (MPSoC) architecture in the system-level design, where tasks have full flexibilities of resource (i.e., processor) sharing to meet the design constraints. With the timing variation of processors¢¢ç¯ clock speed, it has been observed that considering the effects of resource sharing on the resulting performance yield computation is critically important for accurate design space exploration and evaluation in the system-level design. Unfortunately previous statistical static timing analysis (SSTA) in the system-level has never considered resource sharing in computing the performance yield, or has overly simplified by employing the gate-level SSTAs. In this work, we overcome those limitations by proposing an effective SSTA technique called TSBSSTA, which schedules and binds tasks to resources in the presence of resource sharing. We also propose a timing variation-aware (TV) framework, called TSB-TV, tightly integrating TSB-SSTA. We have tested the effectiveness of our approach through experimentation with benchmarks, which showed an average of 56.1% improvement in performance yield over conventional methods.

2A-2 (Time: 13:55 - 14:20)

Title	Flexible and Abstract Communication and Interconnect Modeling for MPSoC
Author	*Katalin Popovici (TIMA Laboratory, France), Ahmed Jerraya (CEA-LETI, Minatec, France)
Page	pp. 143 - 148
Keyword	communication, exploration, modeling, NoC, H.264
Abstract	Current multiprocessor systems on chip (MPSoC) architectures integrate a massive number of IPs that need to exchange data in complex and diverse synchronization ways. The key challenge when designing MPSoC is that the communication architecture needs to be decided at the beginning of the design, before all the details about mapping the application on the architecture are known. These early decisions cause two difficulties: how to select the best communication architecture and how to estimate the effect of mapping the application onto the communication resources. In this paper, we propose high level communication models that allow early accurate performance estimation of both communication architecture and communication mapping. We applied the proposed modeling methods to analyze the impact on performance in case of two network topologies and several communication mapping schemes for the H.264 Encoder application.
Slides

2A-3 (Time: 14:20 - 14:45)

Title	Partial Order Method for Timed Simulation of System-Level MPSoC Designs
Author	*Eric Cheung, Harry Hsieh (University of California, Riverside, United States), Felice Balarin (Cadence Design Systems, United States)
Page	pp. 149 - 154
Keyword	Partial Order Simulation, SystemC, MPSoC
Abstract	Current discrete event simulator requires heavy simulation overhead to switch between different components to simulate them in strictly chronological order. Therefore, timed simulation is significantly slower than un-timed simulation. By simply adding delays in the components and communication channels, our timed MPEG-2 decoder simulates more than 14 times slower than an un-timed simulation. In this paper, we propose a partial order method to speed up timed simulation by relaxing the order that the components are simulated. With partial order method, a component is not required to schedule a channel access if both behavioral and timing results of the access are known. The simulation switches less frequently hence the simulation overhead reduces. We show that partial order method can be used in complex system-level simulation such asMPSoC implementations of the MPEG-2 decoder. In our experiments, partial order method provides more than 10 times speedups over regular discrete event simulation for timed simulation.

2A-4 (Time: 14:45 - 15:10)

Title	A UML-Based Approach for Heterogeneous IP Integration
Author	*Zhenxin Sun, Weng-Fai Wong (National University of Singapore, Singapore)
Page	pp. 155 - 160
Keyword	System level design, UML
Abstract	With increasing availability of predefined IP (Intellectual Properties) blocks and inexpensive microprocessors, embedded system designers are faced with more design choices than ever. On the other hand, there is a constant pressure on reducing the time to market. However, as the IP blocks are provided by different vendors, they differ in their interfaces. In order to improve design reuse, methods for combining heterogeneous IP blocks with incompatible protocols and I/Os are needed. In this paper, we propose an interface synthesis method that uses the UML notation to model the interfaces of predefined components and glue logic within the standard OCP-compliant environment. We built a code generator to produce the interface adapters from the UML models. We experimented with our approach using simple-bus and a MPEG-2 decoder as case studies.
Slides

Session 2B Power Analysis and Optimization
Time: 13:30 - 15:35 Tuesday, January 20, 2009
Location: Room 413
Chair: Masanori Hashimoto (Dept. ISE, Osaka University, Japan)

2B-1 (Time: 13:30 - 13:55)

Title	Statistical Modeling and Analysis of Chip-Level Leakage Power by Spectral Stochastic Method
Author	Ruijing Shen, Ning Mi, *Sheldon Tan (University of California at Riverside, United States), Yici Cai, Xianlong Hong (Tsinghua University, China)
Page	pp. 161 - 166
Keyword	Leakage analysis, orthogonal polynomials, variational analysis
Abstract	In this paper, we present a novel statistical full-chip leakage power analysis method. The new method can provide a general framework to derive the full-chip leakage current or power in a closed form in terms of the variational parameters, such as the channel length, the gate oxide thickness, etc. It can accommodate various spatial correlations. The new method employs the orthogonal polynomials to represent the variational gate leakages in a closed form first, which is generated by a fast multi-dimensional Gaussian quadrature method. The total leakage currents then are computed by simply summing up the resulting orthogonal polynomials (their coefficients). Unlike many existing approaches, no grid-based partitioning and approximation are required. Instead, the spatial correlations are naturally handled by orthogonal decompositions. The proposed method is very efficient and it becomes linear when there exist strong spatial correlations. Experimental results show that the proposed method is about 10X faster than the recently proposed method~\cite{Chang:DAC'05} with constant better accuracy.

2B-2 (Time: 13:55 - 14:20)

Title	On the Futility of Statistical Power Optimization
Author	Jason Cong, Puneet Gupta, *John Lee (University of California, Los Angeles, United States)
Page	pp. 167 - 172
Keyword	gate sizing, optimization, statistical power
Abstract	In response to the increasing variations in integrated-circuit manufacturing, the current trend is to create designs that take these variations into account statistically. In this paper we try to quantify the difference between the statistical and deterministic optima of leakage power while making no assumptions about the delay model. We develop a framework for deriving a theoretical upper-bound on the suboptimality that is incurred by using the deterministic optimum as an approximation for the statistical optimum. On average, the bound is 2.4% for a suite of benchmark circuits in a 45nm technology. We further give an intuitive explanation and show, by using solution rank orders, that the practical suboptimality gap is much lower. There- fore, the need for statistical power modeling for the purpose of optimization is questionable.
Slides

2B-3 (Time: 14:20 - 14:45)

Title	Timing Driven Power Gating in High-Level Synthesis
Author	Shih-Hsu Huang, *Chun-Hua Cheng (Chung Yuan Christian University, Taiwan)
Page	pp. 173 - 178
Keyword	Clock Skew Scheduling, High-Level Synthesis, Low Power Design, Resource Binding, Standby Leakage Minimization
Abstract	The power gating technique is useful in reducing standby leakage current, but it increases the gate delay. For a functional unit, its maximum allowable delay (for a target clock period) limits the smallest standby leakage current its power gating can achieve. In this paper, we point out: in the high-level synthesis of a non-zero clock skew circuit, the resource binding (including functional units and registers) has a large impact on the maximum allowable delays of functional units; as a result, different resource binding solutions have different standby leakage currents. Based on that observation, we present the first work to draw up the timing driven power gating in high-level synthesis. Given a target clock period and design constraints, our goal is to derive the minimum-standby-leakage-current resource binding solution. Benchmark data show: compared with the existing design flow, our approach can greatly reduce the standby leakage current without any overhead.
Slides

2B-4 (Time: 14:45 - 15:10)

Title	Congestion-Aware Power Grid Optimization for 3D Circuits Using MIM and CMOS Decoupling Capacitors
Author	Pingqiang Zhou, Karthikk Sridharan, *Sachin S. Sapatnekar (ECE Dept, University of Minnesota, United States)
Page	pp. 179 - 184
Keyword	3D circuit, power grid, MIM decap, leakage power, congestion
Abstract	In three-dimensional (3D) chips, the amount of supply current per package pin is significantly more than in two-dimensional (2D) designs. Therefore, the power supply noise problem, already a major issue in 2D, is even more severe in 3D. CMOS decoupling capacitors (decaps) have been used effectively for controlling power grid noise in the past, but with technology scaling, they have grown increasingly leaky. As an alternative, metal-insulator-metal (MIM) decaps, with high capacitance densities and low leakage current densities, have been proposed. In this paper, we explore the tradeoffs between using MIM decaps and traditional CMOS decaps, and propose a congestion-aware 3D power supply network optimization algorithm to optimize this tradeoff. The algorithm applies a sequence-of-linear-programs based method to find the optimum tradeoff between MIM and CMOS decaps. Experimental results show that power grid noise can be more effectively optimized after the introduction of MIM decaps, with lower leakage power and little increase in the routing congestion, as compared to a solution using CMOS decaps only.
Slides

2B-5 (Time: 15:10 - 15:35)

Title	Incremental and On-demand Random Walk for Iterative Power Distribution Network Analysis
Author	*Yiyu Shi, Wei Yao (Electrical Engineering Dept., University of California, Los Angeles, United States), Jinjun Xiong (IBM Thomas J. Watson Research Center, United States), Lei He (Electrical Engineering Dept., University of California, Los Angeles, United States)
Page	pp. 185 - 190
Keyword	random walk, power grid, simulation, incremental analysis
Abstract	Power distribution networks (PDNs) are designed and analyzed iteratively. Randomwalk is among themost efficient methods for PDN analysis. We develop in this paper an incremental and on-demand random walk to reduce iterative analysis time. During each iteration, we map the design changes as positive or negative random walks for observed nodes. To update PDN analysis result, we only need to apply these extra positive or negative walks, instead of doing all walks from scratch. We show that different execution orders for these walks do not affect accuracy but do affect the runtime because of the cancellation between positive and negative walks. Considering this cancellation effect, we optimize the walk order by solving a min-energy electromagnetic particles placement problem and, as a result, further reduce the runtime to about 8× compared to the worst order. Experiments show that, compared to random walk from scratch, our algorithm has similar accuracy but reduces the iterative analysis time by up to 18× for on-chip PDN sizing, and by up to 13× for package ball assignment with substrate routing. In addition, our incremental random walk has a linear time complexity with respect to the number of observed nodes and is more suitable for on-demand analysis, compared to random walk from scratch and its big warm-up cost.

Session 2C Logic and Arithmetic Optimization
Time: 13:30 - 15:35 Tuesday, January 20, 2009
Location: Room 414+415
Chairs: Dale Edwards (Semiconductor Research Corp., United States), Hiroyuki Higuchi (Fujitsu Microelectronics Limited, Japan)

2C-1 (Time: 13:30 - 13:55)

Title	SAT-Controlled Redundancy Addition and Removal --- A Novel Circuit Restructuring Technique
Author	Chi-An Wu, Ting-Hao Lin, Shao-Lun Huang, *Chung-Yang (Ric) Huang (National Taiwan University, Taiwan)
Page	pp. 191 - 196
Keyword	Redundancy Addition and Removal, SAT, Logic Restructuring
Abstract	We proposed a novel Boolean Satisfiability (SAT)-controlled redundancy addition and removal (RAR) algorithm to resolve the performance and quality problems of the previous RAR approaches. With the introduction of modern SAT techniques, such as efficient Boolean constraint propagation (BCP), conflict-driven learning, and flexible decision procedure, our RAR engine can identify 10x more alternative wires/gates while achieving 70% reduction in runtime.

2C-2 (Time: 13:55 - 14:20)

Title	On Improved Scheme for Digital Circuit Rewiring and Application on Further Improving FPGA Technology Mapping
Author	Fu Shing Chim, *Tak Kei Lam, Yu Liang Wu (The Chinese University of Hong Kong, Hong Kong)
Page	pp. 197 - 202
Keyword	Rewiring, Graph-based, FPGA, Technology Mapping, VLSI CAD
Abstract	The digital circuit rewiring technique has been shown to be one of the most powerful logic transformation methods being able to further improve some already excellent results on many EDA problems. In this work a new hybrid rewiring approach that can enjoy advantages from both ATPG-based and graph-based rewiring is proposed. Our hybrid approach utilizes structural characteristics and ATPG technique to perform quick alternative wires identification inside circuits. Experimental results suggest that our hybrid engine is able to achieve about 50% of alternative wires coverage when compared with ATPG-based rewiring engine with 4% of runtime only. For some problems only requiring a good-enough and very quick solution, this new rewiring technique may serve as a useful alternative.
Slides

2C-3 (Time: 14:20 - 14:45)

Title	Hybrid LZA: A Near Optimal Implementation of the Leading Zero Anticipator
Author	Amit Verma (National Institute of Technology, Rourkela, India), *Ajay K. Verma, Philip Brisk, Paolo Ienne (Ecole Polytechnique Federale de Lausanne, Switzerland)
Page	pp. 203 - 209
Keyword	leading zero anticipator, Error detection, Adder
Abstract	The Leading Zero Anticipator (LZA) is one of the main components used in floating point addition. It tends to be on the critical path, so it has attracted the attention of many researchers in the past. Most LZAs used today can be classified in two categories: exact and inexact. Inexact LZAs are normally preferred due to their shorter critical paths and reduced complexity; however, the inexact LZA requires an additional correct stage. In this paper we present a new LZA architecture that combines ideas taken from prior exact and inexact LZAs.Our new LZA improves the delay of floating point addition by 7-10% compared to state of art techniques as well as reduces hardware area in most cases. We also establish theoretical lower bounds on the delay of an LZA and we show that our LZA is very close to these bounds.

2C-4 (Time: 14:45 - 15:10)

Title	An Optimized Design for Serial-Parallel Finite Field Multiplication over GF(2^m) Based on All-One Polynomials
Author	Pramod Kumar Meher (Nanyang Technological University, Singapore), *Yajun Ha (National University of Singapore, Singapore), Chiou-Yng Lee (Lunghwa University of Science and Technology, Taiwan)
Page	pp. 210 - 215
Keyword	finite field multiplication, VLSI, architecture optimization
Abstract	In this paper, we derive a recursive algorithm for finite field multiplication over GF(2^m) based on irreducible all-one-polynomials (AOP), where the modular reduction of degree is achieved by cyclic left-shift without any logic operations. A regular and localized bit level dependence graph (DG) is derived from the proposed algorithm and mapped into an array architecture, where the modular reduction is achieved by a serial-in parallel-out shift-register. The multiplier is optimized further to perform the accumulation of partial products by the T flip flops of the output register without XOR gates. It is interesting to note that the optimized structure consists of an array of (m+1) AND gates between an array of (m+1) D flip flops and an array of (m+1) T flip flops. The proposed structure therefore involves significantly less area and less computation time compared with the corresponding existing structures.

Session 2D Special Session: EDA Acceleration Using New Architectures
Time: 13:30 - 15:35 Tuesday, January 20, 2009
Location: Room 416+417
Organizer: Damir A. Jamsek (IBM Corp., United States)

2D-1 (Time: 13:35 - 14:15)

Title	(Invited Paper) Aspects of GPU for General Purpose High Performance Computing
Author	*Reiji Suda (The University of Tokyo/JST CREST, Japan), Takayuki Aoki (Tokyo Institute of Technology/JST CREST, Japan), Shoichi Hirasawa (University of Electro-Communications/JST CREST, Japan), Akira Nukada (Tokyo Institute of Technology/JST CREST, Japan), Hiroki Honda (University of Electro-Communications/JST CREST, Japan), Satoshi Matsuoka (Tokyo Institute of Technology/JST CREST/NII, Japan)
Page	pp. 216 - 223
Keyword	GPU computing, performance evaluation, scheduling algorithm, task parallel paradigm
Abstract	We discuss hardware and software aspects of GPGPU, specifically focusing on NVIDIA cards and CUDA, from the viewpoints of parallel computing. The major weak points of GPU against newest supercomputers are identified to be and summarized as only four points: large SIMD vector length, small memory, absence of fast L2 cache, and high register spill penalty. As software concerns, we derive optimal scheduling algorithm for latency hiding of host-device data transfer, and discuss SPMD parallelism on GPUs.

2D-2 (Time: 14:15 - 14:55)

Title	(Invited Paper) Designing and Optimizing Compute Kernels on Nvidia GPUs
Author	*Damir A. Jamsek (IBM Research Division, United States)
Page	pp. 224 - 229
Keyword	GPU, NVIDIA
Abstract	The availability of high performance compute capability in NVIDIA GPUs has expanded their use in CAD environments. We will describe the basic compute models including host/device programming models, device multi-thread programming models, as well optimization and performance tuning techniques

2D-3 (Time: 14:55 - 15:35)

Title	(Invited Paper) Parallelizing Fundamental Algorithms such as Sorting on Multi-core Processors for EDA Acceleration
Author	*Masato Edahiro (System IP Core Research Laboratories, NEC Corporation/Department of Computer Science, University of Tokyo, Japan)
Page	pp. 230 - 233
Keyword	multi-core, many-core, parallel algorithm, sorting
Abstract	Fundamental algorithms should be parallelized to accelerate EDA software on multi-core architecture. In this paper, we introduce scalable algorithms that have scalability on multi-cores. As an example, a sorting algorithm, called Map Sort, is presented. This algorithm uses a map from subsets of input data to intervals on data range. Experimental results show that, in comparison with quick sort on a single CPU, processing time of Map Sort is comparable on a CPU and three times faster on four CPUs.
Slides

Session 3A System-Level Design of 3D Chips and Configurable Systems
Time: 15:55 - 18:00 Tuesday, January 20, 2009
Location: Room 411+412
Chairs: Eui-Young Chung (Yonsei University, Republic of Korea), Steve Haga (National Sun Yat-Sen University)

3A-1 (Time: 15:55 - 16:20)

Title	System-Level Cost Analysis and Design Exploration for Three-Dimensional Integrated Circuits (3D ICs)
Author	*Xiangyu Dong, Yuan Xie (Pennsylvania State University, United States)
Page	pp. 234 - 241
Keyword	3D Integration, Cost Analysis
Abstract	Three-dimensional integrated circuit (3D IC) is emerging as an attractive option for overcoming the barriers in interconnect scaling. The majority of the existing 3D IC research is focused on how to take advantage of the performance, power, smaller form-factor, and heterogeneous integration benefits that offered by 3D integration. However, all such advantages ultimately have to translate into cost savings when a design strategy has to be decided: Is 3D integration a cost effective way for a particular IC design? Consequently, system-level cost analysis at the early design stage is imperative to help the decision making on whether 3D integration should be adopted. In this paper, we study the design estimation method for 3D ICs at the early design stage, and propose a cost analysis model to study the cost implication for 3D ICs, and address the following cost-related problems related to 3D IC design: (1) Do all the benefits of 3D IC design come with a much higher cost? (2) How to do 3D integration in a cost-effective way? (3)Are there any design options to compensate the extra 3D bonding cost? A cost-driven 3D IC design flow is also proposed to guide the design space exploration for 3D ICs toward a costeffective direction.

3A-2 (Time: 16:20 - 16:45)

Title	Synthesis of Networks on Chips for 3D Systems on Chips
Author	*Srinivasan Murali, Ciprian Seiculescu (Ecole Polytechnique Federale de Lausanne, Switzerland), Luca Benini (University of Bologna, Italy), Giovanni De Micheli (Ecole Polytechnique Federale de Lausanne, Switzerland)
Page	pp. 242 - 247
Keyword	Networks on Chips, 3D, topology, synthesis
Abstract	Three-dimensional stacking of silicon layers is emerging as a promising solution to handle the design complexity and heterogeneity of Systems on Chips (SoCs). Networks on Chips (NoCs) are necessary to efficiently handle the 3D interconnect complexity. Designing power efficient NoCs for 3D SoCs that satisfy the application performance requirements, while satisfying the 3D technology constraints is a big challenge. In this work, we address this problem and present a synthesis approach for designing power-performance efficient 3D NoCs. We present methods to determine the best topology, compute paths and perform placement of the NoC components in each 3D layer. We perform experiments on varied, realistic SoC benchmarks to validate the methods and also perform a comparative study of the resulting 3D NoC designs with 3D optimized mesh topologies. The NoCs designed by our synthesis method results in large interconnect power reduction (average of 38%) and latency reduction (average of 25%) when compared to traditional NoC designs.

3A-3 (Time: 16:45 - 17:10)

Title	An Application-centered Design Flow for Self Reconfigurable Systems Implementation
Author	*Fabio Cancare, Marco Domenico Santambrogio, Donatella Sciuto (Politecnico di Milano, Italy)
Page	pp. 248 - 253
Keyword	Dynamic Reconfiguration, Reconfigurability, FPGA
Abstract	Up to now every proposed methodology for implementing dynamic self reconfigurable systems is architecture-centered. In most cases the system development process is time consuming and requires a very specific technical background. Aim of this work is to provide a fast brain to bit design ow whose goal is to simplify the dynamic reconfigurable system development process by shifting the designer focus from the architecture point of view to the application point of view: designers will not need to possess Dynamic Reconfigurability expertise but just to be skilled with the application domain.
Slides

3A-4 (Time: 17:10 - 17:35)

Title	System-Level Process Variability Compensation on Memory Organizations. On the Scalability of Multi-Mode Memories
Author	*Concepción Sanz, Manuel Prieto, José Ignacio Gómez (Universidad Complutense de Madrid, Spain), Antonis Papanikolaou, Francky Catthoor (Inter-University Microelectronics Center, Belgium)
Page	pp. 254 - 259
Keyword	Process variation, parametric yield, variability compensation
Abstract	Process variation and the dynamism of modern applications can degrade the expected performance of a system. Execution time can be severely affected by both factors, resulting in deadline violations and energy consumption overheads. Memory organizations, which account for a large part of the system-energy and the time budgets, are especially vulnerable to process variation. Configurable – multi-mode – memories are a promising technology to deal with these problems, but they also introduce new issues that need to be solved. Essentially, adding configuration capabilities to the memories comes with a cost, both in memory area and control complexity; hence, we need to evaluate what is the minimum amount of re-configurability to satisfy system’s constraints. In this paper, we analyze the scalability of configurable memories and highlight the relationship among mode allocation, memory mapping and data allocation.
Slides

Session 3B Advances in Timing Analysis and Modeling
Time: 15:55 - 18:00 Tuesday, January 20, 2009
Location: Room 413
Chairs: Shih-Hsu Huang (Chung Yuan Christian University, Taiwan), Atsushi Takahashi (Tokyo Institute of Technology, Japan)

3B-1 (Time: 15:55 - 16:20)

Title	Accelerating Statistical Static Timing Analysis Using Graphics Processing Units
Author	Kanupriya Gulati, *Sunil P. Khatri (Texas A&M University, United States)
Page	pp. 260 - 265
Keyword	Graphics Processing Units, Monte Carlo, Statistical Static Timing Analysis
Abstract	In this paper, we explore the implementation of Monte Carlo based statistical static timing analysis (SSTA) on a Graphics Processing Unit (GPU). SSTA via Monte Carlo simulations is a computationally expensive, but important step required to achieve design timing closure. It provides an accurate estimate of delay variations and their impact on design yield. The large number of threads that can be computed in parallel on a GPU suggests a natural fit for the problem of Monte Carlo based SSTA to the GPU platform. Our implementation performs multiple delay simulations at a single gate in parallel. A parallel implementation of the Mersenne Twister pseudo-random number generator on the GPU, followed by Box-Muller transformations (also implemented on the GPU) is used for generating gate delay numbers from a normal distribution. The mean and standard deviation of the pin-to-output delay distributions for all inputs and for every gate, are obtained using a memory lookup, which benets from the large memory bandwidth of the GPU. Threads which execute in parallel have no data/control dependencies on each other. All threads compute identical instructions, but on different data, as required by the Single Instruction Multiple Data (SIMD) programming semantics of the GPU. Our approach is implemented on a NVIDIA GeForce GTX 8800 GPU card. Our results indicate that our approach can obtain an average speedup of about 260X as compared to a serial CPU implementation. With the recently announced quad 8800 GPU cards, we estimate that our approach would attain a speedup of over 785X. The correctness of the Monte Carlo based SSTA implemented on a GPU has been verified by comparing its results with a CPU based implementation.
Slides

3B-2 (Time: 16:20 - 16:45)

Title	Trade-off Analysis between Timing Error Rate and Power Dissipation for Adaptive Speed Control with Timing Error Prediction
Author	*Hiroshi Fuketa, Masanori Hashimoto, Yukio Mitsuyama, Takao Onoye (Osaka University, Japan)
Page	pp. 266 - 271
Keyword	adaptive speed control, timing error prediction, canary FF, low power design, subthreshold circuit
Abstract	Timing margin of a chip varies chip by chip due to manufacturing variability, and depends on operating environment and aging. Adaptive speed control with timing error prediction is a promising approach to mitigate the timing margin variation, whereas it inherently has a critical risk of timing error occurrence when a circuit is slowed down. This paper presents how to evaluate the relation between timing error rate and power dissipation in self-adaptive circuits with timing error prediction. The discussion is experimentally validated using a 32-bit ripple carry adder in subthreshold operation in a 90nm CMOS process. We show a trade-off between timing error rate and power dissipation, and reveal the dependency of the trade-off on design parameters.
Slides

3B-3 (Time: 16:45 - 17:10)

Title	Statistical Analysis of On-Chip Power Grid Networks by Variational Extended Truncated Balanced Realization Method
Author	*Duo Li, Sheldon Tan (University of California at Riverside, United States), Gengsheng Chen, Xuan Zeng (Fudan University, China)
Page	pp. 272 - 277
Keyword	Power grid, TBR, Reduction, Interconnect, Variation
Abstract	In this paper, we present a novel statistical analysis approach for large power grid network analysis under process variations. The new algorithm is very efficient and scalable for huge networks with a large number of variational variables. This approach, called varETBR for variational extended truncated balanced realization, is based on model order reduction techniques to reduce the circuit matrices before the variational simulation. It performs the parameterized reduction on the original system using variation-bearing subspaces. varETBR calculates variational response Gramians by Monte-Carlo based numerical integration considering both system and input source variations for generating the projection subspace. varETBR is very scalable for the number of variables and is flexible for different variational distributions and ranges as demonstrated in experimental results. After the reduction, Monte-Carlo based statistical simulation is performed on the reduced system and the statistical responses of the original system are obtained thereafter. Experimental results, on a number of IBM benchmark circuits [15] up to 1.6 million nodes, show that the varETBR can be 4500X faster than the Monte-Carlo method and is much more scalable than one of the recently proposed approaches.

3B-4 (Time: 17:10 - 17:35)

Title	Bound-Based Identification of Timing-Violating Paths Under Variability
Author	*Lin Xie, Azadeh Davoodi (University of Wisconsin at Madison, United States)
Page	pp. 278 - 283
Keyword	variability, statistical timing analysis, timing-violating path, violation probability
Abstract	We introduce a bound-based technique to identify the top M timing-violating paths in a circuit under variability. These are the paths with the highest violation probability (i.e., C_p) which is the probability that a path (i.e., p) violates the timing constraint. To compute C_p, we require the violation probabilities of the nodes (i.e., C_n) and edges (i.e., C_e) on the path. First, we show computing C_n and C_e of all the nodes and edges requires only two rounds of Statistical Static Timing Analysis and then for each node/edge we need one table lookup for probability calculation using a technique known as Pearson Curve. Given C_n and C_e, our major contribution is in computing upper and lower bounds for C_p of an arbitrary path segment. We show constant-time for incremental update of the bounds when extending a path segment to a longer one. These bounds can be used to exactly construct the top violating paths. If the goal is to find the single most-violating path, we show a bound-based formulation that can prune a large portion of circuit without losing optimality. In our simulations, we verify the correctness and accuracy of our bounds for individual paths. We also verify identification of selected paths using Monte Carlo simulation. We obtain near-optimal accuracy with extremely fast runtimes.
Slides

3B-5 (Time: 17:35 - 18:00)

Title	Adaptive Techniques for Overcoming Performance Degradation due to Aging in Digital Circuits
Author	Sanjay Kumar, Chris Kim, *Sachin S. Sapatnekar (University of Minnesota, United States)
Page	pp. 284 - 289
Keyword	Reliability, Adaptive Body Bias, NBTI, Leakage, Delay
Abstract	Negative Bias Temperature Instability (NBTI) in PMOS transistors has become a major reliability concern in present-day digital circuit design. Further, with the recent usage of Hf-based high-k dielectrics for gate leakage reduction, Positive Bias Temperature Instability (PBTI), the dual effect in NMOS transistors has also reached significant levels. Consequently, designers are required to build in substantial guardbands into their designs, leading to large area and power overheads, in order to guarantee reliable operation over the lifetime of a chip. We propose a guard-banding technique based on adaptive body bias (ABB) and adaptive supply voltage (ASV), to recover the performance of an aged circuit, and compare its merits over previous approaches.
Slides

Session 3D Special Session: Hardware Dependent Software for Multi- and Many-Core Embedded Systems
Time: 15:55 - 18:00 Tuesday, January 20, 2009
Location: Room 416+417
Organizers: Rainer Doemer (University of California at Irvine, United States), Andreas Gerstlauer (University of Texas at Austin, United States), Wolfgang Mueller (University of Paderborn, Germany)

3D-1 (Time: 15:55 - 16:10)

Title	(Invited Paper) Introduction to Hardware-dependent Software Design
Author	*Rainer Dömer (University of California at Irvine, United States), Andreas Gerstlauer (University of Texas at Austin, United States), Wolfgang Müller (University of Paderborn, Germany)
Page	pp. 290 - 292
Abstract	Due to the rapidly increasing software content in embedded systems, Hardware-dependent Software (HdS) has become a critical topic in system design. In this talk, we will motivate the need for special attention to HdS in research and development and provide a brief introduction to the issues involved in the design of HdS.
Slides

3D-2 (Time: 16:10 - 16:50)

Title	(Invited Paper) Using a Dataflow abstracted Virtual Prototype for HdS-Design
Author	Wolfgang Ecker, Stefan Heinen, *Michael Velten (Infineon Technologies AG, Germany)
Page	pp. 293 - 300
Keyword	Abstraction, VP, TLM, HdS
Abstract	The complexity of Hardware-dependent Software (HdS) continuously grows stronger than chip complexity since more and more tasks are moved to software. Clearly, the pressure on the development of new methodologies for early validation of HdS increases as well. Existing methods must be continuously improved and new methods must be developed. This is exemplified with a state-of-the-art Transaction Level (TL) model used for firmware development of a productive wireless communication chip. By discussing the strengths and shortcomings of TL modeling we derive a set of requirements for a future modeling paradigm, which led to the new data flow abstraction approach presented in this paper. Experiments showed that we gain up to 10x performance improvement.
Slides

3D-3 (Time: 16:50 - 17:20)

Title	(Invited Paper) Needs and Trends in Embedded Software Development for Consumer Electronics
Author	*Yasutaka Tsunakawa (Sony Corporation, Japan)
Page	pp. 301 - 303
Keyword	Embedded software, Consumer electronics, Multi-Core, Many-Core
Abstract	Like other domains, the flow to Many-Core cannot be avoided in the domain of the consumer electronics either. The Multi-Core has already become the mainstream of the system LSI, and the number of cores in the chip will continue to increase. Because of the advancement of required functions and the pressure to the consumption electricity reduction, the flow to Many-Core will continue without cessation. However, seeing it from a point of view of the embedded software development, there are many unsolved problems lie like a huge cliff between current Multi-Core and Many-Core. The research organizations seem to make their main efforts in technical establishment of Many-Core, and the tool vendors concentrate on a solution offer to the current Multi-Core. Therefore measures of the transition period will come several years later are still insufficient. In this article, I want to discuss about the major problems which block the shift to Many-Core from the current Multi-Core, from the viewpoint of consumer electronics.
Slides

3D-4 (Time: 17:20 - 18:00)

Title	(Invited Paper) Hardware-dependent Software Synthesis for Many-Core Embedded Systems
Author	*Samar Abdi, Gunar Schirner, Ines Viskic, Hansu Cho, Yonghyun Hwang, Lochi Yu, Daniel Gajski (Center for Embedded Computer Systems, University of California, Irvine, United States)
Page	pp. 304 - 310
Keyword	Embedded Software, Multicore Design, Software Synthesis
Abstract	This paper presents synthesis of Hardware Dependent Software (HdS) for multicore and many-core designs using Embedded System Environment (ESE). ESE is a tool set, developed at UC Irvine, for transaction level design of multicore embedded systems. HdS synthesis is a key component of ESE backend design ow. We follow a design process that starts with an application model consisting of C processes communicating via abstract message passing channels. The application model is mapped to a platform net-list of SW and HW cores, buses and buffers. A high speed transaction level model (TLM) is generated to validate abstract communication between processes mapped to different cores. The TLM is further rened into a Pin-Cycle Accurate Model (PCAM) for board implementation. The PCAM includes C code for all the HdS layers including routing, packeting, synchronization and bus transfer. The generated HdS methods provide a library of application level services to the C processes on individual SW cores. Therefore, the application developer does not need to write low level HdS for board implementation. Synthesis results for an multi-core MP3 decoder design, using ESE, show that the HdS is generated in order of seconds, compared to hours of manual coding. The quality of synthesized code is comparable to manually written code in terms of performance and code size.
Slides

Wednesday, January 21, 2009

Session 2K Keynote Session II
Time: 9:00 - 10:00 Wednesday, January 21, 2009
Location: Small Auditorium, 5F
Chair: Kazutoshi Wakabayashi (NEC Corp., Japan)

2K-1 (Time: 9:00 - 10:00)

Title	(Keynote Address) Automated Synthesis and Verification of Embedded Systems: Wishful Thinking or Reality?
Author	Wolfgang Rosenstiel (Wilhelm-Schickard-Institute for Informatics, University of Tuebingen, Germany)
Abstract	More complex embedded hardware/software systems have to be developed with shorter design time and reduced cost. One solution for this problem is increasing design automation starting from higher levels of abstraction. Automatic synthesis and verification has been around in research for a quite a while. This talk will show examples for state-of-the art tools for system-level synthesis and verification of embedded systems and demonstrate their possibilities and limitations by some automotive applications.

Session 4A System Level Architectures
Time: 10:15 - 12:20 Wednesday, January 21, 2009
Location: Room 411+412
Chairs: Samar Abdi (University of California, Irvine, United States), Jun Yang (Univ. of Pittsburgh)

4A-1 (Time: 10:15 - 10:40)

Title	Computation and Data Transfer Co-Scheduling for Interconnection Bus Minimization
Author	Cathy Qun Xu (University of Texas at Dallas, United States), *Chun Jason Xue, Bessie C Hu (City University of Hong Kong, Hong Kong), Edwin H.M. Sha (University of Texas at Dallas, United States)
Page	pp. 311 - 316
Keyword	Scheduling, Interconnection network, clustered processors, data path synthesis
Abstract	High Instruction-Level-Parallelism in DSP and media applications demands highly clustered architecture. It is challenge to design an efficient, flexible yet cost saving inter-connection network to satisfy the rapid increasing inter-cluster data transfer needs. This paper presents a computation and data transfer co-scheduling technique to minimize the number of partially connected interconnection buses required for a given embedded application while minimizing its schedule length. Previous researches in this area focused on scheduling computations to minimize the number of inter-cluster data transfers. The proposed co-scheduling technique not only schedule computations to reduce the number of inter-cluster data transfers, but also schedule inter-cluster data transfers to minimize the number of required partially connected buses for inter-cluster connection network. Experimental results indicate that 52.3% fewer buses required compared to current best known technique while achieving the same schedule length minimization.

4A-2 (Time: 10:40 - 11:05)

Title	Prototyping Pipelined Applications on a Heterogeneous FPGA Multiprocessor Virtual Platform
Author	*Antonino Tumeo, Marco Branca, Lorenzo Camerini, Marco Ceriani (Politecnico di Milano, Italy), Matteo Monchiero (HP Labs, United States), Gianluca Palermo, Fabrizio Ferrandi, Donatella Sciuto (Politecnico di Milano, Italy)
Page	pp. 317 - 322
Keyword	FPGA, Prototyping, Pipelining, Multiprocessor, Multimedia
Abstract	Multiprocessors on a chip are the reality of these days. Semiconductor industry has recognized this approach as the most efficient in order to exploit chip resources, but the success of this paradigm heavily relies on the efficiency and widespread diffusion of parallel software. Among the many techniques to express the parallelism of applications, this paper focuses on pipelining, a technique well suited to data-intensive multimedia applications. We introduce a prototyping platform (FPGA-based) and a methodology for these applications. Our platform consists of a mix of standard and custom heterogeneous cores. We discuss several case studies, analyzing the interaction of the architecture and applications and we show that multimedia and telecommunication applications with unbalanced pipeline stages can be easily deployed. Our framework eases the development cycle and enables the developers to focus directly on the problems posed by the programming model in the direction of the implementation of a production system.
Slides

4A-3 (Time: 11:05 - 11:30)

Title	Variability-Aware Robust Design Space Exploration of Chip Multiprocessor Architectures
Author	*Gianluca Palermo, Cristina Silvano, Vittorio Zaccaria (Politecnico di Milano, DEI, Italy)
Page	pp. 323 - 328
Keyword	Design Space Exploration
Abstract	In the context of a design space exploration framework for supporting the platform-based design approach, we address the problem of robustness with respect to manufacturing process variations. First, we introduce response surface modeling techniques to enable an efficient evaluation of the statistical measures of execution time and energy consumption for each system configuration. We then introduce a robust design space exploration frameworkto afford the problem of the impact of manufacturing process variations onto the system-level metrics and consequently onto the application-level constraints. We finally provide a comparison of our design space exploration technique with conventional approaches.
Slides

4A-4 (Time: 11:30 - 11:55)

Title	Partial Conflict-Relieving Programmable Address Shuffler for Parallel Memories in Multi-Core Processor
Author	*Young-Su Kwon, Bon-Tae Koo, Nak-Woong Eum (Electronics and Telecommunications Research Institute, Republic of Korea)
Page	pp. 329 - 334
Keyword	parallel memory, access conflict, multi-core, memory
Abstract	The advancement of process technology enables the integration of multiple cores featuring parallel processing. The requirement of extensive memory bandwidth puts a major performance bottleneck in multi-core architectures for media applications. While the parallel memory system is a viable solution to account for a large amount of memory transactions required by multiple cores, memory access conflicts caused by simultaneous accesses to an identical memory page by two or more cores limit the performance of multi-core architectures. We propose and evaluate the programmable memory address shuffler associated with the novel memory shuffling algorithm integrated in multi-core architectures with parallel memory system. The address shuffler efficiently translates the requested memory addresses into the shuffled addresses such that access conflicts diminish by analyzing the access pattern of the application. We demonstrate that the shuffling of sub-pages is represented by cyclic linked list which enables partial address shuffling with the minimal number of shuffling table entries. The programmable address shuffler reduces the amount of access conflicts by 83% for pitch-shifting audio decompression.

4A-5 (Time: 11:55 - 12:20)

Title	HitME: Low Power Hit MEmory Buffer for Embedded Systems
Author	Andhi Janapsatya, *Sri Parameswaran, Aleksandar Ignjatovic (University of New South Wales, Australia)
Page	pp. 335 - 340
Keyword	memory, low power, cache, loop cache
Abstract	In this paper, we present a novel HitME (Hit-MEmory) buffer to reduce the energy consumption of memory hierarchy in embedded processors. The HitME buffer is a small direct-mapped cache memory that is added as additional memory into existing cache memory hierarchies. The HitME buffer is loaded only when there is a hit on L1 cache. Otherwise, L1 cache is updated from the memory and the processor's memory request is served directly from the L1 cache. The strategy works due to the fact that 90% of memory accesses are only accessed once, and these often pollute the cache. Energy reduction is achieved by reducing the number of accesses to the L1 cache memory. Experimental results show that the use of HitME buffer will reduce the L1 cache accesses resulting in a reduction in the energy consumption of the memory hierarchy. This decrease in L1 cache accesses reduces the cache system energy consumption by an average of 60.9% when compared to traditional L1 cache memory architecture and an energy reduction of 6.4% when compared to filter cache architecture for 70nm cache technology.

Session 4B Beyond Traditional Floorplanning and Placement
Time: 10:15 - 12:20 Wednesday, January 21, 2009
Location: Room 413
Chair: Shigetoshi Nakatake (The University of Kitakyushu, Japan)

4B-1 (Time: 10:15 - 10:40)

Title	Signal Skew Aware Floorplanning and Bumper Signal Assignment Technique for Flip-Chip
Author	*Cheng-Yu Wang, Wai-Kei Mak (Department of Computer Science, National Tsing Hua University, Taiwan)
Page	pp. 341 - 346
Keyword	Flip-chip, floorplanning, Bumper, pad, Assignment
Abstract	Flip-chip is a solution for designs requiring more I/O pins and higher speed. However, the higher speed demand also brings the issue of signal skew. In this paper, we propose a new 3-stage design layout methodology for flip-chip considering signal skew. Firstly, we produce an initial bumper signal assignment, and then solve the flip-chip floorplanning problem using a partitioningbased technique to spread the modules across the flip-chip as the distribution of its bumpers. With an anchoring and relocation strategy, we can effectively place I/O buffers at desirable locations. Finally, we further reduce signal skew and monotonic routing density by refining the bumper signal assignment. Experimental results show that signal skew of traditional floorplanners range from 4% to 280% higher than ours. And the total wirelength of other floorplanners is as much as 100% higher than ours. Moreover, our signal refinement method can further decrease monotonic routing density by up to 8% and signal skew by up to 11%

4B-2 (Time: 10:40 - 11:05)

Title	A Novel Thermal Optimization Flow Using Incremental Floorplanning for 3D ICs
Author	Xin Li, *Yuchun Ma, Xianlong Hong (Tsinghua University, China)
Page	pp. 347 - 352
Keyword	3D ICs, incremental floorplanning, thermal
Abstract	Thermal issue is a critical challenge in 3D IC design. To eliminate hotspots, physical layouts are always adjusted by shifting or duplicating hot blocks. However, these modifications may degrade the packing area as well as interconnect distribution greatly. In this paper, we propose some novel thermal-aware incremental changes to optimize these multiple objectives including thermal issue in 3D ICs. Furthermore, to avoid random incremental modification, which may be inefficient and need long runtime to converge, here potential gain is modeled for each candidate incremental change. Based on the potential gain, a novel thermal optimization flow to intelligently choose the best incremental operation is presented. We distinguish the thermal-aware incremental changes in three different categories: migrating computation, growing unit and moving hotspot. Mixed integer linear programming (MILP) models are devised according to these different incremental changes. Experimental results show that migrating computation, growing unit and moving hotspot can reduce max on-chip temperature by 7%, 13% and 15% respectively on MCNC/GSRC benchmarks. Still, experimental results also show that the thermal optimization flow can reduce max on-chip temperature by 14% compared to an existing 3D floorplan tool CBA, and achieve better area and total wirelength improvement than individual operations do.

4B-3 (Time: 11:05 - 11:30)

Title	Analog Placement with Common Centroid and 1-D Symmetry Constraints
Author	*Linfu Xiao, Evangeline Young (The Chinese University of Hong Kong, Hong Kong)
Page	pp. 353 - 360
Keyword	analog placement, common centroid, symmetry
Abstract	In this paper, we will present a placement method for analog ircuits. We consider both common centroid and 1-D symmetry constraints, which are the two most common types of placement requirements in analog designs. The approach is based on a symmetric feasible condition on the sequence pair representation that can cover completely the set of all placements satisfying the common centroid and 1-D symmetry constraints. This condition is essential for a good searching process to solve the problem effectively. Symmetric placement is an important step to achieve matchings of other electrical properties like delay and temperature variation. We have compared our results with those presented in the most updated previous works. Significant improvements can be obtained by our approach in both common centroid and 1-D symmetry placements, and we are the first who can handle both constraints simultaneously.

4B-4 (Time: 11:30 - 11:55)

Title	A Multilevel Analytical Placement for 3D ICs
Author	Jason Cong, *Guojie Luo (University of California, Los Angeles, United States)
Page	pp. 361 - 366
Keyword	3D IC, analytical placement, through-silicon via
Abstract	Abstract - In this paper we propose a multilevel non-linear programming based 3D placement approach that minimizes a weighted sum of total wirelength and TS via number subject to area density constraints. This approach relaxes the discrete layer assignments so that they are continuous in the z-direction and the problem can be solved by an analytical global placer. A key idea is to do the overlap removal and device layer assignment simultaneously by adding a density penalty function for both area & TS via density constraints. Experimental results show that this analytical placer in a multilevel framework is effective to achieve trade-offs between wirelength and TS via number. Compared to the recently published transformation-based 3D placement method [1], we are able to achieve on average 12% shorter wirelength and 29% fewer TS via compared to their cases with best wirelength; we are also able to achieve on average 20% shorter wirelength and 50% fewer TS via number compared to their cases with best TS via numbers.
Slides

4B-5 (Time: 11:55 - 12:20)

Title	Exploring Adjacency in Floorplanning
Author	Jia Wang, *Hai Zhou (Northwestern University, United States)
Page	pp. 367 - 372
Keyword	floorplanning, adjacency graph
Abstract	This paper describes a new floorplanning approach called Constrained Adjacency Graph (CAG) that helps exploring adjacency in floorplans. CAG extends the previous adjacency graph approaches by adding explicit adjacency constraints to the graph edges. After sufficient and necessary conditions of CAG are developed based on dissected floorplans, CAG is extended to handle general floorplans in order to improve area without changing the adjacency relations dramatically. These characteristics are currently utilized in a randomized greedy improvement heuristic for wire length optimization. The results show that better floorplans are found with much less running time for problems with 100 to 300 modules in comparison to a simulated annealing floorplanner based on sequence pairs.

Session 4C Signal/Power Integrity and Simulation
Time: 10:15 - 12:20 Wednesday, January 21, 2009
Location: Room 414+415
Chairs: Hideki Asai (Shizuoka University, Japan), Sheldon Tan (University of California, Riverside, United States)

4C-1 (Time: 10:15 - 10:40)

Title	Stochastic Current Prediction Enabled Frequency Actuator for Runtime Resonance Noise Reduction
Author	*Yiyu Shi (University of California, Los Angeles, United States), Jinjun Xiong, Howard Chen (IBM Thomas J. Watson Research Center, United States), Lei He (University of California, Los Angeles, United States)
Page	pp. 373 - 378
Keyword	Stochastic current modeling, frequency actuator, resonance noise
Abstract	Power delivery network (PDN) is a distributed RLC network with its dominant resonance frequency in the low-to-middle frequency range. Though high-performance chips’ working frequencies are much higher than this resonance frequency in general, chip runtime loading frequency is not. When a chip executes a chunk of instructions repeatedly, the induced current load may have harmonic components close to this resonance frequency, causing excessive power integrity degradation. Existing PDN design solutions are, however, mainly targeted at reducing high-frequency noise and not effective to suppress such resonance noise. In this work, we propose a novel approach to proactively suppress this type of noise. A method based on a high dimension generalized Markov process is developed to predict current load variation. Based on such prediction, a clock frequency actuator design is proposed to proactively select an optimal clock frequency to suppress the resonance. To the best of our knowledge, this is the first in-depth study on proactively reducing runtime instruction execution induced PDN resonance noise.

4C-2 (Time: 10:40 - 11:05)

Title	Fast Analysis of Nontree-Clock Network Considering Environmental Uncertainty by Parameterized and Incremental Macromodeling
Author	Hai Wang (University of California, Riverside, United States), Hao Yu (Berkeley Design Automation, United States), *Sheldon X.D. Tan (University of California, Riverside, United States)
Page	pp. 379 - 384
Keyword	clock network, environmental uncertainties, macromodeling
Abstract	It is challenging to verify clock-skew for large-scale nontree clock network with environmental uncertainties such as supply voltage fluctuation and thermal temperature gradient. This paper presents a fast clock-skew analysis via parameterized incremental truncated-balanced-realization, called {\it piTBR} method. Environmental uncertainties are parametrically and structurally added into the state equation of clock network. A compact macromodel is obtained by the subspace projection constructed from the singular value decomposition (SVD) of circuit output waveforms. To reduce the computational cost, we propose an incremental SVD method that only needs to partially update the projection matrix by analyzing the perturbed output waveform owning to environmental uncertainties. Experiments on a number of clock networks show that compared with the macromodeling by the fast TBR method, our method reduces the computational cost in the order of $100 \times$ with a similar accuracy. In addition, compared with the macromodeling by the Krylov-subspace-based method, our method reduces the waveform error by $2 \times$ with a similar runtime.

4C-3 (Time: 11:05 - 11:30)

Title	High Performance On-Chip Differential Signaling Using Passive Compensation for Global Communication
Author	Ling Zhang, Yulei Zhang (University of California, San Diego, United States), Akira Tsuchiya (Kyoto University, Japan), Masanori Hashimoto (Osaka University, Japan), Ernest Kuh (University of California, Berkeley, United States), *Chung-Kuan Cheng (University of California, San Diego, United States)
Page	pp. 385 - 390
Keyword	High performance, passive compensation, on-chip T-line
Abstract	To address the performance limitation brought by the scaling issues of on-chip global wires, a new configuration for global wiring using on-chip lossy transmission lines is proposed and optimized. We propose a signaling structure to compensate the distortion and attenuation of on-chip transmission lines, which uses passive compensation and inserts repeated transceivers composing sense amplifiers and inverter chains. An optimization flow for designing this scheme based on eye-diagram prediction and sequential quadratic programming (SQP) is devised. This flow is used to study the latency, power dissipation and throughput performance of the new global wiring scheme as the technology scales from 90nm to 22nm. Comparing to repeated RC wire, experimental results demonstrate that at 22nm technology node, the new scheme can reduce the normalized delay by 80%-95%. , the normalized energy consumption by 50%-94%. The normalized latency is 10 ps/mm , the energy per bit is 20 pJ/m, and the throughput is 15 Gbps/um. All performance metrics are scalable with technology, which makes this approach a potential candidate to break the "interconnect wall" of digital system performance.

4C-4 (Time: 11:30 - 11:55)

Title	Noise Minimization During Power-Up Stage for a Multi-Domain Power Network
Author	*Wanping Zhang (Qualcomm Inc./University of California, San Diego, United States), Yi Zhu (University of California, San Diego, United States), Wenjian Yu (Tsinghua University, China), Amirali Shayan, Renshen Wang (University of California, San Diego, United States), Zhi Zhu (Qualcomm Inc., United States), Chung-Kuan Cheng (University of California, San Diego, United States)
Page	pp. 391 - 396
Keyword	Noise, Power-up sequence, Multi-domain
Abstract	With the popularity of Multiple Power Domain (MPD) design, the multi-domain power network noise analysis and minimization is becoming important. This paper describes an efficient heuristic algorithm to arrange the power-up sequence in a multi-domain power network in order to minimize the noise. We present a formulation of this problem and show it is NP-complete. Therefore, we propose a simulated annealing (SA) based algorithm with preprocessing. Experimental results show that the proposed algorithm can minimize the noise close to the minimal values. In terms of efficiency, the SA algorithm is more than hundreds of times faster than the enumerating method and the running time scales well for these cases with the number of domains. In addition, we discuss the trade off between power-up efficiency and noise.

4C-5s (Time: 11:55 - 12:07)

Title	Parallel Transistor Level Circuit Simulation using Domain Decomposition Methods
Author	*He Peng, Chung-Kuan Cheng (University of California, San Diego, United States)
Page	pp. 397 - 402
Keyword	SPICE, parallel circuit simulation, domain decomposition, multi-core simulation
Abstract	This paper presents an efficient parallel transistor level full-chip circuit simulation tool with SPICE-accuracy. The new approach partitions the circuit into a linear domain and several non-linear domains based on circuit non-linearity and connectivity. The linear domain is solved by parallel fast linear solver while nonlinear domains are parallelly distributed into different processors and solved by direct solver. Parallel domain decomposition technique is used to iteratively solve the different partitions of the circuit and ensure convergence. Different domain decomposition techniques are discussed. Orders of magnitude speedup over SPICE is observed for sets of large-scale VLSI circuits.

4C-6s (Time: 12:07 - 12:19)

Title	Fast Circuit Simulation on Graphics Processing Units
Author	Kanupriya Gulati (Texas A&M University, United States), John F. Croix (Nascentric, Inc., United States), *Sunil P. Khatri (Texas A&M University, United States), Rahm Shastry (Nascentric, Inc., United States)
Page	pp. 403 - 408
Keyword	SPICE, device model evaluations, Graphics Processing Units
Abstract	SPICE based circuit simulation is a traditional workhorse in the VLSI design process. Given the pivotal role of SPICE in the IC design flow, there has been significant interest in accelerating SPICE. Since a large fraction (on average 75%) of the SPICE runtime is spent in evaluating transistor model equations, a significant speedup can be availed if these evaluations are accelerated. This paper reports our early efforts to accelerate transistor model evaluations using a Graphics Processing Unit (GPU). We have integrated this accelerator with a commercial fast SPICE tool. Our experiments demonstrate that significant speedups (2.36X on average) can be obtained. The asymptotic speedup that can be obtained is about 4X. We demonstrate that with circuits consisting of as few as about 1000 transistors, speedups in the neighborhood of this asymptotic value can be obtained. By utilizing the recently announced (but not currently available) quad GPU systems, this speedup could be enhanced further, especially for larger designs.
Slides

Session 4D Special Session: Challenges in 3D Integrated Circuit Design
Time: 10:15 - 12:20 Wednesday, January 21, 2009
Location: Room 416+417
Organizer: Sachin Sapatnekar (Univ. of Minnesota, United States)

4D-1 (Time: 10:15 - 10:40)

Title	(Invited Paper) Three-Dimensional Integration Technology and Integrated Systems
Author	*Mitsumasa Koyanagi, Takafumi Fukushima, Tetsu Tanaka (Tohoku University, Japan)
Page	pp. 409 - 415
Abstract	A new three-dimensional (3-D) integration technology by a self-assembly method is described. In addition, 3-D integrated systems such as 3-D microprocessor chip, 3-D shared memory chip, 3-D image processing chip and 3-D artificial retina chip are demonstrated.
Slides

4D-2 (Time: 10:40 - 11:05)

Title	(Invited Paper) A 3D Prototyping Chip based on a Wafer-level Stacking Technology
Author	*Nobuaki Miyakawa (Honda Research Institute, Japan)
Page	pp. 416 - 420
Keyword	Stacking Technology, Wafer-to-Wafer, 8 inch wafer, Trial Manufacture of 3 layer Stacked
Abstract	A case study on 3D IC process, prototyping, and EDA design flow

4D-3 (Time: 11:05 - 11:30)

Title	(Invited Paper) CAD Challenges for 3D ICs
Author	David Kung, *Ruchir Puri (IBM Corp., United States)
Page	pp. 421 - 422
Abstract	A fundamental shift in the technology has occurred beyond 90nm CMOS where the interconnect resistance has been increasing significantly to cause a repeater explosion problem. This problem translates into not only significant area overhead but also power, as repeaters lose power to leakage. 3D technology has the potential of easing the challenge of repeater explosion. In order to exploit the full potential of 3D technology, new challenges in the area of physical design, thermal analysis, system level design and analysis need to be addressed. 3D interconnects have the potential of reducing critical paths delays significantly, which are typically between memory and the interfacing logic. New tools that consider thermally aware physical design implementations, most importantly at the architecture and SoC level are crucial to the success of 3D as thermal issues are exacerbated in 3D implementations. To justify the cost and complexity overhead of 3D technology, it is essential to study the benefit of 3D early in the design cycle. This requires strong linkage between architecture level analysis tools and 3D physical planning tools. Most of the advantages of 3D will be utilized with new system architectures and physical implementations. Therefore, the tools to aid 3D implementation must also operate at the higher level in addition to the 3D place and route algorithms that have been proposed in the literature before. In fact, the benefits from 3D place and route will be limited since current 2D designs do a fairly good job of optimizing the critical path distance. There is a very strong need for 3D architectural and physical planning tools that operate in the domain of thermal, physical, and performance analysis in order to yield an optimized system implementation in 3D technology. Most of the studies reporting huge benefits from 3D for wire length do not adequately consider the physical impact of vertical vias. It is crucial to consider the impact of vertical vias on the physical design of ICs, from area, latency, and thermal impact point of view.

4D-4 (Time: 11:30 - 11:55)

Title	(Invited Paper) Addressing Thermal and Power Delivery Bottlenecks in 3D Circuits
Author	*Sachin S. Sapatnekar (University of Minnesota, United States)
Page	pp. 423 - 428
Keyword	3D integrated circuits, temperature, power grid, analysis, optimization
Abstract	The enhanced packing densities facilitated by 3D integrated circuit technology also has an unwanted side-effect, in the form of increasing the amount of current per unit footprint of the chip, as compared to a 2D design. This has ramifications on two critical issues: firstly, it means that more heat is generated per unit footprint, potentially leading to thermal problems, and secondly, more current must be supplied per package pin, leading to possible power delivery bottlenecks. This paper presents an overview of the challenges and solutions in the domain of addressing these two issues in 3D integrated circuits.
Slides

4D-5 (Time: 11:55 - 12:20)

Title	(Invited Paper) The Road to 3D EDA Tool Readiness
Author	*Charles Chiang, Subarna Sinha (Synopsys, United States)
Page	pp. 429 - 436
Keyword	TSV
Abstract	The design, representation and optimization of 3D ICs will require changes to the current EDA tool suite. Modifications will be necessary in the data models as well as the analysis and optimization algorithms at various design stages. This talk will provide an in-depth summary of the changes needed at the various design stages to enable and support 3D IC design.

Session 5A Energy-Aware System Level Design Methodology
Time: 13:30 - 15:35 Wednesday, January 21, 2009
Location: Room 411+412
Chairs: Chia-Lin Yang (National Taiwan University, Taiwan), Juinn-Dar Huang (National Chiao Tung University)

5A-2 (Time: 13:55 - 14:20)

Title	System-Level Exploration Tool for Energy-Aware Memory Management in the Design of Multidimensional Signal Processing Systems
Author	*Florin Balasa (Southern Utah University, United States), Ilie I. Luican (University of Illinois at Chicago, United States), Hongwei Zhu (ARM, Inc., United States), Doru V. Nasui (American International Radio, Inc., United States)
Page	pp. 443 - 448
Keyword	multidimensional signal processing, memory management, memory allocation, signal-to-memory assignment, dynamic energy consumption
Abstract	Many signal processing systems, particularly in the multimedia and telecom domains, are synthesized to execute data-dominated applications. Their behavior is described in a high-level programming language, where the code is typically organized in sequences of loop nests and the main data structures are multidimensional arrays. Since data transfer and storage have a significant impact on both the system performance and the major cost parameters -- power consumption and chip area, the designer must spend a significant effort during the system development process on the exploration of the memory subsystem in order to achieve a cost-optimized design. This paper presents a software tool for system-level exploration, where several memory management tasks are addressed in a common theoretical framework. The tool can compute the minimum storage requirement of a given application and can produce the graph of storage variation during the code execution; it offers memory allocation and signal assignment solutions both for flat and hierarchical organizations and optimizes the dynamic energy consumption in the memory subsystem.

5A-3 (Time: 14:20 - 14:45)

Title	Systematic Architecture Exploration based on Optimistic Cycle Estimation for Low Energy Embedded Processors
Author	*Ittetsu Taniguchi (Osaka University, Japan), Murali Jayapala (IMEC vzw., Belgium), Praveen Raghavan, Francky Catthoor (IMEC vzw./K.U.Leuven, Belgium), Keishi Sakanushi, Yoshinori Takeuchi, Masaharu Imai (Osaka University, Japan)
Page	pp. 449 - 454
Keyword	architecture exploration, address generation unit (AGU), reconfigurable architecture
Abstract	Systematic architecture exploration from vast solution space is a complex problem in embedded system design. It is very difficult to explore a best architecture fast and accurately because accurate evaluation usually consumes significant amount of time for point in the solution space. In this paper, we propose fast and systematic architecture exploration method for address generation unit (AGU) based on a coarse grained reconfigurable architecture model. First we prove that a set of Pareto solutions of cycle vs energy becomes a subset of Pareto solutions of cycle vs area under some practical assumptions. In addition we propose ``Optimistic cycle (OC)'' metric to find out promising solutions from vast solution space. Based on this metric we also propose a fast architecture exploration algorithm which only applies mapping to promising architectures. Using the proposed systematic architecture exploration method, we show that we can obtain almost the same trade-off points as the exhaustive search method and also that our method is about 164 times faster than exhaustive search.
Slides

5A-4 (Time: 14:45 - 15:10)

Title	A Framework for Estimating NBTI Degradation of Microarchitectural Components
Author	*Michael DeBole, Ramakrishnan Krishnan (The Pennsylvania State University, United States), Varsha Balakrishnan, Wenping Wang (Arizona State University, United States), Hong Luo, Yu Wang (Tsinghua University, China), Yuan Xie (The Pennsylvania State University, United States), Yu Cao (Arizona State University, United States), N. Vijaykrishnan (The Pennsylvania State University, United States)
Page	pp. 455 - 460
Keyword	NBTI, Reliability, CAD, Computer Architecture
Abstract	Degradation of device parameters over the lifetime of a system is emerging as a significant threat to system reliability. Among the aging mechanisms, wearout resulting from NBTI is of particular concern in deep submicron technology generations. To facilitate architectural level aging analysis, a tool capable of evaluating NBTI vulnerabilities early in the design cycle has been developed. The tool includes workload-based temperature and performance degradation analysis across a variety of technologies and operating conditions, revealing a complex interplay between factors influencing NBTI timing degradation.

Session 5B Design for Manufacturing and Reliability
Time: 13:30 - 15:35 Wednesday, January 21, 2009
Location: Room 413
Chair: Charles Chiang (Synopsys, United States)

5B-1 (Time: 13:30 - 13:55)

Title	Efficient Analytical Determination of the SEU-induced Pulse Shape
Author	Rajesh Garg, *Sunil P. Khatri (Texas A&M University, United States)
Page	pp. 461 - 467
Keyword	Radiation, Single Event Upset, Single Even Transients
Abstract	Single event upsets (SEUs) have become problematic for both combinational and sequential circuits in the deep sub-micron era due to device scaling, lowered supply voltages and higher operating frequencies. To design radiation tolerant circuits efficiently, techniques are required to analyze the effects of a radiation particle strike on a circuit early in the design flow, and hence evaluate the circuit's resilience to SEU events. For an accurate estimation of the SEU tolerance of a circuit, it is important to consider the effects of electrical masking. This is typically done by performing circuit simulations, which are slow. In this paper, we present an analytical model for the determination of the shape of radiation-induced voltage glitches in combinational circuits. The output of our approach can be propagated to the primary outputs of the circuit using existing tools, thereby modeling the effects of electrical masking. This enables an accurate and quick evaluation of the SEU robustness of a circuit. Experimental results demonstrate that our model is very accurate, with a very low root mean square percentage error in the estimation of the shape of the voltage glitch of (4.5%) compared to SPICE. Our model gains its accuracy by using a non-linear model for the load current of the gate, and by considering the effect of the ion track establishment constant on the radiation induced voltage glitch. Our analytical model is very fast (275X faster than SPICE) and accurate, and can therefore be easily incorporated in a design flow to estimate the SEU tolerance of circuits early in the design process.

5B-2 (Time: 13:55 - 14:20)

Title	Post-Routing Redundant Via Insertion with Wire Spreading Capability
Author	Cheok-Kei Lei, *Po-Yi Chiang, Yu-Min Lee (National Chiao Tung University, Taiwan)
Page	pp. 468 - 473
Keyword	redundant via insertion, wire spreading, DFM, yield
Abstract	Redundant via insertion is a widely recommended technique to enhance the via yield and reliability. In this paper, the post-routing redundant via insertion problem is transformed to a mixed bipartite-conflict graph matching problem, and an efficient heuristic minimum weighted matching (HMWM) algorithm is presented to solve it. The developed method not only inserts redundant vias for alive vias but also protects the dead vias by utilizing the wire spreading capability-that's to say, the method shifts wires into the empty space and adds redundant vias for dead vias to further enhance the via yield. Experimental results show that the average insertion rate of alive vias is 99.54% with a short run time, and the wire spreading technique can achieve average insertion rate to be 54.41% for dead vias.

5B-3 (Time: 14:20 - 14:45)

Title	Accounting for Non-linear Dependence Using Function Driven Component Analysis
Author	Lerong Cheng, *Puneet Gupta, Lei He (University of California, Los Angeles, United States)
Page	pp. 474 - 479
Keyword	Noice margin, Statistical Analysis
Abstract	Majority of practical multivariate statistical analyses and optimizations model interdependence among random variables in terms of the linear correlation among them. Though linear correlation is simple to use and evaluate, in several cases non-linear dependence between random variables may be too strong to ignore. In this paper, We propose polynomial correlation coefficients as simple measure of multi-variable non-linear dependence and show that need for modeling non-linear dependence strongly depends on the end function that is to be evaluated from the random variables. Then, we calculate the errors in estimation which result from assuming independence of components generated by linear de-correlation techniques such as PCA and ICA. The experimental result shows that the error predicted by our method is within 1% error compared to the real simulation. In order to deal with non-linear dependence, we further develop a target function driven component analysis algorithm (FCA) to minimize the error caused by ignoring high order dependence and apply such technique to statistical leakage power analysis and SRAM cell noise margin variation analysis. Experimental results show that the proposed FCA method is more accurate compared to the traditional PCA or ICA.

5B-4 (Time: 14:45 - 15:10)

Title	Risk Aversion Min-Period Retiming under Process Variations
Author	Jia Wang, *Hai Zhou (Northwestern University, United States)
Page	pp. 480 - 485
Keyword	statistical optimization, retiming, process variations
Abstract	Recent advances in statistical timing analysis (SSTA) achieve great success in computing arrival times under variations by extending sum and maximum operations to random variables. It remains a challenge problem to apply such results in order to address the variability in circuit optimizations. In this paper, we study the statistical retiming problem, where retiming is a powerful sequential transformation that relocates flip-flops in a circuit without changing its functionality. We formulate the risk aversion min-period retiming problem under process variations based on conventional two-stage stochastic program with fixed recourse and a risk aversion objective of the clock period. We prove that the proposed problem is an integer convex program, show that the subgradient of the objective function can be derived from the combinational paths with the maximum path delay, and present a heuristic incremental algorithm to solve the proposed problem. Our approach can handle arbitrary gate delay model under process variations through sampling from a black-box and the effectiveness is confirmed by the experimental results. Further more, we point out how the current state-of-the-art SSTA techniques could be improved for future optimization algorithms when analytical models are available.

5B-5s (Time: 15:10 - 15:22)

Title	Timing Analysis and Optimization Implications of Bimodal CD Distribution in Double Patterning Lithography
Author	Kwangok Jeong, *Andrew B. Kahng (University of California, San Diego, United States)
Page	pp. 486 - 491
Keyword	Bimodal, DPL, Double patterning, CD
Abstract	Double patterning lithography (DPL) is in current production for memory products, and is widely viewed as inevitable for logic products at the 32nm node. DPL decomposes and prints the shapes of a critical-layer layout in two exposures. In traditional single-exposure lithography, adjacent identical layout features will have identical mean critical dimension (CD), and spatially correlated CD variations. However, with DPL, adjacent features can have distinct mean CDs, and uncorrelated CD variations. This introduces a new set of `bimodal' challenges for timing analysis and optimization. We assess the potential impact of DPL on timing analysis error and guardbanding, and find that the traditional `unimodal' characterization and analysis framework may not be viable for DPL. For example, using 45nm models, we find that different DPL mask layout solutions can cause 50ps skew in clock distribution that is unseen by traditional analyses. Different mask layouts can also result in 20\% or more change in timing path delays. Such results lead to insights into physical design optimizations for clock and data path placement and mask coloring that can help mitigate the error and guardband costs of DPL.

5B-6s (Time: 15:22 - 15:34)

Title	Scheduled Voltage Scaling for Increasing Lifetime in the Presence of NBTI
Author	*Lide Zhang, Robert Dick (Northwestern University, United States)
Page	pp. 492 - 497
Keyword	Scheduled Voltage Scaling, Negative Bias Temperature Instability (NBTI), Guard Banding
Abstract	Negative Bias Temperature Instability (NBTI) is a leading reliability concern for integrated circuits (ICs). It increases the threshold voltages of PMOS transistors, thereby increasing delay. We propose the use of scheduled voltage scaling that gradually enhances the operating voltage of the IC to compensate for NBTI-related performance degradation. Scheduled voltage scaling has the potential to increase IC lifetime by 46% relative to the conventional approach using guard banding for ICs fabricated using a 45nm process.

Session 5C Analog, RF and Mixed-Signal CAD
Time: 13:30 - 15:35 Wednesday, January 21, 2009
Location: Room 414+415
Chairs: Eric Keiter (Sandia National Laboratories, United States), Chin-Fong Chiu (National Chip Implementation Center, Taiwan)

5C-1 (Time: 13:30 - 13:55)

Title	Efficiently Finding the 'Best' Solution with Multi-Objectives from Multiple Topologies in Topology Library of Analog Circuit
Author	*Yu Liu, Masato Yoshioka, Katsumi Homma, Toshiyuki Shibuya (Fujitsu Laboratories Ltd., Japan)
Page	pp. 498 - 503
Keyword	pareto-front, multi-objective optimization, analog, topology
Abstract	This paper presents a new method using multi-objective optimization algorithm to automatically find the best solution from a topology library of analog circuits. Comparing to the traditional optimization methods using single-objective optimization algorithms, this work can efficiently find the best non-dominated solution from multiple topologies for different specifications without additional time-consuming optimizing iterations. The experiments demonstrate that this method is feasible and practical in actual analog designs especially for uncertain or different specifications in multi-dimensions.
Slides

5C-2 (Time: 13:55 - 14:20)

Title	Automated Design and Optimization of Circuits in Emerging Technologies
Author	*Rajesh A. Thakker, Chaitanya Sathe, Angada B. Sachid, Maryam Shojaei Baghini, V. Ramgopal Rao, Mahesh B. Patil (Department of Electrical Engineering, Indian Institute of Technology, Bombay, India)
Page	pp. 504 - 509
Keyword	Look-up table, FinFET, Automatic design, Particle swarm optimization, Emerging technologies
Abstract	A novel table-based environment for automatic design and optimization of FinFET circuits is demonstrated. A new accurate look-up table (LUT) technique is implemented in a circuit simulator and integrated with particle swarm optimization algorithm for efficient circuit designs in novel devices. Op-amp circuits are designed and optimized to demonstrate the accuracy and usefulness of the proposed platform. Further, it is shown that the proposed design methodology can take into account variations in process, supply voltage, and temperature.

5C-3 (Time: 14:20 - 14:45)

Title	An Automated Design Approach for CMOS LDO Regulators
Author	*Samiran DasGupta, Pradip Mandal (Indian Institute of Technology, Kharagpur, India)
Page	pp. 510 - 515
Keyword	Low dropout(LDO), voltage regulator, optimal sizing, design automation
Abstract	This paper presents a method for optimal sizing of CMOS low drop out regulator circuits. The technique relies on the observation that many of the performance metrics of a LDO regulator can be approximated as posynomial functions of design variables. This allows the design problem to be cast as a geometric program. Geometric program is particularly attractive as the tool for optimization as —-1)it can be solved very efficiently, 2)it always finds the global minima, 3)infeasible specifications are readily determined and 4)the final solution is completely independent of the initial guess. As a result CMOS LDOs may be conveniently synthesized; moreover the optimal trade off curves between the competing performance metrics, can be obtained very fast.

5C-4 (Time: 14:45 - 15:10)

Title	A SCORE Macromodel for PLL Designs to Analyze Supply Noise Interaction Issues at Behavioral Level
Author	*Chin-Cheng Kuo, Pei-Syun Lin, Chien-Nan Jimmy Liu (National Central University, Taiwan)
Page	pp. 516 - 521
Keyword	supply noise interaction issues, analog behavioral model, PLL, macromodel
Abstract	Using behavioral models to perform fast simulation is currently a popular solution to verify SOC designs. Previous analog behavior modeling approaches often treat the noisy VDD waveform as a given input and focus on reflecting such stimuli on circuit performance. However, because the interaction of noise aggressors and victims is not considered, some errors may exist in the simulation. In this paper, a simple SCORE macromodel is proposed for PLL designs. It can be integrated with a supply-noise-aware PLL behavioral model to analyze supply noise effects at high level. In addition to numerical results, the time-varying supply noise waveform and real-time PLL responses can be obtained simultaneously. As demonstrated in the experimental results, the proposed approach can provide more realistic simulation results with noise interaction effects but still keep fast simulation time.

5C-5 (Time: 15:10 - 15:35)

Title	Gen-Adler: The Generalized Adler’s Equation for Injection Locking Analysis in Oscillators
Author	*Prateek Bhansali, Jaijeet Roychowdhury (University of Minnesota, United States)
Page	pp. 522 - 527
Keyword	Injection locking, Perturbation Projection Vector, Adler's equation, Oscillators
Abstract	Injection locking analysis based on classical Adler’s equation is limited to LC oscillators as it is dependent on quality factor. In this paper, we present the Generalized Adler’s equation applicable for injection locking analysis on oscillators independent of the circuit topology. The equation is obtained by averaging the PPV phase macromodel. The procedure is considerably simple and handy to determine the locking range for arbitrary shape small AC injection signal. Analytical equations for injection locking dynamics are formulated using the Generalized Adler’s equation and validated with the PPV simulations.
Slides

Session 5D Designers' Forum: Consumer SoC
Time: 13:30 - 15:35 Wednesday, January 21, 2009
Location: Room 416+417
Chair: Yoshio Masubuchi (Toshiba Corporation, Japan)

5D-1 (Time: 13:30 - 14:10)

Title	(Invited Paper) Development of Full-HD Multi-standard Video CODEC IP Based on Heterogeneous Multiprocessor Architecture
Author	*Hiroaki Nakata, Koji Hosogi, Masakazu Ehama, Takafumi Yuasa, Toru Fujihira (Hitachi, Ltd., Japan), Kenichi Iwata, Motoki Kimura, Fumitaka Izuhara, Seiji Mochizuki, Masaki Nobori (Renesas Technology Corp., Japan)
Page	pp. 528 - 534
Keyword	CODEC, video, full-HD, H.264, VC-1
Abstract	To support numerous video codec standards and full-HD videos on different consumer devices, a multi-standard CODEC IP based on a heterogeneous multiprocessor architecture was developed. Operation-specific processors were designed in regards to two types of processing: stream processing and pixel processing. The CODEC also uses effectively several dedicated circuits. To design the CODEC, we developed a C-language model to check our design. The CODEC can process full-HD videos formatted in H.264, MPEG-2, MPEG-4, and VC-1 at 162 MHz operating frequency.
Slides

5D-2 (Time: 14:10 - 14:50)

Title	(Invited Paper) A 65nm Dual-mode Baseband and Multimedia Application Processor SoC with Advanced Power and Memory Management
Author	*Tatsuya Kamei, Tetsuhiro Yamada, Takao Koike, Masayuki Ito, Takahiro Irita, Kenichi Nitta, Toshihiro Hattori, Shinichi Yoshioka (Renesas Technology Corp., Japan)
Page	pp. 535 - 539
Keyword	application processor, mobile phone, low power, multi-media, MMU
Abstract	A Dual-mode baseband (W-CDMA/HSDPA and GSM/GPRS/EDGE) and multimedia application processor SoC is described. The SoC fabricated in triple-Vth 65nm CMOS has 3 CPU cores and 20 separate power domains to achieve both high performance and low power. The SoC adopts the Partial Clock Activation scheme that reduces power by 42% for long-time music replay. The IP-MMU is introduced to reduce maximum memory footprint by 43MB, sharing external memory among CPUs and HW-IPs using virtual address space that enables reuse of physically fragmented memory.
Slides

5D-3 (Time: 14:50 - 15:30)

Title	(Invited Paper) UniPhier: Series Development and SoC Management
Author	*Yoshito Nishimichi, Nobuo Higaki, Masataka Osaka, Seiji Horii, Hisato Yoshida (Panasonic Corp., Japan)
Page	pp. 540 - 545
Keyword	platform, SoC, UniPhier
Abstract	A digital CE integrated platform “UniPhier” (Universal Platform for High-quality Image Enhancing Revolution) has been developed to accelerate sharing technology and design assets across product categories from mobile phones to home-use AV. On the integrated platform, it’s easy to use software and hardware assets that allows reusing across product categories, and great enhance of digital CE products has been realized. In this paper, an overview of the integrated platform “UniPhier” and it’s SoC (System on a Chip) application examples are described.

Session 6A System Level Simulation and Modeling
Time: 15:55 - 18:00 Wednesday, January 21, 2009
Location: Room 411+412
Chairs: Vincent J Mooney (Georgia Institute of Technology, United States), Tsuneo Nakata (Fujitsu Laboratories Ltd., Japan)

6A-1 (Time: 15:55 - 16:20)

Title	Automatic Instrumentation of Embedded Software for High Level Hardware/Software Co-Simulation
Author	Aimen Bouchhima, *Patrice Gerin, Frédéric Pétrot (TIMA Laboratory, France)
Page	pp. 546 - 551
Keyword	cosimulation, annotation
Abstract	We propose an automatic instrumentation method for embedded software annotation to enable performance modeling in high level hardware/software co-simulation environments. The proposed "cross-annotation" technique consists of extending a retargetable compiler infrastructure to allow the automatic instrumentation of embedded software at the basic block level. Thus, target and annotated native binaries are guaranteed to have isomorphic control flow graphs (CFG). The proposed method takes into account the processor-specific optimizations at the compiler level and proves to be accurate with low simulation overhead.
Slides

6A-2 (Time: 16:20 - 16:45)

Title	Fast and Accurate Performance Simulation of Embedded Software for MPSoC
Author	*Eric Cheung, Harry Hsieh (University of California, Riverside, United States), Felice Balarin (Cadence Design Systems, United States)
Page	pp. 552 - 557
Keyword	Performance Simulation, Multiprocessor
Abstract	Performance simulation of software for Multiprocessor System-on-a-Chips (MPSoC) suffers from poor tool support. Cycle accurate simulation at Instruction Set Simulation level is too slow and inefficient for any design of realistic size. Behavioral simulation, though useful for functional analysis at high level, does not provide any performance information that is crucial for design and analysis ofMPSoC implementations. As a consequence, designers are often reduced to manually annotate performance information onto behavioral models, which contributes further to inefficiency and inaccuracy. In this paper, we use structural performance models to provide fast and accurate simulation of software for MPSoC.We generate structural models automatically using GCC with accurate performance annotation while considering optimizations for instruction selection, branch prediction, and pipeline interlock. Our structural models are able to simulate at several orders of magnitude faster than ISS and provide less than 1% error on performance estimation. These models allow realistic MPSoC design space explorations based on performance characteristics with simulation speed comparable to behavioral simulation. We validate our simulation models with several benchmarks and demonstrate our approach with a design case study of an MPEG-2 decoder.

6A-3 (Time: 16:45 - 17:10)

Title	Automatic Generation of Cycle Accurate and Cycle Count Accurate Transaction Level Bus Models from a Formal Model
Author	*Chen Kang Lo, Ren Song Tsay (National Tsing Hua University, Taiwan)
Page	pp. 558 - 563
Keyword	System Level Design, Transaction Level Modeling, bus modeling
Abstract	This paper proposes the first automatic approach to simultaneously generate Cycle Accurate and Cycle Count Accurate transaction level bus models. Since TLM (Transaction Level Modeling) is proven as an effective design methodology for managing the ever-increasing complexity of system level designs, researchers often exploit various abstraction levels to gain either simulation speed or accuracy. Consequently, designers repeatedly perform the time-consuming task of re-writing and performing consistency checks for different abstraction level models of the same design. To ease the work, we propose a correct-by-construction method that automatically and simultaneously generates both fast and accurate transaction level bus models for system simulation. The proposed approach relieves designers from the tedious and error-prone process of refining models and checking for consistency.
Slides

6A-4 (Time: 17:10 - 17:35)

Title	A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor
Author	*Farhad Mehdipour (Kyushu University, Japan), Hamid Noori (Institute of Systems, Information Technologies and Nanotechnologies, Japan), Bahman Javadi (Amirkabir University of Technology, Iran), Hiroaki Honda (Institute of Systems, Information Technologies and Nanotechnologies, Japan), Koji Inoue, Kazuaki Murakami (Kyushu University, Japan)
Page	pp. 564 - 569
Keyword	Reconfigurable instruction set processors, Analytical modeling, reconfigurable accelerator, Performance evaluation, Design space exploration
Abstract	Performance evaluation is a serious challenge in designing or optimizing reconfigurable instruction set processors. The conventional approaches based on synthesis and simulations are very time consuming and need a considerable design effort. A combined analytical and simulation-based model (CAnSOƒx) is proposed and validated for performance evaluation of a typical reconfigurable instruction set processor. The proposed model consists of an analytical core that incorporates statistics gathered from cycle-accurate simulation to make a reasonable evaluation and provide a valuable insight. Compared to cycle-accurate simulation results, CAnSO proves almost 2% variation in the speedup measurement.
Slides

Session 6B Chip and Package Routing Techniques
Time: 15:55 - 18:00 Wednesday, January 21, 2009
Location: Room 413
Chairs: Ting-Chi Wang (National Tsing Hua University, Taiwan), Yasuhiro Takashima (University of Kitakyushu, Japan)

6B-1 (Time: 15:55 - 16:20)

Title	Efficient Simulated Evolution Based Rerouting and Congestion-Relaxed Layer Assignment on 3-D Global Routing
Author	Ke-Ren Dai, *Wen-Hao Liu, Yih-Lang Li (National Chiao Tung University, Taiwan)
Page	pp. 570 - 575
Keyword	Global Routing, layer assignment, simulated evolution
Abstract	The increasing complexity of interconnection designs has enhanced the importance of research into global routing when seeking high-routability (low overflow) results or rapid search paths that report wire-length estimations to a placer. This work presents two routing techniques, namely adaptive pseudo-random net-ordering routing and evolution-based rip-up and reroute using a two-stage cost function in a high-performance congestion-driven 2-D global router. We also propose two efficient via-minimization methods, namely congestion relaxation by layer shifting and rip-up and re-assignment, for a dynamic programming-based layer assignment. Experimental results demonstrate that our router achieves performance similar to the first two winning routers in ISPD 2008 Routing Contest in terms of both routability and wire length at a 1.42X and 25.84X faster routing speed. Besides, our layer assignment yields 3.5% to 5.6% fewer vias, 2.2% to 3.3% shorter wire-length and 13% to 27% less runtime than COLA.

6B-2 (Time: 16:20 - 16:45)

Title	FastRoute 4.0: Global Router with Efficient Via Minimization
Author	*Yue Xu, Yanheng Zhang, Chris Chu (Iowa State University, United States)
Page	pp. 576 - 581
Keyword	Global Routing, Layer Assignment, 3-Bend Routing
Abstract	The number of vias generated during the global routing stage is a critical factor for the yield of final circuits. However, most global routers only approach the problem by charging a cost for vias in the maze routing cost function. In this paper, we present a global router that addresses the via number optimization problem throughout the entire global routing flow. We introduce the via aware Steiner tree generation, 3-bend routing and spiral layer assignment algorithm to reduce via count. We integrate these three techniques into FastRoute 3.0 and achieve significant reduction in both via count and runtime.

6B-3s (Time: 16:45 - 16:57)

Title	High-Performance Global Routing with Fast Overflow Reduction
Author	*Huang-Yu Chen, Chin-Hsiung Hsu, Yao-Wen Chang (National Taiwan University, Taiwan)
Page	pp. 582 - 587
Keyword	Global Routing, Routing
Abstract	We develop a new global router, NTUgr, that contains three major steps: prerouting, initial routing, and iterative forbidden-region rip-up/rerouting (IFR). Prerouting employs a two-stage technique of congestion-hotspot historical cost pre-increment followed by small bounding-box area routing. Initial routing is based on efficient iterative monotonic routing. Finally, IFR features three techniques of (1) multiple forbidden regions expansion, (2) critical subnet rerouting selection, and (3) look-ahead historical cost increment. Experiments show that NTUgr achieves high-quality results for ISPD'07 and ISPD'08 benchmarks.

6B-4s (Time: 16:57 - 17:09)

Title	IO Connection Assignment and RDL Routing for Flip-Chip Designs
Author	Jin-Tai Yan, *Zhi-Wei Chen (Chung Hua University, Taiwan)
Page	pp. 588 - 593
Keyword	Flip-chip, RDL routing, IO connection
Abstract	Given a set of IO buffers and bump balls with the capacity constraints between bump balls, an O(n2) IO assignment and RDL routing algorithm is proposed to assign all the IO connections and minimize the total wirelength with satisfying the capacity constraints and guarantee 100% routability if the capacity constraint is permitted, where n is the number of bump balls in a flip-chip design. Compared with the combination of the greedy IO assignment and our RDL routing, our IO assignment reduces the global wirelength by 7.6% after global routing and improves the routability by 8.8% after detailed routing on the average. Compared with the combination of our IO assignment, the single-layer BGA global router[7] and our detailed routing phase, our RDL routing reduces the global wirelength by 15.9% after global routing and improve the routability by 10.6% after detailed routing on the average for some tested circuits in reasonable CPU time.

6B-5 (Time: 17:09 - 17:34)

Title	On Using SAT to Ordered Escape Problems
Author	Lijuan Luo, *Martin D.F. Wong (University of Illinois at Urbana-Champaign, United States)
Page	pp. 594 - 599
Keyword	PCB routing, SAT
Abstract	Routing for high-speed boards is largely a time-consuming manual task today. The ordered escape routing problem is one of the key problems in board-level routing, and Boolean Satisfiability (SAT) based approach \cite{my-paper} is the only solution to this problem so far. In this paper, we first solve the major deficiency of the original SAT formulation so that the escape problem is completely resolved. Then we propose two techniques to extend SAT approach for large-scale problems. Experimental results on industrial benchmarks show that our methods perform well in terms of both speed and routability.

6B-6 (Time: 17:34 - 17:59)

Title	A Fast Longer Path Algorithm for Routing Grid with Obstacles using Biconnectivity based Length Upper Bound
Author	*Yukihide Kohira, Suguru Suehiro, Atsushi Takahashi (Tokyo Institute of Technology, Japan)
Page	pp. 600 - 605
Keyword	longer path algorithm, upper bound of wire length, routing design of PCB
Abstract	In this paper, a fast longer path algorithm that generates a path of a net in routing grid so that the length is increased as much as possible is proposed. In the proposed algorithm, an upper bound for the length in which the structure of a routing area is taken into account is used. Experiments show that our algorithm utilizes a routing area with obstacles efficiently.

Session 6D Designers' Forum: ESL Design Methods
Time: 15:55 - 18:00 Wednesday, January 21, 2009
Location: Room 416+417

6D-1

Title	(Panel Discussion) ESL Design Methods
Author	Moderator: Takashi Hasegawa (Fujitsu Microelectronics Ltd., Japan), Panelists: Simon Bloch (Mentor Graphics Corporation, United States), Ahmed Jerraya (CEA-LETI, France), Gabriela Nicolescu (Ecole Polytechnique de Montreal, Canada), Shigeru Oho (Hitachi, Ltd., Japan), Koichiro Yamashita (Fujitsu Labs. Ltd., Japan)

Thursday, January 22, 2009

Session 3K Keynote Session III
Time: 9:00 - 10:00 Thursday, January 22, 2009
Location: Small Auditorium, 5F
Chair: Kazutoshi Wakabayashi (NEC Corp., Japan)

3K-1 (Time: 9:00 - 10:00)

Title	(Keynote Address) From Restrictive to Prescriptive Design
Author	Leon Stok (IBM Systems and Technology Group, United States)
Abstract	For many generations the hand-off between design and manufacturing has been done by a set of design rules. However, design rule manuals have grown in size from several tens of pages a few generations back to hundreds of pages now. Many more design rules have been added since the end of traditional scaling. Even with all these additional rules, corner cases are found late in the process that can become significant yield or functionality detractors. Restricted Design Rules (RDRs) have been created to simplify the design rules and come up with more manufacturable designs. IBM has practiced RDRs in the last few process generations but is this enough? For the next technology nodes no new exposure tools will be available for mass production and optical scaling is coming to a halt. Computational scaling will be required to extend Moore’s law. In this new era, can we keep on describing design rules in terms of restrictions or do we need another approach?

Session 7A Compilation Techniques for Embedded Systems
Time: 10:15 - 12:20 Thursday, January 22, 2009
Location: Room 411+412
Chairs: Hiroyuki Tomiyama (Nagoya University, Japan), Maziar Goudarzi (Kyushu University, Japan)

7A-1 (Time: 10:15 - 10:40)

Title	Thermal-aware Post Compilation for VLIW Architectures
Author	*Wen-Wen Hsieh, TingTing Hwang (Department of Computer Science, National Tsing Hua University, Taiwan)
Page	pp. 606 - 611
Keyword	thermal management, Post Compilation, VLIW architecture
Abstract	Development of a thermal management method to reduce hotspots and to balance the temperature distribution has become an important issue. In this paper, we propose a static thermal management technique at compiler level. The target machine is a VLIW architecture where the compiler is required to schedule instructions to achieve instruction level parallelism (ILP). Two technique are proposed. The first one is register binding to balance the temperature of the register file by taking both spatial and temporal thermal information into consideration. The second one is forwarding methods including forwarding-aware architecture and instruction scheduling to reduce the access count of register file. The experimental results show that by combining the two techniques, the peak temperature reduction can reach 7.89 (oC) in the best case and 7.22 (oC) in average with only 0.9% performance penalty in average.
Slides

7A-2 (Time: 10:40 - 11:05)

Title	A Software Solution for Dynamic Stack Management on Scratch Pad Memory
Author	Arun Kannan, *Aviral Shrivastava, Amit Pabalkar, Jong-eun Lee (Arizona State University, United States)
Page	pp. 612 - 617
Keyword	scratch pad, cache, stack, power, compiler
Abstract	We propose a dynamic scratch pad memory (SPM) management scheme for program stack data for processor power reduction. As opposed to previous efforts, our solution does not mandate any hardware changes, does not need profile information, and SPM size at compile-time, and seamlessly integrates support for recursive functions. Our technique manages stack frames on SPM using a scratch pad memory manager (SPMM), integrated into the application binary by the compiler. Our experiments on benchmarks from MiBench [18] show average energy savings of 37% along with a performance improvement of 18%.

7A-3 (Time: 11:05 - 11:30)

Title	Compiler-Managed Register File Protection for Energy-Efficient Soft Error Reduction
Author	Jongeun Lee, *Aviral Shrivastava (Arizona State University, United States)
Page	pp. 618 - 623
Keyword	soft error, register file, power-efficient, compiler, register allocation
Abstract	For embedded systems where neither energy nor reliability can be easily sacrificed, we present an energy efficient soft error protection scheme for register files (RF). Unlike previous approaches, our method explicitly optimizes for energy efficiency and exploits the fundamental tradeoff between reliability and energy. While even simple compiler-managed RF protection scheme is more energy efficient than hardware schemes, this work formulates and solves further compiler optimization problems to significantly enhance the energy efficiency of RF protection schemes by an additional 24%.

7A-4 (Time: 11:30 - 11:55)

Title	Code Decomposition and Recomposition for Enhancing Embedded Software Performance
Author	*Youngchul Cho (SAIT, Samsung Electoronics, Republic of Korea), Kiyoung Choi (Seoul National University, Republic of Korea)
Page	pp. 624 - 629
Keyword	code transformation, code decomposition and recomposition, control-flow analysis, multitasking, code serialization
Abstract	Multitasking of concurrent processes implements the concurrency inherited from applications, increasing the utilization of limited resources. It requires an operating system and imposes significant runtime overhead. Serializing multitasking codes removes the need of operating system and the overhead as well. In this paper, we propose a software synthesis method to transform multitasking codes into a single process code. For this, we decompose multitasking codes into a set of code fractions and then recompose the code fractions into a single process code, preserving the functionality of the original codes. We present two different techniques for the transformation - code partitioning and code covering - and propose a hybrid technique that combines the two techniques.

Session 7B Sequential Design Verification
Time: 10:15 - 12:20 Thursday, January 22, 2009
Location: Room 413
Chairs: Yosinori Watanabe (Cadence, United States), Chung-Yang Huang (National Taiwan University, Taiwan)

7B-1 (Time: 10:15 - 10:40)

Title	Dependent Latch Identification in the Reachable State Space
Author	Chen-Hsuan Lin, *Chun-Yao Wang (National Tsing Hua University, Taiwan)
Page	pp. 630 - 635
Keyword	dependent latch, functional dependency, reachability analysis
Abstract	The large number of latches in current designs increase the complexity of formal verification and logic synthesis, since the growth of latch number leads the state space to explode exponentially. One solution to this problem is to find the functional dependencies among these latches. Then, these latches can be identified as dependent latches or essential latches, where the state space can be constructed using only the essential latches. This paper proposes an approach to find the functional dependencies among latches in a sequential circuit by using SAT solvers with the Craig interpolation theorem. In addition, the proposed approach detects sequential functional dependencies existing in the reachable state space only. Experimental results show that our approach could deal with large sequential circuits with up to 1.5K latches in a reasonable time and simultaneously identify the combinational and sequential dependent latches.

7B-2 (Time: 10:40 - 11:05)

Title	Complete-k-Distinguishability for Retiming and Resynthesis Equivalence Checking without Restricting Synthesis
Author	Nikolaos Liveris, *Hai Zhou (Northwestern University, United States), Prithviraj Banerjee (HP Labs, United States)
Page	pp. 636 - 641
Keyword	sequential equivalence checking, retiming and resynthesis, verification
Abstract	Iterative retiming and resynthesis is a powerful way to optimize sequential circuits but its massive adoption has been hampered by the hardness of verification. This paper tackles the problem of retiming and resynthesis equivalence checking on a pair of circuits. For this purpose we define the Complete-$k$-Distinguishability (C-$k$-D) property for any natural number $k$ based on C-1-D. We show how the equivalence checking problem can be simplified if the circuits satisfy this property and prove that the method is complete for any number of retiming and resynthesis steps. We also provide a way to enforce C-$k$-D on the circuits without restricting the optimization power of retiming and resynthesis or increasing their complexity. Experimental results demonstrate that enforcing C-$k$-D property can speed up the verification process.

7B-4 (Time: 11:30 - 11:55)

Title	Multi-Clock SVA Synthesis without Re-writing
Author	*Jiang Long, Andrew Seawright, Paparao Kavalipati (Mentor Graphics Corp., United States)
Page	pp. 648 - 653
Keyword	SVA, Formal verification, Multi-Clock
Abstract	This paper presents a compilation procedure for synthesiz- ing multi-clock SVA properties for formal verification. The synthesis framework is built upon an existing compilation al- gorithm for single-clock SVA properties. While we could use the SVA rewriting rules to transform a multi-clock property into a single-clocked property and then apply existing tech- niques, instead we propose techniques to selectively model the multi-clock operators to produce a smaller checker logic. Through recursive construction and syntactic transforma- tion, we are able demonstrate the efficiency of the technique and the generated checker logic is provably equivalent to the rewriting version.

7B-5 (Time: 11:55 - 12:20)

Title	Automatic Formal Verification of Clock Domain Crossing Signals
Author	*Bing Li, Chris Ka-Kei Kwok (Mentor Graphics Corporation, United States)
Page	pp. 654 - 659
Keyword	formal verification, clock domain crossing, assertion logic
Abstract	In this paper, we present an approach that uses formal methods to verify Clock Domain Crossing (CDC) issues in a fully automatic way. First, we discuss various CDC schemes and the corresponding checks that need to be formally verified. Then we demonstrate how to synthesize them into assertion logic. After that a fully automatic, on-the-fly formal CDC approach is proposed. To the best of our knowledge, this is the first paper discussing fully automatic, on-the-fly formal verification of CDC signals. Experiment results show that our automatic formal CDC, when compared with the conventional post-CDC formal CDC, takes much less time, but still prove significant number of CDC checks.
Slides

Session 7C Scan Test Generation
Time: 10:15 - 12:20 Thursday, January 22, 2009
Location: Room 414+415
Chair: Satoshi Ohtake (NAIST, Japan)

7C-1 (Time: 10:15 - 10:40)

Title	Fast False Path Identification Based on Functional Unsensitizability Using RTL Information
Author	*Yuki Yoshikawa (Hiroshima City University, Japan), Satoshi Ohtake (Nara Institute of Science and Technology, Japan), Tomoo Inoue (Hiroshima City University, Japan), Hideo Fujiwara (Nara Institute of Science and Technology, Japan)
Page	pp. 660 - 665
Keyword	Delay test, False path, RTL, Over-testing reduction
Abstract	In this paper, we propose a method for identifying false paths based on functional unsensitizability of path delay faults. By using RTL structural information, a number of gate level paths are bound into an RTL path and the bundle of them can be identified in a reasonable amount of time. The identified false paths are useful for over-testing reduction caused by DFT techniques, such as scan design, and also area and performance optimization of circuits during logic synthesis. Experimental results show that our proposed method can identify false paths in a few seconds for several benchmarks.
Slides

7C-2 (Time: 10:40 - 11:05)

Title	Conflict Driven Scan Chain Configuration for High Transition Fault Coverage and Low Test Power
Author	*Zhen Chen, Boxue Yin, Dong Xiang (Tsinghua University, China)
Page	pp. 666 - 671
Keyword	broadside, fault coverage, low power, conflict
Abstract	Two conflict driven methods and the architecture based on them are presented to improve the fault coverage and reduce test power. By the analysis of the functional dependency of test vectors in broad-side and the shift dependency of vectors in the skewed-load, some scan cells are selected to operate in the enhanced scan and skewed-load scan mode, while others operate in traditional broad-side mode. Experimental results show that the fault coverage can achieve the level very close to enhanced scan.
Slides

7C-3 (Time: 11:05 - 11:30)

Title	Dynamic Test Compaction for a Random Test Generation Procedure with Input Cube Avoidance
Author	Irith Pomeranz (Purdue University, United States), *Sudhakar Reddy (University of Iowa, United States)
Page	pp. 672 - 677
Keyword	dynamic test compaction, test generation, stuck-at faults, full-scan
Abstract	A recent approach to test generation avoids the assignment of certain input values in order not to prevent target faults from being detected. The test generation process based on this approach is efficient; however, it generates large test sets. We develop a dynamic test compaction procedure for this approach. Our goal is to reduce the test set size by increasing the number of faults detected by each test vector, while keeping the computational complexity as low as that of the original procedure. This is achieved by avoiding the assignment of certain input values in order not to prevent subsets of faults from being detected.

7C-4 (Time: 11:30 - 11:55)

Title	Detectability of Internal Bridging Faults in Scan Chains
Author	*Fan Yang (University of Iowa, United States), Sreejit Chakravarty, Narendra Devta-Prasanna (LSI Corp., United States), Sudhakar M. Reddy (University of Iowa, United States), Irith Pomeranz (Purdue University, United States)
Page	pp. 678 - 683
Keyword	scan chain, bridge fault, resistive bridge, internal fault, non-feedback bridge
Abstract	We investigate the detection of scan cell internal bridging faults extracted from layout. We show that detection of some zero-resistance non-feedback bridging faults requires two-pattern tests. Half-speed flush tests we proposed earlier detect additional bridging faults. Undetectable faults are classified based on the reasons for their undetectability. Both non-resistive and resistive bridging fault models are considered in this work. A low power supply voltage based test method and IDDQ testing are examined for resistive bridging fault detection.

7C-5 (Time: 11:55 - 12:20)

Title	Fault Modeling and Testing of Retention Flip-Flops in Low Power Designs
Author	*Bing-Chuan Bai (Department of Electrical Engineering, National Taiwan University, Taiwan), Augusli Kifli (Design Development Division, Faraday Technology Corporation, Taiwan), Chien-Mo Li (Department of Electrical Engineering, National Taiwan University, Taiwan), Kun-Cheng Wu (Design Development Division, Faraday Technology Corporation, Taiwan)
Page	pp. 684 - 689
Keyword	Retention, Fault Model, low power, ATPG, Testing
Abstract	Retention flip-flop is one of the most important components in low power designs. This paper presents four new fault models of retention flip-flop. The four faults model the defects that affect the retained value, wakeup time, and sleep time of retention flip-flops. Test patterns for retention flip-flop can be easily generated by ATPG tools. The proposed test methodology is validated by performing experiments on ISCAS’89 benchmark circuits and industrial designs. The experimental results show that average fault coverage is 98%.

Session 7D Designers' Forum: Analog/RF Circuit Designs
Time: 10:15 - 12:20 Thursday, January 22, 2009
Location: Room 416+417
Chair: Makoto Ikeda (University of Tokyo, Japan)

7D-1 (Time: 10:15 - 10:45)

Title	(Invited Paper) Design Methods for Pipeline & Delta-Sigma A-to-D Converters with Convex Optimization
Author	*Kazuo Matsukawa, Takashi Morie, Yusuke Tokunaga, Shiro Sakiyama, Yosuke Mitani, Masao Takayama, Takuji Miki, Akinori Matsumoto, Koji Obata, Shiro Dosho (Panasonic Corp., Japan)
Page	pp. 690 - 695
Keyword	optimization, ADC, pipeline, delta, sigma
Abstract	In system LSIs, costs of analog circuits are getting increased relatively for rapid cost reduction of digital circuits. To satisfy given specifications in the analog design, including low power and small area, designers have to select an optimal solution among large combination of the following alternatives: which architecture should be adopted; what type of transistors should be taken; and whether digitally assisting technologies should be used or not, etc. A design based on experience and intuition cannot lead to the optimum in a short time. A comprehensive approach to the optimization, based on circuit theory, is now required. Convex optimization procedure can solve the formulae which represent circuit performance with over hundreds of design variables. We have constructed optimization environments for pipelined and delta-sigma analog-to-digital converters (ADCs) in consideration of the digitally assisting techniques and layout constraints. Both 12-bit pipelined ADCs and a 5th-order delta-sigma modulator were designed with the optimizer, and achieved top-ranked power efficiency.

7D-2 (Time: 10:45 - 11:15)

Title	(Invited Paper) A Low-Jitter 1.5-GHz and Large-EMI reduction 10-dBm Spread-Spectrum Clock Generator for Serial-ATA
Author	*Takashi Kawamoto, Masaru Kokubo (Hitachi, Ltd., Japan)
Page	pp. 696 - 701
Keyword	Serial-ATA, PLL, VCO, calibration, SSCG
Abstract	A low-jitter and large-EMI-reduction spread spectrum clock generator (SSCG) for Serial-ATA (SATA) was developed. A low-jitter VCO with a high-frequency limiter was developed to prevent SSCGs from malfunctioning. An autocalibration technique suitable for this VCO was developed to prevent SSCGs from degradation because of process variations. A SATA PHY using a technique for calibrating SSCG was developed to use an inexpensive but large frequency-variation reference oscillator. The fabricated SSCG achieved a 10.0-dB EMI reduction and 1.9-3.3 ps rms jitter by the proposed autocalibration technique. The fabricated SATA PHY achieved less than 400-ppm production-frequency tolerance of reference clocks.

7D-3 (Time: 11:15 - 11:45)

Title	(Invited Paper) RF-Analog Circuit Design in Scaled SoC
Author	*Nobuyuki Itoh, Mototsugu Hamada (Toshiba Corp., Japan)
Page	pp. 702 - 707
Keyword	RFCMOS, SoC, Design
Abstract	Downscaling of process technology increases the development cost of RFCMOS SoC. Therefore, designers have to minimize the number of respins, and have to try to obtain higher yield. RFCMOS SoC consists of RF-analog, mixedsignal, logic and memory circuits. In order to realize a small number of respins number and higher yield, key issues are robust design methodology of RF-analog circuits, and full-chip verification. This paper describes practical techniques corresponding to those issues.

7D-4 (Time: 11:45 - 12:15)

Title	(Invited Paper) An Approach to the RF-LSI Design for Ubiquitous Communication Appliances
Author	*Yuichi Kado, Mitsuru Harada (NTT, Japan)
Page	pp. 708 - 714
Keyword	ubiquitous network, RF, Low power, IV.2. Digital calibration
Abstract	Abstract - We propose a ¡°wide area ubiquitous network¡± as a highly economical and convenient wireless system for providing a wide variety of services. Its basic feature is ¡°wide coverage using ultra low power consumption terminals,¡± and its specific target is a ¡°5-km cell radius using 10-mW transmission power terminals run on ten¨Cyear life batteries.¡± In this paper we explain the wireless specifications and the low power consumption performance required of wireless terminals used in these ubiquitous networks. We then introduce a design method that harmonizes RF and digital components and an ultra low power consumption LSI design that make it possible to satisfy these requirements.

Session 8A High-Level Design and Scheduling
Time: 13:30 - 15:35 Thursday, January 22, 2009
Location: Room 411+412
Chairs: Yuichi Nakamura (NEC Corp., Japan), Keishi Sakanushi (Osaka University, Japan)

8A-1 (Time: 13:30 - 13:55)

Title	Improving Scalability of Model-Checking for Minimizing Buffer Requirements of Synchronous Dataflow Graphs
Author	Nan Guan (Northeastern University, China), *Zonghua Gu (Hong Kong University of Science and Technology, China), Wang Yi (Uppsala University, Sweden), Ge Yu (Northeastern University, China)
Page	pp. 715 - 720
Keyword	SDF, model-checking
Abstract	Synchronous Dataflow (SDF) is a well-known model of computation for dataflow-oriented applications such as embedded systems for signal processing and multimedia. It is important to minimize the buffer size requirements of applications generated from SDF graphs, since memory space is often a scarce resource in these systems due to cost or power consumption constraints. Some authors have proposed to use model-checking for finding the minimum buffer size requirements, but the scalability of model-checking is limited by state space explosion. In this paper, we present several techniques for reducing state space size and improving scalability of model-checking by exploiting problem-specific properties of SDF graphs.
Slides

8A-2 (Time: 13:55 - 14:20)

Title	A Reverse-Encoding-based on-chip AHB Bus Tracer for Efficient Circular Buffer Utilization
Author	*Fu-Ching Yang, Cheng-Lung Chiang, Ing-Jer Huang (National Sun Yat-Sen University, Taiwan)
Page	pp. 721 - 726
Keyword	tracer, reverse encoding, pre-t trace, post-t trace, compression
Abstract	The post-T/pre-T trace refers to the trace captured before/after a target point is reached, respectively. Real time compression of the post-T trace in a circular buffer is a challenging problem since the initial state of the trace being compressed might be corrupted when wrapping around occurs and thus makes it difficult to reconstruct the trace from the incomplete information stored in the circular buffer. This paper proposes an efficient compression algorithm which is capable of compressing both pre-T and post-T traces. The algorithm is based on an innovative reverse encoding scheme by reversing the order of the datum being encoded and the datum being referred. This algorithm has been successfully implemented in a realtime on-chip AHB bus tracer and has been embedded in a 3D graphics SoC as an application example. The bus tracer costs only 44K gates and runs at 500MHz at 0.13um technology. Experiments have shown that this bus tracer achieves 100\% circular buffer utilization and captures 1.2x and 4.86x trace depths than state-of-the-art related work and conventional industrial approaches, respectively.
Slides

8A-3 (Time: 14:20 - 14:45)

Title	Analyzing and Optimizing Energy Efficiency of Algorithms on DVS Systems: a First Step towards Algorithmic Energy Minimization
Author	*Tetsuo Yokoyama, Gang Zeng, Hiroyuki Tomiyama, Hiroaki Takada (Nagoya University, Japan)
Page	pp. 727 - 732
Keyword	Intratask dynamic voltage frequency scaling, Algorithmic energy minimization, Static voltage scaling, Sorting algorithms
Abstract	The energy efficiency at the algorithmic level on DVS systems and its analysis and optimization methods are presented. Given a problem the most energy efficient algorithm is {\em not} uniquely determined but dependent on multiple factors, including % the execution time distribution, intratask dynamic voltage scaling (IntraDVS) policies, the size of intermediate data structure, and the size of inputs. We show that at the algorithmic level principles behind energy optimization and performance optimization are {\em not} identical. We propose a metric for evaluating optimal energy efficiency of static voltage scaling (SVS) and a few new effective IntraDVS policies employing data flow information. Experimental results on sorting algorithms show the existence of several tradeoffs in terms of energy consumption. Transforming algorithms by employing problem specific knowledge and data flow information successfully improves their energy efficiency.
Slides

8A-4 (Time: 14:45 - 15:10)

Title	Novel Task Migration Framework on Configurable Heterogeneous MPSoC Platforms
Author	Hao Shen, *Frédéric Pétrot (TIMA Laboratory, INP Grenoble, France)
Page	pp. 733 - 738
Keyword	ASIP, migration framework, heterogeneous, MPSoC
Abstract	Heterogeneous MPSoC architectures can provide higher performance and flexibility with less power consumption and lower cost than homogeneous ones. However, as processor instruction sets of general heterogeneous MPSoCs are not identical, tasks migration between two heterogeneous processors is not possible. To enable this function, we propose to build one specific heterogeneous MPSoC platform in which all heterogeneous processors are based on the same core instruction set for the operating system realization. Different extended instructions can be added for different processors to improve the system performance. Tasks can be migrated from one processor to another only if the target processor has all instructions which can meet the execution requirement of this task. This paper concentrates on the infrastructure that is necessary to support the scheduling and migration of tasks between the processors. By using the Motion-JPEG case study, we confirm that our task migration framework can achieve higher processor usage rate and more flexibility.
Slides

Session 8B Emerging Design Methodologies and Applications
Time: 13:30 - 15:35 Thursday, January 22, 2009
Location: Room 413
Chair: Chin-Long Wey (National Central University, Taiwan)

8B-1 (Time: 13:30 - 13:55)

Title	A Novel Toffoli Network Synthesis Algorithm for Reversible Logic
Author	*Yexin Zheng, Chao Huang (Virginia Tech, United States)
Page	pp. 739 - 744
Keyword	reversible logic, quantum computing, logic synthesis
Abstract	Reversible logic studies have promising potential on energy lossless circuit design, quantum computation, nanotechnology, etc. Reversible logic features a one-to-one input output correspondence which makes the logic synthesis for reversible functions differs greatly from traditional Boolean functions. Exact synthesis methods can provide optimal solutions in terms of the total number of reversible gates in the synthesis results. Unfortunately, they may suffer from long computation time, due to the fact that the search space is likely to grow exponentially as the circuit size increases. Therefore, in this paper, we propose an efficient synthesis heuristic which provides high quality synthesis results of Toffoli network in more reasonable computation time. We use a weighted, directed graph for reversible function representation and complexity measurement. The proposed algorithm maximally decreases function complexity during synthesis steps. It has the ability to climb out of local minimums and guarantees algorithm convergence. The experimental results show that our algorithm can achieve optimal or very close to optimal solutions with computation time several orders of magnitude less than the exact methods. Compared with other heuristics, our method demonstrates superior performance in terms of reversible gate count as well as computation time.

8B-2 (Time: 13:55 - 14:20)

Title	A Cycle-Based Synthesis Algorithm for Reversible Logic
Author	*Zahra Sasanian, Mehdi Saeedi, Mehdi Sedighi, Morteza Saheb Zamani (Amirkabir University of Technology, Iran)
Page	pp. 745 - 750
Keyword	Reversible Logic, Cycle, NCT Library
Abstract	Abstract - Several algorithms have been proposed for the synthesis of reversible circuits. In this paper, a cycle-based synthesis algorithm for reversible logic, based on the NCT library, has been proposed. In other words, direct implementation of a single 3-cycle, a pair of 3-cycles and a pair of 2-cycles have been explored and used to propose an efficient Toffoli-based synthesis algorithm for reversible circuits. The synthesis algorithm decomposes a given large cycle into a set of single 3-cycles, pairs of 3-cycles and pair of 2-cycles and synthesizes the resulted cycles directly. Our experimental results show that the proposed synthesis algorithm can outperform the available 2-cycle-based approach about 34% on average. In addition, several discussions for the generalization of the proposed method to the 2m-cycles are given.
Slides

8B-3 (Time: 14:20 - 14:45)

Title	Array Like Runtime Reconfigurable MIMO Detectors for 802.11n WLAN: A Design Case Study
Author	Pankaj Bhagawat, Rajballav Dash, *Gwan Choi (Texas A&M University, United States)
Page	pp. 751 - 756
Keyword	MIMO systems, 802.11n, Reconfigurability
Abstract	Future high speed wireless standards such as 802.11n involve Multiple Input Multiple Output (MIMO) antenna systems as a key technology component. Efficient design of the MIMO detector is a challenging task. This is further compounded by the fact that 802.11n standard requires support for runtime switching between different modulation schemes (or modes). While searching for an appropriate architecture attention must be paid to application requirements such as required throughput,limits on latency, and reconfiguration between various modes of operations. Important hardware design metrics such as area/power should be optimized over all the operating modes of the detector. In this paper we carry out extensive architectural space exploration to address the issues of power consumption,area, and reconfigurability between different modes of operation while meeting the standards throughput requirement. Ultimately, we come up with two designs that target low area and low power respectively. We also maintain close to optimum Bit Error Rate(BER), which is vital for any wireless system. The design estimates are based on 45nm technology library.
Slides

8B-4 (Time: 14:45 - 15:10)

Title	Mapping method for Dynamically Reconfigurable Architecture
Author	*Akira Kuroda, Mayuko Koezuka, Hidenori Matsuzaki, Takashi Yoshikawa, Shigehiro Asano (Toshiba Corporation, Japan)
Page	pp. 757 - 762
Keyword	dynamically reconfiguarable architecture, compiler, mapping
Abstract	In this paper, we present a mapping algorithm for our dynamically reconfigurable architecture which is suitable for stream applications such as H.264. Because our target architecture consists of four different configuration format units heterogeneously, itfs difficult to apply the conventional algorithms. We propose heuristic mapping algorithm which enables to map generic data flow graph onto this complex hardware automatically. We mapped five main functions of H.264 decoder onto our architecture and compared against manual-mapped result which is done by experienced engineer. The result shows that three of five functions are optimized as manual-mapping.

8B-5 (Time: 15:10 - 15:35)

Title	A Criticality-Driven Microarchitectural Three Dimensional (3D) Floorplanner
Author	Srinath Sridharan, *Michael DeBole, Guangyu Sun, Yuan Xie, Vijaykrishnan Narayanan (Pennsylvania State University, United States)
Page	pp. 763 - 768
Keyword	3D IC, 3D architecture
Abstract	As technology scales, interconnect delay starts to dominate the performance of modern microprocessors. Three dimensional (3D) chip structures have been proposed as a solution to mitigate the interconnect challenge, with the capability of reducing global wiring lengths. Previous works on 3D microprocessor floorplanning have demonstrated the benefits of such wire reductions. However, in modern microprocessors, not all the global interconnects are equally important: some are critical for the performance and hence the wire reduction via 3D stacking can result in great performance improvement, while others may not be on the critical path and therefore the wire reduction may not have impact on the performance. In this paper, we propose a floorplanner for 3D chips that will organize functional blocks according to critical microarchitectural communication paths in order to reduce latencies which will hinder processor performance. We identify potential triggers, in the form of feedback delays, that are responsible for incurring high communication costs and curb its negative effect on performance by intelligently placing the functional blocks in 3D without compromising on area, overlap power density and thermal reliability. With our criticality driven 3D, placement there is an IPC improvement on an average 22% and up to a 64% improvement over 2D placement. Over criticality un-aware 3D placement, criticality driven 3D placement shows an IPC improvement on an average of 8% and up to 25%.

Session 8C Verification, Test, and Yield
Time: 13:30 - 15:35 Thursday, January 22, 2009
Location: Room 414+415
Chairs: Yasuo Sato (Hitachi, Ltd., Japan), Sudhakar M. Reddy (University of Iowa, United States)

8C-1 (Time: 13:30 - 13:55)

Title	Self-Adjusting Constrained Random Stimulus Generation Using Splitting Evenness Evaluation and XOR Constraints
Author	Shujun Deng, Zhiqiu Kong, *Jinian Bian, Yanni Zhao (Department of Computer Science and Technology, Tsinghua University, China)
Page	pp. 769 - 774
Keyword	stimulus generation, SAT, even distribution, splitting, XOR constraint
Abstract	Constrained random stimulus generation plays significant roles in hardware verification nowadays, and the quality of the generated stimuli is key to the efficiency of the test process. In this work, we present a linear dynamic method to guide random stimulus generation by SAT solvers. A splitting simplified Min-Distance-Sum evaluation method and an XOR sampling strategy are integrated in the self-adjusting random stimulus generation framework. The evenness of the split groups is evaluated to find out some uneven parts. Then, random partial solutions for the uneven parts and random XOR constraints for the other inputs are added into constraints to get better distributed stimuli. Experimental results show that our method can evaluate the evenness as well as more complex formulae for stimulus generation, and also confirm that the self-adjusting method can improve the fault coverage ratio by more than 17% averagely with the same number of stimuli.
Slides

8C-2 (Time: 13:55 - 14:20)

Title	Diagnosing Integrator Leakage of Single-Bit First-Order Delta-Sigma Modulator Using DC Input
Author	*Xuan-Lun Huang, Chen-Yuan Yang, Jiun-Lang Huang (Graduate Institute of Electronics Engineering, Department of Electrical Engineering, National Taiwan University, Taiwan)
Page	pp. 775 - 780
Keyword	analog/mixed-signal testing, diagnosis, design-for-test (DfT), delta-sigma modulation, integrator leakage
Abstract	Integrator leakage is a dominant factor in the SNR (signal-to-noise ratio) loss of delta-sigma modulators. In this paper, we propose a Design-for-Test (DfT) technique to diagnose the integrator leakage of the single-bit first-order delta-sigma modulator. The proposed technique is a low-cost solution; it only adds two multiplexers to the modulator, utilizes a single DC voltage as the test stimulus, and estimates the integrator leakage by analyzing the digitized bit stream. Furthermore, the technique can be easily extended to higher order delta-sigma modulators. Simulation results show that accurate estimations of the integrator leakage can be achieved even at the presence of noise.

8C-3 (Time: 14:20 - 14:45)

Title	Path Selection for Monitoring Unexpected Systematic Timing Effects
Author	*Nicholas Callegari, Pouria Bastani, Li-C. Wang (University of California, Santa Barbara, United States), Sreejit Chakravarty, Alexander Tetelbaum (LSI Corp., United States)
Page	pp. 781 - 786
Keyword	clustering, path delay, path selection, delay test
Abstract	This paper presents a novel path selection methodology to select paths for monitoring unexpected systematic timing effects. The methodology consists of three components: path filtering, path encoding, and path clustering. Given a large set of critical paths, in path filtering, the goal is to filter out paths that cannot be functionally sensitized. To explore the space of unexpected timing effects, a set of features are defined to encode paths into path vectors. Each feature is a source of concern that may potentially contribute to the cause of an unexpected timing effect. Finally, a kernel-based clustering algorithm is employed to group similar path vectors into clusters from which the best representative paths are selected for post-silicon monitoring. The effectiveness of our proposed methodology is demonstrated through experiments on an industrial ASIC design.
Slides

8C-4 (Time: 14:45 - 15:10)

Title	Design for Burn-In Test: A Technique for Burn-In Thermal Stability under Die-to-Die Parameter Variations
Author	Mesut Meterelliyoz, *Kaushik Roy (Purdue University, United States)
Page	pp. 787 - 792
Keyword	burn-in, leakage, thermal, stability, variations
Abstract	Strong temperature dependence of leakage has been a major problem during burn-in test where increased voltages and temperatures are applied to weed out defective parts. Moreover, process variations may result in different temperature profiles in different dies during burn-in. This paper proposes an adaptive design-for-burn-in technique that stabilizes the junction temperature by controlling the leakage power using sleep (supply-gating) transistors for a wide range of ambient temperatures, process variations, thermal resistances and supply voltages.

8C-5 (Time: 15:10 - 15:35)

Title	Test Infrastructure Design for Core-Based System-on-Chip Under Cycle-Accurate Thermal Constraints
Author	*Thomas Edison Yu, Tomokazu Yoneda (Nara Institute of Science and Technology, Japan), Krishnendu Chakrabarty (Duke University, United States), Hideo Fujiwara (Nara Institute of Science and Technology, Japan)
Page	pp. 793 - 798
Keyword	SoC test, TAM design, test scheduling, thermal-aware test, wrapper design
Abstract	We present a thermal-aware test-access mechanism (TAM) design and test scheduling method for system-on-chip integrated circuits. The proposed method uses cycle-accurate power profiles for thermal simulation; it also relies on test-set partitioning, test-interleaving, and bandwidth matching. We use a computationally tractable thermal-cost model to ensure that temperature constraints are satisfied and the test application time is minimized. Simulation results for the ITC’02 SOC Test Benchmarks show that, compared to prior thermal-aware test-scheduling techniques, the proposed method leads to shorter test times under tight temperature constraints.
Slides

Session 8D Designers' Forum: Near-Future SoC Architectures -- Can Dynamically Reconfigurable Processors be a Key Technology?
Time: 13:30 - 15:35 Thursday, January 22, 2009
Location: Room 416+417

8D-1

Title	(Panel Discussion) Near-Future SoC Architectures -- Can Dynamically Reconfigurable Processors be a Key Technology?
Author	Moderator: Hideharu Amano (Keio University, Japan), Panelists: Toru Awashima (NEC Corp., Japan), Hisanori Fujisawa (Fujitsu Laboratories Ltd., Japan), Naohiko Irie (Hitachi, Ltd., Japan), Takashi Miyamori (Toshiba Corp., Japan), Tony Stansfield (Panasonic Europe Ltd., Great Britain)

Session 9A Memory Systems Simulation and Optimization
Time: 15:55 - 18:00 Thursday, January 22, 2009
Location: Room 411+412
Chair: Zonghua Gu (Hong Kong University of Science and Technology, Hong Kong)

9A-1 (Time: 15:55 - 16:20)

Title	Soft Lists: A Native Index Structure for NOR-Flash-Based Embedded Devices
Author	*Li-Pin Chang, Chen-Hui Hsu (National Chiao Tung University, Taiwan)
Page	pp. 799 - 804
Keyword	flash memory, embedded system, storage systems, data structure
Abstract	Efficient data indexing is significant to embedded devices, because both CPU cycles and energy are very precious resources. Soft lists, a new index structure for embedded devices with NOR flash, are proposed. The challenge of data indexing over NOR flash is that data update and pointer update may recursively trigger each other. Our approach is to allow a bounded number of probes when a pointer is de-referenced. By this way update and garbage collection is largely simplified, because data can be moved around physical locations without invalidating any pointers. Even better, search with soft lists is very fast, because the probes provide opportunities of forward random skips. Soft lists are evaluated and compared against tree-based index, and soft lists are shown simple but efficient.

9A-2 (Time: 16:20 - 16:45)

Title	Energy-aware Register File Re-Partitioning for Clustered VLIW Architectures
Author	*Chun Jason Xue, Minming Li, Yingchao Zhao, Bessie Hu (City University of Hong Kong, Hong Kong)
Page	pp. 805 - 810
Keyword	register file, partition, energy
Abstract	VLIW architectures have gained acceptance in embedded systems. Traditional monolithic register file is not suitable for VLIW architectures with a large number of functional units. Clustered VLIW architecture is often applied, where the register file is partitioned into a number of smaller register files. Register files represent a substantial portion of the energy consumption in modern processors, and it is growing rapidly with wider instruction width. Most of the known clustered VLIW architectures partition the register file evenly among clusters. In this paper, we study the effect of energy consumption with register file re-partitioning on clustered VLIW architecture, where register files are not necessarily partitioned evenly. We present algorithms to compute energy-efficient re-partition of register files under different conditions. The impact of different intercluster communication models as well as the impact of program behavior on the register file re-partitioning are analyzed in this paper. Experimental results show that energy saving can be achieved using the proposed techniques.

9A-3 (Time: 16:45 - 17:10)

Title	Memory Subsystem Simulation in Software TLM/T Models
Author	*Eric Cheung, Harry Hsieh (University of California, Riverside, United States), Felice Balarin (Cadence Design Systems, United States)
Page	pp. 811 - 816
Keyword	Multiprocessor Simulation, Memory Subsystem Simulation, TLM/T
Abstract	Design of Multiprocessor System-on-a-Chips requires efficient and accurate simulation of every component. Since thememory subsystemaccounts for up to 50%of the performance and energy expenditures, it has to be considered in system-level design space exploration. In this paper, we present a novel technique to simulate memory accesses in software TLM/T models. We use a compiler to automatically expose all memory accesses in software and annotate them onto efficient TLM/T models. A reverse address map provides target memory addresses for accurate cache and memory simulation. Simulating at more than 10MHz, our models allow realistic architectural design space explorations on memory subsystems. We demonstrate our approach with a design exploration case study of an industrial-strength MPEG-2 decoder.

9A-4 (Time: 17:10 - 17:35)

Title	Exact and Fast L1 Cache Simulation for Embedded Systems
Author	*Nobuaki Tojo, Nozomu Togawa, Masao Yanagisawa, Tatsuo Ohtsuki (Waseda University, Japan)
Page	pp. 817 - 822
Keyword	cache, design space exploration, cache simulation, cache optimization
Abstract	In recent years, the gap between the cycle time of processors and memory access time has been increasing. One of the solutions to solve this problem is to use a cache. But just using a large cache may not reduce the total memory access time. We can have an optimal cache configuration which minimizes overall memory access time by varying the three cache parameters: a cache set size, a line size, and an associativity. In this paper, we propose two exact cache simulation algorithms: CRCB1 and CRCB2, based on Cache Inclusion Property. They realize exact cache simulation but increase simulation speed dramatically. By using our approach, the number of cache hit/miss judgments required for simulating all the cache configurations is reduced to 31.4%--93.6% compared to conventional approaches. As a result, our proposed approach totally runs an average of 1.8 times faster and a maximum of 3.3 times faster compared to the fastest approach proposed so far. Our proposed exact cache simulation approach achieves the world fastest L1 cache simulation.
Slides

9A-5 (Time: 17:35 - 18:00)

Title	Accuracy-Aware SRAM: A Reconfigurable Low Power SRAM Architecture for Mobile Multimedia Applications
Author	Minki Cho (Georgia Institute of Technology, United States), Jason Schlessman (Princeton University, United States), *Wayne Wolf, Saibal Mukhopadhyay (Georgia Institute of Technology, United States)
Page	pp. 823 - 828
Keyword	Memory, Power, Variation, Multimedia, SRAM
Abstract	We propose a dynamically reconfigurable SRAM architecture for low-power mobile multimedia applications. Parametric failures due to manufacturing variations limit the opportunities for power saving in SRAM. We show that, using a lower voltage for cells storing low-order bits and a nominal voltage for cells storing higher order bits, ~45% savings in memory power can be achieved with a marginal (~10%) reduction in image quality. A reconfigurable array structure is developed to dynamically reconfigure the number of bits in different voltage domains.

Session 9B Emerging Technologies
Time: 15:55 - 18:00 Thursday, January 22, 2009
Location: Room 413
Chair: Mehdi Baradaran Tahoori (Electrical & Computer Engineering, Northeastern University, United States)

9B-1 (Time: 15:55 - 16:20)

Title	High-Speed Low-Power FinFET Based Domino Logic
Author	Seid Hadi Rasouli (University of California, Santa Barbara, United States), Hanpei Koike (Electroinformatics Group, Nanoelectronics Research Institute, National Institute of Advanced Industrial Science and Technology, Japan), *Kaustav Banerjee (University of California, Santa Barbara, United States)
Page	pp. 829 - 834
Keyword	FinFET, high speed, low power, domino logic, resistive gate
Abstract	This paper introduces a novel FinFET based domino logic, which exploits the exclusive property of the FinFET device (capacitive coupling between front-gate and back-gate in a four-terminal (4T) FinFET) to simultaneously achieve higher performance and lower power consumption. Using a new implementation of the resistive gate, the keeper device is made weaker at the beginning of the evaluation phase to reduce its contention with the pull-down network, but gradually becomes stronger to provide high noise margin.

9B-2 (Time: 16:20 - 16:45)

Title	A Stochastic Perturbative Approach to Design a Defect-Aware Thresholder in the Sense Amplifier of Crossbar Memories
Author	*M. Haykel Ben Jamaa (Ecole Polytechnique Federale de Lausanne, Switzerland), David Atienza (Universidad Complutense de Madrid, Spain), Yusuf Leblebici, Giovanni De Micheli (Ecole Polytechnique Federale de Lausanne, Switzerland)
Page	pp. 835 - 840
Keyword	nanotechnology, crossbar memories, reliability, nanowires
Abstract	The use of nanowire crossbars to build devices with large storage capabilities is a very promising architectural paradigm for forthcoming nanoscale memory devices. However, this new type of memory devices raises questions regarding how to test their correct operation. In particular, the variability affecting the decoder is expected to make very complex the test of these new devices. In this paper we present a method to simplify the test of these new devices by using a current thresholder to detect badly addressed nanowires. In the proposed method, the thresholder design is based on a stochastic and perturbative model of the current through the nanowires. Thus, the calculated thresholder parameters are robust against technology variation. As our experimental results indicate, the thresholder error probability is initially only 10⁴, which can be also reduced further (up to 60x) by trading-off only 35% area overhead in the memory.
Slides

9B-3 (Time: 16:45 - 17:10)

Title	An Alternate Design Paradigm for Robust Spin-Torque Transfer Magnetic RAM (STT MRAM) from Circuit/Architecture Perspective
Author	Jing Li, Patrick Ndai, Ashish Goel, Haixin Liu, *Kaushik Roy (Purdue University, United States)
Page	pp. 841 - 846
Keyword	Spintronics, MRAM, yield
Abstract	Spin-Torque Transfer Magnetic RAM (STT MRAM) is a promising candidate for future embedded applications. It provides desirable memory attributes such as fast access time, low cost, high density and non-volatility. However, variations in process parameters can lead to a large number of cells to fail, severely affecting the yield of the memory array. In this paper, we provide a thorough analysis of the impact of design parameters on parametric failures due to process variations. To achieve high memory yield without incurring expensive technology modification, we developed an alternate design paradigm —circuit/architecture co-design — to take advantage of different levels of design hierarchy (circuit and architecture) to improve the yield and memory density. The technique decouples the conflicting design requirements for read stability/writability and density. Consequently, the memory cell failure probability reduces by 48% and cell area reduces by 21% with negligible performance degradation (~0.4%).

9B-4 (Time: 17:10 - 17:35)

Title	A Design Methodology and Device/Circuit/Architecture Compatible Simulation Framework for Low-Power Magnetic Quantum Cellular Automata Systems
Author	Charles Augustine, Behtash Behin-Aein, Xuanyao Fong, *Kaushik Roy (Purdue University, United States)
Page	pp. 847 - 852
Keyword	MQCA, Design Methodology, Simulation Framework, low power, CMOS alternative
Abstract	CMOS device scaling is facing a daunting challenge with increased parameter variations and exponentially higher leakage current every new technology generation. Thus, researchers have started looking at alternative technologies. Magnetic Quantum Cellular Automata (MQCA) is such an alternative with switching energy close to thermal limits and scalability down to 5nm. In this paper, we present a circuit/architecture design methodology using MQCA. Novel clocking techniques and strategies are developed to improve computation robustness of MQCA systems. We also developed an integrated device/circuit/system compatible simulation framework to evaluate the functionality and the architecture of an MQCA based system and conducted a feasibility/comparison study to determine the effectiveness of MQCAs in digital electronics. Simulation results of an 8-bit MQCA-based Discrete Cosine Transform (DCT) with novel clocking and architecture show up to 290X and 46X improvement (at iso-delay) over 45nm CMOS in energy consumed and area, respectively.

9B-5 (Time: 17:35 - 18:00)

Title	Reconfigurable Double Gate Carbon Nanotube Field Effect Transistor Based Nanoelectronic Architecture
Author	*Bao Liu (The University of Texas at San Antonio, United States)
Page	pp. 853 - 858
Keyword	carbon nanotube, nanoelectronic architecture
Abstract	Carbon nanotubes (CNTs) and carbon nanotube field effect transistor (CNFETs) have demonstrated extraordinary properties and are widely accepted as the building blocks of next generation VLSI circuits. However, no nanoelectronic architecture has been proposed which is solely based on carbon nanotubes and carbon nanotube field effect transistors. In this paper, I propose a novel double gate carbon nanotube field effect transistor (RDG-CNFET), which is reconfigurable to be open, short, FET, or via. Layers of orthogonal carbon nanotubes with electrically bistable molecules sandwiched at each crossing form a dense array of RDG-CNFETs and programmable interconnects, and constitute a nanoelectronic architecture of manufacturability (via regularity), reliability (via reconfigurability), and performance (via device density). Simulation based on CNFET and molecular device compact models demonstrates superior logic density, reliability, performance and power consumption of the proposed RDG-CNFET based nanoelectronic circuits compared with the existing, e.g., molecular diode/MOSTFET based nanoelectronic circuits.

Session 9D Special Session: Dependable VLSI: Device, Design and Architecture -- How should they cooperate ? --
Time: 15:55 - 18:00 Thursday, January 22, 2009
Location: Room 416+417
Organizer: Shuichi Sakai (University of Tokyo, Japan)

9D-1

Title	(Panel Discussion) Dependable VLSI: Device, Design and Architecture -- How should they cooperate ? --
Author	Organizer: Shuichi Sakai (University of Tokyo, Japan), Panelists: Hidetoshi Onodera (Kyoto University, Japan), Hiroto Yasuura (Kyushu University, Japan), James C. Hoe (Carnegie Mellon University, United States)
Page	pp. 859 - 860
Keyword	VLSI, dependability, device, design, architecture
Abstract	VLSI dependability is one of the most significant issues in the modern world. Here the panelists will discuss the key technologies for it as well as the cost optimization among device, design and architecture.

The 14th Asia and South Pacific Design Automation Conference Technical Program

Session Schedule

List of Papers

The 14th Asia and South Pacific Design Automation Conference
Technical Program