(Go to Top Page)

The 14th Asia and South Pacific Design Automation Conference Technical Program

Remark: The presenter of each paper is marked with "*".
Technical Program:   SIMPLE version   DETAILED version with abstract
Author Index:   HERE

Session Schedule

 Tuesday, January 20, 2009

ABCD
1K (Small Auditorium, 5F)
Opening and Keynote Session I

8:30 - 10:00
1A (Room 411+412)
On-Chip Communication Architectures

10:15 - 12:20
1B (Room 413)
Dealing with Thermal Issues

10:15 - 12:20
1C (Room 414+415)

10:15 - 12:20
1D (Room 416+417)
University LSI Design Contest

10:15 - 12:20
2A (Room 411+412)
MPSoC and IP Integration

13:30 - 15:35
2B (Room 413)
Power Analysis and Optimization

13:30 - 15:35
2C (Room 414+415)
Logic and Arithmetic Optimization

13:30 - 15:35
2D (Room 416+417)
Special Session: EDA Acceleration Using New Architectures

13:30 - 15:35
3A (Room 411+412)
System-Level Design of 3D Chips and Configurable Systems

15:55 - 18:00
3B (Room 413)
Advances in Timing Analysis and Modeling

15:55 - 18:00

3D (Room 416+417)
Special Session: Hardware Dependent Software for Multi- and Many-Core Embedded Systems

15:55 - 18:00

 Wednesday, January 21, 2009

ABCD
2K (Small Auditorium, 5F)
Keynote Session II

9:00 - 10:00
4A (Room 411+412)
System Level Architectures

10:15 - 12:20
4B (Room 413)

10:15 - 12:20
4C (Room 414+415)
Signal/Power Integrity and Simulation

10:15 - 12:20
4D (Room 416+417)
Special Session: Challenges in 3D Integrated Circuit Design

10:15 - 12:20
5A (Room 411+412)
Energy-Aware System Level Design Methodology

13:30 - 15:35
5B (Room 413)
Design for Manufacturing and Reliability

13:30 - 15:35
5C (Room 414+415)

13:30 - 15:35
5D (Room 416+417)
Designers' Forum: Consumer SoC

13:30 - 15:35
6A (Room 411+412)
System Level Simulation and Modeling

15:55 - 18:00
6B (Room 413)
Chip and Package Routing Techniques

15:55 - 18:00

6D (Room 416+417)
Designers' Forum: ESL Design Methods

15:55 - 18:00

 Thursday, January 22, 2009

List of Papers

Remark: The presenter of each paper is marked with "*".

 Tuesday, January 20, 2009

Session 1K  Opening and Keynote Session I
Time: 8:30 - 10:00 Tuesday, January 20, 2009
Location: Small Auditorium, 5F
Chair: Kazutoshi Wakabayashi (NEC Corp., Japan)

1K-1 (Time: 9:00 - 10:00)
 Title (Keynote Address) Challenges to EDA System from the View Point of Processor Design and Technology Drivers Author Mitsuo Saito (Toshiba Corporation Semiconductor Company, Japan) Abstract Historically, many microprocessors have been developed, since it was invented in early 1970s. Microprocessor design was always under the hardest competition, so they had been the technology driver for the semiconductor technology and the design methodology until recently. By discussing the relationship between the design methodology (EDA) revolution and the technology driver products transition, based upon famous Makimotos wave hypothesis, what happened to the microprocessor world is highlighted by showing typical examples. As a recent example, the positioning of the Cell Broadband Engine as a high performance computing processor and as a flexible HW, is discussed mainly, also the performance result, and the future trend of the microprocessors towards multi-core are discussed. Then it is explained, why SpursEngine derived from Cell Broadband Engine had to be developed. SoC (combination of microprocessor and HW functional unit) for custom applications should be the technology driver, for the next decade, which is the first experience after microprocessor was born. The special requirements to the EDA system to realize next wave, are predicted. Finally, when the next wave comes, maybe after 2017, software centric era, what happens to the world, is briefly mentioned.

Session 1A  On-Chip Communication Architectures
Time: 10:15 - 12:20 Tuesday, January 20, 2009
Location: Room 411+412
Chair: Sri Parameswaran (University of New South Wales, Australia)

1A-1 (Time: 10:15 - 10:40)
 Title Adaptive Inter-router Links for Low-Power, Area-Efficient and Reliable Network-on-Chip (NoC) Architectures Author Avinash Karanth Kodi (Ohio University, United States), Ashwini Sarathy, Ahmed Louri, *Janet Wang (University of Arizona, United States) Page pp. 1 - 6 Keyword network-on-chip, low-power architecture Abstract The increasing wire delay constraints in deep sub-micron VLSI designs have led to the emergence of scalable and modular Network-on-Chip (NoC) architectures. As the power consumption, area overhead and performance of the entire NoC is influenced by the router buffers, research efforts have targeted optimized router buffer design. In this paper, we propose iDEAL - inter-router, dual-function energy and area-efficient links capable of data transmission as well as data storage when required. iDEAL enables a reduction in the router buffer size by controlling the repeaters along the links to adaptively function as link buffers during congestion, thereby achieving nearly 30% savings in overall network power and 35% reduction in area with only a marginal 1-3% drop in performance. In addition, aggressive speculative flow control further improves the performance of iDEAL. Moreover, the significant reduction in power consumption and area provides sufficient headroom for monitoring Negative Bias Temperature Instability (NBTI) effects in order to improve circuit reliability at reduced feature sizes. Slides

1A-2 (Time: 10:40 - 11:05)
 Title Analysis of Communication Delay Bounds for Network on Chips Author *Yue Qian (National University of Defense Technology, China), Zhonghai Lu (Royal Institute of Technology, Sweden), Wenhua Dou (National University of Defense Technology, China) Page pp. 7 - 12 Keyword Network-on-chip, network calculus, delay bound Abstract In network-on-chip, computing worst-case delay bound for packet delivery is crucial for designing predictable systems but yet an intractable problem due to complicated resource contention scenarios. In this paper, we present an analysis technique to derive the communication delay bound for individual flows. Based on a network contention model, this technique, which is topology independent, employs the network calculus theory to first compute the equivalent service curve for individual flows and then calculate their packet delay bound. To exemplify our method, we also present the derivation of a closed-form formula to calculate the delay bound for all-to-one gather communication. Our experimental results demonstrate the theoretical bounds are correct and tight.

1A-3 (Time: 11:05 - 11:30)
 Title Frequent Value Compression in Packet-based NoC Architectures Author Ping Zhou, Bo Zhao, Yu Du, Yi Xu, Youtao Zhang, *Jun Yang (University of Pittsburgh, United States), Li Zhao (Intel, United States) Page pp. 13 - 18 Keyword compression, NoC, performance, power Abstract The proliferation of Chip Multiprocessors (CMPs) has led to the integration of large on-chip caches. For scalability reasons, a large on-chip cache is often divided into smaller banks that are interconnected through packet-based Network-on-Chip (NoC). With increasing number of cores and cache banks integrated on a single die, the on-chip network introduces significant communication latency and power consumption. In this paper, we propose a novel scheme that exploits Frequent Value compression to optimize the power and performance of NoC. Our experimental results show that the proposed scheme reduces the router power by up to 16.7%, with CPI reduction as much as 23.5% in our setting. Comparing to the recent zero pattern compression scheme, the frequent value scheme saves up to 11.0\% more router power and has up to 14.5% more CPI reduction. Hardware design of the FV table and its overhead are also presented.

1A-4 (Time: 11:30 - 11:55)
 Title Simultaneous Data Transfer Routing and Scheduling for Interconnect Minimization in Multicycle Communication Architecture Author Yu-Ju Hong (Purdue University, United States), Ya-Shih Huang, *Juinn-Dar Huang (National Chiao Tung University, Taiwan) Page pp. 19 - 24 Keyword multicycle communication, architectural synthesis, interconnect minimization, resource allocation and sharing, scheduling Abstract In deep submicron technology, wire delay is no longer negligible and is gradually becoming a dominant factor of system performance. Several state-of-the-art architectural synthesis flows have already adopted the distributed register architecture to cope with the increasing wire delay by allowing multicycle communication. In this paper, we formulate channel and register allocation within a refined regular distributed register architecture, named RDR-GRS, as a problem of simultaneous data transfer routing and scheduling for minimizing global interconnect resources. We also present an innovative algorithm with both spatial and temporal considerations. It features both a concentration-oriented path router gathering wire-sharable data transfers and a channel-based time scheduler resolving contentions for wires in a channel, which are in spatial and temporal domain, respectively. The experimental results show that the proposed algorithm can significantly outperform existing related works.

1A-5 (Time: 11:55 - 12:20)
 Title Dynamically Reconfigurable On-Chip Communication Architectures for Multi Use-Case Chip Multiprocessor Applications Author Sudeep Pasricha, *Nikil Dutt, Fadi Kurdahi (University of California, Irvine, United States) Page pp. 25 - 30 Keyword crossbar, on-chip communication, synthesis, low power Abstract The phenomenon of digital convergence and increasing application complexity today is motivating the design of chip multiprocessor (CMP) applications with multiple use cases. Most traditional on-chip communication architecture design techniques perform synthesis and optimization only for a single use-case, which may lead to sub-optimal design decisions for multi-use case applications. In this paper we present a framework to generate a dynamically reconfigurable crossbar-based on-chip communication architecture that can support multiple use-case bandwidth and latency constraints. Our framework generates on-chip communication architectures with a low cost, low power dissipation, and with minimal reconfiguration overhead. Results of applying our framework on several networking CMP applications show that our approach is able to generate a crossbar solution with significantly lower cost (2.4 to 3.8), and lower power dissipation (1.5 to 3.1), compared to the best previously proposed approach.

Session 1B  Dealing with Thermal Issues
Time: 10:15 - 12:20 Tuesday, January 20, 2009
Location: Room 413
Chairs: Youngsoo Shin (KAIST, Republic of Korea), Li Shang (University of Colorado at Boulder, United States)

1B-1 (Time: 10:15 - 10:40)
 Title Stochastic Thermal Simulation Considering Spatial Correlated Within-Die Process Variations Author *Pei-Yu Huang, Jia-Hong Wu, Yu-Min Lee (National Chiao Tung University, Taiwan) Page pp. 31 - 36 Keyword Statistical IC thermal simulator, Karhunen-Loeve expansion, Leakage power, stochastic Galerkin method Abstract In this work, a statistical thermal simulator including the effect of spatial correlation under within-die process variations is developed. This method utilizes the Karhunen-Loeve (KL) expansion to model the physical parameters, and applies the Polynomial Chaoses (PCs) and the stochastic Galerkin method to tackle the stochastic heat transfer equations. The experimental results not only demonstrate the accuracy and efficiency of the proposed method, but also point out that the stochastic thermal analysis is essential to provide a robust estimation of temperature distribution for the thermal-aware design flow.

1B-2 (Time: 10:40 - 11:05)
 Title A Control Theory Approach for Thermal Balancing of MPSoC Author *Francesco Zanini, David Atienza, Giovanni De Micheli (Ecole Polytechnique Federale de Lausanne, Switzerland) Page pp. 37 - 42 Keyword thermal balancing, MPSoC, control theory, linear quadratic regulator Abstract Thermal balancing and reducing hot-spots are two important challenges facing the MPSoC designers. In this work, we model the thermal behavior of an MPSoC as a control theory problem, which enables the design of an optimum frequency controller without depending on the thermal profile of the chip. The optimization performed by the controller is targeted to achieve thermal balancing on the MPSoC thermal profile to avoid hotspots and improve its reliability. The proposed system is able to perform an on-line minimization of chip thermal gradients based on both scheduler requirements and the chip thermal profile. We compare this with state of the art thermal management approaches, our comparison shows that the proposed system offers a better both thermal profile (temperature differences higher than 4C have been reduced from 27.9% to 0.45%) and performance (up to 32% task waiting time reduction).

1B-3 (Time: 11:05 - 11:30)
 Title Thermal Optimization in Multi-Granularity Multi-Core Floorplanning Author *Michael B. Healy, Hsien-Hsin S. Lee, Gabriel H. Loh, Sung Kyu Lim (Georgia Institute of Technology, United States) Page pp. 43 - 48 Keyword multicore, thermal, floorplanning Abstract Multi-core microarchitectures require a careful balance between many competing objectives to achieve the highest possible performance. Integrated Early Analysis is the consideration of all of these factors at an early stage. Toward this goal, this work presents the first adaptive multi-granularity multi-core microarchitecture-level floorplanner that simultaneously optimizes temperature and performance, and considers memory bus length. We include simultaneous optimization at both the module-level and the core/cache-bank level. Related experiments show that our methodology is effective for optimizing multi-core architectures.

1B-4 (Time: 11:30 - 11:55)
 Title Temperature-Aware Dynamic Frequency and Voltage Scaling for Reliability and Yield Enhancement Author *Yu-Wei Yang, Katherine Shu-Min Li (Department of Computer Science and Engineering, National Sun Yat-Sen University, Taiwan) Page pp. 49 - 54 Keyword DVFS, DVS, oscillation ring, on-chip thermal sensors, on-chip DVFS monitor Abstract A novel oscillation-based on-chip thermal sensing architecture for dynamically adjusting supply voltage and clock frequency in System-on-Chip (SoC) is proposed. It is shown that the oscillation frequency of a ring oscillator reduces linearly as the temperature rises, and thus provides a good on-chip temperature sensing mechanism. An efficient Dynamic Frequency-to-Voltage Scaling (DF2VS) algorithm is proposed to dynamically adjust supply voltage according to the oscillation frequencies of the ring oscillators distributed in SoC so that thermal sensing can be carried at all potential hot spots. An on-chip Dynamic Voltage Scaling or Dynamic Voltage and Frequency Scaling (DVS or DVFS) monitor selects the supply voltage level and clock frequency according to the outputs of all thermal sensors. Experimental results on SoC benchmark circuits show the effectiveness of the algorithm that a 10% reduction in supply voltage alone can achieve about 20% power reduction (DVS scheme), and nearly 50% reduction in power is achievable if the clock frequency is also scaled down (DVFS scheme). The chip temperature is reduced accordingly. Slides

1B-5 (Time: 11:55 - 12:20)
 Title A Multiple Supply Voltage Based Power Reduction Method in 3-D ICs Considering Process Variations and Thermal Effects Author Shih-An Yu, *Pei-Yu Huang, Yu-Min Lee (National Chiao Tung University, Taiwan) Page pp. 55 - 60 Keyword Power Optimization, 3D ICs, Thermal analysis, Multiple Supply Voltage Abstract In this paper, a grid-based multiple supply voltage (MSV) assignment method is presented to statistically minimize the total power consumption of 3-D IC. This method consists of a statistical electro-thermal simulator to get the mean and variance of on-chip, a thermal-aware statistical static timing analysis (SSTA) to take into account the thermal effect on circuit timing, the statistical power delay sensitivityslack product to be the optimization criterion, and an incremental update of statistical timing to save the runtime. The experimental results demonstrate the effectiveness of the developed methodology and indicate that the consideration of the thermal effect in the circuit simulation is imperative.

Session 1C  Advances in Behavioral Synthesis
Time: 10:15 - 12:20 Tuesday, January 20, 2009
Location: Room 414+415
Chairs: Shigeru Yamashita (Nara Institute of Science and Technology, Japan), Kiyoung Choi (Seoul National University, Republic of Korea)

1C-1 (Time: 10:15 - 10:40)
 Title FastYield: Variation-Aware, Layout-Driven Simultaneous Binding and Module Selection for Performance Yield Optimization Author *Gregory Lucas, Scott Cromar, Deming Chen (University of Illinois, Urbana-Champaign, United States) Page pp. 61 - 66 Keyword high level synthesis, process variation, ssta Abstract We propose a new variation-aware high-level synthesis binding/module selection algorithm, named FastYield, that takes into consideration multiplexers, functional units, registers, and interconnects. Additionally, FastYield connects with the lower levels of the design hierarchy through its inclusion of a timing driven floorplanner guided by a statistical static timing analysis (SSTA) engine which is used to modify/enhance the synthesis solution. On average, FastYield achieves an 85% performance yield clock period that is 14.5% smaller, and a performance yield gain of 78.9%, when compared to a variation-unaware algorithm.

1C-2 (Time: 10:40 - 11:05)
 Title CriAS: A Performance-Driven Criticality-Aware Synthesis Flow for On-Chip Multicycle Communication Architecture Author *Chia-I Chen, Juinn-Dar Huang (National Chiao Tung University, Taiwan) Page pp. 67 - 72 Keyword Architectural synthesis, multicycle communication architecture, distributed register architecture, criticality-aware, performance-driven Abstract In deep submicron era, wire delay is no longer negligible and is dominating the system performance. Several state-of-the-art architectural synthesis flows have been proposed for the distributed register architectures to cope with the increasing wire delay by allowing on-chip multicycle communication. In this paper, we present a new performance-driven criticality-aware synthesis flow CriAS targeting regular distributed register architectures. CriAS features a hierarchical binding strategy and a coarse-grained placer for minimizing the number of critical global data transfers. The key ideas are to take time criticality as the major concern at earlier binding stages before the detailed physical placement information is available, and to preserve the locality of closely related critical components in the later placement phase. The experimental results show that 19% overall performance improvement can be achieved on average as compared to the previous work.

1C-3 (Time: 11:05 - 11:30)
 Title Tolerating Process Variations in High-Level Synthesis Using Transparent Latches Author *Yibo Chen, Yuan Xie (the Pennsylvania State University, United States) Page pp. 73 - 78 Keyword high-level synthesis, process variation, latch Abstract Considering process variability at the behavior synthesis level is necessary, because it makes some instances of function units slower and others faster, resulting in unbalanced control steps and reducing the attainable frequency of the circuit. To tackle this problem, this paper proposes a methodology to replace the edge-trigged flip-flops by transparent latches, to exploit latches' extra ability of passing time slacks and tolerating delay variations. In the paper we first define the timing yield in high-level synthesis, and then present how to replace flip-flops with latches to improve timing yield and mitigate the impact of process variations. We then discuss the benefits and overheads for the replacement, and propose an optimization framework for latch replacement in high-level synthesis design flow. Experimental results show that the latch-based design can achieve an average of 27% improvement of timing yield compared with traditional flip-flop based design.

1C-4 (Time: 11:30 - 11:55)
 Title Variation-Aware Resource Sharing and Binding in Behavioral Synthesis Author Feng Wang (Qualcomm Inc., United States), Yuan Xie (Pennsylvania State University, United States), *Andres Takach (Mentor Graphics Corporation, United States) Page pp. 79 - 84 Keyword High level synthesis, resource sharing, resource binding, process variation Abstract As technology scales, the delay uncertainty caused by process variations has become increasingly pronounced in deep submicron designs. In the presence of process variations, worst-case timing analysis may lead to overly conservative synthesis, and may end up using excess resources to guarantee design constraints. In this paper, we propose an efficient variation-aware resource sharing and binding algorithm in behavioral synthesis, which takes into account the performance variations for functional units. The performance yield, which is defined as the probability that the synthesized hardware meets the target performance constraints, is used to evaluate the synthesis result. An efficient metric called statistical performance improvement, is used to guide resource sharing and binding. The proposed algorithm is integrated into a commercial synthesis framework that transfer design specifications from behavioral description to RTL netlists. The effectiveness of the proposed algorithm is demonstrated with a set of industrial benchmark designs, which consist of blocks that are commonly used in wireless and image processing applications. The experimental results show that our method achieves an average 33% area reduction over traditional methods, which are based on the worst-case delay analysis, with an average 10% run time overhead.

1C-5 (Time: 11:55 - 12:20)
 Title Peak Temperature Control in Thermal-aware Behavioral Synthesis through Allocating the Number of Resources Author *Junbo Yu, Qiang Zhou, Jinian Bian (Tsinghua University, China) Page pp. 85 - 90 Keyword resource usage allocation, behavioral synthesis, peak temperature Abstract High temperature adversely impacts on reliability, performance, and leakage power of ICs. In behavioral synthesis, both resource usage allocation and resource binding influence the final thermal profile. Previous thermal-aware behavioral syntheses only focused on binding, ignoring allocation. This paper proposes thermal-aware behavioral synthesis with resource usage allocation. According to power density and feedbacks from thermal simulation, we allocate the number of resources under area constraint. Our flow effectively controls peak temperature and creates even power densities among resources of gdifferenth and gsameh types. Compared to classic behavioral synthesis of peak temperature control, our technique reduces peak temperature by 11.1 on average with no area overhead and only 1.2 more steps latency overhead.

Session 1D  University LSI Design Contest
Time: 10:15 - 12:20 Tuesday, January 20, 2009
Location: Room 416+417
Chairs: Jiun-In Guo (National Chung Cheng University, Taiwan), Hiroki Ishikuro (Keio University, Japan)

1D-1 (Time: 10:15 - 10:20)
 Title A Wireless Real-Time On-Chip Bus Trace System Author *Shusuke Kawai, Takayuki Ikari (Keio University, Japan), Yutaka Takikawa (Renesas Design Corp, Japan), Hiroki Ishikuro, Tadahiro Kuroda (Keio University, Japan) Page pp. 91 - 92 Keyword Inductive coupring, Wireless interface Abstract A 480Mb/s wireless real-time bus trace system with a pulse-based inductive coupling channel array was developed using a 0.25m CMOS digital process. The size and pitch of the inductor array are determined by numerical calculation to optimize the tradeoff between the channel coupling, crosstalk, and alignment tolerance. A low-power quasi-synchronous system is proposed to obtain an enough timing margin for RX pulse detection under the presence of the clock skew Slides

1D-2 (Time: 10:20 - 10:25)
 Title CKVdd: A Self-Stabilization Ramp-Vdd Technique for Dynamic Power Reduction Author Chin-Hsien Wang, *Ching-Hwa Cheng (Feng Chia University, Taiwan), Jiun-In Guo (National Chung Cheng University, Taiwan) Page pp. 93 - 94 Keyword Low power Abstract We propose a self-stabilized ramp voltage technique, CKVdd, to reduce power dissipation in conventional CMOS circuit. Normal CMOS circuits show a power increase proportional to clock frequency. CKVdd results in a lower-than-usual power increase. This technique is easily implemented in CMOS circuits. CKVdd technique possesses several characteristics that differ from of the current circuits using Vdd power source. First, CKVdd circuits have less average current and peak current consumption, such that it can be a low power design technique applied to generic digital circuits. Second, CKVdd technique combines the power source and clock signal, and can easily implement the power management mechanism. Compared to constant Vdd for multimedia decoders, the proposed technique has 45% of the usual power dissipation and 88% of the usual peak current reduction at the cost of small delay penalty. Slides

1D-3 (Time: 10:25 - 10:30)
 Title A 300 nW, 7 ppm/℃ CMOS Voltage Reference Circuit based on Subthreshold MOSFETs Author *Ken Ueno (Hokkaido University, Japan), Tetsuya Hirose (Kobe University, Japan), Tetsuya Asai, Yoshihito Amemiya (Hokkaido University, Japan) Page pp. 95 - 96 Keyword Voltage reference, subthreshold, Ultra-low power, process variation Abstract An ultra-low power CMOS voltage reference circuit has been fabricated in a 0.35-um standard CMOS process. The circuit generates a reference voltage based on threshold voltage of a MOSFET at absolute zero temperature. Theoretical analyses and experimental results showed that the circuit generates a quite stable reference voltage of 745 mV on average. The temperature coefficient and line sensitivity of the circuit were 7 ppm/degC and 20 ppm/V, respectively. The power supply rejection ratio (PSRR) was -45 dB at 100 Hz. The circuit consists of subthreshold MOSFETs with a low-power dissipation of 0.3 uW or less and a 1.5-V power supply. Because the circuit generates a reference voltage based on threshold voltage of a MOSFET in an LSI chip, it can be used as an on-chip process monitoring circuit and as a part of the on-chip process compensation circuit systems. Slides

1D-4 (Time: 10:30 - 10:35)
 Title A 100Mbps, 0.19mW Asynchronous Threshold Detector with DC Power-Free Pulse Discrimination for Impulse UWB Receiver Author *Lechang Liu, Yoshio Miyamoto, Zhiwei Zhou, Kosuke Sakaida, Jisun Ryu, Koichi Ishida, Makoto Takamiya, Takayasu Sakurai (The University of Tokyo, Japan) Page pp. 97 - 98 Keyword Ultra-wideband (UWB), UWB receiver, Threshold detector, Pulse discriminator Abstract An asynchronous threshold detector for DC-960MHz band impulse ultra-wideband (UWB) receiver is proposed in this paper. It features a DC power-free pulse discriminator. The proposed architecture in 90nm CMOS achieves the lowest power consumption of 0.19mW and energy consumption of 1.9pJ/bit at 100Mbps in the UWB receiver. Slides

1D-5 (Time: 10:35 - 10:40)

1D-6 (Time: 10:40 - 10:45)
 Title An Inductor-less MPPT Design for Light Energy Harvesting Systems Author Hui Shao, *Chi-Ying Tsui, Wing-Hung Ki (The Hong Kong University of Science and Technology, Hong Kong) Page pp. 101 - 102 Keyword solar cell, power management, MPPT, energy harvesting Abstract An inductor-less maximum power point tracker was designed for light energy harvesting systems. We target at systems under different lighting environments and sometimes the solar cell voltage may be low. A charge pump is used to convert the voltage to a higher value. At the same time, the control circuit tunes the charge pump switching frequency to track the system maximum output power point. The design was fabricated and measured to verify the system operation. Slides

1D-7 (Time: 10:45 - 10:50)
 Title A 1 GHz CMOS Comparator with Dynamic Offset Control Technique Author *Xiaolei Zhu (Keio University, Japan), Sanroku Tsukamoto (Fujitsu Laboratories Limited, Japan), Tadahiro Kuroda (Keio University, Japan) Page pp. 103 - 104 Keyword Offset cancel, Comparator, A/D converter Abstract Abstract− A dynamic offset control technique that employs charge compensation by timing control is proposed for comparator design in scaled CMOS technology. The analysis has been verified by fabricating a 65 nm CMOS 1.2 V 1 GHz comparator that occupies 25 x 65 m2 and consumes 380 W. Circuits for offset control occupies 21% of the areas and 12% of the power consumption of the whole comparator chip. Slides

1D-8 (Time: 10:50 - 10:55)
 Title Circuit Design Using Stripe-Shaped PMELA TFTs on Glass Author *Keita Ikai, Jinmyoung Kim, Makoto Ikeda, Kunihiro Asada (University of Tokyo, Japan) Page pp. 105 - 106 Keyword TFT, PMELA, Design environment, Glass Abstract A design environment for stripe-shaped PMELA TFTs on glass has been developed and successfully tested. Cell library including standard cells, logic synthesis database, Place and Route rule, layout parasitic extraction rule and transistor models are developed. Measurement results show that the digital circuits designed in this environment work correctly. They also show that the simulation environment is accurate enough for simulating digital circuits. Slides

1D-9 (Time: 10:55 - 11:00)
 Title Low Energy Level Converter Design for Sub-Vth Logics Author Hui Shao, *Chi-Ying Tsui (The Hong Kong University of Science and Technology, Hong Kong) Page pp. 107 - 108 Keyword low energy, sub-Vth logic, level converter Abstract A low energy consumption level converter (LC) is presented for logic voltage conversion from sub-Vth voltage to nominal high voltage. By employing the multi-stage architecture and implementing a unique circuit inside each stage, the proposed LC can reduce its energy consumption by almost 3 orders and at the same time ensure the robustness of its function. The LC was fabricated and measured to verify its operation and performance improvement. Slides

1D-10 (Time: 11:00 - 11:05)
 Title A Time-to-Digital Converter with Small Circuitry Author Kazuya Shimizu, *Masato Kaneta, HaiJun Lin, Haruo Kobayashi, Nobukazu Takai (Gunma University, Japan), Masao Hotta (Musashi Institute of Technology, Japan) Page pp. 109 - 110 Keyword Time-to-Digital Converter, Time Domain Analog Circuit, nano CMOS, Digital Assist Analog Technology, Time Measurement Abstract This paper describes a Time-to-Digital-Converter (TDC) architecture with small CMOS circuitry as well as fine time resolution better linearity compared to a conventional vernier delay line TDC. The TDC measures the interval time between two signals and it is used in an all digital PLL and a time-domain ADC. In the proposed TDC, the number of the delay buffers is half of the conventional TDC, which leads to small chip area and low power. Also the nonlinearity due to delay mismatch among buffers is reduced, which we have demonstrated by MATLAB simulation. We have also designed and laid out its circuitry using TSMC 0.18um CMOS process, and the chip measurements shows its principle functions as expected. Slides

1D-11 (Time: 11:05 - 11:10)
 Title A VDD Independent Temperature Sensor Circuit with Scaled CMOS Process Author *Hiroki Oshiyama, Toshihiro Matsuda, Kei-ichi Suzuki, Hideyuki Iwata (Toyama Prefectural University, Japan), Takashi Ohzone (Dawn Enterprise Co. Ltd., Japan) Page pp. 111 - 112 Keyword CMOS, temperature sensor, voltage reference Abstract A supply voltage (VDD) independent temperature sensor circuit by a standard 90 nm CMOS process achieves the predicted errors about -1.0 to +2.0 C (-0.6 to +0 C) for the temperature range of -20 to +100 C (+20 to +80 C) for two-point calibration lines. This temperature sensor has a good tolerance to the change of VDD from 2.5 to 1.5 V, which corresponds to the measurement error of 0.9 C. Slides

1D-12 (Time: 11:10 - 11:15)
 Title A Current-mode DC-DC Converter using a Quadratic Slope Compensation Scheme Author *Chihiro Kawabata, Yasuhiro Sugimoto (Chuo University, Japan) Page pp. 113 - 114 Keyword DC-DC, converter, quadratic, slope, compensation Abstract A quadratic slope compensation scheme for a current-mode DC-DC converter to obtain stable frequency characteristics without depending on the input and output voltages is proposed. A 5 MHz and 500 mA operational buck DC-DC converter with input voltages ranging from 3.3 V to 2.5 V and with output voltages ranging from 2.5 V to 0.5 V was designed and fabricated by using a 0.35 um CMOS process to verify the effectiveness of the scheme. Little variation of frequency characteristics at frequencies above 200 KHz for the various input and output voltages was observed. Slides

1D-13 (Time: 11:15 - 11:20)
 Title Ultra Low-Power ANSI S1.11 Filter Bank for Digital Hearing Aids Author *Yu-Ting Kuo, Tay-Jyi Lin, Yueh-Tai Li (National Chiao Tung University, Taiwan), Chou-Kun Lin (ITRI, STC, Taiwan), Chih-Wei Liu (National Chiao Tung University, Taiwan) Page pp. 115 - 116 Keyword hearing aid, filter bank, low power Abstract This paper presents an ANSI S1.11-compliant filter bank for digital hearing aids, of which the power consumption is minimized through algorithmic, numerical and architectural optimizations. This filter bank has been implemented and fabricated using the TSMC 0.13m CMOS technology. The transistor-level simulations show that the power dissipation is only 79W for 24KHz & 18-band audio processing. Slides

1D-14 (Time: 11:20 - 11:25)
 Title An 11,424 Gate-Count Dynamic Optically Reconfigurable Gate Array with a Photodiode Memory Architecture Author Daisaku Seto, *Minoru Watanabe (Shizuoka University, Japan) Page pp. 117 - 118 Keyword ORGAs, FPGAs, optical configuration, multi-context devices Abstract The worldfs largest 11,424 gate-count dynamic optically reconfigurable gate array VLSI chip, which is based on the use of junction capacitance of photodiodes as configuration memory, has been fabricated. The size and process of the VLSI chip are, respectively, a 96.04 mm2 and a 0.35 m-3 metal CMOS process technology. To clarify the availability of the VLSI, this paper shows an experimental result of Slides

1D-15 (Time: 11:25 - 11:30)
 Title A Low-Power FPGA Based on Autonomous Fine-Grain Power-Gating Author *Shota Ishihara, Masanori Hariyama, Michitaka Kameyama (Tohoku University, Japan) Page pp. 119 - 120 Keyword FPGA, asynchronous architecture, power-gating, LEDR encoding, bit-serial architecture Abstract This is the first implementation of an FPGA based on autonomous fine-grain power-gating. To cut the power consumption of clock network and detect the activity of the cell efficiently, asynchronous architecture is full exploited. The proposed FPGA is fabricated in a 90nm CMOS process with dual threshold voltages. It is more efficient in power than the synchronous FPGA at less than 30% utilization. Slides

1D-16 (Time: 11:30 - 11:35)
 Title A 52-mW 8.29mm2 19-mode LDPC Decoder Chip for Mobile WiMAX Applications Author *Xin-Yu Shih, Cheng-Zhou Zhan, Cheng-Hung Lin, An-Yeu (Andy) Wu (National Taiwan University, Taiwan) Page pp. 121 - 122 Keyword LDPC, Mobile WiMAX, Multi-mode Abstract This paper presents a LDPC decoder chip supporting all 19 modes in Mobile WiMAX applications. An efficient IC design strategy is proposed to reduce 31.25% decoding latency, and enhance hardware utilization ratio from 50% to 75%. In addition, we propose a new early termination scheme that can dynamically adjust the iteration number. The multi-mode chip implemented in 8.29mm2die area can be maximally measured at 83.3MHz with only 52mW power consumption. Slides

1D-17 (Time: 11:35 - 11:40)
 Title A Full-Synthesizable High-Precision Built-In Delay Time Measurement Circuit Author Ming-Chien Tsai, *Ching-Hwa Cheng (Feng Chia University, Taiwan) Page pp. 123 - 124 Keyword Built-in Delay Test, delay fault diagnosis, Vernier Delay Line Abstract Delay testing has become a major issue for manufacturing advanced Systems on a Chip. Automatic Test Equipment and scan techniques are usually applied in delay testing. However, the circuits under test have many circuit paths and dependent input patterns; it is hard to measure delay times accurately, especially when debugging small delay defects. We propose a Built-In Delay Measurement (BIDM) circuit that is modified from Vernier Delay Lines. All digitally designed BIDMs with small area overhead can be easily embedded within testing circuits. BIDMs can be used to record the data propagation delay times within circuit path segments, for delay testing, diagnosis, and calibration requirements internal to the chip. Our BIDM was implemented in a 32bit error correction circuit by a chip using TSMC 0.18u technology. The instruments measured results showing that the BIDM chip correctly reported the CUT segment path delay times. The chip measurement results were a 95.83% match to the postlayout SPICE simulation values. This BIDM makes it possible to debug small delay defects in chips. Slides

1D-18 (Time: 11:40 - 11:45)
 Title A Dynamic Quality-Scalable H.264 Video Encoder Chip Author *Hsiu-Cheng Chang, Yao-Chang Yang, Jia-Wei Chen (National Chung Cheng University, Taiwan), Ching-Lung Su (National Yunlin University of Science and Technology, Taiwan), Cheng-An Chien, Jiun-In Guo, Jinn-Shyan Wang (National Chung Cheng University, Taiwan) Page pp. 125 - 126 Keyword Quality-Scalable, H.264, Encoder, real-time Abstract This paper proposes a dynamic quality-scalable H.264 video encoder that comprises 470Kgates and 13.3Kbytes SRAM using 1P8M 0.13um CMOS technology. Exploiting parameterized algorithms for motion estimation and intra prediction, the proposed design can dynamically configure the encoding modes with the design trade-off between power consumption and video quality for various video encoding applications. It achieves real-time H.264 video encoding on CIF, D1, and HD720@30fps with 7mW-25mW, 27mW-162mW, and 122mW-183mW power dissipation in different quality modes. Slides

1D-19 (Time: 11:45 - 11:50)
 Title A High Performance LDPC Decoder for IEEE802.11n Standard Author *Wen Ji, Yuta Abe, Takeshi Ikenaga, Satoshi Goto (Waseda University, Japan) Page pp. 127 - 128 Keyword LDPC, message passing algorithm, partially-parallel LDPC decoder Abstract In this paper, we propose a partially-parallel irregular LDPC decoder for IEEE 802.11n standard. The design is based on a novel sum-delta message passing schedule to achieve high throughput and low area cost design. We further improve the design with pipeline structure and parallel computation. The synthesis result in TSMC 0.18 CMOS technology demonstrates that for (648,324) irregular LDPC code, our decoder achieves 7.5X improvement in throughput, which reaches 402 Mbps at the frequency of 200MHz, with 11% area reduction. Slides

1D-20 (Time: 11:50 - 11:55)
 Title Design and Chip Implementation of the Ubiquitous Processor HCgorilla Author *Masa-aki Fukase, Kazunori Noda, Atsuko Yokoyama, Tomoaki Sato (Hirosaki University, Japan) Page pp. 129 - 130 Keyword Processor, Wave-pipeline, Ubiquitous Abstract HCgorilla is a hardware cryptography-embedded multimedia mobile processor that follows the parallelism of multicore and multiple pipelines dedicated for ubiquitous computing. Multiple pipelines are composed of media and cipher pipes. Each pipe is partly wave-pipelined to achieve power conscious high performance. Media pipes have user friendly functions due to Java compatibility. Random number addressing by cipher pipes is suited to cryptographic streaming. This paper describes the design and implementation of HCgorilla chips by using CMOS standard cell libraries Slides

1D-21 (Time: 11:55 - 12:00)
 Title An 8.69 Mvertices/s 278 Mpixels/s Tile-based 3D Graphics SoC HW/SW Development for Consumer Electronics Author *Liang-Bi Chen, Ruei-Ting Gu, Wei-Sheng Huang, Chien-Chou Wang, Wen-Chi Shiue, Tsung-Yu Ho, Yun-Nan Chang, Shen-Fu Hsiao, Chung-Nan Lee, Ing-Jer Huang (Department of Computer Science and Engineering, National Sun Yat-Sen University, Taiwan) Page pp. 131 - 132 Keyword 3D Graphics, SoC, Performance Tuning, Consumer Electronics, Tile-based Abstract This paper presents an 8.69 Mvertices/s, 278 Mpixels/s, 15.7 mm2 tiled-based 3D graphics SoC HW/SW supporting OpenGL ES 1.0 running at 139 MHz. The SoC also includes embedded circuitry to monitor run time characteristics, detect bus protocol error/inefficiency, and capture bus traces at various abstraction levels with compression ratio up to 98%. Slides

1D-22 (Time: 12:00 - 12:05)
 Title A Multi-Task-Oriented Security Processing Architecture with Powerful Extensibility Author *Dan Cao, Jun Han, Xiao-yang Zeng, Shi-ting Lu (Fudan University, China) Page pp. 133 - 134 Keyword security processing, multi-core, SoC Abstract A multi-task-oriented security processing architecture is presented in this paper. This architecture contains a host microprocessor and multiple security processors (SP). The SP could integrate dedicated Crypto-Engines, which provides functional extensibility. And the performance scalability and multi-task parallelism could be enhanced by increasing the number of SPs on system bus. Its demonstrated that this architecture greatly improves the system efficiency. A test chip is implemented based on SMIC 0.18 um standard CMOS technology, and its functionality is well verified. Slides

1D-23 (Time: 12:05 - 12:10)
 Title A Delay-Optimized Universal FPGA Routing Architecture Author *Fang Wu, Huowen Zhang, Lei Duan, Jinmei Lai, Yuan Wang, Jiarong Tong (Fudan University, China) Page pp. 135 - 136 Keyword Routing, Delay, GRB Abstract A universal FPGA routing Architecture is presented, which ensures that every module in the FPGA including CLBs and IOBs have a uniform interconnect architecture, and the load of lines is equally distributed. So, this architecture is highly repeatable and the signal delay is predictable and regular. Furthermore, the realization of the Programmable Interconnect Point (PIP) and the BUFFER driver is also optimized to benefit the signal delay up to 5%.The test results of the example chip show the reasonableness of these ideas. Slides

Session 2A  MPSoC and IP Integration
Time: 13:30 - 15:35 Tuesday, January 20, 2009
Location: Room 411+412
Chairs: Nozomu Togawa (Waseda University, Japan), Marcello Lajolo (NEC Laboratories America, United States)

2A-1 (Time: 13:30 - 13:55)
 Title Timing Variation-Aware Task Scheduling and Binding for MPSoC Author *HaNeul Chon, Taewhan Kim (Seoul National University, Republic of Korea) Page pp. 137 - 142 Keyword Timing variation, task scheduling, binding Abstract This work addresses the new problem of timing variation-aware task scheduling and binding (TSB) for multiprocessor system-on-chip (MPSoC) architecture in the system-level design, where tasks have full flexibilities of resource (i.e., processor) sharing to meet the design constraints. With the timing variation of processors clock speed, it has been observed that considering the effects of resource sharing on the resulting performance yield computation is critically important for accurate design space exploration and evaluation in the system-level design. Unfortunately previous statistical static timing analysis (SSTA) in the system-level has never considered resource sharing in computing the performance yield, or has overly simplified by employing the gate-level SSTAs. In this work, we overcome those limitations by proposing an effective SSTA technique called TSBSSTA, which schedules and binds tasks to resources in the presence of resource sharing. We also propose a timing variation-aware (TV) framework, called TSB-TV, tightly integrating TSB-SSTA. We have tested the effectiveness of our approach through experimentation with benchmarks, which showed an average of 56.1% improvement in performance yield over conventional methods.

2A-2 (Time: 13:55 - 14:20)
 Title Flexible and Abstract Communication and Interconnect Modeling for MPSoC Author *Katalin Popovici (TIMA Laboratory, France), Ahmed Jerraya (CEA-LETI, Minatec, France) Page pp. 143 - 148 Keyword communication, exploration, modeling, NoC, H.264 Abstract Current multiprocessor systems on chip (MPSoC) architectures integrate a massive number of IPs that need to exchange data in complex and diverse synchronization ways. The key challenge when designing MPSoC is that the communication architecture needs to be decided at the beginning of the design, before all the details about mapping the application on the architecture are known. These early decisions cause two difficulties: how to select the best communication architecture and how to estimate the effect of mapping the application onto the communication resources. In this paper, we propose high level communication models that allow early accurate performance estimation of both communication architecture and communication mapping. We applied the proposed modeling methods to analyze the impact on performance in case of two network topologies and several communication mapping schemes for the H.264 Encoder application. Slides

2A-3 (Time: 14:20 - 14:45)
 Title Partial Order Method for Timed Simulation of System-Level MPSoC Designs Author *Eric Cheung, Harry Hsieh (University of California, Riverside, United States), Felice Balarin (Cadence Design Systems, United States) Page pp. 149 - 154 Keyword Partial Order Simulation, SystemC, MPSoC Abstract Current discrete event simulator requires heavy simulation overhead to switch between different components to simulate them in strictly chronological order. Therefore, timed simulation is significantly slower than un-timed simulation. By simply adding delays in the components and communication channels, our timed MPEG-2 decoder simulates more than 14 times slower than an un-timed simulation. In this paper, we propose a partial order method to speed up timed simulation by relaxing the order that the components are simulated. With partial order method, a component is not required to schedule a channel access if both behavioral and timing results of the access are known. The simulation switches less frequently hence the simulation overhead reduces. We show that partial order method can be used in complex system-level simulation such asMPSoC implementations of the MPEG-2 decoder. In our experiments, partial order method provides more than 10 times speedups over regular discrete event simulation for timed simulation.

2A-4 (Time: 14:45 - 15:10)
 Title A UML-Based Approach for Heterogeneous IP Integration Author *Zhenxin Sun, Weng-Fai Wong (National University of Singapore, Singapore) Page pp. 155 - 160 Keyword System level design, UML Abstract With increasing availability of predefined IP (Intellectual Properties) blocks and inexpensive microprocessors, embedded system designers are faced with more design choices than ever. On the other hand, there is a constant pressure on reducing the time to market. However, as the IP blocks are provided by different vendors, they differ in their interfaces. In order to improve design reuse, methods for combining heterogeneous IP blocks with incompatible protocols and I/Os are needed. In this paper, we propose an interface synthesis method that uses the UML notation to model the interfaces of predefined components and glue logic within the standard OCP-compliant environment. We built a code generator to produce the interface adapters from the UML models. We experimented with our approach using simple-bus and a MPEG-2 decoder as case studies. Slides

Session 2B  Power Analysis and Optimization
Time: 13:30 - 15:35 Tuesday, January 20, 2009
Location: Room 413
Chair: Masanori Hashimoto (Dept. ISE, Osaka University, Japan)

2B-1 (Time: 13:30 - 13:55)
 Title Statistical Modeling and Analysis of Chip-Level Leakage Power by Spectral Stochastic Method Author Ruijing Shen, Ning Mi, *Sheldon Tan (University of California at Riverside, United States), Yici Cai, Xianlong Hong (Tsinghua University, China) Page pp. 161 - 166 Keyword Leakage analysis, orthogonal polynomials, variational analysis Abstract In this paper, we present a novel statistical full-chip leakage power analysis method. The new method can provide a general framework to derive the full-chip leakage current or power in a closed form in terms of the variational parameters, such as the channel length, the gate oxide thickness, etc. It can accommodate various spatial correlations. The new method employs the orthogonal polynomials to represent the variational gate leakages in a closed form first, which is generated by a fast multi-dimensional Gaussian quadrature method. The total leakage currents then are computed by simply summing up the resulting orthogonal polynomials (their coefficients). Unlike many existing approaches, no grid-based partitioning and approximation are required. Instead, the spatial correlations are naturally handled by orthogonal decompositions. The proposed method is very efficient and it becomes linear when there exist strong spatial correlations. Experimental results show that the proposed method is about 10X faster than the recently proposed method~\cite{Chang:DAC'05} with constant better accuracy.

2B-2 (Time: 13:55 - 14:20)
 Title On the Futility of Statistical Power Optimization Author Jason Cong, Puneet Gupta, *John Lee (University of California, Los Angeles, United States) Page pp. 167 - 172 Keyword gate sizing, optimization, statistical power Abstract In response to the increasing variations in integrated-circuit manufacturing, the current trend is to create designs that take these variations into account statistically. In this paper we try to quantify the difference between the statistical and deterministic optima of leakage power while making no assumptions about the delay model. We develop a framework for deriving a theoretical upper-bound on the suboptimality that is incurred by using the deterministic optimum as an approximation for the statistical optimum. On average, the bound is 2.4% for a suite of benchmark circuits in a 45nm technology. We further give an intuitive explanation and show, by using solution rank orders, that the practical suboptimality gap is much lower. There- fore, the need for statistical power modeling for the purpose of optimization is questionable. Slides

2B-3 (Time: 14:20 - 14:45)
 Title Timing Driven Power Gating in High-Level Synthesis Author Shih-Hsu Huang, *Chun-Hua Cheng (Chung Yuan Christian University, Taiwan) Page pp. 173 - 178 Keyword Clock Skew Scheduling, High-Level Synthesis, Low Power Design, Resource Binding, Standby Leakage Minimization Abstract The power gating technique is useful in reducing standby leakage current, but it increases the gate delay. For a functional unit, its maximum allowable delay (for a target clock period) limits the smallest standby leakage current its power gating can achieve. In this paper, we point out: in the high-level synthesis of a non-zero clock skew circuit, the resource binding (including functional units and registers) has a large impact on the maximum allowable delays of functional units; as a result, different resource binding solutions have different standby leakage currents. Based on that observation, we present the first work to draw up the timing driven power gating in high-level synthesis. Given a target clock period and design constraints, our goal is to derive the minimum-standby-leakage-current resource binding solution. Benchmark data show: compared with the existing design flow, our approach can greatly reduce the standby leakage current without any overhead. Slides

2B-4 (Time: 14:45 - 15:10)
 Title Congestion-Aware Power Grid Optimization for 3D Circuits Using MIM and CMOS Decoupling Capacitors Author Pingqiang Zhou, Karthikk Sridharan, *Sachin S. Sapatnekar (ECE Dept, University of Minnesota, United States) Page pp. 179 - 184 Keyword 3D circuit, power grid, MIM decap, leakage power, congestion Abstract In three-dimensional (3D) chips, the amount of supply current per package pin is significantly more than in two-dimensional (2D) designs. Therefore, the power supply noise problem, already a major issue in 2D, is even more severe in 3D. CMOS decoupling capacitors (decaps) have been used effectively for controlling power grid noise in the past, but with technology scaling, they have grown increasingly leaky. As an alternative, metal-insulator-metal (MIM) decaps, with high capacitance densities and low leakage current densities, have been proposed. In this paper, we explore the tradeoffs between using MIM decaps and traditional CMOS decaps, and propose a congestion-aware 3D power supply network optimization algorithm to optimize this tradeoff. The algorithm applies a sequence-of-linear-programs based method to find the optimum tradeoff between MIM and CMOS decaps. Experimental results show that power grid noise can be more effectively optimized after the introduction of MIM decaps, with lower leakage power and little increase in the routing congestion, as compared to a solution using CMOS decaps only. Slides

2B-5 (Time: 15:10 - 15:35)
 Title Incremental and On-demand Random Walk for Iterative Power Distribution Network Analysis Author *Yiyu Shi, Wei Yao (Electrical Engineering Dept., University of California, Los Angeles, United States), Jinjun Xiong (IBM Thomas J. Watson Research Center, United States), Lei He (Electrical Engineering Dept., University of California, Los Angeles, United States) Page pp. 185 - 190 Keyword random walk, power grid, simulation, incremental analysis Abstract Power distribution networks (PDNs) are designed and analyzed iteratively. Randomwalk is among themost efficient methods for PDN analysis. We develop in this paper an incremental and on-demand random walk to reduce iterative analysis time. During each iteration, we map the design changes as positive or negative random walks for observed nodes. To update PDN analysis result, we only need to apply these extra positive or negative walks, instead of doing all walks from scratch. We show that different execution orders for these walks do not affect accuracy but do affect the runtime because of the cancellation between positive and negative walks. Considering this cancellation effect, we optimize the walk order by solving a min-energy electromagnetic particles placement problem and, as a result, further reduce the runtime to about 8 compared to the worst order. Experiments show that, compared to random walk from scratch, our algorithm has similar accuracy but reduces the iterative analysis time by up to 18 for on-chip PDN sizing, and by up to 13 for package ball assignment with substrate routing. In addition, our incremental random walk has a linear time complexity with respect to the number of observed nodes and is more suitable for on-demand analysis, compared to random walk from scratch and its big warm-up cost.

Session 2C  Logic and Arithmetic Optimization
Time: 13:30 - 15:35 Tuesday, January 20, 2009
Location: Room 414+415
Chairs: Dale Edwards (Semiconductor Research Corp., United States), Hiroyuki Higuchi (Fujitsu Microelectronics Limited, Japan)

2C-1 (Time: 13:30 - 13:55)
 Title SAT-Controlled Redundancy Addition and Removal --- A Novel Circuit Restructuring Technique Author Chi-An Wu, Ting-Hao Lin, Shao-Lun Huang, *Chung-Yang (Ric) Huang (National Taiwan University, Taiwan) Page pp. 191 - 196 Keyword Redundancy Addition and Removal, SAT, Logic Restructuring Abstract We proposed a novel Boolean Satisfiability (SAT)-controlled redundancy addition and removal (RAR) algorithm to resolve the performance and quality problems of the previous RAR approaches. With the introduction of modern SAT techniques, such as efficient Boolean constraint propagation (BCP), conflict-driven learning, and flexible decision procedure, our RAR engine can identify 10x more alternative wires/gates while achieving 70% reduction in runtime.

2C-2 (Time: 13:55 - 14:20)
 Title On Improved Scheme for Digital Circuit Rewiring and Application on Further Improving FPGA Technology Mapping Author Fu Shing Chim, *Tak Kei Lam, Yu Liang Wu (The Chinese University of Hong Kong, Hong Kong) Page pp. 197 - 202 Keyword Rewiring, Graph-based, FPGA, Technology Mapping, VLSI CAD Abstract The digital circuit rewiring technique has been shown to be one of the most powerful logic transformation methods being able to further improve some already excellent results on many EDA problems. In this work a new hybrid rewiring approach that can enjoy advantages from both ATPG-based and graph-based rewiring is proposed. Our hybrid approach utilizes structural characteristics and ATPG technique to perform quick alternative wires identification inside circuits. Experimental results suggest that our hybrid engine is able to achieve about 50% of alternative wires coverage when compared with ATPG-based rewiring engine with 4% of runtime only. For some problems only requiring a good-enough and very quick solution, this new rewiring technique may serve as a useful alternative. Slides

2C-3 (Time: 14:20 - 14:45)
 Title Hybrid LZA: A Near Optimal Implementation of the Leading Zero Anticipator Author Amit Verma (National Institute of Technology, Rourkela, India), *Ajay K. Verma, Philip Brisk, Paolo Ienne (Ecole Polytechnique Federale de Lausanne, Switzerland) Page pp. 203 - 209 Keyword leading zero anticipator, Error detection, Adder Abstract The Leading Zero Anticipator (LZA) is one of the main components used in floating point addition. It tends to be on the critical path, so it has attracted the attention of many researchers in the past. Most LZAs used today can be classified in two categories: exact and inexact. Inexact LZAs are normally preferred due to their shorter critical paths and reduced complexity; however, the inexact LZA requires an additional correct stage. In this paper we present a new LZA architecture that combines ideas taken from prior exact and inexact LZAs.Our new LZA improves the delay of floating point addition by 7-10% compared to state of art techniques as well as reduces hardware area in most cases. We also establish theoretical lower bounds on the delay of an LZA and we show that our LZA is very close to these bounds.

2C-4 (Time: 14:45 - 15:10)
 Title An Optimized Design for Serial-Parallel Finite Field Multiplication over GF(2m) Based on All-One Polynomials Author Pramod Kumar Meher (Nanyang Technological University, Singapore), *Yajun Ha (National University of Singapore, Singapore), Chiou-Yng Lee (Lunghwa University of Science and Technology, Taiwan) Page pp. 210 - 215 Keyword finite field multiplication, VLSI, architecture optimization Abstract In this paper, we derive a recursive algorithm for finite field multiplication over GF(2^m) based on irreducible all-one-polynomials (AOP), where the modular reduction of degree is achieved by cyclic left-shift without any logic operations. A regular and localized bit level dependence graph (DG) is derived from the proposed algorithm and mapped into an array architecture, where the modular reduction is achieved by a serial-in parallel-out shift-register. The multiplier is optimized further to perform the accumulation of partial products by the T flip flops of the output register without XOR gates. It is interesting to note that the optimized structure consists of an array of (m+1) AND gates between an array of (m+1) D flip flops and an array of (m+1) T flip flops. The proposed structure therefore involves significantly less area and less computation time compared with the corresponding existing structures.

Session 2D  Special Session: EDA Acceleration Using New Architectures
Time: 13:30 - 15:35 Tuesday, January 20, 2009
Location: Room 416+417
Organizer: Damir A. Jamsek (IBM Corp., United States)

2D-1 (Time: 13:35 - 14:15)
 Title (Invited Paper) Aspects of GPU for General Purpose High Performance Computing Author *Reiji Suda (The University of Tokyo/JST CREST, Japan), Takayuki Aoki (Tokyo Institute of Technology/JST CREST, Japan), Shoichi Hirasawa (University of Electro-Communications/JST CREST, Japan), Akira Nukada (Tokyo Institute of Technology/JST CREST, Japan), Hiroki Honda (University of Electro-Communications/JST CREST, Japan), Satoshi Matsuoka (Tokyo Institute of Technology/JST CREST/NII, Japan) Page pp. 216 - 223 Keyword GPU computing, performance evaluation, scheduling algorithm, task parallel paradigm Abstract We discuss hardware and software aspects of GPGPU, specifically focusing on NVIDIA cards and CUDA, from the viewpoints of parallel computing. The major weak points of GPU against newest supercomputers are identified to be and summarized as only four points: large SIMD vector length, small memory, absence of fast L2 cache, and high register spill penalty. As software concerns, we derive optimal scheduling algorithm for latency hiding of host-device data transfer, and discuss SPMD parallelism on GPUs.

2D-2 (Time: 14:15 - 14:55)
 Title (Invited Paper) Designing and Optimizing Compute Kernels on Nvidia GPUs Author *Damir A. Jamsek (IBM Research Division, United States) Page pp. 224 - 229 Keyword GPU, NVIDIA Abstract The availability of high performance compute capability in NVIDIA GPUs has expanded their use in CAD environments. We will describe the basic compute models including host/device programming models, device multi-thread programming models, as well optimization and performance tuning techniques

2D-3 (Time: 14:55 - 15:35)
 Title (Invited Paper) Parallelizing Fundamental Algorithms such as Sorting on Multi-core Processors for EDA Acceleration Author *Masato Edahiro (System IP Core Research Laboratories, NEC Corporation/Department of Computer Science, University of Tokyo, Japan) Page pp. 230 - 233 Keyword multi-core, many-core, parallel algorithm, sorting Abstract Fundamental algorithms should be parallelized to accelerate EDA software on multi-core architecture. In this paper, we introduce scalable algorithms that have scalability on multi-cores. As an example, a sorting algorithm, called Map Sort, is presented. This algorithm uses a map from subsets of input data to intervals on data range. Experimental results show that, in comparison with quick sort on a single CPU, processing time of Map Sort is comparable on a CPU and three times faster on four CPUs. Slides

Session 3A  System-Level Design of 3D Chips and Configurable Systems
Time: 15:55 - 18:00 Tuesday, January 20, 2009
Location: Room 411+412
Chairs: Eui-Young Chung (Yonsei University, Republic of Korea), Steve Haga (National Sun Yat-Sen University)

3A-1 (Time: 15:55 - 16:20)
 Title System-Level Cost Analysis and Design Exploration for Three-Dimensional Integrated Circuits (3D ICs) Author *Xiangyu Dong, Yuan Xie (Pennsylvania State University, United States) Page pp. 234 - 241 Keyword 3D Integration, Cost Analysis Abstract Three-dimensional integrated circuit (3D IC) is emerging as an attractive option for overcoming the barriers in interconnect scaling. The majority of the existing 3D IC research is focused on how to take advantage of the performance, power, smaller form-factor, and heterogeneous integration benefits that offered by 3D integration. However, all such advantages ultimately have to translate into cost savings when a design strategy has to be decided: Is 3D integration a cost effective way for a particular IC design? Consequently, system-level cost analysis at the early design stage is imperative to help the decision making on whether 3D integration should be adopted. In this paper, we study the design estimation method for 3D ICs at the early design stage, and propose a cost analysis model to study the cost implication for 3D ICs, and address the following cost-related problems related to 3D IC design: (1) Do all the benefits of 3D IC design come with a much higher cost? (2) How to do 3D integration in a cost-effective way? (3)Are there any design options to compensate the extra 3D bonding cost? A cost-driven 3D IC design flow is also proposed to guide the design space exploration for 3D ICs toward a costeffective direction.

3A-2 (Time: 16:20 - 16:45)
 Title Synthesis of Networks on Chips for 3D Systems on Chips Author *Srinivasan Murali, Ciprian Seiculescu (Ecole Polytechnique Federale de Lausanne, Switzerland), Luca Benini (University of Bologna, Italy), Giovanni De Micheli (Ecole Polytechnique Federale de Lausanne, Switzerland) Page pp. 242 - 247 Keyword Networks on Chips, 3D, topology, synthesis Abstract Three-dimensional stacking of silicon layers is emerging as a promising solution to handle the design complexity and heterogeneity of Systems on Chips (SoCs). Networks on Chips (NoCs) are necessary to efficiently handle the 3D interconnect complexity. Designing power efficient NoCs for 3D SoCs that satisfy the application performance requirements, while satisfying the 3D technology constraints is a big challenge. In this work, we address this problem and present a synthesis approach for designing power-performance efficient 3D NoCs. We present methods to determine the best topology, compute paths and perform placement of the NoC components in each 3D layer. We perform experiments on varied, realistic SoC benchmarks to validate the methods and also perform a comparative study of the resulting 3D NoC designs with 3D optimized mesh topologies. The NoCs designed by our synthesis method results in large interconnect power reduction (average of 38%) and latency reduction (average of 25%) when compared to traditional NoC designs.

3A-3 (Time: 16:45 - 17:10)
 Title An Application-centered Design Flow for Self Reconfigurable Systems Implementation Author *Fabio Cancare, Marco Domenico Santambrogio, Donatella Sciuto (Politecnico di Milano, Italy) Page pp. 248 - 253 Keyword Dynamic Reconfiguration, Reconfigurability, FPGA Abstract Up to now every proposed methodology for implementing dynamic self reconfigurable systems is architecture-centered. In most cases the system development process is time consuming and requires a very specific technical background. Aim of this work is to provide a fast brain to bit design ow whose goal is to simplify the dynamic reconfigurable system development process by shifting the designer focus from the architecture point of view to the application point of view: designers will not need to possess Dynamic Reconfi gurability expertise but just to be skilled with the application domain. Slides

3A-4 (Time: 17:10 - 17:35)
 Title System-Level Process Variability Compensation on Memory Organizations. On the Scalability of Multi-Mode Memories Author *Concepcin Sanz, Manuel Prieto, José Ignacio Gómez (Universidad Complutense de Madrid, Spain), Antonis Papanikolaou, Francky Catthoor (Inter-University Microelectronics Center, Belgium) Page pp. 254 - 259 Keyword Process variation, parametric yield, variability compensation Abstract Process variation and the dynamism of modern applications can degrade the expected performance of a system. Execution time can be severely affected by both factors, resulting in deadline violations and energy consumption overheads. Memory organizations, which account for a large part of the system-energy and the time budgets, are especially vulnerable to process variation. Configurable multi-mode memories are a promising technology to deal with these problems, but they also introduce new issues that need to be solved. Essentially, adding configuration capabilities to the memories comes with a cost, both in memory area and control complexity; hence, we need to evaluate what is the minimum amount of re-configurability to satisfy systems constraints. In this paper, we analyze the scalability of configurable memories and highlight the relationship among mode allocation, memory mapping and data allocation. Slides

Session 3B  Advances in Timing Analysis and Modeling
Time: 15:55 - 18:00 Tuesday, January 20, 2009
Location: Room 413
Chairs: Shih-Hsu Huang (Chung Yuan Christian University, Taiwan), Atsushi Takahashi (Tokyo Institute of Technology, Japan)

3B-1 (Time: 15:55 - 16:20)
 Title Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Author Kanupriya Gulati, *Sunil P. Khatri (Texas A&M University, United States) Page pp. 260 - 265 Keyword Graphics Processing Units, Monte Carlo, Statistical Static Timing Analysis Abstract In this paper, we explore the implementation of Monte Carlo based statistical static timing analysis (SSTA) on a Graphics Processing Unit (GPU). SSTA via Monte Carlo simulations is a computationally expensive, but important step required to achieve design timing closure. It provides an accurate estimate of delay variations and their impact on design yield. The large number of threads that can be computed in parallel on a GPU suggests a natural fit for the problem of Monte Carlo based SSTA to the GPU platform. Our implementation performs multiple delay simulations at a single gate in parallel. A parallel implementation of the Mersenne Twister pseudo-random number generator on the GPU, followed by Box-Muller transformations (also implemented on the GPU) is used for generating gate delay numbers from a normal distribution. The mean and standard deviation of the pin-to-output delay distributions for all inputs and for every gate, are obtained using a memory lookup, which benets from the large memory bandwidth of the GPU. Threads which execute in parallel have no data/control dependencies on each other. All threads compute identical instructions, but on different data, as required by the Single Instruction Multiple Data (SIMD) programming semantics of the GPU. Our approach is implemented on a NVIDIA GeForce GTX 8800 GPU card. Our results indicate that our approach can obtain an average speedup of about 260X as compared to a serial CPU implementation. With the recently announced quad 8800 GPU cards, we estimate that our approach would attain a speedup of over 785X. The correctness of the Monte Carlo based SSTA implemented on a GPU has been verified by comparing its results with a CPU based implementation. Slides

3B-2 (Time: 16:20 - 16:45)
 Title Trade-off Analysis between Timing Error Rate and Power Dissipation for Adaptive Speed Control with Timing Error Prediction Author *Hiroshi Fuketa, Masanori Hashimoto, Yukio Mitsuyama, Takao Onoye (Osaka University, Japan) Page pp. 266 - 271 Keyword adaptive speed control, timing error prediction, canary FF, low power design, subthreshold circuit Abstract Timing margin of a chip varies chip by chip due to manufacturing variability, and depends on operating environment and aging. Adaptive speed control with timing error prediction is a promising approach to mitigate the timing margin variation, whereas it inherently has a critical risk of timing error occurrence when a circuit is slowed down. This paper presents how to evaluate the relation between timing error rate and power dissipation in self-adaptive circuits with timing error prediction. The discussion is experimentally validated using a 32-bit ripple carry adder in subthreshold operation in a 90nm CMOS process. We show a trade-off between timing error rate and power dissipation, and reveal the dependency of the trade-off on design parameters. Slides

3B-3 (Time: 16:45 - 17:10)
 Title Statistical Analysis of On-Chip Power Grid Networks by Variational Extended Truncated Balanced Realization Method Author *Duo Li, Sheldon Tan (University of California at Riverside, United States), Gengsheng Chen, Xuan Zeng (Fudan University, China) Page pp. 272 - 277 Keyword Power grid, TBR, Reduction, Interconnect, Variation Abstract In this paper, we present a novel statistical analysis approach for large power grid network analysis under process variations. The new algorithm is very efficient and scalable for huge networks with a large number of variational variables. This approach, called varETBR for variational extended truncated balanced realization, is based on model order reduction techniques to reduce the circuit matrices before the variational simulation. It performs the parameterized reduction on the original system using variation-bearing subspaces. varETBR calculates variational response Gramians by Monte-Carlo based numerical integration considering both system and input source variations for generating the projection subspace. varETBR is very scalable for the number of variables and is flexible for different variational distributions and ranges as demonstrated in experimental results. After the reduction, Monte-Carlo based statistical simulation is performed on the reduced system and the statistical responses of the original system are obtained thereafter. Experimental results, on a number of IBM benchmark circuits [15] up to 1.6 million nodes, show that the varETBR can be 4500X faster than the Monte-Carlo method and is much more scalable than one of the recently proposed approaches.

3B-4 (Time: 17:10 - 17:35)
 Title Bound-Based Identification of Timing-Violating Paths Under Variability Author *Lin Xie, Azadeh Davoodi (University of Wisconsin at Madison, United States) Page pp. 278 - 283 Keyword variability, statistical timing analysis, timing-violating path, violation probability Abstract We introduce a bound-based technique to identify the top M timing-violating paths in a circuit under variability. These are the paths with the highest violation probability (i.e., C_p) which is the probability that a path (i.e., p) violates the timing constraint. To compute C_p, we require the violation probabilities of the nodes (i.e., C_n) and edges (i.e., C_e) on the path. First, we show computing C_n and C_e of all the nodes and edges requires only two rounds of Statistical Static Timing Analysis and then for each node/edge we need one table lookup for probability calculation using a technique known as Pearson Curve. Given C_n and C_e, our major contribution is in computing upper and lower bounds for C_p of an arbitrary path segment. We show constant-time for incremental update of the bounds when extending a path segment to a longer one. These bounds can be used to exactly construct the top violating paths. If the goal is to find the single most-violating path, we show a bound-based formulation that can prune a large portion of circuit without losing optimality. In our simulations, we verify the correctness and accuracy of our bounds for individual paths. We also verify identification of selected paths using Monte Carlo simulation. We obtain near-optimal accuracy with extremely fast runtimes. Slides

3B-5 (Time: 17:35 - 18:00)
 Title Adaptive Techniques for Overcoming Performance Degradation due to Aging in Digital Circuits Author Sanjay Kumar, Chris Kim, *Sachin S. Sapatnekar (University of Minnesota, United States) Page pp. 284 - 289 Keyword Reliability, Adaptive Body Bias, NBTI, Leakage, Delay Abstract Negative Bias Temperature Instability (NBTI) in PMOS transistors has become a major reliability concern in present-day digital circuit design. Further, with the recent usage of Hf-based high-k dielectrics for gate leakage reduction, Positive Bias Temperature Instability (PBTI), the dual effect in NMOS transistors has also reached significant levels. Consequently, designers are required to build in substantial guardbands into their designs, leading to large area and power overheads, in order to guarantee reliable operation over the lifetime of a chip. We propose a guard-banding technique based on adaptive body bias (ABB) and adaptive supply voltage (ASV), to recover the performance of an aged circuit, and compare its merits over previous approaches. Slides

Session 3D  Special Session: Hardware Dependent Software for Multi- and Many-Core Embedded Systems
Time: 15:55 - 18:00 Tuesday, January 20, 2009
Location: Room 416+417
Organizers: Rainer Doemer (University of California at Irvine, United States), Andreas Gerstlauer (University of Texas at Austin, United States), Wolfgang Mueller (University of Paderborn, Germany)

3D-1 (Time: 15:55 - 16:10)
 Title (Invited Paper) Introduction to Hardware-dependent Software Design Author *Rainer Dmer (University of California at Irvine, United States), Andreas Gerstlauer (University of Texas at Austin, United States), Wolfgang Mller (University of Paderborn, Germany) Page pp. 290 - 292 Abstract Due to the rapidly increasing software content in embedded systems, Hardware-dependent Software (HdS) has become a critical topic in system design. In this talk, we will motivate the need for special attention to HdS in research and development and provide a brief introduction to the issues involved in the design of HdS. Slides

3D-2 (Time: 16:10 - 16:50)
 Title (Invited Paper) Using a Dataflow abstracted Virtual Prototype for HdS-Design Author Wolfgang Ecker, Stefan Heinen, *Michael Velten (Infineon Technologies AG, Germany) Page pp. 293 - 300 Keyword Abstraction, VP, TLM, HdS Abstract The complexity of Hardware-dependent Software (HdS) continuously grows stronger than chip complexity since more and more tasks are moved to software. Clearly, the pressure on the development of new methodologies for early validation of HdS increases as well. Existing methods must be continuously improved and new methods must be developed. This is exemplified with a state-of-the-art Transaction Level (TL) model used for firmware development of a productive wireless communication chip. By discussing the strengths and shortcomings of TL modeling we derive a set of requirements for a future modeling paradigm, which led to the new data flow abstraction approach presented in this paper. Experiments showed that we gain up to 10x performance improvement. Slides

3D-3 (Time: 16:50 - 17:20)
 Title (Invited Paper) Needs and Trends in Embedded Software Development for Consumer Electronics Author *Yasutaka Tsunakawa (Sony Corporation, Japan) Page pp. 301 - 303 Keyword Embedded software, Consumer electronics, Multi-Core, Many-Core Abstract Like other domains, the flow to Many-Core cannot be avoided in the domain of the consumer electronics either. The Multi-Core has already become the mainstream of the system LSI, and the number of cores in the chip will continue to increase. Because of the advancement of required functions and the pressure to the consumption electricity reduction, the flow to Many-Core will continue without cessation. However, seeing it from a point of view of the embedded software development, there are many unsolved problems lie like a huge cliff between current Multi-Core and Many-Core. The research organizations seem to make their main efforts in technical establishment of Many-Core, and the tool vendors concentrate on a solution offer to the current Multi-Core. Therefore measures of the transition period will come several years later are still insufficient. In this article, I want to discuss about the major problems which block the shift to Many-Core from the current Multi-Core, from the viewpoint of consumer electronics. Slides

3D-4 (Time: 17:20 - 18:00)
 Title (Invited Paper) Hardware-dependent Software Synthesis for Many-Core Embedded Systems Author *Samar Abdi, Gunar Schirner, Ines Viskic, Hansu Cho, Yonghyun Hwang, Lochi Yu, Daniel Gajski (Center for Embedded Computer Systems, University of California, Irvine, United States) Page pp. 304 - 310 Keyword Embedded Software, Multicore Design, Software Synthesis Abstract This paper presents synthesis of Hardware Dependent Software (HdS) for multicore and many-core designs using Embedded System Environment (ESE). ESE is a tool set, developed at UC Irvine, for transaction level design of multicore embedded systems. HdS synthesis is a key component of ESE backend design ow. We follow a design process that starts with an application model consisting of C processes communicating via abstract message passing channels. The application model is mapped to a platform net-list of SW and HW cores, buses and buffers. A high speed transaction level model (TLM) is generated to validate abstract communication between processes mapped to different cores. The TLM is further rened into a Pin-Cycle Accurate Model (PCAM) for board implementation. The PCAM includes C code for all the HdS layers including routing, packeting, synchronization and bus transfer. The generated HdS methods provide a library of application level services to the C processes on individual SW cores. Therefore, the application developer does not need to write low level HdS for board implementation. Synthesis results for an multi-core MP3 decoder design, using ESE, show that the HdS is generated in order of seconds, compared to hours of manual coding. The quality of synthesized code is comparable to manually written code in terms of performance and code size. Slides

 Wednesday, January 21, 2009

Session 2K  Keynote Session II
Time: 9:00 - 10:00 Wednesday, January 21, 2009
Location: Small Auditorium, 5F
Chair: Kazutoshi Wakabayashi (NEC Corp., Japan)

2K-1 (Time: 9:00 - 10:00)
 Title (Keynote Address) Automated Synthesis and Verification of Embedded Systems: Wishful Thinking or Reality? Author Wolfgang Rosenstiel (Wilhelm-Schickard-Institute for Informatics, University of Tuebingen, Germany) Abstract More complex embedded hardware/software systems have to be developed with shorter design time and reduced cost. One solution for this problem is increasing design automation starting from higher levels of abstraction. Automatic synthesis and verification has been around in research for a quite a while. This talk will show examples for state-of-the art tools for system-level synthesis and verification of embedded systems and demonstrate their possibilities and limitations by some automotive applications.

Session 4A  System Level Architectures
Time: 10:15 - 12:20 Wednesday, January 21, 2009
Location: Room 411+412
Chairs: Samar Abdi (University of California, Irvine, United States), Jun Yang (Univ. of Pittsburgh)

4A-1 (Time: 10:15 - 10:40)
 Title Computation and Data Transfer Co-Scheduling for Interconnection Bus Minimization Author Cathy Qun Xu (University of Texas at Dallas, United States), *Chun Jason Xue, Bessie C Hu (City University of Hong Kong, Hong Kong), Edwin H.M. Sha (University of Texas at Dallas, United States) Page pp. 311 - 316 Keyword Scheduling, Interconnection network, clustered processors, data path synthesis Abstract High Instruction-Level-Parallelism in DSP and media applications demands highly clustered architecture. It is challenge to design an efficient, flexible yet cost saving inter-connection network to satisfy the rapid increasing inter-cluster data transfer needs. This paper presents a computation and data transfer co-scheduling technique to minimize the number of partially connected interconnection buses required for a given embedded application while minimizing its schedule length. Previous researches in this area focused on scheduling computations to minimize the number of inter-cluster data transfers. The proposed co-scheduling technique not only schedule computations to reduce the number of inter-cluster data transfers, but also schedule inter-cluster data transfers to minimize the number of required partially connected buses for inter-cluster connection network. Experimental results indicate that 52.3% fewer buses required compared to current best known technique while achieving the same schedule length minimization.

4A-2 (Time: 10:40 - 11:05)
 Title Prototyping Pipelined Applications on a Heterogeneous FPGA Multiprocessor Virtual Platform Author *Antonino Tumeo, Marco Branca, Lorenzo Camerini, Marco Ceriani (Politecnico di Milano, Italy), Matteo Monchiero (HP Labs, United States), Gianluca Palermo, Fabrizio Ferrandi, Donatella Sciuto (Politecnico di Milano, Italy) Page pp. 317 - 322 Keyword FPGA, Prototyping, Pipelining, Multiprocessor, Multimedia Abstract Multiprocessors on a chip are the reality of these days. Semiconductor industry has recognized this approach as the most efficient in order to exploit chip resources, but the success of this paradigm heavily relies on the efficiency and widespread diffusion of parallel software. Among the many techniques to express the parallelism of applications, this paper focuses on pipelining, a technique well suited to data-intensive multimedia applications. We introduce a prototyping platform (FPGA-based) and a methodology for these applications. Our platform consists of a mix of standard and custom heterogeneous cores. We discuss several case studies, analyzing the interaction of the architecture and applications and we show that multimedia and telecommunication applications with unbalanced pipeline stages can be easily deployed. Our framework eases the development cycle and enables the developers to focus directly on the problems posed by the programming model in the direction of the implementation of a production system. Slides

4A-3 (Time: 11:05 - 11:30)
 Title Variability-Aware Robust Design Space Exploration of Chip Multiprocessor Architectures Author *Gianluca Palermo, Cristina Silvano, Vittorio Zaccaria (Politecnico di Milano, DEI, Italy) Page pp. 323 - 328 Keyword Design Space Exploration Abstract In the context of a design space exploration framework for supporting the platform-based design approach, we address the problem of robustness with respect to manufacturing process variations. First, we introduce response surface modeling techniques to enable an efficient evaluation of the statistical measures of execution time and energy consumption for each system configuration. We then introduce a robust design space exploration frameworkto afford the problem of the impact of manufacturing process variations onto the system-level metrics and consequently onto the application-level constraints. We finally provide a comparison of our design space exploration technique with conventional approaches. Slides

4A-4 (Time: 11:30 - 11:55)
 Title Partial Conflict-Relieving Programmable Address Shuffler for Parallel Memories in Multi-Core Processor Author *Young-Su Kwon, Bon-Tae Koo, Nak-Woong Eum (Electronics and Telecommunications Research Institute, Republic of Korea) Page pp. 329 - 334 Keyword parallel memory, access conflict, multi-core, memory Abstract The advancement of process technology enables the integration of multiple cores featuring parallel processing. The requirement of extensive memory bandwidth puts a major performance bottleneck in multi-core architectures for media applications. While the parallel memory system is a viable solution to account for a large amount of memory transactions required by multiple cores, memory access conflicts caused by simultaneous accesses to an identical memory page by two or more cores limit the performance of multi-core architectures. We propose and evaluate the programmable memory address shuffler associated with the novel memory shuffling algorithm integrated in multi-core architectures with parallel memory system. The address shuffler efficiently translates the requested memory addresses into the shuffled addresses such that access conflicts diminish by analyzing the access pattern of the application. We demonstrate that the shuffling of sub-pages is represented by cyclic linked list which enables partial address shuffling with the minimal number of shuffling table entries. The programmable address shuffler reduces the amount of access conflicts by 83% for pitch-shifting audio decompression.

4A-5 (Time: 11:55 - 12:20)
 Title HitME: Low Power Hit MEmory Buffer for Embedded Systems Author Andhi Janapsatya, *Sri Parameswaran, Aleksandar Ignjatovic (University of New South Wales, Australia) Page pp. 335 - 340 Keyword memory, low power, cache, loop cache Abstract In this paper, we present a novel HitME (Hit-MEmory) buffer to reduce the energy consumption of memory hierarchy in embedded processors. The HitME buffer is a small direct-mapped cache memory that is added as additional memory into existing cache memory hierarchies. The HitME buffer is loaded only when there is a hit on L1 cache. Otherwise, L1 cache is updated from the memory and the processor's memory request is served directly from the L1 cache. The strategy works due to the fact that 90% of memory accesses are only accessed once, and these often pollute the cache. Energy reduction is achieved by reducing the number of accesses to the L1 cache memory. Experimental results show that the use of HitME buffer will reduce the L1 cache accesses resulting in a reduction in the energy consumption of the memory hierarchy. This decrease in L1 cache accesses reduces the cache system energy consumption by an average of 60.9% when compared to traditional L1 cache memory architecture and an energy reduction of 6.4% when compared to filter cache architecture for 70nm cache technology.

Session 4B  Beyond Traditional Floorplanning and Placement
Time: 10:15 - 12:20 Wednesday, January 21, 2009
Location: Room 413
Chair: Shigetoshi Nakatake (The University of Kitakyushu, Japan)

4B-1 (Time: 10:15 - 10:40)
 Title Signal Skew Aware Floorplanning and Bumper Signal Assignment Technique for Flip-Chip Author *Cheng-Yu Wang, Wai-Kei Mak (Department of Computer Science, National Tsing Hua University, Taiwan) Page pp. 341 - 346 Keyword Flip-chip, floorplanning, Bumper, pad, Assignment Abstract Flip-chip is a solution for designs requiring more I/O pins and higher speed. However, the higher speed demand also brings the issue of signal skew. In this paper, we propose a new 3-stage design layout methodology for flip-chip considering signal skew. Firstly, we produce an initial bumper signal assignment, and then solve the flip-chip floorplanning problem using a partitioningbased technique to spread the modules across the flip-chip as the distribution of its bumpers. With an anchoring and relocation strategy, we can effectively place I/O buffers at desirable locations. Finally, we further reduce signal skew and monotonic routing density by refining the bumper signal assignment. Experimental results show that signal skew of traditional floorplanners range from 4% to 280% higher than ours. And the total wirelength of other floorplanners is as much as 100% higher than ours. Moreover, our signal refinement method can further decrease monotonic routing density by up to 8% and signal skew by up to 11%

4B-2 (Time: 10:40 - 11:05)
 Title A Novel Thermal Optimization Flow Using Incremental Floorplanning for 3D ICs Author Xin Li, *Yuchun Ma, Xianlong Hong (Tsinghua University, China) Page pp. 347 - 352 Keyword 3D ICs, incremental floorplanning, thermal Abstract Thermal issue is a critical challenge in 3D IC design. To eliminate hotspots, physical layouts are always adjusted by shifting or duplicating hot blocks. However, these modifications may degrade the packing area as well as interconnect distribution greatly. In this paper, we propose some novel thermal-aware incremental changes to optimize these multiple objectives including thermal issue in 3D ICs. Furthermore, to avoid random incremental modification, which may be inefficient and need long runtime to converge, here potential gain is modeled for each candidate incremental change. Based on the potential gain, a novel thermal optimization flow to intelligently choose the best incremental operation is presented. We distinguish the thermal-aware incremental changes in three different categories: migrating computation, growing unit and moving hotspot. Mixed integer linear programming (MILP) models are devised according to these different incremental changes. Experimental results show that migrating computation, growing unit and moving hotspot can reduce max on-chip temperature by 7%, 13% and 15% respectively on MCNC/GSRC benchmarks. Still, experimental results also show that the thermal optimization flow can reduce max on-chip temperature by 14% compared to an existing 3D floorplan tool CBA, and achieve better area and total wirelength improvement than individual operations do.

4B-3 (Time: 11:05 - 11:30)
 Title Analog Placement with Common Centroid and 1-D Symmetry Constraints Author *Linfu Xiao, Evangeline Young (The Chinese University of Hong Kong, Hong Kong) Page pp. 353 - 360 Keyword analog placement, common centroid, symmetry Abstract In this paper, we will present a placement method for analog ircuits. We consider both common centroid and 1-D symmetry constraints, which are the two most common types of placement requirements in analog designs. The approach is based on a symmetric feasible condition on the sequence pair representation that can cover completely the set of all placements satisfying the common centroid and 1-D symmetry constraints. This condition is essential for a good searching process to solve the problem effectively. Symmetric placement is an important step to achieve matchings of other electrical properties like delay and temperature variation. We have compared our results with those presented in the most updated previous works. Significant improvements can be obtained by our approach in both common centroid and 1-D symmetry placements, and we are the first who can handle both constraints simultaneously.

4B-4 (Time: 11:30 - 11:55)
 Title A Multilevel Analytical Placement for 3D ICs Author Jason Cong, *Guojie Luo (University of California, Los Angeles, United States) Page pp. 361 - 366 Keyword 3D IC, analytical placement, through-silicon via Abstract Abstract - In this paper we propose a multilevel non-linear programming based 3D placement approach that minimizes a weighted sum of total wirelength and TS via number subject to area density constraints. This approach relaxes the discrete layer assignments so that they are continuous in the z-direction and the problem can be solved by an analytical global placer. A key idea is to do the overlap removal and device layer assignment simultaneously by adding a density penalty function for both area & TS via density constraints. Experimental results show that this analytical placer in a multilevel framework is effective to achieve trade-offs between wirelength and TS via number. Compared to the recently published transformation-based 3D placement method [1], we are able to achieve on average 12% shorter wirelength and 29% fewer TS via compared to their cases with best wirelength; we are also able to achieve on average 20% shorter wirelength and 50% fewer TS via number compared to their cases with best TS via numbers. Slides

4B-5 (Time: 11:55 - 12:20)
 Title Exploring Adjacency in Floorplanning Author Jia Wang, *Hai Zhou (Northwestern University, United States) Page pp. 367 - 372 Keyword floorplanning, adjacency graph Abstract This paper describes a new floorplanning approach called Constrained Adjacency Graph (CAG) that helps exploring adjacency in floorplans. CAG extends the previous adjacency graph approaches by adding explicit adjacency constraints to the graph edges. After sufficient and necessary conditions of CAG are developed based on dissected floorplans, CAG is extended to handle general floorplans in order to improve area without changing the adjacency relations dramatically. These characteristics are currently utilized in a randomized greedy improvement heuristic for wire length optimization. The results show that better floorplans are found with much less running time for problems with 100 to 300 modules in comparison to a simulated annealing floorplanner based on sequence pairs.

Session 4C  Signal/Power Integrity and Simulation
Time: 10:15 - 12:20 Wednesday, January 21, 2009
Location: Room 414+415
Chairs: Hideki Asai (Shizuoka University, Japan), Sheldon Tan (University of California, Riverside, United States)

4C-1 (Time: 10:15 - 10:40)
 Title Stochastic Current Prediction Enabled Frequency Actuator for Runtime Resonance Noise Reduction Author *Yiyu Shi (University of California, Los Angeles, United States), Jinjun Xiong, Howard Chen (IBM Thomas J. Watson Research Center, United States), Lei He (University of California, Los Angeles, United States) Page pp. 373 - 378 Keyword Stochastic current modeling, frequency actuator, resonance noise Abstract Power delivery network (PDN) is a distributed RLC network with its dominant resonance frequency in the low-to-middle frequency range. Though high-performance chips working frequencies are much higher than this resonance frequency in general, chip runtime loading frequency is not. When a chip executes a chunk of instructions repeatedly, the induced current load may have harmonic components close to this resonance frequency, causing excessive power integrity degradation. Existing PDN design solutions are, however, mainly targeted at reducing high-frequency noise and not effective to suppress such resonance noise. In this work, we propose a novel approach to proactively suppress this type of noise. A method based on a high dimension generalized Markov process is developed to predict current load variation. Based on such prediction, a clock frequency actuator design is proposed to proactively select an optimal clock frequency to suppress the resonance. To the best of our knowledge, this is the first in-depth study on proactively reducing runtime instruction execution induced PDN resonance noise.

4C-2 (Time: 10:40 - 11:05)
 Title Fast Analysis of Nontree-Clock Network Considering Environmental Uncertainty by Parameterized and Incremental Macromodeling Author Hai Wang (University of California, Riverside, United States), Hao Yu (Berkeley Design Automation, United States), *Sheldon X.D. Tan (University of California, Riverside, United States) Page pp. 379 - 384 Keyword clock network, environmental uncertainties, macromodeling Abstract It is challenging to verify clock-skew for large-scale nontree clock network with environmental uncertainties such as supply voltage fluctuation and thermal temperature gradient. This paper presents a fast clock-skew analysis via parameterized incremental truncated-balanced-realization, called {\it piTBR} method. Environmental uncertainties are parametrically and structurally added into the state equation of clock network. A compact macromodel is obtained by the subspace projection constructed from the singular value decomposition (SVD) of circuit output waveforms. To reduce the computational cost, we propose an incremental SVD method that only needs to partially update the projection matrix by analyzing the perturbed output waveform owning to environmental uncertainties. Experiments on a number of clock networks show that compared with the macromodeling by the fast TBR method, our method reduces the computational cost in the order of $100 \times$ with a similar accuracy. In addition, compared with the macromodeling by the Krylov-subspace-based method, our method reduces the waveform error by $2 \times$ with a similar runtime.

4C-3 (Time: 11:05 - 11:30)
 Title High Performance On-Chip Differential Signaling Using Passive Compensation for Global Communication Author Ling Zhang, Yulei Zhang (University of California, San Diego, United States), Akira Tsuchiya (Kyoto University, Japan), Masanori Hashimoto (Osaka University, Japan), Ernest Kuh (University of California, Berkeley, United States), *Chung-Kuan Cheng (University of California, San Diego, United States) Page pp. 385 - 390 Keyword High performance, passive compensation, on-chip T-line Abstract To address the performance limitation brought by the scaling issues of on-chip global wires, a new configuration for global wiring using on-chip lossy transmission lines is proposed and optimized. We propose a signaling structure to compensate the distortion and attenuation of on-chip transmission lines, which uses passive compensation and inserts repeated transceivers composing sense amplifiers and inverter chains. An optimization flow for designing this scheme based on eye-diagram prediction and sequential quadratic programming (SQP) is devised. This flow is used to study the latency, power dissipation and throughput performance of the new global wiring scheme as the technology scales from 90nm to 22nm. Comparing to repeated RC wire, experimental results demonstrate that at 22nm technology node, the new scheme can reduce the normalized delay by 80%-95%. , the normalized energy consumption by 50%-94%. The normalized latency is 10 ps/mm , the energy per bit is 20 pJ/m, and the throughput is 15 Gbps/um. All performance metrics are scalable with technology, which makes this approach a potential candidate to break the "interconnect wall" of digital system performance.

4C-4 (Time: 11:30 - 11:55)
 Title Noise Minimization During Power-Up Stage for a Multi-Domain Power Network Author *Wanping Zhang (Qualcomm Inc./University of California, San Diego, United States), Yi Zhu (University of California, San Diego, United States), Wenjian Yu (Tsinghua University, China), Amirali Shayan, Renshen Wang (University of California, San Diego, United States), Zhi Zhu (Qualcomm Inc., United States), Chung-Kuan Cheng (University of California, San Diego, United States) Page pp. 391 - 396 Keyword Noise, Power-up sequence, Multi-domain Abstract With the popularity of Multiple Power Domain (MPD) design, the multi-domain power network noise analysis and minimization is becoming important. This paper describes an efficient heuristic algorithm to arrange the power-up sequence in a multi-domain power network in order to minimize the noise. We present a formulation of this problem and show it is NP-complete. Therefore, we propose a simulated annealing (SA) based algorithm with preprocessing. Experimental results show that the proposed algorithm can minimize the noise close to the minimal values. In terms of efficiency, the SA algorithm is more than hundreds of times faster than the enumerating method and the running time scales well for these cases with the number of domains. In addition, we discuss the trade off between power-up efficiency and noise.

4C-5s (Time: 11:55 - 12:07)
 Title Parallel Transistor Level Circuit Simulation using Domain Decomposition Methods Author *He Peng, Chung-Kuan Cheng (University of California, San Diego, United States) Page pp. 397 - 402 Keyword SPICE, parallel circuit simulation, domain decomposition, multi-core simulation Abstract This paper presents an efficient parallel transistor level full-chip circuit simulation tool with SPICE-accuracy. The new approach partitions the circuit into a linear domain and several non-linear domains based on circuit non-linearity and connectivity. The linear domain is solved by parallel fast linear solver while nonlinear domains are parallelly distributed into different processors and solved by direct solver. Parallel domain decomposition technique is used to iteratively solve the different partitions of the circuit and ensure convergence. Different domain decomposition techniques are discussed. Orders of magnitude speedup over SPICE is observed for sets of large-scale VLSI circuits.

4C-6s (Time: 12:07 - 12:19)
 Title Fast Circuit Simulation on Graphics Processing Units Author Kanupriya Gulati (Texas A&M University, United States), John F. Croix (Nascentric, Inc., United States), *Sunil P. Khatri (Texas A&M University, United States), Rahm Shastry (Nascentric, Inc., United States) Page pp. 403 - 408 Keyword SPICE, device model evaluations, Graphics Processing Units Abstract SPICE based circuit simulation is a traditional workhorse in the VLSI design process. Given the pivotal role of SPICE in the IC design flow, there has been significant interest in accelerating SPICE. Since a large fraction (on average 75%) of the SPICE runtime is spent in evaluating transistor model equations, a significant speedup can be availed if these evaluations are accelerated. This paper reports our early efforts to accelerate transistor model evaluations using a Graphics Processing Unit (GPU). We have integrated this accelerator with a commercial fast SPICE tool. Our experiments demonstrate that significant speedups (2.36X on average) can be obtained. The asymptotic speedup that can be obtained is about 4X. We demonstrate that with circuits consisting of as few as about 1000 transistors, speedups in the neighborhood of this asymptotic value can be obtained. By utilizing the recently announced (but not currently available) quad GPU systems, this speedup could be enhanced further, especially for larger designs. Slides

Session 4D  Special Session: Challenges in 3D Integrated Circuit Design
Time: 10:15 - 12:20 Wednesday, January 21, 2009
Location: Room 416+417
Organizer: Sachin Sapatnekar (Univ. of Minnesota, United States)

4D-1 (Time: 10:15 - 10:40)
 Title (Invited Paper) Three-Dimensional Integration Technology and Integrated Systems Author *Mitsumasa Koyanagi, Takafumi Fukushima, Tetsu Tanaka (Tohoku University, Japan) Page pp. 409 - 415 Abstract A new three-dimensional (3-D) integration technology by a self-assembly method is described. In addition, 3-D integrated systems such as 3-D microprocessor chip, 3-D shared memory chip, 3-D image processing chip and 3-D artificial retina chip are demonstrated. Slides

4D-2 (Time: 10:40 - 11:05)
 Title (Invited Paper) A 3D Prototyping Chip based on a Wafer-level Stacking Technology Author *Nobuaki Miyakawa (Honda Research Institute, Japan) Page pp. 416 - 420 Keyword Stacking Technology, Wafer-to-Wafer, 8 inch wafer, Trial Manufacture of 3 layer Stacked Abstract A case study on 3D IC process, prototyping, and EDA design flow

4D-3 (Time: 11:05 - 11:30)
 Title (Invited Paper) CAD Challenges for 3D ICs Author David Kung, *Ruchir Puri (IBM Corp., United States) Page pp. 421 - 422 Abstract A fundamental shift in the technology has occurred beyond 90nm CMOS where the interconnect resistance has been increasing significantly to cause a repeater explosion problem. This problem translates into not only significant area overhead but also power, as repeaters lose power to leakage. 3D technology has the potential of easing the challenge of repeater explosion. In order to exploit the full potential of 3D technology, new challenges in the area of physical design, thermal analysis, system level design and analysis need to be addressed. 3D interconnects have the potential of reducing critical paths delays significantly, which are typically between memory and the interfacing logic. New tools that consider thermally aware physical design implementations, most importantly at the architecture and SoC level are crucial to the success of 3D as thermal issues are exacerbated in 3D implementations. To justify the cost and complexity overhead of 3D technology, it is essential to study the benefit of 3D early in the design cycle. This requires strong linkage between architecture level analysis tools and 3D physical planning tools. Most of the advantages of 3D will be utilized with new system architectures and physical implementations. Therefore, the tools to aid 3D implementation must also operate at the higher level in addition to the 3D place and route algorithms that have been proposed in the literature before. In fact, the benefits from 3D place and route will be limited since current 2D designs do a fairly good job of optimizing the critical path distance. There is a very strong need for 3D architectural and physical planning tools that operate in the domain of thermal, physical, and performance analysis in order to yield an optimized system implementation in 3D technology. Most of the studies reporting huge benefits from 3D for wire length do not adequately consider the physical impact of vertical vias. It is crucial to consider the impact of vertical vias on the physical design of ICs, from area, latency, and thermal impact point of view.

4D-4 (Time: 11:30 - 11:55)
 Title (Invited Paper) Addressing Thermal and Power Delivery Bottlenecks in 3D Circuits Author *Sachin S. Sapatnekar (University of Minnesota, United States) Page pp. 423 - 428 Keyword 3D integrated circuits, temperature, power grid, analysis, optimization Abstract The enhanced packing densities facilitated by 3D integrated circuit technology also has an unwanted side-effect, in the form of increasing the amount of current per unit footprint of the chip, as compared to a 2D design. This has ramifications on two critical issues: firstly, it means that more heat is generated per unit footprint, potentially leading to thermal problems, and secondly, more current must be supplied per package pin, leading to possible power delivery bottlenecks. This paper presents an overview of the challenges and solutions in the domain of addressing these two issues in 3D integrated circuits. Slides

4D-5 (Time: 11:55 - 12:20)
 Title (Invited Paper) The Road to 3D EDA Tool Readiness Author *Charles Chiang, Subarna Sinha (Synopsys, United States) Page pp. 429 - 436 Keyword TSV Abstract The design, representation and optimization of 3D ICs will require changes to the current EDA tool suite. Modifications will be necessary in the data models as well as the analysis and optimization algorithms at various design stages. This talk will provide an in-depth summary of the changes needed at the various design stages to enable and support 3D IC design.

Session 5A  Energy-Aware System Level Design Methodology
Time: 13:30 - 15:35 Wednesday, January 21, 2009
Location: Room 411+412
Chairs: Chia-Lin Yang (National Taiwan University, Taiwan), Juinn-Dar Huang (National Chiao Tung University)

5A-2 (Time: 13:55 - 14:20)
 Title System-Level Exploration Tool for Energy-Aware Memory Management in the Design of Multidimensional Signal Processing Systems Author *Florin Balasa (Southern Utah University, United States), Ilie I. Luican (University of Illinois at Chicago, United States), Hongwei Zhu (ARM, Inc., United States), Doru V. Nasui (American International Radio, Inc., United States) Page pp. 443 - 448 Keyword multidimensional signal processing, memory management, memory allocation, signal-to-memory assignment, dynamic energy consumption Abstract Many signal processing systems, particularly in the multimedia and telecom domains, are synthesized to execute data-dominated applications. Their behavior is described in a high-level programming language, where the code is typically organized in sequences of loop nests and the main data structures are multidimensional arrays. Since data transfer and storage have a significant impact on both the system performance and the major cost parameters -- power consumption and chip area, the designer must spend a significant effort during the system development process on the exploration of the memory subsystem in order to achieve a cost-optimized design. This paper presents a software tool for system-level exploration, where several memory management tasks are addressed in a common theoretical framework. The tool can compute the minimum storage requirement of a given application and can produce the graph of storage variation during the code execution; it offers memory allocation and signal assignment solutions both for flat and hierarchical organizations and optimizes the dynamic energy consumption in the memory subsystem.

5A-3 (Time: 14:20 - 14:45)
 Title Systematic Architecture Exploration based on Optimistic Cycle Estimation for Low Energy Embedded Processors Author *Ittetsu Taniguchi (Osaka University, Japan), Murali Jayapala (IMEC vzw., Belgium), Praveen Raghavan, Francky Catthoor (IMEC vzw./K.U.Leuven, Belgium), Keishi Sakanushi, Yoshinori Takeuchi, Masaharu Imai (Osaka University, Japan) Page pp. 449 - 454 Keyword architecture exploration, address generation unit (AGU), reconfigurable architecture Abstract Systematic architecture exploration from vast solution space is a complex problem in embedded system design. It is very difficult to explore a best architecture fast and accurately because accurate evaluation usually consumes significant amount of time for point in the solution space. In this paper, we propose fast and systematic architecture exploration method for address generation unit (AGU) based on a coarse grained reconfigurable architecture model. First we prove that a set of Pareto solutions of cycle vs energy becomes a subset of Pareto solutions of cycle vs area under some practical assumptions. In addition we propose Optimistic cycle (OC)'' metric to find out promising solutions from vast solution space. Based on this metric we also propose a fast architecture exploration algorithm which only applies mapping to promising architectures. Using the proposed systematic architecture exploration method, we show that we can obtain almost the same trade-off points as the exhaustive search method and also that our method is about 164 times faster than exhaustive search. Slides

5A-4 (Time: 14:45 - 15:10)
 Title A Framework for Estimating NBTI Degradation of Microarchitectural Components Author *Michael DeBole, Ramakrishnan Krishnan (The Pennsylvania State University, United States), Varsha Balakrishnan, Wenping Wang (Arizona State University, United States), Hong Luo, Yu Wang (Tsinghua University, China), Yuan Xie (The Pennsylvania State University, United States), Yu Cao (Arizona State University, United States), N. Vijaykrishnan (The Pennsylvania State University, United States) Page pp. 455 - 460 Keyword NBTI, Reliability, CAD, Computer Architecture Abstract Degradation of device parameters over the lifetime of a system is emerging as a significant threat to system reliability. Among the aging mechanisms, wearout resulting from NBTI is of particular concern in deep submicron technology generations. To facilitate architectural level aging analysis, a tool capable of evaluating NBTI vulnerabilities early in the design cycle has been developed. The tool includes workload-based temperature and performance degradation analysis across a variety of technologies and operating conditions, revealing a complex interplay between factors influencing NBTI timing degradation.

Session 5B  Design for Manufacturing and Reliability
Time: 13:30 - 15:35 Wednesday, January 21, 2009
Location: Room 413
Chair: Charles Chiang (Synopsys, United States)

5B-1 (Time: 13:30 - 13:55)
 Title Efficient Analytical Determination of the SEU-induced Pulse Shape Author Rajesh Garg, *Sunil P. Khatri (Texas A&M University, United States) Page pp. 461 - 467 Keyword Radiation, Single Event Upset, Single Even Transients Abstract Single event upsets (SEUs) have become problematic for both combinational and sequential circuits in the deep sub-micron era due to device scaling, lowered supply voltages and higher operating frequencies. To design radiation tolerant circuits efficiently, techniques are required to analyze the effects of a radiation particle strike on a circuit early in the design flow, and hence evaluate the circuit's resilience to SEU events. For an accurate estimation of the SEU tolerance of a circuit, it is important to consider the effects of electrical masking. This is typically done by performing circuit simulations, which are slow. In this paper, we present an analytical model for the determination of the shape of radiation-induced voltage glitches in combinational circuits. The output of our approach can be propagated to the primary outputs of the circuit using existing tools, thereby modeling the effects of electrical masking. This enables an accurate and quick evaluation of the SEU robustness of a circuit. Experimental results demonstrate that our model is very accurate, with a very low root mean square percentage error in the estimation of the shape of the voltage glitch of (4.5%) compared to SPICE. Our model gains its accuracy by using a non-linear model for the load current of the gate, and by considering the effect of the ion track establishment constant on the radiation induced voltage glitch. Our analytical model is very fast (275X faster than SPICE) and accurate, and can therefore be easily incorporated in a design flow to estimate the SEU tolerance of circuits early in the design process.

5B-2 (Time: 13:55 - 14:20)
 Title Post-Routing Redundant Via Insertion with Wire Spreading Capability Author Cheok-Kei Lei, *Po-Yi Chiang, Yu-Min Lee (National Chiao Tung University, Taiwan) Page pp. 468 - 473 Keyword redundant via insertion, wire spreading, DFM, yield Abstract Redundant via insertion is a widely recommended technique to enhance the via yield and reliability. In this paper, the post-routing redundant via insertion problem is transformed to a mixed bipartite-conflict graph matching problem, and an efficient heuristic minimum weighted matching (HMWM) algorithm is presented to solve it. The developed method not only inserts redundant vias for alive vias but also protects the dead vias by utilizing the wire spreading capability-that's to say, the method shifts wires into the empty space and adds redundant vias for dead vias to further enhance the via yield. Experimental results show that the average insertion rate of alive vias is 99.54% with a short run time, and the wire spreading technique can achieve average insertion rate to be 54.41% for dead vias.

5B-3 (Time: 14:20 - 14:45)
 Title Accounting for Non-linear Dependence Using Function Driven Component Analysis Author Lerong Cheng, *Puneet Gupta, Lei He (University of California, Los Angeles, United States) Page pp. 474 - 479 Keyword Noice margin, Statistical Analysis Abstract Majority of practical multivariate statistical analyses and optimizations model interdependence among random variables in terms of the linear correlation among them. Though linear correlation is simple to use and evaluate, in several cases non-linear dependence between random variables may be too strong to ignore. In this paper, We propose polynomial correlation coefficients as simple measure of multi-variable non-linear dependence and show that need for modeling non-linear dependence strongly depends on the end function that is to be evaluated from the random variables. Then, we calculate the errors in estimation which result from assuming independence of components generated by linear de-correlation techniques such as PCA and ICA. The experimental result shows that the error predicted by our method is within 1% error compared to the real simulation. In order to deal with non-linear dependence, we further develop a target function driven component analysis algorithm (FCA) to minimize the error caused by ignoring high order dependence and apply such technique to statistical leakage power analysis and SRAM cell noise margin variation analysis. Experimental results show that the proposed FCA method is more accurate compared to the traditional PCA or ICA.

5B-4 (Time: 14:45 - 15:10)
 Title Risk Aversion Min-Period Retiming under Process Variations Author Jia Wang, *Hai Zhou (Northwestern University, United States) Page pp. 480 - 485 Keyword statistical optimization, retiming, process variations Abstract Recent advances in statistical timing analysis (SSTA) achieve great success in computing arrival times under variations by extending sum and maximum operations to random variables. It remains a challenge problem to apply such results in order to address the variability in circuit optimizations. In this paper, we study the statistical retiming problem, where retiming is a powerful sequential transformation that relocates flip-flops in a circuit without changing its functionality. We formulate the risk aversion min-period retiming problem under process variations based on conventional two-stage stochastic program with fixed recourse and a risk aversion objective of the clock period. We prove that the proposed problem is an integer convex program, show that the subgradient of the objective function can be derived from the combinational paths with the maximum path delay, and present a heuristic incremental algorithm to solve the proposed problem. Our approach can handle arbitrary gate delay model under process variations through sampling from a black-box and the effectiveness is confirmed by the experimental results. Further more, we point out how the current state-of-the-art SSTA techniques could be improved for future optimization algorithms when analytical models are available.

5B-5s (Time: 15:10 - 15:22)
 Title Timing Analysis and Optimization Implications of Bimodal CD Distribution in Double Patterning Lithography Author Kwangok Jeong, *Andrew B. Kahng (University of California, San Diego, United States) Page pp. 486 - 491 Keyword Bimodal, DPL, Double patterning, CD Abstract Double patterning lithography (DPL) is in current production for memory products, and is widely viewed as inevitable for logic products at the 32nm node. DPL decomposes and prints the shapes of a critical-layer layout in two exposures. In traditional single-exposure lithography, adjacent identical layout features will have identical mean critical dimension (CD), and spatially correlated CD variations. However, with DPL, adjacent features can have distinct mean CDs, and uncorrelated CD variations. This introduces a new set of bimodal' challenges for timing analysis and optimization. We assess the potential impact of DPL on timing analysis error and guardbanding, and find that the traditional unimodal' characterization and analysis framework may not be viable for DPL. For example, using 45nm models, we find that different DPL mask layout solutions can cause 50ps skew in clock distribution that is unseen by traditional analyses. Different mask layouts can also result in 20\% or more change in timing path delays. Such results lead to insights into physical design optimizations for clock and data path placement and mask coloring that can help mitigate the error and guardband costs of DPL.

5B-6s (Time: 15:22 - 15:34)
 Title Scheduled Voltage Scaling for Increasing Lifetime in the Presence of NBTI Author *Lide Zhang, Robert Dick (Northwestern University, United States) Page pp. 492 - 497 Keyword Scheduled Voltage Scaling, Negative Bias Temperature Instability (NBTI), Guard Banding Abstract Negative Bias Temperature Instability (NBTI) is a leading reliability concern for integrated circuits (ICs). It increases the threshold voltages of PMOS transistors, thereby increasing delay. We propose the use of scheduled voltage scaling that gradually enhances the operating voltage of the IC to compensate for NBTI-related performance degradation. Scheduled voltage scaling has the potential to increase IC lifetime by 46% relative to the conventional approach using guard banding for ICs fabricated using a 45nm process.

Session 5C  Analog, RF and Mixed-Signal CAD
Time: 13:30 - 15:35 Wednesday, January 21, 2009
Location: Room 414+415
Chairs: Eric Keiter (Sandia National Laboratories, United States), Chin-Fong Chiu (National Chip Implementation Center, Taiwan)

5C-1 (Time: 13:30 - 13:55)
 Title Efficiently Finding the 'Best' Solution with Multi-Objectives from Multiple Topologies in Topology Library of Analog Circuit Author *Yu Liu, Masato Yoshioka, Katsumi Homma, Toshiyuki Shibuya (Fujitsu Laboratories Ltd., Japan) Page pp. 498 - 503 Keyword pareto-front, multi-objective optimization, analog, topology Abstract This paper presents a new method using multi-objective optimization algorithm to automatically find the best solution from a topology library of analog circuits. Comparing to the traditional optimization methods using single-objective optimization algorithms, this work can efficiently find the best non-dominated solution from multiple topologies for different specifications without additional time-consuming optimizing iterations. The experiments demonstrate that this method is feasible and practical in actual analog designs especially for uncertain or different specifications in multi-dimensions. Slides

5C-2 (Time: 13:55 - 14:20)
 Title Automated Design and Optimization of Circuits in Emerging Technologies Author *Rajesh A. Thakker, Chaitanya Sathe, Angada B. Sachid, Maryam Shojaei Baghini, V. Ramgopal Rao, Mahesh B. Patil (Department of Electrical Engineering, Indian Institute of Technology, Bombay, India) Page pp. 504 - 509 Keyword Look-up table, FinFET, Automatic design, Particle swarm optimization, Emerging technologies Abstract A novel table-based environment for automatic design and optimization of FinFET circuits is demonstrated. A new accurate look-up table (LUT) technique is implemented in a circuit simulator and integrated with particle swarm optimization algorithm for efficient circuit designs in novel devices. Op-amp circuits are designed and optimized to demonstrate the accuracy and usefulness of the proposed platform. Further, it is shown that the proposed design methodology can take into account variations in process, supply voltage, and temperature.

5C-3 (Time: 14:20 - 14:45)
 Title An Automated Design Approach for CMOS LDO Regulators Author *Samiran DasGupta, Pradip Mandal (Indian Institute of Technology, Kharagpur, India) Page pp. 510 - 515 Keyword Low dropout(LDO), voltage regulator, optimal sizing, design automation Abstract This paper presents a method for optimal sizing of CMOS low drop out regulator circuits. The technique relies on the observation that many of the performance metrics of a LDO regulator can be approximated as posynomial functions of design variables. This allows the design problem to be cast as a geometric program. Geometric program is particularly attractive as the tool for optimization as -1)it can be solved very efficiently, 2)it always finds the global minima, 3)infeasible specifications are readily determined and 4)the final solution is completely independent of the initial guess. As a result CMOS LDOs may be conveniently synthesized; moreover the optimal trade off curves between the competing performance metrics, can be obtained very fast.

5C-4 (Time: 14:45 - 15:10)
 Title A SCORE Macromodel for PLL Designs to Analyze Supply Noise Interaction Issues at Behavioral Level Author *Chin-Cheng Kuo, Pei-Syun Lin, Chien-Nan Jimmy Liu (National Central University, Taiwan) Page pp. 516 - 521 Keyword supply noise interaction issues, analog behavioral model, PLL, macromodel Abstract Using behavioral models to perform fast simulation is currently a popular solution to verify SOC designs. Previous analog behavior modeling approaches often treat the noisy VDD waveform as a given input and focus on reflecting such stimuli on circuit performance. However, because the interaction of noise aggressors and victims is not considered, some errors may exist in the simulation. In this paper, a simple SCORE macromodel is proposed for PLL designs. It can be integrated with a supply-noise-aware PLL behavioral model to analyze supply noise effects at high level. In addition to numerical results, the time-varying supply noise waveform and real-time PLL responses can be obtained simultaneously. As demonstrated in the experimental results, the proposed approach can provide more realistic simulation results with noise interaction effects but still keep fast simulation time.

5C-5 (Time: 15:10 - 15:35)
 Title Gen-Adler: The Generalized Adlers Equation for Injection Locking Analysis in Oscillators Author *Prateek Bhansali, Jaijeet Roychowdhury (University of Minnesota, United States) Page pp. 522 - 527 Keyword Injection locking, Perturbation Projection Vector, Adler's equation, Oscillators Abstract Injection locking analysis based on classical Adlers equation is limited to LC oscillators as it is dependent on quality factor. In this paper, we present the Generalized Adlers equation applicable for injection locking analysis on oscillators independent of the circuit topology. The equation is obtained by averaging the PPV phase macromodel. The procedure is considerably simple and handy to determine the locking range for arbitrary shape small AC injection signal. Analytical equations for injection locking dynamics are formulated using the Generalized Adlers equation and validated with the PPV simulations. Slides

Session 5D  Designers' Forum: Consumer SoC
Time: 13:30 - 15:35 Wednesday, January 21, 2009
Location: Room 416+417
Chair: Yoshio Masubuchi (Toshiba Corporation, Japan)

5D-1 (Time: 13:30 - 14:10)
 Title (Invited Paper) Development of Full-HD Multi-standard Video CODEC IP Based on Heterogeneous Multiprocessor Architecture Author *Hiroaki Nakata, Koji Hosogi, Masakazu Ehama, Takafumi Yuasa, Toru Fujihira (Hitachi, Ltd., Japan), Kenichi Iwata, Motoki Kimura, Fumitaka Izuhara, Seiji Mochizuki, Masaki Nobori (Renesas Technology Corp., Japan) Page pp. 528 - 534 Keyword CODEC, video, full-HD, H.264, VC-1 Abstract To support numerous video codec standards and full-HD videos on different consumer devices, a multi-standard CODEC IP based on a heterogeneous multiprocessor architecture was developed. Operation-specific processors were designed in regards to two types of processing: stream processing and pixel processing. The CODEC also uses effectively several dedicated circuits. To design the CODEC, we developed a C-language model to check our design. The CODEC can process full-HD videos formatted in H.264, MPEG-2, MPEG-4, and VC-1 at 162 MHz operating frequency. Slides

5D-2 (Time: 14:10 - 14:50)
 Title (Invited Paper) A 65nm Dual-mode Baseband and Multimedia Application Processor SoC with Advanced Power and Memory Management Author *Tatsuya Kamei, Tetsuhiro Yamada, Takao Koike, Masayuki Ito, Takahiro Irita, Kenichi Nitta, Toshihiro Hattori, Shinichi Yoshioka (Renesas Technology Corp., Japan) Page pp. 535 - 539 Keyword application processor, mobile phone, low power, multi-media, MMU Abstract A Dual-mode baseband (W-CDMA/HSDPA and GSM/GPRS/EDGE) and multimedia application processor SoC is described. The SoC fabricated in triple-Vth 65nm CMOS has 3 CPU cores and 20 separate power domains to achieve both high performance and low power. The SoC adopts the Partial Clock Activation scheme that reduces power by 42% for long-time music replay. The IP-MMU is introduced to reduce maximum memory footprint by 43MB, sharing external memory among CPUs and HW-IPs using virtual address space that enables reuse of physically fragmented memory. Slides

5D-3 (Time: 14:50 - 15:30)
 Title (Invited Paper) UniPhier: Series Development and SoC Management Author *Yoshito Nishimichi, Nobuo Higaki, Masataka Osaka, Seiji Horii, Hisato Yoshida (Panasonic Corp., Japan) Page pp. 540 - 545 Keyword platform, SoC, UniPhier Abstract A digital CE integrated platform UniPhier (Universal Platform for High-quality Image Enhancing Revolution) has been developed to accelerate sharing technology and design assets across product categories from mobile phones to home-use AV. On the integrated platform, its easy to use software and hardware assets that allows reusing across product categories, and great enhance of digital CE products has been realized. In this paper, an overview of the integrated platform UniPhier and its SoC (System on a Chip) application examples are described.

Session 6A  System Level Simulation and Modeling
Time: 15:55 - 18:00 Wednesday, January 21, 2009
Location: Room 411+412
Chairs: Vincent J Mooney (Georgia Institute of Technology, United States), Tsuneo Nakata (Fujitsu Laboratories Ltd., Japan)

6A-1 (Time: 15:55 - 16:20)
 Title Automatic Instrumentation of Embedded Software for High Level Hardware/Software Co-Simulation Author Aimen Bouchhima, *Patrice Gerin, Frdric Ptrot (TIMA Laboratory, France) Page pp. 546 - 551 Keyword cosimulation, annotation Abstract We propose an automatic instrumentation method for embedded software annotation to enable performance modeling in high level hardware/software co-simulation environments. The proposed "cross-annotation" technique consists of extending a retargetable compiler infrastructure to allow the automatic instrumentation of embedded software at the basic block level. Thus, target and annotated native binaries are guaranteed to have isomorphic control flow graphs (CFG). The proposed method takes into account the processor-specific optimizations at the compiler level and proves to be accurate with low simulation overhead. Slides

6A-2 (Time: 16:20 - 16:45)
 Title Fast and Accurate Performance Simulation of Embedded Software for MPSoC Author *Eric Cheung, Harry Hsieh (University of California, Riverside, United States), Felice Balarin (Cadence Design Systems, United States) Page pp. 552 - 557 Keyword Performance Simulation, Multiprocessor Abstract Performance simulation of software for Multiprocessor System-on-a-Chips (MPSoC) suffers from poor tool support. Cycle accurate simulation at Instruction Set Simulation level is too slow and inefficient for any design of realistic size. Behavioral simulation, though useful for functional analysis at high level, does not provide any performance information that is crucial for design and analysis ofMPSoC implementations. As a consequence, designers are often reduced to manually annotate performance information onto behavioral models, which contributes further to inefficiency and inaccuracy. In this paper, we use structural performance models to provide fast and accurate simulation of software for MPSoC.We generate structural models automatically using GCC with accurate performance annotation while considering optimizations for instruction selection, branch prediction, and pipeline interlock. Our structural models are able to simulate at several orders of magnitude faster than ISS and provide less than 1% error on performance estimation. These models allow realistic MPSoC design space explorations based on performance characteristics with simulation speed comparable to behavioral simulation. We validate our simulation models with several benchmarks and demonstrate our approach with a design case study of an MPEG-2 decoder.

6A-3 (Time: 16:45 - 17:10)
 Title Automatic Generation of Cycle Accurate and Cycle Count Accurate Transaction Level Bus Models from a Formal Model Author *Chen Kang Lo, Ren Song Tsay (National Tsing Hua University, Taiwan) Page pp. 558 - 563 Keyword System Level Design, Transaction Level Modeling, bus modeling Abstract This paper proposes the first automatic approach to simultaneously generate Cycle Accurate and Cycle Count Accurate transaction level bus models. Since TLM (Transaction Level Modeling) is proven as an effective design methodology for managing the ever-increasing complexity of system level designs, researchers often exploit various abstraction levels to gain either simulation speed or accuracy. Consequently, designers repeatedly perform the time-consuming task of re-writing and performing consistency checks for different abstraction level models of the same design. To ease the work, we propose a correct-by-construction method that automatically and simultaneously generates both fast and accurate transaction level bus models for system simulation. The proposed approach relieves designers from the tedious and error-prone process of refining models and checking for consistency. Slides

6A-4 (Time: 17:10 - 17:35)
 Title A Combined Analytical and Simulation-Based Model for Performance Evaluation of a Reconfigurable Instruction Set Processor Author *Farhad Mehdipour (Kyushu University, Japan), Hamid Noori (Institute of Systems, Information Technologies and Nanotechnologies, Japan), Bahman Javadi (Amirkabir University of Technology, Iran), Hiroaki Honda (Institute of Systems, Information Technologies and Nanotechnologies, Japan), Koji Inoue, Kazuaki Murakami (Kyushu University, Japan) Page pp. 564 - 569 Keyword Reconfigurable instruction set processors, Analytical modeling, reconfigurable accelerator, Performance evaluation, Design space exploration Abstract Performance evaluation is a serious challenge in designing or optimizing reconfigurable instruction set processors. The conventional approaches based on synthesis and simulations are very time consuming and need a considerable design effort. A combined analytical and simulation-based model (CAnSOx) is proposed and validated for performance evaluation of a typical reconfigurable instruction set processor. The proposed model consists of an analytical core that incorporates statistics gathered from cycle-accurate simulation to make a reasonable evaluation and provide a valuable insight. Compared to cycle-accurate simulation results, CAnSO proves almost 2% variation in the speedup measurement. Slides

Session 6B  Chip and Package Routing Techniques
Time: 15:55 - 18:00 Wednesday, January 21, 2009
Location: Room 413
Chairs: Ting-Chi Wang (National Tsing Hua University, Taiwan), Yasuhiro Takashima (University of Kitakyushu, Japan)

6B-1 (Time: 15:55 - 16:20)
 Title Efficient Simulated Evolution Based Rerouting and Congestion-Relaxed Layer Assignment on 3-D Global Routing Author Ke-Ren Dai, *Wen-Hao Liu, Yih-Lang Li (National Chiao Tung University, Taiwan) Page pp. 570 - 575 Keyword Global Routing, layer assignment, simulated evolution Abstract The increasing complexity of interconnection designs has enhanced the importance of research into global routing when seeking high-routability (low overflow) results or rapid search paths that report wire-length estimations to a placer. This work presents two routing techniques, namely adaptive pseudo-random net-ordering routing and evolution-based rip-up and reroute using a two-stage cost function in a high-performance congestion-driven 2-D global router. We also propose two efficient via-minimization methods, namely congestion relaxation by layer shifting and rip-up and re-assignment, for a dynamic programming-based layer assignment. Experimental results demonstrate that our router achieves performance similar to the first two winning routers in ISPD 2008 Routing Contest in terms of both routability and wire length at a 1.42X and 25.84X faster routing speed. Besides, our layer assignment yields 3.5% to 5.6% fewer vias, 2.2% to 3.3% shorter wire-length and 13% to 27% less runtime than COLA.

6B-2 (Time: 16:20 - 16:45)
 Title FastRoute 4.0: Global Router with Efficient Via Minimization Author *Yue Xu, Yanheng Zhang, Chris Chu (Iowa State University, United States) Page pp. 576 - 581 Keyword Global Routing, Layer Assignment, 3-Bend Routing Abstract The number of vias generated during the global routing stage is a critical factor for the yield of final circuits. However, most global routers only approach the problem by charging a cost for vias in the maze routing cost function. In this paper, we present a global router that addresses the via number optimization problem throughout the entire global routing flow. We introduce the via aware Steiner tree generation, 3-bend routing and spiral layer assignment algorithm to reduce via count. We integrate these three techniques into FastRoute 3.0 and achieve significant reduction in both via count and runtime.

6B-3s (Time: 16:45 - 16:57)
 Title High-Performance Global Routing with Fast Overflow Reduction Author *Huang-Yu Chen, Chin-Hsiung Hsu, Yao-Wen Chang (National Taiwan University, Taiwan) Page pp. 582 - 587 Keyword Global Routing, Routing Abstract We develop a new global router, NTUgr, that contains three major steps: prerouting, initial routing, and iterative forbidden-region rip-up/rerouting (IFR). Prerouting employs a two-stage technique of congestion-hotspot historical cost pre-increment followed by small bounding-box area routing. Initial routing is based on efficient iterative monotonic routing. Finally, IFR features three techniques of (1) multiple forbidden regions expansion, (2) critical subnet rerouting selection, and (3) look-ahead historical cost increment. Experiments show that NTUgr achieves high-quality results for ISPD'07 and ISPD'08 benchmarks.

6B-4s (Time: 16:57 - 17:09)
 Title IO Connection Assignment and RDL Routing for Flip-Chip Designs Author Jin-Tai Yan, *Zhi-Wei Chen (Chung Hua University, Taiwan) Page pp. 588 - 593 Keyword Flip-chip, RDL routing, IO connection Abstract Given a set of IO buffers and bump balls with the capacity constraints between bump balls, an O(n2) IO assignment and RDL routing algorithm is proposed to assign all the IO connections and minimize the total wirelength with satisfying the capacity constraints and guarantee 100% routability if the capacity constraint is permitted, where n is the number of bump balls in a flip-chip design. Compared with the combination of the greedy IO assignment and our RDL routing, our IO assignment reduces the global wirelength by 7.6% after global routing and improves the routability by 8.8% after detailed routing on the average. Compared with the combination of our IO assignment, the single-layer BGA global router[7] and our detailed routing phase, our RDL routing reduces the global wirelength by 15.9% after global routing and improve the routability by 10.6% after detailed routing on the average for some tested circuits in reasonable CPU time.

6B-5 (Time: 17:09 - 17:34)
 Title On Using SAT to Ordered Escape Problems Author Lijuan Luo, *Martin D.F. Wong (University of Illinois at Urbana-Champaign, United States) Page pp. 594 - 599 Keyword PCB routing, SAT Abstract Routing for high-speed boards is largely a time-consuming manual task today. The ordered escape routing problem is one of the key problems in board-level routing, and Boolean Satisfiability (SAT) based approach \cite{my-paper} is the only solution to this problem so far. In this paper, we first solve the major deficiency of the original SAT formulation so that the escape problem is completely resolved. Then we propose two techniques to extend SAT approach for large-scale problems. Experimental results on industrial benchmarks show that our methods perform well in terms of both speed and routability.

6B-6 (Time: 17:34 - 17:59)
 Title A Fast Longer Path Algorithm for Routing Grid with Obstacles using Biconnectivity based Length Upper Bound Author *Yukihide Kohira, Suguru Suehiro, Atsushi Takahashi (Tokyo Institute of Technology, Japan) Page pp. 600 - 605 Keyword longer path algorithm, upper bound of wire length, routing design of PCB Abstract In this paper, a fast longer path algorithm that generates a path of a net in routing grid so that the length is increased as much as possible is proposed. In the proposed algorithm, an upper bound for the length in which the structure of a routing area is taken into account is used. Experiments show that our algorithm utilizes a routing area with obstacles efficiently.

Session 6D  Designers' Forum: ESL Design Methods
Time: 15:55 - 18:00 Wednesday, January 21, 2009
Location: Room 416+417

6D-1
 Title (Panel Discussion) ESL Design Methods Author Moderator: Takashi Hasegawa (Fujitsu Microelectronics Ltd., Japan), Panelists: Simon Bloch (Mentor Graphics Corporation, United States), Ahmed Jerraya (CEA-LETI, France), Gabriela Nicolescu (Ecole Polytechnique de Montreal, Canada), Shigeru Oho (Hitachi, Ltd., Japan), Koichiro Yamashita (Fujitsu Labs. Ltd., Japan)

 Thursday, January 22, 2009

Session 3K  Keynote Session III
Time: 9:00 - 10:00 Thursday, January 22, 2009
Location: Small Auditorium, 5F
Chair: Kazutoshi Wakabayashi (NEC Corp., Japan)

3K-1 (Time: 9:00 - 10:00)
 Title (Keynote Address) From Restrictive to Prescriptive Design Author Leon Stok (IBM Systems and Technology Group, United States) Abstract For many generations the hand-off between design and manufacturing has been done by a set of design rules. However, design rule manuals have grown in size from several tens of pages a few generations back to hundreds of pages now. Many more design rules have been added since the end of traditional scaling. Even with all these additional rules, corner cases are found late in the process that can become significant yield or functionality detractors. Restricted Design Rules (RDRs) have been created to simplify the design rules and come up with more manufacturable designs. IBM has practiced RDRs in the last few process generations but is this enough? For the next technology nodes no new exposure tools will be available for mass production and optical scaling is coming to a halt. Computational scaling will be required to extend Moores law. In this new era, can we keep on describing design rules in terms of restrictions or do we need another approach?

Session 7A  Compilation Techniques for Embedded Systems
Time: 10:15 - 12:20 Thursday, January 22, 2009
Location: Room 411+412
Chairs: Hiroyuki Tomiyama (Nagoya University, Japan), Maziar Goudarzi (Kyushu University, Japan)

7A-1 (Time: 10:15 - 10:40)
 Title Thermal-aware Post Compilation for VLIW Architectures Author *Wen-Wen Hsieh, TingTing Hwang (Department of Computer Science, National Tsing Hua University, Taiwan) Page pp. 606 - 611 Keyword thermal management, Post Compilation, VLIW architecture Abstract Development of a thermal management method to reduce hotspots and to balance the temperature distribution has become an important issue. In this paper, we propose a static thermal management technique at compiler level. The target machine is a VLIW architecture where the compiler is required to schedule instructions to achieve instruction level parallelism (ILP). Two technique are proposed. The first one is register binding to balance the temperature of the register file by taking both spatial and temporal thermal information into consideration. The second one is forwarding methods including forwarding-aware architecture and instruction scheduling to reduce the access count of register file. The experimental results show that by combining the two techniques, the peak temperature reduction can reach 7.89 (oC) in the best case and 7.22 (oC) in average with only 0.9% performance penalty in average. Slides

7A-2 (Time: 10:40 - 11:05)
 Title A Software Solution for Dynamic Stack Management on Scratch Pad Memory Author Arun Kannan, *Aviral Shrivastava, Amit Pabalkar, Jong-eun Lee (Arizona State University, United States) Page pp. 612 - 617 Keyword scratch pad, cache, stack, power, compiler Abstract We propose a dynamic scratch pad memory (SPM) management scheme for program stack data for processor power reduction. As opposed to previous efforts, our solution does not mandate any hardware changes, does not need profile information, and SPM size at compile-time, and seamlessly integrates support for recursive functions. Our technique manages stack frames on SPM using a scratch pad memory manager (SPMM), integrated into the application binary by the compiler. Our experiments on benchmarks from MiBench [18] show average energy savings of 37% along with a performance improvement of 18%.

7A-3 (Time: 11:05 - 11:30)
 Title Compiler-Managed Register File Protection for Energy-Efficient Soft Error Reduction Author Jongeun Lee, *Aviral Shrivastava (Arizona State University, United States) Page pp. 618 - 623 Keyword soft error, register file, power-efficient, compiler, register allocation Abstract For embedded systems where neither energy nor reliability can be easily sacrificed, we present an energy efficient soft error protection scheme for register files (RF). Unlike previous approaches, our method explicitly optimizes for energy efficiency and exploits the fundamental tradeoff between reliability and energy. While even simple compiler-managed RF protection scheme is more energy efficient than hardware schemes, this work formulates and solves further compiler optimization problems to significantly enhance the energy efficiency of RF protection schemes by an additional 24%.

7A-4 (Time: 11:30 - 11:55)
 Title Code Decomposition and Recomposition for Enhancing Embedded Software Performance Author *Youngchul Cho (SAIT, Samsung Electoronics, Republic of Korea), Kiyoung Choi (Seoul National University, Republic of Korea) Page pp. 624 - 629 Keyword code transformation, code decomposition and recomposition, control-flow analysis, multitasking, code serialization Abstract Multitasking of concurrent processes implements the concurrency inherited from applications, increasing the utilization of limited resources. It requires an operating system and imposes significant runtime overhead. Serializing multitasking codes removes the need of operating system and the overhead as well. In this paper, we propose a software synthesis method to transform multitasking codes into a single process code. For this, we decompose multitasking codes into a set of code fractions and then recompose the code fractions into a single process code, preserving the functionality of the original codes. We present two different techniques for the transformation - code partitioning and code covering - and propose a hybrid technique that combines the two techniques.

Session 7B  Sequential Design Verification
Time: 10:15 - 12:20 Thursday, January 22, 2009
Location: Room 413
Chairs: Yosinori Watanabe (Cadence, United States), Chung-Yang Huang (National Taiwan University, Taiwan)

7B-1 (Time: 10:15 - 10:40)
 Title Dependent Latch Identification in the Reachable State Space Author Chen-Hsuan Lin, *Chun-Yao Wang (National Tsing Hua University, Taiwan) Page pp. 630 - 635 Keyword dependent latch, functional dependency, reachability analysis Abstract The large number of latches in current designs increase the complexity of formal verification and logic synthesis, since the growth of latch number leads the state space to explode exponentially. One solution to this problem is to find the functional dependencies among these latches. Then, these latches can be identified as dependent latches or essential latches, where the state space can be constructed using only the essential latches. This paper proposes an approach to find the functional dependencies among latches in a sequential circuit by using SAT solvers with the Craig interpolation theorem. In addition, the proposed approach detects sequential functional dependencies existing in the reachable state space only. Experimental results show that our approach could deal with large sequential circuits with up to 1.5K latches in a reasonable time and simultaneously identify the combinational and sequential dependent latches.

7B-2 (Time: 10:40 - 11:05)
 Title Complete-k-Distinguishability for Retiming and Resynthesis Equivalence Checking without Restricting Synthesis Author Nikolaos Liveris, *Hai Zhou (Northwestern University, United States), Prithviraj Banerjee (HP Labs, United States) Page pp. 636 - 641 Keyword sequential equivalence checking, retiming and resynthesis, verification Abstract Iterative retiming and resynthesis is a powerful way to optimize sequential circuits but its massive adoption has been hampered by the hardness of verification. This paper tackles the problem of retiming and resynthesis equivalence checking on a pair of circuits. For this purpose we define the Complete-$k$-Distinguishability (C-$k$-D) property for any natural number $k$ based on C-1-D. We show how the equivalence checking problem can be simplified if the circuits satisfy this property and prove that the method is complete for any number of retiming and resynthesis steps. We also provide a way to enforce C-$k$-D on the circuits without restricting the optimization power of retiming and resynthesis or increasing their complexity. Experimental results demonstrate that enforcing C-$k$-D property can speed up the verification process.

7B-4 (Time: 11:30 - 11:55)
 Title Multi-Clock SVA Synthesis without Re-writing Author *Jiang Long, Andrew Seawright, Paparao Kavalipati (Mentor Graphics Corp., United States) Page pp. 648 - 653 Keyword SVA, Formal verification, Multi-Clock Abstract This paper presents a compilation procedure for synthesiz- ing multi-clock SVA properties for formal verification. The synthesis framework is built upon an existing compilation al- gorithm for single-clock SVA properties. While we could use the SVA rewriting rules to transform a multi-clock property into a single-clocked property and then apply existing tech- niques, instead we propose techniques to selectively model the multi-clock operators to produce a smaller checker logic. Through recursive construction and syntactic transforma- tion, we are able demonstrate the efficiency of the technique and the generated checker logic is provably equivalent to the rewriting version.

7B-5 (Time: 11:55 - 12:20)
 Title Automatic Formal Verification of Clock Domain Crossing Signals Author *Bing Li, Chris Ka-Kei Kwok (Mentor Graphics Corporation, United States) Page pp. 654 - 659 Keyword formal verification, clock domain crossing, assertion logic Abstract In this paper, we present an approach that uses formal methods to verify Clock Domain Crossing (CDC) issues in a fully automatic way. First, we discuss various CDC schemes and the corresponding checks that need to be formally verified. Then we demonstrate how to synthesize them into assertion logic. After that a fully automatic, on-the-fly formal CDC approach is proposed. To the best of our knowledge, this is the first paper discussing fully automatic, on-the-fly formal verification of CDC signals. Experiment results show that our automatic formal CDC, when compared with the conventional post-CDC formal CDC, takes much less time, but still prove significant number of CDC checks. Slides

Session 7C  Scan Test Generation
Time: 10:15 - 12:20 Thursday, January 22, 2009
Location: Room 414+415
Chair: Satoshi Ohtake (NAIST, Japan)

7C-1 (Time: 10:15 - 10:40)
 Title Fast False Path Identification Based on Functional Unsensitizability Using RTL Information Author *Yuki Yoshikawa (Hiroshima City University, Japan), Satoshi Ohtake (Nara Institute of Science and Technology, Japan), Tomoo Inoue (Hiroshima City University, Japan), Hideo Fujiwara (Nara Institute of Science and Technology, Japan) Page pp. 660 - 665 Keyword Delay test, False path, RTL, Over-testing reduction Abstract In this paper, we propose a method for identifying false paths based on functional unsensitizability of path delay faults. By using RTL structural information, a number of gate level paths are bound into an RTL path and the bundle of them can be identified in a reasonable amount of time. The identified false paths are useful for over-testing reduction caused by DFT techniques, such as scan design, and also area and performance optimization of circuits during logic synthesis. Experimental results show that our proposed method can identify false paths in a few seconds for several benchmarks. Slides

7C-2 (Time: 10:40 - 11:05)
 Title Conflict Driven Scan Chain Configuration for High Transition Fault Coverage and Low Test Power Author *Zhen Chen, Boxue Yin, Dong Xiang (Tsinghua University, China) Page pp. 666 - 671 Keyword broadside, fault coverage, low power, conflict Abstract Two conflict driven methods and the architecture based on them are presented to improve the fault coverage and reduce test power. By the analysis of the functional dependency of test vectors in broad-side and the shift dependency of vectors in the skewed-load, some scan cells are selected to operate in the enhanced scan and skewed-load scan mode, while others operate in traditional broad-side mode. Experimental results show that the fault coverage can achieve the level very close to enhanced scan. Slides

7C-3 (Time: 11:05 - 11:30)
 Title Dynamic Test Compaction for a Random Test Generation Procedure with Input Cube Avoidance Author Irith Pomeranz (Purdue University, United States), *Sudhakar Reddy (University of Iowa, United States) Page pp. 672 - 677 Keyword dynamic test compaction, test generation, stuck-at faults, full-scan Abstract A recent approach to test generation avoids the assignment of certain input values in order not to prevent target faults from being detected. The test generation process based on this approach is efficient; however, it generates large test sets. We develop a dynamic test compaction procedure for this approach. Our goal is to reduce the test set size by increasing the number of faults detected by each test vector, while keeping the computational complexity as low as that of the original procedure. This is achieved by avoiding the assignment of certain input values in order not to prevent subsets of faults from being detected.

7C-4 (Time: 11:30 - 11:55)
 Title Detectability of Internal Bridging Faults in Scan Chains Author *Fan Yang (University of Iowa, United States), Sreejit Chakravarty, Narendra Devta-Prasanna (LSI Corp., United States), Sudhakar M. Reddy (University of Iowa, United States), Irith Pomeranz (Purdue University, United States) Page pp. 678 - 683 Keyword scan chain, bridge fault, resistive bridge, internal fault, non-feedback bridge Abstract We investigate the detection of scan cell internal bridging faults extracted from layout. We show that detection of some zero-resistance non-feedback bridging faults requires two-pattern tests. Half-speed flush tests we proposed earlier detect additional bridging faults. Undetectable faults are classified based on the reasons for their undetectability. Both non-resistive and resistive bridging fault models are considered in this work. A low power supply voltage based test method and IDDQ testing are examined for resistive bridging fault detection.

7C-5 (Time: 11:55 - 12:20)
 Title Fault Modeling and Testing of Retention Flip-Flops in Low Power Designs Author *Bing-Chuan Bai (Department of Electrical Engineering, National Taiwan University, Taiwan), Augusli Kifli (Design Development Division, Faraday Technology Corporation, Taiwan), Chien-Mo Li (Department of Electrical Engineering, National Taiwan University, Taiwan), Kun-Cheng Wu (Design Development Division, Faraday Technology Corporation, Taiwan) Page pp. 684 - 689 Keyword Retention, Fault Model, low power, ATPG, Testing Abstract Retention flip-flop is one of the most important components in low power designs. This paper presents four new fault models of retention flip-flop. The four faults model the defects that affect the retained value, wakeup time, and sleep time of retention flip-flops. Test patterns for retention flip-flop can be easily generated by ATPG tools. The proposed test methodology is validated by performing experiments on ISCAS89 benchmark circuits and industrial designs. The experimental results show that average fault coverage is 98%.

Session 7D  Designers' Forum: Analog/RF Circuit Designs
Time: 10:15 - 12:20 Thursday, January 22, 2009
Location: Room 416+417
Chair: Makoto Ikeda (University of Tokyo, Japan)

7D-1 (Time: 10:15 - 10:45)
 Title (Invited Paper) Design Methods for Pipeline & Delta-Sigma A-to-D Converters with Convex Optimization Author *Kazuo Matsukawa, Takashi Morie, Yusuke Tokunaga, Shiro Sakiyama, Yosuke Mitani, Masao Takayama, Takuji Miki, Akinori Matsumoto, Koji Obata, Shiro Dosho (Panasonic Corp., Japan) Page pp. 690 - 695 Keyword optimization, ADC, pipeline, delta, sigma Abstract In system LSIs, costs of analog circuits are getting increased relatively for rapid cost reduction of digital circuits. To satisfy given specifications in the analog design, including low power and small area, designers have to select an optimal solution among large combination of the following alternatives: which architecture should be adopted; what type of transistors should be taken; and whether digitally assisting technologies should be used or not, etc. A design based on experience and intuition cannot lead to the optimum in a short time. A comprehensive approach to the optimization, based on circuit theory, is now required. Convex optimization procedure can solve the formulae which represent circuit performance with over hundreds of design variables. We have constructed optimization environments for pipelined and delta-sigma analog-to-digital converters (ADCs) in consideration of the digitally assisting techniques and layout constraints. Both 12-bit pipelined ADCs and a 5th-order delta-sigma modulator were designed with the optimizer, and achieved top-ranked power efficiency.

7D-2 (Time: 10:45 - 11:15)
 Title (Invited Paper) A Low-Jitter 1.5-GHz and Large-EMI reduction 10-dBm Spread-Spectrum Clock Generator for Serial-ATA Author *Takashi Kawamoto, Masaru Kokubo (Hitachi, Ltd., Japan) Page pp. 696 - 701 Keyword Serial-ATA, PLL, VCO, calibration, SSCG Abstract A low-jitter and large-EMI-reduction spread spectrum clock generator (SSCG) for Serial-ATA (SATA) was developed. A low-jitter VCO with a high-frequency limiter was developed to prevent SSCGs from malfunctioning. An autocalibration technique suitable for this VCO was developed to prevent SSCGs from degradation because of process variations. A SATA PHY using a technique for calibrating SSCG was developed to use an inexpensive but large frequency-variation reference oscillator. The fabricated SSCG achieved a 10.0-dB EMI reduction and 1.9-3.3 ps rms jitter by the proposed autocalibration technique. The fabricated SATA PHY achieved less than 400-ppm production-frequency tolerance of reference clocks.

7D-3 (Time: 11:15 - 11:45)
 Title (Invited Paper) RF-Analog Circuit Design in Scaled SoC Author *Nobuyuki Itoh, Mototsugu Hamada (Toshiba Corp., Japan) Page pp. 702 - 707 Keyword RFCMOS, SoC, Design Abstract Downscaling of process technology increases the development cost of RFCMOS SoC. Therefore, designers have to minimize the number of respins, and have to try to obtain higher yield. RFCMOS SoC consists of RF-analog, mixedsignal, logic and memory circuits. In order to realize a small number of respins number and higher yield, key issues are robust design methodology of RF-analog circuits, and full-chip verification. This paper describes practical techniques corresponding to those issues.

7D-4 (Time: 11:45 - 12:15)
 Title (Invited Paper) An Approach to the RF-LSI Design for Ubiquitous Communication Appliances Author *Yuichi Kado, Mitsuru Harada (NTT, Japan) Page pp. 708 - 714 Keyword ubiquitous network, RF, Low power, IV.2. Digital calibration Abstract Abstract - We propose a wide area ubiquitous network as a highly economical and convenient wireless system for providing a wide variety of services. Its basic feature is wide coverage using ultra low power consumption terminals, and its specific target is a 5-km cell radius using 10-mW transmission power terminals run on tenCyear life batteries. In this paper we explain the wireless specifications and the low power consumption performance required of wireless terminals used in these ubiquitous networks. We then introduce a design method that harmonizes RF and digital components and an ultra low power consumption LSI design that make it possible to satisfy these requirements.

Session 8A  High-Level Design and Scheduling
Time: 13:30 - 15:35 Thursday, January 22, 2009
Location: Room 411+412
Chairs: Yuichi Nakamura (NEC Corp., Japan), Keishi Sakanushi (Osaka University, Japan)

8A-1 (Time: 13:30 - 13:55)
 Title Improving Scalability of Model-Checking for Minimizing Buffer Requirements of Synchronous Dataflow Graphs Author Nan Guan (Northeastern University, China), *Zonghua Gu (Hong Kong University of Science and Technology, China), Wang Yi (Uppsala University, Sweden), Ge Yu (Northeastern University, China) Page pp. 715 - 720 Keyword SDF, model-checking Abstract Synchronous Dataflow (SDF) is a well-known model of computation for dataflow-oriented applications such as embedded systems for signal processing and multimedia. It is important to minimize the buffer size requirements of applications generated from SDF graphs, since memory space is often a scarce resource in these systems due to cost or power consumption constraints. Some authors have proposed to use model-checking for finding the minimum buffer size requirements, but the scalability of model-checking is limited by state space explosion. In this paper, we present several techniques for reducing state space size and improving scalability of model-checking by exploiting problem-specific properties of SDF graphs. Slides

8A-2 (Time: 13:55 - 14:20)
 Title A Reverse-Encoding-based on-chip AHB Bus Tracer for Efficient Circular Buffer Utilization Author *Fu-Ching Yang, Cheng-Lung Chiang, Ing-Jer Huang (National Sun Yat-Sen University, Taiwan) Page pp. 721 - 726 Keyword tracer, reverse encoding, pre-t trace, post-t trace, compression Abstract The post-T/pre-T trace refers to the trace captured before/after a target point is reached, respectively. Real time compression of the post-T trace in a circular buffer is a challenging problem since the initial state of the trace being compressed might be corrupted when wrapping around occurs and thus makes it difficult to reconstruct the trace from the incomplete information stored in the circular buffer. This paper proposes an efficient compression algorithm which is capable of compressing both pre-T and post-T traces. The algorithm is based on an innovative reverse encoding scheme by reversing the order of the datum being encoded and the datum being referred. This algorithm has been successfully implemented in a realtime on-chip AHB bus tracer and has been embedded in a 3D graphics SoC as an application example. The bus tracer costs only 44K gates and runs at 500MHz at 0.13um technology. Experiments have shown that this bus tracer achieves 100\% circular buffer utilization and captures 1.2x and 4.86x trace depths than state-of-the-art related work and conventional industrial approaches, respectively. Slides

8A-3 (Time: 14:20 - 14:45)
 Title Analyzing and Optimizing Energy Efficiency of Algorithms on DVS Systems: a First Step towards Algorithmic Energy Minimization Author *Tetsuo Yokoyama, Gang Zeng, Hiroyuki Tomiyama, Hiroaki Takada (Nagoya University, Japan) Page pp. 727 - 732 Keyword Intratask dynamic voltage frequency scaling, Algorithmic energy minimization, Static voltage scaling, Sorting algorithms Abstract The energy efficiency at the algorithmic level on DVS systems and its analysis and optimization methods are presented. Given a problem the most energy efficient algorithm is {\em not} uniquely determined but dependent on multiple factors, including % the execution time distribution, intratask dynamic voltage scaling (IntraDVS) policies, the size of intermediate data structure, and the size of inputs. We show that at the algorithmic level principles behind energy optimization and performance optimization are {\em not} identical. We propose a metric for evaluating optimal energy efficiency of static voltage scaling (SVS) and a few new effective IntraDVS policies employing data flow information. Experimental results on sorting algorithms show the existence of several tradeoffs in terms of energy consumption. Transforming algorithms by employing problem specific knowledge and data flow information successfully improves their energy efficiency. Slides

8A-4 (Time: 14:45 - 15:10)
 Title Novel Task Migration Framework on Configurable Heterogeneous MPSoC Platforms Author Hao Shen, *Frdric Ptrot (TIMA Laboratory, INP Grenoble, France) Page pp. 733 - 738 Keyword ASIP, migration framework, heterogeneous, MPSoC Abstract Heterogeneous MPSoC architectures can provide higher performance and flexibility with less power consumption and lower cost than homogeneous ones. However, as processor instruction sets of general heterogeneous MPSoCs are not identical, tasks migration between two heterogeneous processors is not possible. To enable this function, we propose to build one specific heterogeneous MPSoC platform in which all heterogeneous processors are based on the same core instruction set for the operating system realization. Different extended instructions can be added for different processors to improve the system performance. Tasks can be migrated from one processor to another only if the target processor has all instructions which can meet the execution requirement of this task. This paper concentrates on the infrastructure that is necessary to support the scheduling and migration of tasks between the processors. By using the Motion-JPEG case study, we confirm that our task migration framework can achieve higher processor usage rate and more flexibility. Slides

Session 8B  Emerging Design Methodologies and Applications
Time: 13:30 - 15:35 Thursday, January 22, 2009
Location: Room 413
Chair: Chin-Long Wey (National Central University, Taiwan)

8B-1 (Time: 13:30 - 13:55)
 Title A Novel Toffoli Network Synthesis Algorithm for Reversible Logic Author *Yexin Zheng, Chao Huang (Virginia Tech, United States) Page pp. 739 - 744 Keyword reversible logic, quantum computing, logic synthesis Abstract Reversible logic studies have promising potential on energy lossless circuit design, quantum computation, nanotechnology, etc. Reversible logic features a one-to-one input output correspondence which makes the logic synthesis for reversible functions differs greatly from traditional Boolean functions. Exact synthesis methods can provide optimal solutions in terms of the total number of reversible gates in the synthesis results. Unfortunately, they may suffer from long computation time, due to the fact that the search space is likely to grow exponentially as the circuit size increases. Therefore, in this paper, we propose an efficient synthesis heuristic which provides high quality synthesis results of Toffoli network in more reasonable computation time. We use a weighted, directed graph for reversible function representation and complexity measurement. The proposed algorithm maximally decreases function complexity during synthesis steps. It has the ability to climb out of local minimums and guarantees algorithm convergence. The experimental results show that our algorithm can achieve optimal or very close to optimal solutions with computation time several orders of magnitude less than the exact methods. Compared with other heuristics, our method demonstrates superior performance in terms of reversible gate count as well as computation time.

8B-2 (Time: 13:55 - 14:20)
 Title A Cycle-Based Synthesis Algorithm for Reversible Logic Author *Zahra Sasanian, Mehdi Saeedi, Mehdi Sedighi, Morteza Saheb Zamani (Amirkabir University of Technology, Iran) Page pp. 745 - 750 Keyword Reversible Logic, Cycle, NCT Library Abstract Abstract - Several algorithms have been proposed for the synthesis of reversible circuits. In this paper, a cycle-based synthesis algorithm for reversible logic, based on the NCT library, has been proposed. In other words, direct implementation of a single 3-cycle, a pair of 3-cycles and a pair of 2-cycles have been explored and used to propose an efficient Toffoli-based synthesis algorithm for reversible circuits. The synthesis algorithm decomposes a given large cycle into a set of single 3-cycles, pairs of 3-cycles and pair of 2-cycles and synthesizes the resulted cycles directly. Our experimental results show that the proposed synthesis algorithm can outperform the available 2-cycle-based approach about 34% on average. In addition, several discussions for the generalization of the proposed method to the 2m-cycles are given. Slides

8B-3 (Time: 14:20 - 14:45)
 Title Array Like Runtime Reconfigurable MIMO Detectors for 802.11n WLAN: A Design Case Study Author Pankaj Bhagawat, Rajballav Dash, *Gwan Choi (Texas A&M University, United States) Page pp. 751 - 756 Keyword MIMO systems, 802.11n, Reconfigurability Abstract Future high speed wireless standards such as 802.11n involve Multiple Input Multiple Output (MIMO) antenna systems as a key technology component. Efficient design of the MIMO detector is a challenging task. This is further compounded by the fact that 802.11n standard requires support for runtime switching between different modulation schemes (or modes). While searching for an appropriate architecture attention must be paid to application requirements such as required throughput,limits on latency, and reconfiguration between various modes of operations. Important hardware design metrics such as area/power should be optimized over all the operating modes of the detector. In this paper we carry out extensive architectural space exploration to address the issues of power consumption,area, and reconfigurability between different modes of operation while meeting the standards throughput requirement. Ultimately, we come up with two designs that target low area and low power respectively. We also maintain close to optimum Bit Error Rate(BER), which is vital for any wireless system. The design estimates are based on 45nm technology library. Slides

8B-4 (Time: 14:45 - 15:10)
 Title Mapping method for Dynamically Reconfigurable Architecture Author *Akira Kuroda, Mayuko Koezuka, Hidenori Matsuzaki, Takashi Yoshikawa, Shigehiro Asano (Toshiba Corporation, Japan) Page pp. 757 - 762 Keyword dynamically reconfiguarable architecture, compiler, mapping Abstract In this paper, we present a mapping algorithm for our dynamically reconfigurable architecture which is suitable for stream applications such as H.264. Because our target architecture consists of four different configuration format units heterogeneously, itfs difficult to apply the conventional algorithms. We propose heuristic mapping algorithm which enables to map generic data flow graph onto this complex hardware automatically. We mapped five main functions of H.264 decoder onto our architecture and compared against manual-mapped result which is done by experienced engineer. The result shows that three of five functions are optimized as manual-mapping.

8B-5 (Time: 15:10 - 15:35)
 Title A Criticality-Driven Microarchitectural Three Dimensional (3D) Floorplanner Author Srinath Sridharan, *Michael DeBole, Guangyu Sun, Yuan Xie, Vijaykrishnan Narayanan (Pennsylvania State University, United States) Page pp. 763 - 768 Keyword 3D IC, 3D architecture Abstract As technology scales, interconnect delay starts to dominate the performance of modern microprocessors. Three dimensional (3D) chip structures have been proposed as a solution to mitigate the interconnect challenge, with the capability of reducing global wiring lengths. Previous works on 3D microprocessor floorplanning have demonstrated the benefits of such wire reductions. However, in modern microprocessors, not all the global interconnects are equally important: some are critical for the performance and hence the wire reduction via 3D stacking can result in great performance improvement, while others may not be on the critical path and therefore the wire reduction may not have impact on the performance. In this paper, we propose a floorplanner for 3D chips that will organize functional blocks according to critical microarchitectural communication paths in order to reduce latencies which will hinder processor performance. We identify potential triggers, in the form of feedback delays, that are responsible for incurring high communication costs and curb its negative effect on performance by intelligently placing the functional blocks in 3D without compromising on area, overlap power density and thermal reliability. With our criticality driven 3D, placement there is an IPC improvement on an average 22% and up to a 64% improvement over 2D placement. Over criticality un-aware 3D placement, criticality driven 3D placement shows an IPC improvement on an average of 8% and up to 25%.

Session 8C  Verification, Test, and Yield
Time: 13:30 - 15:35 Thursday, January 22, 2009
Location: Room 414+415
Chairs: Yasuo Sato (Hitachi, Ltd., Japan), Sudhakar M. Reddy (University of Iowa, United States)

8C-1 (Time: 13:30 - 13:55)
 Title Self-Adjusting Constrained Random Stimulus Generation Using Splitting Evenness Evaluation and XOR Constraints Author Shujun Deng, Zhiqiu Kong, *Jinian Bian, Yanni Zhao (Department of Computer Science and Technology, Tsinghua University, China) Page pp. 769 - 774 Keyword stimulus generation, SAT, even distribution, splitting, XOR constraint Abstract Constrained random stimulus generation plays significant roles in hardware verification nowadays, and the quality of the generated stimuli is key to the efficiency of the test process. In this work, we present a linear dynamic method to guide random stimulus generation by SAT solvers. A splitting simplified Min-Distance-Sum evaluation method and an XOR sampling strategy are integrated in the self-adjusting random stimulus generation framework. The evenness of the split groups is evaluated to find out some uneven parts. Then, random partial solutions for the uneven parts and random XOR constraints for the other inputs are added into constraints to get better distributed stimuli. Experimental results show that our method can evaluate the evenness as well as more complex formulae for stimulus generation, and also confirm that the self-adjusting method can improve the fault coverage ratio by more than 17% averagely with the same number of stimuli. Slides

8C-2 (Time: 13:55 - 14:20)
 Title Diagnosing Integrator Leakage of Single-Bit First-Order Delta-Sigma Modulator Using DC Input Author *Xuan-Lun Huang, Chen-Yuan Yang, Jiun-Lang Huang (Graduate Institute of Electronics Engineering, Department of Electrical Engineering, National Taiwan University, Taiwan) Page pp. 775 - 780 Keyword analog/mixed-signal testing, diagnosis, design-for-test (DfT), delta-sigma modulation, integrator leakage Abstract Integrator leakage is a dominant factor in the SNR (signal-to-noise ratio) loss of delta-sigma modulators. In this paper, we propose a Design-for-Test (DfT) technique to diagnose the integrator leakage of the single-bit first-order delta-sigma modulator. The proposed technique is a low-cost solution; it only adds two multiplexers to the modulator, utilizes a single DC voltage as the test stimulus, and estimates the integrator leakage by analyzing the digitized bit stream. Furthermore, the technique can be easily extended to higher order delta-sigma modulators. Simulation results show that accurate estimations of the integrator leakage can be achieved even at the presence of noise.

8C-3 (Time: 14:20 - 14:45)
 Title Path Selection for Monitoring Unexpected Systematic Timing Effects Author *Nicholas Callegari, Pouria Bastani, Li-C. Wang (University of California, Santa Barbara, United States), Sreejit Chakravarty, Alexander Tetelbaum (LSI Corp., United States) Page pp. 781 - 786 Keyword clustering, path delay, path selection, delay test Abstract This paper presents a novel path selection methodology to select paths for monitoring unexpected systematic timing effects. The methodology consists of three components: path filtering, path encoding, and path clustering. Given a large set of critical paths, in path filtering, the goal is to filter out paths that cannot be functionally sensitized. To explore the space of unexpected timing effects, a set of features are defined to encode paths into path vectors. Each feature is a source of concern that may potentially contribute to the cause of an unexpected timing effect. Finally, a kernel-based clustering algorithm is employed to group similar path vectors into clusters from which the best representative paths are selected for post-silicon monitoring. The effectiveness of our proposed methodology is demonstrated through experiments on an industrial ASIC design. Slides

8C-4 (Time: 14:45 - 15:10)
 Title Design for Burn-In Test: A Technique for Burn-In Thermal Stability under Die-to-Die Parameter Variations Author Mesut Meterelliyoz, *Kaushik Roy (Purdue University, United States) Page pp. 787 - 792 Keyword burn-in, leakage, thermal, stability, variations Abstract Strong temperature dependence of leakage has been a major problem during burn-in test where increased voltages and temperatures are applied to weed out defective parts. Moreover, process variations may result in different temperature profiles in different dies during burn-in. This paper proposes an adaptive design-for-burn-in technique that stabilizes the junction temperature by controlling the leakage power using sleep (supply-gating) transistors for a wide range of ambient temperatures, process variations, thermal resistances and supply voltages.

8C-5 (Time: 15:10 - 15:35)
 Title Test Infrastructure Design for Core-Based System-on-Chip Under Cycle-Accurate Thermal Constraints Author *Thomas Edison Yu, Tomokazu Yoneda (Nara Institute of Science and Technology, Japan), Krishnendu Chakrabarty (Duke University, United States), Hideo Fujiwara (Nara Institute of Science and Technology, Japan) Page pp. 793 - 798 Keyword SoC test, TAM design, test scheduling, thermal-aware test, wrapper design Abstract We present a thermal-aware test-access mechanism (TAM) design and test scheduling method for system-on-chip integrated circuits. The proposed method uses cycle-accurate power profiles for thermal simulation; it also relies on test-set partitioning, test-interleaving, and bandwidth matching. We use a computationally tractable thermal-cost model to ensure that temperature constraints are satisfied and the test application time is minimized. Simulation results for the ITC02 SOC Test Benchmarks show that, compared to prior thermal-aware test-scheduling techniques, the proposed method leads to shorter test times under tight temperature constraints. Slides

Session 8D  Designers' Forum: Near-Future SoC Architectures -- Can Dynamically Reconfigurable Processors be a Key Technology?
Time: 13:30 - 15:35 Thursday, January 22, 2009
Location: Room 416+417

8D-1
 Title (Panel Discussion) Near-Future SoC Architectures -- Can Dynamically Reconfigurable Processors be a Key Technology? Author Moderator: Hideharu Amano (Keio University, Japan), Panelists: Toru Awashima (NEC Corp., Japan), Hisanori Fujisawa (Fujitsu Laboratories Ltd., Japan), Naohiko Irie (Hitachi, Ltd., Japan), Takashi Miyamori (Toshiba Corp., Japan), Tony Stansfield (Panasonic Europe Ltd., Great Britain)

Session 9A  Memory Systems Simulation and Optimization
Time: 15:55 - 18:00 Thursday, January 22, 2009
Location: Room 411+412
Chair: Zonghua Gu (Hong Kong University of Science and Technology, Hong Kong)

9A-1 (Time: 15:55 - 16:20)
 Title Soft Lists: A Native Index Structure for NOR-Flash-Based Embedded Devices Author *Li-Pin Chang, Chen-Hui Hsu (National Chiao Tung University, Taiwan) Page pp. 799 - 804 Keyword flash memory, embedded system, storage systems, data structure Abstract Efficient data indexing is significant to embedded devices, because both CPU cycles and energy are very precious resources. Soft lists, a new index structure for embedded devices with NOR flash, are proposed. The challenge of data indexing over NOR flash is that data update and pointer update may recursively trigger each other. Our approach is to allow a bounded number of probes when a pointer is de-referenced. By this way update and garbage collection is largely simplified, because data can be moved around physical locations without invalidating any pointers. Even better, search with soft lists is very fast, because the probes provide opportunities of forward random skips. Soft lists are evaluated and compared against tree-based index, and soft lists are shown simple but efficient.

9A-2 (Time: 16:20 - 16:45)
 Title Energy-aware Register File Re-Partitioning for Clustered VLIW Architectures Author *Chun Jason Xue, Minming Li, Yingchao Zhao, Bessie Hu (City University of Hong Kong, Hong Kong) Page pp. 805 - 810 Keyword register file, partition, energy Abstract VLIW architectures have gained acceptance in embedded systems. Traditional monolithic register file is not suitable for VLIW architectures with a large number of functional units. Clustered VLIW architecture is often applied, where the register file is partitioned into a number of smaller register files. Register files represent a substantial portion of the energy consumption in modern processors, and it is growing rapidly with wider instruction width. Most of the known clustered VLIW architectures partition the register file evenly among clusters. In this paper, we study the effect of energy consumption with register file re-partitioning on clustered VLIW architecture, where register files are not necessarily partitioned evenly. We present algorithms to compute energy-efficient re-partition of register files under different conditions. The impact of different intercluster communication models as well as the impact of program behavior on the register file re-partitioning are analyzed in this paper. Experimental results show that energy saving can be achieved using the proposed techniques.

9A-3 (Time: 16:45 - 17:10)
 Title Memory Subsystem Simulation in Software TLM/T Models Author *Eric Cheung, Harry Hsieh (University of California, Riverside, United States), Felice Balarin (Cadence Design Systems, United States) Page pp. 811 - 816 Keyword Multiprocessor Simulation, Memory Subsystem Simulation, TLM/T Abstract Design of Multiprocessor System-on-a-Chips requires efficient and accurate simulation of every component. Since thememory subsystemaccounts for up to 50%of the performance and energy expenditures, it has to be considered in system-level design space exploration. In this paper, we present a novel technique to simulate memory accesses in software TLM/T models. We use a compiler to automatically expose all memory accesses in software and annotate them onto efficient TLM/T models. A reverse address map provides target memory addresses for accurate cache and memory simulation. Simulating at more than 10MHz, our models allow realistic architectural design space explorations on memory subsystems. We demonstrate our approach with a design exploration case study of an industrial-strength MPEG-2 decoder.

9A-4 (Time: 17:10 - 17:35)
 Title Exact and Fast L1 Cache Simulation for Embedded Systems Author *Nobuaki Tojo, Nozomu Togawa, Masao Yanagisawa, Tatsuo Ohtsuki (Waseda University, Japan) Page pp. 817 - 822 Keyword cache, design space exploration, cache simulation, cache optimization Abstract In recent years, the gap between the cycle time of processors and memory access time has been increasing. One of the solutions to solve this problem is to use a cache. But just using a large cache may not reduce the total memory access time. We can have an optimal cache configuration which minimizes overall memory access time by varying the three cache parameters: a cache set size, a line size, and an associativity. In this paper, we propose two exact cache simulation algorithms: CRCB1 and CRCB2, based on Cache Inclusion Property. They realize exact cache simulation but increase simulation speed dramatically. By using our approach, the number of cache hit/miss judgments required for simulating all the cache configurations is reduced to 31.4%--93.6% compared to conventional approaches. As a result, our proposed approach totally runs an average of 1.8 times faster and a maximum of 3.3 times faster compared to the fastest approach proposed so far. Our proposed exact cache simulation approach achieves the world fastest L1 cache simulation. Slides

9A-5 (Time: 17:35 - 18:00)
 Title Accuracy-Aware SRAM: A Reconfigurable Low Power SRAM Architecture for Mobile Multimedia Applications Author Minki Cho (Georgia Institute of Technology, United States), Jason Schlessman (Princeton University, United States), *Wayne Wolf, Saibal Mukhopadhyay (Georgia Institute of Technology, United States) Page pp. 823 - 828 Keyword Memory, Power, Variation, Multimedia, SRAM Abstract We propose a dynamically reconfigurable SRAM architecture for low-power mobile multimedia applications. Parametric failures due to manufacturing variations limit the opportunities for power saving in SRAM. We show that, using a lower voltage for cells storing low-order bits and a nominal voltage for cells storing higher order bits, ~45% savings in memory power can be achieved with a marginal (~10%) reduction in image quality. A reconfigurable array structure is developed to dynamically reconfigure the number of bits in different voltage domains.

Session 9B  Emerging Technologies
Time: 15:55 - 18:00 Thursday, January 22, 2009
Location: Room 413
Chair: Mehdi Baradaran Tahoori (Electrical & Computer Engineering, Northeastern University, United States)

9B-1 (Time: 15:55 - 16:20)
 Title High-Speed Low-Power FinFET Based Domino Logic Author Seid Hadi Rasouli (University of California, Santa Barbara, United States), Hanpei Koike (Electroinformatics Group, Nanoelectronics Research Institute, National Institute of Advanced Industrial Science and Technology, Japan), *Kaustav Banerjee (University of California, Santa Barbara, United States) Page pp. 829 - 834 Keyword FinFET, high speed, low power, domino logic, resistive gate Abstract This paper introduces a novel FinFET based domino logic, which exploits the exclusive property of the FinFET device (capacitive coupling between front-gate and back-gate in a four-terminal (4T) FinFET) to simultaneously achieve higher performance and lower power consumption. Using a new implementation of the resistive gate, the keeper device is made weaker at the beginning of the evaluation phase to reduce its contention with the pull-down network, but gradually becomes stronger to provide high noise margin.

9B-2 (Time: 16:20 - 16:45)
 Title A Stochastic Perturbative Approach to Design a Defect-Aware Thresholder in the Sense Amplifier of Crossbar Memories Author *M. Haykel Ben Jamaa (Ecole Polytechnique Federale de Lausanne, Switzerland), David Atienza (Universidad Complutense de Madrid, Spain), Yusuf Leblebici, Giovanni De Micheli (Ecole Polytechnique Federale de Lausanne, Switzerland) Page pp. 835 - 840 Keyword nanotechnology, crossbar memories, reliability, nanowires Abstract The use of nanowire crossbars to build devices with large storage capabilities is a very promising architectural paradigm for forthcoming nanoscale memory devices. However, this new type of memory devices raises questions regarding how to test their correct operation. In particular, the variability affecting the decoder is expected to make very complex the test of these new devices. In this paper we present a method to simplify the test of these new devices by using a current thresholder to detect badly addressed nanowires. In the proposed method, the thresholder design is based on a stochastic and perturbative model of the current through the nanowires. Thus, the calculated thresholder parameters are robust against technology variation. As our experimental results indicate, the thresholder error probability is initially only 104, which can be also reduced further (up to 60x) by trading-off only 35% area overhead in the memory. Slides

9B-3 (Time: 16:45 - 17:10)
 Title An Alternate Design Paradigm for Robust Spin-Torque Transfer Magnetic RAM (STT MRAM) from Circuit/Architecture Perspective Author Jing Li, Patrick Ndai, Ashish Goel, Haixin Liu, *Kaushik Roy (Purdue University, United States) Page pp. 841 - 846 Keyword Spintronics, MRAM, yield Abstract Spin-Torque Transfer Magnetic RAM (STT MRAM) is a promising candidate for future embedded applications. It provides desirable memory attributes such as fast access time, low cost, high density and non-volatility. However, variations in process parameters can lead to a large number of cells to fail, severely affecting the yield of the memory array. In this paper, we provide a thorough analysis of the impact of design parameters on parametric failures due to process variations. To achieve high memory yield without incurring expensive technology modification, we developed an alternate design paradigm circuit/architecture co-design to take advantage of different levels of design hierarchy (circuit and architecture) to improve the yield and memory density. The technique decouples the conflicting design requirements for read stability/writability and density. Consequently, the memory cell failure probability reduces by 48% and cell area reduces by 21% with negligible performance degradation (~0.4%).

9B-4 (Time: 17:10 - 17:35)
 Title A Design Methodology and Device/Circuit/Architecture Compatible Simulation Framework for Low-Power Magnetic Quantum Cellular Automata Systems Author Charles Augustine, Behtash Behin-Aein, Xuanyao Fong, *Kaushik Roy (Purdue University, United States) Page pp. 847 - 852 Keyword MQCA, Design Methodology, Simulation Framework, low power, CMOS alternative Abstract CMOS device scaling is facing a daunting challenge with increased parameter variations and exponentially higher leakage current every new technology generation. Thus, researchers have started looking at alternative technologies. Magnetic Quantum Cellular Automata (MQCA) is such an alternative with switching energy close to thermal limits and scalability down to 5nm. In this paper, we present a circuit/architecture design methodology using MQCA. Novel clocking techniques and strategies are developed to improve computation robustness of MQCA systems. We also developed an integrated device/circuit/system compatible simulation framework to evaluate the functionality and the architecture of an MQCA based system and conducted a feasibility/comparison study to determine the effectiveness of MQCAs in digital electronics. Simulation results of an 8-bit MQCA-based Discrete Cosine Transform (DCT) with novel clocking and architecture show up to 290X and 46X improvement (at iso-delay) over 45nm CMOS in energy consumed and area, respectively.

9B-5 (Time: 17:35 - 18:00)
 Title Reconfigurable Double Gate Carbon Nanotube Field Effect Transistor Based Nanoelectronic Architecture Author *Bao Liu (The University of Texas at San Antonio, United States) Page pp. 853 - 858 Keyword carbon nanotube, nanoelectronic architecture Abstract Carbon nanotubes (CNTs) and carbon nanotube field effect transistor (CNFETs) have demonstrated extraordinary properties and are widely accepted as the building blocks of next generation VLSI circuits. However, no nanoelectronic architecture has been proposed which is solely based on carbon nanotubes and carbon nanotube field effect transistors. In this paper, I propose a novel double gate carbon nanotube field effect transistor (RDG-CNFET), which is reconfigurable to be open, short, FET, or via. Layers of orthogonal carbon nanotubes with electrically bistable molecules sandwiched at each crossing form a dense array of RDG-CNFETs and programmable interconnects, and constitute a nanoelectronic architecture of manufacturability (via regularity), reliability (via reconfigurability), and performance (via device density). Simulation based on CNFET and molecular device compact models demonstrates superior logic density, reliability, performance and power consumption of the proposed RDG-CNFET based nanoelectronic circuits compared with the existing, e.g., molecular diode/MOSTFET based nanoelectronic circuits.

Session 9D  Special Session: Dependable VLSI: Device, Design and Architecture -- How should they cooperate ? --
Time: 15:55 - 18:00 Thursday, January 22, 2009
Location: Room 416+417
Organizer: Shuichi Sakai (University of Tokyo, Japan)

9D-1
 Title (Panel Discussion) Dependable VLSI: Device, Design and Architecture -- How should they cooperate ? -- Author Organizer: Shuichi Sakai (University of Tokyo, Japan), Panelists: Hidetoshi Onodera (Kyoto University, Japan), Hiroto Yasuura (Kyushu University, Japan), James C. Hoe (Carnegie Mellon University, United States) Page pp. 859 - 860 Keyword VLSI, dependability, device, design, architecture Abstract VLSI dependability is one of the most significant issues in the modern world. Here the panelists will discuss the key technologies for it as well as the cost optimization among device, design and architecture.