The 11th Asia and South Pacific Design Automation Conference
Technical Program

Remark: The presenter of each paper is marked with "*".

Technical Program: SIMPLE version DETAILED version with abstract

Author Index: HERE

Session Schedule

Wednesday January 25, 2006

A	B	C	D
Op (Small Auditorium, 5F) Opening Session 8:30 - 9:00
1K (Small Auditorium, 5F) Keynote Address I 9:00 - 10:00
Break 10:00 - 10:15
1A (Room 411+412) Formal Methods for Coverage and Scalable Verification 10:15 - 12:20	1B (Room 413) Interconnect for High-End SoC 10:15 - 12:20	1C (Room 414+415) Timing Analysis and Optimization 10:15 - 12:20	1D (Room 416+417) University Design Contest 10:15 - 12:20
Lunch Break / University Design Contest Discussion at ASP-DAC Site (Room 418) 12:20 - 13:30
2A (Room 411+412) Software Techniques for Efficient SoC Design 13:30 - 15:35	2B (Room 413) Application Examples with Leading Edge Design Methodology 13:30 - 15:35	2C (Room 414+415) Placement 13:30 - 15:35	2D (Room 416+417) Special Session: Electrothermal Design of Nanoscale Integrated Circuits 13:30 - 15:35
Coffee Break (Room 418) 15:35 - 16:00
3A (Room 411+412) Logic Synthesis 16:00 - 18:05	3B (Room 413) Future Technical Directions for Design Automation 16:00 - 18:05	3C (Room 414+415) Routing and Interconnect Optimization 16:00 - 18:05	3D (Room 416+417) Special Session: Flash Memory in Embedded Systems 16:00 - 18:05

Thursday January 26, 2006

A	B	C	D
2K (Small Auditorium, 5F) Keynote Address II 9:00 - 10:00
Break 10:00 - 10:15
4A (Room 411+412) Resolving Timing Issues: Design and Test 10:15 - 12:20	4B (Room 413) Leading Edge Design Methodology for SoCs and SiPs 10:15 - 12:20	4C (Room 414+415) Advanced Circuit Simulation 10:15 - 12:20	4D (Room 416+417) Special Session: Open Access Overview 10:15 - 12:20
Lunch Break / Ph.D. Forum (Room 418) 12:20 - 13:30
5A (Room 411+412) Advances in Simulation Technologies 13:30 - 15:35	5B (Room 413) Scheduling for Embedded Systems 13:30 - 15:35	5C (Room 414+415) High Frequency Interconnect Effects in Nanometer Technology 13:30 - 15:35	5D (Small Auditorium, 5F) Designers' Forum: Low Power Design 13:30 - 15:30
Coffee Break (Room 418) 15:35 - 16:00
6A (Room 411+412) Power Optimization of Large-Scale Circuits 16:00 - 18:05	6B (Room 413) Advanced Memory and Processor Architectures for MPSoC 16:00 - 18:05	6C (Room 414+415) New Routing Techniques 16:00 - 18:05	6D (Small Auditorium, 5F) Designers' Forum Panel: 16:30 - 18:00
Banquet (Room 501+502) 18:30 - 20:30

Friday January 27, 2006

A	B	C	D
3K (Small Auditorium, 5F) Keynote Address III 9:00 - 10:00
Break 10:00 - 10:15
7A (Room 411+412) Minimization of Test Cost and Power 10:15 - 12:20	7B (Room 413) Substrate Coupling and Analog Synthesis 10:15 - 12:20	7C (Room 414+415) Statistical and Yield Analysis 10:15 - 12:20	7D (Room 416+417) Special Session: H.264/AVC Design Challenges and Solutions 10:15 - 12:20
Lunch Break 12:20 - 13:30
8A (Room 411+412) Floorplanning 13:30 - 15:35	8B (Room 413) Memory Optimization for Embedded Systems 13:30 - 15:35	8C (Room 414+415) Inductive Issues in Power Grids and Packages 13:30 - 15:35	8D (Small Auditorium, 5F) Designers' Forum: "Cell" Processor 13:30 - 15:30
Coffee Break (Room 418) 15:35 - 16:00
9A (Room 411+412) High-Level Synthesis 16:00 - 18:05	9B (Room 413) Modeling, Compilation and Optimization of Embedded Architectures 16:00 - 18:05	9C (Room 414+415) Statistical Design 16:00 - 18:05	9D (Small Auditorium, 5F) Designers' Forum Panel: 16:30 - 18:00

List of Papers

Remark: The presenter of each paper is marked with "*".

Wednesday January 25, 2006

Session Op Opening Session (8:30 - 9:00)
Location: Small Auditorium, 5F

Session 1K Keynote Address I (9:00 - 10:00)
Location: Small Auditorium, 5F
Chair(s): Fumiyasu Hirose (Cadence, Japan)

1K-1 (Time: 9:00 - 10:00)

Title	Automotive Electronics: Steady Growth for Years to Come!
Author	Alberto Sangiovanni-Vincentelli (The Edgar L. and Harold H. Buttner Chair of Electrical Engineering and Computer Science, University of California, Berkeley, and Chief Technology Advisor, Member of the Board and Co-founder, Cadence Design Systems, United States)
Abstract	The world of electronics is witnessing a revolution in the way products are conceived, designed and implemented. The ever growing importance of the web, the advent of microprocessors of great computational power, the explosion of wireless communication, the development of new generations of integrated sensors and actuators are changing the world in which we live and work. The new key words are: - Disappearing electronics, i.e., electronics has to be invisible to the user, it has to help unobtrusively. - Pervasive computing, i.e., electronics is everywhere, all common use objects will have an electronic dimension. - Ambient intelligence, i.e., the environment will react to us with the use of electronic components. They will recognize who we are and what we like. - Wearable computing, i.e., the new devices will be worn as a watch or a hat. They will become part of our clothes. Some of these devices will be tags that will contain all important information about us. - Know more, carry less, i.e., the environment will know more about us so that we will not need to carry all the paraphernalia of keys, credit cards, personal I.D.s, access cards, access codes. The car as a self-contained microcosm is experiencing a similar revolution: all the key words listed above are going to have a great impact on the automotive world. We need to rethink what a car really is and the role of electronics in it. Electronics is now essential to control the movements of a car, of the chemical and electrical processes taking place in it, to entertain the passengers, to establish connectivity with the rest of the world, to ensure safety. What will an automobile manufacturer's core competence become in the next few years? Will electronics be the essential element in car manufacturing and design? The challenges and opportunities are related to - how to integrate the mechanical and the electronics worlds, i.e., how to make mechatronics a reality in the automotive world, - how to integrate the different motion control and power-train control functions so that important synergies can be exploited, - how to combine entertainment, communication and navigation subsystems, - how to couple the world of electronics where the life time of a product is around 2 years and shrinking, with the automotive world, where the product life time is 10 years and possibly growing, - how to develop new services based on electronics technology, - how to exploit communication among cars and between cars and infrastructure such as Global Positioning Systems and cellular networks, - how are the markets evolving (for example, what will be the size of the after-market sales for automotive electronics, if any?). We will pose these questions while reviewing some of the most important technology and product developments of the past few years. We will also present new trends on how the design of electronics of the car should be carried out. We will finally analyze the dynamics of the automotive electronics industry that is bound to produce a major shake-up in the structure of the design chain with particular emphasis on the AUTOSAR consortium.

Session 1A Formal Methods for Coverage and Scalable Verification (10:15 - 12:20)
Location: Room 411+412
Chair(s): Kiyoharu Hamaguchi (Osaka University, Japan), Valeria Bertacco (University of Michigan, United States)

1A-1 (Time: 10:15 - 10:40)

Title	Transition-Based Coverage Estimation for Symbolic Model Checking
Author	*Xingwen Xu, Shinji Kimura (Graduate School of IPS, Waseda University, Japan), Kazunari Horikawa, Takehiko Tsuchiya (Toshiba Corporation Semiconductor Company, Japan)
Page	pp. 1 - 6
Keyword	model checking, properties completeness, transition coverage
Abstract	Lack of complete formal specification is one of the major obstacles for the deployment of model checking. Coverage estimation addresses this issue by revealing the unverified part of the design according to the specified properties. In this paper we propose a new transition-based coverage metric to evaluate the completeness of properties for symbolic model checking. It is more comprehensive and accurate than the existing coverage metrics for model checking. An efficient symbolic algorithm is presented for computing the transition coverage for a subset of ACTL. Our coverage estimator has been applied to the model checking of a cache coherence protocol. We uncovered several coverage holes including one that eventually led to the discovery of a design bug.

1A-2 (Time: 10:40 - 11:05)

Title	Word Level Functional Coverage Computation
Author	*Bijan Alizadeh (Microelectronic Research and Development Center of Iran, Iran)
Page	pp. 7 - 12
Keyword	Symbolic Simulation, Coverage, Formal Verification
Abstract	This paper proposes a word-level coverage metric to determine the completeness of a set of properties verified by a word-level method. An algorithm is presented to compute a functionality based coverage metric for a sequence property as specification. Control, intermediate and output signals are represented by a multiplexer based structure of linear integer equations, and RT level properties are directly applied to this representation. A set of integer equations are symbolically simulated based on the specified property in a predictable time. We used a canonical form of linear Taylor Expansion Diagram.

1A-3 (Time: 11:05 - 11:30)

Title	Discovering the Input Assumptions in Specification Refinement Coverage
Author	Prasenjit Basu, Sayantan Das, *Pallab Dasgupta, Partha P Chakrabarti (Indian Institute of Technology Kharagpur, India)
Page	pp. 13 - 18
Keyword	Formal Verification, Functional Coverage
Abstract	The design of a large chip is typically hierarchical -- large modules are recursively expanded into a collection of submodules. Each expansion refines the design due to the addition of level specific details. We believe that a similar approach is necessary to scale the capacity of formal property verification technology -- as the design gets refined from one level to another, the formal specification must also be refined to reflect the level specific design decisions. At the heart of this approach we propose a checker that identifies the input assumptions under which the refined specification "covers" the original specification. This enables the validation engineer to focus the verification effort on the remaining input scenarios thereby reducing the number of target coverage points for simulation.

1A-4 (Time: 11:30 - 11:55)

Title	Refinement Strategies for Verification Methods Based on Datapath Abstraction
Author	*Zaher Semon Andraus, Mark Hammond Liffiton, Karem Ahmad Sakallah (The University of Michigan, Ann Arbor, United States)
Page	pp. 19 - 24
Keyword	Formal Verification, Register Transfer Level, Minimally Unsatisfiable Subsets
Abstract	In this paper we explore the application of Counterexample Guided Abstraction Refinement (CEGAR) in the context of microprocessor correspondence checking. The approach is based on automatic datapath abstraction augmented with automatic refinement based on 1) localization, 2) generalization, and 3) minimal unsatisfiable subset (MUS) extraction. We introduce several refinement strategies and empirically evaluate their effectiveness on a set of microprocessor benchmarks. The data suggest that localization, generalization, and MUS extraction from both the abstract and concrete models are essential for effective verification. Additionally, refinement tends to converge faster when multiple MUses are extracted in each iteration.

1A-5 (Time: 11:55 - 12:20)

Title	Generation of Shorter Sequences for High Resolution Error Diagnosis Using Sequential SAT
Author	Sung-Jui Pan, *Kwang-Ting Cheng (University of California, Santa Barbara, United States), John Moondanos, Ziyad Hanna (Intel Corporation, United States)
Page	pp. 25 - 29
Keyword	Shorter Error Sequence
Abstract	Commonly used pattern sources in simulation-based verification include random, guided random, or design verification patterns. Although these patterns may help bring the design to those hard-to-reach states for activating the errors and for propagating them to observation points, they tend to be very long, which complicates the subsequent diagnosis process. As a key step in reducing the overall diagnosis complexity, we propose a method of generating a shorter error-sequence based on a given long error-sequence. We formulate the problem as a satisfiability problem and employ a SAT solver as the underlying engine for this task. By heuristically selecting an intermediate state $S_i$ which is reachable by the given long sequence, the task of finding the transfer sequence from the initial state to the target state can be divided into two easier tasks - finding a transfer sequence from the initial state to $S_i$ and one from $S_i$ to the target state. Our preliminary experimental results on public benchmark circuits show that the proposed method can achieve significant reduction in the length of the error sequences.

Session 1B Interconnect for High-End SoC (10:15 - 12:20)
Location: Room 413
Chair(s): Yoshinori Takeuchi (Osaka University, Japan), Juinn-Dar Huang (National Chiao-Tung University, Taiwan)

1B-1 (Time: 10:15 - 10:40)

Title	Constraint-Driven Bus Matrix Synthesis for MPSoC
Author	*Sudeep Pasricha, Nikil Dutt (University of California, Irvine, United States), Mohamed Ben-Romdhane (Conexant, United States)
Page	pp. 30 - 35
Keyword	bus matrix, communication architecture synthesis, MPSoC
Abstract	Modern multi-processor system-on-chip (MPSoC) designs have high bandwidth constraints which must be satisfied by the underlying communication architecture. Bus matrix based communication architectures consist of several parallel busses which provide a suitable backbone to support high bandwidth systems, but suffer from high cost overhead due to extensive bus wiring inside the matrix. Manual traversal of the vast exploration space to synthesize a minimal cost bus matrix that also satisfies performance constraints is practically infeasible. In this paper, we address this problem by proposing an automated approach for synthesizing a bus matrix communication architecture which satisfies all performance constraints in the design and minimizes wire congestion in the matrix. To validate our approach, we consider several industrial strength applications from the networking domain and show that our approach results in up to 9x component savings when compared to a full bus matrix and up to 3.2x savings when compared to a maximally connected reduced bus matrix.

1B-2 (Time: 10:40 - 11:05)

Title	Improving Routing Efficiency for Network-on-Chip through Contention-Aware Input Selection
Author	*Dong Wu, Bashir M. Al-Hashimi, Marcus T. Schmitz (University of Southampton, Great Britain)
Page	pp. 36 - 41
Keyword	network-on-chip, routing, input selection, switch, contention-aware
Abstract	The performance of Network-on-Chip (NoC) largely depends on the underlying routing techniques, which have two constituencies: output selection and input selection. Previous research on routing techniques for NoC has focused on the improvement of output selection. This paper investigates the impact of input selection, and presents a novel contention-aware input selection (CAIS) technique for NoC that improves the routing efficiency. When there are contentions of multiple input channels competing for the same output channel, CAIS decides which input channel obtains the access depending on the contention level of the upstream switches, which in turn removes possible network congestion. Simulation results with different synthetic and real-life traffic patterns show that, when combined with either deterministic or adaptive output selection, CAIS achieves significant better performance than the traditional first-come-first-served (FCFS) input selection, with low hardware overhead (<3%).

1B-3 (Time: 11:05 - 11:30)

Title	Physical Design Implementation of Segmented Buses to Reduce Communication Energy
Author	*Jin Guo, Antonis Papanikolaou, Pol Marchal, Francky Catthoor (IMEC, Belgium)
Page	pp. 42 - 47
Keyword	segmented bus, physical design, netlist topology, activity aware floorplanning
Abstract	The amount of energy consumed for interconnecting the IP-blocks is increasing significantly due to the suboptimal scaling of (long) wires. To limit this energy penalty, segmented buses have gained interest in the architectural community. However, the netlist topology and the physical design stage significantly influence the final communication energy cost. We present in this paper an automated way to implement a netlist consisting of hard macro blocks, which are interconnected with heavily segmented buses in an energy optimal fashion for communication. We optimize the network wires energy dissipation in two separate, but related steps: minimizing the number of segments for active communication paths at the first step (block ordering), followed by the activity aware floorplanning step to minimize the physical length of these segments. Energy gains of up to a factor of 4 are achieved compared to a standard system implementation using a shared bus. Especially, the block ordering step contributes significantly to the network energy optimization.

1B-4 (Time: 11:30 - 11:55)

Title	Co-Synthesis of a Configurable SoC Platform based on a Network on Chip Architecture
Author	*Mário Pereira Véstias, Horácio Neto (INESC-ID, Portugal)
Page	pp. 48 - 53
Keyword	Field programmable gate array, network on chip, configurable architecture, system on chip
Abstract	The constant increase of gate capacity and performance of configurable hardware chips made it possible to implement systems-on-chip (SoC) able to tackle the demanding requirements of many embedded systems. In this paper, we propose an approach to the design space exploration of a configurable SoC (CSoC) platform based on a network on chip (NoC) architecture for the execution of dataflow dominated embedded systems. The approach has been validated with the design of a color image compression algorithm in an FPGA

1B-5 (Time: 11:55 - 12:20)

Title	Customized SIMD Unit Synthesis for System on Programmable Chip - A Foundation for HW/SW Partitioning with Vectorization
Author	Muhammad Omer Cheema, *Omar Hammami (ENSTA Paris, France)
Page	pp. 54 - 60
Keyword	SIMD Synthesis, HW/SW Codesign, AltiVec Architecture, Vectorization
Abstract	Use of Single Instruction Multiple Data (SIMD) functional units enables multimedia systems to exploit parallelism to a higher degree resulting in significant system performance improvements. While implementation of whole SIMD system functionality for an application results in wastage of area resources, we have observed that for a specific multimedia application, we only need to implement a customized SIMD unit that is a subset of whole SIMD standard implementation. Based on this study, we have proposed an extension to the traditional system design and synthesis flow by integrating a methodology of SIMD unit Synthesis. Our system synthesizes a customized SIMD unit along with an extended instruction set and generates an equivalent version of assembly code for the application using the extended instruction set. The results of area and performance obtained by experimenting over our implementation of AltiVec compatible customized SIMD units show the effectiveness of our approach.

Session 1C Timing Analysis and Optimization (10:15 - 12:20)
Location: Room 414+415
Chair(s): Ryuichi Yamaguchi (Matsushita Electric Industrial Co., Ltd., Japan), Atsushi Kurokawa (STARC, Japan)

1C-1 (Time: 10:15 - 10:40)

Title	Robust Analytical Gate Delay Modeling for Low Voltage Circuits
Author	Anand Ramalingam (The University of Texas at Austin, United States), Sreekumar V. Kodakara (The University of Minnesota, United States), Anirudh Devgan (Magma Design Automation, United States), *David Z. Pan (The University of Texas at Austin, United States)
Page	pp. 61 - 66
Keyword	delay metric
Abstract	Sakurai-Newton (SN) delay metric [1] is the most widely used closed form delay metric for CMOS gates due to its simplicity and reasonable accuracy. However, it can be shown that the SN metric fails to provide high accuracy and fidelity when CMOS gates operate at low supply voltages. Thus it may not be applicable in many low power applications with voltage scaling. In this paper, we propose a new closed form delay metric based on the centroid of power dissipation. This new metric is inspired by our key observation and theoretic proof that the SN delay is indeed Elmore delay, which can be viewed as the centroid of current. Our proposed metric has a very high correlation coefficient (0.98) when correlated with the actual delays got from the HSPICE simulations. Such high correlation is consistent across all major process technologies (180; 130; 100; 65; and 45nm). In comparison, the SN metric has a correlation coefficient between (0.70; 0.90) depending upon the technology and the CMOS gate, and it is less accurate for lower supply voltages. Since our proposed metric has high fidelity across a wide range of supply voltages yet a simple closed form, it will be very useful to guide low voltage and low power designs

1C-2 (Time: 10:40 - 11:05)

Title	CGTA: Current Gain-based Timing Analysis for Logic Cells
Author	Shahin Nazarian, *Massoud Pedram (University of Southern California, United States), Tao Lin, Emre Tuncer (Magma, United States)
Page	pp. 67 - 72
Keyword	crosstalk noise, current gain-based cell timing analysis, sensitivity, timing analyzer
Abstract	This paper introduces a new current-based cell timing analyzer, called CGTA, which has a higher performance than existing logic cell timing analysis tools. CGTA relies on a compact lookup table storing the output current gain (sensitivity) of every logic cell as a function of its input voltage and output load. The current gain values are subsequently used by the timing calculator to produce the output current value as a function of the applied input voltage. This current and the output load then uniquely determine the output voltage value. Therefore, CGTA is capable of efficiently and accurately computing the output voltage waveform of a logic cell, which has been subjected to an arbitrary noisy input voltage waveform. Experimental results are presented to assess the quality of CGTA compared to other existing approaches.

1C-3 (Time: 11:05 - 11:30)

Title	Efficient Static Timing Analysis Using a Unified Framework for False Paths and Multi-Cycle Paths
Author	Shuo Zhou, Bo Yao, Hongyu Chen, Yi Zhu, *Chung-Kuan Cheng (University of California, San Diego, United States), Mike Hutton (Altera Corp., United States)
Page	pp. 73 - 78
Keyword	static timing analysis, false sub-graphs, time shifting, biclique covering
Abstract	We propose a framework to unify the process of false paths and multi-cycle paths in static timing analysis (STA). We use subgraphs attached with timing constraints to represent false paths and multi-cycle paths. The complexity of the subgraph representation is reduced to improve efficiency. Finally, we present theorems to show that the unified framework produces correct timings. The experimental results demonstrate that the minimization is effective for both artificial and industry test cases.

1C-4 (Time: 11:30 - 11:55)

Title	Crosstalk Analysis using Reconvergence Correlation
Author	*Sachin Shrivastava, Rajendra Pratap, Harindranath Parameswaran, Manuj Verma (Cadence Design Systems, India)
Page	pp. 79 - 83
Keyword	crosstalk, timing windows, pessimism
Abstract	This paper targets at the reduction of false violations during crosstalk analysis by using topological correlation in the design. Pessimism reduction helps in reducing the overall design cycle time by avoiding many noise-fix iterations. We introduce the concept of relative timing windows and suggest a method for doing crosstalk analysis using relative timing windows. We have analyzed the effectiveness of the approach statistically. Some results on real designs are also presented which shows reduction in number of violation using the new approach.

1C-5 (Time: 11:55 - 12:20)

Title	Process-Induced Skew Reduction in Nominal Zero-Skew Clock Trees
Author	*Matthew R. Guthaus, Dennis Sylvester (University of Michigan, United States), Richard B. Brown (University of Utah, United States)
Page	pp. 84 - 89
Keyword	clock tree synthesis, manufacturing variation, statistical
Abstract	This work develops an analytic framework for clock tree analysis considering process variations that is shown to correspond well with Monte Carlo results. The analysis framework is used in a new algorithm that constructs deterministic nominal zero-skew clock trees that have reduced sensitivity to process variation. The new algorithm uses a sampling approach to perform route embedding during a bottom-up merging phase, but does not select the best embedding until the top-down phase. This results in clock trees that exhibit a mean skew reduction of 32.4% on average and a standard deviation reduction of 40.7% as verified by Monte Carlo. The average increase in total clock tree capacitance is less than 0.02%.

Session 1D University Design Contest (10:15 - 12:20)
Location: Room 416+417
Chair(s): Kazutoshi Kobayashi (Kyoto University, Japan), Takahiko Arakawa (Renesas Technology, Japan)

1D-1 (Time: 10:15 - 10:20)

Title	A Low Dynamic Power and Low Leakage Power 90-nm CMOS Square-Root Circuit
Author	*Tadayoshi Enomoto, Nobuaki Kobayashi (Chuo University, Japan)
Page	pp. 90 - 91
Keyword	dynamic power, leakage power, square-root, CMOS, 90 nm
Abstract	To drastically reduce the dynamic power (PAT) and the leakage power (PST), while to keep speed of a CMOS square-root (SR) circuit, a new algorithm, new architectures and a new leakage reduction circuit were developed. Using these techniques, a 90-nm CMOS LSI was fabricated. The PAT and PST of the new SR circuit were reduced to about 1/4 and 1/33 those of a conventional SR circuit. Measured results agreed well with simulated results.

1D-2 (Time: 10:20 - 10:25)

Title	A High-Throughput Low-Power Fully Parallel 1024-bit 1/2-Rate Low Density Parity Check Code Decoder in 3-Dimensional Integrated Circuits
Author	Lili Zhou, Cherry Wakayama, Nuttorn Jangkrajarng, Bo Hu, *Richard Shi (University of Washington, United States)
Page	pp. 92 - 93
Keyword	three-dimensional integrated circuits, LDPC decoder
Abstract	A 1024-bit, ½-rate fully parallel low-density parity-check (LDPC) code decoder has been designed and implemented using a three-dimensional (3D) 0.18mm fully depleted silicon-on-insulator (FDSOI) CMOS technology based on wafer bonding. The taped-out 3D decoder with about 8M transistors was simulated to have a high throughput of 2Gb/s and a low power consumption of only 430mW using 6.4mm by 6.3mm of die area. The 3D implementation is estimated to offer more than 10x power-delay-area product improvement over its corresponding 2D implementation. This first large-scale 3D ASIC with fine-grain (5mm) vertical interconnects is made possible by jointly developing a complete automated 3D design flow from a commercial 2-D design flow combined with the needed 3D-design point tools.

1D-3 (Time: 10:25 - 10:30)

Title	A 16-Bit, Low-Power Microsystem with Monolithic MEMS-LC Clocking
Author	*Robert M. Senger, Eric D. Marsman, Michael S. McCorquodale (University of Michigan, United States), Richard B. Brown (University of Utah, United States)
Page	pp. 94 - 95
Keyword	microsystem, embedded system, low-power, microelectromechanical devices, LC oscillator
Abstract	Single-chip systems save the power dissipation that would be required for chip-to-chip communication, resulting in compact, low-power solutions for battery-powered applications. This paper describes the design and measured performance of a fully-functional digital core with a low-jitter, on-chip, MEMS-LC clock reference. This chip has been fabricated in TSMC’s 0.18um MM/RF bulk CMOS process. Maximum power consumption of the complete microsystem is 48.78mW operating at 90MHz on a 1.8V power supply.

1D-4 (Time: 10:30 - 10:35)

Title	Ultra-Low Voltage Power Management Circuit and Computation Methodology for Energy Harvesting Applications
Author	Chi-Ying Tsui, *Hui Shao, Wing-Hung Ki, Feng Su (Hong Kong University of Science and Technology, Hong Kong)
Page	pp. 96 - 97
Keyword	energy harvesting, power management, charge-triggered computation, self-time circuit
Abstract	A power management and computation methodology is proposed for ultra-low power energy harvesting applications. An integrated exponential charge pump that accepts an input voltage of around 150mV and provides an unregulated output voltage of more than 1.5V serves as the power supply. To cater with the fluctuated energy source and unregulated power supply, a supply side charge-based computation methodology is proposed, of which the computation activity tracks with the fluctuation of the available energy. The idea is demonstrated in a test chip fabricated using a 0.35um technology.

1D-5 (Time: 10:35 - 10:40)

Title	A 0.5-V Sigma-Delta Modulator Using Analog T-Switch Scheme for the Subthreshold Leakage Suppression
Author	*Koichi Ishida, Atit Tamtrakarn, Takayasu Sakurai (University of Tokyo, Japan)
Page	pp. 98 - 99
Keyword	low voltage, analog, sigma-delta, subthreshold leakage
Abstract	A 0.5-V sigma-delta modulator implemented in a 0.15-€Î FD-SOI process with low VTH of 0.1V using analog T-switch (AT-switch) scheme to suppress subthreshold-leakage problems is presented. The scheme is compared with the conventional circuit, which are also fabricated in the same chip. The measurement result demonstrates that the sigma-delta modulator based on AT-switch realizes 6-bit resolution through reducing non-linear leakage effects while the conventional circuit can achieve 4-bit resolution.

1D-6 (Time: 10:40 - 10:45)

Title	An Implementation of a CMOS Down-Conversion Mixer for GSM1900 Receiver
Author	*Fangqing Chu, Wei Li, Junyan Ren (Fudan University, China)
Page	pp. 100 - 101
Keyword	mixer, RFIC
Abstract	A 1.9-GHz down-conversion CMOS mixer, intended for the GSM1900 (PCS1900) Low-IF receivers is present with the utilization of novel folded Gilbert Cell fabricated in a RF 0.18-µm CMOS process. The prototype demonstrates a good performance. It achieves a conversion gain of 6dB, SSB Noise Figure of 18.5dB and IIP3 11.5dBm while consuming 7mA current from 3.3V power supply.

1D-7 (Time: 10:45 - 10:50)

Title	Integrated Direct Output Current Control Switching Converter using Symmetrically-Matched Self-Biased Current Sensors
Author	*Yat-Hei Lam (Hong Kong University of Science and Technology, Hong Kong), Suet-Chui Koon (National Semiconductor Corporation, Hong Kong), Wing-Hung Ki, Chi-Ying Tsui (Hong Kong University of Science and Technology, Hong Kong)
Page	pp. 102 - 103
Keyword	Switching Converter, Power Electronics, Current Sensor
Abstract	A non-inverting flyback converter using an integrated symmetrically-matched self-biased current sensor was fabricated in a 0.35m CMOS process. It operates in pseudo-continuous conduction mode and employs a direct output current control scheme to achieve excellent line transient response. The converter switches at 1MHz with an input of 1.2V to 2V to give an output of 1.5V and delivers 250mA.

1D-8 (Time: 10:50 - 10:55)

Title	Adaptively-Biased Capacitor-Less CMOS Low Dropout Regulator with Direct Current Feedback
Author	*Yat-Hei Lam, Wing-Hung Ki, Chi-Ying Tsui (Hong Kong University of Science and Technology, Hong Kong)
Page	pp. 104 - 105
Keyword	Linear Regulator, Current Sensor, low dropout regulator
Abstract	A capacitor-less low dropout regulator (LDR) with direct current feedback is proposed. A symmetrically-matched voltage mirror in sensing the load current is employed, and gives an excellent line and load regulation. The dynamic biasing results in an LDR with pole-tracking that extends the bandwidth of the loop gain at high load currents. The LDR with active circuit area of 0.11mm2 was fabricated in a 0.35μm CMOS process. Measurement results demonstrated the good performance of the LDR.

1D-9 (Time: 10:55 - 11:00)

Title	A Built-in Power Supply Noise Probe for Digital LSIs
Author	*Mitsuya Fukazawa, Koichiro Noguchi, Makoto Nagata, Kazuo Taki (Kobe University, Japan)
Page	pp. 106 - 107
Keyword	power supply noise, power supply integrity, on chip measurement
Abstract	A design of compact noise detector circuitry that can be embedded and arrayed within a highdensity large-scale digtal circuit is demonstrated, with a prototype chip using 0.18 um CMOS technology.

1D-10 (Time: 11:00 - 11:05)

Title	A 476-gate-count Dynamic Optically Reconfigurable Gate Array VLSI chip in a standard 0.35um CMOS Technology
Author	*Minoru Watanabe, Fuminori Kobayashi (Kyushu Institute of Technology, Japan)
Page	pp. 108 - 109
Keyword	FPGAs, ORGAs, Optical reconfiguration, Gate Array
Abstract	Optically Reconfigurable Gate Arrays (ORGAs) can easily enable both fast reconfiguration and numerous reconfiguration contexts by using an optical holographic memory and optical wide-band reconfiguration connections. Such devices present the possibility of large virtual gate-count VLSIs. This paper presents a new design of a 476-gate-count Dynamic Optically Reconfigurable Gate Array (DORGA) modified from a previously designed 68-gate-count DORGA using standard 0.35 um three-metal CMOS process technology.

1D-11 (Time: 11:05 - 11:10)

Title	Measurement Results of Within-Die Variations on a 90nm LUT Array for Speed and Yield Enhancement of Reconfigurable Devices
Author	*Kazuya Katsuki, Manabu Kotani, Kazutoshi Kobayashi, Hidetoshi Onodera (Graduate School of Informatics, Kyoto University, Japan)
Page	pp. 110 - 111
Keyword	Within-Die variations, reconfiguration
Abstract	It is possible to enhance speed and yield of reconfigurable devices utilizing WID variations. An LUT array LSI is fabricated on a 90nm process to measure WID and D2D variations. Performance fluctuations are measured by counting the number of LUTs through which a signal is passing within a certain time. D2D and WID variations are clearly observed by the measurement.

1D-12 (Time: 11:10 - 11:15)

Title	High-Throughput Decoder for Low-Density Parity-Check Code
Author	*Tatsuyuki Ishikawa, Kazunori Shimizu, Takeshi Ikenaga, Satoshi Goto (Graduate School of Information, Production and Systems, Waseda University, Japan)
Page	pp. 112 - 113
Keyword	low-density parity-check (LDPC) codes, min-sum algorithm, partially-parallel LDPC decoder, memory-reduction
Abstract	We have designed and implemented the LDPC decoder chip with memory-reduction method to achieve high-throughput and practical chip size. The decoder decodes (3,6)-2304bit regular LDPC codes using modified min-sum algorithm. The decoder achieves a throughput of 530Mb/s at an operating frequency of 147MHz. The chip is fabricated in a 0.18um, 6 metal-layer CMOS technology. The chip size is 36mm^2.

1D-13 (Time: 11:15 - 11:20)

Title	Hardware Implementation of Super Minimum All Digital FM Demodulator
Author	*Nursani Rahmatullah, Arif Nugroho (Institut Teknologi Bandung, Indonesia)
Page	pp. 114 - 115
Keyword	FM demodulator, new method, PLL
Abstract	We propose improvement of the new architecture of digital FM demodulator. This work enhances signal quality, system clock frequency, and superior than well known PLL technique today. No more multiplier, no more ROM or table, compact size, and very fast in transient or state response. Real implementation in Altera® APEX20K200 EBC652-1X PLD gives 348 logic elements and run up to 224.42 MHz.

1D-14 (Time: 11:20 - 11:25)

Title	Designing a Custom Architecture for DCT Using NISC Technology
Author	Bita Gorjiara, Mehrdad Reshadi, *Daniel Gajski (University of California, Irvine, United States)
Page	pp. 116 - 117
Keyword	NISC, ASIP, custom processor, Discrete Cosine Transform, design exploration
Abstract	This paper presents design of a custom architecture for Discrete Cosine Transform (DCT) using No-Instruction-Set Computer (NISC) technology that is developed for fast processor customization. Using several software transformations and hardware customization, we achieved up to 10 times performance improvement, 2 times power reduction, 12.8 times energy reduction, and 3 times area reduction compared to an already-optimized soft-core MIPS implementation.

1D-15 (Time: 11:25 - 11:30)

Title	A 52mW 1200MIPS Compact DSP for Multi-Core Media SoC
Author	*Shih-Hao Ou, Tay-Jyi Lin, Chao-Wei Huang, Yu-Ting Kuo, Chie-Min Chao, Chih-Wei Liu (National Chiao Tung University, Taiwan), Chein-Wei Jen (STC, ITRI, Taiwan)
Page	pp. 118 - 119
Keyword	digital signal processor, dual-core, multi-core
Abstract	This paper presents a fully-programmable DSP for multi-core media SoC, which has been optimized to execute a set of signal processing kernels very efficiently. It has a novel data-centric instruction set and the corresponding latency-insensitive micro-architecture, and is optimized concurrently with its automatic software generator. The DSP can achieve 3X performance (in cycles) of those found in commercial dual-core application processors with similar computing resources. It has been implemented in the UMC 0.18¡ÂÎ 1P6M CMOS technology and can operate at 314MHz while consuming only 52mW average power.

1D-16 (Time: 11:30 - 11:35)

Title	Implementation of H.264/AVC Decoder for Mobile Video Applications
Author	*Suh Ho Lee, Ji Hwan Park, Seon Wook Kim, Sung Jea Ko, Suki Kim (Korea University, Republic of Korea)
Page	pp. 120 - 121
Keyword	H.264, SoC Platform, CAVLC, IQ, IDCT, De-blocking filter
Abstract	This paper presents an H.264 baseline profile decoder based on an SOC platform design methodology. The overall decoding throughput is increased by optimized software and a dedicated hardware accelerator. We minimize the number of bus accesses and use macroblock (MB) level pipeline processing techniques to achieve a real time operation. We implemented and verified a prototype on an SOC platform with a 32-bit RISC CPU core and FPGA module. Our design can process up to 20 frames/sec with QCIF_(176x144). The proposed architecture can be easily applied to many mobile video application areas such as a digital camera and a DMB (Digital Multimedia Broadcasting) phone.

1D-17 (Time: 11:35 - 11:40)

Title	A High-Performance Platform-Based SoC for Information Security
Author	Min Wu, Xiaoyang Zeng, *Jun Han, Yongyi Wu, Yibo Fan (State Key Lab of ASIC and System, Fudan University, China)
Page	pp. 122 - 123
Keyword	Platform-based, SoC, Information Security
Abstract	A platform-based SoC named as Firebird is presented in this paper, which is used for the applications of information security. Several design aspects, which includes the embedded 32-bit RISC CPU and AMBA bus system, the reconfigurable and scalable public-key crypto-coprocessor, high-performance TRNG and several low-power schemes, make Firebird very efficient for the client-end applications of information security. Also the test results of this prototype chip indicate that Firebird can work with all these features efficiently, and has some obvious advantages over other designs in the literatures.

1D-18 (Time: 11:40 - 11:45)

Title	Configurable Multi-Processor Architecture and its Processor Element Design
Author	*Tsutomu Nishimura, Takuji Miki, Hiroaki Sugiura, Yuki Matsumoto, Masatsugu Kobayashi (Ritsumeikan University, Japan), Toshiyuki Kato, Tsutomu Eda (VLSI center, Ritsumeikan University, Japan), Hironori Yamauchi (Ritsumeikan University, Japan)
Page	pp. 124 - 125
Keyword	multi-processor system, automatic generation, hardware architecture
Abstract	We developed an application specific multi-processor generation system intended for real-time applications. In this system, we adopted a distributed memory type multi-processor architecture with hierarchical tree network as a configurable multi-processor which can be adapted to various scale systems flexibly. We have also developed a configurable multi-processor prototype as LSI chips with the 0.18 micro meter CMOS standard cell technology.

1D-19 (Time: 11:45 - 11:50)

Title	Design and Implementation of Transducer for ARM-TMS Communication
Author	Hansu Cho, Samar Abdi, *Daniel Gajski (University of California, Irvine, United States)
Page	pp. 126 - 127
Keyword	interface design, IP reuse, communication design
Abstract	Communication between components, with different interface protocols, requires an extra component that must translate one protocol to another. This component is referred to as a transducer. In this paper we describe the design and implementation of a transducer between AMBA bus and TMS DSP bus. The transducer allows system designers to send data from AMBA compliant components to TMS compliant ones, and vice versa. The transducer was modeled in Verilog and implemented on Xilinx VirtexII FPGA board.

Session 2A Software Techniques for Efficient SoC Design (13:30 - 15:35)
Location: Room 411+412
Chair(s): Qiang Zhu (Fujitsu Lab., Japan), Ahmed Jerraya (TIMA Laboratory, France)

2A-1 (Time: 13:30 - 13:55)

Title	Energy Savings through Embedded Processing on Disk System
Author	Seung Woo Son, Guangyu Chen, Mahmut Kandemir, *Fehui Li (Pennsylvania State University, United States)
Page	pp. 128 - 133
Keyword	Low Power, Smart Disk
Abstract	Many of today's data-intensive applications manipulate disk-resident data sets. As a result, their overall behavior is tightly coupled with their disk performance. Unfortunately, most of these applications quickly become disk bound since disk I/O times, the communication latencies, and energy consumption required to transfer disk data to the host machine can be very large. A promising solution to this problem is to embed computational power into the disk storage system. This paper concentrates on such a smart disk based architecture and proposes an automated approach that partitions a given application code between the host machine and the smart disk. The main goal is to perform data filterings, identified at compile time, on the smart disk, thereby reducing the energy spent in communicating disk data to the host unit for processing. To achieve this, the proposed approach uses integer linear programming to identify the code fragments that perform significant data filtering and assigns such fragments to the smart disk for execution. In addition to the communication energy benefits of the proposed approach, we show in this paper that this approach can also help us better exploit the low-power management capabilities provided by the system. Our experiments with four data-intensive applications indicate significant energy savings.

2A-2 (Time: 13:55 - 14:20)

Title	Energy-Aware Computation Duplication for Improving Reliability in Embedded Chip Multiprocessors
Author	Guilin Chen, Mahmut Kandemir, *Feihui Li (Pennsylvania State University, United States)
Page	pp. 134 - 139
Keyword	reliability, compilers, duplication, multi-processor systems
Abstract	Compilers designed for current embedded systems must be capable of addressing multiple constraints such as low power, high performance, small memory footprint and form factor, and high reliability at the same time. In particular, optimizing for one constraint should be performed carefully, considering its impact on other constraints. Recent trends indicate that transient errors are becoming increasingly important in embedded systems. Focusing on an embedded chip multiprocessor and array-intensive applications, this paper demonstrates how reliability against transient errors can be improved without impacting execution time by utilizing idle processors for duplicating some of the computations of the active processors. It also shows how a balance between power savings and reliability improvement can be struck using a metric called the energy-delay-fallibility product. Our experimental results indicate that the ``percentage of duplicated computations'' is a useful high-level metric for studying the tradeoffs among performance, power, and reliability.

2A-3 (Time: 14:20 - 14:45)

Title	Object Duplication for Improving Reliability
Author	Guilin Chen, Guangyu Chen, *Mahmut Kandemir, Narayanan Vijaykrishnan, Mary Jane Irwin (Pennsylvania State University, United States)
Page	pp. 140 - 145
Keyword	Java Virtual Machine, Soft error
Abstract	Soft errors are becoming a common problem in current systems due to the scaling of technology that results in the use of smaller devices, lower voltages, and power-saving techniques. In this work, we focus on soft errors that can occur in the objects created in heap memory, and investigate techniques for enhancing the immunity to soft errors through various object duplication schemes. The idea is to access the duplicate object when the checksum associated with the primary object indicates an error. We implemented several duplication based schemes and conducted extensive experiments. Our results clearly show that this spectrum of schemes enable us to balance the tradeoffs between error rate and heap space consumption.

2A-4 (Time: 14:45 - 15:10)

Title	Mapping and Configuration Methods for Multi-Use-Case Networks on Chips
Author	*Srinivasan Murali (Stanford University, United States), Martijn Coenen, Andrei Radulescu, Kees Goossens (Philips, Netherlands), Giovanni De Micheli (EPFL, Switzerland)
Page	pp. 146 - 151
Keyword	Networks on Chips, Systems on Chips, Mapping, Use Cases
Abstract	To provide a scalable communication infrastructure for Systems on Chips (SoCs), Networks on Chips (NoCs), a communication centric design paradigm is needed. To be cost effective, SoCs are often programmable and integrate several different applications or use-cases on to the same chip. For the SoC platform to support the different use-cases, the NoC architecture should satisfy the performance constraints of each individual use-case. In this work we motivate the need to consider multiple use-cases during the NoC design process. We present a method to efficiently map the applications on to the NoC architecture, satisfying the design constraints of each individual use-case. We also present novel ways to dynamically reconfigure the network across the different use-cases and explore the possibility of integrating Dynamic Voltage and Frequency Scaling (DVS/DFS) techniques with the use-case centric NoC design methodology. We validate the performance of the design methodology on several SoC applications. The dynamic reconfiguration of the NoC integrated with DVS/DFS schemes results in large power savings for the resulting NoC systems.

2A-5 (Time: 15:10 - 15:35)

Title	Conversion of Reference C Code to Dataflow Model: H.264 Encoder Case Study
Author	*Hyeyoung Hwang, Taewook Oh, Hyunuk Jung, Soonhoi Ha (Seoul National University, Republic of Korea)
Page	pp. 152 - 157
Keyword	H.264, Model-based, dataflow, System specification, code transform
Abstract	Model-based design is widely accepted in developing complex embedded system under intense time-to-market pressure. While it promises improved design productivity, the main bottleneck lies not in the design methodology but in constructing the initial algorithm representation in the specified model. It is particularly true if a complicated multimedia application is given in the form of a sequential reference C code. In this paper we propose a systematic procedure to convert a sequential C code to a dataflow specification that has been widely used in many design environments for DSP systems. The proposed technique is successfully applied to H.264 encoder algorithm as a case study.

Session 2B Application Examples with Leading Edge Design Methodology (13:30 - 15:35)
Location: Room 413
Chair(s): In-Cheol Park (KAIST, Republic of Korea), Hideharu Amano (Keio University, Japan)

2B-1 (Time: 13:30 - 13:55)

Title	SAVS: A Self-Adaptive Variable Supply-Voltage Technique for Process -Tolerant and Power-Efficient Multi-issue Superscalar Processor Design
Author	Hai Li (Qualcomm Inc., United States), Yiran Chen (Synopsys Inc., United States), *Kaushik Roy, Cheng-Kok Koh (Purdue University, United States)
Page	pp. 158 - 163
Keyword	Variable Supply-Voltage , Power Efficient
Abstract	Technology scaling and sub-wavelength optical lithography is associated with significant process variations. We propose a self-adaptive variable supply-voltage scaling (SAVS) technique for multi-issue out-of-order pipeline to improve parametric yield with minimal power dissipation. Our error-correction circuitry and recovery mechanism allow the proposed fault-tolerant pipeline to work at a dynamically tuned supply voltage with a very low error rate. Experiments on an 8-issue, out-of-order superscalar processor show that SAVS can achieve 93.3% yield with 8.66% total power reduction under a scaled VDD, compared to the same yield achieved by conventional microarchitecture. The increased execution time is negligible (0.014%).

2B-2 (Time: 13:55 - 14:20)

Title	The Design and Implementation of a Low-Latency On-Chip Network
Author	*Robert Mullins, Andrew West, Simon Moore (University of Cambridge, Great Britain)
Page	pp. 164 - 169
Keyword	on-chip network
Abstract	Many of the issues that will be faced by the designers of multi-billion transistor chips may be alleviated by the presence of a flexible global communication infrastructure. In the short term, such a network will provide scalable chip-wide communication and ease the complexity of handling multi-cycle communications. In the long term, the network will become a primary tool for optimising power and data transfers and for scheduling computations. This paper details the design and implementation of a low-latency on-chip network. The network's speculative routers are in the best case able to route flits in a single clock cycle, helping to minimise on-chip communication latencies and maximise the effectiveness of buffering resources. Results from our 180nm test chip demonstrate an inter-router data transfer rate in excess of 16Gbit/s for each link. In the best case each router hop adds just 1 clock cycle to the final communication latency.

2B-3 (Time: 14:20 - 14:45)

Title	A Near Optimal Deblocking Filter for H.264 Advanced Video Coding
Author	Shen-Yu Shih, Cheng-Ru Chang, *Youn-Long Lin (National Tsing Hua University, Taiwan)
Page	pp. 170 - 175
Keyword	Deblocking Filter, H.264, MPEG-4 AVC
Abstract	We propose a near optimal hardware architecture for deblocking filter in H.264/MPEG-4 AVC. We propose a novel filtering order and data reuse strategy that results in significant saving in filtering time, local memory usage, and memory traffic. Every 16x16 macroblock requires 192 filtering operations. After a few initialization cycles, our 5-stage pipelined architecture is able to perform one filtering operation per cycle. Compared with some state-of-the-art designs, our architecture delivers the fastest level of performance while using much less gate count and memory. We have implemented and integrated the proposed deblocking filter into an H.264 main profile decoder and verified it with an FPGA prototype.

2B-4 (Time: 14:45 - 15:10)

Title	Image Segmentation and Pattern Matching Based FPGA/ASIC Implementation Architecture of Real-Time Object Tracking
Author	*Kousuke Yamaoka, Takashi Morimoto, Hidekazu Adachi, Tetsushi Koide, Hans Juergen Mattausch (Research Center for Nanodevices and Systems, Hiroshima University, Japan)
Page	pp. 176 - 181
Keyword	Object Tracking, Real-Time, Image Segmentation, Pattern Matching, Pipeline Processing
Abstract	A novel algorithm for object tracking in video pictures, based on image segmentation and pattern matching, as well as its FPGA/ASIC implementation architecture are presented. With image segmentation, we can detect all objects in the images no matter whether they are moving or not. Using image segmentation results of successive frames, we exploit pattern matching in a simple object feature space for tracking of objects. The proposed algorithm can be applied to multiple moving and still objects even in the case of a moving camera. The FPGA/ASIC implementation architecture is verified to enable real time tracking of up to 220 objects, when realized with modern FPGA hardware.

2B-5 (Time: 15:10 - 15:35)

Title	Prefetching-Aware Cache Line Turnoff for Saving Leakage Energy
Author	*Ismail Kadayif (Canakkale Onsekiz Mart University, Turkey), Mahmut Kandemir, Feihui Li (Pennsylvania State University, United States)
Page	pp. 182 - 187
Keyword	Leakage energy, Cache, Prefetching, Cachline line turnoff, Dead block
Abstract	While numerous prior studies focused on performance and energy optimizations for caches, their interactions have received much less attention. This paper studies this interaction and demonstrates how performance and energy optimizations can affect each other. More importantly, we propose three optimization schemes that turn off cache lines in a prefetching-sensitive manner. These schemes treat prefetched cache lines differently from the lines brought to the cache in a normal way (i.e., through a load operation) in turning off the cache lines. Our experiments with applications from the SPEC2000 suite indicate that the proposed approaches save significant leakage energy with very small degradation on performance.

Session 2C Placement (13:30 - 15:35)
Location: Room 414+415
Chair(s): Evangeline F.Y. Young (Chinese University of Hong Kong, Hong Kong), Shin'ichi Wakabayashi (Hiroshima City University, Japan)

2C-1 (Time: 13:30 - 13:55)

Title	A Robust Detailed Placement for Mixed-Size IC Designs
Author	Jason Cong, *Min Xie (University of California, Los Angeles, United States)
Page	pp. 188 - 194
Keyword	placement
Abstract	The rapid increase in IC design complexity and wide spread use of intellectual property (IP) blocks have made the so called mixed-size placement a very important topic in recent years. Although several algorithms have been proposed for mixed-sized placements, most of them primarily focus on the global placement aspect. In this paper we propose a three-step approach, named XDP, for mixed-size detailed placement. First, a combination of constraint graph and linear programming is used to legalize macros. Then, an enhanced greedy method is used to legalize the standard cells. Finally, a sliding-window based cell swapping is applied to further reduce wirelength. The impact of individual techniques is analyzed and quantified. Experiments show that when applied to the set of global placement results gen? erated by APlace [1], XDP can produce wirelength comparable to the native detailed placement of APlace, and 3% shorter wire? length compared to Fengshui 5.0 [2]. When applied to the set of global placements generated by mPL6 [3], XDP is the only de? tailed placement that successfully produces legal placement for all the examples, while APlace and Fengshui fail for 4/9 and 1/3 of the examples. For cases where legal placements can be compared, the wirelength produced by XDP is shorter by 3% on average compared to APlace and Fengshui. Furthermore, XDP displays a higher robustness than the other tools by covering a broader spectrum of examples by different global placement tools.

2C-2 (Time: 13:55 - 14:20)

Title	FastPlace 2.0: An Efficient Analytical Placer for Mixed-Mode Designs
Author	*Natarajan Viswanathan, Min Pan, Chris Chu (Iowa State University, United States)
Page	pp. 195 - 200
Keyword	Computer-aided design, analytical placement, mixed-mode placement
Abstract	In this paper, we present FastPlace 2.0 - an extension to the efficient analytical standard-cell placer - FastPlace, to address the mixed-mode placement problem. The main contributions of our work are: (1) Extensions to the global placement framework of FastPlace to handle mixed-mode designs. (2) An efficient and optimal minimum perturbation macro legalization algorithm that is applied after global placement to resolve overlaps among the macros. (3) An efficient legalization scheme to legalize the standard cells among the placeable segments created after fixing the movable macros. On the ISPD 02 Mixed-Size placement benchmarks, our algorithm is 16.8X and 7.8X faster than state-of-the-art academic placers Capo 9.1 and Fengshui 5.0 respectively. Correspondingly, we are on average, 12% and 3% better in terms of wirelength over the respective placers.

2C-3 (Time: 14:20 - 14:45)

Title	Timing-Driven Placement Based on Monotone Cell Ordering Constraints
Author	Chanseok Hwang, *Massoud Pedram (University of Southern California, United States)
Page	pp. 201 - 206
Keyword	placement, timing, layout, conduit
Abstract	In this paper, we present a new timing-driven placement algorithm, which attempts to minimize zigzags and crisscrosses on the timing-critical paths of a circuit. We observed that most of the paths that cause timing problems in the circuit meander outside the minimum bounding box of the start and end nodes of the path. To limit this undesirable behavior, we impose a physical constraint on the placement problem, i.e., we assign a preferred signal direction to each critical path in the circuit. Starting from an initial placement solution, by using a move-based optimization strategy, these preferred directions force cells to move in a direction that maximizes the monotonic behavior of the timing-critical paths in the new placement solution. To make the direction assignment tractable, we implicitly group all circuit paths into a set of input-output conduits and assign a unique preferred direction to each such conduit. We integrated this idea into a recursive bipartitioning-based placement framework with a min-cut objective function. Experimental results on a set of standard placement benchmarks show that this approach improves the result of a state-of-the-art industrial placement tool for all the benchmark circuits while increasing the wire length by a tolerable amount.

2C-4 (Time: 14:45 - 15:10)

Title	Constraint Driven I/O Planning and Placement for Chip-package Co-design
Author	*Jinjun Xiong (University of California at Los Angeles, United States), Yiu-Chung Wong, Egino Sarto (Rio Design Automation, United States), Lei He (University of California at Los Angeles, United States)
Page	pp. 207 - 212
Keyword	Chip-package Co-design, I/O planning, I/O placement, Constraint driven, System on chip and System in package
Abstract	System-on-chip and system-in-package result in increased number of I/O cells and complicated constraints for both chip designs and package designs. This renders the traditional manually tuned and chip-centered I/O designs suboptimal in terms of both turn around time and design quality. In this paper we formally introduce a set of design constraints suitable for chip-package co-design. We formulate a constraint-driven I/O planning and placement problem, and solve it by a multi-step algorithm based upon integer linear programming. Experiment results using real industry designs show that the proposed algorithm can effectively find a large scale I/O placement solution and satisfy all given design constraints in less than 10 minutes. In contrast, the state-of-the-art without considering those design constraints simply cannot meet all design constraints by relying solely upon the conventional iterative approach.

2C-5 (Time: 15:10 - 15:35)

Title	Simultaneous Block and I/O Buffer Floorplanning for Flip-Chip Design
Author	*Chih-Yang Peng, Wen-Chang Chao, Yao-Wen Chang (National Taiwan University, Taiwan), J.-H. Wang (Faraday Technology Corp., Taiwan)
Page	pp. 213 - 218
Keyword	floorplanning, placement, flip-chip
Abstract	The flip-chip package gives the highest chip density of any packaging method to support the pad-limited ASIC design. One of the most important characteristics of flip-chip designs is that the input/output buffers could be placed anywhere inside a chip. In this paper, we first introduce the floorplanning problem for the flip-chip design and formulate it as assigning the positions of input/output buffers and first-stage/last-stage blocks so that the path length between blocks and bump balls as well as the delay skew of the paths are simultaneously minimized. We then present a hierarchical method to solve the problem. We first cluster a block and its corresponding buffers to reduce the problem size. Then, we go into iterations of the alternating and interacting global optimization step and the partitioning step. The global optimization step places blocks based on simulated annealing using the B-tree representation to minimize a given cost function. The partitioning step dissects the chip into two subregions, and the blocks are divided into two groups and are placed in respective subregions. The two steps repeat until each subregion contains at most a given number of blocks, defined by the ratio of the total block area to the chip area. At last, we refine the floorplan by perturbing blocks inside a subregion as well as in different subregions. Compared with the B-tree based floorplanner alone, our method is more efficient and obtains significantly better results, with an average cost of only 51.8\% of that obtained by using the B*-tree alone, based on a set of real industrial flip-chip designs provided by leading companies.

Session 2D Special Session: Electrothermal Design of Nanoscale Integrated Circuits (13:30 - 15:35)
Location: Room 416+417
Chair(s): Dennis Sylvester (Univ. of Michigan, United States), Mongkol Ekpanyapong (Georgia Institute of Technology, United States)

2D-1 (Time: 13:30 - 14:00)

Title	Electrothermal Analysis and Optimization Techniques for Nanoscale Integrated Circuits
Author	*Yong Zhan, Brent Goplen, Sachin S. Sapatnekar (University of Minnesota, United States)
Page	pp. 219 - 222
Keyword	Thermal analysis, Thermal optimization, Electrothermal design, Simulation, Placement
Abstract	With technology scaling, on-chip power densities are growing steadily, leading to the point where temperature has become an important consideration in the design of electrical circuits. This paper overviews several methods for the analysis and optimization of thermal effects in integrated circuits. Thermal analysis may be carried out efficiently through the use of finite difference methods, finite element methods, or Green function based methods, each of which provides different accuracy-computation tradeoffs, and the paper begins by surveying these. Next, we overview a restricted set of thermal optimization methods, specifically, placement techniques for thermal heat-spreading, and then we conclude by summarizing a set of future directions in electrothermal design.

2D-2 (Time: 14:00 - 14:30)

Title	Electrothermal Engineering in the Nanometer Era: From Devices and Interconnects to Circuits and Systems
Author	*Kaustav Banerjee, Sheng-Chih Lin, Navin Srivastava (University of California, Santa Barbara, United States)
Page	pp. 223 - 230
Keyword	Electrothermal, Temperature-Aware, Power Dissipation, Thermal Gradients, Hot-Spots
Abstract	Management of electrothermal (ET) issues arising due to power dissipation both at the micro- and macro- scale is central to the development of future generation microprocessors, integrated networks, and other highly integrated circuits and systems. This paper will provide a broad overview of various ET effects in nanoscale VLSI and highlight both technology and design choices that are thermally-aware. The paper ends with a brief discussion of electrothermal issues in emerging 3-D ICs and highlights the advantages of employing hybrid Carbon Nanotube-Cu interconnects in both 2-D and 3-D designs.

2D-3 (Time: 14:30 - 15:00)

Title	Area Optimization for Leakage Reduction and Thermal Stability in Nanometer Scale Technologies
Author	*Ja Chun Ku, Yehea Ismail (Northwestern University, United States)
Page	pp. 231 - 236
Keyword	Layout, Low-power design, Optimization, VLSI
Abstract	Traditionally, minimum possible area of a VLSI layout is considered the best for delay and power minimization due to decreased interconnect capacitance. This paper shows however that the use of minimum area does not result in the minimum power and/or delay in nanometer scale technologies due to thermal effects, and in some cases, may result in thermal runaway. A methodology using area as a design parameter to reduce the leakage power, and prevent thermal runaway is presented. A 16-bit adder example in a 70nm technology shows a total power savings of 17% with 15% increase in area, and no increase in delay. The power savings using this technique are expected to increase in future technologies.

2D-4 (Time: 15:00 - 15:30)

Title	Compact Thermal Models for Estimation of Temperature-dependent Power/Performance in FinFET Technology
Author	Aditya Bansal, Mesut Meterelliyoz (Purdue University, United States), Siddharth Singh (Osmania Univerisity, India), Jung Hwan Choi, Jayathi Murthy, *Kaushik Roy (Purdue University, United States)
Page	pp. 237 - 242
Keyword	temperature, FinFET
Abstract	With technology scaling, elevated temperatures caused by increased power density create a critical bottleneck modulating the circuit operation. With the advent of FinFET technologies, cooling of a circuit is becoming a bigger challenge because of the thick buried oxide inhibiting the heat flow to the heat sink and confined ultra-thin channel increasing the thermal resistivity. In this work, we propose compact thermal models to predict the temperature rise in FinFET structures. We develop cell-level compact thermal models for standard INV, NAND and NOR gates accounting for the heat transfer across the six faces of a cell. Temperature maps of benchmark circuits exhibit close correspondence with dynamic power maps because of confined regions of heat generation separated by low thermal conductivity material. It is illustrated that temperature-aware timing analysis is imperative, because of high inter-cell temperature gradient. Accurate prediction of temperature in the early phase of design cycle will give valuable estimation of power/performance/reliability of a circuit block and will guide in the design of more robust circuits.

Session 3A Logic Synthesis (16:00 - 18:05)
Location: Room 411+412
Chair(s): Shinji Kimura (Waseda University, Japan), Shih-Chieh Chang (National Tsing Hua University, Taiwan)

3A-1 (Time: 16:00 - 16:25)

Title	An Anytime Symmetry Detection Algorithm for ROBDDs
Author	*Neil Kettle, Andy King (University of Kent, Great Britain)
Page	pp. 243 - 248
Keyword	ROBDD, Symmetry
Abstract	Detecting symmetries is crucial to logic synthesis, technology mapping, detecting function equivalence under unknown input correspondence, and \ROBDD\/ minimization. State-of-the-art is represented by Mishchenko's algorithm. In this paper we present an efficient anytime algorithm for detecting symmetries in Boolean functions represented as \ROBDDs, that output pairs of symmetric variables until a prescribed time bound is exceeded. The algorithm is complete in that given sufficient time it is guaranteed to find all symmetric pairs. The complexity of this algorithm is in $O(n^4+n\card{G}+\card{G}^3)$ where $n$ is the number of variables and $\card{G}$ the number of nodes in the \ROBDD, and it is thus competitive with Mishchenko's $O(\card{G}^3)$ algorithm in the worst-case since $n\ll\card{G}$. However, our algorithm performs significantly better because the anytime approach only requires lightweight data structure support and it offers unique opportunities for optimization.

3A-2 (Time: 16:25 - 16:50)

Title	High Level Equivalence Symmetric Input Identification
Author	*Ming-Hong Su, Chun-Yao Wang (Department of Computer Science, National Tsing Hua University, Taiwan)
Page	pp. 249 - 253
Keyword	Symmetry, Simulation, BDD, Logic Synthesis
Abstract	Abstract — Symmetric input identification is an important technique in logic synthesis. Previous approaches deal with this problem by building BDDs and developing algorithms to determine symmetric inputs. For the design whose corresponding BDDs cannot be built, BDD-based approaches cannot be applied on this problem. To avoid the limitations of BDD-based approaches, simulation-based methods have been proposed. It is applicable to designs described in arbitrary level, especially to high-level and black box designs. Previous simulation-based approaches focus on determining the inputs of nonequivalence symmetry. In this paper, we propose a simulation-based approach to identify equivalence symmetric inputs. The experimental results on a set of ISCAS-85 and MCNC benchmarks are also presented.

3A-3 (Time: 16:50 - 17:15)

Title	Fast Multi-Domain Clock Skew Scheduling for Peak Current Reduction
Author	*Shih-Hsu Huang, Chia-Ming Chang, Yow-Tyng Nieh (Chung Yuan Christian University, Taiwan)
Page	pp. 254 - 259
Keyword	Clock Skew Optimization, Clock Skew Scheduling, Integer Linear Programming, High Performance, Low Power
Abstract	Given several specific clocking domains, the peak current minimization problem can be formulated as a 0-1 integer linear program. However, if the number of binary variables is large, the run time is unacceptable. In this paper, we study the reduction of this high computational expense. Our approach includes the following two aspects. First, we derive the ASAP schedule and the ALAP schedule to prune the redundancies without sacrificing the exactness (optimality) of the solution. Second, we propose a zone-based scheduling algorithm to solve a large circuit heuristically.

3A-4 (Time: 17:15 - 17:40)

Title	Low Area Pipelined Circuits by Multi-clock Cycle Paths and Clock Scheduling
Author	*Bakhtiar Affendi Rosdi, Atsushi Takahashi (Tokyo Institute of Technology, Japan)
Page	pp. 260 - 265
Keyword	multi-clock cycle, clock scheduling, pipelined circuits
Abstract	A new algorithm is proposed to reduce the number of intermediate registers of a pipelined circuit using a combination of multi-clock cycle paths and clock scheduling. The algorithm analyzes the pipelined circuit and determines the intermediate registers that can be removed. An efficient subsidiary algorithm is presented that computes the minimum feasible clock period of a circuit containing multi-clock cycle paths. Experiments with a pipelined adder and multiplier verify that the proposed algorithm can reduce the number of intermediate registers without degrading performance, even when delay variations exist.

3A-5 (Time: 17:40 - 18:05)

Title	A Transduction-based Framework to Synthesize RSFQ Circuits
Author	*Shigeru Yamashita (Nara Institute of Science and Technology, Japan), Katsunori Tanaka (NEC Corporation, Media and Information Research Laboratories, Japan), Hideyuki Takada (Kyoto University, Japan), Koji Obata, Kazuyoshi Takagi (Nagoya University, Japan)
Page	pp. 266 - 272
Keyword	RSFQ, logic design, Transduction Method
Abstract	In this paper, we propose a new framework to synthesize rapid single flux quantum (RSFQ) logic circuits. In our framework, we construct a virtual cell, which we call ``2-AND/XOR,'' from the RSFQ logic primitives. By using 2-AND/XOR cells, we can successfully adopt the conventional logic design techniques into our framework, and thus we can successfully generate RSFQ circuits in reasonable time even for large benchmark circuits that have not reported in the existing researches.

Session 3B Future Technical Directions for Design Automation (16:00 - 18:05)
Location: Room 413
Chair(s): Makoto Nagata (Kobe University, Japan), Ryuichi Fujimoto (Toshiba, Japan)

3B-1 (Time: 16:00 - 16:25)

Title	Fast Simulation of Large Networks of Nanotechnological and Biochemical Oscillators for Investigating Self-Organization Phenomena
Author	Xiaolue Lai, *Jaijeet Roychowdhury (University of Minnesota, United States)
Page	pp. 273 - 278
Keyword	Nanoelectronics, Biochemical, Oscillator, Macromodel, Simulation
Abstract	We address the problem of fast and accurate computational analysis of large networks of coupled oscillators arising in nanotechnological and biochemical systems. Such systems are computationally and analytically challenging because of their very large sizes and the complex nonlinear dynamics they exhibit. We develop and apply a nonlinear oscillator macromodel that generalizes the well-known Kuramoto model for interacting oscillators, and demonstrate that using our macromodel provides important qualitative and quantitive advantages, especially for predicting self-organization phenomena such as spontaneous pattern formation. Our approach extends and applies recently-developed computational methods for macromodelling electrical oscillators, and features both phase and amplitude components that are extracted automatically (using numerical algorithms) from more complex differential-equation oscillator models available in the literature. We apply our approach to networks of Tunneling Phase Logic (TPL) and Brusselator biochemical oscillators, predicting a variety of spontaneous pattern generation phenomena.

3B-2 (Time: 16:25 - 16:50)

Title	Newton: A Library-Based Analytical Synthesis Tool for RF-MEMS Resonators
Author	*Michael S. McCorquodale (Mobius Microsystems, Inc., United States), James L. McCann (Carnegie Mellon University, United States), Richard B. Brown (University of Utah, United States)
Page	pp. 279 - 284
Keyword	MEMS, RF, Synthesis, Physical Design, Resonators
Abstract	Newton is a library-based CAD tool with an analytical synthesis engine which has been developed to support the direct synthesis of the physical design and an electromechanically equivalent model of RF-MEMS resonators based on process parameters and performance metrics. Newton provides accuracy comparable to finite element analysis while requiring a fraction of the computation and design time. A comparison of results from synthesis with Newton, design with FEA, and test results from fabricated devices is presented.

3B-3 (Time: 16:50 - 17:15)

Title	Jitter Decomposition in Ring Oscillators
Author	*Qingqi Dou, Jacob Abraham (University of Texas at Austin, United States)
Page	pp. 285 - 290
Keyword	Jitter Test, Jitter Decompositin, Autocorrelation, Time domain
Abstract	It is important to separate random jitter from deterministic jitter to quantify their contributions to the total jitter. This paper identifies the limitations of the existing methodologies for jitter decomposition, and develops a new and efficient approach using time lag correlation functions to decompose different jitter components. The theory of the approach is developed and it is applied to a ring oscillator simulated in a 0.6-um AMI CMOS process. Results show good agreement between the theory and hspice simulation.

3B-4 (Time: 17:15 - 17:40)

Title	A Fast Methodology for First-Time-Correct Design of PLLs Using Nonlinear Phase-Domain VCO Macromodels
Author	*Prashant Goyal (Indian Institute of Technology, Kanpur, India), Xiaolue Lai, Jaijeet Roychowdhury (University of Minnesota, United States)
Page	pp. 291 - 296
Keyword	PLL, design methodology, behavioral simulation
Abstract	We present a novel methodology suitable for fast, correct design of modern PLLs. The central feature of the methodology is its use of accurate, nonlinear behavioral models for the VCO within the PLL, thus removing the need for many time-consuming SPICE-level simulations during the design process. We apply the new methodology to design a novel injection-aided PLL that acquires lock 3�faster than prior designs, without trading off other design metrics such as jitter. We demonstrate how existing design methodologies based on behavioral simulation are incapable of leading to our new PLL design. The nonlinear behavioral simulations employed in our methodology are more than 2 orders of magnitude faster than transistor- level ones, resulting in an overall design productivity gain of more than an order of magnitude.

3B-5 (Time: 17:40 - 18:05)

Title	Double Edge Triggered Feedback Flip-Flop in Sub 100nm Technology
Author	*Seid Hadi Rasouli, Amir Amirabadi, Azam Seyedi, Ali Afzali-Kusha (University of Tehran, Iran)
Page	pp. 297 - 302
Keyword	low power, high speed, subthreshold leakage current, flip flop
Abstract	In this paper, a new flip-flop called Double-edge triggered Feedback Flip-Flop (DFFF) is proposed. The dynamic power consumption of DFFF is reduced by avoiding unnecessary internal node transition. The subthreshold current in the flip-flops is very low compared to other structures. Reducing the number of transistor in the stack and increasing the number of charge path leads to higher operational speed compared to others flip-flops. The simulation results show an improvement of 44% in the speed and 45% in the static leakage power.

Session 3C Routing and Interconnect Optimization (16:00 - 18:05)
Location: Room 414+415
Chair(s): Youichi Shiraishi (Gunma University, Japan), Lei He (University of California, Los Angels, United States)

3C-1 (Time: 16:00 - 16:25)

Title	Post-Routing Redundant Via Insertion for Yield/Reliability Improvement
Author	*Kuang-Yao Lee, Ting-Chi Wang (National Tsing Hua University, Taiwan)
Page	pp. 303 - 308
Keyword	redundant via, routing
Abstract	Reducing the yield loss due to via failure is one of the important problems in design for manufacturability. A well known and highly recommended method to improve via yield/reliability is to add redundant vias. In this paper we study the problem of post-routing redundant via insertion and formulate it as a maximum independent set (MIS) problem. We present an efficient graph construction algorithm to model the problem, and an effective MIS heuristic to solve the problem. The experimental results show that our MIS heuristic inserts more redundant vias and distributes them more uniformly among via layers than a commercial tool and an existing method. The number of inserted redundant vias can be increased by up to 21.24%. Besides, since redundant vias can be classified into on-track and off-track ones, and on-track ones have better electrical properties, we also present two methods (one is modified from the MIS heuristic, and the other is applied as a post processor) to increase the amount of on-track redundant vias. The experimental results indicate that both methods perform very well.

3C-2 (Time: 16:25 - 16:50)

Title	Temperature-Aware Routing in 3D ICs
Author	Tianpei Zhang, *Yong Zhan, Sachin S. Sapatnekar (University of Minnesota, United States)
Page	pp. 309 - 314
Keyword	3D, IC, Temperature, Routing
Abstract	Three-dimensional integrated circuits (3D ICs) provide an attractive solution for improving circuit performance. Such solutions must be embedded in an electrothermally-conscious design methodology, since 3D ICs generate a significant amount of heat per unit volume. In this paper, we propose a temperature-aware 3D global routing algorithm with insertion of "thermal vias” and "thermal wires” to lower the effective thermal resistance of the material, thereby reducing chip temperature. Since thermal vias and thermal wires take up lateral routing space, our algorithm utilizes sensitivity analysis to judiciously allocate their usage, and iteratively resolve contention between routing and thermal vias and thermal wires. Experimental results show that our routing algorithm can effectively reduce the peak temperature and alleviate routing congestion.

3C-3 (Time: 16:50 - 17:15)

Title	Closed Form Solution for Optimal Buffer Sizing Using The Weierstrass Elliptic Function
Author	Sebastian Vogel (Darmstadt University of Technology, Germany), *Martin D.F. Wong (University of Illinois at Urbana-Champaign, United States)
Page	pp. 315 - 319
Keyword	buffer sizing , closed form, Weiestrass Elliptic Function
Abstract	This paper presents a fundamental result on buffer sizing. Given an interconnection wire with n buffers evenly spaced along the wire, we wouldlike to size all buffers such that the Elmore delay is minimized. It is well known that the problem can be solved by an iterative algorithm which size one buffer at a time. However, no closed form solution has ever been reported. In this paper, we derive a closed form buffer sizing function f(x), where f(x) gives the optimal buffer size for the buffer at position x. We show that f(x) can be expressed in terms of the Weierstrass elliptic function p(x) and its derivative p'(x).

3C-4 (Time: 17:15 - 17:40)

Title	An O(mn) Time Algorithm for Optimal Buffer Insertion of Nets with m Sinks
Author	*Zhuo Robert Li, Weiping Shi (Texas A&M University, United States)
Page	pp. 320 - 325
Keyword	buffer insertion, routing, Elmore Delay, data structure
Abstract	Buffer insertion is an effective technique to reduce interconnect delay. In this paper, we give a simple $O(mn)$ time algorithm for optimal buffer insertion, where $m$ is the number of sinks and $n$ is the number of buffer positions. This is the first linear time buffer insertion algorithm for nets with constant number of sinks. When $m$ is small, it is a significant improvement over our recent $O(n\log^2 n)$ time algorithm, and the $O(n^2)$ time algorithm of van Ginneken. For $b$ buffer types, the new algorithm runs in $O(b^2n+bmn)$ time, an improvement of our recent $O(bn^2)$ algorithm. The improvement is made possible by a clever bookkeeping method and an innovative linked list data structure that can perform addition of a wire, and addition of a buffer in amortized $O(1)$ time. On industrial test cases, the new algorithm is faster than previous best algorithms by an order of magnitude.

3C-5 (Time: 17:40 - 18:05)

Title	Spec-based Flip-Flop and Latch Repeater Planning
Author	*Man Chung Hon (Intel Corporation, United States)
Page	pp. 326 - 331
Keyword	interconnect, repeater, flip-flop, latch
Abstract	Shrinking process geometries and frequency scaling give rise to an increasing number of interconnects that require multiple clock cycles. This paper explores efficient techniques to insert flip-flops and latches to meet pre-determined latency and margin constraints at the receivers. Previous approaches push timing margins to either ends of interconnect. We present an $O(n \log n)$-time algorithm to insert flip-flops that evens out timing margins across the entire interconnect, resulting in more robust designs and faster design convergence. An $O(n \log n)$-time extension to handle symmetric, two-phases latches is also presented. Experimental results verify the correctness and practicality of our approach.

Session 3D Special Session: Flash Memory in Embedded Systems (16:00 - 18:05)
Location: Room 416+417
Chair(s): Tohru Ishihara (Kyushu Univ., Japan), Hiroyuki Tomiyama (Nagoya University, Japan)

3D-1 (Time: 16:00 - 17:00)

Title	Current Trends in Flash Memory Technology
Author	*Sang Lyul Min, Eyee Hyun Nam (Seoul National University, Republic of Korea)
Page	pp. 332 - 333
Keyword	Nand flash memory, Host interface, Storage, Hybrid HDD
Abstract	In this paper, we describe the basics of flash memory technology in general and flash memory drive in particular, and explain the current trends of major components of a flash memory drive including flash memory chips, host interface and flash memory controller.

3D-2 (Time: 17:00 - 18:00)

Title	Configurability of Performance and Overheads in Flash Management
Author	*Tei-Wei Kuo, Jen-Wei Hsieh (National Taiwan University, Taiwan), Li-Pin Chang (National Chiao-Tung University, Taiwan), Yuan-Hao Chang (National Taiwan University, Taiwan)
Page	pp. 334 - 341
Keyword	flash-memory, performance, overheads, configurability
Abstract	Flash memory has been widely considered as a good alternative for storage system implementations because it offers superior vibration tolerance and power efficiency, compared to hard-disks. Because of its unique characteristics, direct applications of disk management methods over flash memory might result in performance degradation and even the reducing of the lifetime. The management issues become even more challenging, especially when the capacity of flash memory increases significantly in the past few years. In this paper, we summarize our work on several important issues in flash memory management, where system performance and management overheads are considered. The capability of the proposed methodology was evaluated by a series of experiments to provide more insights in system designs.

Thursday January 26, 2006

Session 2K Keynote Address II (9:00 - 10:00)
Location: Small Auditorium, 5F
Chair(s): Fumiyasu Hirose (Cadence, Japan)

2K-1 (Time: 9:00 - 10:00)

Title	Challenging Device Innovation
Author	Satoru Ito (President & CEO, RENESAS Technology Corp., Japan)
Abstract	The semiconductor industry has continuously transformed our way of life, through a number of underlying technology breakthroughs and innovations over the past years. There are currently two challenges that this industry faces: a limitation of miniaturization technology and a difficulty in maintaining an economy of scale. To cope with these challenges, there is a growing need to work closely with partners and customers who have business related to semiconductors, in addition to semiconductor manufacturers. Especially in the area of semiconductor design, we see a need to create a new EDA methodology that broadens the definition of traditional EDA and re-defines the connection among system designers, SoC designers and development tool designers. As we move closer to the realm driven by the convergence of applications and advancements in miniaturization technology, I'd like to discuss the associated technological challenges as well as economical challenges, and present to you our strategy to overcome these issues.

Session 4A Resolving Timing Issues: Design and Test (10:15 - 12:20)
Location: Room 411+412
Chair(s): Masaki Hashizume (Tokushima University, Japan), Kazumi Hatayama (Renesas, Japan)

4A-1 (Time: 10:15 - 10:40)

Title	Delay Defect Screening for a 2.16GHz SPARC64 Microprocessor
Author	Noriyuki Ito, *Akira Kanuma, Daisuke Maruyama, Hitoshi Yamanaka, Tsuyoshi Mochizuki, Osamu Sugawara, Chihiro Endoh, Masahiro Yanagida, Takeshi Kono, Yutaka Isoda, Kazunobu Adachi, Takahisa Hiraide, Shigeru Nagasawa, Yaroku Sugiyama, Eizo Ninoi (Fujitsu Limited, Japan)
Page	pp. 342 - 347
Keyword	microprocessor, delay fault, screening, at-speed
Abstract	This paper presents a case-study of delay defect screening applied to Fujitsu 2.16GHz SPARC64 microprocessor. A non-robust delay test is used while each test vector is compacted to detect multiple transition faults in a standard scan-based design. Our test technique applied to a microprocessor designed with 6M gate logic, 4MB level 2 cache, and 239K latches, achieves 90% coverage using 3,103 test vectors. We show the correlation between the screening result and the actual number of delay defects.

4A-2 (Time: 10:40 - 11:05)

Title	A Dynamic Test Compaction Procedure for High-quality Path Delay Testing
Author	Masayasu Fukunaga (Fujitsu Ltd., Japan), Seiji Kajihara, *Xiaoqing Wen (Kyushu Institute of Technology, Japan), Toshiyuki Maeda, Shuji Hamada, Yasuo Sato (Semiconductor Technology Academic Research Center, Japan)
Page	pp. 348 - 353
Keyword	delay testing, test generation, test compaction, path delay fault
Abstract	We propose a dynamic test compaction procedure to generate high-quality test patterns for path delay faults. While the proposed procedure generates a compact two-pattern test set for selected faults, the generated test set would detect not only the selected faults but also faults on many unselected paths. Hence both high test quality by detecting untargeted faults and test cost reduction by reducing test patterns can be achieved. Experimental results show that the effectiveness of the proposed procedure.

4A-3 (Time: 11:05 - 11:30)

Title	Delay Variation Tolerance for Domino Circuits
Author	Kai-Chiang Wu, *Cheng-Tao Hsieh, Shih-Chieh Chang (Department of CS, National Tsing Hua University, Taiwan)
Page	pp. 354 - 359
Keyword	Delay variation, tolerance
Abstract	Factors of delay variation may cause a manufactured chip to violate the pre-specified timing constraint. In this paper, we propose a re-synthesis technique to tolerate delay variation for domino circuits. Note that the slacks of nodes along critical paths are zero; any delay addition to those zero-slack nodes will worsen the final performance of a circuit. Our basic idea is to increase the slacks of nodes in the critical region by appending a redundant auxiliary sub-circuit to the original circuit.

4A-4 (Time: 11:30 - 11:55)

Title	Efficient Identification of Multi-Cycle False Path
Author	Kai Yang, *Tim Cheng (University of California, Santa Barbara, United States)
Page	pp. 360 - 365
Keyword	false path, timing analysis, multi-cycle, sensitization, clock period
Abstract	In this paper, we address the timing analysis problem by considering both single-cycle and multi-cycle operations. We give a precise definition of multi-cycle false paths and provide the necessary conditions for multi-cycle sensitizable paths. We then propose an efficient algorithm to identify multi-cycle false paths.

4A-5 (Time: 11:55 - 12:20)

Title	IEEE Standard 1500 Compatible Interconnect Diagnosis for Delay and Crosstalk Faults
Author	*Katherine Shu-Min Li (Dept. of Electronics Engineering, National Chiao Tung University, Taiwan), Yao-Wen Chang (Dept. of Electronics Engineering & Graduate Institute of Electronics Engineering, National Taiwan University, Taiwan), Chauchin Su (Dept. of Electronics Control, National Chiao Tung University, Taiwan), Chung-Len Lee (Dept. of Electronics Engineering, National Chiao Tung University, Taiwan), Jwu E Chen (Dept. of Electrical Engineering, National Central University, Taiwan)
Page	pp. 366 - 371
Keyword	interconnect diagnosis, optimal diagnosability, delay fault, crosstalk fault, oscillation ring
Abstract	We propose an interconnect diagnosis scheme based on Oscillation Ring test methodology for SOC design with heterogeneous cores. The target fault models are delay faults and crosstalk glitches. We analyze the diagnosability of an interconnect structure and propose a fast diagnosability checking algorithm and an efficient diagnosis ring generation algorithm which achieves the optimal diagnosability. Two optimization techniques improve the efficiency and effectiveness of interconnect diagnosis. In all experiments, our method achieves 100% fault coverage and the optimal diagnosis resolution.

Session 4B Leading Edge Design Methodology for SoCs and SiPs (10:15 - 12:20)
Location: Room 413
Chair(s): Satoshi Matsushita (NEC, Japan), Makoto Ikeda (Univ. of Tokyo, Japan)

4B-1 (Time: 10:15 - 10:40)

Title	High-Level Architecture Exploration for MPEG4 Encoder with Custom Parameters
Author	*Marius Bonaciu, Aimen Bouchhima, Wassim Youssef, Xi Chen (TIMA Laboratory, France), Wander Cesario (MND, France), Ahmed Jerraya (TIMA Laboratory, France)
Page	pp. 372 - 377
Keyword	Multiprocessor SoC architecture, Video encoder, MPEG4 application, Architecture exploration, Customization
Abstract	This paper proposes the use of a high-level architecture exploration method for different MPEG4 video encoders using different customization parameters. The targeted architecture is a heterogeneous MP-SoC which may include up 2 coarse grain SIMD (task level SIMD) subsystems to perform the computations. The customization parameters are related to video resolution, frame rate, Communication Network, level of parallelism and CPU types. These parameters are determined during the high-level architecture exploration, by estimating the archi-tecture performances at early stages of the design flow. Experiments shows that the error factor of these high-level performances estimations are less than 10% compared to those obtained with final manually implemented RTL architecture. This method was used successfully for exploration of different MPEG4 architecture configurations with differ-ent customization parameters. We consider these experiments a break-through because they show how a complex design can be mastered through a set of pragmatic choices.

4B-2 (Time: 10:40 - 11:05)

Title	Programmable Numerical Function Generators Based on Quadratic Approximation: Architecture and Synthesis Method
Author	*Shinobu Nagayama (Hiroshima City University, Japan), Tsutomu Sasao (Kyushu Institute of Technology, Japan), Jon Butler (Naval Postgraduate School, United States)
Page	pp. 378 - 383
Keyword	LUT cascade, FPGA, 2nd-order Chebyshev approximation, Numerical function generators (NFGs), Automatic synthesis
Abstract	This paper presents an architecture and a synthesis method for programmable numerical function generators (NFGs) for trigonometric, logarithmic, square root, and reciprocal functions. Our NFG partitions a given domain of the function into non-uniform segments using an LUT cascade, and approximates the given function by a quadratic polynomial for each segment. Thus, we can implement fast and compact NFGs for a wide range of functions. Implementation results on an FPGA show that: 1) our NFGs require only 4% of the memory needed by NFGs based on the linear approximation with non-uniform segmentation; and 2) our NFGs require only 22% of the memory needed by NFGs based on the 5th-order approximation with uniform segmentation. Our automatic synthesis system generates such compact NFGs quickly.

4B-3 (Time: 11:05 - 11:30)

Title	An Automated Design Flow for 3D Microarchitecture Evaluation
Author	*Jason Cong, Ashok Jagannathan, Yuchun Ma, Glenn Reinman, Jie Wei, Yan Zhang (University of California, Los Angeles, United States)
Page	pp. 384 - 389
Keyword	3D IC, microarchitecture, thermal, floorplan
Abstract	Although the emerging three-dimensional integration technology can significantly reduce interconnect delay, chip area and power dissipation in nanometer technologies, its impact on system performance is still poorly understood due to the lack of tool and systematic flow to evaluate 3D microarchitectures integration. The contribution of this paper is the development of an automated physical design flow for 3D architectures evaluation, named MEVA-3D, which includes 3D floorplanning, routing and automated thermal via insertion, and associated die size, performance, and thermal modeling capabilities. We apply this flow to some simple out-of-order superscaler microprocessor design to evaluate the performance and thermal behavior in 2D and 3D designs, and demonstrate the value of MEVA-3D in providing quantitative evaluation results to guide 3D architecture designs. In particular, we show that it is feasible to manage the thermal challenge with the use of a combination of thermal vias and double-sided heat sinks, and report modest system performance gain in 3D design for these simple test examples.

4B-4 (Time: 11:30 - 11:55)

Title	Optimal Topology Exploration for Application-Specific 3D Architectures
Author	Ozcan Ozturk, Feng Wang, *Mahmut Kandemir, Yuan Xie (Pennsylvania State University, United States)
Page	pp. 390 - 395
Keyword	3D, IC, TOPLOGY, ARCHITECTURE
Abstract	As technology scales, increasing interconnect costs make it necessary to consider alternate ways of building integrated circuits. One promising option along this direction is 3D architectures where a stack of multiple device layers, with direct vertical tunneling through them, are put together on the same chip. In this paper, we explore how processor cores and storage blocks can be placed in a 3D architecture to minimize data access costs under temperature constraints. This process is referred to as the topology exploration. Using integer linear programming, we compare the best 2D placement with the best 3D placement, and show through experiments with both single-core and multi-core systems that the 3D placement generates much better results (in terms of data access costs) under the same temperature bounds. We also discuss the tradeoffs between temperature constraint and data access costs.

4B-5 (Time: 11:55 - 12:20)

Title	Task Placement Heuristic Based on 3D-Adjacency and Look-Ahead in Reconfigurable Systems
Author	Jesus Tabero (Instituto Nacional de Tecnica Aeroespacial, Spain), Julio Septien, Hortensia Mecha, *Daniel Mozos (Universidad Complutense de Madrid, Spain)
Page	pp. 396 - 401
Keyword	reconfigurable systems, FPGA, fragmentation
Abstract	To get efficient HW management in 2D Reconfigurable Systems, heuristics are needed to select the best place to locate each arriving task. We propose a technique that locates the task next to the borders of the free area for as many cycles as possible, trying to minimize the area fragmentation. Moreover, we combine it with a look-ahead heuristic that allows delaying the scheduling of a task to the next event, increasing the solution search space.

Session 4C Advanced Circuit Simulation (10:15 - 12:20)
Location: Room 414+415
Chair(s): Hideki Asai (Shizuoka University, Japan), C.J. Richard Shi (Washington University, United States)

4C-1 (Time: 10:15 - 10:40)

Title	A Quasi-Newton Preconditioned Newton-Krylov Method for Robust and Efficient Time-Domain Simulation of Integrated Circuits with Strong Parasitic Couplings
Author	Zhao Li (Cadence Design Systems, United States), *Richard Shi (University of Washington, United States)
Page	pp. 402 - 407
Keyword	time domain circuit simulation, Krylov-subspace methods, parasitic-coupled VLSI circuits
Abstract	In this paper, the Newton-Krylov method is explored for robust and efficient time-domain VLSI circuit simulation. Different from the LU-factorization based direct method, the Newton-Krylov method uses a preconditioned Krylov-subspace iterative method for linear system solving. Our key contribution is to introduce an effective quasi-Newton preconditioning scheme for Krylov-subspace methods to reduce the number and cost of LU factorizations during time-domain circuit simulation. Experimental results on a collection of digital, analog and RF circuits have shown that the quasi-Newton preconditioned Krylov-subspace method is as robust and accurate as SPICE3. The proposed Newton-Krylov method is especially attractive for simulating circuits with a large amount of parasitic RLC elements for post-layout verification.

4C-2 (Time: 10:40 - 11:05)

Title	An Efficient and Globally Convergent Homotopy Method for Finding DC Operating Points of Nonlinear Circuits
Author	*Kiyotaka Yamamura, Wataru Kuroki (Chuo University, Japan)
Page	pp. 408 - 415
Keyword	circuit simulation, dc operating point analysis, homotopy method , global convergence , SPICE
Abstract	Finding DC operating points of nonlinear circuits is an important problem in circuit simulation. The Newton-Raphson method employed in SPICE-like simulators often fails to converge to a solution. To overcome this convergence problem, homotopy methods have been studied from various viewpoints. There are several types of homotopy methods, one of which succeeded in solving bipolar analog circuits with more than 20000 elements with the theoretical guarantee of global convergence. In this paper, we propose an improved version of the homotopy method that can find DC operating points of practical nonlinear circuits smoothly and efficiently. It is also shown that the proposed method can be easily implemented on SPICE without programming.

4C-3 (Time: 11:05 - 11:30)

Title	Optimization of Circuit Trajectories: An Auxiliary Network Approach
Author	Baohua Wang, *Pinaki Mazumder (University of Michigan, United States)
Page	pp. 416 - 421
Keyword	trajectory optimization, circuit sensitivity, circuit simulation, low power
Abstract	On optimizing circuit trajectories, i.e. continuous paths of circuit parameters, the paper presents an auxiliary network approach, which utilizes Pontryagin's Minimum Principle. Based on a set of circuit element correspondence rules, the introduced approach establishes an auxiliary network for a given circuit to be optimized, then circuit trajectories are optimized in a process of simulating the given circuit and the auxiliary network. The auxiliary network approach facilitates establishing analytic models in designing high-performance circuits that require fine tuning circuit trajectories. The paper details the theoretical framework of auxiliary network, and provides practical examples of its application in adiabatic circuit design.

4C-4 (Time: 11:30 - 11:55)

Title	SASIMI: Sparsity-Aware Simulation of Interconnect-Dominated Circuits with Non-Linear Devices
Author	*Jitesh Jain, Stephen F Cauley, Cheng-Kok Koh, Venkataramanan Balakrishnan (Purdue University, United States)
Page	pp. 422 - 427
Keyword	interconnect
Abstract	We present a technique for the fast and accurate simulation of large scale VLSI interconnects with nonlinear devices, called SASIMI. The numerical efficiency of this technique is realized through linear algebraic techniques that exploit the sparsity and structure of the matrices that are encountered in VLSI structures. Numerical results show that SASIMI is up to 1400 times as fast as commercial-grade SPICE, for moderate-size circuits, with little sacrifice in simulation accuracy.

4C-5 (Time: 11:55 - 12:20)

Title	An Unconditional Stable General Operator Splitting Method for Transistor Level Transient Analysis
Author	Zhengyong Zhu, Rui Shi, *Chung-Kuan Cheng (University of California, San Diego, United States), Ernest S. Kuh (University of California, Berkeley, United States)
Page	pp. 428 - 433
Keyword	simulation, transistor-level, operator splitting
Abstract	In this paper, we introduce a general operator splitting method for transient simulation of VLSI circuits. The proposed approach generates special partitions of the circuits and alternates the explicit and implicit integrations between the partitions. We prove that the method is unconditionally stable independent of the step size. The splitting scheme greatly reduces the nonzero fill-ins generated in direct methods like LU decomposition. Orders of magnitude speedup over Berkeley SPICE3 is observed for sets of circuits.

Session 4D Special Session: Open Access Overview (10:15 - 12:20)
Location: Room 416+417
Chair(s): John Darringer (IBM, United States)

4D-1 (Time: 10:15 - 10:45)

Title	An Introduction to OpenAccess -An Open Source Data Model and API for IC Design-
Author	*Michaela Guiney, Eric Leavitt (Cadence, United States)
Page	pp. 434 - 436
Keyword	OpenAccess, Open, Database
Abstract	The OpenAccess database provides a comprehensive open standard data model and robust implementation for IC design flows. This paper describes how it improves interoperability among applications in an EDA flow. It details how OA benefits developers of both EDA tools and flows. Finally, it outlines how OA is being used in the industry, at semiconductor design companies, EDA tool vendors, and universities.

4D-2 (Time: 10:45 - 11:15)

Title	Open Access Overview "Industrial Experience"
Author	*Yoshio Inoue (Renesas, Japan)
Page	pp. 437 - 438
Keyword	OpenAccess, Renesas, REAP
Abstract	Renesas Technology Corp. designers turned to OpenAccess to address the major design challenges with systems on chip for the automotive, wireless, digital consumer and industrial markets. OpenAccess provides Renesas with an industry standard database that has the capacity and performance needed for today's largest designs. The C++ API (C++ Application Programming Interface) facilitates fast access to a unified data model for both logical and physical design. It enables an efficient level of access to the data model to integrate tools developed in-house with commercially available tools for translation free interoperability.

4D-3 (Time: 11:15 - 11:45)

Title	EDA Vendor Adoption
Author	*Hillel Ofek (Sagantec, United States)
Page	p. 439
Keyword	EDA, Open Access, Interoperability, DFM
Abstract	Rapid IC technology advances towards deep sub-micron technologies produce ever growing pressure on EDA Vendors. EDA design systems and complex design flows require close cooperation of various analysis and optimization tools originating from multiple vendors. This paper will provide a view of Open Access from an EDA provider vantage point. It will address the positives and the challenges facing both the EDA Vendors and users. It will discuss OA adoption, its benefits, and by way of example illustrate current state and show the way to broad adoption of this essential standard.

4D-4 (Time: 11:45 - 12:15)

Title	Utility of the OpenAccess Database in Academic Research
Author	David Papa, *Igor Markov (University of Michigan, United States), Philip Chong (Cadence Design Systems, United States)
Page	pp. 440 - 441
Keyword	OpenAccess
Abstract	The proliferation of OpenAccess is opening promising new research opportunities to academic communities. The benefits of adopting an OpenAccess based approach to EDA research are growing, and we review a number of them. Among them are the ability to learn about a domain while writing software for it, increased ease of code reuse, new high-quality benchmarks, and enhanced industry adoption.

Session 5A Advances in Simulation Technologies (13:30 - 15:35)
Location: Room 411+412
Chair(s): Shin'ichi Minato (Hokkaido University, Japan), Karem Sakallah (University of Michigan, United States)

5A-1 (Time: 13:30 - 13:55)

Title	Depth-Driven Verification of Simultaneous Interfaces
Author	*Ilya Wagner, Valeria Bertacco, Todd Austin (University of Michigan, United States)
Page	pp. 442 - 447
Keyword	Interface verification, High-performance simulation
Abstract	The verification of modern computing systems has grown to dominate the cost of system design, often with limited success as designs continue to be released with latent bugs. This trend is accelerated with the advent of highly integrated system-on-a-chip (SoC) designs, which feature multiple complex subcomponents connected by simultaneously active interfaces. In this paper, we introduce a closed-loop feedback technique targeting the verification of multiple components connected by parallel interfaces. We utilize an environment with hierarchical Markov models, where top-level submodels specify overarching simulation goals of the system, while lower-level submodels specify the detailed component-level input generation. Test accuracy is improved through the use of depth-driven random test generation. The approach allows users to specify correctness properties and key activity nodes in the design to be exercises. We examine three non-trivial designs, two microprocessors and a chip-multiprocessor router switch, and we demonstrate that our technique finds many more bugs than constrained-random test generation technique and reduces the simulation effort in half, compared to previous Markov-model based solutions.

5A-2 (Time: 13:55 - 14:20)

Title	FSM-Based Transaction-Level Functional Coverage for Interface Compliance Verification
Author	*Man-Yun Su, Che-Hua Shih, Juinn-Dar Huang, Jing-Yang Jou (Department of Electronics Engineering, National Chiao Tung University, Taiwan)
Page	pp. 448 - 453
Keyword	functional coverage, interface compliance verification
Abstract	Interface compliance verification plays a very important role in modern SoC designs. In order to perform a quantitative analysis of simulation completeness, adequate coverage metrics are mandatory. In this paper, we propose a finite state machine (FSM) based transaction-level functional coverage methodology for interface compliance verification. A language, State-Oriented Language (SOL), is developed to specify functional transactions mainly at the higher FSM level instead of lower logic or signal level. By utilizing SOL, it is simple and rigorous to specify interesting transactions from the specification FSM of the target interface protocol. Experimental results show that the proposed methodology can effectively improve the verification quality as well as increase the efficiency of regression verification.

5A-3 (Time: 14:20 - 14:45)

Title	Hardware Debugging Method Based on Signal Transitions and Transactions
Author	*Nobuyuki Ohba, Kohji Takano (IBM Japan Ltd., Japan)
Page	pp. 454 - 459
Keyword	Hardware Debugging, Logic Analyzer, Real-time verification, Hardware prototyping
Abstract	This paper proposes a hardware design debugging method, Transition and Transaction Tracer (TTT), which probes and records the signals of interest for a long time, hours, days, or even weeks, without a break. It compresses the captured data in real time and stores it in a state transition format in memory. It can be programmed to generate a trigger for a logic analyzer when it detects certain transitions. The visualizer, which shows the captured data in the matrix, timing-chart, and state-transition diagram formats, helps the engineer effectively find bugs.

5A-4 (Time: 14:45 - 15:10)

Title	Cycle Error Correction in Asynchronous Clock Modeling for Cycle-Based Simulation
Author	*Junghee Lee, Joonhwan Yi (Samsung Electronics, Republic of Korea)
Page	pp. 460 - 465
Keyword	cycle-based, correction, error
Abstract	As the complexity of SoCs is increasing, hardware/software co-verification becomes an important part of system verification. C-level cycle-based simulation could be an efficient methodology for system verification because of its fast simulation speed. The cycle-based simulation has a limitation in using asynchronous clocks that causes inherent cycle errors. In order to reuse the output of a C-level cycle-based simulation for the verification of a lower level model, the C-level model should be cycle-accurate with respect to the lower level model. In this paper a cycle error correction technique is presented for two asynchronous clock models. An example design is devised to show the effectiveness of the proposed method. Our experimental results show that the fast speed of cycle-based simulation can be fully exploited without sacrificing the cycle accuracy.

5A-5 (Time: 15:10 - 15:35)

Title	A Fast Logic Simulator Using a Look Up Table Cascade Emulator
Author	*Hiroki Nakahara, Tsutomu Sasao, Munehiro Matsuura (Kyushu Institute of Technology, Japan)
Page	pp. 466 - 472
Keyword	LUT cascade, Cycle-based simulator, Binary Decision Diagram, Functional Decomposition
Abstract	This paper shows a new type of a cycle-based logic simulation method using a Look-Up Table (LUT) cascade emulator. The method first transforms a given circuit into LUT cascades through BDD (Binary Decision Diagram). Then, it stores LUT data to the memory of an LUT cascade emulator. Next, it generates the C code representing the control circuit of the LUT cascade emulator. And, finally, it converts the C code into the execution code. This method is compared with a Levelized Compiled Code (LCC) simulator with respect to the simulation time and setup time. Although we used standard PC to simulate the circuit, experimental results show that this method is 12-64 times faster than the LCC.

Session 5B Scheduling for Embedded Systems (13:30 - 15:35)
Location: Room 413
Chair(s): Sri Parameswaran (University of New South Wales, Australia), Sang Lyul Min (Seoul National Univ., Republic of Korea)

5B-1 (Time: 13:30 - 13:55)

Title	Power-Aware Scheduling and Dynamic Voltage Setting for Tasks Running on a Hard Real-Time System
Author	Peng Rong, *Massoud Pedram (Univ. of Southern California, United States)
Page	pp. 473 - 478
Keyword	power aware scheduling, dynamic voltage setting, hard real-time system
Abstract	This paper addresses the problem of minimizing energy consumption of a computer system performing periodic hard real-time tasks with precedence constraints. In the proposed approach, dynamic power management and voltage scaling techniques are combined to reduce the energy consumption of the CPU and devices. The optimization problem is first formulated as an integer programming problem. Next, a three-phase solution framework, which integrates power management scheduling and task voltage assignment, is proposed. Experimental results show that the proposed approach outperforms existing methods by an average of 18% in terms of the system-wide energy savings.

5B-2 (Time: 13:55 - 14:20)

Title	Optimal TDMA Time Slot and Cycle Length Allocation for Hard Real-Time Systems
Author	*Ernesto Wandeler, Lothar Thiele (ETH Zurich, Switzerland)
Page	pp. 479 - 484
Keyword	TDMA, Performance Analysis, Event-Triggered & Time-Triggered, Distributed Embedded Systems, Real-Time Systems
Abstract	We present an analytic method to determine the provably smallest possible slot length that must be allocated in a TDMA resource, to serve an event-triggered hard real-time load with arbitrary deterministic timing behavior. Based on this method, we then present constructive methods to find all feasible as well as the optimal cycle length in a TDMA resource, and we show how to determine the minimum required bandwidth of a TDMA resource. We demonstrate the applicability and computational efficiency of the presented methods in a case study of a large distributed embedded system with a TDMA bus, where we will find the optimal parameter set for the TDMA bus.

5B-3 (Time: 14:20 - 14:45)

Title	POSIX modeling in SystemC
Author	*Hector Posadas, Jesus Adamez, Pablo Sanchez, Eugenio Villar (University of Cantabria, Spain), Francisco Blasco (DS2, Spain)
Page	pp. 485 - 490
Keyword	POSIX, simulation, SystemC
Abstract	Early estimation of the execution time of embedded SW is an essential task in complex, HW/SW embedded system design. Application SW execution time estimation requires taking into account the impact of the underlying RTOS. As a consequence, RTOS modeling is becoming an active research area. SystemC provides a framework for multiprocessing, HW/SW co-simulation at several abstraction levels. In this paper, a SystemC library for POSIX modeling and simulation is presented. By using the library, the SystemC specification using POSIX functions is converted automatically into a timed simulation estimating the execution time of the application SW running on the POSIX platform. The library works directly on the source code. Therefore, it provides an early and fast estimation of the performance of the system as a consequence of the architectural mapping decisions. Although accuracy is lower than when using lower-level techniques, it supports high-level design-space exploration as simulation time is significantly less than RT (ISS) simulation.

5B-4 (Time: 14:45 - 15:10)

Title	PARLGRAN: Parallelism Granularity Selection for Scheduling Task Chains on Dynamically Reconfigurable Architectures
Author	*Sudarshan Banerjee, Elaheh Bozorgzadeh, Nikil Dutt (University of California, Irvine, United States)
Page	pp. 491 - 496
Keyword	dynamic reconfiguration, scheduling, placement
Abstract	Partial dynamic reconfiguration, often called RTR (run-time reconfiguration) is a key feature in modern reconfigurable platforms. While partial RTR enables additional application performance, it imposes physical constraints necessitating simultaneous scheduling and placement while mapping application task graphs onto such architectures. In this paper we present PARLGRAN, an approach that maximizes performance of application {\it task chains} by selecting a suitable granularity of data-parallelism for individual {\it data parallel} tasks. Our approach focusses on reconfiguration delay overhead and placement-related issues (such as fragmentation) while selecting individual data-parallelism granularity as an integral part of simultaneous scheduling and placement. We demonstrate that our heuristic generates high-quality schedules on an extensive set of over a 1000 synthetic experiments by comparing the results with an approach that tries to statically maximize data-parallelism, i.e., does not consider the overheads and constraints associated with partial RTR. A detailed case-study on JPEG encoding additionally confirms that blindly maximizing data-parallelism can result in schedules even worse than that generated by a simple (but RTR-aware) approach oblivious to data-parallelism.

5B-5 (Time: 15:10 - 15:35)

Title	Memory Optimal Single Appearance Schedule with Dynamic Loop Count for Synchronous Dataflow Graphs
Author	*Hyunok Oh, Nikil Dutt (University of California, Irvine, United States), Soonhoi Ha (Seoul National University, Republic of Korea)
Page	pp. 497 - 502
Keyword	Synchronous Dataflow, Single Appearance Schedule, Minimum Memory Size, quasi-static, automatic code generation
Abstract	In this paper, we propose a new single appearance schedule for synchronous dataflow programs to minimize data memory and code memory size simultaneously. While a single appearance schedule promises only one appearance of each node definition in the generated code, it requires significant amount of data memory overhead compared with a buffer optimal schedule allowing multiple appearance. The key idea of the proposed technique is to make a dynamic decision of loop count to make a schedule quasi-static. The proposed quasi-static schedule produces a single appearance schedule code with minimum data memory requirement. We prove that every buffer optimal schedule can be transformed to our single appearance schedule which requires optimal buffer size for arbitrary synchronous dataflow graphs.The only penalty for the proposed technique is slight performance overhead of computing loop counts dynamically. In order to minimize the overhead we propose optimization techniques. Experimental results show that the proposed algorithm reduces 20% total memory with less than 1% performance overhead compared with the previous single appearance schedule algorithms.

Session 5C High Frequency Interconnect Effects in Nanometer Technology (13:30 - 15:35)
Location: Room 414+415
Chair(s): Charlie Chung-Ping Chen (National Taiwan University, Taiwan), Noel Menezes (Intel, United States)

5C-1 (Time: 13:30 - 13:55)

Title	Wire Sizing with Scattering Effect for Nanoscale Interconnection
Author	Sean X. Shi, *David Z. Pan (University of Texas at Austin, United States)
Page	pp. 503 - 508
Keyword	scattering effect, wire sizing, nanoscale, interconnection model, wire shaping
Abstract	For nanoscale interconnection, the scattering effect will soon become prominent due to scaling. It will increase the effective resistivity and thus interconnection delay significantly. Existing works on scattering effect are mostly performed using very complicated physics-based models, while the scattering impact on nanoscale VLSI interconnect and optimization have not been studied. In this paper, we first present a simple, closed-form scattering effect resistivity model based on extensive empirical studies on measurement data. Then we apply the proposed scattering model to revisit several classic wire sizing/shaping problems. Our experimental results show that if the scattering effect is ignored or characterized inaccurately beyond 65nm, the resulting interconnect optimization might be way off from the real optimal solution, e.g., up to 70% underestimation of the delay, or 20x oversizing. We also obtain the new closed-form wiresizing functions with consideration of scattering effects.

5C-2 (Time: 13:55 - 14:20)

Title	Adaptive Admittance-based Conductor Meshing for Interconnect Analysis
Author	*Ya-Chi Yang, Cheng-Kok Koh, Venkataramanan Balakrishnan (Purdue University, United States)
Page	pp. 509 - 514
Keyword	discretization, high-frequency effects, inductance extraction, simulation
Abstract	We present a new algorithm for discretizing interconnects, a step that is typically performed to account for the non-uniformity of current flow at high frequencies. The algorithm is based on an easily-computable measure that correlates well with the model accuracy. This measure is used to refine the discretization of interconnects in an adaptive scheme so as to systematically trade off computation against model accuracy. We apply the proposed discretization technique on two classes of problems in the analysis of VLSI interconnects: simulation and frequency-dependent inductance extraction. Numerical results establish that with the interconnect discretizations generated by our algorithm, a reduction in simulation and extraction times by a factor between three and seven can be realized with negligible sacrifice in model accuracy (< 1% error).

5C-3 (Time: 14:20 - 14:45)

Title	Interconnect RL Extraction at a Single Representative Frequency
Author	*Akira Tsuchiya (Kyoto University, Japan), Masanori Hashimoto (Osaka University, Japan), Hidetoshi Onodera (Kyoto University, Japan)
Page	pp. 515 - 520
Keyword	extraction, transmission-line
Abstract	This paper proposes a method to determine a single frequency for interconnect RL extraction. Resistance and inductance of interconnects depend on frequency, and hence the extraction frequency strongly affects the modeling accuracy of interconnects. The proposed method determines an extraction frequency based on the transfer characteristic of interconnects. By choosing the frequency where the transfer characteristic becomes maximum, the extracted RL values achieve the accurate modeling of the waveform. We experimentally verify that the proposed method provides accurate transition waveforms over various interconnect topologies.

5C-4 (Time: 14:45 - 15:10)

Title	An Efficient Algorithm for 3-D Reluctance Extraction Considering High Frequency Effect
Author	*Mengsheng Zhang, Wenjian Yu (EDA Lab, Department of Computer Science & Technology, Tsinghua University, China), Yu Du (Synopsys Inc., United States), Zeyi Wang (EDA Lab, Department of Computer Science & Technology, Tsinghua University, China)
Page	pp. 521 - 526
Keyword	reluctance extraction, VLSI interconnect, inductance
Abstract	As shown in literatures, partial reluctance based circuit analysis is efficient in capturing on-chip inductance effect, because the partial reluctance exhibits much better locality than partial inductance. However, most previous works on reluctance extraction did not take high frequency effect into account and were not efficient enough for 3-D complex structure. In this paper, a new reluctance extraction algorithm is proposed considering the high frequency effect. Numerical experiments demonstrate that our algorithm can handle complex 3-D interconnect structures while exhibiting high accuracy and a speed-up ratio of several tens to hundreds over FastHenry.

5C-5 (Time: 15:10 - 15:35)

Title	Macromodelling Oscillators Using Krylov-Subspace Methods
Author	Xiaolue Lai, *Jaijeet Roychowdhury (University of Minnesota, United States)
Page	pp. 527 - 532
Keyword	Oscillator, Macromodel, MOR, LTV
Abstract	We present an efficient method for automatically extracting unified amplitude/phase macromodels of arbitrary oscillators from their SPICE-level circuit descriptions. Such comprehensive oscillator macromodels are necessary for accuracy when speeding up simulation of higher-level circuits/systems, such as PLLs, in which oscillators are embedded. Standard MOR techniques for linear time invariant (LTI) and varying (LTV) systems are not applicable to oscillators on account of their fundamentally nonlinear phase behaviour. By employing a cancellation technique to deflate out the phase component, we restore the validity and efficacy of Krylov-subspace-based LTV MOR techniques for macromodelling oscillator amplitude responses. The nonlinear phase response is re-incorporated into the macromodel after the amplitude components have been reduced. The resulting unified macromodels predict oscillator waveforms, in the presence of any kind of input or interference, at far lower computational cost than full SPICE-level simulation, and with far greater accuracy compared to existing macromodels.

Session 5D Designers' Forum: Low Power Design (13:30 - 15:30)
Location: Small Auditorium, 5F
Chair(s): Haruyuki Tago (Toshiba, Japan), Makoto Ikeda (University of Tokyo, Japan)

5D-1 (Time: 13:30 - 14:00)

Title	Low-Power Design Methodology for Module-wise Dynamic Voltage and Frequency Scaling with Dynamic De-skewing Systems
Author	*Takeshi Kitahara, Hiroyuki Hara, Shinichiro Shiratake (Toshiba Corporation Semiconductor Company, Japan), Yoshiki Tsukiboshi (Toshiba Microelectronics Corporation, Japan), Tomoyuki Yoda, Tetsuaki Utsumi, Fumihiro Minami (Toshiba Corporation Semiconductor Company, Japan)
Page	pp. 533 - 540
Keyword	dynamic voltage and frequency scaling, clock design methodology, low power design
Abstract	This paper discusses design methodology for a module-wise dynamic voltage and frequency scaling(DVFS) technique. We propose a novel clock design methodology to minimize the inter-module clock skew for solving one of the major design issues in the module-wise DVFS. We also describe a method of determining the minimum supply voltage value for a module. Our experimental results show that the module-wise DVFS can reduce 53% power compared with the chip-wise DVFS, and 17% more reduction was achieved by applying the minimum supply voltage proposed.

5D-2 (Time: 14:00 - 14:30)

Title	Single-Chip Multi-Processor Integrating Quadruple 8-Way VLIW Processors with Interface Timing Analysis Considering Power Supply Noise
Author	*Satoshi Imai, Atsuki Inoue, Motoaki Matsumura, Kenichi Kawasaki, Atsuhiro Suga (Fujitsu Lab., Japan)
Page	pp. 541 - 546
Keyword	multi-processor, embedded, MPEG2, timing analysis , power supply noise
Abstract	This paper introduces a 51.2Gops, 1.0GB/s-DMA single-chip multi-processor integrating quadruple cores and proposes a new power integrity analysis. Our multi-processor is designed to decode MP@HL streams without any dedicated circuits. To achieve such high performance, data throughput as well as processing capability is important, requiring a large number of high speed I/Os. However, this makes for a high level of power supply noise. We then applied an interface timing margin analysis tool that took power supply noise into account, and succeeded in putting reasonable restrictions on LSI design, as well as that for the printed circuit board. As a result, we succeeded in operating the processor at 533MHz with the 2ch 64bit main memory IF at 266MHz and 64bit system bus at 178MHz.

5D-3 (Time: 14:30 - 15:00)

Title	A System-level Power-estimation Methodology based on IP-level Modeling, Power-level Adjustment, and Power Accumulation
Author	*Masafumi Onouchi, Tetsuya Yamada (Hitachi Ltd., Japan), Kimihiro Morikawa, Isamu Mochizuki, Hidetoshi Sekine (Renesas Technology Corp., Japan)
Page	pp. 547 - 550
Keyword	modeling, power, estimation
Abstract	We have developed a specialized rapid power-estimation methodology for multimedia applications. For a multimedia application, we developed three methodologies: an IP-level modeling a power-level adjustment and a power accumulation methodologies. With these methodologies, the system-level power estimation becomes so precise and easy that we can revise the SoC design to reduce its power. According to a comparison of the system-level power estimated with these methodologies to board-measured power, the error between the two powers is less than 5.6%.

5D-4 (Time: 15:00 - 15:30)

Title	PowerViP: SoC Power Estimation Framework at Transaction Level
Author	*Ikhwan Lee, Hyunsuk Kim, Peng Yang, Sungjoo Yoo (Samsung Electronics, Co. Ltd., Republic of Korea), Eui-Young Chung (Yonsei University, Republic of Korea), Kyu-Myung Choi, Jeong-Taek Kong, Soo-Kwan Eo (Samsung Electronics, Co. Ltd., Republic of Korea)
Page	pp. 551 - 558
Keyword	SoC power, power estimation, Virtual platform, ARM926, AMBA AXI
Abstract	In this work, we propose a SoC power estimation framework built on our system-level simulation environment. Our framework provides designers with the system-level power profile in a cycle-accurate manner. We target the framework to run fast and accurately, which is enabled by adopting different modeling techniques depending on the power characteristics of various IP blocks. The framework can be applied to any target SoC design.

Session 6A Power Optimization of Large-Scale Circuits (16:00 - 18:05)
Location: Room 411+412
Chair(s): Sheldon Tan (Univ. of California, Riverside, United States), David Z. Pan (Univ. of Texas, Austin, United States)

6A-1 (Time: 16:00 - 16:25)

Title	Mathematically Assisted Adaptive Body Bias (ABB) for Temperature Compensation in Gigascale LSI Systems
Author	Sanjay V Kumar, Chris H Kim, *Sachin S Sapatnekar (University of Minnesota, United States)
Page	pp. 559 - 564
Keyword	Delay, Leakage, Adaptive Body Bias, Process Variations, Temperature Variations
Abstract	Process variations and temperature variations can cause the frequency and leakage of the chip to vary significantly from their expected values, thereby decreasing the yield. Adaptive Body Bias (ABB) can be used to pull back the chip to the nominal operational region. We propose the use of this technique to counter temperature variations along with process variations. We present a CAD perspective for achieving process and temperature compensation using bidirectional ABB. Mathematical models are used to determine the exact amount of body bias required to optimize the delay and leakage, and an algorithmic flow that can be adopted for gigascale LSI systems is provided.

6A-2 (Time: 16:25 - 16:50)

Title	Analysis and Optimization of Gate Leakage Current of Power Gating Circuits
Author	*Hyung-Ock Kim, Youngsoo Shin (Dept. of Electrical Engineering, KAIST, Republic of Korea)
Page	pp. 565 - 569
Keyword	Power gating, Gate leakage, MTCMOS, leakage
Abstract	Power gating is widely accepted as an efficient way to suppress subthreshold leakage current. Yet, it suffers from gate leakage current, which grows very fast with scaling down of gate oxide. We try to understand the sources of leakage current in power gating circuits and show that input MOSFETs play a crucial role in determining total gate leakage current. It is also shown that the choice of a current switch in terms of polarity, threshold voltage, and size has a significant impact on total leakage current. From the observation of the importance of input MOSFETs, we propose the power optimization of power gating circuits through input control.

6A-3 (Time: 16:50 - 17:15)

Title	Delay Modeling and Static Timing Analysis for MTCMOS Circuits
Author	*Naoaki Ohkubo, Kimiyoshi Usami (Graduate School of Engineering, Shibaura Institute of Technology, Japan)
Page	pp. 570 - 575
Keyword	MTCMOS, Selective-MT, Delay, Static timing analysis, Leakage power
Abstract	One of the critical issues in MTCMOS design is how to estimate a circuit delay quickly. In this paper, we propose a delay modeling and static timing analysis (STA) methodology targeting at MTCMOS circuits. In the proposed method, we prepare a delay look-up table (LUT) consisting of the input slew, the output load capacitance, the virtual ground length, and a power-switch size. Using this LUT, we compute a circuit delay for each logic cell by applying the linear interpolation. Experimental results show that the proposed methodology enables to estimate the critical path delay in a good accuracy.

6A-4 (Time: 17:15 - 17:40)

Title	Switching-Activity Driven Gate Sizing and Vth Assignment for Low Power Design
Author	Yu-Hui Huang, *Po-Yuan Chen, TingTing Hwang (National Tsing Hua University, Taiwan)
Page	pp. 576 - 581
Keyword	low power design, switching activity, leakage power, dynamic power
Abstract	Power consumption has gained much saliency in circuit design recently. One design problem is modelled as "Under a timing constraint, to minimize power as much as possible". Previous research regarding this problem focused on either minimizing dynamic power by gate sizing, or reducing leakage power by dual threshold voltage assignment on non-critical path. However, given a timing constraint, an optimization algorithm must be able to utilize gate sizing and threshold-voltage assignment interchangeably, in order to minimize total power consumption including dynamic and leakage power in active mode and leakage power in idle mode. We find that switching-activity of a gate plays an important role in making decision as to choosing gate sizing or threshold-voltage assignment for performance improvement. For high switching-activity gates, threshold-voltage assignment should be used while for low switching-activity gates, gate sizing should be utilized. We develop an algorithm to perform gate sizing and threshold-voltage assignment simultaneously taking switching activity into consideration. The results show that under the same timing constraint, our circuits have 16.26%, and 18.53%, improvement of total power as compared to the original circuits for the cases where the percentage of active time are 100%, and 50%, respectively.

6A-5 (Time: 17:40 - 18:05)

Title	Power Driven Placement with Layout Aware Supply Voltage Assignment for Voltage Island Generation in Dual-Vdd Designs
Author	*Bin Liu, Yici Cai, Qiang Zhou, Xianlong Hong (Tsinghua University, China)
Page	pp. 582 - 587
Keyword	voltage island, placement, low power
Abstract	In this paper we propose a method for standard cell placement with support for dual supply voltages, aiming to reduce total power under timing constraints and to implement voltage islands with minimal overheads. The method begins with timing and power driven coarse placement, followed by a few iterations between voltage assignment and placement refinement to generate voltage islands. Several techniques, including timing and power driven net weighting, seed growth based voltage assignment, and soft clustering strategy for placement refinements are employed in our implementation. Experimental results on a set of MCNC benchmarks show that our approach is able to produce feasible placement for dual-Vdd designs and significantly reduce total power with a wirelength increase within 14% compared to a power and timing driven placer without voltage islands.

Session 6B Advanced Memory and Processor Architectures for MPSoC (16:00 - 18:05)
Location: Room 413
Chair(s): Soonhoi Ha (Seoul National University, Republic of Korea), Youn-Long Lin (National Tsing Hua University, Taiwan)

6B-1 (Time: 16:00 - 16:25)

Title	Reusable Component IP Design using Refinement-based Design Environment
Author	*Sanggyu Park, Sang-Yong Yoon, Soo-Ik Chae (Seoul National University, Republic of Korea)
Page	pp. 588 - 593
Keyword	Reuse, SoC, Refinement, Platform, IP
Abstract	We propose a method for enhancing the reusability of the component IPs by exploiting our refinement-based design environment, SoCBase-DE. In this method, a computation designer captures the computation part of the function and then a system designer constructs the communication part of the function to be best fit to the system using the SoCBase-DE. This method allows the reuse-centric design environment to be more effective and attractive. We evaluated this method on the design of a H.264 Decoder System.

6B-2 (Time: 16:25 - 16:50)

Title	An Interface-Circuit Synthesis Method with Configurable Processor Core in IP-Based SoC Designs
Author	*Shunitsu Kohara, Naoki Tomono, Jumpei Uchida, Yuichiro Miyaoka, Nozomu Togawa, Masao Yanagisawa, Tatsuo Ohtsuki (Waseda University, Japan)
Page	pp. 594 - 599
Keyword	interface circuit, IP-based design, hardware / software co-synthesis, configurable processor core
Abstract	In SoC designs, efficient communication between the hardware IPs and the on-chip processor becomes very important, however the interface is usually affacted by the processor core specification. Thus in this paper, we focus on developing an efficient interface circuit architecture for the communications between the on-chip processor and embedded hardware IP cores. we also propose a method to synthesize it. Experimental results show that our method could obtain optimal interface circuits and works well through designing a MPEG-4 encode application.

6B-3 (Time: 16:50 - 17:15)

Title	A Real-Time and Bandwidth Guaranteed Arbitration Algorithm for SoC Bus Communication
Author	Chien-Hua Chen, *Geeng-Wei Lee, Juinn-Dar Huang, Jing-Yang Jou (Department of Electronics Engineering, National Chiao Tung University , Taiwan)
Page	pp. 600 - 605
Keyword	SoC, arbitration, bus, real-time
Abstract	In shared SoC bus systems, arbiters are usually adopted to solve bus contentions with various kinds of arbitration algorithms. We propose an arbitration algorithm, RT_lottery, which is designed to meet both hard real-time and bandwidth requirements. For fast evaluation and exploration, we use high abstract-level models in our system simulation environment to generate parameters for our configurable arbiter. The experimental results show that RT_lottery can meet all hard real-time requirements and perform very well in bandwidth allocation. The results also show that RT_lottery outperforms several commonly-used arbitration algorithms today.

6B-4 (Time: 17:15 - 17:40)

Title	Hierarchical Memory Size Estimation for Loop Fusion and Loop Shifting in Data-Dominated Applications
Author	*Qubo Hu (University of Trondheim, Norway), Arnout Vandecappelle, Martin Palkovic (IMEC, Belgium), Per Gunnar Kjeldsberg (University of Trondheim, Norway), Erik Brockmeyer, Francky Catthoor (IMEC, Belgium)
Page	pp. 606 - 611
Keyword	low power design, memory estimation, loop transformations, memory architecture, SPM
Abstract	Loop fusion and loop shifting are important transformations for improving data locality to reduce the number of costly accesses to off-chip memories. Since exploring the exact platform mapping for all the loop transformation alternatives is a time consuming process, heuristics steered by improved data locality are generally used. However, pure locality estimates do not sufficiently take into account the hierarchy of the memory platform. This paper presents a fast, incremental technique for hierarchical memory size requirement estimation for loop fusion and loop shifting at the early loop transformations design stage. As the exact memory platform is often not yet defined at this stage, we propose a platform-independent approach which reports the Pareto-optimal trade-off points for scratch-pad memory size and off-chip memory accesses. The estimation comes very close to the actual platform mapping. Experiments on realistic test-vehicles confirm that. It helps the designer or a tool to find the interesting loop transformations that should then be investigated in more depth afterward.

6B-5 (Time: 17:40 - 18:05)

Title	A Novel Instruction Scratchpad Memory Optimization Method based on Concomitance Metric
Author	Andhi Janapsatya, Aleksandar Ignjatovic, *Sri Parameswaran (The University of New South Wales, Australia)
Page	pp. 612 - 617
Keyword	Scratchpad Memory, Embedded system, Low Power Design
Abstract	Scratchpad memory has been introduced as a replacement for cache memory as it improves performance, and significantly reduces energy consumption of the memory hierarchy of certain embedded systems. This paper deals with optimization of the instruction memory scratchpad based on a novel methodology that uses a metric which we call concomitance. This metric is used to find basic blocks which are executed frequently and in close proximity in time. Once such blocks are found, they are copied into the scratchpad memory at appropriate times; this is achieved using a special instruction inserted into the code at appropriate places. For a set of benchmarks taken from Mediabench, our scratchpad system reduced energy consumption by an average of 49.4% compared to the cache system, and by 30.7% on average when compared to the state of the art scratchpad system, while improving the overall performance. Compared to the state of the art method, the number of instructions copied into the scratchpad memory from the main memory is reduced by 90.4%.

Session 6C New Routing Techniques (16:00 - 18:05)
Location: Room 414+415
Chair(s): Ting-Chi Wang (National Tsing Hua University, Taiwan), Vijay Pitchumani (Intel, United States)

6C-1 (Time: 16:00 - 16:25)

Title	DraXRouter: Global Routing in X-Architecture with Dynamic Resource Assignment
Author	*Zhen Cao, Tong Jing (Computer Science & Technology Department, Tsinghua University, China), Yu Hu, Yiyu Shi (University of California, Los Angeles, United States), Xianlong Hong (Computer Science & Technology Department, Tsinghua University, China), Xiaodong Hu, Guiying Yan (Institute of Applied Mathematics, Chinese Academy of Sciences, China)
Page	pp. 618 - 623
Keyword	Global routing, X-Architecture, Liquid routing, Steiner tree
Abstract	In recent years, the X-Architecture is introduced to obtain better performance for integrated circuit physical design. This paper reformulates the global routing problem in X-Architecture under the liquid routing model. Then, a dynamic resource assignment (Dra) method is presented to reduce potential vias. At last, a global router called DraXRouter, is designed, in which we adopt a dynamic-tabulist-based tree construction algorithm and a stochastic optimization strategy to gain high quality routing solution. Tested on ISPD’98 benchmarks, DraXRouter achieves better routing performance compared with two recent global routers.

6C-2 (Time: 16:25 - 16:50)

Title	Diagonal Routing in High Performance Microprocessor Design
Author	Noriyuki Ito, Hideaki Katagiri, Ryoichi Yamashita, Hiroshi Ikeda, Hiroyuki Sugiyama, *Hiroaki Komatsu, Yoshiyasu Tanamura, Akihiko Yoshitake, Kazuhiro Nonomura, Kinya Ishizaka, Hiroaki Adachi, Yutaka Mori, Yutaka Isoda, Yaroku Sugiyama (Fujitsu Limited, Japan)
Page	pp. 624 - 629
Keyword	microprocessor, diagonal routing
Abstract	This paper presents a diagonal routing method which is applied to an actual microprocessor prototype chip. While including the layout functions for the conventional Manhattan routing, a new diagonal routing capability is added as one of the routing functions. With this enhancement, diagonal routing becomes an additional strategy for improving delays of critical paths in the microprocessor design. The prototype chip proved that our method was effective in reducing the total net length and improving path delays.

6C-3 (Time: 16:50 - 17:15)

Title	CDCTree: Novel Obstacle-Avoiding Routing Tree Construction based on Current Driven Circuit Model
Author	Yiyu Shi (University of California, Los Angeles, United States), Tong Jing (Tsinghua University, China), *Lei He (University of California, Los Angeles, United States), Zhe Feng, Xianlong Hong (Tsinghua University, China)
Page	pp. 630 - 635
Keyword	Steiner tree, obstacle, routing, escape graph
Abstract	Routing tree construction is a fundamental problem in modern VLSI design. In this paper we propose CDCTree, an Obstacle-Avoiding Rectilinear Steiner Minimum Tree (OARSMT) heuristic algorithm to construct an OARSMT. CDCTree is based on the current driven circuit (CDC) model mapped from an escape graph. The circuit structure comes from the topology of the escape graph, with each edge replaced by a resistor indicating the wirelength of that edge. By performing DC analysis on the circuit and selecting the edges according to the current distribution to construct an OARSMT, the resulting tree has short wirelength. The algorithm has been implemented and tested on cases of different scales and with different shapes of obstacles. Experiments show that CDCTree can achieve shorter wirelength than the existing best algorithm, An-OARSMan, when the terminal number of a net is less than 50.

6C-4 (Time: 17:15 - 17:40)

Title	A Novel Framework for Multilevel Full-Chip Gridless Routing
Author	*Tai-Chen Chen, Yao-Wen Chang (National Taiwan University, Taiwan), Shyh-Chang Lin (SpringSoft, Inc., Taiwan)
Page	pp. 636 - 641
Keyword	framework, gridless, multilevel, routing, full-chip
Abstract	Due to its great flexibility, gridless routing is desirable for nanometer circuit designs that use variable wire widths and spacings. Nevertheless, it is much more difficult than grid-based routing because of its larger solution space. In this paper, we present a novel "V-shaped" multilevel framework (called VMF) for full-chip gridless routing. Unlike the traditional "Lambda-shaped" multilevel framework (inaccurately called the "V-cycle" framework in the literature), our VMF works in the V-shaped manner: top-down uncoarsening followed by bottom-up coarsening. Based on the novel framework, we develop a multilevel full-chip gridless router (called VMGR) for large-scale circuit designs. The top-down uncoarsening stage of VMGR starts from the coarsest regions and then processes down to finest ones level by level; at each level, it performs global pattern routing and detailed routing for local nets and then estimate the routing resource for the next level. Then, the bottom-up coarsening stage performs global maze routing and detailed routing to reroute failed connections and refine the solution level by level from the finest level to the coarsest one. We employ a dynamic congestion map to guide the global routing at all stages and propose a new cost function for congestion control. Experimental results show that VMGR achieves the best routability among all published gridless routers based on a set of commonly used MCNC benchmarks. Besides, VMGR can obtain significantly less wirelength, smaller critical path delay, and smaller average net delay than the previous works. In particular, VMF is general and thus can readily apply to other problems.

6C-5 (Time: 17:40 - 18:05)

Title	Monotonic Parallel and Orthogonal Routing for Single-Layer Ball Grid Array Packages
Author	*Yoichi Tomioka, Atsushi Takahashi (Department of Communications and Integrated Systems, Tokyo Institute of Technology, Japan)
Page	pp. 642 - 647
Keyword	Ball grid array, Single-layer, Monotonic
Abstract	In this paper, we give the necessary and sufficient condition that all nets can be connected by monotonic routes when a net consists of a finger and a ball and fingers are on the two parallel boundaries of the Ball Grid Array package, and propose a monotonic routing method based on this condition. Moreover, we give a necessary condition and a sufficient condition when fingers are on the two orthogonal boundaries, and propose a monotonic routing method based on the necessary condition.

Session 6D Designers' Forum Panel: (16:30 - 18:00)
Location: Small Auditorium, 5F

6D-1 (Time: 16:30 - 18:00)

Title	Functional Verification -now and future-
Author	Organizer: Haruyuki Tago (Senior Manager, TOSHIBA, Japan), Moderator: Yoshio Masubuchi (Assistant to General Manager, TOSHIBA, Japan), Panelists: Sanjay Gupta (IBM, United States), Michael Stellfox (Group Director, Cadence, United States), Tetsuji Sumioka (Senior Manager, Sony, Japan), Sunao Torii (Principal Researcher, NEC, Japan)

Friday January 27, 2006

Session 3K Keynote Address III (9:00 - 10:00)
Location: Small Auditorium, 5F
Chair(s): Fumiyasu Hirose (Cadence, Japan)

3K-1 (Time: 9:00 - 10:00)

Title	Effective Platform-based Development for Large-scale Systems Design
Author	Yukichi Niwa (Senior Advisory Director, Group Executive of Platform Technology Development Headquarters, CANON INC., Japan)
Abstract	Platform-based development (PBD) aims to continuously add new value in both cases of incremental development and product planning based development. By adding new technology to previously existing technology and by storing the technologies as reusable assets, PBD enables high quality, low cost, and short turnaround time development. Furthermore, PBD allows target-oriented development where we can select and concentrate technology to eliminate unnecessary development. In order to execute effective PBD, it is important to introduce the firm layer structuring of digital/analog technology so that individual professionals in independent layer can maximize their efficiency without any restraint. The act of layer structuring is nothing but the architectural design of the development methodology. Thus, it's no exaggeration to say that success in business profitability management directly depends on the presence of the good architect. The important thing in the next stage is to optimize the design process by investing in computer resources. For example, it is necessary to thoroughly adapt simulation technology to the development of high quality imaging technology, embedded system (hardware/software) technology, or communication technology. The quantitative evaluation from the early design phase and the workflow based on the accumulated design know-how (IP, methodology) will accelerate technology innovation and strengthen the platform even further. Eventually, management can directly obtain absolute advantage of large-scale system design effectiveness.

Session 7A Minimization of Test Cost and Power (10:15 - 12:20)
Location: Room 411+412
Chair(s): Seiji Kajihara (Kyushu Institute of Technology, Japan), Satoshi Ohtake (NAIST, Japan)

7A-1 (Time: 10:15 - 10:40)

Title	A Routability Constrained Scan Chain Ordering Technique for Test Power Reduction
Author	*Xuan-Lun Huang (Graduate Institute of Electronics Engineering, National Taiwan University, Taiwan), Jiun-Lang Huang (Dept. of Electrical Engineering, National Taiwan University, Taiwan)
Page	pp. 648 - 652
Keyword	design-for-testability, test power reduction, scan chain
Abstract	In this paper, we propose a novel scan-chain ordering technique for test power optimization under user-specified routability constraints. Compared to previous methods, our technique improves in that (1) it allows the user to explicitly set the routing constraints, and (2) the achievable power reduction is much less sensitive to the routing constraints. The proposed method is applied to six industrial designs. The achievable power reduction is in the range of 37-48% without violating any user-specified routing constraint.

7A-2 (Time: 10:40 - 11:05)

Title	FCSCAN: An Efficient Multiscan-based Test Compression Technique for Test Cost Reduction
Author	*Youhua Shi, Nozomu Togawa, Shinji Kimura, Masao Yanagisawa, Tatsuo Ohtsuki (Waseda University, Japan)
Page	pp. 653 - 658
Keyword	DFT, multiscan, test channel, test data compression
Abstract	This paper proposes a new multiscan-based test input data compression technique by employing a Fan-out Compression Scan Architecture (FCSCAN) for test cost reduction. The basic idea of FCSCAN is to target the minority specified 1 or 0 bits (either 1 or 0) in scan slices for compression. Due to the low specified bit density in test cube set, FCSCAN can significantly reduce input test data volume and the number of required test channels so as to reduce test cost. The FCSCAN technique is easy to be implemented with small hardware overhead and does not need any special ATPG for test generation. In addition, based on the theoretical compression efficiency analysis, improved procedures are also proposed for the FCSCAN to achieve further compression. Experimental results on both benchmark circuits and one real industrial design indicate that drastic reduction in test cost can be indeed achieved.

7A-3 (Time: 11:05 - 11:30)

Title	Compaction of Pass/Fail-based Diagnostic Test Vectors for Combinational and Sequential Circuits
Author	*Yoshinobu Higami (Ehime University, Japan), Kewal K. Saluja (University of Wisconsin-Madison, United States), Hiroshi Takahashi, Shin-ya Kobayashi, Yuzo Takamatsu (Ehime University, Japan)
Page	pp. 659 - 664
Keyword	Diagnosis, Test compaction, Combinational circuit, Sequential circuit
Abstract	Substantial attention is being paid to the fault diagnosis problem in recent test literature. Yet, the compaction of test vectors for fault diagnosis is little explored. The compaction of diagnostic test vectors must take care of all fault pairs that need to be distinguished by a given test vector set. Clearly, the number of fault pairs is much larger than the number of faults thus making this problem very difficult and challenging. The key contributions of this paper are: 1) to use techniques for reducing the size of fault pairs to be considered at a time, 2) to use novel variants of the fault distinguishing table method for combinational circuits and reverse order restoration method for sequential circuits, and 3) to introduce heuristics to manage the space complexity of considering all fault pairs for large circuits. Finally, the experimental results for ISCAS benchmark circuits are presented to demonstrate the effectiveness of the proposed methods.

7A-4 (Time: 11:30 - 11:55)

Title	Low-Overhead Design of Soft-Error-Tolerant Scan Flip-Flops with Enhanced-Scan Capability
Author	Ashish Goel (Purdue University, United States), Swarup Bhunia (Case Western Reserve University, United States), Hamid Mahmoodi (San Francisco State University, United States), *Kaushik Roy (Purdue University, United States)
Page	pp. 665 - 670
Keyword	soft error, flip-flop, Enhanced Scan
Abstract	With technology scaling, soft error resilience is becoming a major concern in circuit design. This paper presents a class of low-overhead flip-flops suitable for soft error detection and correction. The proposed design reuses logic elements typically available in a standard-cell implementation of a flip-flop to reduce hardware overhead. We demonstrate that the proposed flip-flops are also suitable for enhanced scan based delay fault testing, which allows arbitrary two-pattern test application for the best combinational path testability. The proposed flip-flops show an average power reduction of 16% and area improvement of 17% compared to the best alternative techniques with no additional delay overhead.

7A-5 (Time: 11:55 - 12:20)

Title	A Memory Grouping Method for Sharing Memory BIST Logic
Author	*Masahide Miyazaki, Tomokazu Yoneda, Hideo Fujiwara (Nara Institute of Science and Technology, Japan)
Page	pp. 671 - 676
Keyword	memory, BIST, wrapper, sharing, scheduling
Abstract	With the increasing demand of rich functionality to be included in an SoC, the SoCs are designed with hundreds of small size memories of different sizes and frequencies. If the memory BIST logics were individually added to many different types of small-sized memory, the area overhead would be very large. To reduce the area overhead of memory BIST, memory BIST logic sharing is very important. This paper proposes a memory grouping method for memory BIST logic sharing. A memory grouping problem is formulated and an algorithm to solve the problem is proposed. Experimental results show that the proposed method reduces up to 40.55% area of memory BIST wrapper. It is shown that selection from two types of connection methods is able to reduce more areas than using single connection method.

Session 7B Substrate Coupling and Analog Synthesis (10:15 - 12:20)
Location: Room 413
Chair(s): Jaijeet Roychowdhury (University of Minnesota, United States), Tomohisa Kimura (Toshiba, Japan)

7B-1 (Time: 10:15 - 10:40)

Title	Equivalent Circuit Modeling of Guard Ring Structures for Evaluation of Substrate Crosstalk Isolation
Author	*Daisuke Kosaka, Makoto Nagata (Kobe University, Japan)
Page	pp. 677 - 682
Keyword	substrate crosstalk, F-matrix computation, deep n-well, transmission characteristic
Abstract	A substrate-coupling equivalent circuit can be derived for an arbitrary guard ring test structure by way of F-matrix computation. The derived netlist represents a unified impedance network among multiple sites on a chip surface and allows circuit simulation for evaluation of isolation effects provided by guard rings. Geometry dependency of guard ring effects attributes to layout patterns of a test structure, including such as area of a guard ring as well as location distance from the circuit to be isolated by the guard ring. In addition, structural dependency arises from vertical impurity concentrations such as p+, n+, and deep n-well, which are generally available in a deep-submicron CMOS technology. The proposed simulation based prototyping technique of guard ring structures can include all these dependences and thus can be strongly helpful to establish isolation strategy against substrate coupling in a given technology, in an early stage of SoC developments.

7B-2 (Time: 10:40 - 11:05)

Title	A New Boundary Element Method for Accurate Modeling of Lossy Substrates with Arbitrary Doping Profiles
Author	*Xiren Wang, Wenjian Yu, Zeyi Wang (EDA Lab., Dept. of Computer Science & Technology, Tsinghua University, China)
Page	pp. 683 - 688
Keyword	lossy substrate modeling, accurate and efficient, versatile for arbitrary processes, direct boundary element method
Abstract	It is important to model substrate couplings for SoC/mixed-signal circuit designs. After introducing the continuation equation of full current in lossy substrates, we present a new direct boundary element method (DBEM), which can handle the substrates with arbitrary doping profiles. Three techniques can speed up the DBEM remarkably, which include reusing coefficient matrices for multiple-¬frequency calculation, condensing the linear system, and sparsifying coefficient matrix. Numerical experiments illustrate that DBEM has high accuracy and high efficiency, and is versatile for arbitrary doping profiles.

7B-3 (Time: 11:05 - 11:30)

Title	Parasitics Extraction Involving 3-D Conductors based on Multi-layered Green's Function
Author	Zuochang Ye, *Zhiping Yu (Institute of Microelectronics, Tsinghua University, China)
Page	pp. 689 - 693
Keyword	Capacitance , Extraction, Multi-layered, Green Function
Abstract	An efficient algorithm for three-dimensional (3-D) capacitance extraction on multi-layered and lossy substrate is presented. The new algorithm represents a major improvement over the quasi-3D approach used in Green's function based solvers and takes into consideration of the side-wall effects of the contacts.

7B-4 (Time: 11:30 - 11:55)

Title	Signal-Path Driven Partition and Placement for Analog Circuit
Author	*Di Long, Xianlong Hong, Sheqin Dong (Tsinghua University, China)
Page	pp. 694 - 699
Keyword	signal-path, analog placement, layout automation, circuit partition, symmetry constrain
Abstract	This paper advances a new methodology based on signal-path information to resolve the problem of device-level placement for analog layout. This methodology is mainly based on three observations: thinking of hierarchical design for analog, structural feature of circuit based on signal-path, requirements of matching/symmetry constraint and the reduction of parasitics. The thinking of hierarchical design makes the whole analog circuit divided into core-circuit and bias-circuit. So, the algorithm is designed as two independent steps: core-circuit is placed firstly, and then bias-circuit. The structural feature of circuit based on signal-path and the requirement of matching/symmetry constraint decide the placement pattern of core-circuit. The reduction of parasitics requires the algorithm to select the optimal variants to realize the placement. Experimental results demonstrate that this algorithm can generate the compact layout with high performance and it is universal and effective.

7B-5 (Time: 11:55 - 12:20)

Title	An Approach to Topology Synthesis of Analog Circuits Using Hierarchical Blocks and Symbolic Analysis
Author	*Xiaoying Wang, Lars Hedrich (Department of Computer Science, University of Frankfurt, Germany)
Page	pp. 700 - 705
Keyword	analog circuit design, circuit synthesis, symbolic analysis
Abstract	This paper presents a method of design automation for analog circuits, focusing on topology generation and quick performance evaluation. First we describe mechanisms to generate circuit topologies with hierarchical blocks. Those blocks are specialized by adding terminal information. The connection between blocks is in compliance with a set of synthesis rules, which are extracted from typical schematics in the literature. Symbolic analysis has been used to select an appropriate topology quickly.

Session 7C Statistical and Yield Analysis (10:15 - 12:20)
Location: Room 414+415
Chair(s): Hiroo Masuda (STARC, Japan), Seijiro Moriyama (PDF Solutions, Japan)

7C-1 (Time: 10:15 - 10:40)

Title	Statistical Corner Conditions of Interconnect Delay (Corner LPE Specifications)
Author	*Kenta Yamada, Noriaki Oda (NEC Electronics Corporation, Japan)
Page	pp. 706 - 711
Keyword	statistical, corner, delay, LPE, interconnect
Abstract	Timing closure in LSI design becomes more and more difficult. But the conventional interconnect RC extraction method have over-margins caused by its corner conditions settings. In this paper, statistical corner conditions using the independence of variations between process parameters and between interconnect layers are proposed. As a result, the fast-to-slow guardband decreases by half in average, compared to the conventional method. The proposed method is ready for implementation to LPE tools.

7C-2 (Time: 10:40 - 11:05)

Title	Speed Binning Aware Design Methodology to Improve Profit under Parameter Variations
Author	Animesh Datta (Purdue University, United States), Swarup Bhunia (Case Western Reserve University, United States), Jung Hwan Choi, Saibal Mukhopadhyay, *Kaushik Roy (Purdue University, United States)
Page	pp. 712 - 717
Keyword	Gate sizing, Design for profit, Frequency-binning, process variation
Abstract	Designing high-performance systems with high yield under parameter variations has raised serious design challenges in nanometer technologies. In this paper, we propose a profit-aware yield model, based on which we present a statistical design methodology to improve profit of a design considering frequency binning and product price profile. A low-complexity sensitivity-based gate sizing algorithm is developed to improve the profitability of design over an initial yield-optimized design. We also propose an algorithm to determine optimal bin boundaries for maximizing profit with frequency binning. Finally, we present an integrated design methodology for simultaneous sizing and bin placement to enhance profit under an area constraint. Experiments on a set of ISCAS85 benchmarks show up to 26% (36%) improvement in profit for fixed bin (for simultaneous sizing and bin placement) with three frequency bins considering both leakage and delay bounds compared to a design optimized for 90% yield at iso-area.

7C-3 (Time: 11:05 - 11:30)

Title	Yield-Area Optimizations of Digital Circuits Using Non-dominated Sorting Genetic Algorithm (YOGA)
Author	Vineet Agarwal, *Janet Wang (University of Arizona, United States)
Page	pp. 718 - 723
Keyword	Yield Optimization, Genetic Algorithm, Gate Sizing
Abstract	With shrinking technology, the timing variation of a digital circuit is becoming the most important factor while designing a functionally reliable circuit. Gate sizing has emerged as one of the e±cient way to subside the yield deterioration due to manufacturing variations. In the past single-objective optimization techniques have been used to optimize the timing variation whereas on the other hand multi-objective optimization techniques can provide a more promising approach to design the circuit. We propose a new algorithm called YOGA, based on multi-objective optimization technique called Non-dominated Sorting Genetic Algorithm (NSGA). YOGA optimizes a circuit in multi domains and provides the user with Pareto-optimal set of solutions which are distributed all over the optimal design spectrum, giving users the flexibility to choose the best fitting solution for their requirements. YOGA overcomes the disadvantages of traditional optimization techniques, while even providing solutions in very stringent bounds.

7C-4 (Time: 11:30 - 11:55)

Title	A Probabilistic Analysis of Pipelined Global Interconnect Under Process Variations
Author	*Navneeth Kankani, Vineet Agarwal, Janet M Wang (University of Arizona, United States)
Page	pp. 724 - 729
Keyword	Interconnect, Pipelining, Reliability, Scheduling, ANOVA
Abstract	The main thesis of this paper is to perform a reliability based analysis for a shared latch inserted global interconnect under uncertainty. We first put forward a novel delay metric named CPUA for estimation of interconnect delay probability density function considering process variations. Without considerable loss in accuracy, CPUA can achieve high computational efficiency even in a large space of random variables. We then propose a comprehensive probabilistic methodology for sampling transfers, on a shared latch inserted global interconnect, that highly improves the reliability of the interconnect. Improvements up to 125% are observed in the reliability when compared to deterministic sampling approach. It is also shown that dual phase clocking scheme for pipelined global interconnect is able to meet more stringent timing constraints due to its lower latency.

7C-5 (Time: 11:55 - 12:20)

Title	Yield-Preferred Via Insertion Based on Novel Geotopological Technology
Author	Fangyi Luo (University of California, Santa Cruz, United States), *Yongbo Jia (Nannor Technologies, Inc., United States), Wayne Wei-Ming Dai (University of California, Santa Cruz, United States)
Page	pp. 730 - 735
Keyword	yield, design-for-manufacturability, interconnect
Abstract	Yield-preferred via insertion is an effective method to reduce the yield loss caused by via failures. The existing methods to apply the redundant-cut vias in metal layers are not efficient nor adequate. In this paper, we present an effective and efficient yield-preferred via insertion method based on a novel geotopological layout platform, GEOTOP. Our method chooses the most yield-favored via candidate and insert it into the layout without causing any design rule violations. Experiments with real industry designs show that our method can achieve very high rate of yield-preferred via without increasing the design die size within acceptable running time.

Session 7D Special Session: H.264/AVC Design Challenges and Solutions (10:15 - 12:20)
Location: Room 416+417
Chair(s): Wayne Wolf (Princeton University, United States)

7D-1 (Time: 10:15 - 10:35)

Title	Introduction to H.264 Advanced Video Coding
Author	Jian-Wen Chen, Chao-Yang Kao, *Youn-Long Lin (National Tsing Hua University, Taiwan)
Page	pp. 736 - 741
Keyword	Video Coding, H.264, Video Compression, MPEG-4 Part 10, Advanced Video Coding
Abstract	We give a tutorial on video coding principles and standards with emphasis on the latest technology called H.264 or MPEG-4 Part 10. We describe a basic method called block-based hybrid coding employed by most video coding standards. We use graphical illustration to show the functionality. This paper is suitable for those who are interested in implementing video codec in embedded software, pure hardwired, or a combination of both.

7D-2 (Time: 10:35 - 10:55)

Title	Algorithms and DSP Implementation of H.264/AVC
Author	Hung-Chih Lin, Yu-Jen Wang, Kai-Ting Cheng, Shang-Yu Yeh, Wei-Nien Chen, Chia-Yang Tsai, Tian-Sheuan Chang, *Hsueh-Ming Hang (National Chiao-Tung University, Taiwan)
Page	pp. 742 - 749
Keyword	H.264, AVC, video coding
Abstract	This survey paper intends to provide a comprehensive coverage of the techniques that are pertinent to the processor-based implementation of H.264/AVC video codec, particularly on DSP. Most of this paper is devoted to the computationally efficient algorithms, or the fast algorithms. Fast algorithms for motion estimation, intra-prediction and mode decision are described to reduce the computational complexity. In addition, in order to port the H.264/AVC codec to DSP, we also outline the basic principles of DSP code optimization.

7D-3 (Time: 10:55 - 11:15)

Title	Hardware Architecture Design of an H.264/AVC Video Codec
Author	Tung-Chien Chen, Chung-Jr Lian, *Liang-Gee Chen (National Taiwan University, Taiwan)
Page	pp. 750 - 757
Keyword	H.264/AVC, Video Codec, VLSI architecture, Video coding, Motion estimation
Abstract	H.264/AVC is the latest video coding standard. It significantly outperforms the previous video coding standards, but the extraordinary huge computation complexity and memory access requirement make the hardwired codec solution a tough job. This paper describes the design methodology for H.264/AVC video codec. The system architecture and scheduling will be addressed. The design consideration and optimization for its significant modules including bandwidth optimized motion compensation engine, reconfigurable intra predictor generator, low bandwidth parallel integer motion estimation will be mentioned. Due to the complex, sequential, and highly data-depended characteristics of all essential algorithms in H.264/AVC, not only the pipeline structure but also efficient memory hierarchy is required. The design case with a hybrid task pipelining scheme, a balanced schedule with block-level, MB-level, and frame-level pipelining, will be presented. By combining with many bandwidth reduction techniques and data reused schemes, very efficient architecture and implementation for plate-form based system is proved by the prototype chips.

7D-4 (Time: 11:15 - 11:35)

Title	ASIP Approach for Implementation of H.264/AVC
Author	Sung Dae Kim, Jeong Hoo Lee, Chung Jin Hyun, *Myung Hoon Sunwoo (Ajou University, Republic of Korea)
Page	pp. 758 - 764
Keyword	H.264/AVC, ASIP, ASDSP
Abstract	This paper introduces an Application-Specific Instruction Set Processor (ASIP) approach for implementation of H.264/AVC. The proposed ASIP has special instructions for intra prediction, deblocking filter, integer transform, etc. The proposed ASIP also has hardware accelerators for inter prediction and entropy coding. Performance comparisons show a significant improvement compared with existing DSPs. The proposed hardware accelerators have small size and can support real-time video processing. Moreover, the proposed ASIP can operate various multimedia standards. The results indicate that the ASIP approach is one of promising solutions for H.264/AVC.

7D-5 (Time: 11:35 - 12:15)

Title	Panel Discussion
Author	Youn-Long Lin (National Tsing Hua University, Taiwan), Hsueh-Ming Hang (National Chiao-Tung University, Taiwan), Liang-Gee Chen (National Taiwan University, Taiwan), Myung Hoon Sunwoo (Ajou University, Republic of Korea)

Session 8A Floorplanning (13:30 - 15:35)
Location: Room 411+412
Chair(s): Yao-Wen Chang (National Taiwan University, Taiwan), Shigetoshi Nakatake (University of Kitakyushu, Japan)

8A-1 (Time: 13:30 - 13:55)

Title	Fast Substrate Noise-Aware Floorplanning with Preference Directed Graph for Mixed-Signal SOCs
Author	*Minsik Cho, Hongjoong Shin, David Z. Pan (University of Texas at Austin, United States)
Page	pp. 765 - 770
Keyword	Floorplanning, substrate noise, mixed-signal
Abstract	In this paper, we introduce a novel substrate noise estimation technique during early floorplanning, based on the concept of Block Preference Directed Graph (BPDG) and the classic Sequence Pair (SP) floorplan representation. Given a set of analog and digital blocks, the BPDG is constructed based on their inherent noise characteristics to capture their preferred relative orders for substrate noise minimization. For each sequence pair generated during floorplanning evaluation, we can measure its violation against BPDG very efficiently. We observe that by simply counting the number of violations obtained in this manner, it correlates remarkably well with accurate but computation-intensive substrate noise modeling. Thus, our BPDG-based model has high fidelity to guide the substrate noise-aware floorplanning and layout optimization, which become a growing concern for mixed-signal/RF system on chips (SOC). Our experimental results show that the proposed approach is over 60x faster than conventional floorplanning with even very compact substrate noise models. We also obtain less area and total substrate noise than the conventional approach.

8A-2 (Time: 13:55 - 14:20)

Title	A Fixed-die Floorplanning Algorithm Using an Analytical Approach
Author	*Yong Zhan, Yan Feng, Sachin S. Sapatnekar (University of Minnesota, United States)
Page	pp. 771 - 776
Keyword	Floorplan, Analytical, Soft module
Abstract	In this paper, we present an analytical floorplanning algorithm that can be used to efficiently pack soft modules into a fixed die. The locations and sizing of the modules are simultaneously optimized so that a minimum total wire length is achieved. Experimental results show that our algorithm can achieve above a 90% success rate with a 10% white space constraint in the fixed die, and the efficiency is much higher than that of the simulated annealing based algorithms for benchmarks containing a large number of modules.

8A-3 (Time: 14:20 - 14:45)

Title	A Multi-Technology-Process Reticle Floorplanner and Wafer Dicing Planner for Multi-Project Wafers
Author	*Chien-Chang Chen, Wai-Kei Mak (National Tsing Hua University, Taiwan)
Page	pp. 777 - 782
Keyword	floorplan, integer linear programming, multi project wafers
Abstract	As the VLSI manufacturing technology advances into the deep sub-micron(DSM) era, the mask cost can reach one or two million dollars. Multiple project wafers (MPW) which put different dies onto the same set of masks is a good cost-sharing approach. Every design needs to be produced by its desired technology process, such as 1 poly with 4 metal layers (1P4M), or 1 poly with 5 metal layers (1P5M). Dies with different desired manufacturing processes cannot be produced from the same wafer, but they can be put onto the same set of masks in order to reduce the total cost of the used masks and wafers. In this paper, we propose a novel integer linear programming (ILP)-based floorplanner for shuttle runs consisting of projects requiring different desired processes. Two simulated annealing-based side-to-side wafer dicing planners are also presented. Experimental results show that our approach achieves 28% wafer reduction on average compared to a previous simulated annealing-based reticle floorplanner.

8A-4 (Time: 14:45 - 15:10)

Title	Design Space Exploration for Minimizing Multi-Project Wafer Production Cost
Author	Rung-Bin Lin, *Meng-Chiou Wu, Wei-Chiu Tseng, Ming-Hsine Kuo, Tsai-Ying Lin, Shr-Cheng Tsai (Yuan Ze University, Taiwan)
Page	pp. 783 - 788
Keyword	Multi-project wafer, Reticle floorplanning, Wafer dicing, Design space exploration, Mask cost
Abstract	Chip floorplan in a reticle for Multi-Project Wafer (MPW) plays a key role in deciding chip fabrication cost. In this paper , we propose a methodology to explore reticle flooplan design space to minimize MPW production cost, facilitated by a new cost model and an efficient reticle floorplanning method. It is shown that a good floorplan saves 47% and 42% production cost with respect to a poor floorplan for small and medium volume production, respectively.

8A-5 (Time: 15:10 - 15:35)

Title	SAT-Based Optimal Hypergraph Partitioning with Replication
Author	*Michael G. Wrighton (Tabula, Inc., United States), Andre M. DeHon (California Institute of Technology, United States)
Page	pp. 789 - 795
Keyword	partitioning, replication, boolean SAT
Abstract	We propose a methodology for optimal k-way partitioning with replication of directed hypergraphs via Boolean satisfiability. We begin by leveraging the power of existing and emerging SAT solvers to attack traditional logic bipartitioning and show good scaling behavior. We continue to present the first optimal partitioning results that admit generation and assignment of replicated nodes concurrently. Our framework is general enough that we also give the first published optimal results for partitioning with respect to the maximum subdomain degree metric and the sum of external degrees metric. We show that for the bipartitioning case we can feasibly solve problems of up to 150 nodes with simultaneous replication in hundreds of seconds. For other partitioning metrics, we are able to solve problems up to 40 nodes in hundreds of seconds.

Session 8B Memory Optimization for Embedded Systems (13:30 - 15:35)
Location: Room 413
Chair(s): Hiroyuki Tomiyama (Nagoya University, Japan), Preeti Ranjan Panda (Indian Inst. of Tech., Delhi, India)

8B-1 (Time: 13:30 - 13:55)

Title	Finding Optimal L1 Cache Configuration for Embedded Systems
Author	Andhi Janapsatya, Aleksandar Ignjatovic, *Sri Parameswaran (The University of New South Wales, Australia)
Page	pp. 796 - 801
Keyword	Design Space Exploration, Embedded System, Cache Memory
Abstract	Modern embedded system execute a single application or a class of applications repeatedly. A new emerging methodology of designing embedded system utilizes configurable processors where the cache size, associativity, and line size can be chosen by the designer. In this paper, a method is given to rapidly find the L1 cache miss rate of an application. An energy model and an execution time model are developed to find the best cache configuration for the given embedded application. Using benchmarks from Mediabench, we find that our method is on average 45 times faster to explore the design space, compared to Dinero IV while still having 100% accuracy.

8B-2 (Time: 13:55 - 14:20)

Title	Memory Size Computation for Multimedia Processing Applications
Author	Hongwei Zhu, Ilie I. Luican, *Florin Balasa (University of Illinois at Chicago, United States)
Page	pp. 802 - 807
Keyword	memory, multimedia, signal processing, multidimensional signals, polytopes
Abstract	In real-time multimedia processing systems a large part of the power consumption is due to the data storage and data transfer. Moreover, the area cost is often largely dominated by the memory modules. The computation of the memory size is an important step in the process of designing an optimized (for area and/or power) memory architecture for multimedia processing systems. This paper presents a novel non-scalar approach for computing exactly the memory size in real-time multimedia algorithms. This methodology uses both algebraic techniques specific to the data-flow analysis used in modern compilers, and also recent advances in the theory of integral polyhedra. In contrast with all the previous works which are only estimation methods, this approach performs exact memory computations even for applications with a large number of scalar signals.

8B-3 (Time: 14:20 - 14:45)

Title	Maximizing Data Reuse for Minimizing Memory Space Requirements and Execution Cycles
Author	Mahmut Kandemir, Guangyu Chen, *Feihui Li (Pennsylvania State University, United States)
Page	pp. 808 - 813
Keyword	data locality
Abstract	Embedded systems in the form of vehicles and mobile devices such as wireless phones, automatic banking machines and new multi-modal devices operate under tight memory and power constraints. Therefore, their performance demands must be balanced very well against their memory space requirements and power consumption. Automatic tools that can optimize for memory space utilization and performance are expected to be increasingly important in the future as increasingly larger portions of embedded designs are being implemented in software. In this paper, we describe a novel optimization framework that can be used in two different ways: (i) deciding a suitable on-chip memory capacity for a given code, and (ii) restructuring the application code to make better use of the available on-chip memory space. While prior proposals have addressed these two questions, the solutions proposed in this paper are very aggressive in extracting and exploiting all data reuse in the application code, restricted only by inherent data dependences.

8B-4 (Time: 14:45 - 15:10)

Title	Compiler-Guided Data Compression for Reducing Memory Consumption of Embedded Applications
Author	Ozcan Ozturk, Guangyu Chen, *Mahmut Kandemir (Pennsylvania State University, United States), Ibrahim Kolcu (University of Manchester, Great Britain)
Page	pp. 814 - 819
Keyword	Scratchpad Memory, Memory Compression, Compiler
Abstract	Memory system presents one of the critical challenges on embedded system design and optimization. This is mainly due to ever-increasing code complexity of embedded applications and exponential increase witnessed in the amount of data they manipulate. As a result, reducing memory space occupancy of embedded applications is very important and will be even more important in the next decade. Motivated by this observation, this paper presents and evaluates a compiler-driven approach to data compression for reducing memory space occupancy. Our goal in this paper is to study how automated compiler support can help in deciding the set of data elements to compress/decompress and the points during execution at which these compressions/decompressions should be performed. The proposed compiler support achieves this by analyzing the source code of the application to be optimized and identifying the order in which the different data blocks are accessed. Based on this analysis, the compiler then automatically inserts compression/decompression calls in the application code. The compression calls target the data blocks that are not expected to be used in the near future, whereas the decompression calls target those data blocks with expected reuse but currently in compressed form.

8B-5 (Time: 15:10 - 15:35)

Title	Analysis of Scratch-Pad and Data-Cache Performance Using Statistical Methods
Author	*Javed Absar (IMEC, Katholieke Universiteit Leuven, Belgium), Francky Catthoor (IMEC, Belgium)
Page	pp. 820 - 825
Keyword	scratch-pad, performance, measure, probability, hit-rate
Abstract	An effectively designed and efficiently used memory hierarchy, composed of scratch-pads or cache, is seen today as the key to obtaining energy and performance gains in data-dominated embedded applications. However, an unsovled problem is - how to make the right choice between the scratch-pad and the data-cache for different class of applications. Recent studies show that applications with regular and manifest data access patterns (e.g. matrix multiplication) perform better on the scratch-pad compared to the cache. In the case of dynamic applications with irregular and non-manifest access patterns, it is however commonly and intuitively believed that the cache would perform better. In this paper, we show by theoretical analysis and experimental results that this intuition can sometimes be misleading. When access-probabilities remain fixed, we prove that the scratch-pad, with an optimal mapping, will always outperform the cache. We also demonstrate how to map dynamic applications efficiently to scratch-pad or cache and additionally, how to accurately predict the performance.

Session 8C Inductive Issues in Power Grids and Packages (13:30 - 15:35)
Location: Room 414+415
Chair(s): Takashi Sato (Renesas, Japan)

8C-1 (Time: 13:30 - 13:55)

Title	Efficient Early Stage Resonance Estimation Techniques for C4 Package
Author	*Jin Shi, Yici Cai (Department of Computer Science and Technology, Tsinghua University, China), Shelton X-D Tan (Department of Electrical Engineering, University of California at Riverside, United States), Xianlong Hong (Department of Computer Science and Technology, Tsinghua University, China)
Page	pp. 826 - 831
Keyword	Package, Resonance , Estimation, C4, early stage
Abstract	In this paper, we study the relationship of the C4 package resonance effects and logical switching timing correlations, which was less investigated in the past. We show that improper logic designs with some special timing correlations can lead to adverse large voltage drops due to resonance effects in the widely used C4 package. We first present the numerical analysis results on industry C4 package circuits to demonstrate resonance phenomenon. Then we propose a simple algorithm to compute the worst case logical timing correlations among cells leading to resonance. Finally, we develop an efficient technique in early logic design stage to estimate the resonance risk. Experiment results demonstrate the effectiveness of the proposed method for the accurate prediction of the resonance effect in C4 package.

8C-2 (Time: 13:55 - 14:20)

Title	Parallel-Distributed Time-Domain Circuit Simulation of Power Distribution Networks with Frequency-Dependent Parameters
Author	*Takayuki Watanabe (University of Shizuoka, Japan), Yuichi Tanji (Kagawa University, Japan), Hidemasa Kubota, Hideki Asai (Shizuoka University, Japan)
Page	pp. 832 - 837
Keyword	power integrity, FDTD, LIM, Parallel Computing, Debye model
Abstract	In this paper, we focus on the verification of the PCB/Package power integrity, which becomes very important for the design of state-of-art high speed digital circuits. The simulation of power distribution networks (PDNs) of the PCB/Package, which can be modeled as a large number of RLC lumped components, is a time-consuming task for using the conventional circuit simulator, such as SPICE. For this purpose, we propose a parallel-distributed time-domain circuit simulation algorithm based on LIM. Furthermore, an effective modeling of frequency-dependencies of the PDNs, such as skin effects and dielectric losses, to solve by LIM is proposed.

8C-3 (Time: 14:20 - 14:45)

Title	Power Distribution Techniques for Dual VDD Circuits
Author	*Sarvesh Hemchandra Kulkarni, Dennis Sylvester (University of Michigan, United States)
Page	pp. 838 - 843
Keyword	Dual VDD Design, Low Power Design, Power Delivery
Abstract	Extensive research has proposed the use of multiple on-die power supplies (VDD) for reducing power consumption in CMOS circuits. We present a detailed study and design techniques for power delivery systems in dual VDD CMOS circuits. We first show that the total current to be delivered by the voltage supplies is significantly reduced (by 27%−46%) in dual VDD circuits. This current reduction prompts various design strategies that can be employed to design the power delivery system. We describe issues that arise at the system, board and package levels and propose a high-level model for the same. We then provide a new placement driven approach for designing on-die dual VDD power grids. Compared to already existing methods, the dual VDD grids generated by our approach reduce the worst case and average voltage drop by up to 12.3% and 6.8% respectively with no area overhead and sometimes improving wire congestion. We also show that dual VDD circuits can afford lower on-die decoupling capacitance budgets.

8C-4 (Time: 14:45 - 15:10)

Title	Calculating Frequency-Dependent Inductance of VLSI Interconnect by Complete Multiple Reciprocity Boundary Element Method
Author	*Changhao Yan, Wenjian Yu, Zeyi Wang (Department of Computer Science and Technology, Tsinghua University, China)
Page	pp. 844 - 849
Keyword	inductance extraction, multiple reciprocity method, MRM, boundary element method, BEM
Abstract	A complete multiple reciprocity method (CMRM), usually for the eigenvalue analysis of Helmholtz equation, is introduced to the BEM for frequency-dependent inductance extraction. Several approaches are proposed to resolve the problem of "ill-conditioned" series encountered when applying the CMRM practically. Using the BEM combined with CMRM, the major operations of calculating the numerical integrals for a frequency point become reusable, so that inductance extraction for a frequency range is greatly accelerated. Numerical results verify the accuracy and efficiency of the proposed method.

8C-5 (Time: 15:10 - 15:35)

Title	Controlling Inductive Cross-talk and Power in Off-chip Buses using CODECs
Author	Brock LaMeres (Agilent Technologies Inc., United States), Kanupriya Gulati, *Sunil Khatri (Dept of Electrical Engg., Texas A&M University, United States)
Page	pp. 850 - 855
Keyword	Inductance , Power, Off-chip IO
Abstract	The parasitic inductances within IC packaging cause supply bounce as well as glitches on the signal pins, significantly limiting the frequency of high-speed inter-chip communication.. Also, off-chip communication contributes a large fraction of the total system power. Until recently, the parasitic inductance problem was addressed by aggressive package design, which is expensive. In this work we present a technique to encode the off-chip data transmission to i) limit bounce on the supplies ii) reduce glitching caused by inductive signal coupling from neighboring signals iii) limit the edge degradation of signals due to mutually inducted voltages from neighboring switching signals and iv) control the total power consumption of the I/O logic. All these factors are modeled in a unified mathematical framework. Our experimental results show that the proposed encoding based techniques result in reduced supply bounce and signal glitching due to inductive cross-talk, closely matching the theoretical predictions. Also, we show that the bus size overhead is reasonable even after stringent power reduction constraints are imposed. We demonstrate that the overall bandwidth of a bus actually increases by 100% over an unencoded bus, using our technique with inductive constraints only (even after accounting for the encoding overhead). When the power constraints were added (to limit the power to 20% of worst case switching power) in addition to the inductive constraints, the bandwidth was again 100% improved over the unencoded bus. The asymptotic bus size overhead depends on how stringent the user-specified power and inductive cross-talk parameters are. We have validated our approach by simulating it in an ASIC setting as well as prototyping and testing it in an FPGA environment.

Session 8D Designers' Forum: "Cell" Processor (13:30 - 15:30)
Location: Small Auditorium, 5F
Chair(s): Haruyuki Tago (Toshiba, Japan), Makoto Ikeda (University of Tokyo, Japan)

8D-1 (Time: 13:30 - 14:00)

Title	A New Test and Characterization Scheme for 10+ GHz Low Jitter Wide Band PLL
Author	*Kazuhiko Miki (Toshiba Corporation, Japan), David Boerstler, Eskinder Hailu, Jieming Qi, Sarah Pettengill (IBM Microelectronics, United States), Yuichi Goto (Toshiba Corporation, Japan)
Page	pp. 856 - 859
Keyword	PLL, VCO, tracking range, duty cycle, test
Abstract	This paper presents a new test and characterization scheme for 10+ GHz low jitter wide band PLL in 90 nm partially depleted (PD) Silicon-On-Insulator (SOI) CMOS technology. We measure the frequency range of VCOs without adding any devices for test between charge-pump (CP) and voltage- controlled oscillator (VCO). That test scheme gives us the intermediate frequency of VCO as well as the maximum and the minimum frequency. This paper also describes circuitry to observe the duty cycle of 4.2GHz clock directly on a wafer probe station, including a method to verify the measured duty cycle.

8D-2 (Time: 14:00 - 14:30)

Title	An SPU Reference Model for Simulation, Random Test Generation and Verification
Author	*Yukio Watanabe (Toshiba Corporation Semiconductor Company, Japan), Balazs Sallay, Brad Michael, Daniel Brokenshire, Gavin Meil, Hazim Shafi (IBM, United States), Daisuke Hiraoka (Sony Computer Entertainment Inc., Japan)
Page	pp. 860 - 866
Keyword	microprocessor, simulation, verification, modeling
Abstract	An instruction set level reference model was developed for the Synergistic Processing Unit development. This reference model was used for the simulators, the test case generator, the verification environment and the software development. Using the same reference model for multiple purposes made it easier to keep up with the architecture changes. Also including the reference model in the simulation environment increased the robustness for the random test executions to find bugs that are usually difficult to catch.

8D-3 (Time: 14:30 - 15:00)

Title	A Cycle Accurate Power Estimation Tool
Author	*Rajat Chaudhry, Daniel Stasiak, Stephen Posluszny, Sang Dhong (IBM Corporation, United States)
Page	pp. 867 - 870
Keyword	Power
Abstract	Power consumption is one of the major challenges in VLSI Design. Power constrained designs need tools to accurately predict the power consumption and provide feedback to designers on the efficiency of the power management logic. In this paper we present the methodology behind a cycle accurate power estimation tool. This tool was used to estimate the power of a first generation CELL Processor. The tool extracts switching and clock activity from RTL simulations and applies them to transistor level macro power models to calculate the power for every cycle of the simulation trace.

8D-4 (Time: 15:00 - 15:30)

Title	Key Features of the Design Methodology Enabling a Multi-Core SoC Implementation of a First-Generation CELL Processor
Author	*Dac Pham, Hans-Werner Anderson, Erwin Behnen, Mark Bolliger, Sanjay Gupta, Peter Hofstee, Paul Harvey, Charles Johns, Jim Kahle (IBM, United States), Atsushi Kameyama (Toshiba America Electronic Components, United States), John Keaty, Bob Le, Sang Lee, Tuyen Nguyen, John Petrovick, Mydung Pham, Juergen Pille, Stephen Posluszny, Mack Riley, Joseph Verock, James Warnock, Steve Weitzel, Dieter Wendel (IBM, United States)
Page	pp. 871 - 878
Keyword	CELL, Multi-Core, SoC, POWER, SPE
Abstract	This paper reviews the design challenges that current and future processors must face, with stringent power limits and high frequency targets, and the design methods required to overcome the above challenges and address the continuing Giga-scale system integration trend. This paper then describes the details behind the design methodology that was used to successfully implement a first-generation CELL processor - a multi-core SoC. Key features of this methodology are broad optimization with fast rule-based analysis engines using macro-level abstraction for constraints propagation up/down the design hierarchy, coupled with accurate transistor level simulation for detailed analysis. The methodology fostered the modular design concept that is inherent to the CELL architecture, enabling a high frequency design by maximizing custom circuit content through re-use, and balanced power, frequency, and die size targets through global convergence capabilities. The design has roughly 241 million transistors implemented in 90nm SOI technology with 8 levels of copper interconnects and one local interconnect layer. The chip has been tested at various temperatures, voltages, and frequencies. Correct operation has been observed in the lab on first pass silicon at frequencies well over 4GHz.

Session 9A High-Level Synthesis (16:00 - 18:05)
Location: Room 411+412
Chair(s): Shigeru Yamashita (Nara Institute of Science and Technology, Japan), Youngsoo Shin (KAIST, Republic of Korea)

9A-1 (Time: 16:00 - 16:25)

Title	TAPHS: Thermal-Aware Unified Physical-Level and High-Level Synthesis
Author	*Zhenyu (Peter) Gu (Northwestern University, United States), Yonghong Yang (Queen's University, Canada), Jia Wang, Robert P. Dick (Northwestern University, United States), Li Shang (Queen's University, Canada)
Page	pp. 879 - 885
Keyword	high-level synthesis, thermal
Abstract	Thermal effects are becoming increasingly important during integrated circuit design. Thermal characteristics influence reliability, power consumption, cooling costs, and performance. It is necessary to consider thermal effects during all levels of the design process, from the architectural level to the physical level. However, design-time temperature prediction requires access to block placement, wire models, power profile, and a chip-package thermal model. Thermal-aware design and synthesis necessarily couple architectural-level design decisions (e.g., scheduling) with physical design (e.g., floorplanning) and modeling (e.g., wire and thermal modeling). This article proposes an efficient and accurate thermal-aware floorplanning high-level synthesis system that makes use of integrated high-level and physical-level thermal optimization techniques. Voltage islands are automatically generated via novel slack distribution and voltage partitioning algorithms in order to reduce the design's power consumption and peak temperature. A new thermal-aware floorplanning technique is proposed to balance chip thermal profile, thereby further reducing peak temperature. The proposed system was used to synthesize a number of benchmarks, yielding numerous designs that trade off peak temperature, integrated circuit area, and power consumption. The proposed techniques reduces peak temperature by 12.5 degrees C on average. When used to minimize peak temperature with a fixed area, peak temperature reductions are common. Under a constraint on peak temperature, integrated circuit area is reduced by 9.9% on average.

9A-2 (Time: 16:25 - 16:50)

Title	An Automated, Efficient and Static Bit-width Optimization Methodology Towards Maximum Bit-width-to-Error Tradeoff With Affine Arithmetic Model
Author	*Yu Pu, Yajun Ha (National University of Singapore, Singapore)
Page	pp. 886 - 891
Keyword	bit-width optimization, affine arithmetic
Abstract	Ideally, bit-width analysis methods should be able to find the most appropriate bit-widths to achieve the optimum bit-width-to-error tradeoff for variables and constants in high level DSP algorithms when they are implemented into hardware. The tradeoff enables the fixed-point hardware implementation to be area efficient but still within the allowed error tolerance. Unfortunately, almost all the existing static bit-width analysis methods are Interval Arithmetic (IA) based that may overestimate bit-widths and enable fairly pessimistic bit-width-to-error tradeoff. We have developed an automated and efficient bit-width optimization methodology that is Affine Arithmetic (AA) based. Experiments have proven that, compared to the previous static analysis methods, our methodology not only dramatically reduces the fractional bit-width by more than 35% but also slightly reduces the integer bit-width. In addition, our probabilistic error analysis method further enlarges the bit-width-to-error tradeoff.

9A-3 (Time: 16:50 - 17:15)

Title	Abridged Addressing: A Low Power Memory Addressing Strategy
Author	*Preeti Ranjan Panda (Indian Institute of Technology, Delhi, India)
Page	pp. 892 - 897
Keyword	memory, synthesis, addressing, low-power
Abstract	The memory subsystem is known to comprise a significant fraction of the power dissipation in embedded systems. The memory addressing strategy, which determines the sequence of addresses appearing on the memory address bus as well as the switching activity in the addressing logic, has a major impact on the memory subsystem power dissipation. We present a novel addressing strategy, {\em Abridged Addressing}, that helps reduce system power dissipation by substantially reducing both the address bus switching as well the addressing logic power. The strategy, which relies on minimizing register accesses in the addressing logic, helps overcome some of the limitations of existing approaches: the address bus switching is low; there is very little area, performance, and power overhead; and the addressing hardware is simpler, making the technique suitable for both on-chip and off-chip memory, as well as single-port and multi-port memories.

9A-4 (Time: 17:15 - 17:40)

Title	Using Speculative Computation and Parallelizing Techniques to Improve Scheduling of Control based Designs
Author	Roberto Cordone (Università degli studi di Crema, Italy), *Fabrizio Ferrandi, Gianluca Palermo, Marco Domenico Santambrogio, Donatella Sciuto (Politecnico di Milano, Italy)
Page	pp. 898 - 904
Keyword	HLS, scheduling, ILP
Abstract	Recent research results have seen the application of parallelizing techniques to high-level synthesis. In particular, the effect of speculative code transformations on mixed control-data flow designs has demonstrated effective results on schedule lengths. In this paper we first analyze the use of the control and data dependence graph as an intermediate representation that provides the possibility of extracting the maximum parallelism. Then we analyze the scheduling problem by formulating an approach based on Integer Linear Programming (ILP) to minimize the number of control steps given the amount of resources. We improve the already proposed ILP scheduling approaches by introducing a new conditional resource sharing constraint which is then extended to the case of speculative computation. The ILP formulation has been solved by using a Branch and Cut framework which provides better results than standard branch and bound techniques.

9A-5 (Time: 17:40 - 18:05)

Title	Worst Case Execution Time Analysis for Synthesized Hardware
Author	*Jun-hee Yoo, Xingguang Feng, Kiyoung Choi (Seoul National University, Republic of Korea), Eui-Young Chung, Kyu-Myung Choi (Samsung Electronics, Republic of Korea)
Page	pp. 905 - 910
Keyword	static estimation, behavioral synthesis, design space exploration, worst case execution time
Abstract	We propose a hardware performance estimation flow for fast design space exploration, based on worst-case execution time analysis algorithms for software analysis. Test cases on some real-world applications show that our flow provides a tight upper bound of the execution time, and many useful hints to the designer.

Session 9B Modeling, Compilation and Optimization of Embedded Architectures (16:00 - 18:05)
Location: Room 413
Chair(s): Hiroyuki Tomiyama (Nagoya University, Japan), Lovic Gauthier (FLEETS, Japan)

9B-1 (Time: 16:00 - 16:25)

Title	Workload Prediction and Dynamic Voltage Scaling for MPEG Decoding
Author	Ying Tan, Parth Malani, Qinru Qiu, *Qing Wu (State University of New York at Binghamton, United States)
Page	pp. 911 - 916
Keyword	low power, dynamic voltage scheduling, MPEG decoding, workload prediction
Abstract	In this paper we present three efficient DVS techniques for a MPEG decoder. Their energy reduction is comparable to that of the optimal solution. A workload prediction model is also developed based on the block level statistics of each MPEG frame. Compared with previous works, the new model exhibits a remarkable improvement in accuracy of the prediction. The experimental results show that, with the new prediction model, the presented DVS techniques achieve more energy reduction than previous works while delivering the same Quality of Service (QoS).

9B-2 (Time: 16:25 - 16:50)

Title	Lazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling
Author	*Yen-Jen Chang (Department of Computer Science, National Chung-Hsing University, Taiwan)
Page	pp. 917 - 922
Keyword	BTB, low-power, dynamic profiling
Abstract	In this paper, we propose an alternative BTB design, called lazy BTB, to reduce the BTB energy consumption by filtering out the redundant lookups. The most distinct feature of the lazy BTB is that it dynamically profiles the taken traces during program execution. Unlike the traditional design in which the BTB has to be looked up every instruction fetch, by introducing an additional field to record the trace information, our design can achieve the goal of one BTB lookup per taken trace. The experimental results show that with a negligible performance degradation the lazy BTB can reduce the BTB energy consumption by about 77% on average for the MediaBench applications.

9B-3 (Time: 16:50 - 17:15)

Title	Cache Size Selection for Performance, Energy and Reliability of Time-Constrained Systems
Author	Yuan Cai (University of Iowa, United States), Marcus T. Schmitz, Alireza Ejlali, Bashir M. Al-Hashimi (University of Southampton, Great Britain), *Sudhakar M. Reddy (University of Iowa, United States)
Page	pp. 923 - 928
Keyword	cache size, energy, performability, reliability, performance
Abstract	Improving performance, reducing energy consumption and enhancing reliability are three important objectives for embedded computing systems design. In this paper, we study the joint impact of cache size selection on these three objectives. For this purpose, we conduct extensive fault injection experiments on five benchmark examples using a cycle-accurate processor simulator. Performance and reliability are analyzed using the performability metric. Overall, our experiments demonstrate the importance of a careful cache size selection when designing energy-efficient and reliable systems. Furthermore, the experimental results show the existence of optimal or Pareto-optimal cache size selection to optimize the three design objectives.

9B-4 (Time: 17:15 - 17:40)

Title	Reducing Dynamic Compilation Overhead by Overlapping Compilation and Execution
Author	Priya Unnikrishnan (IBM Toronto, Canada), Mahmut Kandemir, *Feihui Li (Pennsylvania State University, United States)
Page	pp. 929 - 934
Keyword	embedded Java, dynamic compilation, performance optimization
Abstract	An important problem in executing applications in energy-sensitive embedded environments is to tune their behavior based on dynamic variations in energy constraints. One option for achieving this is dynamic compilation --- compiling code fragments on the fly to adapt to changing energy demands. While dynamic compilation can be very beneficial in many embedded environments where multiple criteria need to be satisfied during execution, it can also incur a significant performance overhead since compilation takes place at runtime. The goal in this work is to reduce this performance overhead of dynamic compilation by overlapping it with application execution. Specifically, provided that we have available hardware resources to perform dynamic compilation concurrently with application execution, our approach compiles the next code fragment to be executed while we are executing the current code fragment. The experimental results from our implementation indicate significant savings in execution times. Our experimental results also indicate that the proposed strategy performs consistently well under different parameters.

9B-5 (Time: 17:40 - 18:05)

Title	Functional Modeling Techniques for Efficient Sw Code Generation of Video Codec Application
Author	*Sang-Il Han (TIMA Laboratory, France), Soo-Ik Chae (Seoul National University, Republic of Korea), Ahmed Amine Jerraya (TIMA Laboratory, France)
Page	pp. 935 - 940
Keyword	Functional model, video codec, software generation, clocked synchronous model, abstract clock
Abstract	Architectures with multiple programmable cores are becoming more attractive for video codec applications because they can provide highly concurrent computation and support multiple video standards and a shorter time-to-market. To find an efficient SW code for the multiple core architecture for a video codec application, it is very important to easily explore the design space by generating a SW code automatically from its functional model. We introduce Abstract Clock Synchronous Model (ACSM) for functional modeling of video codec applications. The ACSM can easily represent both parallelism and conditionals, which are common in video codec applications. By applying ACSM to an H.264 baseline decoder on single core architecture, we reduced the execution time and the number of external memory accesses by 32 % and 46 % respectively compared to traditional dataflow model.

Session 9C Statistical Design (16:00 - 18:05)
Location: Room 414+415
Chair(s): Sachin Sapatnekar (University of Minnesota, United States), Sunil Khatri (Texas A&M Univ., United States)

9C-1 (Time: 16:00 - 16:25)

Title	Convergence-Provable Statistical Timing Analysis with Level-Sensitive Latches and Feedback Loops
Author	Lizheng Zhang, Jengliang Tsai, Weijen Chen, Yuhen Hu, *Charlie Chungping Chen (University of Wisconsin-Madison, United States)
Page	pp. 941 - 946
Keyword	statistical timing analysis, level sensitive latch, feedback loop, convergence
Abstract	Statistical timing analysis has been widely applied to predict the timing yield of VLSI circuits when process variations become significant. Existing statistical latch timing methods are either having exponential complexity or unable to treat the random variable's self-dependence caused by the coexistence of level-sensitive latches and feedback loops. In this paper, an efficient iterative statistical timing algorithm with provable convergence is proposed for latch-based circuits with feedback loops. Based on a new notion of iteration mean, we prove that the algorithm converges unconditionally. Moreover, we show that the converged value of iteration mean can be used to predict the circuit yield during design time. Tested by ISCAS'89 benchmark circuits, the proposed algorithm shows an error of 1.1% and speedup of 303x on average when compared with the Monte Carlo simulation.

9C-2 (Time: 16:25 - 16:50)

Title	Parameterized Block-Based Non-Gaussian Statistical Gate Timing Analysis
Author	Soroush Abbaspour, Hanif Fatemi, *Massoud Pedram (University of Southern California, United States)
Page	pp. 947 - 952
Keyword	statistical timing analysis, process variation
Abstract	As technology scales down, timing verification of digital integrated circuits becomes an extremely difficult task due to the gate and wire variability. Therefore, statistical timing analysis (denoted by ¦åTA) is becoming unavoidable .In this paper, we propose a new framework to handle the statistical gate timing analysis for non-Gaussian sources of variation in block-based ¦åTA. First, we present an approach to approximate variational RC-¦â load by using a canonical first-order model. Next, an accurate variation-aware gate timing analysis based on statistical input transition, statistical gate timing library, and statistical RC-pi load is presented. To perform the aforementioned objective, we present a statistical effective capacitance calculation which is the key contribution of this paper. Experimental results show an average error of 6% for gate delay and output transition time with respect to the HSPICE Monte Carlo simulation while the runtime is about 95 times faster.

9C-3 (Time: 16:50 - 17:15)

Title	Statistical Leakage Minimization through Joint Selection of Gate Sizes, Gate Lengths and Threshold Voltage
Author	*Sarvesh Bhardwaj, Yu Cao, Sarma Vrudhula (Arizona State University, United States)
Page	pp. 953 - 958
Keyword	Statistical, Leakage, Convex, Optimization
Abstract	This paper proposes a novel methodology for statistical Leakage minimization of digital circuits. A function of mean and variance of the leakage is minimized with constraint on alpha-percentile of the delay using statistical delay models. Since the leakage is a strong function of the threshold voltage and gate length, considering them as design variables can provide significant amount of power savings. The leakage minimization problem is formulated as a multivariable convex optimization problem. We demonstrate that statistical optimization can lead to more than 37% savings in nominal leakage compared to worst-case techniques that perform only gate sizing. Also, gate length biasing is shown to cause significant reduction in the leakage variability due to its inverse relation with Vth.

9C-4 (Time: 17:15 - 17:40)

Title	Statistical Bellman-Ford Algorithm With An Application to Retiming
Author	*Mongkol Ekpanyapong (Georgia Institute of Technology, United States), Thaisiri Watewai (University of California, Berkeley, United States), Sung Kyu Lim (Georgia Institute of Technology, United States)
Page	pp. 959 - 964
Keyword	retiming, statistical timing analysis
Abstract	Process variations in digital circuits make sequential circuit timing validation an extremely challenging task. In this paper, a Statistical Bellman-Ford (SBF) algorithm is proposed to compute the longest path length distribution for directed graphs with cycles. Our SBF algorithm efficiently computes the statistical longest path length distribution if there exist no positive cycles or detects one if the circuit is likely to have a positive cycle. An important application of SBF is Statistical Retiming-based Timing Analysis (SRTA), where SBF is used to check for the feasibility of a given target clock period distribution for retiming. Our gate and wire delay distribution model considers several high-impact intra-die process parameters and accurately captures the spatial and reconvergent path correlations. The Monte Carlo simulation is used to validate the accuracy of our SRTA algorithm.

9C-5 (Time: 17:40 - 18:05)

Title	An Exact Algorithm for the Statistical Shortest Path Problem
Author	Liang Deng, *Martin D. F. Wong (University of Illinois at Urbana-Champaign, United States)
Page	pp. 965 - 970
Keyword	Algorithm, Process Variations, Statistical Shortest Path
Abstract	Graph algorithms are widely used in VLSI CAD. Traditional graph algorithms can handle graphs with deterministic edge weights. As VLSI technology continues to scale into nanometer designs, we need to use probability distributions for edge weights in order to model uncertainty due to parameter variations. In this paper, we consider the statistical shortest path (SSP) problem. Given a graph $G$, the edge weights of $G$ are random variables. For each path $P$ in $G$, let $L_{P}$ be its length, which is the sum of all edge weights on $P$. Clearly $L_{P}$ is a random variable and we let $\mu_{P}$ and $\sigma_{P}^2$ be its mean and variance, respectively. In the SSP problem, our goal is to find a path $P$ connecting two given vertices to minimize the cost function $\mu_{P}+\Phi(\sigma_{P}^2)$ where $\Phi$ is an arbitrary function. (For example, if $\Phi(x) = 3\sqrt{x}$, the cost function is $\mu_{P} + 3\sigma_{P}$.) To minimize uncertainty in the final result, it is meaningful to look for paths with bounded variance, i.e., $\sigma_{P}^2 \le B$ for a given fixed bound $B$. In this paper, we present an exact algorithm to solve the SSP problem in $O(B(V+E))$ time where $V$ and $E$ are the numbers of vertices and edges, respectively, in $G$. Our algorithm is superior to previous algorithms for SSP problem because we can handle: 1) \emph{general graphs} (unlike previous works applicable only to directed acyclic graphs), 2) \emph{arbitrary edge-weight distributions} (unlike previous algorithms designed only for specific distributions such as Gaussian), and 3) \emph{general cost function} (none of the previous algorithms can even handle the cost function $\mu_{P} + 3\sigma_{P}$. Finally, we discuss applications of the SSP problem to maze routing, buffer insertions, and timing analysis under parameter variations.

Session 9D Designers' Forum Panel: (16:30 - 18:00)
Location: Small Auditorium, 5F

9D-1 (Time: 16:30 - 18:00)

Title

Top 10 Design Issues by LSI Designers versus EDA Developers

Author

Organizer: Haruyuki Tago (Senior Manager, TOSHIBA, Japan), Moderator: Yoshiaki Hagihara (Sony Fellow, Sony, Japan), Panelists: Raul Camposano (Sr. VP&CTO, Synopsys, United States), Soo-Kwan Eo (Sr. VP, SAMSUNG, Republic of Korea), Joe Sawichi (VP, Mentor, United States), Hirofumi Taguchi (General manager, Matsushita, Japan), Yasuhiro Tani (Director, CANON, Japan), Ted Vucurevich (CTO, Cadence, Republic of Korea)

The 11th Asia and South Pacific Design Automation Conference Technical Program

Session Schedule

List of Papers

The 11th Asia and South Pacific Design Automation Conference
Technical Program