(Go to Top Page)

The 12th Asia and South Pacific Design Automation Conference
Technical Program

Remark: The presenter of each paper is marked with "*".
Technical Program:   SIMPLE version   DETAILED version with abstract
Author Index:   HERE

Session Schedule


Wednesday, January 24, 2007

ABCD
1K (Small Auditorium, 5F)
Opening Session and Keynote Address I

8:30 - 10:00
1A (Room 411+412)
DFM in Physical Design

10:15 - 12:20
1B (Room 413)
SoC Software Design and Performance Analysis

10:15 - 12:20
1C (Room 414+415)
Advances in High-Frequency and High-Speed Circuit Design and CAD

10:15 - 12:20
1D (Room 416+417)
University Design Contest

10:15 - 12:20
2A (Room 411+412)
New Techniques in Placement

13:30 - 15:35
2B (Room 413)
On Chip Communication Methodology

13:30 - 15:35
2C (Room 414+415)
Analog CAD Techniques: From Analysis to Verification

13:30 - 15:35
2D (Room 416+417)
SPECIAL SESSION: Design for Manufacturability

13:30 - 15:35
3A (Room 411+412)
Routing

16:00 - 18:05
3B (Room 413)
System Synthesis and Optimization Techniques

16:00 - 18:05
3C (Room 414+415)
Model Checking and Applications to Digital and Analog Circuits

16:00 - 18:05
3D (Room 416+417)
SPECIAL SESSION: Embedded Software for Multiprocessor Systems-on-Chip

16:00 - 18:05



Thursday, January 25, 2007

ABCD
2K (Small Auditorium, 5F)
Keynote Address II

9:00 - 10:00
4A (Room 411+412)
Model Order Reduction and Macromodeling

10:15 - 12:20
4B (Room 413)
System Level Modeling

10:15 - 12:20
4C (Room 414+415)
Logic Synthesis

10:15 - 12:20
4D (Room 416+417)
SPECIAL SESSION: EDA Challenges for Analog/RF

10:15 - 12:20
5A (Room 411+412)
Statistical Interconnect Modeling and Analysis

13:30 - 15:35
5B (Room 413)
Optimization Issues in Embedded Systems

13:30 - 15:35
5C (Room 414+415)
High-Level Synthesis

13:30 - 15:35
5D (Small Auditorium, 5F)
Designers' Forum Panel : Presilicon SoC HW/SW Verification

13:30 - 15:35
6A (Room 411+412)
Timing Modeling and Optimization

16:00 - 18:05
6B (Room 413)
Application Examples with Leading Edge Design Methodology

16:00 - 18:05
6C (Room 414+415)
Module/Circuit Synthesis

16:00 - 18:05
6D (Small Auditorium, 5F)
Designers' Forum: Low-power SoC Technologies

16:00 - 17:50



Friday, January 26, 2007

ABCD
3K (Small Auditorium, 5F)
Keynote Address III

9:00 - 10:00
7A (Room 411+412)
Advanced Methods for Leakage Reduction

10:15 - 12:20
7B (Room 413)
Uncertainty Aware Interconnect Design

10:15 - 12:20
7C (Room 414+415)
Test Cost Reduction Techniques

10:15 - 12:20
7D (Room 416+417)
SPECIAL SESSION: Multi-Processor Platforms for Next Generation Embedded Systems

10:15 - 12:20
8A (Room 411+412)
Advancement in Power Analysis and Optimization

13:30 - 15:35
8B (Room 413)
Electrical Optimization in Floorplanning/Placement

13:30 - 15:35
8C (Room 414+415)
Advances in Test and Diagnosis

13:30 - 15:35
8D (Small Auditorium, 5F)
Designers' Forum: High-speed Chip to Chip Signaling Solutions

13:30 - 15:35
9A (Room 411+412)
Power Efficient Design Techniques

16:00 - 18:05
9B (Room 413)
Leading Edge Design Methodology for Processors

16:00 - 18:05
9C (Room 414+415)
Satisfiability and Applications

16:00 - 18:05
9D (Small Auditorium, 5F)
Designers' Forum Panel: Top 10 Design Issues

16:00 - 18:05



List of Papers

Remark: The presenter of each paper is marked with "*".

Wednesday, January 24, 2007

Session 1K Opening Session and Keynote Address I
Time: 8:30 - 10:00 Wednesday, January 24, 2007
Location: Small Auditorium, 5F
Chair: Hidetoshi Onodera (Kyoto Univ., Japan)

1K-1 (Time: 9:00 - 10:00)
Title(Keynote Address) Next-Generation Design and EDA Challenges: Small Physics, Big Systems, and Tall Tool-chains
AuthorRob A. Rutenbar (Carnegie Mellon Univ., United States)
Keyword
AbstractThere is much discussion of two challenges in the design of tomorrow's electronics: the difficult "small physics" of nanoscale transistors, and the silicon/software complexity of "big systems". But those of us who want to build beautiful algorithms have an additional hurdle: "tall tool-chains". If it takes 50 tool-steps to build an industrial-strength design flow, and each tool is based on 1-2 "big algorithms", does this mean that each new algorithm idea is worth, at best, 1-2% of the success of a design? This seems to me a bad way of accounting for the tremendous value that EDA brings to the world of design. How can we have a big impact in this important technology area? In this talk, I will offer several pieces of advice for how not to get buried by the tall-tool-chain problem. I will discuss how to identify design problems that can have large impact, how to embrace the strange physics of tomorrow's silicon technologies in the service of building beautiful algorithms, and how to get fresh (and unique) insights on problems by spending time working with a real design team. I will use design examples ranging from lithography, to computational finance, to silicon-based speech recognition, to illustrate the point that this is an exciting time to be working on tomorrow's tool and design challenges.


Session 1A DFM in Physical Design
Time: 10:15 - 12:20 Wednesday, January 24, 2007
Location: Room 411+412
Chairs: Ting-Chi Wang (National Tsing Hua Univ., Taiwan), Toshiyuki Shibuya (Fujitsu Lab., Japan)

1A-1 (Time: 10:15 - 10:40)
TitleModel Based Layout Pattern Dependent Metal Filling Algorithm for Improved Chip Surface Uniformity in the Copper Process
Author*Subarna Sinha, Jianfeng Luo, Charles Chiang (Synopsys, United States)
Pagepp. 1 - 6
KeywordFill, CMP
AbstractThickness range, i.e. the difference between the highest point and the lowest point of the chip surface, is a key indicator of chip yield. This paper presents a novel metal filling algorithm that seeks to minimize the thickness range of the chip surface during the copper damascene process. The proposed solution considers the physical mechanisms in the damascene process, namely ECP (which is the process used to deposit Cu in the trenches) and CMP (which is the process used to polish Cu after ECP), that affect thickness range. Key predictors for the final thickness range, which is the thickness range after ECP & CMP, that can be computed efficiently are identified and used to drive the metal filling process. To the best of our knowledge, this is the first metal filling algorithm that uses an ECP model among other things to guide metal filling. Experimental results are very promising and indicate that the proposed method can significantly reduce the thickness range after metal filling. This is in sharp contrast with the density-driven approaches which often increase the thickness range after metal filling, thereby potentially adversely impacting yield. In addition, the proposed method inserts significantly smaller amount of fill when compared to the density-driven approaches. This is desirable as it limits the impact of metal filling on timing.

1A-2 (Time: 10:40 - 11:05)
TitleFast and Accurate OPC for Standard-Cell Layouts
Author*David M. Pawlowski, Liang Deng, Martin D. F. Wong (University of Illinois at Urbana-Champaign, United States)
Pagepp. 7 - 12
KeywordOPC, cell-wise, boundary-based, RET
AbstractModel based optical proximity correction (OPC) has become necessary at 90nm technology node and beyond. Cellwise OPC is an attractive technique to reduce the mask data size as well as the prohibitive runtime of full-chip OPC. As feature dimensions have gotten smaller, the radius of influence for edge features has extended further into neighboring cells such that it is no longer sufficient at 65nm node and below to perform cellwise OPC independent of neighboring cells, especially for the metal layers. The methodology described in this work accounts for features in neighboring cells and allows a cellwise approach to be applied to cells with gate length of 45nm with the projection that it can also be applied to future technology nodes. OPC-ready cells are generated before placement using boundary-based technology. Each cell has a small number of OPC-ready versions due to an intelligent characterization of standard cell layout features. Total number of cells with boundaries in the OPC-ready library only increases linearly with the number of cells in the original library. Results are very promising: the average edge placement error (EPE) for all metal1 features in 100 layouts is 0.731nm which is less than 1% of metal1 width (80nm), creating similar levels of lithographic accuracy while obviating any of the drawbacks inherent in layout specific full-chip model-based OPC. For even small circuits, we got up to 100X runtime reduction and 35X mask data size shrinking.

1A-3 (Time: 11:05 - 11:30)
TitleCoupling-aware Dummy Metal Insertion for Lithography
Author*Liang Deng (University of Illinois at Urbana-Champaign, United States), Kaiyuan Chao (Intel Co., United States), Hua Xiang (IBM T.J. Watson Research Center, United States), Martin D. F. Wong (University of Illinois at Urbana-Champaign, United States)
Pagepp. 13 - 18
Keyworddummy metal, lithography, coupling capacitance, RET
AbstractAs integrated circuits manufacturing technology is advancing into 65nm and 45nm nodes, extensive resolution enhancement techniques (RETs) are needed to correctly manufacture a chip design. The widely used RET called off-axis illumination (OAI) introduces forbidden pitches which lead to very complex design rules. It has been observed that imposing uniformity on layout designs can substantially improve printability under OAI. For metal layers, uniformity can be achieved simply by inserting dummy metal wire segments at all free spaces. Simulation results indeed show significant improvement in printability with such a dummy metal insertion approach. To minimize mask cost, it is advantageous to use dummy metal segments that are of the same size as regular metal wires due to their simple geometry. But these dummy wires are printable and hence increase coupling capacitances and potentially affect yield. The alternative is to use a set of parallel sub-resolution thin wires (which will not be printed) to replace a printable dummy wire segment. These invisible dummy metal segments do not increase coupling capacitances but increase lithography cost, which includes mask cost and RET/process expense. This paper presents a strategy for dummy metal insertion that can optimally trade off lithography cost and coupling capacitance. In particular, we present an optimal algorithm that can minimize lithography cost subject to any given coupling capacitance bound. Moreover, this dummy metal insertion will achieve a highly uniform density because of the locality of coupling capacitance, which automatically ameliorates chemical mechanical polish (CMP) problem.

1A-4 (Time: 11:30 - 11:55)
TitleFast Buffer Insertion for Yield Optimization under Process Variations
AuthorRuiming Chen, *Hai Zhou (Northwestern University, United States)
Pagepp. 19 - 24
Keywordbuffer insertion, yield optimization
AbstractWith the emerging process variations in fabrication, the traditional corner-based timing optimization techniques become prohibitive. Buffer insertion is a very useful technique for timing optimization. In this paper, we propose a buffer insertion algorithm with the consideration of process variations. We use the solutions from the deterministic buffering that sets all the random variables at their nominal values to guide the statistical buffering algorithm. Our algorithm keeps the sizes of solution lists small, and always achieves higher yield than the deterministic buffering. The experimental results demonstrate that the exiting approaches cannot handle large cases efficiently or effectively, while our algorithm handles large cases very efficiently, and improves the yield more than 12\% on average.

1A-5 (Time: 11:55 - 12:20)
TitleA Global Minimum Clock Distribution Network Augmentation Algorithm for Guaranteed Clock Skew Yield
Author*Bao Liu, Andrew Kahng, Xu Xu (University of California, San Diego, United States), Jiang Hu, Ganesh Venkataraman (Texas A&M University, United States)
Pagepp. 25 - 31
Keywordclock, distribution, robust , design
AbstractNanometer VLSI systems demand robust clock distribution network design for increased process and operating condition variabilities. In this paper, we proposeminimum clock distribution network augmentation for guaranteed skew yield. We present theoretical analysis results on an inserted link in a clock network, which scales down local skew and skew variation, but may not guarantee global skew and skew variation reduction in general. We propose a global minimum clock network augmentation algorithm, which inserts links simultaneously between all nearest sink pairs, apply rule-based link removal, and perform link consolidation by Steiner minimum tree construction for wirelength reduction with guaranteed clock skew yield. Our experimental results show that our proposed algorithm achieves dominant clock network augmentation solutions, e.g., an average of 16% clock skew yield improvement, 9% maximum skew reduction, and 25% reduction of clock skew variation standard deviation with identical wirelength compared with previous best clock network link insertion methods.


Session 1B SoC Software Design and Performance Analysis
Time: 10:15 - 12:20 Wednesday, January 24, 2007
Location: Room 413
Chairs: Qiang Zhu (Fujitsu Lab., Japan), Youn-Long Steve Lin (National Tsing-Hua Univ., Taiwan)

1B-1 (Time: 10:15 - 10:40)
TitleControl-Flow Aware Communication and Conflict Analysis of Parallel Processes
AuthorAxel Siebenborn, *Alexander Viehl, Oliver Bringmann (FZI Forschungszentrum Informatik, Germany), Wolfgang Rosenstiel (Universität Tübingen, Germany)
Pagepp. 32 - 37
KeywordArchitectural Exploration, Performance Analysis, Environment Modeling, Bus Allocation, SystemC
AbstractIn this paper, we present an approach for control-flow aware communication and conflict analysis of systems of parallel communicating processes. This approach allows to determine the global timing behavior of such a system and to detect communication that might produce conflicts on shared communication resources. Furthermore, we show the incorporation of temporal environment models in order to analyze their influence on the system behavior. Based on the determined conflicts, an automated allocation and binding approach for shared resources to resolve potential access conflicts is proposed. All analysis steps can be performed starting with a TLM SystemC model of the entire system without any need for user interaction. Finally, a SystemC model of a Viterbi decoder is used as case study to demonstrate the capability of our approach.

1B-2 (Time: 10:40 - 11:05)
TitleSoftware Performance Estimation in MPSoC Design
AuthorMarcio Oyamada, *Flavio Wagner (UFRGS, Brazil), Marius Bonaciu (TIMA Lab., France), Wander Cesario (MnD, France), Ahmed Jerraya (TIMA Lab., France)
Pagepp. 38 - 43
KeywordPerformance Estimation, MPSoC
AbstractEstimation tools are a key component of system-level methodologies, enabling a fast design space exploration. Estimation of software performance is essential in current software-dominated embedded systems. This work proposes an integrated methodology for system design and performance analysis. An analytic approach based on neural networks is used for high-level software performance estimation. At a functional level, this analytic tool enables a fast evaluation of the performance to be obtained with selected processors, which is an essential task for the definition of a “golden” architecture. From this architectural definition, a tool that refines hardware and software interfaces produces a bus-functional model. A virtual prototype is then generated from the bus-functional model, providing a global, cycle-accurate simulation model and offering several features for design validation and detailed performance analysis. Our work thus combines an analytic approach at functional level and a simulation-based approach at bus functional level. This provides an adequate trade-off between estimation time and precision. A multiprocessor platform implementing an MPEG4 encoder is used as case study, and the analytic estimation results in errors only up to 17% compared to the virtual platform simulation. On the other hand, the analytic estimation time takes only 17 seconds, against 10 minutes using the cycle-accurate simulation model.

1B-3 (Time: 11:05 - 11:30)
TitleEffective OpenMP Implementation and Translation for Multiprocessor System-On-Chip without using OS
Author*Woo-Chul Jeun, Soonhoi Ha (Seoul National University, Republic of Korea)
Pagepp. 44 - 49
KeywordOpenMP, MPSoC, parallel programming, shared memory, synchronization
AbstractIt is attractive to use the OpenMP as a parallel programming model on a Multiprocessor System-On-Chip (MPSoC) because it is easy to write a parallel program in the OpenMP and there is no standard method for parallel programming on an MPSoC. In this paper, we propose an effective OpenMP implementation and translation for major OpenMP directives on an MPSoC with physically shared memories, hardware semaphores, and no operating system.

1B-4 (Time: 11:30 - 11:55)
TitleCreating Explicit Communication in SoC Models Using Interactive Re-Coding
Author*Pramod Chandraiah, Junyu Peng, Rainer Doemer (University of California, Irvine, United States)
Pagepp. 50 - 55
KeywordSystem Level Design, SoC Specification, Refinement, Modeling, Design Methodology
AbstractCommunication exploration has become a critical step during SoC design. Researchers in the CAD community have proposed fast and efficient techniques for comprehensive design space exploration to expedite this critical design step. Although these advances have been helpful in reducing the design time significantly, the overall design time of the system is still a bottleneck. All these techniques assume the availability of an initial SoC input model with explicit communication,whose quality significantly impacts the effectiveness of the communication exploration techniques. Today, these initial models need to be manually written by engineers, which is tedious, error-prone and time consuming.In fact, our studies on industrial-size examples have shown that about 50% of the communication exploration time is spent on coding and re-coding of the initial specification model. In this paper,we propose an efficient interactive approach to explicit communication creation by automating some of the common coding tasks in specification models for communication exploration. Our results show significant savings in designer time.

1B-5 (Time: 11:55 - 12:20)
TitleSystem Architecture for Software Peripherals
Author*Siddharth Choudhuri, Tony Givargis (University of California, Irvine, United States)
Pagepp. 56 - 61
Keywordsoftware peripherals
AbstractSoftware Peripherals have been proposed as a design alternative to traditional peripherals. We propose a software architecture, design methodology and scheduling scheme for implementing software peripherals on general purpose processors, with fast context switch and high resolution timers. Our design flow automatically generates code for scheduling software peripherals. We demonstrate the feasibility of our proposed work by experimenting with a set of five software peripherals scheduled to execute on a MIPS processor. Our performance evaluations show that the performance impact of the software peripherals on user-level tasks is minimal (i.e., 10.11% on a 100 MHz processor) -- strongly suggesting that with the right architecture, sofware peripherals can be efficiently accomodated in typical embedded applications.


Session 1C Advances in High-Frequency and High-Speed Circuit Design and CAD
Time: 10:15 - 12:20 Wednesday, January 24, 2007
Location: Room 414+415
Chairs: Jaijeet Roychowdhury (Univ. of Minnesota, United States), Tomohisa Kimura (Toshiba, Japan)

1C-1 (Time: 10:15 - 10:40)
TitleA New Boundary Element Method for Multiple-Frequency Parameter Extraction of Lossy Substrates
AuthorXiren Wang, *Wenjian Yu, Zeyi Wang (Tsinghua University, China)
Pagepp. 62 - 67
Keywordsubstrate extraction, frequency-dependent parameter, boundary element method, multiple frequency
AbstractThe couplings via realistic lossy substrates can be modeled as frequency-dependent coupling parameters. The fast extraction at multiple frequencies can be accomplished in two sequent steps. The first is to extract the coupling resistance using a direct boundary element method (DBEM). The second is to revise the resistance into the parameter at the frequency in an exact and rapid way. The first step is time-consuming, while it runs only one time; the second repeats at each frequency, but is much easier. For more frequency calculation, this method is more advanced. Numerical experiments illustrate that this method has high accuracy, and it can be hundreds of times faster than an advanced Green's function based method. Substrates with arbitrary doping profiles can also be easily handled, which is partly verified by experiment.

1C-2 (Time: 10:40 - 11:05)
TitleHierarchical Optimization Methodology for Wideband Low Noise Amplifiers
AuthorArthur Nieuwoudt, Tamer Ragheb, *Yehia Massoud (Rice University, United States)
Pagepp. 68 - 73
KeywordLow Noise Amplifiers, Wideband, Optimization, Synthesis
AbstractIn this paper, we present a systematic synthesis methodology for fully integrated wideband low noise amplifiers that simultaneously optimizes impedance matching, noise figure, and other performance parameters. Leveraging an accurate analytical model, we hierarchically couple global optimization techniques with local convex optimization methods to efficiently locate optimal wideband LNA circuits. The results indicate that the methodology yields significant improvement in key LNA design constraints over existing methodologies while achieving up to one order of magnitude speedup in computational performance.

1C-3 (Time: 11:05 - 11:30)
TitlePLLSim - An Ultra Fast Bang-bang Phase Locked Loop Simulation Tool
Author*Michael James Chan, Adam Postula (University of Queensland, Australia), Yong Ding (NanoSilicon Pty Ltd, Australia)
Pagepp. 74 - 79
KeywordPLL, Bang-bang PLL, Behavioral Simulation, Jitter
AbstractAbstract - This paper presents a simulation tool targeted specifically at bang-bang type phase locked loop systems. The aim of this simulator is to quickly and accurately predict important PLL transient characteristics such as capture range, locking time, and jitter. We present a behavioral model for bang-bang type PLLs, and show how application of this model in a simulator can speed up simulation time by four to five orders of magnitude. With this performance, Monte-Carlo simulation techniques become not only feasible, but convenient. The simulator also models the major non-idealities typical of phase locked loop systems. The accuracy of the simulator is confirmed via detailed analysis and comparison with Matlab Simulink based models.

1C-4 (Time: 11:30 - 11:55)
TitleA Programmable Fully-Integrated GPS receiver in 0.18µm CMOS with Test Circuits
AuthorMahta Jenabi, *Noshin Riahi, Ali Fotowat-Ahmadi (Unistar Micro Technology Inc., Canada)
Pagepp. 80 - 85
KeywordGPS, RF design, Testability
AbstractA 0.18um single chip GPS receiver with 19.5 mA power consumption is implemented in 6.5 mm2. A serial input digital control with additional testing structure not adding more than 4% to the Si area are used to the actual RF circuits in case of problems minimizing the number of Si runs.

1C-5 (Time: 11:55 - 12:20)
TitleUltralow-Power Reconfigurable Computing with Complementary Nano-Electromechanical Carbon Nanotube Switches
AuthorSwarup Bhunia, *Massood Tabib Azar, Daniel Saab (Case Western Reserve University, United States)
Pagepp. 86 - 91
KeywordReconfigurable, Low-power, Carbon nanotube
AbstractIn recent years, several alternative devices have been proposed to deal with inherent limitation of conventional CMOS devices in terms of scalability at nanometer scale geometry. The fabrication and integration cost of these devices, however, have been prohibitive and/or the devices do not allow smooth transition from the conventional design paradigm. To address some of these limitations, we have developed a new family of devices called “Complementary Nano Electro-Mechanical Switches” (CNEMS) using carbon nanotubes as active switching/latching elements. The basic structure of these devices consists of three co-planar carbon nanotubes arranged so that the central nanotube can touch the two side carbon nanotubes upon application of a voltage pulse between them. Owing to the unique properties of carbon nanotubes, these devices have very low leakage current, low operation voltages, and have built-in energy storage to reduce computation power, resulting in very low overall power dissipation. CNEMS have stable on-off state and latching mechanism for non-volatile memory-mode operation. Besides, the devices can be readily integrated in the same substrate as CMOS transistors with high integration densities - thus, allowing easy manufacturability and hybridization with conventional CMOS devices. In this paper, we present the properties of these devices and based on our analysis, we propose a reconfigurable computation framework using these devices. For the first time, we demonstrate that these devices are promising in dynamically reconfigurable instant-on system development with about 25X lower power dissipation.


Session 1D University Design Contest
Time: 10:15 - 12:20 Wednesday, January 24, 2007
Location: Room 416+417
Chairs: Makoto Nagata (Kobe University, Japan), Fumio Arakawa (Hitachi, Japan)

1D-1 (Time: 10:15 - 10:20)
TitleA 1Tb/s 3W Inductive-Coupling Transceiver Chip
Author*Noriyuki Miura, Tadahiro Kuroda (Keio University, Japan)
Pagepp. 92 - 93
KeywordSiP, inductive coupling, wireless communication, high bandwidth, low power
AbstractA 1Tb/s 3W inter-chip transceiver transmits clock and data by inductive coupling at a clock rate of 1GHz and data rate of 1Gb/s per channel. 1024 data transceivers are arranged with a pitch of 30um in a layout area of 1mm^2. Bi-Phase Modulation is employed for the data link to improve noise immunity, reducing power in the transceiver. 4-phase Time Division Multiplexing reduces crosstalk and channel pitch. The BER is lower than 10^-13 with 150ps timing margin.

1D-2 (Time: 10:20 - 10:25)
Title22-29GHz Ultra-Wideband CMOS Pulse Generator for Collision Avoidance Short Range Vehicular Radar Sensors
Author*Ahmet Oncu, B.B.M. Wasanthamala Badalawa, Tong Wang, Minoru Fujishima (The University of Tokyo, Japan)
Pagepp. 94 - 95
KeywordUWB CMOS pulse generator, 22-29GHz pseudo-millimeter-wave ultra-wideband (UWB), short-range automotive radar
AbstractThe pseudo-millimeter-wave ultra-wideband (UWB) is attractive for applications in short-range automotive radar systems using 22 to 29GHz in order to realize road safety and intelligent transportation. Although CMOS is suitable for the short-range radar since processing units can be implemented in the same chip with the UWB front-end building block, it is difficult to operate CMOS pulse generators at such a high frequency. To realize the pseudo-millimeter-wave band using CMOS, we have proposed a new pulse generator consisting of a series of delay cells and edge combiners with waveform shaping. As a result of measurement using 90nm CMOS technology, 1Gbps/bit pulses are successfully generated with a power consumption of 1.4mW at a supply voltage of 0.9V. This result will be the key technology for a one-chip short-range radar system.

1D-3 (Time: 10:25 - 10:30)
TitleA 2.8-V Multibit Complex Bandpass Delta-Sigma AD Modulator in 0.18µm CMOS
Author*Hao San, Yoshitaka Jingu, Hiroki Wada, Hiroyuki Hagiwara, Akira Hayakawa, Haruo Kobayashi (Gunma University, Japan), Masao Hotta (Musashi Institute of Technology, Japan)
Pagepp. 96 - 97
KeywordComplex Bandpass Delta-Sigma AD Modulator, Complex Filter, Multi-bit Modulator, DWA Algorithm
AbstractA second-order multibit switched-capacitor(SC) complex bandpass Delta-Sigma AD modulator has been designed, fabricated and tested for application to low-IF receivers in wireless communication systems. We have employed two new algorithms there to improve the signal-to-noise-and-distortion (SNDR) of the modulator. (i) A complex bandpass filter with I, Q dynamic matching to reduce the mismatch influence between I, Q paths. As its by-product, the complex modulator can be divided into two separate parts without signal line crossing between the upper and lower paths. Therefore, the layout design of the modulator can be greatly simplified; (ii) A new complex bandpass Data-Weighted Averaging (DWA) algorithm is implemented to suppress nonlinearity effects of multibit DACs in complex form to achieve high accuracy. Implemented in a 0.18-µm CMOS process and at 2.8V supply, the modulator achieves a measured peak SNDR of 64.5dB at 20MS/s with a signal bandwidth of 78kHz while dissipating 28.4mW and occupying an area of 1.82mm2.

1D-4 (Time: 10:30 - 10:35)
TitleA Wideband CMOS LC-VCO Using Variable Inductor
Author*Kazuma Ohashi, Yusaku Ito, Yoshiaki Yoshihara, Kenichi Okada, Kazuya Masu (Tokyo Institute of Technology, Japan)
Pagepp. 98 - 99
KeywordWideband, LC-VCO, Variable Inductor, MEMS, 0.18um CMOS
AbstractThis paper proposes a novel wide-range tunable CMOS voltage controlled oscillator (VCO). VCO uses an on-chip variable inductor and switched capacitors as variable elements. The VCO was fabricated using a standard 0.18 um CMOS process with five metal layers. The oscillation frequency can be tuned from 1.28 GHz to 2.75 GHz with tuning range of 72 %.

1D-5 (Time: 10:35 - 10:40)
TitleDesign of Active Substrate Noise Canceller using Power Suplly di/dt Detector
Author*Taisuke Kazama, Toru Nakura (The University of Tokyo, Japan), Makoto Ikeda, Kunihiro Asada (VLSI Design and Education Center, The University of Tokyo, Japan)
Pagepp. 100 - 101
Keywordsubstrate noise, di/dt, on-chip noise canceller
AbstractAs the growing demand of mixed-signal designs as A/D, D/A and PLL integrated with large scale digital circuits, substrate noise becomes serious concern. On the other hand, the remedies using guard ring and decoupling capacitor do not have enough efficiency against high frequency noise due to their parasitic component. To suppress the impact of substrate noise, on-chip active noise cancelling technique using di/dt detector has been proposed. This paper introduces an exapmle design of feedforward active substrate noise canceling technique using multiple power supply di/dt detector and demonstrates the noise cancelling results by the measurement of 0.35 $\mu$m CMOS test chip.

1D-6 (Time: 10:40 - 10:45)
TitleA 20 Gbps Scalable Load Balanced Birkhoff-von Neumann Symmetric TDM Switch IC with SERDES Interfaces
Author*Yu-Hao Hsu, Min-Sheng Kao, Hou-Cheng Tzeng, Ching-Te Chiu, Jen-Ming Wu (Inst. of Communications Engineering, NTHU, Taiwan), Shuo-Hung Hsu (Inst. of Electronics Engineering, NTHU, Taiwan)
Pagepp. 102 - 103
KeywordTDM switch IC , SERDES, 8B10B CODEC, CML, half-rate
AbstractFor the first time, we implemented a reconfigurable load-balanced TDM switch IC with SERDES interface circuits for high speed networking applications. An NxN TDM switch could be constructed recursively from the TDM switch IC to achieve switching capacity of hundred gigabits per second or higher. The TDM switch IC contained a digital 8x8 TDM switch core with 8B10B CODECs and analog SERDES I/O interfaces. In the I/O interfaces, eight 2.56/3.2Gbps dual-mode 16/20:1 SERDES with CML buffers were developed. The 16/20:1 instead of 8/10:1 serializer and deserializer were used to reduce the required operating frequency in the switch core by half. New half-rate architectures and all static CMOS gates were used in the 16/20:1 serializer and deserializer for the low power consumption. A wide-band CML I/O buffer with our patented PMOS active load scheme was developed. All implementation were based on the 0.18 µm CMOS technology. Our implementation showed a 20 Gbps switching capacity for the 8ˇÁ8 TDM switch IC.

1D-7 (Time: 10:45 - 10:50)
TitleReconfigurable CMOS Low Noise Amplifier Using Variable Bias Circuit for Self Compensation
Author*Satoshi Fukuda, Daisuke Kawazoe, Kenichi Okada, Kazuya Masu (Tokyo Institute of Technology, Japan)
Pagepp. 104 - 105
Keywordself-compensation, reconfigurable, LNA, variable bias cuircuit
AbstractThis paper proposes a reconfigurable low noise amplifier (LNA) to realize self compensation of performance. Power consamption and intermodulation are compensated by bias voltage of input transistor. By tuning the bias voltage according to the input signal, the proposed LNA achieves more than 33 dBm improvement in delta-IM3, and 87 % of power reduction is realized at 1.9 GHz as compared to an LNA with a fixed bias voltage.

1D-8 (Time: 10:50 - 10:55)
TitlePseudo-Millimeter-Wave Up-Conversion Mixer with On-Chip Balun for Vehicular Radar Systems
Author*Chee Hong Ivan Lai, Minoru Fujishima (University of Tokyo, Japan)
Pagepp. 106 - 107
Keywordup-conversion mixer, Marchand balun, substrate losses
AbstractA low-power, fully integrated 20-26 GHz broadband up-conversion mixer implemented with on-chip Marchand baluns is demonstrated on 90nm CMOS technology in this paper. The baluns employ capacitive coupling between two metal layers and include slotted shields to reduce substrate losses. At 22.1 GHz, the integrated mixer achieves a conversion gain of 2 dB with a maximum power dissipation of only 11.1mW from a 1.2V dc power supply at LO power of 5 dBm.

1D-9 (Time: 10:55 - 11:00)
TitleImproving Execution Speed of FPGA using Dynamically Reconfigurable Technique
AuthorRoel Pantonial, Md. Ashfaquzzaman Khan (Graduate School of Engineering, Tohoku University, Japan), *Naoto Miyamoto (New Industry Creation Hatchery Center, Tohoku University, Japan), Koji Kotani, Shigetoshi Sugawa (Graduate School of Engineering, Tohoku University, Japan), Tadahiro Ohmi (New Industry Creation Hatchery Center, Tohoku University, Japan)
Pagepp. 108 - 109
Keyworddynamic, reconfigurable, FPGA, temporal, interconnect
AbstractThis paper reports the architecture and performance of Flexible Processor III (FP3), a novel multi-context dynamically reconfigurable FPGA (DRFPGA) designed and fabricated in 0.35um 2P3M CMOS technology. FP3 employs a newly developed shift register-type temporal communication module to reduce the critical path delay. Our experimental results brought out, for the first time, that there exists cases where the fastest speed was achieved when multi contexts were in use.

1D-10 (Time: 11:00 - 11:05)
TitleSingle-Issue 1500MIPS Embedded DSP with Ultra Compact Codes
Author*Li-Chun Lin, Shih-Hao Ou (National Chiao Tung University, Taiwan), Tay-Jyi Lin (Industrial Technology Research Institute, Taiwan), Siang-Sen Deng, Chih-Wei Liu (National Chiao Tung University, Taiwan)
Pagepp. 110 - 111
KeywordDSP
AbstractThe performance of single-issue RISC cores can be improved significantly with multi-issue architectures (i.e. superscalar or VLIW) by activating the parallel functional units concurrently. However, they suffer high complexity or huge code sizes. In this paper, we borrow some ideas from old vector machines and propose a novel DSP architecture with very compact codes. In our simulations, the DSP has comparable performance to a 5-issue VLIW core with identical computing resources. However, its code sizes are reduced by a factor of 8. The DSP core has been implemented in the TSMC 0.13um CMOS technology, where the operating frequency is 305MHz and the silicon area is 1.45×1.4 mm2 including 12KB on-chip memory.

1D-11 (Time: 11:05 - 11:10)
TitleA Highly Integrated 8 mW H.264/AVC Main Profile Real-time CIF Video Decoder on a 16 MHz SoC Platform
AuthorHuan-Kai Peng, Chun-Hsin Lee, Jian-Wen Chen, Tzu-Jen Lo, Yung-Hung Chang, Sheng-Tsung Hsu, Yuan-Chun Lin, Ping Chao, *Wei-Cheng Hung, Kai-Yuan Jan (National Tsing Hua University, Taiwan)
Pagepp. 112 - 113
KeywordH.264, AVC, CABAD, SoC
AbstractAbstract - We present a hardwired decoder prototype for H.264/AVC main profile video. Our design takes as its input compressed H.264/AVC bit-stream and produces as its output video frames ready for display. We wrap the decoder core with an AMBA-AHB bus interface and integrate it into a multimedia SoC platform. Several architectural innovations at both IP and system levels are proposed to achieve very high performance at very low operating frequency. Running at 16 MHz FPGA, the whole demo system is able to real-time decode CIF (352x288) video at 30 frames per second. Moreover, we take system cost into consideration such that only a single external SDRAM is needed and memory traffic minimized.

1D-12 (Time: 11:10 - 11:15)
TitleConfigurable AMBA On-Chip Real-Time Signal Tracer
Author*Chung-Fu Kao, Chi-Hung Lin, Ing-Jer Huang (Dept. of Computer Science & Engineering, National Sun Yat-Sen University, Taiwan)
Pagepp. 114 - 115
Keywordtrace, debugging, bus
AbstractThis paper purpose an embedded AMBA signal tracer for microprocessor-based SoC’s. This tracer provides five trace resolution modes that can perform a cycle-accurate or a transaction-based trace collection in an unlimited time. Also this tracer is implemented in a Soft-IP style. It provides four parameters for tracing configuration. The experimental results show that the bus tracer can reach a good compression ratio of 96%.

1D-13 (Time: 11:15 - 11:20)
TitleImplementation of a Standby-Power-Free CAM Based on Complementary Ferroelectric-Capacitor Logic
Author*Shoun Matsunaga, Takahiro Hanyu (Tohoku University, Japan), Hiromitsu Kimura, Takashi Nakamura, Hidemi Takasu (ROHM, Japan)
Pagepp. 116 - 117
Keywordcomplementary ferroelectric-capacitor logic, content-addressable memory, standby-power-free
AbstractA complementary ferroelectric-capacitor (CFC) logic-circuit style is proposed for a compact and standby-power-free content-addressable memory (CAM). Since the use of the CFC logic circuit in designing a CAM cell makes it possible to merge both logic and non-volatile storage elements into serially connected ferroelectric capacitors, the CAM becomes compact. The standby power of the CAM is completely eliminated because the supply voltage can be cut off with maintaining stored data in the CAM. The test chip is fabricated by using 0.35-um ferroelectric CMOS, and the basic behavior can be also measured.

1D-14 (Time: 11:20 - 11:25)
TitleA Multi-Drop Transmission-Line Interconnect in Si LSI
Author*Junki Seita, Hiroyuki Ito, Kenichi Okada, Takashi Sato, Kazuya Masu (Tokyo Institute of Technology, Japan)
Pagepp. 118 - 119
Keywordtransmission line, branch
AbstractThis paper proposes a branching method for on-chip transmission line interconnects, which can reduce delay and power of global interconnects. A 6-mm-long transmission line interconnect with a branch is fabricated by using 0.18um standard Si CMOS process, and the measurement result performs 4Gbps signal transmission.

1D-15 (Time: 11:25 - 11:30)
TitleA 10Gbps/channel On-Chip Signaling Circuit with an Impedance-Unmatched CML Driver in 90nm CMOS Technology
Author*Takeshi Kuboki, Akira Tsuchiya, Hidetoshi Onodera (Kyoto University, Japan)
Pagepp. 120 - 121
Keywordon-chip, signaling, current-mode-logic
AbstractAn on-chip signaling system consists of a CML driver, a differential transmission-line and a CML receiver is fabricated. We developed an impedance-unmatched driver for power reduction. The impedance-unmatched driver reduces the tail current of the CML buffer by tuning the load resistance. The designed circuit achieves 3mm, 10Gbps/channel on-chip signal transmission and the impedance-unmatched driver saves the energy per bit by 21% compared with the conventional impedance-matched driver.

1D-16 (Time: 11:30 - 11:35)
TitleA 90nm 8x16 FPGA Enhancing Speed and Yield Utilizing Within-Die Variations
Author*Yuuri Sugihara, Manabu Kotani, Kazuya Katsuki, Kazutoshi Kobayashi, Hidetoshi Onodera (Kyoto University, Japan)
Pagepp. 122 - 123
Keywordvariation, FPGA
AbstractWe have fabricated FPGA device with functionalities measuring within-die variations in a 90nm process. Measured variations are used to configure each device to maximize the operating frequency by allocationg critical paths in faster portions. Variations are measured using ring oscillators implemented as a configuration of the FPGA. Placement opeimization using a simple model circuit reveals that performance of the circuit is enhanced by 4% in average. The yield is enhanced by 32% to the worst case.

1D-17 (Time: 11:35 - 11:40)
TitleA 0.35um CMOS 1,632-gate-count Zero-Overhead Dynamic Optically Reconfigurable Gate Array VLSI
Author*Minoru Watanabe, Fuminori Kobayashi (Kyushu Institute of Technology, Japan)
Pagepp. 124 - 125
KeywordFPGAs, PLDs, optical reconfiguration
AbstractA Zero-Overhead Dynamic Optically Reconfigurable Gate Array VLSI (ZO-DORGA-VLSI) has been developed. It is based on a concept using junction capacitance of photodiodes and load capacitance of gates constructing a gate array as configuration memory and removing static memory function to store a context. In this paper, the performance of a 1,632 ZO-DORGA-VLSI, which was fabricated using a 0.35 $\mu m$ -- 4.9 mm square CMOS process chip, is presented. In addition, the design of an over 10,000 ZO-DORGA-VLSI is presented.

1D-18 (Time: 11:40 - 11:45)
TitleLow-Power High-Speed 180-nm CMOS Clock Drivers
Author*Tadayoshi Enomoto, Suguru Nagayama, Nobuaki Kobayashi (Chuo University, Japan)
Pagepp. 126 - 127
Keywordpower dissipation , delay time , dynamic current, short-circuit current , CMOS
AbstractThe power dissipation (PT) and delay time (tdT) of a CMOS clock driver were minimized. Eight test circuits, each of which has 2 two-stage clock drivers, and a register array were fabricated using 0.18-µm CMOS technology. The first and second stages of the driver consisted of a single inverter and m inverters, respectively, and the register array stage was constructed with N delay flip-flops (D-FFs). A single inverter in the second stage drove N/m D-FFs where N was fixed at 40 and m varied from 1 to 40. Minimum PT and tdT were 251 µW and 0.640 ns, respectively and were both obtained at an m of 8. These values were 48.6% and 29.4% of maximum PT and tdT, respectively. Simulated and measured results agreed well with these SPICE simulated results.


Session 2A New Techniques in Placement
Time: 13:30 - 15:35 Wednesday, January 24, 2007
Location: Room 411+412
Chairs: Shin'ichi Wakabayashi (Hiroshima City Univ., Japan), Hung-Ming Chen (National Chiao Tung Univ., Taiwan)

2A-1 (Time: 13:30 - 13:55)
TitleFast Analytic Placement using Minimum Cost Flow
Author*Ameya R Agnihotri, Patrick H Madden (SUNY Binghamton, United States)
Pagepp. 128 - 134
KeywordPlacement, Physical Design, Analytic placement
AbstractMany current integrated circuits designs, such as those released for the ISPD2005 placement contest, are extremely large and can contain a great deal of white space. These new placement problems are challenging; analytic placers perform well, but can suffer from high run times. In this paper, we present a new placement tool called Vaastu. Our approach combines continuous and discrete optimization techniques. We utilize network flows, which incorporate the more realistic half-perimeter wire length objective, to facilitate module spreading in conjunction with a log-sum-exponential function based analytic approach. Our approach obtains wire length results that are competitive with the best known results, but with much lower run times.

2A-2 (Time: 13:55 - 14:20)
TitleFastPlace 3.0: A Fast Multilevel Quadratic Placement Algorithm with Placement Congestion Control
Author*Natarajan Viswanathan, Min Pan, Chris Chu (Iowa State University, United States)
Pagepp. 135 - 140
KeywordQuadratic Placement, Iterative Local Refinement, Multilevel Placement
AbstractIn this paper, we present FastPlace 3.0 - an efficient and scalable multilevel quadratic placement algorithm for large-scale mixed-size designs. The main contributions of our work are: (1) A multilevel global placement framework, by incorporating a two-level clustering scheme within the flat analytical placer FastPlace. (2) An efficient and improved Iterative Local Refinement technique that can handle placement blockages and placement congestion constraints. (3) A congestion aware standard-cell legalization technique in the presence of blockages. On the ISPD-2005 placement benchmarks, our algorithm is 5.12X, 11.52X and 16.92X faster than mPL6, Capo10.2 and APlace2.0 respectively. In terms of wirelength, we are on average, 2% higher as compared to mPL6 and 9% and 3% better as compared to Capo10.2 and APlace2.0 respectively. We also achieve competitive results compared to a number of academic placers on the placement congestion constrained ISPD-2006 placement benchmarks.

2A-3 (Time: 14:20 - 14:45)
TitleHippocrates: First-Do-No-Harm Detailed Placement
AuthorHaoxing Ren (IBM, United States), *David Pan (University of Texas at Austin, United States), Charles J Alpert, Gi-Joon Nam, Paul Villarrubia (IBM, United States)
Pagepp. 141 - 146
Keywordplacement, timing, detailed placement
AbstractPhysical synthesis optimizations and engineering change orders typically change the locations of cells, resize cells or add more cells to the design after global placement. Unfortunately, those changes usually lead to wirelength increases; thus another pass of optimizations to further improve wirelength, timing and routing congestion characteristics is required. Simple wirelength-driven detailed placement techniques could be useful in this scenario. While such techniques can help to reduce wirelength, ones without careful timing constraint considerations might degrade the timing characteristics (worst negative slack, total negative slack, etc) and/or introduce more electrical violations (exceeding maximum output load constraints and maximum input slew constraints). In this paper, we propose a new detailed placement paradigm, which use a set of pin-based timing and electrical constraints in detailed placement to prevent it from degrading timing or violating electrical constraints while reducing wirelength, thus dubbed as Hippocrates: FIRST-DO-NO-HARM optimizations. Our experimental results show great promises. By honoring these constraints, our detailed placement technique not only reduces total wirelength (TWL), but also significantly improves timing, achieving 37% better total negative slack (TNS).

2A-4 (Time: 14:45 - 15:10)
TitleECO-system: Embracing the Change in Placement
Author*Jarrod Roy, Igor Markov (University of Michigan, United States)
Pagepp. 147 - 152
KeywordPlacement, ECO, Physical Synthesis
AbstractIn a realistic design flow, circuit and system optimizations must interact with physical aspects of the design. For example, improvements in timing and power may require replacing large modules with variants that have different power/delay trade-off, shape and connectivity. New logic may be added late in the design flow, subject to interconnect optimization. To support such flexibility in design flows we develop a robust system for performing Engineering Change Orders (ECOs). In contrast with existing stand-alone tools that offer poor interfaces to the design flow and cannot handle a full range of modern VLSI layouts, our ECO-system reliably handles fixed objects and movable macros in instances with widely varying amounts of whitespace. It detects geometric regions and sections of the netlist that require modification and applies an adequate amount of change in each case. Given a reasonable initial placement, it applies minimal changes, but is capable of re-placing large regions to handle pathological cases. ECO-system can be used in the range from high-level synthesis, to physical synthesis and detail placement.

2A-5 (Time: 15:10 - 15:35)
TitleBisection Based Placement for the X Architecture
Author*Satoshi Ono (SUNY Binghamton CSD, United States), Sameer Tilak (Supercomputer Center, United States), Patrick H. Madden (SUNY Binghamton CSD, United States)
Pagepp. 153 - 158
Keywordplacement, x architecture
AbstractRising interconnect delay and power consumption have motivated the investigation of alternative integrated circuit routing architectures. In particular, the X Architecture, which features preferred routing in diagonal directions, has gained a measure of industry support, and has even been validated at 65nm. While there has been extensive study of Manhattan design methods, there are markedly fewer published results for non-Manhattan design. To help fill this gap, we study a patented placement method for the X Architecture; to our knowledge, there have been no prior published results for the method. Surprisingly, we find that the patented method in fact performs worse than traditional Manhattan methods -- for both Manhattan and X routing metrics. We also present a theoretic formulation which explains why solution quality is degraded. Many groups in industry are evaluating the merits of non-Manhattan routing architectures. By providing concrete experimental results, we hope to improve the accuracy of these evaluations.


Session 2B On Chip Communication Methodology
Time: 13:30 - 15:35 Wednesday, January 24, 2007
Location: Room 413
Chairs: Soonhoi Ha (Seoul National Univ., Republic of Korea), Nikil Dutt (Univ. of California, Irvine, United States)

2B-1 (Time: 13:30 - 13:55)
TitleSlack-based Bus Arbitration Scheme for Soft Real-time Constrained Embedded Systems
Author*Minje Jun, Kwanhu Bang (Yonsei University, Republic of Korea), Hyuk-Jun Lee (Cisco Systems Incorporated, United States), Naehyuck Chang (Seoul National University, Republic of Korea), Eui-Young Chung (Yonsei University, Republic of Korea)
Pagepp. 159 - 164
Keywordlatency, arbiter, QoS, bus, slack
AbstractWe present a bus arbitration scheme for soft real-time constrained embedded systems. Some masters in such systems are required to complete their work for given timing constraints, resulting in the satisfaction of system-level timing constraints. The computation time of each master is predictable, but it is not easy to predict its data transfer time since the communication architecture is mostly shared by several masters. Previous works solved this issue by minimizing the latencies of several latency-critical masters, but the side effect of these methods is that it can increase the latencies of other masters, hence they may violate the given timing constraints. Unlike previous works, our method uses the concept of “slack” in order to make the latency as close as its given constraint, resulting in the reduction of the side effect. The proposed arbitration scheme consists of bandwidth-conscious arbiter and scheduler. The arbiter can be any existing bandwidth-conscious arbiter and the scheduler implements the latency-awareness proposed in this paper. The scheduler is involved in the arbitration only when it observes a request whose slack is not sufficient for the given timing constraint. The experimental results show that our method outperforms the conventional round-robin arbiter by more than 100% in the best case in terms of the longest violated cycles.

2B-2 (Time: 13:55 - 14:20)
TitleA Precise Bandwidth Control Arbitration Algorithm for Hard Real-Time SoC Buses
Author*Bu-Ching Lin, Geeng-Wei Lee, Juinn-Dar Huang, Jing-Yang Jou (National Chiao Tung University, Taiwan)
Pagepp. 165 - 170
Keywordbandwidth allocation, real-time systems, system buses, system-on-chip, arbitration algorithm
AbstractOn an SoC bus, contentions occur while different IP cores request the bus access at the same time. Hence an arbiter is mandatory to deal with the contention issue on a shared bus system. In different applications, IPs may have real-time and/or bandwidth requirements. It is very difficult to design an arbitration algorithm to simultaneously meet these two requirements. In this paper, we propose an innovative arbitration algorithm, RB_lottery, to meet both of the requirements. It can provide not only the hard real-time guarantee but also the precise bandwidth controllability. The experimental results show that RB_lottery outperforms several well-known existing arbitration algorithms.

2B-3 (Time: 14:20 - 14:45)
TitleCommunication Architecture Synthesis of Cascaded Bus Matrix
Author*Junhee Yoo, Dongwook Lee (Seoul National University, Republic of Korea), Sungjoo Yoo (Samsung Electronics, Republic of Korea), Kiyoung Choi (Seoul National University, Republic of Korea)
Pagepp. 171 - 177
Keywordbus matrix, AXI, communication architecture synthesis
AbstractFor high frequency on-chip communication architecture design, we propose cascaded bus matrix-based solutions. Due to the huge design space in cascaded bus matrix design, it is crucial to perform an efficient design space exploration. In our work, we present a simulated annealing-based design space exploration. For an efficient representation of bus topology, we propose an encoding method called traffic group encoding and apply it to AMBA3 AXI-based bus system design.

2B-4 (Time: 14:45 - 15:10)
TitleTopology Exploration for Energy Efficient Intra-tile Communication
Author*Jin Guo, Antonis Papanikolaou, Francky Catthoor (IMEC, Belgium)
Pagepp. 178 - 183
Keywordsegmented bus, topology, on-chip interconnect
AbstractWith the technology nodes scaling down, the energy consumed by the on-chip intra-tile interconnects is beginning to have a significant impact on the total chip energy. The segmented bus template is an energy efficient architecture style for the on-chip communication between the components. To achieve the minimum energy operation, the netlist topology of the segmented bus should however be optimized accordingly. In this paper we present a strategy for the definition of an energy optimal netlist for segmented buses. An initial floorplanning stage provides information about the eventual lengths of the interconnect wires and a subsequent exploration step defines the optimal topology for the communication architecture. We motivate that a star topology generated using the wire length prediction can be up to a factor 4 more energy efficient compared to standard linear bus topologies.

2B-5 (Time: 15:10 - 15:35)
TitleApplication Specific Network-on-Chip Design with Guaranteed Quality Approximation Algorithms
AuthorKrishnan Srinivasan, *Karam S. Chatha, Goran Konjevod (Arizona State University, United States)
Pagepp. 184 - 190
KeywordNetwork-on-Chip, Irregular topology, Approximation algorithms
AbstractNetwork-on-Chip (NoC) architectures with optimized topologies have been shown to be superior to regular architectures (such as mesh) for application specific multi-processor System-on-Chip (MPSoC) devices. The application specific NoC design problem takes as input the system-level floorplan of the computation architecture, characterized library of NoC components, and the communication performance requirements. The objective is to generate an optimized NoC topology, and routes for the communication traces on the architecture such that the performance requirements are satisfied and power consumption is minimized. The paper discusses a two stage automated approach consisting of i) core to router mapping, and ii) topology and route generation for design of custom NoC architectures. In particular it presents an optimal technique for core to router mapping (stage i), and a factor 2 approximation algorithm for custom topology generation (stage ii). The superior quality of the techniques is established by experimentation with benchmark applications, and comparisons with an optimal integer linear programming (ILP) based technique.


Session 2C Analog CAD Techniques: From Analysis to Verification
Time: 13:30 - 15:35 Wednesday, January 24, 2007
Location: Room 414+415
Chair: Yasuaki Inoue (Waseda Univ., Japan)

2C-1 (Time: 13:30 - 13:55)
TitleThermal-driven Symmetry Constraint for Analog Layout with CBL Representation
Author*Jiayi Liu, Sheqin Dong, Yunchun Ma, Di Long, Xianlong Hong (EDA lab, DCST, Tsinghua University, China)
Pagepp. 191 - 196
Keywordanalog layout, symmetry, thermal-driven, CBL representation
AbstractThermal constraint is very important for analog devices in the context of SOI. Hot-spot effect would cause error or even failure on the performance of analog devices. And the temperature gradient would lead to mismatch on symmetrical devices. In order to handle these problems, this paper introduces an accurate thermal model into the placement process. Based on the geometric symmetry which is achieved with CBL for the first time, the thermal model helps to find the thermal-optimal placement. And the experimental results show this method is promising.

2C-2 (Time: 13:55 - 14:20)
TitleA Graph Reduction Approach to Symbolic Circuit Analysis
Author*Guoyong Shi, Weiwei Chen (Shanghai Jiao Tong University, China), C.-J. Richard Shi (University of Washington, United States)
Pagepp. 197 - 202
Keywordgraph, symbolic, BDD, simulator
AbstractA new graph reduction approach to symbolic circuit analysis is developed in this paper. A Binary Decision Diagram (BDD) mechanism is formulated, together with a specially designed graph reduction process and a recursive sign determination algorithm. This combination of techniques is used to develop a core analysis engine of a symbolic analog circuit simulator that has the potential for analyzing large analog circuits in the frequency domain. Partial experimental results are reported.

2C-3 (Time: 14:20 - 14:45)
TitleRobust Analog Circuit Sizing Using Ellipsoid Method and Affine Arithmetic
AuthorXuexin Liu, *Wai-Shing Luk, Yu Song, Xuan Zeng (ASIC & System State-Key Lab, Fudan University, China)
Pagepp. 203 - 208
Keywordellipsoid method, affine arithmetic, geometric programming, robust design
AbstractAnalog circuit sizing under process/parameter variations is formulated as a mini-max geometric programming problem. To tackle such problem, we present a new method that combines the ellipsoid method and affine arithmetic. Affine Arithmetic is not only used for keeping tracks of variations and correlations, but also helps to determine the sub-gradient at each iteration of the ellipsoid method. An example of designing a CMOS op-amp is given to demonstrate the effectiveness of our method. Finally numerical results are verified by SPICE’s simulation.

2C-4 (Time: 14:45 - 15:10)
TitleWCOMP: Waveform Comparison Tool for Mixed-signal Validation Regression in Memory Design
Author*Peng Zhang, Wai-Shing Luk, Yu Song, Jiarong Tong, Pushan Tang, Xuan Zeng (Fudan University, China)
Pagepp. 209 - 214
KeywordMixed-signal validation, Waveform comparison, Validation automation
AbstractThe increasing effort on full-chip validation constrains design cost and time-to-market. A waveform comparison tool named WCOMP is presented to automate mixed-signal validation regression in memory design. Unlike digital waveform comparison tools, WCOMP compares mixed-signal waveforms for functional match instead of graphical match, which tally with the requirements of full-chip validation regression. Simulations with different regression runs, process parameters, voltages and temperatures can be functionally compared. The methods are proved to be effective in Intel Flash memory design.

2C-5 (Time: 15:10 - 15:35)
TitleStructured Placement with Topological Regularity Evaluation
Author*Shigetoshi Nakatake (University of Kitakyushu, Japan)
Pagepp. 215 - 220
Keywordplacement, floorplan, sequence-pair, regular structure, analog layout
AbstractThis paper introduces a new concept of floorplanning, called structured placement. Regularity is the key criterion so that the placement can make progress beyond constraint-driven approaches. We propose a linear time extration of topological regularity like arrays and rows from a sequence-pair. Besides, we provide a new simulated annealing (SA) framework, called dual SA, which optimizes the regularity as an objective function balancing the size of regular structures against the area efficiency.


Session 2D SPECIAL SESSION: Design for Manufacturability
Time: 13:30 - 15:35 Wednesday, January 24, 2007
Location: Room 416+417
Chair: Keh-Jeng Chang (National Tsing Hua University, Taiwan)

2D-1 (Time: 13:30 - 13:50)
Title(Invited Paper) Modeling Sub-90nm On-chip Variation Using Monte Carlo Method for DFM
AuthorJun-Fu Huang, *Victor Chang, Sally Liu, Kelvin Doong (TSMC, Taiwan), Keh-Jeng Chang (National Tsing-Hua Univ., Taiwan)
Pagepp. 221 - 225
Keyword
Abstract

2D-2 (Time: 13:50 - 14:10)
Title(Invited Paper) DFM Reality in Sub-nanometer IC Design
AuthorNishath Verghese, *Philippe Hurat (Clear Shape Technologies, United States)
Pagepp. 226 - 231
Keyword
Abstract

2D-3 (Time: 14:10 - 14:30)
Title(Invited Paper) DFM/DFY Practices During Physical Designs for Timing, Signal Integrity, and Power
AuthorShi-Hao Chen, *Ke-Cheng Chu, Jiing-Yuan Lin, Cheng-Hong Tsai (Global Unichip, Taiwan)
Pagepp. 232 - 237
Keyword
Abstract

2D-4 (Time: 14:30 - 14:50)
Title(Invited Paper) Recent Research and Emerging Challenges in Physical Design for Manufacturability/Reliability
AuthorChung-Wei Lin (National Taiwan Univ., Taiwan), Ming-Chao Tsai, Kuang-Yao Lee (National Tsing Hua Univ., Taiwan), Tai-Chen Chen (National Taiwan Univ., Taiwan), *Ting-Chi Wang (National Tsing Hua Univ., Taiwan), Yao-Wen Chang (National Taiwan Univ., Taiwan)
Pagepp. 238 - 243
Keyword
Abstract

2D-5 (Time: 14:50 - 15:35)
Title(Panel Discussion) Design for Manufacturability
AuthorOrganizer: Keh-Jeng Chang, Moderator: Keh-Jeng Chang (National Tsing-Hua Univ., Taiwan), Panelists: Kelvin Doong (TSMC, Taiwan), Nishath Verghese (Clear Shape, United States), Ke-Cheng Chu (Global Unichip, Taiwan), Ting-Chi Wang (National Tsing-Hua Univ., Taiwan), Andrew Kahng (Univ. of California, San Diego and Blaze DFM, United States)


Session 3A Routing
Time: 16:00 - 18:05 Wednesday, January 24, 2007
Location: Room 411+412
Chairs: Martin Wong (University of Illinois at Urbana-Champaign, United States), Youichi Shiraishi (Gunma Univ., Japan)

3A-1 (Time: 16:00 - 16:25)
TitleA Novel Performance-Driven Topology Design Algorithm
Author*Min Pan, Chris Chu (Iowa State University, United States), Priyadarsan Patra (Intel Corporation, United States)
Pagepp. 244 - 249
KeywordInterconnect, Performance, Topology
AbstractThis paper presents a very efficient algorithm for performance-driven topology design for interconnects. Given a net, it first generates A-tree topology using table lookup and net-breaking. Then a performance-driven post-processing heuristic not restricting to A-tree topology improves the obtained topology by considering the sink positions, required time and load capacitance to achieve better timing. Experimental results show that our new technique can produce better topologies in terms of timing and is hundreds times faster than traditional approach.

3A-2 (Time: 16:25 - 16:50)
TitleFastRoute 2.0: A High-quality and Efficient Global Router
Author*Min Pan, Chris Chu (Iowa State University, United States)
Pagepp. 250 - 255
KeywordGlobal routing, Steiner trees, congestion
AbstractBecause of the increasing dominance of interconnect issues in advanced IC technology, it is desirable to incorporate global routing into early design stages to get accurate interconnect information. Hence, high-quality and fast global routers are in great demand. In this work, we propose a high-quality and efficient global router, FastRoute 2.0. It can achieve more than an order of magnitude less overflow and very fast runtime compared to three state-of-the-art academic global routers. The promising results make it possible to integrate global routing into early design stages. This could dramatically improve the design solution quality.

3A-3 (Time: 16:50 - 17:15)
TitleDpRouter: A Fast and Accurate Dynamic-Pattern-Based Global Routing Algorithm
Author*Zhen Cao, Tong Jing (Tsinghua University, China), Jinjun Xiong, Yu Hu, Lei He (University of California, Los Angeles, United States), Xianlong Hong (Tsinghua University, China)
Pagepp. 256 - 261
KeywordRouting, Routability, Physical Design, Congestion
AbstractThis paper presents a fast and accurate global routing algorithm, DpRouter, based on two efficient techniques: (1) dynamic pattern routing (Dpr), and (2) segment movement. These two techniques enable DpRouter to explore large solution space to achieve better routability with low time complexity. Compared with the state-of-the-arts, experimental results show that we consistently obtain better routing quality in terms of both congestion and wire length, while simultaneously achieving a more than 30x runtime speedup. We envision that this algorithm can be further leveraged in other routing applications, such as FPGA routing.

3A-4 (Time: 17:15 - 17:40)
TitleA Fast and Stable Algorithm for Obstacle-Avoiding Rectilinear Steiner Minimal Tree Construction
Author*Pei-Ci Wu, Jhih-Rong Gao, Ting-Chi Wang (National Tsing Hua University, Taiwan)
Pagepp. 262 - 267
KeywordRouting, Steiner tree
AbstractIn routing, finding a rectilinear Steiner minimal tree (RSMT) is a fundamental problem. Today’s design often contains rectilinear obstacles, like macro cells, IP blocks, and pre-routed nets. Therefore obstacle-avoiding RSMT (OARSMT) construction becomes a very practical problem. In this paper we propose a fast and stable algorithm for this problem. We use a partitioning based method and an ant colony optimization based method to construct obstacle-avoiding Steiner minimal tree (OASMT). Besides, two heuristics are proposed to do the rectilinearization and refinement to further improve wirelegnth performance. The experimental results show our algorithm achieves the best wirelength in most of the test cases and the runtime is very small even for the large case in which the number of terminals and the number of obstacles both are more than 100.

3A-5 (Time: 17:40 - 18:05)
TitleA Theoretical Study on Wire Length Estimation Algorithms for Placement with Opaque Blocks
Author*Tan Yan, Shuting Li, Yasuhiro Takashima, Hiroshi Murata (The University of Kitakyushu, Japan)
Pagepp. 268 - 273
KeywordWire length estimation, Block placement, Routing obstacle, Shortest path
AbstractHow to estimate the shortest routing length when certain blocks are considered as routing obstacles is becoming an essential problem for block placement because HPWL is no longer valid in this case. Although this problem is well studied in computational geometry [6], the research results are neither well-known to the CAD community nor presented in a way easy for CAD researchers to ultilize their establishment. With the help of some recent notions in block placement, this paper interprets the research result in [1,8], which gives the best algorithm for this problem as we know, in a way more concise and more friendly to CAD researchers. Besides, we also tailor its algorithm to VLSI CAD application. As the result, we present a method that estimates the shortest obstacle-avoiding routing length in O(M^2+N) time for a placement with M blocks and N 2-pin nets.


Session 3B System Synthesis and Optimization Techniques
Time: 16:00 - 18:05 Wednesday, January 24, 2007
Location: Room 413
Chairs: Ren-Song Tsay (National Tsing Hua Univ., Taiwan), Ahmed Jerraya (TIMA, France)

3B-1 (Time: 16:00 - 16:25)
TitleLEAF: A System Level Leakage-Aware Floorplanner for SoCs
Author*Aseem Gupta, Nikil Dutt, Fadi Kurdahi (University of California, Irvine, United States), Kamal Khouri, Magdy Abadir (Freescale Semiconductor Inc., United States)
Pagepp. 274 - 279
KeywordLeakage Power, Floorplanner, Temperature, System Level
AbstractProcess scaling and higher leakage power have resulted in increased power densities and elevated die temperatures. Due to the interdependence of temperature and leakage power, we observe that the floorplan has an impact on both the temperatures and the leakage of the IP-blocks in a system on chip (SoC). Hence, in this paper we propose a novel system level Leakage Aware Floorplanner (LEAF) which optimizes floorplans for temperature-aware leakage power along with the traditional metrics of area and wire length. Our floorplanner takes a SoC netlist and the dynamic power profile of functional blocks to determine a placement while optimizing for temperature dependent leakage power, area, and wire length. To demonstrate the effectiveness of LEAF, we implemented our methodology on ten industrial SoC designs from Freescale Semiconductor Inc. and evaluated the trade-off between leakage power and area. We observed up to 190% difference in leakage power between leakage-unaware and leakage aware floorplanning.

3B-2 (Time: 16:25 - 16:50)
TitleProtocol Transducer Synthesis using Divide and Conquer Approach
Author*Shota Watanabe, Kenshu Seto, Yuji Ishikawa, Satoshi Komatsu, Masahiro Fujita (University of Tokyo, Japan)
Pagepp. 280 - 285
Keywordprotocol, transducer, interface, NoC, wrapper
AbstractIn IP based design, the designers try to reuse existing IPs as much as possible. Since currently available IPs use various communication protocols, protocol conversion is one of the most important topics in IP-based design. We propose a method for automatic protocol transducer synthesis which is applicable to complex protocols. The main idea of our proposed method is protocol transducer synthesis with a divide and conquer approach. We demonstrate our method by synthesizing transducers which translate among the real and complicated protocols with advanced features such as non-blocking transactions and out-of-order transactions.

3B-3 (Time: 16:50 - 17:15)
TitleA Processor Generation Method from Instruction Behavior Description Based on Specification of Pipeline Stages and Functional Units
Author*Takeshi Shiro, Masaaki Abe, Keishi Sakanushi, Yoshinori Takeuchi, Masaharu Imai (Graduate School of Information Science and Technology, Osaka University, Japan)
Pagepp. 286 - 291
KeywordASIP (Application Specific Instruction-set Processor), Design Space Exploration, Architectural Description Language (ADL), Behavior Description, Micro Operation Description
AbstractThis paper proposes a method of generating a pipeline processor from behavior description. In the proposed method, micro operation description is generated by complementing the behavior description with specification of pipeline stages and functional units. From the micro operation description, synthesizable HDL description of a processor can be generated. The proposed method makes it possible to reduce code size of architectural description language and design time drastically without degradation of design quality, compared with the conventional method.

3B-4 (Time: 17:15 - 17:40)
TitlePower and Memory Bandwidth Reduction of an H.264/AVC HDTV Decoder LSI with Elastic Pipeline Architecture
Author*Kentaro Kawakami, Mitsuhiko Kuroda, Hiroshi Kawaguchi, Masahiko Yoshimoto (Kobe University, Japan)
Pagepp. 292 - 297
KeywordH.264, Decoder, Low power, Elastic pipeline, Dynamic Voltage Scaling
AbstractWe propose an elastic pipeline that can apply dynamic voltage scaling (DVS) to hardwired logic circuits. The proposed pipeline can also reduce required local bus bandwidth. In order to demonstrate its feasibility, a hardwired H.264/AVC HDTV decoder is designed as a real-time application. The proposed architecture reduces power to 56% in a 90-nm process technology, compared to the conventional clock-gating scheme or local bus bandwidth to 37.2%.

3B-5 (Time: 17:40 - 18:05)
TitleArchitectural Optimizations for Text to Speech Synthesis in Embedded Systems
Author*Soumyajit Dey, Monu Kedia, Anupam Basu (Indian Institute of Technology Kharagpur, India)
Pagepp. 298 - 303
KeywordText to Speech Synthesis (TTS), Natural Language Processing (NLP), Instruction Set Simulation (ISS), Throughput, Co-simulation
AbstractThe increasing processing power of embedded devices have created the scope for certain applications that could previously be executed in desktop environments only, to migrate into handheld platforms. An important feature of the computing systems of modern times is their support for applications that interact with the user by synthesizing natural speech output. Such applications deliver state of the art performance in desktop environments. However, the real-time performance of such applications in handheld platforms with on-line incoming text streams have not been explored till date. In this work, the performance of a Text to Speech Synthesis application is evaluated on embedded processor architectures and modifications in the underlying hardware platform are proposed for realtime performance improvement of the concerned application.


Session 3C Model Checking and Applications to Digital and Analog Circuits
Time: 16:00 - 18:05 Wednesday, January 24, 2007
Location: Room 414+415
Chairs: Igor Markov (Univ. of Michigan, United States), Shin'ichi Minato (Hokkaido Univ., Japan)

3C-1 (Time: 16:00 - 16:25)
TitleDeeper Bound in BMC by Combining Constant Propagation and Abstraction
AuthorRoy Armoni (-, Israel), Limor Fix (Intel, United States), Ranan Fraer (Intel, Israel), *Tamir Heyman (Carnegie Mellon University, United States), Moshe Vardi (Rich University, United States), Yakir Vizel, Yael Zbar (Intel, Israel)
Pagepp. 304 - 309
Keywordproof-based, abstraction , BMC, constant propagation
AbstractThe most successful technologies for automatic verification of large industrial circuits are bounded model checking, abstraction, and iterative refinement. Previous work has demonstrated the ability to verify circuits with thousands of state elements achieving bounds of at most a couple of hundreds. In this paper we present several novel techniques for abstraction-based bounded model checking. specifically, we introduce a constant-propagation technique to simplify the formulas submitted to the CNF SAT solver; we present a new proof-based iterative abstraction technique for bounded model checking; and we show how the two techniques can be combined. The experimental results demonstrate our ability to handle circuit with several thousands state elements reaching bounds nearing 1,000.

3C-2 (Time: 16:25 - 16:50)
TitleEfficient BMC for Multi-Clock Systems with Clocked Specifications
Author*Malay K Ganai, Aarti Gupta (NEC LABS America, United States)
Pagepp. 310 - 315
KeywordClocked PSL LTL, Customized SAT-based BMC, Multi-clock System, Dynamic simplification, Formal Verification
AbstractCurrent industry trends in system design — multiple clocks, clocks with arbitrary frequency ratios, multi-phased clocks, gated clocks, level-sensitive latches, combined with clocked specifications – pose additional challenges to verification efforts. We propose an integrated solution that improves SAT-based Bounded Model Checking (BMC) by orders of magnitude, for verification of synchronous multi-clock systems with clocked LTL properties. Our main contributions are: a) Efficient clock modeling schemes to handle clock related challenges uniformly, b) Generation of automatic schedules and clock constraints to avoid unnecessary unrolling and loop-checks in BMC, c) Dynamic simplification of BMC problem instances with clock constraints, and d) Customized BMC translations—with incremental formulations and learning—to directly handle PSL-style clocked specifications. We demonstrate the effectiveness of our approach on some OpenCores multi-clock system benchmarks.

3C-3 (Time: 16:50 - 17:15)
TitleSymbolic Model Checking of Analog/Mixed-Signal Circuits
Author*David Walter, Scott Little, Nicholas Seegmiller, Chris Myers (University of Utah, United States), Tomohiro Yoneda (National Institute of Informatics, Japan)
Pagepp. 316 - 323
Keywordverification, analog circuits, BDDs, Petri nets, model checking
AbstractThis paper presents a Boolean based symbolic model checking algorithm for the verification of analog/mixed-signal (AMS) circuits. The systems are modeled in VHDL-AMS, a hardware description language for AMS circuits. The VHDL-AMS description is compiled into labeled hybrid Petri nets (LHPNs) in which analog values are modeled as continuous variables that can change at rates in a bounded range and digital values are modeled using Boolean signals. System properties are specified as temporal logic formulas using timed CTL (TCTL). The verification proceeds over the structure of the formula and maps separation predicates to Boolean variables. The state space is thus represented as a Boolean function using a binary decision diagram (BDD) and the verification algorithm relies on the efficient use of BDD operations.

3C-4 (Time: 17:15 - 17:40)
TitleEfficient Automata-Based Assertion-Checker Synthesis of SEREs for Hardware Emulation
Author*Marc Boule, Zeljko Zilic (McGill University, Canada)
Pagepp. 324 - 329
Keywordassertion, verification, automaton, checker, psl
AbstractIn this paper, we present a method for generating checker circuits from sequential-extended regular expressions (SEREs). Such sequences form the core of increasingly-used Assertion-Based Verification (ABV) languages. A checker generator capable of transforming assertions into efficient circuits allows the adoption of ABV in hardware emulation. Towards that goal, we introduce the algorithms for sequence fusion and length matching intersection, two SERE operators that are not typically used over regular expressions. We also develop an algorithm for generating failure detection automata, a concept critical to extending regular expressions for ABV, as well as present our efficient symbol encoding. Experiments with complex sequences show that our tool outperforms the best known checker generator.


Session 3D SPECIAL SESSION: Embedded Software for Multiprocessor Systems-on-Chip
Time: 16:00 - 18:05 Wednesday, January 24, 2007
Location: Room 416+417
Chairs: Hiroyuki Tomiyama (Nagoya Univ., Japan), Tei-Wei Kuo (National Taiwan Univ., Taiwan)

3D-1 (Time: 16:00 - 16:30)
Title(Invited Paper) Model-based Programming Environment of Embedded Software for MPSoC
Author*Soonhoi Ha (Seoul National Univ., Republic of Korea)
Pagepp. 330 - 335
KeywordEmbedded software, MPSoC, model-based design, common intermediate code
AbstractA noble model-based programming environment of embedded software for MPSoC is proposed. By defining a common intermediate code (CIC), it separates modeling of the software and implementation optimized for target architecture. It also allows us to use diverse models for initial specification. Another feature is to provide multi-phase debugging capabilities: at the modeling stage, at the code generation stage, and at the simulation stage. Preliminary experiments with a Divx player confirm the feasibility and validity of the proposed technique.

3D-2 (Time: 16:30 - 17:00)
Title(Invited Paper) RTOS and Codesign Toolkit for Multiprocessor Systems-on-Chip
Author*Shinya Honda, Hiroyuki Tomiyama, Hiroaki Takada (Nagoya Univ., Japan)
Pagepp. 336 - 341
KeywordRTOS, MultiProcessor, Systemlevel Design
AbstractMultiprocessor designs have become popular in embedded domains for achieving the power and performance requirements. In this paper, we present principles and techniques for design and implementation of RTOS for embedded multiprocessor systems. We also present a system-level design toolkit for rapid design and evaluation of embedded multiprocessor systems.

3D-3 (Time: 17:00 - 17:30)
Title(Invited Paper) Energy-efficient Real-time Task Scheduling in Multiprocessor DVS Systems
Author*Jian-Jia Chen, Chuan-Yue Yang, Tei-Wei Kuo, Chi-Sheng Shih (National Taiwan Univ., Taiwan)
Pagepp. 342 - 349
KeywordEnergy-Efficient Scheduling, Real-Time Systems , DVS, Multiprocessor Systems
AbstractDynamic voltage scaling (DVS) circuits have been widely adopted in many computing systems to provide tradeoff between performance and power consumption. The effective use of energy could not only extend operation duration for hand-held devices but also cut down power bills of server systems. Moreover, while many chip makers are releasing multi-core chips and multiprocessor system-on-a-chips (SoCs), multiprocessor platforms for different applications become even more popular. Multiprocessor platforms could improve the system performance and accommodate the growing demand of computing power and the variety of application functionality. This paper summarizes our work on several important issues in energy-efficient scheduling for real-time tasks in multiprocessor DVS systems. Distinct from most previous work based on heuristics, we aim at the provision of approximated solutions with worst-case guarantees. The proposed algorithms are evaluated by a series of experiments to provide insights in system designs.

3D-4 (Time: 17:30 - 18:00)
Title(Invited Paper) Towards Scalable and Secure Execution Platform for Embedded Systems
Author*Junji Sakai, Hiroaki Inoue, Masato Edahiro (NEC, Japan)
Pagepp. 350 - 354
Keywordmulticore, partitioning, reliability
AbstractReliability of embedded systems can be enhanced by multicore and partitioning approach. Physical partitioning based on AMP multicore achieves runtime stability of multiple applications in a system and also prevents the total system shutdown even when a malicious code creeps in. Combined with logical partitioning by processor virtualization and SMP technologies, the multicore architecture could realize more flexible and more scalable platform for the future embedded systems.



Thursday, January 25, 2007

Session 2K Keynote Address II
Time: 9:00 - 10:00 Thursday, January 25, 2007
Location: Small Auditorium, 5F
Chair: Hidetoshi Onodera (Kyoto Univ., Japan)

2K-1 (Time: 9:00 - 10:00)
Title(Keynote Address) Meeting with the Forthcoming IC Design - The Era of Power, Variability and NRE Explosion and a Bit of the Future -
AuthorTakayasu Sakurai (The Univ. of Tokyo, Japan)
Keyword
AbstractIn the foreseeable future, VLSI design will meet a couple of explosions: power, variability and NRE (non-recurring engineering cost). Some of the solutions for power-aware designs are covered in this talk with relation to variability. A remedy for the NRE explosion is to reduce the number of developments and manufacture and sell tens of millions of chips under a fixed design. System-in-a-Package approach may embody such possibility. Several new technologies are described to enable 3-dimensional stacking of chips to build high-performance yet low-power electronics systems. On the other extreme of the silicon VLSI's which stay as small as a centimeter square, a new domain of electronics called large-area integrated circuit as large as meters is waiting, which may open up a new continent of applications in the era of ubiquitous electronics. One of the implementations of the large-area electronics is based on organic transistors. The talk will provide perspectives of the organic circuit design taking E-skin, sheet-type scanner and Braille display as examples.


Session 4A Model Order Reduction and Macromodeling
Time: 10:15 - 12:20 Thursday, January 25, 2007
Location: Room 411+412
Chairs: Sheldon Tan (Univ. of California, Riverside, United States), Yehia Massoud (Rice Univ., United States)

4A-1 (Time: 10:15 - 10:40)
TitlePassive Interconnect Macromodeling Via Balanced Truncation of Linear Systems in Descriptor Form
AuthorBoyuan Yan, *Sheldon X.-D. Tan, Pu Liu (University of California, Riverside, United States), Bruce McGaughy (Cadence Design Systems Inc., United States)
Pagepp. 355 - 360
Keywordmodel order reduction, descriptor form, TBR, passivity
AbstractIn this paper, we present a novel passive model order reduction (MOR) method via projection-based truncated balanced realization method, PriTBR, for large RLC interconnect circuits. Different from existing passive truncated balanced realization (TBR) methods where numerically expensive Lur'e or algebraic Riccati (ARE's) equations are solved, the new method performs balanced truncation on linear system in descriptor form by solving generalized Lyapunov equations. Passivity preservation is achieved by congruence transformation instead of simple truncations. For the first time, passive model order reduction is achieved by combining Lyapunov equation based TBR method with congruence transformation. Compared with existing passive TBR, the new technique has the same accuracy and is numerically reliable, less expensive. In addition to passivity-preserving, it can be easily extended to preserve structure information inherent to RLC circuits, like block structure, reciprocity and sparsity. PriTBR can be applied as a second MOR stage combined with Krylov-subspace methods to generate a nearly optimal reduced model from a large scale interconnect circuit while passivity, structure, and reciprocity are preserved at the same time. Experimental results demonstrate the effectiveness of the proposed method and show PriTBR and its structure-preserving version, SP-PriTBR, are superior to existing passive TBR and Krylov-subspace based moment-matching methods.

4A-2 (Time: 10:40 - 11:05)
TitleAutomated Extraction of Accurate Delay/Timing Macromodels of Digital Gates and Latches using Trajectory Piecewise Methods
AuthorSandeep Dabas, Ning Dong, *Jaijeet Roychowdhury (University of Minnesota, Twin Cities, United States)
Pagepp. 361 - 366
KeywordModel-order-reduction, Simulation
AbstractWe present a fundamentally new approach,ADME, for extracting highly accurate delay models of a wide variety of digital gates. The technique is based on trajectory-piecewise automated nonlinear macromodelling methods adapted from the mixed-signal/RF domain. Advantages over prior current-source models include rapid automated extraction from SPICE-level netlists, transparent retargettability to different design styles and technologies, and the ability to correctly and holistically account for complex input waveform shapes, nonlinear and linear loading, multiple input switching, effects of internal state, multiple I/Os, supply droop and substrate interference. We validate ADME on a variety of digital gates, including multi-input NAND, NOR, XOR gates, a full adder, a multilevel cascade of gates and a sequential latch. Our results confirm excellent model accuracy at the detailed waveform level and testify to the promise of ADME for sustainable gate delay modelling at nanoscale technologies.

4A-3 (Time: 11:05 - 11:30)
TitlePractical Implementation of Stochastic Parameterized Model Order Reduction via Hermite Polynomial Chaos
AuthorYi Zou, Yici Cai, Qiang Zhou, Xianlong Hong (Tsinghua University, China), Sheldon X.D-Tan (University of California, Riverside, United States), *Le Kang (Tsinghua University, China)
Pagepp. 367 - 372
Keywordstochastic interconnect analysis , Model order reduction
Abstract This paper describes the stochastic model order reduction algorithm via stochastic Hermite Polynomials from the practical implementation perspective. Comparing with existing work on stochastic interconnect analysis and parameterized model order reduction, we generalized the input variation representation using polynomial chaos (PC) to allow for accurate modeling of non-Gaussian input variations. We also explore the implicit system representation using sub-matrices and improved the efficiency for solving the linear equations utilizing block matrix structure of the augmented system. Experiments show that our algorithm matches with Monte Carlo methods very well while keeping the algorithm effective. And the PC representation of non-gaussian variables gains more accuracy than Taylor representation used in previous work.

4A-4 (Time: 11:30 - 11:55)
TitleReduced-Order Wide-Band Interconnect Model Realization using Filter-Based Spline Interpolation
Author*Arthur Nieuwoudt, Mehboob Alam, Yehia Massoud (Rice University, United States)
Pagepp. 373 - 378
Keywordmodel order reduction, wide-band interconnect modeling
AbstractIn the paper, we develop a systematic methodology for modeling sampled interconnect frequency response data based on spline interpolation. Through piecewise polynomial interpolation, we are able to avoid the numerical problems associated with global polynomial fitting and generate higher order systems to model simulated or measured wideband frequency response data. We reduce the complexity of the generated systems using a data point pruning algorithm and by applying model order reduction based on balanced truncation. The methodology provides substantially greater accuracy than global polynomial approximation while only having O(n) growth in model complexity.

4A-5 (Time: 11:55 - 12:20)
TitleFrequency Selective Model Order Reduction via Spectral Zero Projection
AuthorMehboob Alam, *Arthur Nieuwoudt, Yehia Massoud (Rice University, United States)
Pagepp. 379 - 383
KeywordInterconnect, Model Order Reduction, Passivity
AbstractAs process technology continues to scale into the nanoscale regime, interconnect plays an ever increasing role in determining VLSI system performance. As the complexity of these systems increases, reduced order modeling becomes critical. In this paper, we develop a new method for the model order reduction of interconnect using frequency restrictive selection of interpolation points based on the spectral-zeros of the RLC interconnect model’s transfer function. The methodology uses the imaginary part of spectral zeros for frequency selective projection and provides stable as well as passive reduced order models for interconnect in VLSI systems. For large order interconnect models with realistic RLC parameters, the results indicate that our method provides more accurate approximations than techniques based on balanced truncation and moment matching with excellent agreement with the original system’s transfer function.


Session 4B System Level Modeling
Time: 10:15 - 12:20 Thursday, January 25, 2007
Location: Room 413
Chairs: Tei-Wei Kuo (National Taiwan Univ., Taiwan), Shinya Honda (Nagoya Univ., Japan)

4B-1 (Time: 10:15 - 10:40)
TitleAbstract, Multifaceted Modeling of Embedded Processors for System Level Design
Author*Gunar Schirner, Andreas Gerstlauer, Rainer Doemer (University of California, Irvine, United States)
Pagepp. 384 - 389
Keywordabstract processor modeling, abstract computation, embedded software, system level design
AbstractEmbedded software is playing an increasing role in todays SoC designs. It allows a flexible adaptation to evolving standards and to customer specific demands. As software emerges more and more as a design bottleneck, early, fast, and accurate simulation of software becomes crucial. Therefore, an efficient modeling of programmable processors at high levels of abstraction is required. In this article, we focus on abstraction of computation and describe our abstract modeling of embedded processors. We combine the computation modeling with task scheduling support and accurate interrupt handling into a versatile, multi-faceted processor model with varying levels of features. Incorporating the abstract processor model into a communication model, we achieve fast co-simulation of a complete custom target architecture for a system level design exploration. We demonstrate the effectiveness of our approach using an industrial strength telecommunication example executing on a Motorola DSP architecture. Our results indicate the tremendous value of abstract processor modeling. Different feature levels achieve a simulation speedup of up to 6600 times with an error of less than 8% over a ISS based simulation. On the other hand, our full featured model exhibits a 3% error in simulated timing with a 1800 times speedup.

4B-2 (Time: 10:40 - 11:05)
TitleFlexible and Executable Hardware/Software Interface Modeling for Multiprocessor SoC Design Using SystemC
Author*Patrice Gerin, Hao Shen, Alexandre Chureau, Aimen Bouchhima, Ahmed Amine Jerraya (TIMA Laboratory, France)
Pagepp. 390 - 395
KeywordHW/SW Interface, Service-based model, MPSoC, Transaction Accurate
AbstractAt high abstraction level, Multi-Processor System-On-Chip (SoC) designs are specified as assembling of IPs which can be Hardware or Software. The refinement of communication between these different IPs, known as hardware/software interfaces, is widely seen as the design bottlneck due to their complexity. In order to perform early design validation and architecture exploration, flexible executable models of these interfaces are needed at different abstraction levels. In this paper, we define a unified methodology to implement executable models of the hardware/software interface based on SystemC. The proposed formalism based on the concept of services gives to this approach the flexibility needed for architecture exploration and the ability to be used in automatic generation tools. A case study of hardware/software interface modeling at the Transaction Accurate level is presented. Experimental results show that this method allows higher simulation speed with early performance estimation.

4B-3 (Time: 11:05 - 11:30)
TitleA Retargetable Software Timing Analyzer Using Architecture Description Language
Author*Xianfeng Li (Peking University, China), Abhik Roychoudhury, Tulika Mitra (National Univeristy of Singapore, Singapore), Prabhat Mishra (University of Florida, United States), Xu Cheng (Peking University, China)
Pagepp. 396 - 401
KeywordWorst Case Execution Time, Retargetability, Architecture Description Language
AbstractWorst Case Execution Time (WCET) is an essential input for performance and schedulability analysis of real-time systems. Static WCET analysis requires program path analysis and microarchitecture modeling. Despite almost two decades of research, WCET analysis has not enjoyed wide acceptance in industry. This is in part due to the difficulty in microarchitecture modeling of modern processors. Given the large number of embedded processors available in the market, retargetability of the WCET analysis framework is a serious issue. In this paper, we address it using Architecture Description Language (ADL). Starting with the ADL of a target processor, the proposed framework automatically generates graph-based execution models to capture timing effects of instructions in the pipeline. This pipeline model coupled with parameterized models of cache and branch prediction lead to a WCET framework that is safe, accurate and retargetable.


Session 4C Logic Synthesis
Time: 10:15 - 12:20 Thursday, January 25, 2007
Location: Room 414+415
Chairs: Deming Chen (University of Illinois at Urbana-Champaign, United States), Yutaka Tamiya (Fujitsu Lab., Japan)

4C-1 (Time: 10:15 - 10:40)
TitleAutomating Logic Rectification by Approximate SPFDs
Author*Yu-Shen Yang (University of Toronto, Canada), Subarna Sinha (Synopsys, United States), Andreas Veneris (University of Toronto, Canada), Robert Brayton (University of California, United States)
Pagepp. 402 - 407
Keyworddebugging , verification, EDA VLSI, SAT, correction
Abstract In the digital VLSI cycle, a netlist is often modified to correct design errors, perform small specification changes or implement incremental rewiring-based optimization operations. Most existing automated logic rectification tools use a small set of predefined logic transformations when they perform such modifications. This paper first shows that a small set of predefined transformations may not allow rectification to exploit the full potential of the design. Then, it proposes an automated simulation-based methodology to ``approximate'' Sets of Pairs of Functions to be Distinguished (SPFDs) and avoid the memory/time explosion problem. This representation is used by a SAT-based algorithm that devises appropriate logic transformations to fix a design. The SAT method is later complemented by a greedy one that improves on run-time performance. An extensive suite of experiments documents the added potential of the proposed rectification methodology.

4C-2 (Time: 10:40 - 11:05)
TitleBddCut: Towards Scalable Symbolic Cut Enumeration
Author*Andrew Chaang Ling, Jianwen Zhu (University of Toronto, Canada), Stephen Dean Brown (Altera Toronto Technology Centre, Canada)
Pagepp. 408 - 413
KeywordCut Enumeration, Binary Decision Diagram, Elimination, Synthesis, Covering Problem
AbstractWhile the covering algorithm has been perfected recently by the iterative approaches, such as DAOmap and IMap, its application has been limited to technology mapping. The main factor preventing the covering problem's migration to other logic transformations, such as elimination and resynthesis region identification found in SIS and FBDD, is the exponential number of alternative cuts that have to be evaluated. Traditional methods of cut generation do not scale beyond a cut size of 6. In this paper, a symbolic method that can enumerate all cuts is proposed without any pruning, up to a cut size of 10. We show that it can outperform traditional methods by an order of magnitude and, as a result, scales to 100K gate benchmarks. As a practical driver, the covering problem applied to elimination is shown where it can not only produce competitive area, but also provide more than 6x average runtime reduction of the total runtime in FBDD, a BDD based logic synthesis tool with a reported order of magnitude faster runtime than SIS and commercial tools with negligible impact on area.

4C-3 (Time: 11:05 - 11:30)
TitleNode Mergers in the Presence of Don't Cares
Author*Stephen Plaza, Kai-hui Chang, Igor Markov, Valeria Bertacco (University of Michigan, United States)
Pagepp. 414 - 419
KeywordODCs, sat sweep, global don't cares, node mergers
AbstractSAT sweeping is the process of merging two or more functionally equivalent nodes in a circuit by selecting one of them to represent all the other equivalent nodes. This provides significant advantages in synthesis by reducing circuit size and provides additional flexibility in technology mapping, which could be crucial in post-synthesis optimizations. Furthermore, it is also critical in verification because it can reduce the complexity of the netlist to be analyzed in equivalence checking. Most algorithms available so far for this goal do not exploit observability don't cares (ODCs) for node merging since nodes equivalent up to ODCs do not form an equivalence relation. Although a few recently proposed solutions can exploit ODCs by overcoming this limitation, they constrain their analysis to just a few levels of surrounding logic to avoid prohibitive runtime. We develop an ODC-based node merging algorithm that performs efficient global ODC analysis (considering the entire netlist) through simulation and SAT. Our contributions which enable global ODC-based optimizations are: (1) a fast ODC-aware simulator and (2) an incremental verification strategy that limits computational complexity. In addition, our technique operates on arbitrarily mapped netlists, allowing for powerful post-synthesis optimizations. We show that global ODC analysis discovers on average 25% more (and up to 60%) node-merging opportunities than current state-of-the-art solutions based on local ODC analysis.

4C-4 (Time: 11:30 - 11:55)
TitleSynthesis of Reversible Sequential Elements
AuthorMin-Lung Chuang, *Chun-Yao Wang (National Tsing Hua University, Taiwan)
Pagepp. 420 - 425
Keywordreversible logic synthesis, reversible sequential elements
AbstractAbstract – To construct a reversible sequential circuit, reversible sequential elements are required. This work presents novel designs of reversible sequential elements such as D latch, JK latch, and T latch. Based on these reversible latches, we also construct the designs of the corresponding flip-flops. Then, we further discuss the physical implementations of our designs based on classical MOS electronics. Comparing with previous work, the implementation cost of our new designs, including the number of gates and the number of garbage outputs is considerably reduced.

4C-5 (Time: 11:55 - 12:20)
TitleRecognition of Fanout-free Functions
AuthorTsung-Lin Lee, *Chun-Yao Wang (National Tsing Hua University, Taiwan)
Pagepp. 426 - 431
Keywordfanout-free, read-once, logic synthesis, factoring
AbstractFactoring is a logic minimization technique to represent a Boolean function in an equivalent function with minimum literals. When realizing the circuit, a function represented in a more compact form has smaller area. Some Boolean functions even have equivalent forms where each variable appears exactly once, which are known as fanout-free functions. John P. Hayes had devised an algorithm to determine if a function can be fanout-free and construct the circuit if fanout-free realization exists. In this paper, we propose a property and an efficient technique to accelerate this algorithm. With our improvements, execution time of this algorithm is more competitive with the state-of-the-art method.


Session 4D SPECIAL SESSION: EDA Challenges for Analog/RF
Time: 10:15 - 12:20 Thursday, January 25, 2007
Location: Room 416+417
Chair: Georges Gielen (Katholieke Universiteit Leuven, Belgium)

4D-1 (Time: 10:15 - 10:40)
Title(Invited Paper) Design Tool Solutions for Mixed-signal/RF Circuit Design in CMOS Nanometer Technologies
Author*Georges Gielen (Katholieke Universiteit Leuven, Belgium)
Pagepp. 432 - 437
Keywordanalog, mixed-signal, CAD tools
AbstractThe scaling of CMOS technology into the nanometer era enables the fabrication of highly integrated systems, which increasingly contain analog and/or RF parts. However, scaling into the nanometer era also brings problems of leakage power, increasing variability and degradation, reducing supply voltages and worsening signal integrity conditions, all this in combination with tightening time-to-market constraints. Design methodologies and tools need to be developed to address these problems. This invited paper describes progress in modeling techniques for design and verification of complex integrated systems, in circuit and yield optimization tools for analog/RF circuits, as well as in signal integrity analysis methods such as EMC/EMI analysis.

4D-2 (Time: 10:40 - 11:05)
Title(Invited Paper) Challenges to Accuracy for the Design of Deep-submicron RF-CMOS Circuits
Author*Sadayuki Yoshitomi (Toshiba Corporation, Japan)
Pagepp. 438 - 441
KeywordRF-CMOS, Electro-Magnetic simulation, EKV3.0 , Compact Model, NQS effect
AbstractIncreasing complexity, functionality and operating frequency makes RF-CMOS circuit design a tough subject. Efficient use of recent electro-magnetic simulation, which enables the inclusion of many high-frequency effects, and the usage of "more" accurate compact models are the key to overcome this problem. Challenges of these two issues will be shown by the use of real implementation examples.

4D-3 (Time: 11:05 - 11:30)
Title(Invited Paper) Advanced Tools for Simulation and Design of Oscillators/PLLs
AuthorXiaolue Lai, *Jaijeet Roychowdhury (Univ. of Minnesota, United States)
Pagepp. 442 - 449
Keywordmacromodeling
AbstractThe lack of fast yet accurate oscillator and PLL simulation methods has constituted a serious bottleneck in mixed-signal, RF and digital design flows. Methods are described that, given differential equations for any oscillator (ie, equivalent to, eg, a SPICE-level circuit), will extract a simple nonlinear phase macromodel. It will be shown how such nonlinear phase macromodels are capable of capturing a variety of important effects, including jitter and phase noise, injection locking, PLL lock and capture phenomena, cycle slipping, etc., while being faster by several orders of magnitude than SPICE-level simulations. It will also be shown how this nonlinear phase macromodel, when applied to large systems of networked biochemical and and nanoelectronic oscillators, correctly predicts spontaneous pattern formation and edge detection.


Session 5A Statistical Interconnect Modeling and Analysis
Time: 13:30 - 15:35 Thursday, January 25, 2007
Location: Room 411+412
Chairs: Hideki Asai (Shizuoka Univ., Japan), Weiping Shi (Texas A&M Univ., United States)

5A-1 (Time: 13:30 - 13:55)
TitleA New Methodology for Interconnect Parasitics Extraction Considering Photo-Lithography Effects
AuthorYing Zhou (Texas A&M University, United States), Zhuo Li (Pextra Corp., United States), Yuxin Tian, *Weiping Shi (Texas A&M University, United States), Frank Liu (IBM Austin Research Laboratory, United States)
Pagepp. 450 - 455
Keywordlithography simulation, Parasitic Extraction, DFM
AbstractEven with the wide adaptation of resolution enhancement techniques in sub-wavelength lithography, the geometry of the fabricated interconnect is still quite different from the drawn one. Existing Layout Parasitic Extraction (LPE) tools assume perfect geometry, thus introducing significant error in the extracted parasitic models, which in turn cases significant error in timing verification and signal integrity analysis. Our simulation shows that the RC parasitics extracted from perfect GDS-II geometry can be as much as 20\% different from those extracted from the post litho/etching simulation geometry. This paper presents a new LPE methodology and related fast algorithms for interconnect parasitic extraction under photo-lithographic effects. Our methodology is compatible with the existing design flow. Experimental results show that the proposed methods are accurate and efficient.

5A-2 (Time: 13:55 - 14:20)
TitleSimple and Accurate Models for Capacitance Increment due to Metal Fill Insertion
Author*Youngmin Kim (University of Michigan of Ann Arbor, United States), Dusan Petranovic (Mentor Graphics , United States), Dennis Sylvester (University of Michigan of Ann Arbor, United States)
Pagepp. 456 - 461
Keywordmetal fills, dummy, capacitance, interconnect, modeling
AbstractInserting metal fill to improve inter-level dielectric thickness planarity is an essential part of the modern design process. However, the inserted fill shapes impact the performance of signal interconnect by increasing capacitance. In this paper, we analyze and model the impact of the metal dummy on the signal capacitance with various parameters including their electrical characteristic, signal dimensions, and dummy shape and dimensions. Fill has differing impact on interconnects depending on whether the signal of interest is in the same layer as the fill or not. In particular intra-layer dummy has its greatest impact on coupling capacitance while inter-layer dummy has more impact on the ground capacitance component. Based on an analysis of fill impact on capacitance, we propose simple capacitance increment models (Cc for intra-layer dummy and Cg for inter-layer dummy). To consider the realistic case with both signals and metal fill in adjacent layers, we apply a weighting function approach in the ground capacitance model. We verify this model using simple test patterns and benchmark circuits and find that the models match well with field solver results (1.3% average error with much faster runtime than commercial extraction tools, the runtime overhead reduced by ~75% for all benchmark circuits).

5A-3 (Time: 14:20 - 14:45)
TitleNew Block-based Statistical Timing Analysis Approaches without Moment Matching
AuthorRuiming Chen, *Hai Zhou (Northwestern University, United States)
Pagepp. 462 - 467
Keywordstatistical timing, bound of yield
AbstractWith aggressive scaling down of feature sizes in VLSI fabrication, process variation has become a critical issue in designs. We show that two necessary conditions for the ``Max" operation are actually not satisfied in the moment matching based statistical timing analysis approaches. We propose two correlation-aware block-based statistical timing analysis approaches that keep these necessary conditions, and prove that our approaches always achieve \emph{tight} lower bound and upper bound of the yield. Especially, our approach always gets the tight upper bound of the yield irrespective of the distributions that random variables have.

5A-4 (Time: 14:45 - 15:10)
TitleParameter Reduction for Variability Analysis by Slice Inverse Regression (SIR) Method
AuthorAlexandar Mitev, Michael Marefact, Dongsheng Ma, *Janet Wang (University of Arizona at Tucson, United States)
Pagepp. 468 - 473
Keywordperformance oriented , parameter reduction
AbstractWith semiconductor fabrication technologies scaled below 100 nm, the design-manufacturing interface becomes more and more complicated. The resultant process variability causes a number of issues in the new generation IC design. One of the biggest challenges is the enormous number of process variation related parameters. These parameters represent numerous local and global variations, and pose a heavy burden in today's chip verification and design. This paper proposes a new way of reducing the statistical variations (which include both process parameters and design variables) according to their impacts on the overall circuit performance. The new approach creates an effective reduction subspace (ERS) and provides a transformation matrix by using the mean and variance of the response surface. With the generated transformation matrix, the proposed method maps the original statistical variations to a smaller set of variables with which we process variability analysis. Thus, the computational cost due to the number of variations is greatly reduced. Experimental results show that by using new method we can achieve 20% to 50% parameter reduction with only less than 5% error on average.

5A-5 (Time: 15:10 - 15:35)
TitleStochastic Sparse-grid Collocation Algorithm (SSCA) for Periodic Steady-State Analysis of Nonlinear System with Process Variations
Author*Jun Tao, Xuan Zeng (Fudan University, China), Wei Cai (University of North Carolina at Charlotte, United States), Yangfeng Su (Fudan University, China), Dian Zhou (University of Texas at Dallas, United States), Charles Chiang (Synopsys Inc., United States)
Pagepp. 474 - 479
Keywordprocess variation, steady-state analysis, Stochastic Collocation Algorithm, Sparse Grid Technique
AbstractAbstract—In this paper, Stochastic Collocation Algorithm combined with Sparse Grid technique (SSCA) is proposed to deal with the periodic steady-state analysis for nonlinear systems with process variations. Compared to the existing approaches, SSCA has several considerable merits. Firstly, compared with the moment-matching parameterized model order reduction (PMOR) which equally treats the circuit response on process variables and frequency parameter by Taylor approximation, SSCA employs Homogeneous Chaos to capture the impact of process variations with exponential convergence rate and adopts Fourier series or Wavelet Bases to model the steady-state behavior in time domain. Secondly, contrary to Stochastic Galerkin Algorithm (SGA), which is efficient for stochastic linear system analysis, the complexity of SSCA is much smaller than that of SGA for nonlinear case. Thirdly, different from Efficient Collocation Method, the heuristic approach which may results in “Rank deficient problem” and “Runge phenomenon”, Sparse Grid technique is developed to select the collocation points needed in SSCA in order to reduce the complexity while guaranteing the approximation accuracy. Furthermore, though SSCA is proposed for the stochastic nonlinear steady-state analysis, it can be applied for any other kinds of nonlinear system simulation with process variations, such as transient analysis, etc..


Session 5B Optimization Issues in Embedded Systems
Time: 13:30 - 15:35 Thursday, January 25, 2007
Location: Room 413
Chairs: Pai Chou (Univ. of California, Irvine, United States), Maziar Goudarzi (Kyushu Univ., Japan)

5B-1 (Time: 13:30 - 13:55)
TitleRetiming for Synchronous Data Flow Graphs
AuthorNikolaos Liveris, Chuan Lin, Jia Wang, *Hai Zhou (Northwestern University, United States), Prithviraj Banerjee (University of Illinois, Chicago, United States)
Pagepp. 480 - 485
KeywordSDF, retiming, high-level synthesis
AbstractIn this paper we present a new algorithm for retiming Synchronous Dataflow (SDF) graphs. The retiming aims at minimizing the cycle length of an SDF. The algorithm is provably optimal and its execution time is improved compared to previous approaches.

5B-2 (Time: 13:55 - 14:20)
TitleSignal-to-Memory Mapping Analysis for Multimedia Signal Processing
AuthorIlie I. Luican, Hongwei Zhu, *Florin Balasa (University of Illinois at Chicago, United States)
Pagepp. 486 - 491
Keywordmemory management, signal-to-memory mapping, intra-array mapping
AbstractThe storage requirements in data-dominant signal processing systems, whose behavior is described by array-based, loop-organized algorithmic specifications, have an important impact on the overall energy consumption, data access latency, and chip area. Finding the optimal storage of the usually large arrays from these behavioral specifications is an important step during memory allocation. This paper proposes more efficient algorithms for two intra-array mapping-to-memory models (of De Greef and Troncon), resulting in an implementation several times faster than the original ones.

5B-3 (Time: 14:20 - 14:45)
TitleMODLEX: A Multi Objective Data Layout EXploration Framework for Embedded Systems-on-Chip
Author*Rajesh Kumar T. S. (Texas Instruments India, India), Ravikumar C. P. (Texas Instruments, India), Govindarajan R. (Indian Institute of Science, India)
Pagepp. 492 - 497
KeywordMemory Architecture, Data Layout, Power-performance Trade-off, Genetic Algorithm
AbstractThe memory subsystem is a major contributor to the performance, power, and area of complex SoCs used in feature rich multimedia products. Hence, memory architecture of the embedded DSP is complex and usually custom designed with multiple banks of single-ported or dual ported on-chip scratch pad memory and multiple banks of off-chip memory. Building software for such large complex memories with many of the software components as individually optimized software IPs is a big challenge. In order to obtain good performance and a reduction in memory stalls, the data buffers of the application need to be placed carefully in different types of memory . In this paper we present a unified framework (MODLEX) that combines different data layout optimizations to address the complex DSP memory architectures. Our method models the data layout problem as multi-objective Genetic Algorithm (GA) with performance and power being the objectives and presents a set of solution points which is attractive from a platform design viewpoint. While most of the work in the literature assumes that performance and power are non-conflicting objectives, our work demonstrates that there is significant trade-off (up to 70\%) that is possible between power and performance.

5B-4 (Time: 14:45 - 15:10)
TitleA Run-Time Memory Protection Methodology
Author*Udaya Seshua (Philips Semiconductors, India), Nagaraju Bussa (Philips Research, India), Bart Vermeulen (Philips Research, Netherlands)
Pagepp. 498 - 503
Keywordmemory protection, software debug, Hardware/Software co-design
AbstractIn this paper we present a novel methodology, which aids in debugging memory corruption errors during application development. This methodology is based on the analysis of the memory access behavior of a set of benchmark applications. The analysis result is used to strike an optimal balance between hardware and software instrumentation to make our approach low-cost both from a performance penalty and hardware area point-of-view. Experimental results show that our innovative approach typically requires less than 2% of CPU silicon area for less than 1% run-time performance overhead, making it applicable in time-constrained embedded systems.

5B-5 (Time: 15:10 - 15:35)
TitleShort-Circuit Compiler Transformation: Optimizing Conditional Blocks
Author*Mohammad Ali Ghodrat, Tony Givargis, Alex Nicolau (University of California, Irvine, United States)
Pagepp. 504 - 510
KeywordShort circuit evaluation, lazy evaluation, compiler transformation, domain space partitioning
AbstractWe present the short-circuit code transformation technique, intended for embedded compilers. The transformation technique optimizes conditional blocks in high-level programs. Specifically, the transformation takes advantage of the fact that the Boolean value of the conditional expression, determining the true/false paths, can be statically analyzed to determine cases when one or the other of the true/false paths are guaranteed to execute. In such cases, code is generated to bypass the evaluation of the conditional expression. In instances when the bypass code is faster to evaluate than the conditional expression, a net performance gain is obtained. Our experiments with the Mediabench applications show that the short-circuit transformation yields a an average of 35.1% improvement in execution time for SPARC and an average of 36.3% improvement in execution time for ARM. We also measured an average of 36.4% reduction in power consumption for ARM.


Session 5C High-Level Synthesis
Time: 13:30 - 15:35 Thursday, January 25, 2007
Location: Room 414+415
Chairs: Ki-seok Chung (Hanyang Univ., Republic of Korea), Katsuharu Suzuki (NEC, Japan)

5C-1 (Time: 13:30 - 13:55)
TitleOptimization of Arithmetic Datapaths with Finite Word-Length Operands
Author*Sivaram Gopalakrishnan, Priyank Kalla (University of Utah, United States), Florian Enescu (Georgia State University, United States)
Pagepp. 511 - 516
KeywordFinite Integer Rings, Modulo Arithmetic
AbstractThis paper presents an approach to area optimization of arithmetic datapaths that perform polynomial computations over bit-vectors with finite widths. Examples of such designs abound in DSP for audio, video and multimedia computations where the input and output bit-vector sizes are dictated by the desired precision. A bit-vector of size m represents integer values reduced modulo 2^m (% 2^m). Therefore, finite word-length bit-vector arithmetic can be modeled as algebra over finite integer rings, where the bit-vector size dictates the ring cardinality. This paper demonstrates how the number-theoretic properties of finite integer rings can be exploited for optimization of bit-vector arithmetic. Along with an analytical model to estimate the implementation cost at RTL, two algorithms are presented to optimize bit-vector arithmetic. Experimental results, conducted within practical CAD settings, demonstrate significant area savings due to our approach.

5C-2 (Time: 13:55 - 14:20)
TitleExploiting Power-Area Tradeoffs in Behavioural Synthesis through Clock and Operations Throughput Selection
Author*Marco A. Ochoa-Montiel, Bashir M. Al-Hashimi (University of Southampton, Great Britain), Peter Kollig (Philips Semiconductors, Great Britain)
Pagepp. 517 - 522
KeywordHigh Level Synthesis, low power
AbstractThis paper describes a new dynamic-power aware High Level Synthesis (HLS) data path approach that considers the close interrelation between clock choice and operations throughput selection whilst attempting to minimize area, power, or a combination thereof. It is shown that the proposed approach with its compound cost function and its novel clock and operations throughput selection algorithm, obtains solutions with lower power and area than using previous relevant work [11]. Moreover, different power-area tradeoffs can be explored due to the appropriate choice of clock period and operations throughput using our novel approach.

5C-3 (Time: 14:20 - 14:45)
TitleA Parameterized Architecture Model in High Level Synthesis for Image Processing Applications
Author*Yazhuo Dong, Yong Dou (National University of Defense Technology, China)
Pagepp. 523 - 528
Keywordhigh level synthesis, image processing, data reuse
AbstractMost image processing applications are computationally intensive and data intensive. Reconfigurable hardware boards provide a convenient and flexible solution to speed up these algorithms. To get a high performance design without going through the time-consuming hardware design process for each different algorithm, we present a universal parameterized architecture in high level synthesis to generate the hardware frames for all image processing applications automatically. The value of the parameters which decide the target architecture can be obtained from the compiler. The algorithm how to get these parameters is also discussed in this paper.

5C-4 (Time: 14:45 - 15:10)
TitleHigh-Level Power Estimation and Low-Power Design Space Exploration for FPGAs
Author*Deming Chen (University of Illinois at Urbana-Champaign, United States), Jason Cong, Yiping Fan, Zhiru Zhang (University of California, Los Angeles, United States)
Pagepp. 529 - 534
Keywordhigh-level synthesis, low-power, FPGA, power estimation
AbstractIn this paper, we present a simultaneous resource allocation and binding algorithm for FPGA power minimization. To fully validate our methodology and result, our work targets a real FPGA architecture - Altera Stratix FPGA [2], which includes generic logic elements, DSP cores, and memories, etc. We design a high-level power estimator for this architecture and evaluate its estimation accuracy against a commercial gate-level power estimator - Quartus II PowerPlay Analyzer [1]. During the synthesis stage, we pay special attention to interconnects and multiplexers. We concentrate on resource allocation and binding tasks because they are the key steps to determine the interconnections. We use a novel approach to explore the design space during synthesis. It forms, propagates, and prunes synthesis solution points, where each solution point represents one actual implementation of the datapath. Eventually, we generate a design solution curve, which can provide ideal solution points with low power and high performance. Experimental results show that our high-level power estimator is 8.7% away from PowerPlay Analyzer. Meanwhile, we are able to achieve a significant amount of power reduction (32%) with better circuit speed (16%) compared to a traditional resource allocation and binding algorithm.

5C-5 (Time: 15:10 - 15:35)
TitleNumerical Function Generators Using Edge-Valued Binary Decision Diagrams
Author*Shinobu Nagayama (Hiroshima City University, Japan), Tsutomu Sasao (Kyushu Institute of Technology, Japan), Jon Butler (Naval Postgraduate School, United States)
Pagepp. 535 - 540
Keywordedge-valued BDD, non-uniform segmentation, piecewise polynomial approximation, numerical function generator, FPGA
AbstractIn this paper, we introduce the edge-valued binary decision diagram (EVBDD) to reduce the memory and delay in numerical function generators (NFGs). An NFG realizes a function, such as a trigonometric, logarithmic, square root, or reciprocal function, in hardware. NFGs are important in, for example, digital signal applications, where high speed and accuracy are necessary. We use the EVBDD to produce a fast and compact segment index encoder (SIE) that is a key component in our NFG. We compare our approach with NFG designs based on multi-terminal BDD's (MTBDDs), and show that the EVBDD produces SIEs that have, on average, only 7% of the memory and 40% of the delay of those designed using MTBDDs. Therefore, our NFGs based on EVBDDs have, on average, only 38% of the memory and 59% of the delay of NFGs based on MTBDDs.


Session 5D Designers' Forum Panel : Presilicon SoC HW/SW Verification
Time: 13:30 - 15:35 Thursday, January 25, 2007
Location: Small Auditorium, 5F

5D-1
Title(Panel Discussion) Presilicon SoC HW/SW Verification
AuthorOrganizer: Tetsuji Sumioka, Moderator: Tetsuji Sumioka (SONY, Japan), Panelists: Jason Andrews (Cadence, United States), Graham Hellestrand (VaST Systems Technology, United States), Hidefumi Kurokawa (NEC Electronics, Japan), Ilya Klebanov (Advanced Micro Devices, Canada), Seiji Koino (Toshiba, Japan)


Session 6A Timing Modeling and Optimization
Time: 16:00 - 18:05 Thursday, January 25, 2007
Location: Room 411+412
Chairs: Masanori Hashimoto (Osaka Univ., Japan), Charlie Chung-Ping Chen (National Taiwan Univ., Taiwan)

6A-1 (Time: 16:00 - 16:25)
TitleClock Skew Scheduling with Delay Padding for Prescribed Skew Domains
AuthorChuan Lin (Magma Design Automation Inc., United States), *Hai Zhou (Northwestern University, United States)
Pagepp. 541 - 546
KeywordClock skew optimization
AbstractClock skew scheduling is a technique that intentionally introduces skews to memory elements to improve the performance of a sequential circuit. It was shown that the full optimization potential of clock skew scheduling can be reliably implemented using a few skew domains. In this paper we present an optimal skew scheduling algorithm for sequential circuits with flip-flops. Given a finite set of prescribed skew domains, the algorithm finds a domain assignment for each flip-flop such that the clock period is minimized with possible delay padding. Experimental results validate the efficiency of our algorithm and show 17% improvement on average in clock period.

6A-2 (Time: 16:25 - 16:50)
TitleAn Efficient Computation of Statistically Critical Sequential Paths Under Retiming
AuthorMongkol Ekpanyapong (Intel Corporation, United States), Xin Zhao, *Sung Kyu Lim (Georgia Institute of Technology, United States)
Pagepp. 547 - 552
Keywordstatistical timing analysis, retiming
AbstractIn this paper we present the Statistical Retiming-based Timing Analysis (SRTA) algorithm. The goal is to compute the timing slack distribution for the nodes in the timing graph and identify the statistically critical paths under retiming, which are the paths with a high probability of becoming timing-critical after retiming. SRTA enables the designers to perform circuit optimization on these paths to reduce the probability of them becoming timing bottleneck if the circuit is retimed as a post-process. We provide a comparison among static timing analysis (= STA), statistical timing analysis (= SSTA), retiming-based timing analysis (= RTA), and our statistical retiming-based timing analysis (SRTA). Our results show that the placement optimization based on SRTA achieves the best performance results.

6A-3 (Time: 16:50 - 17:15)
TitleFast Electrical Correction Using Resizing and Buffering
AuthorShrirang Karandikar, *Charles J Alpert (IBM Austin Research Laboratory, United States), Mehmet Yildiz, Paul Villarrubia, Steve Quay, Tuhin Mahmud (IBM EDA, United States)
Pagepp. 553 - 558
KeywordElectrical correction, Slew/cap violations, buffering, gate sizing
AbstractCurrent design methodologies are geared towards meeting different design criteria, such as delay, area or power. However, in order to correctly identify the critical parts of a circuit for optimization, the circuit has to be electrically clean -- i.e., slews on each pin have to be within certain limits, a gate cannot drive more than a certain amount of capacitance, etc. Thus far, this requirement has largely been ignored in the literature. Instead, existing methods which optimize delay are used to fix electrical violations. This leads to solutions that are unnecessarily expensive, and still leave violations that remain unfixed. There is therefore a need for an area-efficient strategy that targets the electrical state of a circuit and fixes all violations quickly. This paper explicitly defines ``electrical violations'' and presents a flexible approach (called EVE, the Electrical Violation Eliminator) for fixing these. Experimental results validate our approach.

6A-4 (Time: 17:15 - 17:40)
TitleSmartSmooth: A Linear Time Convexity Preserving Smoothing Algorithm for Numerically Convex Data with Application to VLSI Design
AuthorSanghamitra Roy (University of Wisconsin-Madison, United States), *Charlie Chung-Ping Chen (National Taiwan University, Taiwan)
Pagepp. 559 - 564
KeywordConvex optimization, Smoothing, Numerically convex, Gate sizing
AbstractConvex optimization problems are very popular in the VLSI design society due to their guaranteed convergence to a global optimal point. While optimizing tabular data, significant fitting efforts are required to fit the data into convex form. Fitting the tables into analytically convex forms like posynomials, suffers from excessive fitting errors, as the fitting problem may be non-convex. In recent literature optimal numerically convex tables have been proposed. Since these tables are numerical, it is extremely important to make the table data smooth, and yet preserve its convexity. The smoothness ensures that the convex optimizer behaves predictably and converges quickly to the global optimal point. The existing smoothing techniques either cannot preserve convexity, or require very high execution time. In this paper, we propose a linear time algorithm to smoothen a given numerically convex data and at the same time preserve convexity. Our proposed algorithm SmartSmooth can smoothen the data in linear time without introducing any additional error on the numerically convex data. This algorithm can be a significant contribution in the field of optimization of non-analytical data. We present our SmartSmooth results on industrial cell libraries. SmartSmooth when applied on convex tables produced by ConvexFit shows a 30X reduction in fitting error over a posynomial fitting algorithm and 3X reduction in fitting error over ConvexSmooth algorithm.

6A-5 (Time: 17:40 - 18:05)
TitleModeling the Overshooting Effect for CMOS Inverter in Nanometer Technologies
Author*Zhangcai Huang, Hong Yu (The Graduate School of Information, Production and Systems, Waseda University, Japan), Atsushi Kurokawa (Sanyo Semiconductor Company, Japan), Yasuaki Inoue (The Graduate School of Information, Production and Systems, Waseda University, Japan)
Pagepp. 565 - 570
KeywordCMOS inverter, overshooting time, nanometer technologies, timing analysis
AbstractWith the scaling of CMOS technology, the overshooting time due to the input-to-output coupling capacitance has much more significant effect on inverter delay. Moreover, the overshooting time is also an important parameter in the short circuit power estimation. Therefore, in this paper an effective analytical model is proposed to estimate the overshooting time for the CMOS inverter in nanometer technologies. Furthermore, the influence of the process variation on the overshooting time is illustrated based on the proposed model. And the accuracy of the proposed model is proved to greatly agree with SPICE simulation results.


Session 6B Application Examples with Leading Edge Design Methodology
Time: 16:00 - 18:05 Thursday, January 25, 2007
Location: Room 413
Chairs: Ing-Jer Huang (National Sun-Yat-Sen Univ., Taiwan), Takeshi Ikenaga (Waseda Univ., Japan)

6B-1 (Time: 16:00 - 16:25)
TitleFlow-Through-Queue based Power Management for Gigabit Ethernet Controller
AuthorHwisung Jung (University of Southern California, United States), Andy Hwang (Broadcom Corp., United States), *Massoud Pedram (University of Southern California, United States)
Pagepp. 571 - 576
KeywordLow-power, Gigabit Ethernet controller, energy-efficient, SMDP
AbstractAbstract - This paper presents a novel architectural mechanism and a power management structure for the design of an energy-efficient Gigabit Ethernet controller. Key characteristics of such a controller are low-latency and high-bandwidth required to meet the pressing demands of extremely high frame and control data, which in turn cause difficulties in managing power dissipation. We propose a flow-through-queue (FTQ) based power management method, which allows some of the tasks involved in processing the frame data to be offloaded. This in turn enables utilization of multiple clock rates and multiple voltages for different cores inside the Ethernet controller. A modeling approach based on semi-Markov decision process (SMDP) and queuing models is employed, which allow one to apply mathematical programming formulations for energy optimization under performance constraints. The proposed Gigabit Ethernet controller is designed with a 130nm CMOS technology that includes both high and low threshold voltages. Experimental results show that the proposed power optimization method can achieve system-wide energy savings under tighter performance constraints.

6B-2 (Time: 16:25 - 16:50)
TitleApproximation Algorithm for Process Mapping on Network Processor Architectures
Author*Chris Ostler, Karam S. Chatha, Goran Konjevod (Arizona State University, United States)
Pagepp. 577 - 582
Keywordnetwork processors, throughput maximization, approximation algorithm
AbstractThe high performance requirements of networking applications has led to the advent of programmable network processor (NP) architectures that incorporate symmetric multi-processing, and block multi-threading. The paper presents an automated system-level design technique for process mapping on such architectures with an objective of maximizing the worst case throughput of the application. As this mapping must be done in the presence of resource (processors and code size) constraints, this is an NP-complete problem. We present a polynomial time approximation algorithm which has a proven guarantee to generate solutions with throughput at least 1/2 that of optimal solutions. The proposed algorithm was utilized to map realistic applications on the Intel IXP2400 (NP) architecture, and produced solutions within 78% of optimal.

6B-3 (Time: 16:50 - 17:15)
TitleImplementation of a Real Time Programmable Encoder for Low Density Parity Check Code on a Reconfigurable Instruction Cell Architecture (RICA)
Author*Zahid Khan, Tughrul Arslan (The University of Edinburgh, Great Britain)
Pagepp. 583 - 588
KeywordLDPC, FEC, WiMax, Reconfigurable Computing
AbstractThis paper presents a real time programmable irregular Low Density Parity Check (LDPC) Encoder as specified in the IEEE P802.16E/D7 standard. The encoder is programmable for frame sizes from 576 to 2304 and for five different code rates. H matrix is efficiently generated and stored for a particular frame size and code rate. The encoder is implemented on Reconfigurable Instruction Cell Architecture which has recently emerged as an ultra low power, high performance, ANSI-C programmable embedded core. Different general and architecture specific optimization techniques are applied to enhance the throughput. With the architecture, a throughput from 10 to 19 Mbps has been achieved.

6B-4 (Time: 17:15 - 17:40)
TitleVLSI Design of Multi Standard Turbo Decoder for 3G and Beyond
Author*Imran Ahmed, Tughrul Arslan (University of Edinburgh, Great Britain)
Pagepp. 589 - 594
Keywordreconfigurable, domain specific, turbo decoder, viterbi, vlsi
AbstractTurbo decoding architectures have greater error correcting capability than any other known code. Due to their excellent performance turbo codes have been employed in several transmission systems such as CDMA2000, WCDMA (UMTS), ADSL, IEEE 802.16 metropolitan networks etc. The computation kernel of the algorithm is very similar and we have exploited this commonality for a turbo decoder VLSI design suitable for deployment using platform based system on chip methodologies. Turbo and viterbi components of the unified array are also individually reconfigurable for different standards. This supports the 4G concept that user can be simultaneously connected to several access technologies (for example Wi-Fi, 3G, GSM etc) and can seamlessly move between them. A new normalization scheme for turbo decoding is presented to suit reconfigurable mappings. We have also shown dynamic reconfiguration methodology for a context switch between Turbo and Viterbi decoders which does not waste any clock cycles. The reconfigurable Turbo decoder fabric is implemented reusing components of Viterbi decoder on a 180 nm UMC process technology.

6B-5 (Time: 17:40 - 18:05)
TitleA High-Throughput Low-Power AES Cipher for Network Applications
AuthorShin-Yi Lin, *Chih-Tsun Huang (National Tsing Hua University, Taiwan)
Pagepp. 595 - 600
KeywordAES, Advanced Encryption Standard, Security, Block Cipher, VLSI Design
AbstractWe propose a full-featured high-throughput low-power AES cipher which is suitable for widespread network applications. Different modes of operation are implemented, i.e., the ECB, CBC, CTR and CCM modes. Our cipher utilizes a cost-efficient two-stage pipeline for the CCM mode by a single datapath. With the design-for-test circuitry, the maximum throughput is 4.27 Gbps using a 0.13um CMOS technology with a 333MHz clock rate. The hardware cost is 86.2K gates with the power of 40.9mW.


Session 6C Module/Circuit Synthesis
Time: 16:00 - 18:05 Thursday, January 25, 2007
Location: Room 414+415
Chairs: Shinji Kimura (Waseda Univ., Japan), TingTing Hwang (National Tsing Hua Univ., Taiwan)

6C-1 (Time: 16:00 - 16:25)
TitleImproving XOR-Dominated Circuits by Exploiting Dependencies between Operands
Author*Ajay K. Verma, Paolo Ienne (Ecole Polytechnique Federale de Lausanne, Switzerland)
Pagepp. 601 - 608
KeywordLogic Synthesis, Selective Expansion, XOR-Dominated, Parallel Multiplier
AbstractLogic synthesis has made impressive progress in recent times, pervading digital design and replacing universally manual techniques. A remarkable exception is computer arithmetic, an example being multiple additions performed in carry-save form: column-compressors are usually built exploiting circuit regularity and are hardly optimised further, due to the large number of XOR operations. We show a general technique to optimise XOR-dominated circuits by exploiting the dependencies among the XOR operands and, demonstrate its effectiveness on multiplier-like circuits. We show that it optimises significantly, the best parallel multipliers by exploiting complex dependencies between the addenda which escape known manual optimisations.

6C-2 (Time: 16:25 - 16:50)
TitleOptimum Prefix Adders in a Comprehensive Area, Timing and Power Design Space
AuthorJianhua Liu, Yi Zhu, Haikun Zhu (University of California, San Diego, United States), John Lillis (University of Illinois at Chicago, United States), *Chung-Kuan Cheng (University of California, San Diego, United States)
Pagepp. 609 - 615
Keywordlow power, physical synthesis, prefix addition
AbstractParallel prefix adder is the most flexible and widelyused binary adder for ASIC designs. Many high-level synthesis techniques have been developed to find optimal prefix structures for specific applications. However, the gap between these techniques and back-end designs is increasingly large. In this paper, we propose an integer linear programming method to build minimal-power prefix adders within given timing and area constraints. It counts both gate and wire capacitances in the timing and power models, considers static and dynamic power consumptions, and can handle gate sizing and buffer insertion to improve the performance further. The proposed method is also adaptive for non-uniform arrival time and required time on each bit position. Therefore our method produces the optimum prefix adder for realistic constraints.

6C-3 (Time: 16:50 - 17:15)
TitleAn Interconnect-Centric Approach to Cyclic Shifter Design Using Fanout Splitting and Cell Order Optimization
AuthorHaikun Zhu, Yi Zhu, *Chung-Kuan Cheng (University of California, San Diego, United States), David M. Harris (Harvey Mudd Colledge, United States)
Pagepp. 616 - 621
Keywordcyclic shifter, interconnect, fanout splitting, permutation, integer linear programming
AbstractWe propose two orthogonal approaches to logarithmic cyclic shifter design. The first method, called fanout splitting, replaces multiplexers in a conventional design with demultiplexers which have two fanouts driving the shifting and non-shifting paths separately. The use of demultiplexers has a two-fold effect; it cuts the accumulated wire load on the critical path from $O(N\log_2(N))$ to $O(N)$, and reduces the switching probabilities on the inter-stage long wires from 1/4 to 3/16. We then perform cell order optimization to further improve the delay, and formulate it as an integer linear programming problem. For the 64-bit case, the two approaches together reduce the total delay by 67.1% and dynamic power consumption by 17.6%, respectively.

6C-4 (Time: 17:15 - 17:40)
TitleOptimization of Robust Asynchronous Circuits by Local Input Completeness Relaxation
Author*Cheoljoo Jeong, Steven M. Nowick (Columbia University, United States)
Pagepp. 622 - 627
Keywordasynchronous circuits, input completeness, dual-rail encoding, relaxation
AbstractAs process, temperature and voltage variations become significant in deep submicron design, timing closure becomes a critical challenge using synchronous CAD flows. One attractive alternative is to use robust asynchronous circuits which gracefully accommodate timing discrepancies. However, there is currently little CAD support for such robust methodologies. In this paper, optimization algorithms for a class of highly-robust asynchronous circuits are presented. Though the considered circuit style is robust to timing variation, it suffers from high area overhead inherent in the style. The proposed algorithm optimizes area and delay of these circuits by relaxing their overly-restrictive style. The algorithm was implemented and experimented with MCNC circuits, achieving significant improvement while still preserving the same robustness property of the circuit. On average, 49.2% of the gates of the circuits could be implemented in a relaxed manner and, as a result, 34.9% area improvement was achieved, and 16.1% delay improvement was achieved using a simple heuristic for targeting the critical path in the circuit. This is the first proposed approach that systematically optimizes circuits based on the notion of local relaxation: still preserving the circuit's overall timing-robustness.

6C-5 (Time: 17:40 - 18:05)
TitleSafe Delay Optimization for Physical Synthesis
Author*Kai-hui Chang, Igor L. Markov, Valeria Bertacco (University of Michigan at Ann Arbor, United States)
Pagepp. 628 - 633
Keywordphysical synthesis, delay optimization, safe
AbstractPhysical synthesis is a relatively young field in Electronic Design Automation. Many published optimizations for physical synthesis end up hurting the final result, often by neglecting important physical aspects of the layout, such as long wires or routing congestion. In this work we propose SafeResynth, a safe resynthesis technique, which provides immediately-measurable delay improvement without altering the design's functionality. It can enhance circuit timing without detrimental effects on route length and congestion. When applied to IWLS'05 benchmarks, SafeResynth improves circuit delay by 11% on average after routing, while increasing route length and via count by less than 0.2%. Our resynthesis can also be used in an unsafe mode, akin to more traditional physical synthesis algorithms popular in commercial tools. Applied together, our safe and unsafe transformations achieve 24% average delay improvement for seven large benchmarks from the OpenCores suite. The relative contribution of safe and unsafe techniques varies depending on the amount of whitespace in the layout.


Session 6D Designers' Forum: Low-power SoC Technologies
Time: 16:00 - 17:50 Thursday, January 25, 2007
Location: Small Auditorium, 5F
Chairs: Haruyuki Tago (Toshiba Corporation, Japan), Kazutoshi Kobayashi (Kyoto University, Japan)

6D-1 (Time: 16:00 - 16:20)
Title(Invited Paper) Plenary Talk --Overview on Low Power SoC Design Technology--
Author*Kimiyoshi Usami (Shibaura Institute of Technology, Japan)
Pagepp. 634 - 636
Keywordlow power, leakage, power gating
AbstractSo far, low power design for SoC has mainly focused on techniques to reduce dynamic power and standby leakage power. In further scaled devices, design technology to reduce active leakage power at the operation mode becomes indispensable. This is because the share of leakage power in the total operation power continues to increase as the device gets scaled. This paper gives a brief overview on the conventional leakage reduction techniques and describes novel approaches to use run-time power gating for active leakage reduction.

6D-2 (Time: 16:20 - 16:50)
Title(Invited Paper) Development of Low-power and Real-time VC-1/H.264/MPEG-4 Video Processing Hardware
Author*Masaru Hase, Kazushi Akie, Masaki Nobori, Keisuke Matsumoto (Renesas Technology, Japan)
Pagepp. 637 - 643
KeywordCodec, VC-1, H.264, MPEG-4, IP
AbstractThis paper covers a multi-functional hardware intellectual property (IP) for the encoding and decoding of digital moving pictures with low power consumption. The IP is mainly intended for mobile products such as cellular phones, digital still cameras (DSCs), and digital video cameras (DVCs). It includes VC-1 functionality for Internet content plus AVC (H.264) functionality for digital television broadcasting and MPEG-4 functionality for TV telephony, and is capable of processing D1-sized moving pictures (720 pixels by 480 lines) in real time at an operating frequency of 54 MHz. In addition, original algorithms employed in the IP reduce power consumption by up to 22%.

6D-3 (Time: 16:50 - 17:20)
Title(Invited Paper) Development of Low Power ISDB-T One-Segment Decoder by Mobile Multi-Media Engine SoC (S1G)
Author*Koichi Mori, Masakazu Suzuki, Yasuo Ohara, Satoru Matsuo, Atsushi Asano (Toshiba, Japan)
Pagepp. 644 - 648
KeywordISDB-T, Low Power, Multi-Media, Processor, Mobile Communication
AbstractTOSHIBA has developed mobile multi-media engine SoC, we call as S1G, which can realize low power ISDB-T one-segment decode in 42mW for eight months short period of time. Since MPEG2 TS de-multiplexing, AAC decoding and H.264 decoding should be simultaneously processed in ISDB-T one-segment decode, two TOSHIBA MeP (Media embedded Processor) processors and one DSP and hardware blocks are used effectively with pipeline operation in this LSI. Although it is generally considered that dedicated hardware accelerator should be used to realize low power operation for ISDB-T one-segment decode, TOSHBA succeeded in developing low power ISDB-T one-segment decoder using maximum software resources.

6D-4 (Time: 17:20 - 17:50)
Title(Invited Paper) Low Power Techniques for Mobile Application SoCs based on Integrated Platform "UniPhier"
Author*Masaitsu Nakajima, Takao Yamamoto, Masayuki Yamasaki, Tetsu Hosoki, Masaya Sumita (Matsushita Electric Industrial, Japan)
Pagepp. 649 - 653
KeywordLow Power
AbstractOn this presentation, Low Power Techniques for Mobile application SoCs based on Integrated Platform "UniPhier" are introduced. For SoCs, Hierarchical power reduction approaches of each Soc architecture level, UniPhier Processor level, IPP Processor level, and Circuit level are prepared. In case of development of UniPhier base SoC for mobile application, we can pick the combination of suitable low power techniques to realize the target and can make a trade-off between power and cost.



Friday, January 26, 2007

Session 3K Keynote Address III
Time: 9:00 - 10:00 Friday, January 26, 2007
Location: Small Auditorium, 5F
Chair: Hidetoshi Onodera (Kyoto Univ., Japan)

3K-1 (Time: 9:00 - 10:00)
Title(Keynote Address) How Foundry can Help Improve your Bottom-line? Accuracy Matters!
AuthorFu-Chieh Hsu (Taiwan Semiconductor Manufacturing Company, Taiwan)
Keyword
AbstractAs the leading edge of technology advances into the nanometer era, process data accuracy becomes increasingly important to the success of product designs. The gap between theoretical benefit and benefit obtainable by designers grows wider with each new technology node. However, foundries and EDA tool vendors can collaborate to reclaim some of the lost benefits of these technology nodes. In this talk, I will discuss how foundries can contribute in the effort to reclaim lost benefits through better model and data accuracy, while EDA tool vendors contribute through improved design approaches. I will give some examples of TSMC’s approaches in improving SPICE model accuracy and DFM accuracy, as well as collaboration with EDA tool vendors in creating our DFM Data Kit. By increasing awareness of TSMC’s approach to this issue, I hope to stimulate discussion from all sides of the industry in the search for more solutions.


Session 7A Advanced Methods for Leakage Reduction
Time: 10:15 - 12:20 Friday, January 26, 2007
Location: Room 411+412
Chairs: Masanori Hashimoto (Osaka Univ., Japan), Ankur Gupta (Cadence Design System, United States)

7A-1 (Time: 10:15 - 10:40)
TitleSimultaneous Control of Subthreshold and Gate Leakage Current in Nanometer-Scale CMOS Circuits
AuthorYoungsoo Shin, Sewan Heo, *Hyung-Ock Kim (KAIST, Republic of Korea), Jung Yun Choi (Samsung Electronics, Republic of Korea)
Pagepp. 654 - 659
Keywordlow power, leakage, power gating, design methodology, gate leakage
AbstractPower gating has been widely used to reduce subthreshold leakage. However, its efficiency degrades very fast with technology scaling due to the gate leakage of circuits specific to power gating, such as storage elements and output interface circuits with a data-retention capability. A new scheme called supply switching with ground collapse is proposed to control both gate and subthreshold leakage in nanometer-scale CMOS circuits. Compared to power gating, the leakage is cut by a factor of 6.3 with 65nm and 8.6 with 45nm technology. Various issues in implementing the proposed scheme using standard-cell elements are addressed, from RTL to layout. The proposed design flow is demonstrated on a commercial design with 90nm technology, and the leakage saving by a factor of 32 is observed with 3% and 6% of increase in area and wirelength, respectively.

7A-2 (Time: 10:40 - 11:05)
TitleRuntime Leakage Power Estimation Technique for Combinational Circuits
Author*Yu-Shiang Lin, Dennis Sylvester (University of Michigan, United States)
Pagepp. 660 - 665
Keywordsubthreshold leakage, power estimation
AbstractThis paper carefully examines subthreshold leakage during circuit operation (runtime) and develops a novel analysis technique to capture this important effect, which is currently ignored in traditional steady-state leakage calculation ap- proaches. We implement novel dynamic and static estima- tion methods that provide significant speed improvements over full SPICE simulations and yield estimation errors of approximately 12% on average compared to more than 2X errors in steady-state based subthreshold leakage analysis.

7A-3 (Time: 11:05 - 11:30)
TitleLogic and Layout Aware Voltage Island Generation for Low Power Design
Author*Liangpeng Guo, Yici Cai, Qiang Zhou, Xianlong Hong (Tsinghua Univ., China)
Pagepp. 666 - 671
KeywordVoltage Island, Low Power, Placement, Level Converter
AbstractMultiple supply voltage (MSV) is one of the most effective schemes to achieve low power, but most works are based on logic level. A few recent works are based on physical level but all of them do not consider level converters which have an important effect in dual-vdd design. In this work we propose a logic and layout aware approach for voltage assignment and voltage island generation in placement process to minimize the number of level converters and to implement voltage islands with minimal overheads. Experimental results show that our approach uses much less level converters than the approach in [1] (reduced by 59.50% on average) when achieving the same power savings. The approach is able to produce feasible placement with a small impact to traditional placement goals.

7A-4 (Time: 11:30 - 11:55)
TitleA Fast Probability-Based Algorithm for Leakage Current Reduction Considering Controller Cost
AuthorTsung-Yi Wu, Jr-Luen Tzeng, *Kuang-Yao Chen (National Changhua University of Education, Taiwan)
Pagepp. 672 - 677
KeywordMinimum Leakage Vector, Input Vector Control, Leakage Current Reduction, Sleep Mode, Low Power Design
AbstractIn this paper, we propose a probability-based algorithm that can rapidly find a minimum leakage vector (MLV). Unlike most traditional techniques that ignore the leakage current overhead of the newborn MLV controller, our technique can consider it. Ignoring this overhead during solution exploration brings a side effect that is misrecognizing a non-optimum solution as an optimum one. Experimental results show that our algorithm can reduce the leakage current up to 48% and can find the optimum solutions on 85% of MCNC benchmarks.

7A-5 (Time: 11:55 - 12:20)
TitleA Timing-Driven Algorithm for Leakage Reduction in MTCMOS FPGAs
Author*Hassan Hassan, Mohab Anis, Mohamed Elmasry (University of Waterloo, Canada)
Pagepp. 678 - 683
KeywordFPGA, Leakage Power, MTCMOS
AbstractA timing-driven MTCMOS (T-MTCMOS) CAD methodology is proposed for subthreshold leakage power reduction in nanometer FPGAs. The methodology uses the circuit timing information to tune the performance penalty due to sleep transistors according to the path delays, achieving an average leakage reduction of 44.36% when applied to FPGA benchmarks using a CMOS 0.13um process. Moreover, the methodology is applied to several FPGA architectures and CMOS technologies.


Session 7B Uncertainty Aware Interconnect Design
Time: 10:15 - 12:20 Friday, January 26, 2007
Location: Room 413
Chairs: Chih-Tsun Huang (National Tsing Hua Univ., Taiwan), Takashi Sato (Tokyo Inst. of Tech., Japan)

7B-1 (Time: 10:15 - 10:40)
TitleApproaching Speed-of-light Distortionless Communication for On-chip Interconnect
AuthorHaikun Zhu, Rui Shi (University of California, San Diego, United States), Hongyu Chen (Synopsys Inc., United States), *Chung-Kuan Cheng (University of California, San Diego, United States)
Pagepp. 684 - 689
Keywordglobal interconnect, transmission line, distortionless, speed-of-light, serial link
AbstractWe extend the Surfliner on-chip distortionless transmission line scheme and provide more details for the implementation issues. Surfliner seeks to approach distortionless transmission by intentionally adding shunt resistors between the signal line and the ground. In theory if we distributively make the shunt conductance G=RC/L, there will be no distortion at the receiver end and the signal propagates at the speed of light. We show the feasibility and advantages of this shunt resistor scheme by a real design case of single-ended microstrip line in 0.10$\mu$m technology. The simulation results indicate we can achieve near perfect signaling of 10 Gbps data over a 10 mm serial link, yet no pre-emphasis/equalization or other special techniques are needed. Guidelines for determining the optimal value and spacing of the shunt resistors are also provided.

7B-2 (Time: 10:40 - 11:05)
TitleDelay Uncertainty Reduction by Interconnect and Gate Splitting
AuthorVineet Agarwal, Jin Sun, Alexandar Mitev, *Janet Wang (University of Arizona at Tucson, United States)
Pagepp. 690 - 695
Keywordgate splitting, interconnect splitting
AbstractTraditional timing variation reduction techniques are only able to decrease the gate delay variation by incurring a delay overhead. In this work, we propose novel and effective splitting based variation reduction techniques for both interconnects and gates. We developed a new tool called Timing Uncertainty Reduction by Gate-Interconnect Splitting which reduces the timing variations of a circuit. It is shown that using splitting on interconnect can reduce the Chemical-Mechanical Polishing (CMP) induced dishing effect and can result in decrease at an average of 5% in mean interconnect delay in addition to decrease in its variation. Improvements of up to 30\% are achieved on timing variation for gates of various size while reduction of 55% can be observed in interconnect delay variation.

7B-3 (Time: 11:05 - 11:30)
TitleTransition Skew Coding: A Power and Area Efficient Encoding Technique for Global On-Chip Interconnects
Author*Charbel Akl, Magdy Bayoumi (University of Louisiana at Lafayette, United States)
Pagepp. 696 - 701
Keywordencoding, repeaters
AbstractGlobal signaling is becoming more and more challenging as technology scales down toward the deep submicron. We propose a new bus encoding technique, transition skew coding, that targets many of the global interconnects challenges such as crosstalk, peak energy and current, switching and leakage power, repeaters area, wiring area, signal integrity and noise. Simulations are done on different bus lengths using a 90 nm library. Repeaters sizing and spacing are optimized, and the proposed encoded bus is compared against a standard bus and a bus with shields inserted between every two wires. The encoding and decoding latencies are also analyzed. Simulations show that transition skew coding is efficient in terms of energy and area with low encoding and decoding latency overhead.

7B-4 (Time: 11:30 - 11:55)
TitleFast Buffered Delay Estimation Considering Process Variations
AuthorTien-Ting Fang, *Ting-Chi Wang (National Tsing Hua University, Taiwan)
Pagepp. 702 - 707
Keywordbuffer, timing estimation, statistical, process variations
AbstractAdvanced process technologies impose more significant challenges especially when manufactured circuits exhibit substantial process variations. Consideration of process variations becomes critical to ensure high parametric timing yield. During the design stage, fast estimation of the achievable buffered delay can navigate more accurate and efficient wire planning and timing analysis in floorplanning or global routing. In this paper, we derive approximated first-order canonical forms for buffered delay estimation which considers the effect of process variations and the presence of buffer blockages. We empirically show that an existing deterministic delay estimation method will be over-pessimistic and thus result in unnecessary design rollback. The experimental results also show that our method can estimate buffered delay with 4% average error but achieve up to 149 times speedup when compared to a state-of-the-art statistical buffer insertion method.

7B-5 (Time: 11:55 - 12:20)
TitlePredicting the Performance and Reliability of Carbon Nanotube Bundles for On-Chip Interconnect
Author*Arthur Nieuwoudt, Mosin Mondal, Yehia Massoud (Rice University, United States)
Pagepp. 708 - 713
Keywordcarbon nanotube, modeling, alternative interconnect technologies
AbstractSingle-walled carbon nanotube (SWCNT) bundles have the potential to provide an attractive solution for the resistivity and electromigration problems faced by traditional copper interconnect. In this paper, we evaluate the performance and reliability of nanotube bundles for future VLSI applications. We develop a scalable equivalent circuit model that captures the statistical distribution of metallic nanotubes while accurately incorporating recent experimental and theoretical results on inductance, contact resistance, and ohmic resistance. Leveraging the circuit model, we examine the performance and reliability of nanotube bundles including inductive effects. The results indicate that SWCNT interconnect bundles can provide significant improvement in delay over copper interconnect depending on the bundle geometry and process technology.


Session 7C Test Cost Reduction Techniques
Time: 10:15 - 12:20 Friday, January 26, 2007
Location: Room 414+415
Chairs: Sudhakar M. Reddy (Univ. of Iowa, United States), Tomoo Inoue (Hiroshima City Univ., Japan)

7C-1 (Time: 10:15 - 10:40)
TitleShelf Packing to the Design and Optimization of A Power-Aware Multi-Frequency Wrapper Architecture for Modular IP Cores
Author*Danella Zhao, Unni Chandran (University of Louisiana at Lafayette, United States), Hideo Fujiwara (Nara Institute of Science and Technology, Japan)
Pagepp. 714 - 719
KeywordModular SoC Test, Multi-frequency wrapper design, power aware architecture, resource constrained test scheduling
AbstractThis paper proposes a novel power-aware multi-frequency wrapper architecture design to achieve at-speed testability. The trade-offs between power dissipation, scan time and bandwidth are well handled by gating off certain virtual cores at a time while parallelizing the remaining. A shelf packing based optimization algorithm is proposed to design and optimize the wrapper architecture while minimizing the test time under power and bandwidth constraints

7C-2 (Time: 10:40 - 11:05)
TitleCore-Based Testing of Multiprocessor System-on-Chips Utilizing Hierarchical Functional Buses
Author*Fawnizu Azmadi Hussin, Tomokazu Yoneda (Nara Institute of Science and Technology, Japan), Alex Orailoglu (University of California, San Diego, United States), Hideo Fujiwara (Nara Institute of Science and Technology, Japan)
Pagepp. 720 - 725
KeywordSystem-on-Chip, Power-constrained, Multiprocessor, Packet-based, Test Scheduling
AbstractAn integrated test scheduling methodology for multiprocessor System-on-Chips (SOC) utilizing the functional buses for test data delivery is described. The proposed methodology handles both flat bus single processor SOC and hierarchical bus multiprocessor SOC. It is based on a resource graph manipulation and a packet-based packet set scheduling methodology. The resource graph is decomposed into a set of test configuration graphs, which are then used to determine the optimum test configurations and test delivery schedule under a given power constraint. In order to validate the effectiveness of the proposed methodology, a number of experiments are run on several modified benchmark circuits. The results clearly underscore the advantages of the proposed methodology.

7C-3 (Time: 11:05 - 11:30)
TitleAn Architecture for Combined Test Data Compression and Abort-on-Fail Test
Author*Erik Larsson, Jon Persson (Linköpings Universitet, Sweden)
Pagepp. 726 - 731
Keywordabort-on-fail, compression, ATE
AbstractThe low throughput at IC (Integrated Circuit) testing is mainly due to the increasing test data volume, which leads to high ATE (Automatic Test Equipment) memory requirements and long test application times. In contrast to previous approaches that address either test data compression or abort-on-fail testing, we propose an architecture for combined test data compression and abort-on-fail testing. The architecture improves throughput through multi-site testing as the ATE memory requirement is constant and independent of the degree of multi-site testing. For flexibility in modifying the test data at any time, we make use of a test program for decompression; only test independent evaluation logic is added to the IC. Major advantages compared to MISR (Multiple-Input Signature Register) based schemes are that our scheme (1) allows abort-on-fail testing at clock-cycle granularity, (2) does not impact diagnostic capabilities, and (3) needs no special care for the handling of unknowns (X).

7C-4 (Time: 11:30 - 11:55)
TitleRunBasedReordering: A Novel Approach for Test Data Compression and Scan Power
Author*Hao Fang, Chenguang Tong, Xu Cheng (Micro Processor Research and Development Center of Peking University, China)
Pagepp. 732 - 737
Keywordtest data compression, scan power, scan frame, reorder, run
AbstractAs the large size of test data volume is becoming one of the major problems in testing System-on-a-Chip(SoC), several compression coding schemes have been proposed. Extended frequency-directed run-length (EFDR) is one of the best coding compression schemes. In this paper, we present a novel algorithm named RunBasedReordering(RBR), which is based on EFDR codes. Three techniques have been applied to this algorithm: scan chain reordering, scan polarity adjustment and test pattern reordering. The experiment results show that the test data compression ratio is significantly improved and scan power consumption is dramatically reduced. Moreover, our algorithm can be easily integrated into the existing industrial flow with little area penalty.

7C-5 (Time: 11:55 - 12:20)
TitleSystematic Scan Reconfiguration
Author*Ahmad Al-Yamani (KFUPM, Saudi Arabia), Narendra Devta-Prasanna (University of Iowa, United States), Arun Gunda (LSI Logic, United States)
Pagepp. 738 - 743
KeywordDFT, Scan Test, Test Compression, Test Cost Reduction
AbstractWe present a new test data compression technique that achieves 10x to 40x compression ratios without requiring any information from the ATPG tool about the unspecified bits. The technique is applied to both single-stuck as well as transition fault test sets. The technique allows aggressive parallelization of scan chains leading to similar reduction in test time. It also reduces tester pins requirements by similar ratios. The technique is implemented using a hardware overhead of a few gates per scan chain.


Session 7D SPECIAL SESSION: Multi-Processor Platforms for Next Generation Embedded Systems
Time: 10:15 - 12:20 Friday, January 26, 2007
Location: Room 416+417
Chair: Nikil Dutt (Univ. of California, Irvine, United States)

7D-1 (Time: 10:15 - 10:35)
Title(Invited Paper) Configurable Multi-Processor Platforms for Next Generation Embedded Systems
Author*David Goodwin, Chris Rowen, Grant Martin (Tensilica, United States)
Pagepp. 744 - 746
Keywordprocessor, configurable, mpsoc, embedded
AbstractNext-generation embedded systems in application domains such as multimedia, wired and wireless communications, and multipurpose portable devices, are increasingly turning to multiprocessor platforms as a vehicle for their realization. But entirely fixed platforms composed of entirely fixed components lack the flexibility and ability to be optimized to the application to offer the best solution in any of these areas. Configurability at multiple levels offers a much better chance to optimize the resulting multiprocessor platform. Existing and emerging technologies for configurable and extensible processors and the creation of configurable multiprocessor subsystem platforms offer significant capability to design teams to both differentiate and optimize their products.

7D-2 (Time: 10:35 - 10:55)
Title(Invited Paper) ARM MPCore The Streamlined and Scalable ARM11 Processor Core
Author*Kazuyuki Hirata (ARM, Japan), John Goodacre (ARM, Great Britain)
Pagepp. 747 - 748
Keywordmulti, core, processor, ARM
AbstractThe required processing performance of embedded processor core is getting higher and higher without increasing power consumption dramatically. In same time, large SoC design has more risk of re-spin and long design time due to the complexity and difficulty of verification. ARM offers multi core solution to overcome such a situation over various applications.

7D-3 (Time: 10:55 - 11:15)
Title(Invited Paper) The Potential of Cell BE as a Platform Technology for Embedded Systems
AuthorPeter Hofstee (IBM, United States)

7D-4 (Time: 11:15 - 11:35)
Title(Invited Paper) Many-Core Platforms in Search for Supporting Tools
AuthorRudy Lauwereins (IMEC, Belgium)

7D-5 (Time: 11:35 - 11:55)
Title(Invited Paper) Nomadik®: A Mobile Multimedia Application Processor Platform
Author*Maurizio Paganini (STMicroelectronics, France)
Pagepp. 749 - 750
AbstractThe Nomadik® platform supports a range of consumer-oriented multimedia applications, and is specifically designed for mobile applications. The combined use of industry standard host processor, and multiple low-power DSPs associated with dedicated hardware accelerators make for a flexible, yet low-cost and low-power solution.

7D-6 (Time: 11:55 - 12:20)
Title(Panel Discussion) Multi-Processor Platforms for Next Generation Embedded Systems
AuthorOrganizer: Nikil Dutt, Moderator: Nikil Dutt (Univ. of California, Irvine, United States), Panelists: David Goodwin (Tensilica, United States), Kazuyuki Hirata (ARM, Japan), Peter Hofstee (IBM, United States), Rudy Lauwereins (IMEC, Belgium), Maurizio Paganini (STMicroelecronics, France)
Keyword
Abstract


Session 8A Advancement in Power Analysis and Optimization
Time: 13:30 - 15:35 Friday, January 26, 2007
Location: Room 411+412
Chairs: Youngsoo Shin (KAIST, Republic of Korea), Takayuki Watanabe (Univ. of Shizuoka, Japan)

8A-1 (Time: 13:30 - 13:55)
TitleFast Decoupling Capacitor Budgeting for Power/Ground Network Using Random Walk Approach
Author*Le Kang, Yici Cai, Yi Zou, Jin Shi, Xianlong Hong (Tsinghua University, China), Sheldon X.-D. Tan (University of California, Riverside, United States)
Pagepp. 751 - 756
KeywordPower/Ground, Optimization, Random Walk, Leakage
AbstractThis paper proposes a fast and practical decoupling capacitor (decap) budgeting algorithm to optimize the power ground (P/G) network design. The new method adopts a modified random walk process to partition the circuit. Then, by utilizing the isolation property of decaps, this new method avoids solving the large nonlinear programming problem in traditional decap optimization process. Also, this method integrates leakage currents optimization algorithm using a refined leakage model. Experimental results demonstrate that our proposed method achieves approximate a 10X speed up over the heuristic method based on sensitivity and only about 6% decap area deviation from the optimal budget using the programming method.

8A-2 (Time: 13:55 - 14:20)
TitleTiming-Aware Decoupling Capacitance Allocation in Power Distribution Networks
Author*Sanjay Pant, David Blaauw (University of Michigan, United States)
Pagepp. 757 - 762
Keyworddecap, Ldidt, power grid, timing
AbstractPower supply noise increases the circuit delay, which may lead to performance failure of the design. Decoupling capacitance (decap) addition is effective in reducing the power supply noise, thus making the supply network more robust in presence of large switching currents. Traditionally, decaps have been allocated in order to minimize the worst-case voltage drop in the power grid. In this paper, we propose an approach for timing aware decap allocation which uses global time slacks to drive the decap optimization. Non-critical gates with larger timing slacks can tolerate a relatively higher supply voltage drop as compared to the gates on the critical paths. The decap allocation is formulated as a non-linear optimization problem using Lagrangian relaxation and modified adjoint method is used to efficiently obtain the sensitivities of objective function to decap sizes. A fast path-based heuristic is also implemented and compared with the global optimization formulation. The approaches have been implemented and tested on ISCAS85 benchmark circuits and grids of different sizes. Compared to uniformly allocated decaps, the proposed approach utilizes 35.5% less total decap to meet the same delay target. For the same total decap budget, the proposed approach is shown to improve the circuit delay by 10.1% on an average.

8A-3 (Time: 14:20 - 14:45)
TitleFast Placement Optimization of Power Supply Pads
AuthorYu Zhong, *Martin D. F. Wong (University of Illinois at Urbana-Champaign, United States)
Pagepp. 763 - 767
Keywordpower grid, power pad
AbstractPower grid networks in VLSI circuits are required to provide adequate input supply to ensure reliable performance. In this paper, we propose algorithms to find the placement of power pads that minimize not only the worst voltage drop but also the voltage deviation across the power grid. Our algorithm uses simulated annealing to minimize the total cost of voltage drops. The key enabler for efficient optimization is a fast localized node-based iterative method to compute the voltages after each movement of pads. Experimental results show that our algorithm demonstrates good runtime characteristics for power grids with large numbers of pad candidates in multi-million-size circuits. For a 16-million-node power grid with 646 thousand pad candidates, our algorithm took 72 minutes to improve the worst voltage drop from $0.398 V$ to $0.196 V$ and reduce the deviation of voltages on the power grid from $0.134 V$ to $0.024 V$.

8A-4 (Time: 14:45 - 15:10)
TitleEfficient Second-Order Iterative Methods for IR Drop Analysis in Power Grid
AuthorYu Zhong, *Martin D. F. Wong (University of Illinois at Urbana-Champaign, United States)
Pagepp. 768 - 773
Keywordpower grid
AbstractDue to the extremely large sizes of power grids, IR drop analysis has become a computationally challenging problem both in terms of runtime and memory usage. It has been shown that first-order iterative algorithms based on node-by-node and row-by-row traversals of the power grid have both accuracy and runtime advantages over the well-known Random-Walk method. In this paper, we propose second-order iterative algorithms that can significantly reduce the runtime. The new algorithms are extremely fast, and we prove that they guarantee converge to the exact solutions. Experimental results show that our algorithms outperform the Random-Walk algorithm and the first-order algorithms. For a 25-million node problem, while the Random-Walk algorithm takes 2 days with maximum error of 6.1 mV, and our second-order row-based algorithm takes 32 minutes to get an exact solution. Moreover, we can get a solution with maximum error 2 mV in 10 minutes.

8A-5 (Time: 15:10 - 15:35)
TitleA Current-based Method for Short Circuit Power Calculation under Noisy Input Waveforms
AuthorHanif Fatemi, Shahin Nazarian, *Massoud Pedram (University of Southern California, United States)
Pagepp. 774 - 779
KeywordCurrent-based Method, Short Circuit Power, Noisy input, Crosstalk
AbstractAn accurate model is presented in this paper to calculate the short circuit energy dissipation of logic cells. The short circuit current is highly dependent on the input and output voltage values. Therefore the actual shape of the voltage signal waveforms at the input and output of the cell should be considered in order to precisely calculate the short circuit energy. Previous approaches such as the approximation of the crosstalk induced noisy waveforms with saturated ramps can lead to short circuit energy estimation errors as high as orders of magnitude for a minimum sized inverter. To resolve this shortcoming, a novel current-based logic cell model is utilized, which constructs the output voltage waveform for a given noisy input waveform. The input and output voltage waveforms are then used to calculate the short circuit current, and hence, short circuit energy dissipation. A characterization process is executed for each logic cell in the standard cell library to model the relevant electrical parameters e.g., the parasitic capacitances and nonlinear current sources. Additionally, our model is capable of calculating the short circuit energy dissipation caused by glitches in VLSI circuits, which in some cases can be a key contributor to the total circuit energy dissipation. Experimental results show an average error of about 1% and a maximum error of about 3% compared to SPICE for different types of logic cells under noisy input waveforms including glitches while the runtime speedup is up to 16000.


Session 8B Electrical Optimization in Floorplanning/Placement
Time: 13:30 - 15:35 Friday, January 26, 2007
Location: Room 413
Chairs: Shigetoshi Nakatake (Univ. of Kitakyushu, Japan), David Pan (University of Texas at Austin, United States)

8B-1 (Time: 13:30 - 13:55)
TitleThermal-Aware 3D IC Placement Via Transformation
AuthorJason Cong, *Guojie Luo, Jie Wei, Yan Zhang (Department of Computer Science, University of California, Los Angeles, United States)
Pagepp. 780 - 785
Keyword3D-IC, placement, thermal-aware
Abstract3D IC technologies can help to improve circuit performance, lower power consumption by reducing wirelength and realize heterogeneous system-on-chip design. In this paper, we propose a novel thermal-aware 3D cell placement approach, named T3Place, based on transforming a 2D placement with good wirelength to a 3D placement, with the objectives of wirelength, through-the-silicon (TS) via number and temperature. Moreover, we proposed a novel relaxed conflict-net (RCN) graph-based layer assignment method to further refine the 3D placements.

8B-2 (Time: 13:55 - 14:20)
TitleNoise-Direct: A Technique for Power Supply Noise Aware Floorplanning Using Microarchitecture Profiling
AuthorFayez Mohamood, Michael Healy, Sung Kyu Lim, *Hsien-Hsin S. Lee (Georgia Tech, United States)
Pagepp. 786 - 791
KeywordInductive Noise, Floorplanning, Microarchitecture, Power integrity
AbstractThis paper proposes Noise-Direct, a design methodology for power integrity aware floorplanning, using microarchitectural feedback to guide module placement to tackle high-frequency inductive noise. Given the increasing use of clock-gating for saving power, reliability has been worsened by induced large inductive noise. In this work, we propose an average-case design method by considering the dynamic microarchitectural switching behavior to guarantee power integrity and alleviate the requirement of on-die decoupling capacitances.

8B-3 (Time: 14:20 - 14:45)
TitleOn Increasing Signal Integrity with Minimal Decap Insertion in Area-Array SoC Floorplan Design
Author*Chao-Hung Lu (National Central University, Taiwan), Hung-Ming Chen (National Chiao Tung University, Taiwan), Chien-Nan Jimmy Liu (National Central University, Taiwan)
Pagepp. 792 - 797
KeywordSignal Integrity, Floorplan Design, Decap Insertion
AbstractWith technology further scaling into deep submicron era, more components can be placed onto one chip (System-on-chip, SoC). However, the same scaling brings the design difficulties, among which signal integrity is one of the most important issues. Although flip-chip and area-array architectures have been proposed to strengthen the integrity, we still need careful planning in SoC designs. Power supply noise problem is getting worse due to serious IR-drop and simultaneous switching noise, and decoupling capacitance (decap) insertion is commonly applied to alleviate the noise. There exist some approaches to addressing this issue, but they suffer either from over-design problem or late decap insertion during design stage. In this paper, we propose a methodology to insert decap in a more efficient and effective way during supply noise driven floorplanning in area-array designs. The experimental results are encouraging. Compared with other approaches in \cite{Koh} and \cite{Yan}, we have inserted enough decap to meet supply noise constraint while others employ more area.

8B-4 (Time: 14:45 - 15:10)
TitleVoltage Island Generation under Performance Requirement for SoC Designs
Author*Wai-Kei Mak, Jr-Wei Chen (National Tsing Hua University, Taiwan)
Pagepp. 798 - 803
KeywordSoC, low power, floorplanning, voltage island
AbstractUsing multiple supply voltages on a SoC design is an efficient way to achieve low power. However, it may lead to a complex power network and a huge number of level shifters if we just set the cores to operate at their respective lowest voltage levels. We present two formulations for the voltage level assignment problem. The first is exact but takes longer time to compute a solution. The second can be solved much faster with virtually no loss on optimality. In addition, we propose a modification to the traditional floorplanning framework. Unlike previous works, we can optimize the total power consumption, the level shifter overhead, and the power network complexity without com- promising the wirelength and the chip area. In the experiments, we obtained 17- 53% power savings with voltage island generation.

8B-5 (Time: 15:10 - 15:35)
TitleFast Flip-Chip Pin-Out Designation Respin by Pin-Block Design and Floorplanning for Package-Board Codesign
Author*Ren-Jie Lee, Ming-Fang Lai, Hung-Ming Chen (National Chiao Tung University, Taiwan)
Pagepp. 804 - 809
KeywordPin-Out Designation, Pin-Block Floorplanning, Package-Board Codesign
AbstractDeep submicron effects drive the complication in designing chips, as well as in package designs and communications between package and board. As a result, the iterative interface design has been a time-consuming process. This paper proposes a novel and efficient approach to designating pinout for flip-chip BGA package when designing chipsets. The proposed approach can not only automate the assignment of more than 200 I/O pins on package, but also precisely evaluate package size which accommodates all pins with almost no void pin positions, as good as the one from manual design. Furthermore, the practical experience and techniques in designing such interface has been accounted for, including signal integrity, power delivery and routability. This efficient pin-out designation and package size estimation by pin-block design and floorplanning provides much faster turn around time, thus enormous improvement in meeting design schedule. The results on two real cases show that our methodology is effective in achieving almost the same dimensions in package size, compared with manual design in weeks, while simultaneously considering critical issues in package-board codesign. To the best of our knowledge, this is the first attempt in solving flip-chip pin-out placement problem in package-board codesign.


Session 8C Advances in Test and Diagnosis
Time: 13:30 - 15:35 Friday, January 26, 2007
Location: Room 414+415
Chairs: Erik Larsson (Royal Inst. of Tech., Sweden), Xiaoging Wen (Kyushu Inst. of Tech., Japan)

8C-1 (Time: 13:30 - 13:55)
TitleA Technique to Reduce Peak Current and Average Power Dissipation in Scan Designs by Limited Capture
Author*Seongmoon Wang, Wenlong Wei (NEC Labs., America, United States)
Pagepp. 810 - 816
Keywordlow power testing, scan based testing, power dissipation during test application, low switching activity
AbstractIn this paper, a technique that can efficiently reduce peak and average switching activity during test application is proposed. The peak transition is reduced by about 40% and average number of transitions is reduced by about 56-75%. This reduction in peak and average switching is achieved without any decrease in fault coverage. The proposed method does not require any specific clock tree construction, special scan cells, or scan chain routing. Test cubes generated by any combinational ATPG can be processed by the proposed method to reduce peak and average switching activity without any capture violation. Hardware overhead for the proposed method is negligible. Further, the hardware for the proposed method can be implemented without detailed knowledge of the design.

8C-2 (Time: 13:55 - 14:20)
TitleWarning: Launch off Shift Tests for Delay Faults May Contribute to Test Escapes
AuthorZhuo Zhang, *Sudhakar Reddy (University of Iowa, United States), Irith Pomeranz (Purdue University, United States)
Pagepp. 817 - 822
Keywordtransition delay fault, launch off shift, test escape, functionally detectable faults
AbstractA concern expressed often in the literature is the potential over testing or yield loss caused by the fact that launch off shift operates the circuit under test in non-functional manner. In this paper we present data, for the first time, which points to another potential problem with launch off shift tests - test escapes. We also present data that shows that if launch off shift tests with multiple fault activation cycles are used essentially all functionally detectable faults can be detected.

8C-3 (Time: 14:20 - 14:45)
TitleA Wafer-Level Defect Screening Technique to Reduce Test and Packaging Costs for "Big-D/Small-A" Mixed-Signal SoCs
AuthorSudarshan Bahukudumbi, Sule Ozev, *Krishnendu Chakrabarty (Duke University, United States), Vikram Iyengar (IBM Corporation, United States)
Pagepp. 823 - 828
KeywordSoC test, cost model, wafer-level defect screening
AbstractProduct cost is a key driver in the consumer electronics market, which is characterized by low profit margins and the use of a variety of "big-D/small-A" mixed-signal system-on-chip (SoC) designs. Packaging cost has recently emerged as a major contributor to the product cost for such SoCs. Wafer-level testing can be used to screen defective dies, thereby reducing packaging cost. We propose a new correlation-based signature analysis technique that is especially suitable for mixed-signal test at the wafer-level using low-cost digital testers. The proposed method overcomes the limitations of measurement inaccuracies at the wafer-level. A generic cost model is developed to evaluate the effectiveness of wafer-level testing of analog and digital cores in a mixed-signal SoC, and to study its impact on test escapes, yield loss and packaging costs. Experimental results are presented for a typical mixed-signal "big-D/small-A" SoC, which contains a large section of flattened digital logic and several large mixed-signal cores.

8C-4 (Time: 14:45 - 15:10)
TitleFault Dictionary Size Reduction for Million-Gate Large Circuits
Author*Yu-Ru Hong, Juinn-Dar Huang (National Chiao Tung University, Taiwan)
Pagepp. 829 - 834
Keywordfault diagnosis, fault dictionary, fault dictionary size reduction, pass-fail fault dictionary
AbstractIn general, fault dictionary is prevented from practical applications for its extremely large size. Several previous works are proposed for the fault dictionary size reduction. However, they might not be able to handle today’s million-gate circuits due to the high time and space complexity. In this paper, we propose an algorithm to significantly reduce the size of fault dictionary while still preserving high diagnostic resolution. The proposed algorithm possesses extremely low time and space complexity by avoiding constructing the huge distinguishability table, which inevitably boosts up the required computation complexity. Experimental results demonstrate that the proposed algorithm is fully capable of handling industrial million-gate large circuits in a reasonable amount of runtime and memory.

8C-5 (Time: 15:10 - 15:35)
TitleCyclic-CPRS : A Diagnosis Technique for BISTed Circuits for Nano-meter Technologies
Author*Chun-Yi Lee, Hung-Mao Lin, Fang-Min Wang, James Chien-Mo Li (Graduate Institute of Electronics Engineering, National Taiwan University, Taiwan)
Pagepp. 835 - 840
KeywordFault Diagnosis, BIST, CPRS, Scan Chain, Unknowns
AbstractA Cyclic-CPRS (Column Parity Row Selection) technique is presented to diagnose built-in self tested (BISTed) circuits, even in the presence of many unknowns and transient errors. The novel cyclic scan chains retain the transient errors and unknowns in the CUT until they are fully diagnosed. Instead of masking the unknowns, Cyclic-CPRS directly diagnoses the unknowns as if they were errors. Direct diagnosis of unknowns not only eliminates the masking circuitry but also enhances the diagnosis resolution. Experimental results show that Cyclic-CPRS is very successful even in the presence of 10% errors and unknowns. The proposed technique is especially suitable for nano-meter technologies, in which transient errors and systematic defects are becoming serious problems.


Session 8D Designers' Forum: High-speed Chip to Chip Signaling Solutions
Time: 13:30 - 15:35 Friday, January 26, 2007
Location: Small Auditorium, 5F
Chairs: Haruyuki Tago (Toshiba Corporation, Japan), Kazutoshi Kobayashi (Kyoto University, Japan)

8D-1 (Time: 13:30 - 14:00)
Title(Invited Paper) Preferable Improvements and Changes to FB-DiMM High-Speed Channel for 9.6Gbps Operation
Author*Atsushi Hiraishi, Toshio Sugano (Elpida Memory, Japan), Hideki Kusamitsu (Yamaichi Electronics, Japan)
Pagepp. 841 - 845
KeywordFB-DiMM, High-speed channel
AbstractIn this paper we showed the signal degradation parts in High-speed channel of FB-DiMM system. And we also showed possible countermeasure. For the verification propose and also for establishing the precise modeling and simulation method, we compared measurement and simulation up to 9.6Gbps operation with test board. And we get good relation between them. After getting the calculated loss budget of estimated system, we made recommendations of preferable changes to Main board and DiMM socket.

8D-2 (Time: 14:00 - 14:30)
Title(Invited Paper) Xbox360TM Front Side Bus - A 21.6 Gb/s End to End Interface Design
Author*David Siljenberg, Steve Baumgartner, Tim Buchholtz, Mark Maxson, Trevor Timpane (IBM, United States), Jeff Johnson (Cadence Design Systems, United States)
Pagepp. 846 - 853
Keywordsource synchronous, front side bus, serial link, chip to chip interconnect
AbstractWith a bandwidth of 21.6 GB/s, the Front Side Bus (FSB) of the Microsoft Xbox360TM is one of the fastest, commercially available Front Side Bus interfaces in the consumer market. This paper explains the end-to-end system approach used in designing the bus that achieved volume production ramp 18 months after design start. The 90 nm SOI-CMOS CPU and 90 nm bulk CMOS GPU designs are described. The chip carrier, circuit board, and signal integrity analyses are described. The design approach used to achieve high volume, low cost, and short development time is explained.

8D-3 (Time: 14:30 - 15:00)
Title(Invited Paper) Design Consideration of 6.25 Gbps Signaling for High-Performance Server
Author*Jian Hong Jiang, Weixin Gai, Akira Hattori, Yasuo Hidaka, Takeshi Horie, Yoichi Koyanagi, Hideki Osone (Fujitsu Laboratories of America, United States)
Pagepp. 854 - 857
Keywordmulti-gigabit/s transceiver, multi-dap pre-emphasis, linear equalizer
Abstract As network data rate increases rapidly, high-speed signaling circuits for server communication pose many design challenges due to various system requirements using different interconnect mediums. This paper discusses main problems and solutions of high-speed circuits for server interconnect. Then, it presents a high-speed circuit implementation for such interconnect using 90nm CMOS technology that achieved data rate at 6.25 Gbps in a backplane environment.

8D-4 (Time: 15:00 - 15:30)
Title(Invited Paper) System Co-Design and Co-Analysis Approach to Implementing the XDRTM Memory System of the Cell Broadband EngineTM Processor Realizing 3.2 Gbps Data Rate per Memory Lane in Low Cost, High Volume Production
Author*Wai-Yeung Yip, Scott Best, Wendemagegnehu Beyene, Ralf Schmitt (Rambus, United States)
Pagepp. 858 - 865
KeywordXDR, memory, Cell, interface, Rambus
AbstractThis paper describes the design and analysis of the 3.2 Gbps XDR™ memory system of the Cell Broadband Engine™ (Cell BE) processor developed by Sony Corporation, Sony Computer Entertainment, Toshiba and IBM. A System Co-Design and Co-Analysis Approach was applied where different components of the system are designed and analyzed simultaneously to allow trade-offs to be made to optimize system electrical characteristics at low overall system cost. The XDR memory interface circuit implemented in the Cell BE processor, the power delivery system design and analysis, and the interface statistical signal integrity analysis will be described to illustrate this design and analysis approach.


Session 9A Power Efficient Design Techniques
Time: 16:00 - 18:05 Friday, January 26, 2007
Location: Room 411+412
Chairs: Hiroyuki Tomiyama (Nagoya Univ., Japan), Gang Zeng (Nagoya Univ., Japan)

9A-1 (Time: 16:00 - 16:25)
TitleFlow Time Minimization under Energy Constraints
Author*Jian-Jia Chen (National Taiwan University, Taiwan), Kazuo Iwama (Kyoto University, Japan), Tei-Wei Kuo, Hseuh-I Lu (National Taiwan University, Taiwan)
Pagepp. 866 - 871
KeywordEnergy-aware systems, Scheduling, Flow time minimization, Dynamic voltage scaling
AbstractPower-aware and energy-efficient designs play important roles for modern hardware and software designs, especially for embedded systems. This paper targets a scheduling problem on a processor with the capability of dynamic voltage scaling (DVS), which could reduce the power consumption by slowing down the processor speed. The objective of the targeting problem is to minimize the average flow time of a set of jobs under a given energy constraint, where the flow time of a job is defined as the interval length between the arrival and the completion of the job. We consider two types of processors, which have a continuous spectrum of the available speeds or have only a finite number of discrete speeds. Two algorithms are given: (1) An algorithm is proposed to derive optimal solutions for processors with a continuous spectrum of the available speeds. (2) A greedy algorithm is designed for the derivation of optimal solutions for processors with a finite number of discrete speeds. The proposed algorithms are extended to cope with jobs with different weights for the minimization of the average weighted flow time. The proposed algorithms are also evaluated with comparisons to schedules which execute jobs at a common effective speed.

9A-2 (Time: 16:25 - 16:50)
TitleIntegrating Power Management into Distributed Real-time Systems at Very Low Implementation Cost
AuthorBita Gorjiara, Nader Bagherzadeh, *Pai Chou (University of California, Irvine, United States)
Pagepp. 872 - 877
KeywordDynamic Power Management, real-time systems, distributed systems
AbstractThe development cost of low-power embedded systems can be reduced by reusing legacy designs and applying proper modifications to meet power constraints. The power management techniques for implementing distributed power managers in multi-processor systems, are very costly in terms of hardware/software modifications. In this paper, we propose a new centralized power management technique that reduces the power consumption of distributed systems at very low implementation cost. Our power manager uses the model of the system/application to compute the schedule of turn on/off commands. We applied our power management technique to a distributed software-defined radio system and achieved 60% to 87% energy savings.

9A-3 (Time: 16:50 - 17:15)
TitleA Software Technique to Improve Yield of Processor Chips in Presence of Ultra-Leaky SRAM Cells Caused by Process Variation
Author*Maziar Goudarzi, Tohru Ishihara, Hiroto Yasuura (Kyushu University, Japan)
Pagepp. 878 - 883
KeywordProcess variation, Leakage power, Software-based Technique, Yield, Embedded Systems
AbstractExceptionally leaky transistors are increasingly more frequent in nano-scale technologies due to lower threshold voltage and its increased variation. Such leaky transistors may even change position with changes in the operating voltage and temperature, and hence, redundancy at circuit-level is not sufficient to tolerate such threats to yield. We show that in SRAM cells this leakage depends on the cell value and propose a first software-based runtime technique that suppresses such abnormal leakages by storing safe values in the corresponding cache lines before going to standby mode. Analysis shows the performance penalty is, in the worst case, linearly dependent to the number of so-cured cache lines while the energy saving linearly increases by the time spent in standby mode. Analysis and experimental results on commercial processors confirm that the technique is viable if the standby duration is more than a small fraction of a second.

9A-4 (Time: 17:15 - 17:40)
TitleProgram Phase Directed Dynamic Cache Way Reconfiguration for Power Efficiency
Author*Subhasis Banerjee (Sun Microsystems, India), Surendra G, S. K. Nandy (Indian Institute of Science, India)
Pagepp. 884 - 889
Keywordprogram phase, cache reconfiguration
AbstractAggressive superscalar processor with deep pipeline and sophisticated speculative execution techniques is pushing the power budget to its limit. It is found that a significant portion of this power is wasted during wrong path execution and non power optimal allocation of power hungry resources. Dynamic reconfiguration of micro-architectural resources can be exploited to bring down this waste at runtime. Lack of architectural method to capture the behavior of a program at runtime makes dynamic reconfiguration a challenge. In this paper we propose a method to characterize program behavior at runtime using conflict miss pattern of a data cache, which in turn identifies different program phases in terms of cache utilization. We use this phase information to enable/disable cache ways dynamically depending on the conflict miss pattern of a program. Using a hardware tracking mechanism we ensure that the program performance (throughput in terms of IPC) does not degrade beyond a tolerable limit.

9A-5 (Time: 17:40 - 18:05)
TitleCLIPPER: Counter-based Low Impact Processor Power Estimation at Run-time
Author*Jorgen Peddersen, Sri Parameswaran (University of New South Wales, Australia)
Pagepp. 890 - 895
Keywordpower, macro-modeling, run-time
AbstractNumerous dynamic power management techniques have been proposed which utilize the knowledge of processor power/energy consumption at run-time. So far, no efficient method to provide run-time power/energy data has been presented. Current measurement systems draw too much power to be used in small embedded designs and existing performance counters can not provide sufficient information for run-time optimization. This paper presents a novel methodology to solve the problem of run-time power optimization by designing a processor that estimates its own power/energy consumption. Estimation is performed by the addition of small counters that tally events which consume power. This methodology has been applied to an existing processor resulting in an average power error of 2% and energy estimation error of 1.5%. The system adds little impact to the design, with only a 4.9% increase in chip area and a 3% increase in average power consumption. A case study of an application that utilizes the processor showcases the benefits the methodology enables in dynamic power optimization.


Session 9B Leading Edge Design Methodology for Processors
Time: 16:00 - 18:05 Friday, January 26, 2007
Location: Room 413
Chairs: Takashi Miyamori (Toshiba, Japan), Hideharu Amano (Keio Univ., Japan)

9B-1 (Time: 16:00 - 16:25)
TitleDesign Methodology for 2.4GHz Dual-Core Microprocessor
AuthorNoriyuki Ito, Hiroaki Komatsu, Akira Kanuma, Akihiro Yoshitake, Yoshiyasu Tanamura, Hiroyuki Sugiyama, Ryoichi Yamashita, *Ken-ichi Nabeya, Hironobu Yoshino, Hitoshi Yamanaka, Masahiro Yanagida, Yoshitomo Ozeki, Kinya Ishizaka, Takeshi Kono, Yutaka Isoda (Fujitsu Limited, Japan)
Pagepp. 896 - 901
Keywordmicroprocessor, dual-core, clock, custom macro
AbstractThis paper presents a design methodology that was applied to the design of a 2.4GHz dual-core SPARC64 microprocessor with 90nm CMOS technology. It focuses on the newly adopted techniques, such as efficient data management in dual-core design, fast delay calculation of the noise-immune clock distribution circuit, enhanced signal integrity analysis of a large-scale custom macro design, and enhanced diagnosis capability using a logic BIST circuit.

9B-2 (Time: 16:25 - 16:50)
TitleAn Embedded Low Power/Cost 16-Bit Data/Instruction Microprocessor Compatible with ARM7 Software Tools
Author*Fu-Ching Yang, Ing-Jer Huang (National Sun Yat-Sen University, Taiwan)
Pagepp. 902 - 907
Keywordmicroprocessor, compatible, low-power, low-cost, narrow width memory
AbstractA 16-bit THUMB instruction set microprocessor is proposed for low cost/power in short-precision computing. It achieves 40% gate count, 51% power consumption and 160% clock frequency comparing to ARM7, even the performance is 67% better in narrow width memory at the same clock frequency. The ARM7 software is also compatible.

9B-3 (Time: 16:50 - 17:15)
TitleA Novel Reconfigurable Low Power Distributed Arithmetic Architecture for Multimedia Applications
Author*Zhenyu Liu, Tughrul Arslan, Ahmet T. Erdogan (The University of Edinburgh, Great Britain)
Pagepp. 908 - 913
KeywordDistributed Arithmetic , reconfigurable, DCT
AbstractThe use of reconfigurable cores in system on chip (SoC) designs is increasingly becoming a trend. Such cores are being used for their flexibility, powerful functionality and low power consumption. Distributed Arithmetic (DA) is a powerful algorithm wildly used in many fields of multimedia for its efficiency. This paper presents a novel reconfigurable adder-based architecture for DA to realize the inner product which is the key computation in many digital signal processing applications. 1D DCT is mapped onto the architecture. Compared with some existing ASIC designs, the new architecture achieves good performance in area, speed and power.

9B-4 (Time: 17:15 - 17:40)
TitleExploration of Low Power Adders for a SIMD Data Path
Author*Giacomo Paci (IMEC and DEIS,University of Bologna, Italy), Paul Marchal (IMEC, Belgium), Luca Benini (DEIS,University of Bologna, Italy)
Pagepp. 914 - 919
Keywordadders, SIMD, power, area
AbstractAbstract – Hardware for Ambient Intelligence needs to achieve extremely high computational efficiency (up to 40GOPS/W). An important way for reaching this is exploiting parallelism, and more specifically data-level parallelism enabled by SIMD. Whereas a large body of research exists on the benefits of, the architectural design of and compilation onto SIMD, the design of energy-optimal functional units for SIMD has received limited attention. It appears that existing SIMD functional units are designed in an area optimal, but not energy optimal way. By exploiting the difference in critical path length for the types of operations (e.g., 4x8/2x16/1x32), SIMD adders can be developed that save up to 40% of energy. In this paper, we will present these adders, the issues of building them and quantify their benefits for different usage scenarios and operating frequencies.

9B-5 (Time: 17:40 - 18:05)
TitleMicro-architecture Pipelining Optimization with Throughput-Aware Floorplanning
Author*Yuchun Ma, Zhuoyuan Li (Tsinghua University, China), Jason Cong (University of California, Los Angeles, United States), Xianlong Hong (Tsinghua University, China), Glenn Reinman (University of California, Los Angeles, United States), Sheqin Dong, Qiang Zhou (Tsinghua University, China)
Pagepp. 920 - 925
Keywordmicro-architecture, pipelining, throughput-aware, floorplanning
AbstractFor modern processor designs in nanometer technologies, both block and interconnect pipelining are needed to achieve multi-gigahertz clock frequency, but previous approaches consider block pipelining and interconnect pipelining separately. For example, all recent works on wire pipelining assume pre-pipelined components and consider only inserting pipeline stages on point-to-point wire or bus connections. To the best of our knowledge, this paper is the first that considers block pipelining and interconnect pipelining simultaneously. We optimize multiple critical paths or loops in the micro-architecture and insert the pipelines stages optimally in the blocks and wires of these loops to meet the clock frequency requirement. We propose two approaches to this problem. The first approach is based on mixed integer linear programming (MILP) which is theoretically guaranteed to produce the optimal solution, and the second one is an efficient graph-based algorithm that produces near-optimal solutions. Experimental results show that simultaneous block and interconnect pipelining leads to more than 20% improvement over wire-pipeling alone on the overall processor performance. Moreover, the graph-based approach gives solutions very close to the MILP results ( 2% more than MILP results on average) but in a much shorter runtime.


Session 9C Satisfiability and Applications
Time: 16:00 - 18:05 Friday, January 26, 2007
Location: Room 414+415
Chairs: Jun Sawada (IBM, United States), Takashi Takenaka (NEC, Japan)

9C-1 (Time: 16:00 - 16:25)
TitleMultithreaded SAT Solving
Author*Matthew Lewis, Tobias Schubert, Bernd Becker (Albert-Ludwigs-University of Freiburg, Germany)
Pagepp. 926 - 931
KeywordSAT, Solver, Threads, Multithreaded, Verification
AbstractThis paper describes the multithreaded MiraXT SAT Solver which was designed to take advantage of current and future shared memory multiprocessor systems. The paper highlights design and implementation details that allow the multiple threads to run and cooperate efficiently. Results show that in single threaded mode, MiraXT compares well to other state of the art solvers on Industrial problems. In threaded mode, it provides cutting edge performance, as speedup is obtained on both SAT and UNSAT instances.

9C-2 (Time: 16:25 - 16:50)
TitleTrace Compaction using SAT-based Reachability Analysis
Author*Sean Safarpour, Andreas Veneris, Hratch Mangassarian (University of Toronto, Canada)
Pagepp. 932 - 937
Keywordtrace compaction, reachability, SAT, debugging, trace reduction
AbstractIn today's designs, when functional verification fails, engineers perform debugging using the provided error traces. Reducing the length of error traces can help the debugging task by decreasing the number of variables and clock cycles that must be considered. We propose a novel trace length compaction approach based on SAT-based reachability analysis. We develop procedures and algorithms using pre-image computation to efficiently traverse the state space and reduce the trace lengths. We further introduce a data structure used to store the visited states which is critical to the performance of the proposed approach. Experiments demonstrate the effectiveness of the reachability approach as approximately 75\% of the traces are reduced by one or two orders of magnitudes.

9C-3 (Time: 16:50 - 17:15)
TitleCombinational Equivalence Checking Using Incremental SAT Solving, Output Ordering, and Resets
Author*Stefan Disch, Christoph Scholl (University of Freiburg, Germany)
Pagepp. 938 - 943
Keywordcombinational equivalence checking, incremental SAT
AbstractCombinational equivalence checking is an essential task in circuit design. In this paper we focus on SAT based equivalence checking making use of incremental SAT techniques which are well known from their application in Bounded Model Checking. Based on an analysis of shared circuit structures we present heuristics which try to maximize the benefit from incremental SAT solving in this application by looking for good orders in which the equivalence of different circuit outputs is checked. Moreover, we present a reset strategy for situations where the benefit from the incremental SAT approach seems to decrease. Experimental results demonstrate that our novel method outperforms traditional methods significantly.

9C-4 (Time: 17:15 - 17:40)
TitleFixing Design Errors with Counterexamples and Resynthesis
Author*Kai-hui Chang, Igor L. Markov, Valeria Bertacco (University of Michigan at Ann Arbor, United States)
Pagepp. 944 - 949
KeywordError correction, Resynthesis, Functional verification
AbstractIn this work we propose a new error-correction framework, called CoRe, which uses counterexamples, or bug traces, generated in verification to automatically correct errors in digital designs. CoRe is powered by two innovative resynthesis techniques, Goal-Directed Search (GDS) and Entropy-Guided Search (EGS), which modify the functionality of internal circuit's nodes to match the desired specification. We evaluate our solution to designs and errors arising during combinational equivalence checking, as well as simulation-based verification of digital systems. Compared with previously proposed techniques, CoRe is more powerful in that: (1) it can fix a broader range of error types because it does not rely on specific error models; (2) it derives the correct functionality from simulation vectors, hence not requiring golden netlists; and (3) it can be applied to a range of verification flows, including formal and simulation-based.


Session 9D Designers' Forum Panel: Top 10 Design Issues
Time: 16:00 - 18:05 Friday, January 26, 2007
Location: Small Auditorium, 5F

9D-1
Title(Panel Discussion) Top 10 Design Issues
AuthorOrganizer: Haruyuki Tago (Toshiba, Japan), Moderator: Peter Hofstee (IBM, United States), Panelists: Toshihiro Hattori (Renesas Technology, Japan), Tadahiro Kuroda (Keio University, Japan), Toshinari Takayanagi (P.A. Semi, United States), Toshinori Sato (Kyushu University, Japan)