The 17th Asia and South Pacific Design Automation Conference
Technical Program

Remark: The presenter of each paper is marked with "*".

Technical Program: SIMPLE version DETAILED version with abstract

Author Index: HERE

Session Schedule

Monday, January 30, 2012

Tutorial 1: Design for Manufacturability and Reliability in Nanoscale CMOS and 3D-IC
9:00 - 17:00

Tutorial 2: Wireless Body Sensor Network (WBSN) Design
9:00 - 17:00

Tutorial 3: Heterogeneity for Power Management: Devices to Systems
9:00 - 17:00

Tutorial 4: Energy Efficiency in Scalable Power Sources: Portable to Grid-Connected Systems
9:00 - 17:00

Tutorial 5: Assertion-based verification for SoC and embedded software
9:00 - 17:00

Tuesday, January 31, 2012

Room 204A	Room 204B	Room 203	Room 202
1K (Room 204A+204B) Opening & Keynote 1 8:30 - 9:50
Coffee Break 9:50 - 10:20
2K (Room 204A+204B) Keynote 2 10:20 - 11:10
3K (Room 204A+204B) Keynote 3 11:10 - 12:00
Lunch 12:00 - 14:00
S1 Special Session 1: Robust and Resilient Designs from the Bottom-Up: Technology, CAD, Circuit, and System Issues 14:00 - 15:40	1A Architecture Issues in Embedded Systems 14:00 - 15:40	1B Power Network Design and Analysis 14:00 - 15:40	1C Emerging Circuits and Memories 14:00 - 15:40
Coffee Break 15:40 - 16:10
S2 Special Session 2: Domain Specific Accelerators 16:10 - 17:50	2A System-Level Optimization Techniques for Multi-Core Architectures 16:10 - 17:50	2B High-Speed PCB Routing 16:10 - 17:50	2C Emerging Test Solutions 16:10 - 17:50

Wednesday, February 1, 2012

Room 204A	Room 204B	Room 203	Room 202
I1 (Room 204A) Invited Talk 1 8:30 - 9:20
I2 (Room 204A) Invited Talk 2 9:20 - 10:10
Coffee Break 10:10 - 10:40
S3 Special Session 3: Design and Prototyping of Invasive MPSoC Architectures 10:40 - 12:20	S4 Special Session 4: Making ESL Models Work 10:40 - 12:20	3B High-Level Synthesis 10:40 - 12:20	3C Yield and Manufacturability Enhancement 10:40 - 12:20
Lunch 12:20 - 14:00
S5 Special Session 5: Advanced Post-silicon Validation and Debugging Techniques for SoC 14:00 - 15:40	S6 Special Session 6: Design and Architecture of Emerging Non-volatile Memory Technologies 14:00 - 15:40	4B 3D IC Layout 14:00 - 15:40	4C Simulation and Modeling for Signal-Integrity Analysis 14:00 - 15:40
Coffee Break 15:40 - 16:10
S7 Special Session 7: Sensor Node Optimization in Machine-to-Machine (M2M) Networks 16:10 - 17:50	5A Adaptive and Power-Efficient NoC Architectures 16:10 - 17:25	5B Physical Optimization for Power and Timing 16:10 - 17:50	5C Parallelizing System-Level Simulation 16:10 - 17:25

Thursday, February 2, 2012

Room 204A	Room 204B	Room 203	Room 202
D1 University LSI Design Contest 1 8:30 - 10:10	6A Efficient Methods for Resource Utilization in Multi-Core NoC Designs 8:30 - 10:10	6B Circuit-Level Timing Optimization 8:30 - 10:10	6C Modeling and Simulation for Nanoscale Analog Circuits 8:30 - 10:10
Coffee Break 10:10 - 10:40
D2 University LSI Design Contest 2 10:40 - 12:20	7A System-Level Modeling, Simulation, and Verification 10:40 - 12:20	7B Timing, Thermal, and Power Issues in High-Performance Design 10:40 - 12:20	7C Interconnect, Cooling, and Charge Storage Technologies 10:40 - 12:20
Lunch 12:20 - 14:00
S8 Special Session 8: Design for Reconfigurability and Adaptivity: Device, Circuit and System Perspectives 14:00 - 15:40	8A Scheduling for Embedded and High-Performance Systems 14:00 - 15:40	8B Automated Debugging and Validation 14:00 - 15:40	8C DFM for Nanolithography 14:00 - 15:40
Coffee Break 15:40 - 16:10
S9 Special Session 9: Quality Assurance for 3D-Stacked ICs 16:10 - 17:50	9A Design for System Reliability 16:10 - 17:50	9B Logic and Datapath Synthesis 16:10 - 17:50	9C Video, Display, and Signal Processing Technologies and Techniques 16:10 - 17:50

List of Papers

Remark: The presenter of each paper is marked with "*".

Tuesday, January 31, 2012

Session 1K Opening & Keynote 1
Time: 8:30 - 9:50 Tuesday, January 31, 2012
Location: Room 204A+204B

1K-1 (Time: 8:30 - 9:20)

Title	(Keynote Address) Engineering Complex Systems for Health, Security and the Environment
Author	*Giovanni De Micheli (EPF Lausanne, Switzerland)
Page	pp. 1 - 6
Keyword	security
Abstract	Several important societal and economic world problems can be addressed by the smart use of technology. The last forty years have witnessed the realization of computational systems and networks, rooted in our ability of crafting complex integrated circuits out of billions of transistors. Nowadays, the ability of mastering materials at the molecular level and their interaction with living matter opens up unforeseeable horizons. Networking biological sensors through body-area, ad hoc and standard communication networks boosts the intrinsic power of local measurements, and allows us to reach new standards in health and environment management, with positive fallout on security of individuals and communities. This article reviews the Nano-Tera.ch research program, addressing the enabling and disruptive technologies that stem from the combination of nanotechnology with large (tera) -scale information and communication systems.

Session 2K Keynote 2
Time: 10:20 - 11:10 Tuesday, January 31, 2012
Location: Room 204A+204B

2K-1 (Time: 10:20 - 11:10)

Title	(Keynote Address) Antipodean VLSI Adventures
Author	Neil Weste (OzRunways Pty. Ltd, Australia)
Abstract	With sky-high NRE and CAD costs, and a small industrial IC footprint, not to mention the GFC, it is hard for the erstwhile VLSI impresario to find gainful employment in Australia. This talk will summarize two ways to stay busy that might be a guide to others finding themselves in the same or similar positions. The first method of 'staying in the business' is illustrated by a company with which the author has been associated for the last few years. The main principle is to find an application which requires a 'System on Chip' approach (that is Analog/RF and digital), can be completed in an older process and doesn't require a preponderance of digital gates. The other important feature is not to become a chip company but to outsource that part of the development process. We shall also learn that while an initial market area might drive the formation and technical direction of the company, another completely unrelated area might be responsible for the ultimate success of the company. The second method is to take the techniques honed over many decades writing VLSI design software and use them in some completely different area - 'beating swords into ploughshares (or vice versa)'. In this case the other area is Apps for the iPhone and iPad and the application area is Aviation software. The approaches and outcomes of both methods will be summarized.

Session 3K Keynote 3
Time: 11:10 - 12:00 Tuesday, January 31, 2012
Location: Room 204A+204B

3K-1 (Time: 11:10 - 12:00)

Title	(Keynote Address) Trends, Challenges and Solutions of Design Ecosystem for 20nm and Beyond
Author	Cliff Hou (Taiwan Semiconductor Manufacturing Co. Ltd., Taiwan)
Abstract	In moving to a new process technology, readiness of the design ecosystem - EDA and IP - plays crucial enabling roles. As the complexity and cost of feature scaling grow, the ecosystem is facing challenges that require collective changes by its constituents to evolve accordingly. In this talk, we will first look at the driving forces behind process technology roadmap. Next we will discuss the trend such roadmap ushers in, as well as the resultant challenges facing the ecosystem as the industry moves toward 20nm. We will then move to propose a collaborative framework under which foundry, EDA and IP vendors and customers partner to address the challenges. Finally, we will offer a preview of the ecosystem beyond 20nm and how collaboration will continue to be adjusted to enable future success.

Session S1 Special Session 1: Robust and Resilient Designs from the Bottom-Up: Technology, CAD, Circuit, and System Issues
Time: 14:00 - 15:40 Tuesday, January 31, 2012
Location: Room 204A
Chair: Martin D.F. Wong (University of Illinois at Urbana-Champaign, U.S.A.)

S1-1

Title	(Invited Paper) Robust and Resilient Designs from the Bottom-Up: Technology, CAD, Circuit, and System Issues
Author	Vijay J. Reddi, David Z. Pan (University of Texas at Austin, U.S.A.), Sani Nassif (IBM, U.S.A.), Keith A. Bowman (Intel, U.S.A.)
Page	pp. 7 - 16
Keyword	resilient
Abstract	The semiconductor industry is facing a critical research challenge: design future high performance and energy efficient systems while satisfying historical standards for reliability and lower costs. The primary cause of this challenge is device and circuit parameter variability, which results from the manufacturing process and system operation. As technology scales, the adverse impact of these variations on system-level metrics increases. In this paper, we describe an interdisciplinary effort toward robust and resilient designs that mitigate the effects of device and circuit parameter variations in order to enhance system performance, energy efficiency, and reliability. Collaboration between the technology, CAD, circuit, and system levels of the compute hierarchy can foster the development of cost-effective and efficient solutions.

S1-2 (Time: 14:00 - 14:25)

Title	(Invited Paper) Technology Challenges beyond 22nm
Author	*Sani Nassif (IBM, U.S.A.)

S1-3 (Time: 14:25 - 14:50)

Title	(Invited Paper) Physical CAD for Robust Designs
Author	*David Z. Pan (University of Texas at Austin, U.S.A.)

S1-4 (Time: 14:50 - 15:15)

Title	(Invited Paper) Resilient Circuit Design Trade-Offs for Improving Performance & Energy Efficiency
Author	*Keith A. Bowman (Intel, U.S.A.)

S1-5 (Time: 15:15 - 15:40)

Title	(Invited Paper) Coordinated System Design for Resiliency
Author	*Vijay J. Reddi (University of Texas at Austin, U.S.A.)

Session 1A Architecture Issues in Embedded Systems
Time: 14:00 - 15:40 Tuesday, January 31, 2012
Location: Room 204B
Chairs: Tei-Wei Kuo (National Taiwan University, Taiwan), Zili Shao (The Hong Kong Polytechnic University)

1A-1 (Time: 14:00 - 14:25)

Title	JOP-Plus - A Processor for Efficient Execution of Java Programs Extended with GALS Concurrency
Author	Muhammad Nadeem, *Morteza Biglari-Abhari, Zoran Salcic (University of Auckland, New Zealand)
Page	pp. 17 - 22
Keyword	GALS processor, Java processor, reactivity, concurrency, embedded systems
Abstract	In this paper we present an approach to efficiently mix Java with asynchronous and synchronous concurrency and execute it on a specialized Java processor extended with capabilities for concurrency and reactivity. A new processor, which uses JOP (Java Optimized Processor) as its base, executes concurrent programs that comply with Globally Asynchronous Locally Synchronous (GALS) formal model of computation by clearly distinguishing between concurrency and reactivity control flow and Java control flow. The new processor, called JOP-Plus, can be used for embedded and even real-time applications in which majority of code is written in Java and the overall programs specified and structured in SystemJ system-level concurrent programming language.

1A-2 (Time: 14:25 - 14:50)

Title	An Application Classification Guided Cache Tuning Heuristic for Multi-core Architectures
Author	Marisha Rawlins, *Ann Gordon-Ross (University of Florida, U.S.A.)
Page	pp. 23 - 28
Keyword	multi-core, cache tuning, energy, embedded systems
Abstract	Since multi-core architectures are becoming more popular, recent multi-core optimizations focus on energy consumption. We present a level one data cache tuning heuristic for a heterogeneous multi-core system, which classifies applications based on data sharing and cache behavior, and uses this classification to guide cache tuning and reduce the number of cores that need to be tuned. Results reveal average energy savings of 25% for 2-, 4-, 8-, and 16-core systems while searching only 1% of the design space.

1A-3 (Time: 14:50 - 15:15)

Title	Security Enhanced Linux on Embedded Systems: a Hardware-accelerated Implementation
Author	*Leandro Fiorin, Alberto Ferrante, Konstantinos Padarnitas, Francesco Regazzoni (ALaRI, Faculty of Informatics, University of Lugano, Switzerland)
Page	pp. 29 - 34
Keyword	Security, SELinux, Embedded Systems, Access Control
Abstract	Security Enhanced Linux implements fine-grained mandatory access control. Despite its usefulness, the overhead of implementing it on embedded devices is prohibitive. Therefore, in the past it has been proposed to accelerate SELinux by means of dedicated hardware; in this work we demonstrate the feasibility of such an approach by implementing a hardware accelerator for SELinux on a FPGA-based platform. Our implementation obtains a huge reduction in the performance overhead and energy consumption of SELinux, yet employing a limited chip area.

1A-4 (Time: 15:15 - 15:40)

Title	PRR: A Low-Overhead Cache Replacement Algorithm for Embedded Processors
Author	Wei-Che Tseng (University of Texas at Dallas, U.S.A.), *Chun Jason Xue (City University of Hong Kong, Hong Kong), Qingfeng Zhuge, Jingtong Hu, Edwin H.-M. Sha (University of Texas at Dallas, U.S.A.)
Page	pp. 35 - 40
Keyword	embedded, cache, replacement, algorithm, round-robin
Abstract	In embedded systems power consumption and area tightly constrain the cache capacity and management logic. Many good cache replacement policies have been proposed in the past, but none approach the performance of the least recently used (LRU) algorithm without incurring high overheads. In fact, many embedded designers consider even pseudo-LRU too complex for their embedded systems processors. In this paper, we propose a new level 1 (L1) data cache replacement algorithm, Protected Round-Robin (PRR) that is simple enough to be incorporated into embedded processors while providing miss rates that are very similar to the miss rates of LRU. Our experiments showed that on average the miss rates of PRR are only 0.22 higher than the miss rates of LRU on a 32KB, 4-way L1 data cache with 32 byte long cache lines. PRR has miss rates that are on average 4.72 and 4.66 lower than random and round-robin replacement algorithms, respectively.

Session 1B Power Network Design and Analysis
Time: 14:00 - 15:40 Tuesday, January 31, 2012
Location: Room 203
Chairs: Kimiyoshi Usami (Shibaura Institute of Technology, Japan), Saibal Mukhopadhyay (Georgia Institute of Technology, U.S.A.)

1B-1 (Time: 14:00 - 14:25)

Title	Incremental Power Network Analysis Using Backward Random Walks
Author	*Baktash Boghrati, Sachin Sapatnekar (University of Minnesota, U.S.A.)
Page	pp. 41 - 46
Keyword	Random walks, Incremental analysis, Power network
Abstract	The process of power network analysis during VLSI chip design is inherently iterative. It is very common for the designer to make many small perturbations to an otherwise complete design, to enhance the design or fix design violations. Considering the size of the modern chips, updating the solution for the changed network can be a computationally intensive task. In this paper we propose an efficient and accurate incremental solver that utilizes the backward random walks to identify the region of influence of the perturbation. The solution of the network is updated for the significantly smaller region only. The proposed algorithm is capable of handling consecutive perturbations without any degradation. The experimental results show speedups of up to 13.7x as compared to a complete solution.

1B-2 (Time: 14:25 - 14:50)

Title	Thermal-aware Power Network Design for IR Drop Reduction in 3D ICs
Author	*Zuowei Li, Yuchun Ma, Qiang Zhou, Yici Cai, Yu Wang (Tsinghua University, China), Tingting Huang (National Tsing Hua University, China), Yuan Xie (Pennsylvania State University, U.S.A.)
Page	pp. 47 - 52
Keyword	P/G TSV plan, P/G network, Thermal, 3D ICs
Abstract	Due to the high integration on vertical stacked layers, power/ground network design becomes one of the critical challenges in 3D IC design. With the leakage-thermal dependency, the increasing on-chip temperature in 3D designs has serious impact on IR drop due to the increased wire resistance and increased leakage current. Power/ground (P/G) TSVs can help relieve the IR drop violation by vertically connecting the on-chip P/G networks on different layers. However, most previous works only fulfill a margin of the full potential of PG TSVs planning since they restrict P/G grid in a uniform topology. Besides, their overlook of resistance variation and leakage current will make their results less accurate. In this paper, we present an efficient thermal-aware P/G TSVs planning algorithm based on a sensitivity model with temperature-dependent leakage current considered. The proposed method can overcome the limitation of P/G grid topology and make full use of P/G TSVs planning for optimization of P/G network by allowing short wires to connect the P/G TSVs to P/G grids in non-uniform topology. Moreover, with resistance variation and increased leakage current caused by high temperature in 3D ICs, more accurate result can be obtained. Both the theoretical analysis and experimental results show the efficiency of our approach. Results show that neglecting thermal impacts on power delivery can underestimate IR drop by about 11%. To relieve the severe IR drop violation, 51.8% more P/G TSVs are needed than the cases without thermal impacts considered. Results also show that our P/G TSV planning based on the sensitivity model can reduce max IR drop by 42.3% and reduce the number of violated nodes by 82.4%.

1B-3 (Time: 14:50 - 15:15)

Title	The Feasibility of Carbon Nanotubes for Power Delivery in 3-D Integrated Circuits
Author	*Nauman Khan, Soha Hassoun (Tufts University, U.S.A.)
Page	pp. 53 - 58
Keyword	3-D IC, Through Silicon Via, Power Delivery Network Design, CNT
Abstract	Increased power density and package asymmetry pose challenges in designing power delivery networks for 3-D Integrated Circuits (ICs). The increased resistivity of Cu wires due to scaling has shifted attention to alternate interconnect technologies. Continued and significant innovations in CNT manufacturing at CMOS-compatible temperatures with quality low-resistive contacts promise to enable the use of CNT as a replacement. We investigate in this paper the feasibility of using CNTs for power delivery in 3-D ICs. We evaluate the use of CNTs as Through-Silicon Vias (TSVs) and as wiring for global power delivery grids, fabricated on interposer dies. We assume the CNT interconnect has a mix of single- and multi-walled CNTs with 30% metallic nanotubes. We design a 3-D system-level comparative framework that utilizes select traces from SPEC benchmarks to evaluate improvements of CNTs over Cu. Our results emphasize how CNTs can significantly improve power delivery for 3-D integrated circuits. Using CNTs for on-chip power grid and for TSVs reduces the number of TSVs by 71% when compared to a Cu implementation. For the same substrate area dedicated to power-TSVs, CNTs improve the maximum and average IR drop by 98% and 40%, respectively. Improvements in the Ldi/dt drop are 47% and 18%, respectively.

1B-4 (Time: 15:15 - 15:40)

Title	An Efficient Hamiltonian-Cycle Power-Switch Routing for MTCMOS Designs
Author	*Yi-Ming Wang (Dept. of Electronics Engineering, National Chiao Tung University, Taiwan), Shi-Hao Chen (Global Unichip Corp., Taiwan), Mango C.-T. Chao (Dept. of Electronics Engineering, National Chiao Tung University, Taiwan)
Page	pp. 59 - 65
Keyword	power gating, MTCMOS, low power
Abstract	MTCMOS is popular in industry for implementing a power gating design. Major IC foundries recommend turning on power switches one by one to reduce the peak current during the mode transition. In this paper, we propose a power-switch-routing framework, which can effectively and efficiently find a feasible Hamiltonian-cycle routing among power switches without violating the Manhattan distance constraint while handling the irregular placement of power switches. The framework is currently used in a design service company.

Session 1C Emerging Circuits and Memories
Time: 14:00 - 15:40 Tuesday, January 31, 2012
Location: Room 202
Chairs: Yiran Chen (University of Pittsburgh, U.S.A.), Hai Zhou (Northwestern University, U.S.A.)

1C-1 (Time: 14:00 - 14:25)

Title	An ILP-based Obstacle-Avoiding Routing Algorithm for Pin-Constrained EWOD Chips
Author	*Jia-Wen Chang, Tsung-Wei Huang, Tsung-Yi Ho (National Cheng Kung University, Taiwan)
Page	pp. 67 - 72
Keyword	Biochip, ILP, Routing
Abstract	Electrowetting-on-dielectric (EWOD) chips, by electrically providing flexible and efficient manipulations of microfluidics, have become the most popular actuator particularly for droplet-based digital microfluidic (DMF) systems. In order to enable the electrical manipulations, wire routing is a key problem in designing EWOD chips. Unlike traditional verylarge-scale-integration (VLSI) routing problems, in addition to routing-path establishment on signal pins, the EWOD-chip routing problem needs to address the issue of signal sharing for pin-count reduction under a practical constraint posed by limited pin-count supply. Moreover, EWOD-chip designs might incur several obstacles in the routing regions due to embedded devices for specific fluidic protocols such as embedded magnets for sample purification and electrophoresis devices for particle separation. However, no existing works consider routing with obstacles. To remedy this insufficiency, we propose in this paper the first obstacle-avoiding routing algorithm for pin-constrained EWOD chips. Our algorithm, based on effective integer-linear-programming (ILP) formulation as well as efficient routing framework, can achieve high routability with a low design complexity. Experimental results based on reallife chips with obstacles demonstrate the high routability of our obstacle-avoiding routing algorithm for pin-constrained EWOD chips.

1C-2 (Time: 14:25 - 14:50)

Title	A Look Up Table Design with 3D Bipolar RRAMs
Author	Yi-Chung Chen (Polytechnic Institute of New York University, U.S.A.), *Wei Zhang (Nanyang Technological University, Singapore), Hai Li (Polytechnic Institute of New York University, U.S.A.)
Page	pp. 73 - 78
Keyword	RRAM, LUT, FPGA, 3D technology
Abstract	Look Up Table (LUT) is a basic configurable logic element in Field Programmable Gate Arrays (FPGAs). In commercial product, Static Random Access Memory (SRAM) has been widely used in each LUT to store the configured logic. Recently, the emerging Resistive RAM (RRAM) has attracted a lot of attentions for its high density and non-volatility. In this work, we explore a novel LUT design with bipolar RRAM devices. To obtain design efficiency, a 3D high-density interleaved memory structure is introduced in the proposed LUT. The corresponding peripheral circuits were developed at TSMC 0.18µm technology node. Compared to the traditional SRAM-based FPGA, the RRAM-based LUT demonstrates advantages such as eliminating initialization stage, a much higher density with 56% area reduction, bit-addressable write scheme, dynamic reconfiguration, and the flexibility to support various configurations.

1C-3 (Time: 14:50 - 15:15)

Title	Low Power Memristor-Based ReRAM Design with Error Correcting Code
Author	*Dimin Niu, Yang Xiao, Yuan Xie (The Pennsylvania State University, U.S.A.)
Page	pp. 79 - 84
Keyword	Non-volatile Memory, ReRAM, ECC
Abstract	The emerging memristor-based Resistive RAM (ReRAM) has shown great potential as one of the most promising memory technologies, with the unique properties such as high density, low-power, good-scalability, and non-volatility. However, as the process technology scales, the process variation will cause the deviation of the actual electrical behavior of memristor. Recently, researchers have observed that the probability of a single ReRAM cell switching successfully follows a function of the logarithm of the total programming time. As a result, the uncertainty of the electrical behavior results in different degrees of error rates in ReRAM-based memory. Traditional ECC (Error Correcting Code) design for conventional DRAM memory is used to detect and correct the errors in the memory system. In this paper, based on the mathematical analysis of the error patterns in memristor-based ReRAM and the study of ECC designs, we proposed to use ECC code to relax the BER (Bit Error Rate) requirement of a single memory to improve the write energy consumption and latency for both the MOS based and cross point based memristor ReRAM designs. In addition, the performance/power/area overhead of the proposed design options is also evaluated in detail.

1C-4 (Time: 15:15 - 15:40)

Title	Synthesis of Reversible Circuits with Minimal Lines for Large Functions
Author	Mathias Soeken, *Robert Wille, Christoph Hilken, Nils Przigoda, Rolf Drechsler (University of Bremen, Germany)
Page	pp. 85 - 92
Keyword	Synthesis, Reversible Logic, Data Structures, QMDDs
Abstract	Reversible circuits are an emerging technology where all computations are performed in an invertible manner. Motivated by their promising applications, e.g. in the domain of quantum computation or in the low-power design, the synthesis of such circuits has been intensely studied. However, how to automatically realize reversible circuits with the minimal number of lines for large functions is an open research problem. In this paper, we propose a new synthesis approach which relies on concepts that are complementary to existing ones. While "conventional" function representations have been applied for synthesis so far (such as truth tables, ESOPs, BDDs), we exploit Quantum Multiple-valued Decision Diagrams (QMDDs) for this purpose. An algorithm is presented that performs transformations on this data-structure eventually leading to the desired circuit. Experimental results show the novelty of the proposed approach through enabling automatic synthesis of large reversible functions with the minimal number of circuit lines. Furthermore, the quantum cost of the resulting circuits is reduced by 50% on average compared to an existing state-of-the-art synthesis method.

Session S2 Special Session 2: Domain Specific Accelerators
Time: 16:10 - 17:50 Tuesday, January 31, 2012
Location: Room 204A
Chair: Vijaykrishnan Narayanan (Pennsylvania State University, U.S.A.)

S2-1 (Time: 16:10 - 16:30)

Title	(Invited Paper) Accelerated Processing and the Fusion System Architecture
Author	*Mike O'Connor (AMD Research, Texas, U.S.A.)
Page	p. 93
Keyword	Fusion System Architecture
Abstract	Fusion System Architecture (FSA) is an open, extensible architecture that unifies CPUs and GPUs in a flexible computing fabric. New and existing programming languages and tools can build upon this framework to enable applications that seamlessly move between CPU and GPU cores, exploiting the best attributes of each. The architecture addresses low overhead data and computation transfer, as well as integrated manageability.

S2-2 (Time: 16:30 - 16:50)

Title	(Invited Paper) Platform Characterization for Domain-Specific Computing
Author	*Alex Bui (Department of Radiological Sciences, University of California, Los Angeles, U.S.A.), Kwang-Ting (Tim) Cheng (Department of Electrical and Computer Engineering, University of California, Santa Barbara, U.S.A.), Jason Cong (Department of Computer Science, University of California, Los Angeles, U.S.A.), Luminita Vese (Department of Mathematics, University of California, Los Angeles, U.S.A.), Yi-Chu Wang (Department of Electrical and Computer Engineering, University of California, Santa Barbara, U.S.A.), Bo Yuan, Yi Zou (Department of Computer Science, University of California, Los Angeles, U.S.A.)
Page	pp. 94 - 99
Keyword	Domain specific computing
Abstract	We believe that by adapting architectures to fit the requirements of a given application domain, we can significantly improve the efficiency of computation. To validate the idea for our application domain, we evaluate a wide spectrum of commodity computing platforms to quantify the potential benefits of heterogeneity and customization for the domain-specific applications. In particular, we choose medical imaging as the application domain for investigation, and study the application performance and energy efficiency across a diverse set of commodity hardware platforms, such as general-purpose multi-core CPUs, massive parallel many-core GPUs, low-power mobile CPUs and fine-grain customizable FPGAs. This study leads to a number of interesting observations that can be used to guide further development of domain-specific architectures.

S2-3 (Time: 16:50 - 17:10)

Title	(Invited Paper) GreenDroid: An Architecture for the Dark Silicon Age
Author	Nathan Goulding-Hotta, Jack Sampson, Qiaoshi Zheng, Vikram Bhatt, Joe Auricchio, Steven Swanson, *Michael Bedford Taylor (University of California, San Diego, U.S.A.)
Page	pp. 100 - 105
Keyword	greendroid, utilization wall, c-core, dark silicon
Abstract	Our research attacks the Dark Silicon problem directly through a set of energy-saving accelerators, called Conservation Cores, or c-cores. C-cores are a post-multicore approach that constructively uses dark silicon to reduce the energy consumption of an application by 10x or more. To examine the utility of c-cores, we are developing GreenDroid, a multicore chip that targets the Android mobile software stack. Our mobile application processor prototype targets a 32-nm process and is comprised of hundreds of automatically generated, specialized, patchable c-cores. These cores target specific Android hotspots, including the kernel. Our preliminary results suggest that we can attain large improvements in energy efficiency using a modest amount of silicon.

S2-4 (Time: 17:10 - 17:30)

Title	(Invited Paper) Accelerator-Rich Architectures: Implications, Opportunities and Challenges
Author	*Ravi Iyer (Intel, U.S.A.)
Page	pp. 106 - 107
Keyword	accelerators
Abstract	Providing high performance at ultra-low power for a domain of applications is possible by designing and integrating accelerators. Accelerators may be fixed-function, programmable or re-configurable in nature. Integration of many such accelerators in a system-on-chip (SoC) or chip-multiprocessor (CMP) introduces several major implications on architecture, power/performance and programmability. In this paper, we will provide an overview of the key challenges and outline research opportunities and challenges for accelerator-rich architectures and devices. We will also describe example solutions in some of these areas as a potential direction for further exploration.

S2-5 (Time: 17:30 - 17:50)

Title	(Invited Paper) A Reconfigurable Platform for the Design and Verification of Domain-Specific Accelerators
Author	Sungho Park, Yong Cheol, Peter Cho, Kevin M. Irick, *Vijaykrishnan Narayanan (The Pennsylvania State University, U.S.A.)
Page	pp. 108 - 113
Keyword	accelerators
Abstract	In this paper we present Vortex: a reconfigurable Network-on- Chip platform suitable for implementing domain-specific hardware accelerators in a design efficient manner. Our Vortex platform provides a flexible means to compose domain-specific accelerators for streaming applications such as performance critical machine vision systems. By substituting a traditional shared-bus architecture with low latency packet-switched routers and high utility network adaptors, maximum performance is exploited with minimal regard to communication infrastructure design and validation. To highlight the utility of the Vortex platform we present a case study in which a video analytics pipeline is mapped onto a multi-FPGA system. The system meets real-time throughput requirements on 3 Megapixel 48-bit image sequences with minimal resource overhead attributed to the Vortex communication infrastructure.

Session 2A System-Level Optimization Techniques for Multi-Core Architectures
Time: 16:10 - 17:50 Tuesday, January 31, 2012
Location: Room 204B
Chairs: Kiyoung Choi (Seoul National University, Republic of Korea), Yuko Hara-Azumi (Ritsumeikan University, Japan)

2A-1 (Time: 16:10 - 16:35)

Title	Learning-Based Power Management for Multi-Core Processors via Idle Period Manipulation
Author	Rong Ye, *Qiang Xu (The Chinese University of Hong Kong, Hong Kong)
Page	pp. 115 - 120
Keyword	Power management, Multicore processor, Machine learning
Abstract	Learning-based dynamic power management (DPM) techniques, being able to adapt to varying system conditions and workloads, have attracted lots of research attention recently. To the best of our knowledge, however, none of the existing learning-based DPM solutions are dedicated to power reduction in multi-core processors, although they can be utilized by treating each processor core as a standalone entity and conducting DPM for them separately. In this work, by including task allocation into our learning-based DPM framework for multi-core processors, we are able to manipulate idle periods on processor cores to achieve a better tradeoff between power consumption and system performance. Experimental results show that the proposed solution significantly outperforms existing DPM techniques.

2A-2 (Time: 16:35 - 17:00)

Title	Memory Access Aware Power Gating for MPSoCs
Author	*Ye-Jyun Lin, Chia-Lin Yang, Jiao-Wei Huang (National Taiwan University, Taiwan), Naehyuck Chang (Seoul National University, Republic of Korea)
Page	pp. 121 - 126
Keyword	mpsoc, low power, power gating
Abstract	As technology continues to scale, reducing leakage is critical to achieve energy efficiency. Power gating can potentially save a significant part of leakage but it incurs both energy and performance penalties. Therefore, power gating decisions need to be made carefully. In the current low-power SoC design, an IP core is power gated when it is not operating. In this paper, we explore the IP idle time due to memory accesses for further leakage reduction. In MPSoCs, due to contention among concurrent memory accesses from different IP cores, memory stall cycles vary significantly, ranging from 10 to 600 cycles according to our experiments. We propose a run-time mechanism that predict the memory stall cycles of an individual IP, and make the power gating decision based on the predicted memory latency and its break-even time. With the predicted memory latency, a power-gated IP can be woken up in advance to avoid performance degradation. The experimental results show that our power management mechanism can achieve 25.3% leakage energy saving within 4% performance penalty.

2A-3 (Time: 17:00 - 17:25)

Title	Buffer Minimization in Pipelined SDF Scheduling on Multi-Core Platforms
Author	Yuankai Chen, *Hai Zhou (Northwestern University, U.S.A.)
Page	pp. 127 - 132
Keyword	Buffer-size minimization, Multi-core, SDF, Scheduling, Pipeline
Abstract	With the increasing number of cores available on modern processors, it is imperative to solve the problem of mapping and scheduling a synchronous data flow graph onto a multi-core platform. Such a solution should not only meet the performance constraint, but also minimize resource usage. In this paper, we consider the pipeline scheduling problem for acyclic synchronous dataflow graph on a given number of cores to minimize the total buffer size while meeting the throughput constraint. We propose a two-level heuristic algorithm for this problem. The inner level finds the optimal buffer size for a given topological order of the input task graph; the outer level explores the space of topological order by applying perturbation to the topological order to improve buffer size. We compared our proposed algorithm to an enumeration algorithm which is able to generate optimal solution for small graphs, and a greedy algorithm which is able to run on large graphs. The experimental results show that our two-level heuristic algorithm achieves near-optimal solution compared to the enumeration algorithm, with only 0.8% increase in buffer size on average but with much shorter runtime, and achieves 38.8% less buffer usage on average, compared to the greedy algorithm.

2A-4 (Time: 17:25 - 17:50)

Title	A Hierarchical C2RTL Framework for FIFO-Connected Stream Applications
Author	*Shuangchen Li, Yongpan Liu, Daming Zhang, Xinyu He (TNList, EE Dept.,Tsinghua University, China), Pei Zhang (Y Explorations Inc., U.S.A.), Huazhong Yang (TNList, EE Dept.,Tsinghua University, China)
Page	pp. 133 - 138
Keyword	C2RTL, Hierarchical synthesis, FIFO sizing
Abstract	In modern embedded systems, the C2RTL (high-level synthesis) technology helps the designer to greatly reduce time-to-market, while satisfying the performance and cost constraints. To attack the performance challenges in complex designs, we propose a FIFO-connected hierarchical approach to replace the traditional flatten one in stream applications. Furthermore, we develop an analytical algorithm to find the optimal FIFO capacity to connect multiple modules efficiently. Finally, we prove the advantages of the proposed method and the feasibility of our algorithm in seven real applications. Experimental results show that the hierarchical approach can have an up to 10.43 times speedup compared to the flatten design, while our analytical FIFO sizing algorithm shrinks design time from hours to seconds with the same accuracy compared to the simulation based approach.

Session 2B High-Speed PCB Routing
Time: 16:10 - 17:50 Tuesday, January 31, 2012
Location: Room 203
Chairs: Yih-Lang Li (National Chiao Tung University, Taiwan), Martin D. F. Wong (University of Illinois at Urbana-Champaign, U.S.A.)

2B-1 (Time: 16:10 - 16:35)

Title	Escape Routing of Differential Pairs Considering Length Matching
Author	*Tai-Hung Li, Wan-Chun Chen, Xian-Ting Cai, Tai-Chen Chen (National Central University, Taiwan)
Page	pp. 139 - 144
Keyword	PCB, Routing, Differential Pair, Length Matching, skew
Abstract	The escape routing problem for PCB designs has been extensively studied in literature. Although industrial tools and few studies have worked on the escape routing of differentials pairs, the routing solutions are not good enough by previous methods. In this paper, we propose an escape routing approach of differential pairs considering length matching. The approach includes two stages. The first stage is to find min-cost median points which connect two pins by minimum and equal wire lengths while the second stage is adopted a network-flow approach with min-cost max-flow to simultaneously route all differential pairs. Experimental results show that our approach can efficiently and effectively obtain length-matching differential pairs with significant reduction in maximum and average differential-pair skews.

2B-2 (Time: 16:35 - 17:00)

Title	An Any-Angle Routing Method using Quasi-Newton Method
Author	*Yukihide Kohira (The University of Aizu, Japan), Atsushi Takahashi (Osaka University, Japan)
Page	pp. 145 - 150
Keyword	PCB routing, package routing, any-angle routing, quasi-Newton method
Abstract	In this paper, we propose a routing method which solves an any-angle gridless routing problem by formulating the problem by non-linear programming which is solved by quasi-Newton method. Our proposed method minimizes the total wire length or the total length error while satisfying constraints such as the separation for a route and an obstacle, the separation for two routes, and the angle of bend in a route. Experiments show that the proposed method is effective to obtain any-angle gridless routes in short computational time.

2B-3 (Time: 17:00 - 17:25)

Title	Linear Optimal One-Sided Single-Detour Algorithm for Untangling Twisted Bus
Author	Tao Lin, *Sheqin Dong (Tsinghua University, China), Song Chen, Satoshi Goto (Waseda University, Japan)
Page	pp. 151 - 156
Keyword	linear, optimal, untangling, Twisted Bus
Abstract	We considered the one-sided single-detour untangling twisted nets problem for printed circuit board bus routing. A previous optimal dynamic programming based O(n³) algorithm was proposed in a previous work, where n is the number of nets. In this paper, we propose an optimal O(n) untangling algorithm without considering capacity, and this algorithm is further modified to consider capacity with very little overhead. Experimental results show that our algorithm runs much faster than the previous work due to its low time complexity.

2B-4 (Time: 17:25 - 17:50)

Title	LEMAR: A Novel Length Matching Routing Algorithm for Analog and Mixed Signal Circuits
Author	*Hailong Yao, Yici Cai, Qiang Gao (Tsinghua University, China)
Page	pp. 157 - 162
Keyword	Analog and mixed signal circuits, Length matching, detailed routing
Abstract	Enabled by the heterogeneous integration in modern System- On-Chips (SOCs), the design automation for analog and mixed signal circuit components in SOCs is attracting increasing interests. Matching constraints for specific analog signals are critical for correct functionalities. This paper presents a novel single-layer detailed routing algorithm with the length matching constraint, called LEMAR. LEMAR features an innovative routing model for partitioning the routing layout for wire detouring, effective detouring patterns according to the geometric shapes of the partitioned tiles, an enhanced A*-search algorithm along with the backtrack technique for finding the routing path, and an iterative rip-up and reroute procedure for finding the feasible routing solution with the matching constraint. Experimental results are promising and show that LEMAR is both effective and efficient.

Session 2C Emerging Test Solutions
Time: 16:10 - 17:50 Tuesday, January 31, 2012
Location: Room 202
Chairs: Jiun-Lang Huang (National Taiwan University, Taiwan), Wu-Tung Cheng (Mentor Graphics, U.S.A.)

2C-1 (Time: 16:10 - 16:35)

Title	An Intelligent Analysis of Iddq Data for Chip Classification in Very Deep-Submicron (VDSM) CMOS Technology
Author	Chia-Ling Chang, Chia-Ching Chang, Hui-Ling Chan, *Charles H.-P. Wen (National Chiao Tung University, Taiwan), Jayanta Bhadra (Freescale Semiconductor Inc., U.S.A.)
Page	pp. 163 - 168
Keyword	Iddq testing, data mining
Abstract	Iddq testing has been a critical integral component in test suites for screening unreliable devices. As the silicon technology keeps shrinking, Iddq values and their variation increase as well. Moreover, along with rapid design scaling, defect-induced leakage currents become less significant when compared to full-chip current and also make themselves less distinguishable. Traditional Iddq methods become less effective and cause more test escapes and yield loss. Therefore, in this paper, a new test method named Sigma-Iddq testing is proposed and integrates (1) a variation-aware full-chip leakage estimator and (2) a clustering algorithm to classify chip without using threshold values. Experimental result shows that Sigma-Iddq testing achieves a higher classification accuracy in a 45nm technology when compared to a single-threshold Iddq testing. As a result, both the process-variation and design-scaling impacts are successfully excluded and thus the defective chips can be identified intelligently.

2C-2 (Time: 16:35 - 17:00)

Title	CODA: A Concurrent Online Delay Measurement Architecture for Critical Paths
Author	Yubin Zhang (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China), Haile Yu, *Qiang Xu (The Chinese University of Hong Kong, Hong Kong)
Page	pp. 169 - 174
Keyword	online test, delay of critical paths, delay measurement
Abstract	With technology scaling, integrated circuits behave more unpredictably due to process variation, environmental changes and aging effects. Correspondingly, various variation-aware and adaptive design methodologies have been proposed. More effective solutions can be obtained if we are able to collect real-time information such as the actual propagation delay of critical paths when the circuit is running in normal function mode. Therefore, we propose novel concurrent online delay measurement architecture for critical paths. Experimental results demonstrate its high accuracy and practicality.

2C-3 (Time: 17:00 - 17:25)

Title	Low-cost Control Flow Error Protection by Exploiting Available Redundancies in the Pipeline
Author	Mohammad Abdur Rouf, *Soontae Kim (Korea Advanced Institute of Science and Technology, Republic of Korea)
Page	pp. 175 - 180
Keyword	Control flow error checking, transient fault, branch target buffer, low energy
Abstract	Due to device miniaturization and reducing supply voltage, embedded systems are becoming more susceptible to transient faults. Specifically, faults in control flow can change the execution sequence, which might be catastrophic for safety critical applications. Many techniques are devised using software, hardware or software-hardware co-design for control flow error checking. Software techniques suffer from a significant amount of code size overhead, and hence, negative impact on performance and energy consumption. On the other hand, hardware-based techniques have a significant amount of hardware and area cost. In this research we exploit the available redundancies in the pipeline. The branch target buffer stores target addresses of taken branches, and ALU generates target addresses using the low-order branch displacement bits of branch instructions. To exploit these redundancies in the pipeline, we propose a control flow error checking (CFEC) scheme. It can detect control flow errors and recover from them with negligible energy and performance overhead.

2C-4 (Time: 17:25 - 17:50)

Title	Detection and Diagnosis of Faulty Quantum Circuits
Author	*Alexandru Paler, Ilia Polian (University of Passau, Germany), John P. Hayes (University of Michigan, U.S.A.)
Page	pp. 181 - 186
Keyword	Probabilistic testing, Quantum circuits, Test generation, Fault diagnosis, Design for test
Abstract	A new approach to detecting and diagnosing faults in quantum circuits is introduced. In order to account for the probabilistic nature of quantum circuits, collections of test experiments, called binary tomographic tests (BTTs), are generated. A BTT can identify a fault with respect to some userdefined confidence threshold τ. We present an algorithm to generate BTTs that either detect, or ensure the absence of, all modeled faults in a given circuit. We also present an adaptive diagnostic method to locate quantum faults. While classical circuits, even probabilistic ones, only handle ordinary probabilities, quantum circuits deal with quantum states, which have phase as an extra probabilistic parameter. The tomographic testing methods introduced previously for probabilistic circuits are unable to detect differences in phase, and therefore leave many quantum faults undetected. In contrast, we develop a design-for-test method which is specifically intended to detect faults that only affect the phase of a quantum state. We give experimental results for benchmark and random circuits which show high coverage of quantum faults by BTTs, and good resolution in the case of the adaptive diagnosis method.

Wednesday, February 1, 2012

Session I1 Invited Talk 1
Time: 8:30 - 9:20 Wednesday, February 1, 2012
Location: Room 204A

I1-1 (Time: 8:30 - 9:20)

Title	(Invited Paper) Achieving Energy Efficiency by Dynamic Techniques
Author	Tanay Karnik, Karnik Tschanz, Keith Bowman, Carlos Tokunaga, Vivek De, Shekhar Borkar (Intel, U.S.A.)
Keyword	energy efficiency
Abstract	This talk will introduce the audience to the research at Intel Labs. The world-class research overview will be followed by a special presentation on Intel’s academic programs. This will be followed by many innovations in improving energy efficiency in computation and communication. The primary focus of the rest of this talk will be on effects of voltage variations and device aging across a wide operating range of voltage and frequencies. The dynamic phenomena need dynamic techniques that include sensing and adaptation. We will present analog and digitals adaptive circuit techniques with microarchitecture hooks to mitigate the variation effects.

Session I2 Invited Talk 2
Time: 9:20 - 10:10 Wednesday, February 1, 2012
Location: Room 204A

I2-1 (Time: 9:20 - 10:10)

Title	(Invited Paper) Multi-Threaded Processing Paradigms for Scalable Media Compression
Author	David Taubman (University of New South Wales, Australia)

Session S3 Special Session 3: Design and Prototyping of Invasive MPSoC Architectures
Time: 10:40 - 12:20 Wednesday, February 1, 2012
Location: Room 204A
Chair: Sri Parameswaran (University of New South Wales, Australia)

S3-1 (Time: 10:40 - 11:05)

Title	(Invited Paper) Approximate Time Functional Simulation of Resource-Aware Programming Concepts for Heterogeneous MPSoCs
Author	Sascha Roloff, Frank Hannig, Jürgen Teich (Department of Computer Science University of Erlangen-Nuremberg, Germany)
Page	pp. 187 - 192
Keyword	MPSoC
Abstract	The design and the programming of heterogeneous future MPSoCs including thousands of processor cores is a hard challenge. Means are necessary to program and simulate the dynamic behavior of such systems in order to dimension the hardware design and to verify the software functionality as well as performance goals. Cycle-accurate simulation of multiple parallel applications simultaneously running on different cores of the architecture would be much too slow and is not the desired level of detail. In this paper, we therefore present a novel high-level simulation approach which tackles the complexity and the heterogeneity of such systems and enables the investigation of a new computing paradigm called invasive computing. Here, the workload and its distribution are not known at compile-time but are highly dynamic and have to be adapted to the status (load, temperature, etc.) of the underlying architecture at run-time. We propose an approach for the modeling of tiled MPSoC architectures and the simulation of resource-aware programming concepts on these. This approach delivers important timing information about the parallel execution and also is taking into account the computational properties of possibly different types of cores.

S3-2 (Time: 11:05 - 11:30)

Title	(Invited Paper) Invasive Manycore Architectures
Author	Jörg Henkel (Karlsruhe Institute of Technology, Germany), Andreas Herkersdorf (Technical University of Munich, Germany), Lars Bauer (Karlsruhe Institute of Technology, Germany), Thomas Wild (Technical University of Munich, Germany), Michael Hubner (Karlsruhe Institute of Technology, Germany), Ravi Kumar Pujari (Technical University of Munich, Germany), Artjom Grudnitsky, Jan Heisswolf (Karlsruhe Institute of Technology, Germany), Aurang Zaib (Technical University of Munich, Germany), Benjamin Vogel (Karlsruhe Institute of Technology, Germany), Vahid Lari (University of Erlangen-Nuremberg, Germany), Sebastian Kobbe (Karlsruhe Institute of Technology, Germany)
Page	pp. 193 - 200
Keyword	invasive computing, manycore architectures
Abstract	This paper introduces a scalable hardware and software platform applicable for demonstrating the benefits of the invasive computing paradigm. The hardware architecture consists of a heterogeneous, tile-based manycore structure while the software architecture comprises a multi-agent management layer underpinned by distributed runtime and OS services. The necessity for invasive-specific hardware assist functions is analytically shown and their integration into the overall manycore environment is described.

S3-3 (Time: 11:30 - 11:55)

Title	(Invited Paper) Hardware Prototyping of Novel Invasive Multicore Architectures
Author	Jürgen Becker, Stephanie Friederich, Jan Heisswolf, Ralf Koenig (Karlsruhe Institute of Technology, Germany), David May (Technische Universität München, Germany)
Page	pp. 201 - 206
Keyword	invasive computing
Abstract	The sustained advance in technology will enable integrating hundreds of processing cores on a single die in near future. However, it already can be foreseen that the management of the resources of such large systems will not scale in the same way as the hardware using todays entirely software based and centralized management approaches. The invasive paradigm addresses this problem and proposes concepts to enable resource awareness and scalability – especially focusing the resource management perspective – in future multicore systems. These concepts are based on distributed and software-hardware partitioned resource management strategies. High level management decision that are made by software thereby trigger lower level management strategies that are autonomously carried out in hardware. Sufficiently accurate modeling of the overall invasive system is required to study and optimize such a decentralized, software-hardware partitioned control loop where decisions significantly depend on runtime dynamic effects. Software based simulation cannot deliver the required speed or accuracy making FPGA based prototyping of invasive systems necessary. This paper describes our prototyping concepts and discusses possible implementation alternatives for invasive multicore architectures.

S3-4 (Time: 11:55 - 12:20)

Title	(Invited Paper) Invasive Computing for Robotic Vision
Author	Johny Paul, Walter Stechele (Technical University of Munich, Germany), M. Kröhnert, T. Asfour, R. Dillmann (Karlsruhe Institute of Technology, Germany)
Page	pp. 207 - 212
Keyword	invasive computing
Abstract	Most robotic vision algorithms are computationally intensive and operate on millions of pixels of real-time video sequences. But they offer a high degree of parallelism that can be exploited through parallel computing techniques like Invasive Computing. But the conventional way of multi-processing alone (with static resource allocation) is not sufficient enough to handle a scenario like robotic maneuver, where processing elements have to be shared between various applications and the computing requirements of such applications may not be known entirely at compile-time. Such static mapping schemes leads to inefficient utilization of resources. At the same time it is difficult to dynamically control and distribute resources among different applications running on a single chip, achieving high resource utilization under high-performance constraints. Invasive Computing obtains more importance under such circumstances, where it offers resource awareness to the application programs so that they can adapt themselves to the changing conditions, at run-time. In this paper we demonstrate the resource aware and self-organizing behavior of invasive applications using three widely used applications from the area of robotic vision - Optical Flow, Object Recognition and Disparity Map Computation. The applications can dynamically acquire and release hardware resources, considering the level of parallelism available in the algorithm and time-varying load.

Session S4 Special Session 4: Making ESL Models Work
Time: 10:40 - 12:20 Wednesday, February 1, 2012
Location: Room 204B
Chair: Gunar Schirner (Northeastern University, U.S.A.)

S4-1 (Time: 10:40 - 11:05)

Title	(Invited Paper) Abstract System-Level Models for Early Performance and Power Exploration
Author	*Andreas Gerstlauer, Suhas Chakravarty, Manan Kathuria, Parisa Razaghi (The University of Texas at Austin, U.S.A.)
Page	pp. 213 - 218
Keyword	System-level models, Host-compiled modeling, Design Space Exploration
Abstract	With increasing complexity of today’s embedded systems, research has focused on developing fast, yet accurate high-level and executable models of complete platforms. These models address the need for hardware/software co-simulation of the entire system at early stages of the design. Traditional models tend to be either slow or inaccurate. In this paper, we present ingredients for a class of abstract, high-level platform models that enable fast yet accurate performance and power simulation of application execution on heterogeneous multi-core/-processor architectures. Models are based on host-compiled simulation of the application code, which is instrumented with timing and power information. Back-annotated source code is further augmented with abstract OS and processor models that are integrated into standard co-simulation backplanes. The efficiency of the modeling platform has been evaluated by applying an industrial-strength benchmark, demonstrating the feasibility and benefits of such models for rapid, early exploration of the power, performance and cost design space. Results show that an accurate Pareto set of solutions can be obtained in a fraction of the time needed with traditional simulation and modeling approaches.

S4-2 (Time: 11:05 - 11:30)

Title	(Invited Paper) Virtual Prototyping of Cyber-Physical Systems
Author	*Wolfgang Mueller, Markus Becker, Ahmed Elfeky (University of Paderborn/C-LAB, Germany), Anthony DiPasquale (Northwestern University, U.S.A.)
Page	pp. 219 - 226
Keyword	cyber-physical systems, SystemC, QEMU
Abstract	The modeling and analysis of Cyber-Physical Systems (CPS) is one of the key challenges in complex system design as heterogeneous components are combined and their close interaction with the physical environment has to be considered. This article presents a methodology and an open toolset for the virtual prototyping of CPS. The focus of the methodology is the virtual prototyping of the embedded software combined with the prototyping of the physical environment in order to capture the complete closed control loop of the software over the hardware via sensors/actors with the physical objects. The methodology is based on the application of integrated open source tools and standard languages, i.e., C/C++, SystemC, and the Open Dynamics Engine, which are combined to a powerful simulation framework. Key activities of the methodology are outlined by the example of an electric two-wheel vehicle.

S4-3 (Time: 11:30 - 11:55)

Title	(Invited Paper) Parallel Discrete Event Simulation of Transaction Level Models
Author	*Rainer Doemer, Weiwei Chen, Xu Han (Center for Embedded Computer Systems, University of California, Irvine, U.S.A.)
Page	pp. 227 - 231
Keyword	MPSoC, PDES, SystemC, SpecC
Abstract	Describing Multi-Processor Systems-on-Chip (MPSoC) at the abstract Electronic System Level (ESL) is one task, validating them efficiently is another. Here, fast and accurate system-level simulation is critical. Recently, Parallel Discrete Event Simulation (PDES) has gained significant attraction again as it promises to utilize the existing parallelism in today’s multi-core CPU hosts. This paper discusses the parallel simulation of Transaction-Level Models (TLMs) described in System-Level Description Languages (SLDLs), such as SystemC and SpecC. We review how PDES exploits the explicit parallelism in the ESL design models and uses the parallel processing units available on multi-core host PCs to significantly reduce the simulation time. We show experimental results for two highly parallel benchmarks as well as for two actual embedded applications.

S4-4 (Time: 11:55 - 12:20)

Title	(Invited Paper) Post-Silicon Patching for Verification/Debugging with High-Level Models and Programmable Logic
Author	*Masahiro Fujita, Hiroaki Yoshida (VLSI Design and Education Center (VDEC), University of Tokyo/CREST, Japan Science and Technology Agency, Japan)
Page	pp. 232 - 237
Keyword	high-level models
Abstract	Due to continuous increase of design complexity in SoC development, the time required for post-silicon verification and debugging keeps increasing especially for electrical errors and very corner case bugs (which happen in extreme rare but actual situations), and it is now understood that some sort of programmability in silicon is essential to reduce the time for post-silicon verification and debugging. In this paper, we discuss partial use of in-field programmability in control parts of circuits for post-silicon debugging processes for electrical errors and corner case logical bugs. Our method deals with RTL designs in FSMD (Finite State Machine with Datapath) by adding partially in-field programmability, called "patch logic", in their control parts. If designs are given in high level like Cbased designs, by using our high level synthesis techniques, they are first synthesized to include such in-field programmability in the control parts of the synthesized RTL automatically. With patch logic we can dynamically change the behaviors of circuits in such a way that state transition sequences as well as values of internal values are traced based on user requests. Our patch logic can also check if there is an electrical errors or not periodically. Assuming that electrical errors occur very infrequently, an error can be detected by comparing the equivalence on the results of duplicated computations. Through experiments we discuss the area, timing, and power overhead due to the patch logic and also show results on electrical error detection with duplicated computations.

Session 3B High-Level Synthesis
Time: 10:40 - 12:20 Wednesday, February 1, 2012
Location: Room 203
Chairs: Nagisa Ishiura (Kwansei Gakuin University, Japan), Shigeru Yamashita (Ritsumeikan University, Japan)

3B-1 (Time: 10:40 - 11:05)

Title	Performance-Driven Register Write Inhibition in High-Level Synthesis under Strict Maximum-Permissible Clock Latency Range
Author	*Keisuke Inoue, Mineo Kaneko (Japan Advanced Institute of Science and Technology, Japan)
Page	pp. 239 - 244
Keyword	high-level synthesis, register write inhibition, FU binding, register binding
Abstract	Clock skew scheduling is a process of assigning intentional clock skews to registers for improving circuit performance and reliability. Due to the recent large effect of process variations, it becomes more and more difficult to reliably implement a large set of arbitrary clock latencies. Consequently, the optimization potential of clock skew scheduling should be highly limited. This paper points out that there is a chance to achieve further improvement of circuit performance by removing some register-writes while preserving reliability. This paper is the first work of the clock skew-aware high-level synthesis framework considering register write inhibition to minimize the clock period. A network flow-based heuristic algorithm to obtain the minimum clock period is presented and evaluated by experiments, which supports the effectiveness of the approach.

3B-2 (Time: 11:05 - 11:30)

Title	Clock Period Minimization with Minimum Area Overhead in High-Level Synthesis of Nonzero Clock Skew Circuits
Author	*Wen-Pin Tu, Shih-Hsu Huang, Chun-Hua Cheng (Chung Yuan Christian University, Taiwan)
Page	pp. 245 - 250
Keyword	Clock Skew Optimization, Clock Period Minimization, High-Level Synthesis, Resource Binding, Area Minimization
Abstract	Although clock skew can be utilized to reduce the clock period, the utilization of clock skew also limits the sharing of resources (including registers and functional units). Previous works have considered the influence of clock arrival times on register sharing, but they do not pay any attention to the influence of clock arrival times on functional unit sharing. As a result, extra functional units are often required during functional unit binding. Based on that observation, in this paper, we perform the simultaneous application of register binding and functional unit binding for the high-level synthesis of nonzero clock skew circuits. Our objective is to minimize the circuit area for working with the lower bound of the clock period. Compared with previous works, benchmark data show that our approach can achieve the lower bound of the clock period with a smaller area overhead.

3B-3 (Time: 11:30 - 11:55)

Title	Clock-Constrained Simultaneous Allocation and Binding for Multiplexer Optimization in High-Level Synthesis
Author	*Yuko Hara-Azumi, Hiroyuki Tomiyama (Ritsumeikan University, Japan)
Page	pp. 251 - 256
Keyword	High-level synthesis, Allocation, Binding, Multiplexer, Clock constraint
Abstract	This paper proposes a novel simultaneous allocation and binding method in high-level synthesis, which minimizes the circuit area including multiplexers (MUXs) under a clock constraint. Most existing works on binding minimize MUXs under given allocation by minimizing the number of interconnections, but do not care where the MUXs would be inserted in a circuit. As a result, they cannot guarantee the required clock frequency and often violate the clock constraint. On the contrary, our work globally optimizes binding and allocation for FUs and registers while meeting the clock constraint by considering where MUXs would be inserted. Our work is formulated as an ILP problem. Also, an effective ILP-based heuristic for non-small designs is presented. Experimental results demonstrate that our work satisfies the clock constraint with the minimum circuit area.

3B-4 (Time: 11:55 - 12:20)

Title	An Integrated and Automated Memory Optimization Flow for FPGA Behavioral Synthesis
Author	*Yuxin Wang (Computer Science Department, Peking University and UCLA/PKU Joint Research Institute in Science and Engineering, China), Peng Zhang (Computer Science Department, University of California, Los Angeles, U.S.A.), Xu Cheng (Computer Science Department, Peking University, China), Jason Cong (Computer Science Department, University of California, Los Angeles and UCLA/PKU Joint Research Institute in Science and Engineering, U.S.A.)
Page	pp. 257 - 262
Keyword	Behavioral Synthesis, Memory Partitioning, Memory Merging
Abstract	Behavioral synthesis tools have made significant progress in compiling high-level programs into register-transfer level (RTL) specifications. But manually rewriting code is still necessary in order to obtain better quality of results in memory system optimization. In recent years different automated memory optimization techniques have been proposed and implemented, such as data reuse and memory partitioning, but the problem of integrating these techniques into an applicable flow to obtain a better performance has become a challenge. In this paper we integrate data reuse, loop pipelining, memory partitioning, and memory merging into an automated optimization flow (AMO) for FPGA behavioral synthesis. We develop memory padding to help in the memory partitioning of indices with modulo operations. Experimental results on Xilinx Virtex-6 FPGAs show that our integrated approach can gain an average 5.8x throughput and 4.55x latency improvement compared to the approach without memory partitioning. Moreover, memory merging saves up to 44.32% of block RAM (BRAM).

Session 3C Yield and Manufacturability Enhancement
Time: 10:40 - 12:20 Wednesday, February 1, 2012
Location: Room 202
Chairs: Zheng Shi (Zhejiang University, China), Charles H.-P. Wen (National Chiao Tung University, Taiwan)

3C-1 (Time: 10:40 - 11:05)

Title	EPIC: Efficient Prediction of IC Manufacturing Hotspots With A Unified Meta-Classification Formulation
Author	Duo Ding, Bei Yu, Joydeep Ghosh, *David Z. Pan (The University of Texas at Austin, U.S.A.)
Page	pp. 263 - 270
Keyword	lithography hotspot detection, meta-classification, high performance, physical verification
Abstract	In this paper we present EPIC, a new generic and unified formulation to seamlessly combine the advantages of various types of lithographic hotspot detection techniques. With such formulation, we develop an efficient CAD flow and optimize it with quadratic programming techniques under industry-strength data. After integrating various machine learning and pattern matching detection methods, we evaluate EPIC with a number of industry benchmarks under advanced manufacturing conditions. EPIC demonstrates so far the best capability in selectively combining the desirable features of various hotspot detection methods (3.5-8.2% accuracy improvement) as well as significant suppression of the detection noise (e.g., around 80% false-alarm reduction). These characteristics make EPIC very suitable for conducting high performance physical verification and guiding efficient manufacturability-friendly physical design.

3C-2 (Time: 11:05 - 11:30)

Title	GNOMO: Greater-than-NOMinal V_dd Operation for BTI Mitigation
Author	*Saket Gupta, Sachin Sapatnekar (University of Minnesota, U.S.A.)
Page	pp. 271 - 276
Keyword	Reliability, Mitigation, BTI
Abstract	This paper presents a novel scheme for mitigating delay degradations in digital circuits due to bias temperature instability (BTI). The method works in two alternating phases. In the first, a greater-than-nominal supply voltage, Vdd,g is used, which causes a task to complete more quickly but causes greater aging than the nominal supply voltage, Vdd,n. In the second, the circuit is power-gated, enabling the BTI recovery phase. We demonstrate, both at the circuit and the architectural levels, that this approach can significantly mitigate aging for a small performance penalty.

3C-3 (Time: 11:30 - 11:55)

Title	Tier-Adaptive-Voltage-Scaling (TAVS): A Methodology for Post-Silicon Tuning of 3D ICs
Author	Kwanyeob Chae, *Saibal Mukhopadhyay (Georgia Institute of Technology, U.S.A.)
Page	pp. 277 - 282
Keyword	3D IC, Post-Silicon Tuning, Adaptive-Voltage-Scaling, Variation
Abstract	This paper presents tier-adaptive-voltage-scaling (TAVS) as a post-silicon tuning methodology for improving parametric yield of 3D integrated circuits considering die-to-die and within-die process variations. The TAVS methodology senses process corners of individual tiers using on-tier delay sensors and adapt the supply voltage of each tier. The overall TAVS architecture is presented and the circuit issues associated with design of 3D level shifters are discussed. Circuit level simulation and statistical analysis of the TAVS architecture in predictive 45nm technology show the possibility of 26%-39% reduction in chip delay distribution.

3C-4 (Time: 11:55 - 12:20)

Title	Body Bias Clustering for Low Test-Cost Post-Silicon Tuning
Author	Shuta Kimura, *Masanori Hashimoto, Takao Onoye (Osaka University, Japan)
Page	pp. 283 - 289
Keyword	body biasing, body clustering, post-silicon tuning
Abstract	Post-silicon tuning is attracting a lot of attention for coping with increasing process variation. However, its tuning cost via testing is still a crucial problem. In this paper, we propose tuningfriendly body bias clustering with multiple bias voltages. The proposed method provides a small set of compensation levels so that the speed and leakage current vary monotonically according to the level. Thanks to this monotonic leveling and limitation of the number of levels, the test-cost of post-silicon tuning is significantly reduced. During the body bias clustering, the proposed method explicitly estimates and minimizes the average leakage after the postsilicon tuning. Experimental results demonstrate that the proposed method reduces the average leakage by 25.3 to 51.9% compared to non clustering case. We reveal that two bias voltages are sufficient when only a small number of compensation levels are allowed for test-cost reduction. We also give an implication on how to synthesize a circuit to which post-silicon tuning will be applied.

Session S5 Special Session 5: Advanced Post-silicon Validation and Debugging Techniques for SoC
Time: 14:00 - 15:40 Wednesday, February 1, 2012
Location: Room 204A
Chair: Masahiro Fujita (University of Tokyo, Japan)

S5-1 (Time: 14:00 - 14:25)

Title	(Invited Paper) Bug Localization Techniques for Effective Post-Silicon Validation
Author	*Subhasish Mitra, David Lin (Stanford University, U.S.A.), Nagib Hakim, Don Gardner (Intel Corporation, U.S.A.)
Page	p. 291
Keyword	bug localization
Abstract	Post-silicon validation is used to detect and fix bugs in integrated circuits and systems after manufacture. Due to sheer design complexity, it is nearly impossible to detect and fix all bugs before manufacture. Existing post-silicon validation methods barely cope with today’s complexity. New techniques are essential to minimize the effects of bugs and design flaws going forward. This talk will focus on two recent techniques, QED and IFRA, that can overcome significant challenges associated with a very crucial step in post-silicon validation: bug localization in a system setup. We demonstrate the effectiveness of these techniques using results from quad-core Intel Core i7 hardware platforms and Intel Nehalem processors, and using actual examples of "difficult" bugs that occurred in complex SoCs.

S5-2 (Time: 14:25 - 14:50)

Title	(Invited Paper) Improving Validation Coverage Metrics to Account for Limited Observability
Author	*Peter Lisherness, Kwang-Ting (Tim) Cheng (Electrical and Computer Engineering Department, UCSB, U.S.A.)
Page	pp. 292 - 297
Keyword	observability, coverage metrics
Abstract	In both pre-silicon and post-silicon validation, the detection of design errors requires both stimulus capable of activating the errors and checkers capable of detecting the behavior as erroneous. Most functional and code coverage metrics evaluate only the activation component of the testbench and ignore propagation and detection. In this paper, we summarize our recent work in developing improved metrics that account for propagation and/or detection of design errors. These works include tools for observability-enhanced code coverage and mutation analysis of high-level designs as well as an analytical method, Coverage Discounting, which adds checker sensitivity to arbitrary functional coverage metrics.

S5-3 (Time: 14:50 - 15:15)

Title	(Invited Paper) Automated Data Analysis Techniques for a Modern Silicon Debug Environment
Author	*Yu-Shen Yang (Vennsa Technologies, Canada), Andreas Veneris (Department of ECE and Department of CS, University of Toronto, Canada), Nicola Nicolici (Department of ECE, McMaster University, Canada), Masahiro Fujita (VLSI Design and Education Center, University of Tokyo, Japan)
Page	pp. 298 - 303
Keyword	silicon debug
Abstract	With the growing size of modern designs and more strict time-to-market constraints, design errors unavoidably escape pre-silicon verification and reside in silicon prototypes. As a result, silicon debug has become a necessary step in the digital integrated circuit design flow. Although embedded hardware blocks, such as scan chains and trace buffers, provide a means to acquire data of internal signals in real time for debugging, there is a relative shortage in methodologies to efficiently analyze this vast data to identify root-causes. This paper presents an automated software solution that attempts to fill-in the gap. The presented techniques automate the configuration process for trace-buffer based hardware in order to acquire helpful information for debugging the failure, and detect suspects of the failure in both the spatial and temporal domain.

S5-4 (Time: 15:15 - 15:40)

Title	(Invited Paper) Optimizing Test-Generation to the Execution Platform
Author	Amir Nahir, *Avi Ziv (IBM Research, Haifa, Israel), Subrat Panda (IBM Systems and Technology Group, Bangalore, India)
Page	pp. 304 - 309
Keyword	test generation
Abstract	The role of stimuli generators is to reach all the dark corners of the design and expose the bugs hiding there. As such, stimuli generation is one of the cornerstones of dynamic verification. The quality of tools used for stimuli generation affect the outcome of the verification process. This paper discusses how differences between execution platforms, ranging from software simulators, through accelerators and emulators, to silicon affect the requirements of stimuli generators and how stimuli generators targeting different execution platforms address these differences. We demonstrate how the unique added value of the platforms are combined to guarantee the high quality of the silicon using examples of several IBM pre- and post-silicon stimuli generators with results from the verification of the IBM POWER7 processor chip.

Session S6 Special Session 6: Design and Architecture of Emerging Non-volatile Memory Technologies
Time: 14:00 - 15:40 Wednesday, February 1, 2012
Location: Room 204B
Chair: Zili Shao (Hong Kong Polytechnic University, Hong Kong)

S6-1 (Time: 14:00 - 14:25)

Title	(Invited Paper) When to Forget: A System-level Perspective on STT-RAMs
Author	*Karthik Swaminathan, Raghav Pisolkar, Cong Xu, Vijaykrishnan Narayanan (Dept. of Computer Science and Engineering, The Pennsylvania State University, U.S.A.)
Page	pp. 311 - 316
Keyword	STT-RAM
Abstract	The benefits of using STT-RAMs as an alternative to SRAMs are being examined in great detail. However their comparatively higher write latencies and energies continue to be roadblocks for migrating to MRAM based technology in memory hierarchies. In this paper, we present a novel method by which we demonstrate significant energy reduction in writing to the STT-RAM cell by relaxing its non-volatility property. We exploit this characteristic for optimizing system-level properties such as garbage collection. By categorizing the objects based on their lifetimes it is possible to tune the data retention time of the STTRAM to minimize the write energy. Our scheme yielded 37% reduction in dynamic energy, 88% reduction in leakage and 85% improvement in the Energy-Delay Product over a corresponding SRAM based memory structure.

S6-2 (Time: 14:25 - 14:50)

Title	(Invited Paper) Write-Activity-Aware Page Table Management for PCM-based Embedded Systems
Author	Tianzheng Wang, Duo Liu, *Zili Shao (The Hong Kong Polytechnic University, Hong Kong), Chengmo Yang (University of Delaware, U.S.A.)
Page	pp. 317 - 322
Keyword	PCM, Memory Management, Android, Operating Systems, Page Table
Abstract	Due to its low power consumption and high density, phase change memory (PCM) becomes a promising main memory alternative to DRAM in embedded systems. PCM, however, has the endurance problem in which the number of rewrites to each cell is quite limited compared with DRAM. Therefore, it is fundamental to eliminate unnecessary writes in PCM-based embedded systems. This paper presents a simple yet effective scheme to solve this problem, through redesigning existing software to exploit the write-activity-aware feature provided by the underlying hardware. In particular, we target at the page table management, a key kernel component residing in the memory management part of the Linux kernel. We present for the first time a write-activity-aware page table management scheme, WAPTM, accomplished through two simple modifications to the page table initialization and the page frame allocation process. The scheme has been implemented in Google Android 2.3 based on ARM architecture and evaluated with real applications on the Android emulator. The experimental results show that the proposed scheme can significantly reduce write activities to page tables in the new kernel compared with the original Android. We hope this work can serve as a first step towards the design of write-activity-aware operating systems via simple and feasible modifications.

S6-3 (Time: 14:50 - 15:15)

Title	(Invited Paper) Probabilistic Design in Spintronic Memory and Logic Circuit
Author	*Yiran Chen, Yaojun Zhang, Peiyuan Wang (Dept. of ECE, University of Pittsburgh, U.S.A.)
Page	pp. 323 - 328
Keyword	spintronic
Abstract	Spin-transfer torque random access memory (STTRAM) is a promising candidate for next-generation non-volatile memory technologies. It combines many attractive attributes such as nanosecond access time, high integration density, non-volatility, and good CMOS process compatibility. However, process variation continues to be a critical issue in the designs of STT-RAM and the derived spintronic logic. Besides the process-variationinduced persistent operation error, the non-persistent error that is incurred by the intrinsic thermal fluctuations of Magnetic Tunneling Junction (MTJ) devices significantly influences the spintronic circuit reliability. In this paper, we analyzed these two types of STT-RAM operation errors at both single cell and array levels. On the top of that, we quantitatively investigate the impacts of these errors on a nonvolatile spintronic flip-flop design. Some possible design techniques to reduce the operation error rate are also discussed. Our experimental results show that a statistical design technique must be adopted in spintronic memory and logic designs to achieve the desired operation reliability. We refer this technique as “probabilistic design”.

S6-4 (Time: 15:15 - 15:40)

Title	(Invited Paper) Endurance-Aware Circuit Designs of Nonvolatile Logic and Nonvolatile SRAM Using Resistive Memory (Memristor) Device
Author	*Meng-Fan Chang, Ching-Hao Chuang, Min-Ping Chen, Lai-Fu Chen (Department of Electrical Engineering, National Tsing Hua University, Taiwan), Hiroyuki Yamauchi (Department of Information Electronics, Fukuoka Institute of Technology, Japan), Pi-Feng Chiu, Shyh-Shyuan Sheu (Electronics and Optoelectronics Research Laboratories, Industrial Technology Research Institute (ITRI), Japan)
Page	pp. 329 - 334
Keyword	memristor
Abstract	The use of low voltage circuits and power-off mode help to reduce the power consumption of chips. Non-volatile logic (nvLogic) and nonvolatile SRAM (nvSRAM) enable a chip to preserve its key local states and data, while providing faster power-on/off speeds than those available with conventional two-macro schemes. Resistive memory (memristor) devices feature fast write speed and low write power. Applying memristors to nvLogic and nvSRAMs not only enables chips to achieve low power consumption for store operations, but also achieve fast power-on/off processes and reliable operation even in the event of sudden power failure. However, current memristor devices suffer from limited endurance, which influences the design of the circuit structure for memristor-based nvLogic and nvSRAM. Moreover, previous nvLogic/nvSRAM circuits cannot achieve low voltage operation. This paper explores various circuit structures for nvLogic and nvSRAM, taking into account memristor endurance, especially for low-voltage applications.

Session 4B 3D IC Layout
Time: 14:00 - 15:40 Wednesday, February 1, 2012
Location: Room 203
Chairs: Yasuhiro Takashima (University of Kitakyushu, Japan), Yih-Lang Li (National Chiao Tung University, Taiwan)

4B-1 (Time: 14:00 - 14:25)

Title	Block-level 3D IC Design with Through-Silicon-Via Planning
Author	Dae Hyun Kim (Georgia Institute of Technology, U.S.A.), Rasit Onur Topaloglu (GLOBALFOUNDRIES, U.S.A.), *Sung Kyu Lim (Georgia Institute of Technology, U.S.A.)
Page	pp. 335 - 340
Keyword	3D IC, TSV, 3D RST, Floorplanning
Abstract	In this paper, we propose algorithms (finding signal TSV locations, assigning TSVs to whitespace blocks, and manipulating whitespace blocks) for post-floorplanning signal TSV planning in the block-level 3D IC design. Experimental results show that our signal TSV planner outperforms the state-of-the-art TSV-aware 3D floorplanner by 7% to 38% with respect to wirelength. In addition, our multiple TSV insertion algorithm outperforms a single TSV insertion algorithm by 27% to 37%.

4B-2 (Time: 14:25 - 14:50)

Title	Micro-Bump Assignment for 3D ICs using Order Relation
Author	Ta-Yu Kuan, Yi-Chun Chang, *Tai-Chen Chen (National Central University, Taiwan)
Page	pp. 341 - 346
Keyword	3D ICs, Micro Bump, Placement, RDL, Routing
Abstract	The routing quality on RDLs in 3D ICs is affected by the micro-bump location seriously. In this paper, we propose a micro-bump assignment method using order relation to minimize the crossing problem and reduce the detours in RDLs. Experimental results show that our approach can obtain an assignment result with 100% routability and minimal wirelength in global routing.

4B-3 (Time: 14:50 - 15:15)

Title	Through-Silicon-Via-Induced Obstacle-Aware Clock Tree Synthesis for 3D ICs
Author	Xin Zhao, *Sung Kyu Lim (Georgia Institute of Technology, U.S.A.)
Page	pp. 347 - 352
Keyword	TSV, obstacle avoidance, clock synthesis, 3D ICs
Abstract	In this paper, we present an obstacle-aware clock tree synthesis method for through-silicon-via (TSV)-based 3D ICs. A unique aspect of this problem lies in the fact that various types of TSVs become obstacles during 3D clock routing including signal, power/ground, and clock TSVs. Some of these TSVs become placement obstacles, i.e., they interfere with clock buffers and clock TSVs; while other TSVs become routing obstacles, i.e., clock wires cannot route through them. Thus, the key is to perform TSV-induced obstacle-aware 3D clock routing under the following goals: (1) clock TSVs and clock buffers are located while avoiding overlap with placement obstacles; (2) clock wires are routed while avoiding routing obstacles; and (3) clock skew and slew constraints are satisfied. Related experiments show that our TSV-obstacle-aware clock tree does not sacrifice wirelength or clock power too much while avoiding various TSV-induced obstacles.

4B-4 (Time: 15:15 - 15:40)

Title	Parallel Implementation of R-trees on the GPU
Author	Lijuan Luo (University of Illinois at Urbana-Champaign/NVIDIA Corp., U.S.A.), *Martin D. F. Wong (University of Illinois at Urbana-Champaign, U.S.A.), Lance Leong (NVIDIA Corp., U.S.A.)
Page	pp. 353 - 358
Keyword	R-tree, GPU, parallel programming
Abstract	R-tree is an important spatial data structure used in EDA as well as other fields. Although there has been a huge literature of parallel R-tree query, as far as we know, our work is the first successful one to parallelize R-tree query on the GPU. We also propose the first R-tree construction method on the GPU. Unlike the other parallel construction methods, our method does not depend on a partition algorithm and guarantees the same quality as the sequential construction. Experiments show that more than 30x speedup on R-tree query and more than 20x speedup on R-tree construction are achieved.

Session 4C Simulation and Modeling for Signal-Integrity Analysis
Time: 14:00 - 15:40 Wednesday, February 1, 2012
Location: Room 202
Chairs: Rung-Bin Lin (Yuan Ze University, Taiwan), Youngsoo Shin (KAIST, Republic of Korea)

4C-1 (Time: 14:00 - 14:25)

Title	An Adaptive LU Factorization Algorithm for Parallel Circuit Simulation
Author	*Xiaoming Chen, Yu Wang, Huazhong Yang (Tsinghua University, China)
Page	pp. 359 - 364
Keyword	Parallel LU Factorization, Parallel Circuit Simulation
Abstract	Sparse matrix solver has become the bottleneck in SPICE simulator. It is difficult to parallelize the solver because of the high data-dependency during the numerical LU factorization. This paper proposes a parallel LU factorization (with partial pivoting) algorithm on shared-memory computers with multi-core CPUs, to accelerate circuit simulation. Since not every matrix is suitable for parallel algorithm, a predictive method is proposed to decide whether a matrix should use parallel or sequential algorithm. The experimental results on 35 circuit matrices reveal that the developed algorithm achieves speedups of 2.11x~8.38x (on geometric-average), compared with KLU, with 1~8 threads, on the matrices which are suitable for parallel algorithm. Our solver can be downloaded from http://nicslu.weebly.com.

4C-2 (Time: 14:25 - 14:50)

Title	Predictor-Corrector Latency Insertion Method for Fast Transient Analysis of Ill-Constructed Circuits
Author	*Hiroki Kurobe (Graduate School of Eng., Shizuoka University, Japan), Tadatoshi Sekine (Graduate School of Science and Tech., Shizuoka University, Japan), Hideki Asai (Shizuoka University, Japan)
Page	pp. 365 - 370
Keyword	coupled multiconductor, fast circuit simulation, high-speed interconnect, latency insertion method, predictor-corrector
Abstract	This paper describes a predictor-corrector latency insertion method (LIM) for a fast transient analysis of an ill-constructed circuit. First, the basic LIM algorithm and limitations of the method are described. Next, we propose the predictor-corrector LIM with a large value of fictitious latency for the ill-constructed topologies. Finally, numerical results show that our proposed method is applicable and efficient for the fast simulation of the ill-constructed circuit.

4C-3 (Time: 14:50 - 15:15)

Title	Crosstalk-Aware Statistical Interconnect Delay Calculation
Author	*Qin Tang, Amir Zjajo, Michel Berkelaar, Nick van der Meijs (Delft University of Technology, Netherlands)
Page	pp. 371 - 376
Keyword	crosstalk, interconnect delay, statistical delay calculation, coupling effects, process variations
Abstract	As the device geometries are shrinking, the impact of crosstalk effects increases, which results in a stronger dependence of interconnect delay on the input arrival time difference between victim and aggressor inputs (input skew). The increasing process variations lead to statistical input skew which induces significant interconnect delay variations. Therefore, it is necessary to take input skew variation into account for interconnect delay calculation in the presence of process variations. Existing timing analysis tools evaluate gate and interconnect delays separately. In this paper, we focus on statistical interconnect delay calculation considering crosstalk effects. A piecewise linear delay-change-curve model enables closed-form analytical evaluation of the statistical interconnect delay caused by input skew (SK) variations. This method can handle arbitrarily distributed SK variations. The process-variation (PV)-induced interconnect delay variation is handled in a quadratic delay model which considers coupling effects. The SK- and PV-induced interconnect delay variations are combined together for crosstalk-aware statistical interconnect delay calculation. The experimental results indicate that the proposed method can predict the interconnect delay impacted by both input skew variation and process variations with average (maximum) absolute mean error 0.25% (0.75%) and standard deviation error 1.31%(3.53%) for different types of coupled wires in a 65nm technology.

4C-4 (Time: 15:15 - 15:40)

Title	Fast Floating Random Walk Algorithm for Multi-Dielectric Capacitance Extraction with Numerical Characterization of Green's Functions
Author	Hao Zhuang (Tsinghua University/Peking University, China), *Wenjian Yu, Gang Hu, Zhi Liu, Zuochang Ye (Tsinghua University, China)
Page	pp. 377 - 382
Keyword	floating random walk, capacitance extraction, multiple dielectric, thin dielectric, finite difference method
Abstract	The floating random walk (FRW) algorithm has several advantages for extracting 3D interconnect capacitance. However, for multi-layer dielectrics in VLSI technology, the efficiency of FRW algorithm would be degraded due to frequent stop of walks at dielectric interface and constraint of first-hop length especially in thin dielectrics. In this paper, we tackle these problems with the numerical characterization of Green's function for cross-interface transition probabilities and weight values. We also present a space management technique with Octree data structure to reduce the time of each hop and parallelize the whole FRW by multi-threaded programming. Numerical results show large speedup brought by the proposed techniques for structures under the VLSI technology with thin dielectric layers.

Session S7 Special Session 7: Sensor Node Optimization in Machine-to-Machine (M2M) Networks
Time: 16:10 - 17:50 Wednesday, February 1, 2012
Location: Room 204A
Chairs: Tei-Wei Kuo (National Taiwan University, Taiwan), Yen-Kuang Chen (Intel Corp., U.S.A.)

S7-1 (Time: 16:10 - 16:35)

Title	(Invited Paper) Challenges and Opportunities of Internet of Things
Author	*Yen-Kuang Chen (Intel Corporation, U.S.A.)
Page	pp. 383 - 388
Keyword	internet
Abstract	To date, most Internet applications focus on providing information, interaction, and entertainment for humans. However, with the widespread deployment of networked, intelligent sensor technologies, an Internet of Things (IoT) is steadily evolving, much like the Internet decades ago. In the future, hundreds of billions of smart sensors and devices will interact with one another without human intervention, on a Machine-to-Machine (M2M) basis. They will generate an enormous amount of data at an unprecedented scale and resolution, providing humans with information and control of events and objects even in remote physical environments. The scale of the M2M Internet will be several orders of magnitude larger than the existing Internet, posing serious research challenges. This paper will provide an overview of challenges and opportunities presented by this new paradigm.

S7-2 (Time: 16:35 - 17:00)

Title	(Invited Paper) Application Specific Sensor Node Architecture Optimization --- Experiences from Field Deployments
Author	*Wei Liu, Xiaotian Fei, Tao Tang, Pengjun Wang, Hong Luo, Beixing Deng, Huazhong Yang (Department of Electronic Engineering, Tsinghua University, China)
Page	pp. 389 - 394
Keyword	sensor node architecture
Abstract	The Mote architecture is the most popular platform used in wireless sensor network applications. In this architecture, microcontroller is responsible for all jobs, such as scheduling, sampling, computing, and communication. In the past one year, two practical applications: bridge structural health monitoring system and rare animal monitoring system are developed and deployed in Wuxi and Beijing, China. It is found that Mote architecture faces many problems in these applications. First, sampling, computing, and communication conflicts with each other if they are not carefully scheduled; second, some jobs are very difficult even impossible to be implemented in the microcontroller; third, low power, one of the most fundamental design principles in wireless sensor networks, is sometimes violated with all jobs implemented in the microcontroller. Software optimization is attempted to solve these problems. However, the effect is very limited. Application specific sensor node architecture is necessary for implementing these applications efficiently. In this paper, we propose new application specific sensor node architecture and corresponding design principles and then applied them in the field deployments. Experimental and field tests show that these architectures are more efficient than Mote architecture in these applications.

S7-3 (Time: 17:00 - 17:25)

Title	(Invited Paper) System-Wide Profiling and Optimization with Virtual Machines
Author	*Shih-Hao Hung, Tei-Wei Kuo, Chi-Sheng Shih (Graduate Institute of Networking and Multimedia and Department of Computer Science and Information Engineering, National Taiwan University, Taiwan), Chia-Heng Tu (Graduate Institute of Networking and Multimedia, Taiwan)
Page	pp. 395 - 400
Keyword	profiling, virtual machines
Abstract	Simulation is a common approach for assisting system design and optimization. For system-wide optimization, energy and computational resources are often the two most critical limitations. Modeling energy-states of each hardware component and time spent in each state is needed for accurate energy and performance prediction. Tracking software execution in a realistic operating environment with properly modeled input/output is key to accurate prediction. However, the conventional approaches can have difficulties in practice. First, for a complex system such as an Android smartphone, building a cycle-accurate simulation environment is no easy task. Secondly, for I/O-intensive applications, a slow simulation would significantly alter the application behavior and change its performance profile. Thirdly, conventional software profiling tools generally do not work on simulators, which makes it difficult for performance analysis of complicated software, e.g., Java applications executed by the Dalvik virtual machine. Recently, virtual machine technologies are widely used to emulate a variety of computer systems. While virtual machines do not model the hardware components in the emulated system, we can ease the effort of building a simulation environment by leveraging the infrastructure of virtual machines and adding performance and power models. Moreover, multiple sets of the performance and energy models can be selectively used to verify if the speed of the simulated system impacts the software behavior. Finally, performance monitoring facilities can be integrated to work with profiling tools. We believe this approach should help overcome the aforementioned difficulties. We have prototyped a framework and our case studies showed that the information provided by our tools are useful for software optimization and system design for Android smartphones.

S7-4 (Time: 17:25 - 17:50)

Title	(Invited Paper) Power Optimization of Wireless Video Sensor Nodes in M2M Networks
Author	*Shao-Yi Chien, Teng-Yuan Cheng, Chieh-Chuan Chiu, Pei-Kuei Tsung (Graduate Institute of Electronics Engineering and Department of Electrical Engineering, National Taiwan University, Taiwan), Chia-han Lee (Research Center for Information Technology Innovation, Academia Sinica, Taiwan), V. Srinivasa Somayazulu, Yen-Kuang Chen (Intel Corporation, U.S.A.)
Page	pp. 401 - 405
Keyword	M2M network
Abstract	Low-power wireless video sensor nodes play important roles for applications in machine-to-machine (M2M) network. Several design issues to optimize the power consumption of a video sensor node are addressed in this paper. For the video coding engine selection, the comparison between conventional video coding system and distributed video coding (DVC) system shows that although the rate-distortion performance of existing DVC codec still has room to improve, it can provide lower power consumption with a noisy transmission channel. Furthermore, it also demonstrated that video analysis unit can help to filter out video contents without event-of-interest to reduce transmission power. Finally, several future research directions are addressed, and the trade-off between the video analysis unit, video coding unit, and data transmission should be further studied to design wireless video sensors with optimized power consumption.

Session 5A Adaptive and Power-Efficient NoC Architectures
Time: 16:10 - 17:25 Wednesday, February 1, 2012
Location: Room 204B
Chairs: Karam Chatha (Arizona State University), Yu Wang (Tsinghua University, China)

5A-1 (Time: 16:10 - 16:35)

Title	A Multi-Vdd Dynamic Variable-Pipeline On-Chip Router for CMPs
Author	*Hiroki Matsutani, Yuto Hirata (Keio University, Japan), Michihiro Koibuchi (National Institute of Informatics, Japan), Kimiyoshi Usami (Shibaura Institute of Technology, Japan), Hiroshi Nakamura (The University of Tokyo, Japan), Hideharu Amano (Keio University, Japan)
Page	pp. 407 - 412
Keyword	Network-on-Chip, Low power, Chip multi-processor, Interconnection network
Abstract	We propose a multi-voltage (multi-Vdd) variable pipeline router to reduce the power consumption of Network-on-Chips (NoCs) designed for chip multi-processors (CMPs). Our multi-Vdd variable pipeline router adjusts its pipeline depth (i.e., communication latency) and supply voltage level in response to the applied workload. Unlike dynamic voltage and frequency scaling (DVFS) routers, the operating frequency is the same for all routers throughout the CMP; thus, there is no need to synchronize neighboring routers working at different frequencies. In this paper, we implemented the multi-Vdd variable pipeline router, which selects two supply voltage levels and pipeline modes, using a 65nm CMOS process and evaluated it using a full-system CMP simulator. Evaluation results show that although the application performance degraded by 1.0% to 2.1%, the standby power of NoCs reduced by 10.4% to 44.4%.

5A-2 (Time: 16:35 - 17:00)

Title	ARB-NET: A Novel Adaptive Monitoring Platform for Stacked Mesh 3D NoC Architectures
Author	*Amir-Mohammad Rahmani, Khalid Latif, Vaddina Kameswar Rao (University of Turku/Turku Centre for Computer Science, Finland), Pasi Liljeberg, Juha Plosila, Hannu Tenhunen (University of Turku, Finland)
Page	pp. 413 - 418
Keyword	3D NoC-Bus Hybrid Architecture, Monitoring Platform, Adaptive Routing Algorithm, 3D ICs
Abstract	The emerging three-dimensional integrated circuits (3D ICs) offer a promising solution to mitigate the barriers of interconnect scaling in modern systems. In order to exploit the intrinsic capability of reducing the wire length in 3D ICs, 3D NoC-Bus Hybrid mesh architecture was proposed. Besides its various advantages in terms of area, power consumption, and performance, this architecture has a unique and hitherto previously unexplored way to implement an efficient system-wide monitoring network. In this paper, an integrated low-cost monitoring platform for 3D stacked mesh architectures is proposed which can be efficiently used for various system management purposes. The proposed generic monitoring platform called ARB-NET utilizes bus arbiters to exchange the monitoring information directly with each other without using the data network. As a test case, based on the proposed monitoring platform, a fully congestion-aware adaptive routing algorithm named AdaptiveXYZ is presented taking advantage from viable information generated within bus arbiters. Our extensive simulations with synthetic and real benchmarks reveal that our architecture using the AdaptiveXYZ routing can help achieving significant power and performance improvements compared to recently proposed stacked mesh 3D NoCs.

5A-3 (Time: 17:00 - 17:25)

Title	Memory-Aware Mapping and Scheduling of Tasks and Communications on Many-Core SoC
Author	*Jinho Lee, Kiyoung Choi (Seoul National University, Republic of Korea)
Page	pp. 419 - 424
Keyword	network-on-chip(NoC), mapping, scheduling, QEA, communication type
Abstract	This paper presents an approach to automatic task mapping, scheduling, and communication routing on a many-core SoC, considering the trade-offs between two different communication types - message passing and shared memory - for the communication routing in order to optimize the energy consumption or performance. To solve the optimization problem, the approach uses the quantum-inspired evolutionary algorithm. For the scheduling of the tasks with backward dependencies, it uses the iterative modulo scheduling technique. Experiments with random task graphs as well as a set of real applications show the effectiveness of the proposed approach.

Session 5B Physical Optimization for Power and Timing
Time: 16:10 - 17:50 Wednesday, February 1, 2012
Location: Room 203
Chairs: Sheqin Dong (Tsinghua University, China), Shigetoshi Nakatake (The University of Kitakyushu, Japan)

5B-1 (Time: 16:10 - 16:35)

Title	A Fast Thermal Aware Placement with Accurate Thermal Analysis Based on Green Function
Author	Suradeth Aroonsantidecha, *Shih-Ying Liu, Ching-Yu Chin, Hung-Ming Chen (National Chiao Tung University, Taiwan)
Page	pp. 425 - 430
Keyword	placement, thermal, Green Function, analytical
Abstract	In this paper, we propose a fast and accurate thermal aware analytical placer. Thermal model is constructed based on Green function with enhanced DCT to generate full chip temperature profile. Unlike other previous thermal aware placers, our thermal model is tightly integrated with a flat force directed placement. A thermal spreading force based on 2D Gaussian model is proposed to reduce maximum on-chip temperature with dynamic hot region size control, optimizing between total half-perimeter wirelength (HPWL) and on-chip temperature distribution.

5B-2 (Time: 16:35 - 17:00)

Title	Crosstalk-Aware Power Optimization with Multi-Bit Flip-Flops
Author	*Chih-Cheng Hsu, Yao-Tsung Chang, Mark Po-Hung Lin (National Chung Cheng University, Taiwan)
Page	pp. 431 - 436
Keyword	power optimization, crosstalk, synthesis for low power, physical design, multi-bit flip-flop
Abstract	Applying multi-bit flip-flops (MBFFs) for clock power reduction in modern nanometer ICs has been becoming a promising lower-power design technique. Many previous works tried to utilize as more MBFFs with larger bit numbers as possible to gain more clock power saving. However, an MBFF with a larger bit number may lead to serious crosstalk due to the close interconnecting wires belonging to different signal nets which are connected to the same MBFF. To address the problem, this paper analyzes, evaluates, and compares the relationship between power consumption and crosstalk when applying MBFFs with different bit numbers. To solve the addressed problem, a novel crosstalk-aware power optimization approach is further proposed to optimize power consumption while satisfying the crosstalk constraint. Experimental results show that the proposed approach is very effective in crosstalk avoidance when applying MBFFs for power optimization. To our best knowledge, this is also the first work in the literature that considers the crosstalk effect for the MBFF application.

5B-3 (Time: 17:00 - 17:25)

Title	Topology-Aware Buffer Insertion and GPU-Based Massively Parallel Rerouting for ECO Timing Optimization
Author	*Yen-Hung Lin, Yun-Jian Lo, Jian-Syun Tong, Wen-Hao Liu, Yih-Lang Li (National Chiao Tung University, Taiwan)
Page	pp. 437 - 442
Keyword	Topology, Timing ECO, Rotuing, Parallel EDA, GPU
Abstract	Conventional buffer insertion in timing ECO involves only mini-mizing the arrival time of the most critical sink in one multi-pin net and neglects the obstacles and the topology of routed wire segments, which may worsen the arrival times of other sinks and burden subsequent timing ECO. This work develops a topology-aware ECO timing optimization (TOPO) flow that comprises three phases - buffering pair scoring, edge breaking and buffer connection, and topol-ogy restructuring. TOPO effectively improves the arrival times of violation sinks without worsening those of other sinks. Experimental results indicate that TOPO improves the worst negative slack (WNS) and total negative slack (TNS) of benchmarks by an average of 79.2% and 84.3%, respectively. The proposed algorithm improves the arrival time that is achieved using conventional two-pin net-based buffer insertion by an average of 40.4%, at the cost of consuming 19× runtime. To speed up routing and further improve sink slack, a highly scalable massively parallel maze routing on Graphics Processing Unit (GPU) platform is also developed to enable the proposed flow to explore more solution candidates. High scalability and parallelism are realized by block partitioning and staggering. Experiments reveal that the proposed GPU-based parallel maze routing can achieve near 12× runtime speedup for two-pin routings. With parallelized maze routing, WNS violations in four out of five cases can be resolved.

5B-4 (Time: 17:25 - 17:50)

Title	Voltage Island-Driven Floorplanning Considering Level Shifter Placement
Author	Richard C.J. Hsu, Wei-Yi Cheng, Chung-Lin Lee, *Jai-Ming Lin (National Cheng Kung University, Taiwan)
Page	pp. 443 - 448
Keyword	multiple-supply voltage (MSV), level-shifter, floorplanning/placement, Low power, physical design
Abstract	Low power has become a burning issue in modern VLSI design. To deal with this problem, the multiple-supply voltage (MSV) is a technique widely applied to a design to reduce its power consumption. However, there exist several challenges in implementing Multi-Voltage designs, which includes floorplanning, level-shifter placement, and power planning. Among these challenges, placement of level shifters has direct impacts on the chip area, total wirelength, and power planning. Although several works considering MSV driven floorplanning have been proposed, they do not actually place level shifters in their flows, which makes their results unrealistic. Yu et al. first proposed a methodology to place level shifters during floorplanning. But, level shifters are inserted in the whitespace of a chip, which would increase wirelength of long wires and make power planning more difficult. Thus, in this paper, we first propose two ways to allocate regions for level shifters during floorplanning, and then give a two-stage approach to place these level shifters at proper locations. The experimental results reveal that the wirelength is underestimated if we do place level shifters and it can obtain smaller wirelength if we can consider level shifters during floorplanning.

Session 5C Parallelizing System-Level Simulation
Time: 16:10 - 17:25 Wednesday, February 1, 2012
Location: Room 202
Chairs: Chia-Lin Yang (National Taiwan University, Taiwan), Derek Chiou (University of Texas at Austin, U.S.A.)

5C-1 (Time: 16:10 - 16:35)

Title	Relaxed Synchronization Technique for Speeding-up the Parallel Simulation of Multiprocessor Systems
Author	Dukyoung Yun (Seoul National University, Republic of Korea), Sungchan Kim (Chonbuk National University, Republic of Korea), *Soonhoi Ha (Seoul National University, Republic of Korea)
Page	pp. 449 - 454
Keyword	multiprocessor, parallel simulation, time synchronization, simulation cache, relaxed memory model
Abstract	For design verification of an MPSoC, a virtual prototyping system has been widely used as a cheap and fast method without a hardware prototype. It usually consists of component simulators working together in a single simulation host. As the number of component simulators increases, the simulation performance degrades significantly due to occurrence of frequent inter-simulator communication. In this paper, to boost up the simulation speed further, we propose a novel technique, called relaxed synchronization, which uses a simulation cache at each component simulator for simulation purpose. Like an architectural cache that reduces the main memory access frequency, a simulation cache reduces the count of synchronous communication effectively between the corresponding component simulator and the simulation backplane. When a read or write request to a shared memory is made, a cache line, not a single element, is transferred to utilize the space and temporal locality for simulation. The proposed technique is based on an assumption that the application program uses a relaxed memory model. Through experiments with real-life applications, it is proved that the proposed approach improves the simulation performance by up to 330 %.

5C-2 (Time: 16:35 - 17:00)

Title	Parallel Simulation of Mixed-abstraction SystemC Models on GPUs and Multicore CPUs
Author	*Rohit Sinha, Aayush Prakash, Hiren D. Patel (University of Waterloo, Canada)
Page	pp. 455 - 460
Keyword	GPU, SystemC, Parallel Simulation
Abstract	This work presents a methodology that parallelizes the simulation of mixed-abstraction level SystemC models across multicore CPUs, and graphics processing units (GPUs) for improved simulation performance. Given a SystemC model, we partition it into processes suitable for GPU execution and CPU execution. We convert the processes identified for GPU execution into GPU kernels with additional SystemC wrapper processes that invoke these kernels. The wrappers enable seamless communication of events in all directions between the GPUs and CPUs. We alter the OSCI SystemC simulation kernel to allow parallel execution of processes. Hence, we co-simulate in parallel, the SystemC processes on multiple CPUs, and the GPU kernels on the GPUs; exploit both the CPUs, and GPUs for faster simulation. We experiment with synthetic benchmarks and a set-top box case study.

5C-3 (Time: 17:00 - 17:25)

Title	An Optimizing Compiler for Out-of-Order Parallel ESL Simulation Exploiting Instance Isolation
Author	*Weiwei Chen, Rainer Doemer (Center for Embedded Computer Systems, University of California, Irvine, U.S.A.)
Page	pp. 461 - 466
Keyword	Parallel Discrete Event Simulation, system-level description languages, Optimizing compiler
Abstract	Electronic system-level (ESL) design relies on fast discrete event (DE) simulation for the validation of design models written in system-level description languages (SLDLs). An advanced technique to speedup ESL validation is out-of- order parallel DE simulation which allows multiple threads to run early and in parallel on multi-core hosts. To avoid data hazards and ensure timing accuracy, this technique requires the compiler to statically analyze the design model for potential data access conflicts. In this paper, we propose a compiler optimization that improves the data conflict analysis by exploiting instance isolation. The reduction in the number of conflicts increases the available parallelism and results in significantly reduced simulation time. Our experimental results show up to 90% gain in simulation speed for less than 6% increase in compilation time.

Thursday, February 2, 2012

Session D1 University LSI Design Contest 1
Time: 8:30 - 10:10 Thursday, February 2, 2012
Location: Room 204A

D1-1 (Time: 8:30 - 8:44)

Title	A 60-GHz 16QAM 11Gbps Direct-Conversion Transceiver in 65nm CMOS
Author	*Ryo Minami, Hiroki Asada, Ahmed Musa, Takahiro Sato, Ning Li, Tatsuya Yamaguchi, Yasuaki Takeuchi, Win Chiavipas, Kenichi Okada, Akira Matsuzawa (Tokyo Institute of Technology, Japan)
Page	pp. 467 - 468
Keyword	CMOS, Direct-Conversion, Transceiver, 60GHz
Abstract	This paper presents a 60-GHz direct-conversion transceiver using 60-GHz quadrature oscillators. The 65nm CMOS transceiver realizes the IEEE802.15.3c full-rate wireless communication for every 16QAM/8PSK/QPSK/BPSK mode. The maximum data rates with an antenna built in a package are 8Gbps in QPSK mode and 11Gbps in 16QAM mode within a BER of < 10^-3. The transceiver consumes 186mW while transmitting, and 106mW while receiving. The PLL also consumes 66mW.

D1-2 (Time: 8:44 - 8:58)

Title	A 120-mV Input, Fully Integrated Dual-Mode Charge Pump in 65-nm CMOS for Thermoelectric Energy Harvester
Author	*Po-Hung Chen, Koichi Ishida, Xin Zhang (The University of Tokyo, Japan), Yasuyuki Okuma, Yoshikatsu Ryu (Semiconductor Technology Academic Research Center, Japan), Makoto Takamiya, Takayasu Sakurai (The University of Tokyo, Japan)
Page	pp. 469 - 470
Keyword	Charge pump, Startup, Low voltage, Dual-mode
Abstract	In this paper, a fully integrated low voltage charge pump for thermoelectric energy harvesters is presented. The proposed dual-mode architecture achieves both the low startup voltage in a startup mode and high conversion efficiency in a normal operation mode without off-chip inductors and capacitors. In the measurement, the proposed circuit successfully converts 120mV input to 770mV output with 38.8% conversion efficiency.

D1-3 (Time: 8:58 - 9:12)

Title	CMA-2 : The Second Prototype of a Low Power Reconfigurable Accelerator
Author	*Mai Izawa, Nobuaki Ozaki, Yoshihiro Yasuda, Masayuki Kimura, Hideharu Amano (Keio University, Japan)
Page	pp. 471 - 472
Keyword	Reconfigurable System, Low Power Design, Real Chip Evaluation
Abstract	Cool Mega-Array (CMA) is a high energy-efficiency reconfigurable accelerator for battery-driven mobile devices. It consists of a large processing element (PE) array without memory elements for mapping the data-flow graph of the application being executed, a small simple programmable μ-controller for data management, and a data memory. A prototype CMA chip (CMA-1) with 8×8 PE array was implemented with 65nm CMOS technology. CMA-1 has several limitations for testing as a real accelerator attached to the host CPU. In order to relax the limitation of CMA-1, the second prototype chip (CMA-2) with 10×8 PE array was implemented with 40nm CMOS process. Evaluation result with real chip shows that the maximum energy efficiency is 233.7MOPS/mW.

D1-4 (Time: 9:12 - 9:26)

Title	Complexity-Effective Hilbert-Huang Transform (HHT) IP for Embedded Real-Time Applications
Author	Shyang-Chyun Chen, Chao-Chuan Chen, Wen-Chi Guo, *Tay-Jyi Lin, Ching-Wei Yeh (National Chung Cheng University, Taiwan)
Page	pp. 473 - 474
Keyword	HHT, EMD, multirate signal processing
Abstract	This paper presents a complexity-effective HHT IP for embedded real-time applications. The proposed HHT improves the original empirical mode decomposition (EMD) to reduce the interferences between signal components with filtering, similar to that in the wavelet transform. The IMF and residue signals are compacted to reduce computation and storage. Multirate Hilbert spectral analysis (HSA) is performed to further reduce computations. A prototype of an embedded HHT analyzer has been built to demonstrate the effectiveness.

D1-5 (Time: 9:26 - 9:40)

Title	Implementation of a Perpendicular MTJ-Based Read-Disturb-Tolerant 2T-2R Nonvolatile TCAM Based on a Reversed Current Reading Scheme
Author	*Shoun Matsunaga, Masanori Natsui, Shoji Ikeda (Center for Spintronics Integrated Systems, Tohoku University, Japan), Katsuya Miura (Hitachi Advanced Research Laboratory, Japan), Tetsuo Endoh, Hideo Ohno, Takahiro Hanyu (Center for Spintronics Integrated Systems, Tohoku University, Japan)
Page	pp. 475 - 476
Keyword	TCAM, nonvolatile, MTJ, power gating, standby power
Abstract	A perpendicular magnetic-tunnel-junction (MTJ)-based 2T-2R ternary content-addressable memory (TCAM) cell is proposed for a high-density nonvolatile word-parallel/bit-serial TCAM. The use of MOS/MTJ-hybrid logic makes it possible to implement a compact nonvolatile TCAM cell with 2.5 um² of a cell size in a 0.14-um CMOS and a 100-nm perpendicular-MTJ technologies. By reversed-current reading through the perpendicular MTJ device, tolerability of read disturb is greatly enhanced. Moreover, fine-grained power gating based on bit-level equality-search scheme achieves ultra-low activity rate of 4.1 % in a fabricated 72-bit x 128-word nonvolatile TCAM, which results in ultra-low active power and standby power.

D1-6 (Time: 9:40 - 9:54)

Title	Energy-Efficient RISC Design with On-Demand Circuit-Level Timing Speculation
Author	*Tay-Jyi Lin (National Chung Cheng University, Taiwan), Yu-Ting Kuo (Industrial Technology Research Institute, Taiwan), Yu-Jung Tsai, Ting-Yu Shyu (National Chung Cheng University, Taiwan), Yuan-Hua Chu (Industrial Technology Research Institute, Taiwan)
Page	pp. 477 - 478
Keyword	circuit-level timing speculation, energy-efficient, low-power
Abstract	This paper presents an energy-efficient RISC design with a novel on-demand timing speculation mechanism, which is implemented with dual timing-relaxed datapaths. The proposed approach significantly reduces the design complexity and the overheads of existing double latching approaches, such as Razor. The design has been implemented and fabricated using the TSMC 65GP technology. Its supply voltage can be lowered to 0.6V for 300MHz operations with only 5.35% timing faults, all of which can be rescued with our proposed mechanism at some extra execution cycles.

D1-7 (Time: 9:54 - 10:08)

Title	A 60mW Baseband SoC for CMMB Receiver
Author	*Chuan Wu, Jialin Cao, Dan Bao, Yun Chen, Xiaoyang Zeng (Fudan University, China)
Page	pp. 479 - 480
Keyword	Baseband Processor, SoC, CMMB
Abstract	This paper describes baseband SoC implementation of China Mobile Multimedia Broadcasting (CMMB) receiver, which integrates analog to digital (ADC), physical layer (PHY) baseband processor and medium access control (MAC) processor in single silicon wafer. MAC functions are fully implemented by firmware on an embedded 32-bit RISC-based processor. In addition, several power management techniques are utilized to reduce the power consumption of baseband SoC. The baseband SoC was successfully fabricated in 0.13µm one-poly six-metal (1P6M) CMOS process. Both analog and digital circuits are integrated on 4.8×4.8 mm2 die consuming 60mW total power dissipation under 1.2V and 3.3V supplies. The experiment results reveal the proposed baseband SoC has excellent performance under the multipath channels.

Session 6A Efficient Methods for Resource Utilization in Multi-Core NoC Designs
Time: 8:30 - 10:10 Thursday, February 2, 2012
Location: Room 204B
Chairs: Jiang Xu (The Hong Kong University of Science & Technology, Hong Kong), David Atienza (EPFL, Switzerland)

6A-1 (Time: 8:30 - 8:55)

Title	Proximity-Aware Cache Replication
Author	Chongmin Li, Dongsheng Wang, *Haixia Wang, Yibo Xue (Department of Computer Science & Technology, Tsinghua University, China), Jian Li (IBM Research in Austin, U.S.A.)
Page	pp. 481 - 486
Keyword	Chip multiprocessor, Cache replication, Proximity
Abstract	We propose Proximity-Aware cache Replication (PAR), an LLC replication technique that elegantly integrates an intelligent cache replication placement mechanism and a hierarchical directory-based coherence protocol into one cost-effective and scalable design. PAR dynamically allocates replicas of either shared or private data to a few predefined and fixed locations that are calculated at chip design time. Therefore, PAR fits well to future many-core CMPs thanks to its scalable on-chip storage and coherence design.

6A-2 (Time: 8:55 - 9:20)

Title	Dynamic Reusability-based Replication with Network Address Mapping in CMPs
Author	Jinglei Wang, Dongsheng Wang, *Haixia Wang, Yibo Xue (Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, China)
Page	pp. 487 - 492
Keyword	Chip Multiprocessor, Shared Cache, Replication, Network on Chip
Abstract	In a Chip MultiProcessor(CMP) with shared caches, the last level cache is distributed across all the cores. This increases the on-chip communication delay and thus influence the processor's performance. Replication can be provided in shared caches to reduce the on-chip communication delay. However, current proposals do not take into account replicating blocks's access characteristics and how to make the best of replicas, which have limited performance benefit. In this paper, we observe that reusability of cache blocks influences the availability of replication scheme severely. Based on this observation, we propose Dynamic Reusability-based Replication (DRR), a novel cache design to exploit efficient replicas management using blocks's reuse pattern. DRR monitors the recent referenced cache blocks' access pattern, and replicates the blocks with high reusability to appropriate L2 slices, and the replicated copies can be shared by their nearby cores. We evaluate DRR for 16-core system using splash-2 and parsec benchmarks. DRR improves performance by 30% on average over conventional shared cache design, 16% over Victim Replication(VR), 8% over Adaptive Selected Replication (ASR), and 25% over R-NUCA.

6A-3 (Time: 9:20 - 9:45)

Title	Hungarian Algorithm Based Virtualization to Maintain Application Timing Similarity for Defect-Tolerant NoC
Author	Ke Yue, Frank Lockom, Zheng Li, Soumia Ghalim, *Shangping Ren (IIT, U.S.A.), Lei Zhang, Xiaowei Li (Institute of Computing Technology, Chinese Academy of Sciences, China)
Page	pp. 493 - 498
Keyword	NoC, timing similarity, Hungarian method, defect-redundant
Abstract	Homogeneous manycore processors are emerging in broad application areas, including those with timing requirements, such as real-time and embedded applications. Typically, these processors employ Network-on-Chip (NoC) as the communication infrastructure and core-level redundancy is often used as an effective approach to improve the yield of manycore chips. For a given application's task graph and a task to core mapping strategy, the traffic pattern on the NoC is known a priori. However, when defective cores are replaced by redundant ones, the NoC topology changes. As a result, a fine-tuned program based on timing parameters given by one topology may not meet the expected timing behavior under the new one. To address this issue, a timing similarity metric is introduced to evaluate timing resemblances between different NoC topologies. Based on this metric, a Hungarian method based algorithm is developed to reconfigure a defect-tolerant manycore platform and form a unified application specific virtual core topology of which the timing variations caused by such reconfiguration are minimized. Our case studies indicate that the proposed metric is able to accurately measure the timing differences between different NoC topologies. The standard deviation between the calculated difference using the metric and the difference obtained through simulation is less than 6.58%. Our case studies also indicate that the developed Hungarian method based algorithm using the metric performs close to the optimal solution in comparison to random defect-redundant core assignments.

6A-4 (Time: 9:45 - 10:10)

Title	Using Link-level Latency Analysis for Path Selection for Real-time Communication on NoCs
Author	*Hany Kashif, Hiren D. Patel, Sebastian Fischmeister (University of Waterloo, Canada)
Page	pp. 499 - 504
Keyword	NoCs, path selection, real-time
Abstract	We present a path selection algorithm that is used when deploying hard real-time traffic flows onto a chip-multiprocessor system. This chip-multiprocessor system uses a priority-based real-time network-on-chip interconnect between the multiple processors. The problem we address is the following: given a mapping of the tasks onto a chip-multiprocessor system, we need to determine the paths that the traffic flows take such that the flows meet there deadlines. Furthermore, we must ensure that the deadline is met even in the presence of direct and indirect interference from other flows sharing network links on the path. To achieve this, our algorithm utilizes a link-level analysis to determine the impact of a link being used by a flow, and its affect on other flows sharing the link. Our experimental results show that we can improve schedulability by about 8% and 15% over Minimum Interference Routing and Widest Shortest Path algorithms, respectively.

Session 6B Circuit-Level Timing Optimization
Time: 8:30 - 10:10 Thursday, February 2, 2012
Location: Room 203
Chairs: Sachin Sapatnekar (University of Minnesota, U.S.A.), Iris Hui-Ru Jiang (National Chiao Tung University, Taiwan)

6B-1 (Time: 8:30 - 8:55)

Title	A Semi-Formal Min-Cost Buffer Insertion Technique Considering Multi-Mode Multi-Corner Timing Constraints
Author	*Shih Heng Tsai, Man Yu Li, Chung Yang Huang (GIEE, National Taiwan University, Taiwan)
Page	pp. 505 - 510
Keyword	Buffer insertion, optimization
Abstract	Buffer Insertion has always been the most effective approach for timing optimization in VLSI designs. However, the emerging low-power design paradigm and the consideration of multiple operation modes and process corners (MMMC) have raised great challenges. Traditional dynamic-programming-based techniques are unable to cope with these challenges. In this paper, we develop a novel buffer insertion algorithm that utilizes a neighborhood restriction to simplify the constraint formulation and applies a semi-formal buffer refinement process to minimize buffer cost. The experimental results show that our tool can significantly reduce the buffer cost while meeting the MMMC timing constraints.

6B-2 (Time: 8:55 - 9:20)

Title	ECO Timing Optimization with Negotiation-Based Re-Routing and Logic Re-Structuring Using Spare Cells
Author	Xing Wei, *Wai-Chung Tang, Yi Diao, Yu-Liang Wu (Chinese University of Hong Kong, Hong Kong)
Page	pp. 511 - 516
Keyword	ECO, Timing Optimization, Negotiation-based Routing, Logic Rewiring
Abstract	To maintain a lower re-masking cost, Engineering Change Order (ECO) using pre-placed spare cells for buffer insertion and gate sizing has been shown to be practical for fixing timing violating paths (ECO paths). However, in the previously known best scheme DCP, re-routings are done with each path optimized according to its surrounding available spare cells without considering potential exchanges with neighboring active cells, and spare cell arbitration between competing ECO paths are less addressed. Besides, the extra flexibility for allowing logic restructuring was not exploited. In this work, we develop a framework harnessing the following more flexible strategies to make the usage of spare cells for ECO timing optimization more powerful: (1) a negotiation based re-routing scheme yielding a more global view in solving resource competition arbitration; (2) an extended gate sizing operation to allow exchanges of active gates with spare gates of different function types through equivalent logic re-structuring. Our experiments upon MCNC and ITC benchmarks with highly injected timing violations show that compared to DCP, our newly proposed framework can cut down the average total negative slack (TNS) by 50% and reduce the number of unsolved ECO paths by 31%.

6B-3 (Time: 9:20 - 9:45)

Title	Clock Rescheduling for Timing Engineering Change Orders
Author	*Kuan-Hsien Ho, Xin-Wei Shih, Jie-Hong R. Jiang (National Taiwan University, Taiwan)
Page	pp. 517 - 522
Keyword	Timing ECO, Clock Rescheduling, Spare Cells, Gate Sizing, Buffer Insertion
Abstract	With increasing circuit complexities, design bugs are commonly found in late design stages, and thus engineering change orders (ECOs) have become an indispensable process in modern designs. Most prior approaches to the timing ECO problem are concerned about combinational logic optimization. In contrast, this paper addresses the problem in the sequential domain to explore more optimization flexibility. Experimental results based on five industrial designs show the effectiveness of our work. Our framework has been integrated into a commercial design flow.

6B-4 (Time: 9:45 - 10:10)

Title	Optimal Prescribed-Domain Clock Skew Scheduling
Author	Li Li, Yinghai Lu, *Hai Zhou (Northwestern University, U.S.A.)
Page	pp. 523 - 527
Keyword	prescribed-domain, clock skew scheduling, optimal
Abstract	Clock skew scheduling is an efficient technique to minimize the cycle period by properly assigning clock delays to registers in a circuit. But its effectiveness is limited by the difficulty in implementing a large number of arbitrary clock skews. Multi-domain clock skew scheduling and prescribed-domain clock skew scheduling are two alternatives to overcome this shortage by restricting the number of clock domains. While multi-domain clock skew scheduling has been proved to be NP-hard, the hardness of prescribed-domain clock skew scheduling algorithm remains evasive. In this paper, we give a positive answer to the open question by presenting the first efficient and optimal algorithm for prescribed-domain clock skew scheduling. Besides the runtime improvement over the previous method, the experimental results on ISCAS89 benchmarks show comparable quality to those generated by optimal multi-domain clock skew scheduling.

Session 6C Modeling and Simulation for Nanoscale Analog Circuits
Time: 8:30 - 10:10 Thursday, February 2, 2012
Location: Room 202
Chairs: Ngai Wong (University of Hong Kong, Hong Kong), Hao Yu (Nanyang Technological University, Singapore)

6C-1 (Time: 8:30 - 8:55)

Title	Fast Simulation of Hybrid CMOS and STT-MTJ Circuits with Identified Internal State Variables
Author	Yang Shang, Wei Fei, *Hao Yu (Nanyang Technological University, Singapore)
Page	pp. 529 - 534
Keyword	STT-MTJ, Internal State Variable, Fast simulation
Abstract	Hybrid integration of CMOS and non-volatile memory (NVM) devices has become the technology foundation for emerging non-volatile memory based computing. The primary challenge to validate a hybrid system with both CMOS and non-volatile devices is to develop a SPICE-like simulator that can simulate the dynamic behavior of hybrid system accurately and efficiently. Since spin-transfer-toque magnetic-tunneling-junction (STT-MTJ) device is one of the most promising candidates of next generation NVM devices, it is under great interest in including this new device in the standard CMOS design flow. The previous approaches require complex equivalent circuits to represent the STT-MTJ device, and ignore dynamic effect without consideration of internal states. This paper proposes a new modified nodal analysis for STT-MTJ device with identified internal state variables. As demonstrated by a number of experiment examples on hybrid systems with both CMOS and STT-MTJ devices, our newly developed SPICE-like simulator can deal with the dynamic behavior of STT-MTJ device under arbitrary driving condition and reduce the CPU time by more than 20 times for memory circuits when compared to the previous equivalent circuit approaches.

6C-2 (Time: 8:55 - 9:20)

Title	Time-Domain Performance Bound Analysis of Analog Circuits Considering Process Variations
Author	Xue-Xin Liu, *Sheldon X.-D. Tan, Zhigang Hao (University of California, Riverside, U.S.A.), Guoyong Shi (Shanghai Jiao Tong University, China)
Page	pp. 535 - 540
Keyword	performance bound, time domain, interval, process variation
Abstract	In this paper, we propose a new time-domain performance bound analysis method for analog circuits with process variations. The proposed method, called TIDBA, consists of several steps to compute the bound performances in time domain. First the performance bound in frequency domain is computed for a linearized analog circuits by an variational symbolic analysis method and the Kharitonov's functions. Then the time domain performance bound is computed via a new general-signal transient bound analysis using FFT/IFFT. The new algorithm can give correctly lower bound and upper bound of the performance variations of analog circuits accurately and reliably. Experimental results from two industry benchmark circuits show that TIDBA gives the correct bounds for the Monte Carlo analysis while it delivers one order of magnitude speedup over the Monte Carlo method.

6C-3 (Time: 9:20 - 9:45)

Title	Hierarchical Graph Reduction Approach to Symbolic Circuit Analysis with Data Sharing and Cancellation-Free Properties
Author	Yang Song, *Guoyong Shi (Shanghai Jiao Tong University, China)
Page	pp. 541 - 546
Keyword	analog IC, BDD, cancellation-free, graph reduction, hierarchical analysis
Abstract	Parallel to algebraic methods, graphical circuit analysis methods have the advantage of cancellation-free. This paper proposes a graph reduction method for hierarchical symbolic circuit analysis by applying a binary decision diagram (BDD) for data sharing. This method is extended from the Graph-Pair Decision Diagram (GPDD) method which was developed for two-port dependent sources. New graph construction rules for multiple-port dependent sources are introduced, with which large analog circuits can be analyzed hierarchically. The new hierarchical method guarantees the \emph{cancellation-free} property at each layer of hierarchy. The BDD-based hierarchical analysis method can greatly reduce the analysis complexity of the entire circuit, while the software construction and circuit partition remain easy. The new method is compared to the algebraic hierarchical method based on DDD (Determinant Decision Diagram) which does not have the cancellation-free property. Comparable performance can be achieved with the new method which has the extra cancellation-free property.

6C-4 (Time: 9:45 - 10:10)

Title	Weakly Nonlinear Circuit Analysis Based on Fast Multidimensional Inverse Laplace Transform
Author	*Tingting Wang, Haotian Liu, Yuanzhe Wang, Ngai Wong (The University of Hong Kong, Hong Kong)
Page	pp. 547 - 552
Keyword	Numerical inverse Laplace transform, Laguerre functions, parallel computing, nonlinear circuit, Volterra series
Abstract	There have been continuing thrusts in developing efficient modeling techniques for circuit simulation. However, most circuit simulation methods are time-domain solvers. In this paper we propose a frequency-domain simulation method based on Laguerre function expansion. The proposed method handles both linear and nonlinear circuits. The Laguerre method can invert multidimensional Laplace transform efficiently with a high accuracy, which is a key step of the proposed method. Besides, an adaptive mesh refinement (AMR) technique is developed and its parallel implementation is introduced to speed up the computation. Numerical examples show that our proposed method can accurately simulate large circuits while enjoying low computation complexity.

Session D2 University LSI Design Contest 2
Time: 10:40 - 12:20 Thursday, February 2, 2012
Location: Room 204A

D2-1 (Time: 10:40 - 10:54)

Title	A Reference-Free On-Chip Timing Jitter Measurement Circuit Using Self-Referenced Clock and a Cascaded Time Difference Amplifier in 65nm CMOS
Author	*Kiichi Niitsu, Masato Sakurai, Naohiro Harigai, Daiki Hirabayashi, Takahiro J. Yamaguchi, Haruo Kobayashi (Gunma University, Japan)
Page	pp. 553 - 554
Keyword	jitter, on-chip measurement, CMOS, time difference amplifier, BIST
Abstract	This paper demonstrates a reference-clock-free, high-resolution on-chip timing jitter measurement circuit. It combines a self-referenced clock and a cascaded time difference amplifier (TDA), which results in reference-clock-free, high-resolution timing jitter measurement without sacrificing operational speed. The test chip was designed and fabricated in 65 nm CMOS. Measured results of the proposed circuit show the possibility of detecting a timing jitter of 1.61-ps RMS in 820 MHz clock with less than 4% error.

D2-2 (Time: 10:54 - 11:08)

Title	Simultaneous Data and Power Transmission using Nested Clover Coils
Author	*Yasuhiro Take, Hayun Chung, Noriyuki Miura, Tadahiro Kuroda (Keio University, Japan)
Page	pp. 555 - 556
Keyword	power delivery, non-contact memory card, inductive-coupling, wireless
Abstract	This paper presents a simultaneous data and power transmission utilizing inductive-coupling interfaces for a non-contact memory card application. Nested clover coils are proposed to reduce interference from a power link. In order to maximize power transfer efficiency, the power transmitter tracks and predicts power consumption patterns of the memory card, and adjusts power transfer level. A test-chip prototype fabricated in a 65 nm CMOS process demonstrates 6 Gb/s data rate and 10% power transfer efficiency across a 0.1-2 kΩ load range.

D2-3 (Time: 11:08 - 11:22)

Title	Complexity-Effective Auditory Compensation with a Controllable Filter for Digital Hearing Aids
Author	Ya-Ting Chang, *Kuo-Chiang Chang, Yu-Ting Kuo, Chih-Wei Liu (National Chiao Tung University, Taiwan)
Page	pp. 557 - 558
Keyword	hearing aids, auditory compensation, low complexity
Abstract	Auditory compensation consumes significant power due to the computation-intensive operations in the filter bank. To reduce the complexity, a controllable filter was designed to replace the filter bank. Filter order was designed to match prescriptions within a specific error constraint with minimum computational cost. An interpolation scheme according to the variation of signal intensity was implemented to reduce the overhead of coefficients calculations. The proposed auditory compensation reduces 80% of multiplications and 30% of power consumption compared to the complexity-effective multi-rate filter bank architecture [4]. Moreover, the group delay was also reduced from 10 ms to 2.4 ms.

D2-4 (Time: 11:22 - 11:36)

Title	A Progressive Mixing 20GHz ILFD with Wide Locking Range for Higher Division Ratios
Author	*Ahmed Musa, Kenichi Okada, Akira Matsuzawa (Tokyo Institute of Technology, Japan)
Page	pp. 559 - 560
Keyword	ILFD, injection locking, frequency dividers, wide locking, divide by 4
Abstract	This paper proposes Progressive Mixing Injection Locked Frequency Divider (PMILFD) technique that enhances the locking range for higher division ratios. As this technique uses lower and much stronger harmonics in the mixing process, it results in a stronger injection effect and a much wider locking range. Two 20GHz PMILFDs were designed based on this approach to divide by 4 and 8 using a 65nm CMOS process. The former achieves a 7.9GHz(31.4%) locking range and the later achieves a 3.4GHz(15.5%) while consuming 3.9mW.

D2-5 (Time: 11:36 - 11:50)

Title	A 16-Gb/s Area-Efficient LD Driver with Interwoven Inductor in a 0.18-µm CMOS
Author	*Takeshi Kuboki (Kyoto University, Japan), Yusuke Ohtomo (NTT Microsystem Integration Laboratories, Japan), Akira Tsuchiya (Kyoto University, Japan), Keiji Kishine (University of Shiga Prefecture, Japan), Hidetoshi Onodera (Kyoto University, Japan)
Page	pp. 561 - 562
Keyword	LD Driver, intewoven, inductor
Abstract	This paper presents the fastest laser-diode driver with interwoven peaking inductor in 0.18-um CMOS. Six and four inductors are interwoven into two sets of inductors for area-effective implementation as well as performance enhancement. The operation speed enhancement of the proposed circuit is achieved by tuning mutual inductances of interwoven inductors. The circuit area is 0.34-mm2 and the maximum operating speed is 16-Gb/s.

D2-6 (Time: 11:50 - 12:04)

Title	A PVT-robust Feedback Class-C VCO Using an Oscillation Swing Enhancement Technique
Author	*Wei Deng, Kenichi Okada, Akira Matsuzawa (Tokyo Institute of Technology, Japan)
Page	pp. 563 - 564
Keyword	PVT, Class-C VCO, Startup
Abstract	This paper presents a feedback class-C VCO with PVT-robustness and enhanced oscillation swing. The proposed VCO starts oscillation as a differential LC-VCO for robust startup, and automatically adapts to an amplitude-enhanced class-C VCO in steady-state for lower phase noise. The pro-posed VCO is implemented in a 0.18µm CMOS process. The measured phase noise at room temperature is -125 dBc/Hz @ 1MHz offset with a power dissipation of 3.4-mW, from a carrier frequency of 4.84-GHz. The figure-of-merit is -193 dBc/Hz.

D2-7 (Time: 12:04 - 12:18)

Title	A Single-Routing Layered LDPC Decoder for 10Gbase-T Ethernet in 130nm CMOS
Author	*Dan Bao, Xubin Chen, Yuebin Huang, Chuan Wu, Yun Chen, Xiao Yang Zeng (Fudan University, China)
Page	pp. 565 - 566
Keyword	LDPC, Decoder
Abstract	A highly-parallel LDPC decoder architecture for 10Gbase-T applications is designed in this paper. Firstly, we reduce the routing complexity and corresponding power consumption by the proposed decoder architecture based on single routing networks. Secondly, the proposed architecture is designed with pipelined layered scheduling and multi-block parallel decoding, which improves operation speed and removes pipeline stalls in conventional highly-parallel layered scheduling. Thirdly, we trade off between hardware cost and throughput by a digit-serial data-path. Fourthly, an efficient early-termination circuit suitable for layered decoding is designed. The decoder is implemented in 130nm 1P8M CMOS process. The core area is 18.4mm2 with 14% reduction, and the decoding throughput is 9.48Gbps operating at 278MHz and 5 iterations. The tested power consumption is 774mW at 1.2V and 80MHz.

Session 7A System-Level Modeling, Simulation, and Verification
Time: 10:40 - 12:20 Thursday, February 2, 2012
Location: Room 204B
Chairs: Lovic Gauthier (Kyushu University, Japan), Alan Su (Synopsys, Taiwan)

7A-1 (Time: 10:40 - 11:05)

Title	Automatic Timing Granularity Adjustment for Host-Compiled Software Simulation
Author	Parisa Razaghi, *Andreas Gerstlauer (The University of Texas at Austin, U.S.A.)
Page	pp. 567 - 572
Keyword	Real-time systems, host-compiled simulation, Abstract RTOS modeling
Abstract	Host-compiled simulation has been widely adopted as a practical approach for fast and high-level evaluation of complex software-intensive systems at early stages of the design process. In such approaches, higher speed is achieved by coarse-grained simulation of the system, which also leads to a loss in timing accuracy. To eliminate the inherent speed and accuracy tradeoff, we present an adjustive software simulator, which automatically controls the timing model of the simulation platform to provide both fast and accurate results. At its core, we propose a novel RTOS model that permanently monitors the state of the system and optimally and automatically adjusts back-annotated timing granularities to provide an error-free task scheduling. We evaluated our approach on an industrial-strength example, and results show that the accuracy of a fine-grain simulation can be achieved while maintaining a speed of close to 900MIPS.

7A-2 (Time: 11:05 - 11:30)

Title	Performance Estimation of Embedded Software with Confidence Levels
Author	*Marco Lattuada, Fabrizio Ferrandi (Politecnico di Milano, Italy)
Page	pp. 573 - 578
Keyword	Performance Estimation, Prediction Intervals, Confidence Levels
Abstract	Since time constraints are a very critical aspect of an embedded system, performance evaluation can not be postponed to the end of the design flow, but it has to be introduced since its early stages. Estimation techniques based on mathematical models are usually preferred during this phase since they provide quite accurate estimation of the application performance in a fast way. However, the estimation error has to be considered during design space exploration to evaluate if a solution can be accepted (e.g., by discarding solutions whose estimated time is too close to constraint). Evaluate if the possible error can be significative analyzing a punctual estimation is not a trivial task. In this paper we propose a methodology, based on statistical analysis, that provides a prediction interval on the estimation and a confidence level on the meeting of a time constraint. This information can drive design space exploration reducing the number of solutions to be validated. The results show how the produced intervals effectively capture the estimation error introduced by a linear model.

7A-3 (Time: 11:30 - 11:55)

Title	Verifying Dynamic Power Management Schemes Using Statistical Model Checking
Author	Jayanand Asok Kumar, *Shobha Vasudevan (University of Illinois at Urbana-Champaign, U.S.A.)
Page	pp. 579 - 584
Keyword	dynamic power management, statistical model checking, RTL, multicore
Abstract	Dynamic power management (DPM) schemes, such as power gating, are important runtime strategies for saving power in multicore architectures. Safety and efficiency are probabilistic properties which need to be verified in order to evaluate a DPM scheme. In this work, we employ statistical model checking to verify probabilistic properties on Register Transfer Level (RTL) descriptions of multicores. Statistical model checking performs a system-level verification of the DPM scheme by simulating several sample paths of the entire RTL design until the verification results lie within tolerable bounds of error. We illustrate our approach on the RTL of OpenSPARC T2, a publicly available industry-strength multicore processor. We verify the safety and efficiency properties of several power gating schemes by considering the power manageable blocks in the floating-point graphics unit.

7A-4 (Time: 11:55 - 12:20)

Title	Formal Methods for Coverage Analysis of Architectural Power States in Power-Managed Designs
Author	*Aritra Hazra, Pallab Dasgupta (Indian Institute of Technology Kharagpur, India), Ansuman Banerjee (Indian Statistical Institute Kolkata, India), Kevin Harer (Synopsys Inc., U.S.A.)
Page	pp. 585 - 590
Keyword	Formal Coverage Analysis, Formal Verification, Power Intent Verification, Assertions, Low-Power Designs
Abstract	The architectural power intent of a design defines the intended global power states of a power-managed integrated circuit. Verification of the implementation of power management logic involves the task of checking whether only the intended power states are reached. Typically, the number of global power states reachable by the global power management strategy is significantly lesser than the possible number of global power states. In this paper, we present a formal method for determining the set of reachable global power states in a power-managed design. Our approach demonstrates how this task can be further constrained as required by the verification engineer. We highlight the efficacy of the proposed methods over several test-cases.

Session 7B Timing, Thermal, and Power Issues in High-Performance Design
Time: 10:40 - 12:20 Thursday, February 2, 2012
Location: Room 203
Chairs: Yuchun Ma (Tsinghua University, China), Masanori Hashimoto (Osaka University, Japan)

7B-1 (Time: 10:40 - 11:05)

Title	The Impact of Hot Carriers on Timing in Large Circuits
Author	Jianxin Fang, *Sachin Sapatnekar (University of Minnesota, U.S.A.)
Page	pp. 591 - 596
Keyword	Hot Carrier Effects, Delay Degradation, Reliability Analysis, Static Timing Analysis
Abstract	This paper focuses on hot carrier (HC) effects in large scale digital circuits and proposes a scalable method for analyzing circuit-level delay degradations. At the transistor level, a multi-mode energy-driven model for nanometer technologies is employed. At the logic cell level, a methodology that captures the aging of a device as a sum of device age gains per signal transition is described, and the age gain is characterized using SPICE simulation. At the circuit level, the cell-level characterizations are used in conjunction with probabilistic methods to perform fast degradation analysis. The proposed analysis method is validated by Monte Carlo simulation on various benchmark circuits, and is proved to be accurate, efficient and scalable.

7B-2 (Time: 11:05 - 11:30)

Title	A Learning-Based Autoregressive Model for Fast Transient Thermal Analysis of Chip-Multiprocessors
Author	*Da-Cheng Juan, Huapeng Zhou, Diana Marculescu, Xin Li (Carnegie Mellon University, U.S.A.)
Page	pp. 597 - 602
Keyword	Thermal analysis, chip-multiprocessor, machine learning, autoregression, thermal optimization
Abstract	Thermal issues have become critical roadblocks for the development of advanced chip-multiprocessors. In this paper, we introduce a new angle to view transient thermal analysis – based on predicting thermal profile, instead of calculating it. We develop a systematic framework that can learn different thermal profiles of a CMP by an autoregressive model. Experimental results show that the proposed AR model can achieve 113X speed-up over existing thermal estimation methods, while introducing an error of only 0.8˚C on average.

7B-3 (Time: 11:30 - 11:55)

Title	On-Chip Statistical Hot-Spot Estimation Using Mixed-Mesh Statistical Polynomial Expression Generating and Skew-Normal Based Moment Matching Techniques
Author	Pei-Yu Huang, Yu-Min Lee, *Chi-Wen Pan (National Chiao Tung University, Taiwan)
Page	pp. 603 - 608
Keyword	Thermal Analysis, Thermal Yield, Process Variation, Leakage Powers, Thermal-Aware Design
Abstract	This work introduces the concept of thermal yield profile for the hot-spot identification with considering process variations and provides an efficient estimating technique for the thermal yield profile. After executing a mixed-mesh strategy for generating statistical polynomial expression of the on-chip temperature distribution, the thermal yield profile is obtained by a skew-normal based moment matching technique. Comparing with the Monte Carlo method, experimental results demonstrate that our method can efficiently and accurately estimate the thermal yield profile. With the same level of accuracy, our skew-normal based method achieves 215X speedup over the state of the art, APEX, for estimating the thermal yield profile. Moreover, results show that our mixed-mesh statistical polynomial expression generator achieves 130X speedup over the statistical collocation based method and still accurately estimates the thermal yield profile.

7B-4 (Time: 11:55 - 12:20)

Title	Design Techniques for Functional-Unit Power Gating in the Ultra-Low-Voltage Region
Author	*Michael B. Henry, Leyla Nazhandali (Virginia Tech, U.S.A.)
Page	pp. 609 - 614
Keyword	ultra low voltage operation, power gating, functional unit power gating, low power
Abstract	In this paper, we investigate many of the important aspects of highly aggressive functional unit power gating in the context of ultra-low-voltage operation. Using an optimization framework, we demonstrate that functional unit power gating results in an average of a 30-40% drop in total functional unit energy across a range of benchmarks. We also analyze Sense-Amplifier Pass Transistor Logic and show that compared to CMOS, SAPTL needs much smaller footers that are and consumes 100 times less boot-up energy.

Session 7C Interconnect, Cooling, and Charge Storage Technologies
Time: 10:40 - 12:20 Thursday, February 2, 2012
Location: Room 202
Chairs: Wei Zhang (Nanyang Technological University, Singapore), Hai Zhou (Northwestern University, U.S.A.)

7C-1 (Time: 10:40 - 11:05)

Title	Post-Fabrication Reconfiguration for Power-Optimized Tuning of Optically Connected Multi-Core Systems
Author	Yan Zheng (Tsinghua University, China), *Peter Lisherness, Saeed Shamshiri, Amirali Ghofrani (University of California, Santa Barbara, U.S.A.), Shiyuan Yang (Tsinghua University, China), Kwang-Ting Tim Cheng (University of California, Santa Barbara, U.S.A.)
Page	pp. 615 - 620
Keyword	Post-fabrication Reconfiguration, Optically connected Multi-core system, Reliability, Power Consumption, Design-for-Yield
Abstract	Integrating optical interconnects into the next-generation multi-/many-core architecture has been considered a viable solution to addressing the limitations in throughput, latency, and power efficiency of electrical interconnects. Optical interconnects also allow the performance growth of inter-core connectivity to keep pace with the growth of the cores’ processing ability. However, variations in the fabrication process significantly impair an optical network’s communication quality. Existing post-fabrication tuning methods, which are based on adjusting the voltages and temperatures, have very limited tenability and require excessive power to fully compensate for the variation. In this paper, we study the sources and severity of process variation, and propose two methods to enhance the robustness of an on-chip optical network: 1) adding spare modulators and detectors for post-fabrication reconfiguration and low-power tuning, and 2) introducing a combined detector/modulator structure for a more robust network topology. Simulation results show that employing both methods can reduce the tuning power from hundreds of watts to 6W while maintaining a throughput of 99.7%. To maintain a throughput of 50%, the tuning power can be further reduced to only 12mW.

7C-2 (Time: 11:05 - 11:30)

Title	GLOW: A Global Router for Low-Power Thermal-reliable Interconnect Synthesis using Photonic Wavelength Multiplexing
Author	Duo Ding, Bei Yu, *David Z. Pan (The University of Texas at Austin, U.S.A.)
Page	pp. 621 - 626
Keyword	optical interconnect synthesis, low power, thermal reliability, physical design, nanophotonics WDM
Abstract	In this paper, we examine the integration potential and explore the design space of low power thermal reliable on-chip interconnect synthesis featuring nanophotonics Wavelength Division Multiplexing (WDM). With the recent advancements, it is foreseen that nanophotonics holds the promise to be employed for future on-chip data signalling due to its unique power efficiency, signal delay and huge multiplexing potential. However, there are major challenges to address before feasible on-chip integration could be reached. We present GLOW, a hybrid global router to provide low power opto-electronic interconnect synthesis under the considerations of thermal reliability and various physical design constraints such as power(thermal), delay and signal quality. GLOW is simulated and evaluated on various testing cases derived from ISPD global routing contest benchmarks. Compared with a greedy heuristic approach, GLOW demonstrates 23%-50% of optical power reduction, revealing great potential of on-chip opto-electrical WDM interconnection.

7C-3 (Time: 11:30 - 11:55)

Title	Charge Replacement in Hybrid Electrical Energy Storage Systems
Author	Qing Xie, Yanzhi Wang (University of Southern California, U.S.A.), Younghyun Kim, Donghwa Shin, *Naehyuck Chang (Seoul National University, Republic of Korea), Massoud Pedram (University of Southern California, U.S.A.)
Page	pp. 627 - 632
Keyword	hybrid electrical energy storage, charge replacement, charge management
Abstract	Hybrid electrical energy storage (HEES) systems are composed of multiple banks of heterogeneous electrical energy storage (EES) elements with distinctive properties. Charge replacement in a HEES system (i.e., dynamic assignment of load demands to EES banks) is one of the key operations in the system. This paper formally describes the global charge replacement (GCR) optimization problem and provides an algorithm to find the near-optimal GCR control policy. The optimization problem is formulated as a mixed-integer nonlinear programming problem, where the objective function is the charge replacement efficiency. The constraints account for the energy conservation law, efficiency of the charger/converter, the rate capacity effect, and self-discharge rates plus internal resistances of the EES element arrays. The near-optimal solution to this problem is obtained while considering the state of charges (SoCs) of the EES element arrays, characteristics of the load devices, and estimates of energy contributions by the EES element arrays. Experimental results demonstrate significant improvements in the charge replacement efficiency in an example HEES system comprised of banks of battery and supercapacitor elements with a high-power pulsed military radio transceiver as the load device.

7C-4 (Time: 11:55 - 12:20)

Title	Prospects of Active Cooling with Integrated Super-Lattice based Thin-Film Thermoelectric Devices for Mitigating Hotspot Challenges in Microprocessors
Author	Borislav Alexandrov, Owen Sullivan, Satish Kumar, *Saibal Mukhopadhyay (Georgia Institute of Technology, U.S.A.)
Page	pp. 633 - 638
Keyword	Thermoelectric Coolers, Hot Spot, Active Cooling
Abstract	Super-lattice thin-film thermoelectric coolers (TEC) are emerging as a promising technology for hot spot mitigation in microprocessors. This paper studies the prospect of on-demand cooling with advanced TECs integrated at the back of the heat spreader inside a package (integrated TEC). The thermal compact models of the chip and package with integrated TECs are developed and used for steady-state and transient temperature analysis. The control principles for TEC assisted transient cooling are presented and their impact on reducing thermal violations in microprocessors and TEC energy dissipations are discussed.

Session S8 Special Session 8: Design for Reconfigurability and Adaptivity: Device, Circuit and System Perspectives
Time: 14:00 - 15:40 Thursday, February 2, 2012
Location: Room 204A
Chairs: Yiyu Shi (Missouri University of Science and Technology, U.S.A.), Shih-Chieh Chang (National Tsing Hua University, Taiwan)

S8-1 (Time: 14:00 - 14:25)

Title	(Invited Paper) Nano-Electro-Mechanical (NEM) Relays and their Application to FPGA Routing
Author	Chen Chen, Scott Lee, J. Provine, Soogine Chong, Roozbeh Parsa, Daesung Lee, Roger T. Howe, H.S. Philip Wong (Department of Electrical Engineering, Stanford University, U.S.A.), *Subhasish Mitra (Department of Electrical Engineering, Stanford University and Department of Computer Science, Stanford University, U.S.A.)
Page	p. 639
Keyword	NEM, FPGA routing
Abstract	Nano-Electro-Mechanical (NEM) relays are nano-scale switches that can be mechanically actuated by an electrical signal. Unlike conventional CMOS transistors, NEM relays exhibit zero off-state leakage and very sharp on-off transitions. As a result, NEM relays can be potentially used to design highly energy-efficient digital systems. NEM relays are also excellent candidates for programmable routing switches in Field Programmable Gate Arrays (FPGAs) due to their potentially low on-state resistances despite their long mechanical delays. Low-temperature fabrication of NEM relays creates opportunities for their integration on top of silicon CMOS circuits. Hysteresis properties of NEM relays can enable their use as FPGA programmable routing switches without requiring additional routing SRAM cells. In this talk, we will present an overview of NEM relays and their use in digital system design, and discuss design considerations for hybrid CMOS-NEM FPGAs.

S8-2 (Time: 14:25 - 14:50)

Title	(Invited Paper) Capturing the Phantom of the Power Grid – On the Runtime Adaptive Techniques for Noise Reduction
Author	Tao Wang (ECE Dept., Missouri University of Science and Technology, U.S.A.), Pei-Wen Luo, Yu-Shih Su, Liang-Chia Cheng, Ding-Ming Kwai (Industrial Technology Research Institute, Hsin-Chu, Taiwan), *Yiyu Shi (ECE Dept., Missouri University of Science and Technology, U.S.A.)
Page	pp. 640 - 645
Keyword	Power Grid
Abstract	Power supply noise has become one of the primary concerns in low power designs. To ensure power integrity, designers need to make sure that voltage droop and bounce do not exceed noise margin in all possible scenarios. Since it is very difficult to capture the exact worst corner among the mist of complex functionalities in modern VLSI designs, statistical design methodologies have been adapted, which may bring significant design overhead. In view of this, various runtime techniques have been proposed in literature to suppress power grid noise adaptively. This paper first presents various challenges in power grid designs from an industrial perspective, explains the difficulties in handling them at deign time, and then reviews various runtime techniques to adaptively suppress power supply noise, including sensor-based power gating, re-routable decaps, proactive clock frequency actuator, and PLL based clocking.

S8-3 (Time: 14:50 - 15:15)

Title	(Invited Paper) Post Silicon Skew Tuning: Survey and Analysis
Author	Mac Y.C. Kao, Kun-Ting Tsai, Hsuan-Ming Chou, *Shih-Chieh Chang (NTHU, Taiwan)
Page	pp. 646 - 651
Keyword	skew tuning
Abstract	Clock skew minimization is an important design consideration. However, with the advance of the technology and the smaller device scaling, Process, Voltage, and Temperature (PVT) variations make the clock skew minimization face great challenges. To mitigate the impact of PVT variations, many previous works proposed the Post Silicon Tuning (PST) architecture to dynamically balance the skew of a clock tree. In the PST architecture, there are two main components: Adjustable Delay Buffer (ADB) and Phase Detector (PD). In this paper, we make a survey about existing techniques to the PST architecture and introduce several important design concerns such as the ADB selection, system controlling, and design testing to the PST architecture.

S8-4 (Time: 15:15 - 15:40)

Title	(Invited Paper) Compilation and Architecture Support for Customized Vector Instruction Extension
Author	*Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Hui Huang, Bin Liu, Raghu Prabhakar, Glenn Reinman, Marco Vitanza (University of California, Los Angeles, U.S.A.)
Page	pp. 652 - 657
Keyword	custom vector instruction
Abstract	Vectorization has been commonly employed in modern processors. In this work we identify the opportunities to explore customized vector instructions and build an automatic compilation flow to efficiently identify those instructions. A composable vector unit (CVU) is proposed to support a large number of customized vector instructions with small area overhead. The results show that our approach achieves an average 1.41X speedup over the state-of-art vector ISA. We also observe a large area gain (around 11.6X) over the dedicated ASIC-based design.

Session 8A Scheduling for Embedded and High-Performance Systems
Time: 14:00 - 15:40 Thursday, February 2, 2012
Location: Room 204B
Chairs: Chun Jason Xue (City University of Hong Kong), Morteza Biglari-Abhari (The University of Auckland)

8A-1 (Time: 14:00 - 14:25)

Title	Thread Affinity Mapping for Irregular Data Access on Shared Cache GPGPU
Author	*Hsien-Kai Kuo, Kuan-Ting Chen, Bo-Cheng Charles Lai, Jing-Yang Jou (Department of Electronics Engineering, National Chiao Tung University, Taiwan)
Page	pp. 659 - 664
Keyword	GPGPU, irregular data, data parallel computing, memory coalescing, cache contention
Abstract	Memory Coalescing and on-chip shared Cache are two effective techniques to alleviate the memory bottleneck in modern GPGPUs. These two techniques are very useful on applications with regular memory accesses. However, they become ineffective on concurrent threads with large numbers of uncoordinated accesses and the potential performance benefit could be significantly degraded. This paper proposes a thread affinity mapping methodology to coordinate the irregular data accesses on shared cache GPGPUs. Based on the proposed affinity metrics, threads are congregated into execution groups which are able to fully exploit the memory coalescing and data sharing within an application. An average of 3.5x runtime speedup is achieved on a Fermi GPGPU. The speedup scales with the sizes of test cases, which makes the proposed methodology an effective and promising solution for the continually increasing complexities of applications in the future many-core systems.

8A-2 (Time: 14:25 - 14:50)

Title	Modular Scheduling of Distributed Heterogeneous Time-Triggered Automotive Systems
Author	*Martin Lukasiewycz (TUM CREATE Centre for Electromobility, Singapore), Dip Goswami, Reinhard Schneider, Samarjit Chakraborty (Technical University of Munich, Germany)
Page	pp. 665 - 670
Keyword	Scheduling, Automotive, Time-triggered, Real-time, Control
Abstract	This paper proposes a modular framework that enables a scheduling for time-triggered distributed embedded systems. The framework provides a symbolic representation that is used by an Integer Linear Programming (ILP) solver to determine a schedule that respects all bus and processor constraints as well as end-to-end timing constraints. Unlike other approaches, the proposed technique complies with automotive specific requirements at system-level and is fully extensible. Formulations for common time-triggered automotive operating systems and bus systems are presented. The proposed model supports the automotive bus systems FlexRay 2.1 and 3.0. For the operating systems, formulations for an eCosbased non-preemptive component and a preemptive OSEKtime operating system are introduced. A case study from the automotive domain gives evidence of the applicability of the proposed approach by scheduling multiple distributed control functions concurrently. Finally, a scalability analysis is carried out with synthetic test cases.

8A-3 (Time: 14:50 - 15:15)

Title	RAISE: Reliability-Aware Instruction SchEduling for Unreliable Hardware
Author	Semeen Rehman, Muhammad Shafique, Florian Kriebel, *Jörg Henkel (Karlsruhe Institute of Technology (KIT), Germany)
Page	pp. 671 - 676
Keyword	Reliability, Reliable Software, Instruction Scheduling, Compiler, Reliability Estimation
Abstract	A compile-time Reliability-Aware Instruction SchEduling (RAISE) scheme is presented, which takes into account the spatial and temporal vulnerabilities of different processor resources (pipeline, register file, etc.) used during the execution of different instructions. It reduces the software program’s susceptibility towards failures by minimizing the occupancy cycles of critical instructions inside the pipeline stages in addition to reducing the vulnerable periods of their operands. To facilitate RAISE, a novel technique for static reliability estimation during compilation is presented (i.e. before instructions scheduling). Compared to state-of-the-art reliability-aware instruction schedulers, our scheme provides up to 32.7% reduced software program failures over three different fault rates.

8A-4 (Time: 15:15 - 15:40)

Title	On-Line Leakage-Aware Energy Minimization Scheduling for Hard Real-Time Systems
Author	Huang Huang, Ming Fan, *Gang Quan (Florida International University, U.S.A.)
Page	pp. 677 - 682
Keyword	real-time scheduling, temperature dependency, dynamic voltage and frequency scaling, energy optimization
Abstract	As the semiconductor technology proceeds into the deep sub-micron era, leakage and its dependency with the temperature become critical in dealing with the power/energy minimization problem. In this paper, we develop an analytical method to estimate energy consumption on-line with the leakage/temperature dependency taken into consideration. Based on this method, we develop an on-line scheduling algorithm to reduce the overall energy consumption for a hard realtime system scheduled according to the Earliest Deadline First (EDF) policy. Our experimental results show that the proposed energy estimation method can achieve up to 210X speedup compared with an existing approach while still maintaining high accuracy. In addition, with a large number of different test cases, the proposed energy saving scheduling method consistently outperforms two closely related researches in average by 10% and 14% respectively.

Session 8B Automated Debugging and Validation
Time: 14:00 - 15:40 Thursday, February 2, 2012
Location: Room 203
Chairs: Jiun-Lang Huang (National Taiwan University, Taiwan), Jai-Ming Lin (National Cheng Kung University, Taiwan)

8B-1 (Time: 14:00 - 14:25)

Title	A Formal Approach to Debug Polynomial Datapath Designs
Author	*Bijan Alizadeh (University of Tehran, Iran)
Page	pp. 683 - 688
Keyword	formal verification, debugging, polynomial datapath, mutation-based debugging
Abstract	By increasing the complexity of digital systems, debugging of such systems has become a major economical issue. In this paper, we introduce a mutation-based debugging technique that allows us to efficiently locate and then correct bugs in datapath dominated applications such as in Digital Signal Processing (DSP) for multimedia applications and embedded systems. In order to evaluate the effectiveness of our approaches, we have applied the proposed debugging technique to several industrial designs. The experimental results show that the proposed debugging technique enables us to localize and correct even multiple bugs in a reasonable run time and memory usage.

8B-2 (Time: 14:25 - 14:50)

Title	Automated Debugging of Counterexamples in Formal Verification of Pipelined Microprocessors
Author	*Miroslav N. Velev, Ping Gao (Aries Design Automation, U.S.A.)
Page	pp. 689 - 694
Keyword	formal verification, pipelined processors, automated debugging, abstraction, SAT
Abstract	We propose a novel method for error diagnosis of pipelined microprocessors that allows us to exploit Positive Equality in Correspondence Checking. We also present static CNF variable ordering heuristics that dramatically reduce the solution space during the debugging. Experimental results indicate speedup of up to 2 orders of magnitude relative to previous approaches when applying the method to automated debugging in formal verification of complex pipelined DSPs.

8B-3 (Time: 14:50 - 15:15)

Title	On Error Tolerance and Engineering Change with Partially Programmable Circuits
Author	Hratch Mangassarian (University of Toronto, Canada), Hiroaki Yoshida (University of Tokyo, Japan), Andreas Veneris (University of Toronto, Canada), Shigeru Yamashita (Ritsumeikan University, Japan), *Masahiro Fujita (University of Tokyo, Japan)
Page	pp. 695 - 700
Keyword	Yield enhancement, QBF, Engineering change, Stuck-at-fault
Abstract	The growing size, density and complexity of modern VLSI chips are contributing to an increase in hardware faults and design errors in the silicon, decreasing manufacturing yield and increasing the design cycle. The use of Partially Programmable Circuits (PPCs) has been recently proposed for yield enhancement with very small overhead. This new circuit structure is obtained from conventional logic by replacing some subcircuits with programmable LUTs. The present paper lays the theoretical groundwork for evaluating PPCs with Quantified Boolean Formula (QBF) satisfiability. First, QBF models are constructed to calculate the fault tolerance and design error tolerance of a PPC, namely the percentages of faults and design errors that can be masked using LUT reconfigurations. Next, zero-cost Engineering Change Order (ECO) in PPCs is investigated. QBF formulations are given for performing ECOs, and for quantifying the ECO coverage of a PPC architecture. Experimental results are presented evaluating PPCs from [1], demonstrating the applicability and accuracy of the proposed formulations.

8B-4 (Time: 15:15 - 15:40)

Title	On Error Modeling of Electrical Bugs for Post-Silicon Timing Validation
Author	Ming Gao, *Peter Lisherness (University of California, Santa Barbara, U.S.A.), Jing-Jia Liou (National Tsing Hua University, Taiwan), Kwang-Ting (Tim) Cheng (University of California, Santa Barbara, U.S.A.)
Page	pp. 701 - 706
Keyword	Electrical Bug, Post-silicon Validation, Error Model, Validation Metric
Abstract	There is great demand for an accurate and scalable metric to evaluate the functional stimuli, testbench checkers, and DfD (Design-for-Debug) structures used in post-silicon timing validation. In this paper, we show the inadequacy of existing methods (due to either inaccuracy or a lack of scalability) and propose an approach that leverages debug engineers' experience to model timing errors efficiently and with sufficient precision. Experimental results demonstrate that the proposed approach produced an error model six times more accurate than the prior art with a negligible simulation overhead.

Session 8C DFM for Nanolithography
Time: 14:00 - 15:40 Thursday, February 2, 2012
Location: Room 202
Chairs: David Z. Pan (University of Texas, Austin, U.S.A.), C.-K. Cheng (University of California, San Diego, U.S.A.)

8C-1 (Time: 14:00 - 14:25)

Title	Hybrid Lithography Optimization with E-Beam and Immersion Processes for 16nm 1D Gridded Design
Author	Yuelin Du, Hongbo Zhang, *Martin D. F. Wong (University of Illinois at Urbana-Champaign, U.S.A.), Kai-Yuan Chao (Intel Corporation, U.S.A.)
Page	pp. 707 - 712
Keyword	Hybrid Lithography, E-Beam Lithography, 1-D Gridded Design, Cut Redistribution
Abstract	Since some of major IC industry participants are moving to the highly regular 1D gridded designs to enable scaling to sub-20nm nodes, how to manufacture the randomly distributed cuts with reasonable throughput and process variation becomes a big challenge. With the help of hybrid lithography, people can apply different types of processes for one single layer manufacturing such that the advantages from different technologies can be combined together to further benefit manufacturing. In this paper, targeting cut printing difficulties and hybrid lithography with electron beam (E-Beam) and 193 nm immersion (193i) processes, we propose a novel algorithm to optimally assign cuts to 193i or E-Beam processes with proper modifications on cut distribution, in order to maximize the overall throughput. To validate our method, we construct our algorithm based on the forbidden patterns obtained from the optical simulation; then we formulate the redistribution problem into a well defined ILP problem and finally call a reliable solver to solve the whole problem. Experimental results show that the throughput is dramatically improved by the cut redistribution. Besides that, for sparser layers, the EBL process can be totaly saved, which largely reduces the fabrication cost.

8C-2 (Time: 14:25 - 14:50)

Title	Design-Patterning Co-optimization of SRAM Robustness for Double Patterning Lithography
Author	Vivek Joshi (GLOBALFOUNDRIES, U.S.A.), *Dennis Sylvester (University of Michigan, U.S.A.), Kanak Agarwal (IBM Research, U.S.A.)
Page	pp. 713 - 718
Keyword	Double Patterning, SRAM, yield
Abstract	This paper presents a comprehensive analysis and optimization framework that compares the layerwise impact of different Double Patterning Lithography (DPL) choices on SRAM robustness, density, and printability. It then performs a sizing optimization that accounts for increased variability due to DPL for each layer. Experimental results based on 45nm industrial models show that using the best DPL option for each layer, along with the sizing optimization presented, we can achieve single exposure robustness together with improved DPL printability at nearly no overhead (less than 0.2% increase in write energy).

8C-3 (Time: 14:50 - 15:15)

Title	Efficient Pattern Relocation for EUV Blank Defect Mitigation
Author	Hongbo Zhang, Yuelin Du, *Martin D. F. Wong (University of Illinois at Urbana-Champaign, U.S.A.), Rasit O. Topalaglu (GLOBALFOUNDRIES, U.S.A.)
Page	pp. 719 - 724
Keyword	EUV, Defect Mitigation, Pattern Relocation
Abstract	Blank defect mitigation is a critical step for extreme ultraviolet (EUV) lithography. Targeting the defective blank, a layout relocation method, to shift and rotate the whole layout pattern to a proper position, has been proved to be an effective way to reduce defect impact. Yet, there is still no published work about how to find the best pattern location to minimize the impact from the buried defects with reasonable defect model and considerable process variation control. In this paper, we successfully present an algorithm that can optimally solve this pattern relocation problem. Experimental results validate our method, and the relocation results with full scale layouts generated from Nangate Open Cell Library has shown great advantages with competitive runtimes compared to the existing commercial tool.

8C-4 (Time: 15:15 - 15:40)

Title	Character Design and Stamp Algorithms for Character Projection Electron-Beam Lithography
Author	Peng Du, Wenbo Zhao, Shih-Hung Weng, *Chung-Kuan Cheng, Ronald Graham (University of California, San Diego, U.S.A.)
Page	pp. 725 - 730
Keyword	Characters Projection, E-Beam Lithography, Character Design, Optimization Methods
Abstract	In this paper, we propose a series of methods, including character design, stencil compaction and layout matching for Character Projection (CP) Electron-Beam Lithography. We solve the problems with emphasis on inter-cell routing including wires and vias. For wire layout, we design a small set of regular characters after layout normalization. Then we partition the layout into several rows and adopt a greedy algorithm for layout matching in each row. For via layout, we utilize a minimum path covering algorithm to group vias into paths, which are contained in characters with bounded length. We devise an efficient method to compact all characters into a stencil with much less area than the total area of characters. Experimental results show that our algorithms achieve up to 83.42% and 67.29% of the maximum improved-throughput by CP against to Variable Shaped Beam (VSB) technology for wire and via layouts, respectively. Our characters can apply for general purpose layouts to save the high cost of generating different stencils for different layouts.

Session S9 Special Session 9: Quality Assurance for 3D-Stacked ICs
Time: 16:10 - 17:50 Thursday, February 2, 2012
Location: Room 204A
Chair: Tai-Chen Chen (National Central University, Taiwan)

S9-1 (Time: 16:10 - 16:35)

Title	(Invited Paper) Yield Enhancement for 3D-Stacked ICs: Recent Advances and Challenges
Author	*Qiang Xu, Li Jiang (The Chinese University of Hong Kong, Hong Kong), Huiyun Li (Shenzhen Institutes of Advanced Technology, China), Bill Eklow (Cisco Systems, U.S.A.)
Page	pp. 731 - 737
Keyword	3D-stacked ICs, Manufacturing yield, Defect tolerance, TSV
Abstract	Three-dimensional (3D) integrated circuits (ICs) that stack multiple dies vertically using through-silicon vias (TSVs) have gained wide interests of the semiconductor industry. The shift towards volume production of 3D-stacked ICs, however, requires their manufacturing yield to be commercially viable. Various techniques have been presented in the literature to address this important problem, including pre-bond testing techniques to tackle the "known good die" problem, TSV redundancy designs to provide defect-tolerance, and wafter/die matching solutions to improve the overall stack yield. In this paper, we survey recent advances in this filed and point out challenges to be resolved in the future.

S9-2 (Time: 16:35 - 17:00)

Title	(Invited Paper) Yield-Aware Time-Efficient Testing and Self-fixing Design for TSV-Based 3D ICs
Author	*Jing Xie, Yu Wang, Yuan Xie (Pennsylvania State University, U.S.A.)
Page	pp. 738 - 743
Keyword	3D ICs
Abstract	Testing for three dimensional (3D) integrated circuits (ICs) based on through-silicon-via (TSV) is one of the major challenges for improving the system yield and reducing the overall cost. The lack of pads on most tiers and the mechanical vulnerability of tiers after wafer thinning make it difficult to perform 3D Known-Good-Die (KGD) test with the existing 2D IC probing methods. This paper presents a novel and time-efficient 3D testing flow. In this Known-Good-Stack (KGS) flow, a yield-aware TSV defect searching and replacing strategy is introduced. The Build-in-Self-Test (BIST) design with TSV redundancy scheme can help improve the system yield for today’s imperfect TSV fabrication process. Our study shows that less than 6 redundant TSVs is enough to increase the TSV yield to 98% for a TSV cluster with a size under 16×16 with relatively low initial TSV yield. The average TSV cluster testing and selffixing time is about 3-16 testing cycle depending on the initial TSV yield.

S9-3 (Time: 17:00 - 17:25)

Title	(Invited Paper) On Test and Repair of 3D Random Access Memory
Author	Cheng-Wen Wu (National Tsing Hua University, Taiwan), *Shyue-Kun Lu (National Taiwan University of Science and Technology, Taiwan), Jin-Fu Li (National Central University, Taiwan)
Page	pp. 744 - 749
Keyword	3D Random Access Memory, Built-In Self-Repair, Yield-Enhancement, Built-In Self-Test
Abstract	The three-dimensional (3D) random access memory (RAM) using through-silicon via (TSV) has been considered as a promising approach to overcome the memory wall. However, cost and yield are two key issues for volume production of 3D RAMs, and yield enhancement increasingly requires test techniques. In this paper, we first introduce issues and existing techniques for the testing and yield enhancement of 3D RAMs. Then, a built-in self-repair (BISR) technique for 3D RAM using global redundancy is presented. According to the redundancy analysis results of each die with the BISR circuit, the die-to-die (d2d) and wafer-to-wafer (w2w) stacking problems are transferred to the bipartite maximal matching problem. Then, heuristic algorithms are also proposed to optimize the stacking yield.

S9-4 (Time: 17:25 - 17:50)

Title	(Invited Paper) Design for Manufacturability and Reliability for TSV-based 3D-ICs
Author	*David Z. Pan (University of Texas at Austin, U.S.A.), Sung Kyu Lim, Krit Athikulwongse, Moongon Jung (Georgia Institute of Technology, U.S.A.), Joydeep Mitra, Jiwoo Pak (University of Texas at Austin, U.S.A.), Mohit Pathak (Georgia Institute of Technology, U.S.A.), Jae-seok Yang (University of Texas at Austin, U.S.A.)
Page	pp. 750 - 755
Keyword	reliability, manufacturability, through-silicon-vias, 3D-IC
Abstract	The 3D IC integration using through-silicon-vias (TSV) has gained tremendous momentum recently for industry adoption. However, as TSV involves disruptive manufacturing technologies, new modeling and design techniques need to be developed for 3D IC manufacturability and reliability. In particular, TSVs in 3D IC may cause significant thermal mechanical stress, which not only results in systematic mobility/ performance variations, but also leads to mechanical reliability concerns such as interfacial cracking. Meanwhile, the huge dimensional gaps between TSV, on-chip wires, and bonding/packaging all lead to new electromigration concerns. Thus full-chip/package modeling and physical design tools need to be developed to achieve more reliable 3D IC integration. In this paper, we will discuss some key design for manufacturability and reliability challenges and possible solutions for TSV-based 3D IC integration, as well as future research directions.

Session 9A Design for System Reliability
Time: 16:10 - 17:50 Thursday, February 2, 2012
Location: Room 204B
Chairs: Naehyuck Chang (Seoul National University, Republic of Korea), Shih-Hao Hung (National Taiwan University, Taiwan)

9A-1 (Time: 16:10 - 16:35)

Title	The Synthesis of Linear Finite State Machine-Based Stochastic Computational Elements
Author	*Peng Li (University of Minnesota, U.S.A.), Weikang Qian (University of Michigan-Shanghai Jiao Tong University Joint Institute, China), Marc D. Riedel, Kia Bazargan, David J. Lilja (University of Minnesota, U.S.A.)
Page	pp. 757 - 762
Keyword	stochastic computing, fault tolerance, logic synthesis
Abstract	The Stochastic Computational Element (SCE) uses streams of random bits to perform computation with conventional digital logic gates. It can guarantee reliable computation using unreliable devices. In stochastic computing, the linear Finite State Machine (FSM) can be used to implement some sophisticated functions, such as exponential and tanh function, more efficiently than combinational logic. However, a general approach about how to synthesize a linear FSM-based SCE for a target function is still unknown. In this paper, we will introduce three properties of the linear FSM used in stochastic computing and demonstrate a general approach to synthesize a linear FSM-based SCE for a target function. Experimental results show that our approach produces circuits that are much more tolerant of soft errors than deterministic implementations, while the area-delay product of the circuits are less than that of deterministic implementations.

9A-2 (Time: 16:35 - 17:00)

Title	Selective Time Borrowing for DSP Pipelines with Hybrid Voltage Control Loop
Author	*Paul N. Whatmough (ARM Ltd. / University College London, U.K.), Shidhartha Das, David M. Bull (ARM Ltd., U.K.), Izzat Darwazeh (University College London, U.K.)
Page	pp. 763 - 768
Keyword	Dynamic Voltage Scaling, Timing Errors, Razor, DSP
Abstract	In this paper, we propose the use of a time borrowing window on critical logic paths, over which timing errors can resolve safely without an explicit replay mechanism. We demonstrate that time borrowing can be incorporated into DSP pipelines without increasing the minimum clock period, while removing the metastability risk associated with many previously published approaches to replay-free timing error tolerance. A novel hybrid control approach is used to ensure timing violations do not exceed the safe borrowing window.

9A-3 (Time: 17:00 - 17:25)

Title	EPROF: An Energy/Performance/Reliability Optimization Framework for Streaming Applications
Author	*Yavuz Yetim, Sharad Malik, Margaret Martonosi (Princeton University, U.S.A.)
Page	pp. 769 - 774
Keyword	Stochastic Architectures, Scheduling, Parallel Architectures
Abstract	Computer systems face increasing challenges in simultaneously meeting an application's energy, performance, and reliability goals. While energy and performance tradeoffs have been studied through different dynamic voltage and frequency scaling (DVFS) policies and power management schemes, tradeoffs of energy and performance with reliability have not been studied for general purpose computing. This is particularly relevant for application domains such as multimedia, where some limited application error tolerance can be exploited to reduce energy. In this paper, we present EPROF, an optimization framework based on Mixed-Integer Linear Programming (MILP) that selects possible schedules for running tasks on multiprocessors in order to minimize energy while meeting constraints on application performance and reliability. We consider parallel applications that express (on task graphs) the performance and reliability goals they need to achieve, and that run on chip multiprocessors made up of heterogeneous processor cores that offer different energy/performance/reliability tradeoffs. For the StreamIt benchmarks, EPROF can identify schedules that offer up to 34% energy reduction over a baseline method while achieving the targeted performance and reliability. More broadly, EPROF demonstrates how these three degrees of freedom (energy, performance and reliability) can be flexibly exploited as needed for different applications.

Session 9B Logic and Datapath Synthesis
Time: 16:10 - 17:50 Thursday, February 2, 2012
Location: Room 203
Chairs: Robert Wille (University of Bremen, Germany), Yuichi Nakamura (NEC, Japan)

9B-1 (Time: 16:10 - 16:35)

Title	BTI-Aware Design Using Variable Latency Units
Author	*Saket Gupta, Sachin Sapatnekar (University of Minnesota, U.S.A.)
Page	pp. 775 - 780
Keyword	Variable Latency Design, Reliability, Throughput, Area
Abstract	Circuit degradation due to bias temperature instability (BTI) can lead to timing failures in digital circuits. We develop variable latency unit (VLU) based BTI-aware designs, with a novel scheme for multioutput hold logic implementation for VLUs. A key observation is the identification and exploitation of specific supersetting patterns in the two-dimensional space of frequency and aging of the circuit. The multioutput hold logic scheme is used in conjunction with an adaptive body bias framework to achieve high performance, allowing the design to be easily incorporated in traditional synthesis flows. As compared to conventional combinational BTI-resilience scheme, our design achieves an area reduction of 9.2%, with a significant throughput enhancement of 30.0%.

9B-2 (Time: 16:35 - 17:00)

Title	Linear Decomposition of Index Generation Functions
Author	*Tsutomu Sasao (Kyushu Institute of Technology, Japan)
Page	pp. 781 - 788
Keyword	linear transform, functional decomposition, code converter, random function, data compression
Abstract	This paper shows a heuristic method to reduce the number of variables to represent incompletely specified index generation functions using linear decompositions. To find good linear transformations, two measures are introduced: the imbalance measure and the ambiguity measure. Experimental results using m-out-of-n code to binary converters, randomly generated functions, IP address tables, and English word lists show the usefulness of the approach.

9B-3 (Time: 17:00 - 17:25)

Title	Fixed-Point Accuracy Analysis of Datapaths with Mixed CORDIC and Polynomial Computations
Author	*Omid Sarbishei, Katarzyna Radecka (Dept. of Electrical and Computer Engineering, McGill University, Canada)
Page	pp. 789 - 794
Keyword	Fixed-point format, polynomial datapaths, CORDIC units, precision analysis, range analysis
Abstract	Fixed-point accuracy analysis of imprecise datapaths in terms of Maximum-Mismatch (MM) [1], or Mean-Square-Error (MSE) [14], w.r.t. a reference model is a challenging task. Typically, arithmetic circuits are represented with polynomials; however, for a variety of functions, including trigonometric, hyperbolic, logarithm, exponential, square root and division, Coordinate Rotation Digital Computer (CORDIC) units can result in more efficient implementations with better accuracy. This paper presents a novel approach to robustly analyze the fixed-point accuracy of an imprecise datapath, which may consist of a combination of polynomials and CORDIC units. The approach builds a global polynomial for the error of the whole datapath by converting the CORDIC units and their errors into the lowest possible order Taylor series. The previous work for almost accurate analysis of MM [1] and MSE [14, 15] in large datapaths can only handle polynomial computations.

9B-4 (Time: 17:25 - 17:50)

Title	Algorithm for Synthesizing Design Context-Aware Fast Carry-Skip Adders
Author	*Kiyoung Kim, Taewhan Kim (Seoul National University, Republic of Korea)
Page	pp. 795 - 800
Keyword	timing, synthesis, optimization
Abstract	This work proposes a systematic synthesis algorithm of fast carry-skip adders which considers any arbitrary bit-level arrival times of the addends. We formulate the carry group partitioning problem for minimal timing into a dynamic programming problem and solved it effectively. The experimental results with various real arithmetic designs show that our synthesis algorithm is able to reduce the circuit latency by up to 16% and 10% compared to the best known existing algorithms.

Session 9C Video, Display, and Signal Processing Technologies and Techniques
Time: 16:10 - 17:50 Thursday, February 2, 2012
Location: Room 202
Chairs: Shao-Yi Chien (National Taiwan University, Taiwan), Yen-Kuang Chen (Intel Corp., U.S.A.)

9C-1 (Time: 16:10 - 16:35)

Title	A 16-pixel Parallel Architecture with Block-level/Mode-level Co-reordering Approach for Intra Prediction in 4kx2k H.264/AVC Video Encoder
Author	Huailu Ren (College of Information Science and Engineering of Shandong University of Science and Technology, China), *Yibo Fan (State Key Lab of ASIC and System of Fudan University, China), Xinhua Chen (College of Information Science and Engineering of Shandong University of Science and Technology, China), Xiaoyang Zeng (State Key Lab of ASIC and System of Fudan University, China)
Page	pp. 801 - 806
Keyword	H.264/AVC, intra prediction, hardware architecture
Abstract	Intra prediction is the most important technology in H.264/AVC intra frame encoder. But there is extremely complicated data dependency and an immense amount of computation in intra prediction process. In order to meet the requirements of real-time coding and avoid hardware waste, this paper presents a parallel and high efficient H.264/AVC intra prediction architecture which targets high-resolution (e.g. 4kx2k) video encoding applications. In this architecture, the optimized intra 4x4 prediction engine can process sixteen pixels in parallel at a slightly higher hardware cost (compared to the previous four-pixel parallel architecture). The intra 16x16 prediction engine works in parallel with intra 4x4 prediction engine. It reuses the adder-tree of Sum of Absolute Transformed Difference (SATD) generator. Moreover, in order to reduce the data-dependency in intra 4x4 reconstruction loop, a block-level and mode-level co-reordering strategy is proposed. Therefore, the performance bottleneck of H.264/AVC intra encoding can be alleviated to a great extent. The proposed architecture supports full-mode intra prediction for H.264/AVC baseline, main and extended profiles. It takes only 163 cycles to complete the intra prediction process of one macroblock (MB). This design is synthesized with a SMIC 0.13µm CMOS cell library. The result shows that it takes 61k gates and can run at 215MHz, supporting real-time encoding of 4kx2k@40fps video sequences.

9C-2 (Time: 16:35 - 17:00)

Title	Fine-grained Dynamic Voltage Scaling on OLED Display
Author	Xiang Chen, Jian Zheng, *Yiran Chen (Dept. of Electrical and Computer Eng., University of Pittsburgh, U.S.A.), Hai Li (Dept. of Electrical and Computer Eng., Polytechnic Institute of New York University, U.S.A.), Wei Zhang (School of Computer Eng., Nanyang Technological University, Singapore)
Page	pp. 807 - 812
Keyword	OLED, Driver design, Dynamic voltage scaling
Abstract	OLED has emerged as the new generation display technique, while its power consumption remains inefficient. In this work, we proposed a fine-grained dynamic voltage scaling (FDVS) technique to reduce power consumption. An OLED panel is partitioned into multiple individual areas with objective DVS. A DVS-friendly OLED driver design is also proposed to enhance the color accuracy under DVS. Experiments show that compared to the existing DVS technique, FDVS technique can achieve efficient power saving and reduce the image compensation cost.

9C-3 (Time: 17:00 - 17:25)

Title	A Reconfigurable Accelerator for Neuromorphic Object Recognition
Author	Jagdish Sabarad, Srinidhi Kestur, Mi Sun Park, Dharav Dantara, *Vijaykrishnan Narayanan (The Pennsylvania State University, U.S.A.), Yang Chen, Deepak Khosla (HRL Laboratories, U.S.A.)
Page	pp. 813 - 818
Keyword	accelerator, neuromorphic vision, object recognition, fpga, convolution
Abstract	Advances in neuroscience have enabled researchers to develop computational models of auditory, visual and learning perceptions in the human brain. HMAX, which is a biologically inspired model of the visual cortex, has been shown to outperform standard computer vision approaches for multi-class object recognition. HMAX, while computationally demanding, can be potentially applied in various applications such as autonomous vehicle navigation, unmanned surveillance and robotics. In this paper, we present a reconfigurable hardware accelerator for the time-consuming S2 stage of the HMAX model. The accelerator leverages spatial parallelism, dedicated wide data buses with on-chip memories to provide an energy efficient solution to enable adoption into embedded systems. We present a systolic array-based architecture which includes a run-time reconfigurable convolution engine which can perform multiple variable-sized convolutions in parallel. An automation flow is described for this accelerator which can generate optimal hardware configurations for a given algorithmic specification and also perform run-time configuration and execution seamlessly. Experimental results on Virtex-6 FPGA platforms show 5X to 11X speedups and 14X to 33X higher performance-per-Watt over a CNS-based implementation on a Tesla GPU.

9C-4 (Time: 17:25 - 17:50)

Title	Efficient Implementation of Multi-Moduli Architectures for Binary-to RNS Conversion
Author	*Hector Pettenghi (Instituto de Engenharia de Sistemas e Computadores (INESC-ID), Portugal), Leonel Sousa (Instituto Superior Tecnico (IST)/ Instituto de Engenharia de Sistemas e Computadores (INESC-ID), Portugal), Jude Angelo Ambrose (School of Computer Science and Engineering, University of New South Wales, Australia)
Page	pp. 819 - 824
Keyword	Residue number system, Binary-to-RNS converters, memory-less processors, Digital Signal Processing
Abstract	This paper presents a novel approach to improve the existing Binary-to-RNS multi-moduli architectures. Multi-moduli architectures are implemented serially or in parallel. A novel choice of the weights associated to the inputs provides huge improvement when applied to the most efficient multi-moduli architectures known to date. Experimental results suggest that the proposed memory-less multi-moduli architectures achieve speedups of 1.94 and 1.62 for parallel and serial implementations, respectively, in comparison with the most efficient state-of-the-art structures.

The 17th Asia and South Pacific Design Automation Conference Technical Program

Session Schedule

List of Papers

The 17th Asia and South Pacific Design Automation Conference
Technical Program