(Go to Top Page)

# The 18th Asia and South Pacific Design Automation Conference Technical Program

Remark: The presenter of each paper is marked with "*".
Technical Program:   SIMPLE version   DETAILED version with abstract
Author Index:   HERE

## Session Schedule

 Wednesday, January 23, 2013

ABCD
1K  Opening & Keynote I
8:30 - 10:00
1A  Special Session: Advanced Modeling and Simulation Techniques for Power/Signal Integrity in 3D Design
10:20 - 12:20
1B  Disruptive NoCs for Next-Generation MPSoCs
10:20 - 12:20
1C  Embedded Systems
10:20 - 12:20
1D  University Design Contest
10:20 - 12:20
2A  Special Session: Dependability of on-Chip Systems
13:40 - 15:40
2B  Logic Synthesis
13:40 - 15:40
2C  Simulation for Thermal and Power Grid Analysis
13:40 - 15:40
2D  Advanced Routing Techniques for Chip and PCB Design
13:40 - 15:40
3A  Special Session: Design Automation for Flow-Based Microfluidic Biochips: Connecting Biochemistry to Electronic Design Automation
16:00 - 18:00
3B  System-Level Synthesis and Optimization
16:00 - 18:00
16:00 - 18:00
3D  Hardware-Software Co-Optimization for Emerging NVMs
16:00 - 18:00

 Thursday, January 24, 2013

ABCD
2K  Keynote II
9:00 - 10:00
4A  Special Session: High-Level Synthesis and Parallel Programming Models for FPGAs
10:20 - 12:20
4B  Memory Hierarchy Optimization
10:20 - 12:20
4C  Timing and Power Driven Design Flow
10:20 - 12:20
4D  Special Session: Emerging Security Topics in Electronic Designs and Mobile Devices
10:20 - 12:20
5A  Designers' Forum: Heterogeneous Devices and Multi-Dimensional Integration Design Technologies
13:40 - 15:40
5B  Analysis and Verification of Reliable Systems
13:40 - 15:40
13:40 - 15:40
5D  Multi-/Many-Core System Optimization
13:40 - 15:40
6A  Designers' Forum: Future Direction and Trend of Embedded GPU
16:00 - 18:00
6B  Emerging Technologies
16:00 - 18:00
6C  New Directions in Modeling , Simulation, and Integrity
16:00 - 18:00
16:00 - 18:00

 Friday, January 25, 2013

ABCD
3K  Keynote III
9:00 - 10:00
7A  Special Session: Many-Core Architecture and Software Technology
10:20 - 12:20
7B  Simulation Acceleration
10:20 - 12:20
7C  Reliability Analysis and Test
10:20 - 12:20
7D  Emerging Technologies in Cyber Systems
10:20 - 12:20
8A  Designers' Forum: Photonics for Embedded Systems
13:40 - 15:40
8B  Revisiting Latency and Reliability in Memory Architectures
13:40 - 15:40
8C  New 3D IC Design Techniques
13:40 - 15:40
8D  Advances in Simulation and Formal Verification
13:40 - 15:40
9A  Designers' Forum: Harmonized Hardware-Software Co-Design and Co-Verification
16:00 - 18:00
9B  Memory and Storage Management
16:00 - 18:00
9C  Advanced Modeling and Analysis of Analog and Mixed-Signal Circuits
16:00 - 18:00
9D  High-Level and Architectural Synthesis
16:00 - 18:00

## List of Papers

Remark: The presenter of each paper is marked with "*".

 Wednesday, January 23, 2013

Session 1K  Opening & Keynote I
Time: 8:30 - 10:00 Wednesday, January 23, 2013
Chair: Shinji Kimura (Waseda University, Japan)

1K-1 (Time: 8:30 - 10:00)
 Title (Keynote Address) From Circuits to Cancer Author *Sani Nassif (IBM Austin Research Lab., U.S.A.) Abstract The human race has invested about a trillion dollars in the development of semiconductor electronics, and our lives have been improved greatly as a result. Smart devices are now taken for granted and permeate every aspect of our daily lives. One of the important products of this huge investment was the development of sophisticated design optimization and simulation tools to allow the largely automated design and verification of integrated circuits. Sometimes we in the EDA community do not realize quite how advanced we are in this field, and just how applicable much of the Silicon R&D work is to other areas... This talk will be about one such area, namely that of Proton Radiation Cancer Therapy, where a team at IBM, working with researchers at the M. D. Anderson Cancer Research center, have been busy applying knowledge from the VLSI area to this important problem. We will show examples of applying large scale analysis and optimization techniques to the treatment planning problem, and hopefully motivate other EDA researchers to seek applications of their deep knowledge in adjacent fields.

Session 1A  Special Session: Advanced Modeling and Simulation Techniques for Power/Signal Integrity in 3D Design
Time: 10:20 - 12:20 Wednesday, January 23, 2013
Organizer: Hideki Asai (Shizuoka University, Japan)

1A-1 (Time: 10:20 - 10:50)
 Title (Invited Paper) Equivalent Circuit Model Extraction for Interconnects in 3D ICs Author *A. Ege Engin (San Diego State University, U.S.A.) Page pp. 1 - 6 Keyword TSV, 3D IC, silicon interposer Abstract Parasitic RC behavior of VLSI interconnects has been the major bottleneck in terms of latency and power consumption of ICs. Recent 3D ICs promise to reduce the parasitic RC effect by making use of through silicon vias (TSVs). It is therefore essential to extract the RC model of TSVs to assess their promise. Unlike interconnects on metal layers, TSVs exhibit slow-wave and dielectric quasi-transverse-electromagnetic (TEM) modes due to the coupling to the semiconducting substrate. This TSV behavior can be simulated using analytical methods, 2D electrostatic simulators, or 3D full-wave electromagnetic simulators. In this paper, we describe a methodology to extract parasitic RC models from such simulation data for interconnects in a 3D IC. Slides

1A-2 (Time: 10:50 - 11:20)
 Title (Invited Paper) Unconditionally Stable Explicit Method for the Fast 3-D Simulation of On-Chip Power Distribution Network with Through Silicon Via Author *Tadatoshi Sekine, Hideki Asai (Shizuoka University, Japan) Page pp. 7 - 12 Keyword power distribution network, through silicon via, explicit method, unconditionally stable, fast circuit simulation Abstract In this work, we propose the method which is explicit, but stable with no stability condition for the fast simulation of the equivalent circuit of on-chip power distribution network with a number of through silicon vias. Additionally, the proposed unconditionally stable explicit method is accelerated more by combining with an order reduction technique.

1A-3 (Time: 11:20 - 11:50)
 Title (Invited Paper) Signal Integrity Modeling and Measurement of TSV in 3D IC Author *Joungho Kim, Joungho Kim (Korea Advanced Institute of Science and Technology, Republic of Korea) Page pp. 13 - 16 Keyword Through Silicon Via, Signal Integrity, Modeling, Measurement Abstract In order to guarantee signal integrity of a TSV-based channel in 3D IC design, the modeling and measurements are conducted for electrical characterization of the TSV-based channel including TSVs and RDLs with various performance metrics such as insertion loss, noise coupling and eye diagrams. Based on the modeling and measurements of the fabricated TSV channels, design guide for the signal integrity of the channel is proposed. Slides

1A-4 (Time: 11:50 - 12:20)
 Title (Invited Paper) Power Distribution Network Modeling for 3-D ICs with TSV Arrays Author Chi-Kai Shen, Yi-Chang Lu, Yih-Peng Chiou, Tai-Yu Cheng, *Tzong-Lin Wu (National Taiwan University, Taiwan) Page pp. 17 - 22 Keyword 3-D IC, PDN, equivalent circuit model, TSV, CNIM Abstract A coupling node insertion method (CNIM) is proposed to handle electrical coupling between top metals of on-chip interconnects and silicon substrate surfaces in three-dimensional integrated circuits (3-D ICs). This coupling effect should not be neglected especially as metal area is intentionally increased in order to reduce resistance values. In this paper, we illustrate how to build the CNIM model and incorporate it into power distribution networks. The CNIM model is validated by comparing our results to the one obtained from a full-wave simulator. The differences between two approaches are within 5% but our computation time is shorter than that required by a full-wave simulator.

Session 1B  Disruptive NoCs for Next-Generation MPSoCs
Time: 10:20 - 12:20 Wednesday, January 23, 2013
Chairs: Sri Parameswaran (University of New South Wales, Australia), Chung-Ta King (National Tsing Hua University, Taiwan)

1B-1 (Time: 10:20 - 10:50)
 Title A Case for Wireless 3D NoCs for CMPs Author *Hiroki Matsutani (Keio University, Japan), Paul Bogdan, Radu Marculescu (Carnegie Mellon University, U.S.A.), Yasuhiro Take, Daisuke Sasaki, Hao Zhang (Keio University, Japan), Michihiro Koibuchi (National Institute of Informatics, Japan), Tadahiro Kuroda, Hideharu Amano (Keio University, Japan) Page pp. 23 - 28 Keyword Network-on-Chip (NoC), 3-D NoC, irregular topology Abstract Inductive-coupling is yet another 3D integration technique that can be used to stack more than three known-good-dies in a SiP without wire connections. We present a topology-agnostic 3D CMP architecture using inductive-coupling that offers great flexibility in customizing the number of processor chips, SRAM chips, and DRAM chips in a SiP after chips have been fabricated. In this paper, first, we propose a routing protocol that exchanges the network information between all chips in a given SiP to establish efficient deadlock-free routing paths. Second, we propose its optimization technique that analyzes the application traffic patterns and selects different spanning tree roots so as to minimize the average hop counts and improve the application performance.

1B-2 (Time: 10:50 - 11:20)
 Title Deflection Routing in 3D Network-on-Chip with TSV Serialization Author *Jinho Lee, Dongwoo Lee, Sunwook Kim, Kiyoung Choi (Seoul National University, Republic of Korea) Page pp. 29 - 34 Keyword network-on-chip(NoC), deflection routing, TSV serialization, 3D NoC, network Abstract This paper proposes a deflection routing for 3D NoC with serialized TSVs. Bufferless deflection routing provides area- and power-efficient communication under low to medium traffic load. Under 3D circumstances, the bufferless deflection routing can yield even better performance than buffered routing when key aspects are properly taken into account. Evaluation of the proposed scheme shows its effectiveness in throughput, latency, and energy consumption. Slides

1B-3 (Time: 11:20 - 11:50)
 Title MD: Minimal Path-based Approach for Fault-Tolerant Routing in On-Chip Networks Author Masoumeh Ebrahimi, Masoud Daneshtalab, Juha Plosila (University of Turku, Finland), *Farhad Mehdipour (Kyushu University, Japan) Page pp. 35 - 40 Keyword Network-on-Chip, fault-tolerant approach, minimal path, adaptive routing algorithm. Abstract the communication requirements of many-core embedded systems are convened by the emerging Network-on-Chip (NoC) paradigm. As on-chip communication reliability is a crucial factor in many-core systems, the NoC paradigm should address the reliability issues. Using fault-tolerant routing algorithms to reroute packets around faulty regions will increase the packet latency and create congestion around the faulty region. On the other hand, the performance of NoC is highly affected by the congestion in the network. Congestion in the network can increase the delay of packets to route from a source to a destination, so it should be avoided. In this paper, a minimal and defect-resilient (MD) routing algorithm is proposed in order to route packets adaptively through shortest paths in the presence of one-faulty link, as long as a path exists. To avoid congestion, output channels can be adaptively chosen whenever the distance from the current to destination node is greater than one hop along both directions. In addition, an analytical model is presented to evaluate MD under two-faulty links’ condition.

1B-4 (Time: 11:50 - 12:20)
 Title A Dynamic Stream Link for Efficient Data Flow Control in NoC Based Heterogeneous MPSoC Author Claude Helmstetter, Sylvain Basset, *Romain Lemaire (CEA-Leti, Minatec Campus, France), Michel Langevin, Chuck Pilkington (STMicroelectronics, Ottawa, Canada), Fabien Clermidy (CEA-Leti, Minatec Campus, France), Pierre Paulin (STMicroelectronics, Ottawa, Canada), Pascal Vivet (CEA-Leti, Minatec Campus, France), Didier Fuin (STMicroelectronics, Grenoble, France) Page pp. 41 - 46 Keyword NoC, Stream Link, Heterogeneous MPSoC, Data Flow Abstract As Systems-on-Chip size increase, the communication costs become critical and Networks-on-Chip (NoC) bring innovative solutions. Efficient stream-based protocols over NoC have been widely studied to address dataflow communications. They are usually controlled by a set of static parameters. However, new applications, such as high-resolution video decoders, present more data-dependent behaviors forcing communication protocols to support higher dynamicity. For this purpose, we present in this paper dynamic stream links for stream-based end-to-end NoC communications by introducing two link protocols, both independent of the transfer size, allowing to improve the hardware/software control flexibility. The proposed protocols have been modeled in a MPSoC virtual platform and the hardware cost evaluated. Based on simulations, we provide guidelines to exploit these protocols according to application needs. Slides

Session 1C  Embedded Systems
Time: 10:20 - 12:20 Wednesday, January 23, 2013
Chairs: Hiroyuki Tomiyama (Ritsumeikan University, Japan), Tohru Ishihara (Kyoto University, Japan)

1C-1 (Time: 10:20 - 10:50)
 Title On Real-Time STM Concurrency Control for Embedded Software with Improved Schedulability Author *Mohammed Elshambakey, Binoy Ravindran (ECE Dept, Virginia Tech, U.S.A.) Page pp. 47 - 52 Keyword stm, real time, contention manager Abstract We consider software transactional memory (STM) concurrency control for embedded multicore real-time software, and present a novel contention manager for resolving transactional conflicts, called PNF. We upper bound transactional retries and task response times. Our implementation in RSTM/real-time Linux reveals that PNF yields shorter or comparable retry costs than competitors. Slides

1C-2 (Time: 10:50 - 11:20)
 Title Schedule Integration for Time-Triggered Systems Author *Florian Sagstetter, Martin Lukasiewycz (TUM CREATE, Singapore), Samarjit Chakraborty (TU Munich, Germany) Page pp. 53 - 58 Keyword scheduling, time-triggered system, FlexRay Abstract This paper presents a framework for the schedule integration of time-triggered systems tailored to the automotive domain. In-vehicle networks might be very large and complex such that obtaining a schedule for a fully synchronous system becomes a challenging task since all bus and processor constraints as well as end-to-end-timing constraints have to be taken concurrently into account. Existing optimization approaches apply the schedule optimization to the entire network, limiting their application due to scalability issues. In contrast, the presented framework obtains the schedule for the entire network, using a two-step approach where for each cluster a local schedule is obtained and the local schedules are finally merged to the global schedule. This approach is also in accordance with the design process in the automotive industry where different subsystems are developed independently to reduce the design complexity and are finally combined in the integration stage. In this paper, a generic framework for schedule integration of time-triggered systems is presented. Further, we show how this framework is implemented for a FlexRay network using an Integer Linear Programming (ILP) approach which might also be easily adapted to other protocols. A realistic case study and a scalability analysis give evidence of the applicability and efficiency of our approach. Slides

1C-3 (Time: 11:20 - 11:50)
 Title Online Estimation of the Remaining Energy Capacity in Mobile Systems Considering System-Wide Power Consumption and Battery Characteristics Author Donghwa Shin (Seoul National University, Republic of Korea), Woojoo Lee (University of Southern California, U.S.A.), Kitae Kim (Seoul National University, Republic of Korea), Yanzhi Wang, Qing Xie (University of Southern California, U.S.A.), *Naehyuck Chang (Seoul National University, Republic of Korea), Massoud Pedram (University of Southern California, U.S.A.) Page pp. 59 - 64 Keyword Low-power design, Power estimation, Smartphone, Battery life, Quality of service Abstract Emerging mobile systems integrate a lot of functionality into a small form factor with a small energy source in the form of rechargeable battery. This situation necessitates accurate estimation of the remaining energy in the battery such that user applications can be judicious on how they consume this scarce and precious resource. This paper thus focuses on estimating the remaining battery energy in Android OS-based mobile systems. This paper proposes to instrument the Android kernel in order to collect and report accurate subsystem activity values based on real-time profiling of the running applications. The activity information along with offline-constructed, regression-based power macro models for major subsystems in the smartphone yield the power dissipation estimate for the whole system. Next, while accounting for the rate-capacity effect in batteries, the total power dissipation data is translated into the battery’s energy depletion rate, and subsequently, used to compute the battery’s remaining lifetime based on its current state of charge information. Finally, this paper describes a novel application design framework, which considers the batterys state-of-charge (SOC), batterys energy depletion rate, and service quality of the target application. The benefits of the design framework are illustrated by examining an archetypical case, involving the design space exploration and optimization of a GPS-based application in an Android OS.

1C-4 (Time: 11:50 - 12:20)
 Title WUCC: Joint WCET and Update Conscious Compilation for Cyber-physical Systems Author Yazhi Huang, Mengying Zhao, *Chun Jason Xue (City University of Hong Kong, Hong Kong) Page pp. 65 - 70 Keyword WCET, code similarity, real time systems Abstract The cyber-physical system (CPS) is a desirable computing platform for many industrial and scientific applications. However, the application of CPSs has two challenges: First, CPSs often include a number of sensor nodes. Update of preloaded code on remote sensor nodes powered by batteries is extremely energy-consuming. The code update issue in the energy sensitive CPS must be carefully considered; Second, CPSs are often real-time embedded systems with real-time properties. Worst-Case Execution Time (WCET) is one of the most important metrics in real-time system design. While existing works only consider one of these two challenges at a time, in this paper, a compiler-level optimization, Joint WCET and Update Conscious Compilation (WUCC), is proposed to jointly consider WCET and code update for cyber-physical systems. The novelty of the proposed approach is that the WCET problem and code update problem are considered concurrently such that a balanced solution with minimal WCET and minimal code difference can be achieved. The experimental results show that the proposed technique can minimize WCET and code difference effectively. Slides

Session 1D  University Design Contest
Time: 10:20 - 12:20 Wednesday, January 23, 2013
Chairs: Hiroshi Kawaguchi (Kobe University, Japan), Tetsuo Hironaka (Hiroshima City University, Japan)

1D-1 (Time: 10:20 - 10:25)
 Title A 40-nm 144-mW VLSI Processor for Real-time 60-kWord Continuous Speech Recognition Author *Guangji He, Takanobu Sugahara, Tsuyoshi Fujinaga, Yuki Miyamoto, Hiroki Noguchi, Shintaro Izumi, Hiroshi Kawaguchi, Masahiko Yoshimoto (Kobe University, Japan) Page pp. 71 - 72 Keyword hidden Markov model(HMM), large vocabulary continuous speech recognition(LVCSR), memory bandwidth reduction Abstract We have developed a low-power VLSI chip for 60- kWord real-time continuous speech recognition based on a context-dependent Hidden Markov Model (HMM). Our implementation includes a cache architecture using locality of speech recognition, beam pruning using a dynamic threshold, two-stage language model searching, highly parallel Gaussian Mixture Model (GMM) computation based on the mixture level, a variable-frame look-ahead scheme, and elastic pipeline operation between the Viterbi transition and GMM processing. Results show that our implementation achieves 95% bandwidth reduction (70.86 MB/s) and 78% required frequency reduction (126.5 MHz). The test chip, fabricated using 40 nm CMOS technology, contains 1.9 M transistors for logic and 7.8 Mbit on-chip memory. It dissipates 144 mW at 126.5 MHz and 1.1 V for 60 kWord real-time continuous speech recognition. Slides

1D-2 (Time: 10:25 - 10:30)
 Title A 24.5-53.6pJ/pixel 4320p 60fps H.264/AVC Intra-Frame Video Encoder Chip in 65nm CMOS Author *Dajiang Zhou, Gang He, Wei Fei, Zhixiang Chen, Jinjia Zhou, Satoshi Goto (Waseda University, Japan) Page pp. 73 - 74 Keyword H.264/AVC, 4320p, video encoder, low power Abstract An H.264/AVC intra-frame video encoder is implemented in 65nm CMOS. With an efficient intra prediction design, its maximum throughput reaches 1991Mpixels/s for 7680x4320p 60fps video, 9.4x to 32x faster than previous designs. The encoder also incorporates a 1.41Gbins/s CABAC architecture that has been enhanced by 31%. Moreover, low energy consumption is achieved by the high parallelism and hardware efficiency of this design. 1080p 30fps encoding dissipates only 2mW at 0.8V and 9MHz. Slides

1D-3 (Time: 10:30 - 10:35)
 Title A Low Power Multimedia Processor Implementing Dynamic Voltage and Frequency Scaling Technique Author Tadayoshi Enomoto, *Nobuaki Kobayashi (Chuo University, Japan) Page pp. 75 - 76 Keyword motion estimation, Multimedia Processor, DVFS, power dissipation Abstract A 90-nm CMOS multimedia processor was developed by employing dynamic voltage and frequency scaling (DVFS) technique to greatly reduce the power dissipation (P). To adaptively predict the optimum supply voltage (VD) and the optimum clock frequency (fc) a fast motion estimation (ME) algorithm, an absolute difference accumulator as well as a DVFS controller were developed. Measured P of the multimedia processor was 34.4 µW, which was only 0.48% that of a conventional multimedia processor. Slides

1D-4 (Time: 10:35 - 10:40)
 Title A 40-nm 0.5-V 12.9-pJ/Access 8T SRAM Using Low-Power Disturb Mitigation Technique Author *Shusuke Yoshimoto, Masaharu Terada, Shunsuke Okumura (Kobe University, Japan), Toshikazu Suzuki (Panasonic Corporation, Japan), Shinji Miyano (Semiconductor Technology Academic Research Center, Japan), Hiroshi Kawaguchi, Masahiko Yoshimoto (Kobe University, Japan) Page pp. 77 - 78 Keyword SRAM, 8T, Low power, half select, write back Abstract This paper presents a novel disturb mitigation technique which achieves low-power and low-voltage SRAM. Our proposed technique consists of a floating bitline technique and a low-swing bitline driver (LSBD). We fabricated a 512-Kb 8T SRAM test chip that operates at a single 0.5-V supply voltage. The proposed technique achieves 1.52-pJ/access active energy in a write cycle and 72.8-uW leakage power, which are 59.4% and 26.0% better than the conventional write-back technique. Slides

1D-5 (Time: 10:40 - 10:45)
 Title A Physical Unclonable Function Chip Exploiting Load Transistors’ Variation in SRAM Bitcells Author *Shunsuke Okumura, Shusuke Yoshimoto, Hiroshi Kawaguchi (Kobe University, Japan), Masahiko Yoshimoto (Kobe University/JST CREST, Japan) Page pp. 79 - 80 Keyword SRAM, PUF, Chip ID Abstract We propose a chip identification (ID) generating scheme with random variation of transistor characteristics in SRAM bitcells. In the proposed scheme, a unique fingerprint is generated by grounding both bitlines. It has high speed, and it can be implemented in a very small area overhead. We fabricated test chips in a 65-nm process and obtained 12,288 sets of unique 128-bit fingerprints, which are evaluated in this paper. The failure rate of the IDs is found to be 2.1 × 10-12. Slides

1D-6 (Time: 10:45 - 10:50)
 Title Over 10-Times High-speed, Energy Efficient 3D TSV-Integrated Hybrid ReRAM/MLC NAND SSD by Intelligent Data Fragmentation Suppression Author *Chao Sun (Chuo University/University of Tokyo, Japan), Hiroki Fujii (University of Tokyo, Japan), Kousuke Miyaji, Koh Johguchi (Chuo University, Japan), Kazuhide Higuchi (University of Tokyo, Japan), Ken Takeuchi (Chuo University, Japan) Page pp. 81 - 82 Keyword SSD, ReRAM, TSV, MLC NAND Abstract A 3D through-silicon-via (TSV)-integrated hybrid ReRAM/multi-level-cell (MLC) NAND solid-state drive's (SSD's) architecture is proposed with NAND-like interface (I/F) and sector-access overwrite policy for ReRAM. Furthermore, intelligent data management algorithms are proposed to suppress data fragmentation and excess usage of MLC NAND. As a result, 11-times performance increase, 6.9-times endurance enhancement and 93% write energy reduction are achieved. Both ReRAM write and read latency should be less than 3us to obtain these improvements. The required endurance for ReRAM is 10^5.

1D-7 (Time: 10:50 - 10:55)
 Title Highly Reliable Solid-State Drives (SSDs) with Error-Prediction LDPC (EP-LDPC) Architecture and Error-Recovery Scheme Author *Shuhei Tanakamaru, Yuki Yanagihara (The University of Tokyo, Japan), Ken Takeuchi (Chuo University, Japan) Page pp. 83 - 84 Keyword Solid-state drive, SSD, Error-correcting code, ECC, LDPC Abstract 11-times extended lifetime, 76% reduced error SSD is proposed. The error-prediction LDPC realizes both 7-times faster read and high reliability. Errors are most efficiently corrected by calibrating memory data based on the VTH, inter-cell coupling, write/erase cycles and data-retention time. The error-recovery scheme with a program-disturb error-recovery pulse and a data-retention error-recovery pulse is also proposed to reduce the program-disturb error and the data-retention error by 76% and 56%, respectively.

1D-8 (Time: 10:55 - 11:00)
 Title A 3Gb/s 2.08mm2 100b Error-Correcting BCH Decoder in 0.13µm CMOS Process Author *Youngjoo Lee, Hoyoung Yoo, In-Cheol Park (KAIST, Republic of Korea) Page pp. 85 - 86 Keyword ECC, BCH, decoder, optimization Abstract This paper presents a high-throughput BCH decoder that can correct 100 bit-errors. Several optimization methods are proposed to reduce the hardware complexity caused by the large error-correction capability. Based on the proposed methods, an 8-parallel decoder is designed for the (9592, 8192, 100) BCH code, which achieves a decoding throughput of 3Gb/s and occupies 2.08mm^2 in 0.13ìm CMOS process.

1D-9 (Time: 11:00 - 11:05)
 Title A 6.72-Gb/s, 8pJ/bit/iteration WPAN LDPC Decoder in 65nm CMOS Author *Zhixiang Chen, Xiao Peng, Xiongxin Zhao, Leona Okamura, Dajiang Zhou, Satoshi Goto (Graduate School of Information, Production and Systems, Waseda University, Japan) Page pp. 87 - 88 Keyword LDPC, Decoder, WPAN, IEEE 802.15.3c, Permutation Network Abstract An LDPC decoder in 65nm CMOS targeting WPAN (IEEE 802.15.3c) is presented with measurement results. A modified-PCM based message permutation strategy with compatible data flow is proposed to solve the network problem raised by high parallelism LDPC decoding. Compared to the state-of-art, decoder chip achieves 17.7%, 33.5% and 49% improvements in chip density, gate count and energy efficiency, respectively. Slides

1D-10 (Time: 11:05 - 11:10)
 Title A 7.5Gb/s Referenceless Transceiver for UHDTV with Adaptive Equalization and Bandwidth Scanning Technique in 0.13um CMOS Process Author *Junyoung Song (Korea University, Republic of Korea), Hyunwoo Lee (Hynix Inc., Republic of Korea), Sewook Hwang (Korea University, Republic of Korea), Inhwa Jung (Hynix Inc., Republic of Korea), Chulwoo Kim (Korea University, Republic of Korea) Page pp. 89 - 90 Keyword Transceiver, CDR, PLL, Equalizer, Wireline Abstract A 7.5Gb/s referenceless transceiver for the ultra-high definition television is designed in a 0.13µm CMOS process. By applying the dynamic pre-emphasis calibration and the bandwidth scanning clock generators, measured eye opening and jitter of the clock are enhanced by 39.6% and 40%, respectively. Also the data-width comparison based adaptive equalizer with self-adjusting reference voltage is proposed. Slides

1D-11 (Time: 11:10 - 11:15)
 Title A 12.5 Gb/s/Link Non-Contact Multi Drop Bus System with Impedance-Matched Transmission Line Couplers and Dicode Partial-Response Channel Transceivers Author *Atsutake Kosuge, Wataru Mizuhara, Noriyuki Miura, Masao Taguchi, Hiroki Ishikuro, Tadahiro Kuroda (Keio University, Japan) Page pp. 91 - 92 Keyword Memory Interface, Coupler, Partial Response, Multi-Drop Bus Abstract A reduced-reflection multi-drop bus system using Dicode (1-D) partial response signaling transceiver is presented for the first time in the world. Directional couplers on transmission lines arranged with equi-energy distributing and exact impedance matched conditions allow the bus to reach to 12.5Gbps/link speed, which is the world’s fastest data link speed with multi-drop bus architecture. Dicode partial-response signaling method with a half-rate architecture was used where a precoder is placed in the transmitter to make the signal best fit for the channel to eliminate inter symbol interference (ISI). Slides

1D-12 (Time: 11:15 - 11:20)
 Title 315MHz OOK Transceiver with 38-µW Receiver and 36-µW Transmitter in 40-nm CMOS Author *Shunta Iguchi (University of Tokyo, Japan), Akira Saito (Semiconductor Technology Academic Research Center, Japan), Kentaro Honda, Yunfei Zheng (University of Tokyo, Japan), Kazunori Watanabe (Semiconductor Technology Academic Research Center, Japan), Takayasu Sakurai, Makoto Takamiya (University of Tokyo, Japan) Page pp. 93 - 94 Keyword Transceiver, Sensor node, Low voltage, Low power, Intermittent sampling Abstract A 1-Mbps, 315MHz OOK transceiver in 40-nm CMOS for body area networks is developed. Both a 38-pJ/bit carrier-frequency-free intermittent sampling receiver with -55dBm sensitivity and a 36-pJ/bit transmitter applied dual supply voltage scheme with -20dBm output power achieve the lowest energy in the published transceivers for wireless sensor networks. Slides

1D-13 (Time: 11:20 - 11:25)
 Title A Full 4-Channel 60GHz Direct-Conversion Transceiver Author *Seitaro Kawai, Ryo Minami, Ahmed Musa, Takahiro Sato, Ning Li, Tatsuya Yamaguchi, Yasuaki Takeuchi, Yuki Tsukui, Kenichi Okada, Akira Matsuzawa (Tokyo Institute of Technology, Japan) Page pp. 95 - 96 Keyword 60GHz, CMOS, tranceiver Abstract This paper presents a 60-GHz direct-conversion transceiver in 65 nm CMOS technology. By the proposed gain peaking technique, this transceiver realizes good gain flatness and is capable of more than 7Gbps in 16QAM wireless communication for every channel of IEEE802.15.3c standard within EVM of around -23dB. The transceiver consumes 319mW in transmitting and 223 mW in receiving, that includes the PLL consumption. Slides

1D-14 (Time: 11:25 - 11:30)
 Title A Sub-harmonic Injection-locked Frequency Synthesizer with Frequency Calibration Scheme for Use in 60GHz TDD Transceivers Author *Teerachot Siriburanon, Wei Deng, Ahmed Musa, Kenichi Okada, Akira Matsuzawa (Tokyo Institute of Technology, Japan) Page pp. 97 - 98 Keyword 60GHz, Synthesizer, Calibration, Injection-locked Abstract A 58.1-to-65.0 GHz frequency synthesizer using sub-harmonic injection-locking technique is presented. The synthesizer can generate all 60GHz channels defined by IEEE 802.15.3c, wirelessHD, IEEE 802.11ad, WiGig, and ECMA-387. A frequency calibration scheme is proposed to monitor frequency shift resulting from environmental variations. Implemented in a 65nm CMOS process, the synthesizer achieves a typical phase noise of -117 dBc/Hz @10MHz offset from a carrier frequency of 61.56 GHz.

1D-15 (Time: 11:30 - 11:35)
 Title A Fractional-N Harmonic Injection-locked Frequency Synthesizer with 10MHz-6.6GHz Quadrature Outputs for Software-Defined Radios Author *Wei Deng, Ahmed Musa, Kenichi Okada, Akira Matsuzawa (Tokyo Institute of Technology, Japan) Page pp. 99 - 100 Keyword synthesizer, fractional-N, SDR, PLL, injection-locked Abstract This paper presents an area-efficient frequency synthesizer with a quadrature phase output using a fractional-N injection-locking technique for software-defined radios. A background calibration scheme is proposed to compensate for the PVT variations. Implemented in a 65nm CMOS process, this work demonstrates 10 MHz to 6.6 GHz continuous quadrature frequency coverage, while only occupies a small area of 0.38 mm2 and consumes 16-26 mW depending on output frequency, from a 1.2 V power supply. The normalized phase noise achieves –135.3 dBc/Hz at 3 MHz offset, and -95.1 dBc/Hz in-band phase noise at 10 kHz offset, from a 1.7 GHz carrier frequency.

1D-16 (Time: 11:35 - 11:40)
 Title A Ring-VCO-Based Sub-Sampling PLL CMOS Circuit with 0.73 ps Jitter and 20.4 mW Power Consumption Author *Kenta Sogo, Akihiro Toya, Takamaro Kikkawa (Research Institute for Nanodevice and Bio Systems, Hiroshima University, Japan) Page pp. 101 - 102 Keyword PLL, CMOS, JItter, Sampling, Phase noise Abstract This paper presents a ring voltage–controlled- oscillator(ring-VCO)-based sub-sampling phase locked loop (PLL) CMOS circuit with low phase noise and low jitter. A 2.08 GHz PLL is developed by use of 65 nm CMOS technology. The in-band phase noise is -119.1 dBc/Hz at 1 MHz and the output jitter integrated from 1 kHz to 10 MHz is 0.73 ps (rms) with the power consumpition 20.4 mW. The normalized jitter-power product is -229.7 dB. Slides

1D-17 (Time: 11:40 - 11:45)
 Title Design of a Clock Jitter Reduction Circuit Using Gated Phase Blending Between Self-Delayed Clock Edges Author *Kiichi Niitsu, Naohiro Harigai, Daiki Hirabayashi, Daiki Oki, Masato Sakurai (Gunma University, Japan), Osamu Kobayashi (STARC, Japan), Takahiro J. Yamaguchi, Haruo Kobayashi (Gunma University, Japan) Page pp. 103 - 104 Keyword jitter, clock, PLL, jitter reduction, CMOS Abstract Design of a clock jitter reduction circuit that exploits the phase blending technique between the uncorrelated self-delayed clock edges is demonstrated. By blending uncorrelated clock edges, the output clock edges approach the ideal timing and, thus, timing jitter can be reduced by a factor of square root of two per stage. Measurement results with a 180-nm CMOS prototype chip demonstrated approximately four-fold reduction in timing jitter from 30.2ps to 8.8ps in 500-MHz clock by cascading the proposed circuit with four-stages.

1D-18 (Time: 11:45 - 11:50)
 Title A 25-Gb/s LD Driver with Area-Effective Inductor in a 0.18-µm CMOS Author *Takeshi Kuboki (Kyoto University, Japan), Yusuke Ohtomo (NTT Electronics, Japan), Akira Tsuchiya (Kyoto University, Japan), Keiji Kishine (University of Shiga Prefecture, Japan), Hidetoshi Onodera (Kyoto University, Japan) Page pp. 105 - 106 Keyword optical interconnect, LD driver, Interwoven inductor Abstract This paper presents high-speed and area-efficient laser-diode driver with interwoven inductor in a 0.18-μm CMOS. We interweave ten peaking inductors for area-effective implementation as well as performance enhancement. Interwoven inductor can not only achieve area-efficiency but also tune frequency characteristic. Mutual inductances of interwoven inductor enhance bandwidth and suppress group delay dispersion. The test chip area is 0.32 mm2 and the maximum operating speed is 25 Gb/s. Slides

1D-19 (Time: 11:50 - 11:55)
 Title A Regulated Charge Pump with Low-Power Integrated Optimum Power Point Tracking Algorithm for Indoor Solar Energy Harvesting Author *Jungmoon Kim, Chulwoo Kim (Korea University, Republic of Korea) Page pp. 107 - 108 Keyword Photovoltaic systems, solar energy harvesting, charge pump, maximum power point tracking, optimum power point tracking Abstract This paper presents a regulated charge pump (CP) with an integrated optimum power point tracking (OPPT) algorithm designed for indoor solar energy harvesting. The proposed OPPT circuit does not require a current sensor that consumes power proportionally to the load. The solar cell voltage is regulated at the optimum power point; the CP output is regulated according to the target voltage. The controller of the OPPT circuit and CP dissipates only 450nW, so the proposed technique is appropriate for indoor solar energy harvesting applications under dim lighting conditions. Slides

1D-20 (Time: 11:55 - 12:00)
 Title A Low Voltage Buck DC-DC Converter Using On-Chip Gate Boost Technique in 40nm CMOS Author *Xin Zhang, Po-Hung Chen (University of Tokyo, Japan), Yoshikatsu Ryu (Semiconductor Technology Academic Research Center, Japan), Koichi Ishida (University of Tokyo, Japan), Yasuyuki Okuma, Kazunori Watanabe (Semiconductor Technology Academic Research Center, Japan), Takayasu Sakurai, Makoto Takamiya (University of Tokyo, Japan) Page pp. 109 - 110 Keyword DC-DC converter, PWM controller, low voltage Abstract A low voltage buck DC-DC converter (0.45-V input, 0.4-V output) with on-chip gate boosted (OGB) and clock frequency scaled digital PWM controller is designed in 40-nm CMOS process. The highest efficiency to date is achieved at the output power less than 40µW. In order to compensate for the die-to-die delay variations of a delay line in the proposed digital PWM controller, a linear delay trimming by a logarithmic stress voltage (LSV) scheme with good controllability is also proposed and verified in measurement. Slides

1D-21 (Time: 12:00 - 12:05)
 Title A 0.35-0.8V 8b 0.5-35MS/s 2bit/step Extremely-low Power SAR ADC Author *Kentaro Yoshioka, Akira Shikata, Ryota Sekimoto, Tadahiro Kuroda, Hiroki Ishikuro (Keio University, Japan) Page pp. 111 - 112 Keyword SAR ADC, Extreme-low voltage, 2bit/step, Low power, Power efficient Abstract An extremely low-voltage operating high speed and low power 2bit/step asynchronous SAR ADC is presented. Wide range dynamic threshold configuring comparator is proposed to enable power and area efficient 2bit/step operation. By configuring the comparator threshold by simple Vcm biased current sources, the ADC holds immunity against 10% power supply variation. The prototype ADC fabricated in 40nm CMOS achieved 44.3 dB SNDR with 6.14 MS/s at a single supply voltage of 0.5 V. The ADC achieved a peak FoM of 5.9fJ/conv-step at 0.4V and operates down to 0.35V. Slides

Session 2A  Special Session: Dependability of on-Chip Systems
Time: 13:40 - 15:40 Wednesday, January 23, 2013
Organizer: Jörg Henkel (Karlsruhe Institute of Technology, Germany)

2A-1 (Time: 13:40 - 14:20)
 Title (Invited Paper) Thermal Management for Dependable on-chip Systems Author *Jörg Henkel, Thomas Ebi, Hussam Amrouch, Heba Khdr (Karlsruhe Institute of Technology, Germany) Page pp. 113 - 118 Keyword Dependability, Thermal Management, Aging Abstract Dependability has become a growing concern in the nano-CMOS era due to elevated temperatures and an increased susceptibility to temperature of the small structures. We present an overview of temperature-related effects that threaten dependability and a methodology for reducing the dependability concerns through thermal management utilizing the concept of aging budgeting.

2A-2 (Time: 14:20 - 15:00)
 Title (Invited Paper) Dependable VLSI Platform using Robust Fabrics Author *Hidetoshi Onodera (Kyoto University, Japan) Page pp. 119 - 124 Keyword dependable VLSI, DFM, variability, soft error, aging Abstract Extreme scaling imposes enormous challenges on LSI design such as manufacturability degradation, variability increase, performance aging, and soft-error vulnerability. For overcoming these difficulties, we have been developing a VLSI platform that can realize dependable circuits with required level of reliability. The platform project tackles the challenges with collaborative researches on layout, circuit, architecture, and design automation. Overview of the project as well as key achievements on the component-level and the architecture-level will be explained, followed by a brief introduction of the platform SoC and its C-based design tools.

2A-3 (Time: 15:00 - 15:40)
 Title (Invited Paper) Variability-Aware Memory Management for Nanoscale Computing Author *Nikil Dutt (University of California, Irvine, U.S.A.), Puneet Gupta (University of California, Los Angeles, U.S.A.), Alex Nicolau, Luis Angel D. Bathen (University of California, Irvine, U.S.A.), Mark Gottscho (University of California, Los Angeles, U.S.A.) Page pp. 125 - 132 Keyword Memory Management, Variation-Aware Design Abstract As the semiconductor industry continues to push the limits of sub-micron technology, the ITRS expects hardware (e.g., die-to-die, wafer-to-wafer, and chip-to-chip) variations to continue increasing over the next few decades. As a result, it is imperative for designers to build variation-aware software stacks that may adapt and opportunistically exploit said variations to increase system performance/responsiveness as well as minimize power consumption. The memory subsystem is one of the largest components in today’s computing system, a main contributor to the overall power consumption of the system, and therefore one of the most vulnerable components to the effects of variations (e.g., power). This paper discusses the concept of variability-aware memory management for nanoscale computing systems. We show how to opportunistically exploit the hardware variations in on- chip and off-chip memory at the system level through the deploy- ment of variation-aware software stacks.

Session 2B  Logic Synthesis
Time: 13:40 - 15:40 Wednesday, January 23, 2013
Chairs: Yuko Hara-Azumi (Nara Institute of Science and Technology, Japan), Shigeru Yamashita (Ritsumeikan University, Japan)

2B-1 (Time: 13:40 - 14:10)
 Title MIXSyn: An Efficient Logic Synthesis Methodology for Mixed XOR-AND/OR Dominated Circuits Author *Luca Amarú, Pierre-Emmanuel Gaillardon, Giovanni De Micheli (Integrated Systems Laboratory, Ecole Polytechnique Federale de Lausanne, Switzerland) Page pp. 133 - 138 Keyword Logic Synthesis, XOR-intensive, Library-free Technology Mapping, Ambipolar Transistors Abstract We present a new logic synthesis methodology, called MIXSyn, that produces area-efficient results for mixed XOR-AND/OR dominated logic functions. MIXSyn is a two step synthesis process. The first step is a hybrid logic optimization that enables selective and distinct optimization of AND/OR and XOR-intensive portions of the logic circuit. The second step is a library-free technology mapping that enhances design flexibility with a tractable computational cost. MIXSyn has been tested on a set of large MCNC benchmarks. Experimental results indicate that MIXSyn produces CMOS circuits with 18.0% and 9.2% fewer devices, on the average, with respect to state-of-art academic and commercial synthesis tools, respectively. MIXSyn is also capable to exploit the opportunity of novel XOR implementations offered by the use of double-gate ambipolar devices. Experimental results show that MIXSyn can reduce the number of ambipolar transistors by 20.9% and 15.3%, on the average, with respect to state-of-art academic and commercial synthesis tools, respectively. Slides

2B-2 (Time: 14:10 - 14:40)
 Title Optimizing Multi-level Combinational Circuits for Generating Random Bits Author Chen Wang, *Weikang Qian (Shanghai Jiao Tong University, China) Page pp. 139 - 144 Keyword logic synthesis, random bit generation, probabilistic computation Abstract Random bits are an important construct in many applications, such as hardware-based implementation of probabilistic algorithms and weighted random testing. One approach in generating random bits with required probabilities is to synthesize combinational circuits that transform a set of source probabilities into target probabilities. In [1], the authors proposed a greedy algorithm that synthesizes circuits in the form of a gate chain to approximate target probabilities. However, since this approach only considers circuits of such a special form, the resulting circuits are not satisfactory both in terms of the approximation error and the circuit depth. In this paper, we propose a new algorithm to synthesize combinational circuits for generating random bits. Compared to the previous one, our approach greatly enlarges the search space. Also, we apply a linear property of probabilistic logic computation and an iterative local search method to increase the efficiency of our algorithm. Experimental results comparing the approximation error and the depth of the circuits synthesized by our method to those of the circuits produced by the previous approach demonstrate the superiority of our method. Slides

2B-3 (Time: 14:40 - 15:10)
 Title Improving the Mapping of Reversible Circuits to Quantum Circuits Using Multiple Target Lines Author *Robert Wille, Mathias Soeken, Christian Otterstedt, Rolf Drechsler (University of Bremen, Germany) Page pp. 145 - 150 Keyword quantum, reversible, synthesis, optimization, circuits Abstract The efficient synthesis of quantum circuits is an active research area. Since many of the known quantum algorithms include a large Boolean component (e.g. the database in the Grover search algorithm), quantum circuits are commonly synthesized in a two-stage approach. First, the desired function is realized as a reversible circuit making use of existing synthesis methods for this domain. Afterwards, each reversible gate is mapped to a functionally equivalent quantum gate cascade. In this paper, we propose an improved mapping of reversible circuits to quantum circuits which exploits a certain structure of many reversible circuits. In fact, it can be observed that reversible circuits are often composed of similar gates which only differ in the position of their target lines. We introduce an extension of reversible gates which allow multiple target lines in a single gate. This enables a significantly cheaper mapping to quantum circuits. Experiments show that considering multiple target lines leads to improvements of up to 85% in the resulting quantum cost. Slides

Session 2C  Simulation for Thermal and Power Grid Analysis
Time: 13:40 - 15:40 Wednesday, January 23, 2013
Chair: Youngsoo Shin (Korea Advanced Institute of Science and Technology, Republic of Korea)

2C-1 (Time: 13:40 - 14:10)
 Title I-LUTSim: An Iterative Look-Up Table Based Thermal Simulator for 3-D ICs Author *Chi-Wen Pan, Yu-Min Lee (National Chiao Tung University, Taiwan), Pei-Yu Huang (Industrial Technology Research Institute, Taiwan), Chi-Ping Yang (National Chiao Tung University, Taiwan), Chang-Tzu Lin, Chia-Hsin Lee, Yung-Fa Chou, Ding-Ming Kwai (Industrial Technology Research Institute, Taiwan) Page pp. 151 - 156 Keyword 3-D, IC, Thermal, Simulator, Table Abstract This work presents an iterative look-up table based thermal simulator, I-LUTSim, to efficiently estimate the temperature profile of three-dimensional integrated circuits. I-LUTSim includes two stages. First, the pre-process stage constructs thermal impulse response tables. Then, the simulation stage iteratively calculates the temperature profile via the table lookup. With this two-stage scheme, the maximum absolute error of I-LUTSim is less than 0.41% compared with that of a commercial tool ANSYS. Moreover, I-LUTSim is at least an order of magnitude faster than a fast matrix solver SuperLU for the full-chip temperature simulation. Slides

2C-2 (Time: 14:10 - 14:40)
 Title Compact Nonlinear Thermal Modeling of Packaged Integrated Systems Author *Zao Liu, Sheldon X.-D. Tan, Hai Wang (University of California, Riverside, U.S.A.), Ashish Gupta (Intel Corporation, U.S.A.), Sahana Swarup (University of California, Riverside, U.S.A.) Page pp. 157 - 162 Keyword Thermal modeling, Nonlinear, Subspace identification Abstract This paper proposes a new thermal nonlinear modeling technique for packaged integrated systems. Thermal behavior of complicated systems like packaged electronic systems may exhibit nonlinear and temperature dependent properties. As a result, it is difficult to use a low order linear model to approximate the thermal behavior of the packaged integrated systems without accuracy loss. In this paper, we try to mitigate this problem by using piecewise linear (PWL) approach to characterizing the thermal behavior of those systems. The new method (called ThermSubPWL), which is the first proposed approach to nonlinear thermal modeling problem, identifies the linear local models for different temperature ranges using the subspace identification method. A linear transformation method is proposed to transform all the identified linear local models to the common state basis to build the continuous piecewise linear model. Experimental results validate the proposed method on a realistic packaged integrated system modeled via the multidomain/physics commercial tool, COMSOL, under practical power signal inputs. The new piecewise models can lead to much smaller model order without accuracy loss, which translates to significant savings in both the simulation time and the time required to identify the reduced models compared to applying the high order models. Slides

2C-3 (Time: 14:40 - 15:10)
 Title A Multilevel H-matrix-based Approximate Matrix Inversion Algorithm for Vectorless Power Grid Verification Author Wei Zhao, Yici Cai, *Jianlei Yang (Dept. of Computer Science and Technology, Tsinghua University, China) Page pp. 163 - 168 Keyword Power grid, Vectorless verification, H-matrix, Multilevel method Abstract Vectorless power grid verification technique makes it possible to estimate the worst-case voltage fluctuations of the on-chip power delivery network at the early design stage. For most of the existing vectorless verification algorithms, the sub¬problem of linear system solution which computes the inverse of the power grid matrix takes up a large part of the computation time and has become a critical bottleneck of the whole algorithm. In this paper, we propose a new algorithm that combines the H-matrix-based technique and the multilevel method to construct a data-sparse approximate inverse of the power grid matrix. Experimental results have shown that the proposed algorithm can obtain an almost linear complexity both in runtime and memory consumption for efficient vectorless power grid verification. Slides

2C-4 (Time: 15:10 - 15:40)
 Title Realization of Frequency-Domain Circuit Analysis Through Random Walk Author Tetsuro Miyakawa, Hiroshi Tsutsui, Hiroyuki Ochi, *Takashi Sato (Kyoto University, Japan) Page pp. 169 - 174 Keyword AC analysis, Random walk algorithm, Importance sampling, Incremental analysis Abstract This paper presents the realization of frequency-domain circuit analysis based on random walk framework for the first time. In conventional random walk based circuit analyses, the sample movement at a node is randomly chosen to follow the edge probabilities. The probabilities are determined by edge-admittances connecting to the node, which is impossible to apply for the frequency-domain analysis because the probabilities are imaginary numbers. By applying the idea of importance sampling, the intractable imaginary probabilities are converted into real numbers while maintaining the estimation correctness. Runtime acceleration through incremental analysis is also proposed.

Session 2D  Advanced Routing Techniques for Chip and PCB Design
Time: 13:40 - 15:40 Wednesday, January 23, 2013
Chairs: Toshiyuki Shibuya (Fujitsu Laboratory, Japan), Jai-Ming Lin (National Cheng Kung University, Taiwan)

2D-1 (Time: 13:40 - 14:10)
 Title A Separation and Minimum Wire Length Constrained Maze Routing Algorithm under Nanometer Wiring Rules Author *Fong-Yuan Chang, Ren-Song Tsay, Wai-Kei Mak (National Tsing Hua University, Taiwan), Sheng-Hsiung Chen (Springsoft, Taiwan) Page pp. 175 - 180 Keyword Maze, Routing, Nanometer Wiring Rules, DFM, minimum wire length Abstract Due to process limitations, wiring rules are imposed by foundries on chip layout. Under nanometer wiring rules, the required separation between two wire ends is dependent on their surrounding wires, and there is a limit on the minimum length of each wire segment. Yet, traditional maze routing algorithms are not designed to handle these rules, so rule violations must be corrected by post-processing and the quality of result is seriously impacted. For this reason, we propose a new maze routing algorithm capable of handling these wiring rules. The proposed algorithm is proved to find a legal shortest path with runtime complexity of O(n), where n is the number of grid point. Experiments with seven tight industrial cases show that the success rate of getting a DRC clean routing by a commercial router is improved, and the average runtime is reduced by 2.3 times. Slides

2D-2 (Time: 14:10 - 14:40)
 Title An ILP-based Automatic Bus Planner for Dense PCBs Author Pei-Ci Wu (University of Illinois at Urbana-Champaign, U.S.A.), Qiang Ma (Synopsys, Inc., U.S.A.), *Martin D. F. Wong (University of Illinois at Urbana-Champaign, U.S.A.) Page pp. 181 - 186 Keyword bus planner, printed circuit boards Abstract Modern PCBs have to be routed manually since no EDA tools can successfully route these complex boards. An autorouter for PCBs would improve design productivity tremendously since each board takes about 2 months to route manually. This paper focuses on a major step in PCB routing called bus planning. In the bus planning problem, we need to simultaneously solve the bus decomposition, escape routing, layer assignment and global bus routing. This problem was partially addressed by Kong et al. in [3] where they only focused on the layer assignment and global bus routing, assuming bus decomposition and escape rout- ing are given. In this paper, we present an ILP-based solution to the entire bus planning problem. We apply our bus planner to an industrial PCB (with over 7000 nets and 12 signal layers) which was previously successfully routed manually, and compare with a state-of-the-art industrial internal tool where the layer assignment and global bus routing are based on the algorithm in [3]. Our bus planner successfully routed 97.4% of all the nets. This is a huge improvement over the industrial tool which could only achieve 84.7% routing completion for this board.

2D-3 (Time: 14:40 - 15:10)
 Title Layer Minimization in Escape Routing for Staggered-Pin-Array PCBs Author *Yuan-Kai Ho, Xin-Wei Shih, Yao-Wen Chang (National Taiwan University, Taiwan), Chung-Kuan Cheng (University of California, San Diego, U.S.A.) Page pp. 187 - 192 Keyword Escape routing, PCB routing Abstract As the technology advances, the pin number of a high-end PCB design keeps increasing while the size of a PCB keeps shrinking. The staggered pin array is used to accommodate a larger pin number than the grid pin array of the same area. Nevertheless, escaping a large pin number to the boundary of a dense staggered pin array, namely multilayer escape routing for staggered pin arrays, is significantly harder than that for grid pin arrays. This paper addresses this multilayer escape routing problem to minimize the number of used layers in a staggered pin array for manufacturing cost reduction. We first present an escaped pin selection method to assign a maximal number of escaped pins in the current layer and also to increase useful routing regions for subsequent layers. Missing pins are also modeled in our routing network to utilize the routing resource effectively. Experimental results show that our approach can significantly reduce the required layer number for escape routing.

2D-4 (Time: 15:10 - 15:40)
 Title Network Flow Modeling for Escape Routing on Staggered Pin Arrays Author Pei-Ci Wu, *Martin D. F. Wong (University of Illinois at Urbana-Champaign, U.S.A.) Page pp. 193 - 198 Keyword escape routing, staggered pin array, network flow modeling Abstract Recently staggered pin arrays are introduced for modern designs with high pin density. Although some studies have been done on escape routing for hexagonal arrays, the hexagonal array is only a special kind of staggered pin array. There exist other kinds of staggered pin arrays in current industrial designs, and the existing works cannot be extended to solve them. In this paper, we study the escape routing problem on staggered pin arrays. Network flow models are proposed to correctly model the capacity constraints of staggered pin arrays. Our models are guaranteed to find an escape routing satisfying the capacity constraints if there exists one. The correctness of these models lead to an optimal algorithm.

Session 3A  Special Session: Design Automation for Flow-Based Microfluidic Biochips: Connecting Biochemistry to Electronic Design Automation
Time: 16:00 - 18:00 Wednesday, January 23, 2013
Organizer: Tsung-Yi Ho (National Cheng Kung University, Taiwan)

3A-1 (Time: 16:00 - 16:30)
 Title (Invited Paper) A Clique-Based Approach to Find Binding and Scheduling Result in Flow-Based Microfluidic Biochips Author *Trung Anh Dinh, Shigeru Yamashita (Ritsumeikan University, Japan), Tsung-Yi Ho (National Cheng Kung University, Taiwan), Yuko Hara-Azumi (Nara Institute of Science and Technology, Japan) Page pp. 199 - 204 Keyword Flow-based microfluidic biochips, Architectural synthesis, Routing constraints, Resource constraints, Clique Abstract Microfluidic biochips have been recently proposed to integrate all the necessary functions for biochemical analysis. There are several types of microfluidic biochips; among them there has been a great interest in flow-based microfluidic biochips, in which the flow of liquid is manipulated using integrated micro-valves. By combining several microvalves, more complex resource units such as micropumps, switches and mixers can be built. For efficient execution, the flow of liquid routes in microfluidic biochips needs to be scheduled under some specific constraints. Slides

3A-2 (Time: 16:30 - 17:00)
 Title (Invited Paper) Control Synthesis of the Flow-Based Microfluidic Large-Scale Integration Biochips Author Wajid Hassan Minhass, *Paul Pop, Jan Madsen (Technical University of Denmark, Denmark), Tsung-Yi Ho (National Cheng Kung University, Taiwan) Page pp. 205 - 212 Keyword microfluidic, biochips, synthesis, flow-based, control Abstract In this paper we are interested in flow-based microfluidic biochips, which are able to integrate the necessary functions for biochemical analysis on-chip. In these chips, the flow of liquid is manipulated using integrated microvalves. By combining severalmicrovalves, more complex units, such asmicropumps, mixers, and multiplexers, can be built. In this paper we propose, for the first time to our knowledge, a top-down control synthesis framework for the flow-based biochips. Starting from a given biochemical application and a biochip architecture, we synthesize the control logic that is used by the biochip controller to automatically execute the biochemical application. We also propose a control pin count minimization scheme aimed at efficiently utilizing chip area, reducing macro-assembly around the chip and enhancing chip scalability. We have evaluated our approach using both real-life applications and synthetic benchmarks.

3A-3 (Time: 17:00 - 17:30)
 Title (Invited Paper) A Network-Flow Based Valve-Switching Aware Binding Algorithm for Flow-Based Microfluidic Biochips Author *Kai-Han Tseng, Sheng-Chi You (National Cheng Kung University, Taiwan), Wajid Hassan Minhass (Technical University of Denmark, Denmark), Tsung-Yi Ho (National Cheng Kung University, Taiwan), Paul Pop (Technical University of Denmark, Denmark) Page pp. 213 - 218 Keyword Flow-based microfluidic biochip, Network flow, Valve minimization Abstract Designs of flow-based microfluidic biochips are receiving much attention recently because they replace conventional biological automation paradigm and are able to integrate different biochemical analysis functions on a chip. However, as the design complexity increases, a flow-based microfluidic biochip needs more chip-integrated micro-valves, i.e., the basic unit of fluid-handling functionality, to manipulate the fluid flow for biochemical applications. Moreover, frequent switching of micro-valves may cause more power consumption and even result in the problem of reliability. To minimize the valve-switching activities, we develop a network-flow based resource binding algorithm based on breadth-first search (BFS) and minimum cost maximum flow (MCMF) in architectural-level synthesis. The experimental results show that our methodology not only makes significant reduction of valve-switching activities but also diminishes the application completion time for both real-life applications and a set of synthetic benchmarks. Slides

3A-4 (Time: 17:30 - 18:00)
 Title (Invited Paper) Design and Verification Tools for Continuous Fluid Flow-based Microfluidic Devices Author Jeffrey McDaniel, Aurelila Baez, Brian Crites, Aditya Tammewar, *Philip Brisk (University of California, Riverside, U.S.A.) Page pp. 219 - 224 Keyword Microfluidics, Hardware Design Language Abstract This paper describes an integrated design, verification, and simulation environment for programmable microfluidic devices called laboratories-on-chip (LoCs). Today’s LoCs are architected and laid out by hand, which is time-consuming, tedious, and error-prone. To increase designer productivity, this paper introduces a Microfluidic Hardware Design Language (MHDL) for LoC specification, along with software tools to assist LoC designers verify the correctness of their specifications and estimate their performance. Slides

Session 3B  System-Level Synthesis and Optimization
Time: 16:00 - 18:00 Wednesday, January 23, 2013
Chairs: Antoine Trouve (ISIT, Japan), Farhad Mehdipour (Kyushu University, Japan)

3B-1 (Time: 16:00 - 16:30)
 Title Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications Author *Shuangchen Li, Yongpan Liu (Tsinghua University, China), X.Sharon Hu (University of Notre Dame, U.S.A.), Xinyu He, Yining Zhang (Tsinghua University, China), Pei Zhang (Y Explorations Inc., U.S.A.), Huazhong Yang (Tsinghua University, China) Page pp. 225 - 230 Keyword HLS, Partition Abstract Developing FPGA solutions for streaming applications written in C (or its variants) can benefit greatly from automatic C-to-RTL (C2RTL) synthesis. Yet, the complexity and stringent throughput/cost constraints of such applications are rather challenging for existing C2RTL synthesis tools. This paper considers automatic partition and block-level parallelization to address these challenges. An MILP-based approach is introduced for finding an optimal partition of a given program into blocks while allowing block-level parallelization. In order to handle extremely large problem instances, a heuristic algorithm is also discussed. Experimental results based on seven well known multimedia applications demonstrate the effectiveness of both solutions. Slides

3B-2 (Time: 16:30 - 17:00)
 Title Multi-Mode Pipelined MPSoCs for Streaming Applications Author *Haris Javaid, Daniel Witono, Sri Parameswaran (University of New South Wales, Australia) Page pp. 231 - 236 Keyword Pipelined MPSoCs, Streaming Applications, Multi-mode Accelerators Abstract In this paper, we propose a design flow for the pipelined paradigm of Multi-Processor System on Chips (MPSoCs) targeting multiple streaming applications. A multi-mode pipelined MPSoC, used as a streaming accelerator, executes multiple, mutually exclusive applications through modes where each mode refers to the execution of one application. We model each application as a directed graph. The challenge is to merge application graphs into a single graph so that the multi-mode pipelined MPSoC derived from the merged graph contains minimal resources. We solve this problem by finding maximal overlap between application graphs. Three heuristics are proposed where two of them greedily merge application graphs while the third one finds an optimal merging at the cost of higher running time. The results indicate significant area saving (up to 62\% processor area, 57\% FIFO area and 44 processor/FIFO ports) with minuscule degradation of system throughput (up to 2\%) and latency (up to 2\%) and increase in energy values (up to 3\%) when compared to widely used approach of designing distinct pipelined MPSoCs for individual applications. Our work is the first step in the direction of multi-mode pipelined MPSoCs, and the results demonstrate the usefulness of resource sharing among pipelined MPSoCs based streaming accelerators in a multimedia platform. Slides

3B-3 (Time: 17:00 - 17:30)
 Title Network Simplex Method Based Multiple Voltage Scheduling in Power-Efficient High-Level Synthesis Author *Cong Hao, Song Chen, Takeshi Yoshimura (Waseda University, Japan) Page pp. 237 - 242 Keyword High-Level Synthesis, Scheduling, low-power Abstract In this work, we focus on the problem of latency-constrained scheduling with consideration of multiple voltage technologies in High-level synthesis.Without the resource concern, we propose an Integer Linear Programming (ILP) formulation, whose constraint matrix is the node-arc incidence matrix of a network graph, for power minimization. Accordingly, the formulation is relaxed to a piecewise Linear Programming problem having only integer feasible solutions and optimally solved using the efficient piecewise-linear extended network simplex method(PLNSM). The experimental results showed 80X+ speedup compared to the general linear programming formulation. Considering the resource usage, we propose a two-stage heuristic Network Simplex Method based Power-efficient Multiple Voltage Scheduling(NPMVS) method. Firstly, the above relaxed LP formulation is modified to perform mobility allocation and delay assignment for the operations so as to minimize the power and the differences between the allocated operation mobilities and the predefined target mobilities. The modified formulation is solved using the PLNSM and iteratively performed to minimize power and resource density variation in control steps by gradually updating the predefined target mobilities. Secondly, with the allocated operation mobilities, we apply dependency-free scheduling with the objective of minimizing the resource usage. Experimental results show that the proposed method can produce optimum solutions for all 6 benchmarks with 14 groups of data in a maximum time of 0.25 second. Slides

3B-4 (Time: 17:30 - 18:00)
 Title VISA Synthesis: Variation-Aware Instruction Set Architecture Synthesis Author *Yuko Hara-Azumi (Nara Institute of Science and Technology, Japan), Takuya Azumi (Ritsumeikan University, Japan), Nikil Dutt (University of California, Irvine, U.S.A.) Page pp. 243 - 248 Keyword ISA synthesis, process variation, SSTA, timing faults Abstract We present VISA: a novel Variation-aware Instruction Set Architecture synthesis approach that makes effective use of process variation from both software and hardware points of view. To achieve an efficient speedup, VISA selects custom instructions based on statistical static timing analysis (SSTA) for aggressive clocking. Furthermore, with minimum performance overhead, VISA dynamically detects and corrects timing faults resulting from aggressive clocking of the underlying processor. This hybrid software/hardware approach brings the significant speedup without degrading the yield. Our experimental results on commonly used ISA synthesis benchmarks demonstrate that VISA achieves significant performance improvement compared with a traditional deterministic worst case-based approach (up to 78.0%) and an existing SSTA-based approach (up to 49.4%). Slides

Time: 16:00 - 18:00 Wednesday, January 23, 2013
Chair: Hidetoshi Matsuoka (Fujitsu Laboratory, Japan)

3C-1 (Time: 16:00 - 16:30)
 Title L-Shape Based Layout Fracturing for E-Beam Lithography Author Bei Yu, Jhih-Rong Gao, *David Z. Pan (University of Texas at Austin, U.S.A.) Page pp. 249 - 254 Keyword Electron Beam Lithography, Layout Fracturing, L-shape shot Abstract Layout fracturing is a fundamental step in mask data preparation and e-beam lithography (EBL) writing. To increase EBL throughput, recently a new L-shape writing strategy is proposed, which calls for new L-shape fracturing, versus the conventional rectangular fracturing. Meanwhile, during layout fracturing, one must minimize very small/narrow features, also called slivers, due to manufacturability concern. This paper addresses this new research problem of how to perform L-shaped fracturing with sliver minimization. We propose two novel algorithms. The ﬁrst one, rectangular merging (RM), starts from a set of rectangular fractures and merges them optimally to form L-shape fracturing. The second algorithm, direct L-shape fracturing (DLF), directly and effectively fractures the input layouts into L-shapes with sliver minimization. The experimental results show that our algorithms are very effective. Slides

3C-2 (Time: 16:30 - 17:00)
 Title High-throughput Electron Beam Direct Writing of VIA Layers by Character Projection using Character Sets Based on One-dimensional VIA Arrays with Area-efficient Stencil Design Author *Rimon Ikeno (The University of Tokyo, Japan), Takashi Maruyama (e-Shuttle, Inc., Japan), Tetsuya Iizuka, Satoshi Komatsu, Makoto Ikeda, Kunihiro Asada (The University of Tokyo, Japan) Page pp. 255 - 260 Keyword Electron Beam Direct Writing, Character Projection, DFM, Layout Design, Interconnect Design Abstract For high-speed electron beam direct writing (EBDW) of VIA layers by Character projection (CP), number of VIAs in each CP shot should be increased, but it will result in huge number of CP characters for arbitrary VIA placements. We adopt one-dimensional VIA arrays as the basic character architecture to increase VIA numbers in a CP shot while saving the stencil area by superposed array characters. CP throughput is further improved by layout constraints for VIA arrangement. Our experimental results give estimated CP exposure counts less than 174G shot/wafer in 14nm technology. Slides

3C-3 (Time: 17:00 - 17:30)
 Title Linear Time Algorithm to Find All Relocation Positions for EUV Defect Mitigation Author Yuelin Du (University of Illinois at Urbana-Champaign, U.S.A.), Hongbo Zhang, Qiang Ma (Synopsys, Inc., U.S.A.), *Martin D. F. Wong (University of Illinois at Urbana-Champaign, U.S.A.) Page pp. 261 - 266 Keyword EUV, Blank Defect Mitigation, Linear Time, Relocation Position, Multi-die Placement Abstract In EUV mask fabrication, die size is usually much smaller than the exposure field, such that one blank can accommodate multiple copies of a die. For thorough utilization of blank area, the number of valid dies that are not impacted by any defects should be maximized. To do so, all relocation positions to place a single valid die must be determined first. In this paper, we develop an efficient linear time algorithm to solve this problem. Slides

3C-4 (Time: 17:30 - 18:00)
 Title Self-Aligned Double and Quadruple Patterning-Aware Grid Routing with Hotspots Control Author *Chikaaki Kodama (Toshiba Corporation, Japan), Hirotaka Ichikawa (Toshiba Microelectronics Corporation, Japan), Koichi Nakayama, Toshiya Kotani, Shigeki Nojima, Shoji Mimotogi, Shinji Miyamoto (Toshiba Corporation, Japan), Atsushi Takahashi (Tokyo Institute of Technology, Japan) Page pp. 267 - 272 Keyword Self-aligned double patterning, Self-aligned quadruple patterning, Grid routing, Lithography, Hotspot Abstract Self-Aligned Double and Quadruple Patterning (SADP, SAQP) have become the most promising processes for sub-20nm and sub-14nm node technology. We propose the simple grid routing method for SADP and SAQP possible to predict the wafer image. A new grid structure is prepared and mandrel patterns can be easily derived without complex coloring or decomposition. Also we try to reduce hotspots in a wafer image by dummy pattern flipping. Classical maze-routing algorithm is implemented and the effectiveness is confirmed. Slides

Session 3D  Hardware-Software Co-Optimization for Emerging NVMs
Time: 16:00 - 18:00 Wednesday, January 23, 2013
Chairs: Yun (Eric) Liang (Peking University, China), Yiran Chen (University of Pittsburgh, U.S.A.)

3D-1 (Time: 16:00 - 16:30)
 Title Compiler-Assisted Refresh Minimization for Volatile STT-RAM Cache Author Qingan Li (City University of Hong Kong, Hong Kong), Jianhua Li, Liang Shi (University of Science and Technology of China, China), *Chun Jason Xue (City University of Hong Kong, Hong Kong), Yiran Chen (University of Pittsburgh, U.S.A.), Yanxiang He (Wuhan University, China) Page pp. 273 - 278 Keyword volatile STT-RAM, refresh, compiler Abstract Recently, researchers propose to improve the efficiency of STT-RAM by relaxing its non-volatility. To avoid data loss resulting from volatility, dynamic refresh schemes are indispensable. In this paper, we propose to reduce dynamic refresh through re-arranging program data layout at compilation time. Experimental results show that, the proposed methods can reduce the number of refresh operations by 73.3%, reduce the dynamic energy consumption by 27.6%, and in the meantime slightly increase the performance by 0.7%. Slides

3D-2 (Time: 16:30 - 17:00)
 Title Curling-PCM: Application-Specific Wear Leveling for Phase Change Memory based Embedded Systems Author *Duo Liu (College of Computer Science, Chongqing University, China), Tianzheng Wang (Department of Computer Science, University of Toronto, Canada), Yi Wang, Zili Shao (Department of Computing, The Hong Kong Polytechnic University, Hong Kong), Qingfeng Zhuge, Edwin Sha (College of Computer Science, Chongqing University, China) Page pp. 279 - 284 Keyword Phase chang memory, wear leveling, embedded systems, application-specific, response time Abstract Phase change memory (PCM) has been used as NOR replacement in embedded systems. However, endurance problems greatly limit its adoption in embedded systems. This paper utilizes application-specific features and proposes a wear leveling technique, Curling-PCM, which periodically moves the hot region and guarantees response time through a partial curling policy. Experimental results show effectiveness of the proposed technique. We expect this work can serve as a first step towards the utilization of application-specific features in PCM-based embedded systems. Slides

3D-3 (Time: 17:00 - 17:30)
 Title Selectively Protecting Error-Correcting Code for Area-Efficient and Reliable STT-RAM Caches Author *Junwhan Ahn (Seoul National University, Republic of Korea), Sungjoo Yoo (Pohang University of Science and Technology, Republic of Korea), Kiyoung Choi (Seoul National University, Republic of Korea) Page pp. 285 - 290 Keyword backhopping, caches, error-correcting code, STT-RAM Abstract Recent researches on STT-RAM revealed that device scaling makes its write operations unreliable. To mitigate the impact of this problem, this paper proposes a low-cost, ECC-based solution for STT-RAM caches. In particular, it proposes to share storage for ECC among different blocks within a set and to use them only for unsuccessful write operations. Experimental results show that our scheme reduces 74% to 98% of area overhead incurred by the conventional per-block ECC while maintaining system performance and reliability.

3D-4 (Time: 17:30 - 18:00)
 Title Loadsa: A Yield-Driven Top-Down Design Method for STT-RAM Array Author Wujie Wen, Yaojun Zhang, Lu Zhang, *Yiran Chen (University of Pittsburgh, U.S.A.) Page pp. 291 - 296 Keyword Yield-driven, Top-down, Statistical design, STT-RAM Abstract As an emerging nonvolatile memory technology, spin transfer torque random access memory (STT-RAM) faces great design challenges. The large device variations and the thermal induced switching randomness of the magnetic tunneling junction(MTJ) introduce the persistent and non-persistent errors in STT-RAM operations, respectively. Modeling these statistical metrics generally require the expensive Monte-Carlo simulations on the combined magnetic-CMOS models, which is hardly integrated in the modern micro-architecture and system designs. Also, the conventional bottom-up design method incurs costly iterations in the STT-RAM design toward specific system requirement. In this work, we propose Loadsa1: a yield-driven top-down design method to explore the design space of STT-RAM array from a statistical point of view. Both array-level semi-analytical yield model and cell-level failure-probability model are developed to enable a top-down design method: The system-level requirements, e.g., the chip yield under power and area constraints, are hierarchically mapped to array and cell-level design parameters, e.g., redundancy, ECC scheme, and MOS transistor size, etc. Our simulation results show that "Loadsa" can accurately optimize the STT-RAM based on the system and cell level constraints with a linear computation complexity. Our method demonstrates great potentials in the early design stage of memory or micro-architecture by eliminating the design integrations, while offering a full statistical view of the design even when the common yield enhancement practices are applied. Slides

 Thursday, January 24, 2013

Session 2K  Keynote II
Time: 9:00 - 10:00 Thursday, January 24, 2013
Chair: Shinji Kimura (Waseda University, Japan)

2K-1 (Time: 9:00 - 10:00)
 Title (Keynote Address) Gearing Up for the Upcoming Technology Nodes Author Kee Sup Kim (Samsung, Republic of Korea) Abstract Upcoming technology nodes have many challenges. In this talk, Dr. Kee Sup Kim outlines the challenges in recent technology nodes and how Samsung prepared each generation of design infrastructure to overcome these challenges. The emphasis will be giving to the challenges in the upcoming technology nodes and what approaches are being taken to overcome the challenges posed by double patterning, 3D transistors, 3D IC’s and increasing process instabilities.

Session 4A  Special Session: High-Level Synthesis and Parallel Programming Models for FPGAs
Time: 10:20 - 12:20 Thursday, January 24, 2013
Organizer: Yun (Eric) Liang (Peking University, China)

4A-1 (Time: 10:20 - 10:50)
 Title (Invited Paper) Fractal Video Compression in OpenCL: An Evaluation of CPUs, GPUs, and FPGAs as Acceleration Platforms Author *Doris Chen, Deshanand Singh (Altera Toronto Technology Center, Canada) Page pp. 297 - 304 Keyword high-level synthesis, FPGA, GPU Abstract Fractal compression is an efficient technique for image and video encoding that has not gained widespread acceptance due to its computational intensity. In this paper, we present a real-time implementation of fractal compression in OpenCL, and show how the algorithm can be efficiently optimized for multi-CPUs, GPUs, and FPGAs. We show that the core computation implemented on the FPGA through OpenCL is 3x and 114x faster than a high-end GPU and multi-core CPU, respectively. We also compare to a hand-coded FPGA implementation to showcase the effectiveness of OpenCL-to-FPGA compilation. Slides

4A-2 (Time: 10:50 - 11:20)
 Title (Invited Paper) High Level Synthesis of Multiple Dependent CUDA Kernels for FPGA Author Swathi Gurumani, Hisham Cholakkail (Advanced Digital Sciences Center, Singapore), Yun Liang (Peking University, China), *Kyle Rupnow (Nanyang Technological University, Singapore), Deming Chen (University of Illinois at Urbana-Champaign, U.S.A.) Page pp. 305 - 312 Keyword HLS, FPGA, CUDA Abstract High-level synthesis (HLS) tools provide automatic generation of hardware at the register transfer level (RTL) from algorithm descriptions written in high-level languages, enabling faster creation of custom accelerators for FPGA architectures. Existing HLS tools support a wide variety of input languages, and assist users in design space exploration through automation and feedback on designs' performance bottlenecks. This design space exploration applies techniques such as pipelining, partitioning and resource sharing in order to improve performance, and resource utilization. However, although automated exploration can find some inherent parallelism, data-parallel input source code is still superior for exposing a greater variety of parallelism. In prior work, we demonstrated automated design space exploration of GPU multi-threaded (CUDA) language source code for efficient RTL generation. In this paper, we examine the challenges in extending this automated design space exploration to multiple dependent CUDA kernels, demonstrate a step-by-step procedure for efficiently performing multi-kernel synthesis, and demonstrate the potential of this approach through a case study of a stereo matching algorithm. This study demonstrates that HLS of multiple dependent CUDA kernels can maintain performance parity with the GPU implementation, while consuming over 16X less energy than the GPU. Based on our manual procedure, we identify the key challenges in fully automating the synthesis of multi-kernel CUDA programs. Slides

4A-3 (Time: 11:20 - 11:50)
 Title (Invited Paper) The Liquid Metal IP Bridge Author Perry Cheng, Stephen J. Fink, Rodric Rabbah, *Sunil Shukla (IBM Research, U.S.A.) Page pp. 313 - 319 Keyword High Level Synthesis, Heterogeneous Computing, FPGA Abstract Programmers are increasingly turning to heterogeneous systems to achieve performance. Examples include FPGA-based systems that integrate reconfigurable architectures with conventional processors. However, the burden of managing the coding complexity that is intrinsic to these systems falls entirely on the programmer. This limits the proliferation of these systems as only highly-skilled programmers and FPGA developers can unlock their potential. The goal of the Liquid Metal project at IBM Research is to address the programming complexity attributed to heterogeneous FPGA-based systems. A feature of this work is a vertically integrated development lifecycle that appeals to skilled software developers. A primary enabler for this work is a canonical IP bridge, designed to offer a uniform communication methodology between software and hardware, and that is applicable across a wide range of platforms available off-the-shelf.

Session 4B  Memory Hierarchy Optimization
Time: 10:20 - 12:20 Thursday, January 24, 2013
Chair: Jason Xue (City University of Hong Kong, Hong Kong)

4B-1 (Time: 10:20 - 10:50)
 Title TRISHUL: A Single-pass Optimal Two-level Inclusive Data Cache Hierarchy Selection Process for Real-time MPSoCs Author *Mohammad Shihabul Haque, Akash Kumar, Yajun Ha, Qiang Wu, Shaobo Luo (National University of Singapore, Singapore) Page pp. 320 - 325 Keyword data cache hierarchy configuration, real-time software, Single-pass, Simulation Abstract Hitherto discovered approaches analyze the execution time of a real-time application on all the possible cache hierarchy setups to find the application specific optimal two-level inclusive data cache hierarchy to reduce cost, space and energy consumption while satisfying the time deadline in real-time Multi-Processor Systems on Chip (MPSoC). These brute-force like approaches can take years to complete. Alternatively, application's memory access trace driven crude estimation methods can find a cache hierarchy quickly by compromising the accuracy of results. In this article, for the first time, we propose a fast and accurate application's trace driven approach to find the optimal real-time application specific two-level inclusive data cache hierarchy. Our proposed approach TRISHUL'' predicts the optimal cache hierarchy performance first and then utilizes that information to find the optimal cache hierarchy quickly. TRISHUL can suggest a cache hierarchy, which has up to 128 times smaller size, up to 7 times faster compared to the suggestion of the state-of-the-art crude trace driven two-level inclusive cache hierarchy selection approach for the application traces analyzed. Slides

4B-2 (Time: 10:50 - 11:20)
 Title Optimizing Translation Information Management in NAND Flash Memory Storage Systems Author *Qi Zhang, Xuandong Li, Linzhang Wang, Tian Zhang (Nanjing University, China), Yi Wang, Zili Shao (The Hong Kong Polytechnic University, Hong Kong) Page pp. 326 - 331 Keyword Translation block, NAND flash memory, On-demand, SSD Abstract Address mapping is one of the major functions in managing NAND flash. With the capacity increase of NAND flash, it becomes vitally important to reduce the RAM print of the address mapping table while not introducing big performance overhead. Demand-based address mapping is an effective approach to solve this problem, in which the address mapping table is stored in NAND flash (called translation pages), and mapping items are cached on-demand in RAM. Therefore, it is critical to manage translation pages in demand-based address mapping. This paper solves two most important problems in translation page management.First, to reduce frequent translation page updates caused by data requests,we propose a page-level caching mechanism to exploit the fundamental property of NAND flash where the basic read/write unit is one page. Second, to reduce the garbage collection overhead from translation pages, we propose a multiple write pointers strategy to group data pages corresponding to the same translation page into one data block, by which, when the data block is reclaimed via the garbage collection, we only need to update one translation page.We evaluate our scheme using a set of benchmarks from both real-world and synthetic traces. Experimental results show that our techniques can achieve significant reduction in the extra translation operations and improve the system response time. Slides

4B-3 (Time: 11:20 - 11:50)
 Title An Adaptive Filtering Mechanism for Energy Efficient Data Prefetching Author *Xianglei Dang, Xiaoyin Wang, Dong Tong, Zichao Xie, Lingda Li, Keyi Wang (Peking University, China) Page pp. 332 - 337 Keyword data prefetching, energy efficiency, useless prefetch filtering, memory performance optimization Abstract As data prefetching is used in embedded processors, it is crucial to reduce the wasted energy for improving the energy efficiency. In this paper, we propose an adaptive prefetch filtering (APF) mechanism to reduce the wasted bandwidth and energy as well as the cache pollution caused by useless prefetches. APF records the prefetch-victim address pairs of issued prefetches and collects information about which address in each pair is first accessed by the processor to guide the filtering of new generated useless prefetches. Meanwhile, filtered prefetches are recorded for building the feedback mechanism to avoid filtering useful prefetches. Experimental results demonstrate that APF reduces useless prefetches by an average of 53.81% with a mere 5.28% reduction of useful prefetches, thus reducing the memory access bandwidth consumption by 59.92% and the L2 cache energy by 6.19%. APF also improves the performance of several programs by reducing the cache pollution incurred by useless prefetches, thus gaining an average performance improvement of 2.12%. Slides

4B-4 (Time: 11:50 - 12:20)
 Title Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs Author *Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai, Jing-Yang Jou (Department of Electronics Engineering, National Chiao Tung University, Taiwan) Page pp. 338 - 343 Keyword Shared cache, Cache contention, Thread scheduling, Irregular applications, GPGPUs Abstract On-chip shared cache is effective to alleviate the memory bottleneck in modern many-core systems, such as GPGPUs. However, when scheduling numerous concurrent threads on a GPGPU, a cache capacity agnostic scheduling scheme could lead to severe cache contention among threads and thus significant performance degradation. Moreover, the diverse working sets in irregular applications make the cache contention issue an even more serious problem. As a result, taking cache capacity into account has become a critical scheduling issue of GPGPUs. This paper formulates a Cache Capacity Aware Thread Scheduling Problem to capture the impact of cache capacity as well as different architectural considerations. With a proof to be NP-hard, this paper has proposed two algorithms to perform the cache capacity aware thread scheduling. The simulation results on Nvidia’s Fermi configuration have shown that the proposed scheduling scheme can effectively avoid cache contention, and achieve an average of 44.7% cache miss reduction and 28.5% runtime enhancement. The paper also shows the runtime can be enhanced up to 62.5% for more complex applications. Slides

Session 4C  Timing and Power Driven Design Flow
Time: 10:20 - 12:20 Thursday, January 24, 2013
Chairs: Masanori Hashimoto (Osaka University, Japan), Sheldon Tan (University of California, Riverside, U.S.A.)

4C-1 (Time: 10:20 - 10:50)
 Title Optimization for Overdrive Signoff Author Tuck-Boon Chan, Andrew B. Kahng, *Jiajia Li, Siddhartha Nath (University of California, San Diego, U.S.A.) Page pp. 344 - 349 Keyword overdrive, signoff, overdesign, multi-mode, optimization Abstract In modern SOC implementations, multi-mode design is commonly used to achieve better circuit performance and power across voltage-scaling, “turbo” and other operating modes. Although there are many tools for multi-mode circuit implementation, to our knowledge there is no available systematic analysis or methodology for the selection of associated signoff modes. We observe that the selection of signoff modes has significant impact on circuit area, power and performance. For example, incorrect choice of signoff voltages for required overdrive frequencies can result in a netlist with 15% suboptimality in power or 21% in area. In this paper, we propose a concept of mode dominance which can be used as a guideline for signoff mode selection. Further, we also propose efficient circuit implementation flows to optimize the selection of signoff modes within several distinct use cases. Our results show that our proposed methodology provides 5-7% improvement in performance compared to the traditional “signoff and scale” method. The signoff modes determined by our methods result in only 0.6% overhead in performance and 8% overhead in power after implementation, compared to the optimal signoff modes. Slides

4C-2 (Time: 10:50 - 11:20)
 Title Mountain-Mover: An Intuitive Logic Shifting Heuristic for Improving Timing Slack Violating Paths Author *Xing Wei, Wai-Chung Tang, Yu-Liang Wu (The Chinese University of Hong Kong, Hong Kong), Cliff Sze, Charles Alpert (IBM Austin Research Center, U.S.A.) Page pp. 350 - 355 Keyword logic rewiring, slack, timing optimization, post-placement Abstract Based on a simple intuitive notion, in this paper, we propose an efficient post-placement improvement scheme. Based on the given timing slack distribution of a circuit, a corresponding slack mountain map'' can be visualized with peaks and valleys representing the worst negative slack and non-critical positive slack areas respectively. Guided by this map, violating paths are improved while the slack mountain is flattened by applying a local logic perturbation technique (rewiring) repeatedly to shift logic resources from critical to non-critical areas. However, due to the locality property of the rewiring technique, to better avoid being stuck at local minimums, instead of firing rewiring operations from the peak top towards lower areas, we do this local logic shifting starting from sea areas'' (non-critical) towards peak (critical) areas. At the end, as the slack map is more flattened, a circuit with slack violations more evenly distributed can be yielded. Comparing to the recent work, our experimental results show that this scheme can obtain a better or comparable delay reduction but with CPU time one order of magnitude smaller. Slides

4C-3 (Time: 11:20 - 11:50)
 Title Pulsed-Latch ASIC Synthesis in Industrial Design Flow Author *Sangmin Kim, Duckhwan Kim, Youngsoo Shin (Department of Electrical Engineering, KAIST, Republic of Korea) Page pp. 356 - 361 Keyword pulsed-latch, pulse generator, design flow, scan latch, ASIC Abstract Flip-flop has long been used as a sequencing element of choice in ASIC design; commercial synthesis tools have also been developed in this context. This work has been motivated by a question of whether existing CAD tools can be employed from RTL to layout while pulsed latch replaces flip-flop as a sequencing element. Two important problems have been identified and their solutions are proposed: placement of pulse generators and latches for integrity of pulse shape, and design of special scan latches and their selective use to reduce hold violations. A reference design flow has also been set up using published documents, in order to assess the proposed one. In 40-nm technology, the proposed flow achieves 20% reduction in circuit area and 30% reduction in power consumption, on average of 12 test circuits.

4C-4 (Time: 11:50 - 12:20)
 Title Power Optimization for Application-Specific 3D Network-on-Chip with Multiple Supply Voltages Author *Kan Wang, Sheqin Dong (Tsinghua University, China) Page pp. 362 - 367 Keyword Layer Assignment, Multiple Supply Voltages, Application Specific 3D NoC, Inter-layer Communication, Power Consumption Abstract In this paper, a MSV-driven power optimization method is proposed for application-specific 3D NoC (MSV-3DNoC). A unified modeling method is presented for considering both layer assignment and voltage assignment, which achieves the best trade-off between core power and communication power. A 3D NoC synthesis is proposed to assign network components onto each layer and generate inter-layer interconnection. A global redistribution is applied to further reduce communication power. Experimental results show that compared to MSV-driven 2D NoC, the proposed method can improve total chip power greatly. Slides

Session 4D  Special Session: Emerging Security Topics in Electronic Designs and Mobile Devices
Time: 10:20 - 12:20 Thursday, January 24, 2013
Organizer: Yiran Chen (University of Pittsburgh, U.S.A.)

4D-1 (Time: 10:20 - 10:50)
 Title (Invited Paper) Hardware Security Strategies Exploiting Nanoelectronic Circuits Author Garrett S. Rose (Air Force Research Laboratory, U.S.A.), *Jeyavijayan Rajendran (New York University, U.S.A.), Nathan McDonald (Air Force Research Laboratory, U.S.A.), Ramesh Karri (New York University, U.S.A.), Miodrag Potkonjak (University of California, Los Angeles, U.S.A.), Bryant Wysocki (Air Force Research Laboratory, U.S.A.) Page pp. 368 - 372 Keyword Cybersecurity, PUF, VLSI, Nanotechnology, Memristor Abstract Hardware security has emerged as an important field of study aimed at mitigating issues such as piracy, counterfeiting, and side channel attacks. One popular solution for such hardware security attacks are physical unclonable functions (PUF) which provide a hardware specific unique signature or identification. Novel nanoelectronic technologies such as memristors are viable options for improved security in emerging integrated circuits. We provide an overview of memristor based PUF structures and circuits that illustrate the potential for nanoelectronic hardware security solutions.

4D-2 (Time: 10:50 - 11:20)
 Title (Invited Paper) Can We Identify Smartphone App by Power Trace? Author Mian Dong, Po-Hsiang Lai, *Zhu Li (Samsung Telecommunications America, U.S.A.) Page pp. 373 - 375 Keyword power, smartphone Abstract Power trace of a smartphone, as time series data, carries important information of the system behavior and is useful for many applications, such as energy management, software optimization and anomaly detection. However, the power trace measured from the battery terminals include the power consumption by all the hardware components and thus describes the activity of the whole system. Yet modern smartphones are multiprocessing, i.e., multiple applications can be running simultaneously in the same system. Our goal is to answer the following question: “Can we identify smartphone app by power trace?” That is, whether the power trace of a smartphone can be different by running different applications.

4D-3 (Time: 11:20 - 11:50)
 Title (Invited Paper) Secure Storage System and Key Technologies Author *Jiwu Shu, Zhirong Shen, Wei Xue, Yingxun Fu (Tsinghua University, China) Page pp. 376 - 383 Keyword secure storage, cloud storage security, privacy, data security Abstract With the rapid development of cloud storage, data security in storage receives great attention and becomes the top concern to block the spread development of cloud service. In this paper, we systematically study the security researches in the storage systems. We first present the design criteria that are used to evaluate a secure storage system and summarize the widely adopted key technologies. Then, we further investigate the security research in cloud storage and conclude the new challenges in the cloud environment. Finally, we give a detailed comparison among the selected secure storage systems and draw the relationship between the key technologies and the design criteria. Slides

4D-4 (Time: 11:50 - 12:20)
 Title (Invited Paper) Mobile User Classification and Authorization Based on Gesture Usage Recognition Author *Kent W. Nixon, Xiang Chen, Zhi-Hong Mao (University of Pittsburgh, U.S.A.), Kang Li (Rutgers University, U.S.A.), Yiran Chen (University of Pittsburgh, U.S.A.) Page pp. 384 - 389 Keyword Mobile Device, Gesture, Security Abstract Intelligent mobile devices have been widely serving in almost all aspects of everyday life, spanning from communication, web surfing, entertainment, to daily organizer. A large amount of sensitive and private information is stored on the mobile device, leading to severe data security concern. In this work, we propose a novel mobile user classification and authorization scheme based on the recognition of user’s gesture. Compared to other security solutions like password, track pattern and finger print etc.

Session 5A  Designers' Forum: Heterogeneous Devices and Multi-Dimensional Integration Design Technologies
Time: 13:40 - 15:40 Thursday, January 24, 2013
Organizer: Akihiko Okubora (Sony, Japan)

5A-1 (Time: 13:40 - 14:10)
 Title (Invited Paper) Challenges in Integration of Diverse Functionalities on CMOS Author *Kazuya Masu, Noboru Ishihara (Tokyo Institute of Technology, Japan), Toshifumi Konishi (NTT Advanced Technology, Japan), Katsuyuki Machida (Tokyo Institute of Technology, Japan), Hiroshi Toshiyoshi (The University of Tokyo, Japan) Page pp. 390 - 393 Abstract We introduce “Wafer Shuttle” that is suitable for integration of diverse functionalities. CMOS/MEMS design flow and environment based on SPICE is discussed. It is pointed out that modeling will be important to promote the R&D of MEMS/CMOS and/or diverse-functionalities integration on CMOS.

5A-2 (Time: 14:10 - 14:40)
 Title (Invited Paper) 3DIC from Concept to Reality Author Frank Lee, Bill Shen, Willy Chen, *Suk Lee (Taiwan Semiconductor Manufacturing Company, Taiwan) Page pp. 394 - 398 Keyword 3DIC, TSMC, System, Design Abstract 3DIC technology presents a new system integration strategy for the electronics industry to achieve superior system performance with lower power consumption, higher bandwidth, smaller system form factor, and shorter time to market through heterogeneous integration. TSMC's “Chip-on-Wafer-on-Substrate (CoWoS)” technology opens up a new opportunity to bring 3D chip stacking vision from concept to reality. The provided methodology will be discussed about this market trend and the different pieces needed to jointly make it a success, which includes customers' required applications, TSMC's support design flow, as well as the ecosystem design enablement of multi-die implementation, DFT solution, thermal analysis, verification and new categories of IPs.

5A-3 (Time: 14:40 - 15:10)
 Title (Invited Paper) 2.5D Design Methodology Author *Sinya Tokunaga (Semiconductor Technology Academic Research Center, Japan) Page pp. 399 - 402 Keyword 3D-IC, Silicon interposer, TSV, Co-design, Co-analysis Abstract We present about 2.5D design methodology. Very important issue is a high frequency insertion loss on the silicon interposer. There are two wiring methodologies on the silicon interposer. One is the Manhattan wiring method like as LSI wiring design and the other is the transmission channel wiring method like as package design. We have confirmed that the transmission channel wiring is twice better electro characteristic than the Manhattan wiring using a component model that is 6mm length at 1 GHz.

5A-4 (Time: 15:10 - 15:40)
 Title (Invited Paper) Design Issues in Heterogeneous 3D/2.5D Integration Author *Dragomir Milojevic, Pol Marchal, Erik Jan Marinissen, Geert Van der Plas, Diederik Verkest, Eric Beyne (IMEC, Belgium) Page pp. 403 - 410 Keyword Heterogeneous, 3D/2.5D Integration, Thermal, mechanical analysis, Design for test Abstract Efficient processing of fine-pitched Through Silicon Vias, micro-bumps and back-side re-distribution layers enable face-to-back or face-to-face integration of heterogeneous ICs using 3D stacking and/or Silicon Interposers. While these technology features are extremely compelling, they considerably stress the existing design practices and EDA tool flows typically conceived for 2D systems. With all system, technology and implementation level options brought with these features, the design space increases to an extent where traditional 2D tools cannot be used any more for efficient exploration. Therefore, the cost-effective design of future 3D ICs products will require new planning and co-optimisation techniques and tools that are fast and accurate enough to cope with these challenges. In this paper we present design methodology and the practical EDA tool chain that covers different aspects of the design flow and is specific to efficient design of 3D-ICs. Flow features include: fast synthesis and 3D design partitioning at gate level, TSV/micro-bump array planning, 3D floor planning, placement and routing, congestion analysis, fast thermal and mechanical modeling, easy technology vs. implementation trade-off analysis, 3D device models generations and Design-for-Test (DfT). The application of the tool chain is illustrated using concrete example of a real-world design, showing not only the applicability of the tool chain, but also the benefits of heterogeneous 2.5 and 3D integration technologies.

Session 5B  Analysis and Verification of Reliable Systems
Time: 13:40 - 15:40 Thursday, January 24, 2013
Chairs: Sri Parameswaran (University of New South Wales, Australia), Ittetsu Taniguchi (Ritsumeikan University, Japan)

5B-1 (Time: 13:40 - 14:10)
 Title Verifying Distributed Controllers using Time-Stamped ECAs Author *Matthias Kauer, Sebastian Steinhorst, Martin Lukasiewycz (TUM CREATE, Singapore), Dip Goswami, Reinhard Schneider, Samarjit Chakraborty (TU Munich, Germany) Page pp. 411 - 416 Keyword verification, event-count automata, linear control, timing analysis Abstract We study distributed controllers where sensor, controller, and actuator tasks are mapped onto different processors or Electronic Control Units (ECUs) in a distributed automotive architecture, communicating via a shared bus. Controllers in such setups are designed with a sampling period equal to the worst-case sensor-to-actuator message delay. However, this assumption of all messages having to meet their deadlines is too pessimistic. The inherent robustness of most controllers allows some of the messages to miss their deadlines, while still meeting specified control performance constraints. Given a controller, in this paper we first quantify the frequency of its acceptable deadline misses and represent this as a Linear Temporal Logic (LTL) formula. Further, we model the distributed architecture as a network of time-stamped event count automata (TS-ECA). Such a network of TS-ECAs is then model-checked to verify whether it satisfies the LTL formula. The verification ensures that the controller may be mapped onto the architecture and the control performance constraints will be satisfied. We have implemented this methodology in the Symbolic Analysis Laboratory (SAL), which is a well-known framework combining different tools for system verification. Our implementation and case studies using standard controller design shows the applicability of our proposed controller/architecture co-verification. It represents a significant improvement in current design flows where, although controller models are formally verified, their implementation on a distributed architecture is done in an ad hoc fashion with extensive testing and integration effort.

5B-2 (Time: 14:10 - 14:40)
 Title Reliability Assessment of Safety-Relevant Automotive Systems in a Model-Based Design Flow Author *Sebastian Reiter, Michael Pressler, Alexander Viehl (FZI Forschungszentrum Informatik, Germany), Oliver Bringmann, Wolfgang Rosenstiel (University Tuebingen, Germany) Page pp. 417 - 422 Keyword reliability, model-based, error injection Abstract To support the reliability assessment of safety-relevant distributed automotive systems and reduce its complexity, this paper presents a novel approach that extends virtual prototyping towards error effect simulation. Besides the common functional and timed system simulation, error injection is used to stress error tolerance mechanisms. A quantitative assessment of the overall system reliability is performed by observing the system reactions and identifying incorrect system behavior. To foster the industrial application, the analysis is integrated in an model-based design flow, starting at the modeling level to assemble and parameterize the virtual prototype and to configure the analysis. The feasibility of the proposed approach is demonstrated by analyzing a representative safety-relevant automotive use case. Slides

5B-3 (Time: 14:40 - 15:10)
 Title Sequential Dependency and Reliability Analysis of Embedded System Author Hehua Zhang, *Yu Jiang (Tsinghua University, China), William N.N Hung (Synopsys, Inc., U.S.A.), Xiaoyu Song (Portland State University, U.S.A.), Jiaguang Sun (Tsinghua University, China) Page pp. 423 - 428 Keyword Dynamic Bayesian Network, embedded system, temporal correlations Abstract Embedded systems are becoming increasingly popular due to their widespread applications and the reliability of them is a crucial issue. The complexity of the reliability analysis arises in handling the sequential feedback that make the system output depends not only on the present input but also the internal state. In this paper, we propose a novel probabilistic model, named sequential dependency model (SDM), for the reliability analysis of embedded systems with sequential feedback. It is constructed based on the structure of the system components and the signals among them. We prove that the SDM model is s Dynamic Bayesian Network (DBN) that captures: the spatial dependencies between system components in a single time slice, the temporal dependencies between system components of different time slices, and the temporal dependencies due to the sequential feedback. We initiate the conditional probability distribution (CPD) table of the SDM node with the failure probability of the corresponding system component. Then, the SDM model handles the spatial-temporal correlations at internal components as well as the higher order temporal correlations due to the sequential feedback with the computational mechanism of DBN, experiment results demonstrate the accuracy of our model. Slides

5B-4 (Time: 15:10 - 15:40)
 Title Processor and DRAM Integration by TSV-Based 3-D Stacking for Power-Aware SOCs Author Shin-Shiun Chen, Chun-Kai Hsu, *Hsiu-Chuan Shih (National Tsing Hua University, Taiwan), Jen-Chieh Yeh (Industrial Technology Research Institute, Taiwan), Cheng-Wen Wu (National Tsing Hua University, Taiwan) Page pp. 429 - 434 Keyword 3D IC, DRAM, SOC, ESL, Power Abstract With the rapid popularization of mobile devices, the low-power and energy-efficient became far more important than the system operating frequency. This work demonstrates a processor and DRAM integration scheme by TSV-based 3-D stacking and the performance and energy efficiency is evaluated by an ESL design methodology. The integration scheme comprising Sans-Cache DRAM (SCDRAM) architecture which is designed under the power and energy considerations is explored. Experiment results show the proposed architecture can greatly reduce 80% energy while having 23.5% of system performance improvement. Slides

Session 5C  Advances in Physical Design
Time: 13:40 - 15:40 Thursday, January 24, 2013
Chairs: Sung Kyu Lim (Georgia Institute of Technology, U.S.A.), Yasuhiro Takashima (University of Kitakyushu, Japan)

5C-1 (Time: 13:40 - 14:10)
 Title A Flexible Fixed-outline Floorplanning Methodology for Mixed-size Modules Author *Kai-Chung Chan, Chao-Jam Hsu, Jai-Ming Lin (National Cheng Kung University, Taiwan) Page pp. 435 - 440 Keyword mixed-sized modules, fixed-outline, floorplanning Abstract This paper presents a new flow to handle fixed-outline floorplanning for mixed size modules. It consists of two stages, which includes global distribution stage and legalization stage. The methodology is very flexible, and it can be integrated into other methods or be extended to handle other constraints such as routability or thermal. The experimental results show that our method can averagely reduce wirelength by 22.5% and 4.7% than PATOMA and DeFer in mixed size benchmarks. Slides

5C-2 (Time: 14:10 - 14:40)
 Title Optimizing Routability in Large-Scale Mixed-Size Placement Author Jason Cong (University of California, Los Angeles, U.S.A.), Guojie Luo (Peking University, China), *Kalliopi Tsota, Bingjun Xiao (University of California, Los Angeles, U.S.A.) Page pp. 441 - 446 Keyword placement, routing, congestion, routability Abstract One of the necessary requirements for the placement process is that it should be capable of generating routable solutions. This paper describes methods leading to the reduction of the routing congestion and the final routed wirelength for large-scale mixed-size designs. In order to reduce routing congestion and improve routability, we propose blocking narrow regions on the chip. We also propose dummy-cell insertion inside regions characterized by reduced fixed-macro density. Our placer consists of three major components: (i) narrow channel reduction by performing neighbor-based fixed-macro inflation; (ii) dummy-cell insertion inside large regions with reduced fixed-macro density; and (iii) preplacement inflation by detecting tangled logic structures in the netlist and minimizing the maximum pin density. We evaluated the quality of our placer using the newly released DAC 2012 routability-driven placement contest designs and we compared our results to the top four teams that participated in the placement contest. The experimental results reveal that our placer improves the routability of the DAC 2012 placement contest designs and effectively reduces the routing congestion. Slides

5C-3 (Time: 14:40 - 15:10)
 Title Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment Author Xin-Wei Shih (MediaTek, Taiwan), *Tzu-Hsuan Hsu (Linkwish, Taiwan), Hsu-Chieh Lee (Google, Taiwan), Yao-Wen Chang (National Taiwan University, Taiwan), Kai-Yuan Chao (Intel, U.S.A.) Page pp. 447 - 452 Keyword clock, skew, supply voltage, IR-drop, power Abstract For high-performance synchronous systems, nonuniform/non-ideal supply voltages of buffers (e.g., due to IR-drop) may incur a large clock skew and thus serious performance degradation. This paper addresses this problem and presents the first symmetrical buffered clock-tree synthesis flow that considers supply voltage differences of buffers. We employ a two-phase technique of bottom-up clock sink clustering to determine the tree topology, followed by top-down buffer placement and wire routing to complete the clock tree. At each level of processing, clock skew and wirelength are minimized by the determination of buffer embedding regions and the alignment of buffer supply voltages. Experimental results show that, on average, our method can achieve a 76% (respectively, 40%) clock skew reduction with marginal resource and runtime overheads, compared to the state-of-the-art work without supply voltage consideration (with an extension for supply voltages based on our top-down flow). With the skew reductions, our method can meet the stringent skew constraint set by the 2010 ISPD contest for all cases, while other counterparts cannot. In particular, our work provides a key insight into the importance of handling practical design issues (such as IR-drop) for real-world clock-tree synthesis. Slides

5C-4 (Time: 15:10 - 15:40)
 Title BCell: Automatic Layout of Leaf Cells Author Stefan Hougardy, *Tim Nieberg, Jan Schneider (Research Institute for Discrete Mathematics, University of Bonn, Germany) Page pp. 453 - 460 Keyword Leaf Cells, Placement, Routing Abstract In this paper we present BonnCell, our solution to compute leaf cell layouts. Our placement algorithm allows to find very compact solutions and uses an accurate target function to guarantee routability. The routing algorithm handles all nets simultaneously using a constraint generation MIP based approach. BCell easily allows to adapt to new design rules as required for 14nm and beyond. The experimental results on current 22nm designs show significant improvements compared to manual designs done by experienced designers. Slides

Session 5D  Multi-/Many-Core System Optimization
Time: 13:40 - 15:40 Thursday, January 24, 2013
Chairs: Yuichi Nakamura (NEC, Japan), Yongpan Liu (Tsinghua University, China)

5D-1 (Time: 13:40 - 14:10)
 Title Register and Thread Structure Optimization for GPUs Author *Yun Liang (Center for Energy-efficient Computing and Applications, School of EECS, Peking University, China), Zheng Cui (Advanced Digital Sciences Center, Illinois at Singapore, Singapore), Kyle Rupnow (Nanyang Technological University, Singapore), Deming Chen (University of Illinois, Urbana-Champaign, U.S.A.) Page pp. 461 - 466 Keyword GPU, register, thread structure, design space exploration Abstract GPUs are an increasingly popular implementation platform for a variety of general purpose applications from mobile and embedded devices to high performance computing. The CUDA and OpenCL parallel programming models enable easy utilization of the GPU's resources. However, tuning GPU applications' performance is a complex and labor intensive task. Software programmers employ a variety of optimization techniques to explore tradeoffs between the thread parallelism and performance of a single thread. However, prior techniques ignore register allocation, a significant factor in single thread performance and, indirectly affects the number of simultaneously active threads. In this paper, we show that joint optimization of register allocation and thread structure has great potential to significantly improve performance. However, the design space for this joint optimization can be large; therefore, we develop performance metrics appropriate for evaluation within a compiler's inner loop and efficient design space exploration techniques that use the metrics to narrow the search space. Across a range of GPU applications, we achieve average performance speedup of 1.33X (up to 1.73X) with design space exploration 355X faster than the exhaustive search. Slides

5D-2 (Time: 14:10 - 14:40)
 Title Real-Time Partitioned Scheduling on Multi-Core Systems with Local and Global Memories Author *Che-Wei Chang (National Taiwan University, Taiwan), Jian-Jia Chen (Karlsruhe Institute of Technology, Germany), Tei-Wei Kuo (National Taiwan University, Taiwan), Heiko Falk (Ulm University, Germany) Page pp. 467 - 472 Keyword real-time system, heterogeneous memory, partitioned scheduling, resource optimization, worst case execution time Abstract Real-time task scheduling becomes even more challenging with the emerging of island-based multi-core architecture, where the local memory module of an island offers shorter access time than the global memory module does. With such a popular architecture design in mind, this paper exploits real-time task scheduling over island-based homogeneous cores with local and global memory pools. Joint considerations of real-time scheduling and memory allocation are presented to efficiently use the computing and memory resources. A polynomial-time algorithm with an asymptotic 4-approximation bound is proposed to minimize the number of needed islands to successfully schedule tasks. To evaluate the performance of the proposed algorithm, 82 benchmarks from the MRTC, MediaBench, UTDSP, NetBench, and DSPstone benchmark suites are profiled by a worst-case-execution-time analyzer aiT and included in the experiments.

5D-3 (Time: 14:40 - 15:10)
 Title Dynamic Thermal Management for Multi-Core Microprocessors Considering Transient Thermal Effects Author *Zao Liu (University of California, Riverside, U.S.A.), Tailong Xu (Anhui University, China), Sheldon X.-D. Tan (University of California, Riverside, U.S.A.), Hai Wang (UESTC, China) Page pp. 473 - 478 Keyword Dynamic thermal management, task migration, thermal analysis, moment matching, hot spots Abstract Dynamic thermal management method is a viable way to effectively mitigate the thermal emergences. In this paper, a new thermal management scheme is proposed to reduce the on-chip temperature variance and the occurrence of hot spots by considering more transient thermal effects. The new method performs the task migrations to reduce the temperature variations across the chip. Instead of intuitively assigning the heavy tasks to the low temperature cores to balance the thermal profile based on steady state thermal analysis, the proposed method applies moment matching based transient thermal analysis techniques for fast thermal estimation and prediction to guide the migration process. We show that by considering the dominant temperature moment component, the resulting algorithm can lead to significant reduction of hot spots with full transient thermal simulation. Our experimental results on a 16 core microprocessor demonstrate that the proposed method can reduce the number of the hot spots by 50% compared to the simple lowest temperature based task scheduling method, leading to more uniform on-chip temperature distribution across the microprocessor cores. Slides

5D-4 (Time: 15:10 - 15:40)
 Title BAMSE: A Balanced Mapping Space Exploration Algorithm for GALS-based Manycore Platforms Author Mohammad Foroozannejad, Brent Bohnenstiehl, *Soheil Ghiasi (University of California, Davis, U.S.A.) Page pp. 479 - 484 Keyword Manycore, GALS, Mapping, Algorithm Abstract We study the problem of mapping concurrent tasks of an application modeled as a data flow graph onto processors of a GALS-based manycore platform. We propose a mapping algorithm called BAMSE, which exploits the characteristics of streaming applications and the specifications of the target architecture to optimize the mapping solution. Different configuration parameters embedded into the algorithm enable one to strike a balance between scalability of the approach and the quality of generated solutions. Experiments with several real life applications show that our algorithm outperforms hand-optimized manual mappings up to 65% in terms of longest inter-processor communication link, and as high as 19% with respect to total length of the links, when the two criteria are used as primary and secondary optimization objectives, respectively. Additionally, our algorithm delivers superior mappings compared to ILP generated solutions after 10 days of solver runtime. Slides

Session 6A  Designers' Forum: Future Direction and Trend of Embedded GPU
Time: 16:00 - 18:00 Thursday, January 24, 2013
Organizer: Masaitsu Nakajima (Panasonic, Japan), Moderator: Koji Inoue (Kyushu University, Japan)

6A-1
 Title (Panel Discussion) Future Direction and Trend of Embedded GPU Author Panelists: Jem Davies (ARM, U.S.A.), Hong Jiang (Intel, U.S.A.), Eisaku Ohbuchi (Digital Media Professionals Inc., Japan), Yasushi Sugama (Fujitsu Laboratories, Japan), Tony King-Smith (Imagination Technologies, U.K.)

Session 6B  Emerging Technologies
Time: 16:00 - 18:00 Thursday, January 24, 2013
Chairs: Tsung-Yi Ho (National Cheng Kung University, Taiwan), Zili Shao (The Hong Kong Polytechnic University, Hong Kong)

6B-1 (Time: 16:00 - 16:30)
 Title Thermal Simulator of 3D-IC with Modeling of Anisotropic TSV Conductance and Microchannel Entrance Effects Author Hanhua Qian, Hao Liang, Chip-Hong Chang, Wei Zhang, *Hao Yu (Nanyang Technological University, Singapore) Page pp. 485 - 490 Keyword thermal model, 3D-IC, TSV, entrance effect, microchannel Abstract This paper presents a fast and accurate steady state thermal simulator for heatsink and microfluid-cooled 3D-ICs. This model considers the thermal effect of TSVs at fine-granularity by calculating the anisotropic equivalent thermal conductances of a solid grid cell if TSVs are inserted. Entrance effect of microchannels is also investigated for accurate modeling of microfluidic cooling. The proposed thermal simulator is verified against commercial multiphysics solver COMSOL and compared with Hotspot and 3D-ICE. Simulation results shows that for heatsink cooling, the proposed simulator is as accurate as Hotspot but runs much faster at moderate granularity. For microfluidic cooling, our proposed simulator is much more accurate than 3D-ICE in its estimation of steady state temperature and thermal distribution. Slides

6B-2 (Time: 16:30 - 17:00)
 Title A Novel Cell Placement Algorithm for Flexible TFT Circuit with Mechanical Strain and Temperature Consideration Author *Juin-Li Lin, Po-Hsun Wu, Tsung-Yi Ho (National Cheng Kung University, Taiwan) Page pp. 491 - 496 Keyword Placement, Mobility, TFT Abstract Mobility is the key device parameter to affect circuit performance in thin-film transistor (TFT) technologies, and it is very sensitive to the change of mechanical strain and temperature. However, existing algorithms only consider the impact of mechanical strain in cell placement of TFT circuit. Without taking temperature into consideration, mobility may be dramatically decreased which leads to circuit performance degradation. This paper presents the first work to minimize the mobility variation caused by the change of mechanical strain and temperature simultaneously. Experimental results show that the proposed algorithms can effectively and effciently reduce the mobility variation without routing overhead. Slides

6B-3 (Time: 17:00 - 17:30)
 Title Improving Energy Efficiency for Energy Harvesting Embedded Systems Author Yang Ge, Yukan Zhang, *Qinru Qiu (Syracuse University, U.S.A.) Page pp. 497 - 502 Keyword Hybrid electrical energy storage system, energy harvesting system, bank reconfiguration Abstract While the energy harvesting system (EHS) supplies green energy to the embedded system, it also suffers from uncertainty and large variation in harvesting rate. This constraint can be remedied by using efficient energy storage. Hybrid Electrical Energy Storage (HEES) system is proposed recently as a cost effective approach with high power conversion efficiency and low self-discharge. In this paper, we propose a fast heuristic algorithm to improve the efficiency of charge allocation and replacement in an EHS/HEES equipped embedded system. The goal of our algorithm is to minimize the energy overhead on the DC-DC converter while satisfying the task deadline constraints of the embedded workload and maximizing the energy stored in the HEES system. We first provide an approximated but accurate power consumption model of the DC-DC converter. Based on this model, the optimal operating point of the system can be analytically solved. Integrated with the dynamic reconfiguration of the HEES bank, our algorithm provides energy efficiency improvement and run-time overhead reduction compared to previous approaches.

6B-4 (Time: 17:30 - 18:00)
 Title Modeling Variability and Irreproducibility of Nanoelectronic Resistive Switches for Circuit Simulation Author *Arne Heittmann, Tobias G. Noll (RWTH Aachen University, Germany) Page pp. 503 - 508 Keyword variability, resistive switches, electochemical metallization effect, hybrid circuits, nanoelectronics Abstract This paper presents a device model for nanoelectronic resistive switches which are based on the electrochemical metallization effect (ECM). The focus is set on modeling variability as well as irreproducibility which are essential properties of scaled nanoelectronic devices. In particular, a Poisson-based random ion deposition model and a non-linear filament surface effect are described. The model is especially useful for circuit simulation and can be implemented on standard circuit simulation platforms such as Spice or Spectre using inbuilt standard elements. Based on this model, effects of variability were examined by Monte Carlo simulation for a particular hybrid CMOS/nanoelectronic circuit. The results show that the proposed model is able to cover significant scaling effects, which is necessary for prospective design space exploration and circuit optimization.

Session 6C  New Directions in Modeling , Simulation, and Integrity
Time: 16:00 - 18:00 Thursday, January 24, 2013
Chairs: Hideki Asai (Shizuoka University, Japan), Sheldon Tan (University of California, Riverside, U.S.A.)

6C-1 (Time: 16:00 - 16:30)
 Title HS3DPG: Hierarchical Simulation for 3D P/G Network Author *Shuai Tao, Xiaoming Chen, Yu Wang, Yuchun Ma (Tsinghua University, China), Yiyu Shi (Missouri University of Science and Technology, U.S.A.), Hui Wang, Huazhong Yang (Tsinghua University, China) Page pp. 509 - 514 Keyword 3D P/G network, hierarchical simulation, port equivalent model Abstract As different chips are stacked together in 3D ICs, the power/ground (P/G) network simulation becomes more challenging than that of 2D cases. In this paper, we propose a hierarchical simulation method suitable for 3D P/G network (HS3DPG), which can ensure full parallelism and good scalability with the number of tiers. Besides, the "locality" property is introduced into HS3DPG to further simplify the simulation. Finally, we use HS3DPG to analyze the voltage distribution of a 3D P/G network with clustered TSVs. Slides

6C-2 (Time: 16:30 - 17:00)
 Title Piecewise-Polynomial Associated Transform Macromodeling Algorithm for Fast Nonlinear Circuit Simulation Author *Yang Zhang, Neric Fong, Ngai Wong (The University of Hong Kong, Hong Kong) Page pp. 515 - 520 Keyword Nonlinear MOR, Associated transform, TPWL, PWP, Macromodeling Abstract We present a piecewise-polynomial based associated transform algorithm (PWPAT) for macromodeling nonlinear circuits in system-level circuit design. The generated reduced model can provide both global and local accuracies with the most compact dimension. Numerical examples compare it with existing algorithms and verify its superior accuracy in higher order harmonics simulation over traditional Trajectory Piecewise-Linear (TPWL) approach. Slides

6C-3 (Time: 17:00 - 17:30)
 Title An Ultra-Compact Virtual Source FET Model for Deeply-Scaled Devices: Parameter Extraction and Validation for Standard Cell Libraries and Digital Circuits Author *Li Yu, Omar Mysore, Lan Wei, Luca Daniel, Dimitri Antoniadis (MIT, U.S.A.), Ibrahim Elfadel (Masdar Institute of Science and Technology, United Arab Emirates), Duane Boning (MIT, U.S.A.) Page pp. 521 - 526 Keyword ultra-compact model, parameter extraction, library cell characterization, VLSI timing, power analysis Abstract In this paper, we present the first validation of the virtual source(VS) charge-based compact model for standard cell libraries and large-scale digital circuits. With only a modest number of physically meaningful parameters, the VS model accounts for the main short-channel effects in nanometer technologies. Using a novel DC and transient parameter extraction methodology, the model is verified with simulated data from a well-characterized, industrial 40nm bulk silicon model. The VS model is used to fully characterize a standard cell library at the 40nm node with timing comparisons showing less than 2.7% error with respect to the industrial design kit. Furthermore, a 1001-stage inverter chain and a 32-bit ripple-adder are employed as test cases in a vendor CAD environment to validate the use of the VS model for large-scale digital circuit applications. Parametric Vdd sweeps show that the VS model is also ready for usage in low-power design methodologies. Finally, runtime comparisons have shown that the use of the VS model results in a speedup of about 7.6*.

6C-4 (Time: 17:30 - 18:00)

Time: 16:00 - 18:00 Thursday, January 24, 2013
Chairs: Yasuo Sato (Kyushu Institute of Technology, Japan), Takashi Sato (Kyoto University, Japan)

6D-1 (Time: 16:00 - 16:30)
 Title Provably Optimal Test Cube Generation using Quantified Boolean Formula Solving Author Matthias Sauer, *Sven Reimer (University of Freiburg, Germany), Ilia Polian (University of Passau, Germany), Tobias Schubert, Bernd Becker (University of Freiburg, Germany) Page pp. 533 - 539 Keyword Test Cube, X-input, QBF, SAT, Relaxation Abstract Circuits that employ test pattern compression rely on test cubes to achieve high compression ratios. The less inputs of a test pattern are specified, the better it can be compacted and hence the lower the test application time. Although there exist previous approaches to generate such test cubes, none of them are optimal. We present for the first time a framework that yields provably optimal test cubes by using the theory of quantified Boolean formulas (QBF). Extensive comparisons with previous methods demonstrate the quality gain of the proposed method. Slides

6D-2 (Time: 16:30 - 17:00)
 Title Synthesizing Multiple Scan Chains by Cost-Driven Spectral Ordering Author *Louis Y.-Z. Lin, Christina C.-H. Liao, Charles H.-P. Wen (Dept. of Elec. & Comp. engr., National Chiao Tung University, Taiwan) Page pp. 540 - 545 Keyword testing, scan chain, scan order Abstract Power cost and wire cost are two most critical issues in scan-chain optimization for modern VLSI testing. Many previous works used layout-based partitioning and greedy heuristics to synthesize multiple scan chains, making themselves suffer from (1) nongeometric-cost problem and (2) crossing-edge problem. Therefore, in this paper, we propose cost-driven spectral ordering including (1) cost-driven k-way spectral partitioning and (2) greedy non-crossing 2-opt ordering to resolve the two problems stated above, respectively. Experiments show that different cost metrics can be properly addressed in k-way spectral partitioning. Moreover, our cost-driven spectral ordering achieves on average 9% mixed (power-and-wire) reduction than two previous works on benchmark circuits, which evidently demonstrates its effectiveness on multiple scan-chain synthesis.

6D-3 (Time: 17:00 - 17:30)
 Title A Binding Algorithm in High-Level Synthesis for Path Delay Testability Author *Yuki Yoshikawa (Kure National College of Technology, Japan) Page pp. 546 - 551 Keyword Delay test, High-level synthesis, Resource binding, Synthesis for testability Abstract A binding method in high-level synthesis for path delay testability is proposed in this paper. For a given scheduled data flow graph, the proposed method synthesizes a path delay testable RTL datapath and its controller. Every path in the datapath is two pattern testable with the controller if the path is activated in the functional operation, i.e., the path is not false path. Our experimental results show that the proposed method can synthesize such RTL circuits with small area overhead compared with that augmented by some DFT techniques such as scan design. Slides

6D-4 (Time: 17:30 - 18:00)
 Title Full Exploitation of Process Variation Space for Continuous Delivery of Optimal Delay Test Quality Author Baris Arslan (University of California, San Diego/Qualcomm, U.S.A.), *Alex Orailoglu (University of California, San Diego, U.S.A.) Page pp. 552 - 557 Keyword delay test, test cost optimization, adaptive test, process-aware test Abstract The increasing magnitude of process variations individualizes effectively each chip, necessitating distinct quantities of test resources for each in order to optimize overall delay test quality without exceeding set test budgets. This paper proposes an analytical framework that delivers the optimal test time assignment per chip in order to minimize the delay defect escape rate. Adjustment of the chip-specific test time in the continuous process variation space is attained through an adaptive test flow that utilizes process data measurements from the device under test. The results evince that a substantial improvement in the delay test quality can be obtained at no increase whatsoever to test time consumed by conventional test flows.

 Friday, January 25, 2013

Session 3K  Keynote III
Time: 9:00 - 10:00 Friday, January 25, 2013
Chair: Shinji Kimura (Waseda University, Japan)

3K-1 (Time: 9:00 - 10:00)
 Title (Keynote Address) Human, Vehicle and Social Infrastructure System Development for Sustainable Mobility – Development Innovation based on Large-Scale Simulation – Author Hiroyuki Watanabe (Toyota Motor Corporation, Japan) Abstract In order to realize a sustainable mobility society, technology development is ongoing to handle energy security, CO2 reduction, traffic-congestion and road-accident related challenges. The vehicle itself and the surrounding social infrastructure system, has to deal with 3 challenges. The first challenge will be to develop and raise the efficiency of a powertrain supporting renewable energy like bio-fuels, electricity, and hydrogen. The second will be the challenge to enhance vehicle dynamic performance and to innovate environmental and safety features of the vehicle by autonomous driving and its supporting technologies. The third challenge is ITS development applying ITC (Information and Communication Technology), namely, the development of a “connected” vehicle and its related social infrastructure. These developments, which began by utilizing simulation technology such as HILS and SILS, have evolved for the systems which include both human and society. In the development of the social infrastructure system, much progress has been made in the development process itself, such as application of real-time probe-data, and large-scale simulation using Big Data. In this keynote lecture, we will introduce approaches to resolving key challenges, and show the future technological trend together with a proposal to innovate development based on large-scale simulation.

Session 7A  Special Session: Many-Core Architecture and Software Technology
Time: 10:20 - 12:20 Friday, January 25, 2013
Chairs: Masato Edahiro (Nagoya University, Japan), Hiroyuki Tomiyama (Ritsumeikan University, Japan)

7A-1 (Time: 10:20 - 10:40)
 Title (Invited Paper) SMYLE Project: Toward High-Performance, Low-Power Computing on Manycore-Processor SoCs Author *Koji Inoue (Kyushu University, Japan) Page pp. 558 - 560 Keyword manycore, SoC, low power, high performance, processor Abstract This paper introduces a manycore research project called SMYLE (Scalable ManYcore for Low Energy computing). The aims of this project are: 1) proposing a manycore SoC architecture and developing a suitable programming and execution environment, 2) designing a domain specific manycore system for emerging video mining applications, and 3) releasing developed software tools and FPGA emulation environments to accelerate manycore research and development in the community. The project started in December 2010 with full support from the New Energy and Industrial Technology Development Organization (NEDO).

7A-2 (Time: 10:40 - 11:05)
 Title (Invited Paper) SMYLEref: A Reference Architecture for Manycore-Processor SoCs Author *Masaaki Kondo, Son Truong Nguyen (The University of Electro-Communications, Japan), Tomoya Hirao, Takeshi Soga, Hiroshi Sasaki, Koji Inoue (Kyushu University, Japan) Page pp. 561 - 564 Keyword Manycore Processor, Prototyping, FPGA Abstract Nowadays, the trend of developing micro-processor with tens of cores brings a promising prospect for embedded systems. Realizing a high performance and low power many-core processor is becoming a primary technical challenge. We are currently developing a many-core processor architecture for embedded systems as a part of a NEDO's project. This paper introduces the many-core architecture called SMYLEref along whit the concept of Virtual Accelerator on Many-core, in which many cores on a chip are utilized as a hardware platform for realizing multiple virtual accelerators. We are developing its prototype system with off-the-shelf FPGA evaluation boards. In this paper, we introduce the architecture of SMYLEref and the detail of the prototype system. In addition, several initial experiments with the prototype system are also presented. Slides

7A-3 (Time: 11:05 - 11:30)
 Title (Invited Paper) SMYLE OpenCL: A Programming Framework for Embedded Many-core SoCs Author *Hiroyuki Tomiyama, Takuji Hieda, Naoki Nishiyama, Noriko Etani, Ittetsu Taniguchi (Ritsumeikan University, Japan) Page pp. 565 - 567 Keyword manycore SoCs, OpenCL, embedded systems Abstract Embedded SoC architecture has shifted from single-core to multi/many-core paradigm because of better power/performance efficiency. In order to exploit the potential power/performance efficiency of the many-core architecture, a parallel computing framework is necessary. OpenCL is one of the most popular parallel computing frameworks in the field of general-purpose computing on GPUs and multicore servers. However, the existing OpenCL implementations are not suitable to embedded real-time systems because of the large runtime overhead. In this paper, we describe a lightweight OpenCL framework for embedded multi/many-core SoCs. Our OpenCL framework minimizes the runtime overhead by statically creating threads and mapping them onto cores. Preliminary experiments on an FPGA prototype board with a five-core architecture shows a significant reduction in runtime overhead compared with an existing OpenCL framework.

7A-4 (Time: 11:30 - 11:55)
 Title (Invited Paper) Support Tools for Porting Legacy Applications to Multicore Author Yuri Ardila, *Natsuki Kawai, Takashi Nakamura, Yosuke Tamura (Fixstars Corporation, Japan) Page pp. 568 - 573 Keyword auto-parallelizer, performance estimation, benchmark, parallel computing Abstract Abstract| This paper presents PEMAP, an automated performance estimation tool to project performance of hand-parallelized programs from sequential programs and BEMAP, a benchmark suite to measure an auto-parallelizer or even a machine's performance. BEMAP is an open-source project, and the documentations on code explanations and experimental results are also provided. Our experiments on PEMAP shows we can estimate performance of hand-parallelized programs in an error of 0.44% of sequential program's performance on average, while using BEMAP shows that the ability of an auto-parallelizer can be measured by comparing the compiled code to the hand-tuned parallelized OpenCL code, and therefore assisting the development of the auto-parallelizer tool. Slides

7A-5 (Time: 11:55 - 12:20)
 Title (Invited Paper) Manycore Processor for Video Mining Applications Author *Yukoh Matsumoto, Hiroyuki Uchida, Michiya Hagimoto, Yasumori Hibi, Sunao Torii, Masamichi Izumida (TOPS Systems Corporation, Japan) Page pp. 574 - 575 Abstract Through Architecture-Algorithm co-design for Video Mining Applications we designed a scalable Manycore processor consists of clustered heterogeneous cores with stream processing capabilities, and zero-overhead inter-process communication through FIFO with a hardware-software mechanism. For achieving high-performance and low-power consumption, especially so as to reduce memory access required for Video Mining Applications, each application is partitioned to exploit both task and data parallelism, and programmed as a distributed stream processing with relatively large local register-file based on Kahn Process Network model. Slides

Session 7B  Simulation Acceleration
Time: 10:20 - 12:20 Friday, January 25, 2013
Chairs: Farhad Mehdipour (Kyushu University, Japan), Antoine Trouve (Institute of Systems, Information Technologies and Nanotechnologies, Japan)

7B-1 (Time: 10:20 - 10:50)
 Title Native Simulation of Complex VLIW Instruction Sets using Static Binary Translation and Hardware-Assisted Virtualization Author *Mian-Muhammad Hamayun, Frédéric Pétrot, Nicolas Fournel (TIMA Laboratory, CNRS/INP Grenoble/UJF, France) Page pp. 576 - 581 Keyword System Simulation, Static Binary Translation, Hardware-Assisted Virtualization, VLIW Abstract We introduce a static binary translation flow in native simulation context for cross-compiled VLIW executables. This approach is interesting in situations where either the source code is not available or the target platform is not supported by any retargetable compilation framework, which is usually the case for VLIW processors. The generated simulators execute on a Hardware-Assisted Virtualization (HAV) based native platform. We have implemented this approach for a TI C6x series processor and our simulation results show a speed-up of around two orders of magnitude compared to the cycle accurate simulators. Slides

7B-2 (Time: 10:50 - 11:20)
 Title RExCache: Rapid Exploration of Unified Last-level Cache Author *Su Myat Min Shwe, Haris Javaid, Sri Parameswaran (University of New South Wales, Australia) Page pp. 582 - 587 Keyword estimator, exploration, cache Abstract In this paper, we propose to explore design space of a unified last-level cache to improve system performance and energy efficiency. The challenge is to quickly estimate the execution time and energy consumption of the system with distinct cache configurations using minimal number of slow full-system cycle-accurate simulations. To this end, we propose a novel, simple yet highly accurate execution time estimator and a simple, reasonably accurate energy estimator. Our framework, RExCache, combines a cycle-accurate simulator and a trace-driven cache simulator with our novel execution time estimator and energy estimator to avoid cycle-accurate simulations of all the last-level cache configurations. Once execution time and energy estimates are available from the estimators, RExCache chooses minimum execution time or minimum energy consumption cache configuration. Our experiments with nine different applications from mediabench, and 330 last-level cache configurations show that the execution time and energy estimators had at least average absolute accuracy of 99.74% and 80.31% respectively. RExCache took only a few hours (21 hours for H.264enc) to explore last-level cache configurations compared to several days of traditional method (36 days for H.264enc) and cycle-accurate simulations (257 days for H.264enc), enabling quick exploration of the last-level cache. When 100 different real-time constraints on execution time and energy were used, all the cache configurations found by RExCache were similar to those from cycle-accurate simulations. On the other hand, the traditional method found correct cache configurations for only 69 out of 100 constraints. Thus, RExCache has better absolute accuracy than the traditional method, yet reducing the simulation time by at least 97%. Slides

7B-3 (Time: 11:20 - 11:50)
 Title An Efficient Hybrid Synchronization Technique for Scalable Multi-Core Instruction Set Simulations Author *Bo-Han Zeng, Ren-Song Tsay, Ting-Chi Wang (National Tsing Hua University, Taiwan) Page pp. 588 - 593 Keyword Timing Synchronization, Multi-Core Simulator, Instruction Set Simulator Abstract Multi-core system simulation techniques have been essential to system development in recent years. Although these techniques have been studied extensively, we have found that both conventional polling and collaborative timing synchronization approaches encounter a severe scalability issue when the number of target cores is more than that of the host cores. To resolve this issue, we propose an effective hybrid technique that combines the advantage of the two approaches. According to the experimental results, the proposed technique effectively resolves the scalability issue and shows one to four orders of improvement compared to conventional approaches.

Session 7C  Reliability Analysis and Test
Time: 10:20 - 12:20 Friday, January 25, 2013
Chairs: David Z. Pan (University of Texas, Austin, U.S.A.), Alex Orailoglu (University of California, San Diego, U.S.A.)

7C-1 (Time: 10:20 - 10:50)
 Title Statistical Analysis of BTI in the Presence of Process-induced Voltage and Temperature Variations Author *Farshad Firouzi, Saman Kiamehr, Mehdi B. Tahoori (Karlsruhe Institute of Technology, Germany) Page pp. 594 - 600 Keyword NBTI, PBTI, PVT, Reliability, Timing analysis Abstract In nano-scale regime, there are various sources of uncertainty and unpredictability of VLSI designs such as transistor aging mainly due to Bias Temperature Instability (BTI) as well as Process-Voltage-Temperature (PVT) variations. BTI exponentially varies by temperature and the actual supply voltage seen by the transistors within the chip which are functions of leakage power. Leakage power is strongly impacted by PVT and BTI which in turn results in thermal-voltage variations. Hence, neglecting one or some of these aspects can lead to a considerable inaccuracy in the estimated BTI-induced delay degradation. However, a holistic approach to tackle all these issues and their interdependence is missing. In this paper, we develop an analytical model to predict the probability density function and covariance of temperatures and voltage droops of a die in the presence of the BTI and process variation. Based on this model, we propose a statistical method that characterizes the life-time of the circuit affected by BTI in the presence of process-induced temperature-voltage variations. We observe that for benchmark circuits, treating each aspect independently and ignoring their intrinsic interactions results in 16% over-design, translating to unnecessary yield and performance loss. Slides

7C-2 (Time: 10:50 - 11:20)
 Title CLASS: Combined Logic and Architectural Soft Error Sensitivity Analysis Author *Mojtaba Ebrahimi, Liang Chen (Karlsruhe Institute of Technology, Germany), Hossein Asadi (Sharif University of Technology, Iran), Mehdi B. Tahoori (Karlsruhe Institute of Technology, Germany) Page pp. 601 - 607 Keyword Reliability Analysis, Soft Error, ACE, Error Propagation, Markov Chains Abstract With continuous technology downscaling, the rate of radiation induced soft errors is rapidly increasing. Fast and accurate soft error vulnerability analysis in early design stages plays an important role in cost-effective reliability improvement. However, existing solutions are suitable for either regular (a.k.a address-based such as memory hierarchy) or irregular (random logic such as functional units and control logic) structures, failing to provide an accurate system level analysis. In this paper, we propose a hybrid approach integrating architecture-level and logic-level techniques to accurately estimate the vulnerability of all regular and irregular structures within a microprocessor. It carefully handles error propagation and masking scenarios among these structures. We have evaluated the vulnerability of the OR1200 processor using the proposed approach. Comparison with statistical fault injection shows an average inaccuracy of less than 5% with five orders of magnitude improvement in runtime. Slides

7C-3 (Time: 11:20 - 11:50)
 Title Application Specified Soft Error Failure Rate Analysis using Sequential Equivalence Checking Techniques Author *Tun Li, Dan Zhu, Sikun Li, Yang Guo (National University of Defense Technology, China) Page pp. 608 - 613 Keyword Soft-error, Failure rate analysis, Sequential equivalence checking, Application Abstract Soft errors have become a critical challenge as a result of technology scaling. However, to evaluate the influence of soft errors in flip-flop (FF) on the failure of circuit is a hard verification problem. Here, we proposed a novel flip-flop soft error failure rate analysis methodology using sequential equivalence checking (SEC) and taking the application behaviors into consideration, which combines the advantage of formal techniques based approaches in completeness and the advantage of application behaviors in accuracy in differentiating vulnerability of FFs. As a result, all the FFs in a circuit are sorted by their failure rates and designers can use this information to perform optimal hardening of selected sequential components against soft errors. Experimental results on an implementation of a SpaceWire end node and the set of the largest ISCAS’89 benchmark sequential circuits demonstrate the efficiency of our approach. Case study on an instruction decoder of a practical 32 bits microprocessor shows the applicable of our methodology. Slides

7C-4 (Time: 11:50 - 12:20)
 Title An Adaptive Current-Threshold Determination for IDDQ Testing Based on Bayesian Process Parameter Estimation Author *Michihiro Shintani, Takashi Sato (Graduate School of Informatics, Kyoto University, Japan) Page pp. 614 - 619 Keyword IDDQ testing, Statistical leakage current analysis, Bayes' Theorem Abstract Application of IDDQ testing to LSIs fabricated using advanced process technology is becoming increasingly difficult due to large variability of scaled devices. In this paper, we propose a novel technique that adaptively determines per-chip current-threshold for IDDQ testing to enhance test accuracy. In the proposed technique, process condition of a chip and fault sensitization vector are first estimated based on measured IDDQ currents through Bayesian inference. Then, using the estimated process condition, a statistical distribution of the leakage current for each test pattern is calculated and suitable current-threshold is determined by the distribution. Simulation experiments demonstrate that the proposed technique can successfully detect a very small leakage fault, down to 2% of the nominal IDDQ current with the test escape ratio of 3.1%. Slides

Session 7D  Emerging Technologies in Cyber Systems
Time: 10:20 - 12:20 Friday, January 25, 2013
Chairs: Hao Yu (Nanyang Technological University, Singapore), Guojie Luo (Peking University, China)

7D-1 (Time: 10:20 - 10:50)
 Title DARNS:A Randomized Multi-modulo RNS Architecture for Double-and-Add in ECC to Prevent Power Analysis Side Channel Attacks Author Jude Angelo Ambrose (University of New South Wales, Australia), *Hector Pettenghi, Leonel Sousa (Instituto de Engenharia de Sistemas e Computadores, Portugal) Page pp. 620 - 625 Keyword residue number systems, powe analysis side channel attacks, multi-modulo architectures Abstract Security in embedded systems is of critical importance since most of our secure transactions are currently made via credit cards or mobile phones. Power analysis based side channel attacks have been proved as the most successful attacks on embedded systems to retrieve secret keys, allowing impersonation and theft. State-of-the-art solutions for such attacks in Elliptic Key Cryptography (ECC), mostly in software, hinder performance and repeatedly attacked using improved techniques. To protect the ECC from both simple power analysis and differential power analysis, as a hardware solution, we propose to take advantage of the inherent parallelization capability in Multi-modulo Residue Number Systems (RNS) architectures to obfuscate the secure information. Random selection of moduli is proposed to randomly choose the moduli sets for each key bit operation. This solution allows us to prevent power analysis, while still providing all the benefits of RNS. In this paper, we show that the DPA is indeed thwarted, as well as correlation analysis. Slides

7D-2 (Time: 10:50 - 11:20)
 Title ScanPUF: Robust Ultralow-Overhead PUF Using Scan Chain Author Yu Zheng, Aswin Raghav Krishna, *Swarup Bhunia (Case Western Reserve University, U.S.A.) Page pp. 626 - 631 Keyword PUF, DFT, Uniqueness, Stability, NBTI Abstract Physical Unclonable Functions (PUFs) have emerged as an attractive primitive to address diverse hardware security issues, such as chip authentication, intellectual property (IP) protection and cryptographic key generation. Existing PUFs, typically acquired and integrated in a design as a commodity, often incur considerable hardware overhead. Many of these PUFs also suffer from insufficient challenge-response pairs. In this paper, we propose {\em ScanPUF}, a novel PUF implementation using a common on-chip structure used for improving circuit testability, namely scan chain. It exploits path delay variations between the scan flip-flops in a scan chain to create high-quality (in terms of uniqueness and robustness) secret keys. Furthermore, since a scan chain provides large pool of scan paths to create a signature, we can achieve high volume of secret keys from each chip. Since it uses a prevalent on-chip structure, the overhead is extremely small (2.3% area of the RO-PUF), primarily contributed by small additional logic in the signature-generation cycle controller. Circuit-level simulation results with 1000 chips under inter- and intra-die process variations show high uniqueness of 49.9% average inter-die Hamming distance and good reproducibility of 5% intra-die Hamming distance below 85 $^\circ$C. The temporal variations due to device aging effect e.g. bias temperature instability (BTI) lead to only 4% unstable bits for ten-year usage. The experimental evaluation on FPGA (Altera Cyclone-III) exhibits 47.1% average inter-Hamming distance, as well as 3.2% unstable bits at room temperature. Slides

7D-3 (Time: 11:20 - 11:50)
 Title An Efficient Compression Scheme for Checkpointing of FPGA-Based Digital Mockups Author *Ting-Shuo Chou (University of California, Irvine, U.S.A.), Chen Huang, Bailey Miller (University of California, Riverside, U.S.A.), Tony Givargis (University of California, Irvine, U.S.A.), Frank Vahid (University of California, Riverside, U.S.A.) Page pp. 632 - 637 Keyword Digital Mockups, Test Automation, Cyber-Physical Systems, Medical Cyber-Physical Systems, Hardware-in-the-Loop Abstract This paper outlines a transparent and nonintrusive checkpointing mechanism for use with FPGA-based digital mockups. A digital mockup is an executable model of a physical system, implemented on an FPGA, and used for real-time test and validation of cyber-physical devices that interact with the physical system. These digital mockups are typically defined in terms of a large set of ordinary differential equations (ODEs). A checkpoint is a snapshot of the internal state of the model at a specific point in time as captured by some controller that resides on the same FPGA. A further requirement is that the model continues uninterrupted execution during a checkpointing operation. Once a checkpoint is created, the corresponding state information is transferred from the FPGA to a host computer for visualization and other off-chip processing. We outline the architecture of a checkpointing controller that captures and transfers the state information at a desired clock cycle using an aggressive compression technique. Our controller achieves 90% reduction in the amounts of data that is transferred from the FPGA to the host computer under periodic checkpointing scenarios. Slides

7D-4 (Time: 11:50 - 12:20)
 Title Maximizing Return on Investment of a Grid-Connected Hybrid Electrical Energy Storage System Author Di Zhu, Yanzhi Wang, Siyu Yue, *Qing Xie (University of Southern California, U.S.A.), Naehyuck Chang (Seoul National University, Republic of Korea), Massoud Pedram (University of Southern California, U.S.A.) Page pp. 638 - 643 Keyword return on investment, capital cost, hybrid electrical energy storage system Abstract This paper is the first to present a comprehensive analysis of the profitability of the hybrid electrical energy storage (HEES) systems while further providing a HEES design and control optimization framework to maximize the total return on investment (ROI). The solution consists of two steps: (i) Derivation of an optimal HEES management policy to maximize the daily energy cost saving and (ii) Optimal design of the HEES system to maximize the amortized annual profit under budget and system volume constraints. We consider a HEES system comprised of lead-acid and Li-ion batteries for a case study. The optimal HEES system achieves an annual ROI of up to 60% higher than a lead-acid battery-only system (Li-ion battery-only) system. Slides

Session 8A  Designers' Forum: Photonics for Embedded Systems
Time: 13:40 - 15:40 Friday, January 25, 2013
Organizer: Toshiki Sugawara (Hitachi, Japan)

8A-1 (Time: 13:40 - 14:10)
 Title (Invited Paper) Silicon Photonics Technology Platform for Embedded and Integrated Optical Interconnect Systems Author *Peter De Dobbelaere (Luxtera, U.S.A.) Page pp. 644 - 647 Keyword silicon photonics, optical transceiver Abstract By taking advantage of the vast investments made by the semiconductor industry, silicon photonics allows high-volume, high yield and low-cost manufacturing of complex photonic integrated circuits, including high-speed optical transceivers. In order to take full advantage of the CMOS semiconductor technology, the development and progress of Si photonics should further standardize and adhere to established CMOS technology methodologies and roadmaps.

8A-2 (Time: 14:10 - 14:40)
 Title (Invited Paper) High-Frequency Circuit Design for 25Gb/s×4 Optical Transceiver Author *Norio Chujo, Takashi Takemoto, Fumio Yuki, Hiroki Yamashita (Hitachi, Japan) Page pp. 648 - 651 Keyword optical module, transceiver Abstract A 25-Gb/s optical transceiver module has been developed for backplanes. Itis necessary to downsize current modules while reducing power consumption and increasing speed up to 25 Gb/s. We employed many approaches to achieve this by reducing crosstalk noise, by enhancing power integrity, and by using CMOS-based analog FE and on-chip termination and optical waveform optimization. The fully integrated transceiver IC was fabricated with the 65-nm CMOS process and the package was small, being 9 x 14 mm in size.

8A-3 (Time: 14:40 - 15:10)
 Title (Invited Paper) Design and Application of Highly Integrated Optical Switches Based on Silicon Photonics Author *Shigeru Nakamura (NEC, Japan) Page pp. 652 - 654 Keyword Silicon photonics Abstract Silicon photonics is promising for integrating various functional optical devices according to applications. Here we discuss the possibility of applying silicon photonics devices to optical path switching in wide area photonic network nodes. System design including optical switches and device design for implementing silicon photonics based optical switch are discussed. We integrate thermo-optical switch elements based on silicon optical waveguides into a compact one-chip device as a 8 x 8 optical switch. We demonstrate its capabilities such as high extinction ratio operation independently of polarization and ambient temperature, which are considered a critical step toward real application.

8A-4 (Time: 15:10 - 15:40)
 Title (Invited Paper) High Performance PIN Ge Photodetector and Si Optical Modulator with MOS Junction for Photonics-Electronics Convergence System Author *Junichi Fujikata, Masataka Noguchi, Makoto Miura, Masashi Takahashi, Shigeki Takahashi, Tsuyoshi Horikawa, Yutaka Urino, Takahiro Nakamura, Yasuhiko Arakawa (PETRA, Japan) Page pp. 655 - 656 Keyword Si photonics, Ge photodetector, Si optical modulation Abstract We report on a high speed silicon-waveguide- integrated PIN Ge photodetector of 45 GHz bandwidth, and a high efficiency of 0.3 V¥cm silicon optical modulator with a metal-oxide-semiconductor (MOS) junction by applying the low optical loss and high conductivity poly-silicon gate. These OE/EO devices enable low drive voltage of around 1V, which would contribute to a high density optical interposer of the future photonics-electronics convergence system.

Session 8B  Revisiting Latency and Reliability in Memory Architectures
Time: 13:40 - 15:40 Friday, January 25, 2013
Chairs: Luca Carloni (Columbia University, U.S.A.), Fabien Clermidy (CEA-LETI, France)

8B-1 (Time: 13:40 - 14:10)
 Title Reevaluating the Latency Claims of 3D Stacked Memories Author *Daniel W. Chang (University of Wisconsin, Madison, U.S.A.), Gyungsu Byun (West Virginia University, U.S.A.), Hoyoung Kim, Minwook Ahn, Soojung Ryu (Samsung Electronics Co., Ltd., Republic of Korea), Nam S. Kim, Michael Schulte (University of Wisconsin, Madison, U.S.A.) Page pp. 657 - 662 Keyword DRAM, 3D main memory, 3D memory latency, digital signal processor, embedded systems Abstract In recent years, 3D technology has been a popular area of study that has allowed researchers to explore a number of novel computer architectures. One of the more popular topics is that of integrating 3D main memory dies below the computing die and connecting them with through-silicon vias (TSVs). This is assumed to reduce off-chip main memory access latencies by roughly 45% to 60%. Our detailed circuit-level models, however, demonstrate that this latency reduction from the TSVs is significantly less. In this paper, we present these models, compare 2D and 3D main memory latencies, and show that the reduction in latency from using 3D main memory to be no more than 2.4 ns. We also show that although the wider I/O bus width enabled by using TSVs increases performance, it may do so with an increase in power consumption. Although TSVs consume less power per bit transfer than off-chip metal interconnects (11.2 times less power per bit transfer), TSVs typically use considerably more bits and may result in a net increase in power due to the large number of bits in the memory I/O bus. Our analysis shows that although a 3D memory hierarchy exploiting a wider memory bus can increase performance, this performance increase may not justify the net increase in power consumption. Slides

8B-2 (Time: 14:10 - 14:40)
 Title Heterogeneous Memory Management for 3D-DRAM and External DRAM with QoS Author *Le-Nguyen Tran (University of California, Irvine, U.S.A.), Houman Homayoun (George Mason University, U.S.A.), Fadi J. Kurdahi, Ahmed M. Eltawil (University of California, Irvine, U.S.A.) Page pp. 663 - 668 Keyword 3D-DRAM, Memory management, QoS, Heterogeneous memory system, computer architecture Abstract This paper presents an innovative memory management approach to utilize both 3D-DRAM and external DRAM (ex-DRAM). Our approach dynamically allocates and relocates memory blocks between the 3D-DRAM and the ex-DRAM to exploit the high memory bandwidth and the low memory latency of the 3D-DRAM as well as the high capacity and the low cost of the ex-DRAM. Our simulation shows that in workloads that are not memory intensive, our memory management technique transfers all active memory blocks to the 3D-DRAM which runs faster than the ex-DRAM. In memory intensive workloads, our memory management technique utilizes both the 3D-DRAM and the ex-DRAM to increase the memory bandwidth to alleviate the bandwidth congestion. Our approach also supports Quality of Service (QoS) for “latency sensitive”, “bandwidth sensitive”, and “insensitive” applications. To improve the performance and satisfy a certain level of QoS, memory blocks of different application types are allocated differently. Compared to the scratchpad memory management mechanism, the average memory access latency of our approach decreases by 19% and 23%; the performance improves by up to 5% and 12% in single threaded benchmarks and multi-threaded benchmarks respectively. Moreover, using our approach, applications do not need to manage the memory explicitly like the scratchpad case. Our memory block relocation comes with negligible performance overhead, particularly for applications which have high spatial memory locality.

8B-3 (Time: 14:40 - 15:10)
 Title Line Sharing Cache: Exploring Cache Capacity with Frequent Line Value Locality Author *Keitarou Oka (Graduate School of Infomation Science and Electrical Engineering, Kyushu University, Japan), Hiroshi Sasaki, Koji Inoue (Faculty of Infomation Science and Electrical Engineering, Kyushu University, Japan) Page pp. 669 - 674 Keyword Cache Memory, Frequent Value Locality, Compression Abstract This paper proposes a new LLC architecture called line sharing cache (LSC) which reduces the number of misses without increasing the size of the cache memory. LSC stores lines which have the identical value in a single line entry and allows greater amounts of lines to be stored. Evaluation results show performance improvements of up to 35% across a set of SPEC CPU2000 benchmarks. Slides

8B-4 (Time: 15:10 - 15:40)
 Title ShieldUS: A Novel Design of Dynamic Shielding for Eliminating 3D TSV Crosstalk Coupling Noise Author *Yuan-Ying Chang, Yoshi Shih-Chieh Huang (National Tsing Hua University, Taiwan), Vijaykrishnan Narayanan (Pennsylvania State University, U.S.A.), Chung-Ta King (National Tsing Hua University, Taiwan) Page pp. 675 - 680 Keyword TSV, crosstalk Abstract 3D IC is a promising technology to meet the demands of high throughput, high scalability, and low power consumption for future generation integrated circuits. One way to implement the 3D IC is to interconnect layers of two-dimensional (2D) IC with Through-Silicon Via (TSV), which shortens the signal lengths. Unfortunately, while TSVs are bundled together as a cluster, the crosstalk coupling noise may lead to transmission errors. As a result, the working frequency of TSVs has to be lowered to avoid the errors, leading to narrower bandwidth that TSVs can provide. In this paper, we first derive the crosstalk noise model from the perspective of 3D chip and then propose ShieldUS, a runtime data-to-TSVs remapping strategy. With ShieldUS, the transition patterns of data over TSVs are observed at runtime, and relatively stable bits will be mapped to the TSVs which act as shields to protect the other bits which have more fluctuations. We evaluate the performance of ShieldUS with address lines from real benchmark traces and data lines of different self-similarities. The results show that ShieldUS is accurate and flexible. We further study dynamic shielding and our design of Interval Equilibration Unit (IEU) can intelligently select suitable parameters for dynamic shielding, which makes dynamic shielding practical and does not need to predefine parameters. This also improves the practicability of ShieldUS. Slides

Session 8C  New 3D IC Design Techniques
Time: 13:40 - 15:40 Friday, January 25, 2013
Chairs: Guojie Luo (Peking University, China), Wai-Kei Mak (National Tsing Hua University, Taiwan)

8C-1 (Time: 13:40 - 14:10)
 Title High-Density Integration of Functional Modules Using Monolithic 3D-IC Technology Author *Shreepad Panth (Georgia Institute of Technology, U.S.A.), Kambiz Samadi, Yang Du (Qualcomm Research, U.S.A.), Sung Kyu Lim (Georgia Institute of Technology, U.S.A.) Page pp. 681 - 686 Keyword monolithic, power reduction, 3D-IC, floorplanning, block-level design Abstract Three dimensional integrated circuits (3D-ICs) have emerged as a promising solution to continue device scaling. They can be realized using Through Silicon Vias (TSVs), or monolithic integration using Monolithic Inter-tier vias (MIVs), an emerging alternative that provides much higher via densities. In this paper, we provide a framework for floorplanning existing 2D IP blocks into 3D-ICs using MIVs. We take the floorplanning solution all the way through place and-route and report post-layout metrics for area, wirelength, timing, and power consumption. Results show that the wirelength of TSV-based 3D designs outperform 2D designs by upto 14% in large-scale circuits only. MIV-based 3D designs, however, offer an average wirelength improvement of 33% for a wide range of benchmark circuits. We also show that while TSV-based 3D cannot improve the performance and power unless the TSV capacitance is reduced, MIV-based 3D offers significant reduction of upto 33% in the longest path delay and 35% in the inter-block net power. Slides

8C-2 (Time: 14:10 - 14:40)
 Title Block-level Designs of Die-to-Wafer Bonded 3D ICs and Their Design Quality Tradeoffs Author *Krit Athikulwongse (Georgia Institute of Technology, U.S.A.), Dae Hyun Kim (Cadence, U.S.A.), Moongon Jung, Sung Kyu Lim (Georgia Institute of Technology, U.S.A.) Page pp. 687 - 692 Keyword 3D IC, Block-level Design, Die-to-Wafer Bonding Abstract In 3D ICs, block-level designs provide various advantages over designs done at other granularity such as gate-level because they promote the reuse of IP blocks. In this paper, we study block-level 3D-IC designs, where the footprint of the dies in the stack are different. This happens in case of die-to-wafer bonding, which is more popular choice for near-term low-cost 3D designs. We study design quality tradeoffs among three different ways to place through-silicon vias (TSVs): TSV-farm, TSV-distributed, and TSV-whitespace. In our holistic approach, we use wirelength, power, performance, temperature, and mechanical stress metrics to conduct comprehensive comparative studies on the three design styles. In addition, we provide analysis on the impact of TSV size and pitch on the design quality of these three styles. Slides

8C-3 (Time: 14:40 - 15:10)
 Title Thermal-reliable 3D Clock-tree Synthesis Considering Nonlinear Electrical-thermal-coupled TSV Model Author Yang Shang, Chun Zhang, *Hao Yu, Chuan Seng Tan (Nanyang Technological University, Singapore), Xin Zhao, Sung Kyu Lim (Georgia Institute of Technology, U.S.A.) Page pp. 693 - 698 Keyword 3D physical design, Clock tree synthesis, Nonlinear electrical-thermal TSV model Abstract 3D physical design needs accurate model of through-silicon-vias(TSVs). In this paper, physics-based electrical-thermal model is introduced for both signal and dummy thermal TSVs considering nonlinear electrical-thermal dependence. A nonlinear programming-based clock-skew reduction problem is formulated to allocate thermal TSVs for clock-skew reduction under non-uniform temperature distribution. Experiments show that under the nonlinear electrical-thermal TSV model, insertion of thermal TSVs can effectively reduce temperature-gradient introduced clock-skew by 58.4% on average which is 11.6% higher than the result under linear electrical-thermal model. Slides

8C-4 (Time: 15:10 - 15:40)
 Title Stacking Signal TSV for Thermal Dissipation in Global Routing for 3D IC Author *Po-Yang Hsu, Hsien-Te Chen, TingTing Hwang (National Tsing Hua University, Taiwan) Page pp. 699 - 704 Keyword 3D IC, stacked TSV Abstract With no further shrink of device size, three dimensional (3D) chip stacking by Through-Silicon-VIA (TSV) has been identified as an effective way to achieve better performance in speed and power. However, such solution inevitably encounters challenges in thermal dissipation since stacked dies generate significant amount of heat per unit volume. We leverage an integrated architecture of stacked-signal-TSVs to minimize temperature with small wiring overhead. Based on the structure of stacked signal TSV, a two-stage TSV locating algorithm in global routing is designed. By this TSV locating algorithm, we demonstrate that our stacking signal TSV structure is able to reduce 17% temperature with 4% wiring overhead and 3% performance loss calculated by 3D Elmore delay model. Compared to a previous work by Cong and Zhang [1] where additional thermal TSVs are inserted, our experimental results have in average 23% less TSVs than Cong and Zhang’s [1] with the same temperature constraint. Slides

Session 8D  Advances in Simulation and Formal Verification
Time: 13:40 - 15:40 Friday, January 25, 2013
Chairs: Miroslav Velev (Aries Design Automation, U.S.A.), Andreas Veneris (University of Toronto, Canada)

8D-1 (Time: 13:40 - 14:10)
 Title VFCC: A Verification Framework of Cache Coherence using Parallel Simulation Author *Qiaoli Xiong, Jiangfang Yi, Tianbao Song, Zichao Xie, Dong Tong (Peking University, China) Page pp. 705 - 710 Keyword cache coherence, verification, simulation Abstract A cache coherence protocol is a vital component of a multiprocessor to maintain the data consistency. In this paper, we proposed VFCC, which is a simulation framework to validate a cache-coherence protocol implementation of a commercial 64-bit superscalar multiprocessor. It exploits multiple-level parallelism to accelerate validation without overheads among threads. Our experimental results demonstrate VFCC has a 5.0x speedup than a traditional simulator on a conventional 16-core host machine.

8D-2 (Time: 14:10 - 14:40)
 Title A Computational Model for SAT-based Verification of Hardware-Dependent Low-Level Embedded System Software Author *Bernard Schmidt, Carlos Villarraga (University of Kaiserslautern, Germany), Jörg Bormann (-, Germany), Dominik Stoffel, Markus Wedler, Wolfgang Kunz (University of Kaiserslautern, Germany) Page pp. 711 - 716 Keyword HW/SW-Verification, low-level software, property checking Abstract This paper describes a method to generate a computational model for formal verification of hardware-dependent software in embedded systems. The computational model of the combined HW/SW system is a program netlist (PN) consisting of instruction cells connected in a directed acyclic graph that compactly represents all execution paths of the software. The model can be easily integrated into SAT-based verification environments such as those based on Bounded Model Checking (BMC). The proposed construction of the model, however, allows for an efficient reasoning of the SAT solver over entire execution paths. We demonstrate the efficiency of our approach by presenting experimental results from the formal verification of an industrial LIN (Local Interconnect Network) bus node, implemented as a software driver on a 32-bit RISC machine. Slides

8D-3 (Time: 14:40 - 15:10)
 Title Reviving Erroneous Stability-based Clock-Gating using Partial Max-SAT Author Bao Le, *Dipanjan Sengupta, Andreas Veneris (University of Toronto, Canada) Page pp. 717 - 722 Keyword Debugging, Design Errors, Low Power design, Clock Gating, Stability Condition Abstract Although recent developments have automated most of the low power implementations, designers often manually modify the circuit in order to achieve further power savings. This human intervention is often paved with many errors that are bound to typical logic functional failures. Debugging these errors can be a resource intensive process. This paper proposes a novel debugging methodology to rectify erroneous clock gating implementations. The net effect of the proposed methodology leads to shorter debug time ensuring additional power savings. Slides

8D-4 (Time: 15:10 - 15:40)
 Title Simplification of C-RTL Equivalent Checking for Fused Multiply Add Unit using Intermediate Models Author *Bin Xue, Prosenjit Chatterjee (Nvidia Corp, U.S.A.), Sandeep K. Shukla (Virginia Tech, U.S.A.) Page pp. 723 - 728 Keyword sequential equivalent checking, FMA, floating point Abstract The functionality of Fused multiply add (FMA) design can be formally verified by comparing its register transition level (RTL) implementation against its system level specification often modeled by C/C++ language using sequential equivalent checking (SEC). However, C-RTL SEC does not scale for FMA because of the huge discrepancy existed between the two models. This paper analyzes the dissimilarities and proposes two intermediate models, one abstract RTL and one rewritten C model to bridge the gap. The original SEC proof are partitioned into three sub-proofs among intermediate models where a variety of simplification techniques are applied to further reduce the complexity. Experiments from an industry project show that with the two intermediate models, the SEC proof is complete and scalable for FMA design Slides

Session 9A  Designers' Forum: Harmonized Hardware-Software Co-Design and Co-Verification
Time: 16:00 - 18:00 Friday, January 25, 2013
Organizer: Nobuyuki Nishiguchi (STARC, Japan), Moderator: Koichiro Yamashita (Fujitsu Laboratories, Japan)

9A-1
 Title (Panel Discussion) Harmonized Hardware-Software Co-Design and Co-Verification Author Panelists: Atsushi Ike (Fujitsu Laboratories, Japan), Hiroyuki Ikegami (Renesas Electronics Corporation, Japan), Tsuyoshi Isshiki (Tokyo Institute of Technology, Japan), Rainer Leupers (RWTH Aachen, Germany), Yosinori Watanabe (Cadence Berkeley Lab, U.S.A.), Tim Kogel (Synopsys, U.S.A.)

Session 9B  Memory and Storage Management
Time: 16:00 - 18:00 Friday, January 25, 2013
Chairs: Philip Brisk (University of California, Riverside, U.S.A.), Samarjit Chakraborty (TU Munich, Germany)

9B-1 (Time: 16:00 - 16:30)
 Title Reconstruction of Memory Accesses Based on Memory Allocation Mechanism for Source-Level Simulation of Embedded Software Author *Kun Lu, Daniel Müller-Gritschneder, Ulf Schlichtmann (Technische Universität München, Germany) Page pp. 729 - 734 Keyword source level simulation, TLM, performance estimation Abstract To date, there still lacks a way to accurately simulate data memory accesses in source-level simulation (SLS) of host-compiled embedded SW. The difficulty lies in that the accessed addresses for the load and store instructions can not be statically determined. Without knowing those addresses, the source code can not be annotated appropriately for data cache simulation. In this paper, we show an approach that is capable of resolving the accessed memory addresses based on the memory allocation mechanism. Applying this approach, the source code can be annotated to perform precise data cache simulation. The novelty of our methodology is that it is the first of its kind to take the memory allocation mechanism into account and thus can handle all the stack, data, heap and text sections.Moreover, a method is also proposed to handle pointer dereferences. In experiments, SLS with our approach yields almost identical cache miss rate and pattern when compared to the reference simulation. Slides

9B-2 (Time: 16:30 - 17:00)

9B-3 (Time: 17:00 - 17:30)

9B-4 (Time: 17:30 - 18:00)
 Title Scheduling Multiple Charge Migration Tasks in Hybrid Electrical Energy Storage Systems Author Qing Xie, Di Zhu, Yanzhi Wang (University of Southern California, U.S.A.), *Younghyun Kim, Naehyuck Chang (Seoul National University, Republic of Korea), Massoud Pedram (University of Southern California, U.S.A.) Page pp. 749 - 754 Keyword charge management, energy storage, charge migration, scheduling Abstract Hybrid electrical energy storage (HEES) systems are comprised of multiple banks of heterogeneous electrical energy storage (EES) elements with distinct properties. This paper defines and solves the problem of scheduling multiple charge migration tasks in HEES systems with the objective of minimizing the total energy drawn from the source banks. The solution approach consists of two steps: (i) Finding the best charging current profile and voltage level setting for the Charge Transfer Interconnect (CTI) bus for each charge migration task, and (ii) Merging and scheduling the charge migration tasks. Experimental results demonstrate improvements of up to 32.2% in the charge migration efficiency compared to baseline setups in an example HEES system. Slides

Session 9C  Advanced Modeling and Analysis of Analog and Mixed-Signal Circuits
Time: 16:00 - 18:00 Friday, January 25, 2013
Chairs: Sheldon Tan (University of California, Riverside, U.S.A.), Takashi Sato (Kyoto University, Japan)

9C-1 (Time: 16:00 - 16:30)
 Title Stable Backward Reachability Correction for PLL Verification with Consideration of Environmental Noise Induced Jitter Author Yang Song, Haipeng Fu, *Hao Yu (Nanyang Technological University, Singapore), Guoyong Shi (Shanghai Jiao Tong University, China) Page pp. 755 - 760 Keyword Analog/RF system verification, Reachability analysis, PLL jitter Abstract It is unknown to perform efficient PLL system-level verification with consideration of jitter induced by substrate or power-supply noise. With the consideration of nonlinear phase noise macromodel, this paper introduces a forward reachability analysis with stable backward correction for PLL system-level verification with jitter. By refining initial state of PLL through backward correction, one can perform an efficient PLL verification to automatically adjust the locking range with consideration of environmental noise induced jitter. Moreover, to overcome the unstable nature during backward correction, a stability calibration is introduced in this paper to limit error. To validate our method, the proposed approach is applied to verify a number of PLL designs including single-LC or coupled-LC oscillators described by system-level behavioral model with jitter. Experimental results show that our forward reachability analysis with backward correction can succeed in reaching the adjusted locking range by correcting initial states in presence of environmental noise induced jitter. Slides

9C-2 (Time: 16:30 - 17:00)
 Title Performance Bound and Yield Analysis for Analog Circuits under Process Variations Author Xue-Xin Liu (University of California, Riverside, U.S.A.), Adolfo Adair Palma-Rodriguez, Santiago Rodriguez-Chavez (Institute of Astrophysics, Optics, and Electronics, Mexico), *Sheldon X.-D. Tan (University of California, Riverside, U.S.A.), Esteban Tlelo-Cuautle (Institute of Astrophysics, Optics, and Electronics, Mexico), Yici Cai (Tsinghua University, China) Page pp. 761 - 766 Keyword bound analysis, variation, yield, optimization, symbolic analysis Abstract Yield estimation for analog integrated circuits are crucial for analog circuit design and optimization in the presence of process variations. In this paper, we present a novel analog yield estimation method based on performance bound analysis technique in frequency domain. The new method first derives the transfer functions of linear (or linearized) analog circuits via a graph-based symbolic analysis method. Then frequency response bounds of the transfer functions in terms of magnitude and phase are obtained by a nonlinear constrained optimization technique. To predict yield rate, bound information are employed to calculate Gaussian distribution functions. Experimental results show that the new method can achieve similar accuracy while delivers 20 times speedup over Monte Carlo simulation of HSPICE on some typical analog circuits. Slides

9C-3 (Time: 17:00 - 17:30)
 Title Local Approximation Improvement of Trajectory Piecewise Linear Macromodels through Chebyshev Interpolating Polynomials Author Muhammad Umer Farooq, *Likun Xia (Universiti Teknologi PETRONAS, Malaysia) Page pp. 767 - 772 Keyword Chebyshev polynomial, Taylor polynomial, State space (SS) Abstract We introduce the concept of 2Dimensional (2D) scalability of trajectory piecewise linear (TPWL) macromodels through the exploitation of Chebyshev interpolating polynomials in each piecewise region. The goal of 2D scalability is to improve the local approximation properties of TPWL macromodels. Horizontal scalability is achieved through the reduction of number of linearization points along the trajectory; vertical scalability is obtained by extending the scope of macromodel to predict the response of a nonlinear system for inputs far from training trajectory. In this way more efficient macromodels are obtained in terms of simulation speed up of complex nonlinear systems. We provide the implementation details and illustrate the 2D scalability concept with an example using nonlinear transmission line.

Session 9D  High-Level and Architectural Synthesis
Time: 16:00 - 18:00 Friday, January 25, 2013
Chairs: Robert Wille (University of Bremen, Germany), Takashi Takenaka (NEC, Japan)

9D-1 (Time: 16:00 - 16:30)
 Title Range and Bitmask Analysis for Hardware Optimization in High-Level Synthesis Author *Marcel Gort, Jason H. Anderson (University of Toronto, Canada) Page pp. 773 - 779 Keyword High level synthesis, Range analysis, FPGA, Compiler, LLVM Abstract We consider how bit-level representations of variables in HLS can be used to optimize hardware. Range and bitmask based analyses are considered separately and in tandem, where range analysis pre-determines min/max ranges for variables in order to minimize the hardware that uses those variables and bitmask analysis characterizes individual bits within a word as either constants (1 or 0), sign bits, or unknowns, which may also permit hardware to be eliminated. Static compiler-based analysis is contrasted with dynamic profiling-based analysis in terms of their potential to impact area and speed of HLS-generated hardware. Results show optimizations in HLS based on static analysis reduce circuit area by 9%, while those based on dynamic analysis provide 34% area reduction. Slides

9D-2 (Time: 16:30 - 17:00)
 Title A Gradual Scheduling Framework for Problem Size Reduction and Cross Basic Block Parallelism Exploitation in High-level Synthesis Author *Hongbin Zheng, Qingrui Liu, Junyi Li, Dihu Chen, Zixin Wang (Sun Yet-sen University, China) Page pp. 780 - 786 Keyword High-level synthesis, Electronic design automation and methodology, Scheduling Abstract In High-level Synthesis (HLS), scheduling has a critical impact on the quality of hardware implementation. However, the schedules of different operations are actually having unequal impacts on the Quality of Result. Based on this fact, we propose a novel scheduling framework, which is able to schedule the operations separately according their significance to Quality of Result, to avoid wasting the computational effort on noncritical operations. Furthermore, the proposed framework supports global code motion, which helps to improve the speed performance of the hardware implementation by distributing the execution time of operations across the their parent BB. Slides

9D-3 (Time: 17:00 - 17:30)
 Title Implementing Microprocessors from Simplified Descriptions Author *Nikhil A. Patil, Derek Chiou (University of Texas at Austin, U.S.A.) Page pp. 787 - 793 Keyword high-level synthesis, Bluespec, Microcode, Processors Abstract Despite the proliferation of high-level synthesis tools, hardware description of microprocessors remains complex. We argue that much of the incidental complexity can be relieved by untangling the description into separate functional and microarchitectural components. Such an untangling can be achieved using a high-level microcode compiler that can generate not only microcode, but also the micro-instruction format and the interpretations of each control bit. Simplifying hardware description will help the designer make better design-space trade-offs, and close the design and verification loop faster. This paper takes the reader through an implementation of a simple Y86 processor to qualitatively illustrate the complexity reduction from the untangling. Slides

9D-4 (Time: 17:30 - 18:00)
 Title Application-Specific Fault-Tolerant Architecture Synthesis for Digital Microfluidic Biochips Author *Mirela Alistar, Paul Pop, Jan Madsen (Denmark Technical University, Denmark) Page pp. 794 - 800 Keyword digital microfluidics biochips, CAD tools, architecture synthesis, fault tolerant Abstract Microfluidic-based biochips are replacing the conventional biochemical analyzers, and are able to integrate on-chip all the necessary functions for biochemical analysis using microfluidics. The digital microfluidic biochips are based on the manipulation of liquids not as a continuous flow, but as discrete droplets on an array of electrodes. Microfluidic operations, such as transport, mixing, split, are performed on this array by routing the corresponding droplets on a series of electrodes. Researchers have proposed several approaches for the synthesis of digital microfluidic biochips. All previous work assumes that the biochip architecture is given, and most approaches consider a rectangular shape for the electrode array. However, non-regular application-specific architectures are common in practice. Hence, in this paper, we propose an approach to the application-specific architecture synthesis. Our approach can also help the designer to increase the yield by introducing redundant electrodes to tolerate permanent faults. The proposed architecture synthesis algorithm has been evaluated using several benchmarks. Slides