Wednesday, January 23, 2013 |
A | B | C | D |
---|---|---|---|
8:30 - 10:00 |
|||
10:20 - 12:20 |
10:20 - 12:20 |
10:20 - 12:20 |
10:20 - 12:20 |
13:40 - 15:40 |
13:40 - 15:40 |
13:40 - 15:40 |
13:40 - 15:40 |
16:00 - 18:00 |
16:00 - 18:00 |
16:00 - 18:00 |
16:00 - 18:00 |
Thursday, January 24, 2013 |
A | B | C | D |
---|---|---|---|
9:00 - 10:00 |
|||
10:20 - 12:20 |
10:20 - 12:20 |
10:20 - 12:20 |
10:20 - 12:20 |
13:40 - 15:40 |
13:40 - 15:40 |
13:40 - 15:40 |
13:40 - 15:40 |
16:00 - 18:00 |
16:00 - 18:00 |
16:00 - 18:00 |
16:00 - 18:00 |
Friday, January 25, 2013 |
A | B | C | D |
---|---|---|---|
9:00 - 10:00 |
|||
10:20 - 12:20 |
10:20 - 12:20 |
10:20 - 12:20 |
10:20 - 12:20 |
13:40 - 15:40 |
13:40 - 15:40 |
13:40 - 15:40 |
13:40 - 15:40 |
16:00 - 18:00 |
16:00 - 18:00 |
16:00 - 18:00 |
16:00 - 18:00 |
Wednesday, January 23, 2013 |
Title | (Keynote Address) From Circuits to Cancer |
Author | *Sani Nassif (IBM Austin Research Lab., U.S.A.) |
Abstract | The human race has invested about a trillion dollars in the development of semiconductor electronics, and our lives have been improved greatly as a result. Smart devices are now taken for granted and permeate every aspect of our daily lives. One of the important products of this huge investment was the development of sophisticated design optimization and simulation tools to allow the largely automated design and verification of integrated circuits. Sometimes we in the EDA community do not realize quite how advanced we are in this field, and just how applicable much of the Silicon R&D work is to other areas... This talk will be about one such area, namely that of Proton Radiation Cancer Therapy, where a team at IBM, working with researchers at the M. D. Anderson Cancer Research center, have been busy applying knowledge from the VLSI area to this important problem. We will show examples of applying large scale analysis and optimization techniques to the treatment planning problem, and hopefully motivate other EDA researchers to seek applications of their deep knowledge in adjacent fields. |
Title | (Invited Paper) Equivalent Circuit Model Extraction for Interconnects in 3D ICs |
Author | *A. Ege Engin (San Diego State University, U.S.A.) |
Page | pp. 1 - 6 |
Keyword | TSV, 3D IC, silicon interposer |
Abstract | Parasitic RC behavior of VLSI interconnects has been the major bottleneck in terms of latency and power consumption of ICs. Recent 3D ICs promise to reduce the parasitic RC effect by making use of through silicon vias (TSVs). It is therefore essential to extract the RC model of TSVs to assess their promise. Unlike interconnects on metal layers, TSVs exhibit slow-wave and dielectric quasi-transverse-electromagnetic (TEM) modes due to the coupling to the semiconducting substrate. This TSV behavior can be simulated using analytical methods, 2D electrostatic simulators, or 3D full-wave electromagnetic simulators. In this paper, we describe a methodology to extract parasitic RC models from such simulation data for interconnects in a 3D IC. |
Slides |
Title | (Invited Paper) Unconditionally Stable Explicit Method for the Fast 3-D Simulation of On-Chip Power Distribution Network with Through Silicon Via |
Author | *Tadatoshi Sekine, Hideki Asai (Shizuoka University, Japan) |
Page | pp. 7 - 12 |
Keyword | power distribution network, through silicon via, explicit method, unconditionally stable, fast circuit simulation |
Abstract | In this work, we propose the method which is explicit, but stable with no stability condition for the fast simulation of the equivalent circuit of on-chip power distribution network with a number of through silicon vias. Additionally, the proposed unconditionally stable explicit method is accelerated more by combining with an order reduction technique. |
Title | (Invited Paper) Signal Integrity Modeling and Measurement of TSV in 3D IC |
Author | *Joungho Kim, Joungho Kim (Korea Advanced Institute of Science and Technology, Republic of Korea) |
Page | pp. 13 - 16 |
Keyword | Through Silicon Via, Signal Integrity, Modeling, Measurement |
Abstract | In order to guarantee signal integrity of a TSV-based channel in 3D IC design, the modeling and measurements are conducted for electrical characterization of the TSV-based channel including TSVs and RDLs with various performance metrics such as insertion loss, noise coupling and eye diagrams. Based on the modeling and measurements of the fabricated TSV channels, design guide for the signal integrity of the channel is proposed. |
Slides |
Title | (Invited Paper) Power Distribution Network Modeling for 3-D ICs with TSV Arrays |
Author | Chi-Kai Shen, Yi-Chang Lu, Yih-Peng Chiou, Tai-Yu Cheng, *Tzong-Lin Wu (National Taiwan University, Taiwan) |
Page | pp. 17 - 22 |
Keyword | 3-D IC, PDN, equivalent circuit model, TSV, CNIM |
Abstract | A coupling node insertion method (CNIM) is proposed to handle electrical coupling between top metals of on-chip interconnects and silicon substrate surfaces in three-dimensional integrated circuits (3-D ICs). This coupling effect should not be neglected especially as metal area is intentionally increased in order to reduce resistance values. In this paper, we illustrate how to build the CNIM model and incorporate it into power distribution networks. The CNIM model is validated by comparing our results to the one obtained from a full-wave simulator. The differences between two approaches are within 5% but our computation time is shorter than that required by a full-wave simulator. |
Title | A Case for Wireless 3D NoCs for CMPs |
Author | *Hiroki Matsutani (Keio University, Japan), Paul Bogdan, Radu Marculescu (Carnegie Mellon University, U.S.A.), Yasuhiro Take, Daisuke Sasaki, Hao Zhang (Keio University, Japan), Michihiro Koibuchi (National Institute of Informatics, Japan), Tadahiro Kuroda, Hideharu Amano (Keio University, Japan) |
Page | pp. 23 - 28 |
Keyword | Network-on-Chip (NoC), 3-D NoC, irregular topology |
Abstract | Inductive-coupling is yet another 3D integration technique that can be used to stack more than three known-good-dies in a SiP without wire connections. We present a topology-agnostic 3D CMP architecture using inductive-coupling that offers great flexibility in customizing the number of processor chips, SRAM chips, and DRAM chips in a SiP after chips have been fabricated. In this paper, first, we propose a routing protocol that exchanges the network information between all chips in a given SiP to establish efficient deadlock-free routing paths. Second, we propose its optimization technique that analyzes the application traffic patterns and selects different spanning tree roots so as to minimize the average hop counts and improve the application performance. |
Title | Deflection Routing in 3D Network-on-Chip with TSV Serialization |
Author | *Jinho Lee, Dongwoo Lee, Sunwook Kim, Kiyoung Choi (Seoul National University, Republic of Korea) |
Page | pp. 29 - 34 |
Keyword | network-on-chip(NoC), deflection routing, TSV serialization, 3D NoC, network |
Abstract | This paper proposes a deflection routing for 3D NoC with serialized TSVs. Bufferless deflection routing provides area- and power-efficient communication under low to medium traffic load. Under 3D circumstances, the bufferless deflection routing can yield even better performance than buffered routing when key aspects are properly taken into account. Evaluation of the proposed scheme shows its effectiveness in throughput, latency, and energy consumption. |
Slides |
Title | MD: Minimal Path-based Approach for Fault-Tolerant Routing in On-Chip Networks |
Author | Masoumeh Ebrahimi, Masoud Daneshtalab, Juha Plosila (University of Turku, Finland), *Farhad Mehdipour (Kyushu University, Japan) |
Page | pp. 35 - 40 |
Keyword | Network-on-Chip, fault-tolerant approach, minimal path, adaptive routing algorithm. |
Abstract | the communication requirements of many-core embedded systems are convened by the emerging Network-on-Chip (NoC) paradigm. As on-chip communication reliability is a crucial factor in many-core systems, the NoC paradigm should address the reliability issues. Using fault-tolerant routing algorithms to reroute packets around faulty regions will increase the packet latency and create congestion around the faulty region. On the other hand, the performance of NoC is highly affected by the congestion in the network. Congestion in the network can increase the delay of packets to route from a source to a destination, so it should be avoided. In this paper, a minimal and defect-resilient (MD) routing algorithm is proposed in order to route packets adaptively through shortest paths in the presence of one-faulty link, as long as a path exists. To avoid congestion, output channels can be adaptively chosen whenever the distance from the current to destination node is greater than one hop along both directions. In addition, an analytical model is presented to evaluate MD under two-faulty links’ condition. |
Title | A Dynamic Stream Link for Efficient Data Flow Control in NoC Based Heterogeneous MPSoC |
Author | Claude Helmstetter, Sylvain Basset, *Romain Lemaire (CEA-Leti, Minatec Campus, France), Michel Langevin, Chuck Pilkington (STMicroelectronics, Ottawa, Canada), Fabien Clermidy (CEA-Leti, Minatec Campus, France), Pierre Paulin (STMicroelectronics, Ottawa, Canada), Pascal Vivet (CEA-Leti, Minatec Campus, France), Didier Fuin (STMicroelectronics, Grenoble, France) |
Page | pp. 41 - 46 |
Keyword | NoC, Stream Link, Heterogeneous MPSoC, Data Flow |
Abstract | As Systems-on-Chip size increase, the communication costs become critical and Networks-on-Chip (NoC) bring innovative solutions. Efficient stream-based protocols over NoC have been widely studied to address dataflow communications. They are usually controlled by a set of static parameters. However, new applications, such as high-resolution video decoders, present more data-dependent behaviors forcing communication protocols to support higher dynamicity. For this purpose, we present in this paper dynamic stream links for stream-based end-to-end NoC communications by introducing two link protocols, both independent of the transfer size, allowing to improve the hardware/software control flexibility. The proposed protocols have been modeled in a MPSoC virtual platform and the hardware cost evaluated. Based on simulations, we provide guidelines to exploit these protocols according to application needs. |
Slides |
Title | On Real-Time STM Concurrency Control for Embedded Software with Improved Schedulability |
Author | *Mohammed Elshambakey, Binoy Ravindran (ECE Dept, Virginia Tech, U.S.A.) |
Page | pp. 47 - 52 |
Keyword | stm, real time, contention manager |
Abstract | We consider software transactional memory (STM) concurrency control for embedded multicore real-time software, and present a novel contention manager for resolving transactional conflicts, called PNF. We upper bound transactional retries and task response times. Our implementation in RSTM/real-time Linux reveals that PNF yields shorter or comparable retry costs than competitors. |
Slides |
Title | Schedule Integration for Time-Triggered Systems |
Author | *Florian Sagstetter, Martin Lukasiewycz (TUM CREATE, Singapore), Samarjit Chakraborty (TU Munich, Germany) |
Page | pp. 53 - 58 |
Keyword | scheduling, time-triggered system, FlexRay |
Abstract | This paper presents a framework for the schedule integration of time-triggered systems tailored to the automotive domain. In-vehicle networks might be very large and complex such that obtaining a schedule for a fully synchronous system becomes a challenging task since all bus and processor constraints as well as end-to-end-timing constraints have to be taken concurrently into account. Existing optimization approaches apply the schedule optimization to the entire network, limiting their application due to scalability issues. In contrast, the presented framework obtains the schedule for the entire network, using a two-step approach where for each cluster a local schedule is obtained and the local schedules are finally merged to the global schedule. This approach is also in accordance with the design process in the automotive industry where different subsystems are developed independently to reduce the design complexity and are finally combined in the integration stage. In this paper, a generic framework for schedule integration of time-triggered systems is presented. Further, we show how this framework is implemented for a FlexRay network using an Integer Linear Programming (ILP) approach which might also be easily adapted to other protocols. A realistic case study and a scalability analysis give evidence of the applicability and efficiency of our approach. |
Slides |
Title | Online Estimation of the Remaining Energy Capacity in Mobile Systems Considering System-Wide Power Consumption and Battery Characteristics |
Author | Donghwa Shin (Seoul National University, Republic of Korea), Woojoo Lee (University of Southern California, U.S.A.), Kitae Kim (Seoul National University, Republic of Korea), Yanzhi Wang, Qing Xie (University of Southern California, U.S.A.), *Naehyuck Chang (Seoul National University, Republic of Korea), Massoud Pedram (University of Southern California, U.S.A.) |
Page | pp. 59 - 64 |
Keyword | Low-power design, Power estimation, Smartphone, Battery life, Quality of service |
Abstract | Emerging mobile systems integrate a lot of functionality into a small form factor with a small energy source in the form of rechargeable battery. This situation necessitates accurate estimation of the remaining energy in the battery such that user applications can be judicious on how they consume this scarce and precious resource. This paper thus focuses on estimating the remaining battery energy in Android OS-based mobile systems. This paper proposes to instrument the Android kernel in order to collect and report accurate subsystem activity values based on real-time profiling of the running applications. The activity information along with offline-constructed, regression-based power macro models for major subsystems in the smartphone yield the power dissipation estimate for the whole system. Next, while accounting for the rate-capacity effect in batteries, the total power dissipation data is translated into the battery’s energy depletion rate, and subsequently, used to compute the battery’s remaining lifetime based on its current state of charge information. Finally, this paper describes a novel application design framework, which considers the batterys state-of-charge (SOC), batterys energy depletion rate, and service quality of the target application. The benefits of the design framework are illustrated by examining an archetypical case, involving the design space exploration and optimization of a GPS-based application in an Android OS. |
Title | WUCC: Joint WCET and Update Conscious Compilation for Cyber-physical Systems |
Author | Yazhi Huang, Mengying Zhao, *Chun Jason Xue (City University of Hong Kong, Hong Kong) |
Page | pp. 65 - 70 |
Keyword | WCET, code similarity, real time systems |
Abstract | The cyber-physical system (CPS) is a desirable computing platform for many industrial and scientific applications. However, the application of CPSs has two challenges: First, CPSs often include a number of sensor nodes. Update of preloaded code on remote sensor nodes powered by batteries is extremely energy-consuming. The code update issue in the energy sensitive CPS must be carefully considered; Second, CPSs are often real-time embedded systems with real-time properties. Worst-Case Execution Time (WCET) is one of the most important metrics in real-time system design. While existing works only consider one of these two challenges at a time, in this paper, a compiler-level optimization, Joint WCET and Update Conscious Compilation (WUCC), is proposed to jointly consider WCET and code update for cyber-physical systems. The novelty of the proposed approach is that the WCET problem and code update problem are considered concurrently such that a balanced solution with minimal WCET and minimal code difference can be achieved. The experimental results show that the proposed technique can minimize WCET and code difference effectively. |
Slides |
Title | A 40-nm 144-mW VLSI Processor for Real-time 60-kWord Continuous Speech Recognition |
Author | *Guangji He, Takanobu Sugahara, Tsuyoshi Fujinaga, Yuki Miyamoto, Hiroki Noguchi, Shintaro Izumi, Hiroshi Kawaguchi, Masahiko Yoshimoto (Kobe University, Japan) |
Page | pp. 71 - 72 |
Keyword | hidden Markov model(HMM), large vocabulary continuous speech recognition(LVCSR), memory bandwidth reduction |
Abstract | We have developed a low-power VLSI chip for 60- kWord real-time continuous speech recognition based on a context-dependent Hidden Markov Model (HMM). Our implementation includes a cache architecture using locality of speech recognition, beam pruning using a dynamic threshold, two-stage language model searching, highly parallel Gaussian Mixture Model (GMM) computation based on the mixture level, a variable-frame look-ahead scheme, and elastic pipeline operation between the Viterbi transition and GMM processing. Results show that our implementation achieves 95% bandwidth reduction (70.86 MB/s) and 78% required frequency reduction (126.5 MHz). The test chip, fabricated using 40 nm CMOS technology, contains 1.9 M transistors for logic and 7.8 Mbit on-chip memory. It dissipates 144 mW at 126.5 MHz and 1.1 V for 60 kWord real-time continuous speech recognition. |
Slides |
Title | A 24.5-53.6pJ/pixel 4320p 60fps H.264/AVC Intra-Frame Video Encoder Chip in 65nm CMOS |
Author | *Dajiang Zhou, Gang He, Wei Fei, Zhixiang Chen, Jinjia Zhou, Satoshi Goto (Waseda University, Japan) |
Page | pp. 73 - 74 |
Keyword | H.264/AVC, 4320p, video encoder, low power |
Abstract | An H.264/AVC intra-frame video encoder is implemented in 65nm CMOS. With an efficient intra prediction design, its maximum throughput reaches 1991Mpixels/s for 7680x4320p 60fps video, 9.4x to 32x faster than previous designs. The encoder also incorporates a 1.41Gbins/s CABAC architecture that has been enhanced by 31%. Moreover, low energy consumption is achieved by the high parallelism and hardware efficiency of this design. 1080p 30fps encoding dissipates only 2mW at 0.8V and 9MHz. |
Slides |
Title | A Low Power Multimedia Processor Implementing Dynamic Voltage and Frequency Scaling Technique |
Author | Tadayoshi Enomoto, *Nobuaki Kobayashi (Chuo University, Japan) |
Page | pp. 75 - 76 |
Keyword | motion estimation, Multimedia Processor, DVFS, power dissipation |
Abstract | A 90-nm CMOS multimedia processor was developed by employing dynamic voltage and frequency scaling (DVFS) technique to greatly reduce the power dissipation (P). To adaptively predict the optimum supply voltage (VD) and the optimum clock frequency (fc) a fast motion estimation (ME) algorithm, an absolute difference accumulator as well as a DVFS controller were developed. Measured P of the multimedia processor was 34.4 µW, which was only 0.48% that of a conventional multimedia processor. |
Slides |
Title | A 40-nm 0.5-V 12.9-pJ/Access 8T SRAM Using Low-Power Disturb Mitigation Technique |
Author | *Shusuke Yoshimoto, Masaharu Terada, Shunsuke Okumura (Kobe University, Japan), Toshikazu Suzuki (Panasonic Corporation, Japan), Shinji Miyano (Semiconductor Technology Academic Research Center, Japan), Hiroshi Kawaguchi, Masahiko Yoshimoto (Kobe University, Japan) |
Page | pp. 77 - 78 |
Keyword | SRAM, 8T, Low power, half select, write back |
Abstract | This paper presents a novel disturb mitigation technique which achieves low-power and low-voltage SRAM. Our proposed technique consists of a floating bitline technique and a low-swing bitline driver (LSBD). We fabricated a 512-Kb 8T SRAM test chip that operates at a single 0.5-V supply voltage. The proposed technique achieves 1.52-pJ/access active energy in a write cycle and 72.8-uW leakage power, which are 59.4% and 26.0% better than the conventional write-back technique. |
Slides |
Title | A Physical Unclonable Function Chip Exploiting Load Transistors’ Variation in SRAM Bitcells |
Author | *Shunsuke Okumura, Shusuke Yoshimoto, Hiroshi Kawaguchi (Kobe University, Japan), Masahiko Yoshimoto (Kobe University/JST CREST, Japan) |
Page | pp. 79 - 80 |
Keyword | SRAM, PUF, Chip ID |
Abstract | We propose a chip identification (ID) generating scheme with random variation of transistor characteristics in SRAM bitcells. In the proposed scheme, a unique fingerprint is generated by grounding both bitlines. It has high speed, and it can be implemented in a very small area overhead. We fabricated test chips in a 65-nm process and obtained 12,288 sets of unique 128-bit fingerprints, which are evaluated in this paper. The failure rate of the IDs is found to be 2.1 × 10-12. |
Slides |
Title | Over 10-Times High-speed, Energy Efficient 3D TSV-Integrated Hybrid ReRAM/MLC NAND SSD by Intelligent Data Fragmentation Suppression |
Author | *Chao Sun (Chuo University/University of Tokyo, Japan), Hiroki Fujii (University of Tokyo, Japan), Kousuke Miyaji, Koh Johguchi (Chuo University, Japan), Kazuhide Higuchi (University of Tokyo, Japan), Ken Takeuchi (Chuo University, Japan) |
Page | pp. 81 - 82 |
Keyword | SSD, ReRAM, TSV, MLC NAND |
Abstract | A 3D through-silicon-via (TSV)-integrated hybrid ReRAM/multi-level-cell (MLC) NAND solid-state drive's (SSD's) architecture is proposed with NAND-like interface (I/F) and sector-access overwrite policy for ReRAM. Furthermore, intelligent data management algorithms are proposed to suppress data fragmentation and excess usage of MLC NAND. As a result, 11-times performance increase, 6.9-times endurance enhancement and 93% write energy reduction are achieved. Both ReRAM write and read latency should be less than 3us to obtain these improvements. The required endurance for ReRAM is 10^5. |
Title | Highly Reliable Solid-State Drives (SSDs) with Error-Prediction LDPC (EP-LDPC) Architecture and Error-Recovery Scheme |
Author | *Shuhei Tanakamaru, Yuki Yanagihara (The University of Tokyo, Japan), Ken Takeuchi (Chuo University, Japan) |
Page | pp. 83 - 84 |
Keyword | Solid-state drive, SSD, Error-correcting code, ECC, LDPC |
Abstract | 11-times extended lifetime, 76% reduced error SSD is proposed. The error-prediction LDPC realizes both 7-times faster read and high reliability. Errors are most efficiently corrected by calibrating memory data based on the VTH, inter-cell coupling, write/erase cycles and data-retention time. The error-recovery scheme with a program-disturb error-recovery pulse and a data-retention error-recovery pulse is also proposed to reduce the program-disturb error and the data-retention error by 76% and 56%, respectively. |
Title | A 3Gb/s 2.08mm2 100b Error-Correcting BCH Decoder in 0.13µm CMOS Process |
Author | *Youngjoo Lee, Hoyoung Yoo, In-Cheol Park (KAIST, Republic of Korea) |
Page | pp. 85 - 86 |
Keyword | ECC, BCH, decoder, optimization |
Abstract | This paper presents a high-throughput BCH decoder that can correct 100 bit-errors. Several optimization methods are proposed to reduce the hardware complexity caused by the large error-correction capability. Based on the proposed methods, an 8-parallel decoder is designed for the (9592, 8192, 100) BCH code, which achieves a decoding throughput of 3Gb/s and occupies 2.08mm^2 in 0.13ìm CMOS process. |
Title | A 6.72-Gb/s, 8pJ/bit/iteration WPAN LDPC Decoder in 65nm CMOS |
Author | *Zhixiang Chen, Xiao Peng, Xiongxin Zhao, Leona Okamura, Dajiang Zhou, Satoshi Goto (Graduate School of Information, Production and Systems, Waseda University, Japan) |
Page | pp. 87 - 88 |
Keyword | LDPC, Decoder, WPAN, IEEE 802.15.3c, Permutation Network |
Abstract | An LDPC decoder in 65nm CMOS targeting WPAN (IEEE 802.15.3c) is presented with measurement results. A modified-PCM based message permutation strategy with compatible data flow is proposed to solve the network problem raised by high parallelism LDPC decoding. Compared to the state-of-art, decoder chip achieves 17.7%, 33.5% and 49% improvements in chip density, gate count and energy efficiency, respectively. |
Slides |
Title | A 7.5Gb/s Referenceless Transceiver for UHDTV with Adaptive Equalization and Bandwidth Scanning Technique in 0.13um CMOS Process |
Author | *Junyoung Song (Korea University, Republic of Korea), Hyunwoo Lee (Hynix Inc., Republic of Korea), Sewook Hwang (Korea University, Republic of Korea), Inhwa Jung (Hynix Inc., Republic of Korea), Chulwoo Kim (Korea University, Republic of Korea) |
Page | pp. 89 - 90 |
Keyword | Transceiver, CDR, PLL, Equalizer, Wireline |
Abstract | A 7.5Gb/s referenceless transceiver for the ultra-high definition television is designed in a 0.13µm CMOS process. By applying the dynamic pre-emphasis calibration and the bandwidth scanning clock generators, measured eye opening and jitter of the clock are enhanced by 39.6% and 40%, respectively. Also the data-width comparison based adaptive equalizer with self-adjusting reference voltage is proposed. |
Slides |
Title | A 12.5 Gb/s/Link Non-Contact Multi Drop Bus System with Impedance-Matched Transmission Line Couplers and Dicode Partial-Response Channel Transceivers |
Author | *Atsutake Kosuge, Wataru Mizuhara, Noriyuki Miura, Masao Taguchi, Hiroki Ishikuro, Tadahiro Kuroda (Keio University, Japan) |
Page | pp. 91 - 92 |
Keyword | Memory Interface, Coupler, Partial Response, Multi-Drop Bus |
Abstract | A reduced-reflection multi-drop bus system using Dicode (1-D) partial response signaling transceiver is presented for the first time in the world. Directional couplers on transmission lines arranged with equi-energy distributing and exact impedance matched conditions allow the bus to reach to 12.5Gbps/link speed, which is the world’s fastest data link speed with multi-drop bus architecture. Dicode partial-response signaling method with a half-rate architecture was used where a precoder is placed in the transmitter to make the signal best fit for the channel to eliminate inter symbol interference (ISI). |
Slides |
Title | 315MHz OOK Transceiver with 38-µW Receiver and 36-µW Transmitter in 40-nm CMOS |
Author | *Shunta Iguchi (University of Tokyo, Japan), Akira Saito (Semiconductor Technology Academic Research Center, Japan), Kentaro Honda, Yunfei Zheng (University of Tokyo, Japan), Kazunori Watanabe (Semiconductor Technology Academic Research Center, Japan), Takayasu Sakurai, Makoto Takamiya (University of Tokyo, Japan) |
Page | pp. 93 - 94 |
Keyword | Transceiver, Sensor node, Low voltage, Low power, Intermittent sampling |
Abstract | A 1-Mbps, 315MHz OOK transceiver in 40-nm CMOS for body area networks is developed. Both a 38-pJ/bit carrier-frequency-free intermittent sampling receiver with -55dBm sensitivity and a 36-pJ/bit transmitter applied dual supply voltage scheme with -20dBm output power achieve the lowest energy in the published transceivers for wireless sensor networks. |
Slides |
Title | A Full 4-Channel 60GHz Direct-Conversion Transceiver |
Author | *Seitaro Kawai, Ryo Minami, Ahmed Musa, Takahiro Sato, Ning Li, Tatsuya Yamaguchi, Yasuaki Takeuchi, Yuki Tsukui, Kenichi Okada, Akira Matsuzawa (Tokyo Institute of Technology, Japan) |
Page | pp. 95 - 96 |
Keyword | 60GHz, CMOS, tranceiver |
Abstract | This paper presents a 60-GHz direct-conversion transceiver in 65 nm CMOS technology. By the proposed gain peaking technique, this transceiver realizes good gain flatness and is capable of more than 7Gbps in 16QAM wireless communication for every channel of IEEE802.15.3c standard within EVM of around -23dB. The transceiver consumes 319mW in transmitting and 223 mW in receiving, that includes the PLL consumption. |
Slides |
Title | A Sub-harmonic Injection-locked Frequency Synthesizer with Frequency Calibration Scheme for Use in 60GHz TDD Transceivers |
Author | *Teerachot Siriburanon, Wei Deng, Ahmed Musa, Kenichi Okada, Akira Matsuzawa (Tokyo Institute of Technology, Japan) |
Page | pp. 97 - 98 |
Keyword | 60GHz, Synthesizer, Calibration, Injection-locked |
Abstract | A 58.1-to-65.0 GHz frequency synthesizer using sub-harmonic injection-locking technique is presented. The synthesizer can generate all 60GHz channels defined by IEEE 802.15.3c, wirelessHD, IEEE 802.11ad, WiGig, and ECMA-387. A frequency calibration scheme is proposed to monitor frequency shift resulting from environmental variations. Implemented in a 65nm CMOS process, the synthesizer achieves a typical phase noise of -117 dBc/Hz @10MHz offset from a carrier frequency of 61.56 GHz. |
Title | A Fractional-N Harmonic Injection-locked Frequency Synthesizer with 10MHz-6.6GHz Quadrature Outputs for Software-Defined Radios |
Author | *Wei Deng, Ahmed Musa, Kenichi Okada, Akira Matsuzawa (Tokyo Institute of Technology, Japan) |
Page | pp. 99 - 100 |
Keyword | synthesizer, fractional-N, SDR, PLL, injection-locked |
Abstract | This paper presents an area-efficient frequency synthesizer with a quadrature phase output using a fractional-N injection-locking technique for software-defined radios. A background calibration scheme is proposed to compensate for the PVT variations. Implemented in a 65nm CMOS process, this work demonstrates 10 MHz to 6.6 GHz continuous quadrature frequency coverage, while only occupies a small area of 0.38 mm2 and consumes 16-26 mW depending on output frequency, from a 1.2 V power supply. The normalized phase noise achieves –135.3 dBc/Hz at 3 MHz offset, and -95.1 dBc/Hz in-band phase noise at 10 kHz offset, from a 1.7 GHz carrier frequency. |
Title | A Ring-VCO-Based Sub-Sampling PLL CMOS Circuit with 0.73 ps Jitter and 20.4 mW Power Consumption |
Author | *Kenta Sogo, Akihiro Toya, Takamaro Kikkawa (Research Institute for Nanodevice and Bio Systems, Hiroshima University, Japan) |
Page | pp. 101 - 102 |
Keyword | PLL, CMOS, JItter, Sampling, Phase noise |
Abstract | This paper presents a ring voltage–controlled- oscillator(ring-VCO)-based sub-sampling phase locked loop (PLL) CMOS circuit with low phase noise and low jitter. A 2.08 GHz PLL is developed by use of 65 nm CMOS technology. The in-band phase noise is -119.1 dBc/Hz at 1 MHz and the output jitter integrated from 1 kHz to 10 MHz is 0.73 ps (rms) with the power consumpition 20.4 mW. The normalized jitter-power product is -229.7 dB. |
Slides |
Title | Design of a Clock Jitter Reduction Circuit Using Gated Phase Blending Between Self-Delayed Clock Edges |
Author | *Kiichi Niitsu, Naohiro Harigai, Daiki Hirabayashi, Daiki Oki, Masato Sakurai (Gunma University, Japan), Osamu Kobayashi (STARC, Japan), Takahiro J. Yamaguchi, Haruo Kobayashi (Gunma University, Japan) |
Page | pp. 103 - 104 |
Keyword | jitter, clock, PLL, jitter reduction, CMOS |
Abstract | Design of a clock jitter reduction circuit that exploits the phase blending technique between the uncorrelated self-delayed clock edges is demonstrated. By blending uncorrelated clock edges, the output clock edges approach the ideal timing and, thus, timing jitter can be reduced by a factor of square root of two per stage. Measurement results with a 180-nm CMOS prototype chip demonstrated approximately four-fold reduction in timing jitter from 30.2ps to 8.8ps in 500-MHz clock by cascading the proposed circuit with four-stages. |
Title | A 25-Gb/s LD Driver with Area-Effective Inductor in a 0.18-µm CMOS |
Author | *Takeshi Kuboki (Kyoto University, Japan), Yusuke Ohtomo (NTT Electronics, Japan), Akira Tsuchiya (Kyoto University, Japan), Keiji Kishine (University of Shiga Prefecture, Japan), Hidetoshi Onodera (Kyoto University, Japan) |
Page | pp. 105 - 106 |
Keyword | optical interconnect, LD driver, Interwoven inductor |
Abstract | This paper presents high-speed and area-efficient laser-diode driver with interwoven inductor in a 0.18-μm CMOS. We interweave ten peaking inductors for area-effective implementation as well as performance enhancement. Interwoven inductor can not only achieve area-efficiency but also tune frequency characteristic. Mutual inductances of interwoven inductor enhance bandwidth and suppress group delay dispersion. The test chip area is 0.32 mm2 and the maximum operating speed is 25 Gb/s. |
Slides |
Title | A Regulated Charge Pump with Low-Power Integrated Optimum Power Point Tracking Algorithm for Indoor Solar Energy Harvesting |
Author | *Jungmoon Kim, Chulwoo Kim (Korea University, Republic of Korea) |
Page | pp. 107 - 108 |
Keyword | Photovoltaic systems, solar energy harvesting, charge pump, maximum power point tracking, optimum power point tracking |
Abstract | This paper presents a regulated charge pump (CP) with an integrated optimum power point tracking (OPPT) algorithm designed for indoor solar energy harvesting. The proposed OPPT circuit does not require a current sensor that consumes power proportionally to the load. The solar cell voltage is regulated at the optimum power point; the CP output is regulated according to the target voltage. The controller of the OPPT circuit and CP dissipates only 450nW, so the proposed technique is appropriate for indoor solar energy harvesting applications under dim lighting conditions. |
Slides |
Title | A Low Voltage Buck DC-DC Converter Using On-Chip Gate Boost Technique in 40nm CMOS |
Author | *Xin Zhang, Po-Hung Chen (University of Tokyo, Japan), Yoshikatsu Ryu (Semiconductor Technology Academic Research Center, Japan), Koichi Ishida (University of Tokyo, Japan), Yasuyuki Okuma, Kazunori Watanabe (Semiconductor Technology Academic Research Center, Japan), Takayasu Sakurai, Makoto Takamiya (University of Tokyo, Japan) |
Page | pp. 109 - 110 |
Keyword | DC-DC converter, PWM controller, low voltage |
Abstract | A low voltage buck DC-DC converter (0.45-V input, 0.4-V output) with on-chip gate boosted (OGB) and clock frequency scaled digital PWM controller is designed in 40-nm CMOS process. The highest efficiency to date is achieved at the output power less than 40µW. In order to compensate for the die-to-die delay variations of a delay line in the proposed digital PWM controller, a linear delay trimming by a logarithmic stress voltage (LSV) scheme with good controllability is also proposed and verified in measurement. |
Slides |
Title | A 0.35-0.8V 8b 0.5-35MS/s 2bit/step Extremely-low Power SAR ADC |
Author | *Kentaro Yoshioka, Akira Shikata, Ryota Sekimoto, Tadahiro Kuroda, Hiroki Ishikuro (Keio University, Japan) |
Page | pp. 111 - 112 |
Keyword | SAR ADC, Extreme-low voltage, 2bit/step, Low power, Power efficient |
Abstract | An extremely low-voltage operating high speed and low power 2bit/step asynchronous SAR ADC is presented. Wide range dynamic threshold configuring comparator is proposed to enable power and area efficient 2bit/step operation. By configuring the comparator threshold by simple Vcm biased current sources, the ADC holds immunity against 10% power supply variation. The prototype ADC fabricated in 40nm CMOS achieved 44.3 dB SNDR with 6.14 MS/s at a single supply voltage of 0.5 V. The ADC achieved a peak FoM of 5.9fJ/conv-step at 0.4V and operates down to 0.35V. |
Slides |
Title | (Invited Paper) Thermal Management for Dependable on-chip Systems |
Author | *Jörg Henkel, Thomas Ebi, Hussam Amrouch, Heba Khdr (Karlsruhe Institute of Technology, Germany) |
Page | pp. 113 - 118 |
Keyword | Dependability, Thermal Management, Aging |
Abstract | Dependability has become a growing concern in the nano-CMOS era due to elevated temperatures and an increased susceptibility to temperature of the small structures. We present an overview of temperature-related effects that threaten dependability and a methodology for reducing the dependability concerns through thermal management utilizing the concept of aging budgeting. |
Title | (Invited Paper) Dependable VLSI Platform using Robust Fabrics |
Author | *Hidetoshi Onodera (Kyoto University, Japan) |
Page | pp. 119 - 124 |
Keyword | dependable VLSI, DFM, variability, soft error, aging |
Abstract | Extreme scaling imposes enormous challenges on LSI design such as manufacturability degradation, variability increase, performance aging, and soft-error vulnerability. For overcoming these difficulties, we have been developing a VLSI platform that can realize dependable circuits with required level of reliability. The platform project tackles the challenges with collaborative researches on layout, circuit, architecture, and design automation. Overview of the project as well as key achievements on the component-level and the architecture-level will be explained, followed by a brief introduction of the platform SoC and its C-based design tools. |
Title | (Invited Paper) Variability-Aware Memory Management for Nanoscale Computing |
Author | *Nikil Dutt (University of California, Irvine, U.S.A.), Puneet Gupta (University of California, Los Angeles, U.S.A.), Alex Nicolau, Luis Angel D. Bathen (University of California, Irvine, U.S.A.), Mark Gottscho (University of California, Los Angeles, U.S.A.) |
Page | pp. 125 - 132 |
Keyword | Memory Management, Variation-Aware Design |
Abstract | As the semiconductor industry continues to push the limits of sub-micron technology, the ITRS expects hardware (e.g., die-to-die, wafer-to-wafer, and chip-to-chip) variations to continue increasing over the next few decades. As a result, it is imperative for designers to build variation-aware software stacks that may adapt and opportunistically exploit said variations to increase system performance/responsiveness as well as minimize power consumption. The memory subsystem is one of the largest components in today’s computing system, a main contributor to the overall power consumption of the system, and therefore one of the most vulnerable components to the effects of variations (e.g., power). This paper discusses the concept of variability-aware memory management for nanoscale computing systems. We show how to opportunistically exploit the hardware variations in on- chip and off-chip memory at the system level through the deploy- ment of variation-aware software stacks. |
Title | MIXSyn: An Efficient Logic Synthesis Methodology for Mixed XOR-AND/OR Dominated Circuits |
Author | *Luca Amarú, Pierre-Emmanuel Gaillardon, Giovanni De Micheli (Integrated Systems Laboratory, Ecole Polytechnique Federale de Lausanne, Switzerland) |
Page | pp. 133 - 138 |
Keyword | Logic Synthesis, XOR-intensive, Library-free Technology Mapping, Ambipolar Transistors |
Abstract | We present a new logic synthesis methodology, called MIXSyn, that produces area-efficient results for mixed XOR-AND/OR dominated logic functions. MIXSyn is a two step synthesis process. The first step is a hybrid logic optimization that enables selective and distinct optimization of AND/OR and XOR-intensive portions of the logic circuit. The second step is a library-free technology mapping that enhances design flexibility with a tractable computational cost. MIXSyn has been tested on a set of large MCNC benchmarks. Experimental results indicate that MIXSyn produces CMOS circuits with 18.0% and 9.2% fewer devices, on the average, with respect to state-of-art academic and commercial synthesis tools, respectively. MIXSyn is also capable to exploit the opportunity of novel XOR implementations offered by the use of double-gate ambipolar devices. Experimental results show that MIXSyn can reduce the number of ambipolar transistors by 20.9% and 15.3%, on the average, with respect to state-of-art academic and commercial synthesis tools, respectively. |
Slides |
Title | Optimizing Multi-level Combinational Circuits for Generating Random Bits |
Author | Chen Wang, *Weikang Qian (Shanghai Jiao Tong University, China) |
Page | pp. 139 - 144 |
Keyword | logic synthesis, random bit generation, probabilistic computation |
Abstract | Random bits are an important construct in many applications, such as hardware-based implementation of probabilistic algorithms and weighted random testing. One approach in generating random bits with required probabilities is to synthesize combinational circuits that transform a set of source probabilities into target probabilities. In [1], the authors proposed a greedy algorithm that synthesizes circuits in the form of a gate chain to approximate target probabilities. However, since this approach only considers circuits of such a special form, the resulting circuits are not satisfactory both in terms of the approximation error and the circuit depth. In this paper, we propose a new algorithm to synthesize combinational circuits for generating random bits. Compared to the previous one, our approach greatly enlarges the search space. Also, we apply a linear property of probabilistic logic computation and an iterative local search method to increase the efficiency of our algorithm. Experimental results comparing the approximation error and the depth of the circuits synthesized by our method to those of the circuits produced by the previous approach demonstrate the superiority of our method. |
Slides |
Title | Improving the Mapping of Reversible Circuits to Quantum Circuits Using Multiple Target Lines |
Author | *Robert Wille, Mathias Soeken, Christian Otterstedt, Rolf Drechsler (University of Bremen, Germany) |
Page | pp. 145 - 150 |
Keyword | quantum, reversible, synthesis, optimization, circuits |
Abstract | The efficient synthesis of quantum circuits is an active research area. Since many of the known quantum algorithms include a large Boolean component (e.g. the database in the Grover search algorithm), quantum circuits are commonly synthesized in a two-stage approach. First, the desired function is realized as a reversible circuit making use of existing synthesis methods for this domain. Afterwards, each reversible gate is mapped to a functionally equivalent quantum gate cascade. In this paper, we propose an improved mapping of reversible circuits to quantum circuits which exploits a certain structure of many reversible circuits. In fact, it can be observed that reversible circuits are often composed of similar gates which only differ in the position of their target lines. We introduce an extension of reversible gates which allow multiple target lines in a single gate. This enables a significantly cheaper mapping to quantum circuits. Experiments show that considering multiple target lines leads to improvements of up to 85% in the resulting quantum cost. |
Slides |
Title | I-LUTSim: An Iterative Look-Up Table Based Thermal Simulator for 3-D ICs |
Author | *Chi-Wen Pan, Yu-Min Lee (National Chiao Tung University, Taiwan), Pei-Yu Huang (Industrial Technology Research Institute, Taiwan), Chi-Ping Yang (National Chiao Tung University, Taiwan), Chang-Tzu Lin, Chia-Hsin Lee, Yung-Fa Chou, Ding-Ming Kwai (Industrial Technology Research Institute, Taiwan) |
Page | pp. 151 - 156 |
Keyword | 3-D, IC, Thermal, Simulator, Table |
Abstract | This work presents an iterative look-up table based thermal simulator, I-LUTSim, to efficiently estimate the temperature profile of three-dimensional integrated circuits. I-LUTSim includes two stages. First, the pre-process stage constructs thermal impulse response tables. Then, the simulation stage iteratively calculates the temperature profile via the table lookup. With this two-stage scheme, the maximum absolute error of I-LUTSim is less than 0.41% compared with that of a commercial tool ANSYS. Moreover, I-LUTSim is at least an order of magnitude faster than a fast matrix solver SuperLU for the full-chip temperature simulation. |
Slides |
Title | Compact Nonlinear Thermal Modeling of Packaged Integrated Systems |
Author | *Zao Liu, Sheldon X.-D. Tan, Hai Wang (University of California, Riverside, U.S.A.), Ashish Gupta (Intel Corporation, U.S.A.), Sahana Swarup (University of California, Riverside, U.S.A.) |
Page | pp. 157 - 162 |
Keyword | Thermal modeling, Nonlinear, Subspace identification |
Abstract | This paper proposes a new thermal nonlinear modeling technique for packaged integrated systems. Thermal behavior of complicated systems like packaged electronic systems may exhibit nonlinear and temperature dependent properties. As a result, it is difficult to use a low order linear model to approximate the thermal behavior of the packaged integrated systems without accuracy loss. In this paper, we try to mitigate this problem by using piecewise linear (PWL) approach to characterizing the thermal behavior of those systems. The new method (called ThermSubPWL), which is the first proposed approach to nonlinear thermal modeling problem, identifies the linear local models for different temperature ranges using the subspace identification method. A linear transformation method is proposed to transform all the identified linear local models to the common state basis to build the continuous piecewise linear model. Experimental results validate the proposed method on a realistic packaged integrated system modeled via the multidomain/physics commercial tool, COMSOL, under practical power signal inputs. The new piecewise models can lead to much smaller model order without accuracy loss, which translates to significant savings in both the simulation time and the time required to identify the reduced models compared to applying the high order models. |
Slides |
Title | A Multilevel H-matrix-based Approximate Matrix Inversion Algorithm for Vectorless Power Grid Verification |
Author | Wei Zhao, Yici Cai, *Jianlei Yang (Dept. of Computer Science and Technology, Tsinghua University, China) |
Page | pp. 163 - 168 |
Keyword | Power grid, Vectorless verification, H-matrix, Multilevel method |
Abstract | Vectorless power grid verification technique makes it possible to estimate the worst-case voltage fluctuations of the on-chip power delivery network at the early design stage. For most of the existing vectorless verification algorithms, the sub¬problem of linear system solution which computes the inverse of the power grid matrix takes up a large part of the computation time and has become a critical bottleneck of the whole algorithm. In this paper, we propose a new algorithm that combines the H-matrix-based technique and the multilevel method to construct a data-sparse approximate inverse of the power grid matrix. Experimental results have shown that the proposed algorithm can obtain an almost linear complexity both in runtime and memory consumption for efficient vectorless power grid verification. |
Slides |
Title | Realization of Frequency-Domain Circuit Analysis Through Random Walk |
Author | Tetsuro Miyakawa, Hiroshi Tsutsui, Hiroyuki Ochi, *Takashi Sato (Kyoto University, Japan) |
Page | pp. 169 - 174 |
Keyword | AC analysis, Random walk algorithm, Importance sampling, Incremental analysis |
Abstract | This paper presents the realization of frequency-domain circuit analysis based on random walk framework for the first time. In conventional random walk based circuit analyses, the sample movement at a node is randomly chosen to follow the edge probabilities. The probabilities are determined by edge-admittances connecting to the node, which is impossible to apply for the frequency-domain analysis because the probabilities are imaginary numbers. By applying the idea of importance sampling, the intractable imaginary probabilities are converted into real numbers while maintaining the estimation correctness. Runtime acceleration through incremental analysis is also proposed. |
Title | A Separation and Minimum Wire Length Constrained Maze Routing Algorithm under Nanometer Wiring Rules |
Author | *Fong-Yuan Chang, Ren-Song Tsay, Wai-Kei Mak (National Tsing Hua University, Taiwan), Sheng-Hsiung Chen (Springsoft, Taiwan) |
Page | pp. 175 - 180 |
Keyword | Maze, Routing, Nanometer Wiring Rules, DFM, minimum wire length |
Abstract | Due to process limitations, wiring rules are imposed by foundries on chip layout. Under nanometer wiring rules, the required separation between two wire ends is dependent on their surrounding wires, and there is a limit on the minimum length of each wire segment. Yet, traditional maze routing algorithms are not designed to handle these rules, so rule violations must be corrected by post-processing and the quality of result is seriously impacted. For this reason, we propose a new maze routing algorithm capable of handling these wiring rules. The proposed algorithm is proved to find a legal shortest path with runtime complexity of O(n), where n is the number of grid point. Experiments with seven tight industrial cases show that the success rate of getting a DRC clean routing by a commercial router is improved, and the average runtime is reduced by 2.3 times. |
Slides |
Title | An ILP-based Automatic Bus Planner for Dense PCBs |
Author | Pei-Ci Wu (University of Illinois at Urbana-Champaign, U.S.A.), Qiang Ma (Synopsys, Inc., U.S.A.), *Martin D. F. Wong (University of Illinois at Urbana-Champaign, U.S.A.) |
Page | pp. 181 - 186 |
Keyword | bus planner, printed circuit boards |
Abstract | Modern PCBs have to be routed manually since no EDA tools can successfully route these complex boards. An autorouter for PCBs would improve design productivity tremendously since each board takes about 2 months to route manually. This paper focuses on a major step in PCB routing called bus planning. In the bus planning problem, we need to simultaneously solve the bus decomposition, escape routing, layer assignment and global bus routing. This problem was partially addressed by Kong et al. in [3] where they only focused on the layer assignment and global bus routing, assuming bus decomposition and escape rout- ing are given. In this paper, we present an ILP-based solution to the entire bus planning problem. We apply our bus planner to an industrial PCB (with over 7000 nets and 12 signal layers) which was previously successfully routed manually, and compare with a state-of-the-art industrial internal tool where the layer assignment and global bus routing are based on the algorithm in [3]. Our bus planner successfully routed 97.4% of all the nets. This is a huge improvement over the industrial tool which could only achieve 84.7% routing completion for this board. |
Title | Layer Minimization in Escape Routing for Staggered-Pin-Array PCBs |
Author | *Yuan-Kai Ho, Xin-Wei Shih, Yao-Wen Chang (National Taiwan University, Taiwan), Chung-Kuan Cheng (University of California, San Diego, U.S.A.) |
Page | pp. 187 - 192 |
Keyword | Escape routing, PCB routing |
Abstract | As the technology advances, the pin number of a high-end PCB design keeps increasing while the size of a PCB keeps shrinking. The staggered pin array is used to accommodate a larger pin number than the grid pin array of the same area. Nevertheless, escaping a large pin number to the boundary of a dense staggered pin array, namely multilayer escape routing for staggered pin arrays, is significantly harder than that for grid pin arrays. This paper addresses this multilayer escape routing problem to minimize the number of used layers in a staggered pin array for manufacturing cost reduction. We first present an escaped pin selection method to assign a maximal number of escaped pins in the current layer and also to increase useful routing regions for subsequent layers. Missing pins are also modeled in our routing network to utilize the routing resource effectively. Experimental results show that our approach can significantly reduce the required layer number for escape routing. |
Title | Network Flow Modeling for Escape Routing on Staggered Pin Arrays |
Author | Pei-Ci Wu, *Martin D. F. Wong (University of Illinois at Urbana-Champaign, U.S.A.) |
Page | pp. 193 - 198 |
Keyword | escape routing, staggered pin array, network flow modeling |
Abstract | Recently staggered pin arrays are introduced for modern designs with high pin density. Although some studies have been done on escape routing for hexagonal arrays, the hexagonal array is only a special kind of staggered pin array. There exist other kinds of staggered pin arrays in current industrial designs, and the existing works cannot be extended to solve them. In this paper, we study the escape routing problem on staggered pin arrays. Network flow models are proposed to correctly model the capacity constraints of staggered pin arrays. Our models are guaranteed to find an escape routing satisfying the capacity constraints if there exists one. The correctness of these models lead to an optimal algorithm. |
Title | (Invited Paper) A Clique-Based Approach to Find Binding and Scheduling Result in Flow-Based Microfluidic Biochips |
Author | *Trung Anh Dinh, Shigeru Yamashita (Ritsumeikan University, Japan), Tsung-Yi Ho (National Cheng Kung University, Taiwan), Yuko Hara-Azumi (Nara Institute of Science and Technology, Japan) |
Page | pp. 199 - 204 |
Keyword | Flow-based microfluidic biochips, Architectural synthesis, Routing constraints, Resource constraints, Clique |
Abstract | Microfluidic biochips have been recently proposed to integrate all the necessary functions for biochemical analysis. There are several types of microfluidic biochips; among them there has been a great interest in flow-based microfluidic biochips, in which the flow of liquid is manipulated using integrated micro-valves. By combining several microvalves, more complex resource units such as micropumps, switches and mixers can be built. For efficient execution, the flow of liquid routes in microfluidic biochips needs to be scheduled under some specific constraints. |
Slides |
Title | (Invited Paper) Control Synthesis of the Flow-Based Microfluidic Large-Scale Integration Biochips |
Author | Wajid Hassan Minhass, *Paul Pop, Jan Madsen (Technical University of Denmark, Denmark), Tsung-Yi Ho (National Cheng Kung University, Taiwan) |
Page | pp. 205 - 212 |
Keyword | microfluidic, biochips, synthesis, flow-based, control |
Abstract | In this paper we are interested in flow-based microfluidic biochips, which are able to integrate the necessary functions for biochemical analysis on-chip. In these chips, the flow of liquid is manipulated using integrated microvalves. By combining severalmicrovalves, more complex units, such asmicropumps, mixers, and multiplexers, can be built. In this paper we propose, for the first time to our knowledge, a top-down control synthesis framework for the flow-based biochips. Starting from a given biochemical application and a biochip architecture, we synthesize the control logic that is used by the biochip controller to automatically execute the biochemical application. We also propose a control pin count minimization scheme aimed at efficiently utilizing chip area, reducing macro-assembly around the chip and enhancing chip scalability. We have evaluated our approach using both real-life applications and synthetic benchmarks. |
Title | (Invited Paper) A Network-Flow Based Valve-Switching Aware Binding Algorithm for Flow-Based Microfluidic Biochips |
Author | *Kai-Han Tseng, Sheng-Chi You (National Cheng Kung University, Taiwan), Wajid Hassan Minhass (Technical University of Denmark, Denmark), Tsung-Yi Ho (National Cheng Kung University, Taiwan), Paul Pop (Technical University of Denmark, Denmark) |
Page | pp. 213 - 218 |
Keyword | Flow-based microfluidic biochip, Network flow, Valve minimization |
Abstract | Designs of flow-based microfluidic biochips are receiving much attention recently because they replace conventional biological automation paradigm and are able to integrate different biochemical analysis functions on a chip. However, as the design complexity increases, a flow-based microfluidic biochip needs more chip-integrated micro-valves, i.e., the basic unit of fluid-handling functionality, to manipulate the fluid flow for biochemical applications. Moreover, frequent switching of micro-valves may cause more power consumption and even result in the problem of reliability. To minimize the valve-switching activities, we develop a network-flow based resource binding algorithm based on breadth-first search (BFS) and minimum cost maximum flow (MCMF) in architectural-level synthesis. The experimental results show that our methodology not only makes significant reduction of valve-switching activities but also diminishes the application completion time for both real-life applications and a set of synthetic benchmarks. |
Slides |
Title | (Invited Paper) Design and Verification Tools for Continuous Fluid Flow-based Microfluidic Devices |
Author | Jeffrey McDaniel, Aurelila Baez, Brian Crites, Aditya Tammewar, *Philip Brisk (University of California, Riverside, U.S.A.) |
Page | pp. 219 - 224 |
Keyword | Microfluidics, Hardware Design Language |
Abstract | This paper describes an integrated design, verification, and simulation environment for programmable microfluidic devices called laboratories-on-chip (LoCs). Today’s LoCs are architected and laid out by hand, which is time-consuming, tedious, and error-prone. To increase designer productivity, this paper introduces a Microfluidic Hardware Design Language (MHDL) for LoC specification, along with software tools to assist LoC designers verify the correctness of their specifications and estimate their performance. |
Slides |
Title | Optimal Partition with Block-Level Parallelization in C-to-RTL Synthesis for Streaming Applications |
Author | *Shuangchen Li, Yongpan Liu (Tsinghua University, China), X.Sharon Hu (University of Notre Dame, U.S.A.), Xinyu He, Yining Zhang (Tsinghua University, China), Pei Zhang (Y Explorations Inc., U.S.A.), Huazhong Yang (Tsinghua University, China) |
Page | pp. 225 - 230 |
Keyword | HLS, Partition |
Abstract | Developing FPGA solutions for streaming applications written in C (or its variants) can benefit greatly from automatic C-to-RTL (C2RTL) synthesis. Yet, the complexity and stringent throughput/cost constraints of such applications are rather challenging for existing C2RTL synthesis tools. This paper considers automatic partition and block-level parallelization to address these challenges. An MILP-based approach is introduced for finding an optimal partition of a given program into blocks while allowing block-level parallelization. In order to handle extremely large problem instances, a heuristic algorithm is also discussed. Experimental results based on seven well known multimedia applications demonstrate the effectiveness of both solutions. |
Slides |
Title | Multi-Mode Pipelined MPSoCs for Streaming Applications |
Author | *Haris Javaid, Daniel Witono, Sri Parameswaran (University of New South Wales, Australia) |
Page | pp. 231 - 236 |
Keyword | Pipelined MPSoCs, Streaming Applications, Multi-mode Accelerators |
Abstract | In this paper, we propose a design flow for the pipelined paradigm of Multi-Processor System on Chips (MPSoCs) targeting multiple streaming applications. A multi-mode pipelined MPSoC, used as a streaming accelerator, executes multiple, mutually exclusive applications through modes where each mode refers to the execution of one application. We model each application as a directed graph. The challenge is to merge application graphs into a single graph so that the multi-mode pipelined MPSoC derived from the merged graph contains minimal resources. We solve this problem by finding maximal overlap between application graphs. Three heuristics are proposed where two of them greedily merge application graphs while the third one finds an optimal merging at the cost of higher running time. The results indicate significant area saving (up to 62\% processor area, 57\% FIFO area and 44 processor/FIFO ports) with minuscule degradation of system throughput (up to 2\%) and latency (up to 2\%) and increase in energy values (up to 3\%) when compared to widely used approach of designing distinct pipelined MPSoCs for individual applications. Our work is the first step in the direction of multi-mode pipelined MPSoCs, and the results demonstrate the usefulness of resource sharing among pipelined MPSoCs based streaming accelerators in a multimedia platform. |
Slides |
Title | Network Simplex Method Based Multiple Voltage Scheduling in Power-Efficient High-Level Synthesis |
Author | *Cong Hao, Song Chen, Takeshi Yoshimura (Waseda University, Japan) |
Page | pp. 237 - 242 |
Keyword | High-Level Synthesis, Scheduling, low-power |
Abstract | In this work, we focus on the problem of latency-constrained scheduling with consideration of multiple voltage technologies in High-level synthesis.Without the resource concern, we propose an Integer Linear Programming (ILP) formulation, whose constraint matrix is the node-arc incidence matrix of a network graph, for power minimization. Accordingly, the formulation is relaxed to a piecewise Linear Programming problem having only integer feasible solutions and optimally solved using the efficient piecewise-linear extended network simplex method(PLNSM). The experimental results showed 80X+ speedup compared to the general linear programming formulation. Considering the resource usage, we propose a two-stage heuristic Network Simplex Method based Power-efficient Multiple Voltage Scheduling(NPMVS) method. Firstly, the above relaxed LP formulation is modified to perform mobility allocation and delay assignment for the operations so as to minimize the power and the differences between the allocated operation mobilities and the predefined target mobilities. The modified formulation is solved using the PLNSM and iteratively performed to minimize power and resource density variation in control steps by gradually updating the predefined target mobilities. Secondly, with the allocated operation mobilities, we apply dependency-free scheduling with the objective of minimizing the resource usage. Experimental results show that the proposed method can produce optimum solutions for all 6 benchmarks with 14 groups of data in a maximum time of 0.25 second. |
Slides |
Title | VISA Synthesis: Variation-Aware Instruction Set Architecture Synthesis |
Author | *Yuko Hara-Azumi (Nara Institute of Science and Technology, Japan), Takuya Azumi (Ritsumeikan University, Japan), Nikil Dutt (University of California, Irvine, U.S.A.) |
Page | pp. 243 - 248 |
Keyword | ISA synthesis, process variation, SSTA, timing faults |
Abstract | We present VISA: a novel Variation-aware Instruction Set Architecture synthesis approach that makes effective use of process variation from both software and hardware points of view. To achieve an efficient speedup, VISA selects custom instructions based on statistical static timing analysis (SSTA) for aggressive clocking. Furthermore, with minimum performance overhead, VISA dynamically detects and corrects timing faults resulting from aggressive clocking of the underlying processor. This hybrid software/hardware approach brings the significant speedup without degrading the yield. Our experimental results on commonly used ISA synthesis benchmarks demonstrate that VISA achieves significant performance improvement compared with a traditional deterministic worst case-based approach (up to 78.0%) and an existing SSTA-based approach (up to 49.4%). |
Slides |
Title | L-Shape Based Layout Fracturing for E-Beam Lithography |
Author | Bei Yu, Jhih-Rong Gao, *David Z. Pan (University of Texas at Austin, U.S.A.) |
Page | pp. 249 - 254 |
Keyword | Electron Beam Lithography, Layout Fracturing, L-shape shot |
Abstract | Layout fracturing is a fundamental step in mask data preparation and e-beam lithography (EBL) writing. To increase EBL throughput, recently a new L-shape writing strategy is proposed, which calls for new L-shape fracturing, versus the conventional rectangular fracturing. Meanwhile, during layout fracturing, one must minimize very small/narrow features, also called slivers, due to manufacturability concern. This paper addresses this new research problem of how to perform L-shaped fracturing with sliver minimization. We propose two novel algorithms. The first one, rectangular merging (RM), starts from a set of rectangular fractures and merges them optimally to form L-shape fracturing. The second algorithm, direct L-shape fracturing (DLF), directly and effectively fractures the input layouts into L-shapes with sliver minimization. The experimental results show that our algorithms are very effective. |
Slides |
Title | High-throughput Electron Beam Direct Writing of VIA Layers by Character Projection using Character Sets Based on One-dimensional VIA Arrays with Area-efficient Stencil Design |
Author | *Rimon Ikeno (The University of Tokyo, Japan), Takashi Maruyama (e-Shuttle, Inc., Japan), Tetsuya Iizuka, Satoshi Komatsu, Makoto Ikeda, Kunihiro Asada (The University of Tokyo, Japan) |
Page | pp. 255 - 260 |
Keyword | Electron Beam Direct Writing, Character Projection, DFM, Layout Design, Interconnect Design |
Abstract | For high-speed electron beam direct writing (EBDW) of VIA layers by Character projection (CP), number of VIAs in each CP shot should be increased, but it will result in huge number of CP characters for arbitrary VIA placements. We adopt one-dimensional VIA arrays as the basic character architecture to increase VIA numbers in a CP shot while saving the stencil area by superposed array characters. CP throughput is further improved by layout constraints for VIA arrangement. Our experimental results give estimated CP exposure counts less than 174G shot/wafer in 14nm technology. |
Slides |
Title | Linear Time Algorithm to Find All Relocation Positions for EUV Defect Mitigation |
Author | Yuelin Du (University of Illinois at Urbana-Champaign, U.S.A.), Hongbo Zhang, Qiang Ma (Synopsys, Inc., U.S.A.), *Martin D. F. Wong (University of Illinois at Urbana-Champaign, U.S.A.) |
Page | pp. 261 - 266 |
Keyword | EUV, Blank Defect Mitigation, Linear Time, Relocation Position, Multi-die Placement |
Abstract | In EUV mask fabrication, die size is usually much smaller than the exposure field, such that one blank can accommodate multiple copies of a die. For thorough utilization of blank area, the number of valid dies that are not impacted by any defects should be maximized. To do so, all relocation positions to place a single valid die must be determined first. In this paper, we develop an efficient linear time algorithm to solve this problem. |
Slides |
Title | Self-Aligned Double and Quadruple Patterning-Aware Grid Routing with Hotspots Control |
Author | *Chikaaki Kodama (Toshiba Corporation, Japan), Hirotaka Ichikawa (Toshiba Microelectronics Corporation, Japan), Koichi Nakayama, Toshiya Kotani, Shigeki Nojima, Shoji Mimotogi, Shinji Miyamoto (Toshiba Corporation, Japan), Atsushi Takahashi (Tokyo Institute of Technology, Japan) |
Page | pp. 267 - 272 |
Keyword | Self-aligned double patterning, Self-aligned quadruple patterning, Grid routing, Lithography, Hotspot |
Abstract | Self-Aligned Double and Quadruple Patterning (SADP, SAQP) have become the most promising processes for sub-20nm and sub-14nm node technology. We propose the simple grid routing method for SADP and SAQP possible to predict the wafer image. A new grid structure is prepared and mandrel patterns can be easily derived without complex coloring or decomposition. Also we try to reduce hotspots in a wafer image by dummy pattern flipping. Classical maze-routing algorithm is implemented and the effectiveness is confirmed. |
Slides |
Title | Compiler-Assisted Refresh Minimization for Volatile STT-RAM Cache |
Author | Qingan Li (City University of Hong Kong, Hong Kong), Jianhua Li, Liang Shi (University of Science and Technology of China, China), *Chun Jason Xue (City University of Hong Kong, Hong Kong), Yiran Chen (University of Pittsburgh, U.S.A.), Yanxiang He (Wuhan University, China) |
Page | pp. 273 - 278 |
Keyword | volatile STT-RAM, refresh, compiler |
Abstract | Recently, researchers propose to improve the efficiency of STT-RAM by relaxing its non-volatility. To avoid data loss resulting from volatility, dynamic refresh schemes are indispensable. In this paper, we propose to reduce dynamic refresh through re-arranging program data layout at compilation time. Experimental results show that, the proposed methods can reduce the number of refresh operations by 73.3%, reduce the dynamic energy consumption by 27.6%, and in the meantime slightly increase the performance by 0.7%. |
Slides |
Title | Curling-PCM: Application-Specific Wear Leveling for Phase Change Memory based Embedded Systems |
Author | *Duo Liu (College of Computer Science, Chongqing University, China), Tianzheng Wang (Department of Computer Science, University of Toronto, Canada), Yi Wang, Zili Shao (Department of Computing, The Hong Kong Polytechnic University, Hong Kong), Qingfeng Zhuge, Edwin Sha (College of Computer Science, Chongqing University, China) |
Page | pp. 279 - 284 |
Keyword | Phase chang memory, wear leveling, embedded systems, application-specific, response time |
Abstract | Phase change memory (PCM) has been used as NOR replacement in embedded systems. However, endurance problems greatly limit its adoption in embedded systems. This paper utilizes application-specific features and proposes a wear leveling technique, Curling-PCM, which periodically moves the hot region and guarantees response time through a partial curling policy. Experimental results show effectiveness of the proposed technique. We expect this work can serve as a first step towards the utilization of application-specific features in PCM-based embedded systems. |
Slides |
Title | Selectively Protecting Error-Correcting Code for Area-Efficient and Reliable STT-RAM Caches |
Author | *Junwhan Ahn (Seoul National University, Republic of Korea), Sungjoo Yoo (Pohang University of Science and Technology, Republic of Korea), Kiyoung Choi (Seoul National University, Republic of Korea) |
Page | pp. 285 - 290 |
Keyword | backhopping, caches, error-correcting code, STT-RAM |
Abstract | Recent researches on STT-RAM revealed that device scaling makes its write operations unreliable. To mitigate the impact of this problem, this paper proposes a low-cost, ECC-based solution for STT-RAM caches. In particular, it proposes to share storage for ECC among different blocks within a set and to use them only for unsuccessful write operations. Experimental results show that our scheme reduces 74% to 98% of area overhead incurred by the conventional per-block ECC while maintaining system performance and reliability. |
Title | Loadsa: A Yield-Driven Top-Down Design Method for STT-RAM Array |
Author | Wujie Wen, Yaojun Zhang, Lu Zhang, *Yiran Chen (University of Pittsburgh, U.S.A.) |
Page | pp. 291 - 296 |
Keyword | Yield-driven, Top-down, Statistical design, STT-RAM |
Abstract | As an emerging nonvolatile memory technology, spin transfer torque random access memory (STT-RAM) faces great design challenges. The large device variations and the thermal induced switching randomness of the magnetic tunneling junction(MTJ) introduce the persistent and non-persistent errors in STT-RAM operations, respectively. Modeling these statistical metrics generally require the expensive Monte-Carlo simulations on the combined magnetic-CMOS models, which is hardly integrated in the modern micro-architecture and system designs. Also, the conventional bottom-up design method incurs costly iterations in the STT-RAM design toward specific system requirement. In this work, we propose Loadsa1: a yield-driven top-down design method to explore the design space of STT-RAM array from a statistical point of view. Both array-level semi-analytical yield model and cell-level failure-probability model are developed to enable a top-down design method: The system-level requirements, e.g., the chip yield under power and area constraints, are hierarchically mapped to array and cell-level design parameters, e.g., redundancy, ECC scheme, and MOS transistor size, etc. Our simulation results show that "Loadsa" can accurately optimize the STT-RAM based on the system and cell level constraints with a linear computation complexity. Our method demonstrates great potentials in the early design stage of memory or micro-architecture by eliminating the design integrations, while offering a full statistical view of the design even when the common yield enhancement practices are applied. |
Slides |
Thursday, January 24, 2013 |
Title | (Keynote Address) Gearing Up for the Upcoming Technology Nodes |
Author | Kee Sup Kim (Samsung, Republic of Korea) |
Abstract | Upcoming technology nodes have many challenges. In this talk, Dr. Kee Sup Kim outlines the challenges in recent technology nodes and how Samsung prepared each generation of design infrastructure to overcome these challenges. The emphasis will be giving to the challenges in the upcoming technology nodes and what approaches are being taken to overcome the challenges posed by double patterning, 3D transistors, 3D IC’s and increasing process instabilities. |
Title | (Invited Paper) Fractal Video Compression in OpenCL: An Evaluation of CPUs, GPUs, and FPGAs as Acceleration Platforms |
Author | *Doris Chen, Deshanand Singh (Altera Toronto Technology Center, Canada) |
Page | pp. 297 - 304 |
Keyword | high-level synthesis, FPGA, GPU |
Abstract | Fractal compression is an efficient technique for image and video encoding that has not gained widespread acceptance due to its computational intensity. In this paper, we present a real-time implementation of fractal compression in OpenCL, and show how the algorithm can be efficiently optimized for multi-CPUs, GPUs, and FPGAs. We show that the core computation implemented on the FPGA through OpenCL is 3x and 114x faster than a high-end GPU and multi-core CPU, respectively. We also compare to a hand-coded FPGA implementation to showcase the effectiveness of OpenCL-to-FPGA compilation. |
Slides |
Title | (Invited Paper) High Level Synthesis of Multiple Dependent CUDA Kernels for FPGA |
Author | Swathi Gurumani, Hisham Cholakkail (Advanced Digital Sciences Center, Singapore), Yun Liang (Peking University, China), *Kyle Rupnow (Nanyang Technological University, Singapore), Deming Chen (University of Illinois at Urbana-Champaign, U.S.A.) |
Page | pp. 305 - 312 |
Keyword | HLS, FPGA, CUDA |
Abstract | High-level synthesis (HLS) tools provide automatic generation of hardware at the register transfer level (RTL) from algorithm descriptions written in high-level languages, enabling faster creation of custom accelerators for FPGA architectures. Existing HLS tools support a wide variety of input languages, and assist users in design space exploration through automation and feedback on designs' performance bottlenecks. This design space exploration applies techniques such as pipelining, partitioning and resource sharing in order to improve performance, and resource utilization. However, although automated exploration can find some inherent parallelism, data-parallel input source code is still superior for exposing a greater variety of parallelism. In prior work, we demonstrated automated design space exploration of GPU multi-threaded (CUDA) language source code for efficient RTL generation. In this paper, we examine the challenges in extending this automated design space exploration to multiple dependent CUDA kernels, demonstrate a step-by-step procedure for efficiently performing multi-kernel synthesis, and demonstrate the potential of this approach through a case study of a stereo matching algorithm. This study demonstrates that HLS of multiple dependent CUDA kernels can maintain performance parity with the GPU implementation, while consuming over 16X less energy than the GPU. Based on our manual procedure, we identify the key challenges in fully automating the synthesis of multi-kernel CUDA programs. |
Slides |
Title | (Invited Paper) The Liquid Metal IP Bridge |
Author | Perry Cheng, Stephen J. Fink, Rodric Rabbah, *Sunil Shukla (IBM Research, U.S.A.) |
Page | pp. 313 - 319 |
Keyword | High Level Synthesis, Heterogeneous Computing, FPGA |
Abstract | Programmers are increasingly turning to heterogeneous systems to achieve performance. Examples include FPGA-based systems that integrate reconfigurable architectures with conventional processors. However, the burden of managing the coding complexity that is intrinsic to these systems falls entirely on the programmer. This limits the proliferation of these systems as only highly-skilled programmers and FPGA developers can unlock their potential. The goal of the Liquid Metal project at IBM Research is to address the programming complexity attributed to heterogeneous FPGA-based systems. A feature of this work is a vertically integrated development lifecycle that appeals to skilled software developers. A primary enabler for this work is a canonical IP bridge, designed to offer a uniform communication methodology between software and hardware, and that is applicable across a wide range of platforms available off-the-shelf. |
Title | TRISHUL: A Single-pass Optimal Two-level Inclusive Data Cache Hierarchy Selection Process for Real-time MPSoCs |
Author | *Mohammad Shihabul Haque, Akash Kumar, Yajun Ha, Qiang Wu, Shaobo Luo (National University of Singapore, Singapore) |
Page | pp. 320 - 325 |
Keyword | data cache hierarchy configuration, real-time software, Single-pass, Simulation |
Abstract | Hitherto discovered approaches analyze the execution time of a real-time application on all the possible cache hierarchy setups to find the application specific optimal two-level inclusive data cache hierarchy to reduce cost, space and energy consumption while satisfying the time deadline in real-time Multi-Processor Systems on Chip (MPSoC). These brute-force like approaches can take years to complete. Alternatively, application's memory access trace driven crude estimation methods can find a cache hierarchy quickly by compromising the accuracy of results. In this article, for the first time, we propose a fast and accurate application's trace driven approach to find the optimal real-time application specific two-level inclusive data cache hierarchy. Our proposed approach ``TRISHUL'' predicts the optimal cache hierarchy performance first and then utilizes that information to find the optimal cache hierarchy quickly. TRISHUL can suggest a cache hierarchy, which has up to 128 times smaller size, up to 7 times faster compared to the suggestion of the state-of-the-art crude trace driven two-level inclusive cache hierarchy selection approach for the application traces analyzed. |
Slides |
Title | Optimizing Translation Information Management in NAND Flash Memory Storage Systems |
Author | *Qi Zhang, Xuandong Li, Linzhang Wang, Tian Zhang (Nanjing University, China), Yi Wang, Zili Shao (The Hong Kong Polytechnic University, Hong Kong) |
Page | pp. 326 - 331 |
Keyword | Translation block, NAND flash memory, On-demand, SSD |
Abstract | Address mapping is one of the major functions in managing NAND flash. With the capacity increase of NAND flash, it becomes vitally important to reduce the RAM print of the address mapping table while not introducing big performance overhead. Demand-based address mapping is an effective approach to solve this problem, in which the address mapping table is stored in NAND flash (called translation pages), and mapping items are cached on-demand in RAM. Therefore, it is critical to manage translation pages in demand-based address mapping. This paper solves two most important problems in translation page management.First, to reduce frequent translation page updates caused by data requests,we propose a page-level caching mechanism to exploit the fundamental property of NAND flash where the basic read/write unit is one page. Second, to reduce the garbage collection overhead from translation pages, we propose a multiple write pointers strategy to group data pages corresponding to the same translation page into one data block, by which, when the data block is reclaimed via the garbage collection, we only need to update one translation page.We evaluate our scheme using a set of benchmarks from both real-world and synthetic traces. Experimental results show that our techniques can achieve significant reduction in the extra translation operations and improve the system response time. |
Slides |
Title | An Adaptive Filtering Mechanism for Energy Efficient Data Prefetching |
Author | *Xianglei Dang, Xiaoyin Wang, Dong Tong, Zichao Xie, Lingda Li, Keyi Wang (Peking University, China) |
Page | pp. 332 - 337 |
Keyword | data prefetching, energy efficiency, useless prefetch filtering, memory performance optimization |
Abstract | As data prefetching is used in embedded processors, it is crucial to reduce the wasted energy for improving the energy efficiency. In this paper, we propose an adaptive prefetch filtering (APF) mechanism to reduce the wasted bandwidth and energy as well as the cache pollution caused by useless prefetches. APF records the prefetch-victim address pairs of issued prefetches and collects information about which address in each pair is first accessed by the processor to guide the filtering of new generated useless prefetches. Meanwhile, filtered prefetches are recorded for building the feedback mechanism to avoid filtering useful prefetches. Experimental results demonstrate that APF reduces useless prefetches by an average of 53.81% with a mere 5.28% reduction of useful prefetches, thus reducing the memory access bandwidth consumption by 59.92% and the L2 cache energy by 6.19%. APF also improves the performance of several programs by reducing the cache pollution incurred by useless prefetches, thus gaining an average performance improvement of 2.12%. |
Slides |
Title | Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs |
Author | *Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai, Jing-Yang Jou (Department of Electronics Engineering, National Chiao Tung University, Taiwan) |
Page | pp. 338 - 343 |
Keyword | Shared cache, Cache contention, Thread scheduling, Irregular applications, GPGPUs |
Abstract | On-chip shared cache is effective to alleviate the memory bottleneck in modern many-core systems, such as GPGPUs. However, when scheduling numerous concurrent threads on a GPGPU, a cache capacity agnostic scheduling scheme could lead to severe cache contention among threads and thus significant performance degradation. Moreover, the diverse working sets in irregular applications make the cache contention issue an even more serious problem. As a result, taking cache capacity into account has become a critical scheduling issue of GPGPUs. This paper formulates a Cache Capacity Aware Thread Scheduling Problem to capture the impact of cache capacity as well as different architectural considerations. With a proof to be NP-hard, this paper has proposed two algorithms to perform the cache capacity aware thread scheduling. The simulation results on Nvidia’s Fermi configuration have shown that the proposed scheduling scheme can effectively avoid cache contention, and achieve an average of 44.7% cache miss reduction and 28.5% runtime enhancement. The paper also shows the runtime can be enhanced up to 62.5% for more complex applications. |
Slides |
Title | Optimization for Overdrive Signoff |
Author | Tuck-Boon Chan, Andrew B. Kahng, *Jiajia Li, Siddhartha Nath (University of California, San Diego, U.S.A.) |
Page | pp. 344 - 349 |
Keyword | overdrive, signoff, overdesign, multi-mode, optimization |
Abstract | In modern SOC implementations, multi-mode design is commonly used to achieve better circuit performance and power across voltage-scaling, “turbo” and other operating modes. Although there are many tools for multi-mode circuit implementation, to our knowledge there is no available systematic analysis or methodology for the selection of associated signoff modes. We observe that the selection of signoff modes has significant impact on circuit area, power and performance. For example, incorrect choice of signoff voltages for required overdrive frequencies can result in a netlist with 15% suboptimality in power or 21% in area. In this paper, we propose a concept of mode dominance which can be used as a guideline for signoff mode selection. Further, we also propose efficient circuit implementation flows to optimize the selection of signoff modes within several distinct use cases. Our results show that our proposed methodology provides 5-7% improvement in performance compared to the traditional “signoff and scale” method. The signoff modes determined by our methods result in only 0.6% overhead in performance and 8% overhead in power after implementation, compared to the optimal signoff modes. |
Slides |
Title | Mountain-Mover: An Intuitive Logic Shifting Heuristic for Improving Timing Slack Violating Paths |
Author | *Xing Wei, Wai-Chung Tang, Yu-Liang Wu (The Chinese University of Hong Kong, Hong Kong), Cliff Sze, Charles Alpert (IBM Austin Research Center, U.S.A.) |
Page | pp. 350 - 355 |
Keyword | logic rewiring, slack, timing optimization, post-placement |
Abstract | Based on a simple intuitive notion, in this paper, we propose an efficient post-placement improvement scheme. Based on the given timing slack distribution of a circuit, a corresponding ``slack mountain map'' can be visualized with peaks and valleys representing the worst negative slack and non-critical positive slack areas respectively. Guided by this map, violating paths are improved while the slack mountain is flattened by applying a local logic perturbation technique (rewiring) repeatedly to shift logic resources from critical to non-critical areas. However, due to the locality property of the rewiring technique, to better avoid being stuck at local minimums, instead of firing rewiring operations from the peak top towards lower areas, we do this local logic shifting starting from ``sea areas'' (non-critical) towards peak (critical) areas. At the end, as the slack map is more flattened, a circuit with slack violations more evenly distributed can be yielded. Comparing to the recent work, our experimental results show that this scheme can obtain a better or comparable delay reduction but with CPU time one order of magnitude smaller. |
Slides |
Title | Pulsed-Latch ASIC Synthesis in Industrial Design Flow |
Author | *Sangmin Kim, Duckhwan Kim, Youngsoo Shin (Department of Electrical Engineering, KAIST, Republic of Korea) |
Page | pp. 356 - 361 |
Keyword | pulsed-latch, pulse generator, design flow, scan latch, ASIC |
Abstract | Flip-flop has long been used as a sequencing element of choice in ASIC design; commercial synthesis tools have also been developed in this context. This work has been motivated by a question of whether existing CAD tools can be employed from RTL to layout while pulsed latch replaces flip-flop as a sequencing element. Two important problems have been identified and their solutions are proposed: placement of pulse generators and latches for integrity of pulse shape, and design of special scan latches and their selective use to reduce hold violations. A reference design flow has also been set up using published documents, in order to assess the proposed one. In 40-nm technology, the proposed flow achieves 20% reduction in circuit area and 30% reduction in power consumption, on average of 12 test circuits. |
Title | Power Optimization for Application-Specific 3D Network-on-Chip with Multiple Supply Voltages |
Author | *Kan Wang, Sheqin Dong (Tsinghua University, China) |
Page | pp. 362 - 367 |
Keyword | Layer Assignment, Multiple Supply Voltages, Application Specific 3D NoC, Inter-layer Communication, Power Consumption |
Abstract | In this paper, a MSV-driven power optimization method is proposed for application-specific 3D NoC (MSV-3DNoC). A unified modeling method is presented for considering both layer assignment and voltage assignment, which achieves the best trade-off between core power and communication power. A 3D NoC synthesis is proposed to assign network components onto each layer and generate inter-layer interconnection. A global redistribution is applied to further reduce communication power. Experimental results show that compared to MSV-driven 2D NoC, the proposed method can improve total chip power greatly. |
Slides |
Title | (Invited Paper) Hardware Security Strategies Exploiting Nanoelectronic Circuits |
Author | Garrett S. Rose (Air Force Research Laboratory, U.S.A.), *Jeyavijayan Rajendran (New York University, U.S.A.), Nathan McDonald (Air Force Research Laboratory, U.S.A.), Ramesh Karri (New York University, U.S.A.), Miodrag Potkonjak (University of California, Los Angeles, U.S.A.), Bryant Wysocki (Air Force Research Laboratory, U.S.A.) |
Page | pp. 368 - 372 |
Keyword | Cybersecurity, PUF, VLSI, Nanotechnology, Memristor |
Abstract | Hardware security has emerged as an important field of study aimed at mitigating issues such as piracy, counterfeiting, and side channel attacks. One popular solution for such hardware security attacks are physical unclonable functions (PUF) which provide a hardware specific unique signature or identification. Novel nanoelectronic technologies such as memristors are viable options for improved security in emerging integrated circuits. We provide an overview of memristor based PUF structures and circuits that illustrate the potential for nanoelectronic hardware security solutions. |
Title | (Invited Paper) Can We Identify Smartphone App by Power Trace? |
Author | Mian Dong, Po-Hsiang Lai, *Zhu Li (Samsung Telecommunications America, U.S.A.) |
Page | pp. 373 - 375 |
Keyword | power, smartphone |
Abstract | Power trace of a smartphone, as time series data, carries important information of the system behavior and is useful for many applications, such as energy management, software optimization and anomaly detection. However, the power trace measured from the battery terminals include the power consumption by all the hardware components and thus describes the activity of the whole system. Yet modern smartphones are multiprocessing, i.e., multiple applications can be running simultaneously in the same system. Our goal is to answer the following question: “Can we identify smartphone app by power trace?” That is, whether the power trace of a smartphone can be different by running different applications. |
Title | (Invited Paper) Secure Storage System and Key Technologies |
Author | *Jiwu Shu, Zhirong Shen, Wei Xue, Yingxun Fu (Tsinghua University, China) |
Page | pp. 376 - 383 |
Keyword | secure storage, cloud storage security, privacy, data security |
Abstract | With the rapid development of cloud storage, data security in storage receives great attention and becomes the top concern to block the spread development of cloud service. In this paper, we systematically study the security researches in the storage systems. We first present the design criteria that are used to evaluate a secure storage system and summarize the widely adopted key technologies. Then, we further investigate the security research in cloud storage and conclude the new challenges in the cloud environment. Finally, we give a detailed comparison among the selected secure storage systems and draw the relationship between the key technologies and the design criteria. |
Slides |
Title | (Invited Paper) Mobile User Classification and Authorization Based on Gesture Usage Recognition |
Author | *Kent W. Nixon, Xiang Chen, Zhi-Hong Mao (University of Pittsburgh, U.S.A.), Kang Li (Rutgers University, U.S.A.), Yiran Chen (University of Pittsburgh, U.S.A.) |
Page | pp. 384 - 389 |
Keyword | Mobile Device, Gesture, Security |
Abstract | Intelligent mobile devices have been widely serving in almost all aspects of everyday life, spanning from communication, web surfing, entertainment, to daily organizer. A large amount of sensitive and private information is stored on the mobile device, leading to severe data security concern. In this work, we propose a novel mobile user classification and authorization scheme based on the recognition of user’s gesture. Compared to other security solutions like password, track pattern and finger print etc. |
Title | (Invited Paper) Challenges in Integration of Diverse Functionalities on CMOS |
Author | *Kazuya Masu, Noboru Ishihara (Tokyo Institute of Technology, Japan), Toshifumi Konishi (NTT Advanced Technology, Japan), Katsuyuki Machida (Tokyo Institute of Technology, Japan), Hiroshi Toshiyoshi (The University of Tokyo, Japan) |
Page | pp. 390 - 393 |
Abstract | We introduce “Wafer Shuttle” that is suitable for integration of diverse functionalities. CMOS/MEMS design flow and environment based on SPICE is discussed. It is pointed out that modeling will be important to promote the R&D of MEMS/CMOS and/or diverse-functionalities integration on CMOS. |
Title | (Invited Paper) 3DIC from Concept to Reality |
Author | Frank Lee, Bill Shen, Willy Chen, *Suk Lee (Taiwan Semiconductor Manufacturing Company, Taiwan) |
Page | pp. 394 - 398 |
Keyword | 3DIC, TSMC, System, Design |
Abstract | 3DIC technology presents a new system integration strategy for the electronics industry to achieve superior system performance with lower power consumption, higher bandwidth, smaller system form factor, and shorter time to market through heterogeneous integration. TSMC's “Chip-on-Wafer-on-Substrate (CoWoS)” technology opens up a new opportunity to bring 3D chip stacking vision from concept to reality. The provided methodology will be discussed about this market trend and the different pieces needed to jointly make it a success, which includes customers' required applications, TSMC's support design flow, as well as the ecosystem design enablement of multi-die implementation, DFT solution, thermal analysis, verification and new categories of IPs. |
Title | (Invited Paper) 2.5D Design Methodology |
Author | *Sinya Tokunaga (Semiconductor Technology Academic Research Center, Japan) |
Page | pp. 399 - 402 |
Keyword | 3D-IC, Silicon interposer, TSV, Co-design, Co-analysis |
Abstract | We present about 2.5D design methodology. Very important issue is a high frequency insertion loss on the silicon interposer. There are two wiring methodologies on the silicon interposer. One is the Manhattan wiring method like as LSI wiring design and the other is the transmission channel wiring method like as package design. We have confirmed that the transmission channel wiring is twice better electro characteristic than the Manhattan wiring using a component model that is 6mm length at 1 GHz. |
Title | (Invited Paper) Design Issues in Heterogeneous 3D/2.5D Integration |
Author | *Dragomir Milojevic, Pol Marchal, Erik Jan Marinissen, Geert Van der Plas, Diederik Verkest, Eric Beyne (IMEC, Belgium) |
Page | pp. 403 - 410 |
Keyword | Heterogeneous, 3D/2.5D Integration, Thermal, mechanical analysis, Design for test |
Abstract | Efficient processing of fine-pitched Through Silicon Vias, micro-bumps and back-side re-distribution layers enable face-to-back or face-to-face integration of heterogeneous ICs using 3D stacking and/or Silicon Interposers. While these technology features are extremely compelling, they considerably stress the existing design practices and EDA tool flows typically conceived for 2D systems. With all system, technology and implementation level options brought with these features, the design space increases to an extent where traditional 2D tools cannot be used any more for efficient exploration. Therefore, the cost-effective design of future 3D ICs products will require new planning and co-optimisation techniques and tools that are fast and accurate enough to cope with these challenges. In this paper we present design methodology and the practical EDA tool chain that covers different aspects of the design flow and is specific to efficient design of 3D-ICs. Flow features include: fast synthesis and 3D design partitioning at gate level, TSV/micro-bump array planning, 3D floor planning, placement and routing, congestion analysis, fast thermal and mechanical modeling, easy technology vs. implementation trade-off analysis, 3D device models generations and Design-for-Test (DfT). The application of the tool chain is illustrated using concrete example of a real-world design, showing not only the applicability of the tool chain, but also the benefits of heterogeneous 2.5 and 3D integration technologies. |
Title | Verifying Distributed Controllers using Time-Stamped ECAs |
Author | *Matthias Kauer, Sebastian Steinhorst, Martin Lukasiewycz (TUM CREATE, Singapore), Dip Goswami, Reinhard Schneider, Samarjit Chakraborty (TU Munich, Germany) |
Page | pp. 411 - 416 |
Keyword | verification, event-count automata, linear control, timing analysis |
Abstract | We study distributed controllers where sensor, controller, and actuator tasks are mapped onto different processors or Electronic Control Units (ECUs) in a distributed automotive architecture, communicating via a shared bus. Controllers in such setups are designed with a sampling period equal to the worst-case sensor-to-actuator message delay. However, this assumption of all messages having to meet their deadlines is too pessimistic. The inherent robustness of most controllers allows some of the messages to miss their deadlines, while still meeting specified control performance constraints. Given a controller, in this paper we first quantify the frequency of its acceptable deadline misses and represent this as a Linear Temporal Logic (LTL) formula. Further, we model the distributed architecture as a network of time-stamped event count automata (TS-ECA). Such a network of TS-ECAs is then model-checked to verify whether it satisfies the LTL formula. The verification ensures that the controller may be mapped onto the architecture and the control performance constraints will be satisfied. We have implemented this methodology in the Symbolic Analysis Laboratory (SAL), which is a well-known framework combining different tools for system verification. Our implementation and case studies using standard controller design shows the applicability of our proposed controller/architecture co-verification. It represents a significant improvement in current design flows where, although controller models are formally verified, their implementation on a distributed architecture is done in an ad hoc fashion with extensive testing and integration effort. |
Title | Reliability Assessment of Safety-Relevant Automotive Systems in a Model-Based Design Flow |
Author | *Sebastian Reiter, Michael Pressler, Alexander Viehl (FZI Forschungszentrum Informatik, Germany), Oliver Bringmann, Wolfgang Rosenstiel (University Tuebingen, Germany) |
Page | pp. 417 - 422 |
Keyword | reliability, model-based, error injection |
Abstract | To support the reliability assessment of safety-relevant distributed automotive systems and reduce its complexity, this paper presents a novel approach that extends virtual prototyping towards error effect simulation. Besides the common functional and timed system simulation, error injection is used to stress error tolerance mechanisms. A quantitative assessment of the overall system reliability is performed by observing the system reactions and identifying incorrect system behavior. To foster the industrial application, the analysis is integrated in an model-based design flow, starting at the modeling level to assemble and parameterize the virtual prototype and to configure the analysis. The feasibility of the proposed approach is demonstrated by analyzing a representative safety-relevant automotive use case. |
Slides |
Title | Sequential Dependency and Reliability Analysis of Embedded System |
Author | Hehua Zhang, *Yu Jiang (Tsinghua University, China), William N.N Hung (Synopsys, Inc., U.S.A.), Xiaoyu Song (Portland State University, U.S.A.), Jiaguang Sun (Tsinghua University, China) |
Page | pp. 423 - 428 |
Keyword | Dynamic Bayesian Network, embedded system, temporal correlations |
Abstract | Embedded systems are becoming increasingly popular due to their widespread applications and the reliability of them is a crucial issue. The complexity of the reliability analysis arises in handling the sequential feedback that make the system output depends not only on the present input but also the internal state. In this paper, we propose a novel probabilistic model, named sequential dependency model (SDM), for the reliability analysis of embedded systems with sequential feedback. It is constructed based on the structure of the system components and the signals among them. We prove that the SDM model is s Dynamic Bayesian Network (DBN) that captures: the spatial dependencies between system components in a single time slice, the temporal dependencies between system components of different time slices, and the temporal dependencies due to the sequential feedback. We initiate the conditional probability distribution (CPD) table of the SDM node with the failure probability of the corresponding system component. Then, the SDM model handles the spatial-temporal correlations at internal components as well as the higher order temporal correlations due to the sequential feedback with the computational mechanism of DBN, experiment results demonstrate the accuracy of our model. |
Slides |
Title | Processor and DRAM Integration by TSV-Based 3-D Stacking for Power-Aware SOCs |
Author | Shin-Shiun Chen, Chun-Kai Hsu, *Hsiu-Chuan Shih (National Tsing Hua University, Taiwan), Jen-Chieh Yeh (Industrial Technology Research Institute, Taiwan), Cheng-Wen Wu (National Tsing Hua University, Taiwan) |
Page | pp. 429 - 434 |
Keyword | 3D IC, DRAM, SOC, ESL, Power |
Abstract | With the rapid popularization of mobile devices, the low-power and energy-efficient became far more important than the system operating frequency. This work demonstrates a processor and DRAM integration scheme by TSV-based 3-D stacking and the performance and energy efficiency is evaluated by an ESL design methodology. The integration scheme comprising Sans-Cache DRAM (SCDRAM) architecture which is designed under the power and energy considerations is explored. Experiment results show the proposed architecture can greatly reduce 80% energy while having 23.5% of system performance improvement. |
Slides |
Title | A Flexible Fixed-outline Floorplanning Methodology for Mixed-size Modules |
Author | *Kai-Chung Chan, Chao-Jam Hsu, Jai-Ming Lin (National Cheng Kung University, Taiwan) |
Page | pp. 435 - 440 |
Keyword | mixed-sized modules, fixed-outline, floorplanning |
Abstract | This paper presents a new flow to handle fixed-outline floorplanning for mixed size modules. It consists of two stages, which includes global distribution stage and legalization stage. The methodology is very flexible, and it can be integrated into other methods or be extended to handle other constraints such as routability or thermal. The experimental results show that our method can averagely reduce wirelength by 22.5% and 4.7% than PATOMA and DeFer in mixed size benchmarks. |
Slides |
Title | Optimizing Routability in Large-Scale Mixed-Size Placement |
Author | Jason Cong (University of California, Los Angeles, U.S.A.), Guojie Luo (Peking University, China), *Kalliopi Tsota, Bingjun Xiao (University of California, Los Angeles, U.S.A.) |
Page | pp. 441 - 446 |
Keyword | placement, routing, congestion, routability |
Abstract | One of the necessary requirements for the placement process is that it should be capable of generating routable solutions. This paper describes methods leading to the reduction of the routing congestion and the final routed wirelength for large-scale mixed-size designs. In order to reduce routing congestion and improve routability, we propose blocking narrow regions on the chip. We also propose dummy-cell insertion inside regions characterized by reduced fixed-macro density. Our placer consists of three major components: (i) narrow channel reduction by performing neighbor-based fixed-macro inflation; (ii) dummy-cell insertion inside large regions with reduced fixed-macro density; and (iii) preplacement inflation by detecting tangled logic structures in the netlist and minimizing the maximum pin density. We evaluated the quality of our placer using the newly released DAC 2012 routability-driven placement contest designs and we compared our results to the top four teams that participated in the placement contest. The experimental results reveal that our placer improves the routability of the DAC 2012 placement contest designs and effectively reduces the routing congestion. |
Slides |
Title | Symmetrical Buffered Clock-Tree Synthesis with Supply-Voltage Alignment |
Author | Xin-Wei Shih (MediaTek, Taiwan), *Tzu-Hsuan Hsu (Linkwish, Taiwan), Hsu-Chieh Lee (Google, Taiwan), Yao-Wen Chang (National Taiwan University, Taiwan), Kai-Yuan Chao (Intel, U.S.A.) |
Page | pp. 447 - 452 |
Keyword | clock, skew, supply voltage, IR-drop, power |
Abstract | For high-performance synchronous systems, nonuniform/non-ideal supply voltages of buffers (e.g., due to IR-drop) may incur a large clock skew and thus serious performance degradation. This paper addresses this problem and presents the first symmetrical buffered clock-tree synthesis flow that considers supply voltage differences of buffers. We employ a two-phase technique of bottom-up clock sink clustering to determine the tree topology, followed by top-down buffer placement and wire routing to complete the clock tree. At each level of processing, clock skew and wirelength are minimized by the determination of buffer embedding regions and the alignment of buffer supply voltages. Experimental results show that, on average, our method can achieve a 76% (respectively, 40%) clock skew reduction with marginal resource and runtime overheads, compared to the state-of-the-art work without supply voltage consideration (with an extension for supply voltages based on our top-down flow). With the skew reductions, our method can meet the stringent skew constraint set by the 2010 ISPD contest for all cases, while other counterparts cannot. In particular, our work provides a key insight into the importance of handling practical design issues (such as IR-drop) for real-world clock-tree synthesis. |
Slides |
Title | BCell: Automatic Layout of Leaf Cells |
Author | Stefan Hougardy, *Tim Nieberg, Jan Schneider (Research Institute for Discrete Mathematics, University of Bonn, Germany) |
Page | pp. 453 - 460 |
Keyword | Leaf Cells, Placement, Routing |
Abstract | In this paper we present BonnCell, our solution to compute leaf cell layouts. Our placement algorithm allows to find very compact solutions and uses an accurate target function to guarantee routability. The routing algorithm handles all nets simultaneously using a constraint generation MIP based approach. BCell easily allows to adapt to new design rules as required for 14nm and beyond. The experimental results on current 22nm designs show significant improvements compared to manual designs done by experienced designers. |
Slides |
Title | Register and Thread Structure Optimization for GPUs |
Author | *Yun Liang (Center for Energy-efficient Computing and Applications, School of EECS, Peking University, China), Zheng Cui (Advanced Digital Sciences Center, Illinois at Singapore, Singapore), Kyle Rupnow (Nanyang Technological University, Singapore), Deming Chen (University of Illinois, Urbana-Champaign, U.S.A.) |
Page | pp. 461 - 466 |
Keyword | GPU, register, thread structure, design space exploration |
Abstract | GPUs are an increasingly popular implementation platform for a variety of general purpose applications from mobile and embedded devices to high performance computing. The CUDA and OpenCL parallel programming models enable easy utilization of the GPU's resources. However, tuning GPU applications' performance is a complex and labor intensive task. Software programmers employ a variety of optimization techniques to explore tradeoffs between the thread parallelism and performance of a single thread. However, prior techniques ignore register allocation, a significant factor in single thread performance and, indirectly affects the number of simultaneously active threads. In this paper, we show that joint optimization of register allocation and thread structure has great potential to significantly improve performance. However, the design space for this joint optimization can be large; therefore, we develop performance metrics appropriate for evaluation within a compiler's inner loop and efficient design space exploration techniques that use the metrics to narrow the search space. Across a range of GPU applications, we achieve average performance speedup of 1.33X (up to 1.73X) with design space exploration 355X faster than the exhaustive search. |
Slides |
Title | Real-Time Partitioned Scheduling on Multi-Core Systems with Local and Global Memories |
Author | *Che-Wei Chang (National Taiwan University, Taiwan), Jian-Jia Chen (Karlsruhe Institute of Technology, Germany), Tei-Wei Kuo (National Taiwan University, Taiwan), Heiko Falk (Ulm University, Germany) |
Page | pp. 467 - 472 |
Keyword | real-time system, heterogeneous memory, partitioned scheduling, resource optimization, worst case execution time |
Abstract | Real-time task scheduling becomes even more challenging with the emerging of island-based multi-core architecture, where the local memory module of an island offers shorter access time than the global memory module does. With such a popular architecture design in mind, this paper exploits real-time task scheduling over island-based homogeneous cores with local and global memory pools. Joint considerations of real-time scheduling and memory allocation are presented to efficiently use the computing and memory resources. A polynomial-time algorithm with an asymptotic 4-approximation bound is proposed to minimize the number of needed islands to successfully schedule tasks. To evaluate the performance of the proposed algorithm, 82 benchmarks from the MRTC, MediaBench, UTDSP, NetBench, and DSPstone benchmark suites are profiled by a worst-case-execution-time analyzer aiT and included in the experiments. |
Title | Dynamic Thermal Management for Multi-Core Microprocessors Considering Transient Thermal Effects |
Author | *Zao Liu (University of California, Riverside, U.S.A.), Tailong Xu (Anhui University, China), Sheldon X.-D. Tan (University of California, Riverside, U.S.A.), Hai Wang (UESTC, China) |
Page | pp. 473 - 478 |
Keyword | Dynamic thermal management, task migration, thermal analysis, moment matching, hot spots |
Abstract | Dynamic thermal management method is a viable way to effectively mitigate the thermal emergences. In this paper, a new thermal management scheme is proposed to reduce the on-chip temperature variance and the occurrence of hot spots by considering more transient thermal effects. The new method performs the task migrations to reduce the temperature variations across the chip. Instead of intuitively assigning the heavy tasks to the low temperature cores to balance the thermal profile based on steady state thermal analysis, the proposed method applies moment matching based transient thermal analysis techniques for fast thermal estimation and prediction to guide the migration process. We show that by considering the dominant temperature moment component, the resulting algorithm can lead to significant reduction of hot spots with full transient thermal simulation. Our experimental results on a 16 core microprocessor demonstrate that the proposed method can reduce the number of the hot spots by 50% compared to the simple lowest temperature based task scheduling method, leading to more uniform on-chip temperature distribution across the microprocessor cores. |
Slides |
Title | BAMSE: A Balanced Mapping Space Exploration Algorithm for GALS-based Manycore Platforms |
Author | Mohammad Foroozannejad, Brent Bohnenstiehl, *Soheil Ghiasi (University of California, Davis, U.S.A.) |
Page | pp. 479 - 484 |
Keyword | Manycore, GALS, Mapping, Algorithm |
Abstract | We study the problem of mapping concurrent tasks of an application modeled as a data flow graph onto processors of a GALS-based manycore platform. We propose a mapping algorithm called BAMSE, which exploits the characteristics of streaming applications and the specifications of the target architecture to optimize the mapping solution. Different configuration parameters embedded into the algorithm enable one to strike a balance between scalability of the approach and the quality of generated solutions. Experiments with several real life applications show that our algorithm outperforms hand-optimized manual mappings up to 65% in terms of longest inter-processor communication link, and as high as 19% with respect to total length of the links, when the two criteria are used as primary and secondary optimization objectives, respectively. Additionally, our algorithm delivers superior mappings compared to ILP generated solutions after 10 days of solver runtime. |
Slides |
Title | (Panel Discussion) Future Direction and Trend of Embedded GPU |
Author | Panelists: Jem Davies (ARM, U.S.A.), Hong Jiang (Intel, U.S.A.), Eisaku Ohbuchi (Digital Media Professionals Inc., Japan), Yasushi Sugama (Fujitsu Laboratories, Japan), Tony King-Smith (Imagination Technologies, U.K.) |
Title | Thermal Simulator of 3D-IC with Modeling of Anisotropic TSV Conductance and Microchannel Entrance Effects |
Author | Hanhua Qian, Hao Liang, Chip-Hong Chang, Wei Zhang, *Hao Yu (Nanyang Technological University, Singapore) |
Page | pp. 485 - 490 |
Keyword | thermal model, 3D-IC, TSV, entrance effect, microchannel |
Abstract | This paper presents a fast and accurate steady state thermal simulator for heatsink and microfluid-cooled 3D-ICs. This model considers the thermal effect of TSVs at fine-granularity by calculating the anisotropic equivalent thermal conductances of a solid grid cell if TSVs are inserted. Entrance effect of microchannels is also investigated for accurate modeling of microfluidic cooling. The proposed thermal simulator is verified against commercial multiphysics solver COMSOL and compared with Hotspot and 3D-ICE. Simulation results shows that for heatsink cooling, the proposed simulator is as accurate as Hotspot but runs much faster at moderate granularity. For microfluidic cooling, our proposed simulator is much more accurate than 3D-ICE in its estimation of steady state temperature and thermal distribution. |
Slides |
Title | A Novel Cell Placement Algorithm for Flexible TFT Circuit with Mechanical Strain and Temperature Consideration |
Author | *Juin-Li Lin, Po-Hsun Wu, Tsung-Yi Ho (National Cheng Kung University, Taiwan) |
Page | pp. 491 - 496 |
Keyword | Placement, Mobility, TFT |
Abstract | Mobility is the key device parameter to affect circuit performance in thin-film transistor (TFT) technologies, and it is very sensitive to the change of mechanical strain and temperature. However, existing algorithms only consider the impact of mechanical strain in cell placement of TFT circuit. Without taking temperature into consideration, mobility may be dramatically decreased which leads to circuit performance degradation. This paper presents the first work to minimize the mobility variation caused by the change of mechanical strain and temperature simultaneously. Experimental results show that the proposed algorithms can effectively and effciently reduce the mobility variation without routing overhead. |
Slides |
Title | Improving Energy Efficiency for Energy Harvesting Embedded Systems |
Author | Yang Ge, Yukan Zhang, *Qinru Qiu (Syracuse University, U.S.A.) |
Page | pp. 497 - 502 |
Keyword | Hybrid electrical energy storage system, energy harvesting system, bank reconfiguration |
Abstract | While the energy harvesting system (EHS) supplies green energy to the embedded system, it also suffers from uncertainty and large variation in harvesting rate. This constraint can be remedied by using efficient energy storage. Hybrid Electrical Energy Storage (HEES) system is proposed recently as a cost effective approach with high power conversion efficiency and low self-discharge. In this paper, we propose a fast heuristic algorithm to improve the efficiency of charge allocation and replacement in an EHS/HEES equipped embedded system. The goal of our algorithm is to minimize the energy overhead on the DC-DC converter while satisfying the task deadline constraints of the embedded workload and maximizing the energy stored in the HEES system. We first provide an approximated but accurate power consumption model of the DC-DC converter. Based on this model, the optimal operating point of the system can be analytically solved. Integrated with the dynamic reconfiguration of the HEES bank, our algorithm provides energy efficiency improvement and run-time overhead reduction compared to previous approaches. |
Title | Modeling Variability and Irreproducibility of Nanoelectronic Resistive Switches for Circuit Simulation |
Author | *Arne Heittmann, Tobias G. Noll (RWTH Aachen University, Germany) |
Page | pp. 503 - 508 |
Keyword | variability, resistive switches, electochemical metallization effect, hybrid circuits, nanoelectronics |
Abstract | This paper presents a device model for nanoelectronic resistive switches which are based on the electrochemical metallization effect (ECM). The focus is set on modeling variability as well as irreproducibility which are essential properties of scaled nanoelectronic devices. In particular, a Poisson-based random ion deposition model and a non-linear filament surface effect are described. The model is especially useful for circuit simulation and can be implemented on standard circuit simulation platforms such as Spice or Spectre using inbuilt standard elements. Based on this model, effects of variability were examined by Monte Carlo simulation for a particular hybrid CMOS/nanoelectronic circuit. The results show that the proposed model is able to cover significant scaling effects, which is necessary for prospective design space exploration and circuit optimization. |
Title | HS3DPG: Hierarchical Simulation for 3D P/G Network |
Author | *Shuai Tao, Xiaoming Chen, Yu Wang, Yuchun Ma (Tsinghua University, China), Yiyu Shi (Missouri University of Science and Technology, U.S.A.), Hui Wang, Huazhong Yang (Tsinghua University, China) |
Page | pp. 509 - 514 |
Keyword | 3D P/G network, hierarchical simulation, port equivalent model |
Abstract | As different chips are stacked together in 3D ICs, the power/ground (P/G) network simulation becomes more challenging than that of 2D cases. In this paper, we propose a hierarchical simulation method suitable for 3D P/G network (HS3DPG), which can ensure full parallelism and good scalability with the number of tiers. Besides, the "locality" property is introduced into HS3DPG to further simplify the simulation. Finally, we use HS3DPG to analyze the voltage distribution of a 3D P/G network with clustered TSVs. |
Slides |
Title | Piecewise-Polynomial Associated Transform Macromodeling Algorithm for Fast Nonlinear Circuit Simulation |
Author | *Yang Zhang, Neric Fong, Ngai Wong (The University of Hong Kong, Hong Kong) |
Page | pp. 515 - 520 |
Keyword | Nonlinear MOR, Associated transform, TPWL, PWP, Macromodeling |
Abstract | We present a piecewise-polynomial based associated transform algorithm (PWPAT) for macromodeling nonlinear circuits in system-level circuit design. The generated reduced model can provide both global and local accuracies with the most compact dimension. Numerical examples compare it with existing algorithms and verify its superior accuracy in higher order harmonics simulation over traditional Trajectory Piecewise-Linear (TPWL) approach. |
Slides |
Title | An Ultra-Compact Virtual Source FET Model for Deeply-Scaled Devices: Parameter Extraction and Validation for Standard Cell Libraries and Digital Circuits |
Author | *Li Yu, Omar Mysore, Lan Wei, Luca Daniel, Dimitri Antoniadis (MIT, U.S.A.), Ibrahim Elfadel (Masdar Institute of Science and Technology, United Arab Emirates), Duane Boning (MIT, U.S.A.) |
Page | pp. 521 - 526 |
Keyword | ultra-compact model, parameter extraction, library cell characterization, VLSI timing, power analysis |
Abstract | In this paper, we present the first validation of the virtual source(VS) charge-based compact model for standard cell libraries and large-scale digital circuits. With only a modest number of physically meaningful parameters, the VS model accounts for the main short-channel effects in nanometer technologies. Using a novel DC and transient parameter extraction methodology, the model is verified with simulated data from a well-characterized, industrial 40nm bulk silicon model. The VS model is used to fully characterize a standard cell library at the 40nm node with timing comparisons showing less than 2.7% error with respect to the industrial design kit. Furthermore, a 1001-stage inverter chain and a 32-bit ripple-adder are employed as test cases in a vendor CAD environment to validate the use of the VS model for large-scale digital circuit applications. Parametric Vdd sweeps show that the VS model is also ready for usage in low-power design methodologies. Finally, runtime comparisons have shown that the use of the VS model results in a speedup of about 7.6*. |
Title | On Potential Design Impacts of Electromigration Awareness |
Author | Andrew B. Kahng, *Siddhartha Nath, Tajana S. Rosing (University of California, San Diego, U.S.A.) |
Page | pp. 527 - 532 |
Keyword | Electromgration, Fmax, EM slack, tradeoff, reliability |
Abstract | Reliability issues significantly limit performance improvements from Moore’s-Law scaling. At 45nm and below, electromigration (EM) is a serious reliability issue which affects global and local interconnects in a chip and limits performance scaling. Traditional IC implementation flows meet a 10-year lifetime requirement by overdesigning and sacrificing performance. At the same time, it is well-known among circuit designers that Black’s Equation [2] suggests that lifetime can be traded for performance. In our work, we carefully study the impacts of EM-awareness on IC implementation outcomes, and show that circuit performance does not trade off so smoothly with mean time to failure (MTTF) as suggested by Black’s Equation. We conduct two basic studies: EM lifetime versus performance with fixed resource budget, and EM lifetime versus resource with fixed performance. Using design examples implemented in two process nodes, we show that performance scaling achieved by reducing the EM lifetime requirement depends on the EM slack in the circuit, which in turn depends on factors such as timing constraints, length of critical paths and the mix of cell sizes. Depending on these factors, the performance gain can range from 10% to 80% when the lifetime requirement is reduced from 10 years to one year. We show that at a fixed performance requirement, power and area resources are affected by the timing slack and can either decrease by 3% or increase by 7.8% when the MTTF requirement is reduced. We also study how conventional EM fixes using per net Non-Default Rule (NDR) routing, downsizing of drivers, and fanout reduction affect performance at reduced lifetime requirements. Our study indicates, e.g., that NDR routing can increase performance by up to 5% but at the cost of 2% increase in area at a reduced 7-year lifetime requirement. |
Slides |
Title | Provably Optimal Test Cube Generation using Quantified Boolean Formula Solving |
Author | Matthias Sauer, *Sven Reimer (University of Freiburg, Germany), Ilia Polian (University of Passau, Germany), Tobias Schubert, Bernd Becker (University of Freiburg, Germany) |
Page | pp. 533 - 539 |
Keyword | Test Cube, X-input, QBF, SAT, Relaxation |
Abstract | Circuits that employ test pattern compression rely on test cubes to achieve high compression ratios. The less inputs of a test pattern are specified, the better it can be compacted and hence the lower the test application time. Although there exist previous approaches to generate such test cubes, none of them are optimal. We present for the first time a framework that yields provably optimal test cubes by using the theory of quantified Boolean formulas (QBF). Extensive comparisons with previous methods demonstrate the quality gain of the proposed method. |
Slides |
Title | Synthesizing Multiple Scan Chains by Cost-Driven Spectral Ordering |
Author | *Louis Y.-Z. Lin, Christina C.-H. Liao, Charles H.-P. Wen (Dept. of Elec. & Comp. engr., National Chiao Tung University, Taiwan) |
Page | pp. 540 - 545 |
Keyword | testing, scan chain, scan order |
Abstract | Power cost and wire cost are two most critical issues in scan-chain optimization for modern VLSI testing. Many previous works used layout-based partitioning and greedy heuristics to synthesize multiple scan chains, making themselves suffer from (1) nongeometric-cost problem and (2) crossing-edge problem. Therefore, in this paper, we propose cost-driven spectral ordering including (1) cost-driven k-way spectral partitioning and (2) greedy non-crossing 2-opt ordering to resolve the two problems stated above, respectively. Experiments show that different cost metrics can be properly addressed in k-way spectral partitioning. Moreover, our cost-driven spectral ordering achieves on average 9% mixed (power-and-wire) reduction than two previous works on benchmark circuits, which evidently demonstrates its effectiveness on multiple scan-chain synthesis. |
Title | A Binding Algorithm in High-Level Synthesis for Path Delay Testability |
Author | *Yuki Yoshikawa (Kure National College of Technology, Japan) |
Page | pp. 546 - 551 |
Keyword | Delay test, High-level synthesis, Resource binding, Synthesis for testability |
Abstract | A binding method in high-level synthesis for path delay testability is proposed in this paper. For a given scheduled data flow graph, the proposed method synthesizes a path delay testable RTL datapath and its controller. Every path in the datapath is two pattern testable with the controller if the path is activated in the functional operation, i.e., the path is not false path. Our experimental results show that the proposed method can synthesize such RTL circuits with small area overhead compared with that augmented by some DFT techniques such as scan design. |
Slides |
Title | Full Exploitation of Process Variation Space for Continuous Delivery of Optimal Delay Test Quality |
Author | Baris Arslan (University of California, San Diego/Qualcomm, U.S.A.), *Alex Orailoglu (University of California, San Diego, U.S.A.) |
Page | pp. 552 - 557 |
Keyword | delay test, test cost optimization, adaptive test, process-aware test |
Abstract | The increasing magnitude of process variations individualizes effectively each chip, necessitating distinct quantities of test resources for each in order to optimize overall delay test quality without exceeding set test budgets. This paper proposes an analytical framework that delivers the optimal test time assignment per chip in order to minimize the delay defect escape rate. Adjustment of the chip-specific test time in the continuous process variation space is attained through an adaptive test flow that utilizes process data measurements from the device under test. The results evince that a substantial improvement in the delay test quality can be obtained at no increase whatsoever to test time consumed by conventional test flows. |
Friday, January 25, 2013 |
Title | (Keynote Address) Human, Vehicle and Social Infrastructure System Development for Sustainable Mobility – Development Innovation based on Large-Scale Simulation – |
Author | Hiroyuki Watanabe (Toyota Motor Corporation, Japan) |
Abstract | In order to realize a sustainable mobility society, technology development is ongoing to handle energy security, CO2 reduction, traffic-congestion and road-accident related challenges. The vehicle itself and the surrounding social infrastructure system, has to deal with 3 challenges. The first challenge will be to develop and raise the efficiency of a powertrain supporting renewable energy like bio-fuels, electricity, and hydrogen. The second will be the challenge to enhance vehicle dynamic performance and to innovate environmental and safety features of the vehicle by autonomous driving and its supporting technologies. The third challenge is ITS development applying ITC (Information and Communication Technology), namely, the development of a “connected” vehicle and its related social infrastructure. These developments, which began by utilizing simulation technology such as HILS and SILS, have evolved for the systems which include both human and society. In the development of the social infrastructure system, much progress has been made in the development process itself, such as application of real-time probe-data, and large-scale simulation using Big Data. In this keynote lecture, we will introduce approaches to resolving key challenges, and show the future technological trend together with a proposal to innovate development based on large-scale simulation. |
Title | (Invited Paper) SMYLE Project: Toward High-Performance, Low-Power Computing on Manycore-Processor SoCs |
Author | *Koji Inoue (Kyushu University, Japan) |
Page | pp. 558 - 560 |
Keyword | manycore, SoC, low power, high performance, processor |
Abstract | This paper introduces a manycore research project called SMYLE (Scalable ManYcore for Low Energy computing). The aims of this project are: 1) proposing a manycore SoC architecture and developing a suitable programming and execution environment, 2) designing a domain specific manycore system for emerging video mining applications, and 3) releasing developed software tools and FPGA emulation environments to accelerate manycore research and development in the community. The project started in December 2010 with full support from the New Energy and Industrial Technology Development Organization (NEDO). |
Title | (Invited Paper) SMYLEref: A Reference Architecture for Manycore-Processor SoCs |
Author | *Masaaki Kondo, Son Truong Nguyen (The University of Electro-Communications, Japan), Tomoya Hirao, Takeshi Soga, Hiroshi Sasaki, Koji Inoue (Kyushu University, Japan) |
Page | pp. 561 - 564 |
Keyword | Manycore Processor, Prototyping, FPGA |
Abstract | Nowadays, the trend of developing micro-processor with tens of cores brings a promising prospect for embedded systems. Realizing a high performance and low power many-core processor is becoming a primary technical challenge. We are currently developing a many-core processor architecture for embedded systems as a part of a NEDO's project. This paper introduces the many-core architecture called SMYLEref along whit the concept of Virtual Accelerator on Many-core, in which many cores on a chip are utilized as a hardware platform for realizing multiple virtual accelerators. We are developing its prototype system with off-the-shelf FPGA evaluation boards. In this paper, we introduce the architecture of SMYLEref and the detail of the prototype system. In addition, several initial experiments with the prototype system are also presented. |
Slides |
Title | (Invited Paper) SMYLE OpenCL: A Programming Framework for Embedded Many-core SoCs |
Author | *Hiroyuki Tomiyama, Takuji Hieda, Naoki Nishiyama, Noriko Etani, Ittetsu Taniguchi (Ritsumeikan University, Japan) |
Page | pp. 565 - 567 |
Keyword | manycore SoCs, OpenCL, embedded systems |
Abstract | Embedded SoC architecture has shifted from single-core to multi/many-core paradigm because of better power/performance efficiency. In order to exploit the potential power/performance efficiency of the many-core architecture, a parallel computing framework is necessary. OpenCL is one of the most popular parallel computing frameworks in the field of general-purpose computing on GPUs and multicore servers. However, the existing OpenCL implementations are not suitable to embedded real-time systems because of the large runtime overhead. In this paper, we describe a lightweight OpenCL framework for embedded multi/many-core SoCs. Our OpenCL framework minimizes the runtime overhead by statically creating threads and mapping them onto cores. Preliminary experiments on an FPGA prototype board with a five-core architecture shows a significant reduction in runtime overhead compared with an existing OpenCL framework. |
Title | (Invited Paper) Support Tools for Porting Legacy Applications to Multicore |
Author | Yuri Ardila, *Natsuki Kawai, Takashi Nakamura, Yosuke Tamura (Fixstars Corporation, Japan) |
Page | pp. 568 - 573 |
Keyword | auto-parallelizer, performance estimation, benchmark, parallel computing |
Abstract | Abstract| This paper presents PEMAP, an automated performance estimation tool to project performance of hand-parallelized programs from sequential programs and BEMAP, a benchmark suite to measure an auto-parallelizer or even a machine's performance. BEMAP is an open-source project, and the documentations on code explanations and experimental results are also provided. Our experiments on PEMAP shows we can estimate performance of hand-parallelized programs in an error of 0.44% of sequential program's performance on average, while using BEMAP shows that the ability of an auto-parallelizer can be measured by comparing the compiled code to the hand-tuned parallelized OpenCL code, and therefore assisting the development of the auto-parallelizer tool. |
Slides |
Title | (Invited Paper) Manycore Processor for Video Mining Applications |
Author | *Yukoh Matsumoto, Hiroyuki Uchida, Michiya Hagimoto, Yasumori Hibi, Sunao Torii, Masamichi Izumida (TOPS Systems Corporation, Japan) |
Page | pp. 574 - 575 |
Abstract | Through Architecture-Algorithm co-design for Video Mining Applications we designed a scalable Manycore processor consists of clustered heterogeneous cores with stream processing capabilities, and zero-overhead inter-process communication through FIFO with a hardware-software mechanism. For achieving high-performance and low-power consumption, especially so as to reduce memory access required for Video Mining Applications, each application is partitioned to exploit both task and data parallelism, and programmed as a distributed stream processing with relatively large local register-file based on Kahn Process Network model. |
Slides |
Title | Native Simulation of Complex VLIW Instruction Sets using Static Binary Translation and Hardware-Assisted Virtualization |
Author | *Mian-Muhammad Hamayun, Frédéric Pétrot, Nicolas Fournel (TIMA Laboratory, CNRS/INP Grenoble/UJF, France) |
Page | pp. 576 - 581 |
Keyword | System Simulation, Static Binary Translation, Hardware-Assisted Virtualization, VLIW |
Abstract | We introduce a static binary translation flow in native simulation context for cross-compiled VLIW executables. This approach is interesting in situations where either the source code is not available or the target platform is not supported by any retargetable compilation framework, which is usually the case for VLIW processors. The generated simulators execute on a Hardware-Assisted Virtualization (HAV) based native platform. We have implemented this approach for a TI C6x series processor and our simulation results show a speed-up of around two orders of magnitude compared to the cycle accurate simulators. |
Slides |
Title | RExCache: Rapid Exploration of Unified Last-level Cache |
Author | *Su Myat Min Shwe, Haris Javaid, Sri Parameswaran (University of New South Wales, Australia) |
Page | pp. 582 - 587 |
Keyword | estimator, exploration, cache |
Abstract | In this paper, we propose to explore design space of a unified last-level cache to improve system performance and energy efficiency. The challenge is to quickly estimate the execution time and energy consumption of the system with distinct cache configurations using minimal number of slow full-system cycle-accurate simulations. To this end, we propose a novel, simple yet highly accurate execution time estimator and a simple, reasonably accurate energy estimator. Our framework, RExCache, combines a cycle-accurate simulator and a trace-driven cache simulator with our novel execution time estimator and energy estimator to avoid cycle-accurate simulations of all the last-level cache configurations. Once execution time and energy estimates are available from the estimators, RExCache chooses minimum execution time or minimum energy consumption cache configuration. Our experiments with nine different applications from mediabench, and 330 last-level cache configurations show that the execution time and energy estimators had at least average absolute accuracy of 99.74% and 80.31% respectively. RExCache took only a few hours (21 hours for H.264enc) to explore last-level cache configurations compared to several days of traditional method (36 days for H.264enc) and cycle-accurate simulations (257 days for H.264enc), enabling quick exploration of the last-level cache. When 100 different real-time constraints on execution time and energy were used, all the cache configurations found by RExCache were similar to those from cycle-accurate simulations. On the other hand, the traditional method found correct cache configurations for only 69 out of 100 constraints. Thus, RExCache has better absolute accuracy than the traditional method, yet reducing the simulation time by at least 97%. |
Slides |
Title | An Efficient Hybrid Synchronization Technique for Scalable Multi-Core Instruction Set Simulations |
Author | *Bo-Han Zeng, Ren-Song Tsay, Ting-Chi Wang (National Tsing Hua University, Taiwan) |
Page | pp. 588 - 593 |
Keyword | Timing Synchronization, Multi-Core Simulator, Instruction Set Simulator |
Abstract | Multi-core system simulation techniques have been essential to system development in recent years. Although these techniques have been studied extensively, we have found that both conventional polling and collaborative timing synchronization approaches encounter a severe scalability issue when the number of target cores is more than that of the host cores. To resolve this issue, we propose an effective hybrid technique that combines the advantage of the two approaches. According to the experimental results, the proposed technique effectively resolves the scalability issue and shows one to four orders of improvement compared to conventional approaches. |
Title | Statistical Analysis of BTI in the Presence of Process-induced Voltage and Temperature Variations |
Author | *Farshad Firouzi, Saman Kiamehr, Mehdi B. Tahoori (Karlsruhe Institute of Technology, Germany) |
Page | pp. 594 - 600 |
Keyword | NBTI, PBTI, PVT, Reliability, Timing analysis |
Abstract | In nano-scale regime, there are various sources of uncertainty and unpredictability of VLSI designs such as transistor aging mainly due to Bias Temperature Instability (BTI) as well as Process-Voltage-Temperature (PVT) variations. BTI exponentially varies by temperature and the actual supply voltage seen by the transistors within the chip which are functions of leakage power. Leakage power is strongly impacted by PVT and BTI which in turn results in thermal-voltage variations. Hence, neglecting one or some of these aspects can lead to a considerable inaccuracy in the estimated BTI-induced delay degradation. However, a holistic approach to tackle all these issues and their interdependence is missing. In this paper, we develop an analytical model to predict the probability density function and covariance of temperatures and voltage droops of a die in the presence of the BTI and process variation. Based on this model, we propose a statistical method that characterizes the life-time of the circuit affected by BTI in the presence of process-induced temperature-voltage variations. We observe that for benchmark circuits, treating each aspect independently and ignoring their intrinsic interactions results in 16% over-design, translating to unnecessary yield and performance loss. |
Slides |
Title | CLASS: Combined Logic and Architectural Soft Error Sensitivity Analysis |
Author | *Mojtaba Ebrahimi, Liang Chen (Karlsruhe Institute of Technology, Germany), Hossein Asadi (Sharif University of Technology, Iran), Mehdi B. Tahoori (Karlsruhe Institute of Technology, Germany) |
Page | pp. 601 - 607 |
Keyword | Reliability Analysis, Soft Error, ACE, Error Propagation, Markov Chains |
Abstract | With continuous technology downscaling, the rate of radiation induced soft errors is rapidly increasing. Fast and accurate soft error vulnerability analysis in early design stages plays an important role in cost-effective reliability improvement. However, existing solutions are suitable for either regular (a.k.a address-based such as memory hierarchy) or irregular (random logic such as functional units and control logic) structures, failing to provide an accurate system level analysis. In this paper, we propose a hybrid approach integrating architecture-level and logic-level techniques to accurately estimate the vulnerability of all regular and irregular structures within a microprocessor. It carefully handles error propagation and masking scenarios among these structures. We have evaluated the vulnerability of the OR1200 processor using the proposed approach. Comparison with statistical fault injection shows an average inaccuracy of less than 5% with five orders of magnitude improvement in runtime. |
Slides |
Title | Application Specified Soft Error Failure Rate Analysis using Sequential Equivalence Checking Techniques |
Author | *Tun Li, Dan Zhu, Sikun Li, Yang Guo (National University of Defense Technology, China) |
Page | pp. 608 - 613 |
Keyword | Soft-error, Failure rate analysis, Sequential equivalence checking, Application |
Abstract | Soft errors have become a critical challenge as a result of technology scaling. However, to evaluate the influence of soft errors in flip-flop (FF) on the failure of circuit is a hard verification problem. Here, we proposed a novel flip-flop soft error failure rate analysis methodology using sequential equivalence checking (SEC) and taking the application behaviors into consideration, which combines the advantage of formal techniques based approaches in completeness and the advantage of application behaviors in accuracy in differentiating vulnerability of FFs. As a result, all the FFs in a circuit are sorted by their failure rates and designers can use this information to perform optimal hardening of selected sequential components against soft errors. Experimental results on an implementation of a SpaceWire end node and the set of the largest ISCAS’89 benchmark sequential circuits demonstrate the efficiency of our approach. Case study on an instruction decoder of a practical 32 bits microprocessor shows the applicable of our methodology. |
Slides |
Title | An Adaptive Current-Threshold Determination for IDDQ Testing Based on Bayesian Process Parameter Estimation |
Author | *Michihiro Shintani, Takashi Sato (Graduate School of Informatics, Kyoto University, Japan) |
Page | pp. 614 - 619 |
Keyword | IDDQ testing, Statistical leakage current analysis, Bayes' Theorem |
Abstract | Application of IDDQ testing to LSIs fabricated using advanced process technology is becoming increasingly difficult due to large variability of scaled devices. In this paper, we propose a novel technique that adaptively determines per-chip current-threshold for IDDQ testing to enhance test accuracy. In the proposed technique, process condition of a chip and fault sensitization vector are first estimated based on measured IDDQ currents through Bayesian inference. Then, using the estimated process condition, a statistical distribution of the leakage current for each test pattern is calculated and suitable current-threshold is determined by the distribution. Simulation experiments demonstrate that the proposed technique can successfully detect a very small leakage fault, down to 2% of the nominal IDDQ current with the test escape ratio of 3.1%. |
Slides |
Title | DARNS:A Randomized Multi-modulo RNS Architecture for Double-and-Add in ECC to Prevent Power Analysis Side Channel Attacks |
Author | Jude Angelo Ambrose (University of New South Wales, Australia), *Hector Pettenghi, Leonel Sousa (Instituto de Engenharia de Sistemas e Computadores, Portugal) |
Page | pp. 620 - 625 |
Keyword | residue number systems, powe analysis side channel attacks, multi-modulo architectures |
Abstract | Security in embedded systems is of critical importance since most of our secure transactions are currently made via credit cards or mobile phones. Power analysis based side channel attacks have been proved as the most successful attacks on embedded systems to retrieve secret keys, allowing impersonation and theft. State-of-the-art solutions for such attacks in Elliptic Key Cryptography (ECC), mostly in software, hinder performance and repeatedly attacked using improved techniques. To protect the ECC from both simple power analysis and differential power analysis, as a hardware solution, we propose to take advantage of the inherent parallelization capability in Multi-modulo Residue Number Systems (RNS) architectures to obfuscate the secure information. Random selection of moduli is proposed to randomly choose the moduli sets for each key bit operation. This solution allows us to prevent power analysis, while still providing all the benefits of RNS. In this paper, we show that the DPA is indeed thwarted, as well as correlation analysis. |
Slides |
Title | ScanPUF: Robust Ultralow-Overhead PUF Using Scan Chain |
Author | Yu Zheng, Aswin Raghav Krishna, *Swarup Bhunia (Case Western Reserve University, U.S.A.) |
Page | pp. 626 - 631 |
Keyword | PUF, DFT, Uniqueness, Stability, NBTI |
Abstract | Physical Unclonable Functions (PUFs) have emerged as an attractive primitive to address diverse hardware security issues, such as chip authentication, intellectual property (IP) protection and cryptographic key generation. Existing PUFs, typically acquired and integrated in a design as a commodity, often incur considerable hardware overhead. Many of these PUFs also suffer from insufficient challenge-response pairs. In this paper, we propose {\em ScanPUF}, a novel PUF implementation using a common on-chip structure used for improving circuit testability, namely scan chain. It exploits path delay variations between the scan flip-flops in a scan chain to create high-quality (in terms of uniqueness and robustness) secret keys. Furthermore, since a scan chain provides large pool of scan paths to create a signature, we can achieve high volume of secret keys from each chip. Since it uses a prevalent on-chip structure, the overhead is extremely small (2.3% area of the RO-PUF), primarily contributed by small additional logic in the signature-generation cycle controller. Circuit-level simulation results with 1000 chips under inter- and intra-die process variations show high uniqueness of 49.9% average inter-die Hamming distance and good reproducibility of 5% intra-die Hamming distance below 85 $^\circ$C. The temporal variations due to device aging effect e.g. bias temperature instability (BTI) lead to only 4% unstable bits for ten-year usage. The experimental evaluation on FPGA (Altera Cyclone-III) exhibits 47.1% average inter-Hamming distance, as well as 3.2% unstable bits at room temperature. |
Slides |
Title | An Efficient Compression Scheme for Checkpointing of FPGA-Based Digital Mockups |
Author | *Ting-Shuo Chou (University of California, Irvine, U.S.A.), Chen Huang, Bailey Miller (University of California, Riverside, U.S.A.), Tony Givargis (University of California, Irvine, U.S.A.), Frank Vahid (University of California, Riverside, U.S.A.) |
Page | pp. 632 - 637 |
Keyword | Digital Mockups, Test Automation, Cyber-Physical Systems, Medical Cyber-Physical Systems, Hardware-in-the-Loop |
Abstract | This paper outlines a transparent and nonintrusive checkpointing mechanism for use with FPGA-based digital mockups. A digital mockup is an executable model of a physical system, implemented on an FPGA, and used for real-time test and validation of cyber-physical devices that interact with the physical system. These digital mockups are typically defined in terms of a large set of ordinary differential equations (ODEs). A checkpoint is a snapshot of the internal state of the model at a specific point in time as captured by some controller that resides on the same FPGA. A further requirement is that the model continues uninterrupted execution during a checkpointing operation. Once a checkpoint is created, the corresponding state information is transferred from the FPGA to a host computer for visualization and other off-chip processing. We outline the architecture of a checkpointing controller that captures and transfers the state information at a desired clock cycle using an aggressive compression technique. Our controller achieves 90% reduction in the amounts of data that is transferred from the FPGA to the host computer under periodic checkpointing scenarios. |
Slides |
Title | Maximizing Return on Investment of a Grid-Connected Hybrid Electrical Energy Storage System |
Author | Di Zhu, Yanzhi Wang, Siyu Yue, *Qing Xie (University of Southern California, U.S.A.), Naehyuck Chang (Seoul National University, Republic of Korea), Massoud Pedram (University of Southern California, U.S.A.) |
Page | pp. 638 - 643 |
Keyword | return on investment, capital cost, hybrid electrical energy storage system |
Abstract | This paper is the first to present a comprehensive analysis of the profitability of the hybrid electrical energy storage (HEES) systems while further providing a HEES design and control optimization framework to maximize the total return on investment (ROI). The solution consists of two steps: (i) Derivation of an optimal HEES management policy to maximize the daily energy cost saving and (ii) Optimal design of the HEES system to maximize the amortized annual profit under budget and system volume constraints. We consider a HEES system comprised of lead-acid and Li-ion batteries for a case study. The optimal HEES system achieves an annual ROI of up to 60% higher than a lead-acid battery-only system (Li-ion battery-only) system. |
Slides |
Title | (Invited Paper) Silicon Photonics Technology Platform for Embedded and Integrated Optical Interconnect Systems |
Author | *Peter De Dobbelaere (Luxtera, U.S.A.) |
Page | pp. 644 - 647 |
Keyword | silicon photonics, optical transceiver |
Abstract | By taking advantage of the vast investments made by the semiconductor industry, silicon photonics allows high-volume, high yield and low-cost manufacturing of complex photonic integrated circuits, including high-speed optical transceivers. In order to take full advantage of the CMOS semiconductor technology, the development and progress of Si photonics should further standardize and adhere to established CMOS technology methodologies and roadmaps. |
Title | (Invited Paper) High-Frequency Circuit Design for 25Gb/s×4 Optical Transceiver |
Author | *Norio Chujo, Takashi Takemoto, Fumio Yuki, Hiroki Yamashita (Hitachi, Japan) |
Page | pp. 648 - 651 |
Keyword | optical module, transceiver |
Abstract | A 25-Gb/s optical transceiver module has been developed for backplanes. Itis necessary to downsize current modules while reducing power consumption and increasing speed up to 25 Gb/s. We employed many approaches to achieve this by reducing crosstalk noise, by enhancing power integrity, and by using CMOS-based analog FE and on-chip termination and optical waveform optimization. The fully integrated transceiver IC was fabricated with the 65-nm CMOS process and the package was small, being 9 x 14 mm in size. |
Title | (Invited Paper) Design and Application of Highly Integrated Optical Switches Based on Silicon Photonics |
Author | *Shigeru Nakamura (NEC, Japan) |
Page | pp. 652 - 654 |
Keyword | Silicon photonics |
Abstract | Silicon photonics is promising for integrating various functional optical devices according to applications. Here we discuss the possibility of applying silicon photonics devices to optical path switching in wide area photonic network nodes. System design including optical switches and device design for implementing silicon photonics based optical switch are discussed. We integrate thermo-optical switch elements based on silicon optical waveguides into a compact one-chip device as a 8 x 8 optical switch. We demonstrate its capabilities such as high extinction ratio operation independently of polarization and ambient temperature, which are considered a critical step toward real application. |
Title | (Invited Paper) High Performance PIN Ge Photodetector and Si Optical Modulator with MOS Junction for Photonics-Electronics Convergence System |
Author | *Junichi Fujikata, Masataka Noguchi, Makoto Miura, Masashi Takahashi, Shigeki Takahashi, Tsuyoshi Horikawa, Yutaka Urino, Takahiro Nakamura, Yasuhiko Arakawa (PETRA, Japan) |
Page | pp. 655 - 656 |
Keyword | Si photonics, Ge photodetector, Si optical modulation |
Abstract | We report on a high speed silicon-waveguide- integrated PIN Ge photodetector of 45 GHz bandwidth, and a high efficiency of 0.3 V¥cm silicon optical modulator with a metal-oxide-semiconductor (MOS) junction by applying the low optical loss and high conductivity poly-silicon gate. These OE/EO devices enable low drive voltage of around 1V, which would contribute to a high density optical interposer of the future photonics-electronics convergence system. |
Title | Reevaluating the Latency Claims of 3D Stacked Memories |
Author | *Daniel W. Chang (University of Wisconsin, Madison, U.S.A.), Gyungsu Byun (West Virginia University, U.S.A.), Hoyoung Kim, Minwook Ahn, Soojung Ryu (Samsung Electronics Co., Ltd., Republic of Korea), Nam S. Kim, Michael Schulte (University of Wisconsin, Madison, U.S.A.) |
Page | pp. 657 - 662 |
Keyword | DRAM, 3D main memory, 3D memory latency, digital signal processor, embedded systems |
Abstract | In recent years, 3D technology has been a popular area of study that has allowed researchers to explore a number of novel computer architectures. One of the more popular topics is that of integrating 3D main memory dies below the computing die and connecting them with through-silicon vias (TSVs). This is assumed to reduce off-chip main memory access latencies by roughly 45% to 60%. Our detailed circuit-level models, however, demonstrate that this latency reduction from the TSVs is significantly less. In this paper, we present these models, compare 2D and 3D main memory latencies, and show that the reduction in latency from using 3D main memory to be no more than 2.4 ns. We also show that although the wider I/O bus width enabled by using TSVs increases performance, it may do so with an increase in power consumption. Although TSVs consume less power per bit transfer than off-chip metal interconnects (11.2 times less power per bit transfer), TSVs typically use considerably more bits and may result in a net increase in power due to the large number of bits in the memory I/O bus. Our analysis shows that although a 3D memory hierarchy exploiting a wider memory bus can increase performance, this performance increase may not justify the net increase in power consumption. |
Slides |
Title | Heterogeneous Memory Management for 3D-DRAM and External DRAM with QoS |
Author | *Le-Nguyen Tran (University of California, Irvine, U.S.A.), Houman Homayoun (George Mason University, U.S.A.), Fadi J. Kurdahi, Ahmed M. Eltawil (University of California, Irvine, U.S.A.) |
Page | pp. 663 - 668 |
Keyword | 3D-DRAM, Memory management, QoS, Heterogeneous memory system, computer architecture |
Abstract | This paper presents an innovative memory management approach to utilize both 3D-DRAM and external DRAM (ex-DRAM). Our approach dynamically allocates and relocates memory blocks between the 3D-DRAM and the ex-DRAM to exploit the high memory bandwidth and the low memory latency of the 3D-DRAM as well as the high capacity and the low cost of the ex-DRAM. Our simulation shows that in workloads that are not memory intensive, our memory management technique transfers all active memory blocks to the 3D-DRAM which runs faster than the ex-DRAM. In memory intensive workloads, our memory management technique utilizes both the 3D-DRAM and the ex-DRAM to increase the memory bandwidth to alleviate the bandwidth congestion. Our approach also supports Quality of Service (QoS) for “latency sensitive”, “bandwidth sensitive”, and “insensitive” applications. To improve the performance and satisfy a certain level of QoS, memory blocks of different application types are allocated differently. Compared to the scratchpad memory management mechanism, the average memory access latency of our approach decreases by 19% and 23%; the performance improves by up to 5% and 12% in single threaded benchmarks and multi-threaded benchmarks respectively. Moreover, using our approach, applications do not need to manage the memory explicitly like the scratchpad case. Our memory block relocation comes with negligible performance overhead, particularly for applications which have high spatial memory locality. |
Title | Line Sharing Cache: Exploring Cache Capacity with Frequent Line Value Locality |
Author | *Keitarou Oka (Graduate School of Infomation Science and Electrical Engineering, Kyushu University, Japan), Hiroshi Sasaki, Koji Inoue (Faculty of Infomation Science and Electrical Engineering, Kyushu University, Japan) |
Page | pp. 669 - 674 |
Keyword | Cache Memory, Frequent Value Locality, Compression |
Abstract | This paper proposes a new LLC architecture called line sharing cache (LSC) which reduces the number of misses without increasing the size of the cache memory. LSC stores lines which have the identical value in a single line entry and allows greater amounts of lines to be stored. Evaluation results show performance improvements of up to 35% across a set of SPEC CPU2000 benchmarks. |
Slides |
Title | ShieldUS: A Novel Design of Dynamic Shielding for Eliminating 3D TSV Crosstalk Coupling Noise |
Author | *Yuan-Ying Chang, Yoshi Shih-Chieh Huang (National Tsing Hua University, Taiwan), Vijaykrishnan Narayanan (Pennsylvania State University, U.S.A.), Chung-Ta King (National Tsing Hua University, Taiwan) |
Page | pp. 675 - 680 |
Keyword | TSV, crosstalk |
Abstract | 3D IC is a promising technology to meet the demands of high throughput, high scalability, and low power consumption for future generation integrated circuits. One way to implement the 3D IC is to interconnect layers of two-dimensional (2D) IC with Through-Silicon Via (TSV), which shortens the signal lengths. Unfortunately, while TSVs are bundled together as a cluster, the crosstalk coupling noise may lead to transmission errors. As a result, the working frequency of TSVs has to be lowered to avoid the errors, leading to narrower bandwidth that TSVs can provide. In this paper, we first derive the crosstalk noise model from the perspective of 3D chip and then propose ShieldUS, a runtime data-to-TSVs remapping strategy. With ShieldUS, the transition patterns of data over TSVs are observed at runtime, and relatively stable bits will be mapped to the TSVs which act as shields to protect the other bits which have more fluctuations. We evaluate the performance of ShieldUS with address lines from real benchmark traces and data lines of different self-similarities. The results show that ShieldUS is accurate and flexible. We further study dynamic shielding and our design of Interval Equilibration Unit (IEU) can intelligently select suitable parameters for dynamic shielding, which makes dynamic shielding practical and does not need to predefine parameters. This also improves the practicability of ShieldUS. |
Slides |
Title | High-Density Integration of Functional Modules Using Monolithic 3D-IC Technology |
Author | *Shreepad Panth (Georgia Institute of Technology, U.S.A.), Kambiz Samadi, Yang Du (Qualcomm Research, U.S.A.), Sung Kyu Lim (Georgia Institute of Technology, U.S.A.) |
Page | pp. 681 - 686 |
Keyword | monolithic, power reduction, 3D-IC, floorplanning, block-level design |
Abstract | Three dimensional integrated circuits (3D-ICs) have emerged as a promising solution to continue device scaling. They can be realized using Through Silicon Vias (TSVs), or monolithic integration using Monolithic Inter-tier vias (MIVs), an emerging alternative that provides much higher via densities. In this paper, we provide a framework for floorplanning existing 2D IP blocks into 3D-ICs using MIVs. We take the floorplanning solution all the way through place and-route and report post-layout metrics for area, wirelength, timing, and power consumption. Results show that the wirelength of TSV-based 3D designs outperform 2D designs by upto 14% in large-scale circuits only. MIV-based 3D designs, however, offer an average wirelength improvement of 33% for a wide range of benchmark circuits. We also show that while TSV-based 3D cannot improve the performance and power unless the TSV capacitance is reduced, MIV-based 3D offers significant reduction of upto 33% in the longest path delay and 35% in the inter-block net power. |
Slides |
Title | Block-level Designs of Die-to-Wafer Bonded 3D ICs and Their Design Quality Tradeoffs |
Author | *Krit Athikulwongse (Georgia Institute of Technology, U.S.A.), Dae Hyun Kim (Cadence, U.S.A.), Moongon Jung, Sung Kyu Lim (Georgia Institute of Technology, U.S.A.) |
Page | pp. 687 - 692 |
Keyword | 3D IC, Block-level Design, Die-to-Wafer Bonding |
Abstract | In 3D ICs, block-level designs provide various advantages over designs done at other granularity such as gate-level because they promote the reuse of IP blocks. In this paper, we study block-level 3D-IC designs, where the footprint of the dies in the stack are different. This happens in case of die-to-wafer bonding, which is more popular choice for near-term low-cost 3D designs. We study design quality tradeoffs among three different ways to place through-silicon vias (TSVs): TSV-farm, TSV-distributed, and TSV-whitespace. In our holistic approach, we use wirelength, power, performance, temperature, and mechanical stress metrics to conduct comprehensive comparative studies on the three design styles. In addition, we provide analysis on the impact of TSV size and pitch on the design quality of these three styles. |
Slides |
Title | Thermal-reliable 3D Clock-tree Synthesis Considering Nonlinear Electrical-thermal-coupled TSV Model |
Author | Yang Shang, Chun Zhang, *Hao Yu, Chuan Seng Tan (Nanyang Technological University, Singapore), Xin Zhao, Sung Kyu Lim (Georgia Institute of Technology, U.S.A.) |
Page | pp. 693 - 698 |
Keyword | 3D physical design, Clock tree synthesis, Nonlinear electrical-thermal TSV model |
Abstract | 3D physical design needs accurate model of through-silicon-vias(TSVs). In this paper, physics-based electrical-thermal model is introduced for both signal and dummy thermal TSVs considering nonlinear electrical-thermal dependence. A nonlinear programming-based clock-skew reduction problem is formulated to allocate thermal TSVs for clock-skew reduction under non-uniform temperature distribution. Experiments show that under the nonlinear electrical-thermal TSV model, insertion of thermal TSVs can effectively reduce temperature-gradient introduced clock-skew by 58.4% on average which is 11.6% higher than the result under linear electrical-thermal model. |
Slides |
Title | Stacking Signal TSV for Thermal Dissipation in Global Routing for 3D IC |
Author | *Po-Yang Hsu, Hsien-Te Chen, TingTing Hwang (National Tsing Hua University, Taiwan) |
Page | pp. 699 - 704 |
Keyword | 3D IC, stacked TSV |
Abstract | With no further shrink of device size, three dimensional (3D) chip stacking by Through-Silicon-VIA (TSV) has been identified as an effective way to achieve better performance in speed and power. However, such solution inevitably encounters challenges in thermal dissipation since stacked dies generate significant amount of heat per unit volume. We leverage an integrated architecture of stacked-signal-TSVs to minimize temperature with small wiring overhead. Based on the structure of stacked signal TSV, a two-stage TSV locating algorithm in global routing is designed. By this TSV locating algorithm, we demonstrate that our stacking signal TSV structure is able to reduce 17% temperature with 4% wiring overhead and 3% performance loss calculated by 3D Elmore delay model. Compared to a previous work by Cong and Zhang [1] where additional thermal TSVs are inserted, our experimental results have in average 23% less TSVs than Cong and Zhang’s [1] with the same temperature constraint. |
Slides |
Title | VFCC: A Verification Framework of Cache Coherence using Parallel Simulation |
Author | *Qiaoli Xiong, Jiangfang Yi, Tianbao Song, Zichao Xie, Dong Tong (Peking University, China) |
Page | pp. 705 - 710 |
Keyword | cache coherence, verification, simulation |
Abstract | A cache coherence protocol is a vital component of a multiprocessor to maintain the data consistency. In this paper, we proposed VFCC, which is a simulation framework to validate a cache-coherence protocol implementation of a commercial 64-bit superscalar multiprocessor. It exploits multiple-level parallelism to accelerate validation without overheads among threads. Our experimental results demonstrate VFCC has a 5.0x speedup than a traditional simulator on a conventional 16-core host machine. |
Title | A Computational Model for SAT-based Verification of Hardware-Dependent Low-Level Embedded System Software |
Author | *Bernard Schmidt, Carlos Villarraga (University of Kaiserslautern, Germany), Jörg Bormann (-, Germany), Dominik Stoffel, Markus Wedler, Wolfgang Kunz (University of Kaiserslautern, Germany) |
Page | pp. 711 - 716 |
Keyword | HW/SW-Verification, low-level software, property checking |
Abstract | This paper describes a method to generate a computational model for formal verification of hardware-dependent software in embedded systems. The computational model of the combined HW/SW system is a program netlist (PN) consisting of instruction cells connected in a directed acyclic graph that compactly represents all execution paths of the software. The model can be easily integrated into SAT-based verification environments such as those based on Bounded Model Checking (BMC). The proposed construction of the model, however, allows for an efficient reasoning of the SAT solver over entire execution paths. We demonstrate the efficiency of our approach by presenting experimental results from the formal verification of an industrial LIN (Local Interconnect Network) bus node, implemented as a software driver on a 32-bit RISC machine. |
Slides |
Title | Reviving Erroneous Stability-based Clock-Gating using Partial Max-SAT |
Author | Bao Le, *Dipanjan Sengupta, Andreas Veneris (University of Toronto, Canada) |
Page | pp. 717 - 722 |
Keyword | Debugging, Design Errors, Low Power design, Clock Gating, Stability Condition |
Abstract | Although recent developments have automated most of the low power implementations, designers often manually modify the circuit in order to achieve further power savings. This human intervention is often paved with many errors that are bound to typical logic functional failures. Debugging these errors can be a resource intensive process. This paper proposes a novel debugging methodology to rectify erroneous clock gating implementations. The net effect of the proposed methodology leads to shorter debug time ensuring additional power savings. |
Slides |
Title | Simplification of C-RTL Equivalent Checking for Fused Multiply Add Unit using Intermediate Models |
Author | *Bin Xue, Prosenjit Chatterjee (Nvidia Corp, U.S.A.), Sandeep K. Shukla (Virginia Tech, U.S.A.) |
Page | pp. 723 - 728 |
Keyword | sequential equivalent checking, FMA, floating point |
Abstract | The functionality of Fused multiply add (FMA) design can be formally verified by comparing its register transition level (RTL) implementation against its system level specification often modeled by C/C++ language using sequential equivalent checking (SEC). However, C-RTL SEC does not scale for FMA because of the huge discrepancy existed between the two models. This paper analyzes the dissimilarities and proposes two intermediate models, one abstract RTL and one rewritten C model to bridge the gap. The original SEC proof are partitioned into three sub-proofs among intermediate models where a variety of simplification techniques are applied to further reduce the complexity. Experiments from an industry project show that with the two intermediate models, the SEC proof is complete and scalable for FMA design |
Slides |
Title | (Panel Discussion) Harmonized Hardware-Software Co-Design and Co-Verification |
Author | Panelists: Atsushi Ike (Fujitsu Laboratories, Japan), Hiroyuki Ikegami (Renesas Electronics Corporation, Japan), Tsuyoshi Isshiki (Tokyo Institute of Technology, Japan), Rainer Leupers (RWTH Aachen, Germany), Yosinori Watanabe (Cadence Berkeley Lab, U.S.A.), Tim Kogel (Synopsys, U.S.A.) |
Title | Reconstruction of Memory Accesses Based on Memory Allocation Mechanism for Source-Level Simulation of Embedded Software |
Author | *Kun Lu, Daniel Müller-Gritschneder, Ulf Schlichtmann (Technische Universität München, Germany) |
Page | pp. 729 - 734 |
Keyword | source level simulation, TLM, performance estimation |
Abstract | To date, there still lacks a way to accurately simulate data memory accesses in source-level simulation (SLS) of host-compiled embedded SW. The difficulty lies in that the accessed addresses for the load and store instructions can not be statically determined. Without knowing those addresses, the source code can not be annotated appropriately for data cache simulation. In this paper, we show an approach that is capable of resolving the accessed memory addresses based on the memory allocation mechanism. Applying this approach, the source code can be annotated to perform precise data cache simulation. The novelty of our methodology is that it is the first of its kind to take the memory allocation mechanism into account and thus can handle all the stack, data, heap and text sections.Moreover, a method is also proposed to handle pointer dereferences. In experiments, SLS with our approach yields almost identical cache miss rate and pattern when compared to the reference simulation. |
Slides |
Title | Shared Cache Aware Task Mapping for WCRT Minimization |
Author | *Huping Ding (National University of Singapore, Singapore), Yun Liang (Center for Energy-efficient Computing and Applications, School of EECS, Peking University, China), Tulika Mitra (National University of Singapore, Singapore) |
Page | pp. 735 - 740 |
Keyword | Worst-Case Execution Time (WCET), Worst-Case Response Time(WCRT), Task mapping, Shared cache modeling, Multi-core |
Abstract | The Worst-Case Response Time (WCRT) of multi-tasking applications running on multi-cores is an important metric for real-time embedded systems. The WCRT is determined by the mapping of the tasks to the cores (which determines load balancing) and the Worst-Case Execution Time (WCET) of the tasks. However, the WCET of a task is also influenced by the conflicts in the shared cache from concurrently executing tasks on other cores in a multi-core system. In other words, the mapping of the tasks to the cores indirectly influences the WCET of the tasks, which in turn impacts the WCRT of the entire application. Thus the mapping of the tasks to the cores should simultaneously maximize workload balance and minimize shared cache interference. We propose an integer-linear programming (ILP) formulation to achieve this objective. Experimental evaluation shows that shared cache aware task mapping achieves on an average 25% and 33% WCRT reduction for real-life and synthetic applications, respectively, compared to traditional approach that is agnostic to shared cache conflicts and solely focuses on load balancing. |
Slides |
Title | Scratchpad Memory Aware Task Scheduling with Minimum Number of Preemptions on a Single Processor |
Author | *Qing Wan, Hui Wu, Jingling Xue (University of New South Wales, Australia) |
Page | pp. 741 - 748 |
Keyword | Task Scheduling, Worst-Case Execution Time, Scratchpad Memory |
Abstract | We propose a unified approach to the problem of scheduling a set of tasks with individual release times, deadlines and precedence constraints, and allocating the data of each task to the SPM (Scratchpad Memory) on a single processor system. Our approach consists of a task scheduling algorithm and an SPM allocation algorithm. The former constructs a feasible schedule incrementally, aiming to minimize the number of preemptions in the feasible schedule. The latter allocates a portion of the SPM to each task in an efficient way by employing a novel data structure, namely, the preemption graph. We have evaluated our approach and a previous approach by using six task sets. The results show that our approach achieves up to 20.31% on WCRT (Worst-Case Response Time) reduction over the previous approach. |
Title | Scheduling Multiple Charge Migration Tasks in Hybrid Electrical Energy Storage Systems |
Author | Qing Xie, Di Zhu, Yanzhi Wang (University of Southern California, U.S.A.), *Younghyun Kim, Naehyuck Chang (Seoul National University, Republic of Korea), Massoud Pedram (University of Southern California, U.S.A.) |
Page | pp. 749 - 754 |
Keyword | charge management, energy storage, charge migration, scheduling |
Abstract | Hybrid electrical energy storage (HEES) systems are comprised of multiple banks of heterogeneous electrical energy storage (EES) elements with distinct properties. This paper defines and solves the problem of scheduling multiple charge migration tasks in HEES systems with the objective of minimizing the total energy drawn from the source banks. The solution approach consists of two steps: (i) Finding the best charging current profile and voltage level setting for the Charge Transfer Interconnect (CTI) bus for each charge migration task, and (ii) Merging and scheduling the charge migration tasks. Experimental results demonstrate improvements of up to 32.2% in the charge migration efficiency compared to baseline setups in an example HEES system. |
Slides |
Title | Stable Backward Reachability Correction for PLL Verification with Consideration of Environmental Noise Induced Jitter |
Author | Yang Song, Haipeng Fu, *Hao Yu (Nanyang Technological University, Singapore), Guoyong Shi (Shanghai Jiao Tong University, China) |
Page | pp. 755 - 760 |
Keyword | Analog/RF system verification, Reachability analysis, PLL jitter |
Abstract | It is unknown to perform efficient PLL system-level verification with consideration of jitter induced by substrate or power-supply noise. With the consideration of nonlinear phase noise macromodel, this paper introduces a forward reachability analysis with stable backward correction for PLL system-level verification with jitter. By refining initial state of PLL through backward correction, one can perform an efficient PLL verification to automatically adjust the locking range with consideration of environmental noise induced jitter. Moreover, to overcome the unstable nature during backward correction, a stability calibration is introduced in this paper to limit error. To validate our method, the proposed approach is applied to verify a number of PLL designs including single-LC or coupled-LC oscillators described by system-level behavioral model with jitter. Experimental results show that our forward reachability analysis with backward correction can succeed in reaching the adjusted locking range by correcting initial states in presence of environmental noise induced jitter. |
Slides |
Title | Performance Bound and Yield Analysis for Analog Circuits under Process Variations |
Author | Xue-Xin Liu (University of California, Riverside, U.S.A.), Adolfo Adair Palma-Rodriguez, Santiago Rodriguez-Chavez (Institute of Astrophysics, Optics, and Electronics, Mexico), *Sheldon X.-D. Tan (University of California, Riverside, U.S.A.), Esteban Tlelo-Cuautle (Institute of Astrophysics, Optics, and Electronics, Mexico), Yici Cai (Tsinghua University, China) |
Page | pp. 761 - 766 |
Keyword | bound analysis, variation, yield, optimization, symbolic analysis |
Abstract | Yield estimation for analog integrated circuits are crucial for analog circuit design and optimization in the presence of process variations. In this paper, we present a novel analog yield estimation method based on performance bound analysis technique in frequency domain. The new method first derives the transfer functions of linear (or linearized) analog circuits via a graph-based symbolic analysis method. Then frequency response bounds of the transfer functions in terms of magnitude and phase are obtained by a nonlinear constrained optimization technique. To predict yield rate, bound information are employed to calculate Gaussian distribution functions. Experimental results show that the new method can achieve similar accuracy while delivers 20 times speedup over Monte Carlo simulation of HSPICE on some typical analog circuits. |
Slides |
Title | Local Approximation Improvement of Trajectory Piecewise Linear Macromodels through Chebyshev Interpolating Polynomials |
Author | Muhammad Umer Farooq, *Likun Xia (Universiti Teknologi PETRONAS, Malaysia) |
Page | pp. 767 - 772 |
Keyword | Chebyshev polynomial, Taylor polynomial, State space (SS) |
Abstract | We introduce the concept of 2Dimensional (2D) scalability of trajectory piecewise linear (TPWL) macromodels through the exploitation of Chebyshev interpolating polynomials in each piecewise region. The goal of 2D scalability is to improve the local approximation properties of TPWL macromodels. Horizontal scalability is achieved through the reduction of number of linearization points along the trajectory; vertical scalability is obtained by extending the scope of macromodel to predict the response of a nonlinear system for inputs far from training trajectory. In this way more efficient macromodels are obtained in terms of simulation speed up of complex nonlinear systems. We provide the implementation details and illustrate the 2D scalability concept with an example using nonlinear transmission line. |
Title | Range and Bitmask Analysis for Hardware Optimization in High-Level Synthesis |
Author | *Marcel Gort, Jason H. Anderson (University of Toronto, Canada) |
Page | pp. 773 - 779 |
Keyword | High level synthesis, Range analysis, FPGA, Compiler, LLVM |
Abstract | We consider how bit-level representations of variables in HLS can be used to optimize hardware. Range and bitmask based analyses are considered separately and in tandem, where range analysis pre-determines min/max ranges for variables in order to minimize the hardware that uses those variables and bitmask analysis characterizes individual bits within a word as either constants (1 or 0), sign bits, or unknowns, which may also permit hardware to be eliminated. Static compiler-based analysis is contrasted with dynamic profiling-based analysis in terms of their potential to impact area and speed of HLS-generated hardware. Results show optimizations in HLS based on static analysis reduce circuit area by 9%, while those based on dynamic analysis provide 34% area reduction. |
Slides |
Title | A Gradual Scheduling Framework for Problem Size Reduction and Cross Basic Block Parallelism Exploitation in High-level Synthesis |
Author | *Hongbin Zheng, Qingrui Liu, Junyi Li, Dihu Chen, Zixin Wang (Sun Yet-sen University, China) |
Page | pp. 780 - 786 |
Keyword | High-level synthesis, Electronic design automation and methodology, Scheduling |
Abstract | In High-level Synthesis (HLS), scheduling has a critical impact on the quality of hardware implementation. However, the schedules of different operations are actually having unequal impacts on the Quality of Result. Based on this fact, we propose a novel scheduling framework, which is able to schedule the operations separately according their significance to Quality of Result, to avoid wasting the computational effort on noncritical operations. Furthermore, the proposed framework supports global code motion, which helps to improve the speed performance of the hardware implementation by distributing the execution time of operations across the their parent BB. |
Slides |
Title | Implementing Microprocessors from Simplified Descriptions |
Author | *Nikhil A. Patil, Derek Chiou (University of Texas at Austin, U.S.A.) |
Page | pp. 787 - 793 |
Keyword | high-level synthesis, Bluespec, Microcode, Processors |
Abstract | Despite the proliferation of high-level synthesis tools, hardware description of microprocessors remains complex. We argue that much of the incidental complexity can be relieved by untangling the description into separate functional and microarchitectural components. Such an untangling can be achieved using a high-level microcode compiler that can generate not only microcode, but also the micro-instruction format and the interpretations of each control bit. Simplifying hardware description will help the designer make better design-space trade-offs, and close the design and verification loop faster. This paper takes the reader through an implementation of a simple Y86 processor to qualitatively illustrate the complexity reduction from the untangling. |
Slides |
Title | Application-Specific Fault-Tolerant Architecture Synthesis for Digital Microfluidic Biochips |
Author | *Mirela Alistar, Paul Pop, Jan Madsen (Denmark Technical University, Denmark) |
Page | pp. 794 - 800 |
Keyword | digital microfluidics biochips, CAD tools, architecture synthesis, fault tolerant |
Abstract | Microfluidic-based biochips are replacing the conventional biochemical analyzers, and are able to integrate on-chip all the necessary functions for biochemical analysis using microfluidics. The digital microfluidic biochips are based on the manipulation of liquids not as a continuous flow, but as discrete droplets on an array of electrodes. Microfluidic operations, such as transport, mixing, split, are performed on this array by routing the corresponding droplets on a series of electrodes. Researchers have proposed several approaches for the synthesis of digital microfluidic biochips. All previous work assumes that the biochip architecture is given, and most approaches consider a rectangular shape for the electrode array. However, non-regular application-specific architectures are common in practice. Hence, in this paper, we propose an approach to the application-specific architecture synthesis. Our approach can also help the designer to increase the yield by introducing redundant electrodes to tolerate permanent faults. The proposed architecture synthesis algorithm has been evaluated using several benchmarks. |
Slides |