(Go to Top Page)

The 19th Asia and South Pacific Design Automation Conference
Technical Program

Remark: The presenter of each paper is marked with "*".
Technical Program:   SIMPLE version   DETAILED version with abstract
Author Index:   HERE

Session Schedule


Tuesday, January 21, 2014

Room 302Room 300Room 301Room 303
1K  (Room 300)
Opening & Keynote I

8:30 - 10:00
Break
10:00 - 10:40
1S  Special Session: Normally-Off Computing: Towards Zero Stand-by Power Management
10:40 - 12:20
1A  University Design Contest
10:40 - 12:20
1B  Planning and Placement for Design Closure and Manufacturability
10:40 - 12:20
1C  Circuit, Architecture, and System for Emerging Technologies
10:40 - 12:20
Lunch Break
12:20 - 13:50
2S  Special Session: EDA for Energy
13:50 - 15:30
2A  Distributed and Mixed-Criticality Real-Time Systems
13:50 - 15:30
2B  Advanced Patterning for Advanced Layout
13:50 - 15:30
2C  Timing-Driven Design, Modeling, and Optimization
13:50 - 15:30
Break
15:30 - 15:50
3S  Special Session: Neuron Inspired Computing using Nanotechnology
15:50 - 17:30
3A  Synthesis and Exploration Techniques for Computing Platforms
15:50 - 17:30
3B  Advances in Microfluidic Biochips
15:50 - 17:30
3C  Advanced Modeling and Simulation Techniques for Analog/Mixed-Signal Circuits
15:50 - 17:30



Wednesday, January 22, 2014

Room 302Room 300Room 301Room 303
2K  (Room 300)
Keynote II

8:30 - 9:30
Break
9:30 - 10:10
4S  Special Session: Design Automation Methods for Highly-Complex Multimedia Systems
10:10 - 12:15
4A  System-Level Thermal and Power Optimization Techniques
10:10 - 12:15
4B  Emerging Techniques for Future NoC
10:10 - 12:15
4C  Emerging Applications
10:10 - 12:15
Lunch Break
12:15 - 13:50
5S  Special Session: Billion Chips of Trillion Transistors
13:50 - 15:30
5A  Simulation and Modeling
13:50 - 15:30
5B  Reliability Analysis and Enhencement
13:50 - 15:30
5C  Variational Design Techniques for Analog/Mixed-Signal Circuits
13:50 - 15:30
Break
15:30 - 15:50
6S  Special Session: Overcoming Major Silicon Bottlenecks: Variability, Reliability, Validation and Debug
15:50 - 17:30
6A  Synthesis of Quantum Circuits and Adaptive Logic
15:50 - 17:30
6B  Contemporary Routing
15:50 - 17:30
6C  Power Supply Noise Aware Design Optimization
15:50 - 17:30
Break
17:30 - 18:30
BK  (Flower Field Hall, Gardens by the Bay)
Banquet & Banquet Keynote

18:30 - 21:00



Thursday, January 23, 2014

Room 302Room 300Room 301Room 303
3K  (Room 300)
Keynote III

8:30 - 9:30
Break
9:30 - 10:10
7S  Special Session: Brain Like Computing: Modelling, Technology, and Architecture
10:10 - 12:15
7A  Power and Life Time Issues of Memory Subsystem
10:10 - 12:15
7B  Advances in High-Level and Logic Synthesis
10:10 - 12:15
7C  Advanced Test Solutions
10:10 - 12:15
Lunch Break
12:15 - 13:50
8S  Special Session: Design Flow for Integrated Circuits using Magnetic Tunnel Junction Switched by Spin Orbit Torque
13:50 - 15:30
8A  Analysis, Optimization, and Scheduling for Multiprocessor Platforms
13:50 - 15:30
8B  Advances in Formal Verification and Debugging
13:50 - 15:30
8C  Advances in CAD Techniques for Signal Integrity
13:50 - 15:30
Break
15:30 - 15:50
9S  Special Session: The Role of Photons in Harming or Increasing Security
15:50 - 17:30
9A  System-Level Verification
15:50 - 17:30
9B  Modeling and Evaluator for Emerging Technologies
15:50 - 17:30
9C  Design and Simulation Toward Power and Temperature Awareness
15:50 - 17:30


List of Papers

Remark: The presenter of each paper is marked with "*".

Tuesday, January 21, 2014

Session 1K  Opening & Keynote I
Time: 8:30 - 10:00 Tuesday, January 21, 2014
Location: Room 300
Chairs: Yong Lian (National University of Singapore, Singapore), Yajun Ha (National University of Singapore, Singapore)

1K-1 (Time: 9:00 - 10:00)
Title(Keynote Address) All Programmable SOC FPGA for Networking and Computing in Big Data Infrastructure
AuthorIvo Bolsens (Senior VP and CTO, Xilinx, U.S.A.)
AbstractToday's FPGAs have become 'All Programmable SOC Platforms' that integrate in one single device multi-core CPU's, programmable DSP functions, programmable IO and programmable logic, all immersed in a rich and configurable interconnect network. These programmable platform FPGA's allow for the implementation of heterogeneous multi-core architectures that combine traditional CPU's with application-specific processing cores and dedicated data transfer and storage functions. This is enabled by tools that guide designers during the partitioning and mapping of high-level specifications onto a combination of software running on embedded processors and hardware implemented in programmable logic. FPGAs are well placed to continue to benefit from Moore's law. Advances in process scaling will be augmented with new circuit and architectural improvements along with innovations in system-in-package technology to solve IO challenges and integrate heterogeneous technologies. These innovations will allow designers to build higher performance and lower power systems that optimally exploit the programmable FGPA architecture. As FPGA platforms continue to deliver more performance at lower cost and lower power, they are becoming the heart of embedded applications such as complex packet processing for networks with line rates of 400+ Gbps; high performance digital signal processing in novel wireless baseband and radio functions; and high flexibility to enable programmable networking and data storage functions in cloud infrastructure.


Session 1S  Special Session: Normally-Off Computing: Towards Zero Stand-by Power Management
Time: 10:40 - 12:20 Tuesday, January 21, 2014
Location: Room 302
Organizer: Hiroshi Nakamura (University of Tokyo, Japan)

1S-1 (Time: 10:40 - 11:05)
Title(Invited Paper) Normally-Off Computing Project : Challenges and Opportunities
Author*Hiroshi Nakamura, Takashi Nakada, Shinobu Miwa (The University of Tokyo, Japan)
Pagepp. 1 - 5
Keywordnormally-off, non-volatile memory, power gating
AbstractNormally-Off is a way of computing which aggressively powers off components of computer systems when they need not to operate. Simple power gating cannot fully take the chances of power reduction because volatile memories lose data when power is turned off. Recently, new non-volatile memories (NVMs) have appeared. High attention has been paid to normally-off computing using these NVMs. In this paper, its expectation and challenges are addressed with a brief introduction of our project started in 2011.
Slides

1S-2 (Time: 11:05 - 11:30)
Title(Invited Paper) Novel Nonvolatile Memory Hierarchies to Realize "Normally-Off Mobile Processors"
Author*Shinobu Fujita, Kumiko Nomura, Hiroki Noguchi, Susumu Takeda, Keiko Abe (Toshiba Corporation, Japan)
Pagepp. 6 - 11
KeywordSTT-MRAM, Normally-off computer, Normally-off processor, mobile processor, nonvolatile memory
AbstractThis paper presents novel processor architecture for HP-processor with nonvolatile/volatile hybrid cache memory. By simulations of high-performance (HP)-processor using MTJs, it has been clarified that total power of the HP-processor using perpendicular-(p-)STT-MRAM can be reduced by over 90 % with little degradation of processor performance. The presented architecture with nonvolatile memory hierarchy will realize the “normally-off computers”.
Slides

1S-3 (Time: 11:30 - 11:55)
Title(Invited Paper) Normally-Off MCU Architecture for Low-Power Sensor Node
Author*Masanori Hayashikoshi, Yohei Sato, Hiroshi Ueki, Hiroyuki Kawai, Toru Shimizu (Renesas Electronics Corporation, Japan)
Pagepp. 12 - 16
KeywordNormally-off, Low-power, Microcontroller, MCU
AbstractThe production volume of sensor nodes is much increased with the development of cyber-physical systems. Therefore, it becomes important how to reduce the power consumption of huge sensor nodes. In this work, normally-off architecture of microcontroller for future low-power sensor node is proposed. To realize true low-power effects with normally-off computing technology, a co-design of hardware and software technology is much important. In this work, the power consumption of sensor nodes is possible to reduce of around 70%.
Slides

1S-4 (Time: 11:55 - 12:20)
Title(Invited Paper) Normally-Off Technologies for Healthcare Appliance
Author*Shintaro Izumi, Hiroshi Kawaguchi, Yoshimoto Masahiko (Kobe University, Japan), Yoshikazu Fujimori (Rohm, Japan)
Pagepp. 17 - 20
KeywordECG, Heart rate, Healthcare
AbstractBattery mass and power consumption of wearable system must be reduced because the key factors affecting wearable system usability are miniaturization and weight reduction. This report describes a wearable biosignal monitoring system using normally-off technologies to minimize the power consumption. Especially we focused on daily-life monitoring and electrocardiograph processor. Our system employs FeRAM and Near Field Communication (NFC). A robust heart rate monitor and Cortex M0 core are used to on-node processing for logging data reduction.
Slides


Session 1A  University Design Contest
Time: 10:40 - 12:20 Tuesday, January 21, 2014
Location: Room 300
Chair: Chun Huat Heng (National University of Singapore, Singapore)

1A-1 (Time: 10:40 - 10:44)
TitleA Dual-Loop Injection-Locked PLL with All-Digital Background Calibration System for On-Chip Clock Generation
Author*Wei Deng, Ahmed Musa, Teerachot Siriburanon, Masaya Miyahara, Kenichi Okada, Akira Matsuzawa (Tokyo Institute of Technology, Japan)
Pagepp. 21 - 22
KeywordPLL, All-digital, Calibration, Dual-loop, Clock Generation
AbstractThis paper presents a compact, low power, and low jitter dual-loop injection-locked PLL with synthesizable all-digital background calibration system for clock generation. Implemented in a 65nm CMOS process, this work demonstrates a 0.7-ps RMS jitter at 1.2 GHz while having 0.97-mW power consumption resulting in an FOM of -243dB. It also consumes an area of only 0.022mm2 resulting in the best performance-area trade-off system presented up-to-date.
Slides

1A-2 (Time: 10:44 - 10:48)
TitleA 950µW 5.5-GHz Low Voltage PLL with Digitally-Calibrated ILFD and Linearized Varactor
Author*Sho Ikeda, Tatsuya Kamimura, Sangyeop Lee, Hiroyuki Ito, Noboru Ishihara, Kazuya Masu (Tokyo Institute of Technology, Japan)
Pagepp. 23 - 24
KeywordPLL, low voltage, low power
AbstractThis paper proposes an ultra-low-power 5.5-GHz PLL which employs a divide-by-4 injection-locked frequency divider (ILFD), which is calibrated by digital circuits, and linearity-compensated varactors for low supply-voltage operation. The proposed PLL was fabricated in 65nm CMOS. It shows a 1-MHz-offset phase noise of -106 dBc/Hz and the total power consumption of 950 µW at 5.5 GHz.
Slides

1A-3 (Time: 10:48 - 10:52)
TitleA Swing-Enhanced Current-Reuse Class-C VCO with Dynamic Bias Control Circuits
Author*Teerachot Siriburanon, Wei Deng, Kenichi Okada, Akira Matsuzawa (Tokyo Institute of Technology, Japan)
Pagepp. 25 - 26
KeywordClass-C, Current-Reuse, VCO, Phase noise, FoM
AbstractA swing-enhanced current-reuse class-C VCO which can theoretically achieve same phase noise figure-of-merit (FoM) as other class-C VCOs at the lowest power consumption is presented. A swing enhancement in class-C operation and an oscillation robustness are achieved through dynamic bias control circuits for both NMOS and PMOS transistors. The proposed VCO has been fabricated in 180nm CMOS process while oscillating at 4.6 GHz. The measured phase noise is -119 dBc/Hz at 1 MHz offset while consuming 1.6 mA from 1.5 V supply. An FoM of -189 dBc/Hz is achieved.

1A-4 (Time: 10:52 - 10:56)
TitleDesign of A High-Performance Millimeter-Wave Amplifier Using Specific Modeling
Author*Xiaojun Bi (National University of Singapore/Institute of Microelectronics, Agency for Science, Technology and Research, Singapore), Yongxin Guo (National University of Singapore, Singapore/National University of Singapore (Suzhou) Research Institute, China), M. Annamalai Arasu (Institute of Microelectronics, Agency for Science, Technology and Research, Singapore), M. S. Zhnag (National University of Singapore, Singapore), Yong Zhong Xiong, Minkyu Je (Institute of Microelectronics, Agency for Science, Technology and Research, Singapore)
Pagepp. 27 - 28
KeywordAmplifier, Modeling, SiGe
AbstractIn this design contest, the design methodology leading to a high performance Millimeter-wave amplifier in 0.13 µm SiGe BiCMOS is elaborated. Equivalent circuit models of the utilized cascode shielding structure are developed to assist the amplifier design. Meanwhile, final layouts of the passive connections are verified by 3D electromagnetic simulation in ANSYS HFSS. The implemented amplifier obtained a gain more than 45 dB in band, which is the gain record of silicon-based amplifiers in W-band.

1A-5 (Time: 10:56 - 11:00)
TitleA Multi-Mode Reconfigurable Analog Baseband with I/Q Calibration for GNSS Receivers
Author*Zheng Song, Nan Qi, Baoyong Chi, Zhihua Wang (Tsinghua University, China)
Pagepp. 29 - 30
KeywordAnalog baseband, I/Q calibration, Reconfigurable, AGC
AbstractA multi-mode reconfigurable analog baseband for GNSS receivers is presented. It provides I/Q mismatch auto-calibration with the aid of a FPGA. The 3rd/5th-order reconfigurable C-BPF supports various bandwidths from 2.2 to 10MHz and with center frequency from 3.996 to 16MHz. The AGC loop features 5-50dB gain range and 1dB step, and digital AGC control algorithms. The auto DC-offset cancellation is also integrated on-chip. The analog baseband consumes 6.5-13mA current. The measured image-rejection ratio is 45-55dB, improved by 22dB after calibration.
Slides

1A-6 (Time: 11:00 - 11:04)
TitleAn 8b Extremely Area Efficient Threshold Configuring SAR ADC with Source Voltage Shifting Technique
Author*Kentaro Yoshioka, Akira Shikata, Ryota Sekimoto, Tadahiro Kuroda, Hiroki Ishikuro (Keio University, Japan)
Pagepp. 31 - 32
KeywordThreshold configuring, SAR ADC, Low power, Extremely small area
AbstractAn extremely low power and area efficient threshold configuring ADC (TC-ADC) for time interleaved ADC is proposed. The threshold configuring comparator (TCC) performs a binary search and 8b output is obtained by proposed source voltage shifting and threshold interpolation technique. Prototype ADC in 40nm CMOS occupies a core area of only 0.0038mm2. With a supply voltage of 0.7V, the ADC achieves 7.0 ENOB with 24MS/s. Peak FoM of 9.8fJ/conv. is obtained at 0.5V supply, which is over 15x improvement compared with conventional TC-ADC.
Slides

1A-7 (Time: 11:04 - 11:08)
TitleA Single-Inductor 8-Channel Output DC-DC Boost Converter with Time-Limited Power Distribution Control and Single Shared Hysteresis Comparator
Author*Jungmoon Kim, Chulwoo Kim (Korea University, Republic of Korea)
Pagepp. 33 - 34
Keywordsingle-inductor multi-output (SIMO) DC-DC converter, cross-regulation, All-comparator control, comparator sharing technique, hysteresis comparator
AbstractThis paper describes a time-limited power distribution control (TPDC) technique that can be used for single-inductor multiple-output (SIMO) DC-DC converter with many unbalanced loads. Furthermore, the true all-comparator control technique that raises no stability or complexity issues is proposed. This all-comparator technique for SIMO converters is realized only with a single shared hysteresis comparator at a constant switching frequency of 800 kHz. The maximum efficiency reaches 92%. The fabricated chip with 8-channel outputs occupies 2.4×2.1 mm2 in a 0.35-μm CMOS process.

1A-8 (Time: 11:08 - 11:12)
TitleA DC-DC Boost Converter with Variation Tolerant MPPT Technique and Efficient ZCS Circuit for Thermoelectric Energy Harvesting Applications
Author*Jungmoon Kim, Minseob Shim, Junwon Jung, Heejun Kim, Chulwoo Kim (Korea University, Republic of Korea)
Pagepp. 35 - 36
KeywordBattery charger, boost converter, energy harvesting, maximum power point tracking, variation
AbstractThis paper presents a boost converter with the maximum power point tracking (MPPT) technique for thermoelectric energy harvesting (EH) applications. The technique realizes variation tolerance by adjusting the switching frequency fSW of the converter. A finely controlled zero-current switching (ZCS) scheme together with the accurate MPPT technique enhances the overall efficiency (η) of the converter because of an optimal turn-on time generated by a one-shot pulse generator that is proposed. Moreover, the ZCS technique can deal with low and high temperature differences applied to the thermoelectric generator. Experimentally, the converter implemented in a 0.35 μm BCDMOS process had a peak η of 72% at the input voltage VIN of 500mV while supplying a 5.62V output.

1A-9 (Time: 11:12 - 11:16)
Title7.3 Gb/s Universal BCH Encoder and Decoder for SSD Controllers
Author*Hoyoung Yoo, Youngjoo Lee, In-Cheol Park (Korea Advanced Institute of Science and Technology, Republic of Korea)
Pagepp. 37 - 38
KeywordBCH enc/dec, universal BCH encoder, multi-mode, syndrome calculation
AbstractThis paper presents a universal BCH encoder and decoder that can support multiple error-correction capabilities. A novel encoding architecture and on-demand syndrome calculation technique is proposed to reduce both hardware complexity and power consumption. Based on the proposed methods, 32-parallel universal encoder and decoder are designed for BCH (8192+14t, 8192, t) codes, where the error-correction capability t is configurable to 8, 11, 16, 24, 32, and 64. The prototype chip achieves a throughput of 7.3 Gb/s and occupies 2.24 mm2 in 0.13μm CMOS technology.

1A-10 (Time: 11:16 - 11:20)
TitleA High-Speed and Low-Complexity Lens Distortion Correction Processor for Wide-Angle Cameras
Author*Won-Tae Kim, Hui-Sung Jeong, Gwang-Ho Lee, Tae-Hwan Kim (School of Electronics, Telecommunication and Computer Engineering, Korea Aerospace University, Republic of Korea)
Pagepp. 39 - 40
Keywordbarrel distortion, wide-angle camera, memory interface, raster scan, hardware implementation
AbstractThis paper presents a high-speed and low-complexity lens distortion correction processor for wide-angle cameras. In the proposed processor, the conventional correction process is modified to be performed incrementally so as to reduce the hardware complexity. In addition, an efficient memory interface is proposed by utilizing the locality of the memory access in the correction process. The proposed processor is implemented with 17.2K logic gates in a 0.11 µm CMOS process and its correction speed is 205 Mpixels/s.
Slides


Session 1B  Planning and Placement for Design Closure and Manufacturability
Time: 10:40 - 12:20 Tuesday, January 21, 2014
Location: Room 301
Chairs: Shigetoshi Nakatake (University of Kitakyushu, Japan), Hung-Ming Chen (National Chiao Tung University, Taiwan)

1B-1 (Time: 10:40 - 11:05)
TitleAnalytical Placement of Mixed-Size Circuits for Better Detailed-Routability
AuthorShuai Li, *Cheng-Kok Koh (Purdue University, U.S.A.)
Pagepp. 41 - 46
Keywordroutability-driven placement, pin density, mixed-size circuit
AbstractWe propose an analytical placer for generating placement results with better detailed-routability. By including a group of pin density constraints in its mathematical formulation, the placer manages to alleviate pin congestion when distributing cells. Moreover, for mixed-size circuits, we adopt a scaled smoothing method to minimize the possible negative influence of fixed macro blocks in placement and routing. Routing solutions obtained by a commercial router show the good detailed-routability of the placement results generated by our analytical placer.
Slides

1B-2 (Time: 11:05 - 11:30)
TitleLithographic Defect Aware Placement Using Compact Standard Cells Without Inter-Cell Margin
Author*Seongbo Shim, Yoojong Lee, Youngsoo Shin (KAIST, Republic of Korea)
Pagepp. 47 - 52
Keywordplacement, lithography, PVB, defect
AbstractConventional standard cells contain extra space, called inter-cell margin, to prevent potential defects caused by lithography process. Margin is indeed necessary between some cell pairs, but there are also lots of cell pairs that do not yield any defects (or have very low probability of defects) when they are placed without margin. We address a new placement problem using standard cells without inter-cell margin. Placement should be done such that defect probability is made as small as possible while standard objectives such as wirelength is also pursued. The key in this approach is efficient computation of defect probabilities of all cell pairs and arranging them as a table that is referred to by a placer. We study how the cell pairs can be grouped by examining similar patterns along cell boundary, which greatly reduces the number of defect probability computation. The proposed placement method was evaluated on a few test circuits using 28-nm technology. Chip area was reduced by 10.8% on average with average and maximum defect probability kept below 0.4% and 4.1%, respectively.
Slides

1B-3 (Time: 11:30 - 11:55)
TitleStructural Planning of 3D-IC Interconnects by Block Alignment
Author*Johann Knechtel (Institute of Electromechanical and Electronic Design, Dresden University of Technology, Germany), Evangeline F. Y. Young (Department of Computer Science and Engineering, Chinese University of Hong Kong, Hong Kong), Jens Lienig (Institute of Electromechanical and Electronic Design, Dresden University of Technology, Germany)
Pagepp. 53 - 60
Keyword3D-IC interconnect structures, block aligment, 3D floorplanning
AbstractThree-dimensional integrated circuits rely on optimized interconnect structures for blocks which are spread among one or multiple dies. We demonstrate how 2D and 3D block alignment can be efficiently utilized for structural planning of different interconnects. To realize this, we extend the corner block list and provide effective techniques for 3D layout generation, i.e., block placement and alignment. Our techniques are made available in an open-source, simulated-annealing-based tool. Besides block alignment, it accounts for key objectives in 3D design like fast thermal management and fixed-outline floorplanning. Experimental results on GSRC and IBM-HB+ circuits demonstrate the capabilities of our tool for both planning 3D-IC interconnects by block alignment and for 3D floorplanning in general.
Slides

1B-4 (Time: 11:55 - 12:20)
TitleComprehensive Die-Level Assessment of Design Rules and Layouts
AuthorRani Ghaida (GLOBALFOUNDRIES, U.S.A.), Yasmine Badr (University of California, Los Angeles, U.S.A.), Mukul Gupta (Qualcomm Inc., U.S.A.), Ning Jin (GLOBALFOUNDRIES, U.S.A.), *Puneet Gupta (University of California, Los Angeles, U.S.A.)
Pagepp. 61 - 66
KeywordDesign Technology Co-optimization (DTCO), Design for Manufacturing (DFM), Design Rules, Layout, Technology
AbstractCo-development of design rules and layout methodologies is the key to successful adoption of a technology. We develop the first framework for systematic evaluation of design rules and their interaction with layouts, performance, margins and yield at the chip-scale (as opposed to cell-level). A "good chips per wafer" metric is used to unify area, performance, variability and functional yield. For instance, a study of well-to-active spacing rule reveals a non-monotone dependence of rule value to chip area (although cell area relationship is monotone).
Slides


Session 1C  Circuit, Architecture, and System for Emerging Technologies
Time: 10:40 - 12:20 Tuesday, January 21, 2014
Location: Room 303
Chairs: Hai (Helen) Li (University of Pittsburgh, U.S.A.), Danghui Wang (Northwestern Polytechnical University, China)

1C-1 (Time: 10:40 - 11:05)
TitlePrefetching Techniques for STT-RAM Based Last-Level Cache in CMP Systems
AuthorMengjie Mao (University of Pittsburgh, U.S.A.), Guangyu Sun (Peking University, China), Yong Li, Alex K. Jones, *Yiran Chen (University of Pittsburgh, U.S.A.)
Pagepp. 67 - 72
Keywordprefetch, STT-RAM, last-level cache
AbstractPrefetching is widely used in modern computer systems to mitigate the impact of long memory access latency by paying extra cost in memory and cache accesses. However, the efficacy of prefetching significantly degrades in the memory hierarchy using the emerging spin-transfer torque random access memory (STT-RAM) as last-level cache (LLC) due to the long write access latency. In this work, we propose two orthogonal but complimentary techniques to improve the prefetching efficacy of STT-RAM based LLC in chip multi-processor systems, namely, request prioritization (RP) and hybrid local-global prefetch control (HLGPC). Simulation results show that by combining these two techniques, we can achieve 6.5%~11% system performance improvement and 4.8%~7.3% LLC energy reduction in a quadcore system with 2MB~8MB STT-RAM based LLC, compared to baseline with basic prefetching.
Slides

1C-2 (Time: 11:05 - 11:30)
TitleCNPUF: A Carbon Nanotube-based Physically Unclonable Function for Secure Low-Energy Hardware Design
Author*Sven Tenzing Choden Konigsmark, Leslie K. Hwang, Deming Chen, Martin D. F. Wong (University of Illinois at Urbana-Champaign, U.S.A.)
Pagepp. 73 - 78
KeywordPUF, CNT, Low power, Security, Emerging Technology
AbstractPhysically Unclonable Functions (PUFs) are used to provide identification, authentication and secret key generation based on unique and unpredictable physical characteristics. Carbon Nanotube Field Effect Transistors (CNFETs) were shown to have excellent electrical and unique physical characteristics and are promising candidates to replace silicon transistors in future Very Large Scale Integration (VLSI) designs. We present Carbon Nanotube PUF (CNPUF), the first PUF design that takes advantage of unique CNFET characteristics. We achieve higher reliability against environmental variations and increased resistance against modeling attacks. Furthermore, we have a considerable power and energy reduction in comparison to previous ultra-low power PUF designs of 89.6% and 98%, respectively. Additionally, CNPUF allows power-security tradeoff.
Slides

1C-3 (Time: 11:30 - 11:55)
Title3DCoB: A New Design Approach for Monolithic 3D Integrated Circuits
Author*Hossam Sarhan, Sebastien Thuries, Olivier Billoint, Fabien Clermidy (CEA-LETI, France)
Pagepp. 79 - 84
Keyword3D-IC, Monolithic, Sequential Integration, cell-on-cell, cell-on-buffer
Abstract3D Monolithic Integration (3DMI) technology provides very high dense vertical interconnects with low parasitics. Previous 3DMI design approaches provide either cell-on-cell or transistor-on-transistor integration. In this paper we present 3D Cell-on-Buffer (3DCoB) as a novel design approach for 3DMI. Our approach provides a fully compatible sign-off physical implementation flow with the conventional 2D tools. We implement our approach on some benchmark circuits using 28nm-FDSOI technology. The sign-off performance results show 35% improvement compared to the same 2D design.
Slides

1C-4 (Time: 11:55 - 12:20)
TitleEmulator-Oriented Tiny Processors for Unreliable Post-Silicon Devices: A Case Study
Author*Yuko Hara-Azumi (Nara Institute of Science and Technology/JST, PRESTO, Japan), Masaya Kunimoto, Yasuhiko Nakashima (Nara Institute of Science and Technology, Japan)
Pagepp. 85 - 90
KeywordEmulator-oriented processor, Reliability, Post-silicon
AbstractAlthough various post-silicon devices have been invested in recent years, they still have a major issue of reliability. Because circuit area is an essential factor of reliability, especially for such unreliable post-silicon devices, it is desired to build small circuits which can reuse as many today's application programs as possible even if the performance is not very high. This paper presents the very first work to study novel, efficient techniques of emulating wider-bit guest processors (e.g., 32-bit) on a narrower-bit host processor (e.g., 8-bit) with the very limited hardware resources while mitigating performance degradation. We propose three types of tiny emulation-oriented processors varying in available hardware resources and reliability enhancement approaches. Quantitative evaluation and discussions are done for comparing those three processors. We believe that this work will lead not only acceleration of developing post-silicon technology but also a big paradigm shift in building digital devices.


Session 2S  Special Session: EDA for Energy
Time: 13:50 - 15:30 Tuesday, January 21, 2014
Location: Room 302
Organizer: Fadi Kurdahi (University of California, Irvine, U.S.A.), Sani Nassif (IBM Austin Research Lab, U.S.A.), Mohammad Al Faruque (University of California, Irvine, U.S.A.)

2S-1 (Time: 13:50 - 14:20)
Title(Invited Paper) Applying VLSI EDA to Energy Distribution System Design
Author*Sani Nassif, Gi-Joon Nam, Jerry Hayes (IBM Austin Research Laboratory, U.S.A.), Sani Fakhouri (University of California, Irvine, U.S.A.)
Pagepp. 91 - 96
KeywordEDA, Energy distribution network, Simulation, Optimization
AbstractEnergy distribution networks refer to that part of the electricity network that delivers power to homes and business. It is reported that significant amounts of energy are being wasted simply due to inefficiencies in this network. Further, this domain is rapidly changing with new types of loads such as electric vehicles or the spread of new types of energy sources such as photo-voltaic and wind. In this paper, we demonstrate a comprehensive design automation capability for energy distribution networks leading to much more flexible yet effective system. The new system's capabilities include power load distribution and transfers, equipment upgrading, geospatial-aware network optimization, outage identification, contingency planning and loss analysis/reduction. These features are enabled by advanced simulation, analysis and optimization engines that are adapted from those available in the traditional VLSI design automation area. The paper will conclude with potential future research directions that require further innovations in energy distribution networks.

2S-2 (Time: 14:20 - 14:50)
Title(Invited Paper) A Model-Based Design of Cyber-Physical Energy Systems
AuthorMohammad Abdullah Al Faruque, *Fereidoun Ahourai (University of California, Irvine, U.S.A.)
Pagepp. 97 - 104
KeywordModel-based design, Cyber-physical systems, cyber-physical energy system, gridlab-d, co-simulation
AbstractCyber-Physical Energy Systems (CPES) are an amalgamation of both power gird technology, and the intelligent communication and co-ordination between the supply and the demand side through distributed embedded computing. Through this combination, CPES are intended to deliver power efficiently, reliably, and economically. The design and development work needed to either implement a new power grid network or upgrade a traditional power grid to a CPES-compliant one is both challenging and time consuming due to the heterogeneous nature of the associated components/subsystems. The Model Based Design (MBD) methodology has been widely seen as a promising solution to address the associated design challenges of creating a CPES. In this paper, we demonstrate a MBD method and its associated tool for the purpose of designing and validating various control algorithms for a residential microgrid. Our presented co-simulation engine GridMat is a MATLAB/Simulink toolbox; the purpose of it is to co-simulate the power systems modeled in GridLAB-D as well as the control algorithms that are modeled in Simulink. We have presented various use cases to demonstrate how different levels of control algorithms may be developed, simulated, debugged, and analyzed by using our GridMat toolbox for a residential microgrid.

2S-3 (Time: 14:50 - 15:20)
Title(Invited Paper) The Data Center as a Grid Load Stabilizer
AuthorHao Chen, Michael C. Caramanis, *Ayse K. Coskun (Boston University, U.S.A.)
Pagepp. 105 - 112
Keyworddemand response, regulation service, data center energy management, power market
AbstractTo accommodate the increasing presence of volatile and intermittent renewable energy sources in power generation, independent system operators (ISO) offer opportunities for demand side regulation service (RS) so as to stabilize the grid load. These power market features allow the demand side to earn monetary credits by modulating its power consumption dynamically following an RS signal broadcast by ISO. This paper studies the capacities and benefits of a major potential demand side, the data center, to provide RS. We propose a dynamic control policy that modulates the data center power consumption in response to ISO requests by leveraging server power capping techniques and various server power states. Results demonstrate that using our policy, data centers can provide fast reserves in quantities that are substantial proportions (around 50%) of their average energy consumption, with no major deterioration in quality of service (QoS). By doing so, data centers decrease their energy costs around 50%, while providing the ISOs and the society in general with cost effective demand side reserves that render massive renewable generation adoption affordable.
Slides


Session 2A  Distributed and Mixed-Criticality Real-Time Systems
Time: 13:50 - 15:30 Tuesday, January 21, 2014
Location: Room 300
Chair: Muhammad Shafique (Karlsruhe Institute of Technology, Germany)

2A-1 (Time: 13:50 - 14:15)
TitleBounding Buffer Space Requirements for Real-Time Priority-Aware Networks
AuthorHany Kashif, *Hiren D. Patel (University of Waterloo, Canada)
Pagepp. 113 - 118
KeywordReal-time, Network-on-Chip, Buffer space
AbstractOne implementation alternative for network interconnects in modern chip-multiprocessor systems is priority-aware arbitration networks. To enable the deployment of real-time applications to priority-aware networks, recent research proposes worst-case latency (WCL) analyses for such networks. Buffer space requirements in priority-aware networks, however, are seldom addressed. In this work, we bound the buffer space required for valid WCL analyses and consequently optimize router design for application specifications by computing the required buffer space at each virtual channel in priority-aware routers. In addition to the obvious advantage of bounding buffer space while providing valid WCL bounds, buffer space reduction decreases chip area and saves energy in priority-aware networks. Our experiments show that the proposed buffer space computation reduces the number of unfeasible implementations by 42% compared to an existing buffer space analysis technique. It also reduces the required buffer space in priority-aware routers by up to 79%.

2A-2 (Time: 14:15 - 14:40)
TitleTask- and Network-Level Schedule Co-Synthesis of Ethernet-Based Time-Triggered Systems
Author*Licong Zhang, Dip Goswami, Reinhard Schneider, Samarjit Chakraborty (TU Munich, Germany)
Pagepp. 119 - 124
KeywordSchedule Optimization, Ethernet, Time-triggered Traffic
AbstractIn this paper, we study time-triggered distributed systems where periodic application tasks are mapped onto different end stations (processing units) communicating over a switched Ethernet network. We address the problem of application level (i.e., both task- and network-level) schedule synthesis and optimization. In this context, most of the recent works [10], [11] either focus on communication schedule or consider a simplified task model. In this work, we formulate the co-synthesis problem of task and communication schedules as a Mixed Integer Programming (MIP) model taking into account a number of Ethernet-specific timing parameters such as interframe gap, precision and synchronization error. Our formulation is able to handle one or multiple timing objectives such as application response time, end-to-end delay and their combinations. We show the applicability of our formulation considering an industrial size case study using a number of different sets of objectives. Further, we show that our formulation scales to systems with reasonably large size.
Slides

2A-3 (Time: 14:40 - 15:05)
TitleService Adaptions for Mixed-Criticality Systems
Author*Pengcheng Huang, Georgia Giannopoulou, Nikolay Stoimenov, Lothar Thiele (ETH Zurich, Switzerland)
Pagepp. 125 - 130
KeywordMixed Criticality, EDF, Online Reconfiguration, Real Time
AbstractComplex embedded systems are typically mixed-critical, where heterogeneous guarantees must be provided for functionalities of different criticalities. We study in this paper the reconfiguration of services provided to low criticality tasks in reaction to the overruns of high criticality tasks. We further investigate the quantification of the resetting time of the system services. For both service reconfiguration and resetting, we derive tight analysis under Earliest Deadline First (EDF) Scheduling.
Slides

2A-4 (Time: 15:05 - 15:30)
TitleEfficient Feasibility Analysis of DAG Scheduling with Real-Time Constraints in the Presence of Faults
Author*Xiaotong Cui, Jun Zhang, Kaijie Wu, Edwin Sha (College of Computer Science,Chongqing University, China)
Pagepp. 131 - 136
KeywordFault tolerance, feasibility test, frame-based real-time system, worst-case analysis, critical task
AbstractTasks in hard real-time systems are required to meet deadlines in the presence of faults. We conclude that a sufficient condition of a task set experiencing its worst-case finish time (WCFT) is that its critical task (CT) incurs all faults. An algorithm is pre-sented to identify the CT and the WCFT in O(N2) with N being the task number. A common practice that bet the WCFT using the task with the longest re-execution time could under estimate by up-to 35%!
Slides


Session 2B  Advanced Patterning for Advanced Layout
Time: 13:50 - 15:30 Tuesday, January 21, 2014
Location: Room 301
Chairs: Martin Wong (University of Illinois, Urbana-Champaign, U.S.A.), Shigeki Nojima (Toshiba Corporation, Japan)

2B-1 (Time: 13:50 - 14:15)
TitleFlexible Packed Stencil Design with Multiple Shaping Apertures for E-Beam Lithography
AuthorChris Chu (Iowa State University, U.S.A.), *Wai-Kei Mak (National Tsing Hua University, Taiwan)
Pagepp. 137 - 142
KeywordElectron-beam direct write lithography, Character projection, Stencil design
AbstractElectron-beam direct write (EBDW) lithography is a promising solution for chip production in the sub-22nm regime. To improve the throughput of EBDWlithography, character projection method is commonly employed and a critical problem is to pack as many characters as possible onto the stencil. In this paper, we consider two enhancements in packed stencil design over previous works. First, the use of multiple shaping apertures with different sizes is explored. Second, the fact that the pattern of a character can be located anywhere within its enclosing projection region is exploited to facilitate flexible blank space sharing. For this packed stencil design problem with multiple shaping apertures and flexible blank space sharing, a dynamic programming based algorithm is proposed. Experimental results show that the proposed enhancement and the associated algorithm can significantly reduce the total shot count and hence improve the throughput of EBDW lithography.
Slides

2B-2 (Time: 14:15 - 14:40)
TitleSelf-Aligned Double Patterning Layout Decomposition with Complementary E-Beam Lithography
AuthorJhih-Rong Gao, Bei Yu, *David Z. Pan (University of Texas at Austin, U.S.A.)
Pagepp. 143 - 148
KeywordSADP, E-Beam, layout decomposition, double patterning
AbstractAdvanced lithography techniques enable higher pattern resolution; however, techniques such as extreme ultraviolet lithography and e-beam lithography (EBL) are not yet ready for high volume production. Recently, complementary lithography has be- come promising, which allows two different lithography processes work together to achieve high quality layout patterns while not increasing much manufacturing cost. In this paper, we present a new layout decomposition framework for self-aligned double patterning and complementary EBL, which considers overlay minimization and EBL throughput optimization simultaneously. We perform conflict elimination by merge- and-cut technique and formulate it as a matching- based problem. The results show that our approach is fast and effective, where all conflicts are solved with minimal overlay error and e-beam utilization.
Slides

2B-3 (Time: 14:40 - 15:05)
TitleFixing Double Patterning Violations with Look-Ahead
Author*Sambuddha Bhattacharya, Subramanian Rajagopalan, Shabbir H Batterywala (Synopsys India Pvt. Ltd., India)
Pagepp. 149 - 154
KeywordDPT, DRC, lithography, linear program
AbstractDouble Patterning Technology (DPT) conflicts express themselves as odd cycles of spacing between layout shapes. One way of resolving these is by imposing a large spacing constraint between a pair of shapes participant in an odd cycle. However, this may shrink spacing in other parts of the layout and introduce DRC violations or new DPT conflicts. In this work, we model DPT conflict resolution as a constrained linear optimization problem, look ahead to upfront estimate potential violations and preclude them with additional constraints. We borrow the approach of Satisfiability Modulo Theory (SMT) solvers to simultaneously check satisfiability of linear constraint set and resolution of DPT conflicts. These two are interleaved and feed information to each other to churn out a feasible set of constraints that fixes DPT and DRC violations. We demonstrate the efficacy of the method on layouts at advanced nodes.
Slides

2B-4 (Time: 15:05 - 15:30)
TitleEUV-CDA: Pattern Shift Aware Critical Density Analysis for EUV Mask Layouts
Author*Abde Ali Kagalwalla (University of California, Los Angeles, U.S.A.), Michael Lam, Kostas Adam (Mentor Graphics, U.S.A.), Puneet Gupta (University of California, Los Angeles, U.S.A.)
Pagepp. 155 - 160
KeywordEUV, Mask, Yield, DFM, Defect
AbstractDespite the use of mask defect avoidance and mitigation techniques, finding a usable defective mask blank remains a challenge for EUVL at sub-10nm node due to dense layouts and low CD tolerance. In this work, we propose a pattern shift-aware metric called critical density, which can quickly evaluate the robustness of EUV layouts to mask defects (300-1300X faster than naďve Monte Carlo), thereby enabling design-level mask defect mitigation techniques. Our experimental results indicate that regularity hurts layout robustness to mask defects.
Slides


Session 2C  Timing-Driven Design, Modeling, and Optimization
Time: 13:50 - 15:30 Tuesday, January 21, 2014
Location: Room 303
Chairs: Mango Chia-Tso Chao (National Chiao Tung University, Taiwan), Tai-Chen Chen (National Central University, Taiwan)

2C-1 (Time: 13:50 - 14:15)
TitleStatistical Analysis of Random Telegraph Noise in Digital Circuits
Author*Xiaoming Chen, Yu Wang (Tsinghua University, China), Yu Cao (Arizona State University, U.S.A.), Huazhong Yang (Tsinghua University, China)
Pagepp. 161 - 166
Keywordrandom telegraph noise, statistical analysis, reliability
AbstractRandom telegraph noise (RTN) has become an important reliability issue at the sub-65nm technology node. Existing RTN simulation approaches mainly focus on single trap induced RTN and transient response of RTN, and they are usually time-consuming for circuit-level simulation. This paper proposes a statistical algorithm to study multiple traps induced RTN in digital circuits, to show the temporal distribution of circuit delay under RTN. Based on the simulation results we show how to protect circuit from RTN. Bias dependence of RTN is also discussed.
Slides

2C-2 (Time: 14:15 - 14:40)
TitleSemi-Analytical Current Source Modeling of FinFET Devices Operating in Near/Sub-Threshold Regime with Independent Gate Control and Considering Process Variation
AuthorTiansong Cui, Yanzhi Wang, Xue Lin, Shahin Nazarian, *Massoud Pedram (University of Southern California, U.S.A.)
Pagepp. 167 - 172
KeywordCurrent Source Modeling (CSM), FinFET circuits, near/sub-threshold, independent gate control, process variation
AbstractFinFET has been proposed as an alternative for bulk CMOS. The characteristics of FinFETs operating in the near/sub-threshold region made it hard to verify the timing of a circuit using the conventional SSTA. In this paper, we extend the CSM to FinFET devices operating in multiple voltage regimes subject to independent gate control and process variations.
Slides

2C-3 (Time: 14:40 - 15:05)
Title2-SAT Based Linear Time Optimum Two-Domain Clock Skew Scheduling
Author*Yukihide Kohira (The University of Aizu, Japan), Atsushi Takahashi (Tokyo Institute of Technology, Japan)
Pagepp. 173 - 178
KeywordMulti-domain clock skew scheduling, Two-domain clock skew schedule, 2-SAT
AbstractMulti-domain clock skew scheduling is an effective technique to improve the performance of sequential circuits by using practical clock distribution network. Although the upper bound of performance of a circuit increases as the number of clock domains increases in multi-domain clock skew scheduling, the improvement of the performance becomes smaller while the cost of clock distribution network increases much. In this paper, a linear time algorithm that finds an optimum two-domain clock skew schedule is proposed. Experimental results show that optimum circuits are efficiently obtained by our method in short time.

2C-4 (Time: 15:05 - 15:30)
TitlePower Minimization of Pipeline Architecture through 1-Cycle Error Correction and Voltage Scaling
Author*Insup Shin (KAIST, Republic of Korea), Jae-Joon Kim (POSTECH, Republic of Korea), Youngsoo Shin (KAIST, Republic of Korea)
Pagepp. 179 - 184
Keywordtiming speculation, low power design, error correction, voltage scaling
AbstractWe present a new 1-cycle timing error correction method, which allows aggressive voltage scaling in a pipelined architecture. The proposed method differs from the state-of-the-art in that the pipeline stage where the timing error occurs can continue to receive input data without halting to avoid data collision. The feature allows the pipeline to avoid recurring clock gating when timing errors happen at multiple stages or timing errors continue to occur at a certain stage. Compared to a state-of-art method, the proposed method shows 2-6% energy reduction for a 5-stage pipeline and 7-11% reduction for a 10-stage pipeline. In addition, the proposed logic to propagate clock gating signal is much simpler than that of the previous method by eliminating reverse propagation path of clock gating signal.
Slides


Session 3S  Special Session: Neuron Inspired Computing using Nanotechnology
Time: 15:50 - 17:30 Tuesday, January 21, 2014
Location: Room 302
Organizer: Kevin Cao (Arizona State University, U.S.A.), Sarma Vrudhula (Arizona State University, U.S.A.)

3S-1 (Time: 15:50 - 16:20)
Title(Invited Paper) A Silicon Nanodisk Array Structure Realizing Synaptic Response of Spiking Neuron Models with Noise
Author*Takashi Morie, Haichao Liang, Yilai Sun, Takashi Tohara (Kyushu Institute of Technology, Japan), Makoto Igarashi, Seiji Samukawa (Tohoku University, Japan)
Pagepp. 185 - 190
Keywordnanostructure, nanodevice, spiking neuron, fluctuation, noise
AbstractIn the implementation of spiking neuron models, which can achieve realistic neuron operation, generation of post-synaptic potentials (PSPs) is an essential function. We have already proposed a new nanodisk array structure for generating PSPs using delay in electron hopping among nanodisks. Generated PSPs have fluctuation caused by stochastic electron movement. Noise or fluctuation is effectively used in neural processing. In this paper, we review our proposed structure and show fluctuation controllability based on single-electron circuit simulation.

3S-2 (Time: 16:20 - 16:50)
Title(Invited Paper) Energy Efficient In-Memory Machine Learning for Data Intensive Image-Processing by Non-Volatile Domain-Wall Memory
Author*Hao Yu, Yuhao Wang, Shuai Chen, Wei Fei (Nanyang Technological University, Singapore), Chuliang Weng, Junfeng Zhao, Zhulin Wei (Huawei Shannon Laboratory, China)
Pagepp. 191 - 196
Keywordneural network, logic-in-memory, non-volatile memory, domain wall, image processing
AbstractImage processing in conventional logic-memory I/O-integrated systems will incur significant communication congestion at memory I/Os for excessive big image data at exa-scale. This paper explores an in-memory machine learning on neural network architecture by utilizing the newly introduced domain-wall nanowire, called DW-NN. We show that all operations involved in machine learning on neural network can be mapped to a logic-in-memory architecture by non-volatile domain-wall nanowire. Domain-wall nanowire based logic is customized for in machine learning within image data storage. As such, both neural network training and processing can be performed locally within the memory. The experimental results show that system throughput in DW-NN is improved by 11.6x and the energy efficiency is improved by 92x when compared to conventional image processing system.
Slides

3S-3 (Time: 16:50 - 17:20)
Title(Invited Paper) Lessons from the Neurons Themselves
Author*Louis Scheffer (Howard Hughes Medical Institute, U.S.A.)
Pagepp. 197 - 200
KeywordNeuromorphic, Artificial neuron, neurons
AbstractNatural neural circuits, optimized by millions of years of evolution, are fast, low power, and robust, all characteristics we would love to have in systems we ourselves design. Recently there have been enormous advances in understanding how neurons implement computations within the brain of living creatures. Can we use this new-found knowledge to create better artificial system? What lessons can we learn from the neurons themselves, that can help us create better neuromorphic circuits?
Slides


Session 3A  Synthesis and Exploration Techniques for Computing Platforms
Time: 15:50 - 17:30 Tuesday, January 21, 2014
Location: Room 300
Chairs: Sri Parameswaran (University of New South Wales, Australia), Kyle Rupnow (Nanyang Technological University, Singapore)

3A-1 (Time: 15:50 - 16:15)
TitleLeveraging the Error Resilience of Machine-Learning Applications for Designing Highly Energy Efficient Accelerators
Author*Zidong Du (State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, China), Avinash Lingamneni (Electrical and Computer Engineering, Rice University, U.S.A.), Yunji Chen (State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, China), Krishna Palem (Rice University, U.S.A.), Olivier Temam (INRIA, France), Chengyong Wu (State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, China)
Pagepp. 201 - 206
KeywordAccelerator, Inexact computing, Hardware Neuron Network
AbstractIn the recent years, inexact computing has been increasingly regarded as one of the most promising approaches for slashing energy consumption in many applications that can tolerate a certain degree of inaccuracy. Driven by the principle of trading tolerable amounts of application accuracy in return for significant resource savings---the energy consumed, the (critical path) delay and the (silicon) area--this approach has been limited to ASICs so far. These ASIC realizations have a narrow application scope and are often rigid in their tolerance to inaccuracy, as currently designed; the latter often determining the extent of resource savings we would achieve. In this paper, we propose to improve the application scope, error resilience as well as the energy savings of inexact computing by combining it with hardware neural networks. These neural networks are fast emerging as popular candidate accelerators for future heterogeneous multi-core platforms and have flexible error tolerance limits owing to their ability to be trained. Our results in 65nm technology demonstrate that the proposed inexact neural network accelerator could achieve 1.78x 2.67x savings in energy consumption (with corresponding delay and area savings being 1.23x and 1.46x respectively) when compared to the existing baseline neural network implementation, at the cost of a small accuracy loss (MSE increases from 0.14 to 0.20 on average).

3A-2 (Time: 16:15 - 16:40)
TitleArISE: Aging-Aware Instruction Set Encoding for Lifetime Improvement
Author*Fabian Oboril, Mehdi Tahoori (KIT, Germany)
Pagepp. 207 - 212
KeywordOpcode, Transistor Aging, Reliability, Decoder, Microprocessor Pipeline
AbstractMicroprocessors fabricated at nanoscale nodes are exposed to accelerated transistor aging due to Bias Temperature Instability and Hot Carrier Injection. As a result, device delays increase over time reducing Mean Time To Failure (MTTF). To address this challenge, many (micro)-architectural-techniques target the execution stage of the instruction pipeline. However, also the decoding stages can become aging-critical and limit the microprocessor lifetime. Therefore, we propose a novel aging-aware instruction set encoding methodology, which increases MTTF of these stages in case of the FabScalar microprocessor by 2x with negligible implementation costs.
Slides

3A-3 (Time: 16:40 - 17:05)
TitleDRuiD: Designing Reconfigurable Architectures with Decision-Making Support
Author*Giovanni Mariani (Universita della Svizzera Italiana - ALaRI, Switzerland/Politecnico di Milano, Italy), Gianluca Palermo (Politecnico di Milano - DEIB, Italy), Roel Meeuws, Vlad-Mihai Sima (Delft Technical University, Netherlands), Cristina Silvano (Politecnico di Milano - DEIB, Italy), Koen Bertels (Delft Technical University, Netherlands)
Pagepp. 213 - 218
KeywordReconfigurable architectures, Heterogeneous architectures, Machine learning, Random forest, FPGA
AbstractThe development process for heterogeneous computing platforms requires a clear understanding of both, application requirements and heterogeneous computing technologies. To support the development process, we propose a framework called DRuiD capable of learning characteristics that make application functionalities suitable for certain computing elements. An expert system supports the designer in the mapping decision and gives hints on possible code modifications to be applied to make the applications more suitable for a given computing element.
Slides

3A-4 (Time: 17:05 - 17:30)
TitleEdit Distance Based Instruction Merging Technique to Improve Flexibility of Custom Instructions Toward Flexible Accelerator Design
AuthorHui Huang (University of California, Los Angeles, U.S.A.), *Taemin Kim, Yatin Hoskote (Intel Labs, U.S.A.)
Pagepp. 219 - 224
KeywordInstruction set extension, Flexibility, Application specific instruction set processor, System-On-a-Chip
AbstractDue to ever shortening time-to-market of a system-on-a-chip (SoC) and increasing NRE cost of designing accelerators in the SoC, a design methodology for a flexible accelerator is desirable. We propose a novel technique to make custom instructions (CIs) of an application specific instruction-set processor (ASIP) flexible. By doing so, CIs can support applications that were not considered at design time of the ASIP, which is difficult to do with a conventional CI design method. We have shown that custom instructions generated by our technique can support future applications by up to 7X better than those from a conventional method.
Slides


Session 3B  Advances in Microfluidic Biochips
Time: 15:50 - 17:30 Tuesday, January 21, 2014
Location: Room 301
Chairs: Tsung-Yi Ho (National Cheng Kung University, Taiwan), Juinn-Dar Huang (National Chiao Tung University, Taiwan)

3B-1 (Time: 15:50 - 16:15)
TitleA Network-Flow-Based Optimal Sample Preparation Algorithm for Digital Microfluidic Biochips
Author*Trung Anh Dinh, Shigeru Yamashita (Ritsumeikan University, Japan), Tsung-Yi Ho (National Cheng Kung University, Taiwan)
Pagepp. 225 - 230
Keyworddigital microfluidic biochips, sample preparation, minimum-cost maximum-flow
AbstractSample preparation, which is a front-end process to produce droplets of the desired target concentrations from input reagents, plays a pivotal role in every assays, laboratories, and applications in biomedical engineering and life science. The consumption of sample/buffer/waste is usually used to evaluate the effectiveness of a sample preparation process. In this paper, we present the first optimal sample preparation algorithm based on a minimum-cost maximum-flow model. By using the proposed model, we can obtain both the optimal cost of sample and buffer usage and the waste amount even for multiple-target concentrations. Experiments demonstrate that we can consistently achieve much better results not only in the consumption of sample and buffer but also the waste amount when compared with all the state-of-the-art of the previous approaches.
Slides

3B-2 (Time: 16:15 - 16:40)
TitleExploring Speed and Energy Tradeoffs in Droplet Transport for Digital Microfluidic Biochips
AuthorJohnathan Fiske, *Daniel Grissom, Philip Brisk (University of California, Riverside, U.S.A.)
Pagepp. 231 - 237
KeywordMicrofluidics, Cyber-Physical System, Droplet Transport
AbstractThis paper transforms the problem of droplet routing for digital microfluidic biochips (DMFBs) from the discrete into the continuous domain, based on the observation that droplet transport velocity is a function of the actuation voltage applied to electrodes that control the devices. A new formulation of the DMFB droplet routing problem is introduced for the continuous domain, which attempts to minimize total energy consumption while meeting a timing constraint. Henceforth, DMFBs should be viewed as continuous, highly integrated cyber-physical systems that interact with and manipulate physical quantities, as opposed to inherently discrete and fully synchronized devices.
Slides

3B-3 (Time: 16:40 - 17:05)
TitleGeneral Purpose Cross-Referencing Microfluidic Biochip with Reduced Pin-Count
Author*Jackson Ho Chuen Yeung, Evangeline F.Y. Young (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 238 - 243
KeywordDMFB, microfluidics, biochip, routing
AbstractThe number of control pins used is a major factor affecting the manufacturing cost of Digital Microfluidic Biochip (DMFB). Pin-count on a DMFB can be reduced by sharing of control pins between electrodes. Most existing works on reducing pin-count are problem specific. Problem specific optimizations result in DMFB that can only perform certain specific bioassays. Cross-Referencing DMFB has a full array layout that is fully reconfigurable for any bioassay. Conventional Cross-Referencing DMFB uses m + n number of pins. We have devised a non-problem specific pin assignment methodology that uses only √2(√m + √n) number of pins. The resulting DMFB are still fully reconfigurable. We have developed a droplet router specifically for cross-referencing DMFB with shared control pins. All real bioassay tested can be routed using a fixed and problem independent control pin mapping. Reduction on pin count ranges from 50 % to 67 %.

3B-4 (Time: 17:05 - 17:30)
TitleWash Optimization for Cross-Contamination Removal in Flow-Based Microfluidic Biochips
AuthorKai Hu (Duke University, U.S.A.), *Tsung-Yi Ho (National Cheng Kung University, Taiwan), Krishnendu Chakrabarty (Duke University, U.S.A.)
Pagepp. 244 - 249
Keywordwash optimization, flow-based biochip, cross-contamination
AbstractRecent advances in flow-based microfluidics have enabled the emergence of biochemistry-on-a-chip as a new paradigm in drug discovery and point-of-care disease diagnosis. However, these applications in biochemistry require high precision to avoid erroneous assay outcomes, and therefore are vulnerable to contamination between two fluidic flows with different biochemistries. Moreover, to wash contaminated sites, the buffer solution in flow-based biochips has to be guided along pre-etched channel networks. This constraint makes washing in flow-based microfluidics even harder. In this paper, we propose the first approach for automated wash optimization for contamination removal in flow-based microfluidic biochips. The proposed approach targets the generation of washing pathways to clean all contaminated microchannels with minimum execution time. A path dictionary is first established by pre-searching physically implementable paths in a given chip layout. When wash targets and occupied microchannels are defined, the proposed methods determine an optimized path set with the least washing time by calculating the priorities of wash targets. Two fabricated biochips are used to evaluate the proposed washing method. Compared to an ad hoc baseline method, the proposed approach leads to more efficient washing in all cases.
Slides


Session 3C  Advanced Modeling and Simulation Techniques for Analog/Mixed-Signal Circuits
Time: 15:50 - 17:30 Tuesday, January 21, 2014
Location: Room 303
Chairs: Hao Yu (Nanyang Technological University, Singapore), Shi Guoyong (Shanghai Jiao Tong University, China)

3C-1 (Time: 15:50 - 16:15)
TitleABCD-NL: Approximating Continuous Non-Linear Dynamical Systems Using Purely Boolean Models for Analog/Mixed-Signal Verification
Author*Aadithya V. Karthik, Sayak Ray, Pierluigi Nuzzo, Alan Mishchenko, Robert Brayton, Jaijeet Roychowdhury (The University of California, Berkeley, U.S.A.)
Pagepp. 250 - 255
KeywordAMS verification, Booleanization, SPICE, Modelling, FSMs
AbstractWe present ABCD-NL, a technique that approximates non-linear analog circuits using purely Boolean models, to high accuracy. Given an analog/mixed-signal (AMS) system (e.g., a SPICE netlist), ABCD-NL produces a Boolean circuit representation (e.g., an And Inverter Graph, Finite State Machine, or Binary Decision Diagram) that captures the I/O behaviour of the given system, to near SPICE-level accuracy, without making any apriori simplifications. The Boolean models produced by ABCD-NL can be used for high-speed simulation and formal verification of AMS designs, by leveraging existing tools developed for Boolean/hybrid systems analysis (e.g., ABC). We apply ABCD-NL to a number of SPICE-level AMS circuits, including data converters, charge pumps, comparators, non-linear signaling/communications sub-systems, etc. Also, we formally verify the throughput of an AMS signaling system -- modelled in SPICE using 22nm BSIM4 transistors, Booleanized with high accuracy using ABCD-NL, and property-checked using ABC.

3C-2 (Time: 16:15 - 16:40)
TitleToward Efficient Programming of Reconfigurable Radio Frequency (RF) Receivers
Author*Jun Tao, Ying-Chih Wang, Minhee Jun, Xin Li, Rohit Negi, Tamal Mukherjee, Lawrence Pileggi (Carnegie Mellon University, U.S.A.)
Pagepp. 256 - 261
KeywordReconfigurable radio frequency (RF) system programming, Two-phase relaxation search, Pareto-based search space reduction
AbstractReconfigurable radio frequency (RF) system is an emerging component to mitigate the growing engineering cost for wireless chip design. In this paper, we propose a new methodology for efficient programming of reconfigurable RF receiver. The proposed method is facilitated by two novel techniques: two-phase relaxation search and Pareto-based search space reduction. Our numerical experiments demonstrate that the proposed methodology is more robust (i.e., close to global optimum) and/or efficient (i.e., with low computational cost) than other traditional algorithms based on either local relaxation or simulated annealing.
Slides

3C-3 (Time: 16:40 - 17:05)
TitleEfficient Matrix Exponential Method Based on Extended Krylov Subspace for Transient Simulation of Large-Scale Linear Circuits
AuthorQuan Chen, *Wenhui Zhao, Ngai Wong (The University of Hong Kong, Hong Kong)
Pagepp. 262 - 266
Keywordtransient simulation, matrix exponential, extended Krylov subspace
AbstractMatrix exponential (MEXP) method has been demonstrated to be a competitive candidate for transient simulation of very large-scale integrated circuits. Nevertheless, the performance of MEXP based on ordinary Krylov subspace is unsatisfactory for stiff circuits, wherein the underlying Arnoldi process tends to oversample the high magnitude part of the system spectrum while undersampling the low magnitude part which is important to the final accuracy. In this work we explore the use of extended Krylov subspace to generate more accurate and efficient approximation for MEXP. We also develop a formulation that allows unequal positive and negative dimensions in the generated Krylov subspace for better performance. Numerical results demonstrate the efficacy of the proposed method.
Slides



Wednesday, January 22, 2014

Session 2K  Keynote II
Time: 8:30 - 9:30 Wednesday, January 22, 2014
Location: Room 300
Chair: Nagisa Ishiura (Kwansei Gakuin University, Japan)

2K-1 (Time: 8:30 - 9:30)
Title(Keynote Address) Designing Analog Functions without Analog Transistors
AuthorGeorges Gielen (Katholieke Universite Leuven, Belgium)
AbstractAnalog functions are indispensable for most electronic applications, ranging from telecom to biomedical or automotive applications. Yet, designing the analog circuits has become a large burden, especially in advanced CMOS technologies where reduced voltage headrooms and increased variability and reliability problems challenge the design of power-efficient analog circuits. Together with the lack of adequate EDA tools this also jeopardizes efficient analog circuit design. This keynote describes a possible way forward. The industry clearly has reached a bifurcation point. Many applications will leave the scaling race, and adopt older or nonstandard (e.g. flexible organic) technologies for the analog circuits, offering the increased functionality essentially through heterogeneous integration. Many other applications will stick to advanced CMOS, but will shift the analog design paradigm from analog-heavy to digital-heavy minimalistic-analog circuits. The presentation will discuss and illustrate the challenges and solutions in such approach to design analog functions without analog transistors.


Session 4S  Special Session: Design Automation Methods for Highly-Complex Multimedia Systems
Time: 10:10 - 12:15 Wednesday, January 22, 2014
Location: Room 302
Organizer: Sri Parameswaran (University of New South Wales, Australia)

4S-1 (Time: 10:10 - 10:40)
Title(Invited Paper) SDG2KPN: System Dependency Graph to Function-Level KPN Generation of Legacy Code for MPSoCs
AuthorJude Angelo Ambrose, Jorgen Peddersen (University of New South Wales, Australia), Alvin Labios, Yusuke Yachide (Canon Information Systems Research Australia (CiSRA), Australia), *Sri Parameswaran (University of New South Wales, Australia)
Pagepp. 267 - 273
KeywordMPSoC, KPN
AbstractThe Multiprocessor System-on-Chip (MPSoC) paradigm as a viable implementation platform for parallel processing has expanded to encompass embedded devices. The ability to execute code in parallel gives MPSoCs the potential to achieve high performance with low power consumption. In order for sequential legacy code to take advantage of the MPSoC design paradigm, it must first be partitioned into data flow graphs (such as Kahn Process Networks --- KPNs) to ensure the data elements can be correctly passed between the separate processing elements that operate on them. Existing techniques are inadequate for use in complex legacy code. This paper proposes SDG2KPN, a System Dependency Graph to KPN conversion methodology targeting the conversion of legacy code. By creating KPNs at the granularity of the function-/procedure-level, SDG2KPN is the first of its kind to support shared and global variables as well as many more program patterns/application types. We also provide a design flow which allows the creation of MPSoC systems utilizing the produced KPNs. We demonstrate the applicability of our approach by retargeting several sequential applications to the Tensilica MPSoC framework. Our system parallelized AES, an application of 950 lines, in 4.8 seconds, while H.264, of 57896 lines, took 164.9 seconds to parallelize.
Slides

4S-2 (Time: 10:40 - 11:10)
Title(Invited Paper) Low Power Design of the Next-Generation High Efficiency Video Coding
Author*Muhammad Shafique, Jörg Henkel (Karlsruhe Institute of Technology, Germany)
Pagepp. 274 - 281
KeywordLow Power, HEVC, Temperature, Accelerator, Video Memory
AbstractThis paper provides a comprehensive analysis of the computational complexity, temperature, and memory access behavior for the next-generation High Efficiency Video Coding (HEVC) standard. We highlight the associated design challenges and present several low-power algorithmic and architectural techniques for developing power-efficient HEVC-based multimedia system. We explore the interplay between the algorithms and architectures to provide high power efficiency while leveraging the application-specific knowledge and video content characteristics.
Slides

4S-3 (Time: 11:10 - 11:40)
Title(Invited Paper) Mapping Complex Algorithm into FPGA with High Level Synthesis
Author*Kazutoshi Wakabayashi, Takashi Takenaka, Hiroaki Inoue (NEC Corp., Japan)
Pagepp. 282 - 284
KeywordHigh Level Synthesis, FPGA, Contol dependency, data dependency, compiler
AbstractThis presentation discusses on the comparison between “Reconfigurable Chip with High Level Synthesis” and “CPU, GPCPU with compiler such as CUDA” from the compiler perspective. Initially, we introduce several demands for acceleration with FPGA to achieve low latency calculation and control. As an application example, we show a High Frequency Trading. We accelerate it by FPGA NIC with C-based and SQL-based HLS, and show the necessity of high level language customizable reconfigurable chip. Then, we illustrate the difference of FPGA and processor (CPU, GPGPU) with the “FSM+Datapath” model and examine how the architecture difference affects delay and parallelism of operations. Next, we discuss parallelization of operations, threads with High Level Synthesis for FPGA and software compiler for processors. The main advantage of the former method is it is able to parallelize operations beyond control dependencies while the latter method has to obey control dependencies. Finally, some experimental results prove that “FPGA and HLS” generate better performance than a processor for control intensive algorithm.

4S-4 (Time: 11:40 - 12:10)
Title(Invited Paper) Leveraging Parallelism in the Presence of Control Flow on CGRAs
AuthorJihyun Ryoo, Kyuseung Han, *Kiyoung Choi (Seoul National University, Republic of Korea)
Pagepp. 285 - 291
KeywordCGRA, control flow, mapping
AbstractCoarse-Grained Reconfigurable Architectures (CGRAs) are suitable for accelerating data-intensive applications in embedded systems due to high performance and power efficiency. However, as application programs become complex having more control flows in them, it becomes harder to accelerate such programs on CGRAs. Previous researches on this issue have focused on correct execution of control flows rather than their acceleration. This paper reveals how control flows degrade the performance of programs and proposes a software approaches to accelerating control flows by exploiting parallelism residing in each conditionals as well as among conditionals. Experiments show that our proposed techniques improve performance by 2.51 times on average.
Slides


Session 4A  System-Level Thermal and Power Optimization Techniques
Time: 10:10 - 12:15 Wednesday, January 22, 2014
Location: Room 300
Chairs: Yun (Eric) Liang (Peking University, China), Wengfai Wong (National University of Singapore, Singapore)

4A-1 (Time: 10:10 - 10:35)
TitlePhysical-Aware Task Migration Algorithm for Dynamic Thermal Management of SMT Multi-Core Processors
AuthorBagher Salami (Ferdowsi University of Mashhad, Iran), Mohammadreza Baharani (University of Tehran, Iran), Hamid Noori (Ferdowsi University of Mashhad, Iran), *Farhad Mehdipour (Kyushu University, Japan)
Pagepp. 292 - 297
Keyworddynamic thermal management, multi-core processors, SMT, DVFS, task migration
AbstractThis paper presents a task migration algorithm for dynamic thermal management of SMT multi-core processors. The unique features of this algorithm include: 1) considering SMT capability of the processors for task scheduling, 2) using adaptive task migration threshold, and 3) considering cores physical features. This algorithm is evaluated on a commercial SMT quad-core processor. The experimental results indicate that our technique can significantly decrease the average and peak temperature compared to Linux standard scheduler, and two well-known thermal management techniques.
Slides

4A-2 (Time: 10:35 - 11:00)
TitleAgile Frequency Scaling for Adaptive Power Allocation in Many-Core Systems Powered by Renewable Energy Sources
Author*Xiaohang Wang, Zhiming Li (Guangzhou Institute of Advanced Technology, CAS, China), Mei Yang, Yingtao Jiang (University of Nevada, Las Vegas, U.S.A.), Masoud Daneshtalab (University of Turku, Finland), Terrence Mak (The Chinese University of Hong Kong, China)
Pagepp. 298 - 303
Keywordpower allocation, many-core
AbstractAs low-power electronics and miniaturization conspire to populate the world with emerging devices, one appealing approach is to power these multi-core/many-core-based devices with energy harvested from various environments. Of the most important issues concerning these devices is how to effectively allocate power budget among the cores competing for power, which is formulated as one specific type of power-performance optimization problem in this paper. We attempt to solve this problem by proposing an Adaptive Power Allocation Technique (APAT) that explores a dynamic programming network. Our goal here is to maximize the overall system performance, taking into account a unique yet challenging fact that, available power budget might have to undergo a significant change when a renewable energy source is scavenging. APAT has a linear time complexity and low hardware overhead. Experiments have confirmed that APAT can reduce 20- 30% of execution time compared to other state-of-the-art power allocation algorithms. In addition, as APAT is quite insensitive to the changing rate of the power, lending itself well for power management in many-core systems powered by energy-harvesting sources.
Slides

4A-3 (Time: 11:00 - 11:25)
TitleVariation Aware Voltage Island Formation for Power Efficient Near-Threshold Manycore Architectures
Author*Ioannis Stamelakos, Sotirios Xydis, Gianluca Palermo, Cristina Silvano (Politecnico di Milano, Italy)
Pagepp. 304 - 310
KeywordNear Threshold Computing, Voltage Island, Variability Aware Design, Dark Silicon, Power Aware Manycore Design
AbstractThe Power Wall problem is gaining a lot of attention as the main stopper to feasible/efficient scaling in the manycore era. In this paper, we introduce a variability aware voltage island formation framework for exploring the potential of NTV power efficiency. By analyzing the voltage island granularities, we show that there is a strong dependence between the efficacy of NTV operation and the parallelism scaling of the application. Operating a 128 core chip at NTV region significant power gains are delivered.
Slides

4A-4 (Time: 11:25 - 11:50)
TitleAn Evaluation of an Energy Efficient Many-Core SoC with Parallelized Face Detection
Author*Hiroyuki Usui, Jun Tanabe, Toru Sano, Hui Xu, Takashi Miyamori (Toshiba Corporation, Japan)
Pagepp. 311 - 316
Keywordmany-core, NoC, image recognition, face detection
AbstractWe have developed a many-core SoC that includes two many-core clusters with 32 energy efficient processor cores connected by a low latency tree-based NoC. In this paper, we evaluate performance of many-core SoC by face detection as an example of real image recognition applications and discuss two parallelized implementations on the many-core clusters. By keeping balance of workloads on the cores, the performance scales up to 64 cores and the SoC consumes only 2.21W.
Slides

4A-5 (Time: 11:50 - 12:15)
TitleEnergy Aware Real-Time Scheduling Policy with Guaranteed Security Protection
Author*Wei Jiang (School of Computer Science and Engineering, University of Electronic Science and Technology of China, China), Ke Jiang (Department of Computer and Information Science, Linköping University, Sweden), Xia Zhang (School of Information and Software Engineering, University of Electronic Science and Technology of China, China), Yue Ma (Department of Computer Science and Engineering, University of Notre Dame, U.S.A.)
Pagepp. 317 - 322
KeywordReal-time System, Security, Energy, Scheduling
AbstractIn this work, we address the emerging scheduling problem existed in the design of secure and energy-efficient real-time embedded systems. The objective is to minimize the energy consumption subject to security and schedulability constraints. Due to the complexity of the problem, we propose a dynamic programming based approximation approach to find the near-optimal solutions with respect to predefined security constraint. The proposed technique has polynomial time complexity which is about half of traditional approximation approaches. The efficiency of our algorithm is validated by extensive experiments.
Slides


Session 4B  Emerging Techniques for Future NoC
Time: 10:10 - 12:15 Wednesday, January 22, 2014
Location: Room 301
Chairs: Paul Bogdan (University of Southern California, U.S.A.), Wei Zhang (HKUST, Hong Kong)

4B-1 (Time: 10:10 - 10:35)
TitleA Comprehensive and Accurate Latency Model for Network-on-Chip Performance Analysis
Author*Zhiliang Qian (The Hong Kong University of Science and Technology, Hong Kong), Da-cheng Juan (Carnegie Mellon University, U.S.A.), Paul Bogdan (University of Southern California, U.S.A.), Chi-Ying Tsui (The Hong Kong University of Science and Technology, Hong Kong), Diana Marculescu, Radu Marculescu (Carnegie Mellon University, U.S.A.)
Pagepp. 323 - 328
KeywordQueuing model, Analytical model, Network on Chip, Latency
AbstractIn this work, we propose a new, accurate, and comprehensive analytical model for Network-on-Chip (NoC) performance analysis. Given the application communication graph, the NoC architecture, the task mapping and the routing algorithm, the proposed framework analyzes the links dependency and then determines the ordering of queuing analysis for accurate performance modeling. Toward this end, the channel waiting times in the links are estimated using a generalized G/G/1/K queuing model, which can tackle bursty traffic and dependent arrival times with general service time distributions. The proposed model is general and can be used to analyze various traffic scenarios for NoC platforms with arbitrary buffer and packet lengths. Experimental results on both synthetic and real applications demonstrate the accuracy and scalability of the newly proposed model.
Slides

4B-2 (Time: 10:35 - 11:00)
TitleA Low-Latency Asynchronous Interconnection Network with Early Arbitration Resolution
AuthorGeorgios Faldamis (Cavium, Inc., U.S.A.), *Weiwei Jiang (Columbia University, U.S.A.), Gennette Gill (D.E. Shaw Research, U.S.A.), Steven M. Nowick (Columbia University, U.S.A.)
Pagepp. 329 - 336
Keywordasynchronous, network-on-chip, low-latency, arbitration, mesh-of-trees
AbstractA new asynchronous arbitration node is introduced for use as a building block in an asynchronous interconnection network. The target network topology is a variant Mesh-of-Trees (MoT), combining a binary fan-out (i.e. routing) network and a binary fan-in (i.e. arbitration) network, which is becoming widely used for multi-core shared-memory interfaces. The two key features are: (i) each fan-in node can resolve its arbitration and pre-allocate the corresponding input channel, before the actual data arrives; and (ii) a lightweight shadow monitoring network fast forwards information as soon as data enters the network, in continuous time, without synchronization to a fixed-rate clock, notifying each fan-in node on its path to enable the early arbitration. The router nodes were designed in IBM 90nm technology using a ARM standard cell library. SPECTRE simulations indicate that the new arbitration node provided significant reductions in latency of up to 54.4\% over prior designs, while maintaining roughly comparable throughput. Network-level simulations were then performed on eight diverse synthetic benchmarks, comparing the new approach ("early arbitration") with two earlier alternative asynchronous MoT networks ("baseline" and "predictive"), using a mix of random and deterministic traffic. Considerable improvements in system latency were obtained on all benchmarks, ranging from 13.0% to 38.7%. The early arbitration strategy also showed direct benefits for the two most adversarial benchmarks, "uniform random traffic" and "hotspot8".
Slides

4B-3 (Time: 11:00 - 11:25)
TitleA Vertically Integrated and Interoperable Multi-Vendor Synthesis Flow for Predictable NoC Design in Nanoscale Technologies
Author*Alberto Ghiribaldi, Herve Tatenguem Fankem (University of Ferrara, Italy), Federico Angiolini (iNoCs, Switzerland), Mikkel Stensgaard, Tobias Bjerregaard (Teklatech, Denmark), Davide Bertozzi (University of Ferrara, Italy)
Pagepp. 337 - 342
KeywordNetwork-on-Chip, Design Flow, EDA tool, Embedded Systems
AbstractWe deliver a design flow for the synthesis and convergence of application-specific networks-on-chip. The flow comes with novel features that can better address nanoscale design challenges: front-end driven floorplanning, dynamic IR-drop minimization, fast and accurate system-level power grid models, predictable link design. Above all, such features are addressed by different prototype engines, even from different vendors, that can be smoothly integrated into the flow by means of a common specification format the Communication Exchange Format (CEF), that enables unprecedented tool interactions. This flow is validated through an extensive demonstration framework.
Slides

4B-4 (Time: 11:25 - 11:50)
TitleFuzzy Flow Regulation for Network-on-Chip Based Chip Multiprocessors Systems
Author*Yuan Yao, Zhonghai Lu (KTH Royal Institute of Technology, Sweden)
Pagepp. 343 - 348
KeywordNetwork-on-Chip, Chip Multiprocessor, Flow regulation, Fuzzy logic
AbstractFlow regulation is a traffic shaping technique, which can be used to improve communication performance with better utilization of network resources in chip multi-processors (CMPs). This paper presents fuzzy flow regulation. Being different from the static flow regulation policy, our system makes regulation decisions fully dynamically according to traffic dynamism and the state of interconnection network. The central idea is to use fuzzy logic to mimic the behavior of an expert that can recognize the network status and then intelligently control the admission of input flows. As the experiment results show, the maximum improvement in average delay reaches 53.0% against static regulation and 37.4% against no regulation. The maximum improvement in average throughput reaches 37.5% against static regulation and 23.8% against no regulation.
Slides

4B-5 (Time: 11:50 - 12:15)
TitleAdjustable Contiguity of Run-Time Task Allocation in Networked Many-Core Systems
Author*Mohammad Fattah, Pasi Liljeberg, Juha Plosila, Hannu Tenhunen (University of Turku, Finland)
Pagepp. 349 - 354
KeywordRun-Time Application Mapping, Dynamic Many-Core Systems
AbstractIn this paper, we propose a run-time mapping algorithm, CASqA, for networked many-core systems. In this algorithm, the level of contiguousness of the allocated processors (α) can be adjusted in a fine-grained fashion. A strictly contiguous allocation (α = 0) decreases the latency and power dissipation of the network and improves the applications execution time. However, it limits the achievable throughput and increases the turnaround time of the applications. As a result, recent works consider non-contiguous allocation (α = 1) to improve the throughput traded off against applications execution time and network metrics. Experimental results show that relentlessly allowing non-contiguous allocation not only cripples the network performance, but also degrades the achievable throughput compared to moderated cases (0<α<1). More precisely, up to 35% drop in the network costs can be gained by adjusting the level of contiguity compared to non-contiguous cases, while the achieved throughput is kept constant. Moreover, CASqA provides at least 32% energy saving in the network compared to other works.
Slides


Session 4C  Emerging Applications
Time: 10:10 - 12:15 Wednesday, January 22, 2014
Location: Room 303
Chairs: Yu Wang (Tsinghua University, China), Dajiang Zhou (Waseda University, Japan)

4C-1 (Time: 10:10 - 10:35)
TitleSTD-TLB: A STT-RAM-Based Dynamically-Configurable Translation Lookaside Buffer for GPU Architectures
AuthorXiaoxiao Liu, Yong Li, Yaojun Zhang, Alex K. Jones, *Yiran Chen (University of Pittsburgh, U.S.A.)
Pagepp. 355 - 360
KeywordTLB, GPU, STT-RAM
AbstractTranslation lookaside buffer (TLB) was recently introduced into modern graphics processing unit (GPU) architectures to support virtual memory addressing. Compared to CPUs, the performance of GPUs is more sensitive to the capacity of TLBs because of heavier memory accesses. However, large SRAM cell area greatly limits the implementable capacity of conventional SRAM-based TLBs. In this work, we propose using STT-RAM to construct TLBs in light of the unique memory access pattern in GPUs, i.e., infrequent data updates. STT-RAM TLB can replace its same-area SRAM counterpart with greater capacity, similar read performance and lower energy consumption. As an optimization of STT-RAM TLB, we further propose a STT-RAM-based dynamically-configurable TLB (STD-TLB) by leveraging differential sensing technique. STD-TLB can switch between high-capacity mode and high-performance mode on-the-fly based on real-time application needs. Our experiments show that compared to SRAM TLB, standard STT-RAM TLB improves the performance and energy delay product of GPU address translation by 32% and 75%, respectively, while STD-TLB achieves additional 15% and 13% improvements over standard STT-RAM TLB.
Slides

4C-2 (Time: 10:35 - 11:00)
TitleTraining Itself: Mixed-Signal Training Acceleration for Memristor-Based Neural Network
Author*Boxun Li, Yuzhi Wang, Yu Wang (Tsinghua University, China), Yiran Chen (University of Pittsburgh, U.S.A.), Huazhong Yang (Tsinghua University, China)
Pagepp. 361 - 366
KeywordNeural Network, Training, Memristor
AbstractThe artificial neural network (ANN) is among the most widely used methods in data processing applications. The memristor-based neural network further demonstrates a power efficient hardware realization of ANN. Training phase is the critical operation of memristor-based neural network. However, the traditional training method for memristor-based neural network is time consuming and energy inefficient. Users have to first work out the parameters of memristors through digital computing systems and then tune the memristor to the corresponding state. In this work, we introduce a mixed-signal training acceleration framework, which realizes the self-training of memristor-based neural network. We first modify the original stochastic gradient descent algorithm by approximating calculations and designing an alternative computing method. We then propose a mixed-signal acceleration architecture for the modified training algorithm by equipping the original memristor-based neural network architecture with the copy crossbar technique, weight update units, sign calculation units and other assistant units. The experiment on the MNIST database demonstrates that our mixed-signal acceleration is 3 orders of magnitude faster and 4 orders of magnitude more energy efficient than the CPU implementation counterpart at the cost of a slight decrease of the recognition accuracy (<5%).
Slides

4C-3 (Time: 11:00 - 11:25)
TitleHDTV1080p HEVC Intra Encoder with Source Texture Based CU/PU Mode Pre-decision
Author*Jia Zhu, Zhenyu Liu, Dongsheng Wang (Tsinghua University, China), Qingrui Han, Yang Song (Huawei Technologies Co., Ltd., China)
Pagepp. 367 - 372
KeywordHEVC, Intra, VLSI, RDO, CU-Depth
AbstractHEVC doubles the coding efficiency with more than 4x coding complexity as compared to H.264/AVC. To alleviate the burden of Intra encoder, we estimate the RD-cost from the source image textures, and dynamically select two promising CU/PU mode candidates to execute exhaustive RDO processing. As integrated in our hardwired encoder, the averaged 61.7% computation complexity was saved with 4.53% rate augment. With TSMC 90nm technology, the real-time encoder for HDTV1080p at 44fps is implemented with 2269k-gate at 357MHz operating frequency.
Slides

4C-4 (Time: 11:25 - 11:50)
TitleFast Large-Scale Optimal Power Flow Analysis for Smart Grid through Network Reduction
Author*Yi Liang, Deming Chen (University of Illinois at Urbana-Champaign, U.S.A.)
Pagepp. 373 - 378
Keywordsmart grid, power system, optimal power flow, network reduction, congestion
AbstractOptimal power flow (OPF) plays an important role in power system operation. The emerging smart grid aims to create an automated energy delivery system that enables two-way flows of electricity and information. As a result, it will be desirable if OPF can be solved in real time in order to allow the implementation of the time-sensitive applications, such as real-time pricing. In this paper, we present a novel algorithm to accelerate the computation of alternating current optimal power flow (ACOPF) through power system network reduction (NR). We formulate the OPF problem based on an equivalent reduced system and interpret its solution and the detailed optimal dispatch for the original power system is obtained afterwards using a distributed algorithm. Our results are compared with two widely used methods: full ACOPF and the linearized OPF with DC power flow and lossless network assumption, the so-called DCOPF. Experimental results show that for a large power system, our method achieves 7.01X speedup over ACOPF with only 1.72% error, and is 75.7% more accurate than the DCOPF solution. Our method is even 10% faster than DCOPF. Our experimental results demonstrate the unique strength of the proposed technique for fast, scalable, and accurate OPF computation. We also show that our method is effective for smaller benchmarks.

4C-5 (Time: 11:50 - 12:15)
TitleStorage-Less and Converter-Less Maximum Power Point Tracking of Photovoltaic Cells for a Nonvolatile Microprocessor
Author*Cong Wang (Tsinghua University, China), Naehyuck Chang, Younghyun Kim, Sangyoung Park (Seoul National University, Republic of Korea), Yongpan Liu (Tsinghua University, China), Hyung Gyu Lee (Daegu University, Republic of Korea), Rong Luo, Huazhong Yang (Tsinghua University, China)
Pagepp. 379 - 384
KeywordStorage-less, Converter-less, MPPT, Nonvolatile Processor
AbstractThis paper pioneers the maximum power point tracking (MPPT) of photovoltaic (PV) cells that directly supply power to a microprocessor without an energy storage element (a battery or a large-size capacitor) nor power converters. The maximum power point tracking is conventionally performed by an MPPT charger that stores in the energy storage element, and a voltage regulator (typically a DC-DC converter) produces a proper voltage level for the microprocessor. The energy storage element is an energy buffer and makes it possible to perform MPPT of the PV cells and power management of the microprocessor independently. However, the energy storage element, MPPT charger and DC-DC converter cause seriously limited lifetime (when a typical battery is adopted), significant energy loss (typically over 20%), increased weight/volume and high cost, etc. The proposed method enables extremely fine-grain dynamic power management (DPM) in every a few hundred microseconds and performs the MPPT without using an MPPT charger and a DC-DC converter as well as an energy storage element. We achieve 84.5% of energy harvesting efficiency using the proposed setup with huge reduction in cost, weight and volume, and extended lifetime, which is not even numerically comparable with conventional MPPT methods.
Slides


Session 5S  Special Session: Billion Chips of Trillion Transistors
Time: 13:50 - 15:30 Wednesday, January 22, 2014
Location: Room 302
Organizer: Chen-Yong Cher (IBM TJ Watson Research Center, U.S.A.)

5S-1 (Time: 13:50 - 14:20)
Title(Invited Paper) Soft Error Resiliency Characterization on IBM BlueGene/Q Processor
Author*Chen-Yong Cher, K. Paul Muller, Ruud A. Haring, David L. Satterfield, Thomas E. Musta, Thomas M. Gooding, Kristan D. Davis, Marc B. Dombrowa, Gerard V. Kopcsay, Robert M. Senger, Yutaka Sugawara, Krishnan Sugavanam (IBM T. J. Watson Research Center, U.S.A.)
Pagepp. 385 - 387
Keywordsoft error rate, fault injection, high-performance computing, chip irradiation
AbstractFault injection through accelerated irradiation is an effective way to evaluate the overall soft error resiliency of microprocessors. In this work, we report on irradiation experiments on a Blue Gene/Q (BG/Q) compute processor chip running selected applications. Blue Gene/Q is the third generation of IBM’s massively parallel, energy efficient Blue Gene series of supercomputers. In the experiments, we found 26 code fails that are relevant for the calculation of the mean-time-between-failures (MTBF) for a 20 PetaFLOP, 96 rack system running a comparable workload mix. The expected MTBF for check-stops due to cosmic radiation and alpha particles from chip packaging materials is calculated to be 51 days for sea-level at New York City running the application mix studied. If the most vulnerable application is run exclusively, the projected MTBF is 35 days. These are outstanding results for a machine of this magnitude. The beaming experiment and projected MTBF validate the necessity to include autonomous hardware detection and recovery at the cost of design effort, silicon area and power.

5S-2 (Time: 14:20 - 14:50)
Title(Invited Paper) Resiliency for Many-Core System on a Chip
Author*Tanay Karnik, James Tschanz, Nitin Borkar, Jason Howard, Sriram Vangal, Vivek De, Shekhar Borkar (Intel Corporation, U.S.A.)
Pagepp. 388 - 389
KeywordResiliency, SOC
AbstractResilient techniques are commonly employed for dynamic and static variation tolerance. In this paper, we present an adaptive clocking technique that achieves 31% throughput increase with 15% energy reduction, and an adaptive interconnect fabric technique that increases bandwidth by 63% with 14.6% energy reduction. We also discuss variations in many-core microprocessors and some techniques to enable a resilient many-core system on a chip.

5S-3 (Time: 14:50 - 15:20)
Title(Invited Paper) Rethinking Error Injection for Effective Resilience
AuthorShahrzad Mirkhani (University of Texas, U.S.A.), Hyungmin Cho, Subhasish Mitra (Stanford University, U.S.A.), *Jacob Abraham (University of Texas, U.S.A.)
Pagepp. 390 - 393
Keywordtransient fault, soft error, error injection
AbstractSoft errors, caused by radiation, have become a major challenge in today’s computer systems and networking equipment, making it imperative that systems be designed to be resilient to errors. Error injection is a powerful approach to evaluate system resilience, and current practice is to inject errors in architectural registers of processors, program variables of applications, or storage elements in the hardware model. This paper, using answers to frequently asked questions, discusses the need for rethinking conventional approaches to error injection, showing data from recent research and our simulation results. Approaches to improving current error injections are also suggested.
Slides


Session 5A  Simulation and Modeling
Time: 13:50 - 15:30 Wednesday, January 22, 2014
Location: Room 300
Chairs: Atushi Ike (Fujitsu Laboratories, Japan), Yuichi Nakamura (NEC, Japan)

5A-1 (Time: 13:50 - 14:15)
TitleAmphisbaena: Modeling Two Orthogonal Ways to Hunt on Heterogeneous Many-Cores
Author*Jun Ma (University of Chinese Academy of Sciences; Institute of Computing Technology, Chinese Academy of Sciences, China), Guihai Yan, Yinhe Han, Xiaowei Li (Institute of Computing Technology, Chinese Academy of Sciences, China)
Pagepp. 394 - 399
Keywordperformance modeling, heterogeneous architecture, amphisbaena, scale-out speedup, scale-up speedup
AbstractHeterogeneous many-cores can deliver high performance or energy efficiency. There are two orthogonal ways to improve performance: 1) scale-out by exploiting thread-level parallelism, and 2) scale-up by enabling core heterogeneity. Predicting the performance of such architecture is increasingly challenging. We propose a comprehensive performance model Amphisbaena, or Phi, built from two orthogonal functions alpha and beta. Function alpha describes the scale-out speedup and function beta handles the scale-up speedup. The Phi model can clearly tell not only the overall speedup of a given multithreading and core mapping strategy, but also how to improve the multithreading and core mapping, hence should be a promising performance predictor for future heterogenous many-cores. The results show that Phi model’s error rate is within 12%, which is lower than state-of-the-art methods. We demonstrate the application of Phi model by introducing a heuristic scheduling algorithm, which outperforms the baselines by 13% on average.

5A-2 (Time: 14:15 - 14:40)
TitleCo-Simulation Framework for Streamlining Microprocessor Development on Standard ASIC Design Flow
Author*Tomoyuki Nakabayashi, Tomoyuki Sugiyama, Takahiro Sasaki (Mie University, Japan), Eric Rotenberg (North Carolina State University, U.S.A.), Toshio Kondo (Mie University, Japan)
Pagepp. 400 - 405
Keywordco-simulation, development environment, ASIC design, microprocessor
AbstractIn this paper, we present a practical processor co-simulation framework on a standard ASIC design flow. We propose an off-chip system call emulator, checkpoint mechanism, and cache warming mechanism to streamline design and verification of a processor. These mechanisms provide a short turnaround time, processor prototyping, and highly accurate evaluation result. All our proposed approaches can be consistently used not only in RTL simulation but also in gate/transistor simulation and even in chip evaluation with an LSI tester.
Slides

5A-3 (Time: 14:40 - 15:05)
TitleAnnotation and Analysis Combined Cache Modeling for Native Simulation
AuthorRongjie Yan (State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, China), *De Ma (Institute of Microelectronic CAD, Hangzhou Dianzi University, China), Kai Huang, Xiaoxu Zhang, Siwen Xiu (Institute of VLSI Design, Zhejiang University, China)
Pagepp. 406 - 411
Keywordcache model, dynamic annotation, static analysis, native simulation
AbstractTo accelerate the speed of performance estimation and raise its accuracy for MPSoC, we propose a static analysis and dynamic annotation combined method to efficiently model cache mechanism in native simulation. We use a new cache model to analyze segmental profiling results statically to speed up simulation, and take advantage of a dynamic annotation technique to exactly trace the addresses of local variables. Experimental results show the efficiency of the proposed techniques for more accurate system performance estimation.

5A-4 (Time: 15:05 - 15:30)
TitleA Scorchingly Fast FPGA-Based Precise L1 LRU Cache Simulator
Author*Josef Schneider, Jorgen Peddersen, Sri Parameswaran (University of New South Wales, Australia)
Pagepp. 412 - 417
KeywordCache simulation, FPGA, LRU
AbstractJudicious selection of cache configuration is critical in embedded systems as the cache design can impact power consumption and processor throughput. A large cache increases cache hits but requires more hardware and more power, and will be slower for each access. A smaller cache is more economical and faster per access, but may result in significantly more cache misses resulting in a slower system. For a given application or a class of applications on a given hardware system, the designer can aim to optimise cache configuration through cache simulation. We present here the first multiple cache simulator based on hardware. The FPGA implementation is characterised by a trace consumption rate of 100MHz making our cache simulation core up to 53x faster, for a set of benchmarks, than the fastest software based cache simulator. Our cache simulator can determine the hit rates of 308 cache configurations, of which it can determine the hit rates of 44 simultaneously.
Slides


Session 5B  Reliability Analysis and Enhencement
Time: 13:50 - 15:30 Wednesday, January 22, 2014
Location: Room 301
Chair: Shigeki Nojima (Toshiba Corporation, Japan)

5B-1 (Time: 13:50 - 14:15)
TitleRedundant-Via-Aware ECO Routing
Author*Hsi-An Chien, Ting-Chi Wang (National Tsing Hua University, Taiwan)
Pagepp. 418 - 423
KeywordRedundant Via, ECO Routing
AbstractRedundant via insertion (RVI) has become an inevitable means adopted in the routing or post-routing stage to enhance chip reliability and yield as feature size shrinks down to nanometer scale. The remaining routing resources, however, could become so limited after RVI, and make engineering change order (ECO) routing during a pre-mask stage or even a post-mask stage difficult to complete. In this paper, we study an ECO routing problem where redundant vias are present in the given layout but can be considered for replacement or removal to increase the routability and improve the routing quality. To find an ECO routing path, we construct only the necessary part of the routing graph on-the-fly, and develop an A* search based algorithm for achieving efficient path finding. We also take redundant via replacement, removal, and insertion into account when formulating the routing cost, and apply a state-of-the-art method to perform redundant via replacement and insertion. Experiments show that our algorithm not only successfully routes all test cases but also efficiently produces high-quality solutions.

5B-2 (Time: 14:15 - 14:40)
TitleA Fast and Provably Bounded Failure Analysis of Memory Circuits in High Dimensions
AuthorWei Wu, Fang Gong (University of California, Los Angeles, U.S.A.), Gengsheng Chen (Fudan University, China), *Lei He (University of California, Los Angeles, U.S.A.)
Pagepp. 424 - 429
KeywordSRAM, Failure probability, Importance sampling, High dimensions
AbstractMemory circuits demands extremely high integration density and reliability under process variations. The most challenging task is how to accurately estimate the extremely small failure probability of memory circuits where the circuit failure is a ``rare event''. Classic importance sampling has been widely recognized to be inaccurate and unreliable in high dimensions. To address this issue, we propose a fast statistical analysis to estimate the probability of rare events in high dimensions and prove that the estimation is always bounded. This methodology has been successfully applied to the failure analysis of memory circuits with hundreds of variables. Experiments on a 54-dimensional SRAM cell circuit show that the proposed approach achieves 1150X speedup over Monte Carlo without compromising any accuracy. It also outperforms the classification based method (e.g., Statistical Blockade) by 204X and existing importance sampling method (e.g., Spherical Sampling) by 5X. On another 90-dimension circuit, the proposed approach yields 364X speedup over Monte Carlo while existing importance sampling methods completely fail to provide reasonable accuracy.
Slides

5B-3 (Time: 14:40 - 15:05)
TitlePredicting Circuit Aging Using Ring Oscillators
AuthorDeepashree Sengupta, *Sachin Sapatnekar (University of Minnesota, U.S.A.)
Pagepp. 430 - 435
KeywordBTI, Ring oscillators, UofM model
AbstractThis paper presents a method for inferring circuit delay shifts due to bias temperature instability using ring oscillator (ROSC) sensors. This procedure is based on presilicon analysis, postsilicon ROSC measurements, a new aging analysis model called the Upperbound on f_Max (UofM), and a look-up table that stores a precomputed degradation ratio that translates delay shifts in the ROSC to those in the circuits. This method not only yields delay estimates within 0:2% of the true values with very low runtime, but is also independent of temperature and supply voltage variations.
Slides

5B-4 (Time: 15:05 - 15:30)
TitleStatistical Analysis of Process Variation Based on Indirect Measurements for Electronic System Design
Author*Ivan Ukhov, Mattias Villani, Petru Eles, Zebo Peng (Linköping University, Sweden)
Pagepp. 436 - 442
Keywordstatistical analysis, process variation, Bayesian inference
AbstractWe present a framework for the analysis of process variation across semiconductor wafers. The framework is capable of quantifying the primary parameters affected by process variation, e.g., the effective channel length, which is in contrast with the former techniques wherein only secondary parameters were considered, e.g., the leakage current. Instead of taking direct measurements of the quantity of interest, we employ Bayesian inference to draw conclusions based on indirect observations, e.g., on temperature.
Slides


Session 5C  Variational Design Techniques for Analog/Mixed-Signal Circuits
Time: 13:50 - 15:30 Wednesday, January 22, 2014
Location: Room 303
Chairs: C.Y. Tsui (Hong Kong University of Science and Technology, Hong Kong), Hideki Asai (Shizuoka University, Japan)

5C-1 (Time: 13:50 - 14:15)
TitleSymbolic Computation of SNR for Variational Analysis of Sigma-Delta Modulator
Author*Jiandong Cheng, Guoyong Shi (Shanghai Jiao Tong University, China)
Pagepp. 443 - 448
KeywordSigma-delta modulator, Statistical analysis, Switched-capacitor, Symbolic analysis, Signal-to-noise ratio
AbstractSignal-to-noise ratio (SNR) is an important design metric for switched-capacitor sigma-delta modulators (SC-SDMs). In an automatic synthesis environment, fast SNR computation is of paramount importance. So far the main SNR computation method has been behavioral simulation. Other less accurate methods are based on empirical formulas. These methods could not contribute too much to enhancing synthesis efficiency. In this work a highly efficient and purely symbolic SNR computation method is proposed. The difficulty in the computation of noise power (requiring integration of a rational function) is overcome by a Taylor polynomial approximation. Together with a symbolic loop-transfer analysis tool, the SNR can be computed fully symbolically. This novel computation method is applied for variational SC-SDM analysis. The effectiveness and efficiency are compared to behavioral Monte Carlo simulation results.

5C-2 (Time: 14:15 - 14:40)
TitleSparse Statistical Model Inference for Analog Circuits under Process Variations
Author*Yan Zhang, Sriram Sankaranarayanan, Fabio Somenzi (University of Colorado at Boulder, U.S.A.)
Pagepp. 449 - 454
KeywordSparse regression, Statistical model inference, Analog verification
AbstractIn this paper, we address the problem of performance modeling for transistor-level circuits under process variations. A sparse regression technique is introduced to characterize the relationship between the process parameters and the output responses. This approach relies on repeated simulations to find polynomial approximations of response surfaces. It employs a heuristic to construct sparse polynomial expansions and a stepwise regression algorithm based on LASSO to find low degree polynomial approximations. The proposed technique is able to handle many tens of process parameters with a small number of simulations when compared to an earlier approach using ordinary least squares. We present our approach in the context of statistical model inference (SMI), a recently proposed statistical verification framework for transistor-level circuits. Our experimental evaluation compares percentage yields predicted by our approach with Monte-Carlo simulations and SMI using ordinary least squares on benchmarks with up to 30 process parameters. The sparse-SMI approach is shown to require significantly fewer simulations, achieving orders of magnitude improvement in the run times with small differences in the resulting yield estimates.
Slides

5C-3 (Time: 14:40 - 15:05)
TitleTime-Domain Performance Bound Analysis for Analog and Interconnect Circuits Considering Process Variations
Author*Tan Yu, Sheldon Tan (University of California, Riverside, U.S.A.), Yici Cai (Tsinghua University, China), Puying Tang (University of Electronic Science and Technology of China, China)
Pagepp. 455 - 460
KeywordTime-domain, bound analysis
AbstractTime-Domain worst case or performance bound estimation for analog integrated circuits and interconnect circuits are crucial for both analog and digital circuit design and optimization in the presence of process variations. In this paper, we present a novel non-Monte-Carlo performance bound analysis technique in time domain. The new method consists of several steps. First the symbolic transient modified nodal analysis (MNA) formulation of the circuit matrices of ( linearized) analog and interconnect circuits at a time step is formed. Then the closed-form expressions of the interested performance in terms of variational parameters of the circuit matrices of ( linearized) analog and interconnect circuits are derived via a graph-based symbolic analysis method. Then time-domain performance response bound of current time step are obtained by a nonlinear constrained optimization process subject to the parameter variations and variational circuit state bounds computed from the previous time step. The proposed method is more amenable for computing high sigma bounds than standard MC method. Experimental results show that the new method can delivers order of magnitudes speedup over standard Monte Carlo simulation on some typical analog circuits and interconnect circuits with very high accuracy.
Slides

5C-4 (Time: 15:05 - 15:30)
TitleA Robustness Optimization of SRAM Dynamic Stability by Sensitivity-Based Reachability Analysis
AuthorYang Song, *Sai Manoj P. D., Hao Yu (Nanyang Technological University, Singapore)
Pagepp. 461 - 466
KeywordDynamic stability optimization, Reachability analysis, Large-signal sensitivity
AbstractA robustness optimization of SRAM dynamic stability at nano-scale is developed in this paper by zonotope-based reachability analysis. A backward Euler method is developed to efficiently perform reachability analysis with zonotope to deal with multiple device parameters with tuning ranges. Moreover, a sensitivity calculation of zonotope is developed to optimize safety distance by simultaneously tuning multiple SRAM device parameters without multiple repeated computations. As such, sequential robustness optimizations can be performed such that the optimized SRAM designs can depart from unsafe region but converge into safe region. The proposed method is implemented inside a SPICE-like simulator. As shown by numerical experiments, the proposed method can achieve 600x speedup on an average compared to the traditional verification method by Monte-Carlo under the similar accuracy. In addition, compared to the traditional small-signal based sensitivity optimization, our method can converge faster with high accuracy.
Slides


Session 6S  Special Session: Overcoming Major Silicon Bottlenecks: Variability, Reliability, Validation and Debug
Time: 15:50 - 17:30 Wednesday, January 22, 2014
Location: Room 302
Organizer: Subhasish Mitra (Stanford University, U.S.A.)

6S-1 (Time: 15:50 - 16:20)
Title(Invited Paper) Accurate and Inexpensive Performance Monitoring for Variability-Aware Systems
AuthorLiangzhen Lai, *Puneet Gupta (UCLA, U.S.A.)
Pagepp. 467 - 473
Keywordvariability, performance monitoring, reliability, adaptive system
AbstractDesigning reliable integrated systems has become a major challenge with shrinking geometries, increasing fault rates and devices which age substantially in their usage life. The proposed research is motivated by the observation many of the infield failures are delay failures and several variability signatures are also delay-related. The origins of temporal delay fluctuations include manufacturing variability, voltage/temperature changes, negative or positive bias temperature instability-related Vth degradation, etc. Since the actual delay changes depend on process variations as well as workload, on-chip monitoring may be the best way of predicting them. There is a need to monitor circuit performance during manufacturing as well as at runtime to predict achievable performance and warn against impending failures. Adaptive mechanisms in hardware and/or software can optimize the trade o between errors, energy and performance based on the feedback from runtime circuit performance monitors. This paper presents approaches for automated synthesis of design-dependent performance monitors. These monitors can be used to predict impending delay failures relatively inexpensively. For low-overhead monitoring, we propose multiple designdependent ring oscillators (DDROs) as smart canary structures which can reliably predict achievable chip frequency but with margins for local variations. Early silicon results indicate that DDROs can reduce delay monitoring error by 35% compared to conventional ring oscillators. To further improve the prediction (albeit at a higher overhead), we propose in-situ slack monitors (SlackProbe) which can match local variations as well at overheads much smaller than monitoring all sequential elements. SlackProbe reduces the number of monitors required by over 15X with 5% additional delay margin in several commercial processor benchmarks. Finally, we show an example of software testbed that demonstrates a variability-aware system that utilizes the hardware monitors and operates with both hardware and software adaptation.

6S-2 (Time: 16:20 - 16:50)
Title(Invited Paper) Quantifying Workload Dependent Reliability in Embedded Processors
Author*Vikas Chandra (ARM, U.S.A.)
Pagepp. 474 - 477
KeywordReliability, BTI, TDDB, Soft Error
AbstractWith nearly three decades of continued CMOS scaling, the devices have now been pushed to their physical and reliability limits. Scaling to sub-20nm technology nodes changes the nature of reliability effects from abrupt functional problems to progressive degradation of the performance characteristics of devices and system components. The impact of unreliability results in time-dependent variability, directly translating into design uncertainty in manufactured chips. Further, application workloads can significantly affect the overall system reliability. In this work, we have analyzed aging effects on various design hierarchies of an embedded processor in 28nm running real-world applications. We have also quantified the dependencies of aging effects on switching-activity and power-state of workloads. Implementation results show that the processor timing degradation can vary from 2% to 11%, depending on the workload.

6S-3 (Time: 16:50 - 17:20)
Title(Invited Paper) QED Post-Silicon Validation and Debug: Frequently Asked Questions
AuthorDavid Lin, *Subhasish Mitra (Stanford University, U.S.A.)
Pagepp. 478 - 482
KeywordDebug, Post-Silicon Validation, Quick Error Detection, Verification
AbstractDuring post-silicon validation and debug, one or more manufactured integrated circuits (ICs) are tested in actual system environments to detect and fix design flaws (bugs). According to several industrial reports, the costs of post-silicon validation and debug are rising faster than design costs. Hence, new systematic techniques are essential to overcome the rising costs of existing post-silicon validation and debug techniques. QED, an acronym for Quick Error Detection, is such a technique that effectively overcomes several post-silicon validation and debug challenges. QED systematically creates a wide variety of validation tests to quickly detect bugs, not only inside processor cores, but also in uncore components (i.e., components in an SoC that are neither processor cores nor co-processors) of multi-core system-on-chips. In this paper, we present a brief overview of QED through a series of frequently asked questions.


Session 6A  Synthesis of Quantum Circuits and Adaptive Logic
Time: 15:50 - 17:30 Wednesday, January 22, 2014
Location: Room 300
Chairs: Yusuke Matsunaga (Kyushu University, Japan), Deming Chen (University of Illinois, Urbana-Champaign, U.S.A.)

6A-1 (Time: 15:50 - 16:15)
TitleEfficient Synthesis of Quantum Circuits Implementing Clifford Group Operations
Author*Philipp Niemann (University of Bremen, Germany), Robert Wille (University of Bremen/Cyber Physical Systems DFKI GmbH/Technical University Dresden, Germany), Rolf Drechsler (University of Bremen/Cyber Physical Systems DFKI GmbH, Germany)
Pagepp. 483 - 488
Keywordquantum circuits, synthesis, Clifford groups, stabilizer circuits
AbstractQuantum circuits established themselves as a promising emerging technology and, hence, attracted significant attention in the domain of computer-aided design. As a result, many approaches for synthesis of corresponding netlists have been proposed in the last decade. However, as the design of quantum circuits poses significant obstacles caused by phenomena such as superposition, entanglement, and phase shifts, automatic synthesis still represents a significant challenge. In this paper, we propose an automatic synthesis approach for quantum circuits that implement Clifford Group operations. These circuits constitute an important subclass of quantum computation and cover core aspects of quantum functionality. The proposed approach exploits specific properties of the unitary transformation matrices that are associated to each quantum operation. Furthermore, Quantum Multiple-Valued Decision Diagrams (QMDDs) are employed for an efficient representation of these matrices. Experimental results confirm that this enables a compact realization of the respective quantum functionality.
Slides

6A-2 (Time: 16:15 - 16:40)
TitleOptimal SWAP Gate Insertion for Nearest Neighbor Quantum Circuits
Author*Robert Wille (University of Bremen/Cyber Physical Systems DFKI GmbH/Technical University Dresden, Germany), Aaron Lye (University of Bremen, Germany), Rolf Drechsler (University of Bremen/Cyber Physical Systems DFKI GmbH, Germany)
Pagepp. 489 - 494
Keywordquantum circuits, optimization, nearest neighbor, synthesis
AbstractMotivated by its promising applications e.g. for database search or factorization, significant progress has been made in the development of automated design methods for quantum circuits. But in order to keep up with recent physical developments in this domain, new technological constraints have to be considered. Limited interaction distance between gate qubits is one of the most common of these constraints. This led to the development of several strategies aiming at making a given quantum circuit nearest neighbor-complying by adding SWAP gates into the existing circuit structure. However, all of these strategies are of heuristic nature. In this work, we present an exact approach that enables nearest neighbor compliance by adding a minimal number of SWAP gates. Experiments demonstrate the efficiency of the approach.
Slides

6A-3 (Time: 16:40 - 17:05)
TitleQubit Placement to Minimize Communication Overhead in 2D Quantum Architectures
AuthorAlireza Shafaei, Mehdi Saeedi, *Massoud Pedram (University of Southern California, U.S.A.)
Pagepp. 495 - 500
KeywordQuantum Computing, Qubit Placement, 2D Quantum Architectures, Interaction Distance
AbstractRegular, local-neighbor topologies of quantum architectures restrict interactions to adjacent qubits, which in turn increases the latency of quantum circuits mapped to these architectures. To alleviate this effect, optimization methods that consider qubit-to-qubit interactions in 2D grid architectures are presented in this paper. The proposed approaches benefit from Mixed Integer Programming (MIP) formulation for the qubit placement problem. Simulation results on various benchmarks show 27% on average reduction in communication overhead between qubits compared to best results of previous work.
Slides

6A-4 (Time: 17:05 - 17:30)
TitleA Novel Wirelength-Driven Packing Algorithm for FPGAs with Adaptive Logic Modules
AuthorSheng-Kai Wu, *Po-Yi Hsu, Wai-Kei Mak (National Tsing Hua University, Taiwan)
Pagepp. 501 - 506
KeywordFPGA, packing, clustering, ALM
AbstractAdaptive logic module (ALM) in modern field programmable gate array can serve as one 6-input lookup table (LUT) or two smaller lookup tables under certain constraints. In a typical design flow, a netlist of LUTs formed after technology mapping has to be merged into ALMs and then packed into coarse-grained logic blocks (CLBs) before placement and routing. How the LUTs are merged and the ALMs are packed has a significant impact on the quality of the placement. We propose a novel wirelength-driven algorithm to merge the LUTs and pack the ALMs to ensure that it will not adversely affect the final wirelength. Experimental results show that substituting AAPack [7] by our algorithm yields about 14.69% reduction in number of tracks and 16.83% wirelength improvement for ALM-based FPGA. Applying our algorithm to traditional FPGA, the minimum number of tracks and wirelength are reduced by 16.32% and 17.90%, respectively, compared to T-VPack.
Slides


Session 6B  Contemporary Routing
Time: 15:50 - 17:30 Wednesday, January 22, 2014
Location: Room 301
Chairs: Mark Lin (National Chung Cheng University, Taiwan), Toshiyuki Shibuya (Fujitsu Laboratories, Japan)

6B-1 (Time: 15:50 - 16:15)
TitleA Topology-Based ECO Routing Methodology for Mask Cost Minimization
Author*Po-Hsun Wu, Shang-Ya Bai, Tsung-Yi Ho (National Cheng Kung University, Taiwan)
Pagepp. 507 - 512
KeywordECO, Routing, Mask Cost
AbstractEngineering Change Order (ECO) routing, which is a complicated and difficult task due to limited routing resource and increasing design rules, is requested in later design stage for the purpose of delay and noise optimization. After ECO routing procedure, some routing layers may be modified and the corresponding masks are needed to be remanufactured which leads to high mask re-spin cost. Although several ECO routers had been proposed to obtain a routing solution based on different design objectives, mask re-spin cost still cannot be effectively reduced because the ECO routing problem is handled in a sequential manner. This paper presents a three-stage ECO routing flow which can simultaneously route all ECO nets while considering routing layer minimization. Initially, several routing paths for each ECO net are efficiently generated and an Integer Linear Programming (ILP) model is developed to simultaneously determine the routing path of each ECO net to minimize the number of changed masks. Moreover, a minimum-cost-maximum-flow (MCMF) algorithm is applied to further reduce the number of changed masks. Experimental results demonstrate that our proposed ECO routing flow can effectively reduce the number of changed masks with only negligible wirelength and via overhead.

6B-2 (Time: 16:15 - 16:40)
TitleBOB-Router: A New Buffering-Aware Global Router with Over-the-Block Routing Resources Optimization
AuthorYilin Zhang (University of Texas at Austin, U.S.A.), Salim Chowdhury (Oracle, U.S.A.), *David Z. Pan (University of Texas at Austin, U.S.A.)
Pagepp. 513 - 518
KeywordRouting, Over-the-block, Congestion, Slew
AbstractIn this paper, we propose a new global router, BOB-Router, endowed with the ability to use over-the-block routing resources to the greatest extent in additional to traditional routing concepts of minimizing wirelength, via count and overflow. In previous global routing formulations, the routing resources over the IP blocks were either dealt as routing blockages leading to a significant waste, or simply treated in the same way as outside-the-block routing resources, which would violate the slew constraints and thus fail buffering. Utilizing over-the-block routing resources could dramatically improve the routing solution, yet requires special attention, since the slew, affected by different RC on different metal layers, must be constrained by buffering and is easily violated. Moreover, even of all nets are slew-legalized, the routing solution could still suffer from heavy congestion problem. For the first time, BOB-Router tries to solve the over-the-block global routing problem through minimizing overflows, wirelength and via count simultaneously without violating slew constraints. BOB-Router generates a slew-legalized initial solution followed by an Lagragian-multiplier-based pricing phase and RC-constrained A* search to help explore new buffering-aware topologies on all metal layers. Our experimental results show that BOB-Router completely satisfies the slew constraints and significantly outperforms the obstacle-avoiding global routers in terms of wirelength, via count and overflows.
Slides

6B-3 (Time: 16:40 - 17:05)
TitleRoutability-Driven Bump Assignment for Chip-Package Co-Design
AuthorMeng-Ling Chen, Tu-Hsiung Tsai, *Hung-Ming Chen (National Chiao Tung University, Taiwan), Shi-Hao Chen (Global Unichip Corporation, Taiwan)
Pagepp. 519 - 524
KeywordBump assignment, co-design
AbstractIn current chip and package designs, it is a bottleneck to simultaneously optimize both pin assignment and pin routing for different design domains (chip, package, and board). Usually, the whole process costs a huge manual effort and multiple iterations thus reducing profit margin. Therefore, we propose a fast heuristic chip-package co-design algorithm in order to automatically obtain a bump assignment which introduces high routability both in RDL routing and package routing (100\% in our real case). Experimental results show that the proposed method (inspired by board escape routing algorithms) automatically finishes bump assignment, RDL routing and package routing in a short time, while the traditional co-design flow requires weeks even months.
Slides

6B-4 (Time: 17:05 - 17:30)
TitleVFGR: A Very Fast Parallel Global Router with Accurate Congestion Modeling
Author*Zhongdong Qi, Yici Cai, Qiang Zhou (Tsinghua University, China), Zhuoyuan Li, Mike Chen (Nimbus Automation Technologies, China)
Pagepp. 525 - 530
KeywordGlobal Routing, Parallelization, Congestion Modeling
AbstractWith the rapid growth of design size and complexity, global routing has always been a hard problem. Several new factors contribute to global routing congestion and can only be measured and optimized in 3-D global routing rather than 2-D routing. To more accurately reflect modern design rule requirements and various new factors, we propose a practical congestion model in global routing. To achieve better global and detailed routing solution quality, we propose a 3-D global router VFGR with parallel computing using proposed congestion model. Experimental results show that VFGR can achieve comparable or better solution quality with two start-of-the-art global routers with shorter runtime. It is also demonstrated that adopting proposed congestion model in global routing, higher solution quality and much shorter runtime can be achieved in detailed routing stage.
Slides


Session 6C  Power Supply Noise Aware Design Optimization
Time: 15:50 - 17:30 Wednesday, January 22, 2014
Location: Room 303
Chairs: Wenjian Yu (Tsinghua University, China), Shi Guoyong (Shanghai Jiao Tong University, China)

6C-1 (Time: 15:50 - 16:15)
TitleEfficient Simulation-Based Optimization of Power Grid with On-Chip Voltage Regulator
AuthorTing Yu, *Martin D.F. Wong (University of Illinois at Urbana-Champaign, U.S.A.)
Pagepp. 531 - 536
KeywordPower grid, IR-drop, LDO
AbstractIR-drop values of power grid can be reduced through inserting on-chip low-dropout voltage regulators (LDO). In this paper, we explore the optimization of LDOs to meet the IR-drop constraint, where the maximum IR-drop value is less than 10% of power supply. With Cholesky direct solver and SPICE, we propose a method to simulate power grid with LDOs. Based on the simulation method, we develop an efficient flow to optimize the number and locations of the LDOs. Effectiveness of the proposed method is verified by the experimental results. To the best of our knowledge, this is the first work optimizing the number and locations of LDOs to meet the IR-drop constraint.
Slides

6C-2 (Time: 16:15 - 16:40)
TitleWalking Pads: Fast Power-Supply Pad-Placement Optimization
AuthorKe Wang (University of Virginia, U.S.A.), *Brett Meyer (McGill University, Canada), Runjie Zhang, Kevin Skadron, Mircea Stan (University of Virginia, U.S.A.)
Pagepp. 537 - 543
Keywordmulti-core, power delivery, C4 pad allocation, IR Drop, heuristic optimization
AbstractWe propose a novel C4 pad placement optimization framework for 2D power delivery grids: Walking Pads (WP). WP optimizes pad locations by moving pads according to the "virtual forces" exerted on them by other pads and current sources in the system. WP algorithms achieve the same IR drop as state-of-the-art techniques, but are up to 634X faster. We further propose an analytical model relating pad count and IR drop for determining the optimal pad count for a given IR drop budget.
Slides

6C-3 (Time: 16:40 - 17:05)
TitlePower Supply Noise-Aware Workload Assignments for Homogenous 3D MPSoCs with Thermal Consideration
Author*Yuanqing Cheng (LIRMM, France), Aida Todri-Sanial (CNRS/LIRMM, France), Alberto Bosio (University of Montpellier/LIRMM, France), Luigi Dilillo, Patrick Girard (CNRS/LIRMM, France), Arnaud Virazel (University of Montpellier/LIRMM, France)
Pagepp. 544 - 549
Keyword3D Homogeneous MPSoC, Workload Assignment, Power Supply Noise, Thermal
AbstractIn order to improve performance and reduce cost, multi-processor system on chip (MPSoC) is increasingly becoming attractive. At the same time, 3D integration emerges as a promising technology for high density integration. 3D homogenous MPSoCs combine the benefits of both. However, high current demand and large on-chip switching activity variations introduce severe power supply noise (PSN) for 3D MPSoCs, which can increase critical path delay, and degrade chip performance and reliability. Meanwhile, thermal gradient should also be considered for 3D MPSoCs to avoid generation of hotspot. In the paper, we investigate the PSN effects of different workloads and propose an effective PSN estimation method. Then, a heuristic workload assignment algorithm is proposed to suppress PSN under the given thermal constraint. The experimental results show that PSNs can be reduced significantly compared with thermal-balanced workload assignment scheme, and the system performance can be improved as well.
Slides

6C-4 (Time: 17:05 - 17:30)
TitleSwimmingLane: A Composite Approach to Mitigate Voltage Droop Effects in 3D Power Delivery Network
Author*Xing Hu (Institute of Computing Technology, University of Chinese Academy of Sciences, China), Yi Xu (Space Science Institute, Macau University of Science and Technology, Macau/Advanced Micro Devices Research China Laboratory, China), Yu Hu (Institute of Computing Technology, University of Chinese Academy of Sciences, China), Yuan Xie (Advanced Micro Devices, China/Pennsylvania State University, U.S.A.)
Pagepp. 550 - 555
Keyword3D chip, Voltage droop
AbstractDespite the promising features of rapid data transferring across layers, low transmission power and high device density, 3D integration technology also presents many challenges, one of which is power integrity. By stacking multiple dies vertically, 3D chips have higher load than the same-sized 2D chips, leading to larger voltage droop and exacerbating damage to power integrity. To alleviate this problem, we first analyze the impact of application behaviors on voltage droop in a 3D power supply network (PDN) and observe that voltage droop is extremely imbalanced either across different layers or among the cores in the same layer. Then we propose a hardware and heuristic software co-design: (1) Mitigating the interference among different dies via a layer-independent scheme, and (2) balancing the intra-layer voltage droop and reducing the worst-case margin via OS scheduling. Compared to conventional designs, our schemes can reduce power consumption by 18%, worst-case voltage droops by 13%, and the number of voltage violations by 40%.
Slides


Session BK  Banquet & Banquet Keynote
Time: 18:30 - 21:00 Wednesday, January 22, 2014
Location: Flower Field Hall, Gardens by the Bay
Chair: Mashiro Fujita (University of Tokyo, Japan)

BK-1 (Time: 19:30 - 20:00)
Title(Keynote Address) The Art of Innovation - How Singapore Will Continue to Drive the Progress in Semiconductor Technologies
AuthorUlf Schneider (Managing Director, Lantiq Asia Pacific/President, SSIA, Singapore)
AbstractSince the mid 1960’s Singapore has been an important pillar of the worldwide semiconductor industry, reinventing its portfolio, focus and strategy a few times to keep up with overall trends. Preparing for the next decade, Singapore’s industry, research and academia has to put up again the right directions and strategy to keep up with the pace in a more and more competitive global environment. The talk will cover some of the really unique opportunities which Singapore has in this aspect.



Thursday, January 23, 2014

Session 3K  Keynote III
Time: 8:30 - 9:30 Thursday, January 23, 2014
Location: Room 300
Chair: Naehyuck Chang (Seoul National University, Republic of Korea)

3K-1 (Time: 8:30 - 9:30)
Title(Keynote Address) Beyond Charge-Based Computing
AuthorKaushik Roy (Purdue University, U.S.A.)
AbstractThe trend towards ultra low power logic and low leakage embedded memories for System-On-Chips, has prompted researcher to consider the possibility of replacing charge as the state variable for computation. Recent experiments on spin devices like magnetic tunnel junctions (MTJ's), domain wall magnets (DWM) and spin valves have led to the possibility of using "spin" as state variable for computation, achieving very high density on-chip memories and ultra low voltage logic. High density of memories can be exploited to develop memory-centric reconfigurable computing fabrics that provide significant improvements in energy efficiency and reliability compared to conventional FPGAs. While the possibility of having on-chip spin transfer torque memories is close to reality, several questions still exist regarding the energy benefits of spin as the state variable for logic computation. Latest experiments on lateral spin valves (LSV) have shown switching of nano-magnets using spin-polarized current injection through a metallic channel such as Cu. Such lateral spin valves having multiple input magnets connected to an output magnet using metal channels can be used to mimic "neurons". The spin-based neurons can be integrated with CMOS and other devices like Phase change memories to realize ultra low-power data processing hardware based on neural networks, and are suitable for different classes of applications like, cognitive computing, programmable Boolean logic and analog and digital signal processing. Note, for some of these applications, CMOS technologies may not be suitable for ultra low power implementation. In this talk I will first discuss the advantages of using spin (as opposed to charge) as state variable for both memory and logic and then present how a cellular array of magneto-metallic devices, operating at terminal voltages ~20mV, can do efficient hybrid digital/analog computation for applications such as cognitive computing. Finally, I will consider recent advances in other non-charge based computing paradigm such as magnetic quantum cellular automata.


Session 7S  Special Session: Brain Like Computing: Modelling, Technology, and Architecture
Time: 10:10 - 12:15 Thursday, January 23, 2014
Location: Room 302
Chair: Ahmed Hemani (KTH, Sweden)

7S-1 (Time: 10:10 - 10:40)
Title(Invited Paper) Spiking Brain Models: Computation, Memory and Communication Constraints for Custom Hardware Implementation
Author*Anders Lansner, Ahmed Hemani, Nasim Farahini (KTH, Sweden)
Pagepp. 556 - 562
KeywordBrain, Neural network, custom VLSI, BCPNN, Associative memory
AbstractWe estimate the computational capacity required to simulate in real time the neural information processing in the human brain. We show that the computational demands of a detailed implementation are beyond reach of current technology, but that some biologically plausible reductions of problem complexity can give performance gains between two and six orders of magnitude, which put implementations within reach of tomorrow’s technology.

7S-2 (Time: 10:40 - 11:10)
Title(Invited Paper) Advanced Technologies for Brain-Inspired Computing
Author*Fabien Clermidy, Rodolphe Heliot, Alexandre Valentian (CEA-LETI, France), Christian Gamrat, Olivier Bichler, Marc Duranton (CEA-LIST, France), Bilel Blehadj, Olivier Temam (INRIA, France)
Pagepp. 563 - 569
KeywordNeuromorphic, Memristor, 3D TSV, 3D monolithic
AbstractThis paper aims at presenting how new technologies can overcome classical implementation issues of Neural Networks. Resistive memories such as Phase Change Memories and Conductive-Bridge RAM can be used for obtaining low-area synapses thanks to programmable resistance also called Memristors. Similarly, the high capacitance of Through Silicon Vias can be used to greatly improve analog neurons and reduce their area. The very same devices can also be used for improving connectivity of Neural Networks as demonstrated by an application. Finally, some perspectives are given on the usage of 3D monolithic integration for better exploiting the third dimension and thus obtaining systems closer to the brain.
Slides

7S-3 (Time: 11:10 - 11:40)
Title(Invited Paper) GPGPU Accelerated Simulation and Parameter Tuning for Neuromorphic Applications
AuthorKristofor D. Carlson, Michael Beyeler, *Nikil Dutt, Jeffrey L. Krichmar (UC Irvine, U.S.A.)
Pagepp. 570 - 577
Keywordgraphics processing units, spiking neural networks, evolutionary algorithms, GPUs, SNNs
AbstractNeuromorphic engineering takes inspiration from biology to design brain-like systems that are extremely low-power, fault-tolerant, and capable of adaptation to complex environments. The design of these artificial nervous systems involves both the development of neuromorphic hardware devices and the development neuromorphic simulation tools. In this paper, we describe a simulation environment that can be used to design, construct, and run spiking neural networks (SNNs) quickly and efficiently using graphics processing units (GPUs). We then explain how the design of the simulation environment utilizes the parallel processing power of GPUs to simulate large-scale SNNs and describe recent modeling experiments performed using the simulator. Finally, we present an automated parameter tuning framework that utilizes the simulation environment and evolutionary algorithms to tune SNNs. We believe the simulation environment and associated parameter tuning framework presented here can accelerate the development of neuromorphic software and hardware applications by making the design, construction, and tuning of SNNs an easier task.
Slides

7S-4 (Time: 11:40 - 12:10)
Title(Invited Paper) A Scalable Custom Simulation Machine for the Bayesian Confidence Propagation Neural Network Model of the Brain
AuthorNasim Farahini, *Ahmed Hemani, Anders Lansner (KTH, Sweden), Fabian Clermidy (CEA-LETI, France), Christer Svensson (Linköping University, Sweden)
Pagepp. 578 - 585
KeywordBrain simulation, BCPNN, Custom supercomputer, Spiking Neural Network
AbstractA multi-chip custom digital super-computer called eBrain for simulating Bayesian Confidence Propagation Neural Network (BCPNN) model of the human brain has been proposed. It uses Hybrid Memory Cube (HMC), the 3D stacked DRAM memories for storing synaptic weights that are integrated with a custom designed logic chip that implements the BCPNN model. In 22nm node, eBrain executes BCPNN in real time with 740 TFlops/s while accessing 30 TBs synaptic weights with a bandwidth of 112 TBs/s while consuming less than 6 kWs power for the typical case. This efficiency is three orders better than general purpose supercomputers in the same technology node.


Session 7A  Power and Life Time Issues of Memory Subsystem
Time: 10:10 - 12:15 Thursday, January 23, 2014
Location: Room 300
Chairs: Muhammad Shafique (Karlsruhe Institute of Technology, Germany), Wei Zhang (Hong Kong University of Science and Technology, Hong Kong)

7A-1 (Time: 10:10 - 10:35)
TitleNo△:Leveraging Delta Compression for End-to-End Memory Access in NoC Based Multicores
Author*Jia Zhan, Matt Poremba (The Pennsylvania State University, U.S.A.), Yi Xu (AMD Research, China), Yuan Xie (Advanced Micro Devices, China/Pennsylvania State University, U.S.A.)
Pagepp. 586 - 591
KeywordNetwork-on-Chip, Data Compression
AbstractAs the number of on-chip processing elements increases, the interconnection backbone bears bursty traffic from memory and cache access. In this paper, we propose a compression technique called No△, which leverages delta compression to compress network traffic. Specifically, it conducts data encoding prior to packet injection and decoding before ejection in the network interface. The key idea of No△ is to store a data packet in the Network-on-Chip as a common base value plus an array of relative differences (△). It can improve the overall network performance and achieve energy savings because of the decreased network load. Moreover, this scheme does not require modifications of the cache storage design and is complementary to any optimization techniques for the on-chip interconnect. Our experiments reveal that the proposed No△ incurs negligible overhead and outperforms state-of-the-art zero-content compression and frequent-value compression.
Slides

7A-2 (Time: 10:35 - 11:00)
TitleDPA: A Data Pattern Aware Error Prevention Technique for NAND Flash Lifetime Extension
AuthorJie Guo, Zhijie Chen (University of Pittsburgh, U.S.A.), Danghui Wang (Northwestern Polytechnical University, China), Zili Shao (The Hong Kong Polytechnic University, Hong Kong), *Yiran Chen (University of Pittsburgh, U.S.A.)
Pagepp. 592 - 597
KeywordNAND flash, Error pattern, Endurance
AbstractPrevious works reveal that the bit error rate of a NAND flash cell is highly dependent on the stored data patterns. Based on this observation, we propose a Data Pattern Aware (DPA) error protection technique to extend the lifespan of NAND flash based storage systems. DPA manipulates the ratio of 1’s and 0’s in the stored data to reduce the probability of the data patterns which are susceptible to noise. By minimizing the vulnerable data patterns, our scheme can effectively reduce the bit error rate and therefore improves the system endurance. The simulation result shows that DPA scheme can increase the flash system life expectancy by up to 4×, complementing the efforts of other orthogonal techniques like wear-leveling.
Slides

7A-3 (Time: 11:00 - 11:25)
TitleScattered Refresh: An Alternative Refresh Mechanism to Reduce Refresh Cycle Time
Author*T. Venkata Kalyan, Ravi Kasha, Madhu Mutyam (Indian Institute of Technology - Madras, India)
Pagepp. 598 - 603
KeywordDRAM, High-density, Refresh, Row-Mapping
AbstractWith realization of high density DRAM devices, the amount of time spent in refreshing a DRAM bank is increasing. This reduces the availability of the bank to the requests from the processing cores, leading to degradation in performance. In this work we target to reduce the refresh cycle time of the DRAM device by scattering the rows in a refresh operation to different subarrays and leveraging the available parallelism in their access. Considering 8Gb devices, we show that Scattered Refresh achieves up to 10.2% of overall system performance improvement. Scattered Refresh being orthogonal to the existing refresh handling techniques, can be employed along with any of them, boosting their effectiveness further.
Slides

7A-4 (Time: 11:25 - 11:50)
TitleA Read-Write Aware DRAM Scheduling for Power Reduction in Multi-Core Systems
Author*Chih-Yen Lai, Gung-Yu Pan, Hsien-Kai Kuo (Department of Electronics Engineering & Institute of Electronics, National Chiao Tung University, Taiwan), Jing-Yang Jou (Department of Electrical Engineering and Department of Electronics Engineering, National Central University/Institute of Electronics, National Chiao Tung University, Taiwan)
Pagepp. 604 - 609
KeywordLow power, DRAM, Scheduling, Multi-core
AbstractThe demand of high performance and low power has increased the importance of power efficiency in multi-core systems. In modern multi-core architectures, DRAM has dominated the power consumption and therefore reordering based DRAM scheduling have been intensively studied to reduce the power. However, the benefit of reordering is not fully explored by the previous studies. To further reduce the power, this paper proposes the read-write reordering and the read-write aware throttling. When compared to the existing work, the proposed techniques reduce 10% more DRAM power with slight performance degradation.
Slides

7A-5 (Time: 11:50 - 12:15)
TitleA Coherent Hybrid SRAM and STT-RAM L1 Cache Architecture for Shared Memory Multicores
Author*Jianxing Wang, Yenni Tim, Weng-Fai Wong, Zhong-Liang Ong (National University of Singapore, Singapore), Zhenyu Sun, Hai (Helen) Li (University of Pittsburgh, U.S.A.)
Pagepp. 610 - 615
KeywordCache, STT-RAM, MESI
AbstractSTT-RAM is an emerging NVRAM technology that promises high density, low energy and a comparable access speed to conventional SRAM. This paper proposes a hybrid L1 cache architecture that incorporates both SRAM and STT-RAM. By exploiting the MESI coherence protocol to perform dynamic block reallocation between different cache partitions, our hybrid scheme achieves 38% of energy saving with a mere 0.8% decline in IPC while extends the lifespan of STT-RAM partition.
Slides


Session 7B  Advances in High-Level and Logic Synthesis
Time: 10:10 - 12:15 Thursday, January 23, 2014
Location: Room 301
Chairs: Yuko Hara-Azumi (Nara Institute of Science and Technology, Japan), Robert Wille (University of Bremen, Germany)

7B-1 (Time: 10:10 - 10:35)
TitleAllocation of FPGA DSP-Macros in Multi-Process High-Level Synthesis Systems
Author*Benjamin Carrion Schafer (The Hong Kong Polytechnic University, Hong Kong)
Pagepp. 616 - 621
KeywordHigh-Level Synthesis, Design Space Exploration, DSP-macros, FPGAs
AbstractHigh-Level Synthesis (HLS) is a single process synthesis method that has shown to produce very good results compared to hand coded RTL, especially for DSP-related applications. At the same time FPGAs are reaching capacities that allow entire systems to be implemented on them. Most of these systems are also DSP-related and make intensive use of the FPGAs’ embedded hardmacros (e.g. DSP-blocks). This works presents a method to efficiently allocate DSP-macros in multi-process systems created using HLS in order to minimize the overall area. The proposed method calculates the area sensitivity of each process when its multiply-accumulate (MAC) operations are either mapped onto the FPGA’s hardmacro or its configurable resources and allocates the available hardmacros across all processes. Experimental results show that our method creates very good results compared to the optimal solution at a negligible running time.
Slides

7B-2 (Time: 10:35 - 11:00)
TitleArray Scalarization in High Level Synthesis
AuthorPreeti Ranjan Panda, *Namita Sharma (Indian Institute of Technology Delhi, India), Arun Kumar Pilania, Gummidipudi Krishnaiah, Sreenivas Subramoney, Ashok Jagannathan (Intel Technology India Pvt. Ltd., India)
Pagepp. 622 - 627
KeywordHigh level synthesis, Behavioral Synthesis, Array Scalarization
AbstractParallelism across loop iterations present in behavioral specifications can typically be exposed and optimized using well known techniques such as Loop Unrolling. However, since behavioral arrays are usually mapped to memories (SRAM) during synthesis, performance bottlenecks arise due to memory port constraints. We study array scalarization, the transformation of an array into a group of scalar variables. We propose a technique for selectively scalarizing arrays for improving the performance of synthesized designs by taking into consideration the latency benefits as well as the area overhead caused by using discrete registers for storing array elements instead of denser SRAM. Our experiments on several benchmark examples indicate promising speedups of more than 10x for several designs due to scalarization.
Slides

7B-3 (Time: 11:00 - 11:25)
TitleData Compression via Logic Synthesis
Author*Luca Amaru, Pierre-Emmanuel Gaillardon (EPFL-LSI, Switzerland), Andreas Burg (EPFL-TCL, Switzerland), Giovanni De Micheli (EPFL-LSI, Switzerland)
Pagepp. 628 - 633
KeywordLogic Synthesis, Data Compression
AbstractNowadays, most software and hardware applications are committed to reduce the footprint and resource usage of data. In this general context, lossless data compression is a beneficial technique that encodes information using fewer (or at most equal number of) bits as compared to the original representation. A traditional compression flow consists of two phases: data decorrelation and entropy encoding. Data decorrelation, also called entropy reduction, aims at reducing the autocorrelation of the input data stream to be compressed in order to enhance the efficiency of entropy encoding. Entropy encoding reduces the size of the previously decorrelated data by using techniques such as Huffman coding, arithmetic coding, and others. When the data decorrelation is optimal, entropy encoding produces the strongest lossless compression possible. While efficient solutions for entropy encoding exist, data decorrelation is still a challenging problem limiting ultimate lossless compression opportunities. In this paper, we use logic synthesis to remove redundancy in binary data aiming to unlock the full potential of lossless compression. Embedded in a complete lossless compression flow, our logic synthesis based methodology is capable to identify the underlying function correlating a data set. Experimental results on data sets deriving from different causal processes show that the proposed approach achieves the highest compression ratio compared to state-of-art compression tools such as ZIP, bzip2 and 7zip.
Slides

7B-4 (Time: 11:25 - 11:50)
TitleSynthesis of Power- and Area-Efficient Binary Machines for Incompletely Specified Sequences
Author*Nan Li, Elena Dubrova (Royal Institute of Technology, Sweden)
Pagepp. 634 - 639
KeywordLFSR, NLFSR, binary machine, LBIST, PRNG
AbstractBinary Machines (BMs) are a generalization of Linear Feedback Shift Registers (LFSRs) in which a current state is a nonlinear function of the previous state. It is known how to construct a BM generating a given completely specified binary sequence. In this paper, we present an algorithm which can efficiently handle the case of incompletely specified sequences. Our experimental results show that it significantly outperforms the approaches based on all-0 or random fill in both area and power dissipation. On average, it reduces dynamic power dissipation twice compared to all-0 fill approach and 6 times compared to random fill approach. The presented algorithm can potentially be useful for many applications, including Logic Built-In Self Test (LBIST).
Slides

7B-5 (Time: 11:50 - 12:15)
TitleMulti-Mode Trace Signal Selection for Post-Silicon Debug
Author*Min Li, Azadeh Davoodi (University of Wisconsin - Madison, U.S.A.)
Pagepp. 640 - 645
Keywordpost-silicon debug, trace buffers
AbstractTrace buffers are used during post-silicon debug to allow restoring the internal signals of a chip via online tracing of a few state elements within a capture window. In this work, we show that the quality of restoration corresponding to a set of trace signals, selected for a single operating mode, may significantly degrade over the remaining operating modes of a design. This is the first work to study the multi-mode trace signal selection problem in order to maximize the restoration over all the operating modes.
Slides


Session 7C  Advanced Test Solutions
Time: 10:10 - 12:15 Thursday, January 23, 2014
Location: Room 303
Chairs: Jiun-Lang Huang (National Taiwan University, Taiwan), Mango Chia-Tso Chao (National Chiao Tung University, Taiwan)

7C-1 (Time: 10:10 - 10:35)
TitleImplicit Intermittent Fault Detection in Distributed Systems
Author*Peter Waszecki, Matthias Kauer, Martin Lukasiewycz (TUM CREATE, Singapore), Samarjit Chakraborty (TU Munich, Germany)
Pagepp. 646 - 651
Keywordfault detection, distributed systems, reliability, automotive
AbstractThis paper presents a novel approach to detect resources in distributed systems with an increased occurrence of intermittent faults that exceed the amount of unavoidable transient faults caused by environmental phenomena. Intermittent faults occur due to stressed resources and often are a precursor of permanent faults. The proposed early fault detection and diagnosis allows the use of precautionary measures before the permanent failure of a component in a distributed system occurs. In this paper, we present four methods that can implicitly detect intermittent faults by taking the distributed applications and their dependencies into account. Thus, explicit tests are not required which would lead to additional costs and resource load. On the other hand, the implicit approach may considerably reduce the number of plausibility tests compared to the conservative solution with one test per resource. We analyzed and evaluated implementations of the proposed fault detection principle. The experimental results give evidence of the feasibility of our approach and show a comparison of the implemented methods in terms of runtime and detection rate.
Slides

7C-2 (Time: 10:35 - 11:00)
TitleA Segmentation-Based BISR Scheme
AuthorGeorgios Zervakis, Nikolaos Eftaxiopoulos, Kostas Tsoumanis, Nicholas Axelos, *Kiamal Pekmestzi (National Technical University of Athens, Greece)
Pagepp. 652 - 657
KeywordBuilt-In Self-Repair, segmentation-based, reparability, memory
AbstractWith memory estate increasing in System-On-Chips and highly integrated products, memory defects and wearout effects are the determining factor in the chip’s yield loss and reliability. In this paper, a multiple cache-based Built-In Self-Repair scheme is proposed that is able to repair from the word level down to the bit level. Moreover, it is proved that the level of segmenta-tion does not affect the repair efficiency. An exploration is then conducted to find the optimal scheme in terms of area over-head.
Slides

7C-3 (Time: 11:00 - 11:25)
TitleFault-Tolerant TSV by Using Scan-Chain Test TSV
Author*Fu-Wei Chen, Hui-Ling Ting, TingTing Hwang (National Tsing Hua University, Taiwan)
Pagepp. 658 - 663
Keyword3-D IC, through-silicon-via, redundant TSV, 3D scan-chain, Fault-tolerant
AbstractIn order to increase the yield of 3-D IC, fault-tolerance technique to recover failed TSV is essential. In this paper, an architecture of TSV recovery by using scan-chain test TSV is proposed. With the architecture, only a small amount of redundant TSVs is required to be inserted. Extra TSV area that occurs by our method is much less than that of other methods. Moreover, a 3D-IC scan-chain optimization algorithm is proposed taking into consideration the locations of functional TSVs as well as test TSVs, so that the number of total TSVs including test TSV and extra redundant TSV of a 3-D IC design is effectively reduced.
Slides

7C-4 (Time: 11:25 - 11:50)
TitleSuppressing Test Inflation in Shared-Memory Parallel Automatic Test Pattern Generation
AuthorJerry C. Y. Ku, Ryan H.-M. Huang, Louis Y. -Z. Lin, *Charles H.-P. Wen (Dept. of Elec. Comp. Engr., National Chiao Tung University, Taiwan)
Pagepp. 664 - 669
Keywordparallel ATPG, test inflation
AbstractMulti-core machines enable the possibility of parallel computing in Automatic Test Pattern Generation (ATPG). With sufficient computing power, previously proposed parallel ATPG has reached near linear speedup. However, test inflation in parallel ATPG yet arises as a critical problem and limits its practicality. Therefore, we developed a parallel ATPG system that incorporates (1) concurrent interruption (CI), (2) ripple compaction (RC) and (3) fan-in-cone based fault ordering (FIC) to deal with such problem. Concurrent interruption aborts test generation on simultaneously detected faults by fault simulation. Ripple compaction combines tests for different faults while fan-in-cone based fault ordering strategically arranges the fault list to reduce the number of test generations and thus speeds up the ATPG process. As a result, the proposed parallel ATPG system effectively reduces 11% pattern count with ~0% test inflation while maintaining an average of 6.5X speedup with no attenuation in fault coverage on experimental circuits.

7C-5 (Time: 11:50 - 12:15)
TitleA Volume Diagnosis Method for Identifying Systematic Faults in Lower-Yield Wafer Occurring during Mass Production
Author*Tsutomu Ishida, Izumi Nitta (Fujitsu Laboratories LTD., Japan), Koji Banno (Fujitsu Semiconductor LTD., Japan), Yuzi Kanazawa (Fujitsu Laboratories LTD., Japan)
Pagepp. 670 - 675
KeywordVolume diagnosis, Combinatorial optimization
AbstractThis work focuses on volume diagnosis for identifying systematic faults in lower-yield wafers, whose yields are lower than baseline level due to systematic faults during mass production. We develop a model-based volume diagnosis method. To diagnose accurately using the fail data with one lower-yield wafer, we apply modeling techniques for handling pseudo-faults and random faults in the fail data. Experimental results show our method’s efficiency; we succeeded in identifying the failure layer for 20/22 data sets with actual lower-yield wafers.
Slides


Session 8S  Special Session: Design Flow for Integrated Circuits using Magnetic Tunnel Junction Switched by Spin Orbit Torque
Time: 13:50 - 15:30 Thursday, January 23, 2014
Location: Room 302
Organizer: Mehdi Tahoori (Karlsruhe Institute of Technology, Germany)

8S-1 (Time: 13:50 - 14:15)
Title(Invited Paper) An Overview of Spin-Based Integrated Circuits
AuthorWang Kang (University Beihang, China/IEF, Université Paris-Sud, France), *Weisheng Zhao, Zhaohao Wang, Jacques-Olivier Klein, Yue Zhang, Djaafar Chabi (IEF, Université Paris-Sud, France), Youguang Zhang (Univ. Beihang, China), Dafiné Ravelosona, Claude Chappert (IEF, Université Paris-Sud, France)
Pagepp. 676 - 683
Keywordnon-volatile, fast speed, spintronics, low power
AbstractConventional CMOS integrated circuits suffer from serve power and scalability challenges as technology node scales into ultra-deep-micron technology nodes. Alternative approaches beyond charge-only based circuits. In particular, spin-based devices or integrated circuits show promising merits to overcome these issues by adding the spin freedom of electrons to the electronic circuits. Spintronics has now become a hot topic in both academics and industrials. This paper overviews the status and prospects of spin-based integrated circuits under intense investigation and address particularly their merits and challenges for practical applications.
Slides

8S-2 (Time: 14:15 - 14:40)
Title(Invited Paper) Advances in Spintronics Devices for Microelectronics - from Spin-Transfer Torque to Spin-Orbit Torque
Author*Shunsuke Fukami, Hideo Sato, Michihiko Yamanouchi, Shoji Ikeda, Fumihiro Matsukura, Hideo Ohno (Tohoku University, Japan)
Pagepp. 684 - 691
Keywordspintronics, spin-transfer torque, spin-orbit torque, magnetic random access memory, magnetic tunnel junction
AbstractRecent advances in spintronics devices make it possible to open a new era of microelectronics. In this paper, we review the spintronics devices utilizing spin-transfer torques (STTs) and spin-orbit torques (SOTs) developed in recent years. The progresses of two-terminal STT device with CoFeB-MgO based magnetic tunnel junction (MTJ), three-terminal magnetic domain wall (DW) motion device with Co/Ni multilayer, and three-terminal SOT device with Cu-based channel are described. Integrated circuits with the developed spintronics devices are also reviewed.

8S-3 (Time: 14:40 - 15:05)
Title(Invited Paper) Hybrid CMOS/Magnetic Process Design Kit and SOT-Based Non-Volatile Standard Cell Architectures
Author*Gregory Di Pendina, Kotb Jabeur, Guillaume Prenat (Spintec Laboratory, CEA-INAC/CNRS/UJF/G-INP, France)
Pagepp. 692 - 699
KeywordSpin Orbit Torque, Compact model, Standard Cell, Magnetic Random Access Memory, Process Design Kit
AbstractThis paper gives an overview of hybrid CMOS/magnetic logic circuit design. We describe the magnetic devices, the expected advantages of using them beside CMOS to help to circumvent the incoming limits of VLSI circuits and the tools required to design such circuits, including Process Design Kit (PDK) and Standard Cells (SC). As a case of study, we particularly focus on a new and promising device technology based on Spin Orbit Torque (SOT) effect.
Slides

8S-4 (Time: 15:05 - 15:30)
Title(Invited Paper) Architectural Aspects in Design and Analysis of SOT-Based Memories
AuthorRajendra Bishnoi, Mojtaba Ebrahimi, Fabian Oboril, *Mehdi Tahoori (Karlsruhe Institute of Technology, Germany)
Pagepp. 700 - 707
KeywordSpin Orbit Torqe, non-volatile memory, magnetic memory, design space exploration, cache
AbstractMagnetic Random Access Memory (MRAM) and in particular SOT-MRAM is a promising emerging memory technology because of its various advantages. In this work, we provide an analysis of SOT-MRAM at circuit- and architecture-level, and compare SOT-MRAM with several other technologies. Our architecture-level analysis shows that a hybrid-combination of SRAM and SOT-MRAM for the L1- and L2-cache, respectively, can significantly reduce area and energy while the performance slightly increases.
Slides


Session 8A  Analysis, Optimization, and Scheduling for Multiprocessor Platforms
Time: 13:50 - 15:30 Thursday, January 23, 2014
Location: Room 300
Chairs: Sebastian Steinhorst (TUM CREATE, Singapore), Akash Kumar (National University of Singapore, Singapore)

8A-1 (Time: 13:50 - 14:15)
TitleTiming Anomalies in Multi-Core Architectures due to the Interference on the Shared Resources
Author*Hardik Shah, Kai Huang, Alois Knoll (Technical University Munich, Germany)
Pagepp. 708 - 713
KeywordMulti-core, WCET, Interference, shared memory
AbstractTiming anomalies in single-core processors have been theoretically explained and well understood phenomenon. This paper presents new timing anomalies which occur in multi-core architectures due to the interference on the shared resources. We derive formulation to capture these anomalies and provide practical evidences using real applications from the Mälardalen WCET benchmark suit executing on NIOS II multi-core architecture on an Altera FPGA.
Slides

8A-2 (Time: 14:15 - 14:40)
TitleA Unified Online Directed Acyclic Graph Flow Manager for Multicore Schedulers
Author*Karim Kanoun, David Atienza (École Polytechnique Fédérale de Lausanne, Switzerland), Nicholas Mastronarde (State University of New York at Buffalo, U.S.A.), Mihaela van der Schaar (University of California, Los Angeles, U.S.A.)
Pagepp. 714 - 719
KeywordDirected Acyclic Graph DAG, Online task graph analyzer, Parallel processing, Multimedia embedded systems, Online energy-efficient scheduler
AbstractThe Directed-Acyclic Graph (DAG) monitoring solutions used by existing energy-efficient schedulers to analyze DAGs, make a priori assumptions about the workload and the relationship between the task dependencies. Thus, these schedulers are limited to work on a limited subset of DAG models. To address this problem, we propose a unified online DAG monitoring solution for all possible DAG models to assist online schedulers. We validate our approach using H.264 video decoding application and synthetic DAG models.
Slides

8A-3 (Time: 14:40 - 15:05)
TitleVariation-Aware Statistical Energy Optimization on Voltage-Frequency Island Based MPSoCs under Performance Yield Constraints
Author*Song Jin (Department of Electronic and Communication Engineering, School of Electrical and Electronic Engineering, North China Electric Power University, China), Yinhe Han (State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, China), Songwei Pei (Department of Computer Science and Technology, Beijing University of Chemical Technology, China)
Pagepp. 720 - 725
Keywordenergy efficiency, process variation, voltage-frequency island, statistical design, performance yield
AbstractEnergy efficiency is a primary design concern for embedded multiprocessor system-on-chips (MPSoCs). Recently, Voltage-Frequency Island (VFI) -based design paradigm was introduced for fine-grained power management, which can seamlessly combine with the task scheduling algorithm to optimize system energy. However, the ever-increasing variabilities cause large uncertainty on delay and power. Such statistical nature in performance parameters easily makes deterministic energy optimization hard to achieve desirable performance yield, defined as the probability of the design meeting timing constraints of the system. In this paper, we propose a variation-aware statistical energy optimization framework, which takes account of performance yield constraints in energy-aware task scheduling, voltage assignment and VFI partitioning process. Energy optimization sensitivity, defined as energy variations of the task under voltage scaling, combines with the statistical slack of the task to guide the overall optimization flow. Experimental results demonstrate the effectiveness of the proposed scheme.
Slides

8A-4 (Time: 15:05 - 15:30)
TitleQoS-Aware Dynamic Resource Allocation for Spatial-Multitasking GPUs
Author*Paula Aguilera, Katherine Morrow, Nam Sung Kim (University of Wisconsin - Madison, U.S.A.)
Pagepp. 726 - 731
KeywordGPGPU, QoS, spatial multitasking, resource partitioning
AbstractGPGPU computing is becoming widely adopted. Some GPGPU applications fail to fully utilize available GPU resources, motivating the use of spatial multitasking (partitioning resources between simultaneously-running applications). When applications have quality-of-service (QoS) requirements enough resources must be allocated to satisfy their requirements. Remaining resources can be disabled to reduce power consumption or used to accelerate other applications. We propose a runtime algorithm to dynamically partition GPU resources between concurrently running applications, when at least one has QoS requirements.
Slides


Session 8B  Advances in Formal Verification and Debugging
Time: 13:50 - 15:30 Thursday, January 23, 2014
Location: Room 301
Chairs: Charles H.-P. Wen (National Chiao Tung University, Taiwan), Vishvender Singh (Infineon Technologies Asia-Pacific, Singapore)

8B-1 (Time: 13:50 - 14:15)
TitleAutomated Debugging of Missing Assumptions
AuthorBrian Keng (University of Toronto, Canada), Evean Qin (Vennsa Technologies Inc., Canada), *Andreas Veneris, Bao Le (University of Toronto, Canada)
Pagepp. 732 - 737
KeywordDebugging, Assumptions, Verification, Formal
AbstractFormal verification has increased efficiency by detecting corner case design bugs but it has also introduced new challenges when failures are detected. Once a counter-example is returned by a formal tool, the user typically does not know if the failure is caused by a design bug, an incorrectly written assertion, or a missing assumption. Previous work in debug automation has focused on the former two cases. This paper introduces a novel methodology to automatically debug missing assumptions. It begins by generating multiple formal counter-examples for the error. Next, a function is extracted from these counter-examples that encodes the input combinations that cause the assertion to fail. This function is later used to generate a list of fixed cycle assumptions that prevent failures similar to the generated counter-examples. These filtered assumptions can then be used as hints for the actual missing assumption. Further, if a missing assumption is not the cause of the failure, the method offers the additional benefit that the counter-examples it generates can be utilized to debug the RTL and/or the assertion. An extensive set of experimental results on OpenCores designs and assertions show that the number of generated assumptions can be reduced by an average of 38% using ten counter-examples, while an average of 28 assumptions is returned to the user.
Slides

8B-2 (Time: 14:15 - 14:40)
TitleProperty Directed Reachability for QF_BV with Mixed Type Atomic Reasoning Units
Author*Tobias Welp (University of California, Berkeley, U.S.A.), Andreas Kuehlmann (Coverity, Inc./University of California, Berkeley, U.S.A.)
Pagepp. 738 - 743
KeywordIC3, PDR, Model Checking, QF_BV
AbstractA generalization of Property Directed Reachability (PDR) for the theory QF_BV presented at DATE 2013 outperforms the original formulation if the required inductive invariant can be represented efficiently as a set of polytopes. However, many QF_BV model checking instances do not belong in this class and can be solved quickly with the original PDR algorithm. In this paper, we present a hybrid approach which uses both polytopes and Boolean cubes as atomic reasoning units combining the advantages of either homogeneous approach. We discuss theoretic properties of the presented algorithm and report experimental results demonstrating its effectiveness.
Slides

8B-3 (Time: 14:40 - 15:05)
TitleAdaptive Interpolation-Based Model Checking
Author*Chien-Yu Lai, Cheng-Yin Wu, Chung-Yan (Ric) Huang (National Taiwan University, Taiwan)
Pagepp. 744 - 749
Keywordinterpolation, model checking, formal verification
AbstractInterpolation-based model checking (IMC) is an important technique in modern formal verification tools. In essence, it relies on an abstraction and refinement process to derive an adequate image approximation for the reachability analysis. However, previous IMC algorithms only offer fixed degrees of abstraction and thus may fail in the proofs if the abstraction is too coarse- or fine-grained. In this paper, we propose an adaptive interpolation-based model checking algorithm in which the degree of abstraction can be adjusted on demand. That is, during the proof process, we closely monitor the effectiveness of the interpolation-based over-approximated image computation and thus adjust the degree of abstraction for the best performance. The experimental results confirm that our flexible interpolation indeed leads to an adequate degree of abstraction as our IMC algorithm outperforms previous ones in various aspects.
Slides

8B-4 (Time: 15:05 - 15:30)
TitleEfficient Parallel GPU Algorithms for BDD Manipulation
Author*Miroslav Velev, Ping Gao (Aries Design Automation, U.S.A.)
Pagepp. 750 - 755
KeywordBinary Decision Diagrams, Boolean Satisfiability, Formal Verification, GPU, Parallel Execution
AbstractWe present parallel algorithms for Binary Decision Diagram (BDD) manipulation optimized for efficient execution on Graphics Processing Units (GPUs). Compared to a sequential CPU-based BDD package with the same capabilities, our GPU implementation achieves at least 5 orders of magnitude speedup. To the best of our knowledge, this is the first work on using GPUs to accelerate a BDD package.


Session 8C  Advances in CAD Techniques for Signal Integrity
Time: 13:50 - 15:30 Thursday, January 23, 2014
Location: Room 303
Chairs: Rung-Bin Lin (Yuan Ze University, Taiwan), Sheldon Tan (University of California, Riverside, U.S.A.)

8C-1 (Time: 13:50 - 14:15)
TitleEfficient Techniques for the Capacitance Extraction of Chip-Scale VLSI Interconnects Using Floating Random Walk Algorithm
Author*Chao Zhang, Wenjian Yu (Tsinghua University, China)
Pagepp. 756 - 761
Keywordcapacitance extraction, floating random walk, Gaussian surface generation, parallel computing
AbstractTo enable the capacitance extraction of chip-scale large VLSI layout using the floating random walk (FRW) algorithm, two techniques are proposed. The first one is a virtual Gaussian surface sampling technique. It is used to construct the Gaussian surface for complex nets with vias, and optimizes the sampling and placement of Gaussian surface to reduce the time of random walk. The other one is a parallelized, improved construction approach for Octree based space management structure. It can be over 5000X faster than the existing approach and provides same convenience to the FRW procedure. Numerical experiments on large cases with up to half million conductors validate the proposed techniques, and demonstrate a fast FRW solver for chip-scale extraction task.
Slides

8C-2 (Time: 14:15 - 14:40)
Title3DLAT: TSV-Based 3D ICs Crosstalk Minimization Utilizing Less Adjacent Transition Code
Author*Qiaosha Zou, Dimin Niu, Yan Cao (Pennsylvania State University, U.S.A.), Yuan Xie (Advanced Micro Devices, China/Pennsylvania State University, U.S.A.)
Pagepp. 762 - 767
Keyword3D IC, capacitive crosstalk, power saving
Abstract3D integration is one of the promising solutions to overcome the interconnect bottleneck with vertical interconnect through-silicon vias (TSVs). This paper investigates the crosstalk in 3D IC designs, especially the capacitive crosstalk in TSV interconnects. We propose a novel w-LAT coding scheme to reduce the capacitive crosstalk and minimize the power consumption overhead in the TSV array. Combining with the Transition Signaling, The LAT coding scheme restricts the number of transitions in every transmission cycle to minimize the crosstalk and power consumption. Compared to other 3D crosstalk minimization coding schemes, the proposed coding can provide the same delay reduction with affordable overhead. The performance and power analysis show that when w is 4, the proposed LAT coding scheme can achieve 38% interconnect crosstalk delay reduction compared to the data transmission without coding. By reducing the value of w, further reduction can be achieved.
Slides

8C-3 (Time: 14:40 - 15:05)
TitleTackling Close-to-Band Passivity Violations in Passive Macro-Modeling
Author*Moning Zhang, Zuochang Ye (Tsinghua National Laboratory for Information Science and Technology, Institute of Microelectronics, Tsinghua University, China)
Pagepp. 768 - 773
Keywordpassivity enforcement, S parameter, state-space model
AbstractPassivity enforcement is important for macromodeling for passive systems from measured or simulated Sparameter data. State-space systems generated from vector fitting usually present strong passivity violation outside the frequency bandwidth especially in the close-to-band region. Removing such close-to-band violation is very difficult with exiting passivity enforcement techniques without severly sacrificing the accuracy of the model. In this paper we propose a frequency data extension method which aims to reduce or even eliminated such close-to-band violations without sacrificing model accuracy. The generated model can be used in a later stage for further passivity enforcement. Experiments show that with applying the proposed method, the accuracy of the generated model can be significantly improved.
Slides

8C-4 (Time: 15:05 - 15:30)
TitleHIE-Block Latency Insertion Method for Fast Transient Simulation of Nonuniform Multiconductor Transmission Lines
Author*Takahiro Takasaki, Tadatoshi Sekine, Hideki Asai (Shizuoka University, Japan)
Pagepp. 774 - 779
Keywordblock latency insertion method, fast circuit simulation, hybrid implicit-explicit scheme, nonuniform multiconductor transmission lines, numerical stability condition
AbstractThis papaer describes a hybrid implicit-explicit block latency insertion method (HIE-block-LIM) for the fast simulation of nonuniform multiconductor transmission lines. In the HIE-block-LIM, an implicit difference method is used with respect to the current variables in one direction, and an explicit method is adopted to update the other variables. The HIE-block-LIM can alleviate a time step size limitation of the existing block-LIM by taking both advantages of the explicit and implicit difference methods.
Slides


Session 9S  Special Session: The Role of Photons in Harming or Increasing Security
Time: 15:50 - 17:30 Thursday, January 23, 2014
Location: Room 302
Organizer: Francesco Regazzoni (University of Lugano, Switzerland), Edoardo Charbon (Delft University of Technology, Netherlands)

9S-1 (Time: 15:50 - 16:30)
Title(Invited Paper) The Role of Photons in Cryptanalysis
Author*Juliane Krämer (University Berlin, Germany), Michael Kasper (Fraunhofer Institute for Secure Information Technology, Germany), Jean-Pierre Seifert (University Berlin)
Pagepp. 780 - 787
KeywordPhotonic Side Channel, AES, SPEA, DPEA
AbstractPhotons can be exploited to reveal secrets of security ICs like smartcards, secure microcontrollers, and cryptographic coprocessors. One such secret is the secret key of cryptographic algorithms. This work gives an overview about current research on revealing these secret keys by exploiting the photonic side channel. Different analysis methods are presented. It is shown that the analysis of photonic emissions also helps to gain knowledge about the attacked device and thus poses a threat to modern security ICs. The presented results illustrate the differences between the photonic and other side channels, which do not provide fine-grained spatial information. It is shown that the photonic side channel has to be addressed by software engineers and during chip design.

9S-2 (Time: 16:30 - 17:10)
Title(Invited Paper) SPADs for Quantum Random Number Generators and Beyond
AuthorSamuel Burri (EPFL, Switzerland), Damien Stucki (ID Quantique, Switzerland), Yuki Maruyama (Delft University of Technology, Netherlands), Claudio Bruschini (EPFL, Switzerland), Edoardo Charbon (Delft University of Technology, Netherlands), *Francesco Regazzoni (ALaRI - USI, Switzerland)
Pagepp. 788 - 794
KeywordSPADs, Random Number Generators, Security
AbstractThis paper explores the design of a QRNG based on a massively parallel array of SPAD. The matrix comprises 512x128 independent cells that convert photons onto a raw bit-stream of random bits. The sequences are read out in a 128-bit parallel bus, concatenated, and pipelined onto a de-biasing filter. Reported results, achieved on the manufactured devices, show that the architecture can reach up to 5 Gbit/s while consuming 25pJ/bit, demonstrating scalability and performance for any RNG based on SPADs.

9S-3 (Time: 17:10 - 17:50)
Title(Invited Paper) Quantum Key Distribution with Integrated Optics
Author*Mirko Lobino (Griffith University, Australia), Anthony Laing (University of Bristol, U.K.), Pei Zhang (Xi'an Jiaotong University, U.K.), Kanin Aungskunsiri, Enrique Martin-Lopez (University of Bristol, U.K.), Joachim Wabnig (Nokia Research Centre, U.K.), Richard W. Nock, Jack Munns, Damien Bonneau, Pisu Jiang (University of Bristol, U.K.), Hong Wei Li (Nokia Research Centre, U.K.), John G. Rarity (University of Bristol, U.K.), Antti O. Niskanen (Nokia Research Centre, U.K.), Mark G. Thompson, Jeremy L. O'Brien (University of Bristol, U.K.)
Pagepp. 795 - 799
KeywordQuantum cryptography, integrated optics
AbstractWe report on a quantum key distribution (QKD) experiment where a client with an on-chip polarisation rotator can access a server through a telecom-fibre link. Large resources such as photon source and detectors are situated at server-side. We employ a reference frame independent QKD protocol for polarisation qubits and show that it overcomes detrimental effects of drifting fibre birefringence in a polarisation maintaining fibre.


Session 9A  System-Level Verification
Time: 15:50 - 17:30 Thursday, January 23, 2014
Location: Room 300
Chairs: Yinhe Han (Chinese Academy of Sciences, China), Akash Kumar (National University of Singapore, Singapore)

9A-1 (Time: 15:50 - 16:15)
TitleConstraint-Based Platform Variants Specification for Early System Verification
Author*Andreas Burger, Alexander Viehl, Andreas Braun (FZI Research Center for Information Technology, Germany), Finn Haedicke (solvertec/University of Bremen, Germany), Daniel Große (solvertec, Germany), Oliver Bringmann, Wolfgang Rosenstiel (FZI Research Center for Information Technology/University of Tübingen, Germany)
Pagepp. 800 - 805
KeywordSpecification, Language, System-level modeling, Verification
AbstractTo overcome the verification gap arising from significantly increased external IP integration and reuse during electronic platform design and composition, we present a model-based approach to specify platform variants. The variants specification is processed automatically by formalizing and solving the integrated constraint sets to derive valid platforms. These constraint sets enable a precise specification of the required platform variants for verification, exploration and test. Experimental results demonstrate the applicability, versatility and scalability of our novel model-based approach.
Slides

9A-2 (Time: 16:15 - 16:40)
TitleA Transaction-Oriented UVM-Based Library for Verification of Analog Behavior
Author*Alexander Wolfgang Rath, Volkan Esen, Wolfgang Ecker (Infineon Technologies AG, Germany)
Pagepp. 806 - 811
KeywordUVM, analog, mixed signal, transaction, RNM
AbstractThe Universal Verification Methodology (UVM) has become a de facto standard in today’s functional verification of digital designs. However, it is rarely used for the verification of Designs Under Test containing Real Number Models. This paper presents a new technique using UVM that can be used in order to compare models of analog circuitry on different levels of abstraction. It makes use of statistic metrics. The presented technique enables us to ensure that Real Number Models used in chip projects match the transistor level circuitry during the whole life cycle of the project.
Slides

9A-3 (Time: 16:40 - 17:05)
TitleAutomata-Theoretic Modeling of Fixed-Priority Non-Preemptive Scheduling for Formal Timing Verification
Author*Matthias Kauer, Sebastian Steinhorst (TUM CREATE, Singapore), Reinhard Schneider (TU Munich, Germany), Martin Lukasiewycz (TUM CREATE, Singapore), Samarjit Chakraborty (TU Munich, Germany)
Pagepp. 812 - 817
KeywordCAN, Formal Verification, Model Checking, Timing Analysis, FPNS
AbstractThe design process of safety-critical systems requires formal analysis methods to ensure their correct functionality without over-sized safety margins and extensive testing. For architectures with state-based events or scheduling, such as load-dependent frequency scaling, model checking has emerged as a promising tool. It formally verifies timing behavior of real-time systems with minimal over-approximation of the worst case delays. In this context, ECA have become a valuable modeling approach because they are specifically designed to handle typical arrival patterns and integrate well with analytic techniques. In this work, we propose an extension of the ECA framework's semantics and use it in a FPNS model that correctly abstracts the intra-slot behavior in the slotted-time model of the ECA. This is challenging because straightforward implementations cannot capture the full behavior of event-triggered scheduling with such a time model that the ECA shares with most model checking based methods. In a case study, we obtain bounds via model checking a basic model and then our proposed model. We compare these bounds with a SystemC simulation. This shows that the bounds from the basic model are too optimistic -- and exceeded in practice -- because it does not capture the full behavior, while the bounds from the proposed extended model are both safe and reasonably tight.


Session 9B  Modeling and Evaluator for Emerging Technologies
Time: 15:50 - 17:30 Thursday, January 23, 2014
Location: Room 301
Chairs: Guangyu Sun (Peking University, China), Wei Zhang (Hong Kong University of Science and Technology, Hong Kong)

9B-1 (Time: 15:50 - 16:15)
TitlePROCEED: A Pareto Optimization-Based Circuit-Level Evaluator for Emerging Devices
Author*Shaodi Wang, Andrew Pan, Chi On Chui, Puneet Gupta (University of California, Los Angeles, U.S.A.)
Pagepp. 818 - 824
KeywordTunneling(T) FET, silicon-on-insulator (SOI), circuit-level device evaluation, Pareto optimization, simulation based optimization
AbstractEvaluation of novel devices in a circuit context is crucial to identifying and maximizing their value. We propose a new framework, PROCEED, and metrics for accurate device-circuit co-evaluation by properly optimizing digital circuit benchmarks. PROCEED assesses technology suitability over a wide operating region (MHz to GHz) by leveraging available circuit knobs (Vt assignment, power management, sizing, etc.) and is up to 21X more accurate than existing methods. As an example, we use PROCEED to compare CMOS and tunneling transistor devices.
Slides

9B-2 (Time: 16:15 - 16:40)
TitleModeling and Design Analysis of 3D Vertical Resistive Memory - A Low Cost Cross-Point Architecture
Author*Cong Xu, Dimin Niu (Pennsylvania State University, U.S.A.), Shimeng Yu (Arizona State University, U.S.A.), Yuan Xie (Advanced Micro Devices, China/Pennsylvania State University, U.S.A.)
Pagepp. 825 - 830
KeywordReRAM, 3D
AbstractResistive Random Access Memory (ReRAM) is one of the most promising emerging non-volatile memory (NVM) candidates due to its fast read/write speed, excellent scalability and low-power operation. Recently proposed 3D vertical cross-point ReRAM (3D-VRAM) architecture attracts a lot of attention because it offers a cost-competitive solution for NAND Flash replacement. In this work, we first develop an array-level model which includes the geometries and properties of all the components in the 3D structure. The model is capable of analyzing the read/write noise margin of a 3D-VRAM array in the presence of the sneak leakage current and voltage drop. Then we build a system-level design tool that is able to explore the design space with specified constraints and find the optimal design points with different targets. We also study the impact of different design parameters on the array size, bit density and overall cost-per-bit. Compared to the state-of-the-art 3D horizontal ReRAM (3D-HRAM), the 3D-VRAM shows great cost advantage when stacking more than 16 layers.

9B-3 (Time: 16:40 - 17:05)
TitleThe Stochastic Modeling of TiO2 Memristor and Its Usage in Neuromorphic System Design
AuthorMiao Hu (University of Pittsburgh, U.S.A.), Yu Wang (Tsinghua University, China), Qinru Qiu (Syracuse University, U.S.A.), Yiran Chen, *Hai Li (University of Pittsburgh, U.S.A.)
Pagepp. 831 - 836
KeywordMemristor, Stochastic model, Neuromorphic system
AbstractMemristor–the fourth basic circuit element, has shown great potential in neuromorphic circuit design for its unique synapse-like feature. However, there still exists a large gap between the theoretical memristor characteristics and the experimental data obtained from real device measurements. For instance, though the continuous resistance state of memristor has been expected to facilitate neuromorphic circuit designs, obtaining and maintaining an arbitrary intermediate state cannot be well controlled in nowadays memristive system. Moreover, the stochastic behaviors have been widely observed in real device measurement. To facilitate the investigation on memristor-based hardware implementation, we first built a stochastic behavior model for TiO2 memristive devices based on the real experimental results. We then proposed a macro cell design composed of multiple parallel connecting memristors. By leveraging the stochastic behavior of memristors, the macro cell can be successfully used as the weight storage unit and stochastic neuron – the two fundamental components widely adopted in neural networks, providing a feasible solution in memristor-based hardware implementation of neuromorphic systems.
Slides

9B-4 (Time: 17:05 - 17:30)
TitleThrough-Silicon-Via Inductor: Is It Real or Just A Fantasy?
Author*Umamaheswara Rao Tida (Missouri University of Science and Technology, U.S.A.), Cheng Zhuo (Intel Research, U.S.A.), Yiyu Shi (Missouri University of Science and Technology, U.S.A.)
Pagepp. 837 - 842
Keyword3D IC, TSV Inductor, Through-Silicon-Vias, Micro-channel, On-chip inductors
AbstractThrough-silicon-vias (TSVs) can potentially be used to implement inductors in three-dimensional (3D) integrated systems for minimal footprint and large inductance. However, different from conventional 2D spiral inductors, TSV inductors are fully buried in the lossy substrate, thus suffering from inferior quality factors. As such, literature has pointed out that TSV inductors should be used when area is the only concern, which essentially means they are useless. In this paper, we propose a novel shield mechanism utilizing the micro-channel, a technique conventionally used for heat removal, to reduce the substrate loss. The technique increases the quality factor and the inductance of the TSV inductor by up to 21x and 17x respectively. It enables us to implement TSV inductors of up to 38x smaller area and 33% higher quality factor, compared with spiral inductors of the same inductance. To the best of the authors’ knowledge, this is the first proposal on improving quality factor of TSV inductors. We hope our study shall point out a new and exciting research direction for 3D IC designers.
Slides


Session 9C  Design and Simulation Toward Power and Temperature Awareness
Time: 15:50 - 17:30 Thursday, January 23, 2014
Location: Room 303
Chairs: Yasuhiro Takashima (University of Kitakyushu, Japan), Yukihide Kohira (The University of Aizu, Japan)

9C-1 (Time: 15:50 - 16:15)
TitleDesign and Control Methodology for Fine Grain Power Gating Based on Energy Characterization and Code Profiling of Microprocessors
Author*Kimiyoshi Usami, Masaru Kudo, Kensaku Matsunaga, Tsubasa Kosaka, Yoshihiro Tsurui (Shibaura Institute of Technology, Japan), Weihan Wang, Hideharu Amano (Keio University, Japan), Hiroaki Kobayashi, Ryuichi Sakamoto, Mitaro Namiki (Tokyo University of Agriculture and Technology, Japan), Masaaki Kondo (The University of Electro-Communications, Japan), Hiroshi Nakamura (University of Tokyo, Japan)
Pagepp. 843 - 848
Keywordpower gating
AbstractThis paper describes design and control scheme of an embedded processor whose internal function units are power gated at instruction-by-instruction basis. Enabling/disabling the power gating is adaptively controlled under the support of on-chip leakage monitors and the operating system to minimize energy overhead due to sleep-in and wakeup. Measured results of the fabricated chip demonstrated that our approach reduces energy by up to 15% for the range of 25-85C as compared to the conventional fine-grain power gating technique.
Slides

9C-2 (Time: 16:15 - 16:40)
TitleA Hybrid Random Walk Algorithm for 3-D Thermal Analysis of Integrated Circuits
Author*Yuan Liang, Wenjian Yu (Tsinghua University, China), Haifeng Qian (IBM T. J. Watson Research Center, U.S.A.)
Pagepp. 849 - 854
Keywordthermal analysis, random walk method
AbstractIn this work, a hybrid random walk method is proposed for the thermal analysis of integrated circuits. Preserving the advantage of generic random walk method (GRW), i.e. the suitability for simulating local hot-spots, the proposed techniques largely reduce its runtime for accurate high-resolution simulation, and is suitable for the realistic pyramid-shape IC model. This is achieved by combining the GRW and the floating random walk techniques, and a novel usage of rectangular cuboid transition domain. The techniques to handle the Neumann boundary and convective boundary in thermal simulation are also discussed. Numerical experiments on several IC test cases validate the efficiency and accuracy of the proposed techniques, and demonstrate more than 100X speedup over the GRW method.

9C-3 (Time: 16:40 - 17:05)
TitleLightSim : A Leakage Aware Ultrafast Temperature Simulator
AuthorSmruti R. Sarangi, *Gayathri Ananthanarayanan, M. Balakrishnan (IIT Delhi, India)
Pagepp. 855 - 860
Keywordtemperature, estimation, thermal, analysis
AbstractIn this paper, we propose the design of an ultra-fast temperature simulator, LightSim, which can perform both steady state and transient thermal analysis, and also take the effect of leakage power into account. We use a novel Hankel transform based technique to derive a transient version of the Green's function for a chip, which takes into account the feedback loop between temperature and leakage. Subsequently, we calculate the temperature map of a chip by convolving the derived Green's function with the power map. Our simulator is at least 3500 times faster than HotSpot, and at least 2.3 times faster than competing research prototypes. The total error is limited to 0.18K.
Slides

9C-4 (Time: 17:05 - 17:30)
TitleFast Vectorless Power Grid Verification Using Maximum Voltage Drop Location Estimation
AuthorWei Zhao, Yici Cai, *Jianlei Yang (Tsinghua University, China)
Pagepp. 861 - 866
KeywordPower Grid, Vectorless Verification, Voltage Drop, Location Estimation
AbstractPower grid integrity verification is critical for reliable chip design. Vectorless power grid verification provides a promising approach to evaluate the worst-case voltage fluctuations without the detailed information of circuit activities. Vectorless verification is usually required to solve numerous linear programming problems to obtain the worst-case voltage fluctuation throughout the grid, which is extremely time-consuming for large-scale verification. In this paper, a maximum voltage drop location estimation approach is proposed for efficient vectorless verification. The power grid nodes are grouped into disjoint subsets, and an estimation strategy is utilized to roughly locate the nodes which have the worst-case voltage drop in each group. Consequently, the verification problem size can be significantly reduced compared with accurate verification. Experimental results show that the proposed approach can achieve remarkable speedups with acceptable accuracy loss.
Slides