(Go to Top Page)

The 26th Asia and South Pacific Design Automation Conference
Technical Program

Remark:
  • Presentations and chat Q&A are available from Jan. 12 to 29.
  • Before the live chat Q&A sessions, the videos are broadcasted via Zoom for mainly mainland China on the same day.
  • Live chat Q&A sessions, which all the speakers and session chairs in each session are attending, are held according to "Live Tutorial / Live Q&A Session Schedule".
  • Tutorials are given live on Zoom according to the time table, and later the videos will be available.
  • Time zone is JST (=UTC+9:00)
  • The presenter of each paper is marked with "*".


Technical Program:   SIMPLE version   DETAILED version with abstract
Author Index:   HERE

Live Tutorial / Live Q&A Session Schedule

Monday, January 18, 2021

Room T1, T4Room T2, T5Room T3
T1  (Room T1)
Tutorial-1 (live session and its video)

9:00 - 12:00
T2  (Room T2)
Tutorial-2 (live session and its video)

9:00 - 12:00
T3  (Room T3)
Tutorial-3 (live session and its video)

9:00 - 12:00
T4  (Room T4)
Tutorial-4 (live session and its video)

14:00 - 17:00
T5  (Room T5)
Tutorial-5 (live session and its video)

14:00 - 17:00




Tuesday, January 19, 2021

Room 1A - 3ARoom 1B - 3BRoom 1C - 3CRoom 1D - 3DRoom 1E - 3E
1K  (Room K)
Opening and Keynote Session I (video and its broadcast via Zoom)

1A  (Room 1A)
University Design Contest I

15:00 - 15:30
1B  (Room 1B)
Accelerating Design and Simulation

15:00 - 15:30
1C  (Room 1C)
Process-in-Memory for Efficient and Robust AI

15:00 - 15:30
1D  (Room 1D)
Validation and Verification

15:00 - 15:30
1E  (Room 1E)
Design Automation Methods for Various Microfluidic Platforms

15:00 - 15:30
2A  (Room 2A)
University Design Contest II

15:30 - 16:00
2B  (Room 2B)
Emerging Non-Volatile Processing-In-Memory for Next Generation Computing

15:30 - 16:00
2C  (Room 2C)
(SS-1) Emerging Trends for Cross-Layer Co-Design: From Device, Circuit, to Architecture, Application

15:30 - 16:00
2D  (Room 2D)
Machine Learning Techniques for EDA in Analog/Mixed-Signal ICs

15:30 - 16:00
2E  (Room 2E)
Innovating Ideas in VLSI Routing Optimization

15:30 - 16:00
3A  (Room 3A)
(SS-2) ML-Driven Approximate Computing

16:00 - 16:30
3B  (Room 3B)
Architecture-Level Exploration

16:00 - 16:30
3C  (Room 3C)
Core Circuits for AI Accelerators

16:00 - 16:30
3D  (Room 3D)
Stochastic and Approximate Computing

16:00 - 16:30
3E  (Room 3E)
Timing Analysis and Timing-Aware Design

16:00 - 16:30



Wednesday, January 20, 2021

Room 4A - 6ARoom 4B - 6BRoom 4C - 6CRoom 4D - 6DRoom 4E - 6E
2K  (Room K)
Keynote Session II (video and its broadcast via Zoom)

4A  (Room 4A)
(SS-3) Technological Advancements inside the AI chips, and using the AI Chips

15:00 - 15:30
4B  (Room 4B)
System-Level Modeling, Simulation, and Exploration

15:00 - 15:30
4C  (Room 4C)
Neural Network Optimizations for Compact AI Inference

15:00 - 15:30
4D  (Room 4D)
Brain-Inspired Computing

15:00 - 15:30
4E  (Room 4E)
Cross-Layer Hardware Security

15:00 - 15:30
5A  (Room 5A)
(DF-1): New-Principle Computer

15:30 - 16:00
5B  (Room 5B)
Embedded Operating Systems and Information Retrieval

15:30 - 16:00
5C  (Room 5C)
(SS-4) Security Issues in AI and Their Impacts on Hardware Security

15:30 - 16:00
5D  (Room 5D)
Advances in Logic and High-level Synthesis

15:30 - 16:00
5E  (Room 5E)
Hardware-Oriented Threats and Solutions in Neural Networks

15:30 - 16:00
6A  (Room 6A)
(DF-2): Advanced Sensing Technology and Automotive Application

16:00 - 16:30
6B  (Room 6B)
Advanced Optimizations for Embedded Systems

16:00 - 16:30
6C  (Room 6C)
Design and Learning of Logic Circuits and Systems

16:00 - 16:30
6D  (Room 6D)
Hardware Locking and Obfuscation

16:00 - 16:30
6E  (Room 6E)
Efficient Solutions for Emerging Technologies

16:00 - 16:30



Thursday, January 21, 2021

Room 7A - 9ARoom 7B - 9BRoom 7C - 9CRoom 7D - 9DRoom 7E - 9D
3K  (Room K)
Keynote Session III (video and its broadcast via Zoom)

7A  (Room 7A)
(SS-5) Platform-Specific Neural Network Acceleration

15:00 - 15:30
7B  (Room 7B)
Toward Energy-Efficient Embedded Systems

15:00 - 15:30
7C  (Room 7C)
Software and System Support for Nonvolatile Memory

15:00 - 15:30
7D  (Room 7D)
Learning-Driven VLSI Layout Automation Techniques

15:00 - 15:30
7E  (Room 7E)
DNN-Based Physical Analysis and DNN Accelerator Design

15:00 - 15:30
8A  (Room 8A)
(DF-3): Emerging Open Design Platform

15:30 - 16:00
8B  (Room 8B)
Embedded Neural Networks and File Systems

15:30 - 16:00
8C  (Room 8C)
(SS-6) Design Automation for Future Autonomy

15:30 - 16:00
8D  (Room 8D)
Emerging Hardware Verification

15:30 - 16:00
8E  (Room 8E)
Optimization and Mapping Methods for Quantum Technologies

15:30 - 16:00
9A  (Room 9A)
(DF-4): Technological Utilization in COVID-19 Pandemic

16:00 - 16:30
9B  (Room 9B)
Emerging System Architectures for Edge-AI

16:00 - 16:30
9C  (Room 9C)
(SS-7) Cutting-Edge EDA Techniques for Advanced Process Technologies

16:00 - 16:30
9D  (Room 9D)
(SS-8) Robust and Reliable Memory Centric Computing at Post-Moore

16:00 - 16:30
9E  (Room 9E)
Design for Manufacturing and Soft Error Tolerance

16:00 - 16:30



DF: Designers' Forum, SS: Special Session

List of papers

Remark:
  • Presentations and chat Q&A are available from Jan. 12 to 29.
  • Before the live chat Q&A sessions, the videos are broadcasted via Zoom for mainly mainland China on the same day.
  • Live chat Q&A sessions, which all the speakers and session chairs in each session are attending, are held according to "Live Tutorial / Live Q&A Session Schedule".
  • Tutorials are given live on Zoom according to the time table, and later the videos will be available.
  • Time zone is JST (=UTC+9:00)
  • The presenter of each paper is marked with "*".



Monday, January 18, 2021

[To Session Table]

Session T1  Tutorial-1 (live session and its video)
Time: 9:00 - 12:00, Monday, January 18, 2021
Location: Room T1

T1-1
Title(Tutorial) Achieving Quantum Computing's Disruptive Capabilities through Error-Mitigating Software
AuthorJoseph Emerson (Quantum Benchmark Inc./University of Waterloo, Canada)
AbstractQuantum computers will achieve disruptive capabilities only if they can overcome their intrinsic sensitivity to quantum error sources. These error sources, if unmitigated, lead to inaccurate and incorrect quantum computing solutions. I will describe the current status of quantum computing capabilities, key challenges, and the roadmap to achieving quantum advantage in the future. I will present some of our state-of-the-art results from current generation quantum computing platforms that leverage Quantum Benchmark's error-mitigating software solutions.


[To Session Table]

Session T2  Tutorial-2 (live session and its video)
Time: 9:00 - 12:00, Monday, January 18, 2021
Location: Room T2

T2-1
Title(Tutorial) Reliability and Availability of Hardware-Software Systems — Stochastic Reliability Models of Real Systems
AuthorKishor S. Trivedi (Duke University, USA)
AbstractHigh reliability and availability are requirements for most technical systems including computer and communication systems. Reliability and availability assurance methods based on probabilistic models is the topic addressed in this talk. Non-state-space solution methods are often used to solve models based on reliability block diagrams, fault trees and reliability graphs. Relatively efficient algorithms are known to handle systems with hundreds of components and have been implemented in many software packages. Nevertheless, many practical problems cannot be handled by such algorithms. Bounding algorithms are then used in such cases as was done for a major subsystem of Boeing 787. Non-state-space methods derive their efficiency from the independence assumption that is often violated in practice. State space methods based on Markov chains, stochastic Petri nets, semi-Markov and Markov regenerative processes can be used to model various kinds of dependencies among system components. Linux Operating System and WebSphere Application server are used as examples of Markov models. IBM research cloud is used as an example of stochastic Petri net model. However, the state space explosion of such models severely restricts the size of the problem that can be solved. Hierarchical and fixed-point iterative methods provide a scalable alternative that combines the strengths of state space and non-state-space methods and have been extensively used to solve real-life problems. Real-world examples of such multi-level models from IBM, Cisco and Sun Microsystems will be discussed. Hardware systems as well as software systems and their combinations will be addressed via these examples. Novel approaches to software fault tolerance will be discussed. These methods and applications are fully described in a recently completed book: Reliability and Availability Engineering: Modeling, Analysis and Applications, Cambridge University Press, 2017.


[To Session Table]

Session T3  Tutorial-3 (live session and its video)
Time: 9:00 - 12:00, Monday, January 18, 2021
Location: Room T3

T3-1
Title(Tutorial) Machine Learning in EDA Tutorial: Approaches, Advantages, Challenges and Examples
AuthorElias Fallon (Cadence Design Systems, Inc., USA)
AbstractElectronic Design Automation (EDA) software has delivered semiconductor design productivity improvements for decades. The next leap in productivity will come from the addition of machine learning (ML) techniques to the toolbox of computational software capabilities employed by EDA developers. Recent research and development into machine learning for EDA point to clear patterns for how it impacts EDA tools, flows, and design challenges. This research has also illustrated some of the challenges that will come with production deployment of machine learning techniques into EDA tools and flows. This tutorial will detail patterns observed in ML for EDA development. The advantages and disadvantages of different ML approaches as well as the challenges for deployment in EDA developments will be discussed. Specific examples of different ML approaches in the EDA domain will be presented, and the opportunities and open questions for future research will be shown.


[To Session Table]

Session T4  Tutorial-4 (live session and its video)
Time: 14:00 - 17:00, Monday, January 18, 2021
Location: Room T4

T4-1
Title(Tutorial) The Latest Heterogenous Integration Packaging Trends for 5G, Artificial Intelligence, Automotive Electronics, and High Performance Computing
AuthorHenry H. Utsunomiya (Interconnection Technologies, Inc., Japan)
AbstractHeterogeneous Integration refers to the assembly and packaging of multiple separately manufactured components in order to improve functionality and enhance operating characteristics. And it allows for the packaging of components of different functionalities, different process technologies, different materials and sometimes separate manufacturers. Since Moore’s Law scaling pace has been slowing down, increasing functionality of monolithic die on the same area and/or same footprint by System on Chip (SoC) without increasing cost per transistors becoming difficult. Heterogenous Integration technology building blocks such as 3D integration, Chiplets, and embedded dies into packaging substrate implementation into System-in-Package (SiP) provides alternative solutions to both microelectronics industry and electronics systems with shorter time to market and affordable cost. In this tutorial, introduce current status of Heterogeneous Integration technology, its building blocks, use case examples e.g. 5th Generation Mobile Communication (5G), Ambient Assisted Living, Artificial Intelligence (AI), Autonomous Driving, Industry 4.0, Health Care, Internet of Things (IoT), supply chain and roadmap toward 2030 will be discussed.


[To Session Table]

Session T5  Tutorial-5 (live session and its video)
Time: 14:00 - 17:00, Monday, January 18, 2021
Location: Room T5

T5-1
Title(Tutorial) Emerging Devices from Manufacturing Point of View: 3D NAND Flash Memory, PCRAM and Carbon Nanotube
AuthorKoukou Suu (Ulvac Technologies, Inc., USA), Yoshihiro Hirota (Tokyo Electron Limited, Japan), Shigemi Murakawa (Zeon Corporation, Japan)
AbstractThis tutorial covers three topics:

1. Manufacturing Technology of Phase Change Memory for Storage Class Memory and AI Applications
Koukou Suu (Ulvac Technologies, Inc.)
Thin-film functional material such as phase-change (Ge2Sb2Te5) and selector materials have been utilizing to form advanced semiconductor devices including Phase-Change Random Access Memory (PCRAM) for Storage Class memory (Ge-As-Se) and analog Artificial Intelligence (AI). In this talk, we will give presentation our development activities of phase-change material thin- film processing technologies including sputtering, ALD/CVD as well as manufacturing processes for Phase-Change Random Access Memory (PCRAM).

2. 3D NAND Flash Memory, Manufacturing Technology and Trend
Yoshihiro Hirota (Tokyo Electron Limited)
NAND Flash memory was invented in 1980s and the first paper in the world was published in IEDM 1987 by F. Masuoka Group, Toshiba. The bit density has been increased by the scaling technology of manufacturing process, and the structure has been changed from 2D to 3D structure to increase the bit density. Recently, 3D NAND with 128 WLs has been developed and shipped, and the bit density has been achieved to 7.8Gbit/mm2. A general manufacturing process flow of 3D NAND and some key processes are introduced. The aggressive manufacturing process technologies are visually shown. High aspect ratio etching process technology is highly aggressive to fabricate the memory cell holes. The aspect ratio becomes more than 45 now. The conformal thin film deposition technology into the memory cell holes with vertical structure is also highly aggressive. Vertical scaling, horizontal scaling and electrical scaling are challenging to increase the bit density today. In vertical scaling, there are stack layer increase, shrink of each layer thickness and others. In horizontal scaling, there are the scaling of the memory cell hole layout, an introduction of Circuit (or CMOS) under Array (CuA) structure and other new structures. Finally, electrical scaling challenge is multi bit cell technology. There are Floating Gate type and Charge Trap Film type in 3D NAND. The advantages and disadvantages of both type NANDs are also discussed.

3. Carbon nanotube and its Electronic Device Applications
Shigemi Murakawa (Zeon Corporation)
This tutorial reviews the manufacturing method of the single-walled carbon nanotube and its advantageous characteristics such as extremely low impurities, high specific surface area and high electric conductivity. Also described are its various device applications including non-volatile memory, high-density capacitor, electric elastomer, microwave-shielding film and the thermo-electric conversion device. Here, the adaptive design and operation of the device are important, considering the process and material-characteristics windows. Also, the AI-based QC system which covers from the material characteristics to the device performance is highly expected.



Tuesday, January 19, 2021

[To Session Table]

Session 1K  Opening and Keynote Session I (video and its broadcast via Zoom)
Location: Room K

1K-1
TitleOpening:
1. Welcome by GC (Mr. Hattori)
2. Welcome by SC-Chair (Prof. Onodera)
3. Program Report by TPC Chair (Prof. Tan)
    3-1. Best Paper Award Presentation (Prof. Hashimoto)
    3-2. 10-Year Retrospective Most Influential Paper Award Presentation (Prof. Hashimoto)
4. Designers' Forum Report by DF Chair (Mr. Yamashita)
5. Design Contest Report by UDC Co-Chair (Prof. Tsuchiya)
    5.1 UDC Award Presentation (Prof. Tsuchiya)
6. Student Research Forum Report by SRF Chair (Prof. Weichen Liu)
7. IEEE CEDA Awards by CEDA Award Chair (Prof. Wakabayashi)
    7.1 CEDA outstanding service award (Prof. Wakabayashi)
8. Welcome message for ASP-DAC 2022 by 2022GC (Prof. Ting-Chi Wang)

1K-2
TitleIntroduction of Prof. Kaushik Roy by Toshihiro Hattori (Renesas Electronics, Japan)

1K-3
Title(Keynote Address) Re-Engineering Computing with Neuro-Inspired Learning: Devices, Circuits, and Systems
AuthorKaushik Roy (Purdue University, USA)
AbstractAdvances in machine learning, notably deep learning, have led to computers matching or surpassing human performance in several cognitive tasks including vision, speech and natural language processing. However, implementation of such neural algorithms in conventional "von-Neumann" architectures are several orders of magnitude more area and power expensive than the biological brain. Hence, we need fundamentally new approaches to sustain exponential growth in performance at high energy-efficiency beyond the end of the CMOS roadmap in the era of ‘data deluge’ and emergent data-centric applications. Exploring the new paradigm of computing necessitates a multi-disciplinary approach: exploration of new learning algorithms inspired from neuroscientific principles, developing network architectures best suited for such algorithms, new hardware techniques to achieve orders of improvement in energy consumption, and nanoscale devices that can closely mimic the neuronal and synaptic operations of the brain leading to a better match between the hardware substrate and the model of computation. In this talk, I will focus on our recent works on neuromorphic computing with spike based learning and the design of underlying hardware that can lead to quantum improvements in energy efficiency with good accuracy.


[To Session Table]

Session 1A  University Design Contest I
Time: 15:00 - 15:30, Tuesday, January 19, 2021
Location: Room 1A
Chairs: Kousuke Miyaji (Shinshu University, Japan), Akira Tsuchiya (The University of Shiga Prefecture, Japan)

Best Design Award
1A-1
TitleA DSM-based Polar Transmitter with 23.8% System Efficiency
Author*Yuncheng Zhang, Bangan Liu, Xiaofan Gu, Chun Wang, Atsushi Shirane, Kenichi Okada (Tokyo Institute of Technology, Japan)
Pagepp. 1 - 2
KeywordPower amplifier, DSM, Digital transmitter, Injection-locked PLL
AbstractAn energy efficient digital polar transmitter (TX) based on 1.5bit Delta-Sigma modulator (DSM) and fractional-N injection-locked phase-locked loop (IL-PLL) is proposed. In the proposed TX, redundant charge and discharge of turnedoff capacitors in the conventional switched-capacitor power amplifiers (SCPAs) are avoided, which drastically improves the efficiency at power back-off. In the PLL, spur-mitigation technique is proposed to reduce the frequency mismatch between the oscillator and the reference. The transmitter, implemented in 65nm CMOS, achieves a PAE of 29% at an EVM of -25.1dB, and a system efficiency of 23.8%.

1A-2
TitleA 0.41W 34Gb/s 300GHz CMOS Wireless Transceiver
Author*Ibrahim Abdo, Takuya Fujimura, Tsuyoshi Miura, Korkut K. Tokgoz, Atsushi Shirane, Kenichi Okada (Tokyo Institute of Technology, Japan)
Pagepp. 3 - 4
KeywordCMOS, 300GHz, subharmonic mixer, wireless transceiver, 65nm
AbstractA 300GHz CMOS-only wireless transceiver that achieves a maximum data rate of 34Gb/s while consuming a total power of 0.41W from a 1V supply is introduced. A subharmonic mixer with low conversion loss is proposed to compensate the absence of the RF amplifiers in TX and RX as a mixer-last-mixer-first topology is adopted. The TRX covers 19 IEEE802.15.3d channels (13-23, 39-43, 52-53, 59).

1A-3
TitleCapacitive Sensor Circuit with Relative Slope-Boost Method Based on a Relaxation Oscillator
Author*Ryo Onishi, Koki Miyamoto, Korkut Kaan Tokgoz, Noboru Ishihara, Hiroyuki Ito (Tokyo Institute of Technology, Japan)
Pagepp. 5 - 6
KeywordCapacitive sensor circuit, CMOS, ultra-low power, relaxation oscillator, slope-boost
AbstractThis paper presents a relative slope-boosting technique for a ca-pacitive sensor circuit based on a relaxation oscillator. Our tech-nique improves jitter, i.e. resolution, by changing both the voltage slope on the sensing and the reference sides with respect to the sen-sor capacitance. The sensor prototype circuit is implemented in a 180-nm standard CMOS process and achieves resolution of 710 aF while consuming 12.7 pJ energy every cycle of 13.78 kHz output frequency. The measured power consumption from a 1.2 V DC supply is 430 nW.

1A-4
Title28GHz Phase Shifter with Temperature Compensation for 5G NR Phased-array Transceiver
Author*Yi Zhang, Jian Pang, Kiyoshi Yanagisawa, Atsushi Shirane, Kenichi Okada (Tokyo Institute of Technology, Japan)
Pagepp. 7 - 8
Keywordvector summing type phase shifter, phase error, temperature compensation, 5G NR, phased-array
AbstractA phase shifter with temperature compensation for 28GHz phased-array TRX is presented. A precise low-voltage current reference is proposed for the IDAC biasing circuit. The total gain variation for a single TX path including phase shifter and post stage amplifiers over -40°C to 80°C is only 1dB in measurement and the overall phase error due to temperature is less than 1 degree without off-chip calibration.

1A-5
TitleAn up to 35 dBc/Hz Phase Noise Improving Design Methodology for Differential-Ring-Oscillators Applied in Ultra-Low Power Systems
Author*Peter Toth, Hiroki Ishikuro (Keio University, Japan)
Pagepp. 9 - 10
Keywordtime-mode circuit, phase-noise improving, amplitdue feedback loop
AbstractThis work presents a novel control loop concept to adjust dynamically a differential ring oscillators (DRO) biasing in order to improve the phase noise performance (PN) in the ultra-low-power domain. Applying this proposed feedback system on any DRO with a tail current source is possible. The following paper presents the proposed concept and includes measurements of a 180 nm CMOS integrated prototype system, which underlines the feasibility of the discussed idea. Measurements show an up to 35 dBc/Hz phase noise improvement with an active control loop. Moreover, the tuning range of the implemented ring oscillator is extended by about 430 % compared to fixed bias operation. These values are measured at a minimum oscillation power consumption of 55 pW/Hz.

1A-6
TitleGate Voltage Optimization in Capacitive DC-DC Converters for Thermoelectric Energy Harvesting
Author*Yi Tan, Yohsuke Shiiki, Hiroki Ishikuro (Keio University, Japan)
Pagepp. 11 - 12
KeywordCharge pump, Capacitive DC-DC
AbstractThis paper presents a gate voltage optimized fully integrated charge pump for thermoelectric energy harvesting applications. In this paper, the trade-off generated by rising the gate voltage of switching transistors are discussed. The proposed 5/3-stage design, which implemented with 180 nm CMOS technique, achieved a down to 0.12V/0.13V startup voltage correspondingly with the proposed technique. A 20% peak power conversion efficiency improvement is achieved when comparing with a similar 3-stage linear charge pump in previous state-of-the-art research.

1A-7
TitleAn 0.57 GOPS/DSP Object Detection PIM Accelerator on FPGA
Author*Bo Jiao, Jinshan Zhang, Yuanyuan Xie, Shunli Wang (Shanghai Engineering Research Centerof AI&Robotics, Fudan University, China), Haozhe Zhu (Fudan University, China), Xiaoyang Kang, Zhiyan Dong, Lihua Zhang, Chixiao Chen (Shanghai Engineering Research Centerof AI&Robotics, Fudan University, China)
Pagepp. 13 - 14
KeywordArtificial Intelligence, PIM, FPGA, co_design
AbstractThe paper presents an object detection accelerator featuring a processing-in-memory (PIM) architecture on FPGAs. PIM architectures are well known for their energy efficiency and avoidance of the memory wall. In the accelerator, a PIM unit is developed using BRAM and LUT based counters,which also helps to improve the DSP performance density.The overall architecture consists of 64 PIM units and three memory buffers to store inter-layer results.A shrunk and quantized Tiny-YOLO network is mapped to the PIM accelerator,where DRAM access is fully eliminated during inference. The design achieves a throughput of 201.6GOPs at 100 MHz clock rate and ciorrespondingly,a performance density of 0.57GOPS/DSP.
Slides

1A-8
TitleSupply Noise Reduction Filter for Parallel Integrated Transimpedance Amplifiers
AuthorShinya Tanimura, *Akira Tsuchiya, Toshiyuki Inoue, Keiji Kishine (The University of Shiga Prefecture, Japan)
Pagepp. 15 - 16
Keywordtransimpedance amplifier, supply noise, noise reduction
AbstractThis paper presents a noise reduction in transimpedance amplifier (TIA) for optical interconnection. Multi-channel TIAs suffer from inter-channel interfarence via the supply and the ground line. We employ a simple RC filter to reduce the supply noise. The RC filter is inserted to the supply node of the TIA. The filter behaves differently against the noise from TIA, from the supply line, and from the ground line. Thus, we discuss the time constant tuning to maximize the effect of the filter. The proposed circuit was fabricated in an 180-nm CMOS. The measurement results verify 38% noise reduction at 5 Gbps operation without extra area and power.


[To Session Table]

Session 1B  Accelerating Design and Simulation
Time: 15:00 - 15:30, Tuesday, January 19, 2021
Location: Room 1B
Chairs: Chien-Chung Ho (National Chung Cheng University, Taiwan), Chun-Yi Lee (National Tsing Hua University, Taiwan)

1B-1
TitleA Fast Yet Accurate Message-level Communication Bus Model for Timing Prediction of SDFGs on MPSoC
Author*Hai-Dang Vu, Sebastien Le Nours, Sebastien Pillement (University of Nantes, France), Ralf Stemmer, Kim Gruettner (OFFIS, Germany)
Pagepp. 17 - 22
KeywordSystem-level modeling, Multi Processor, Timing prediction
AbstractFast yet accurate performance and timing prediction of complex parallel data flow applications on multi-processor systems remains a difficult discipline. The reason for it comes from the complexity of the data flow applications and the hardware platform with shared resources, like buses and memories. This combination may lead to complex timing interferences that are difficult to express in pure analytical or classical simulation-based approaches. In this work, we propose a message-level communication model for timing and performance prediction of Synchronous Data Flow (SDF) applications on MPSoCs with shared memories. We compare our work against measurement and TLM simulation-based performance prediction models on two case-studies from the computer vision domain. We show that the accuracy and execution time of our simulation outperforms existing approaches and is suitable for a fast yet accurate design space exploration.
Slides

1B-2
TitleSimulation of Ideally Switched Circuits in SystemC
Author*Breytner Joseph Fernandez-Mesa, Liliana Andrade, Frédéric Pétrot (Univ. Grenoble Alpes, CNRS, Grenoble INP, TIMA, France)
Pagepp. 23 - 28
KeywordPower systems, system-level, SystemC, AMS, electronic circuits
AbstractModeling and simulation of power systems at low levels of abstraction is supported by specialized tools such as SPICE and MATLAB. But when power systems are part of larger systems including digital hardware and software, low-level models become over-detailed; at the system level, models must be simple and execute fast. We present an extension to SystemC that relies on efficient modeling, simulation, and synchronization strategies for Ideally Switched Circuits. Our solution enables designers to specify circuits and to jointly simulate them with other SystemC hardware and software models. We test our extension with three power converter case studies and show a simulation speed-up between 1.2 and 2.7 times w.r.t. the reference tool while preserving accuracy. This work demonstrates the suitability of SystemC for the simulation of heterogeneous models to meet system-level goals such as validation, verification, and integration.
Slides

1B-3
TitleHW-BCP: A Custom Hardware Accelerator for SAT Suitable for Single Chip Implementation for Large Benchmarks
Author*Soowang Park (University of Southern California, USA), Jae-Won Nam (Seoul National University of Science and Technology, Republic of Korea), Sandeep K. Gupta (University of Southern California, USA)
Pagepp. 29 - 34
KeywordCustom hardware, SAT, BCP, von Neumann machine, CAM
AbstractBoolean Satisfiability (SAT) has broad usage in Electronic Design Automation (EDA), artificial intelligence (AI), and theoretical studies. Further, as an NP-complete problem, acceleration of SAT will also enable acceleration of a wide range of combinatorial problems. We propose a completely new custom hardware design to accelerate SAT. Starting with the well-known fact that Boolean Constraint Propagation (BCP) takes most of the SAT solving time (80-90%), we focus on accelerating BCP. By profiling a widely-used software SAT solver, MiniSAT v2.2.0 (MiniSAT2) [1], we identify opportunities to accelerate BCP via parallelization and elimination of von Neumann overheads, especially data movement. The proposed hardware for BCP (HW-BCP) achieves these goals via a customized combination of content-addressable memory (CAM) cells, SRAM cells, logic circuitry, and optimized interconnects. In 65nm technology, on the largest SAT instances in the SAT Competition 2017 benchmark suite, our HW-BCP dramatically accelerates BCP (4.5ns per BCP in simulations) and hence provides a 62-185x speedup over optimized software implementation running on general purpose processors. Finally, we extrapolate our HW-BCP design to 7nm technology and estimate area and delay. The analysis shows that in 7nm, in a realistic chip size, HW-BCP would be large enough for the largest SAT instances in the benchmark suite.


[To Session Table]

Session 1C  Process-in-Memory for Efficient and Robust AI
Time: 15:00 - 15:30, Tuesday, January 19, 2021
Location: Room 1C
Chairs: Shouzhen Gu (Eastern China Normal University, China), Weiwen Jiang (University of Notre Dame, USA)

1C-1
TitleA Novel DRAM-Based Process-in-Memory Architecture and its Implementation for CNNs
Author*Chirag Sudarshan (Technische Universität Kaiserslautern, Germany), Taha Soliman, Cecilia De la Parra (Robert Bosch GmbH – Corporate Research, Germany), Christian Weis (Technische Universität Kaiserslautern, Germany), Leonardo Ecco (Robert Bosch GmbH – Corporate Research, Germany), Matthias Jung (Fraunhofer IESE, Germany), Norbert Wehn (Technische Universität Kaiserslautern, Germany), Andre Guntoro (Robert Bosch GmbH – Corporate Research, Germany)
Pagepp. 35 - 42
KeywordProcessing-in-Memory (PIM), DRAM, Convolution Neural Network (CNN)
AbstractProcessing-in-Memory (PIM) is an emerging approach to bridge the memory-computation gap. One of the key challenges of PIM architectures in the scope of neural network inference is the deployment of traditional area-intensive arithmetic multipliers in memory technology, especially for DRAM-based PIM architectures. Hence, existing DRAM PIM architectures are either confined to binary networks or exploit the analog property of the sub-array bitlines to perform bulk bit-wise logic operations. The former reduces the accuracy of predictions, i.e. Quality-of-results, while the latter increases overall latency and power consumption. In this paper, we present a novel DRAM-based PIM architecture and implementation for multi-bit-precision CNN inference. The proposed implementation relies on shifter based approximate multiplications specially designed to fit into commodity DRAM architectures and its technology. The main goal of this work is to propose an architecture that is fully compatible with commodity DRAM architecture and to maintain a similar thermal design power(i.e.<1𝑊). Our evaluation shows that the proposed DRAM-based PIM has a small area overhead of 6.6% when compared with an 8 Gb commodity DRAM. Moreover, the architecture delivers a peak performance of 8.192 TOPS per memory channel while maintaining a very high energy efficiency. Finally, our evaluation also shows that the use of approximate multipliers results in a negligible drop in prediction-accuracy (i.e.<2 %) in comparison with conventional CNN inference that relies on traditional arithmetic multipliers.

1C-2
TitleA Quantized Training Framework for Robust and Accurate ReRAM-based Neural Network Accelerators
Author*Chenguang Zhang, Pingqiang Zhou (ShanghaiTech University, China)
Pagepp. 43 - 48
KeywordReRAM, ReRAM, Variation, Robust, Quantize
AbstractNeural networks (NN), especially deep neural networks (DNN), have achieved great success in lots of fields. ReRAM crossbar, as a promising candidate, is widely employed to accelerate neural network owing to its nature of processing MVM. However, ReRAM crossbar suffers high conductance variation due to many non-ideal effects, resulting in great inference accuracy degradation. Recent works use uniform quantization to enhance the tolerance of conductance variation, but these methods still suffer high accuracy loss with large variation. In this paper, firstly, we analyze the impact of the quantization and conductance variation on the accuracy. Then, based on two observation, we propose a quantized training framework to enhance the robustness and accuracy of the neural network running on the accelerator, by introducing a smart non-uniform quantizer. This framework consists of a robust trainable quantizer and a corresponding training method, and needs no extra hardware overhead and compatible with a standard neural network training procedure. Experimental results show that our proposed method can improve inference accuracy by 10% ∼ 30% under large variation, compared with uniform quantization method.
Slides

1C-3
TitleAttention-in-Memory for Few-Shot Learning with Configurable Ferroelectric FET Arrays
Author*Dayane Reis, Ann Franchesca Laguna, Michael Niemier, X. Sharon Hu (University of Notre Dame, USA)
Pagepp. 49 - 54
KeywordComputing-in-memory, Few-shot learning, Emerging technologies, FeFETs, Ferroelectric FETs
AbstractAttention-in-Memory (AiM), a computing-in-memory (CiM) design, is introduced to implement the attentional layer of Memory Augmented Neural Networks (MANNs). AiM consists of a memory array based on Ferroelectric FETs (FeFET) along with CMOS peripheral circuits implementing configurable functionalities, i.e., it can be dynamically changed from a ternary content-addressable memory (TCAM) to a general-purpose (GP) CiM. When compared to state-of-the art accelerators, AiM achieves comparable end-to-end speed-up and energy for MANNs, with better accuracy (95.14% v.s. 92.21%, and 95.14% v.s. 91.98%) at iso-memory size, for a 5-way 5-shot inference task with the Omniglot dataset.


[To Session Table]

Session 1D  Validation and Verification
Time: 15:00 - 15:30, Tuesday, January 19, 2021
Location: Room 1D
Chairs: Seetal Potluri (NC State University, USA), He Li (University of Cambridge, UK)

1D-1
TitleMutation-based Compliance Testing for RISC-V
AuthorVladimir Herdt (DFKI GmbH, Germany), *Sören Tempel (University of Bremen, Germany), Daniel Große (Johannes Kepler University Linz, Austria), Rolf Drechsler (University of Bremen & DFKI GmbH, Germany)
Pagepp. 55 - 60
KeywordRISC-V, Compliance Testing, Simulation, Mutation, Symbolic Execution
AbstractCompliance testing for RISC-V is very important. Essentially, it ensures that compatibility is maintained between RISC-V implementations and the ever growing RISC-V ecosystem. Therefore, an official compliance testsuite is being actively developed. However, it is very difficult to achieve that all relevant functional behavior is comprehensively tested. In this paper we propose a mutation-based approach to boost RISC-V compliance testing by providing more comprehensive testing results. Therefore, we define mutation classes tailored for RISC-V to access the quality of the compliance testsuite and provide a symbolic execution framework to generate new testcases that kill the undetected mutants. Our experimental results demonstrate the effectiveness of our approach. We identified several serious gaps in the compliance testsuite and generated new tests to close these gaps.
Slides

1D-2
TitleA General Equivalence Checking Framework for Multivalued Logic
Author*Chia-Chun Lin, Hsin-Ping Yen, Sheng-Hsiu Wei, Pei-Pei Chen (National Tsing Hua University, Taiwan), Yung-Chih Chen (Yuan Ze University, Taiwan), Chun-Yao Wang (National Tsing Hua University, Taiwan)
Pagepp. 61 - 66
KeywordEquivalence checking, multivalued logic, SAT solvers
AbstractLogic equivalence checking is a critical task in the ASIC design flow. Due to the rapid development in nanotechnology-based devices, an efficient implementation of multivalued logic becomes practical. As a result, many synthesis algorithms for ternary logic were proposed. In this paper, we bring out an equivalence checking framework based on multivalued logic exploiting the modern SAT solvers. Furthermore, a structural conflict-driven clause learning (SCDCL) technique is also proposed to accelerate the SAT solving process. The SCDCL algorithm deploys some strategies to cut off the search space for SAT algorithms. The experimental results show that the proposed SCDCL technique saves 42% CPU time from SAT solvers on average over a set of industrial benchmarks.
Slides

1D-3
TitleATLaS: Automatic Detection of Timing-based Information Leakage Flows for SystemC HLS Designs
Author*Mehran Goli, Rolf Drechsler (University of Bremen/DFKI, Germany)
Pagepp. 67 - 72
KeywordSystemC, Timing Flows, HLS, IFT
AbstractIn order to meet the time-to-market constraint, High-level Synthesis (HLS) is being increasingly adopted by the semiconductor industry. HLS designs, which can be automatically translated into the Register Transfer Level (RTL), are typically written in SystemC at the Electronic System Level (ESL). Timing-based information leakage and its countermeasures, while well-known at RTL and below, have not been yet considered for HLS. The paper makes a contribution to this emerging research area by proposing ATLaS, a novel timing-based information leakage flows detection approach for SystemC HLS designs. The efficiency of our approach in identifying timing channels for SystemC HLS designs is demonstrated on two security-critical architectures which are shared interconnect and crypto core.


[To Session Table]

Session 1E  Design Automation Methods for Various Microfluidic Platforms
Time: 15:00 - 15:30, Tuesday, January 19, 2021
Location: Room 1E
Chairs: Tsun-Ming Tseng (Technical University of Munich, Germany), Yamashita Shigeru (Ritsumeikan University, Japan)

1E-1
TitleA multi-commodity network flow based routing algorithm for paper-based digital microfluidic biochips
Author*Nai-Ren Shih, Tsung-Yi Ho (National Tsing Hua University, Taiwan)
Pagepp. 73 - 78
KeywordRouting, Network-Flow, Microfluidic
AbstractPaper-based digital microfluidic biochips (P-DMFBs) have emerged as a safe, low-cost, and fast-responsive platform for biochemical assays. In P-DMFB, droplet manipulations are executed by the electrowetting technology. In order to enable the electrowetting technology, patterned arrays of electrodes and control lines are coated on paper with a hydrophobic Teflon film and dielectric parylene-C film. Different from traditional DMFBs, the manufacturing of P-DMFBs is efficient and inexpensive since the electrodes and control lines are printed on photo paper with an inkjet printer. Active paper-based hybridized chip (APHC) is a type of P-DMFBs that has open and close part. APHC enjoys more convenience than common P-DMFBs since it has no need to fabricate and maintain the micro gap between glass and paper chip, which requires highly delicate treatments. However, the pattern rails of electrodes in APHCs are denser than traditional P-DMFBs, which makes existed electrode routing algorithm fail in APHCs. To deal with the challenge in electrode routing of APHCs, this paper proposes a multi-commodity network flow-based routing algorithm, which simultaneously maximizes the routability and minimizes the total wire length of control lines. The multi-commodity flow model can utilize the pin-sharing between electrodes, which can improve routability and reduce the detour of routing lines. Moreover, the activation sequences of electrodes are considered, which guarantees that the bioassay will not be interfered with after pin-sharing. The proposed method achieves a 100% successful routing rate on real-life APHCs while other electrode routing method cannot solve the electrode routing of APHCs successfully.
Slides

1E-2
TitleInterference-free Design Methodology for Paper-Based Digital Microfluidic Biochips
Author*Yun-Chen Lo (National Tsing Hua University, Taiwan), Bing Li (Technische Universität München, Germany), Sooyong Park, Kwanwoo Shin (Sogang University, Republic of Korea), Tsung-Yi Ho (National Tsing Hua University, Taiwan)
Pagepp. 79 - 84
KeywordPaper-based microfluidic biochips, Inteference-free design automation, Soft control interference constraint relaxation
AbstractPaper-based digital microfluidic biochips (P-DMFBs) have recently attracted great attention for its low-cost, in-place, and fast fabrica- tion. This technology is essential for agile bio-assay development and deployment. P-DMFBs print electrodes and associate control lines on paper to control droplets and complete bio-assays. However, P-DMFBs have following issues: 1) control line interference may cause unwanted droplet movements, 2) avoiding control interference degrades assay performance and routability, 3) single layer fabrication limits routability, and 4) expensive ink cost limits low-cost benefits of P-DMFBs. To solve above issues, this work proposes an interference-free design methodology to design P-DMFBs with fast assay speed, better routability, and compact printing area. The contributions are as follows: First, we categorize control interference into soft and hard. Second, we identify only soft interference happens and propose to relax soft control interference. Third, we propose an interference-free design methodology. Finally, we pro- pose a cost-efficient ILP-based fluidic design module. Experimental results show proposed method outperforms prior work across all bio-assay benchmarks. Compared to previous work, our cost- optimized designs use only 47%∼78% area, gain 3.6%∼16.2% more routing resources, and achieve 0.97x∼1.5x shorter assay completion time. Our performance-optimized designs can accelerate assay speed by 1.05x∼1.65x using 81%∼96% printed area.
Slides

1E-3
TitleAccurate and Efficient Simulation of Microfluidic Networks
Author*Gerold Fink, Philipp Ebner (Institute for Integrated Circuits, Johannes Kepler University, Austria), Medina Hamidović, Werner Haselmayr (Institute for Communications Engineering and RF-Systems, Johannes Kepler University, Austria), Robert Wille (Institute for Integrated Circuits, Johannes Kepler University, Austria)
Pagepp. 85 - 90
Keywordmicrofluidics, 1D-model, simulation, CFD, droplet-based
AbstractMicrofluidics is a prospective field which provides technological advances to the life sciences. However, the design process for microfluidic devices is still in its infancy and frequently results in a ``trial-and-error'' scheme. In order to overcome this problem, simulation methods provide a powerful solution---allowing for deriving a design, validating its functionality, or exploring alternatives without the need of an actual fabricated and costly prototype. To this end, several physical models are available such as Computational Fluid Dynamics (CFD) or the 1-dimensional analysis model. However, while CFD-simulations have high accuracy, they also have high costs with respect to setup and simulation time. On the other hand, the 1D-analysis model is very efficient but lacks in accuracy when it comes to certain phenomena. In this work, we present ideas to combine these two models and, thus, to provide an accurate and efficient simulation approach for microfluidic networks. A case study confirms the general suitability of the proposed approach.


[To Session Table]

Session 2A  University Design Contest II
Time: 15:30 - 16:00, Tuesday, January 19, 2021
Location: Room 2A
Chairs: Kousuke Miyaji (Shinshu University, Japan), Akira Tsuchiya (The University of Shiga Prefecture, Japan)

Special Feature Award
2A-1
TitleA 65nm CMOS Process Li-ion Battery Charging Cascode SIDO Boost Converter with 89% Maximum Efficiency for RF Wireless Power Transfer Receiver
Author*Yasuaki Isshiki, Dai Suzuki, Ryo Ishida, Kousuke Miyaji (Shinshu University, Japan)
Pagepp. 91 - 92
KeywordRF wireless power transfer, SIDO boost converter, battery charger, low power
AbstractThis paper proposes a 65nm CMOS process cascode single-inductor-dual-output (SIDO) boost converter for RF wireless power transfer (WPT) receiver. In order to withstand 4.2V Li-ion battery output, cascode 2.5V I/O PFETs are used at the power stage while 2.5V cascode NFETs are used for 1V output to supply low voltage control circuit. By using NFETs, 1V output with 5V tol-erance can be achieved. Measurement results show con-version efficiency of 89% at PIN=7.9mW and VBAT=3.4V.

2A-2
TitleA High Accuracy Phase and Amplitude Detection Circuit for Calibration of 28GHz Phased Array Beamformer System
Author*Joshua Alvin, Jian Pang, Atsushi Shirane, Kenichi Okada (Tokyo Institute of Technology, Japan)
Pagepp. 93 - 94
KeywordPhased Array, 5G, Millimeter wave, Calibration
AbstractThis paper presents high-accuracy phase and amplitude detection circuits for the calibration of 5G millimeter-wave phased array beamformer systems. The phase and amplitude detection circuits, which are implemented in a 65nm CMOS process, can realize phase and amplitude detections with RMS phase error of 0.17 degree and RMS gain error of 0.12 dB, respectively. The total power consumption of the circuits is 59mW.

2A-3
TitleA Highly Integrated Energy-efficient CMOS Millimeter-wave Transceiver with Direct-modulation Digital Transmitter, Quadrature Phased-coupled Frequency Synthesizer and Substrate-Integrated Waveguide E-shaped Patch Antenna
Author*Wei Deng, Zheng Song, Ruichang Ma, Haikun Jia, Baoyong Chi (Tsinghua University, China)
Pagepp. 95 - 96
Keywordmm-wave, transceiver, transmitter, PLL, antenna
AbstractAn energy-efficient millimeter-wave transceiver with direct-modulation digital transmitter (TX) and I/Q phase-coupled frequency synthesizer is presented in this paper. A wideband Substrate-Integrated Waveguide (SIW) E-shaped patch antenna is designed for high-level integration. This work demonstrates the fully functional wireless link based on the proposed mm-wave transceiver. The proposed transceiver achieves the 10-Gbps data rate while consuming 340.4 mW. The measured Over-the-Air (OTA) EVM is -13.8 dB. The energy efficiency is 34 pJ/bit, which is a significant improvement compared with the state-of-the-art mm-wave transceivers.

2A-4
TitleA 3D-Stacked SRAM Using Inductive Coupling Technology for AI Inference Accelerator in 40-nm CMOS
Author*Kota Shiba, Tatsuo Omori, Mototsugu Hamada, Tadahiro Kuroda (The University of Tokyo, Japan)
Pagepp. 97 - 98
Keyword3D integration, 3D memory architecture, deep neural networks, inductive coupling, SRAM
AbstractA 3D-stacked SRAM using an inductive coupling wireless inter-chip communication technology (TCI) is presented for an AI inference accelerator. The energy and area efficiency are improved thanks to the introduction of a proposed low-voltage NMOS push-pull transmitter and a 12:1 SerDes. A termination scheme to short unused open coils is proposed to eliminate the ringing in an inductive coupling bus. Test chips were fabricated in a 40-nm CMOS technology confirming 0.40-V operation of the proposed transmitter with successful stacked SRAM operation.

2A-5
TitleSub-10-μm Coil Design for Multi-Hop Inductive Coupling Interface
Author*Tatsuo Omori, Kota Shiba, Mototsugu Hamada, Tadahiro Kuroda (The University of Tokyo, Japan)
Pagepp. 99 - 100
Keyword3D memory, 3D system integration, inductive coupling, through-silicon via (TSV), ThruChip Interface (TCI)
AbstractSub-10-μm on-chip coils are designed and prototyped for the multi- hop inductive coupling interface in a 40-nm CMOS. Multi-layer coils and a new receiver circuit are employed to compensate the decrease of the coupling coefficient due to the small coil size. The prototype emulates a 3D stacked module with 8 dies in a 7-nm CMOS and shows that a 0.1-pJ/bit and 41-Tb/s/mm2 inductive coupling interface is achievable.

2A-6
TitleCurrent-Starved Chaotic Oscillator Over Multiple Frequency Decades on Low-Cost CMOS
Author*Korkut Kaan Tokgoz, Ludovico Minati, Hiroyuki Ito (Tokyo Institute of Technology, Japan)
Pagepp. 101 - 102
Keywordchaotic oscillator, CMOS, current-starved, distributed sensing, multiple frequency decades
AbstractThis work presents a current-starved cross-coupled chaotic oscillator achieving multiple decades of oscillation frequency spanning 2 kHz to 15 MHz. The main circuit characteristics are low-power consumption (<100 nW to 25 µW, at 1 V supply voltage), and controllability of the oscillation frequency, enabling future applications such as in distributed environmental sensing. The IC was implemented in 180 nm standard CMOS process, yielding a core area of 0.028 mm2.

2A-7
TitleTCI tester: Tester for Through Chip Interface
Author*Hideto Kayashima, Hideharu Amano (Keio University, Japan)
Pagepp. 103 - 104
KeywordWireless inter-chip communication, TCI
AbstractTCI tester is a chip for evaluating electric characteristics of the TCI IP embedded into family chips for stacking. From the evaluation of various chips including TCI IP, the problems have been revealed and the guideline of the layout was shown.

2A-8
TitleAn 18 Bit Time-to-Digital Converter Design with Large Dynamic Range and Automated Multi-Cycle Concept
Author*Peter Toth, Hiroki Ishikuro (Keio University, Japan)
Pagepp. 105 - 106
KeywordTime-Domain Converter, Multi-Stage TDC, Low-Power TDC, Sensor Interface
AbstractThis paper presents a wide-dynamic-range high-resolution time-domain converter concept tailored for low-power sensor interfaces. The unique system structure applies different techniques to reduce circuit complexity, power consumption, and noise sensitivity. A multi-cycle concept allows a virtual delay line extension and is applied to achieve high resolution down to 1 ns. At the same time, it expands the dynamic range drastically up to 2.35 ms. Moreover, individually tunable delay elements in the range of 1 ns to 12 ns allow on-demand flexible operation in a low- or high-resolution mode for smart sensing applications and flexible power control. The concept of this paper is evaluated by a custom-designed FPGA supported PCB. The presented concept is highly suitable for on-chip integration.


[To Session Table]

Session 2B  Emerging Non-Volatile Processing-In-Memory for Next Generation Computing
Time: 15:30 - 16:00, Tuesday, January 19, 2021
Location: Room 2B
Chairs: Md Tanvir Arafin (Morgan State University, USA), Tae Hyoung (Tony) Kim (Nanyang Technological University, Singapore)

Best Paper Award
2B-1
TitleConnection-based Processing-In-Memory Engine Design Based on Resistive Crossbars
Author*Shuhang Zhang (Technical University of Munich, Germany), Hai Li (Duke University, USA), Ulf Schlichtmann (Technical University of Munich, Germany)
Pagepp. 107 - 113
KeywordAccelerator, Deep neural network, Processing-in-memory, Resistive random access memory
AbstractDeep neural networks have successfully been applied to various fields. The efficient deployment of neural network models emerges as a new challenge. Processing-in-memory (PIM) engines that carry out computation within memory structures are widely studied for improving computation efficiency and data communication speed. In particular, resistive memory crossbars can naturally realize the dot-product operations and show great potential in PIM design. The common practice of a current-based design is to map a matrix toa crossbar, apply the input data from one side of the crossbar, and extract the accumulated currents as the computation results at the orthogonal direction. In this study, we propose a novel PIM design concept that is based on the crossbar connections. Our analysis on star-mesh network transformation reveals that in a crossbar storing both input data and weight matrix, the dot-product result is embedded within the network connection. Our proposed connection-based PIM design leverages this feature and discovers the latent dot-products directly from the connection information. Moreover, in the connection-based PIM design, the output current range of resistive crossbars can easily be adjusted, leading to more linear conversion to voltage values, and the output circuitry can be shared by multiple resistive crossbars. The simulation results show that our design can achieve on average46.23%and33.11%reductions in area and energy consumption, with a merely 3.85% latency overhead compared with current-based designs
Slides

2B-2
TitleFePIM: Contention-Free In-Memory Computing Based on Ferroelectric Field-Effect Transistors
Author*Xiaoming Chen, Yuping Wu, Yinhe Han (Institute of Computing Technology, Chinese Academy of Sciences, China)
Pagepp. 114 - 119
KeywordFeFET, data contention, PIM
AbstractThe memory wall bottleneck has caused a large portion of the energy to be consumed by data transfer between processors and memories when dealing with data-intensive workloads. By giving some processing abilities to memories, processing-in-memory (PIM) is a promising technique to alleviate the memory wall bottleneck. In this work, we proposed a novel PIM architecture by employing ferroelectric field-effect transistors (FeFETs). The proposed design, named FePIM, is able to perform in-memory bitwise logic and add operations between two selected rows or between one selected row and an immediate operand. By utilizing unique features of FeFET devices, we further propose novel solutions to eliminate simultaneous-read-and-write (SRAW) contentions such that stalls are eliminated. Experimental results show that FePIM reduces 15% of the memory access latency and 44% of the memory access energy, compared with an enhanced version of a state-of-the-art FeFET-based PIM design which cannot handle SRAW contentions.

2B-3
TitleRIME: A Scalable and Energy-Efficient Processing-In-Memory Architecture for Floating-Point Operations
Author*Zhaojun Lu (University of Maryland, USA), Md Tanvir Arafin (Morgan State University, USA), Gang Qu (University of Maryland, USA)
Pagepp. 120 - 125
KeywordProcessing in-memory, Resistive random access memory (RRAM, Floating-point multiplier, Minority logic
AbstractProcessing in-memory (PIM) is an emerging technology poised to break the memory-wall in the conventional von Neumann architecture. PIM reduces data movement from the memory systems to the CPU by utilizing memory cells for logic computation. However, existing PIM designs do not support high precision computation (e.g., floating-point operations) essential for critical data-intensive applications. Furthermore, PIM architectures require complex control module and costly peripheral circuits to harness the full potential of in-memory computation. These peripherals and control modules usually suffer from scalability and efficiency issues. Hence, in this paper, we explore the analog properties of the resistive random access memory (RRAM) crossbar and propose a scalable RRAM-based in-memory floating-point computation architeture (RIME). RIME uses single-cycle NOR, NAND, and Minority logic to achieve floating-point operations. RIME features a centralized control module and a simplified peripheral circuit that eliminate data movement during parallel computation. An experimental 32-bit multiplier designed using RIME demonstrates 4.8X speedup, 1.9X area-improvement, and 5.4X energy-efficiency than state-of-the-art RRAM-based PIM multipliers.

2B-4
TitleA Non-Volatile Computing-In-Memory Framework With Margin Enhancement Based CSA and Offset Reduction Based ADC
Author*Yuxuan Huang, Yifan He, Jinshan Yue, Huazhong Yang, Yongpan Liu (Tsinghua University, China)
Pagepp. 126 - 131
Keywordcomputing-in-memory, RRAM, margin enhancement, offset reduction
AbstractNowadays, deep neural network (DNN) has played an important role in machine learning. Non-volatile computing-in-memory (nvCIM) for DNN has become a new architecture to optimize hardware performance and energy efficiency. However, the existing nvCIM accelerators focus on system-level performance but ignore analog factors. In this paper, the sense margin and offset are considered in the proposed nvCIM framework. The margin enhancement based current-mode sense amplifier (MECSA) and the offset reduction based analog-to-digital converter (ORADC) are proposed to improve the accuracy of the ADC. Based on the above methods, the nvCIM framework is displayed and the experiment results show that the proposed framework has an improvement on area, power, and latency with the high accuracy of network models, and the energy efficiency is 2.3 - 20.4x compared to the existing RRAM based nvCIM accelerators.


[To Session Table]

Session 2C  (SS-1) Emerging Trends for Cross-Layer Co-Design: From Device, Circuit, to Architecture, Application
Time: 15:30 - 16:00, Tuesday, January 19, 2021
Location: Room 2C
Chairs: Xunzhao Yin (Zhejiang University, China), Mohsen Imani (University of California at Irvine, USA)

2C-1
Title(Invited Paper) Cross-layer Design for Computing-in-Memory: From Devices, Circuits, to Architectures and Applications
AuthorHussam Amrouch (University of Stuttgart, Germany), Xiaobo Sharon Hu (University of Notre Dame, USA), Mohsen Imani (University of California, Irvine, USA), Ann Franchesca Laguna, Michael Niemier (University of Notre Dame, USA), Simon Thomann (Karlsruhe Institute of Technology, Germany), Xunzhao Yin, Cheng Zhuo (Zhejiang University, China)
Pagepp. 132 - 139

2C-2
Title(Invited Paper) Impact of Emerging Devices on Future Computing
AuthorHussam Amrouch (University of Stuttgart, Germany)

2C-3
Title(Invited Paper) SaVI: Seed-and-Vote based In-Memory Accelerator for DNA Read Mapping
Author*Ann Franchesca Laguna, Xiaobo Sharon Hu (University of Notre Dame, USA)

2C-4
Title(Invited Paper) Digital-based Processing In-Memory for Acceleration of Unsupervised Learning
AuthorMohsen Imani (University of California, Irvine, USA)


[To Session Table]

Session 2D  Machine Learning Techniques for EDA in Analog/Mixed-Signal ICs
Time: 15:30 - 16:00, Tuesday, January 19, 2021
Location: Room 2D
Chairs: Chien-Nan Jimmy Liu (National Chiao Tung University, Taiwan), Fan Yang (Fudan University, China)

Best Paper Candidate
2D-1
TitleAutomatic Surrogate Model Generation and Debugging of Analog/Mixed-Signal Designs Via Collaborative Stimulus Generation and Machine Learning
Author*Jun Yang Lei, Abhijit Chatterjee (Georgia Institute of Technology, USA)
Pagepp. 140 - 145
KeywordMixed-signal and analog design validation, test generation, machine learning
AbstractIn top-down mixed-signal design, a key problem is to ensure that the netlist or physical design does not contain unanticipated behaviors. Mismatches between netlist level circuit descriptions and high level behavioral models need to be captured at all stages of the design process for accuracy of system level simulation as well as fast convergence of the design. To support the above, we present a guided test generation algorithm that explores the input stimulus space and generates new stimuli which are likely to excite differences between the model and its netlist description. Subsequently, a recurrent neural network (RNN) based learning model is used to learn divergent model and netlist behaviors and absorb them into the model to minimize these differences. The process is repeated iteratively and in each iteration, a Bayesian optimization algorithm is used to find optimal RNN hyperparameters to maximize behavior learning. The result is a circuit-accurate behavioral model that is also much faster to simulate than a circuit simulator. In addition, another sub-goal is to perform design bug diagnosis to track the source of observed behavioral anomalies down to individual modules or small levels of circuit detail. An optimization-based diagnosis approach using Volterra learning kernels that is easily integrated into circuit simulators is proposed. Results on representative circuits are presented.

2D-2
TitleA Robust Batch Bayesian Optimization for Analog Circuit Synthesis via Local Penalization
Author*Jiangli Huang, Fan Yang, Changhao Yan (Fudan University, China), Dian Zhou (UT Dallas, USA), Xuan Zeng (Fudan University, China)
Pagepp. 146 - 151
KeywordAnalog circuit synthesis, Bayesian optimization, Local penalization
AbstractBayesian optimization has been successfully introduced to analog circuit synthesis recently. Since the evaluations of performances are computational expensive, batch Bayesian optimization has been proposed to run simulations in parallel. However, circuit simulations may fail during the optimization, due to the improper design variables. In such cases, Bayesian optimization methods may have poor performance. In this paper, we propose a Robust Batch Bayesian Optimization approach (RBBO) for analog circuit synthesis. Local penalization (LP) is used to capture the local repulsion between query points in one batch. The diversity of the query points can thus be guaranteed. The failed points and their neighborhoods can also be excluded by LP. Moreover, we propose an Adaptive Local Penalization (ALP) strategy to adaptively scale the penalized areas to improve the convergence of our proposed RBBO method. The proposed approach is compared with the state-of-the-art algorithms with several practical analog circuits. The experimental results have demonstrated the efficiency and robustness of the proposed method.
Slides

2D-3
TitleLayout Symmetry Annotation for Analog Circuits with Graph Neural Networks
Author*Xiaohan Gao (Peking University, China), Chenhui Deng (Cornell University, USA), Mingjie Liu (University of Texas at Austin, USA), Zhiru Zhang (Cornell University, USA), David Z. Pan (University of Texas at Austin, USA), Yibo Lin (Peking University, China)
Pagepp. 152 - 157
KeywordAnalog layout generation, Machine learning, Graph neural networks
AbstractThe performance of analog circuits is susceptible to various layout constraints, such as symmetry, matching, shielding, etc. Modern analog placement and routing algorithms usually need to take these constraints as input for high quality solutions, while manually annotating such constraints is tedious and requires design expertise. Thus, automatic constraint annotation from circuit netlists is a critical step to analog layout automation. In this work, we propose a graph learning based framework to learn the general rules for annotation of the symmetry constraints with path-based feature extraction and label filtering techniques. Experimental results on the open-source analog circuit designs demonstrate that our framework is able to achieve significantly higher accuracy compared with the most recent works on symmetry constraint detection leveraging graph similarity and signal flow analysis techniques. The framework is general and can be extended to other pairwise constraints as well.

2D-4
TitleFast and Efficient Constraint Evaluation of Analog Layout using Machine Learning Models
Author*Tonmoy Dhar, Jitesh Poojary (University of Minnesota, Twin Cities, USA), Yaguang Li (Texas A&M University, USA), Kishor Kunal, Meghna Madhusudan, Arvind Kumar Sharma, Susmita Dey Manasi (University of Minnesota, Twin Cities, USA), Jiang Hu (Texas A&M University, USA), Ramesh Harjani, Sachin Sapatnekar (University of Minnesota, Twin Cities, USA)
Pagepp. 158 - 163
KeywordAnalog, Automation, Parasitic, Machine, Learning
AbstractPlacement algorithms for analog circuits use iterative methods to search for an optimal layout. During this search, multiple layout configurations are explored, each of which corresponds to a different set of wire parasitics in the circuit. To ensure that the layout meets a set of electrical constraints, it is essential to develop fast predictors of circuit performance that can guide the layout engine. This work presents a novel methodology to incorporate performance requirements in analog IC layout generation process. The flow starts by discerning rough bounds on layout parasitics based on circuit netlist. Next, a Latin hypercube sampling technique is used to select a sample set from the reduced search space. Based on classification accuracy, the framework first leverages a support vector machine (SVM) with a linear kernel. Depending on the accuracy of this SVM, a denser sample set is then chosen to refine the SVM, or a multilayer perceptron is used to model nonlinearities. The resulting machine learning model is used to rapidly evaluate a placement to determine whether or not it satisfies circuit constraints, and can rapidly direct the placement engine away from areas of the design space that do not meet electrical constraints.
Slides


[To Session Table]

Session 2E  Innovating Ideas in VLSI Routing Optimization
Time: 15:30 - 16:00, Tuesday, January 19, 2021
Location: Room 2E
Chairs: Yibo Lin (Peking University, China), Gengjie Chen (Giga Design Automation, Hong Kong)

Best Paper Award
2E-1
TitleTreeNet: Deep Point Cloud Embedding for Routing Tree Construction
Author*Wei Li, Yuxiao Qu (The Chinese University of Hong Kong, Hong Kong), Gengjie Chen (Giga Design Automation, China), Yuzhe Ma, Bei Yu (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 164 - 169
KeywordRouting Tree, Point Cloud
AbstractIn the routing tree construction, both wirelength (WL) and pathlength(PL) are of importance. Among all methods, PD-II and SALT are the two most prominent ones. However, neither PD-II nor SALT always dominates the other one in terms of both WL and PL for all nets. In addition, estimating the best parameters for both algorithms is still an open problem. In this paper, we model the pins of a net as point cloud and further propose TreeNet, a novel deep net architecture to obtain the embedding of the point cloud object. Based on the obtained cloud embedding, an adaptive workflow is designed for the routing tree construction. Experimental results show that the proposed TreeNetis superior to other deep learning models for the point cloud on classification tasks. Moreover, the proposed adaptive workflow for the routing tree construction outperforms SALT and PD-II, in terms of both efficiency and effectiveness.
Slides

2E-2
TitleA Unified Printed Circuit Board Routing Algorithm With Complicated Constraints and Differential Pairs
Author*Ting-Chou Lin, Devon Merrill, Yen-Yi Wu, Chester Holtz, Chung-Kuan Cheng (University of California, San Diego, USA)
Pagepp. 170 - 175
Keywordprinted circuit board, routing
AbstractThe printed circuit board (PCB) routing problem has been studied extensively in recent years. Due to continually growing net/pin counts, extremely high pin density, and unique physical constraints, the manual routing of PCBs has become a time-consuming task to reach design closure. Previous works break down the problem into escape routing and area routing and focus on these problems separately. However, there is always a gap between these two problems requiring a massive amount of human efforts to fine-tune the algorithms back and forth. Besides, previous works of area routing mainly focus on routing between escaping routed ball-grid-array (BGA) packages. Nevertheless, in practice, many components are not in the form of BGA packages, such as passive devices, decoupling capacitors, and through-hole pin arrays. To mitigate the deficiencies of previous works, we propose a full-board routing algorithm that can handle multiple real-world complicated constraints to facilitate the printed circuit board routing and produce high-quality manufacturable layouts. Experimental results show that our algorithm is effective and efficient. Specifically, for all given test cases, our router can achieve 100% routability without any design rule violation while the other two state-of-the-art routers fail to complete the routing for some test cases and incur design rule violations.

2E-3
TitleMulti-FPGA Co-optimization: Hybrid Routing and Competitive-based Time Division Multiplexing Assignment
AuthorXiaopeng Zhang, *Dan Zheng, Chak-Wa Pui, Evangeline F.Y. Young (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 176 - 182
KeywordFPGA, routing, time-division multiplexing, logic verification
AbstractIn multi-FPGA systems, time-division multiplexing~(TDM) is a widely used technique to transfer signals between FPGAs. While TDM can greatly increase logic utilization, the inter-FPGA delay will also become longer. A good time-multiplexing scheme for inter-FPGA signals is very important for optimizing the system performance. In this work, we propose a fast algorithm to generate high quality time-multiplexed routing results for multiple FPGA systems. A hybrid routing algorithm is proposed to route the nets between FPGAs, by maze routing and by a fast minimum terminal spanning tree method. After obtaining a routing topology, a two-step method is applied to perform TDM assignment to optimize timing, which includes an initial assignment and a competitive-based refinement. Experimental results show that our system-level routing and TDM assignment algorithm can outperform both the top winner of the ICCAD 2019 Contest and the state-of-the-art methods. Moreover, compared to the state-of-the-art work, our approach has better run time by more than 2x with comparable TDM performance.

2E-4
TitleBoosting Pin Accessibility Through Cell Layout Topology Diversification
Author*Suwan Kim, Kyeongrok Jo, Taewhan Kim (Seoul National University, Republic of Korea)
Pagepp. 183 - 188
KeywordPin Accessibility, Cell Replacement, Standard cell, DTCO, ECO-routing
AbstractAs the layout of standard cells is becoming dense, accessing pins is much harder in detailed routing. The conventional solutions to resolving the pin access issue are to attempt cell flipping, cell shifting, cell swapping, and/or cell dilating in the placement optimization stage, expecting to acquire high pin accessibility. However, those solutions do not guarantee close-to-100% pin accessibility to ensure safe manual fixing afterward in the routing stage. Furthermore, there is no easy and effective methodology to fix the inaccessibility in the detailed routing stage as yet. This work addresses the problem of fixing the inaccessibility in the detailed routing stage. Precisely, (1) we produce, for each type of cell, multiple layouts with diverse pin locations and access points by modifying the core engines i.e., gate poly ordering and middle-of-line dummy insertion in the flow of design-technology co-optimization based automatic cell layout generation. Then, (2) we propose a systematic method to make use of those layouts to fix the routing failures caused by pin inaccessibility in the ECO (Engineering Change Order) routing stage. Experimental results demonstrate that our proposed cell layout diversification and replacement approach can fix metal-2 shorts by 93.22% in the ECO routing stage.


[To Session Table]

Session 3A  (SS-2) ML-Driven Approximate Computing
Time: 16:00 - 16:30, Tuesday, January 19, 2021
Location: Room 3A
Chairs: Hussam Amrouch (Karlsruhe Institute of Technology, Germany), Jörg Henkel (Karlsruhe Institute of Technology, Germany)

3A-1
Title(Invited Paper) Approximate Computing for ML: State-of-the-art, Challenges and Visions
AuthorGeorgios Zervakis (Karlsruhe Institute of Technology), Hassaan Saadat (University of New South Wales, Austria), Hussam Amrouch (Karlsruhe Institute of Technology, Germany), Andreas Gerstlauer (University of Texas at Austin, USA), Sri Parameswaran (University of New South Wales, Australia), Jörg Henkel (Karlsruhe Institute of Technology, Germany)
Pagepp. 189 - 196

3A-2
Title(Invited Paper) ML-Driven Run-time Configurable Approximate circuits
Author*Georgios Zervakis, Hussam Amrouch, Jörg Henkel (Karlsruhe Institute of Technology, Germany)

3A-3
Title(Invited Paper) The Art of Creating Approximate Components for Machine Learning
AuthorSri Parameswaran (University of New South Wales, Australia)

3A-4
Title(Invited Paper) Approximate High-Level Synthesis
AuthorSeogoo Lee (Cadence, USA), Andreas Gerstlauer (University of Texas at Austin, USA)


[To Session Table]

Session 3B  Architecture-Level Exploration
Time: 16:00 - 16:30, Tuesday, January 19, 2021
Location: Room 3B
Chairs: Raymond RuiRui Huang (Alibaba Cloud, China), Vassos Soteriou (Cyprus University of Technology, Cyprus)

Best Paper Candidate
3B-1
TitleBridging the Frequency Gap in Heterogeneous 3D SoCs through Technology-Specific NoC Router Architectures
Author*Jan Moritz Joseph (RWTH Aachen University, Germany), Lennart Bamberg (GrAI Matter Labs, Netherlands), Geonhwa Jeong, Ruei-Ting Chien (Georgia Institute of Technology, USA), Rainer Leupers (RWTH Aachen University, Germany), Alberto Garía-Ortiz (University of Bremen, Germany), Tushar Krishna (Georgia Institute of Technology, USA), Thilo Pionteck (Otto-von-Guericke Universitaet Magdeburg, Germany)
Pagepp. 197 - 203
KeywordNoC, heterogeneous 3D Integration, SoC
AbstractIn heterogeneous 3D System-on-Chips (SoCs), NoCs with uniform properties suffer one major limitation; the clock frequency of routers varies due to different manufacturing technologies. For example, digital nodes allow for a higher clock frequency of routers than mixed-signal nodes. This large frequency gap is commonly tackled by complex and expensive pseudo-mesochronous or asynchronous router architectures. Here, a more efficient approach is chosen to bridge the frequency gap. We propose to use a heterogeneous network architecture. We show that reducing the number of VCs allows to bridge a frequency gap of up to 2x. We achieve a system-level latency improvement of up to 47% for uniform random traffic and up to 59% for PARSEC benchmarks, a maximum throughput increase of 50%, up to 68% reduced area and 38% reduced power in an exemplary setting combining 15-nm digital and 30-nm mixed-signal nodes and comparing against a homogeneous synchronous network architecture. Versus asynchronous and pseudo-mesochronous router architectures, the proposed optimization consistently performs better in area, in power and the average flit latency improvement can be larger than 51%.

3B-2
TitleCombining Memory Partitioning and Subtask Generation for Parallel Data Access on CGRAs
Author*Cheng Li, Jiangyuan Gu, Shouyi Yin, Leibo Liu, Shaojun Wei (Tsinghua University, China)
Pagepp. 204 - 209
KeywordCGRA, Memory Partitioning, Subtask Generation
AbstractCoarse-Grained Reconfigurable Architectures (CGRAs) are attractive reconfigurable platforms with the advantages of high performance and power efficiency. In a CGRA based computing system, the computations are often mapped onto the CGRA with parallel memory accesses. To fully exploit the on-chip memory bandwidth, memory partitioning algorithms are widely used to reduce access conflicts. CGRAs have a fixed storage fabric and limited size memory due to the severe area constraints. Previous memory partitioning algorithms assumed that data could be completely transferred into the target memory. However, in practice, we often encounter situations where on-chip storage is insufficient to store the complete data. In order to perform the computation of these applications in the memory-limited CGRA, we first develop a memory partitioning strategy with continual placement, which can also avoid data preprocessing, and then divide the kernel into multiple subtasks that suit the size of the target memory. Experimental results show that, compared to the state-of-the-art method, our approach achieves a 43.2% reduction in data preparation time and an 18.5% improvement in overall performance. If the subtask generation scheme is adopted, our approach can achieve a 14.4% overall performance improvement while reducing memory requirements by 99.7%.

3B-3
TitleA Dynamic Link-latency Aware Cache Replacement Policy (DLRP)
Author*Yen-Hao Chen, Allen Wu, TingTing Hwang (National Tsing Hua University, Taiwan)
Pagepp. 210 - 215
KeywordMPSoCs
AbstractMultiprocessor system-on-chips (MPSoCs) with the non-uniform cache architecture (UNCA) have been widely used in modern devices. Even though the physical distances from cores to data locations may affect the network latency of cache accesses, it does not always affect the overall performance. For instance, when a bank interleaved accessing pattern accesses all banks evenly, all cache banks will have the same cache performance regardless of the physical distances from cores to data locations. The distance-sensitive accessing pattern, on the other hand, results in a significant cache performance degradation on the cache banks that are far away from the core. We have also observed that a set of commonly used neural network application kernels, including the neural network fully-connected and convolutional layers, contains substantial distance-sensitive accessing patterns. In this paper, we propose an identification mechanism to detect such distance-sensitive accessing patterns and a distanceaware replacement policy (DARP) to resolve this problem. With little hardware overhead by using some additional monitors, our method outperforms LRU, NRU, and SRRIP in performance by 43%, 36%, and 27%, respectively. Moreover, the hardware monitor consumes less than 0.07% total energy of the cache system, which is negligible.
Slides

3B-4
TitlePrediction of Register Instance Usage and Time-sharing Register for Extended Register Reuse Scheme
Author*Shuxin Zhou (State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences, China), Huandong Wang (Loongson Corporation, China), Dong Tong (Peking University, China)
Pagepp. 216 - 221
KeywordRegister renaming, physical register file, time-sharing
AbstractRegister renaming is the key for the performance of out-of-order processors. However, the release mechanism of the physical register may cause a waste from time dimension. The register reuse technique is the earliest solution to release a physical register at renaming stage, which takes the advantage of those register instances with only one time use. However, the range of possible reuse mined by this scheme is not high, and the physical structure of the register have to be modified. Aiming at these two problems, we propose an extended register reuse scheme. Our work presents: 1) prediction of the use times of the register instance, so as to reuse the physical registers at the end of the last use, to expand the range of possible reuse. 2) A design of time-sharing register file with little overheads which is implemented by Backup Registers, avoiding to modify the physical register structure. Compared with the original register reuse technique, this work achieves 8.5% performance improvement, alternatively, 9.6% decrease of the number of physical registers with minor hardware overhead.


[To Session Table]

Session 3C  Core Circuits for AI Accelerators
Time: 16:00 - 16:30, Tuesday, January 19, 2021
Location: Room 3C
Chairs: Cheng Zhuo (Zhejiang University, China), Grace Li Zhang (Technical University of Munich, Germany)

3C-1
TitleResidue-Net: Multiplication-free Neural Network by In-situ, No-loss Migration to Residue Number Systems
Author*Sahand Salamat, Sumiran Shubhi, Behnam Khaleghi, Tajana Rosing (University of California, San Diego, USA)
Pagepp. 222 - 228
KeywordNeural Network Acceleration, FPGAs, Residue Number System
AbstractDeep neural networks are widely deployed on embedded devices to solve a wide range of problems from robotics to autonomous driving. The accuracy of these networks is usually proportional to their complexity. Quantization of model parameters (i.e., weights) and/or activations to alleviate the complexity of these networks while preserving accuracy is a popular technique. Nonetheless, previous studies have shown that quantization is limited to six bits as the accuracy of the network decreases afterward. We propose Residue-Net, a multiplication-free accelerator for neural networks that uses Residue Number System (RNS) to achieve substantial energy reduction. RNS breaks down the operations to several smaller operations that are simpler to implement. Moreover, Residue-Net replaces the costly copious multiplications with non-complex, energy-efficient shift and addition operations to further reduce the computational complexity of neural networks. To evaluate the efficiency of our proposed accelerator, we compared the performance of Residue-Net with a baseline FPGA implementation of four widely-used networks, LeNet, AlexNet, VGG16, and ResNet50. Residue-Net delivers the same accuracy as the fixed-point model with considerable speedup and energy reduction.

3C-2
TitleA Multiple-Precision Multiply and Accumulation Design with Multiply-Add Merged Strategy for AI Accelerating
Author*Song Zhang, Jiangyuan Gu, Shouyi Yin, Leibo Liu, Shaojun Wei (Tsinghua University, China)
Pagepp. 229 - 234
KeywordMAC, Balanced Pipeline, Bit-width Adjustable, Signed/Unsigned Unified, Multiply-Add Merged
AbstractMultiply and accumulations(MAC) are fundamental operations for domain-specific accelerator with AI applications ranging from filtering to convolutional neural networks(CNN). This paper proposes an energy-efficient MAC design, supporting a wide range of bit-width, for both signed and unsigned operands. Firstly, based on the classic Booth algorithm, we propose the Booth algorithm to propose a multiply-add merged strategy. The design can not only support both signed and unsigned operations but also eliminate the delay, area and power overheads from the adder of traditional MAC units. Then a multiply-add merged design method for flexible bit-width adjustment is proposed using the fusion strategy. In addition, treating the addend as a partial product makes the operation easy to pipeline and balanced. The comprehensive improvement in delay, area and power can meet various requirements from different applications and hardware design. By using the proposed method, we have synthesized MAC units for several operation modes using a SMIC 40-nm library. Comparison with other MAC designs shows that the proposed design method can achieve up to 24.1% and 28.2% PDP and ADP improvement for bit-width fixed MAC designs, and 28.43% ~38.16% for bit-width adjustable ones. When pipelined, the design has decreased the latency by more than 13%. The improvement in power and area is up to 8.0% and 8.1% respectively.

3C-3
TitleDeepOpt: Optimized Scheduling of CNN Workloads for ASIC-based Systolic Deep Learning Accelerators
Author*Susmita Dey Manasi, Sachin S. Sapatnekar (University of Minnesota, Twin Cities, USA)
Pagepp. 235 - 241
KeywordCNN, Scheduling, Hardware accelerator
AbstractScheduling computations in each layer of a convolutional neural network on a deep learning (DL) accelerator involves a large number of choices, each of which involves a different set of memory reuse and memory access patterns. Since memory transactions are the primary bottleneck in DL acceleration, these choices can strongly impact the energy and throughput of the accelerator. This work proposes an optimization framework, DeepOpt, for general ASIC-based systolic hardware accelerators for layer-specific and hardware-specific scheduling strategy for each layer of a CNN to optimize energy and latency. Optimal hardware allocation significantly reduces execution cost as compared to generic static hardware resource allocation, e.g., improvements of up to 50X in the energy-delay product for VGG-16 and 41X for GoogleNet-v1.
Slides

3C-4
TitleValue-Aware Error Detection and Correction for SRAM Buffers in Low-Bitwidth, Floating-Point CNN Accelerators
Author*Jun-Shen Wu, Chi-En Wang, Ren-Shuo Liu (National Tsing Hua University, Taiwan)
Pagepp. 242 - 247
KeywordCNN, SRAM, Stuck-at fault, voltage scaling, accelerator
AbstractLow-power CNN accelerators are a key technique to enable the future artificial intelligence world. Dynamic voltage scaling is an essential low-power strategy, but it is bottlenecked by on-chip SRAM. More specifically, SRAM can exhibit stuck-at (SA) faults at a rate as high as 0.1% when the supply voltage is lowered to 0.5V. Although this issue has been studied in CPU cache design, since their solutions are tailored for CPUs instead of CNN accelerators, they inevitably incur unnecessary design complexity and SRAM capacity overhead. To address the above issue, we conduct simulation and analyses to enable us to propose error detecting and correcting mechanisms that are tailored for our targeting low-bitwidth, floating-point (LBFP) CNN accelerators. We analyze the impacts of SA faults in different SRAM positions, and we also analyze the impacts of different SA types, i.e., stuck-at-one (SA1) and stuck-at-zero (SA0). The experimental results lead us to error detecting and correcting mechanisms that are aware of different SA values and prioritize fixing SA1 appearing at SRAM positions where the exponent bits of LBFP are stored. The experimental results show that our proposed mechanisms can help to push the voltage scaling limit down to a voltage level, e.g., 0.5V, with 0.1% SA faults.
Slides


[To Session Table]

Session 3D  Stochastic and Approximate Computing
Time: 16:00 - 16:30, Tuesday, January 19, 2021
Location: Room 3D
Chairs: Georgios Zervakis (Karlsruhe Institute of Technology, Germany), Iraklis Anagnostopoulos (Southern Illinois University, USA)

3D-1
TitleMIPAC: Dynamic Input-Aware Accuracy Control for Dynamic Auto-Tuning of Iterative Approximate Computing
Author*Taylor Kemp, Yao Yao, Younghyun Kim (University of Wisconsin-Madison, USA)
Pagepp. 248 - 253
KeywordApproximate computing, Quality control, Logic synthesis
AbstractFor many applications that exhibit strong error resilience, such as machine learning and signal processing, energy efficiency and performance can be dramatically improved by allowing for slight errors in intermediate computations. Iterative methods (IMs), wherein the solution is improved over multiple executions of an approximation algorithm, allow for energy-quality trade-off at run-time by adjusting the number of iterations (NOI). However, in prior IM circuits, NOI adjustment has been made based on a pre-characterized NOI-quality mapping, which is input-agnostic thus results in an undesirable large variation in output quality. In this paper, we propose a novel design framework that incorporates a lightweight quality controller that makes input-dependent predictions on the output quality and determines the optimal NOI at run-time. The proposed quality controller is composed of accurate yet low-overhead NOI predictors, generated by a novel logic reduction technique. We evaluate the proposed design framework on several IM circuits and demonstrate significant improvements in energy-quality performance.

3D-2
TitleNormalized Stability: A Cross-Level Design Metric for Early Termination in Stochastic Computing
Author*Di Wu, Ruokai Yin, Joshua San Miguel (University of Wisconsin-Madison, USA)
Pagepp. 254 - 259
KeywordStochastic computing, Early termination, Metric
AbstractStochastic computing is a statistical computing scheme that represents data as serial bit streams to greatly reduce hardware complexity. The key trade-off is that processing more bits in the streams yields higher computation accuracy at the cost of more latency and energy consumption. To maximize efficiency, it is desirable to account for the error tolerance of applications and terminate stochastic computations early when the result is acceptably accurate. Currently, the stochastic computing community lacks a standard means of measuring a circuit's potential for early termination and predicting at what cycle it would be safe to terminate. To fill this gap, we propose normalized stability, a metric that measures how fast a bit stream converges under a given accuracy budget. Our unit-level experiments show that normalized stability accurately reflects and contrasts the early-termination capabilities of varying stochastic computing units. Furthermore, our application-level experiments on low-density parity-check decoding, machine learning and image processing show that normalized stability can reduce the design space and predict the timing to terminate early.
Slides

3D-3
TitleZero Correlation Error: A Metric for Finite-Length Bitstream Independence in Stochastic Computing
Author*Hsuan Hsiao (University of Toronto, Canada), Joshua San Miguel (University of Wisconsin-Madison, USA), Yuko Hara-Azumi (Tokyo Institute of Technology, Japan), Jason Anderson (University of Toronto, Canada)
Pagepp. 260 - 265
Keywordstochastic computing
AbstractStochastic computing (SC), with its probabilistic data representation format, has sparked renewed interest due to its ability to use very simple circuits to implement complex operations. Though unlike traditional binary computing, SC needs to carefully handle correlations that exist across data values to avoid the risk of unacceptably inaccurate results. With many SC circuits designed to operate under the assumption that input values are independent, it is important to provide the ability to accurately measure and characterize independence of SC bitstreams. We propose zero correlation error (ZCE), a metric that quantifies how independent two finite-length bitstreams are, and show that it addresses fundamental limitations in metrics currently used by the SC community. Through evaluation at both the functional unit level and application level, we demonstrate how ZCE can be an effective tool for analyzing SC bitstreams, simulating circuits and design space exploration.

3D-4
TitleAn Efficient Approximate Node Merging with an Error Rate Guarantee
AuthorKit Seng Tam, *Chia-Chun Lin (National Tsing Hua University, Taiwan), Yung-Chih Chen (Yuan Ze University, Taiwan), Chun-Yao Wang (National Tsing Hua University, Taiwan)
Pagepp. 266 - 271
KeywordApproximate logic synthesis
AbstractApproximate computing is an emerging design paradigm for error-tolerant applications. e.g., signal processing and machine learning. In approximate computing, the area, delay, or power consumption of an approximate circuit can be improved by trading off its accuracy. In this paper, we propose an approximate logic synthesis approach based on a node-merging technique with an error rate guarantee. The ideas of our approach are to replace internal nodes by constant values and to merge two similar nodes in the circuit in terms of functionality. We conduct experiments on a set of IWLS 2005 and MCNC benchmarks. The experimental results show that our approach can reduce area by up to 80%, and 31% on average. As compared with the state-of-the-art method, our approach has a speedup of 51 under the same 5% error rate constraint.
Slides


[To Session Table]

Session 3E  Timing Analysis and Timing-Aware Design
Time: 16:00 - 16:30, Tuesday, January 19, 2021
Location: Room 3E
Chairs: Sabya Das (Synopsys Inc, USA), Wenjian Yu (Tsinghua University, China)

Best Paper Candidate
3E-1
TitleAn Adaptive Delay Model for Timing Yield Estimation under Wide-Voltage Range
Author*Hao Yan (South-east University, China), Xiao Shi (Artificial Intelligence Dept., Southeast University, USA), Chengzhen Xuan, Peng Cao, Longxing Shi (South-east University, China)
Pagepp. 272 - 277
KeywordDelay Model, Timing Yield, Wide-Voltage Scaling, Non-Gaussian, Nonlinear Sampling
AbstractYield analysis for wide-voltage circuit design is a strong nonlinear integration problem. The most challenging task is how to accurately estimate the yield of long-tail distribution. This paper proposes an adaptive delay model to substitute expensive transistor-level simulation for timing yield estimation. We use the Low-Rank Tensor Approximation (LRTA) to model the delay variation from a large number of process parameters. Moreover, an adaptive nonlinear sampling algorithm is adopted to calibrate the model iteratively, which can capture the larger variability of delay distribution for different voltage regions. The proposed method is validated on benchmark circuits of TAU15 in 45nm free PDK. The experiment results show that our method achieves 20-100X speedup compared to Monte Carlo simulation at the same accuracy level.

3E-2
TitleATM: A High Accuracy Extracted Timing Model for Hierarchical Timing Analysis
Author*Kuan-Ming Lai (NTHU, Taiwan), Tsung-Wei Huang (University of Utah, USA), Pei-Yu Lee (MaxEDA, Taiwan), Tsung-Yi Ho (NTHU, Taiwan)
Pagepp. 278 - 283
KeywordStatic Timing Analysis, Macro Modeling
AbstractAs technology advances, the complexity and size of integrated circuits continue to grow. Hierarchical design flow is a mainstream solution to speed up timing closure. Static timing analysis is a pivotal step in the flow but it can be timing-consuming on large flat designs. To reduce the long runtime, we introduce ATM, a high-accuracy extracted timing model for hierarchical timing analysis. The interface logic model (ILM) and extracted timing model (ETM) are the two popular paradigms for generating timing macros. ILM is accurate but large in model size, and ETM is compact but less accurate. Recent research has applied graph compression techniques to ILM to reduce model size with simultaneous high accuracy. However, the generated models are still very large compared to ETM, and its efficiency of in-context usage may be limited. We base ATMon the ETM paradigm and address its accuracy limitation. Experimental results on TAU 2017 benchmarks show that ATM reduces the maximum absolute error of ETM from 131 ps to less than 1 ps. Compared to the ILM-based approach, our accuracy differs within1 ps and the generated model can be up to 270× smaller.

3E-3
TitleMode-wise Voltage-scalable Design with Activation-aware Slack Assignment for Energy Minimization
Author*TaiYu Cheng (Osaka University, Japan), Yutaka Masuda (Nagoya University, Japan), Jun Nagayama, Yoichi Momiyama (Socionext Inc., Japan), Jun Chen, Masanori Hashimoto (Osaka University, Japan)
Pagepp. 284 - 290
Keywordmode-wise Voltage-scaling, activation-aware slack assignment, multi-corner multi-mode, downhill simplex method
AbstractThis paper proposes a design optimization methodology that can achieve a mode-wise voltage scalable(MWVS) design with applying the activation-aware slack assignment (ASA). Originally, ASA allocates the timing margin of critical paths with the stochastic treatment of timing errors, which limits its application. Instead, this work employs ASA with guaranteeing no timing errors. The MWVS design is formulated as an optimization problem that minimizes the overall power consumption considering each mode duration, achievable voltage reduction, and accompanied circuit overhead explicitly, and explores the solution space with the downhill simplex algorithm that does not require numerical derivation. For obtaining a solution, i.e., a design, in the optimization process, we exploit the multi-corner multi-mode design flow in a commercial tool for performing mode-wise ASA with sets of false paths dedicated to individual modes. Experimental results based on RISC-V design show that the proposed methodology saves 20% more power compared to the conventional voltage scaling approach and attains 15% gain from the single-mode ASA. Also, the cycle-by-cycle fine-grained false path identification reduced leakage power by 42%.
Slides

3E-4
TitleA Timing Prediction Framework for Wide Voltage Design with Data Augmentation Strategy
Author*Peng Cao, Wei Bao, Kai Wang, Tai Yang (Southeast University, China)
Pagepp. 291 - 296
KeywordPVT corners, path delay prediction, data augmentation
AbstractWide voltage design has been widely used to achieve power reduction and energy efficiency improvement. The consequent increasing number of PVT (Process–Voltage–Temperature) corners poses severe challenges to timing analysis in terms of accuracy and efficiency. The data insufficiency issue during path delay acquisition raises the difficulty for the training of machine learning models, especially at low voltage corners due to tremendous library characterization effort and/or simulation cost. In this paper, a learning-based timing prediction framework is proposed to predict path delays across wide voltage region by LightGBM (Light Gradient Boosting Machine) with data augmentation strategies including CTGAN (Conditional Generative Adversarial Networks) and SMOTER (Synthetic Minority Oversampling Technique for Regression), which generate realistic synthetic data of circuit delays to improve prediction precision and reduce data sampling effort. Experimental results demonstrate that with the proposed framework, the path delays at low voltage could be predicted by their delays at high voltage corners with rRMSE(relative Root of Mean Square Error) of less than 5%, owing to the data augmentation strategies which achieve significant prediction error reduction by up to 12X.



Wednesday, January 20, 2021

[To Session Table]

Session 2K  Keynote Session II (video and its broadcast via Zoom)
Location: Room K

2K-1
TitleIntroduction of Prof. Krishnendu Chakrabarty by Toshihiro Hattori (Renesas Electronics, Japan)

2K-2
Title(Keynote Address) Secure and Trustworthy Microfluidic Biochips: Protecting Medical Diagnostics, Bioassay IP, and DNA Forensics
AuthorKrishnendu Chakrabarty (Duke University, USA)
AbstractToday's microfluidic biochips are integrated with sensors and intelligent control, and networked for data analysis. These systems are cyberphysical in nature and are unfortunately coming of age in an era of rampant cybersecurity issues. Consequently, we anticipate novel security and trust problems that need to be addressed using interdisciplinary expertise in microfluidics, microbiology, hardware design, and cybersecurity. This presentation will first describe security threats and attack surfaces, and their consequences for the research landscape, industry, and society. The speaker will next present countermeasures against bioassay outcome manipulation, biochip actuation tampering, and bioassay IP theft. Experimental results will be presented for securing biomolecular protocols on the benchtop and for protecting microfluidic biochip prototypes with security primitives.


[To Session Table]

Session 4A  (SS-3) Technological Advancements inside the AI chips, and using the AI Chips
Time: 15:00 - 15:30, Wednesday, January 20, 2021
Location: Room 4A
Chair: Ravikumar Chakaravarthy (Xilinx Inc., USA)

4A-1
Title(Invited Paper) Energy-Efficient Deep Neural Networks with Mixed-Signal Neurons and Dense-Local and Sparse-Global Connectivity
AuthorBaibhab Chatterjee, *Shreyas Sen (Purdue University, USA)
Pagepp. 297 - 304
Keywordartificial neural networks, CMOS, low-energy, mixed-signal, neuromorphic computing
AbstractNeuromorphic Computing has become tremendously popular due to its ability to solve certain classes of learning tasks better than traditional von-Neumann computers. Data-intensive classification and pattern recognition problems have been of special interest to Neuromorphic Engineers, as these problems present complex use-cases for Deep Neural Networks (DNNs) which are motivated from the architecture of the human brain, and employ densely connected neurons organized in a hierarchical manner. However, as these systems become larger in order to handle an increasing amount of data and higher dimensionality of features, the designs often become connectivity constrained. To solve this the computation core needs to be divided into multiple cores/islands. Today, the communication among these cores are carried out through a power-hungry network-on-chip (NoC), and hence the optimal distribution of these islands along with energy-efficient communication strategies hold immense promise to reduce the overall energy consumption of the neuromorphic computer, which is currently orders of magnitude higher than the biological human brain. In this paper, we extensively analyze the choice of the size of the islands based on digital and mixed-signal neurons/synapses for different signal to noise ratios (SNR), and propose strategies for local and global communication for reduction of the system-level energy consumption.
Slides

4A-2
Title(Invited Paper) Merged Logic and Memory Fabrics for AI Workloads
AuthorBrian Crafton, Samuel Spetalnick, *Arjit Raychowdhury (Georgia Institute of Technology, USA)
Pagepp. 305 - 310
KeywordCompute In-Memory, Neural Network, Emerging, Non-Volatile
AbstractAs we approach the end of the silicon roadmap, we observe a steady increase in both the research effort toward and quality of embedded non-volatile memories (eNVM). Integrated in a dense array, eNVM such as resistive random access memory (RRAM) can perform compute in-memory (CIM) using the physical properties of the device. The combination of eNVM and CIM seeks to minimize both data transport and leakage power while offering density up to 10× that of traditional 6T SRAM. Despite these exciting new properties, these devices introduce problems that were not faced by traditional CMOS and SRAM based designs. While some of these problems will be solved by further research and development, properties such as significant cell-to-cell variance and high write power will persist due to the physical limitations of the devices. As a result, circuit and system level designs must account for and mitigate the problems that arise. In this work we introduce these problems from the system level and propose solutions that improve performance while mitigating the impact of the non-ideal properties of eNVM.

4A-3
Title(Invited Paper) Vision Control Unit in Fully Self Driving Vehicles using Xilinx MPSoC and Opensource Stack
Author*Ravikumar Chakaravarthy, Hyun Kwon, Hua Jiang (Xilinx Inc, USA)
Pagepp. 311 - 317
KeywordFSD, MPSoC, AI, Heterogenous Processing, XTA
AbstractThe vision control unit of an FSD vehicle is responsible for camera video capture, image processing and rendering, AI algorithm processing, data and meta-data transfer to next stage of the FSD pipeline. In this proposed solution we have used many open source stacks and frameworks for video and AI processing. The processing of the video pipeline and algorithms take full advantage of the pipelining and parallelism using all the heterogenous cores of the Xilinx MPSoC. In addition, we have developed an extensible, scalable, adaptable and configurable AI backend framework, XTA, for acceleration purposes that is derived from a popular, open source AI backend framework, TVM-VTA. XTA uses all the MPSoC cores for its computation in a parallel and pipelined fashion. XTA also adapts to the compute and memory parameters of the system and can scale to achieve optimal performance for any given AI problem. The FSD system design is based on a distributed system architecture and uses open source components and a real-time kernel to coordinate the actions. The details of image capture, rendering and AI processing of the vision perception pipeline will be presented along with the performance measurements of the vision pipeline.


[To Session Table]

Session 4B  System-Level Modeling, Simulation, and Exploration
Time: 15:00 - 15:30, Wednesday, January 20, 2021
Location: Room 4B
Chairs: Lei Yang (University of New Mexico, USA), Yaoyao Ye (Shanghai Jiao Tong University, China)

4B-1
TitleConstrained Conservative State Symbolic Co-analysis for Ultra-low-power Embedded Systems
Author*Shashank Hegde, Subhash Sethumurugan (University of Minnesota, USA), Hari Cherupalli (Synopsys Inc., USA), Henry Duwe (Iowa State University, USA), John Sartori (University of Minnesota, USA)
Pagepp. 318 - 324
Keywordsymbolic execution, symbolic simulation, gate level analysis, hw/sw co-analysis
AbstractSymbolic simulation and symbolic execution techniques have long been used for verifying designs and testing software. Recently, using symbolic hardware-software co-analysis to characterize unused hardware resources across all possible executions of an application running on a processor has been leveraged to enable application-specific analysis and optimization techniques. Like other symbolic simulation techniques, symbolic hardware-software co-analysis does not scale well to complex applications, due to an explosion in the number of execution paths that must be analyzed to characterize all possible executions of an application. To overcome this issue, prior work proposed a scalable approach by maintaining conservative states of the system at previously-visited locations in the application. However, this approach can be too pessimistic in determining the exercisable subset of resources of a hardware design. Since a less pessimistic estimate of the exercisable logic can lead to greater benefits, techniques based on conservative states leave potential benefits on the table. In this paper, we propose a technique for performing symbolic co-analysis of an application on a processor's netlist by identifying, propagating, and imposing constraints from the software level onto the gate-level simulation. This produces a more precise, less pessimistic estimate of the gates that an application can exercise when executing on a processor while guaranteeing coverage of all possible gates that the application can exercise. Compared to the state-of-art analysis based on conservative states, our constrained approach reduces the number of exercisable gates by up to 34.98%, 12.7% on average, and analysis runtime by up to 84.61%, 43.8% on average.

4B-2
TitleArbitrary and Variable Precision Floating Point Arithmetic Support in Dynamic Binary Translation
Author*Marie Badaroux, Frédéric Pétrot (Univ. Grenoble Alpes, CNRS, Grenoble INP, TIMA, France)
Pagepp. 325 - 330
KeywordSystem-Level Simulation, Dynamic Binary Translation, Arbitrary Precision Floating-Poin
AbstractFloating point hardware support has more or less been settled 35 years ago by the adoption of the IEEE 754 standard. However, many scientific applications require higher accuracy than what can be represented on 64 bits, and to that end make use of dedicated arbitrary precision software libraries. To reach a good performance/accuracy trade-off, developers use variable precision, requiring e.g. more accuracy as the computation progresses. Hardware accelerators for this kind of computations do not exist yet, and independently of the actual quality of the underlying arithmetic computations, defining the right instruction set architecture, memory representations, etc, for them is a challenging task. We investigate in this paper the support for arbitrary and variable precision arithmetic in a dynamic binary translator, to help gaining a system level view of what such an accelerator could provide as interface to compilers, and thus programmers. We detail our design and present an implementation in QEMU using the MPRF library for the RISC-V processor.
Slides

4B-3
TitleOptimizing Temporal Decoupling using Event Relevance
Author*Lukas Jünger (Institute for Communication Technologies and Embedded Systems, RWTH Aachen, Germany), Carmine Bianco, Kristof Niederholtmeyer, Dietmar Petras (Synopsys GmbH, Germany), Rainer Leupers (Institute for Communication Technologies and Embedded Systems, RWTH Aachen, Germany)
Pagepp. 331 - 337
KeywordTLM, Temporal Decoupling, Quantum, SystemC, ESL
AbstractOver the last decades, HW/SW systems have grown ever more complex. System simulators, so called virtual platforms, have been an important tool for developing and testing these systems. However, the rise in overall complexity has also impacted the simulators. Complex platforms require fast simulation components and a sophisticated simulation infrastructure to meet today’s performance demands. With the introduction of SystemC TLM2.0, temporal decoupling has become a staple in the arsenal of simulation acceleration techniques. Temporal decoupling yields a significant simulation performance increase at the cost of diminished accuracy. The two prevalent approaches are called static quantum and dynamic quantum. In this work both are analyzed using a state-of-the-art, industrial virtual platform as a case study. While dynamic quantum offers an ideal trade-off between simulation performance and accuracy in a single-core scenario, performance reductions can be observed in multi-core platforms. To address this, a novel performance optimization is proposed, achieving a 14.32% performance gain in our case study while keeping near-perfect accuracy.
Slides

4B-4
TitleDesign Space Exploration of Heterogeneous-Accelerator SoCs with Hyperparameter Optimization
Author*Thanh Cong (University Rennes, Inria, IRISA, France), François Charot (Inria, University Rennes, IRISA, France)
Pagepp. 338 - 343
KeywordHeterogeneous architecture design, System-on-chip, Hardware accelerators, Hyperparmeter optimization, Simulation
AbstractModern SoC systems consist of general-purpose processor cores augmented with large numbers of specialized accelerators. Building such systems requires a design flow allowing the design space to be explored at the system level with an appropriate strategy. In this paper, we describe a methodology allowing to explore the design space of power-performance heterogeneous SoCs by combining an architecture simulator (gem5-Aladdin) and a hyperparameter optimization method (Hyperopt). This methodology allows different types of parallelism with loop unrolling strategies and memory coherency interfaces to be swept. The flow has been applied to a convolutional neural network algorithm. We show that the most energy-efficient architecture achieves a 2x to 4x improvement in energy-delay-product compared to architecture without parallelism. Furthermore, the obtained solution is more efficient than commonly implemented architectures (Systolic, 2D-mapping, and Tiling). We also applied the methodology to find the optimal architecture, including its coherency interface for a complex SoC made up of six accelerated-workloads. We show that a hybrid interface appears to be the most efficient; it reaches 22% and 12% improvement in energy-delay-product compared to just only using non-coherent and only LLC-coherent models, respectively.
Slides


[To Session Table]

Session 4C  Neural Network Optimizations for Compact AI Inference
Time: 15:00 - 15:30, Wednesday, January 20, 2021
Location: Room 4C
Chairs: Ngai Wong (HKU, Hong Kong), Hai-Bao Chen (Shanghai Jiao Tong University, China)

4C-1
TitleDNR: A Tunable Robust Pruning Framework Through Dynamic Network Rewiring of DNNs
Author*Souvik Kundu, Mahdi Nazemi, Peter A. Beerel, Massoud Pedram (University of Southern California, USA)
Pagepp. 344 - 350
KeywordAdversarial robustness, Robust pruning, Structured pruning, adversarial pruning, model compression
AbstractThis paper presents a dynamic network rewiring (DNR) method to generate pruned deep neural network (DNN) models that are robust against adversarial attacks yet maintain high accuracy on clean images. In particular, the disclosed DNR method is based on a unified constrained optimization formulation using a hybrid loss function that merges ultra-high model compression with robust adversarial training. This training strategy dynamically adjusts inter-layer connectivity based on per-layer normalized momentum computed from the hybrid loss function. In contrast to existing robust pruning frameworks that require multiple training iterations, the proposed learning strategy achieves an overall target pruning ratio with only a single training iteration and can be tuned to support both irregular and structured channel pruning. To evaluate the merits of DNR, experiments were performed with two widely accepted models, namely VGG16 and ResNet-18, on CIFAR-10, CIFAR-100 as well as with VGG16 on Tiny-ImageNet. Compared to the baseline uncompressed models, DNR provides over 20× compression on all the datasets with no significant drop in either clean or adversarial classification accuracy. Moreover, our experiments show that DNR consistently finds compressed models with better clean and adversarial image classification performance than what is achievable through state-of-the-art alternatives. Our models and test codes are available at the anonymous link https://drive.google.com/drive/u/2/folders/1Y2rpsUTPxJVb1s7PThJaTtIz-TGQXVM0

4C-2
TitleDynamic Programming Assisted Quantization Approaches for Compressing Normal and Robust DNN Models
Author*Dingcheng Yang, Wenjian Yu, Haoyuan Mu (Tsinghua University, China), Gary Yao (Case Western Reserve University, USA)
Pagepp. 351 - 357
KeywordDynamic Programing, Neural Network Compression, Quantization, Robust Model, Weight Sharing
AbstractIn this work, we present effective quantization approaches for compressing the deep neural networks (DNNs). A key ingredient is a novel dynamic programming (DP) based algorithm to obtain the optimal solution of scalar K-means clustering. Based on the approaches with regularization and quantization function, two weight quantization approaches called DPR and DPQ for compressing normal DNNs are proposed respectively. Experiments show that they produce models with higher inference accuracy than recently proposed counterparts while achieving same or larger compression. They are also extended for compressing robust DNNs, and the relevant experiments show 16X compression of the robust ResNet-18 model with less than 3% accuracy drop on both natural and adversarial examples.
Slides

4C-3
TitleAccelerate Non-unit Stride Convolutions with Winograd Algorithms
Author*Junhao Pan, Deming Chen (University of Illinois at Urbana-Champaign, USA)
Pagepp. 358 - 364
KeywordMachine Learning, Hardware Accelerators
AbstractWhile computer vision tasks target increasingly challenging scenarios, the need for real-time processing of images rises as well, requiring more efficient methods to accelerate convolutional neural networks. For unit stride convolutions, we use FFT-based methods and Winograd algorithms to compute matrix convolutions, which effectively lower the computing complexity by reducing the number of multiplications. For non-unit stride convolutions, we usually cannot directly apply those algorithms to accelerate the computations. In this work, we propose a novel universal approach to construct the non-unit stride convolution algorithms for any given stride and filter sizes from Winograd algorithms. Specifically, we first demonstrate the steps to decompose an arbitrary convolutional kernel and apply the Winograd algorithms separately to compute non-unit stride convolutions. We then present the derivation of this method and proof by construction to confirm the validity of this approach. Finally, we discuss the minimum number of multiplications and additions necessary for the non-unit stride convolutions and evaluate the performance of the decomposed Winograd algorithms. From our analysis of the computational complexity, the new approach can benefit from 1.5x to 3x fewer multiplications. In our experiments in real DNN layers, we have acquired around 1.3x speedup (T_old / T_new) of the Winograd algorithms against the conventional convolution algorithm in various experiment settings.

4C-4
TitleEfficient Accuracy Recovery in Approximate Neural Networks by Systematic Error Modelling
Author*Cecilia De la Parra, Andre Guntoro (Robert Bosch GmbH, Germany), Akash Kumar (Technical University of Dresden, Germany)
Pagepp. 365 - 371
KeywordApproximate Computing, deep neural networks, approximation error model, DNN optimization
AbstractApproximate Computing is a promising paradigm for mitigating the computational demands of Deep Neural Networks (DNNs), by leveraging DNN performance and area, throughput or power. The DNN accuracy, affected by such approximations, can be then effectively improved through retraining. In this paper, we present a novel methodology for modelling the approximation error introduced by approximate hardware in DNNs, which accelerates retraining and achieves negligible accuracy loss. To this end, we implement the behavioral simulation of several approximate multipliers and model the error generated by such approximations on pre-trained DNNs for image classification on CIFAR10 and ImageNet. Finally, we optimize the DNN parameters by applying our error model during DNN retraining, to recover the accuracy lost due to approximations. Experimental results demonstrate the efficiency of our proposed method for accelerated retraining (11x faster for CIFAR10 and 8x faster for ImageNet) for full DNN approximation, which allows us to deploy approximate multipliers with energy savings of up to 36% for 8-bit precision DNNs with an accuracy loss lower than 1%.


[To Session Table]

Session 4D  Brain-Inspired Computing
Time: 15:00 - 15:30, Wednesday, January 20, 2021
Location: Room 4D
Chairs: Hussam Amrouch (Karlsruhe Institute of Technology, Germany), Xunzhao Yin (Zhejiang University, China)

Best Paper Candidate
4D-1
TitleMixed Precision Quantization for ReRAM-based DNN Inference Accelerators
Author*Sitao Huang (University of Illinois at Urbana-Champaign, USA), Aayush Ankit (Purdue University, USA), Plinio Silveira, Rodrigo Antunes (Hewlett Packard Enterprise, Brazil), Sai Rahul Chalamalasetti (Hewlett Packard Enterprise, USA), Izzat El Hajj (American University of Beirut, Lebanon), Dong-Eun Kim (Purdue University, USA), Glaucimar Aguiar (Hewlett Packard Enterprise, Brazil), Pedro Bruel (University of São Paulo, Brazil), Sergey Serebryakov, Cong Xu, Can Li, Paolo Faraboschi, John Paul Strachan (Hewlett Packard Enterprise, USA), Deming Chen (University of Illinois at Urbana-Champaign, USA), Kaushik Roy (Purdue University, USA), Wen-mei Hwu (University of Illinois at Urbana-Champaign, USA), Dejan Milojicic (Hewlett Packard Enterprise, USA)
Pagepp. 372 - 377
KeywordMixed precision quantization, ReRAM, DNN inference accelerators
AbstractReRAM-based accelerators have shown great potential for accelerating DNN inference because ReRAM crossbars can perform analog matrix-vector multiplication operations with low latency and energy consumption. However, these crossbars require the use of ADCs which constitute a significant fraction of the cost of MVM operations. The overhead of ADCs can be mitigated via partial sum quantization. However, prior quantization flows for DNN inference accelerators do not consider partial sum quantization which is not strongly relevant to traditional architectures. To address this issue, we propose a mixed precision quantization scheme for ReRAM-based DNN inference accelerators where weight quantization, input quantization, and partial sum quantization are jointly applied for each DNN layer. We also propose an automated quantization flow powered by deep reinforcement learning to search for the best quantization configuration in the large design space. Our evaluation shows that the proposed mixed precision quantization scheme and quantization flow reduce inference latency and energy consumption by up to 3.89x and 4.84x, respectively, while only losing 1.18% in DNN inference accuracy.
Slides

4D-2
TitleA reduced-precision streaming SpMV architecture for Personalized PageRank on FPGA
Author*Alberto Parravicini, Francesco Sgherzi, Marco Domenico Santambrogio (Politecnico di Milano, Italy)
Pagepp. 378 - 383
KeywordFPGA, Graph Algorithms, Approximate Computing, Graph Architectures
AbstractSparse matrix-vector multiplication is often employed in many data-analytic workloads in which low latency and high throughput are more valuable than exact numerical convergence. FPGAs provide quick execution times while offering precise control over the accuracy of the results thanks to reduced-precision fixed-point arithmetic. In this work, we propose a novel streaming implementation of Coordinate Format (COO) sparse matrix-vector multiplication, and study its effectiveness when applied to the Personalized PageRank algorithm, a common building block of recommender systems in e-commerce websites and social networks. Our implementation achieves speedups up to 6x over a reference floating-point FPGA architecture and a state-of-the-art multi-threaded CPU implementation on 8 different data-sets, while preserving the numerical fidelity of the results and reaching up to 42x higher energy efficiency compared to the CPU implementation.
Slides

4D-3
TitleHyperRec: Efficient Recommender Systems with Hyperdimensional Computing
AuthorYunhui Guo, Mohsen Imani, *Jaeyoung Kang, Sahand Salamat, Justin Morris, Baris Aksanli, Yeseong Kim, Tajana Rosing (UCSD, USA)
Pagepp. 384 - 389
KeywordHyperdimensional Computing, Recommender Systems
AbstractRecommender systems are important tools for many commercial applications such as online shopping websites. They recommend users a list of products based on the users’ preferences. There are several issues that make the recommendation task very challenging in practice. The first is that an efficient and compact representation is needed to represent users, items and relations. The second issue is that the online markets are changing dynamically, it is thus important that the recommendation algorithm is suitable for fast updates and hardware acceleration. In this paper, we propose a new hardware-friendly recommendation algorithm based on Hyperdi-mensional Computing, called HyperRec. Unlike existing solutions which leverages floating-point numbers for the data representation, in HyperRec, users and items are modeled with binary vectors in a high dimension. The binary representation enables to perform the reasoning process of the proposed algorithm only using Boolean operations, which is efficient on various computing platforms and suitable for hardware acceleration. In this work, we show how to utilize GPU and FPGA to accelerate the proposed HyperRec.The experimental results show that the proposed solution achieves high efficiency on various computing platforms, while providing superior recommendation quality. For example, when compared with the state-of-the-art methods for rating prediction, the CPU-based HyperRec implementation is 13.75× faster and consumes 87% less memory, while decreasing the mean squared error (MSE) for the prediction by as much as 31.84%. Our FPGA implementationis on average 67.0× faster and has 6.9× higher energy efficient as compared to CPU. Our GPU implementation further achieves on average 3.1× speedup as compared to FPGA, while providing only 1.2× lower energy efficiency.

4D-4
TitleEfficient Techniques for Training the Memristor-based Spiking Neural Networks Targeting Better Speed, Energy and Lifetime
Author*Yu Ma, Pingqiang Zhou (ShanghaiTech University, China)
Pagepp. 390 - 395
KeywordMemristor, Spiking neural network, Drift, Reinforcement learning, Model tuning
AbstractSpeed and energy consumption are two important metrics in designing spiking neural networks (SNNs). The inference process of current SNNs is terminated after a preset number of time steps for all images, which leads to a waste of time and spikes. We can terminate the inference process after proper number of time steps for each image. Besides, normalization method also influences the time and spikes of SNNs. In this work, we first use reinforcement learning algorithm to develop an efficient termination strategy which can help find the right number of time steps for each image. Then we propose a model tuning technique for memristor-based crossbar circuit to optimize the weight and bias of a given SNN. Experimental results show that the proposed techniques can reduce about 58.7% crossbar energy consumption and over 62.5% time consumption and double the drift lifetime of memristor-based SNN.
Slides


[To Session Table]

Session 4E  Cross-Layer Hardware Security
Time: 15:00 - 15:30, Wednesday, January 20, 2021
Location: Room 4E
Chairs: Song Bian (Kyoto University, Japan), Gang Qu (University of Maryland, USA)

Best Paper Candidate
4E-1
TitlePCBench: Benchmarking of Board-Level Hardware Attacks and Trojans
Author*Huifeng Zhu (Washington University in St.Louis, USA), Xiaolong Guo (Kansas State University, USA), Yier Jin (University of Florida, USA), Xuan Zhang (Washington University in St.Louis, USA)
Pagepp. 396 - 401
KeywordPCB-level attack, hardware security, Trojan benchmarking, attacks taxonomy
AbstractMost modern electronic systems are hosted by printed circuit boards(PCBs), making them a ubiquitous system component that can take many different shapes and forms. In order to achieve a high level of economy of scale, the global supply chain of electronic systems has evolved into disparate segments for the design, fabrication, assembly, and testing of PCB boards and their various associated components. As a consequence, the modern PCB supply chain exposes many vulnerabilities along its different stages, allowing adversaries to introduce malicious alterations to facilitate board-level attacks. As an emerging hardware threat, the attack and defense techniques at the board level have not yet been systemically explored and thus require a thorough and comprehensive investigation. In the absence of standard board-level attack benchmark, current re-search on perspective countermeasures is likely to be evaluated on proprietary variants of ad-hoc attacks, preventing credible and verifiable comparison among different techniques. Upon this request, in this paper, we will systematically define and categorize a broad range of board-level attacks. For the first time, the attack vectors and construction rules for board-level attacks are developed. A practical and reliable board-level attack benchmark generation scheme is also developed, which can be used to produce references for evaluating countermeasures. Finally, based on the proposed approach, we have created a comprehensive set of board-level attack benchmarks for open-source release

4E-2
TitleCache-Aware Dynamic Skewed Tree for Fast Memory Authentication
Author*Saru Vig (Nanyang Technological University, Singapore), Rohan Juneja (Qualcomm, India), Siew-Kei Lam (Nanyang Technological University, Singapore)
Pagepp. 402 - 407
KeywordMemory, Security
AbstractMemory integrity trees are widely-used to protect external memories in embedded systems against bus attacks. However, existing methods often result in high-performance overheads incurred during memory authentication. To reduce memory accesses during authentication, the tree nodes are cached on-chip. In this paper, we propose a cache-aware technique to dynamically skew the integrity tree based on the application workloads in order to reduce the performance overhead. The tree is initialized using Van-Emde Boas (vEB) organization to take advantage of locality of reference. At run time, the nodes of the integrity tree are dynamically positioned based on their memory access patterns. In particular, frequently accessed nodes are placed closer to the root to reduce the memory access overheads. The proposed technique is compared with existing methods on Multi2Sim using benchmarks from SPEC-CPU2006, SPLASH-2, and PARSEC to demonstrate its performance benefits.

4E-3
TitleAutomated Test Generation for Hardware Trojan Detection using Reinforcement Learning
Author*Zhixin Pan, Prabhat Mishra (University of Florida, USA)
Pagepp. 408 - 413
KeywordHardware Trojan Detection, Machine Learning, Logic Testing, Rein-forcement Learning
AbstractDue to globalized semiconductor supply chain, there is an increasing risk of exposing System-on-Chip (SoC) designs to hardware Trojans. Traditional simulation-based validation using millions of test vectors is unsuitable for detecting stealthy Trojans with extremely rare trigger conditions. There is a critical need to develop efficient Trojan detection techniques to ensure trustworthy SoCs. In this paper, we propose an novel logic testing approach for Trojan detection using an effective combination of testability analysis and reinforcement learning. We utilize both static controllability and observability analysis along with dynamic simulation to significantly improve the trigger coverage. And by using Reinforcement learning, we considerably reduces the test generation time without sacrificing the test quality.

4E-4
TitleOn the Impact of Aging on Power Analysis Attacks Targeting Power-Equalized Cryptographic Circuits
Author*Md Toufiq Hasan Anik (University of Maryland Baltimore County (UMBC), USA), Bijan Fadaeinia, Amir Moradi (Ruhr University Bochum, Germany), Naghmeh Karimi (University of Maryland Baltimore County (UMBC), USA)
Pagepp. 414 - 420
KeywordDevice Aging, Power Analysis Attack, Cryptographic Chips, Power Attack Resilient, SABL
AbstractSide-channel analysis attacks exploit the physical characteristics of cryptographic chip implementations to extract their embedded secret keys. In particular, Power Analysis (PA) attacks make use of the dependency of the power consumption on the data being processed by the cryptographic devices. To tackle the vulnerability of cryptographic circuits against PA attack, various countermeasures have been proposed in literature and adapted by industries, among which a branch of hiding schemes opt to equalize the power consumption of the chip regardless of the processed data. Although these countermeasures are supposed to reduce the information leakage of cryptographic chips, they fail to consider the impact of aging occurs during the device lifetime. Due to aging, the specifications of transistors, and in particular their threshold-voltage, deviate from their fabrication-time specification, leading to a change of circuit’s delay and power consumption over time. In this paper, we show that the aging-induced impacts result in imbalances in the equalized power consumption achieved by hiding countermeasures. This makes such protected cryptographic chips vulnerable to PA attacks when aged. The experimental results extracted through the aging simulation of the PRESENT cipher protected by Sense Amplifier Based Logic (SABL), one of the well-known hiding countermeasures, show that the achieved protection may not last during the circuit lifetime.
Slides


[To Session Table]

Session 5A  (DF-1): New-Principle Computer
Time: 15:30 - 16:00, Wednesday, January 20, 2021
Location: Room 5A
Organizer/Chair: Chihiro Yoshimura (Hitachi, Ltd., Japan), Organizer: Noriyuki Miura (Osaka University, Japan)

5A-1
Title(Designers' Forum) Challenges in Ultra-High-Performance Low-Power Computing towards the Post Moore Era ~ A Computer Architecture Perspective ~
AuthorKoji Inoue (Kyushu University, Japan)
KeywordNew Principle Computers
AbstractSince the development of the world’s first single-chip processors in the early 1970s, advances in computer technology have taken place at an ever-increasing rate, with semiconductor integration densities increasing fourfold approximately every three years. However, this trend (known as Moore’s Law) is finally coming to an end. In other words, the quantitative approach to processor development based on increasing the integration density of transistors will no longer be practically viable (marking the beginning of the so-called post-Moore era). To challenge this critical issue, we need to bring a paradigm shift from the “quantitative approach” to a “qualitative approach” based on creating and using novel devices. This talk attempts to explore the trend in next-generation computing technologies with emerging devices and introduces two examples of research activities regarding superconductor computing and nano-photonic computing.

5A-2
Title(Designers' Forum) CMOS Annealing Machine for Combinatorial Optimization Problems
AuthorMasanao Yamaoka (Hitachi Ltd., Japan)
KeywordNew Principle Computers
AbstractA domain-specific computing architecture is promising beyond the scaling limit in the post-Moore era. A real-time control in many fields are the most important tasks of computers in the IoT era. A new computing architecture, an annealing machine, which is specialized to solve combinatorial optimization problems, is proposed. The annealing machine maps optimization problems to an Ising model and solves the optimization problems in an instant by its own convergence property. We proposed a CMOS annealing machine, CMOS implementation of the annealing machine, which is a type of an in-memory computing architecture. The card-size prototype supports 60kbit problems. The CMOS annealing machines are expected to achieve high performance both in the cloud and at the edge. We are working on specific studies of applications toward the practical stage. For example, we have been conducting proof-of-concept experiments for a portfolio optimization in financial products.

5A-3
Title(Designers' Forum) Massively Parallel Noisy Intermediate Scale Quantum Computer
AuthorRyuta Tsuchiya (Hitachi Ltd., Japan)
KeywordNew Principle Computers
AbstractQuantum computers are expected as a new generation of computing technology that solves problems exponentially faster than conventional processors. The major challenge in quantum computer is to build a large-scale integrated system. This difficulty comes the total number of external control signals when it comes to integrating millions of qubits on one chip. Here, we propose a silicon-based quantum bit (qubit) array structure. In this array, multiple qubits are controlled in a two-dimensional array by using orthogonal word lines/bit lines like DRAM and FLASH memory cells. By sharing the signal wiring, large-scale integration of qubits is possible while suppressing an increase in the number of wirings. The fabricated array shows to have an allowable large operating voltage window to control the number of qubits. The proposed structure has a potential to pave the way for massively parallel noisy intermediate scale quantum computer.

5A-4
Title(Designers' Forum) Designing Functional Neuronal Networks with Living Cells
AuthorHideaki Yamamoto, Ayumi Hirano-Iwata, Shigeo Sato (Tohoku University, Japan)
KeywordNew Principle Computers
AbstractHow does a complex network of biological elements realize robust and energy-efficient computation in the brain? We approach this question by integrating semiconductor microfabrication, cell-culture technology, and numerical simulations. More precisely, we prepare micropatterns of cell-adhesion proteins on coverslips, which serve as guidance cues for the growth of cultured neurons into structured networks. Their dynamics are then recorded by fluorescence imaging, allowing us to study of how network topology sculpts multicellular dynamics. Such bottom-up analysis of living neuronal networks provides a unique approach for investigating how neurons form structured networks, generate high-dimensional dynamics, and realize information processing.


[To Session Table]

Session 5B  Embedded Operating Systems and Information Retrieval
Time: 15:30 - 16:00, Wednesday, January 20, 2021
Location: Room 5B
Chairs: Tony Givargis (University of California, Irvine, USA), Chengmo Yang (University of Delaware, USA)

5B-1
TitleEnergy-Performance Co-Management of Mixed-Sensitivity Workloads on Heterogeneous Multi-core Systems
Author*Elham Shamsa, Anil Kanduri (University of Turku, Finland), Amir M. Rahmani (University of California, USA), Pasi Liljeberg (University of Turku, Finland)
Pagepp. 421 - 427
KeywordOn-chip Resource allocation, Heterogeneous Multi-core Systems, Performance, latency, throughput, Concurrent applications
AbstractSatisfying performance of complex workload scenarios with respect to energy consumption on Heterogeneous Multi-core Platforms (HMPs) is challenging when considering i) the increasing variety of applications, and ii) the large space of resource management configurations. Existing run-time resource management approaches use online and offline learning to handle such complexity. However, they focus on one type of application, neglecting concurrent execution of mixed sensitivity workloads. In this work, we propose an energy-performance co-management method which prioritizes mixed type of applications at run-time and searches in the configuration space to find the optimal configuration for each application which satisfies the performance requirements while saving energy. We evaluate our approach on a real Odroid XU3 platform over mixed-sensitivity embedded workloads. Experimental results show our approach provides 54% lower performance violation with 50% higher energy saving compared to the existing approaches.

5B-2
TitleOptimizing Inter-Core Data-Propagation Delays in Industrial Embedded Systems under Partitioned Scheduling
AuthorLamija Hasanagić, *Tin Vidović, Saad Mubeen, Mohammad Ashjaei (Mälardalen University, Sweden), Matthias Becker (KTH Royal Institute of Technology, Sweden)
Pagepp. 428 - 434
KeywordReal-Time, Inter-Core Communication, Data Propagation Delay, Scheduling, Time-Triggered
AbstractThis paper addresses the scheduling of industrial time-critical applications on multi-core embedded systems. A novel scheduling technique under partitioned scheduling is proposed that minimizes inter-core data-propagation delays between tasks that are activated with different periods. The proposed technique is based on the read-execute-write model for the execution of tasks to guarantee temporal isolation when accessing the shared resources. A Constraint Programming formulation is presented to find the schedule for each core. Evaluations are preformed to assess the scalability as well as the resulting schedulability ratio, which is still 18% for two cores that are both utilized 90%. Furthermore, an automotive industrial case study is performed to demonstrate the applicability of the proposed technique to industrial systems. The case study also presents a comparative evaluation of the schedules generated by (i) the proposed technique and (ii) the Rubus-ICE industrial tool suite with respect to jitter, inter-core data-propagation delays and their impact on data age of task chains that span multiple cores.
Slides

5B-3
TitleLiteIndex: Memory-Efficient Schema-Agnostic Indexing for JSON documents in SQLite
Author*Siqi Shang, Qihong Wu, Tianyu Wang, Zili Shao (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 435 - 440
KeywordJSON, SQLite, Indexing, Memory-Efficient, Schema-Agnostic
AbstractSQLite with JSON(JavaScript Object Notation) format is widely adopted for local data storage in mobile applications such as Twitter and Instagram. With more data are generated and stored, it becomes vitally important to efficiently index and search JSON records in SQLite. However, current methods in SQLite either require full text search (that incurs big memory usage and long query latency) or indexing based on expression (that needs to be manually created by specifying search keys). On the other hand, existing JSON automatic indexing techniques, mainly focusing on big data and cloud environments, depend on a colossal tree structure that cannot be applied in memory-constrained mobile devices. In this paper, we propose a novel schema-agnostic indexing technique called LiteIndex that can automatically index JSON records by extracting keywords from long text and maintaining user-preferred items within a given memory constraint. This is achieved by memory-efficient index organization with light-weight keyword extraction from long text and user-preference-aware reinforcement-learning-based index pruning mechanism. LiteIndex has been implemented in a Android smartphone platform and evaluated with a dataset from Tweet. Experimental results show that LiteIndex can significantly reduce the query latency by up to 10x with less memory usage compared with SQLite with FTS3/FTS4 extensions.
Slides


[To Session Table]

Session 5C  (SS-4) Security Issues in AI and Their Impacts on Hardware Security
Time: 15:30 - 16:00, Wednesday, January 20, 2021
Location: Room 5C
Chairs: Jiliang ZHANG (Hunan University, China), Lingjuan Wu (Huazhong Agricultural University, China)

5C-1
Title(Invited Paper) Micro-architectural Cache Side-Channel Attacks and Countermeasures
Author*Chaoqun Shen, Congcong Chen, Jiliang Zhang (Hunan University, China)
Pagepp. 441 - 448
Keywordhardware security, micro-architecture, cache, side-channel attacks
AbstractCentral Processing Unit (CPU) is considered as the brain of a computer. If the CPU has vulnerabilities, the security of software running on it is difficult to be guaranteed. In recent years, various micro-architectural cache side-channel attacks on the CPU such as Spectre and Meltdown have appeared. They exploit contention on internal components of the processor to leak secret information between processes. This newly evolving research area has aroused significant interest due to the broad application range and harmfulness of these attacks. This article reviews recent research progress on micro-architectural cache side-channel attacks and defenses. First, the various micro-architectural cache side-channel attacks are classified and discussed. Then, the corresponding countermeasures are summarized. Finally, the limitations and future development trends are prospected.
Slides

5C-2
Title(Invited Paper) Security of Neural Networks from Hardware Perspective: A Survey and Beyond
Author*Qian Xu (University of Maryland College Park, USA), Md Tanvir Arafin (Morgan State University, USA), Gang Qu (University of Maryland College Park, USA)
Pagepp. 449 - 454
Keywordneural networks, hardware security, side-channel attacks, hardware trojan, trusted execution environment
AbstractRecent advances in neural networks (NNs) and their application in deep learning techniques have made the security aspects of NNs an important and timely topic for fundamental research. This work surveys the current state-of-the-art security opportunities and challenges for computing hardware used in implementing deep neural networks to address this issue. First, we explore the hardware attack surfaces for deep neural networks (DNN). Then, we report the current state of hardware-based attacks (i.e., hardware trojan insertion, fault injection, and side-channel analysis) on DNN. Next, we document the recent developments on detection and countermeasures for hardware-oriented attacks. We also provide details on the application of secure enclaves for the trusted execution of NN-based algorithms. Finally, we discuss hardware security promises for intellectual property protection for deep learning systems. Based on our survey, we find ample opportunities for hardware research to secure the next generation of DNN-based artificial intelligence and machine learning platforms.

5C-3
Title(Invited Paper) Learning Assisted Side Channel Delay Test for Detection of Recycled ICs
Author*Ashkan Vakil (George Mason University, USA), Farzad Niknia (University of Maryland Baltimore County, USA), Ali Mirzaeian, Avesta Sasan (George Mason University, USA), Naghmeh Karimi (University of Maryland Baltimore County, USA)
Pagepp. 455 - 462
KeywordCounterfeit IC, Aging, Recycled IC, Hardware Security, Learning
AbstractWith the outsourcing of design flow, ensuring the security and trustworthiness of integrated circuits has become more challenging. Among the security threats, IC counterfeiting and recycled ICs have received a lot of attention due to their inferior quality, and in turn, their negative impact on the reliability and security of the underlying devices. Detecting recycled ICs is challenging due to the effect of process variations and process drift occurring during the chip fabrication. Moreover, relying on a golden chip as a basis for comparison is not always feasible. Accordingly, this paper presents a recycled IC detection scheme based on delay side-channel testing. The proposed method relies on the features extracted during the design flow and the sample delays extracted from the target chip to build a Neural Network model using which the target chip can be truly identified as new or recycled. The proposed method classifies the timing paths of the target chip into two groups based on their vulnerability to aging using the information collected from the design and detects the recycled ICs based on the deviation of the delay of these two sets from each other.
Slides

5C-4
Title(Invited Paper) ML-augmented Methodology for Fast Thermal Side-Channel Emission Analysis
Author*Norman Chang, Deqi Zhu, Lang Lin, Dinesh Selvakumaran, Jimin Wen, Stephen Pan, Wenbo Xia, Hua Chen, Calvin Chow (Ansys, Inc., USA), Gary Chen (National Taiwan University, Taiwan)
Pagepp. 463 - 468
KeywordCPA, side-channel attack
AbstractAccurate side-channel attacks can non-invasively or semi-invasively extract secure information from hardware devices using “side- channel” measurements. The thermal profile of an IC is one class of side channel that can be used to exploit the security weaknesses in a design. Measurement of junction temperature from an on-chip thermal sensor or top metal layer temperature using an infrared thermal image of an IC with the package being removed can disclose secret keys of a cryptographic design through correlation power analysis. In order to identify the design vulnerabilities to thermal side channel attacks, design time simulation tools are highly important. However, simulation of thermal side-channel emission is highly complex and computationally intensive due to the scale of simulation vectors required and the multi-physics simulation models involved. Hence, in this paper, we have proposed a fast and comprehensive Machine Learning (ML) augmented thermal simulation methodology for thermal Side-Channel emission Analysis (SCeA).


[To Session Table]

Session 5D  Advances in Logic and High-level Synthesis
Time: 15:30 - 16:00, Wednesday, January 20, 2021
Location: Room 5D
Chairs: Jun Zhou (University of Electronic Science and Technology of China, China), Grace Li Zhang (Technical University of Munich, Germany)

Best Paper Candidate
5D-1
Title1st-Order to 2nd-Order Threshold Logic Gate Transformation with an Enhanced ILP-based Identification Method
Author*Li-Cheng Zheng (National Central University, Taiwan), Hao-Ju Chang, Yung-Chih Chen (Yuan Ze University, Taiwan), Jing-Yang Jou (National Central University, Taiwan)
Pagepp. 469 - 474
KeywordLogic optimization, Threshold logic, 2nd-order threshold logic
AbstractThis paper introduces a method to enhance an integer linear programming (ILP)-based method for transforming a 1st-order threshold logic gate (1-TLG) to a 2nd-order TLG (2-TLG) with lower area cost. We observe that for a 2-TLG, most of the 2nd-order weights (2-weights) are zero. That is, in the ILP formulation, most of the variables for the 2-weights could be set to zero. Thus, we first propose three sufficient conditions for transforming a 1-TLG to a 2-TLG by extracting 2-weights. These extracted weights are seen to be more likely non-zero. Then, we simplify the ILP formulation by eliminating the non-extracted 2-weights to speed up the ILP solving. The experimental results show that, to transform a set of 1-TLGs to 2-TLGs, the enhanced method saves an average of 24% CPU time with only an average of 1.87% quality loss in terms of the area cost reduction rate.

5D-2
TitleA Novel Technology Mapper for Complex Universal Gates
AuthorMeng-Che Wu, *Ai Quoc Dao (National Chung Cheng University, Taiwan), Mark Po-Hung Lin (National Chiao Tung University, Taiwan)
Pagepp. 475 - 480
KeywordLogic synthesis, Technology mapping, universal gate, FPGA
AbstractComplex universal logic gates, which may have higher density and flexibility than basic logic gates and look-up tables (LUT), are useful for cost-effective or security-oriented VLSI design requirements. However, most of the technology mapping algorithms aim to optimize combinational logic with basic standard cells or LUT components. It is desirable to investigate optimal technology mappers for complex universal gates in addition to basic standard cells and LUT components. This paper proposes a novel technology mapper for complex universal gates with a tight integration of the following techniques: Boolean network simulation with permutation classification, supergate library construction, dynamic programming based cut enumeration, Boolean matching with optimal universal cell covering. Experimental results show that the proposed method outperforms the state-of-the-art technology mapper in ABC, in terms of both area and delay.
Slides

5D-3
TitleHigh-Level Synthesis of Transactional Memory
Author*Omar Ragheb, Jason H. Anderson (University of Toronto, Canada)
Pagepp. 481 - 486
KeywordHLS, transactional memory, Pthreads, FPGA, synchronization
AbstractThe rising popularity of high-level synthesis (HLS) is due to the complexity and amount of background knowledge required to design hardware circuits. Despite significant recent advances in HLS research, HLS-generated circuits may be of lower quality than human-expert-designed circuits, from the performance, power, or area perspectives. In this work, we aim to raise circuit performance by introducing a transactional memory (TM) synchronization model to the open-source LegUp HLS tool. LegUp HLS supports the synthesis of multi-threaded software into parallel hardware, including support for mutual-exclusion lock-based synchronization. With the introduction of transactional memory based synchronization, location-specific (i.e. finer-grained) memory locks are made possible, where instead of placing an access lock around an entire array, one can place a lock around individual array elements. Significant circuit performance improvements are observed through reduced stalls due to contention, and greater memory-access parallelism. On a set of 5 parallel benchmarks, wall-clock time is improved by 2.0×, on average, by the TM synchronization model vs. mutex-based locks.


[To Session Table]

Session 5E  Hardware-Oriented Threats and Solutions in Neural Networks
Time: 15:30 - 16:00, Wednesday, January 20, 2021
Location: Room 5E
Chairs: Jiaji He (Tsinghua University, China), Tanvir Arafin (Morgan State University, USA)

5E-1
TitleVADER: Leveraging the Natural Variation of Hardware to Enhance Adversarial Attack
Author*Hao Lv (Institute of Computing Technology, Chinese Academy of Sciences, China), Bing Li (Capital Normal University, China), Ying Wang (State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, China), Cheng Liu, Lei Zhang (Institute of Computing Technology, Chinese Academy of Sciences, China)
Pagepp. 487 - 492
KeywordRRAM, Security, Hardware Variation
AbstractAdversarial attacks have been viewed as the primary threat to the security of neural networks. Hence, extensive adversarial defense techniques have been proposed to protect the neural networks from adversarial attacks, allowing for the application of neural networks to the security-sensitive tasks. Recently, the emerging devices, e.g., Resistive RAM (RRAM), attracted extensive attention for establishing the hardware platform for neural networks to tackle the inadequate computing capability of the traditional computing platform. Though the emerging devices exhibit the instinct instability issues due to the advanced manufacture technology, including hardware variations and defects, the error-resilience capability of neural networks enables the wide deployment of neural networks on the emerging devices. In this work, we find that the natural instability in emerging devices impairs the security of neural networks. Specifically, we design an enhanced adversarial attack, Variation-oriented ADvERsarial (VADER) attack which leverages the inherent hardware variations in RRAM chips to penetrate the protection of adversarial defense and mislead the prediction of neural networks. We evaluated the effectiveness of VADER across various protected neural network models and the result shows that VADER achieves higher success attack rate over other adversarial attacks.
Slides

5E-2
TitleEntropy-Based Modeling for Estimating Adversarial Bit-flip Attack Impact on Binarized Neural Network
Author*Navid Khoshavi (Florida Polytechnic University, USA), Saman Sargolzaei (University of Tennessee at Martin, USA), Yu Bi (University of Rhode Island, USA), Arman Roohi (UNIVERSITY of NEBRASKA–LINCOLN, USA)
Pagepp. 493 - 498
Keywordblack-box attack, white-box attack, deep neural network accelerator, bit-flip attack,, Statistical Model
AbstractOver past years, the high demand to efficiently process the deep learning (DL) models has driven the market of the chip design companies. However, the recent Deep Chip architectures, a common term to refer to DL hardware accelerator, have slightly paid attention to the security requirements in quantized neural network (QNN), while the black/white -box adversarial attacks can jeopardize the integrity of inference accelerator. Therefore, in this paper, a comprehensive study of the resiliency of QNN topologies to black-box attacks is examined. Herein, different attack scenarios are performed on an FPGA processor co-design and the collected results are extensively analyzed to give an estimation of the impact’s degree of different type of attacks on the QNN topology. To be specific, we evaluated the sensitivity of the QNN accelerator to a range number of bit-flip attacks (BFAs) that might occur in the operational lifetime of the device. The BFAs are injected at uniformly distributed times either across the entire QNN or per each individual layer during the image classification. The acquired results are utilized to build the entropy-based model that can be leveraged to construct resilient QNN architectures to bit-flip attacks.

5E-3
TitleA Low Cost Weight Obfuscation Scheme for Security Enhancement of ReRAM Based Neural Network Accelerators
AuthorYuhang Wang, *Song Jin (North China Electric Power University, China), Tao Li (University of Florida, USA)
Pagepp. 499 - 504
KeywordReRAM, Security, Model stealing attack, Weight obfuscation
AbstractThe resistive random-access memory (ReRAM) based accelerator can execute the large scale neural network (NN) applications in an extremely energy efficient way. However, the non-volatile feature of the ReRAM introduces some security vulnerabilities. The weight parameters of a well-trained NN model deployed on the ReRAM based accelerator are persisted even after the chip is powered off. The adversaries who have the physical access to the accelerator can hence launch the model stealing attack and extract these weights by some micro-probing methods. Run time encryption of the weights is intuitive to protect the NN model but degrades execution performance and device endurance largely. While obfuscation of the weight rows needs to pay the tremendous hardware area overhead in order to achieve the high security. In view of above mentioned problems, in this paper we propose a low cost weight obfuscation scheme to secure the NN model deployed on the ReRAM based accelerators from the model stealing attack. We partition the crossbar into many virtual operation units (VOUs) and perform full permutation on the weights of the VOUs along the column dimension. Without the keys, the attacker cannot perform the correct NN computations even if they have obtained the obfuscated model. Compared with the weight rows based obfuscation, our scheme can achieve the same level of security with less an order of magnitude in the hardware area and power overheads.


[To Session Table]

Session 6A  (DF-2): Advanced Sensing Technology and Automotive Application
Time: 16:00 - 16:30, Wednesday, January 20, 2021
Location: Room 6A
Organizer/Chair: Masaki Sakakibara (Sony Semiconductor Solutions Corporation, Japan), Organizer: Yuji Ishikawa (Toshiba Corporation, Japan)

6A-1
Title(Designers' Forum) A 32x32-Pixel Global Shutter CMOS THz Imager with VCO-Based ADC
AuthorYuri Kanazawa (Hokkaido University, Japan)
KeywordAdvanced Sensing
AbstractTerahertz imaging has attracted attention because of its many useful characteristics. A low-cost terahertz detector would further improve its desirability. Antenna-type CMOS terahertz imagers are low-cost and enable high-speed image capturing. However, its performance depends on its readout architecture. In this work, taking advantage of pixel parallel ADC architecture, our real-time terahertz imager achieves following features: global shutter function and digital CDS using VCO-Based ADC, noise suppression for OOK modulated signal employing the CDS. Additionally, we achieve high responsivity (218kV/W@0.93THz) with low power consumption using a MOSFET amplifier-based detector fabricated in low-cost 180nm Si-CMOS process.

6A-2
Title(Designers' Forum) Design Strategies of a Vertical Avalanche Photodiode (VAPD), Photon Count Equalizer (PCE) and Subrange Codes (SRC) for an Ultra-long Range (>250 m) Direct-/indirect Mixed Time-of-Flight (TOF) System with Reconfigurable Resolution
AuthorT. Okino, M. Ishii, Y. Sakata, S. Yamada, A. Inoue, S. Kasuga, M. Takemoto, M. Tamaru, H. Koshida, M. Usuda, T. Kunikyo, Y. Yuasa, T. Kabe, S. Saito, Y. Sugiura, K. Nakanishi, N. Torazawa, T. Shirono, Y. Nose, S. Koyama, M. Mori, Y. Hirose, M. Sawada, A. Odagawa, T. Tanaka (Panasonic Corporation, Japan)
KeywordAdvanced Sensing
AbstractWe present design strategies of our recently developed direct-/indirect mixed time-of-flight (TOF) system based on a 6 μm□, 1200×900 pixels, Geiger-mode operated vertical avalanche photodiodes (VAPD) CMOS image sensor. The device is capable of ranging up to 250 m full range with 10 cm lateral resolution by the direct TOF mode. For short ranges (< 20 m), with assist of the indirect TOF mode, depth resolution of 10 cm is demonstrated. Full range images are output in real time (30 fps) with a 450 fps capture speed of each subrange image. We focus on three key elements; (i) a pixel design based on VAPD with a capacitive selfquenching free from after-pulse. (ii) a photon count equalizer (PCE) that enables clear separation of photon count peaks. (iii) subrange- coding (SRC)/synthesis (SRS) architectures configurable for various applications and scenes. Major tradeoffs such as SRC vs. pulse counts and external light vs. range are discussed.

6A-3
Title(Designers' Forum) A 240×192-Pixel 225m-Range Automotive LiDAR SoC Using a 40ch Voltage/Time Dual-Data-Converter-Based AFE
AuthorSatoshi Kondo (Toshiba Corporation, Japan)
KeywordAdvanced Sensing
AbstractA safe and reliable self-driving system is a key enabling technology for a society without traffic jams or accidents; LiDAR plays an essential role for such systems. To ensure higher levels of safety and comfort, early detection of small objects (e.g., debris/children) is crucial. To achieve this, state-of-the-art LiDARs must attain even more finely scaled pixel resolution. However, hybrid LiDAR systems require a pair of TDC/ADC AFEs per pixel to obtain both precise short-range distance measurement (DM) and 200m long-range DM. Scaling the pixel resolution will significantly enlarge the SoC area and explode its cost. We report a dual-data converter (DDC) that consolidates the functions of ADC and TDC into a single circuit; as such, a significant reduction in the area of the Hybrid LiDAR AFE is achieved. The DDC acquires both high-precision time and voltage data from the input: although it achieves 5× smaller AFE area than prior arts. This innovation leads to a 40-channel AFE integration of the SoC without increasing cost. Moreover, owing to the high ADC performance of DDCs, DM under 70klux sunlight was 12% longer, achieving 225m.

6A-4
Title(Designers' Forum) ViscontiTM: Edge AI Processor for Automotive Application
AuthorYutaka Yamada (Toshiba Corporation, Japan)
KeywordAdvanced Sensing
AbstractIn this presentation, we introduce our image recognition SoC for advanced driver assistance system (ADAS). Vehicles with ADAS contribute to reduce traffic accidents and are widely available today. Image recognition SoCs are essential devices for ADAS application including many image recognition functions. ADAS application needs high performance SoCs that execute the functions such as raw image capture, several image filters, object recognitions, and sending alerts. On the other hand, these SoCs have tight constraint of power consumption due to high temperature inside the vehicle and the limited cooling system. Our SoC provides all the functions that ADAS application needs by several power-efficient and programmable hardware acceleration.


[To Session Table]

Session 6B  Advanced Optimizations for Embedded Systems
Time: 16:00 - 16:30, Wednesday, January 20, 2021
Location: Room 6B
Chairs: Qi Zhu (Northwestern University, USA), Xiaolin Xu (Northeastern University, USA)

6B-1
TitlePuncturing the memory wall: Joint optimization of network compression with approximate memory for ASR application
Author*Qin Li (Tsinghua University, China), Peiyan Dong (Northeastern University, USA), Zijie Yu, Changlu Liu, Fei Qiao (Tsinghua University, China), Yanzhi Wang (Northeastern University, USA), Huazhong Yang (Tsinghua University, China)
Pagepp. 505 - 511
KeywordMemory wall, joint optimization, network compression, approximate memory, automatic speech recognition
AbstractThe automatic speech recognition (ASR) system is becoming increasingly irreplaceable in smart speech interaction applications. Nonetheless, these applications confront the memory wall when embedded in the energy and memory constrained Internet of Things devices. Therefore, it is extremely challenging but imperative to design a memory-saving and energy-saving ASR system. This paper proposes a joint-optimized scheme of network compression with approximate memory for the economical ASR system. At the algorithm level, this work presents block-based pruning and quantization with error model (BPQE), an optimized compression framework including a novel pruning technique coordinated with low-precision quantization and the approximate memory scheme. The BPQE compressed recurrent neural network (RNN) model comes with an ultra-high compression rate and fine-grained structured pattern that reduce the amount of memory access immensely. At the hardware level, this work presents an ASR-adapted incremental retraining method to further obtain optimal power saving. This retraining method stimulates the utility of the approximate memory scheme, while maintaining considerable accuracy. According to the experiment results, the proposed joint-optimized scheme achieves 58.6% power saving and 40× memory saving with a phone error rate of 20%.

6B-2
TitleCanonical Huffman Decoder on Fine-grain Many-core Processor Arrays
Author*Satyabrata Sarangi, Bevan Baas (University of California, Davis, USA)
Pagepp. 512 - 517
KeywordCanonical Huffman decoder, Many-core processors
AbstractCanonical Huffman codecs have been used in a wide variety of platforms ranging from mobile devices to data centers which all demand high energy efficiency and high throughput. This work presents bit-parallel canonical Huffman decoder implementations on a fine-grain many-core array built using simple RISC-style programmable processors. We develop multiple energy-efficient and area-efficient decoder implementations and the results are compared with an Intel i7-4850HQ and a massively parallel GT 750M GPU executing the corpus benchmarks: Calgary, Canterbury, Artificial, and Large. The many-core implementations achieve a scaled throughput per chip area that is 324× and 2.7× greater on average than the i7 and GT 750M respectively. In addition, the many-core implementations yield a scaled energy efficiency (bytes decoded per energy) that is 24.1× and 4.6× greater than the i7 and GT 750M respectively.

6B-3
TitleA Decomposition-Based Synthesis Algorithm for Sparse Matrix-Vector Multiplication in Parallel Communication Structure
Author*Mingfei Yu, Ruitao Gao, Masahiro Fujita (The University of Tokyo, Japan)
Pagepp. 518 - 523
Keywordsparse matrix-vector multiplication, parallel computing, communication structure, convolutional neural network
AbstractThere is an obvious trend that hardware including many-core CPU, GPU and FPGA are always made use of to conduct the computationally intensive tasks, brought by the application of deep learning, a large proportion of which could be formulated into the format of sparse matrix-vector multiplication(SpMV). Scheduling solutions for SpMV targeting parallel processing turn out to be irregular, leading to the dilemma that the optimum synthesis problems are time-consuming or even infeasible, when the size of the involved matrix increases. In this paper, a decomposition-based synthesis algorithm is put forward. As the proposed method is guided by known sub-scheduling solutions, the irregularity of SpMV problem could be avoided, while the search space of the synthesis problem is reduced as well. Through comparison with existing method, the proposed method is proved to be able to offer scheduling solutions of high-equality: averagely utilizing 65.27% of the sparseness of the involved matrix and achieving 91.39% of the performance of the solutions generated by exhaustive search, with a remarkable saving of compilation time and best scalability among the above-mentioned approaches.
Slides


[To Session Table]

Session 6C  Design and Learning of Logic Circuits and Systems
Time: 16:00 - 16:30, Wednesday, January 20, 2021
Location: Room 6C
Chairs: Wei Zhang (Hong Kong University of Science and Technology, Hong Kong), Yung-Chih Chen (Yuan Ze University Taiwan, Taiwan)

6C-1
TitleLearning Boolean Circuits from Examples for Approximate Logic Synthesis
Author*Sina Boroumand, Christos-Savvas Bouganis, George Constantinides (Imperial College London, UK)
Pagepp. 524 - 529
Keywordapproximate computing, logic synthesis, machine learning, information theory
AbstractMany computing applications are inherently error resilient. Thus, it is possible to decrease computing accuracy to achieve greater efficiency in area, performance, and/or energy consumption. In recent years, a slew of automatic techniques for approximate computing has been proposed; however, most of these techniques require full knowledge of an exact, or `golden' circuit description. In contrast, there has been significant recent interest in synthesizing computation from examples, a form of supervised learning. In this paper, we explore the relationship between supervised learning of Boolean circuits and existing work on synthesizing incompletely-specified functions. We show that when considered through a machine learning lens, the latter work provides a good training accuracy but poor test accuracy. We contrast this with prior work from the 1990s which uses mutual information to steer the search process, aiming for good generalization. By combining this early work with a recent approach to learning logic functions, we are able to achieve a scalable and efficient machine learning approach for Boolean circuits in terms of area/delay/test-error trade-off.
Slides

6C-2
TitleRead your Circuit: Leveraging Word Embedding to Guide Logic Optimization
Author*Walter Lau Neto (University of Utah, USA), Matheus Trevisan Moreira (Chronos Tech, USA), Luca Amaru (Synopsys Inc., USA), Cunxi Yu, Pierre-Emmanuel Gaillardon (University of Utah, USA)
Pagepp. 530 - 535
KeywordTiming Closure, Electronic Design Automation, Logic Optimization, Word Embedding, Machine Learning
AbstractTo tackle the involved complexity, Electronic Design Automation (EDA) tools are broken in well-defined steps, each operating at different abstraction levels. Higher levels of abstraction shorten the flow run-time while sacrificing correlation with the physical circuit implementation. Bridging this gap between Logic Synthesis tool and Physical Design (PnR) tools is key to improve Quality of Results (QoR), while possibly shorting the time-to-market. To address this problem, in this work, we formalize logic paths as sentences, with the gates being a bag of words. Thus, we show how word embedding can be leveraged to represent generic paths and predict if a given path is likely to be critical post-PnR. We present the effectiveness of our approach, with accuracy over than 90% for our test-cases. Finally, we give a step further and introduce an intelligent and non-intrusive flow that uses this information to guide optimization. Our flow presents up to 15.53% area delay product (ADP) and 18.56% power delay product (PDP), compared to a standard flow.

6C-3
TitleExploiting HLS-Generated Multi-Version Kernels to Improve CPU-FPGA Cloud Systems
AuthorBernardo Neuhaus Lignati, Michael Guilherme Jordan, Guilherme Korol (Institute of Informatics - Federal University of Rio Grande do Sul (UFRGS), Brazil), Mateus Beck Rutzig (Electronics and Computing Department - Federal University of Santa Maria (UFSM), Brazil), *Antonio Carlos Schneider Beck (Institute of Informatics - Federal University of Rio Grande do Sul (UFRGS), Brazil)
Pagepp. 536 - 541
Keywordcollaborative, CPU-FPGA, energy, HLS, makespan
AbstractCloud Warehouses have been exploiting CPU-FPGA collaborative execution environments, where multiple clients share the same infrastructure to achieve to maximize resource utilization with the highest possible energy efficiency and scalability. However, the resource provisioning is challenging in these environments, since kernels may be dispatched to both CPU and FPGA concurrently in a highly variant scenario, in terms of available resources and workload characteristics. In this work, we propose MultiVers, a framework that leverages automatic HLS generation to enable further gains in such CPU-FPGA collaborative systems. MultiVers exploits the automatic generation from HLS to build libraries containing multiple versions of each incoming kernel request, greatly enlarging the available design space exploration passive of optimization by the allocation strategies in the cloud provider. Multiversmakes both kernel multiversioning and allocation strategy to work symbiotically, allowing fine-tuning in terms of resource usage, performance, energy, or any combination of these parameters. We show the efficiency of MultiVers by using real-world cloud request scenarios with a diversity of benchmarks, achieving average improvements on makespan and energy of up to 4.62 × and 19.04×, respectively, over traditional allocation strategies executing non-optimized kernels.
Slides


[To Session Table]

Session 6D  Hardware Locking and Obfuscation
Time: 16:00 - 16:30, Wednesday, January 20, 2021
Location: Room 6D
Chairs: Xueyan Wang (Beihang University, China), Yier Jin (University of Florida, USA)

6D-1
TitleArea Efficient Functional Locking through Coarse Grained Runtime Reconfigurable Architectures
Author*Jianqi Chen, Benjamin Carrion Schafer (The University of Texas at Dallas, USA)
Pagepp. 542 - 547
KeywordFunctional locking, CGRRA, High-Level Synthesis
AbstractThe protection of Intellectual Property (IP) has emerged as one of the most important issues in the hardware design industry. Most VLSI design companies are now fabless and need to protect their IP from being illegally distributed. One of the main approach to address this has been through logic locking. Logic locking prevents IPs from being reversed engineered as well as overbuilding the hardware circuit by untrusted foundries. One of the main problem with existing logic locking techniques is that the foundry has full access to the entire design including the logic locking mechanism. Because of the importance of this topic, continuous more robust locking mechanisms are proposed and equally fast new methods to break them. One alternative approach is to lock a circuit through omission. The main idea is to selectively map a portion of the IP onto an embedded FPGA (eFPGA). Because the foundry does not have access to the bitsream, the circuit cannot be used until programmed by the legitimate user. One of the main problems with this approach is the large overhead in terms of area and power, as well as timing degradation. Area is especially a concern for price sensitive applications. To address this, in this work we presents a method to map portions of a design onto a Coarse Grained Runtime ReconfigurableArchitecture (CGRRA) such that multiple parts of a design can be hidden onto the CGRRA reducing the total area overhead.
Slides

6D-2
TitleObfusX: Routing Obfuscation with Explanatory Analysis of a Machine Learning Attack
Author*Wei Zeng, Azadeh Davoodi (University of Wisconsin-Madison, USA), Rasit Onur Topaloglu (IBM, USA)
Pagepp. 548 - 554
Keywordrouting obfuscation, split manufacturing, explainable artificial intelligence, machine learning
AbstractThis is the first work that incorporates recent advancements in "explainability" of machine learning (ML) to build a routing obfuscator called ObfusX. We adopt a recent metric---the SHAP value---which explains to what extent each layout feature can reveal each unknown connection for a recent ML-based split manufacturing attack model. The unique benefits of SHAP-based analysis include the ability to identify the best candidates for obfuscation, together with the dominant layout features which make them vulnerable. As a result, ObfusX can achieve better hit rate (97% lower) while perturbing significantly fewer nets when obfuscating using a via perturbation scheme, compared to prior work. When imposing the same wirelength limit using a wire lifting scheme, ObfusX performs significantly better in performance metrics (e.g., 2.4 times more reduction on average in percentage of netlist recovery).
Slides

6D-3
TitleBreaking Analog Biasing Locking Techniques via Re-Synthesis
Author*Julian Leonhard, Mohamed Elshamy, Marie-Minerve Louërat, Haralampos-G. Stratigopoulos (Sorbonne Université, CNRS, LIP6, France)
Pagepp. 555 - 560
KeywordHardware security and trust, IP/IC piracy, locking, analog circuit synthesis
AbstractWe demonstrate an attack to break all analog circuit locking techniques that act upon the biasing of the circuit. The attack is based on re-synthesizing the biasing circuits and requires only the use of an optimization algorithm. It is generally applicable to any analog circuit class. For the attacker the method requires no in-depth understanding or analysis of the circuit. The attack is demonstrated on a bias-locked Low-Dropout (LDO) regulator. As the underlying optimization algorithm we employ a Genetic Algorithm (GA).


[To Session Table]

Session 6E  Efficient Solutions for Emerging Technologies
Time: 16:00 - 16:30, Wednesday, January 20, 2021
Location: Room 6E
Chairs: Xueqing Li (Tsinghua University, China), Sangyoung Park (TU Berlin, Germany)

6E-1
TitleEnergy and QoS-Aware Dynamic Reliability Management of IoT Edge Computing Systems
Author*Kazim Ergun (University of California, San Diego, USA), Raid Ayoub, Pietro Mercati (Intel Corporation, USA), Dan Liu, Tajana Rosing (University of California, San Diego, USA)
Pagepp. 561 - 567
KeywordInternet of Things, Edge Computing, Dynamic Reliability Management
AbstractThe Internet of Things (IoT) systems, as any electronic or mechanical system, are prone to failures. Hard failures in hardware due to aging and degradation are particularly important since they are irrecoverable, requiring maintenance for the replacement of defective parts, at high cost. In this paper, we propose a novel dynamic reliability management (DRM) technique for IoT edge computing systems to satisfy the Quality of Service (QoS) and reliability requirements while maximizing the remaining energy of the edge device batteries. We formulate a state-space optimal control problem with a battery energy objective, QoS and terminal reliability constraints. We decompose the problem into low-overhead sub-problems and solve it employing a hierarchical and multi-timescale control approach, distributed over the edge devices and the gateway. Our results, based on real measurements and trace-driven simulation demonstrate that the proposed scheme can achieve similar battery lifetime compared to the state-of-the-art approaches while satisfying reliability requirements, where other approaches fail to do so.

6E-2
TitleLight: A Scalable and Efficient Wavelength-Routed Optical Networks-On-Chip Topology
Author*Zhidan Zheng, Mengchu Li, Tsun-Ming Tseng, Ulf Schlichtmann (Technical University of Munich, Germany)
Pagepp. 568 - 573
KeywordWavelength-routed Optical NoCs, topology synthesis, Parallel Switching Elements
AbstractWavelength-routed optical networks-on-chip (WRONoCs) are known for delivering collision- and arbitration-free on-chip communication in many-cores systems. While appealing for low latency and high predictability, WRONoCs are challenged by scalability concerns due to two reasons: (1) State-of-the-art WRONoC topologies use a large number of microring resonators (MRRs) which result in much MRR tuning power and crosstalk noise. (2) The positions of master and slave nodes in current topologies do not match realistic layout constraints. Thus, many additional waveguide crossings will be introduced during physical implementation, which degrades the network performance. In this work, we propose an N×(N-1) WRONoC topology: Light with a 4×3 router Hash as the basic building block, and a simple but efficient approach to configure the resonant wavelength for each MRR. Experimental results show thatLightoutperforms state-of-the-art topologies in terms of enhancing signal-to-noise ratio (SNR) and reducing insertion loss, especially for large-scale networks. Furthermore, Light can be easily implemented onto a physical plane without causing external waveguide crossings
Slides

Best Paper Candidate
6E-3
TitleOne-Pass Synthesis for Field-coupled Nanocomputing Technologies
Author*Marcel Walter (Group of Computer Architecture, University of Bremen, Germany), Winston Haaswijk (Cadence Design Systems, Inc., USA), Robert Wille (Institute for Integrated Circuits, Johannes Kepler University Linz, Austria), Frank Sill Torres (Department for the Resilience of Maritime Systems, DLR, Germany), Rolf Drechsler (Group of Computer Architecture, University of Bremen, Germany)
Pagepp. 574 - 580
KeywordField-coupled Nanocomputing, One-pass Synthesis, Quantum-dot Cellular Automata, Physical Design, Placement & Routing
AbstractField-coupled Nanocomputing (FCN) is a class of post-CMOS emerging technologies which promises to overcome certain physical limitations of conventional solutions such as CMOS, by allowing for high computational throughput with low power dissipation. Despite their promises, the design of corresponding FCN circuits is still in its infancy. In fact, state-of-the-art solutions still heavily rely on conventional synthesis approaches that do not take the tight physical constraints of FCN circuits (particularly with respect to routability and clocking) into account. Instead, physical design is conducted in a second step in which a classical logic network is mapped onto an FCN layout. Using this two-stage approach with a classical and FCN-oblivious logic network as an intermediate result frequently leads to substantial quality loss or even completely impractical results. In this work, we propose a one-pass synthesis scheme for FCN circuits which conducts both steps, synthesis and physical design, in a single run. For the first time, this allows us to generate exact, i.e., minimal FCN circuits for a given functionality.



Thursday, January 21, 2021

[To Session Table]

Session 3K  Keynote Session III (video and its broadcast via Zoom)
Location: Room K

3K-1
TitleIntroduction of Prof. Jun Mitani by Toshihiro Hattori (Renesas Electronics, Japan)

3K-2
Title(Keynote Address) The Frontier of Origami Science
AuthorJun Mitani (University of Tsukuba, Japan)
AbstractOrigami, the art of folding paper into shapes, is not only a traditional Japanese art but also a subject of research in a wide range of fields such as mathematics, engineering, and education. The technique of folding a single sheet of material without cutting contributes to compactly folding objects and improving their portability. Therefore, origami technology is being applied to space engineering, architecture, and the development of medical devices. However, folding a single sheet of paper into a desired shape is difficult to achieve because of its strict geometric constraints. To solve this problem, a lot of researches have been conducted. In this talk, I will first introduce the history of origami in Japan from the past to today. After that, I will also introduce a wide range of topics, including the geometry of origami, my long term work on curved fold design, and the recent trend of origami research.


[To Session Table]

Session 7A  (SS-5) Platform-Specific Neural Network Acceleration
Time: 15:00 - 15:30, Thursday, January 21, 2021
Location: Room 7A
Chair: Yanzhi Wang (Northeastern University, USA)

7A-1
Title(Invited Paper) Real-Time Mobile Acceleration of DNNs: From Computer Vision to Medical Applications
Author*Hongjia Li, Geng Yuan (Northeastern University, USA), Wei Niu (College of William and Mary, USA), Yuxuan Cai, Mengshu Sun, Zhengang Li (Northeastern University, USA), Bin Ren (College of William and Mary, USA), Xue Lin, Yanzhi Wang (Northeastern University, USA)
Pagepp. 581 - 586
Keywordcomputer vision, real-time, mobile acceleration
AbstractWith the growth of mobile vision applications, there is a growing need to break through the current performance limitation of mobile platforms, especially for computationally intensive applications, such as object detection, action recognition, and medical diagnosis. To achieve this goal, we present our unified real-time mobile DNN inference acceleration framework, seamlessly integrating hardware-friendly, structured model compression with mobile-targeted compiler optimizations. We aim at an unprecedented, real-time performance of such large-scale neural network inference on mobile devices. A fine-grained block-based pruning scheme is proposed to be universally applicable to all types of DNN layers, such as convolutional layers with different kernel sizes and fully connected layers. Moreover, it is also successfully extended to 3D convolutions. With the assist of our compiler optimizations, the fine-grained block-based sparsity is fully utilized to achieve high model accuracy and high hardware acceleration simultaneously. To validate our framework, three representative fields of applications are implemented and demonstrated, object detection, activity detection, and medical diagnosis. All applications achieve real-time inference using an off-the-shelf smartphone, outperforming the representative mobile DNN inference acceleration frameworks by up to 6.7× in speed. The demonstrations of these applications can be found in the following link: https://bit.ly/39lWpYu.
Slides

7A-2
Title(Invited Paper) Dynamic Neural Network to Enable Run-Time Trade-off between Accuracy and Latency
Author*Li Yang, Deliang Fan (Arizona State University, USA)
Pagepp. 587 - 592
KeywordDynamic Neural Network
AbstractIn this work, we explicitly review the methods to construct a dynamic neural network framework, which mainly includes two phases: sub-nets generation and fused sub-nets training. Two different sub-nets generation methods are presented, which are used for the uniform and non-uniform dynamic neural networks respectively. Except that, to reduce the training cost of multi-path non-uniform sub-nets generation method, we further propose a single-path method. Experiments on CIFAR-10 and ImageNet both test the effectiveness of the method. Beyond that, we study the trade-off between inference accuracy and latency for each subnet on Titan GPU and Xeon CPU.

7A-3
Title(Invited Paper) When Machine Learning Meets Quantum Computer: Network-Circuit Co-Design via Quantum-Aware Neural Architecture Search
Author*Weiwen Jiang (University of Notre Dame, USA), Jinjun Xiong (IBM Thomas J. Watson Research Center, USA), Yiyu Shi (University of Notre Dame, USA)
Pagepp. 593 - 598
Keywordneural networks, MNIST dataset, quantum computing, IBM Quantum, IBM Qiskit
AbstractAlong with the development of AI democratization, the machine learning approach, in particular neural networks, has been applied to wide-range applications. In different application scenarios, the neural network will be accelerated on the tailored computing platform. The acceleration of neural networks on classical computing platforms, such as CPU, GPU, FPGA, ASIC, has been widely studied; however, when the scale of the application consistently grows up, the memory bottleneck becomes obvious, widely known as memory-wall. In response to such a challenge, advanced quantum computing, which can represent 2N states with N quantum bits (qubits), is regarded as a promising solution. It is imminent to know how to design the quantum circuit for accelerating neural networks. Most recently, there are initial works studying how to map neural networks to actual quantum processors. To better understand the state-of-the-art design and inspire new design methodology, this paper carries out a case study to demonstrate an end-to-end implementation. On the neural network side, we employ the multilayer perceptron to complete image classification tasks using the standard and widely used MNIST dataset. On the quantum computing side, we target IBM Quantum processors, which can be programmed and simulated by using IBM Qiskit. This work targets the acceleration of the inference phase of a trained neural network on the quantum processor. Along with the case study, we will demonstrate the typical procedure for mapping neural networks to quantum circuits.
Slides

7A-4
Title(Invited Paper) Improving Efficiency in Neural Network Acceleration using Operands Hamming Distance Optimization
Author*Meng Li, Yilei Li, Vikas Chandra (Facebook, Inc., USA)
Pagepp. 599 - 604
KeywordNeural network accelerator, datapath efficiency, Hamming distance optimization
AbstractNeural network accelerator is a key enabler for the on-device AI inference, for which energy efficiency is an important metric. The datapath energy, including the computation energy and the data movement energy among the arithmetic units, claims a significant part of the total accelerator energy. By revisiting the basic physics of the arithmetic logic circuits, we show that the datapath energy is highly correlated with the bit flips when streaming the input operands into the arithmetic units, defined as the hamming distance (HD) of the input operand matrices. Based on the insight, we propose a post-training optimization algorithm and a HD-aware training algorithm to co-design and co-optimize the accelerator and the network synergistically. The experimental results based on post-layout simulation with MobileNetV2 demonstrate on average 2.85× datapath energy reduction and up to 8.51× datapath energy reduction for certain layers.

7A-5
Title(Invited Paper) Lightweight Run-Time Working Memory Compression for Deployment of Deep Neural Networks on Resource-Constrained MCUs
Author*Zhepeng Wang, Yawen Wu, Zhenge Jia (University of Pittsburgh, USA), Yiyu Shi (University of Notre Dame, USA), Jingtong Hu (University of Pittsburgh, USA)
Pagepp. 607 - 614
KeywordNeural Network Deployment, Neural Network Compression, Edge Computing, Artificial Intelligence of Things
AbstractThis work aims to achieve intelligence on embedded devices by deploying deep neural networks (DNNs) onto resource-constrained microcontroller units (MCUs). Apart from the low frequency (e.g., 1-16 MHz) and limited storage (e.g., 16KB to 256KB ROM), one of the largest challenges is the limited RAM (e.g., 2KB to 64KB), which is needed to save the intermediate feature maps of a DNN. Most existing neural network compression algorithms aim to reduce the model size of DNNs so that they can fit into limited storage. However, they do not reduce the size of intermediate feature maps significantly, which is referred to as working memory and might exceed the capacity of RAM. Therefore, it is possible that DNNs cannot run in MCUs even after compression. To address this problem, this work proposes a technique to dynamically prune the activation values of the intermediate output feature maps in the runtime to ensure that they can fit into limited RAM. The results of our experiments show that this method could significantly reduce the working memory of DNNs to satisfy the hard constraint of RAM size, while maintaining satisfactory accuracy with relatively low overhead on memory and run-time latency.
Slides


[To Session Table]

Session 7B  Toward Energy-Efficient Embedded Systems
Time: 15:00 - 15:30, Thursday, January 21, 2021
Location: Room 7B
Chairs: Shiyan Hu (University of Southampton, UK), Xiang Chen (George Mason University, USA)

7B-1
TitleEHDSktch: A Generic Low Power Architecture for Sketching in Energy Harvesting Devices
Author*Priyanka Singla (Indian Institute of Technology Delhi, India), Chandran Goodchild (University of Freiburg, Germany), Smruti R. Sarangi (Indian Institute of Technology Delhi, India)
Pagepp. 615 - 620
KeywordStreaming algorithms, Hardware for sketching, Approximate computing
AbstractEnergy harvesting devices (EHDs) are becoming extremely prevalent in remote and hazardous environments. They sense the ambient parameters and compute some statistics on them, which are then sent to a remote server. Due to the resource-constrained nature of EHDs, it is challenging to perform exact computations on streaming data; however, if we are willing to tolerate a slight amount of inaccuracy, we can leverage the power of sketching algorithms to provide quick answers with significantly lower energy consumption. In this paper, we propose a novel hardware architecture called EHDSktch – a set of IP blocks that can be used to implement most of the popular sketching algorithms. We demonstrate an energy savings of 4-10X and a speedup of more than 10X over state-of-the-art software implementations. Leveraging the temporal locality further provides us a performance gain of 3-20% in energy and time and reduces the on-chip memory requirement by at least 50-75%.
Slides

7B-2
TitleEnergy-Aware Design Methodology for Myocardial Infarction Detection on Low-Power Wearable Devices
Author*Mohanad Odema, Nafiul Rashid, Mohammad Abdullah Al Faruque (University of California, Irvine, USA)
Pagepp. 621 - 626
KeywordMobile Health, Myocardial Infarction, Wearable Devices, Multi-Objective Bayesian Optimization, Neural Architecture Search
AbstractMyocardial Infarction (MI) is a heart disease that damages the heart muscle and requires immediate treatment. Its silent and recurrent nature necessitates real-time continuous monitoring of patients. Nowadays, wearable devices are smart enough to perform on-device processing of heartbeat segments and report any irregularities in them. However, the small form factor of wearable devices imposes resource constraints and requires energy-efficient solutions to satisfy them. In this paper, we propose a design methodology to automate the design space exploration of neural network architectures for MI detection. This methodology incorporates Neural Architecture Search (NAS) using Multi-Objective Bayesian Optimization (MOBO) to render Pareto optimal architectural models. These models minimize both detection error and energy consumption on the target device. The design space is inspired by Binary Convolutional Neural Networks (BCNNs) suited for mobile health applications with limited resources. The models’ performance is validated using the PTB diagnostic ECG database from PhysioNet. Moreover, energy-related measurements are directly obtained from the target device in a typical hardware-in-the-loop fashion. Finally, we benchmark our models against other related works. One model exceeds state-of-the-art accuracy on wearable devices (reaching 91.22%), whereas others trade off some accuracy to reduce their energy consumption (by a factor reaching 8.26×).
Slides

7B-3
TitlePower-Efficient Layer Mapping for CNNs on Integrated CPU and GPU Platforms: A Case Study
Author*Tian Wang (Nanjing University of Science and Technology, China), Kun Cao (Jinan University, China), Junlong Zhou, Gongxuan Zhang, Xiji Wang (Nanjing University of Science and Technology, China)
Pagepp. 627 - 632
Keywordmultiprocessor system on a chip, convolution neural network, CPU and GPU, low power, mapping
AbstractHeterogeneous multiprocessor systems on a chip (MPSoCs) consisting of integrated CPUs and GPUs are suitable platforms for embedded applications running on handheld devices such as smart phones. As the handheld devices are mostly powered by battery, the integrated CPU and GPU MPSoC is usually designed with an emphasis on low-power rather than performance. In this paper, we are interested in exploring a power-efficient layer mapping of convolution neural networks (CNNs) deployed on integrated CPU and GPU platforms. Specifically, we investigate the impact of layer mapping of YoloV3-Tiny (i.e., a widely-used CNN in both industry and academia) on system power consumption through numerous experiments on NVIDIA board Jetson TX2. The experimental results indicate that 1) almost all of the convolution layers are not suitable for mapping to CPU, 2) the pooling layer can be mapped to CPU for reducing power consumption, but the mapping may lead to a decrease in inference speed when the layer’s output tensor size is large, 3) the detection layer can be mapped to CPU as long as its floating-point operation scale is not too large, and 4) the channel and upsampling layers are both suitable for mapping to CPU. These observations obtained in this study can be further utilized to guide the design of power-efficient layer mapping strategies for integrated CPU and GPU platforms.
Slides

7B-4
TitleA Write-friendly Arithmetic Coding Scheme for Achieving Energy-Efficient Non-Volatile Memory Systems
Author*Yi-Shen Chen (Department of Computer Science and Information Engineering, National Taiwan University, Taiwan), Chun-Feng Wu (Department of Computer Science and Information Engineering, National Taiwan University/Institute of Information Science, Academia Sinica, Taiwan), Yuan-Hao Chang (Institute of Information Science, Academia Sinica, Taiwan), Tei-Wei Kuo (Department of Computer Science and Information Engineering, National Taiwan University/College of Engineering, City University of Hong Kong, Taiwan)
Pagepp. 633 - 638
KeywordInternet of Things, non-volatile memory, data compression, arithmetic coding, energy efficiency
AbstractIn the era of the Internet of Things (IoT), wearable IoT devices become popular and closely related to our life. Most of these devices are based on the embedded systems that have to operate on limited energy resources, such as batteries or energy harvesters. Therefore, energy efficiency is one of the critical issues for these devices. To relieve the energy consumption by reducing the total accesses on memory and storage layers, the technologies of storage-class memory (SCM) and data compression techniques are applied to eliminate the data movements and squeeze the data size, respectively. However, the information gap between them hinders the cooperation among the two techniques for achieving further optimizations on minimizing energy consumption. This work proposes a write-friendly arithmetic coding with joint managing both techniques to achieve energy-efficient non-volatile memory (NVM) systems. In particular, the concept of "ignorable bits" is introduced to further skip the write operations while storing the compressed data into SCM devices. The proposed design was evaluated by a series of intensive experiments, and the results are encouraging.


[To Session Table]

Session 7C  Software and System Support for Nonvolatile Memory
Time: 15:00 - 15:30, Thursday, January 21, 2021
Location: Room 7C
Chairs: Zhaoyan Shen (Shandong University, China), In-chao Lin (National Cheng Kung University, Taiwan)

7C-1
TitleDP-Sim: A Full-stack Simulation Infrastructure for Digital Processing In-Memory Architectures
Author*Minxuan Zhou (University of California, San Diego, USA), Mohsen Imani (University of California, Irvine, USA), Yeseong Kim (Daegu Gyeongbuk Institute of Science and Technology, Republic of Korea), Saransh Gupta, Tajana Rosing (University of California, San Diego, USA)
Pagepp. 639 - 644
KeywordProcessing in-memory, Simulator
AbstractDigital processing in-memory (DPIM) is a promising technology that significantly reduces data movements while providing high parallelism. In this work, we design and implement the first full-stack DPIM simulation infrastructure, DP-Sim, which evaluates a comprehensive range of DPIM-specific design space with respect to both software and hardware. DP-Sim provides a C++ library to enable DPIM acceleration in general programs while supporting several aspects of software-level exploration by a convenient interface. The DP-Sim software front-end generates specialized instructions that can be processed by a hardware simulator based on a new DPIM-enabled architecture model which is 10.3% faster than conventional memory simulation models. We use DP-Sim to explore the DPIM-specific design space of acceleration for various emerging applications. Our experiments show that bank-level control is 11.3× faster than conventional channel-level control because of higher computing parallelism. Furthermore, cost-aware memory allocation can provide at least 2.2× speedup vs. heuristic methods, showing the importance of data layout in DPIM acceleration.

7C-2
TitleSAC: A Stream Aware Write Cache Scheme for Multi-Streamed Solid State Drives
Author*Bo Zhou, Chuanming Ding, Yina Lv (School of Computer Science and Technology, East China Normal University, China), Chun Xue (City University of Hong Kong, Hong Kong), Qingfeng Zhuge, Edwin Sha, Liang Shi (School of Computer Science and Technology, East China Normal University, China)
Pagepp. 645 - 650
KeywordSSD, Cache, multi-stream
AbstractThis work found that the state-of-the-art multi-streamed SSDs are inefficiently used due to two issues. First, the write cache inside SSDs is not aware of data from different streams, which induce conflict among streams. Second, the current stream identification methods are not accurate, which should be optimized inside SSDs. This work proposed a novel write cache scheme to efficiently utilize and optimize the multiple streams. First, an inter-stream aware cache partitioning scheme is proposed to manage the data from different streams. Second, an intra-stream based active cache evicting scheme is proposed to evict data to block with more invalid pages in priority. Experiment results show that the proposed scheme significantly reduces the write amplification (WAF) of multi-streamed SSDs by up to 28% with negligible cost.

7C-3
TitleProviding Plug N' Play for Processing-in-Memory Accelerators
AuthorPaulo Cesar Santos, Bruno E. Forlin, *Luigi Carro (Federal University of Rio Grande do Sul, Brazil)
Pagepp. 651 - 656
Keywordprocessing-in-memory, code-offloading, cache-coherence, virtual-memory support, efficiency
AbstractAlthough Processing-in-Memory (PIM) emerged as a solution to avoid unnecessary and expensive data movements to/from host and accelerators, their widespread usage is still difficult, given that to effectively use a PIM device, huge and costly modifications must be done at the host processor side to allow instructions offloading,cache coherence, virtual memory management, and communication between different PIM instances. The present work addresses these challenges by presenting non-invasive solutions for those requirements. We demonstrate that, at compile-time, and without any host modifications or programmer intervention, it is possible to exploit already available resources to allow efficient host and PIMcommunication and task partitioning, without disturbing neither host nor memory hierarchy. We present Plug&PIM, a plug n' play strategy for PIM adoption with minimal performance penalties.
Slides

7C-4
TitleAging Aware Request Scheduling for Non-Volatile Main Memory
AuthorShihao Song, *Anup Das (Drexel University, USA), Onur Mutlu (ETH Zurich, Switzerland), Nagarajan Kandasamy (Drexel University, USA)
Pagepp. 657 - 664
KeywordNon-volatile Memory, Circuit Aging, BTI
AbstractModern computing systems are embracing non-volatile memory (NVM) to implement high-capacity and low-cost main memory. Elevated voltage of NVM accelerate aging of CMOS transistors in the peripheral circuit within each memory bank. Aggressive device scaling increases power density and temperature, which accelerates the aging, challenging the reliable operation of NVM-based main memory. We propose HEBE, an architectural technique to mitigate the circuit aging-related reliability problems of NVM-based main memory. HEBE is built on three contributions. First, we propose a new analytical model that can dynamically track the aging in each peripheral circuit based on workload-specific accesses. Second, we develop a new lifetime reliability-aware access scheduler that exploits this reliability model at run time to de-stress peripheral circuits only when their aging exceeds a critical threshold. Third, we propose a simple microarchitectural change of introducing an isolation transistor to decouple parts of a peripheral circuit operating at different voltages, allowing the decoupled logic blocks to undergo long-latency de-stress operations independently and off the critical path of memory read and write accesses, improving performance. We evaluate HEBE using both single-core and multi-programmed workloads and show that it significantly improves both performance and reliability of NVM-based computing systems.


[To Session Table]

Session 7D  Learning-Driven VLSI Layout Automation Techniques
Time: 15:00 - 15:30, Thursday, January 21, 2021
Location: Room 7D
Chairs: Jinwook Jung (IBM Research, USA), Daijoon Hyun (Cheongju University, Republic of Korea)

7D-1
TitlePlacement for Wafer-Scale Deep Learning Accelerator
Author*Benzheng Li, Qi Du, Dingcheng Liu, Jingchong Zhang (Xidian University, China), Gengjie Chen (Giga Design Automation, China), Hailong You (Xidian University, China)
Pagepp. 665 - 670
Keywordplacement, wafer-scale engine, AI accelerator
AbstractTo meet the growing demand from deep learning applications for computing resources, accelerators by ASIC are necessary. A wafer-scale engine (WSE) is recently proposed, which is able to simultaneously accelerate multiple layers from a neural network (NN). However, without a high-quality placement that properly maps NN layers onto the WSE, the acceleration efficiency cannot be achieved. Here, the WSE placement resembles the traditional ASIC floorplan problem of placing blocks onto a chip region, but they are fundamentally different. Since the slowest layer determines the compute time of the whole NN on WSE, a layer with a heavier workload needs more computing resources. Besides, locations of layers and protocol adapter cost of internal IO connections will influence inter-layer communication overhead. In this paper, we propose [anonymous] to handle this new challenge. A binary-search-based framework is developed to obtain a minimum compute time of the NN. Two dynamic-programming-based algorithms with different optimizing strategies are integrated to produce legal placement. The distance and adapter cost between connected layers will be further minimized by some refinements. Compared with the first place of the ISPD2020 Contest, [anonymous] reduces the contest metric by up to 6.89% and on average 2.09%, while runs 7.23× faster.
Slides

7D-2
TitleNet2: A Graph Attention Network Method Customized for Pre-Placement Net Length Estimation
Author*Zhiyao Xie (Duke University, USA), Rongjian Liang (Texas A&M University, USA), Xiaoqing Xu (ARM Inc, USA), Jiang Hu (Texas A&M University, USA), Yixiao Duan, Yiran Chen (Duke University, USA)
Pagepp. 671 - 677
KeywordPlacement, Net length, Machine learning, Physical-aware synthesis, Graph neural network
AbstractNet length is a key proxy metric for optimizing timing and power across various stages of a standard digital design flow. However, the bulk of net length information is not available until cell placement, and hence it is a significant challenge to explicitly consider net length optimization in design stages prior to placement, such as logic synthesis. This work addresses this challenge by proposing a graph attention network method with customization, called Net2, to estimate individual net length before cell placement. Its accuracy-oriented version Net2a achieves about 15% better accuracy than several previous works in identifying both long nets and long critical paths. Its fast version Net2f is more than 1000× faster than placement while still outperforms previous works and other neural network techniques in terms of various accuracy metrics.

7D-3
TitleMachine Learning-based Structural Pre-route Insertability Prediction and Improvement with Guided Backpropagation
Author*Tao-Chun Yu, Shao-Yun Fang (National Taiwan University of Science and Technology, Taiwan), Hsien-Shih Chiu, Kai-Shun Hu, Chin-Hsiung Hsu (Synopsys Taiwan, Co., Ltd., Taiwan), Philip Hui-Yuh Tai (Synopsys, Inc., USA), Cindy Chin-Fang Shen (Synopsys Taiwan, Co., Ltd., Taiwan)
Pagepp. 678 - 683
Keywordstructural pre-route, machine learning, placement legalization
AbstractWith the development of semiconductor technology nodes, the sizes of standard cells become smaller and the number of standard cells is dramatically increased to bring into more functionality in integrated circuits (ICs). However, the shrinking of standard cell sizes causes many problems of ICs such as timing, power, and electromigration (EM). To tackle these problems, a new style structural pre-route (SPR) is proposed. Such type of pre-route is composed of redundant parallel metals and vias so that the low resistance and the redundant sub-structures can improve timing and yield. But the large area overhead becomes the major problem of inserting such pre-routes all over a design. In this paper, we propose a machine learning-based approach to predict the insertability of SPRs for placed designs. In addition, we apply a pattern visualization method by using a guided backpropagation technique to see in depth of our model and identify the problematic layout features causing SPR insertion failures. The experimental results not only show the excellent performance of our model, but also show that avoiding generating the identified critical features during legalization can improve SPR insertability compared to a commercial SPR-aware placement tool.

7D-4
TitleStandard Cell Routing with Reinforcement Learning and Genetic Algorithm in Advanced Technology Nodes
Author*Haoxing Ren, Matthew Fojtik (Nvidia Corporation, USA)
Pagepp. 684 - 689
KeywordStandard Cell Layout, Routing, Reinforcement Learning, Genetic Algorithm, Advanced Geometry with Unidirectional Metal
AbstractStandard cell layout in advanced technology nodes are done manually in the industry today. Automating standard cell layout process, in particular the routing step, are challenging because of the constraints of enormous design rules. In this paper we propose a machine learning based approach that applies genetic algorithm to create initial routing candidates and uses reinforcement learning (RL) to fix the design rule violations incrementally. A design rule checker feedbacks the violations to the RL agent and the agent learns how to fix them based on the data. This approach is also applicable to future technology nodes with unseen design rules. We demonstrate the effectiveness of this approach on a number of standard cells. We have shown that it can route a cell which is deemed unroutable manually, reducing the cell size by 11%


[To Session Table]

Session 7E  DNN-Based Physical Analysis and DNN Accelerator Design
Time: 15:00 - 15:30, Thursday, January 21, 2021
Location: Room 7E
Chairs: Fan Yang (Fudan University, China), Zuochang Ye (Tsinghua University, China)

7E-1
TitleThermal and IR Drop Analysis Using Convolutional Encoder-Decoder Networks
Author*Vidya A. Chhabria (University of Minnesota, USA), Vipul Ahuja, Ashwath Prabhu, Nikhil Patil, Palkesh Jain (Qualcomm Technologies Inc, India), Sachin S. Sapatnekar (University of Minnesota, USA)
Pagepp. 690 - 696
KeywordThermal, IR-drop, Convolutional neural networks, Encoder decoder networks, image-to-image translation
AbstractComputationally expensive temperature and power grid analyses are required during the design cycle to guide IC design. This paper employs encoder-decoder based generative (EDGe) networks to map these analyses to fast and accurate image-to-image and sequence-to-sequence translation tasks. The network takes a power map as input and outputs the corresponding temperature or IR drop map. We propose two networks: (i) ThermEDGe: a static and dynamic full-chip temperature estimator and (ii) IREDGe: a full-chip static IR drop predictor based on input power, power grid distribution, and power pad distribution patterns. The models are design-independent and must be trained just once for a particular technology and packaging solution. ThermEDGe and IREDGe are demonstrated to rapidly predict on-chip temperature and IR drop contours in milliseconds (in contrast with commercial tools that require several hours or more) and provide an average error of 0.6% and 0.008% respectively.
Slides

7E-2
TitleGRA-LPO: Graph Convolution Based Leakage Power Optimization
Author*Uday Mallappa, Chung-Kuan Cheng (University of California, San Diego, USA)
Pagepp. 697 - 702
KeywordLeakage Recovery, Gate Sizing, GCN, Machine Learning, Optimization
AbstractStatic power consumption is a critical challenge for IC designs,particularly for mobile and IoT applications. A final post-layoutstep in modern design flows involves a leakage recovery step thatis embedded in signoff static timing analysis tools. The goal ofsuch recovery is to make use of the positive slack (if any) andrecover the leakage power by performing cell swaps with footprintcompatible variants. Though such swaps result in unaltered routing,the hard constraint is not to introduce any new timing violations.This process can require up to tens of hours of runtime, just beforethe tapeout, when schedule and resource constraints are tightest. Thephysical design teams can benefit greatly from a fast predictor ofthe leakage recovery step: if the eventual recovery will be too small,the entire step can be skipped, and the resources can be allocatedelsewhere. If we represent the circuit netlist as a graph with cellsas vertices and nets connecting these cells as edges, the leakagerecovery step is an optimization step, on this graph. If we can learnthese optimizations over several graphs with various logic-conestructures, we can generalize the learning to unseen graphs. Usinggraph convolution neural networks, we develop a learning-basedmodel, that predicts per-cell recoverable slack, and translate theseslack values to equivalent power savings. For designs up to 1.6Minstances, our inference step takes less than 12 seconds on a TeslaP100 GPU, and an additional feature extraction, post-processingsteps consuming 420 seconds. The model is accurate with relativeerror under 6.2%, for the design-specific context.

7E-3
TitleDEF: Differential Encoding of Featuremaps for Low Power Convolutional Neural Network Accelerators
Author*Alexander Montgomerie-Corcoran, Christos-Savvas Bouganis (Imperial College London, UK)
Pagepp. 703 - 708
KeywordPower Optimisation, Activity Coding, Neural Networks
AbstractAs the need for the deployment of Deep Learning applications on edge-based devices becomes ever increasingly prominent, power consumption starts to become a limiting factor on the performance that can be achieved by the computational platforms. A significant source of power consumption for these edge-based machine learning accelerators is off-chip memory transactions. In the case of Convolutional Neural Network (CNN) workloads, a predominant workload in deep learning applications, those memory transactions are typically attributed to the store and recall of feature-maps. There is therefore a need to explicitly reduce the power dissipation of these transactions whilst minimising any overheads needed to do so. In this work, a Differential Encoding of Feature-maps (DEF) scheme is proposed, which aims at minimising activity on the memory data bus, specifically for CNN workloads. The coding scheme uses domain-specific knowledge, exploiting statistics of feature-maps alongside knowledge of the data types commonly used in machine learning accelerators as a means of reducing power consumption. DEF is able to out-perform recent state-of-the-art coding schemes, with significantly less overhead, achieving up to 50% reduction of activity across a number of modern CNNs.
Slides

7E-4
TitleTemperature-Aware Optimization of Monolithic 3D Deep Neural Network Accelerators
Author*Prachi Shukla, Sean S. Nemtzow (Boston University, USA), Vasilis F. Pavlidis (The University of Manchester, UK), Emre Salman (Stony Brook University, USA), Ayse K. Coskun (Boston University, USA)
Pagepp. 709 - 714
KeywordMonolithic 3D, systolic arrays, temperature, energy efficiency
AbstractWe propose an automated method to facilitate the design of energy-efficient Mono3D DNN accelerators with safe on-chip temperatures for mobile systems. We introduce an optimizer to investigate the effect of different aspect ratios and footprint specifications of the chip, and select energy-efficient accelerators under user-specified thermal and performance constraints. We also demonstrate that using our optimizer, we can reduce energy consumption by 1.6x and area by 2x with a maximum of 9.5% increase in latency compared to a Mono3D DNN accelerator optimized only for performance.


[To Session Table]

Session 8A  (DF-3): Emerging Open Design Platform
Time: 15:30 - 16:00, Thursday, January 21, 2021
Location: Room 8A
Organizer/Chair: Yuji Ishikawa (Toshiba Corporation, Japan), Organizer: Noriyuki Miura (Osaka University, Japan)

8A-1
Title(Designers' Forum) Impact of Open Source and NDA-free on LSI Design and Fabrication
AuthorJunichi Akita (Kanazawa University, Japan)
KeywordOpen Design
AbstractThe continuous development of LSI technologies has been realizing the more sophisticated society based advanced technologies such as AI and IoT, and it also has been making demands on more complicated and large scale LSI chip that integrates the whole computing systems into a single chip, as well as the more complicated EDA tools. There also have been continuous development on the open-source EDA tools in each layer of LSI design from the process simulation to the high level synthesis, and there some tools that we can use for the practical LSI designs. It is also notable the development of LSI technologies also requires us the NDA for obtaining the process design kit including the libraries, since the process technologies are high level confidential information. This NDA also has been restricting the potential applications and the users of LSI technologies, while there are some recent activities using the open source, NDA-free technologies to form the emerging ecosystem of LSI industries. In this talk, I introduce the history and the recent trend of the open-source, NDA-free LSI design tools and practices.

8A-2
Title(Designers' Forum) Simulated Bifurcation Algorithm for Large-scale Combinatorial Optimization
AuthorHayato Goto (Toshiba Corporation, Japan)
KeywordOpen Design
AbstractCombinatorial optimization problems appear in various situations, including circuit design. However, such problems are extremely hard because of exponential increase in the number of solution candidates depending on problem sizes. For this reason, quantum computing is expected to be useful for combinatorial optimization. In 2019, we proposed a new quantum-inspired classical algorithm for combinatorial optimization, which we call “simulated bifurcation (SB).” Exploiting the high parallelizability of SB, our SB-based machine implemented with a FPGA or GPUs demonstrated high performance for large-scale problems. This result opens the possibility for the acceleration of combinatorial optimization by many-core processors.

8A-3
Title(Designers' Forum) Towards a Hardware Synthesis Environment from the Functional Language Elixir
AuthorHideki Takase (Kyoto University/JST PRESTO, Japan)
KeywordOpen Design
AbstractElixir is a functional language that runs on the Erlang VM. We suggest that the Elixir description has the affinity with the data flow hardware architecture. This talk presents a hardware synthesis method from the Elixir description. We propose a synthesis flow for the data flow architecture on the FPGA from the design description written by Elixir. Our method can synthesize the functional equivalence circuit from the libraries for direct manipulation and parallel processing of data collection in Elixir. Data flow is directly constructed base on the pipe operator that connects the processing relations of the function in the data processing order. Our method can contribute to performance and power efficiency in the server that reconfigure its hardware as required by Elixir applications.


[To Session Table]

Session 8B  Embedded Neural Networks and File Systems
Time: 15:30 - 16:00, Thursday, January 21, 2021
Location: Room 8B
Chairs: Zili Shao (The Chinese University of Hong Kong, Hong Kong), Mohammad Al Faruque (University of California, Irvine, USA)

Best Paper Candidate
8B-1
TitleGravity: An Artificial Neural Network Compiler for Embedded Applications
Author*Tony Givargis (University of California, Irvine, USA)
Pagepp. 715 - 721
KeywordEmbedded Software, Compilers for Embedded Systems, Artificial Neural Networks, Design Automation
AbstractThis paper introduces the Gravity compiler. Gravity is an open source optimizing Artificial Neural Network (ANN) to ANSI C compiler with two unique design features that make it ideal for use in resource constrained embedded systems: (1) the generated ANSI C code is self-contained and void of any library or platform dependencies and (2) the generated ANSI C code is optimized for maximum performance and minimum memory usage. Moreover, Gravity is constructed as a modern compiler consisting of an intuitive input language, an expressive Intermediate Representation (IR), a mapping to a Fictitious Instruction Set Machine (FISM) and a retargetable backend, making it an ideal research tool for exploring high-performance embedded software strategies in AI and Deep-Learning applications. We validate the efficacy of Gravity by solving the MNIST handwriting digit recognition on an embedded device. We measured a 300x reduction in memory, 2.5x speedup in inference and 33% speedup in training compared to TensorFlow. We also outperformed TVM, by over 2.4x in inference speed.

8B-2
TitleA Self-Test Framework for Detecting Fault-induced Accuracy Drop in Neural Network Accelerators
Author*Fanruo Meng, Fateme Hosseini, Chengmo Yang (University of Delaware, USA)
Pagepp. 722 - 727
KeywordNeural Network Accelerator, Fault tolerance, Reliability, Testing
AbstractHardware accelerators built with SRAM or emerging memory devices are essential to the accommodation of the ever-increasing Deep Neural Network (DNN) workloads on resource-constrained devices. After deployment, however, the performance of these accelerators is threatened by the faults in their on-chip and off-chip memories where millions of DNN weights are held. Different types of faults may exist depending on the underlying memory technology, degrading inference accuracy. To tackle this challenge, this paper proposes an online self-test framework that monitors the accuracy of the accelerator with a small set of test images selected from the test dataset. Upon detecting a noticeable level of accuracy drop, the framework uses additional test images to identify the corresponding fault type and predict the severeness of faults by analyzing the change in the ranking of the test images. Experimental results show that our method can quickly detect the fault status of a DNN accelerator and provide fault type and fault severeness information, allowing for subsequent recovery and self-healing process.

8B-3
TitleFacilitating the Efficiency of Secure File Data and Metadata Deletion on SMR-based Ext4 File System
Author*Ping-Xiang Chen (Department of Computer Science, National Tsing Hua University, Taiwan), Shuo-Han Chen (Department of Computer Science and Information Engineering, National Taipei University of Technology, Taiwan), Yuan-Hao Chang, Yu-Pei Liang (Institute of Information Science, Academia Sinica, Taiwan), Wei-Kuan Shih (Department of Computer Science, National Tsing Hua University, Taiwan)
Pagepp. 728 - 733
Keywordsecure deletion, shingled magnetic recording, metadata, ext4
AbstractThe efficiency of secure deletion is highly dependent on the data layout of underlying storage devices. In particular, owing to the sequential-write constraint of the emerging Shingled Magnetic Recording (SMR) technology, an improper data layout could lead to serious write amplification and hinder the performance of secure deletion. The performance degradation of secure deletion on SMR drives is further aggravated with the need to securely erase the file system metadata of deleted files due to the small-size nature of file system metadata. Such an observation motivates us to propose a secure-deletion and SMR-aware space allocation (SSSA) strategy to facilitate the process of securely erasing both the deleted files and their metadata simultaneously. The proposed strategy is integrated within the widely-used extended file system 4 (ext4) and is evaluated through a series of experiments to demonstrate the effectiveness of the proposed strategy.
Slides


[To Session Table]

Session 8C  (SS-6) Design Automation for Future Autonomy
Time: 15:30 - 16:00, Thursday, January 21, 2021
Location: Room 8C
Chair: Qi Zhu (Northwestern University, USA)

8C-1
Title(Invited Paper) Efficient Computing Platform Design for Autonomous Driving Systems
Author*Shuang Liang, Xuefei Ning, Jincheng Yu, Kaiyuan Guo (Department of Electronic Engineering, Tsinghua University, China), Tianyi Lu, Changcheng Tang (Novauto Co., Ltd., China), Shulin Zeng, Yu Wang (Department of Electronic Engineering, Tsinghua University, China), Diange Yang (School of Vehicle and Mobility, Tsinghua University, China), Huazhong Yang (Department of Electronic Engineering, Tsinghua University, China)
Pagepp. 734 - 741
Keywordautonomous driving, computing platform, neural networks, hardware accelerators
AbstractAutonomous driving is becoming a hot topic in both academic and industrial communities. Traditional algorithms can hardly achieve the complex tasks and meet the high safety criteria. Recent research on deep learning shows significant performance improvement over traditional algorithms and is believed to be a strong candidate in autonomous driving system. Despite the attractive performance, deep learning does not solve the problem totally. The application scenario requires that an autonomous driving system must work in real-time to keep safety. But the high computation complexity of neural network model, together with complicated pre-process and post-process, brings great challenges. System designers need to do dedicated optimizations to make a practical computing platform for autonomous driving. In this paper, we introduce our work on efficient computing platform design for autonomous driving systems. In the software level, we introduce neural network compression and hardware-aware architecture search to reduce the workload. In the hardware level, we propose customized hardware accelerators for pre- and post-process of deep learning algorithms. Finally, we introduce the hardware platform design, NOVA-30, and our on-vehicle evaluation project.

8C-2
Title(Invited Paper) On Designing Computing Systems for Autonomous Vehicles: a PerceptIn Case Study
Author*Bo Yu (PerceptIn, USA), Jie Tang (South China University of Technology, China), Shaoshan Liu (PerceptIn, USA)
Pagepp. 742 - 747
Keywordautonomous vehicle, FPGA, localization
AbstractPerceptIn develops and commercializes autonomous vehicles for micromobility around the globe. This paper makes a holistic summary of PerceptIn’s development and operating experiences. It provides the business tale behind our product, and presents the development of the computing system for our vehicles. We illustrate the design decision made for the computing system, and show the advantage of offloading localization workloads onto an FPGA platform.
Slides

8C-3
Title(Invited Paper) Runtime Software Selection for Adaptive Automotive Systems
AuthorChia-Ching Fu, Ben-Hau Chia, *Chung-Wei Lin (National Taiwan University, Taiwan)
Pagepp. 748 - 752
Keywordintelligent vehicles, over-the-air update, plug-and-play automotive systems, quality-of-service
AbstractAs automotive systems become more intelligent than ever, they need to handle many functional tasks, resulting in more and more software programs running in automotive systems. However, whether a software program should be executed depends on the environmental conditions (surrounding conditions). For example, a deraining algorithm supporting object detection and image recognition should only be executed when it is raining. Supported by the advance of over-the-air (OTA) updates and plug-and-play systems, adaptive automotive systems, where the software programs are updated, activated, and deactivated before driving and during driving, can be realized. In this paper, we consider the upcoming environmental conditions of an automotive system and target the corresponding software selection problem during runtime. We formulate the problem as a set cover problem with timing constraints and then propose a heuristic approach to solve the problem. The approach is very efficient so that it can be applied during runtime, and it is a preliminary step towards the broad realization of adaptive automotive systems.
Slides

8C-4
Title(Invited Paper) Safety-Assured Design and Adaptation of Learning-Enabled Autonomous Systems
Author*Qi Zhu, Chao Huang, Ruochen Jiao, Shuyue Lan, Hengyi Liang, Xiangguo Liu, Yixuan Wang, Zhilu Wang, Shichao Xu (Northwestern University, USA)
Pagepp. 753 - 760
KeywordAutonomous systems, Machine learning, Safety, Design automation
AbstractFuture autonomous systems will employ sophisticated machine learning techniques for the sensing and perception of the surroundings and the making corresponding decisions for planning, control, and other actions. They often operate in highly dynamic, uncertain and challenging environment, and need to meet stringent timing, resource, and mission requirements. In particular, it is critical and yet very challenging to ensure the safety of these autonomous systems, given the uncertainties of the system inputs, the constant disturbances on the system operations, and the lack of analyzability for many machine learning methods (particularly those based on neural networks). In this paper, we will discuss some of these challenges, and present our work in developing automated, quantitative, and formalized methods and tools for ensuring the safety of autonomous systems in their design and during their runtime adaptation. We argue that it is essential to take a holistic approach in addressing system safety and other safety-related properties, vertically across the functional, software, and hardware layers, and horizontally across the autonomy pipeline of sensing, perception, planning, and control modules. This approach could be further extended from a single autonomous system to a multi-agent system where multiple autonomous agents perform tasks in a collaborative manner. We will use connected and autonomous vehicles (CAVs) as the main application domain to illustrate the importance of such holistic approach and show our initial efforts in this direction.
Slides


[To Session Table]

Session 8D  Emerging Hardware Verification
Time: 15:30 - 16:00, Thursday, January 21, 2021
Location: Room 8D
Chairs: He Li (University of Cambridge, UK), Daniel Grosse (Johannes Kepler University Linz, Austria)

8D-1
TitleSystem-Level Verification of Linear and Non-Linear Behaviors of RF Amplifiers using Metamorphic Relations
Author*Muhammad Hassan (DFKI GmbH / University of Bremen, Germany), Daniel Große (Johannes Kepler University Linz / DFKI GmbH, Austria), Rolf Drechsler (University of Bremen / DFKI GmbH, Germany)
Pagepp. 761 - 766
KeywordMetamorphic Testing, system-level verification, RF amplifiers, Analog/mixed signal, SystemC-AMS
AbstractSystem-on-Chips (SoC) have imposed new yet stringent design specifications on the Radio Frequency (RF) subsystems. The Timed Data Flow (TDF) model of computation available in SystemC-AMS offers here a good trade-off between accuracy and simulation-speed at the system-level. However, one of the main challenges in system-level verification is the availability of reference models traditionally used to verify the correctness of the Design Under Verification (DUV). Recently, Metamorphic testing (MT) introduced a new verification perspective in the software domain to alleviate this problem. MT uncovers bugs just by using and relating test-cases. In this paper, we present a novel MT-based verification approach to verify the linear and non-linear behaviors of RF amplifiers at the system-level. The central element of our MT-approach is a set of Metamorphic Relations (MRs) which describes the relation of the inputs and outputs of consecutive DUV executions. For the class of Low Noise Amplifiers (LNAs) we identify 12 high-quality MRs. We demonstrate the effectiveness of our proposed MT-based verification approach in an extensive set of experiments on an industrial system-level LNA model without the need of a reference model.

8D-2
TitleRandom Stimuli Generation for the Verification of Quantum Circuits
Author*Lukas Burgholzer, Richard Kueng, Robert Wille (Johannes Kepler University Linz, Austria)
Pagepp. 767 - 772
Keywordquantum computing, stimuli generation, verification of quantum circuits, simulative verification, emerging technologies
AbstractVerification of quantum circuits is essential for guaranteeing correctness of quantum algorithms and/or quantum descriptions across various levels of abstraction. In this work, we show that there are promising ways to check the correctness of quantum circuits using simulative verification and random stimuli. To this end, we investigate how to properly generate stimuli for efficiently checking the correctness of a quantum circuit. More precisely, we introduce, illustrate, and analyze three schemes for quantum stimuli generation—offering a trade-off between the error detection rate (as well as the required number of stimuli) and efficiency. In contrast to the verification in the classical realm, we show (both, theoretically and empirically) that even if only a few randomly-chosen stimuli (generated from the proposed schemes) are considered, high error detection rates can be achieved for quantum circuits. The results of these conceptual and theoretical considerations have also been empirically confirmed—with a grand total of approximately 106 simulations conducted across 50 000 benchmark instances.
Slides

8D-3
TitleExploiting Extended Krylov Subspace for the Reduction of Regular and Singular Circuit Models
AuthorChrysostomos Chatzigeorgiou, Dimitrios Garyfallou, *George Floros, Nestor Evmorfopoulos, George Stamoulis (University of Thessaly, Greece)
Pagepp. 773 - 778
KeywordModel Order Reduction, Moment-Matching, Krylov Methods, Circuit Simulation
AbstractDuring the past decade, Model Order Reduction (MOR) has become key enabler for the efficient simulation of large circuit models. MOR techniques based on moment-matching are well established due to their simplicity and computational performance in the reduction process. However, moment-matching methods based on the ordinary Krylov subspace are usually inadequate to accurately approximate the original circuit behavior. In this paper, we present a moment-matching method which is based on the extended Krylov subspace and exploits the superposition property in order to deal with many terminals. The proposed method can handle large-scale regular and singular circuits and generate accurate and efficient reduced-order models for circuit simulation. Experimental results on industrial IBM power grid benchmarks demonstrate that our method achieves an error reduction up to 83.69% over a standard Krylov subspace technique.
Slides


[To Session Table]

Session 8E  Optimization and Mapping Methods for Quantum Technologies
Time: 15:30 - 16:00, Thursday, January 21, 2021
Location: Room 8E
Chairs: Debjyoti Bhattacharjee (imec, Belgium), Rudy Raymond H.P. (IBM Research - Tokyo, Japan)

8E-1
TitleAlgebraic and Boolean Optimization Methods for AQFP Superconducting Circuits
Author*Eleonora Testa, Siang-Yun Lee, Heinz Riener, Giovanni De Micheli (EPFL, Switzerland)
Pagepp. 779 - 785
KeywordAQFP, superconducting electronics, majority logic, logic synthesis
AbstractAdiabatic quantum-flux-parametron (AQFP) circuits are a family of superconducting electronic (SCE) circuits that have recently gained growing interest due to their low-energy consumption, and may serve as alternative technology to overcome the down-scaling limitations of CMOS. The AQFP circuits logic design differs from classic digital design in many respects. For this reason, as of today, the design of AQFP complex circuits is still limited by the ability of design tools to efficiently take into account the different AQFP technology requirements. For instance, AQFP logic cells are abstracted by the majority operation, require data and clocking in specific timing windows and have fan-out limitations. In this work, we implement a novel majority-based logic synthesis flow addressing AQFP technology. In particular, we present both algebraic and Boolean methods over majority-inverter graphs (MIGs) aiming at optimizing size and depth of logic circuits. The technology limitations and constraints of the AQFP technology (e.g., path balancing and maximum fanout) are considered during the optimization steps. The experimental results show that our flow reduces both size and depth of MIGs, while meeting the constraint set by the AQFP technology. Further, we demonstrate an improvementfor both area and delay when the MIGs are mapped into the AQFP technology.
Slides

8E-2
TitleDynamical Decomposition and Mapping of MPMCT Gates to Nearest Neighbor Architectures
Author*Atsushi Matsuo (IBM Quantum, IBM Research - Tokyo/Ritsumeikan University, Japan), Wakaki Hattori, Shigeru Yamashita (Ritsumeikan University, Japan)
Pagepp. 786 - 791
KeywordQuantum Circuit, Quantum Compiler, Mixed-Polarity Multiple-Control Toffoli (MPMCT) gate, Nearest Neighbor Architecture (NNA)
AbstractWe usually use Mixed-Polarity Multiple-Control Toffoli (MPMCT) gates to realize large control logic functions for quantum computation. A logic circuit consisting of MPMCT gates needs to be mapped to a quantum computing device that has some physical limitation; (1) we need to decompose MPMCT gates into one or two-qubit gates, and then (2) we need to insert SWAP gates such that all the gates can be performed on Nearest Neighbor Architectures (NNAs). Up to date, the above two processes have been independently studied intensively. This paper points out that we can decrease the total number of the gates in a circuit if the above two processes are considered dynamically as a single step; we propose a method to inserts SWAP gates while decomposing MPMCT gates unlike most of the existing methods. Our additional idea is to consider the effect on the latter part of a circuit carefully by considering the qubit layout when composing an MPMCT gate. We show some experimental results to confirm the effectiveness of our method.
Slides

8E-3
TitleExploiting Quantum Teleportation in Quantum Circuit Mapping
Author*Stefan Hillmich, Alwin Zulehner, Robert Wille (Institute for Integrated Circuits, Johannes Kepler University Linz, Austria)
Pagepp. 792 - 797
Keyworddesign automation, quantum computing, mapping, compilation, teleportation
AbstractQuantum computers are constantly growing in their number of qubits, but continue to suffer from restrictions such as the limited pairs of qubits that may interact with each other. Thus far, this problem is addressed by mapping and moving qubits to suitable positions for the interaction (known as quantum circuit mapping). However, this movement requires additional gates to be incorporated into the circuit, whose number should be kept as small as possible since each gate increases the likelihood of errors and decoherence. State-of-the-art mapping methods utilize swapping and bridging to move the qubits along the static paths of the coupling map---solving this problem without exploiting all means the quantum domain has to offer. In this paper, we propose to additionally exploit quantum teleportation as a possible complementary method. Quantum teleportation conceptually allows to move the state of a qubit over arbitrary long distances with constant overhead---providing the potential of determining cheaper mappings. The potential is demonstrated by a case study on the IBM Q Tokyo architecture which already shows promising improvements. With the emergence of larger quantum computing architectures, quantum teleportation will become more effective in generating cheaper mappings.
Slides


[To Session Table]

Session 9A  (DF-4): Technological Utilization in COVID-19 Pandemic
Time: 16:00 - 16:30, Thursday, January 21, 2021
Location: Room 9A
Organizer/Chair: Koichiro Yamashita (Fujitsu R&D Center Co., LTD., Japan)

9A-1
Title(Designers' Forum) ToF 3D Object Detection: Crowd Status Monitoring with Protected Privacy
AuthorWang Weihang, Ge Hao (Data Miracle Intelligent Technology Co.,Ltd., China), Liu Peilin (Shanghai Jiao Tong University, China)
KeywordCOVID
AbstractTo ensure the safety of the public in the outbreak of COVID-19, it is necessary to monitor the crowd density and distribution in real-time with ToF cameras. They are employed to achieve intelligent sensing and computing with protected privacy, such as the number of people in an area of interest, crowd density distribution, distance between people, and so on. In this report, some researches and applications of head detection, people counting, and crowd density estimation by using the ToF camera are introduced.

9A-2
Title(Designers' Forum) Smartphone App to Support Team Communication in Remote Work
AuthorSatomi Tsuji (Hitachi Ltd., Japan)
KeywordCOVID
AbstractDue to COVID-19, Hitachi Ltd. has introduced remote working across the company from April 2020.  The concern there was the lack of spontaneous communication, especially chit-chat, which reduced teamwork, decision-making speed and creativity, and increased stress risk of employees.  We have conducted research on the utility of face-to-face communication and have confirmed that short conversations and being spoken in an even distribution is effective in improving team productivity.  Based on these findings, we created a smartphone app that includes a communication supporting and evaluation system.  In this presentation, we will introduce approach and results of our PoC with hundreds of employees called "Remote work together".

9A-3
Title(Designers' Forum) Re-ID Technology: Chase of the Contact Person with Protected Privacy
AuthorChang Zhigang, Zheng Shibao (Shanghai Jiao Tong University, China)
KeywordCOVID
AbstractPerson Re-ID plays an important role in multi-camera tracking, by matching paired person images crossing non-overlapping views. It is more privacy protected than face recognition, with more comprehensive factors considered than face feature, such as body shape, clothes, poses, and gaits etc. In this report, the research of feature extraction, metric learning, solution for difficulties in Re-ID, and Semi-supervised/unsupervised Re-ID methods are introduced. Meanwhile, the application of tracking contact person in epidemic of COVID-19 is also discussed.

9A-4
Title(Designers' Forum) Actlyzer :Video-based Behavior Analysis Technology
AuthorToshiaki Wakama, Sho Iwasaki (Fujitsu Laboratories LTD., Japan)
KeywordCOVID
AbstractNew behaviors, such as social distancing and frequent hand washing, are required for prevention of spread of COVID-19. We developed "Actlyzer", a novel video-based behavioral analysis technology, which makes it possible to alert behaviors with high-risk of infection and record properly performed hygiene management behaviors. Actlyzer can detect almost 100 human’s basic actions and this enables it to recognize more complex behaviors by defining them as combination of the basic actions. For example, it is possible to detect two people facing each other for long time and body-touching such as hug and handshaking. Moreover, Actlyzer can also recognize hand actions and it can be used to judge if hand washing is correctly performed or not. Even unfamiliar people can properly wash their hands since it provides real-time guidance depending on their actions. We will make use of Actlyzer to support the more effective measures against COVID-19 at medical sites, event sites, and restaurants.


[To Session Table]

Session 9B  Emerging System Architectures for Edge-AI
Time: 16:00 - 16:30, Thursday, January 21, 2021
Location: Room 9B
Chairs: Sai Manoj Pudukotai Dinakarrao (George Mason University, USA), Shinya Takamaeda-Yamazaki (The University of Tokyo, Japan)

Best Paper Candidate
9B-1
TitleHardware-Aware NAS Framework with Layer Adaptive Scheduling on Embedded System
Author*Chuxi Li, Xiaoya Fan, Shengbing Zhang (Northwestern Polytechnical University/Engineering Research Center of Embedded System Integration, Ministry of Education, China), Zhao Yang, Miao Wang (Northwestern Polytechnical University, China), Danghui Wang, Meng Zhang (Northwestern Polytechnical University/Engineering Research Center of Embedded System Integration, Ministry of Education, China)
Pagepp. 798 - 805
Keywordneural netwoks, NAS, hardware-aware, dataflow scheduling, embedded system
AbstractNeural Architecture Search (NAS) has been proven to be an effective solution for building Deep Convolutional Neural Network (DCNN) models automatically. Subsequently, several hardware-aware NAS frameworks incorporate hardware latency into the search objectives to avoid the potential risk that the searched network cannot be deployed on target platforms. However, the mismatch between NAS and hardware persists due to the absent of rethinking the applicability of the searched network layer characteristics and hardware resource allocation. A single convolution layer can be executed under various dataflows, to which the current hardware-aware NAS frameworks are insensitive. This ignorance results in significant per- formance degradation for some maladaptive layers obtained from NAS, which might achieved a much better latency when the adopted dataflow changes. Thus, the network latency is insufficient to eval- uate the deployment efficiency. To address the issue, this paper proposes a novel hardware-aware NAS framework in consideration of the adaptability between layers and dataflow patterns. Beside, we develop an optimized layer adaptive data scheduling strategy as well as a coarse-grained reconfigurable computing architecture so as to deploy the searched networks with high energy-efficiency by selecting the most appropriate dataflow pattern layer-by-layer under limited resources. Evaluation results show that the NAS framework can obtain DCNNs of similar accuracy to the state-of-the-art ones and the proposed architecture can provide both power-efficiency improvement and energy consumption saving.
Slides

9B-2
TitleDataflow-Architecture Co-Design for 2.5D DNN Accelerators using Wireless Network-on-Package
Author*Robert Guirado (Universitat Politècnica de Catalunya, Spain), Hyoukjun Kwon (Georgia Institute of Technology, USA), Sergi Abadal, Eduard Alarcon (Universitat Politècnica de Catalunya, Spain), Tushar Krishna (Georgia Institute of Technology, USA)
Pagepp. 806 - 812
KeywordDNN accelerator, Wireless Network-on-Chip, Dataflow
AbstractDeep neural network (DNN) models continue to grow in size and complexity, demanding higher computational power to enable real-time inference. To efficiently deliver such computational demands, hardware accelerators are being developed and deployed across scales. This naturally requires an efficient scale-out mechanism for increasing compute density as required by the application. 2.5D integration over interposer has emerged as a promising solution, but as we show in this work, the limited interposer bandwidth and multiple hops in the Network-on-Package (NoP) can diminish the benefits of the approach. To cope with this challenge, we propose WIENNA, a wireless NoP-based 2.5D DNN accelerator. In WIENNA, the wireless NoP connects an array of DNN accelerator chiplets to the global buffer chiplet, providing high-bandwidth multicasting capabilities. Here, we also identify the dataflow style that most efficienty exploits the wireless NoP's high-bandwidth multicasting capability on each layer. With modest area and power overheads, WIENNA achieves 2.2X-5.1X higher throughput and 38.2% lower energy than an interposer-based NoP design.
Slides

9B-3
TitleBlock-Circulant Neural Network Accelerator Featuring Fine-Grained Frequency-Domain Quantization and Reconfigurable FFT Modules
Author*Yifan He, Jinshan Yue, Yongpan Liu, Huazhong Yang (Tsinghua University, China)
Pagepp. 813 - 818
KeywordArtificial neural networks, Acceleration, Quantization, Energy Efficiency
AbstractBlock-circulant based compression is a popular technique to accelerate neural network inference. Though storage and computing costs can be reduced by transforming weights into block-circulant matrices, this method incurs uneven data distribution in the frequency domain and imbalanced workload. In this paper, we propose RAB: a Reconfigurable Architecture Block-Circulant Neural Network Accelerator to solve the problems via two techniques. First, a fine-grained frequency-domain quantization is proposed to accelerate MAC operations. Second, a reconfigurable architecture is designed to transform FFT/IFFT modules into MAC modules, which alleviates the imbalanced workload and further improves efficiency. Experimental results show that RAB can achieve 1.9x/1.8x area/energy efficiency improvement compared with the state-of-the-art block-circulant compression based accelerator.
Slides

9B-4
TitleBatchSizer: Power-Performance Trade-off for DNN Inference
Author*Seyed Morteza Nabavinejad (Institute for Research in Fundamental Sciences (IPM), Iran), Sherief Reda (Brown University, USA), Masoumeh Ebrahimi (KTH Royal Institute of Technology, Sweden)
Pagepp. 819 - 824
Keywordbatch size, inference, deep neural networks, power cap
AbstractGPU accelerators can deliver significant improvement for DNN processing; however, their performance is limited by internal and external parameters. A well-known parameter that restricts the performance of various computing platforms in real-world setups, including GPU accelerators, is the power cap imposed usually by an external power controller. A common approach to meet the power cap constraint is using the Dynamic Voltage Frequency Scaling (DVFS) technique. However, the functionally of this technique is limited and platform-dependent. To improve the performance of DNN inference on GPU accelerators, we propose a new control knob, which is the size of input batches fed to the GPU accelerator in DNN inference applications. After evaluating the impact of this control knob on power consumption and performance of GPU accelerators and DNN inference applications, we introduce the design and implementation of a fast and lightweight runtime system, called BatchSizer. This runtime system leverages the new control knob for managing the power consumption of GPU accelerators in the presence of the power cap. Conducting several experiments using a modern GPU and several DNN models and input datasets, we show that our BatchSizer can significantly surpass the conventional DVFS technique regarding performance (up to 29%), while successfully meeting the power cap.
Slides


[To Session Table]

Session 9C  (SS-7) Cutting-Edge EDA Techniques for Advanced Process Technologies
Time: 16:00 - 16:30, Thursday, January 21, 2021
Location: Room 9C
Chairs: Wenjian Yu (Tsinghua University, China), Lifeng Wu (Huada Empyrean Inc., China)

9C-1
Title(Invited Paper) Deep Learning for Mask Synthesis and Verification: A Survey
Author*Yibo Lin (Peking University, China)
Pagepp. 825 - 832
Keywordlithography modeling, mask optimization, hotspot detection, deep learning
AbstractAchieving lithography compliance is increasingly difficult in advanced technology nodes. Due to complicated lithography modeling and long simulation cycles, verifying and optimizing photomasks becomes extremely expensive. To speedup design closure, deep learning techniques have been introduced to enable data-assisted optimization and verification. Such approaches have demonstrated promising results with high solution quality and efficiency. Recent research efforts show that learning based techniques can accomplish more and more tasks, from classification, simulation, to optimization, etc. In this paper, we will survey the successful attempts of advancing mask synthesis and verification with deep learn- ing and highlight the domain specific learning techniques. We hope this survey can shed light to future development of learning-based design automation methodologies.

9C-2
Title(Invited Paper) Physical Synthesis for Advanced Neural Network Processors
Author*Zhuolun He, Peiyu Liao, Siting Liu, Yuzhe Ma (The Chinese University of Hong Kong, Hong Kong), Yibo Lin (Peking University, China), Bei Yu (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 833 - 840
KeywordPhysical Synthesis, Neural Network Processor, Datapath, Placement
AbstractThe remarkable breakthroughs in deep learning have led to a dramatic thirst for computational resources to tackle interesting real-world problems. Various neural network processors have been proposed for the purpose, yet, far fewer discussions have been made on the physical synthesis for such specialized processors, especially in advanced technology nodes. In this paper, we review several physical synthesis techniques for advanced neural network processors. We especially argue that datapath design is an essential methodology in the above procedures due to the organized computational graph of neural networks. As a case study, we investigate a wafer-scale deep learning accelerator placement problem in detail.
Slides

9C-3
Title(Invited Paper) Advancements and Challenges on Parasitic Extraction for Advanced Process Technologies
Author*Wenjian Yu, Mingye Song, Ming Yang (Tsinghua University, China)
Pagepp. 841 - 846
KeywordAdvanced Process Technology
Abstractmore complicated and design margin is shrinking, accurate parasitic extraction during IC design is largely demanded. In this invited paper, we survey the recent advancements on parasitic extraction techniques, especially those enhancing the floating random walk based capacitance solver and incorporating machine learning methods. The work dealing with process variation are also addressed. After that, we briefly discuss the challenges for capacitance extraction under advanced process technologies, including manufactureaware geometry variations and accurate modeling of FinFETs, etc.
Slides


[To Session Table]

Session 9D  (SS-8) Robust and Reliable Memory Centric Computing at Post-Moore
Time: 16:00 - 16:30, Thursday, January 21, 2021
Location: Room 9D
Chairs: Grace Li Zhang (Technical University of Munich, Germany), Cheng Zhuo (Zhejiang University, China)

9D-1
Title(Invited Paper) Reliability-Aware Training and Performance Modeling for Processing-In-Memory Systems
Author*Hanbo Sun, Zhenhua Zhu, Yi Cai, Shulin Zeng, Kaizhong Qiu, Yu Wang, Huazhong Yang (Tsinghua University, China)
Pagepp. 847 - 852
KeywordPIM accelerator, Reliability-Aware Training, Performance Modeling
AbstractMemristor based Processing-In-Memory (PIM) systems give alternative solutions to boost the computing energy efficiency of Convolutional Neural Network (CNN) based algorithms. However, Analog-to-Digital Converters' (ADCs) high interface costs and the limited size of the memristor crossbars make it challenging to map CNN models onto PIM systems with both high accuracy and high energy efficiency. Besides, it takes a long time to simulate the performance of large-scale PIM systems, resulting in unacceptable development time for the PIM system. To address these problems, we propose a reliability-aware training framework and a behavior-level modeling tool (MNSIM 2.0) for PIM accelerators. The proposed reliability-aware training framework, containing network splitting/merging analysis and a PIM-based non-uniform activation quantization scheme, can improve the energy efficiency by reducing the ADC resolution requirements in memristor crossbars. Moreover, MNSIM 2.0 provides a general modeling method for PIM architecture design and computation data flow; it can evaluate both accuracy and hardware performance within a short time. Experiments based on MNSIM 2.0 show that the reliability-aware training framework can improve 3.4× energy efficiency of PIM accelerators with little accuracy loss. The equivalent energy efficiency is 9.02 TOPS/W, nearly 2.6∼4.2× compared with the existing work. We also evaluate more case studies of MNSIM 2.0, which help us balance the trade-off between accuracy and hardware performance.

9D-2
Title(Invited Paper) Robustness of Neuromorphic Computing with RRAM-based Crossbars and Optical Neural Networks
Author*Grace Li Zhang, Bing Li, Ying Zhu (Technical University of Munich, Germany), Tianchen Wang, Yiyu Shi (University of Notre Dame, USA), Xunzhao Yin, Cheng Zhuo (Zhejiang University, China), Huaxi Gu (Xidian University, China), Tsung-Yi Ho (National Tsing Hua University, Taiwan), Ulf Schlichtmann (Technical University of Munich, Germany)
Pagepp. 853 - 858
KeywordNeuromorphic computing, RRAM crossbar, optical neural networks, robustness, variations and noise
AbstractRRAM-based crossbars and optical neural networks are attractive platforms to accelerate neuromorphic computing. However, both accelerators suffer from hardware uncertainties such as process variations. These uncertainty issues left unaddressed, the inference accuracy of these computing platforms can degrade significantly. In this paper, a statistical training method where weights under process variations and noise are modeled as statistical random variables is presented. To incorporate these statistical weights into training, the computations in neural networks are modified accordingly. For optical neural networks, we modify the cost function during software training to reduce the effects of process variations and thermal imbalance. In addition, the residual effects of process variations are extracted and calibrated in hardware test, and thermal variations on devices are also compensated in advance. Simulation results demonstrate that the inference accuracy can be improved significantly under hardware uncertainties for both platforms.
Slides

9D-3
Title(Invited Paper) Uncertainty Modeling of Emerging Device based Computing-in-Memory Neural Accelerators with Application to Neural Architecture Search
Author*Zheyu Yan (University of Notre Dame, USA), Da-Cheng Juan (National Tsing Hua University, Taiwan), Xiaobo Juan, Yiyu Shi (University of Notre Dame, USA)
Pagepp. 859 - 864
KeywordNAS, RRAM
AbstractEmergent Device based Computing-in-memory (CiM) has been proved to be a promising candidate for high energy efficiency deep neural network (DNN) computations. However, most emergent devices suffer uncertainty issues, resulting in a difference between actual data stored and the weight value it is design to be. This leads to an accuracy drop from trained models to actually deployed platforms. In this work, we offer a thorough analysis on the effect of such uncertainties induced changes in DNN models. To reduce the impact of device uncertainties, we propose UAE, a uncertainty-aware Neural Architecture Search scheme to identify a DNN model that is both accurate and robust against device uncertainties.

9D-4
Title(Invited Paper) A Physical-Aware Framework for Memory Network Design Space Exploration
AuthorTianhao Shen, Di Gao, Li Zhang (Zhejiang University, China), Jishen Zhao (University of California, San Diego, USA), *Cheng Zhuo (Zhejiang University, China)
Pagepp. 865 - 871
Keywordmemory networks, router, Physical-Aware
AbstractAt the era of big data, there have been growing demands for server memory capacity and performance. Memory network is a promising alternative to provide high bandwidth and low latency through distributed memory nodes connected by high speed interconnect. However, most of them implement the design from a pure-logic-level and ignore the physical impact from network interconnect latency, processor placement and the interplay between processor and memory. In this work, we propose a Physical-Aware framework for memory network design space exploration, which facilitates the design of an energy efficient and physical-aware memory network system. Experimental results on various workloads show that the proposed framework can help customize network topology with significant improvements on various design metrics when compared to the other commonly used topologies.


[To Session Table]

Session 9E  Design for Manufacturing and Soft Error Tolerance
Time: 16:00 - 16:30, Thursday, January 21, 2021
Location: Room 9E
Chairs: Yongfu Li (Shanghai Jiao Tong University, China), Muhammad Shafique (NYU Abu Dhabi, United Arab Emirates)

9E-1
TitleManufacturing-Aware Power Staple Insertion Optimization by Enhanced Multi-Row Detailed Placement Refinement
Author*Yu-Jin Xie, Kuan-Yu Chen, Wai-Kei Mak (National Tsing Hua University, Taiwan)
Pagepp. 872 - 877
Keywordstaple, IR-drop, DP, optimization
AbstractPower staple insertion is a new methodology for IR drop mitigation in advanced technology nodes. Detailed placement refinement which perturbs an initial placement slightly is an effective strategy to increase the success rate of power staple insertion. We are the first to address the manufacturing-aware power staple insertion optimization problem by triple-row placement refinement. We present a correct-by-construction approach based on dynamic programming to maximize the total number of legal power staples inserted subject to the design rule for 1D patterning. Instead of using a multi-dimensional array which incurs huge space overhead, we show how to construct a directed acyclic graph (DAG) on the fly efficiently to implement the dynamic program for multi-row optimization in order to conserve memory usage. The memory usage can thus be reduced by a few orders of magnitude in practice

9E-2
TitleA Hierarchical Assessment Strategy on Soft Error Propagation in Deep Learning Controller
Author*Ting Liu, Yuzhuo Fu, Yan Zhang, Bin Shi (Shanghai Jiao Tong University, China)
Pagepp. 878 - 884
Keyworddeep learning controller, soft error, fault propagation, fault diagnosis, recurrent neuron networks
AbstractDeep learning techniques have been introduced into the field of intelligent controller design in recent years and become an effective alternative in complex control scenarios. In addition to improve control robustness, deep learning controllers (DLCs) also provide a potential fault tolerance to internal disturbances (such as soft errors) due to the inherent redundant structure of deep neuron networks (DNNs). In this paper, we propose a heretical assessment to characterize the impact of soft errors on the dependability of a PID controller and its DLC alternative. Single-bit-flip injections in underlying hardware and time series data collection from multiple abstraction layers (ALs) are performed on a virtual prototype system based on an ARM Cortex-A9 CPU, with a PID controller and corresponding recurrent neuron network (RNN) implemented DLC deployed on it. We employ Generative Adversarial Networks and Bayesian Networks to characterize the local and global dependencies caused by soft errors across the system. By analyzing cross-AL fault propagation paths and component sensitivities, we discover that the parallel data processing pipelines and regular feature size scaling mechanism in DLC can effectively prevent critical failure causing faults from propagating to the control output.
Slides

9E-3
TitleAttacking a CNN-based Layout Hotspot Detector Using Group Gradient Method
Author*Haoyu Yang, Shifan Zhang (Chinese University of Hong Kong, Hong Kong), Kang Liu (New York University, USA), Siting Liu (Chinese University of Hong Kong, Hong Kong), Benjamin Tan, Ramesh Karri, Siddharth Garg (New York University, USA), Bei Yu, Evangeline F.Y. Young (Chinese University of Hong Kong, Hong Kong)
Pagepp. 885 - 891
KeywordLithography Hotspot Detection, Adversarial Attack, Deep Learning
AbstractDeep neural networks are being used in disparate VLSI design automation tasks, including layout printability estimation, mask optimization, and routing congestion analysis. Preliminary results show the power of deep learning as an alternate solution in state-of-the-art design and sign-off flows. However, deep learning is vulnerable to adversarial attacks. In this paper, we examine the risk of state-of-the-art deep learning-based layout hotspot detectors under practical attack scenarios. We show that legacy gradient-based attacks do not adequately consider the design rule constraints. We present an innovative adversarial attack formulation to attack the layout clips and propose a fast group gradient method to solve it. Experiments show that the attack can deceive the deep neural networks using small perturbations in clips which preserve layout functionality while meeting the design rules.

9E-4
TitleBayesian Inference on Introduced General Region: An Efficient Parametric Yield Estimation Method for Integrated Circuits
Author*Zhengqi Gao, Zihao Chen, Jun Tao, Yangfeng Su (Fudan University, China), Dian Zhou (University of Texas at Dallas, USA), Xuan Zeng (Fudan University, China)
Pagepp. 892 - 897
Keywordparametric yield estimation, Bayesian Inference, Machine Learning
AbstractIn this paper, we propose an efficient parametric yield estimation method based on Bayesian Inference. By observing that nowadays analog and mixed-signal circuit is designed via a multi-stage flow, and that the circuit performance correlation of early stage and late stage is naturally symmetrical, we introduce a general region to capture the common features of the early and late stage. Meanwhile, two private regions are also incorporated to represent the unique features of these two stages respectively. Afterwards, we introduce classifiers one for each region to explicitly encode the correlation information. Next, we set up a graphical model, and consequently adopt Bayesian Inference to calculate the model parameters. Finally, based on the obtained optimal model parameters, we can accurately and efficiently estimate the parametric yield with a simple sampling method. Our numerical experiments demonstrate that compared to the state-of-the-art algorithms, our proposed method can better estimate the yield while significantly reducing the number of circuit simulations.

9E-5
TitleAnalog IC Aging-induced Degradation Estimation via Heterogeneous Graph Convolutional Networks
Author*Tinghuan Chen, Qi Sun (The Chinese University of Hong Kong, Hong Kong), Canhui Zhan, Changze Liu, Huatao Yu (Hisilicon Technologies Co., China), Bei Yu (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 898 - 903
KeywordAging, graph convolutional network, heterogeneous
AbstractWith continued scaling, transistor aging induced by Hot Carrier Injection and Bias Temperature Instability causes increasing failure of nanometer-scale integrated circuits (ICs). In this paper, we propose a heterogeneous graph convolutional network (H-GCN) to fast and accurately estimate aging-induced transistor degradation in post-layout net-lists of analog IC. To characterize the multi-typed devices and connection ports, a heterogeneous directed multigraph is adopted to efficiently represent circuit net-lists. An embedding generation algorithm is developed to aggregate information from the node itself and its multi-typed neighboring nodes through multi- typed edges in our proposed H-GCN. Our proposed H-GCN can replace static aging analysis. While compared with the traditional static aging analysis, our proposed H-GCN has more accurate estimation on transistor degradation. We conduct experiments on very advanced 5nm industrial designs to show compared with traditional machine learning methods and the typical graph convolutional network (GCN), the proposed H-GCN can achieve more accurate estimations of aging-induced transistor degradation. Compared with an industrial dynamic aging design for reliability tool, our proposed H-GCN can achieve 24.623x speedup on average.
Slides