(Go to Top Page)

The 22nd Asia and South Pacific Design Automation Conference
Technical Program

Remark: The presenter of each paper is marked with "*".
Technical Program:   Session Schedule   SIMPLE version   DETAILED version with abstract
Author Index:   HERE

Session Schedule


Tuesday, January 17, 2017

Room 103Room 102Room 104Room 105
1K  (International Conference Room)
Opening & Keynote I

8:30 - 10:35
Coffee Break
10:35 - 11:05
1S  University Design Contest
11:05 - 12:20
1A  Design Assurance and Reliability
11:05 - 12:20
1B  New Frontiers of Hardware Accelerator Synthesis
11:05 - 12:20
1C  Analysis Techniques for Reliability and Manufacturability
11:05 - 12:20
Lunch Break
12:20 - 13:50
2S  (Special Session) Neuromorphic Computing and Low-Power Image Recognition
13:50 - 15:30
2A  System-level Techniques for Energy and Performance Optimization
13:50 - 15:30
2B  Pushing the Limits of Logic Synthesis
13:50 - 15:30
2C  Design Techniques for Reliability Enhancement
13:50 - 15:30
Coffee Break
15:30 - 15:50
3S  (Special Session) Let's Secure the Physics of Cyber-Physical Systems
15:50 - 17:30
3A  Novel Techniques to Improve the Simulation Performance
15:50 - 17:30
3B  Formal and Informal Verification
15:50 - 17:30
3C  Pursuing System to Circuit Level Optimality in Timing and Power Integrity
15:50 - 17:30



Wednesday, January 18, 2017

Room 103Room 102Room 104Room 105
2K  (International Conference Room)
Keynote II

9:00 - 9:50
Coffee Break
9:50 - 10:15
4S  (Special Session) Emerging Technologies for Biomedical Applications: Artificial Vision Systems and Brain Machine Interface
10:15 - 12:15
4A  Power and Thermal Management
10:15 - 12:20
4B  Emerging Topics in Hardware Security
10:15 - 12:20
4C  Manufacturability and Emerging Techniques
10:15 - 12:20
Lunch Break
12:20 - 13:50
5S  (Designers' Forum) Advanced Devices and Networks for IoT Applications
13:50 - 15:30
5A  Approximate Computation for Energy Efficiency
13:50 - 15:30
5B  Advance Test and Fault Tolerant Technologies
13:50 - 15:30
5C  Advanced Placement and Routing Techniques
13:50 - 15:30
Coffee Break
15:30 - 15:50
6S  (Designers' Forum) Panel Discussion: What is future AI we will create ? - "Doraemon" or "Terminator" ? -
15:50 - 17:30
6A  Recent Advances in Circuit Simulation and Optimization
15:50 - 17:30
6B  Application-Aware Embedded Architecture Design
15:50 - 17:30
6C  Advances in Microfluidic Biochips
15:50 - 17:30
Banquet (Convention Hall A)
18:00 - 20:00



Thursday, January 19, 2017

Room 103Room 102Room 104Room 105
3K  (International Conference Room)
Keynote III

9:00 - 9:50
Coffee Break
9:50 - 10:15
7S  (Special Session) When Backend Meets Frontend: Cross-Layer Design & Optimization for System Robustness
10:15 - 12:15
7A  NVM/Flash: From Advanced Storage Design to Emerging Applications
10:15 - 12:20
7B  Hardware Diversity and Hardware Trojan
10:15 - 12:20
7C  Hardware Accelerator for Emerging Applications
10:15 - 12:20
Lunch Break
12:20 - 13:50
8S  (Designers' Forum) Advanced Automotive Security
13:50 - 15:30
8A  Scheduling, Resource Management, and Simulation for Multi-Core Systems
13:50 - 15:30
8B  Machine Learning: Acceleration and Application
13:50 - 15:30
8C  Design Automation and Modeling for Emerging Technologies
13:50 - 15:30
Coffee Break
15:30 - 15:50
9S  (Designers' Forum) Advanced Image Sensing and Processing
15:50 - 17:30
9A  New Directions in Networks on Chip
15:50 - 17:30
9B  Memory Architecture: Now and Future
15:50 - 17:30
9C  Intelligent Computing with Memristor Technologies
15:50 - 17:30


List of papers

Remark: The presenter of each paper is marked with "*".

Tuesday, January 17, 2017

Session 1K  Opening & Keynote I
Time: 8:30 - 10:35 Tuesday, January 17, 2017
Location: International Conference Room

1K-1
Title(Keynote Address) In Memory of Edward J. McCluskey: The Next Wave of Pioneering Innovations
AuthorOrganizers/Chairs: Subhasish Mitra (Stanford University, U.S.A.), Deming Chen (University of Illinois at Urbana-Champaign, U.S.A.)
Pagep. 1
AbstractThis special plenary session will celebrate Prof. McCluskey (who passed away in 2016) through three keynote speeches by world-renowned scholars on the next wave of pioneering innovations, starting with a memorial speech by Prof. Jacob Abraham of University of Texas at Austin.

1K-2
Title(Keynote Address) Heterogeneous Integration of X-tronics: Design Automation and Education
AuthorK.-T. Tim Cheng (Hong Kong University of Science and Technology, Hong Kong)
Pagep. 2
AbstractAdvances in photonics, flexible electronics, emerging memories, etc. and Si electronics’ integration with these devices have enabled new classes of integrated circuits and systems with enhanced functionality, higher performance, or lower power consumption. Driving greater integration of such heterogeneous X-tronics can facilitate the continued proliferation of low-cost micro-/nano-systems for a wide range of applications. However, achieving their large-scale integration will require design ecosystem and design automation tools/methodologiesmuch like those that enabled electronic integration in previous decades. In this talk, I will briefly introduce two recentManufacturing Innovation Institutes, on Integrated Photonics and on Flexible Hybrid Electronics respectively, and a research center on developing 3D Hybrid CMOS-memristor circuits, which bring together academia, industry, and federal partners to increase U.S. manufacturing competitiveness in these areas. I will then focus on their design automation efforts and highlight the needs, challenges and opportunities of developing a robust design ecosystem for X-tronics integration. I will also share the educational challenges of talent development for X-tronics design automation.

1K-3
Title(Keynote Address) Electronics for the Human Body
AuthorJohn Rogers (Northwestern University, U.S.A.)
Pagep. 3
AbstractBiology is soft, curvilinear and transient; modern semiconductor technologies are rigid, planar and everlasting. Electronic and optoelectronic systems that eliminate this profound mismatch in properties create opportunities for devices that can intimately integrate with the body, for diagnostic, therapeutic or surgical function with important, unique capabilities in biomedical research and clinical healthcare. Over the last decade, a convergence of new concepts in mechanical engineering, materials science, electrical engineering and advanced manufacturing has led to the emergence of diverse, novel classes of 'biocompatible' electronic platforms. This talk describes the key ideas, with examples ranging from wireless, skin-like electronic 'tattoos' for continuous monitoring of physiological health, to multiplexed, conformal sensor sheets for mapping cardiac electrophysiology, to bioresorbable intracranial sensors for treating traumatic brain injury.

1K-4
Title(Keynote Address) Design of Society: Beyond Digital System Design
AuthorHiroto Yasuura (Kyushu University, Japan)
Pagep. 4
AbstractThe progress of digital system design and production technologies have produced social innovation by Information Communication Technology (ICT). Most social systems and our daily lives are fully supported by ICT. The progress has been accelerated exponentially and destructive innovations have occurred in various fields in industries and societies. Governments emphasize Industry 4.0 or Society 5.0 and people are looking for new businesses with IoT and AI with Big Data. In this talk, I will look back on the growth of ICT and look forward to future society which we will create using ICT. We can say that design technology of digital systems is now expanding to design of societies.


Session 1S  University Design Contest
Time: 11:05 - 12:20 Tuesday, January 17, 2017
Location: Room 103
Chairs: Noriyuki Miura (Kobe University, Japan), Hiroyuki Ito (Tokyo Institute of Technology, Japan)

1S-1 (Time: 11:05 - 11:08)
TitleW-Band Ultra-High Data-Rate 65nm CMOS Wireless Transceiver
Author*Korkut Kaan Tokgoz, Shotaro Maki, Seitarou Kawai, Noriaki Nagashima (Tokyo Institute of Technology, Japan), Yoichi Kawano, Toshihide Suzuki, Taisuke Iwai (Fujitsu Laboratories, Japan), Kenichi Okada, Akira Matsuzawa (Tokyo Institute of Technology, Japan)
Pagepp. 5 - 6
KeywordCMOS, ultra-high data-rate, millimeter-wave, wireless, frequency-interleave
AbstractA W-band ultra-high data-rate (56Gb/s) wideband (68-102GHz) 65nm bulk CMOS wireless transceiver is presented. Frequency interleaving using two up-converted IF data signals using 68 and 102GHz LO to W-band is applied to achieve wideband and a world-record 56Gb/s wireless communications on CMOS. 16QAM modulation is used for 6.5GHz (26Gb/s) low-band and 7.5GHz (30Gb/s) high-band data. Transmitter/receiver (TX/RX) consumes 260/300mW from 1V DC supply.
Slides

1S-2 (Time: 11:08 - 11:11)
TitleAn Image Sensor/Processor 3D Stacked Module Featuring ThruChip Interfaces
Author*Masayuki Ikebe, Tetsuya Asai, Masafumi Mori, Toshiyuki Itou, Daisuke Uchida (Hokkaido University, Japan), Yasuhiro Take, Tadahiro Kuroda (Keio University, Japan), Masato Motomura (Hokkaido University, Japan)
Pagepp. 7 - 8
KeywordTCI, 3D IC, Imager, computational imaging
Abstract1,000 fps motion vector estimation and classification engine for highspeed computational imaging in a 3D stacked imager/processor module is proposed, prototyped, assembled, and also tested. The module features 1) ThruChip interfaces for high fps image transfer, 2) orders of magnitude more area/powerb efficient motion vector estimation architecture compared to conventional ones, and 3) a cognitive classification scheme employed on motion vector patterns, enabling the classification of moving objects not possible in conventional proposals.
Slides

1S-3 (Time: 11:11 - 11:14)
TitleA 686Mbps 1.85mm2 Near-Optimal Symbol Detector for Spatial Modulation MIMO Systems in 0.18μm CMOS
Author*Hye-Yeon Yoon, Gwang-Ho Lee, Tae-Hwan Kim (Korea Aerospace University, Republic of Korea)
Pagepp. 9 - 10
KeywordSpatial modulation, MIMO, Detectors, Implementation, verification
AbstractTargeting the spatial-modulation (SM) MIMO systems, a symbol detector is designed, implemented, and verified. The detector is designed based on a low-complexity dual datapath architecture performing the modified signal-vector-based list detection. Implemented in 0.18um CMOS, the detector occupies 1.85mm2 and shows the throughput of 686Mbps for 16 X 4 256-QAM SM-MIMO systems. Evaluated under a hardware-in-the-loop environment, the error rate is close to the optimal.
Slides

1S-4 (Time: 11:14 - 11:17)
TitleA Scalable Time-Domain Biosensor Array Using Logarithmic Cyclic Time-Attenuation-Based TDC for High-Resolution and Large-Scale Bio-Imaging
AuthorKei Ikeda, Atsuki Kobayashi, Kazuo Nakazato (Nagoya University, Japan), *Kiichi Niitsu (Nagoya University, JST PRESTO, Japan)
Pagepp. 11 - 12
KeywordBioimaging, time domain, TDC, cyclic, high resolution
AbstractThis paper presents a time-domain biosensor array that uses a capacitor-less current-mode analog-to-time converter (CMATC) and logarithmic cyclic time-attenuation-based TDC with discharging acceleration. Combining the exponential function of the CMATC and logarithmic function of the proposed TDC offers linear input–output characteristics. The time-domain property enables bio-imaging at a high spatial resolution and large scale while maintaining scalability. Measurement results with a 0.25-μm test chip successfully demonstrated linear input–output characteristics.
Slides

1S-5 (Time: 11:17 - 11:20)
TitleAn HDL-Synthesized Injection-Locked PLL Using LC-Based DCO for On-chip Clock Generation
Author*Dongsheng Yang, Wei Deng, Bangan Liu, Aravind Tharayil Narayanan, Teerachot Siriburanon, Kenichi Okada, Akira Matsuzawa (Tokyo Institute of Technology, Japan)
Pagepp. 13 - 14
KeywordSynthesizable, Injection locking, IL-PLL, LC-DCO, Low jitter
AbstractThis paper presents an HDL-synthesized injection-locked phase-locked loop using LC-Based DCO for on-chip clock generation. The superior noise performance of the LC-DCO enables the proposed synthesizable PLL to achieve top performance among the existing designs. Fabricated in a 65nm CMOS process, this prototype demonstrates a 0.142ps integrated jitter at 3.0GHz and consumes 4.6mW while only occupying an area of 0.12mm2. It achieves a figure of merit (FoM) of -250.3dB, which is the best for the synthesized PLL up-to-date.
Slides

1S-6 (Time: 11:20 - 11:23)
TitleA 14bit 80kSPS Non-Binary Cyclic ADC without High Accuracy Analog Components
Author*Yuki Watanabe, Hayato Narita, Hiroyuki Tsuchiya (Tokyo City University, Japan), Tatsuji Matsuura (Tokyo University of Science, Japan), Hao San, Masao Hotta (Tokyo City University, Japan)
Pagepp. 15 - 16
Keywordβ-expansion, Cyclic ADC, non-binary
AbstractThis paper presents a prototype of 14bit 80kSPS non-binary cyclic ADC based on β-expansion. Since the β-expansion based ADCs are robust against to the non-idealities as capacitor mismatch and finite amplifier DC gain, so that the design consideration of this high accuracy ADC can be only focused on the capacitance of sampling capacitor to satisfy the overall kTC noise target and the drivability of amplifier. Proposed proof-of-concept cyclic ADC is designed and fabricated in TSMC 90nm CMOS technology. Peak SNDR=81.9dB is achieved while Fs=80kSPS with a poor gain of the amplifier as low as 66dB dissipating 8mW at VDD=3.3V in analog circuits.
Slides

1S-7 (Time: 11:23 - 11:26)
TitleNon-Binary Cyclic ADC with Correlated Level Shifting Technique
Author*Hiroyuki Tsuchiya, Asato Uchiyama, Yuta Misima, Yuki Watanabe, Hao San, Masao Hotta (Tokyo City University, Japan), Tatsuji Matsuura (Tokyo University of Science, Japan)
Pagepp. 17 - 18
Keywordβ-expansion, Cyclic ADC, CLS
AbstractA proof-of-concept non-binary cyclic ADC with proposed correlated level shifting (CLS) technique is designed and fabriated in 90nm CMOS tehnology. By applying the odd/even structure to a multiplying digital-to-analog converter (MDAC), the amplifier with CLS can be used for a cyclic ADC, so that the allowable dynamic range of proposed cyclic ADC almost is doubled compare to the conventional cyclic ADC. As a result, the SNDR of ADC can be improved with smaller power penalty. Measurement results of the prototype verify the effectiveness of proposed CLS technique for cyclic ADC.
Slides

1S-8 (Time: 11:26 - 11:29)
TitleA Current-Integration-Based CMOS Amperometric Sensor with 1.2 μm × 2.05 μm Electroless-Plated Microelectrode Array for High-Sensitivity Bacteria Counting
Author*Kohei Gamo, Kazuo Nakazato (Nagoya University, Japan), Kiichi Niitsu (Nagoya University, JST PRESTO, Japan)
Pagepp. 19 - 20
KeywordAmperometry, Microelectrode, Current integrator, electroless-plating, cyclic voltammetry
AbstractA current-integration-based CMOS amperometric sensor with a bacteria-sized (1.2 μm × 2.05 μm) electroless-plated microelectrode array for high-sensitivity bacteria counting is presented. For high-sensitivity bacteria counting with sufficient SNR, noise must be reduced because the bacteria-sized microelectrode can handle only small current on the order of nA. The proposed current integration can reduce noise associated with the CMOS sensor. Measurement results with a 0.6-μm test chip demonstrate successful high-sensitivity 2D direct counting of microbeads with 27 dB SNR.
Slides

1S-9 (Time: 11:29 - 11:32)
TitleA Real-time 17-Scale Object Detection Accelerator with Adaptive 2000-Stage Classification in 65nm CMOS
Author*Minkyu Kim, Abinash Mohanty, Deepak Kadetotad (Arizona State University, U.S.A.), Naveen Suda (ARM, Inc., U.S.A.), Luning Wei (Zhejiang University, China), Pooja Saseendran (Arizona State University, U.S.A.), Xiaofei He (Zhejiang University, China), Yu Cao, Jae-sun Seo (Arizona State University, U.S.A.)
Pagepp. 21 - 22
Keywordobject detection, machine learning, classification, real-time/low-power, special-purpose accelerator
AbstractThis paper presents an object detection accelerator that features many-scale (17), many-object (up to 50), multi-class (e.g., face, traffic sign), and high accuracy (average precision (AP) of 0.81/0.72 for AFW/BTSD datasets) detection. Employing 10 gradient/color channels, integral features are extracted and 2,000 simple classifiers for rigid boosted templates are adaptively combined to make a strong classification. The prototype chip implemented in 65nm CMOS demonstrates 16-40 frames per second and 22-160 mW power at 0.6-1.0V supply.
Slides

1S-10 (Time: 11:32 - 11:35)
TitleA 15 x 15 SPAD Array Sensor with Breakdown-Pixel-Extraction Architecture for Efficient Data Readout
Author*Xiao Yang (Department of Electrical Engineering and Information Systems, The University of Tokyo, Japan), Hongbo Zhu, Toru Nakura, Tetsuya Iizuka, Kunihiro Asada (VLSI Design and Education Center (VDEC), The University of Tokyo, Japan)
Pagepp. 23 - 24
KeywordSPAD, sensor, readout circuit
AbstractThis design proposes a breakdown-pixel-extraction architecture for SPAD based faint light detection systems. The proposed readout circuit detects the breakdown pixels and only their addresses are readout. Therefore, under the faint light environment, this SPAD sensor significantly improves the data readout efficiency. A test-of-concept chip with a 15 x 15 SPAD array sensor was fabricated in a 0.18 um CMOS process, and a high speed readout is verified by measurement.
Slides

1S-11 (Time: 11:35 - 11:38)
TitleDesign of an Energy-Autonomous Bio-Sensing System Using a Biofuel Cell and 0.19V 53µW Integrated Supply-Sensing Sensor with a Supply-Insensitive Temperature Sensor and Inductive-Coupling Transmitter
Author*Atsuki Kobayashi, Kei Ikeda (Nagoya University, Japan), Yudai Ogawa, Matsuhiko Nishizawa (Tohoku University, Japan), Kazuo Nakazato (Nagoya University, Japan), Kiichi Niitsu (Nagoya University, JST PRESTO, Japan)
Pagepp. 25 - 26
KeywordEnergy autonomy, Bio-sensing, Biofuel cell, Supply-sensing, Temperature sensor
AbstractThis paper presents an energy-autonomous bio-sensing system with the capability of wireless communication. The proposed system includes a biofuel cell as a power source and sensing frontend associated with the integrated supply-sensing sensor. The sensor consists of a digital-based gate leakage timer, supply-insensitive time-domain temperature sensor, and inductive-coupling transmitter. A test chip using 65-nm CMOS technology was operated with a supply of 0.19 V and consumed 53 µW to successfully demonstrate wireless communication with an asynchronous receiver.
Slides

1S-12 (Time: 11:38 - 11:41)
TitleA 13.56MHz CMOS Active Diode Full-Wave Rectifier Achieving ZVS with Voltage-Time-Conversion Delay-Locked Loop for Wireless Power Transmission
AuthorKeita Yogosawa, Hideki Shinohara, *Kousuke Miyaji (Shinshu University, Japan)
Pagepp. 27 - 28
Keywordwireless power transmission, active rectifier, ZVS
AbstractA CMOS active diode rectifier for wireless power transmission with proposed voltage-time-conversion (VTC) delay-locked loop (DLL) control suppresses reverse current by realizing zero-voltage switching (ZVS), regardless of AC input and process variations. The proposed circuit is implemented in a standard 0.18um CMOS process using I/O MOSFETs, which corresponds to 0.35um technology. The maximum power conversion efficiency of 78% is obtained at 231ohm load resistance.
Slides

1S-13 (Time: 11:41 - 11:44)
TitleCMOS-on-Quartz Pulse Generator for Low Power Applications
Author*Parit Kanjanavirojkul, Nguyen Ngoc Mai-Khanh, Tetsuya Iizuka, Toru Nakura, Kunihiro Asada (The University of Tokyo, Japan)
Pagepp. 29 - 30
Keywordpulse generation, UWB-IR, CMOS, low-duty-cycle, flip-chip devices
AbstractA pulse generator (PG) for low power and low duty cycle applications is presented. The PG employs a CMOS switch and a quarter-wavelength transmission line resonator. Since the architecture does not involve feedback gain, the PG is theoretically capable to generate a pulse efficiently at high oscillation frequency fosc, at which a transistor gain is limited. The PG also features a quick starting time and zero stand-by power. The PG is designed on a 0.18 um CMOS flipped over a transmission line resonator, implemented by aluminum on a quartz substrate. The prototype consumes 5.4 pJ/pulse, with energy conversion efficiency (ECE) of 2.37% at fosc of 11.5 GHz.
Slides

1S-14 (Time: 11:44 - 11:47)
TitleA 13.56 MHz On/Off Delay-Compensated Fully-Integrated Active Rectifier for Biomedical Wireless Power Transfer Systems
Author*Lin Cheng, Wing-Hung Ki, Tak-Sang Yim (The Hong Kong University of Science and Technology, Hong Kong)
Pagepp. 31 - 32
KeywordWireless power transfer, active rectifier, implantable medical devices, delay compensation, PVT variations and mismatches
AbstractA 13.56 MHz 64.8 mW fully-integrated CMOS active rectifier for biomedical wireless power transfer system is presented in this summary. It employs an adaptive on/off delay compensation technique to accurately compensate for the circuit delays despite PVT variations and mismatches. With an AC input that ranges from 1.8 V to 3.6 V, the measured voltage conversion ratio is higher than 90% and the measured power conversion efficiency is higher than 89.1% for a load resistor of 500 Ω.
Slides

1S-15 (Time: 11:47 - 11:50)
TitleA Wireless Power Receiver with a 3-Level Reconfigurable Resonant Regulating Rectifier for Mobile-Charging Applications
Author*Lin Cheng, Wing-Hung Ki, Chi-Ying Tsui (The Hong Kong University of Science and Technology, Hong Kong)
Pagepp. 33 - 34
KeywordWireless charging, 3-level reconfigurable resonant regulating rectifier, resonant wireless power transfer, one-stage topology, A4WP
AbstractA wireless power receiver using a 3-level reconfigurable resonant regulating (R3) rectifier is presented in this summary. The receiver improves power conversion efficiency and reduces die area and off-chip components by achieving power conversion and voltage regulation in one stage, using only 4 on-chip power switches and 1 off-chip capacitor. The receiver regulates the output voltage at 5 V and delivers a maximum power of 6 W. It was fabricated in a standard 0.35 µm CMOS process with a die area of 4.77 mm2, and the measured peak efficiency reaches 92.2%.
Slides

1S-16 (Time: 11:50 - 11:53)
TitleSub-1-µs Start-up Time, 32-MHz Relaxation Oscillator for Low-Power Intermittent VLSI Systems
Author*Hiroki Asano, Tetsuya Hirose, Taro Miyoshi, Keishi Tsubaki, Toshihiro Ozaki, Nobutaka Kuroki, Masahiro Numa (Kobe University, Japan)
Pagepp. 35 - 36
KeywordRelaxation Oscillator (ROSC), Fast start-up, Intermittent operation, High accuracy, PVT variation
AbstractWe propose a sub-1-µs start-up time, fully integrated 32-MHz relaxation oscillator (ROSC) for intermittent VLSI systems. Our proposed ROSC employs current mode architecture that is different from conventional voltage mode architecture. This enables compact and fast switching speed to be achieved. The measurement results demonstrated that the ROSC achieved sub-1-µs start-up time and generated stable output frequency of 32.6 MHz. Measured line regulation, temperature coefficient, and variation coefficient in 10 samples were ±0.69, ±0.38, and 0.62%, respectively.
Slides

1S-17 (Time: 11:53 - 11:56)
TitleA 19-μA Metabolic Equivalents Monitoring SoC Using Adaptive Sampling
Author*Mio Tsukahara, Shintaro Izumi, Motofumi Nakanishi, Hiroshi Kawaguchi (Kobe University, Japan), Hiromitsu Kimura, Kyoji Marumoto, Takaaki Fuchikami, Yoshikazu Fujimori (Rohm Co. Ltd., Japan), Masahiko Yoshimoto (Kobe University, Japan)
Pagepp. 37 - 38
Keywordphysical activity, metabolic equivalents, non-volatile CPU, normally-off computing, wearable sensor SoC
AbstractThis paper presents a low-power metabolic equivalents (METs) estimation SoC for monitoring physical activity with wearable sensor. Long-term continuous METs monitoring can contribute to detection of non-communicable diseases. The proposed SoC consists of a non-volatile CPU and a dedicated hardware for heart rate extraction and METs estimation to reduce the power consumption. A test chip is fabricated in a 130-nm CMOS process. Evaluation results show that the proposed system, which consists of the test chip and an accelerometer, consumes about 19 uA on average.
Slides

1S-18 (Time: 11:56 - 11:59)
TitleAn FPGA-Compatible PLL-Based Sensor against Fault Injection Attack
AuthorWei He, Jakub Breier, *Shivam Bhasin (Nanyang Technological University, Singapore), Noriyuki Miura, Makoto Nagata (Kobe University, Japan)
Pagepp. 39 - 40
KeywordLFI, EMFI, FPGA, Ring Oscillator
AbstractLaser based fault injection (LFI) and Electromagnetic fault injection (EMFI) are powerful techniques for fault injection in security critical cir- cuits. Since LFI/EMFI creates faults by injecting high energy disturbances, it can be detected in advance by a sensitive embedded sensor. In this paper, a PLL based sensor system for detecting laser fault injection is presented. Experiments show a high detection rate, with significant power security margin, whilst main- taining low hardware cost, on multiple FPGA platforms.
Slides

1S-19 (Time: 11:59 - 12:02)
TitleVariability Mapping at Runtime Using the PAnDA Multi-reconfigurable Architecture
Author*Simon Bale (University of York, U.K.), James Walker (University of Hull, U.K.), Martin Trefzer, Andy Tyrrell (University of York, U.K.)
Pagepp. 41 - 42
KeywordReconfigurable architectures, evolvable hardware, intrinsic variability
AbstractThis paper describes a novel multi-reconfigurable architecture, which allows variability-aware design, rapid prototyping and post-fabrication optimisation of digital systems. This is achieved by exploiting reconfiguration at both the digital function level and the transistor level. A runtime variability map of the architecture, created using ring oscillators, is presented.
Slides

1S-20 (Time: 12:02 - 12:05)
TitleDesign of High-Frequency Piezoelectric Resonator-Based Cascaded Fractional-N PLL with Sub-ppb-Order Channel Adjusting Technique
Author*Yosuke Ishikawa, Sho Ikeda, Hiroyuki Ito (Tokyo Institute of Technology, Japan), Akifumi Kasamatsu (National Institute of Information and Communications Technology, Japan), Takayoshi Obara, Naoki Noguchi, Koji Kamisuki, Yao Jiyang (Tokyo Institute of Technology, Japan), Shinsuke Hara, Ruibing Dong (National Institute of Information and Communications Technology, Japan), Shiro Dosho, Noboru Ishihara, Kazuya Masu (Tokyo Institute of Technology, Japan)
Pagepp. 43 - 44
KeywordPLL, RF
AbstractWe reported a high-frequency piezoelectric resonator (PZR)-based cascaded fractional-N PLL featuring channel adjusting technique with sub-ppb-order frequency resolution, which can overcome the difficulty using the narrow range GHz PZR. This paper details a design of the proposed cascaded PLL. In order to reduce power consumption of 2nd-PLL, a power-efficient latch for pre-scaler is proposed. 3rd-1st cascaded delta-sigma-modulator can reduce the number of gates. The prototype PLL was fabricated in a 65nm CMOS and achieved 8.484GHz to 8.912GHz output, 180 fs rms-jitter, and -244 dB FOM while consuming 12.7mW.
Slides


Session 1A  Design Assurance and Reliability
Time: 11:05 - 12:20 Tuesday, January 17, 2017
Location: Room 102
Chairs: Chih-Tsun Huang (National Tsing Hua University, Taiwan), Franco Fummi (University of Verona, Italy)

1A-1 (Time: 11:05 - 11:30)
TitleAGARSoC: Automated Test and Coverage-Model Generation for Verification of Accelerator-Rich SoCs
AuthorBiruk Mammo, *Doowon Lee, Harrison Davis, Yijun Hou, Valeria Bertacco (University of Michigan, U.S.A.)
Pagepp. 45 - 50
KeywordSoC, accelerators, verification, coverage, test generation
AbstractSoC design trends show increasing integration of special-purpose, third-party hardware blocks to accelerate diverse types of computation. These accelerator blocks interact with each other in unexpected ways when integrated into a complex, accelerator-rich SoC. In this work we propose a novel solution that guides verification engineers to the high-priority accelerator interaction scenarios during RTL verification. We observe that interaction scenarios frequently exercised by software for the SoC, which is typically developed alongside the RTL, should be the highest priority targets for verification. To this end we analyze the behavior of software executed on high-level simulation models to identify commonly occurring accelerator interaction scenarios. We encapsulate scenarios observed from diverse software executions into an abstract representation that can then be used to extract coverage models and generate test programs. Our experiments show that our solution is able to identify frequently exercised scenarios, extract coverage models, and generate compact, high-quality tests for two completely different SoC designs.
Slides

1A-2 (Time: 11:30 - 11:55)
TitleFeature Extraction from Design Documents to Enable Rule Learning for Improving Assertion Coverage
Author*Kuo-Kai Hsieh, Sebastian Siatkowski, Li-Chung Wang (University of California, Santa Barbara, U.S.A.), Wen Chen, Jayanta Bhadra (NXP Semiconductors, U.S.A.)
Pagepp. 51 - 56
KeywordFunctional verification, Text mining, Rule Learning, Functional coverage
AbstractFeature selection is essential to rule learning in the context of functional verification. In practice today, features are selected manually and the selection requires domain knowledge. In contrast, this work proposes using automatic feature extraction from design documents as a viable approach to support rule learning. To demonstrate its effectiveness, document-extracted features are employed to learn the rules for covering a set of assertions based on a commercial SoC. Experiments show that 100%-accurate rules can be obtained for more than 70% of the assertions.
Slides

1A-3 (Time: 11:55 - 12:20)
TitleTrust is good, Control is better: Hardware-based Instruction-Replacement for Reliable Processor-IPs
Author*Kenneth Schmitz, Arun Chandrasekharan, Jonas Gomes Filho, Daniel Große, Rolf Drechsler (University of Bremen, Germany)
Pagepp. 57 - 62
KeywordIP-integration, high-level synthesis, design automation, instruction-replacement
AbstractFault-free function and defect tolerance are key requirements for modern embedded systems. To meet time-to-market constraints, complex IP-components are used to assemble even more complex semiconductor products. Often, trust is required since these IPs are developed, verified and tested by external third-party IP-providers. In this work, we focus specifically on processor-IPs. A method for run-time instruction-replacement on hardware-level is presented to increase the reliability of the system. In contrast to existing techniques, our scheme can easily deal with black-box components and is comparatively lightweight. Furthermore, it includes an easy to use methodology for automated and convenient implementation. The results shows the successful application of this novel technique for reliable integration of state-of-the-art RISC-based processor-IPs.
Slides


Session 1B  New Frontiers of Hardware Accelerator Synthesis
Time: 11:05 - 12:20 Tuesday, January 17, 2017
Location: Room 104
Chairs: Seiya Shibata (NEC Corp., Japan), Takefumi Miyoshi (e-trees.Japan)

1B-1 (Time: 11:05 - 11:30)
TitleEfficient Floating Point Precision Tuning for Approximate Computing
Author*Nhut-Minh Ho, Elavarasi Manogaran, Weng-Fai Wong (National University of Singapore, Singapore), Asha Anoosheh (University of California, Berkeley, U.S.A.)
Pagepp. 63 - 68
Keywordapproximate computing, floating-point precision optimization, bitwidth optimization, high-level synthesis
AbstractThis paper presents an automatic tool-chain that efficiently computes the precision of floating point variables down to the bit level of the mantissa. Our toolchain uses a distributed algorithm that can analyze thousands of variables. We successfully used the tool to transform floating point signal processing programs to their arbitrary precision fixed-point equivalent, obtaining about 82% and 66% average reduction in resources when compared to the double precision and single precision versions, respectively.
Slides

1B-2 (Time: 11:30 - 11:55)
TitleArea-Constrained Technology Mapping for In-Memory Computing Using ReRAM Devices
AuthorDebjyoti Bhattacharjee, Arvind Easwaran, *Anupam Chattopadhyay (Nanyang Technological University, Singapore)
Pagepp. 69 - 74
KeywordReRAM, technology mapping, area constrained, in memory computing
AbstractIn-memory computing platforms, such as Resistive RAM (ReRAM), offer natural advantage to data-intensive applications.The benefits of data locality and capability to perform native Boolean operations is exploited for significant performance advantage in multiple contexts ranging across neuromorphic computing, associative memory-based computing, arithmetic benchmarks and general-purpose programmable logic-in-memory computing. Despite these advances, design automation tools supporting in-memory computing are still in a nascent phase. In this work, we investigate for the first time, the problem of minimizing delay under arbitrary area constraint of ReRAM devices. We formulate the problem of area-constrained delay minimization as an Integer Linear Programming (ILP) formulation and further propose heuristics that offers scalability as well as solution close to optimal performance. Area-constrained mapping technology mappings enables unlocking significantly large design space trade-offs.

1B-3 (Time: 11:55 - 12:20)
TitleTessellating Memory Space for Parallel Access
Author*Juan Escondido, Mingjie Lin (UCF, U.S.A.)
Pagepp. 75 - 80
KeywordTesselation, Memory
AbstractModern reconfigurable computing chips, such as FPGAs, offer an unprecedented opportunity to achieving both multifunctionality and real-time responsiveness for memory- intensive embedded applications. However, how to cost-effectively synthesize application-specific hardware constructs that fully exploit memory-level parallelism remains to be a key challenge. To address this problem, we propose a new tessellation- based memory partitioning and mapping scheme that aims at maximizing parallel memory accesses while conserving both hardware and energy consumption. Comparing with the existing linear skewing and hyper-plane partitioning methodologies, our proposed technique exploits the regularity of tessellation patterns to assign memory bank and calculate intra-bank offset in a direct geometric-based manner, therefore not only quite intuitive to comprehend, but also quite straightforward to implement with hardware.


Session 1C  Analysis Techniques for Reliability and Manufacturability
Time: 11:05 - 12:20 Tuesday, January 17, 2017
Location: Room 105
Chairs: Shao-Yun Fang (National Taiwan University of Science and Technology, Taiwan), Song Chen (Univ. of Science and Tech. of China, China)

1C-1 (Time: 11:05 - 11:30)
TitleLithography Hotspot Detection by Two-stage Cascade Classifier Using Histogram of Oriented Light Propagation
Author*Yoichi Tomioka (University of Aizu, Japan), Tetsuaki Matsunawa, Chikaaki Kodama, Shigeki Nojima (Toshiba Corporation, Japan)
Pagepp. 81 - 86
KeywordDesign for manufacturability, Lithography hotspot detection, Real AdaBoot
AbstractIn advanced semiconductor-process technology, the ability to detect and repair lithography hotspots, which can affect printability, is essential. In this paper, we propose a two-stage cascade classifier for accurate hotspot detection. Our classifier uses a novel layout feature based on the propagation of light passing through a photomask. We performed experiments to evaluate our cascade classifier by applying it to the ICCAD-2012 CAD contest problem. The hotspot detection performance was evaluated according to two indices: (I1) the number of detected hotspots over the number of actual hotspots and (I2) the number of detected hotspots over the number of false hotspots. The results showed that the proposed method gained a 1.15% improvement in I1 and 24.4 times improvement in I2 on average compared to existing state-of-the-art methods, even the one with the best I1.
Slides

1C-2 (Time: 11:30 - 11:55)
TitleReliability Analysis of Memories suffering MBUs for the Effect of Negative Bias Temperature Instability
Author*Shanshan Liu, Liyi Xiao, Xuebing Cao, Zhigang Mao (Harbin Institute of Technology, China)
Pagepp. 87 - 92
KeywordMemory, reliability, Negative bias temperature instability, mutiple bit upsets, error correction codes
AbstractIn this paper, the effect of negative bias temperature instability (NBTI) on MBUs sensitivity of 65 nm bulk technology memories is analyzed and simulated by Geant4. A MTTF reliability model including NBTI stress time is proposed for memories protected by error correction codes (ECCs). Both cases of scrubbing and nonscrubbing are considered. By using the proposed model, the predicted MTTF results align well with the simulation MTTF results in the radiation environment.
Slides

1C-3 (Time: 11:55 - 12:20)
TitleEfficient Circuit Failure Probability Calculation along Product Lifetime Considering Device Aging
Author*Hiromitsu Awano, Masayuki Hiromoto, Takashi Sato (Kyoto University, Japan)
Pagepp. 93 - 98
KeywordAging, Yield estimation, BTI
AbstractAn efficient device-aging simulation that efficiently estimates temporal degradation of failure probability of a circuit is proposed. As the size of transistors shrinks, consideration of device aging in addition to manufacturing variability has become an urgent issue for maintaining reliability of LSIs. Contrary to existing techniques that separately handle manufacturing variability and the device aging, we propose a simultaneous evaluation approach using an augmented reliability and subset simulation. By eliminating the repetitive failure-probability calculations at each device-age, the proposed method reduces the number of required circuit simulations to about 1/6 of that of the conventional method without compromising accuracy.
Slides


Session 2S  (Special Session) Neuromorphic Computing and Low-Power Image Recognition
Time: 13:50 - 15:30 Tuesday, January 17, 2017
Location: Room 103
Organizers/Chairs: Yiran Chen (University of Pittsburgh, U.S.A.), Bo Yuan (City University of New York, U.S.A.), Yung-Hsiang Lu (Purdue University, U.S.A.), Ying Wang (Institute of Computing Technology, Chinese Academy of Sciences, China)

2S-1 (Time: 13:50 - 14:15)
Title(Invited Paper) Low-Power Image Recognition Challenge
Author*Kent Gauen, Rohit Rangan, Anup Mohan, Yung-Hsiang Lu (Purdue University, U.S.A.), Wei Liu, Alexander C. Berg (University of North Carolina, U.S.A.)
Pagepp. 99 - 104
Keywordobject detection, image processing, machine learning, low-power, competition
AbstractLow-Power Image Recognition Challenge (LPIRC) is, to our knowledge, the only on-site competition that considers both energy consumption and recognition accuracy. LPIRC was held as one-day workshops in the Design Automation Conference in 2015 and 2016. The scores were the ratio of recognition accuracy and the energy consumption. The winner of 2016 was able to analyze 7,347 images and achieve 9.44% normalized mAP (mean average precision) with average power consumption of 4.7 W. Another team analyzed 1,020 images and achieved 25.7% normalized mAP.
Slides

2S-2 (Time: 14:15 - 14:40)
Title(Invited Paper) CNN-based Object Detection Solutions for Embedded Heterogeneous Multi-core SoCs
AuthorCheng Wang, *Ying Wang, Yinhe Han, Lili Song, Zhenyu Quan, Jiajun Li, Xiaowei Li (Institute of Computing Technology, Chinese Academy of Sciences, China)
Pagepp. 105 - 110
KeywordCNN, image recognition, heterogeneous SoC, energy efficiency
AbstractThis paper surveys how to use Convolutional Neural Networks (CNN) to hypothesize object location and categorization from images or videos in mobile heterogeneous SoCs. Recently a variety of CNN-based object detection frameworks have demonstrated both increasing accuracy and speed. Though they are making fast progress in high quality image recognition, state-of-the-art CNN-based detection frameworks seldom discuss their hardware-depended aspects and the cost-effectiveness of real-time image analysis in off-the-shelf low-power devices. As the focus of deep learning and convolutional neural nets is shifting to the embedded or mobile applications with limited power and computational resources, scaling down object detection framework and CNNs is becoming a new and important direction. In this work we conduct a comprehensive comparative study of state-of-the-art real-time object detection frameworks about their performance, cost-effectiveness/energy-efficiency (in the metric of mAP/Wh) in off-the-shelf mobile GPU devices. Based on the analysis results and observation in investigation, we propose to adjust the design parameters of such frameworks and employ a design space exploration procedure to maximize the energy-efficiency (mAP/Wh) of real-time object detection solution in mobile GPUs. As shown in the benchmarking result, we successfully boost the energy-efficiency of multiple popular CNN-based detection solutions by maximizing the utility of computation resources of SoC and trading-off between prediction accuracy and energy cost. In the second Low-Power Image Recognition Challenge (LPIRC), our system achieved the best result measured in mAP/Energy on the embedded Jetson TX1 CPU+GPU SoC.

2S-3 (Time: 14:40 - 15:05)
Title(Invited Paper) Low-Power Neuromorphic Speech Recognition Engine with Coarse-Grain Sparsity
AuthorShihui Yin, Deepak Kadetotad (Arizona State University, U.S.A.), Bonan Yan, Chang Song, Yiran Chen (University of Pittsburgh, U.S.A.), Chaitali Chakrabarti, *Jae-sun Seo (Arizona State University, U.S.A.)
Pagepp. 111 - 114
Keywordneuromorphic computing, spiking neural network, speech recognition, low power
AbstractIn recent years, we have seen a surge of interest in neuromorphic computing and its hardware design for cognitive applications. In this work, we present new neuromorphic architecture, circuit, and device co-designs that enable spike-based classification for speech recognition task. The proposed neuromorphic speech recognition engine supports a sparsely connected deep spiking network with coarse granularity, leading to large memory reduction with minimal index information. Simulation results show that the proposed deep spiking neural network accelerator achieves phoneme error rate (PER) of 20.5% for TIMIT database, and consume 3.0mW in 40nm CMOS for real-time performance. To alleviate the memory bottleneck, the usage of non-volatile memory is also evaluated and discussed.

2S-4 (Time: 15:05 - 15:30)
Title(Invited Paper) Towards Acceleration of Deep Convolutional Neural Networks Using Stochastic Computing
AuthorJi Li (University of Southern California, U.S.A.), Ao Ren, Zhe Li, Caiwen Ding (Syracuse University, U.S.A.), Bo Yuan (City University of New York, U.S.A.), Qinru Qiu, *Yanzhi Wang (Syracuse University, U.S.A.)
Pagepp. 115 - 120
KeywordDeep Convolutional Neural Networks, Stochastic Computing
AbstractIn recent years, Deep Convolutional Neural Network (DCNN) has become the dominant approach for almost all recognition and detection tasks and outperformed humans on certain tasks. Nevertheless, the high power consumptions and complex topologies have hindered the widespread deployment of DCNNs, particularly in wearable devices and embedded systems with limited area and power budget. This paper presents a fully parallel and scalable hardware-based DCNN design using Stochastic Computing (SC), which leverages the energy-accuracy trade-off through optimizing SC components in different layers. We first conduct a detailed investigation of the Approximate Parallel Counter (APC) based neuron and multiplexer-based neuron using SC, and analyze the impacts of various design parameters, such as bit stream length and input number, on the energy/power/area/accuracy of the neuron cell. Then, from an architecture perspective, the influence of inaccuracy of neurons in different layers on the overall DCNN accuracy (i.e., software accuracy of the entire DCNN) is studied. Accordingly, a structure optimization method is proposed for a general DCNN architecture, in which neurons in different layers are implemented with optimized SC components, so as to reduce the area, power, and energy of the DCNN while maintaining the overall network performance in terms of accuracy. Experimental results show that the proposed approach can find a satisfactory DCNN configuration, which achieves 55X, 151X, and 2X improvement in terms of area, power and energy, respectively, while the error is increased by 2.86%, compared with the conventional binary ASIC implementation.
Slides


Session 2A  System-level Techniques for Energy and Performance Optimization
Time: 13:50 - 15:30 Tuesday, January 17, 2017
Location: Room 102
Chairs: Liang Shi (Chongqing University, China), Takatsugu Ono (Kyushu University, Japan)

2A-1 (Time: 13:50 - 14:15)
TitleEnabling Fast Preemption via Dual-Kernel Support on GPUs
AuthorLi-Wei Shieh (National Taiwan University, Taiwan), *Kun-Chih Chen (National Sun Yat-sen University, Taiwan), Hsueh-Chun Fu, Po-Han Wang, Chia-Lin Yang (National Taiwan University, Taiwan)
Pagepp. 121 - 126
KeywordGraphics Processing Unit, Preemption
AbstractTo meet QoS, we enable fast preemption on GPUs. First, we propose a dual-kernel approach to support fine-grained preemption, and an allocation policy to avoid resource fragmentation. Second, we propose a victim selection scheme to reduce the preemption cost while satisfying a required preemption latency. Evaluations show that we can reach very close to the ideal preemption scheme within 2% difference in terms of deadline violations. On average, we improve GPU resource utilization by 2.93x over prior technique during preemption.
Slides

2A-2 (Time: 14:15 - 14:40)
TitleEfficient Mapping of CDFG onto Coarse-Grained Reconfigurable Array Architectures
Author*Satyajit Das, Kevin Martin, Philippe Coussy (LabSTICC, University of South Brittany, France), Davide Rossi, Luca Benini (University of Bologna, Italy)
Pagepp. 127 - 132
KeywordCDFG Mapping, CGRA, Energy efficient, Low power, control flow
AbstractIn the approaching era of IoT, flexible and low power accelerators have become essential to meet aggressive energy efficiency targets. During the last few decades, Coarse Grain Reconfigurable Arrays (CGRA) have demonstrated high energy efficiency as accelerators, especially for high-performance streaming applications. While existing CGRAs mostly rely on partial and full predication techniques to support conditional branches, inefficient architecture and mapping support for handling control flow limits the use of CGRAs in accelerating either only inner loop bodies, or transformed loops specifically adapted to the target CGRA. This paper proposes a novel CGRA architecture with support for jump and conditional jump instructions and a lightweight global synchronization mechanism to enable complete Control Data Flow Graph (CDFG) mapping in an ultra-low-power environment. The architecture is coupled with a complete design flow that efficiently maps applications with heavy control flow starting from a generic C language description. The proposed mapping approach reduces the impact of wasteful instruction issues in the conventional approaches of predication providing an average energy improvement of 1.44x and 1.6x when compared to the state of the art partial and full predication techniques. Moreover, the proposed method achieves an average speed-up up to 21x and an energy improvement up to 50.42x while executing applications with heavy control flow with respect to sequential execution on a low-power embedded CPU, demonstrating its suitability for next generation IoT applications.
Slides

2A-3 (Time: 14:40 - 15:05)
TitleTiming Window Wiper : A New Scheme for Reducing Refresh Power of DRAM
Author*Ho Hyun Shin, Hyeokjun Seo, Byunghoon Lee, Jeongbin Kim, Eui-Young Chung (Yonsei University, Republic of Korea)
Pagepp. 133 - 138
KeywordDRAM, refresh, power
AbstractDRAM refresh power, which is consumed solely to preserve data, is rapidly increasing as capacity increases. A study predicts that the power will account for up to 35% of total DRAM power in the high capacity DRAM device. While various schemes were proposed to reduce the refresh power, those could not be adopted to commercial products due to cost overhead and design complexity issues. In this paper, we propose a simple refresh power saving scheme called TWW. We implement it with a much smaller amount of register size than previous works without modification on OS and DRAM controller. It eliminates unnecessary refresh operation of pre-activated rows in a specific timing window with optimum register size. We can save the refresh power up to 16% with only 6.2 KB registers in a DRAM device. This paper explains the implementation of the proposed scheme and shows the power saving effectiveness with gem5 full system simulator.
Slides

2A-4 (Time: 15:05 - 15:30)
TitleOn Efficient Message Passing in Energy Harvesting Based Distributed System
Author*Ye Tian, Qiang Xu (The Chinese University of Hong Kong, Hong Kong), Jason Xue (City University of Hong Kong, Hong Kong)
Pagepp. 139 - 144
Keywordenergy harvesting, message passing, energy efficiency
AbstractEnergy harvesting systems recycling energy from ambient environment for extended lifetime have been proposed to be the next step in the evolution of the Internet of Things. One of the largest challenges of energy harvesting is the insufficiency and instability of ambient energy sources. Unlike wireless distributed systems considered to date, frequent power failures need to be taken into account in designing energy-efficient transmission strategies. In this paper, we propose a message passing method between energy harvesting nodes, which predicts the probability of success before message passing and adjusts transmission strategy online according to input power trace. Experimental result shows that our method can reduce energy consumption of transmission significantly compared to traditional ones and keeps effective for various kinds of unstable input power.
Slides


Session 2B  Pushing the Limits of Logic Synthesis
Time: 13:50 - 15:30 Tuesday, January 17, 2017
Location: Room 104
Chairs: Shouyi Yin (Tsinghua University, China), Shinobu Nagayama (Hiroshima City University, Japan)

2B-1 (Time: 13:50 - 14:15)
TitleFast Extract with Cube Hashing
Author*Bruno Schmitt (UFRGS, Brazil), Alan Mishchenko (UC Berkeley, U.S.A.), Victor Kravets (IBM, U.S.A.), Robert Brayton (UC Berkeley, U.S.A.), André Reis (UFRGS, Brazil)
Pagepp. 145 - 150
Keywordlogic synthesis and optimization, factoring, combinational circuits
AbstractThe fast extract algorithm is a well-known algebraic method for factoring and decomposing Boolean expressions. Since it uses pairwise comparison between cubes to find factors, its performance greatly degrades when processing a network whose output function has thousands of cubes when expressed in terms of its primary inputs. This paper describes a full implementation of the fast extract with cube hashing algorithm, fxch, which reduce complexity to linear in the number of cubes by using sub-cube hashing find factors.
Slides

2B-2 (Time: 14:15 - 14:40)
TitleA Novel Basis for Logic Rewriting
Author*Winston Haaswijk, Mathias Soeken (EPFL, Switzerland), Luca Amarú (Synopsys, U.S.A.), Pierre-Emmanuel Gaillardon (The University of Utah, U.S.A.), Giovanni De Micheli (EPFL, Switzerland)
Pagepp. 151 - 156
Keywordlogic, synthesis, rewriting, xmg
AbstractGiven a set of logic primitives and a Boolean function, exact synthesis finds the optimum representation (e.g., depth or size) of the function in terms of the primitives. Due to its high computational complexity, the use of exact synthesis is limited to small networks. Some logic rewriting algorithms use exact synthesis to replace small subnetworks by their optimum representations. However, conventional approaches have two major drawbacks. First, their scalability is limited, as Boolean functions are enumerated to precompute their optimum repre- sentations. Second, the strategies used to replace subnetworks are not satisfactory. We show how the use of exact synthesis for logic rewriting can be improved. To this end, we propose a novel method that includes various improvements over conventional approaches: (i) we improve the subnetwork selection strategy, (ii) we show how enumeration can be avoided, allowing our method to scale to larger subnetworks, and (iii) we introduce XOR Majority Graphs (XMGs) as compact logic representations that make exact synthesis more efficient. We show a 45.8% geometric mean reduction, a 6.5% size reduction, and depth · size reductions of 8.6%, compared to the academic state-of-the-art. Finally, we outperform 3 over 9 of the best known size results for the EPFL benchmark suite, reducing size by up to 11.5% and depth up to 46.7%.

2B-3 (Time: 14:40 - 15:05)
TitleMulti-level Logic Benchmarks: An Exactness Study
Author*Luca Amaru (Synopsys, U.S.A.), Mathias Soeken, Winston Haaswijk, Eleonora Testa (EPFL, Switzerland), Patrick Vuillod, Jiong Luo (Synopsys, U.S.A.), Pierre-Emmanuel Gaillardon (University of Utah, U.S.A.), Giovanni De Micheli (EPFL, Switzerland)
Pagepp. 157 - 162
KeywordLogic Synthesis, Benchmarks, Exact Synthesis, Multi-level Logic Synthesis
AbstractIn this paper, we study exact multi-level logic benchmarks. We refer to an exact logic benchmark, or exact benchmark in short, as the optimal implementation of a given Boolean function, in terms of minimum number of logic levels and/or nodes. Exact benchmarks are of paramount importance to design automation because they allow engineers to test the efficiency of heuristic techniques used in practice. When dealing with two-level logic circuits, tools to generate exact benchmarks are available, e.g., espresso-exact, and scale up to relatively large size. However, when moving to modern multi-level logic circuits, the problem of deriving exact benchmarks is inherently more complex. Indeed, few solutions are known. In this paper, we present a scalable method to generate exact multi-level benchmarks with the optimum, or provably close to the optimum, number of logic levels. Our technique involves concepts from graph theory and joint support decomposition. Experimental results show an asymptotic exponential gap between state-of-the-art synthesis techniques and our exact results. Our findings underline the need for strong new research in logic synthesis.
Slides

2B-4 (Time: 15:05 - 15:30)
TitleApproximate Logic Synthesis for FPGA by Wire Removal and Local Function Change
Author*Yi Wu, Chuyu Shen, Yi Jia, Weikang Qian (Shanghai Jiao Tong University, China)
Pagepp. 163 - 169
KeywordLogic Synthesis, Approximate Computing, Approximate Logic Synthesis, FPGA, Look-up Table
AbstractApproximate computing is a new design paradigm targeting at error-tolerant applications. By allowing a little amount of inaccuracy in the computation, it could significantly reduce circuit area and power consumption. Several logic synthesis methods for approximate computing were proposed recently. However, these methods are mainly aimed at ASIC designs. In this work, we propose a novel approximate logic synthesis method targeting at the FPGA design. We exploit the flexibility of look-up tables and propose a method that combines wire removal and local function change. The experimental results showed that our method produces better results than the state-of-the-art approximate logic synthesis method adapted to FPGA designs. Moreover, it can be combined with the state-of-the-art method to further improve the design quality.
Slides


Session 2C  Design Techniques for Reliability Enhancement
Time: 13:50 - 15:30 Tuesday, January 17, 2017
Location: Room 105
Chairs: Yukihide Kohira (Aizu University, Japan), Tetsuaki Matsunawa (Toshiba, Japan)

2C-1 (Time: 13:50 - 14:15)
TitleGuiding Template-aware Routing Considering Redundant Via Insertion for Directed Self-Assembly
AuthorKun-Lin Lin, *Shao-Yun Fang (National Taiwan University of Science and Technology, Taiwan)
Pagepp. 170 - 175
KeywordDirected Self-Assembly, Redundant Via Insertion, DSA-aware Routing, Guiding Template
AbstractThe directed self-assembly (DSA) technology has shown its great potential in via/contact layer fabrication for sub 10-nm technology nodes. To guarantee sufficient overlay accuracy of generated vias, only a few guiding templates are feasible, and thus manufacturable via patterns are limited. In addition, redundant via insertion has become a necessary step during circuit design to improve reliability and yield. However, routing by only considering redundant vias or DSA may either deteriorate the redundant via insertion rate or damage via manufacturability. This paper presents the first work of detailed routing that simultaneously considers guiding template feasibility and redundant via insertion.
Slides

2C-2 (Time: 14:15 - 14:40)
TitleWorkload-aware Static Aging Monitoring of Timing-critical Flip-flops
Author*Arunkumar Vijayan, Saman Kiamehr, Fabian Oboril (Karlsruhe Institute of Technology, Germany), Krishnendu Chakrabarty (Duke University, U.S.A.), Mehdi Tahoori (Karlsruhe Institute of Technology, Germany)
Pagepp. 176 - 181
KeywordBTI, static aging, monitoring, workload
AbstractIn advanced technology nodes, Bias Temperature Instability (BTI) has emerged as a prominent reliability concern. The worst-case effects of BTI occur during specific workload phases in which flip-flops on a critical path do not switch their logic values for a long duration. These inactive flip-flops in the circuit experience accelerated workload-dependent static-BTI stress. The aging effect of static BTI for a few hours has been shown to be equivalent to one year of aging due to dynamic BTI, which can eventually cause circuit failure. The techniques available to mitigate static-BTI stress during standby mode of circuits are pessimistic, thereby limiting the performance of the circuit. To address this problem, we propose a runtime monitoring method to raise a flag when a timing-critical flip-flop experiences severe static-BTI stress. To reduce the monitoring costs, we select a small representative set of flip-flops offline based on workload-aware correlation analysis and these selected flip-flops are monitored online for static aging phases. Our experiments conducted on two processors show that, less than 0.5% of the total number of flip-flops is required to be selected as representative flip-flops for S-BTI stress monitoring.
Slides

2C-3 (Time: 14:40 - 15:05)
TitleEnhancing Robustness of Sequential Circuits Using Application-specific Knowledge and Formal Methods
Author*Sebastian Huhn (University of Bremen, Germany), Stefan Frehse (DFKI GmbH Bremen, Germany), Robert Wille (Johannes Kepler University Linz, Austria), Rolf Drechsler (University of Bremen, Germany)
Pagepp. 182 - 187
KeywordRobustness Enhancement, Transient Faults, Formal Methods
AbstractDue to shrinking feature sizes, integrated circuits are getting more vulnerable against transient faults. Methods increasing the robustness of circuits against these faults exist for a long period of time but either introduce huge additional logic, increase the latency of the circuit, or are applicable for dedicated circuits such as microprocessors. This work proposes an alternative hardening method which requires only a slight increase in additional hardware, does not influence the timing behavior, and is automatically applicable to arbitrary circuits. To this end, application-specific knowledge of the considered circuit is exploited, analyzed by a dedicated orchestration of formal techniques, and, eventually, used to synthesize a fault detection mechanism enhancing the robustness of the circuit. Experimental evaluations show that the proposed solution leads to a significant increase in the robustness, while the hardware overhead is kept moderate
Slides

2C-4 (Time: 15:05 - 15:30)
TitleWIPE: Wearout Informed Pattern Elimination to Improve the Endurance of NVM-based Caches
Author*Sina Asadi, Amir Mahdi Hosseini Monazzah, Hamed Farbeh, Seyed Ghassem Miremadi (Sharif University of Technology, Iran)
Pagepp. 188 - 193
KeywordCache, Endurance, Non-volatile Memory, Frequent data pattern, Online Coding
AbstractWith the recent development in Non-Volatile Memory (NVM) technologies, several studies have suggested using them as an alternative to SRAMs in on-chip caches. Limited endurance of NVMs is a major challenge when employed in the caches. This paper proposes a data manipulation scheme, so-called Wearout Informed Pattern Elimination (WIPE), to improve the endurance of NVM-based caches by minimizing the activity of frequent data patterns. Simulation results show that WIPE improves the endurance by up to 93% with negligible overhead.
Slides


Session 3S  (Special Session) Let's Secure the Physics of Cyber-Physical Systems
Time: 15:50 - 17:30 Tuesday, January 17, 2017
Location: Room 103
Organizers/Chairs: Mohammad Al Faruque (University of California Irvine, U.S.A.), Anupam Chattopadhyay (Nanyang Technological University, Singapore), Francesco Regazzoni (ALaRI - USI, Switzerland)

3S-1 (Time: 15:55 - 16:25)
Title(Invited Paper) Securing the Hardware of Cyber-Physical Systems
Author*Francesco Regazzoni (ALaRI - USI, Switzerland), Ilia Polian (University of Passau, Germany)
Pagepp. 194 - 199
KeywordCyber-Physical Systems, Security, Hardware Security
AbstractThe cyber-physical system (CPS) paradigm offers tremendous advantages in many application scenarios and promises a solution to a large number of pressing individual and societal needs. However, their properties such as heterogeneity, lack of perimeter protection, longevity, pervasive diffusion and strictly constrained resources also give rise to new security vulnerabilities. In this paper, we discuss security threats related to the hardware blocks of a CPS. We first review attack scenarios affecting security attributes confidentiality, integrity and authenticity, and then outline novel attack vectors that target the cyber and the physical aspects of a CPS simultaneously.

3S-2 (Time: 16:25 - 16:55)
Title(Invited Paper) Cross-Domain Security of Cyber-Physical Systems
AuthorSujit Rokka Chhetri, Jiang Wan, *Mohammad Al Faruque (University of California Irvine, U.S.A.)
Pagepp. 200 - 205
KeywordCyber-Physical, Security
AbstractThe interaction between the cyber domain and the physical domain components and processes can be leveraged to enhance the security of the cyber-physical system. In order to do so, we must first analyze various cyber domain and physical domain information flows, and characterize the relation between them using model functions. In this paper, we present a notion of cross-domain security of cyber-physical systems, whereby we present a security analysis framework that can be used for generating novel cross-domain attack models, attack detection methods, etc. We demonstrate how information flows such as discrete domain signal flows and continuous domain energy flows in the cyber and physical domain can be used to generate model functions using data-driven estimation, and use this model functions for performing various cross-domain security analysis. We also demonstrate the practical applicability of the cross-domain security analysis framework using the cyber-physical manufacturing system as a case study.
Slides

3S-3 (Time: 16:55 - 17:25)
Title(Invited Paper) A Systematic Security Analysis of Real-Time Cyber-Physical Systems
AuthorArvind Easwaran, *Anupam Chattopadhyay, Shivam Bhasin (Nanyang Technological University, Singapore)
Pagepp. 206 - 213
KeywordReal-time security, security modeling and analysis, cyber-physical systems security, deadline attacks
AbstractSecurity in Cyber-Physical Systems (CPS) has become a serious concern owing to the rapid adoption of technologies such as plug-and-play connectivity, robotics and remote co-ordination and control. It is well understood that the performance overhead incurred due to security considerations is rather high, which needs to be captured holistically for a real-time CPS with strict timing budget and hard deadlines. Additionally, attacks in real-time CPS may only alter the timing behaviour of system components without any changes in functionality, resulting in serious consequences due to missed deadlines. To address this challenging issue, it is necessary to understand the role of diverse components in a real-time CPS and how those expose the system to a malicious attacker. In this paper, we propose a systematic security analysis flow, using a novel Attack Sequence Diagram (ASD), which links the sources, intermediate components and final manifestations of an attack, thereby clearly delineating the attack surfaces of a complex real-time CPS. Based on the ASD, it is possible to evaluate the complexity of an attack, performance overhead of a countermeasure and explore different design trade-offs for a real-time CPS. With the help of real-world and synthetic examples, we demonstrate that ASD seamlessly enables one to map the existing vulnerabilities and uncover new attack possibilities.


Session 3A  Novel Techniques to Improve the Simulation Performance
Time: 15:50 - 17:30 Tuesday, January 17, 2017
Location: Room 102
Chairs: Ing-Jer Huang (National Sun Yat-sen University, Taiwan), Masashi Tawada (Waseda University, Japan)

3A-1 (Time: 15:50 - 16:15)
TitleAutomated Generation of Dynamic Binary Translators for Instruction Set Simulation
Author*Katsumi Okuda, Minoru Yoshida, Haruhiko Takeyama, Minoru Nakamura (Mitsubishi Electric Corporation, Japan)
Pagepp. 214 - 219
KeywordISS, instruction set simulator, DBT, binary translation, emulator
AbstractInstruction set simulators (ISSs) are indispensable tools for developing new architectures and embedded software. Due to the increasing variety of architectures and time-to-market pressure, it is important to efficiently develop fast ISSs based on dynamic binary translation. However, the implementation of such ISSs needs more effort than interpretive ISSs. In this paper, we propose a novel framework that generates ISSs based on dynamic binary translation from the descriptions of interpretive ISSs. Our results on SH, MIPS64, and ARM show that the generated ISSs are 1.4 to 13.4 times faster than their original interpretive ISSs.
Slides

3A-2 (Time: 16:15 - 16:40)
TitleLoop Aware IR-Level Annotation Framework for Performance Estimation in Native Simulation
Author*Omayma Matoussi, Frédéric Pétrot (Laboratoire TIMA, Université Grenoble Alpes, France)
Pagepp. 220 - 225
KeywordELS, System Level Simulation, Native Simulation
AbstractNative simulation is an interesting virtual prototyping candidate to speed-up architecture exploration and early software developments. It however does not provide out-of-the box non-functional information needed for software performance estimation. Annotating software with information is complex as high-level codes and binary codes have different structures due to compiler optimizations. This work proposes an annotation framework at compiler IR-level that focuses on loop structures, and reflects optimizations through a mapping scheme between the binary and the high-level IR. Experiments on instruction count show in average around 2% of error.
Slides

3A-3 (Time: 16:40 - 17:05)
TitleHybrid Analysis of SystemC Models for Fast and Accurate Parallel Simulation
Author*Tim Schmidt, Guantao Liu, Rainer Dömer (University of California, Irvine, U.S.A.)
Pagepp. 226 - 231
KeywordSystemC, Simulation, Parallel Simulation, Multi Core Simulation, Dynamic Analysis
AbstractParallel SystemC approaches expect a thread safe and race condition free model from the designer or use a compiler which identifies the race conditions. However, they have strong limitations for real world examples. Two major obstacles are: a) all the source code must be available and b) the entire design must be statically analyzable. In this paper, we propose a solution for a fast and fully accurate parallel SystemC simulation which overcomes these two obstacles a) and b). We propose a hybrid approach which includes both static and dynamic analysis of the design model. We also handle library calls in the compiler analysis where the source code of the library functions is not available. Our experiments demonstrate a 100% accurate execution and a speedup of 6.39× for a Network-on-Chip particle simulator.
Slides

3A-4 (Time: 17:05 - 17:30)
TitleVirtual Prototyping of Smart Systems through Automatic Abstraction and Mixed-Signal Scheduling
AuthorMichele Lora, Enrico Fraccaroli, *Franco Fummi (University of Verona, Italy)
Pagepp. 232 - 237
KeywordSystem-level simulation, Virtual Platforms, Analog Mixed-Signals
AbstractThis work proposes a methodology to abstract mixed-signal systems, by integrating digital and analog components in a homogeneous virtual platform model for efficient simulation. Two main contributions are provided: 1) an automatic abstraction technique for analog components, allowing to preserve only the details meaningful for the functional behavior of the entire platform and 2) a novel scheduling technique that exploits temporal decoupling and synchronization of digital and analog processes, to simulate them together in a homogeneous model.
Slides


Session 3B  Formal and Informal Verification
Time: 15:50 - 17:30 Tuesday, January 17, 2017
Location: Room 104
Chairs: Jason Verley (Sandia National Laboratory, U.S.A.), Rajit Manohar (Cornell University, U.S.A.)

3B-1 (Time: 15:50 - 16:15)
TitleEfficient Parallel Verification of Galois Field Multipliers
Author*Cunxi Yu, Maciej Ciesielski (University of Massachusetts, Amherst, U.S.A.)
Pagepp. 238 - 243
KeywordFormal verification, Galois field, computer algebra
AbstractGalois field (GF) arithmetic is used to implement critical arithmetic components in communication and security-related hardware, and verification of such components is of prime importance. Current techniques for formally verifying such components are based on computer algebra methods that proved successful in verification of integer arithmetic circuits. However, these methods are sequential in nature and do not offer any parallelism. This paper presents an algebraic functional verification technique of gate-level GF (2m ) multipliers, in which verification is performed in bit-parallel fashion. The method is based on extracting a unique polynomial in Galois field of each output bit independently. We demonstrate that this method is able to verify an n-bit GF multiplier in n threads. Experiments performed on pre- and post-synthesized Mastrovito and Montgomery multipliers show high efficiency up to 571 bits.
Slides

3B-2 (Time: 16:15 - 16:40)
TitleProperty Mining Using Dynamic Dependency Graphs
Author*Jan Malburg (German Aerospace Center, Germany), Tino Flenker (University of Bremen, Germany), Görschwin Fey (German Aerospace Center, Germany)
Pagepp. 244 - 250
KeywordProperty-Mining, Simulation bassed, Dataflow-analysis, RTL
AbstractWe present a technique to automatically generate System\-Verilog-Assertions from designs using dynamic dependency graphs. By this, we extract relations between signals of the design using only a few simulation runs, which drastically reduces the required number of use cases compared to other approaches. Additionally, unlike previous approaches, we do not use expression templates establish those relations. We abstract from the concrete use cases by inserting symbolic values and by merging similar conditions in time. A model-checker verifies the correctness of the generated properties. The evaluation showed that our approach is able to create more expressive properties than the GoldMine tool, while requiring less simulation data.
Slides

3B-3 (Time: 16:40 - 17:05)
TitleCEGAR-Based EF Synthesis of Boolean Functions with an Application to Circuit Rectification
Author*Heinz Riener (German Aerospace Center, Germany), Rüdiger Ehlers (University Bremen, Germany), Goerschwin Fey (German Aerospace Center, Germany)
Pagepp. 251 - 256
KeywordFormal methods, Boolean reasoning, Circuit Rectification, Logic Synthesis
AbstractThe Exists-Forall (EF) synthesis problem deals with finding parameters such that for all input assignments a correctness specification is met. Many standard problems from computer-aided design and verification can be formulated as an instance of EF synthesis when a function template with holes --- parameters to be synthesized --- is provided. In this paper, we generalize the idea of EF synthesis in the context of Boolean logic by allowing existential quantification over the domain of Boolean functions (rather than Boolean variables) and present a bounded synthesis approach guided by counterexamples to generate them using techniques from Boolean learning. As an application, we present Engineering Change Order (ECO) as an EF synthesis problem and apply the presented approach to incrementally synthesize patches for digital circuits with multiple seeded faults.
Slides

3B-4 (Time: 17:05 - 17:30)
TitleAn Extensible Perceptron Framework for Revision RTL Debug Automation
Author*John Adler, Ryan Berryhill, Andreas Veneris (University of Toronto, Canada)
Pagepp. 257 - 262
Keyworddebugging, verification, RTL, revision, regression verification
AbstractAutomated debugging techniques can significantly reduce the manual effort required to localize RTL errors. These techniques return to the user a set of RTL locations where a change can correct erroneous behavior. However, each location must be manually investigated. This problem is exacerbated by the increasing amount of failures in the modern regression verification cycle. Recent work in clustering-based revision debugging mitigates this cost by ranking revisions based on their likelihood of having introduced an error. This work presents a perceptronbased approach to revision debugging that can be extended to leverage the revision history of a design directly. Perceptrons are trained using labeled revisions from the design history. They are then used to predict the probability that a revision has introduced an error. The proposed methodology performs competitively with the state-of-the-art, but can be extended to handle more features. This allows for an automated regression debug flow integrated with Version Control and Issue Tracking Systems.
Slides


Session 3C  Pursuing System to Circuit Level Optimality in Timing and Power Integrity
Time: 15:50 - 17:30 Tuesday, January 17, 2017
Location: Room 105
Chairs: Takashi Sato (Kyoto University, Japan), Sheldon Tan (University of California, Riverside, U.S.A.)

3C-1 (Time: 15:50 - 16:15)
TitleAlgorithm for Synthesis and Exploration of Clock Spines
Author*Youngchan Kim, Taewhan Kim (Seoul National University, Republic of Korea)
Pagepp. 263 - 268
Keywordclock spine, clock network, timing, power, noise
AbstractThis work addresses the problem of developing a synthesis algorithm for clock spine networks. The idea is to transform the problem of allocating and placing clock spines on a plane into a slicing floorplan optimization problem, in which every candidate of clock spine network structures is uniquely expressed into a postfix notation to enable a fast cost computation in the slicing floorplan optimization. As a result, it can explore diverse structures of clock spine network to find globally optimal ones
Slides

3C-2 (Time: 16:15 - 16:40)
TitleYield-Driven Redundant Power Bump Assignment for Power Network Robustness
AuthorYu-Min Lee, Chi-Han Lee, *Yan-Cheng Zhu (National Chiao Tung University, Taiwan)
Pagepp. 269 - 274
Keywordpower bump, yield-driven, power supply network, power pad, power network
AbstractDuring package manufacturing process, open defects of power bumps may cause insufficient power supply and degrade the power network yield. This work presents a redundant power bump insertion method to ensure power integrity by considering the power bump yields. The proposed method can efficiently assign redundant bumps by accurately estimating the location of worst load yield and minimizing the amounts of redundant bumps to enhance the power network yield.

3C-3 (Time: 16:40 - 17:05)
TitleA Tighter Recursive Calculus to Compute the Worst Case Traversal Time of Real-Time Traffic over NoCs
Author*Meng Liu, Matthias Becker, Moris Behnam, Thomas Nolte (Mälardalen University, Sweden)
Pagepp. 275 - 282
Keywordnetwork-on-chip, timing analysis, round-robin arbitration, real-time application, worst-case traversal time
AbstractNetwork-on-Chip (NoC) is a communication subsystem which has been widely utilized in many-core processors and system-on-chips in general. In this paper, we focus on a Round-Robin Arbitration (RRA) based wormhole-switched NoC which is a common architecture used in most of the existing implementations. In order to execute real-time applications on such a NoC based platform, a number of given real-time requirements need to be fulfilled. One of the most typical requirements is schedulability which refers to determining if real-time packets can be delivered within the given time durations. Timing analysis is a common tool to verify the schedulability of a real-time system. Unfortunately, the existing timing analyses of RRA-based NoCs either provide too pessimistic estimates which results in overly allocated resources, or require a large amount of processing which limits the applicability in reality. Therefore, in this paper, we present an improved timing analysis, aiming to provide more accurate estimates along with acceptable computation time. From the evaluation results, we can clearly observe the improvement achieved by the proposed timing analysis.
Slides

3C-4 (Time: 17:05 - 17:30)
TitleAn Efficient Homotopy-Based Poincaré-Lindstedt Method for the Periodic Steady-State Analysis of Nonlinear Autonomous Oscillators
AuthorZhongming Chen, *Kim Batselier (The University of Hong Kong, Hong Kong), Haotian Liu (Cadence Design Systems, Inc., U.S.A.), Ngai Wong (The University of Hong Kong, Hong Kong)
Pagepp. 283 - 288
Keywordperiodic steady state, autonomous, nonlinear, homotopy, Poincare Lindstedt
AbstractThe periodic steady-state analysis of nonlinear systems has always been an important topic in electronic design automation (EDA). For autonomous systems, the mainstream approaches, like shooting Newton and harmonic balance, are difficult to employ since the period itself becomes an unknown. This paper presents an innovative state-space homotopy-based Poincar\'e-Lindstedt method, with a novel Pad\'e approximation of the stretched time axis, that effectively overcomes this hurdle. Examples demonstrate the excellent efficiency and scalability of the proposed approach.
Slides



Wednesday, January 18, 2017

Session 2K  Keynote II
Time: 9:00 - 9:50 Wednesday, January 18, 2017
Location: International Conference Room
Chair: Masaharu Imai (Osaka University, Japan)

2K-1 (Time: 9:00 - 9:50)
Title(Keynote Address) Emerging Medical Technologies for Interfacing the Brain: From Deep Brain Stimulation to Brain Computer Interfaces
AuthorNapoleon Torres-Martinez (CEA LETI, France)
Pagep. 289
AbstractEvolving medical technologies, including stimulators, infusion pumps, and neuroprosthesis, are addressing progressively a wide range of neurological conditions, bringing fresh hope to patients where other solutions have proven to be ineffective. In this context, brain-computer interfaces (BCIs), that allow interaction between neural tissue and an external device, have been developed for a many diverse conditions; in particular, they have allowed severely motor disabled patients to communicate and integrate better within their environment. Further, brain electrical stimulators or deep brain stimulators (DBS) have been in use for several decades for Parkinson disease and others diseases, giving patients new levels of quality of life not possible with more conventional pharmacological therapies. In addition, there are recent reports of various physical optical phenomena that produce a reduction in degeneration within targeted brain areas, opening new avenues for the treatment of neurological debilitating conditions. All these advances are however, limited by the mandatory long process of technological maturation and testing, for the benefit optimization and safety. Our institution has been developing new devices in these key areas, integrating medical teams and engineering expert from the device conception, addressing clear clinical problems from the early steps. New technologies are being made simpler and ever more close to reality and clinical trial than before; the design of innovative solutions to improve implantable devices opens a new era in clinical research.


Session 4S  (Special Session) Emerging Technologies for Biomedical Applications: Artificial Vision Systems and Brain Machine Interface
Time: 10:15 - 12:15 Wednesday, January 18, 2017
Location: Room 103
Organizer: Masaharu Imai (Osaka University, Japan), Moderator: Yoshinori Takeuchi (Osaka University, Japan)

4S-1 (Time: 10:15 - 10:45)
Title(Invited Paper) Smart Electrode - Toward a Retinal Stimulator with the Large Number of Electrodes -
Author*Jun Ohta (NAIST, Japan)
Pagep. 290
Keywordretinal prosthesis, stimulator
AbstractTo achieve better vision through a retinal prosthesis, over 1000 electrodes are preferable. When increasing the number of electrodes, we may be faced a critical issue associated with the interconnection of stimulus electrodes and lead wires with good mechanical flexibility. To access the issue, we have proposed and developed a smart electrode that consists of a stimulus electrode combined with a CMOS microchip on a flexible substrate. Since each microchip can turn on and off its associated electrode for stimulation through external control circuitry, the small number of interconnection wires, that is four or five, are required. In this presentation, concept, fabrication process and experimental results in vitro and in vivo are shown.

4S-2 (Time: 10:45 - 11:15)
Title(Invited Paper) Strategic Circuits for Neuromodulation of the Visual System
Author*Gregg Jorgen Suaning (University of New South Wales, Australia)
Pagepp. 291 - 294
Keywordneuroprosthesis, visual prosthesis, neuromodulation, stimulation circuitry
AbstractAdvancement of retinal neuro-prosthesis technologies has reached a point where multiple devices are now available clinically as a therapeutic treatment for degenerative disorders of the retina. Reported outcomes have typically fallen short of patient and researcher expectations, and true restoration of vision remains an elusive objective. Here, the state of the art in visual prostheses is explored from the perspective of stimulus delivery, and solutions for outstanding challenges are proposed and analyzed.
Slides

4S-3 (Time: 11:15 - 11:45)
Title(Invited Paper) Design Considerations and Clinical Applications of Closed-Loop Neural Disorder Control SoCs
Author*Chung-Yu Wu, Cheng-Hsiang Cheng, Yi-Huan Ou-Yang (National Chiao Tung University, Taiwan), Chiung-Chu Chen (Chang Gung Memorial Hospital and University, Taiwan), Wei-Ming Chen, Ming-Dou Ker, Chen-Yi Lee (National Chiao Tung University, Taiwan), Sheng-Fu Liang, Fu-Zen Shaw (National Cheng Kung University, Taiwan)
Pagepp. 295 - 298
KeywordClosed-loop neuraomodulation, Deep brain stimulator, Epileptic seizure control SoC
AbstractThis paper presents the closed-loop neural disorder control concept and some design considerations. Two architectures of closed-loop neuromodulation for Parkinson's disease and epileptic seizure are proposed. One is a closed-loop deep brain stimulator, which meets the IEC 60601-1 standard. The other one is an implantable SoC for epileptic seizure control, which is verified by animal experiment.
Slides

4S-4 (Time: 11:45 - 12:15)
Title(Panel Discussion) Emerging Technologies for Biomedical Applications: Artificial Vision Systems and Brain Machine Interface
Author*Panelists: Jun Ohta (NAIST, Japan), Gregg Jorgen Suaning (University of New South Wales, Australia), Chung-Yu Wu (National Chiao Tung University, Taiwan), Napoleon Torres-Martinez (CEA-Leti, France)
Pagep. 299
KeywordBiomedical Applications, Artificial Vision Systems, Brain Machine Interface
AbstractBiomedical application will be one of the most important applications in VLSI design field near future, and this special session informs you the hottest biomedical applications related to artificial vision systems and brain machine interface. We invite three speakers Prof. Ohta from Japan, Prof. Suaning form Australia, Prof. Wu from Taiwan, and a keynote speaker Dr. Torres-Martinez from France as leading authorities of this field to have panel discussion. Attendee will enjoy the information of the state of the art of this area all over the world and get information for difficulties and challenging to develop biomedical application systems and biomedical application itself.
Slides


Session 4A  Power and Thermal Management
Time: 10:15 - 12:20 Wednesday, January 18, 2017
Location: Room 102
Chair: Koji Inoue (Kyushu University, Japan)

4A-1 (Time: 10:15 - 10:40)
TitleA Tool for Synthesizing Power-Efficient and Custom-Tailored Wavelength-Routed Optical Rings
Author*Marta Ortín-Obón (University of Zaragoza, Spain), Luca Ramini (University of Ferrara, Italy), Víctor Vińals-Yúfera (University of Zaragoza, Spain), Davide Bertozzi (University of Ferrara, Italy)
Pagepp. 300 - 305
Keywordnetworks-on-chip, silicon photonics, optical ring
AbstractOut of all the optical network-on-chip topologies, the ring has been proved to be far superior to its competitors: the contention-free all-to-all communications offer the lowest latency possible, while its clean physical design with few crossings and ring resonators provides unmatchable power results. The ring implements simultaneous communications by using a communication matrix that sets a distinctive waveguide-wavelength pair for each of them. That communication matrix has a high impact on energy consumption, but so far there have been very few efforts towards optimizing and automating its design. As far as we know, we propose the best optical ring design algorithm, which produces rings with the lowest number of wavelengths and waveguides in the literature. The algorithm is completed with a layout-aware and fully automated laser power calculation framework to help the user choose the most power-efficient design point.
Slides

4A-2 (Time: 10:40 - 11:05)
TitleIslands of Heaters: A Novel Thermal Management Framework for Photonic NoCs
Author*Dharanidhar Dang (Texas A&M University, U.S.A.), Sai Vineel Reddy Chittamuru (Colorado State University, U.S.A.), Rabi N Mahapatra (Texas A&M University, U.S.A.), Sudeep Pasricha (Colorado State University, U.S.A.)
Pagepp. 306 - 311
KeywordPhotonics, NoC, Adaptive, Simulation
AbstractSilicon-photonics has become a promising candidate for networks-on-chip (NoC). But photonic NoCs (PNoCs) are very sensitive to temperature variations that frequently occur on-chip. These variations can create significant reliability-issues for PNoCs. This paper proposes a novel run-time framework which integrates an adaptive-heater-mechanism with a thread-migration scheme to overcome temperature-induced issues in PNoCs. Simulations on 64-core PNoCs show total power reduction by up to 64.1% while maintaining high network bandwidth.
Slides

4A-3 (Time: 11:05 - 11:30)
TitleEnergy-Aware Loops Mapping on Multi-Vdd CGRAs without Performance Degradation
Author*Jiangyuan Gu, Shouyi Yin, Leibo Liu, Shaojun Wei (Tsinghua University, China)
Pagepp. 312 - 317
KeywordCGRA, Multi-Vdd, Modulo Scheduling, Performance, Energy-Efficiency
AbstractCoarse Grained Reconfigurable Architectures (CGRAs) have been paid an increasing attention due to their inherent advantages of high performance and energy efficiency. As we know, multi-Vdd technique is popularly used to reduce energy consumption, and modulo scheduling is one of widely-used pipeline techniques to improve performance. To achieve both high performance and energy-efficiency simultaneously, this paper proposes an energy-aware mapping algorithm integrating multi-Vdd assignment into the scheduling and mapping procedures of loop applications. Also, an energy-aware FDS (eFDS) algorithm and a rapid MCC searching method based on compatibility concept are successfully adopted to solve the bi-objective optimization problem. The experimental results show that the proposed approach brings 18.7% energy reduction and 1.44X energy-efficiency improvement while keeping optimized performance.
Slides

4A-4 (Time: 11:30 - 11:55)
TitleAlgorithm Accelerations for Luminescent Solar Concentrator-Enhanced Reconfigurable Onboard Photovoltaic System
Author*Caiwen Ding (Syracuse University, U.S.A.), Ji Li (University of Southern California, U.S.A.), Weiwei Zheng (Syracuse University, U.S.A.), Naehyuck Chang (Korea Advanced Institute of Science and Engineering (KAIST), Republic of Korea), Xue Lin (Northeastern University, U.S.A.), Yanzhi Wang (Syracuse University, U.S.A.)
Pagepp. 318 - 323
KeywordLuminescent Solar Concentrator, Photovoltaic, EV/HEV, Reconfiguration, Partial Shading
AbstractElectric vehicles (EVs) and hybrid electric vehicles (HEVs) are growing in popularity. Onboard photovoltaic (PV) systems have been proposed to overcome the limited all-electric driving range of EVs/HEVs. However, there exist obstacles to the wide adoption of onboard PV systems such as low efficiency, high cost, and low compatibility. To tackle these limitations, we propose to adopt the semiconductor nanomaterial-based luminescent solar concentrator (LSC)-enhanced PV cells into the onboard PV systems. In this paper, we investigate methods of accelerating the reconfiguration algorithm for the LSC-enhanced onboard PV system to reduce computational/energy overhead and capital cost. First, in the system design stage, we group LSC-enhanced PV cells into macrocells and reconfigure the onboard PV system based on macrocells. Second, we simplify the partial shading scenario by assuming an LSC-enhanced PV cell is either lighted or completely shaded (Algorithm 1). Third, we make use of the observation that the conversion efficiency of the charger is high and nearly constant as long as its input voltage exceeds a threshold value (Algorithm 2). We test and evaluate the effectiveness of the proposed two algorithms by comparing with the optimal PV array reconfiguration algorithm and simulating an LSC-enhanced reconfigurable onboard PV system using actually measured solar irradiance traces during vehicle driving. Experiments demonstrate the output power of algorithm 1 in the first scenario is 9.0\% lower in average than that of the optimal PV array reconfiguration algorithm. In the second scenario, we observe an average of 1.16X performance improvement of the proposed algorithm 2.
Slides

4A-5 (Time: 11:55 - 12:20)
TitleTwo-stage Thermal-Aware Scheduling of Task Graphs on 3D Multi-cores Exploiting Application and Architecture Characteristics
Author*Zuomin Zhu (Hong Kong University of Science and Technology, Hong Kong), Vivek Chaturvedi (Nanyang Technological University, Singapore), Amit Kumar Singh (University of Southampton, U.K.), Wei Zhang (Hong Kong University of Science and Technology, Hong Kong), Yingnan Cui (Nanyang Technological University, Singapore)
Pagepp. 324 - 329
KeywordThermal-aware scheduling, 3D CMP, Task graph
AbstractIn this paper, we propose a two-stage thermal-aware task scheduling policy which exploits the application and system architecture characteristics to decouple the mapping of task-graphs for the performance and peak temperature optimization into two stages. At the first stage, the algorithm collects the best mapping of task-graphs exploiting the application and architecture characteristics to minimize the makespan of the task-graphs. At the second stage, a light-weight online algorithm comprised of efficient thermal rank and combined power models is performed to map the task nodes to the real cores for temperature minimization while maintaining the best possible performance achieved in the first stage. Compared to the previous approaches which perform the performance and temperature optimization together, our method can reduce the online mapping algorithm complexity and improve its efficiency. Experiments on real benchmarks show that an average of 6.3C peak temperature reduction and 6.8% performance improvement can be achieved compared to other existing methods.
Slides


Session 4B  Emerging Topics in Hardware Security
Time: 10:15 - 12:20 Wednesday, January 18, 2017
Location: Room 104
Chairs: Xiaoxiao Wang (Beihang University, China), Kazuo Sakiyama (University of Electrical Communications, Japan)

4B-1 (Time: 10:15 - 10:40)
TitleEnsuring System Security through Proximity Based Authentication
AuthorJoshua Marxen, *Alex Orailoglu (University of California, San Diego, U.S.A.)
Pagepp. 330 - 335
KeywordSecurity, Authentication, Localization, RSSR
AbstractAs Internet of Things applications using embedded systems enter wider markets, securing systems against attacks becomes necessary. In many applications, securely determining transmitter location helps maintaining system security. A relatively new RF-based localization technique called Received Signal Strength Ratio (RSSR), has potential utility in securing ad-hoc networks and body-area networks. We describe a novel attack on the security of such systems, and discuss a set of mitigation strategies that restore the effectiveness of RSSR.

4B-2 (Time: 10:40 - 11:05)
TitleVOLtA: Voltage Over-scaling Based Lightweight Authentication for IoT Applications
AuthorMd Tanvir Arafin, Mingze Gao, *Gang Qu (University of Maryland, College Park, U.S.A.)
Pagepp. 336 - 341
KeywordVoltage over-scaling, Authentication, Approximate Computation, IoT
AbstractIncorporating security protocols in IoT components is challenging due to their extremely constrained resources. We address this challenge by proposing a hardware-oriented lightweight authentication protocol based on device signature generated during voltage over-scaling (VOS). First, we demonstrate that VOS-based computing leaves a process variation dependent error signature in its approximate results. This error can be methodically profiled to extract information about the underlying process variation in the computation unit. We then combine this error profile with security key based authentication schemes to create a two-factor authentication mechanism. To understand the effectiveness of this protocol, we perform detailed security analysis under various attack scenarios. Finally, we simulate the authentication hardware using a process variation aware 45nm design library in HSpice. Simulation results show that our VOS-based assumptions are valid, and this authentication mechanism can withstand basic environmental variations. Overall, our approach provides a unique approach for using hardware process variations as a key for authentication.
Slides

4B-3 (Time: 11:05 - 11:30)
TitleSecurity Analysis of Anti-SAT
AuthorMuhammad Yasin (New York University, U.S.A.), Bodhisatwa Mazumdar, *Ozgur Sinanoglu (New York University Abu Dhabi, United Arab Emirates), Jeyavijayan Rajendran (The University of Texas at Dallas, U.S.A.)
Pagepp. 342 - 347
Keywordhardware security, logic encryption, design for trust, Boolean Satisfiability
AbstractLogic encryption protects integrated circuits (ICs) against intellectual property (IP) piracy and overbuilding attacks by encrypting the IC with a key. A Boolean satisfiability (SAT) based attack breaks all existing logic encryption technique within few hours. Recently, a defense mechanism known as Anti-SAT was presented that protects against SAT attack, by rendering the SAT-attack effort exponential in terms of the number of key gates. In this paper, we highlight the vulnerabilities of Anti-SAT and propose signal probability skew (SPS) attack against Anti- SAT block. SPS attack leverages the structural traces in Anti- SAT block to identify and isolate Anti-SAT block. The attack is 100% successful on all variants of Anti-SAT block. SPS attack is scalable to large circuits, as it breaks circuits with up to 22K gates within two minutes.

4B-4 (Time: 11:30 - 11:55)
TitleExploiting Accelerated Aging Effect for On-line Configurability and Hardware Tracking
AuthorYang You, *Jie Gu (Northwestern University, U.S.A.)
Pagepp. 348 - 353
KeywordAging, Configurability, Tracking, Security
AbstractConventional CMOS technology lacks an efficient way of realizing reconfigurability, which is a highly desired feature for applications such as hardware tracking for security. On the other hand, the traditional “undesirable” aging effect has presented a non-volatile “memory” to the CMOS chip. This paper exploits the aging effects in standard CMOS to enable non-volatile configurability to the chip for application of hardware tracking. A novel accelerated aging circuit is developed to shorten the required stress time to a few seconds of operation. Due to the significant challenges posed by process variation in advanced CMOS technology, a novel stochastic processing methodology is proposed to significantly reduce the failure rate of the tracking and detection. Combining both circuits and system level acceleration, the work of chip usage tracking can be realized within seconds of usage in contrast with days of operation from previously reported aging monitor. The design was implemented and simulated in 45nm CMOS technology with less than 25µW power consumption and compact sizes for easy insertion as a silicon IP using only core transistors. The robustness of the proposed stochastic processing technique has been verified using transistor level Monte-Carlo simulation. Compared with existing aging monitors, the proposed techniques accelerate the process by thousands of times enabling the desired online configurability.
Slides

4B-5 (Time: 11:55 - 12:20)
TitleSGXCrypter: IP Protection for Portable Executables Using Intel's SGX Technology
Author*Dimitrios Tychalas (Electrical and Computer Engineering New York University Abu Dhabi, United Arab Emirates), Nektarios Georgios Tsoutsos (Computer Science and Engineering New York University Polytechnic School of Engineering, U.S.A.), Michail Maniatakos (Electrical and Computer Engineering New York University Abu Dhabi, United Arab Emirates)
Pagepp. 354 - 359
KeywordCrypter, SGX, Portable Executable
AbstractExecutable packing schemes are popular for obfuscating the binary code of a target program through compression or encryption, and can be leveraged for protecting proprietary code against reverse-engineering. Although ensuring confidentiality, packed executables are prepended with decryption or decompression code that processes the rest of the binary, which is a lucrative target for reverse-engineering attackers. To thwart such attacks, we introduce a novel packing scheme, SGXCrypter, which utilizes Intel's novel Software Guard Extensions to securely unpack and execute Windows binaries.
Slides


Session 4C  Manufacturability and Emerging Techniques
Time: 10:15 - 12:20 Wednesday, January 18, 2017
Location: Room 105
Chairs: Taewhan Kim (Seoul National University, Republic of Korea), Wenjing Rao (University of Illinois, U.S.A.)

4C-1 (Time: 10:15 - 10:40)
TitleNetwork Flow Based Cut Redistribution and Insertion for Advanced 1D Layout Design
Author*Ye Zhang, Wai-Shing Luk, Fan Yang, Changhao Yan (Fudan University, China), Hai Zhou (Northwestern University, U.S.A.), Dian Zhou (University of Texas at Dallas, U.S.A.), Xuan Zeng (Fudan University, China)
Pagepp. 360 - 365
Keywordcut redistribution, 1D layout, cut insertion, network flow
AbstractEnd Cutting 1D layout design is a promising candidate for sub-10nm process nodes. Given a 1D layout with horizontal wires, cut redistribution technique is used for sliding the line-end cuts in order to align them vertically or resolve spacing conflicts. The aligned cuts can then be merged into a single shot of cuts. In this paper, we proposed a network flow based method for efficient cut redistribution and insertion. Normally, a pair of movable cuts could have three possible relations, left-of, right-of and merge-into. We observe that if the left-right-merge orderings of cuts are fixed, the cut redistribution can be formulated as a network flow problem, which can be solved efficiently. We also find that inserting cuts can resolve the spacing conflicts in some circumstances. This cut insertion strategy is introduced in our proposed method to reduce the spacing conflicts. Moreover, the complementary e-beam lithography for printing the cuts is also considered in this paper. Experimental results show that compared with a previous ILP-based method, our method can achieve a 200X speedup and competitive solution quality.
Slides

4C-2 (Time: 10:40 - 11:05)
TitleAn Efficient Algorithm for Stencil Planning and Optimization in E-Beam Lithography
Author*Jiabei Ge, Changhao Yan (State Key Lab of ASIC&Syst, Fudan University, China), Hai Zhou (Department of Electrical Engineering and Computer Science, Northwestern University, U.S.A.), Dian Zhou (Department of Electrical Engineering, University of Texas at Dallas, U.S.A.), Xuan Zeng (State Key Lab of ASIC&Syst, Fudan University, China)
Pagepp. 366 - 371
KeywordElectron Beam Lithography, Overlapping aware Stencil Planning
AbstractCharacter projection is a promising technique to dramatically improve throughput of E-beam lithography. However, its effectiveness depends on how good the stencils are planned and optimized. Recently Kuang and Young proposed an efficient heuristic based on 2-D bin-packing for the stencil optimization. In this paper, we identified drawbacks in their approach, and developed a better algorithm that reduces the shot numbers to less than half of theirs in average. The key point is introducing the merit frequency/area (f/A) to select candidate characters. Experimental results verify the effectiveness of the proposed method.
Slides

4C-3 (Time: 11:05 - 11:30)
TitleFlexible Interconnect in 2.5D ICs to Minimize the Interposer's Metal Layers
AuthorDaniel P. Seemuth, *Azadeh Davoodi, Katherine Morrow (University of Wisconsin - Madison, U.S.A.)
Pagepp. 372 - 377
KeywordFlexible interconnect, FPGA, 2.5D IC, pin assignment, 3D routing
AbstractIn 2.5D ICs, the number of metal layers in interposers contributes strongly to interposer manufacturing costs. Many systems implemented as 2.5D ICs include component dies, e.g. FPGA dies, that have flexible interconnect that increase connectivity options within the 2.5D IC. We present the first work to leverage flexible interconnect in FPGA dies within a 2.5D IC to decrease routing metal layers in the interposer. This is done by performing 3D global routing and reassigning flexible pins in the FPGA dies. In our experiments, we reduce the number of metal layers by up to 33% versus the number of layers required before reassigning flexible pins.
Slides

4C-4 (Time: 11:30 - 11:55)
TitleOptimizing DSA-MP Decomposition and Redundant Via Insertion with Dummy Vias
AuthorChung-Yao Hung, *Peng-Yi Chou, Wai-Kei Mak (National Tsing Hua University, Taiwan)
Pagepp. 378 - 383
KeywordDSA, Redundant Via Insertion, Multiple Patterning, Dummy Via
AbstractBlock copolymer directed self-assembly (DSA) has emerged as an economical complementary technology in the midst of next generation lithography. In particular, DSA has a strong potential for contact hole and via patterning. However, high via density in sub-10nm technology node makes it very hard to manufacture all the vias using DSA only. Complementing DSA with multiple patterning (MP) is an option such that multiple masks are used to print the DSA guiding templates for guiding the self-assembly of the block copolymer. Besides, redundant via insertion is desirable because it is an effective means to reduce yield loss due to via defect and improve reliability. To the best of our knowledge, we are the first work to consider DSA-MP decomposition and redundant via insertion with dummy via consideration to enhance manufacturability. Experimental results shows the effectiveness and efficiency of our method compared with previous works and resulted in 0 unmanufacturable vias for all benchmarks.
Slides

4C-5 (Time: 11:55 - 12:20)
TitleDesign of Multiple Fanout Clock Distribution Network for Rapid Single Flux Quantum Technology
AuthorNaveen Katam, Alireza Shafaei, *Massoud Pedram (University of Southern California, U.S.A.)
Pagepp. 384 - 389
KeywordRSFQ, Fanout, margin, yield
AbstractRapid Single Flux Quantum (RSFQ) logic cells have traditionally been limited to driving one fanout cell because of the difficulty in distributing the single flux quantum pulse to multiple fanouts. This paper presents a method to modify the standard RSFQ cells at their input/output interfaces to other cells in order to support a multiple-fanout drive capability. This capability is especially useful for clock distribution in RSFQ logic. This is because RSFQ logic is requires the clock signal to be provided to every logic gate. This is why this paper focuses on the clock signal driving more than one cell without the use of splitters. The potential tradeoff is in lower margins for the cells. However, by careful design of the RSFQ cells, the yield is not compromised by our proposed technique.
Slides


Session 5S  (Designers' Forum) Advanced Devices and Networks for IoT Applications
Time: 13:50 - 15:30 Wednesday, January 18, 2017
Location: Room 103
Organizers: Koichiro Yamashita (Fujitsu Laboratories, Japan), Tatsuo Shiozawa (Toshiba, Japan), Masaru Kokubo (Hitachi, Japan), Chair: Koichiro Yamashita (Fujitsu Laboratories, Japan)

5S-1 (Time: 13:50 - 14:15)
Title(Invited Paper) Implementation of Reliable and Maintenance-Free Wireless Multihop Networks
AuthorRen Sakata, Suhwuk Kim, Hiroki Kudo (Toshiba Corporation, Japan)
KeywordIoT
AbstractWireless sensor networks have recently attracted attention since these are able to realize continuous observation of old infrastructures such as bridges and tunnels. Even places requiring long-term observation for detecting structural deterioration can be monitored simply by attaching a sensor. In this presentation, a prototype of a maintenance-free sensor node is introduced. They key features of this technology are the autonomous formation of a mesh network and achievement of low power consumption. Without the use of special settings, each wireless sensor autonomously recognizes ambient conditions, adjusts the timing of communications and forms a communication network with neighboring sensors. The network also minimizes standby power based on communication conditions, so that wireless communication can be maintained for long periods using batteries alone. This maintenance-free system reliably collects sensor data across a wide area.

5S-2 (Time: 14:15 - 14:40)
Title(Invited Paper) High-performance and Low-power Embedded Memory for Edge Computing System
AuthorMasami Nakajima (Renesas Electronics Corporation, Japan)
KeywordIoT
AbstractIn the future's edge computing system, we need to reduce the size and weight while maintaining high performance. It is necessary to provide 100-200MHz performance to CPU. Also, we must reduce the size of the battery, which is largest and heaviest parts of all. Therefore, low power is necessary. In this work, by using high-performance and low-power embedded flash technology and the low-power memory access scheme which includes low-voltage bus architecture and automatic power on/off scheme, we have succeeded in reducing power consumption with high computing power.

5S-3 (Time: 14:40 - 15:05)
Title(Invited Paper) Ultra-Low-Power Wireless Sensor Nodes with Energy Harvesting, and IoT gateway Technology
AuthorHiroki Morimura (NTT Corporation, Japan)
KeywordIoT
AbstractUltra-low-power circuit techniques for energy-harvesting and IoT gateway technology for connecting various sensors are described. Power generated by energy harvester becomes as small as the nanowatt level when the size of a sensor node becomes millimeter-size. The portfolio of energy harvesting and circuit technology is discussed from the viewpoints of technical and application issues. Then, nanowatt level wireless circuit techniques are explained. Moreover, the concept of IoT gateway technology in order to improve “connectivity” for gathering huge data will be presented.

5S-4 (Time: 15:05 - 15:30)
Title(Invited Paper) Fast Channel Switching Technique for Interference Avoidance with 5 GHz Dual Channel Wireless LAN
AuthorTakashi Takeuchi (Hitachi Ltd., Japan)
KeywordIoT
AbstractIn 5 GHz wireless LAN, a DFS (dynamic frequency selection) technology must be implemented to avoid interference from radars. However, in accordance with Japanese regulations, communication is interrupted for 60 seconds by the radar scan after communication frequency switches. We addressed this challenge by developing a novel wireless access point possessing two RF modules. Preparing another channel in the access point enables fast and stable frequency switching at the initiative of a station. Measurement results of the prototype system showed a frequency switching delay time of 11 milliseconds under the DFS environment.


Session 5A  Approximate Computation for Energy Efficiency
Time: 13:50 - 15:30 Wednesday, January 18, 2017
Location: Room 102
Chairs: Li Shang (University of Colorado, U.S.A.), Shinobu Miwa (University of Electro-Communications, Japan)

5A-1 (Time: 13:50 - 14:15)
TitleA Novel Data Format for Approximate Arithmetic Computing
Author*Mingze Gao, Qian Wang, Akshaya Sharma Kankanhalli Nagendra, Gang Qu (University of Maryland, College Park, U.S.A.)
Pagepp. 390 - 395
KeywordApproximate Computing, Low Power Design, Approximate Data Format
AbstractApproximate computing has become one of the most popular paradigms in modern computing filed. It takes advantages of the error-tolerable feature of applications such as machine learning and image/signal processing, and balances the tradeoff between the computation quality and the computation overhead. In this paper, we proposed an approximate integer format (AIF) and corresponding arithmetic computing scheme. During computation, we dynamically segment the operands and apply precise computation on the important segments so that the computation can be allocated on the computing units with much smaller bit-width. The experimental results show that our approximation scheme significantly saves the power consumption while guaranteeing the overall accuracy of the results. Moreover, unlike the existing approximation hardware designs which only allow one type of arithmetic operation (e.g. approximate adder or multiplier), our approximation scheme can be implemented on all the arithmetic operations with very minor changes. The proposed approximate integer format can be transplanted to the fixed point format so that the applications are not only constrained to integer operations.
Slides

5A-2 (Time: 14:15 - 14:40)
TitleApproxPIM: Exploiting Realistic 3D-stacked DRAM for Energy-Efficient Processing In-memory
Author*Yibin Tang (University of Chinese Academy of Sciences; Institute of Computing Technology, Chinese Academy of Sciences, China), Ying Wang, Huawei Li, Xiaowei Li (Institute of Computing Technology, Chinese Academy of Sciences, China)
Pagepp. 396 - 401
KeywordPIM, memory wall, approximate computing
AbstractIn this paper, we propose a light-weight PIM architecture, approxPIM, which leverages approximate computing techniques to enable In-Memory Processing in realistic 3D-stacked DRAM, Micron's Hybrid Memory Cube (HMC). Using the newly-released atomic instruction support of the HMC, approxPIM can process a wide range of data-intensive applications without adding any logic resources into the memory devices. Evaluation results show that our approxPIM significantly boosts the energy-efficiency and performance of the whole system.
Slides

5A-3 (Time: 14:40 - 15:05)
TitleApproxEye: Enabling Approximate Computation Reuse for Microrobotic computer Vision
AuthorXin He (State Key Laboratory of Computer Architecture, Institute of computing technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences, China), *Guihai Yan (State Key Laboratory of Computer Architecture, Institute of computing technology, Chinese Academy of Sciences, China), Faqiang Sun (State Key Laboratory of Computer Architecture, Institute of computing technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences, China), Yinhe Han, Xiaowei Li (State Key Laboratory of Computer Architecture, Institute of computing technology, Chinese Academy of Sciences, China)
Pagepp. 402 - 407
Keywordapproximate computing, computation reuse
AbstractAiming at real-life problems, microrobotic systems have gained more and more attention. However, limited achievable performance of microrobotic system prevents it from carrying out complex tasks. Current research works propose customize designs for different applications and incorporate dedicated accelerator for high energy efficiency. Such techniques not only require significant effort and expertise for specified applications, but also consume unnegligible amount of chip resources. So in this paper we propose ApproxEye, a partial approximate computation reuse framework to accelerate microrobotic computer vision. Leveraging computation locality, ApproxEye reuses results of previous ``similar'' computations to reduce redundant computations. To squeeze every piece of computation reuse opportunity, ApproxEye proposes to heuristically define optimal reuse granularity and applies adaptive reuse requirements for different computations. Moreover, to reduce latency of computation reuse, ApproxEye tailors a parallel implemented search scheme for approximate computation reuse. Experimental results show ApproxEye could effectively exploit the potential of computation reuse and achieve 57.05% speedup on average.
Slides

5A-4 (Time: 15:05 - 15:30)
TitleOn Resilient Task Allocation and Scheduling with Uncertain Quality Checkers
Author*Qian Zhang, Ting Wang, Qiang Xu (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 408 - 413
KeywordApproximate Computing, Task Scheduling, Quality Management, Quality Checking
AbstractMany emerging applications are inherently error-resilient and hence do not require exact computation. Previous work on resilience-aware task allocation and scheduling problem first generates an initial energy-efficient task schedule on voltage-scalable multiprocessor system at design-time, and then conducts voltage adjustment according to runtime quality checking result. While energy efficiency improvements are quite encouraging, the final quality requirement might be violated because quality checkers are usually designed based on partial information and they are not always correct. In this paper, we propose to address the uncertainty issue of quality checkers in resilient task allocation and scheduling. To be specific, given the initial task schedule and quality checkers, we propose (i) a solution that ensures the final output quality with maximized probability, which can be applied to any approximate computing quality management systems; and (ii) a greedy runtime algorithm to achieve optimized energy efficiency gains. Experimental results on various task graphs demonstrate the efficacy of our proposed technique.
Slides


Session 5B  Advance Test and Fault Tolerant Technologies
Time: 13:50 - 15:30 Wednesday, January 18, 2017
Location: Room 104
Chairs: Satoshi Ohtake (Oita University, Japan), Ying Wang (Chinese Academy of Sciences, China)

5B-1 (Time: 13:50 - 14:15)
TitleAn Artificial Neural Network Approach for Screening Test Escapes
AuthorFan Lin (University of California, Santa Barbara, U.S.A.), *Kwang-Ting Tim Cheng (Hong Kong University of Science and Technology, Hong Kong)
Pagepp. 414 - 419
Keywordartificial neural network, test escape, yield loss, statistical test, machine learning
AbstractIn this paper we investigate the application of an artificial neural network (ANN) for screening test escapes. Specifically, we propose to train an autoencoder, an ANN, in an unsupervised way to fit the good chip population. We demonstrate that an autoencoder-based classification could achieve a higher detection rate for test escapes and a significant reduction in runtime and memory usage, compared with an SVM applied on the same features and some additional proximity features generated from multiple nonlinear transformations.
Slides

5B-2 (Time: 14:15 - 14:40)
TitleProcessor Shield for L1 Data Cache Software-Based On-line Self-testing
Author*Ching-Wen Lin, Chung-Ho Chen (Institute of Computer and Communication Engineering, National Cheng Kung University, Taiwan)
Pagepp. 420 - 425
KeywordDVFS, fault coverage, March algorithm, on-line testing, SBST
AbstractConventional software-based cache self-tests typically ignore system related testing issues, such as physical memory layout, virtual memory mapping, and isolating faulty effects, especially for on-line testing. We propose an architectural support for data cache software-based self-testing (SBST): Processor Shield, which can tackle difficult-to-test issues during on-line SBST. The proposed processor shield includes a software framework and design for testing (DFT) hardware, which enables SBST program to run without influencing other processes and on-bus devices even if a cache test fails. The proposed SBST process can be iteratively executed and cooperate with dynamic voltage frequency scaling (DVFS) system to calibrate the required guardbands to accommodate transistor aging effects. Finally, we present a case study that performs SBST programs under Linux kernel on an ARMv5-compatible processor system. Our method can successfully switch between the SBST process and the kernel process and achieve the expected high fault coverages for cache control logic and RAM module testing.
Slides

5B-3 (Time: 14:40 - 15:05)
TitlePredicting Vt Variation and Static IR Drop of Ring Oscillators Using Model-Fitting Techniques
AuthorTzu-Hsuan Huang, *Wei-Tse Hung, Hao-Yu Yang, Wen-Hsiang Chang (National Chiao Tung University, Taiwan), Ying-Yen Chen, Chun-Yi Kuo, Jih-Nung Lee (Realtek Semiconductor Corp., Taiwan), Mango Chia-Tso Chao (National Chiao Tung University, Taiwan)
Pagepp. 426 - 431
Keywordring oscillator, Vt variation, IR drop, process monitoring, model-fitting techniques
AbstractThis paper presents a statistical model-fitting framework to efficiently decompose the impact of device Vt variation and power-network IR drop from the measured ring-oscillator frequencies without adding any extra circuitry to the original ring oscillators. The framework applies Gaussian process regression as its core model-fitting technique and stepwise regression as a preprocess to select significant predictor features. The experiments conducted based on the SPICE simulation of an industrial 28nm technology demonstrate that our framework can simultaneously predict the NMOS Vt, PMOS Vt and static IR drop of the ring oscillators based on their frequencies measured at different external supply voltages. The final resulting R squares of the predicted features are all more than 99.93%.
Slides

5B-4 (Time: 15:05 - 15:30)
TitleA Local Reconfiguration Based Scalable Fault Tolerant Many-processor Array
AuthorSoumya Banerjee, *Wenjing Rao (University of Illinois at Chicago, U.S.A.)
Pagepp. 432 - 437
KeywordFault Tolerance, Many-processor Arrays, Self Repair, Online Reconfiguration
AbstractThis paper presents a reconfigurable Many-processor Array utilizing a layer of Routers with localized interconnects to provide fault tolerance for Processing Elements (PE’s). In such a system, each PE is assigned to a Router in the neighborhood. The required interconnect topology among the PE's is implemented via a fixed Backbone Network connecting all the Routers. A localized Auxiliary Network is used to provide assignment flexibilities between each Router and its peripheral PE's. Faulty PE's are repaired via spare PE's in the array, and to extend the reach of spares, repair is done via Replacement Chains: a faulty PE's Router will be assigned to another functional PE within its neighborhood; the Router of the replacement PE will then be reassigned to another PE, until eventually a spare PE is reached. In this paper, we propose a Many-processor Array on the basis of this principle, and show that this architecture is able to deliver high level of fault tolerance properties while being scalable in hardware and interconnect overheads.


Session 5C  Advanced Placement and Routing Techniques
Time: 13:50 - 15:30 Wednesday, January 18, 2017
Location: Room 105
Chairs: Seokhyeong Kang (UNIST, Republic of Korea), Wai-Kei Mak (National Tsing Hua University, Taiwan)

5C-1 (Time: 13:50 - 14:15)
TitleRegularity-aware Routability-driven Placement Prototyping Algorithm for Hierarchical Mixed-size Circuits
AuthorJai-Ming Lin, Bo-Heng Yu, *Li-Yen Chang (National Cheng Kung University, Taiwan)
Pagepp. 438 - 443
KeywordRegularity, Routability, Powerplanning, Macro Placement, VLSI design
AbstractThe paper introduces a routability-driven placement prototyping algorithm for hierarchical mixed-size circuits and pays special attention to regular placement of macros. The three-stage approach is the most popular mix-cell placement algorithm and can best fit into existing commercial design flows, where placement prototyping is the most important stage and locations of macros and standard cells are affected by the result. In addition to normal cells and macros, there exist sets of macros which have identical shapes and similar hierarchy in modern designs. Placement of these macros regularity can facilitate powerplanning and induce better routability. Experimental results have demonstrated effectiveness of our approach in industry benchmarks and actual design flow.
Slides

5C-2 (Time: 14:15 - 14:40)
TitleFloorplan and Placement Methodology for Improved Energy Reduction in Stacked Power-Domain Design
Author*Kristof Blutman, Hamed Fatemi (NXP Semiconductors, Netherlands), Andrew B. Kahng (University of California, San Diego, U.S.A.), Ajay Kapoor (NXP Semiconductors, Netherlands), Jiajia Li (University of California, San Diego, U.S.A.), José Pineda de Gyvez (NXP Semiconductors, U.S.A.)
Pagepp. 444 - 449
Keywordstacked domain, partitioning, floorplan, battery lifetime, power domain
AbstractEnergy and battery lifetime constraints are critical challenges to IC designs. Stacked power-domain implementation, which stacks voltage domains in a design, can effectively improve the power delivery efficiency and thus improve battery lifetime. However, such an approach requires balanced current between different domains across multiple operating scenarios. Furthermore, level shifter insertion (together with shifters’ delay impacts), along with placement constraints imposed by power domain regions, can incur power and area penalties. To our knowledge, no existing work performs sub-block-level partitioning optimization for stacked-domain designs. In this paper, we present an optimization framework for stacked-domain designs. Based on an initial placement solution, we apply a flow-based partitioning that is aware of multiple operating scenarios, cell placement, and timing-critical paths to partition cells into two power domains with balanced current and minimized number of inserted level shifters. We further propose heuristics to define regions for each power domain so as to minimize placement perturbation, as well as a dynamic programming-based method to minimize the area cost of power domain generation. In an updated floorplan, we perform matching-based optimization to insert level shifters with minimized wirelength penalty. Overall, our method achieves more than ‾10% and 3X battery lifetime improvements in function and sleep modes, respectively.

5C-3 (Time: 14:40 - 15:05)
TitleAn Effective Legalization Algorithm for Mixed-Cell-Height Standard Cells
Author*Chao-Hung Wang, Yen-Yi Wu (National Taiwan University, Taiwan), Jianli Chen (Fuzhou University, China), Yao-Wen Chang, Sy-Yen Kuo (National Taiwan University, Taiwan), Wenxing Zhu, Genghua Fan (Fuzhou University, China)
Pagepp. 450 - 455
KeywordLegalization, Mixed-cell-height cells, Detailed placement, Standard cell design, Physical design
AbstractFor circuit designs in advanced technologies, standard-cell libraries consist of cells with different heights. Such mixed cell heights incur new, complicated challenges for layout designs, due mainly to the heterogeneity in cell dimensions and thus their larger solution spaces. This paper addresses the legalization problem of mixed-height standard cells, which aims to place cells without any overlap and with minimized displacement. Experimental results show that our algorithm can achieve about 50% smaller wirelength increase than a state-of-the-art work.
Slides

5C-4 (Time: 15:05 - 15:30)
TitleDelay-driven Layer Assignment for Advanced Technology Nodes
AuthorSzu-Yuan Han (National Tsing Hua University, Taiwan), Wen-Hao Liu (Cadence Design Systems, U.S.A.), Rickard Ewetz (University of Central Florida, U.S.A.), Cheng-Kok Koh (Purdue University, U.S.A.), Kai-Yuan Chao (Intel Corporation, U.S.A.), *Ting-Chi Wang (National Tsing Hua University, Taiwan)
Pagepp. 456 - 462
Keywordlayer assignment, global routing
AbstractThis paper addresses a delay-driven layer assignment problem with consideration of via delay and coupling effect in the global routing stage. A negotiation-based framework is proposed to balance delay, congestion, and via count. Coupling capacitance is considered using a probabilistic look-up table. Finally, the proposed algorithm uses both parallel wires and wide wires to reduce wire delay. The effectiveness of our layer assignment algorithm is supported by extensive experimental results.
Slides


Session 6S  (Designers' Forum) Panel Discussion: What is future AI we will create ? - "Doraemon" or "Terminator" ? -
Time: 15:50 - 17:30 Wednesday, January 18, 2017
Location: Room 103
Organizers: Hiroe Iwasaki (NTT, Japan), Sunao Torii (ExaScaler, Japan), Akihiko Inoue (Panasonic, Japan), Chair: Satoshi Kurihara (The University of Electro-Communications, Japan)

6S-1 (Time: 15:50 - 17:30)
Title(Panel Discussion) What is future AI we will create? - "Doraemon" or "Terminator" ? -
AuthorPanelists: Hiroshi Yamakawa (dwango, Japan), Luca Rigazio (Panasonic Silicon Valley Lab, Japan), Takeshi Yamada (NTT, Japan), Akira Naruse (NVIDIA, Japan), Shinji Nakadai (NEC, Japan)
AbstractNowadays, Artificial Intelligence (AI) research and development is in third boom. The driving force behind this advancement is deep learning technology. Its high level feature extraction ability was adopted in AI GO application "AlphaGO", and accomplished a great feat that AlphaGO overthrow human champion. But emergence of strong AI is now stirring up controversy simultaneously. Our job is just maybe replaced with AI power !! Is it true? What should we do, if autonomous AI has consciousness and has hostility toward human? These concerns is too excessive assumption, but we cannot ignore. So, in this panel session, we will discuss about real situation of current AI and think about future AI. Key word is "Doraemon" and "Terminator".


Session 6A  Recent Advances in Circuit Simulation and Optimization
Time: 15:50 - 17:30 Wednesday, January 18, 2017
Location: Room 102
Chairs: Markus Olbrich (University of Hannover, Germany), Ibrahim (Abe) Elfadel (Masdar Institute of Science and Technology, United Arab Emirates)

6A-1 (Time: 15:50 - 16:15)
TitleSTEAM: Spline-based Tables for Efficient and Accurate Device Modelling
Author*Archit Gupta, Tianshi Wang, Ahmet Gokcen Mahmutoglu, Jaijeet Roychowdhury (UC Berkeley, U.S.A.)
Pagepp. 463 - 468
Keywordtable-based modelling, circuit-simulation, cubic splines
AbstractA common complaint from users of device models is that the "better" the model, the longer it takes to simulate. Modelling based on interpolation between sampled data points is attractive in this context because it offers low model evaluation times. Although such "table-based" modelling has a long history, important conceptual and implementation issues have been obscure in the literature. These issues include: separating the algebraic ("DC") and dynamic ("charge/flux") components properly; extrapolation outside sampled regions; smoothness; accuracy vs computation vs memory trade-offs; and suitability of the table-based model for various analyses (such as DC, AC, transient, RF, etc., analyses). In this paper, we clarify precisely what functions should be sampled for a table-based device to work properly in any analysis. We re-visit interpolation, showing that well-implemented cubic splines provide excellent smoothness and arbitrarily great accuracy at low, almost-constant evaluation cost. However, memory requirements increase with accuracy. We present a novel extrapolation scheme using passivity concepts that aids convergence. Using Berkeley MAPP, we demonstrate speedups of 150X in core BSIM model evaluations (translating to overall simulation speedups of 6-18X) with relative errors of 0.001%. Our approach can convert any existing device model to a smooth/accurate table-based model with small, fixed evaluation cost. Unlike previous work, our code will be released as open source, serving as a platform for the community to evaluate and experiment with table-based models quickly and conveniently.
Slides

6A-2 (Time: 16:15 - 16:40)
TitleA Time Domain Behavioral Model for Oscillators Considering Flicker Noise
Author*Hui Zhang, Bo Wang (Peking University Shenzhen Graduate School, China)
Pagepp. 469 - 474
Keywordoscillators, modeling, noise
AbstractThe phase noise behavior due to flicker noise of an oscillator has not been modeled accurately by a time domain behavioral model. In this paper, the mathematical foundation to model the up-converted flicker noise region of the phase noise is discussed and derived in detail. Based on the foundation, we present a time domain behavioral model of the oscillator and implement it in Simulink. Comparisons show that the model is as precise as the direct transistor-level circuit simulation, whether in the up-converted thermal or flicker noise region.
Slides

6A-3 (Time: 16:40 - 17:05)
TitleParasitic-Aware GP-based Many-objective Sizing Methodology for Analog and RF Integrated Circuits
Author*Tuotian Liao, Lihong Zhang (Memorial University of Newfoundland, Canada)
Pagepp. 475 - 480
KeywordCircuit Sizing, Geometric Programming, Many-objective Algorithm, Interconnect Parasitics Modelling
AbstractIn this paper, an efficient parasitic-aware geometric programming and many-objective evolution algorithm based two–phase hybrid sizing methodology is presented. It considers circuit performance constraints and layout parasitics simultaneously within a concurrent process by using convex optimization in the first phase, and knowledge-driven heuristic refinement in the second phase. The proposed method has been used to optimize several analog and RF circuits in different CMOS technologies with high efficacy demonstrated.
Slides

6A-4 (Time: 17:05 - 17:30)
TitleHigh-Speed Stochastic Circuits Using Synchronous Analog Pulses
Author*M. Hassan Najafi, David J. Lilja (University of Minnesota, twin cities, U.S.A.)
Pagepp. 481 - 487
KeywordStochastic computing, stochastic number generation, pulse width modulation, high-performance computing, mixed-signal design
AbstractThe primary advantages of stochastic computing are the very simple hardware required to implement complex operations, its ability to gracefully tolerate noise, and the skew tolerance. Its relatively long latency, however, is a potential barrier to widespread use of this paradigm, particularly when high accuracy is required. This work proposes a new, high-speed, yet accurate approach for implementing stochastic circuits that uses synchronized analog pulses as a new way of representing correlated stochastic numbers.
Slides


Session 6B  Application-Aware Embedded Architecture Design
Time: 15:50 - 17:30 Wednesday, January 18, 2017
Location: Room 104
Chairs: Chun-Yi Lee (NTHU, Taiwan), Shao-Yun Fang (National Taiwan University of Science and Technology, Taiwan)

6B-1 (Time: 15:50 - 16:15)
TitleThroughput Optimization for Streaming Applications on CPU-FPGA Heterogeneous Systems
Author*Xuechao Wei, Yun Liang (Center for Energy-Efficient Computing and Applications (CECA), School of EECS, Peking University, China), Tao Wang (Center for Energy-Efficient Computing and Applications (CECA), School of EECS, Peking University/PKU-UCLA Joint Research Institute in Science and Engineering, China), Songwu Lu, Jason Cong (Center for Energy-Efficient Computing and Applications (CECA), School of EECS, Peking University/Computer Science Department, UCLA/PKU-UCLA Joint Research Institute in Science and Engineering, U.S.A.)
Pagepp. 488 - 493
Keywordstreaming, FPGA, heterogeneous, optimization, model
AbstractStreaming processing is an important technology that finds applications in networking, multimedia, signal processing, etc. However, it is very challenging to design and implement streaming applications as they impose complex constraints. First, the tasks involved in the streaming applications must complete the computation under a latency constraint. Second, streaming systems are built under more and more stringent power budget. Hence, power capping technique is employed to manage the power consumption for streaming systems. To accommodate these needs, heterogeneous systems that consist of CPUs and FPGAs are becoming increasingly popular due to their performance and power benefits. In this paper, we optimize the throughput for streaming applications on CPU-FPGA heterogeneous system under latency and power constraints. We develop two algorithms to map the tasks onto the heterogeneous system and order their execution by exploiting the heterogeneity in architectural capabilities and task characteristics. We also employ pipelining to improve the throughput by overlapping the execution of different frames and use frequency scaling to adjust the execution of tasks for power saving. Experiments using a variety of streaming applications show that our heterogeneous solution can successfully meet the latency and power constraints for the cases where the CPU implementation fails. Furthermore, our technique can improve the throughput by 37.32% on average.
Slides

6B-2 (Time: 16:15 - 16:40)
TitleDark Silicon-Aware Hardware-Software Collaborated Design for Heterogeneous Many-Core Systems
AuthorLei Yang, *Weichen Liu (Chongqing University, China), Nan Guan (Hong Kong Polytechnic University, Hong Kong), Mengquan Li, Peng Chen, Edwin H. M. Sha (Chongqing University, China)
Pagepp. 494 - 499
KeywordHeterogeneous Multi-Processing systems, Dark silicon, Hardware-software co-design
AbstractARM's big.LITTLE architecture coupled with Heterogeneous Multi-Processing (HMP) has enabled energy-efficient solutions in the dark silicon era. System-level techniques activate nonadjacent cores to eliminate thermal hotspot. However, it unexpectedly increases communication delay due to longer distance in network architectures, and in turn degrades application performance and system energy efficiency. In this paper, we present a novel hierarchical hardware-software collaborated approach to address the performance/temperature conflict in dark silicon many-core systems. Optimizations on inter-processor communication, application performance, chip temperature and energy consumption are well isolated and addressed in different phases. Evaluation results show that on average 22.57% reduction of communication latency, 23.04% improvement on energy efficiency and 6.11 reduction of chip peak temperature are achieved compared with state-of-the-art techniques.

6B-3 (Time: 16:40 - 17:05)
TitleNon-Intrusive Dynamic Profiler for Multicore Embedded Systems
AuthorSudarshan Sargur, *Roman Lysecky (University of Arizona, U.S.A.)
Pagepp. 500 - 505
KeywordProfiling, Dynamic profiling, Multicore embedded systems
AbstractApplication profiling is an important step in the design and optimization of embedded systems. Accurately identifying and analyzing the execution of frequently executed computational kernels is needed to effectively optimize the system implementation, at both design time and runtime. Most previous profiling approaches are software based, which can incur significant overhead and may be prohibitive or impractical for profiling embedded systems at runtime. In addition, profiling methods typically focus on profiling the execution of specific tasks executing on a single core, but do not consider accurate and holistic profiling across multiple processors cores. Directly utilizing and naively combining isolated profiles from multiple processor cores can lead to significant profile inaccuracy. In this paper, we present a hardware-based dynamic profiler for non-intrusively and accurately profiling software applications in multicore embedded systems. The profiler provides a detailed execution profile for computational kernels and maintains profile accuracy across multiple processor cores. The hardware-based profiler achieves an average error of less than 0.5% for the percentage execution time of profiled applications.
Slides

6B-4 (Time: 17:05 - 17:30)
TitleDesign of A Pre-Scheduled Data Bus for Advanced Encryption Standard Encrypted System-on-Chips
AuthorXiaokun Yang (University of Houston Clear Lake, U.S.A.), *Wujie Wen (Florida International University, U.S.A.)
Pagepp. 506 - 511
KeywordAdvanced Encryption Standard (AES), Bus, FPGA, System-on-Chips (SoCs)
AbstractThis paper proposes a high efficiency data bus (DBUS) for Advanced Encryption Standard (AES) encrypted system-on-chips (SoCs). Using DBUS, the data sequence can be pre-selected for AES encryption/decryption, so that the state buffering and rescheduling overhead can be reduced. FPGA results show that the DBUS based design lowers the dynamic energy to 66.93\%, and achieves up to 1.30 times higher valid throughput compared with the Advanced eXensible Interface (AXI) based implementation.
Slides


Session 6C  Advances in Microfluidic Biochips
Time: 15:50 - 17:30 Wednesday, January 18, 2017
Location: Room 105
Chairs: Tohru Ishihara (Kyoto University, Japan), Weikang Qian (Shanghai Jiao Tong University, China)

6C-1 (Time: 15:50 - 16:15)
TitlePiracy Prevention of Digital Microfluidic Biochips
Author*Ching-Wei Hsieh (National Tsing Hua University, Taiwan), Zipeng Li (Duke University, U.S.A.), Tsung-Yi Ho (National Tsing Hua University, Taiwan)
Pagepp. 512 - 517
KeywordDigital Microfluidics, Privacy, Security, No. 101, Kuang-Fu Rd.
AbstractDigital microfluidic biochips (DMFBs) play an important role in the healthcare industry due to its advantages such as low-cost, portability, and efficiency. According to the recent market report, the growth of biochips market is twice than before. However, as the enormous business opportunities grow, piracy attacks, which are exploited by unscrupulous people to gain illegal profits, become a severe threat to DMFBs. To prevent piracy attacks, the conventional approach uses secret keys to perform authentication. Nevertheless, DMFBs only consist of electrodes to control the operations of droplets, and there are no memories and logic gates integrated on it to store secret keys. This makes designing secure defenses of DMFBs against piracy attacks more difficult. Thus, in this paper, we propose the first authentication method for piracy prevention of DMFBs based on a novel Physical Unclonable Function (PUF). The proposed PUF utilizes the inherent variation of electrodes on DMFBs to generate secret keys, so it does not require memory. Experimental results demonstrate the feasibility of our proposed PUF. Finally, we analyze the security of the proposed method against piracy attacks.
Slides

6C-2 (Time: 16:15 - 16:40)
TitleOn Reliability Hardening in Cyber-Physical Digital-Microfluidic Biochips
Author*Guan-Ruei Lu (Institute of Electronics Engineering, National Chiao Tung University, Taiwan), Guan-Ming Huang (Institute of Electrical and Computer Engineering, National Chiao Tung University, Taiwan), Ansuman Banerjee, Bhargab B. Bhattacharya (Advanced Computing & Microelectronics Unit, Indian Statistical Institute, India), Tsung-Yi Ho (Institute of Computer Science and Engineering, National Tsing Hua University, Taiwan), Hung-Ming Chen (Institute of Electronics Engineering, National Chiao Tung University, Taiwan)
Pagepp. 518 - 523
KeywordCyber-Physical, Reliability
AbstractIn the area of biomedical engineering, digital-microfluidic biochips (DMFBs) have received considerable attention, because of their capability of providing an efficient and reliable platform for conducting point-of-care clinical diagnostics. System reliability, in turn, mandates error-recoverability while implementing biochemical assays on-chip for medical applications. Unfortunately, the technology of DMFBs is not yet fully equipped to handle error-recovery from various microfluidic operations involving droplet motion and reaction. Recently, a number of cyber-physical systems have been proposed to provide real-time checking and error-recovery in assays based on the feedback received from a few on-chip checkpoints. However, in order to synthesize robust feedback systems for different types of DMFBs, certain practical issues need to be considered such as co-optimization of checkpoint placement and layout of droplet-routing pathways. For application-specific DMFBs, we propose here an algorithm that minimizes the number of checkpoints and determines their locations to cover every path in a given droplet-routing solution. Next, for general-purpose DMFBs, where the checkpoints are pre-deployed in specific locations, we present a checkpoint-aware routing algorithm such that every droplet-routing path passes through at least one checkpoint to enable error-recovery and to ensure physical routability of all droplets. Our experiments on assay benchmarks show encouraging results in terms of latest-arrival-time and routability of droplets. The proposed methods thus provide convenient reliability-hardening mechanisms for a wide class of cyber-physical DMFBs.
Slides

6C-3 (Time: 16:40 - 17:05)
TitleHamming-Distance-Based Valve-Switching Optimization for Control-Layer Multiplexing in Flow-Based Microfluidic Biochips
Author*Qin Wang, Shiliang Zuo, Hailong Yao (Tsinghua University, China), Tsung-Yi Ho (National Tsing Hua University, Taiwan), Bing Li, Ulf Schlichtmann (Technical University of Munich, Germany), Yici Cai (Tsinghua University, China)
Pagepp. 524 - 529
KeywordMicrofluidic biochips, Flow-based, Multiplexer, Valve-switching
AbstractFlow-based microfluidic biochips have progressed significantly in the last decade, through innovations in fabrication techniques, the integration of thousands of microvalves and large scale networks of microchannels on a chip is enabled. The development in flow-based microfluidic biochips integration has usually been compared to the evolution of Moore's Law. Microvalves are very critical components used to control the movement of fluidic and finish complex operations. In order to control the open and close states of a microvalve, the off-chip control ports are required. With the sharp increase in number of microvlaves, the manufacturing cost increases rapidly. As a result, a software-programmable microfluidic platform has been proposed to reduce the number of off-chip control ports by integrating a microfluidic multiplexer to control the array of microvalves. The multiplexer needs to be switched when the states of microvalves are changed between every two adjacent time slot. Furthermore, different switching orders of microvalves lead to different switching frequencies of multiplexer. High switching frequency will make the multiplexer vulnerable and decrease the reliability of chips. This paper proposes the first Hamming-distance-based switching order optimization method for the multiplexer. Experimental results show that our method can reduce the switching frequency of multiplexer effectively, and the solution is very close to the optimal lower bound.
Slides

6C-4 (Time: 17:05 - 17:30)
TitleClose-to-Optimal Placement and Routing for Continuous-Flow Microfluidic Biochips
Author*Andreas Grimmer (Johannes Kepler University, Austria), Qin Wang, Hailong Yao (Tsinghua University, China), Tsung-Yi Ho (National Tsing Hua University, Taiwan), Robert Wille (Johannes Kepler University, Austria)
Pagepp. 530 - 535
KeywordContinuous-Flow Microfluidics, Placement and Routing
AbstractContinuous-flow microfluidics automate laboratory procedures. Existing EDA-solutions for placement and routing of corresponding components and channels rely on heuristics and consider placement and routing independently. Consequently, the results are suboptimal. We propose a close-to-optimal design solution considering all/as much as possible solutions. Therefore, solving engines and pruning schemes are used to handle the complexity. Evaluations show optimal results for small experiments, close-to-optimal results for large experiments, and improvements of up to 1-2 orders of magnitude compared to the current state-of-the-art.
Slides



Thursday, January 19, 2017

Session 3K  Keynote III
Time: 9:00 - 9:50 Thursday, January 19, 2017
Location: International Conference Room
Chair: Kazutoshi Kobayashi (Kyoto Institute of Technology, Japan)

3K-1 (Time: 9:00 - 9:50)
Title(Keynote Address) All-Programmable FPGAs: More Powerful Devices Require More Powerful Tools
AuthorSteve Trimberger (Xilinx Research Labs, U.S.A.)
Pagep. 536
AbstractSince their inception, FPGAs have changed significantly in their capacity and architecture. The devices we use today are called upon to solve problems in mixed-signal, high-speed communications, signal processing and compute acceleration that early devices could not address. The devices continue to grow in capability and complexity. In order for designers to use them effectively, new tools are required. This talk describes the evolution of FPGA devices and tools from the earliest days to the present day, and outlines the devices and tools needed in the coming decade.


Session 7S  (Special Session) When Backend Meets Frontend: Cross-Layer Design & Optimization for System Robustness
Time: 10:15 - 12:15 Thursday, January 19, 2017
Location: Room 103
Organizers/Chairs: Cheng Zhuo (Zhejiang University), Masanori Hashimoto (Osaka University, Japan)

7S-1 (Time: 10:15 - 10:45)
Title(Invited Paper) Containing Guardbands
Author*Hussam Amrouch, Jörg Henkel (Karlsruhe Institute of Technology, Germany)
Pagepp. 537 - 542
KeywordBTI, Instantaneous Aging, Reliability, Guardband, Logic Synthesis
AbstractReliability concerns may overtake conventional design constraints such as cost and performance because transistors in deep nano-CMOS era are increasingly susceptible to degradation effects. This made reliability become unsustainably expensive due to need for wider and wider guardbands (i.e. safety margins). It is in fact the time to reverse this trend: instead of widening guardbands, it is inevitable to contain them. In this work, we summarize three novel means to achieve this goal: interdependencies of degradation effects, aging-aware logic synthesis and modeling instantaneous transistors aging.

7S-2 (Time: 10:45 - 11:15)
Title(Invited Paper) Pattern Based Runtime Voltage Emergency Prediction: An Instruction-Aware Block Sparse Compressed Sensing Approach
AuthorYu-Guang Chen (National Tsing Hua University, Taiwan), Michihiro Shintani, *Takashi Sato (Kyoto University, Japan), Yiyu Shi (University of Notre Dame, U.S.A.), Shih-Chieh Chang (National Tsing Hua University, Taiwan)
Pagepp. 543 - 548
KeywordCompressed Sensing
AbstractThe relentless technology scaling calls for reduced supply voltage for dynamic power suppression. On the other hand, transistor threshold voltage cannot be scaled at the same pace to avoid excessive leakage power. Consequently, the noise margin is significantly reduced, leading to the deployment of various noise management systems that handle runtime voltage emergencies. Most of these systems rely on on-chip noise sensors, which are large in size and consume significant power. To tackle this issue, in this paper we propose a sensor-less voltage emergency estimation framework. It explores the relationship between switching activities and noise, and takes advantage of block sparse compressed sensing developed by the signal processing society. Experimental results on a few industrial designs show that by monitoring registers, voltage emergencies can be successfully predicted.

7S-3 (Time: 11:15 - 11:45)
Title(Invited Paper) Heterogeneous Chip Power Delivery Modeling and Co-Synthesis for Practical 3DIC Realization
AuthorWei-Hsun Liao (National Chiao Tung University, Taiwan), Chang-Tzu Lin (ITRI, Taiwan), Sheng-Hsin Fang, Chien-Chia Huang, *Hung-Ming Chen (National Chiao Tung University, Taiwan), Ding-Ming Kwai, Yung-Fa Chou (ITRI, Taiwan)
Pagepp. 549 - 553
KeywordHeterogeneous Die Modeling, Power Delivery Network Synthesis, 3DIC
AbstractThree dimensional IC (3DIC) is becoming practical in today's consumer electronics designs. However, one major problem remains in design synthesis and flow: how to model heterogeneous die(s) with major logic die for power synthesis and signoff. This work provides a realistic model and principle for heterogeneous die’s power network for 3DICs. It is based on given abstract or early stage information like bump location and power consumption from the provider. Our work also uses this model to synthesize power network with bottom logic die in the design flow. The result is DRC clean power network without IR and EM violation for all power domains. First, we analyze the location and power consumption of power bump for heterogeneous die(s). Second, according to previous analysis, we decide the stripe location and power sink location of heterogeneous die’s model by a clustering method. After the initial model is synthesized, we convert it to a node graph with corresponding resistance of via and metal layer, also nodal voltages. Third, the model is optimized by using Sequential Linear Programming (SLP) to adjust stripe width. It will improve the model iteratively until the target IR-Drop is met. Furthermore, our work will create a pseudo DEF of the proposed model to be incorporated with the commercial tool for verification. We experiment on a real case from design house containing a 3D DRAM stack to demonstrate the effectiveness of this cross-layer realization. Results show that we can save 34% metal layer usage in one of the power domains in our case by using proposed methodology.

7S-4 (Time: 11:45 - 12:15)
Title(Invited Paper) CN-SIM: A Cycle-Accurate Full System Power Delivery Noise Simulator
Author*Kassan Unda (University of Notre Dame, U.S.A.), Chung-Han Chou, Shih-Chieh Chang (National Tsing Hua University, Taiwan), Cheng Zhuo (Zhejiang University, China), Yiyu Shi (University of Notre Dame, U.S.A.)
Pagepp. 554 - 559
KeywordPower delivery noise, cross layer
AbstractThis paper introduces CN-SIM, a cycle accurate, full system, power delivery (PD) noise simulator. CN-SIM provides a cross layer connectivity form application layer, to the architecture layer, to the circuit layer, which is much needed to realistically estimate PD noise. Thus, making it easier for system architects to explore multilayer design optimizations. CN-SIM’s granularity at its deepest is at the functional unit (FU) level. The experimental results of running PARSEC suite benchmarks for different system configurations and different industrial PD design have illustrated CN-SIM’s capability to capture the crosslayer impact on PD noise.


Session 7A  NVM/Flash: From Advanced Storage Design to Emerging Applications
Time: 10:15 - 12:20 Thursday, January 19, 2017
Location: Room 102
Chairs: Sungjoo Yoo (Seoul National University, Republic of Korea), Ya-Shu Chen (National Taiwan University of Science and Technology, Taiwan)

7A-1 (Time: 10:15 - 10:40)
TitleImproving LDPC Performance Via Asymmetric Sensing Level Placement on Flash Memory
Author*Qiao Li, Liang Shi (College of Computer Science, Chongqing University, China), Chun Jason Xue (Department of Computer Science, City University of Hong Kong, Hong Kong), Qingfeng Zhuge, Edwin H.-M. Sha (College of Computer Science, Chongqing University, China)
Pagepp. 560 - 565
KeywordLDPC, flash memory, sensing voltage
AbstractFlash memory development through technology scaling and bit density has significant impact on the reliability of flash cells. Hence strong error correction code (ECC) schemes are highly recommended. With a strong error correction capability, low-density-parity code (LDPC) is now applied for the state-of-the-art flash memory. However, LDPC has long decoding latency when the raw bit error rates (RBER) are high. This is because it needs fine-grained soft sensing between states to iteratively decode the raw data. In this work, we propose a smart sensing level placement scheme to reduce the LDPC decoding latency. The basic idea for the placement scheme is motivated by two asymmetric error characteristics of flash memory: the asymmetric errors at different states, and the asymmetric errors caused by voltage left-shifts and right-shifts. With understanding of these two types of error characteristics, the sensing levels are smartly placed to achieve reduced sensing levels while maintaining the error correction capability of LDPC. Experiment analysis shows that the proposed scheme achieves significant performance improvement.
Slides

7A-2 (Time: 10:40 - 11:05)
TitleA Flash Scheduling Strategy for Current Capping in Multi-Power-Mode SSDs
AuthorLi-Pin Chang, Chia-Hsiang Cheng, *Kai-Hsiang Lin (National Chiao Tung University, Taiwan)
Pagepp. 566 - 571
Keywordflash storage, solid-state disks, power management
AbstractSolid state disks (SSDs) employ internal parallelism for high throughput, but concurrent flash operations can draw a high instantaneous current. This study presents a flash scheduling algorithm to optimize the SSD internal parallelism subject to the current limit. Based on realistic flash current models, the scheduler decides the actual starting times of every flash operation, and it efficiently examines the peak current only at a few time points. Our experiments show that our approach outperformed existing methods.
Slides

7A-3 (Time: 11:05 - 11:30)
TitleTemperature-Aware Data Allocation Strategy for 3D Charge-Trap Flash Memory
Author*Yi Wang, Mingxu Zhang (Shenzhen University, China), Jing Yang (Harbin Institute of Technology, China)
Pagepp. 572 - 577
Keyword3D flash memory, temperature-aware, garbage collection, space allocation
AbstractThree-dimensional (3D) flash memory is emerging as an attractive solution to overcome the scaling bottleneck in sub-20 nanometer design. Compared to the conventional planar flash memory, current 3D flash memory adopts charge-trap technology that can significantly enhance cell density and storage capacity. Despite these advantages, novel material and fabricate process in charge-trap flash memory bring new challenges. Recent studies demonstrate that charge-trap flash is sensitive to thermal yield. This issue does not happen in two-dimensional flash memory which adopts floating gate technology. For 3D flash memory with charge-trap technology, the high temperature will incur both charge loss and retention degradation. The large capacity data block in 3D flash memory also causes extra garbage collection overhead, which makes the temperature issue even worse. This paper presents TempLoad, a temperature-aware data allocation strategy for 3D charge-trap flash memory. TempLoad is a novel hardware and file system interface that can transparently allocate physical space based on the temperature status. TempLoad adopts several address mapping strategies to fully utilize the storage capacity and reduce the garbage collection overhead. The objective is to prevent the generation of hotspots and enhance the data integrity of 3D flash memory. Experimental results show that the proposed technique can reduce the peak temperature by 28.49% and reduce uncorrectable page errors by 83.71% with negligible timing overhead in comparison with previous work.
Slides

7A-4 (Time: 11:30 - 11:55)
TitleScalable Frequent-Pattern Mining on Nonvolatile Memories
Author*Yi Lin (Chongqing University, China), Po-Chun Huang (Yuan Ze University, Taiwan), Duo Liu, Liang Liang (Chongqing University, China)
Pagepp. 578 - 583
Keywordfrequent pattern mining, fp-tree, nonvolatile memory, phase change memory
AbstractFrequent-pattern mining is a common means to reveal the hidden trends behind a set of data. However, existing frequent-pattern mining algorithms are primarily designed for DRAM, instead of the energy-economic nonvolatile memories (NVMs). Due to the huge differences between the intrinsic characteristics of NVMs and those of DRAM, existing frequent-pattern mining algorithms might suffer from serious overheads of write amplification or energy consumption as they are used on NVMs. The design complexity is further exaggerated when parallel computing is used to accelerate the mining process. In this paper, we propose PevFP-tree, a solution to the parallel frequent-pattern mining problem on NVMs, such as phase-change memory (PCM). By jointly considering the characteristics of NVMs, PevFP-tree could accelerate the mining process and enhance the energy efficiency. In addition, PevFP-tree offers superior scalability in terms of the degree of parallelism of the mining algorithm, as well as the branching factor of its tree structure. The efficacy of PevFP-tree is then evaluated by experiments based on realistic datasets, where the results are encouraging.
Slides

7A-5 (Time: 11:55 - 12:20)
TitleKVFTL: Optimization of Storage Space Utilization for Key-Value-Specific Flash Storage Devices
Author*Yen-Ting Chen (National Tsing Hua University, Taiwan), Ming-Chang Yang, Yuan-Hao Chang, Tseng-Yi Chen (Academia Sinica, Taiwan), Hsin-Wen Wei (Tamkang University, Taiwan), Wei-Kuan Shih (National Tsing Hua University, Taiwan)
Pagepp. 584 - 590
Keywordflash storage device, key-value store, space utilization
AbstractThe strong momentum of key-value store applications drives the commercialization of key-value-specific hard disk drives. To achieve higher degree of performance, the specific flash-based solid state drives would be also commercialized for key-value store applications in the foreseeable future. However, the existing fixed-sized management strategies of flash-based devices would potentially result in low storage space utilization on managing variable-sized key-value data. This problem inspires this paper to propose a key-value flash translation layer (KVFTL) design to improve the storage space utilization of the key-value-specific solid state drives (KVSSDs). A series of experiments was conducted to evaluated the proposed design, and the experimental results on space utilization and device performance are very encouraging.
Slides


Session 7B  Hardware Diversity and Hardware Trojan
Time: 10:15 - 12:20 Thursday, January 19, 2017
Location: Room 104
Chairs: Wujie Wen (Florida International University, U.S.A.), Chip Hong Chang (Nanyang Technological University, Singapore)

7B-1 (Time: 10:15 - 10:40)
TitleTrojan Localization Using Symbolic Algebra
AuthorFarimah Farahmandi, Yuanwen Huang, *Prabhat Mishra (University of Florida, U.S.A.)
Pagepp. 591 - 597
Keywordformal verification, Hardware Trojan Localization, Hardware Security, Polynomial Manipulation, Groebner Basis Theory
AbstractGrowing reliance on reusable hardware Intellectual Property (IP) blocks, often gathered from untrusted third-party vendors, severely affects the security and trustworthiness of System-on-Chip computing platforms. These IPs may come with deliberate malicious implants to incorporate undesired functionality and work as hidden backdoor. In this paper, we propose an automated approach to identify untrustworthy IPs and localize malicious functional modifications (if any). The technique is based on extracting polynomials from gate-level implementation of the untrustworthy IP and comparing them with specification polynomials. The proposed approach is applicable when the specification is available. Our approach is scalable due to manipulation of polynomials instead of BDD-based analysis used in traditional equivalence checking techniques. Experimental results using Trust-HUB benchmarks demonstrate that our approach improves both localization and test generation efficiency by several orders of magnitude compared to the state-of-the-art Trojan detection techniques.
Slides

7B-2 (Time: 10:40 - 11:05)
TitleDetecting Hardware Trojans in Unspecified Functionality Through Solving Satisfiability Problems
Author*Nicole Fern (UC Santa Barbara, U.S.A.), Ismail San (Anadolu University, Turkey), Kwang-Ting (Tim) Cheng (Hong Kong University of Science and Technology, Hong Kong)
Pagepp. 598 - 604
KeywordHardware Trojans, Unspecified Functionality, Satisfiability Problems, SMT Solvers
AbstractFor modern complex designs it is impossible to fully specify design behavior, and only feasible to verify functionally meaningful scenarios. Hardware Trojans modifying only unspecified functionality are not possible to detect using existing verification methodologies and Trojan detection strategies. We propose a detection methodology for these Trojans by 1) precisely defining "suspicious" unspecified functionality in terms of information leakage, and 2) formulating detection as a satisfiability problem that can take advantage of the recent advances in both boolean and satisfiability modulo theory (SMT) solvers. The formulated detection procedure can be applied to a gate-level design using commercial equivalence checking tools, or directly to the Verilog/VHDL code by reasoning about the satisfiability of SMT expressions built from traversing the data-flow graph. We demonstrate the effectiveness of our approach on an adder coprocessor and a UART communication controller infested with Trojans which process information leaked from the on-chip bus during idle cycles using signals with only partially specified behavior.
Slides

7B-3 (Time: 11:05 - 11:30)
TitleRouting Perturbation for Enhanced Security in Split Manufacturing
Author*Yujie Wang, Pu Chen, Jiang Hu (Texas A&M University, U.S.A.), Jeyavijayan Rajendran (The University of Texas at Dallas, U.S.A.)
Pagepp. 605 - 610
KeywordHardware Security, Split Manufacturing, Routing
AbstractSplit manufacturing can mitigate security vulnerabilities at untrusted foundries by exposing only partial designs. Even so, attackers can make educated guess according to design conventions and thereby recover entire chip designs. In this work, a routing perturbation-based defense method is proposed such that such attacks become very difficult while wirelength/timing overhead is restricted to be very small. Experimental results on benchmark circuits confirm the effectiveness of the proposed techniques. The new techniques also significantly outperform the latest previous work.
Slides

7B-4 (Time: 11:30 - 11:55)
TitleMUTARCH: Architectural Diversity for FPGA Device and IP Security
Author*Robert Karam, Tamzidul Hoque (University of Florida, U.S.A.), Sandip Ray (NXP Semiconductors, U.S.A.), Mark Tehranipoor, Swarup Bhunia (University of Florida, U.S.A.)
Pagepp. 611 - 616
KeywordFPGA, bitstream, security, architectural diversity
AbstractField Programmable Gate Arrays are being increasingly deployed in diverse applications, but bitstream security is lacking. Existing techniques incur significant overhead, are susceptible to side channel attacks, and are vulnerable to piracy and malicious alteration during in-field upgrade. Instead, we explore the concept of mutable architectures, FPGA fabrics that are physically different from one another and can change with time for vastly increased security. The approach integrates with existing design methodologies and requires minimal changes to the toolflows. Our analysis and simulations demonstrate robust security against major bitstream attacks.
Slides

7B-5 (Time: 11:55 - 12:20)
TitleSecurity Vulnerability Analysis of Design-for-Test Exploits for Asset Protection in SoCs
AuthorGustavo K. Contreras, Adib Nahiyan, Swarup Bhunia, Domenic Forte, *Mark Tehranipoor (University of Florida, U.S.A.)
Pagepp. 617 - 622
KeywordSecurity, Information assurance, Information flow tracking, confidentiality and integrity, Design-for-test
AbstractSoCs implementing security modules should be both testable and secure. In this paper, for the first time, we propose a novel automated security vulnerability analysis framework to identify violations of confidentiality, integrity, and availability policies caused by test structures and designer oversights during SoC integration. Results demonstrate existing information leakage vulnerabilities in implementations of various encryption algorithms and secure microprocessors. These can be exploited to obtain secret keys, control finite state machines, or gain unauthorized access to memory read/write functions.


Session 7C  Hardware Accelerator for Emerging Applications
Time: 10:15 - 12:20 Thursday, January 19, 2017
Location: Room 105
Chairs: Tohru Ishihara (Kyoto University, Japan), Yongpan Liu (Tsinghua University, China)

7C-1 (Time: 10:15 - 10:40)
TitleTowards Scalable and Efficient GPU-Enabled Slicing Acceleration in Continuous 3D Printing
AuthorAosen Wang, Chi Zhou (State University of New York at Buffalo, U.S.A.), *Zhanpeng Jin (State University of New York at Binghamton, U.S.A.), Wenyao Xu (State University of New York at Buffalo, U.S.A.)
Pagepp. 623 - 628
KeywordContinuous 3D Printing, GPU, Pixelwise Parallel Slicing, Fully Parallel Slicing
AbstractRecently, continuous 3D printing, a revolutionary branch of legacy additive manufacturing, has made its two-order time efficiency breakthrough in industrial manufacturing. As its manufacturing technique advances rapidly, the prefabrication to slice the 3D object into image layers becomes potential to impede further improvement of production efficiency. In this paper, we present two scalable and efficient graphic processing unit (GPU) enabled schemes, i.e., pixelwise parallel slicing and fully parallel slicing, to accelerate the image-projection based slicing algorithm in continuous 3D printing. Specifically, the pixelwise approach utilizes the pixel-level parallelism and exploits the in-shared-memory computing on GPU. The fully parallel method aggressively expands the parallelism on both triangle mesh size and slicing layers. The thread-level priority competing issue, resulting from full parallelism, is addressed by a critical area using atomic operation. Experiments with real 3D object benchmarks show that our pixelwise parallel slicing can gain one order of magnitude runtime reduction to CPU, and the fully parallel slicing achieves two orders improvement. We also evaluate the scalability of both proposed schemes.
Slides

7C-2 (Time: 10:40 - 11:05)
TitleFPGA-based Accelerator for Long Short-Term Memory Recurrent Neural Networks
Author*Yijin Guan, Zhihang Yuan (Peking University, China), Guangyu Sun (Peking University/PKU-UCLA Joint Research Institute in Science and Engineering, China), Jason Cong (Peking University/PKU-UCLA Joint Research Institute in Science and Engineering/University of California, Los Angeles, U.S.A.)
Pagepp. 629 - 634
KeywordFPGA, Accelerator, LSTM-RNN
AbstractLong Short-Term Memory Recurrent neural networks (LSTM-RNNs) have been widely used for speech recognition, machine translation, scene analysis, etc. Unfortunately, general-purpose processors like CPUs and GPGPUs can not implement LSTM-RNNs efficiently due to the recurrent nature of LSTM-RNNs. FPGA-based accelerators have attracted attention of researchers because of good performance, high energy-efficiency and great flexibility. In this work, we present an FPGA-based accelerator for LSTM-RNNs that optimizes both computation performance and communication requirements. The peak performance of our accelerator achieves 7.26 GFLOP/S, which significantly outperforms previous approaches.
Slides

7C-3 (Time: 11:05 - 11:30)
TitleFine-Grained Accelerators for Sparse Machine Learning Workloads
AuthorAsit K. Mishra, *Eriko Nurvitadhi, Ganesh Venkatesh, Jonathan Pearce, Debbie Marr (Intel Corp., U.S.A.)
Pagepp. 635 - 640
KeywordAccelerators, MachineLearning, BigData, Sparse
AbstractText analytics applications using machine learning techniques have grown in importance with ever increasing amount of data being generated from web-scale applications, social media and digital repositories. Apart from being large in size, these generated data are often unstructured and are heavily sparse in nature. The performance of these applications on current systems is hampered by hard to predict branches and low compute-per-byte ratio. This paper proposes a set of fine-grained accelerators that improve the performance and energy-envelope of these applications by an order of magnitude.To this end, this paper proposes a set of fine-grained accelerators that improve the performance and energy-envelope of these applications by an order of magnitude. Our proposed accelerators are simple to integrate with a core, easy to inter- face with the memory system, occupy very little area compared to other core components, and are generic enough to cover a wide variety of text analytics applications.
Slides

7C-4 (Time: 11:30 - 11:55)
TitleHigh Throughput Hardware Architecture for Accurate Semi-Global Matching
AuthorYan Li, Chen Yang, Wei Zhong, Zhiwei Li, *Song Chen (University of Science and Technology of China, China)
Pagepp. 641 - 646
Keywordstereo matching, SGM, high throughput
AbstractAs the most important step of a stereo vision system, stereo matching, which finds the correspondences in stereo image pairs, requires high-quality real-time depth computation. In this paper, a high accuracy and high throughput full-pipeline hardware architecture with disparity and row parallelism is proposed. In the semi-global aggregation stage, to improve the accuracy in discontinuous regions, adaptive weighted path costs are adopted, and, five aggregation paths are used without consuming external memory resources. The proposed hardware architecture is implemented on a Stratix V FPGA, which results in a throughput of 1280×960/197fps with 64 disparity levels at 156MHz.
Slides

7C-5 (Time: 11:55 - 12:20)
TitleA Memristor-based Neuromorphic Engine with a Current Sensing Scheme for Artificial Neural Network Applications
AuthorChenchen Liu, Qing Yang (University of Pittsburgh, U.S.A.), Chi Zhang, Hao Jiang (San Francisco State University, U.S.A.), Qing Wu (Air Force Research Lab, U.S.A.), *Hai (Helen) Li (University of Pittsburgh, U.S.A.)
Pagepp. 647 - 652
Keywordmemristor, neuromorphic computing, current sensing
AbstractBy following the big data revolution, neuromorphic computing makes a comeback for its great potential in information processing capability. Despite of many types of architectures reported in conventional CMOS domain, memristor, as an example of emerging devices, demonstrates an intrinsic support of parallel matrix-vector multiplication operation that is widely used in artificial neural network applications. However, its computation accuracy and speed are far from satisfactory, mainly constrained by the features of memristor crossbar array and peripheral circuitry. In this work, we propose a new memristor crossbar based computing engine design by leveraging a current sensing scheme. High parallelism in operation and therefore fast computation can be achieved via simultaneously supplying analog voltages into a memristor crossbar and directly converting the weighted current through a current-to-voltage converter. We implemented and compared the feed-forward neural networks with different array sizes and layer numbers. Our design demonstrates a good computation accuracy, e.g., 96.6% classification accuracy for MNIST handwritten digit in a two-layer design.


Session 8S  (Designers' Forum) Advanced Automotive Security
Time: 13:50 - 15:30 Thursday, January 19, 2017
Location: Room 103
Organizers: Shinichi Shibahara (Renesas System Design, Japan), Akihiko Inoue (Panasonic, Japan), Chair: Shinichi Shibahara (Renesas System Design, Japan)

8S-1 (Time: 13:50 - 14:15)
Title(Invited Paper) Using Security Applications for Automotive Hardware Security Modules
AuthorDennis Kengo Oka (ETAS, Japan)
KeywordSecurity, Automotive
AbstractWith advancements in connectivity and new technologies, the attack surface for automotive systems is increasing. Various new cybersecurity threats are emerging that can have major safety and financial impacts. As a result, it is imperative to understand these threats to apply the appropriate security measures. This talk will present a number of threats and discuss how such threats can be countered using security applications for automotive hardware security modules.

8S-2 (Time: 14:15 - 14:40)
Title(Invited Paper) An Embedded Hardware Security Module for Automotive ECUs
AuthorYasuhisa Shimazaki (Renesas Electronics Corporation, Japan)
AbstractIn coming autonomous-driving era, automobiles and many kinds of facilities will be connected each other to provide safe, comfortable and efficient driving environment for drivers. Vehicle to vehicle communication, for example, is necessary to get some safety-related information such as distance, speed, and condition of the neighbors, or communication between a vehicle and cloud is used to obtain traffic information, to update firmware of electronic control units (ECUs), and so forth. This means, however, we need to pay much attention to micro controller unit (MCU) design in terms of cyber security. In 2015, actually, remote attack for a running car through cellular network was demonstrated, resulting in 1.4 million recalls. In order to address this issue, MCUs need to have some sort of security measures which protect themselves and their communication channels effectively and efficiently. Based on this motivation, we have implemented a hardware security module on our MCUs. In this presentation, actual implementation of the hardware security module that supports SHE and EVITA, which are widely accepted security specifications for automotive applications, will be shown. The presentation will also cover performance evaluation results of elliptic curve digital signature algorithm (ECDSA) verification used in vehicle to x (V2X) communication.

8S-3 (Time: 14:40 - 15:05)
Title(Invited Paper) Security Hardware for Automotive Applications
AuthorTakeshi Fujino (Ritsumeikan University, Japan)
AbstractSeveral kinds of malicious attacks against vehicles have been demonstrated in the past few years. The first one since 2010 is an invasive attack to the in-vehicle CAN network. The speed-meters on the dashboard can be wrongly displayed, and some equipments such as windshield wiper or turn signal can be controlled by the command injected from the PC via ODB-II port. The next stage is a remote attack from cellular network, where abnormal operations of accelerator and engine are demonstrated on Jeep in 2015. FCA issued a recall for 1.4 million vehicles on a vulnerable software. In future, the authenticated CAN communication and ECUs with secure-boot sequence will be deployed against these attacks, however, secure key management and storage on ECUs are problems to be solved. We proposed the key management method using tamper resistant AES cryptographic circuit and PUF. The latest attack targets are ADAS sensors such as sonar, radar, and camera. After the vulnerability reports presented by some security researchers including us, the demonstration, where the auto-pilot system on Tesla’s car is easily fooled, was reported in the summer of 2016. The development of tamper-resistant ADAS sensors will be important for the future self-driving car.

8S-4 (Time: 15:05 - 15:30)
Title(Invited Paper) Physical and Logical Attacks against LSI Chips and Their Countermeasures
AuthorShinichi Kawamura (Toshiba Corporation, Japan)
AbstractSince the proposal of a timing attack and a power analysis in mid 90s, physical and logical attacks against LSI chips are one of the most critical issues in the research of crypto implementation and the smart card industry. Since LSI for smart card is almost bare, it is not difficult for attackers to make direct access to the controller. In the smart card industry, however, a validation program and testing laboratories have been established and are working well to prevent a critical incident. It is now expected that controllers of automobile are the next critical target of such attacks. Unlike the case of smart card, attack to LSI chip of an automobile controller would be directly connected to the loss of a human life. Therefore, we have to be serious to learn from the cases in smart card industry as to how we should prepare against such threats.


Session 8A  Scheduling, Resource Management, and Simulation for Multi-Core Systems
Time: 13:50 - 15:30 Thursday, January 19, 2017
Location: Room 102
Chairs: Yuko Hara-Azumi (Tokyo Institute of Technology, Japan), Yi Wang (Shenzhen University, China)

8A-1 (Time: 13:50 - 14:15)
TitleAn Adaptive On-line CPU-GPU Governor for Games on Mobile Devices
AuthorPo-Kai Chuang, *Ya-Shu Chen, Po-Hao Huang (National Taiwan University of Science and Technology, Taiwan)
Pagepp. 653 - 658
KeywordMobile Device, Power Management, User Experience
AbstractEnergy efficiency is a critical issue for battery-driven mobile devices. The popularity of mobile games with increasingly sophisticated graphics raises an urgent need for an on-line power governor for both CPUs and GPUs. This study proposes an adaptive on-line CPU-GPU governor for games on mobile devices to minimize energy consumption. The concept is implemented on a Google Nexus 7 device and evaluated using real world gaming applications (APPs). The results show an energy savings of up to 26% compared to the Performance governor in Linux (include network, screen, and system idle power) while maintaining a stable user experience.
Slides

8A-2 (Time: 14:15 - 14:40)
TitleA Static Scheduling Approach to Enable Safety-Critical OpenMP Applications
AuthorAlessandra Melani (Scuola Superiore Sant'Anna, Italy), *Maria A. Serrano (Barcelona Supercomputing Center and Technical University of Catalonia, Spain), Marko Bertogna (Universitŕ di Modena e Reggio Emilia, Italy), Isabella Cerutti (Scuola Superiore Sant'Anna, Italy), Eduardo Quińones (Barcelona Supercomputing Center, Spain), Giorgio Buttazzo (Scuola Superiore Sant'Anna, Italy)
Pagepp. 659 - 665
KeywordOpenMP, Scheduling
AbstractParallel computation is fundamental to satisfy the performance requirements of advanced safety-critical systems. OpenMP is a good candidate to exploit the performance opportunities of parallel platforms. However, safety-critical systems are often based on static allocation strategies, whereas current OpenMP implementations are based on dynamic schedulers. This paper proposes two OpenMP-compliant static allocation approaches: an optimal but costly approach based on an ILP formulation, and a sub-optimal but tractable approach that computes a worst-case makespan bound close to the optimal one.
Slides

8A-3 (Time: 14:40 - 15:05)
TitleCommunication Driven Remapping of Processing Element (PE) in Fault-tolerant NoC-based MPSoCs
AuthorChia-Ling Chen, *Yen-Hao Chen, TingTing Hwang (National Tsing Hua University, Taiwan)
Pagepp. 666 - 671
KeywordNoC
AbstractWe propose a remapping algorithm to tolerate the failures of Processing Elements (PEs) on Multiprocessor System-on-Chip. A new graph modeling is proposed to precisely define the increase of communication cost among PEs after remapping. Our method can be used not only to repair faults but also to improve the communication cost of given initial mapping results. Experimental results show that under multiple failures, the communication cost by our method is 5.92% less on average compared with that by previous work using the same number of spare PEs. Moreover, the communication cost is further reduced by 4.53% after applying our method to initial mappings produced by NMAP.
Slides

8A-4 (Time: 15:05 - 15:30)
TitleDetailed and Highly Parallelizable Cycle-Accurate Network-on-Chip Simulation on GPGPU
Author*Amir Charif, Alexandre Coelho, Nacer-Eddine Zergainoh, Michael Nicolaidis (TIMA Laboratory, France)
Pagepp. 672 - 677
Keywordnetwork-on-chip, parallel simulation, gpu, cuda, noc simulator
AbstractAs the number of processing elements in modern chips keeps increasing, the evaluation of new designs will need to account for various challenges at the NoC level. To cope with the impractically long run times when simulating large NoCs, we introduce a novel GPU-based parallel simulation method that can speed up simulations by over 250x, while offering RTL-like accuracy. These promising results make our simulation method ideal for evaluating future NoCs comprising thousands of nodes.
Slides


Session 8B  Machine Learning: Acceleration and Application
Time: 13:50 - 15:30 Thursday, January 19, 2017
Location: Room 104
Chairs: Weichen Liu (Chongqing University, China), Nan Guan (The Hong Kong Polytechnic University, Hong Kong)

8B-1 (Time: 13:50 - 14:15)
TitleSpendthrift: Machine Learning Based Resource and Frequency Scaling for Ambient Energy Harvesting Nonvolatile Processors
Author*Kaisheng Ma, Xueqing Li, Srivatsa Rangachar Srinivasa (Pennsylvania State University, U.S.A.), Yongpan Liu (Tsinghua University, China), John (Jack) Sampson (Pennsylvania State University, U.S.A.), Yuan Xie (University of California at Santa Barbara, U.S.A.), Vijaykrishnan Narayanan (Pennsylvania State University, U.S.A.)
Pagepp. 678 - 683
KeywordNonvolatile processor, energy harvesting, machine learning, power-adaptive microarchitecture, Internet of Things
AbstractBatteryless energy harvesting systems face a twofold challenge in converting incoming energy into forward progress. Not only must such systems contend with inherently weak and fluctuating power sources, but they have very limited temporal windows for capitalizing on transitory periods of above-average power. To maximize forward progress, such systems should aggressively consume energy when it is available, rather than optimizing for peak average-case efficiency. However, there are multiple ways that a processor can trade between consumption and performance. In this paper, we examine two approaches, frequency scaling and resource scaling, and develop a predictor-driven scheme for dynamically allocating future power budgets between the two techniques. We show that our solution can achieve forward progress equal to 2.08X of the baseline Out-of-Order (OoO) processor with the best static configuration of frequency and resources. The combined technique outperforms either technique in isolation, with frequency-only and resource-only approaches achieving 1.43x and 1.61x forward progress improvements, respectively.

8B-2 (Time: 14:15 - 14:40)
TitleModular Reinforcement Learning for Self-Adaptive Energy Efficiency Optimization in Multicore System
AuthorZhe Wang, *Zhongyuan Tian, Jiang Xu, Rafael Kioji Vivas Maeda, Haoran Li, Peng Yang, Zhehui Wang, Luan H. K. Duong, Zhifei Wang, Xuanqi Chen (Hong Kong University of Science and Technology, Hong Kong)
Pagepp. 684 - 689
Keywordadaptive, modular reinforcement learning, multicore processor, DVFS, embedded system
AbstractBeing able to adapt to varying system conditions, learning-based DVFS control techniques can effectively improve energy-efficiency for embedded systems. However, these techniques are known to be non-scalable in multi-core systems to provide globally-optimized solutions. For this reason, we propose an self-adaptive online DVFS control strategy based on core-level Modular Reinforcement Learning. Experimental results show that the proposed approach can improve up to 28% energy-efficiency compared to existing learning-based technique on different system scales.
Slides

8B-3 (Time: 14:40 - 15:05)
TitleBHNN: a Memory-Efficient Accelerator for Compressing Deep Neural Networks with Blocked Hashing Techniques
AuthorJingyang Zhu (Hong Kong University of Science and Technology, Hong Kong), *Zhiliang Qian (Shanghai Jiao Tong University, China), Chi-Ying Tsui (Hong Kong University of Science and Technology, Hong Kong)
Pagepp. 690 - 695
Keyworddeep learning, model compression, hardware acceleration, FPGA
AbstractIn this paper, we propose a novel algorithm for compressing neural networks to reduce the memory requirements by using blocked hashing techniques. By adding blocked constraints on top of the conventional hashing technique, the test error rate is maintained while the spatial locality for the computations is preserved. Using this scheme, the synaptic connections are compressed by at least an order (10x) compared with the plain neural network with virtually no prediction accuracy loss. Compared with other compression techniques, the proposed algorithm achieves the best performance in the heavy compression regions. The blocked hashing techniques are also hardware friendly, of which the memory hierarchy of the hardware architecture can be efficiently implemented. To demonstrate the hardware efficiency, we implement the hardware architecture of the deep neural networks using the proposed blocked hashing techniques on a Xilinx Virtex-7 FPGA board. With a hardware parallelism of 32, the accelerator achieves a speed-up of 22x over the CPU, and 3~5x over the GPU in the inference phase.
Slides

8B-4 (Time: 15:05 - 15:30)
TitleScalable Stochastic-Computing Accelerator for Convolutional Neural Networks
Author*Hyeonuk Sim, Dong Nguyen, Jongeun Lee (UNIST, Republic of Korea), Kiyoung Choi (Seoul National University, Republic of Korea)
Pagepp. 696 - 701
KeywordStochastic computing, Convolutional neural network, Accelerator, Low-cost
AbstractStochastic Computing (SC) is an alternative design paradigm particularly useful for applications where cost is critical. SC has been applied to neural networks, as neural networks are known for their high computational complexity. However previous work in this area has critical limitations such as the fully-parallel architecture assumption, which prevent them from being applicable to recent ones such as convolutional neural networks, or ConvNets. This paper presents the first SC architecture for ConvNets, shows its feasibility, with detailed analyses of implementation overheads. Our SC-ConvNet is a hybrid between SC and conventional binary design, which is a marked difference from earlier SC-based neural networks. Though this might seem like a compromise, it is a novel feature driven by the need to support modern ConvNets at scale, which commonly have many, large layers. Our proposed architecture also features hybrid layer composition, which helps achieve very high recognition accuracy. Our detailed evaluation results involving functional simulation and RTL synthesis suggest that SC-ConvNets are indeed competitive with conventional binary designs, even without considering inherent error resilience of SC.
Slides


Session 8C  Design Automation and Modeling for Emerging Technologies
Time: 13:50 - 15:30 Thursday, January 19, 2017
Location: Room 105
Chair: Yiran Chen (University of Pittsburgh, U.S.A.)

8C-1 (Time: 13:50 - 14:15)
TitleReservoir and Mixer Constrained Scheduling for Sample Preparation on Digital Microfluidic Biochips
Author*Varsha Agarwal, Ananya Singla (Indian Institute of Technology Roorkee, India), Mahammad Samiuddin (Indian Institute of Technology Kharagpur, India), Sudip Roy (Indian Institute of Technology Roorkee, India), Tsung-Yi Ho (National Tsing Hua University, Taiwan), Indranil Sengupta (Indian Institute of Technology Kharagpur, India), Bhargab B. Bhattacharya (Indian Statistical Institute Kolkata, India)
Pagepp. 702 - 707
Keywordbiochip, microfluidics, sample preparation, mixing, scheduling
AbstractIn recent years, digital microfluidic biochips are being dominantly used for implementing a wide range of biochemical laboratory protocols (bioprotocols) on hand-held devices. Accurate preparation of fluid-samples is a fundamental preprocessing step that is needed in many bioprotocols. Oftentimes, the number of reservoirs built on-chip may be far less than that of the reactant fluids to be mixed. Hence, during the execution of an assay, several fluids are to be unloaded from the reservoirs to make room for loading new fluids stored off-line. Such unload-wash-load steps (switching) may be required several times, and these steps, being manual, significantly impact assay-completion time. In this paper, we propose a new scheduling scheme namely Reservoir and Mixer constrained Scheduling (RMS) that can schedule a mixing tree obtained by a mixing algorithm, while minimizing the number of switching such that the total completion time can be minimized. Simulation results over a large number of target ratios show that given the mixing trees obtained by standard mixing algorithms such as MinMix/RMA/CoDOS, RMS reduces switching steps (on average by 40.3%/41.9%/33%) at the cost of increasing mixing time (by only 3.5%/6.2%/4.8%), compared to an existing scheduling scheme invoked with reservoir constraints.
Slides

8C-2 (Time: 14:15 - 14:40)
TitleExact Routing for Micro-Electrode-Dot-Array Digital Microfluidic Biochips
Author*Oliver Keszocze (University of Bremen, Germany), Zipeng Li (Duke University, U.S.A.), Andreas Grimmer, Robert Wille (Johannes Kepler University, Austria), Krishnendu Chakrabarty (Duke University, U.S.A.), Rolf Drechsler (University of Bremen and DFKI GmbH, Germany)
Pagepp. 708 - 713
KeywordDMFB, MEDA, routing
AbstractDigital microfluidics are an emerging technology that provide fluidic-handling capabilities on a chip. One of the most important issues when conducting experiments on such a biochip is the routing of droplets. A more recent variant of biochips use a micro-electrode-dot-array (MEDA) which yields a finer controllability of the droplets. Although this new technology allows for more advanced routing possibilities, this also poses new challenges to corresponding CAD methods. In contrast to conventional microfluidic biochips, droplets on MEDA biochips may diagonally move on the grid and are not bound to have the same shape during the whole experiment. In this work, we present an exact routing method that copes with these challenges while, at the same time, guarantees to find the minimal solution with respect to completion time. For the first time, this allows for evaluating the benefits of MEDA biochips compared to their conventional counterparts as well as a quality assessment of previously proposed routing methods in this domain.
Slides

8C-3 (Time: 14:40 - 15:05)
TitleMajority Logic Circuits Optimisation by Node Merging
AuthorChun-Che Chung (Department of Computer Science ,National Tsing Hua University, Taiwan), Yung-Chih Chen (Department of Computer Science & Engineering ,Yuan Ze University, Taiwan), Chun-Yao Wang, *Chia-Cheng Wu (Department of Computer Science ,National Tsing Hua University, Taiwan)
Pagepp. 714 - 719
Keywordmajority logic, node merging, logic synthesis, optimization
AbstractQuantum-dot Cellular Automata (QCA) has emerged as a new design paradigm for nanotechnologies. Since the operational logic in QCA is the majority logic, much research about the synthesis and optimisation of majority logic has been proposed recently. In this paper, we propose an optimisation method by merging nodes in the Majority-Inverter-Graph, which is the representation of majority logic circuits. Instead of using satisfiability solvers, our approach can identify the node mergers by using logic implications for circuit size reduction. The experimental results show that for a set of EPFL benchmarks, our approach can minimise the node count by 21% when integrated with the state-of-the-art on average.
Slides

8C-4 (Time: 15:05 - 15:30)
TitleA Statistical STT-RAM Retention Model for Fast Memory Subsystem Designs
AuthorZihao Liu, *Wujie Wen (Florida International University, U.S.A.), Lei Jiang (Indiana University Bloomington, U.S.A.), Yier Jin (University of Central Florida, U.S.A.), Gang Quan (Florida International University, U.S.A.)
Pagepp. 720 - 725
KeywordSTT-RAM, Retention, Statistical, Compact Model
AbstractSpin-transfer torque random access memory (STT-RAM) is a promising nonvolatile memory (NVM) solution to implement on-chip caches and off-chip main memories for its high integration density and short access time, but it suffers from considerable write latency and energy overhead. Aggressively relaxing its non-volatility for write fast and write energy efficient memory subsystems has been quite debatable, due to the unclear retention behavior on a timescale of microsecondsto-seconds. Moreover, recent studies project that retention failure will eventually dominate the cell reliability as STT-RAM scales. As a result, a comprehensive understanding of the thermal noise induced STT-RAM retention mechanism has become a must. In this work, we develop a compact semi-analytical model for fast retention failure analysis. We then systematically analyze critical factors (e.g., initial angle, device dimension etc.) and their impacts on the STT-RAM retention behavior through our model. Our experimental results show that STT-RAM suffers from a soft-error style retention failure, which may happen instantly just after the last write finishes and is totally different from that of DRAM and Flash, i.e., the gradual charge loss process. Our model offers an excellent agreement with the results from golden macromagnetic simulations in the region of interest without conducting expensive Monte-Carlo runs. At last, we demonstrate our model can enable architectural designers to rethink STT-RAM based memory designs by emphasizing its probabilistic retention property.
Slides


Session 9S  (Designers' Forum) Advanced Image Sensing and Processing
Time: 15:50 - 17:30 Thursday, January 19, 2017
Location: Room 103
Organizers: Yusuke Oike (Sony Semiconductor Solutions, Japan), Masaitsu Nakajima (Socionext, Japan), Yusuke Oike (Sony Semiconductor Solutions, Japan)

9S-1 (Time: 15:50 - 16:15)
Title(Invited Paper) An APS-H-Size 250Mpixel CMOS Image Sensor Using Column Single-Slope ADCs with Dual-Gain Amplifiers
AuthorHirofumi Totsuka (Canon, Japan)
AbstractThis talk presents an APS-H size 250Mpixel CMOS image sensor. The sensor is fabricated with a 0.13μm CMOS technology and is based on 1.5μm-pitch pixels and single-slope column ADCs with dual-gain amplifiers featuring 6dB wider dynamic range and 75% shorter conversion time than a conventional single-slope ADC.

9S-2 (Time: 16:15 - 16:40)
Title(Invited Paper) A 1/1.7-inch 20Mpixel Back-Illuminated Stacked CMOS Image Sensor
AuthorChihiro Okada (Sony Semiconductor Solutions, Japan)
AbstractThis talk presents a 1/1.7-inch 20Mpixel back-illuminated stacked CMOS image sensor with multi-functional modes which are parallel multiple sampling, the two simultaneous output streams and data compression. This sensor has achieved a RMS random noise of 1.3e- with the parallel multiple sampling and the two simultaneous output streams of 4Mpixel for a movie mode and 16Mpixel for a still mode with a 2.3Gb/s/lane high-speed interface. Furthermore, the high speed output mode of 16Mpixel at 120fps with a low image degradation compression. The stacked structure realizes on an analog implementation of the double column parallel ADCs.

9S-3 (Time: 16:40 - 17:05)
Title(Invited Paper) Emerging Applications Based on High-speed Computational Vision
AuthorYoshihiro Watanabe (The University of Tokyo, Japan)
AbstractHigh-speed computational vision can simultaneously execute not just image capturing and recording but also image processing at the level of 1,000 fps. This system is expected to open up emerging applications in various fields. This presentation gives an overview of the system architectures, the image processing, and the image sensing technologies, and the actual application examples such as high-speed digital archiving and interactive display systems.

9S-4 (Time: 17:05 - 17:30)
Title(Invited Paper) Acceleration of Partial Image Matching on FPGA Platforms Using OpenCL
AuthorNoboru Yoneoka (Fujitsu Laboratories, Japan)
AbstractThe increasing amount of data such as presentation materials and visualized documents leads to high demand for efficient document search systems. In this presentation, we introduce a visual document search system with partial image matching engine accelerated on FPGA. The FPGA accelerator is developed in OpenCL environment which makes highly efficient description of FPGA hardware and software. With task parallelization and a data pipeline architecture, the system provides high throughput performance and quick response.


Session 9A  New Directions in Networks on Chip
Time: 15:50 - 17:30 Thursday, January 19, 2017
Location: Room 102
Chair: Kun-Chih Chen (National Sun Yat-Sen University, Taiwan)

9A-1 (Time: 15:50 - 16:15)
TitleDLPS: Dynamic Laser Power Scaling for Optical Network-on-Chip
Author*Fan Lan (Zhejiang University, China), Rui Wu, Chong Zhang (University of Californis, Santa Barbara, U.S.A.), Yun Pan (Zhejiang University, China), Kwang-ting Cheng (University of Californis, Santa Barbara, U.S.A.)
Pagepp. 726 - 731
KeywordNetwork-on-Chip, optical interconnect, power efficiency
AbstractOptical Network-on-Chip (NoC), offering the advantages of low energy consumption, high bandwidth, and low latency, is a promising solution for on-chip communications of multi-core systems. However, on-chip lasers, a key element in optical NoCs, are a dominant source of power consumption. In this paper, we propose dynamic laser power scaling (DLPS), a fine-grained control strategy for minimizing laser power consumption while meeting the communication bandwidth required for the application. The proposed DLPS strategy intelligently switches among multiple operation modes based on the communication traffic pattern. Our experiments show that by introducing two new modes (standby, and intermediate data rate), DLPS can further reduce the communication energy as well as reduce the execution time for communication-intensive applications, compared to a simple on-off control strategy that dynamically turns lasers either completely on or off.
Slides

9A-2 (Time: 16:15 - 16:40)
TitleAdaptive Load Distribution in Mixed-Critical Networks-On-Chip
Author*Adam Kostrzewa, Sebastian Tobuschat, Leonardo Ecco, Rolf Ernst (TU Braunschweig, Germany)
Pagepp. 732 - 737
Keywordreal-time, mixed-critical, networks-con-chip, safety
AbstractModern Networks-on-Chip (NoCs) must accommodate a diversity of temporal requirements e.g. provide guarantees for real-time senders with the minimum impact on performance sensitive best-effort (BE) traffic. In this work, we propose a protocol-based adaptive load distribution which by selectively detouring BE traffic i.e. load balancing, allows to significantly improve NoC’s performance without costly hardware extensions. The introduced method offers during runtime safe and efficient integration of mixed-critical workloads through the coupling of the flow control with the path selection based on the global NoC state. The requested real-time reliability of the interconnect is achieved through predictable synchronization with control messages supported by a formal analysis and an experimental evaluation.
Slides

9A-3 (Time: 16:40 - 17:05)
TitleBoDNoC: Providing Bandwidth-on-Demand Interconnection for Multi-Granularity Memory Systems
Author*Shiqi Lian, Ying Wang, Yinhe Han, Xiaowei Li (State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, China)
Pagepp. 738 - 743
KeywordNetwork-on-Chip, Access granularity
AbstractMulti-granularity memory system provides multiple access granularities for the applications with various spatial localities. In the multi-granularity access pattern, the one-size-bandwidth NoC design cannot utilize the bandwidth efficiently. We propose a novel NoC design, called BoDNoC, which can merge multiple narrow subnets to provide various bandwidths for access data. The new design also adopts an optimization algorithm to take full advantage of bandwidth. Experimental results show that BoDNoC can improve the throughput by 23.5% and reduce the energy consumption by 37.2% in comparison with one-size-bandwidth NoC design.
Slides

9A-4 (Time: 17:05 - 17:30)
TitleUsing Segmentation to Improve Schedulability of RRA-based NoCs with Mixed Traffic
AuthorMeng Liu, *Matthias Becker, Moris Behnam, Thomas Nolte (Mälardalen University, Sweden)
Pagepp. 744 - 750
Keywordnetwork-on-chip, real-time application, segmentation, schedulability
AbstractNetwork-on-Chip (NoC) is the interconnect of choice for manycore processors and system-on-chips in general. Most of the existing NoC designs focus on the performance with respect to average throughput, which makes them less applicable for real-time applications especially when applications have hard timing requirements on the worst-case scenarios. In this paper, we focus on a Round-Robin Arbitration (RRA) based wormhole-switched NoC which is a common architecture used in most of the existing implementations. We propose a novel segmentation algorithm targeting RRA-based NoCs in order to improve the schedulability of real-time traffic without modifying the hardware architecture. Additionally, we also address the problem of transmitting both real-time traffic and best-effort traffic in the same NoC. The proposed solutions aim to provide timing guarantees to real-time traffic and achieve low latency for best-effort traffic. According to the evaluation results, the proposed segmentation solution can significantly improve the schedulability of the whole network.
Slides


Session 9B  Memory Architecture: Now and Future
Time: 15:50 - 17:30 Thursday, January 19, 2017
Location: Room 104
Chairs: Hyung Gyu Lee (Daegu University, Republic of Korea), Shimpei Sato (Tokyo Institute of Technology, Japan)

9B-1 (Time: 15:50 - 16:15)
TitleBuilding Energy-Efficient Multi-Level Cell STT-RAM Caches with Data Compression
AuthorLiu Liu, Ping Chi, Shuangchen Li, Yuanqing Cheng, *Yuan Xie (University of California, Santa Barbara, U.S.A.)
Pagepp. 751 - 756
KeywordCache, Data Compression
AbstractSpin-transfer torque magnetic random access memory (STT-RAM) technology has emerged as a potential replacement of SRAM in cache design, especially for building large-scale and energy-efficient last level caches. Compared with single-level cell (SLC), multi-level cell (MLC) STT-RAM is expected to double cache capacity and increase system performance. However, the two-step read/write access schemes incur considerable energy consumption and performance degradation. In this paper, we propose two techniques using data compression to optimize MLC STT-RAM cache design. The first technique tries to compress a cache line and fit it into only the soft-bit region of the cells. So that reading or writing this cache line takes only one step which is fast and energy-efficient. We introduce a second technique to increase the cache capacity by enabling the left hard-bit region to store another compressed cache line, which can improve the system performance for memory intensive workloads. The experimental results show that, compared with a conventional MLC STT-RAM last level cache design, our overhead minimized technique reduces the dynamic energy consumption by 38.2% on average with the same system performance, and our capacity augmented technique boosts the system performance by 6.1% with 19.2% dynamic energy saving on average, across the evaluated multi-programmed benchmarks.

9B-2 (Time: 16:15 - 16:40)
TitleMPIM: Multi-Purpose In-Memory Processing Using Configurable Resistive Memory
AuthorMohsen Imani, *Yeseong Kim, Tajana Rosing (University of California San Diego, U.S.A.)
Pagepp. 757 - 763
KeywordProcessing in memory, Non-volatile memory, Content addressable memory, Search computation, bitwise operation
AbstractRunning Internet of Things applications on general purpose processors results in a large energy and performance overhead, due to the high cost of data movement. Processing in-memory is a promising solution to reduce the data movement cost by processing the data locally inside the memory. In this paper, we design a Multi-Purpose In-Memory Processing (MPIM) system, which can be used as main memory and for processing. MPIM consists of multiple crossbar memories with the capability of efficient in-memory computations. Instead of transferring the large dataset to the processors, MPIM provides two important in-memory processing capabilities: i) data searching for the nearest neighbor ii) bitwise operations including OR, AND and XOR with small analog sense amplifiers. The experimental results show that the MPIM can achieve up to 5.5x energy savings and 19x speedup for the search operations as compared to AMD GPU-based implementation. For bitwise vector processing, we present 11000x energy improvements with 62x speedup over the SIMD-based computation, while outperforming other state-of-the-art in-memory processing techniques.

9B-3 (Time: 16:40 - 17:05)
TitleExtending the Lifetime of Object-based NAND Flash Device with STT-RAM/DRAM Hybrid Buffer
AuthorChuhan Min, Jie Guo, Hai Li, *Yiran Chen (University of Pittsburgh, U.S.A.)
Pagepp. 764 - 769
KeywordNAND flash memories, object-based interface, hybrid buffer, STT-RAM, simulation
AbstractA major limitation of NAND flash memory is erase-before-program characteristics. It incurs write amplification, severely degrading system performance and endurance. Previous works reveal that metadata update substantially contributes to write amplification in object-based NAND flash device (ONFD). To further reduce the overhead of metadata update in ONFD, we propose a hybrid buffer scheme (HBS) by utilizing the lower latency and byte-addressable characteristics of the promising emerging non-volatile memory STT-RAM. Our HBS proposes to store ONFD metadata with highest cost in a complement STT-RAM buffer to reduce write amplification. Considering limited size of STT-RAM, we propose a hybrid buffer management technique to maximize effective memory utilization. In addition, by leveraging non-volatility of STT-RAM, our HBS can also substantially reduce data recovery overhead and complexity upon power failure. Experiment results show that the proposed design can achieve up to 15% performance improvement with average 34% endurance extension compared to the state-of-the-art works.

9B-4 (Time: 17:05 - 17:30)
TitleLocality-Aware Bank Partitioning for Shared DRAM MPSoCs
Author*Yangguo Liu, Junlin Lu, Dong Tong, Xu Cheng (Peking University, China)
Pagepp. 770 - 775
KeywordMPSoC, memory partitioning, memory interference, spatial locality, system performance
AbstractMemory interference is a critical impediment to system performance in MPSoCs. To address this problem, we first propose a Locality-Aware Bank Partitioning (LABP), which partitions memory banks according to applications’ memory access behavior. The key idea is to separate memory intensive applications with high row-buffer locality from the other appli-cations. Moreover, we integrate LABP with a bandwidth alloca-tion scheme to leverage the architecture advantages, and present a comprehensive approach named Integrated Bandwidth and Bank Partitioning (IBBP) to further alleviate the interference. Experimental results show LABP improves system throughput/fairness by 10.8%/26.4%. IBBP provides 14.1% better system throughput and 34.2% better system fairness. Our methods are better than other recent works, including bandwidth throttling, DBP and DBP-TCM.
Slides


Session 9C  Intelligent Computing with Memristor Technologies
Time: 15:50 - 17:30 Thursday, January 19, 2017
Location: Room 105
Chair: Yuan-Hao Chang (Academia Sinica, Taiwan)

9C-1 (Time: 15:50 - 16:15)
TitleClassification Accuracy Improvement for Neuromorphic Computing Systems with One-level Precision Synapses
AuthorYandan Wang, Wei Wen, Linghao Song, *Hai Li (University of Pittsburgh, U.S.A.)
Pagepp. 776 - 781
KeywordMemristor, Neuromorphic Computing, Binary, Neural Networks, Classification
AbstractBrain inspired neuromorphic computing has demonstrated remarkable advantages over traditional von Neumann architecture for its high energy efficiency and parallel data processing. However, the limited resolution of synaptic weights degrades system accuracy and thus impedes the use of neuromorphic systems. In this work, we propose three orthogonal methods to learn synapses with one-level precision, namely, distribution-aware quantization, quantization regularization and bias tuning, to make image classification accuracy comparable to the state-of-the-art. Experiments on both multi-layer perception and convolutional neural networks show that the accuracy drop can be well controlled within 0.19% (5.53%) for MNIST (CIFAR-10) database, compared to an ideal system without quantization.

9C-2 (Time: 16:15 - 16:40)
TitleBinary Convolutional Neural Network on RRAM
Author*Tianqi Tang, Lixue Xia, Boxun Li, Yu Wang, Huazhong Yang (Tsinghua University, China)
Pagepp. 782 - 787
Keywordbinary neural network
AbstractConvolutional Neural Network (CNN) has achieved great recognition performance, though larger computational intensity and higher bandwidth are required. The emerging RRAM-based computing system has been considered as a promising solution. In this paper, an RRAM crossbar-based specific accelerator is proposed for CNN forward process. The RRAM-based Convolver Circuit and the structure for intermediate data caching are introduced. Moreover, the RRAM-based implementation of binary CNN is well discussed. The robustness under device variation on low bit-level network weights are demonstrated.
Slides

9C-3 (Time: 16:40 - 17:05)
TitleAlgorithm-Hardware Co-Optimization of the Memristor-Based Framework for Solving SOCP and Homogeneous QCQP Problems
Author*Ao Ren (Syracuse University, U.S.A.), Sijia Liu (University of Michigan, U.S.A.), Ruizhe Cai (Syracuse University, U.S.A.), Wujie Wen (Florida International University, U.S.A.), Pramod K. Varshney, Yanzhi Wang (Syracuse University, U.S.A.)
Pagepp. 788 - 793
KeywordMemeristor crossbar, SOCP, homogeneous QCQP, ADMM, process variations
AbstractA memristor device, which is recently invented by HP Lab, has a unique ability to change and record the state of its own. Moreover, its exciting features like high density, low power and great scalability make it an important candidate to be constructed as a crossbar structure to participate in a large number of matrix operations. The memristor crossbar technology can potentially be utilized for developing low-complexity and high-scalability solution framework of a large category of convex optimization problems, which involve extensive matrix operations and have critical applications in multiple disciplines. This paper, as the first attempt towards this direction, proposes a novel memristor crossbar-based framework for solving two important convex optimization problems, i.e., second-order cone programming (SOCP) and homogeneous quadratically constrained quadratic programming (QCQP) problems. The proposed framework has innovations at both algorithm and hardware levels. At the algorithm level, an operator splitting method, the alternating directions method of multipliers (ADMM), is adopted to split the SOCP and homogeneous QCQP problems into the forms of solving linear systems, which could be effectively solved using the memristor crossbar in O(1) time complexity. The proposed framework is an iterative procedure, which iterates a constant number of times with O(N) complexity for each iteration. Therefore the SOCP and homogeneous QCQP problems can be solved with a time complexity of pseudo-O(N), which is a significant reduction compared to the state-of-the-art software solvers of O(N3.5) - O(N4). Experimental results demonstrate that reliable performance with high accuracy can be achieved under process variations.
Slides

9C-4 (Time: 17:05 - 17:30)
TitleComputation-Oriented Fault-Tolerance Schemes for RRAM Computing Systems
AuthorWenqin Huangfu, *Lixue Xia, Ming Cheng, Xiling Yin, Tianqi Tang, Boxun Li (Tsinghua University, China), Krishnendu Chakrabarty (Duke University, U.S.A.), Yuan Xie (University of California at Santa Barbara, U.S.A.), Yu Wang, Huazhong Yang (Tsinghua University, China)
Pagepp. 794 - 799
KeywordRRAM, SFAs, Fault-Tolerance, Mapping Algorithm, Redundancy Schemes
AbstractThe emerging metal-oxide resistive switching random-access memory (RRAM) devices and RRAM crossbar arrays have demonstrated their potential in enormously boosting the speed and energy-efficiency of analog matrix-vector multiplication. Unfortunately, due to the immature fabrication technology, commonly occurring Stuck-At-Faults (SAFs) seriously degrade the computational accuracy of RRAM crossbar based Computing System (RCS). In this paper, we propose a Mapping Algorithm with inner fault-tOlerant ability (MAO) to convert matrix parameters into RRAM conductances in RCS by providing larger mapping space and fully exploring the available mapping space. Furthermore, we present two computation-oriented redundancy schemes - `Redundant Crossbars' (RX) and `Independent Redundant Columns' (IRC) to alleviate the loss of computational accuracy due to SAFs. RX adds redundant RRAM crossbar arrays and IRC introduces independent redundant RRAM columns. Compared with the original design of RCS, experimental results show that, with 5% SAFs, MAO can improve the recognition accuracy of the Mixed National Institute of Standards and Technology (MNIST) dataset from 47.89% to 95.99% with no extra overhead. With 10% SAFs, RX and IRC can improve the recognition accuracy of the MNIST dataset from 25.83% to 97.17% and 96.13% with energy overhead of 37.58% and 27.22%.
Slides