(Go to Top Page)

The 20th Asia and South Pacific Design Automation Conference
Technical Program

Remark: The presenter of each paper is marked with "*".
Technical Program:   SIMPLE version   DETAILED version with abstract
Author Index:   HERE

Session Schedule


Tuesday, January 20, 2015

Room 103Room 102Room 104Room 105
1K  (International Conference Room)
Opening & Keynote I

8:30 - 9:50
1S  University Design Contest
10:20 - 12:10
1A  NoCS I (Performance and Fault Tolerance)
10:20 - 12:00
1B  Toward Power Efficient Design
10:20 - 12:00
1C  Modeling and Design Methodologies of Post-silicon Devices
10:20 - 12:00
2S  (Special Session) Internet of Things
13:50 - 15:30
2A  NoCS II (Power and Emerging Technology)
13:50 - 15:30
2B  Design Automation for Tomorrow’s Circuit Technologies
13:50 - 15:30
2C  Emerging Applications
13:50 - 15:30
3S  (Special Session) New Challenges and Solutions in Nanometer Physical Design
15:50 - 17:30
3A  Circuits for Performance and Reliability
15:50 - 16:40
3B  Frontiers in Logic Synthesis
15:50 - 17:30
3C  Energy Optimization for Electric Vehicles and Smart Grids
15:50 - 17:30



Wednesday, January 21, 2015

Room 103Room 102Room 104Room 105
2K  (International Conference Room)
Keynote II

9:00 - 9:50
4S  (Special Session) Machine Learning in EDA: Promises and Challenges in Selected Applications
10:15 - 12:20
4A  Efficient NVM Management, from Register to Disk
10:15 - 12:20
4B  Robust Timing, and P/G Modeling and Design
10:15 - 12:20
4C  New Issues in Placement and Routing
10:15 - 12:20
5S  (Designers' Forum ) Car Electronics
13:50 - 15:30
5A  Optimization and Exploration for Caches
13:50 - 15:30
5B  CAD for Analog/RF/Mixed-Signal Design
13:50 - 15:30
5C  Next-Generation Clock Network Synthesis
13:50 - 15:30
6S  (Designers' Forum) Panel Discussion: Challenges in the Era of Big-Data Computing
15:50 - 17:30
6A  Optimization Techniques for Non-Volatile Memory based Systems
15:50 - 17:30
6B  Test for Higher Quality
15:50 - 17:30
6C  Reliability
15:50 - 17:30
Banquet (Convention Hall A)
18:00 - 20:00



Thursday, January 22, 2015

Room 103Room 102Room 104Room 105
3K  (International Conference Room)
Keynote III

9:00 - 9:50
7S  (Special Session) The Future of Emerging ReRAM Technology
10:15 - 12:20
7A  Ensuring the Correctness of System Integration
10:15 - 12:20
7B  Orchestrating Tasks, Cores, and Communication
10:15 - 12:20
7C  Design for Manufacturability
10:15 - 12:20
8S  (Designers' Forum) Technology Trend toward 8K Era
13:50 - 15:30
8A  Exploring Better Architecture of Your Systems
13:50 - 15:30
8B  Circuit-Level Modeling and Simulation
13:50 - 15:30
8C  Reliable and Trustworthy Electronics
13:50 - 15:30
9S  (Designers' Forum) Panel Discussion: IP Base SoC Design and IP Design Innovation
15:50 - 17:30
9A  Power/Thermal Management and Modeling
15:50 - 17:30
9B  (Special Session) System-Level Designs and Tools for Multicore Systems
15:50 - 17:30
9C  Building Secure Systems
15:50 - 17:30


List of papers

Remark: The presenter of each paper is marked with "*".

Tuesday, January 20, 2015

Session 1K  Opening & Keynote I
Time: 8:30 - 9:50 Tuesday, January 20, 2015
Location: International Conference Room
Chair: Kunio Uchiyama (Hitachi)

1K-1 (Time: 8:30 - 9:50)
Title(Keynote Address) The Required Technologies for Automotive towards 2020
Author*Udo Wolz (Bosch Corporation, Japan)
Pagep. 1
AbstractThis keynote speech deals with the future of the automotive industry and the requirements out of new applications and technologies. The mobility of the future will be electric, automated and connected. Until 2020 the internal combustion engine will still dominate the powertrain with approximately 90% share. This includes systems with mild electrification such as start/stop. For stronger electrified vehicles like hybrids, plug-in hybrids and full EV, battery technologies and battery management are key. Automated driving is already on the way with driver assistance functions and will end up with fully automated driving. Surround sensing of cars and connection between cars and cars to infrastructure will lead to extremely high needs of computing power. Information and Communication Technology is key here. For safety functions extremely short reaction times in milliseconds are necessary. Security mechanisms to ensure proper, unimpaired operation are a must. Seamless communication from home to car and to other parts of life is expected. This leads to smart phone connectivity with the car, car with the cloud, etc.. The car will be part of the Internet. The technologies to achieve this are currently already penetrating from consumer electronics and IT technologies to cars and vice versa. E.g. MEMS, highly reliable micromechanical sensors since long utilized for automotive applications, now entered the market for consumer electronics: Bosch sensors can be found in more than every second Smartphone worldwide. With increasing electrification, automation and connectivity, the requirements and the market demand for VLSI and embedded systems' computing power will increase continuously.


Session 1S  University Design Contest
Time: 10:20 - 12:10 Tuesday, January 20, 2015
Location: Room 103
Chairs: Hiroyuki Ito (Tokyo Institute of Technology, Japan), Noriyuki Miura (Kobe University, Japan)

1S-1 (Time: 10:20 - 10:24)
TitleAn HDL-Synthesized Gated-Edge-Injection PLL with A Current Output DAC
Author*Dongsheng Yang, Wei Deng, Tomohiro Ueno, Teerachot Siriburanon, Satoshi Kondo, Kenichi Okada, Akira Matsuzawa (Tokyo Institute of Technology, Japan)
Pagepp. 2 - 3
KeywordSynthesizable, Logic synthesis, ADPLL, Gated edge injection, Standard cell
AbstractThis paper presents a small area, low power, fully synthesizable PLL with a current output DAC and an interpolative-phase coupled oscillator using edge injection technique for on-chip clock generation. A prototype PLL is fabricated in a 65nm digital CMOS process, achieves a 1.7-ps integrated jitter at 0.9 GHz and consumes 0.78 mW leading to an FOM of -236.5 dB while only occupying an area of 0.0066 mm2. It achieves the best performance-area trade-off.
Slides

1S-2 (Time: 10:24 - 10:28)
TitleAn Oscillator-Based True Random Number Generator with Process and Temperature Tolerance
AuthorTakehiko Amaki, *Masanori Hashimoto, Takao Onoye (Osaka University, Japan)
Pagepp. 4 - 5
Keywordtrule random number generator, process variation, temperature fluctuation
AbstractThis paper presents an oscillator-based true random number generator (TRNG) that automatically adjusts the duty cycle of a fast oscillator to 50 %, and generates unbiased random numbers tolerating process variation and dynamic temperature fluctuation. Measurement results with 65nm test chips show that the proposed TRNG adjusted the probability of ‘1’ to within 50 ± 0.07 % in five chips in the temperature range of 0 ℃ to 75 ℃. Consequently, the proposed TRNG passed the NIST and DIEHARD tests at 7.5 Mbps with 6,670 μm2 area.
Slides

1S-3 (Time: 10:28 - 10:32)
TitleImplementation of Double Arbiter PUF and Its Performance Evaluation on FPGA
Author*Takanori Machida (The University of Electro-Communications, Japan), Dai Yamamoto (Fujitsu Laboratories Ltd., Japan), Mitsugu Iwamoto, Kazuo Sakiyama (The University of Electro-Communications, Japan)
Pagepp. 6 - 7
KeywordArbiter-based PUF, FPGA, Equal-length wiring, Uniqueness, Machine-learning attacks
AbstractLow uniqueness and vulnerability to machine-learning attacks are known as two major problems of Arbiter-Based Physically Unclonable Function (APUF) implemented on FPGAs. In this paper, we implement Double APUF (DAPUF) that duplicates the original APUF in order to overcome the problems. From the experimental results on Xilinx Virtex-5, we show that the uniqueness of DAPUF becomes almost ideal, and the prediction rate of the machine-learning attack decreases from 86% to 57%.

1S-4 (Time: 10:32 - 10:36)
TitleA Negative-Resistance Sense Amplifier for Low-Voltage Operating STT-MRAM
Author*Yohei Umeki, Koji Yanagida (Graduate School of System Informatics, Kobe University, Japan), Shusuke Yoshimoto (Department of Electrical Engineering, Stanford University, U.S.A.), Shintaro Izumi, Masahiko Yoshimoto, Hiroshi Kawaguchi (Graduate School of System Informatics, Kobe University, Japan), Koji Tsunoda, Toshihiro Sugii (Low-Power Electronics Association and Project (LEAP), Japan)
Pagepp. 8 - 9
KeywordLow-voltage, STT-MRAM, Non-volatile memory
AbstractThis paper exhibits a 65-NM 8-Mb spin transfer torque magnetoresistance random access memory (STT-MRAM) operating at 0.38V. The proposed sense amplifier comprises a boosted-gate nMOS and negative-resistance pMOSes as loads, which maximizes the readout margin. The STT-MRAM achieves a cycle time of 1.9 μs (= 0.526 MHz) at 0.38 V. The operating power is 1.70 μW at that voltage.
Slides

1S-5 (Time: 10:36 - 10:40)
TitleA High Stability, Low Supply Voltage and Low Standby Power Six-Transistor CMOS SRAM
Author*Nobuaki Kobayashi, Ryusuke Ito, Tadayoshi Enomoto (Chuo University, Japan)
Pagepp. 10 - 11
KeywordCMOS, SRAM, Standby Power Dissipation, Self-controllable Voltage Level (SVL) Circuit
AbstractStatic random access memories (SRAMs) having high “read” and “write” margins, and a small standby power (PST) are needed for use in low supply voltage battery-driven portable systems. The decrease in MOSFET sizes increases not only leakage currents, but also threshold voltage variation that results in smaller margins[1], [2]. To solve these problems a very small circuit called a “Self-controllable Voltage Level (SVL)” circuit[3] was used in the newly developed (dvlp.) SRAM. The dvlp. SRAM succeeded in increasing margins, reducing the standby power and lowering a supply voltage (VDD). The PST of the 2-kbit-memory cell array of the dvlp. SRAM was only 0.938 μW, namely, 9.17% of PST (10.23 µW) of the conventional (conv.) SRAM at VDD=1.0 V. A “read” margin (VRM) of the dvlp. SRAM was 0.1923 V that was 2.09 times larger than VRM (0.0919 V) of the conv. SRAM at VDD=1.0 V.
Slides

1S-6 (Time: 10:40 - 10:44)
TitleAn Efficient Multi-Port Memory Controller for Multimedia Applications
Author*Xuan-Thuan Nguyen, Cong-Kha Pham (University of Electro-Communications, Japan)
Pagepp. 12 - 13
Keywordmulti-port memory controller, high bandwidth, fpga, multimedia
AbstractThe remedy for processor-memory bottleneck has considered as the key to success because of the substantial growth in multimedia applications. In this paper, an efficient external multi-port memory controller (MPMC) which consists of several buffers to speed up the transactions, embedded memory to store the configuration, and an arbiter to schedule all access, is proposed. The experimental results prove that the proposed design can operate independently of other system architectures, support up to 16 simultaneous external components with different clocks and data width, and achieve up to 88% and 92% of theory peak bandwidth for write and read process, respectively.

1S-7 (Time: 10:44 - 10:48)
TitleReliability-Configurable Mixed-Grained Reconfigurable Array Compatible with High-Level Synthesis
Author*Masanori Hashimoto, Dawood Alnajjar, Hiroaki Konoura (Osaka University/JST, CREST, Japan), Yukio Mitsuyama (Kochi University of Technology/JST, CREST, Japan), Hajime Shimada (Nagoya University/JST, CREST, Japan), Kazutoshi Kobayashi (Kyoto Institute of Technology/JST, CREST, Japan), Hiroyuki Kanbara (ASTEM/JST, CREST, Japan), Hiroyuki Ochi (Ritsumeikan University/JST, CREST, Japan), Takashi Imagawa (Kyoto University/JST, CREST, Japan), Kazutoshi Wakabayashi (NEC Corp./JST, CREST, Japan), Takao Onoye (Osaka University/JST, CREST, Japan), Hidetoshi Onodera (Kyoto University/JST, CREST, Japan)
Pagepp. 14 - 15
Keywordreconfigurable device, soft error, high-level synthesis, reliability, irradiation test
AbstractThis paper presents a mixed-grained reconfigurable VLSI array architecture that can cover mission-critical applications to consumer products through C-to-array application mapping. A proof-of-concept VLSI chip was fabricated in a 65nm process. Measurement results show that applications on the chip can be working in a harsh radiation environment.
Slides

1S-8 (Time: 10:48 - 10:52)
TitleA 14μA ECG Processor with Noise Tolerant Heart Rate Extractor and FeRAM for Wearable Healthcare Systems
Author*Yozaburo Nakai, Shintaro Izumi, Ken Yamashita, Masanao Nakano, Hiroshi Kawaguchi, Masahiko Yoshimoto (Kobe University, Japan)
Pagepp. 16 - 17
Keywordbiomedical signal processing, electrocardiography, heart rate extraction, mobile healthcare, wearable sensors
AbstractThis report describes an electrocardiograph (ECG) processor for use with a wearable healthcare system. It comprises an analog front end, a 12-bit ADC, a robust Instantaneous Heart Rate (IHR) monitor, a 32-bit Cortex-M0 core, and 64 Kbyte Ferroelectric Random Access Memory (FeRAM). The IHR monitor uses a short-term autocorrelation (STAC) algorithm to improve the heart-rate detection accuracy despite its use in noisy conditions. The ECG processor chip consumes 13.7μA for heart rate logging application.
Slides

1S-9 (Time: 10:52 - 10:56)
TitleA 128-Way FPGA Platform for the Acceleration of KLMS Algorithm
Author*Xiaowei Ren, Qihang Yu, Badong Chen, Nanning Zheng, Pengju Ren (Xi'an Jiaotong University, China)
Pagepp. 18 - 19
KeywordKLMS, FPGA, parallel, acceleration
AbstractThis paper proposes a 128-way parallel FPGA platform to accelerate the kernel least mean square (KLMS) algorithm. With the adoption of a quantized method and pipeline technology, this platform which works at 200MHz is 4827 times faster, on average, than the Matlab code running on a 3GHz Intel(R) Core(TM) i5-2320 CPU.
Slides

1S-10 (Time: 10:56 - 11:00)
TitleA Real-Time Permutation Entropy Computation for EEG Signals
Author*Xiaowei Ren, Qihang Yu, Badong Chen, Nanning Zheng, Pengju Ren (Xi'an Jiaotong University, China)
Pagepp. 20 - 21
KeywordPermutation Entropy, FPGA, parallel, acceleration
AbstractIn this paper, we implement a reconfigurable FPGA accelerator which could compute multiscale permutation entropy for 128 EEG signals simultaneously in real time. When it works at 150MHz and the window size is 256, compared with C code running on a 3GHz Intel(R) Core(TM) i5-2320 CPU, the average speedup is 3748.
Slides

1S-11 (Time: 11:00 - 11:04)
TitleA High Efficient Hardware Architecture for Multiview 3DTV
Author*Jiang Yu, Geng Liu, Xin Zhang, Pengju Ren (Institute of Artifical Intelligence and Robotics, Xi'an Jiaotong University, China)
Pagepp. 22 - 23
Keyword3DTV, Architecture, FPGA, Multiview
AbstractThere are three main challenges to design an efficient multiview 3DTV SoC: (1)how to organize DRAM address mapping to maximize off-chip bandwidth utilization; (2)how to design a parallel configurable image scaling engine to interpolate various viewpoints in real-time; (3)how to reduce computational complexity of float-point sub-pixel rearrangement with sufficient accuracy. To this end, we present a highly optimized hardware architecture, which saves 38.4% logic and 37.5% memory resources when implementing a multiview 1080P@60Hz 3DTV on the Xilinx XC5VLX330 FPGA.
Slides

1S-12 (Time: 11:04 - 11:08)
TitleDesign of A Scalable Many-Core Processor for Embedded Applications
Author*Hsiao-Wei Chien, Jyun-Long Lai, Chao-Chieh Wu, Chih-Tsun Huang, Ting-Shuo Hsu, Jing-Jia Liou (National Tsing Hua University, Taiwan)
Pagepp. 24 - 25
KeywordMany-Core, Hardware/Software Co-Validation
AbstractWe present a novel design of scalable many-core processor with its comprehensive development framework, including the Electronic System Level, Register Transfer Level, and full-system prototyping platforms. Architecture exploration, performance evaluation and system verification/validation can be done across different abstraction levels. With our hardware-independent software layer, applications built on top of the fast virtual platform can be executed seamlessly on the prototype. The emulation result justifies the effectiveness of our processor architecture in embedded applications.

1S-13 (Time: 11:08 - 11:12)
TitleA DPA/DEMA/LEMA-Resistant AES Cryptographic Processor with Supply-Current Equalizer and Micro EM Probe Sensor
Author*Daisuke Fujimoto, Noriyuki Miura (Kobe University, Japan), Yu-ichi Hayashi, Naofumi Homma, Takafumi Aoki (Tohoku University, Japan), Makoto Nagata (Kobe University, Japan)
Pagepp. 26 - 27
KeywordAES, side-channel attack, EM attack, power equalizer, sensor
AbstractCombination of a supply-current equalizer (EQ) and a micro EM probe sensor (EMS) exhibits strong resiliency against major three DPA/DEMA/LEMA low-cost side-channel attacks on a cryptographic processor. Test-chip measurements with 128bit AES cryptographic processor in 0.18um CMOS successfully demonstrate the secret key protection from all three attacks. A digital-oriented circuit implementation together with a careful design optimization minimizes the hardware overhead of EQ and EMS to +33%, +1.6% in area, +7.6%, +0.15% in power, and ~0%, -0.2% in performance of an unprotected AES, respectively.

1S-14 (Time: 11:12 - 11:16)
TitleA 64×64 1200fps Dual-Mode CMOS Ion-Image Sensor for Accurate DNA Sequencing
Author*Xiwei Huang, Jing Guo, Mei Yan, Hao Yu (Nanyang Technological University, Singapore)
Pagepp. 28 - 29
KeywordCIS, ISFET, pH detection, contact imaging, DNA sequencing
AbstractA dual-mode CMOS ion-image sensor is demonstrated towards accurate high-throughput DNA sequencing. Dual-mode (optical/pH) sensing is realized by integrating the ion-sensitive field-effect transistor (ISFET) with standard 4T CMOS image sensor (CIS) pixel fabricated in standard 0.18μm CIS process. With accurate determination of microbead physical locations by optical contact imaging, local pH can be obtained for one DNA slice with accurate correlation to improve sequencing accuracy from system perspective. Moreover, for high-throughput large-arrayed sequencing, pixel-to-pixel ISFET threshold voltage mismatch is reduced by a correlated double sampling (CDS) readout that supports both image and pH modes. Measurement results show a sensitivity of 103.8mV/pH, a fixed-pattern-noise (FPN) reduction from 4% to 0.3%, and a readout speed of 1200 frames/second (fps).
Slides

1S-16 (Time: 11:16 - 11:20)
TitleA 0.21-V Minimum Input, 73.6% Maximum Efficiency, Fully Integrated 3-Terminal Voltage Converter with MPPT for Low-Voltage Energy Harvesters
Author*Toshihiro Ozaki, Tetsuya Hirose, Takahiro Nagai, Keishi Tsubaki, Nobutaka Kuroki, Masahiro Numa (Kobe University, Japan)
Pagepp. 30 - 31
KeywordEnergy harvesting, voltage converter, low-voltage, low-power
AbstractWe propose a fully integrated 3-terminal voltage converter with a maximum power point tracking (MPPT) circuit for low-voltage energy harvesting. The MPPT circuit dissipates nano-watt power to extract maximum output power. The measurement results demonstrated that the circuit converted a 0.49-V input to a 1.46-V output with 73.6% power conversion efficiency when the output power was 348 uW. The circuit can operate at an extremely low input voltage of 0.21 V.
Slides

1S-17 (Time: 11:20 - 11:24)
TitleDual-Output Wireless Power Delivery System for Small Size Large Volume Wireless Memory Card
Author*Junki Hashiba, Toru Kawajiri, Yuya Hasegawa, Hiroki Ishikuro (Keio University, Japan)
Pagepp. 32 - 33
Keywordwireless power delivery, single-inductor dual-output, Pseudo-random-sequence PWM
AbstractA single-inductor dual-output wireless power delivery system for small size battery-less NAND flash memory card is presented. The power delivery system uses two rectifiers connected to a single inductor which is synchronously switched by pseudo-random-sequence PWM signal with induced AC voltage. The power delivery system can generate 8 V and 16 V with peak efficiency of 40 % and maximum total transmitting power is 0.5 W. The test chip was designed and fabricated using 0.18um-CMOS with high-voltage LDMOS option.
Slides

1S-18 (Time: 11:24 - 11:28)
TitleA Tri-Level 50MS/s 10-bit Capacitive-DAC for Bluetooth Applications
Author*Daisuke Kanemoto (University of Yamanashi, Japan), Keigo Oshiro, Keiji Yoshida, Haruichi Kanaya (Kyushu University, Japan)
Pagepp. 34 - 35
KeywordCapacitive, DAC
AbstractThis document summarizes, for the university design contest, a chip design of low power dissipation and small die area 10-bit capacitive digital-to-analog converter (DAC) in a 0.18 um CMOS process. Power dissipation of this chip is 350 uW including the output buffers. The die area is 0.081mm2

1S-19 (Time: 11:28 - 11:32)
TitleA Tail-Current Modulated VCO with Adaptive-Bias Scheme
Author*Aravind Tharayil Narayanan, Wei Deng, Kenichi Okada, Akira Matsuzawa (Tokyo Institute of Technology, Japan)
Pagepp. 36 - 37
KeywordVCO, Class-C, PVT, flicker noise
AbstractThis paper proposes a tail-current modulated VCO with adaptive-bias scheme. The proposed adaptive-bias scheme ensures robust startup conditions for the tail-feedback VCO. A tail-feedback VCO using the proposed scheme is implemented in a 0.18-μm CMOS process. The measured phase noise is -119.3dBc/Hz at 1MHz offset with a power dissipation of 6.8mW at 4.6GHz.
Slides

1S-20 (Time: 11:32 - 11:36)
TitleA Low-Power VCO Based ADC with Asynchronous Sigma-Delta Modulator in 65nm CMOS
Author*Jili Zhang, Chenluan Wang, Shengxi Diao, Fujiang Lin (University of Science and Technology of China, China)
Pagepp. 38 - 39
KeywordVCO, ADC, ASDM, Nonlinearity
AbstractThis paper presents a low power VCO based ADC with asynchronous sigma-delta modulator (ASDM). A prototype is designed in 65nm CMOS technology with a measured performance of 54.3dB SNDR and 68dB SFDR over 8MHz bandwidth while consuming 2.8mW from a 1.2V supply.
Slides

1S-21 (Time: 11:36 - 11:40)
TitleA 0.5-V 5.8-GHz Low-Power Asymmetrical QPSK/OOK Transceiver for Wireless Sensor Network
Author*Sho Ikeda, Sang_yeop Lee, Shin Yonezawa, Yiming Fang, Motohiro Takayasu, Taisuke Hamada, Yosuke Ishikawa, Hiroyuki Ito, Noboru Ishihara, Kazuya Masu (Tokyo Institute of Technology, Japan)
Pagepp. 40 - 41
KeywordTransceiver, PLL, VCO, 0.5V
AbstractThis paper proposes low power RF transceiver which is suitable for wireless sensor network. Using 5.8GHz band has potentiality to achieve small size wireless sensor module because of smaller antenna in higher frequency. The proposed transceiver utilizes different modulation schemes for uplink and downlink to optimize power consumption and spectral efficiency. In addition, supply voltage of 0.5V can reduce the power consumption of overall RF transceiver. The prototype transceiver was fabricated in 65nm CMOS process, and the transmitter achieved EVM of 12.6% while consuming 2.86mW, and the receiver realizes sensitivity of -75dBm while consuming 0.83mW.
Slides

1S-22 (Time: 11:40 - 11:44)
TitleA 58.3-to-65.4 GHz 34.2 mW Sub-Harmonically Injection-Locked PLL with a Sub-Sampling Phase Detection
Author*Teerachot Siriburanon, Tomohiro Ueno, Kento Kimura, Satoshi Kondo, Wei Deng, Kenichi Okada, Akira Matsuzawa (Tokyo Institute of Technology, Japan)
Pagepp. 42 - 43
KeywordPLL, sub-sampling, mm-wave, 60GHz, in-band phase noise
AbstractThis paper presents a low power and low noise sub-harmonically injection-locked PLL based on a 20GHz sub-sampling PLL (SS-PLL) and a quadrature injection locked oscillator (QILO). Relatively lower in-band phase noise and out-of-band phase noise have been achieved through the sub-sampling phase detection and sub-harmonic injection techniques, respectively. Implemented in a 65nm CMOS process, this work can support all 60GHz channels and achieves a phase noise of -115dBc/Hz at 10MHz offset while consuming 20.2mW and 14mW from the 20GHz SS-PLL and the QILO, respectively.

1S-23 (Time: 11:44 - 11:48)
TitleCircuit and Package Design for 44GB/s Inductive-Coupling DRAM/SoC Interface
Author*Akira Okada, Abdul Raziz Junaidi, Yasuhiro Take, Atsutake Kosuge, Tadahiro Kuroda (Keio University, Japan)
Pagepp. 44 - 45
KeywordTCI, phase division multiplexing, UT-FOWLP
AbstractA 44GB/s inductive-coupling DRAM/SoC interface is developed by PoP configuration. It utilizes the advantages of both TSV and LPDDR by using the ThruChip Interface (TCI) and the ultra-thin fan-out wafer level package (UT-FOWLP). This proposed interface outperforms WIO2 with TSV in terms of area efficiency (4x better), immunity from simultaneous switching output noise (32x better) and manufacturing cost (40% cheaper). In addition, it outperforms LPDDR4 in PoP in terms of power dissipation (5x lower) and timing control easiness.
Slides

1S-24 (Time: 11:48 - 11:52)
TitleDesign and Analysis for ThruChip Design for Manufacturing (DFM)
Author*Li-Chung Hsu, Yasuhiro Take, Atsutake Kosuge, So Hasegawa, Junichiro Kadamoto, Tadahiro Kuroda (Keio University, Japan)
Pagepp. 46 - 47
KeywordThruChip, TCI, DFM
AbstractA 1GB/s ThruChip interface (TCI) test chip for wafer thinning, power mesh, and dummy metal fill impacts are analyzed and evaluated with test chip measurement and field solver simulation. The measurement results show that TCI coil dimension can be sized down as wafer thinning by following D/Z=3 rule. However, the experiment shows 20% power reduction by enlarging TCI coil (D/Z=6). The power mesh lies between TCI coils can dramatically decrease the TCI magnetic pulse strength and hence cause TCI to fail. Dummy metal within TCI coils has no impact on TCI transmission.
Slides


Session 1A  NoCS I (Performance and Fault Tolerance)
Time: 10:20 - 12:00 Tuesday, January 20, 2015
Location: Room 102
Chairs: Yoshinori Takeuchi (Osaka University, Japan), Takashi Miyamori (Toshiba)

1A-1 (Time: 10:20 - 10:45)
TitleA Novel Approach Using a Minimum Cost Maximum Flow Algorithm for Fault-Tolerant Topology Reconfiguration in NoC Architectures
AuthorLeibo Liu, *Yu Ren, Chenchen Deng (Tsinghua University, China), Jie Han (University of Alberta, Canada), Shouyi Yin, Shaojun Wei (Tsinghua University, China)
Pagepp. 48 - 53
KeywordNetwork-on-chip, fault tolerance, topology reconfiguration
AbstractAn approach using a minimum cost maximum flow algorithm is proposed for fault-tolerant topology reconfiguration in a Network-on-Chip system. Topology reconfiguration is converted into a network flow problem by constructing a directed graph with capacity constraints. A cost factor is considered to differentiate between processing elements. This approach maximizes the use of spare cores to repair faulty systems, with minimal impact on area, throughput and delay. It also provides a transparent virtual topology to alleviate the burden for operating systems.
Slides

1A-2 (Time: 10:45 - 11:10)
TitleAdaptive Remaining Hop Count Flow Control: Consider the Interaction between Packets
Author*Peng Wang, Sheng Ma, Hongyi Lu, Zhiying Wang, Chen Li (National University of Defense Technology, China)
Pagepp. 54 - 60
KeywordFlow Control, Remaining Hop Count, interaction between packets
AbstractThe interaction between packets affects performance and global fairness of Network-on-Chip. Preferentially transferring packets with small remaining hop counts (PPSR) can reduce the flying packet amount to improve the performance. Yet, the global fairness is negatively affected. In contrast, preferentially transferring packets with large remaining hop counts (PPLR) can achieve better global fairness with a poorer performance. In this paper, we propose adaptive remaining hop count flow control, which dynamically switches between PPSR and PPLR. In this way, we can achieve higher performance and better global fairness.
Slides

1A-3 (Time: 11:10 - 11:35)
TitleA Flexible Hardware Barrier Mechanism for Many-Core Processors
Author*Takeshi Soga (ISIT Kyushu, JST CREST, Japan), Hiroshi Sasaki, Tomoya Hirao (Kyushu University, Japan), Masaaki Kondo (The University of Tokyo, Japan), Koji Inoue (Kyushu University, Japan)
Pagepp. 61 - 68
KeywordHardware Barrier, On-chip, Flexible, Small Area, Low Latency
AbstractThis paper proposes a new hardware barrier mechanism which offers the flexibility to select which cores should join the synchronization, allowing for executing multiple multi-threaded applications by dividing a many-core processor into several groups. Experimental results based on an RTL simulation show that our hardware barrier achieves a 66-fold reduction in latency over typical software based implementations, with a hardware overhead of the processor of only 1.8%. Additionally, we demonstrate that the proposed mechanism is sufficiently flexible to cover a variety of core groups with minimal hardware overhead.
Slides

1A-4 (Time: 11:35 - 12:00)
TitleA Performance Enhanced Dual-Switch Network-on-Chip Architecture
Author*Lian Zeng, Takahiro Watanabe (Waseda University, Japan)
Pagepp. 69 - 74
KeywordNetwork-on-Chip, Dual-switch allocator, High performance
AbstractNetwork-on-Chip is an attractive solution for future systems on chip. As the network becomes more congested, packets will be blocked more frequently. It would result in degrading the network performance. In this article, we propose an innovative dual-switch allocator (DSA) design. By introducing two switch allocators, we can make utmost use of idle output ports. Experimental results show that our design significantly achieves the performance improvement in terms of throughput and latency at the cost of very little power overhead.
Slides


Session 1B  Toward Power Efficient Design
Time: 10:20 - 12:00 Tuesday, January 20, 2015
Location: Room 104
Chairs: Kimiyoshi Usami (Shibaura Institute of Technology, Japan), Masanori Hashimoto (Osaka University, Japan)

1B-1 (Time: 10:20 - 10:45)
TitleA Cross-Layer Framework for Designing and Optimizing Deeply-Scaled FinFET-Based SRAM Cells under Process Variations
Author*Alireza Shafaei, Shuang Chen, Yanzhi Wang, Massoud Pedram (University of Southern California, U.S.A.)
Pagepp. 75 - 80
KeywordDeeply-scaled FinFET devices, SRAM cell design, Near-threshold computing
AbstractA cross-layer framework (spanning device and circuit levels) is presented for designing robust and energy-efficient SRAM cells, made of deeply-scaled FinFET devices. In particular, 7nm FinFET devices are designed and simulated by using Synopsys TCAD tool suite, Sentaurus. Next, 6T and 8T SRAM cells, which are composed of these devices, are designed and optimized. To enhance the cell stability and reduce leakage energy consumption, the dual (i.e., front and back) gate control feature of FinFETs is exploited. This is, however, done without requiring any external signal to drive the back gates of the FinFET devices. Subsequently, the effect of process variations on the aforesaid SRAMs is investigated and steps are presented to protect the cells against these variations. More precisely, the SRAM cells are first designed to minimize the expected energy consumption (per clock cycle) subject to the non-destructive read and successful write requirements under worst-case process corner conditions. These SRAM cells, which are overly pessimistic, are then refined by selectively adjusting some transistor sizes, which in turn reduces the expected energy consumption while ensuring that the parametric yield of the cells remains above some pre-specified threshold. To do this efficiently, an analytical method for estimating the yield of SRAM cells under process variations is also presented and integrated in the refinement procedure. A dual-gate controlled 6T SRAM cell operating at 324mV (in the near-threshold supply regime) is finally presented as a high-yield and energy-efficient memory cell in the 7nm FinFET technology.
Slides

1B-2 (Time: 10:45 - 11:10)
TitleControlled Placement of Standard Cell Memory Arrays for High Density and Low Power in 28nm FD-SOI
Author*Adam Teman (EPFL, Switzerland), Davide Rossi (University of Bologna, Italy), Pascal Meinerzhagen (EPFL, Switzerland), Luca Benini (University of Bologna, Italy/ETH, Switzerland), Andreas Burg (EPFL, Switzerland)
Pagepp. 81 - 86
KeywordStandard Cell Memories, Controlled Placement, Place and Route, Low Voltage Memories, Physical Implementation Methodology
AbstractStandard cell memories (SCMs) have recently become a popular alternative to SRAM IPs due to their design flexibility, ease of implementation, and robust operation at low supply voltages. Exclusively composed of standard cells, these memory macros are implemented as part of the standard digital design flow. However, the synthesis and place and route (P&R) algorithms employed by this flow do not exploit the distinct and regular structure of a memory array, leaving room for optimization.In this paper, we present a controlled placement design methodology for optimizing the physical implementation of SCM macros, leading to a structured, non-congested layout with close to 100% placement utilization and reduced wirelength as compared to unstructured layouts. Three sample SCM macro sizes were implemented according to the proposed methodology in a state-of-the-art 28nm FD-SOI technology, and compared with equivalent macros designed with the non-controlled, standard flow, achieving as much as a 22% reduction in area, a 57% reduction in switching power, and a 42% reduction in leakage power. In addition, these macros provide as much as an 88% reduction in switching power, as compared to equivalently sized, foundry provided SRAM IPs, while enabling robust functionality well below the minimum operating voltage of these IPs.
Slides

1B-3 (Time: 11:10 - 11:35)
TitleMicroarchitectural-Level Statistical Timing Models for Near-Threshold Circuit Design
Author*Jun Shiomi, Tohru Ishihara, Hidetoshi Onodera (Kyoto University, Japan)
Pagepp. 87 - 93
KeywordNerat-threshold computing, statistical static timing analysis (SSTA)
AbstractThis paper proposes architecturallevel statistical static timing analysis models for the nearthreshold voltage computing where the path delay distribution is approximated as a lognormal distribution. First, we prove several important theorems that help consider architectural design strategies for high performance and energy efficient near-threshold computing. After that, we show the numerical experiments with Monte Carlo simulations using a commercial 28-nm process technology model and demonstrate that the properties presented in the theorems hold for the practical near-threshold logic circuits.

1B-4 (Time: 11:35 - 12:00)
TitleStress-Aware P/G TSV Planning in 3D-ICs
Author*Shengcheng Wang, Farshad Firouzi, Fabian Oboril, Mehdi B. Tahoori (Karlsruhe Institute of Technology, Germany)
Pagepp. 94 - 99
Keyword3D-IC, TSV, Stress, IR-drop, Timing analysis
AbstractPower/Ground (P/G) Through-Silicon-Vias (TSVs) in the Power Distribution Network (PDN) of Three-Dimensional-Integrated-Circuit (3D-IC) have a twofold impact on the delays of the surrounding gates. TSV fabrication causes thermal stress around TSVs, which results in significant carrier mobility variations in their vicinity. On the other hand, the insertion of P/G TSVs will change the voltage of each node in the power grid, which also impacts the delays of the connected gates. Thus, it is necessary to consider the combined effect on delay variation during the P/G TSV planning. In this work, we propose a methodology using Mixed-Integer-Bilinear-Programming (MIBLP) to optimize this delay variation by a refined P/G TSV allocation. Taking into account the impact of thermal stress as well as voltage drop on the circuit delay, we optimally plan the P/G TSVs to minimize the circuit delay for different keep-out zones (KOZs) and PDN pitches.
Slides


Session 1C  Modeling and Design Methodologies of Post-silicon Devices
Time: 10:20 - 12:00 Tuesday, January 20, 2015
Location: Room 105
Chairs: Zili Shao (Hong Kong Polytechnic University, Hong Kong), Duo Liu (Chongqing University, China)

1C-1 (Time: 10:20 - 10:45)
TitleQuantitative Modeling of Racetrack Memory, A Tradeoff among Area, Performance, and Power
Author*Chao Zhang, Guangyu Sun, Weiqi Zhang (CECA, Peking University, China), Fan Mi, Hai Li (University of Pittsburgh, U.S.A.), Weisheng Zhao (Spintronics Interdisciplinary Center, Beihang University, China)
Pagepp. 100 - 105
KeywordRacetrack Memory, modeling, Macro Unit, Cache
AbstractRecently, an emerging non-volatile memory called Racetrack Memory (RM) becomes promising to satisfy the requirement of increasing on-chip memory capacity. However, the lack of circuit-level modeling has limited RM design exploration. We develop an RM circuit-level model, with careful study of device configurations and circuit layouts. This model introduces Macro Unit (MU) as the building block of RM. Our case study demonstrates significant variance under area, performance, and energy. In addition, cross-layer optimization is critical for RM as on-chip memory.
Slides

1C-2 (Time: 10:45 - 11:10)
TitleTechnological Exploration of RRAM Crossbar Array for Matrix-Vector Multiplication
Author*Peng Gu, Boxun Li, Tianqi Tang (Tsinghua University, China), Shimeng Yu, Yu Cao (Arizona State University, U.S.A.), Yu Wang, Huazhong Yang (Tsinghua University, China)
Pagepp. 106 - 111
KeywordRRAM, crossbar array, matrix computation
AbstractThe matrix-vector multiplication is the key operation for many computationally intensive algorithms. In recent years, the emerging metal oxide resistive switching random access memory (RRAM) device and RRAM crossbar array have demonstrated a promising hardware realization of the analog matrix-vector multiplication with ultra-high energy efficiency. In this paper, we analyze the impact of nonlinear voltage-current relationship of RRAM devices and the interconnect resistance as well as other crossbar array parameters on the circuit performance and present a design guide. On top of that, we propose a technological exploration flow for device parameter configuration to overcome the impact of nonideal factors and achieve a better trade-off among performance, energy and reliability for each specific application. The simulation results of a support vector machine (SVM) and MNIST pattern recognition dataset shows that the RRAM crossbar array-based SVM is robust to the input signal fluctuation but sensitive to the tunneling gap deviation. A further resistance resolution test presents that a 4-bit RRAM device is able to realize a recognition accuracy of ~90%, indicating the physical feasibility of RRAM crossbar array-based SVM. In addition, the proposed technological exploration flow is able to achieve 10.98% improvement of recognition accuracy on the MNIST dataset and 26.4% energy savings compared with previous work.

1C-3 (Time: 11:10 - 11:35)
TitleModeling Framework for Cross-Point Resistive Memory Design Emphasizing Reliability and Variability Issues
AuthorYang Zheng, Cong Xu (Pennsylvania State University, U.S.A.), *Yuan Xie (Pennsylvania State University/University of California, Santa Barbara, U.S.A.)
Pagepp. 112 - 117
KeywordReRAM, Reliability, Variability, cross-point structure
AbstractIn this paper, pseudo-hard error caused by temporal variation is defined for the first time as a unique type of error in ReRAM cross-point array. A comprehensive model is proposed to numerically evaluate all kinds of reliability and variability issues including voltage drop, read/write disturbance, spatial/temporal variations, and hard errors. Detailed analysis and solutions including dual-port write and test-and-flip strategy are proposed to shed light on reliable ReRAM cross-point memory design.

1C-4 (Time: 11:35 - 12:00)
TitleA Defect-Aware Approach for Mapping Reconfigurable Single-Electron Transistor Arrays
Author*Ching-Yi Huang, Chian-Wei Liu, Chun-Yao Wang (National Tsing Hua University, Taiwan), Yung-Chih Chen (Yuan Ze University, Taiwan), Suman Datta, Vijaykrishnan Narayanan (The Pennsylvania State University, U.S.A.)
Pagepp. 118 - 123
KeywordSingle-Electron Transistor, Reliability, Area Optimization, Defect-aware mapping algorithm
AbstractSingle-Electron Transistor (SET) at room temperature has been demonstrated as a promising device for extending Moore's law due to its ultra low power consumption. However, early realizations of SET array lacked variability and reliability due to their fixed architectures and high defect rates of nanowire segments. Therefore, a reconfigurable version of SET was proposed to deal with these issues. Recently, several automated mapping approaches were proposed for area minimization of reconfigurable SET arrays. However, to the best of our knowledge, no mapping approaches that consider the existence of defective nanowire segments were proposed. Thus, this paper presents the first defect-aware approach for mapping reconfigurable SET arrays. The experimental results show that our approach can successfully map the SET arrays with 20% width overhead on average in the presence of 5000 ppm defects.
Slides


Session 2S  (Special Session) Internet of Things
Time: 13:50 - 15:30 Tuesday, January 20, 2015
Location: Room 103
Chair: Li Shang (University of Colorado Boulder, U.S.A.)

2S-1 (Time: 13:50 - 14:20)
Title(Invited Paper) Powering the IoT: Storage-Less and Converter-Less Energy Harvesting
Author*Hyung Gyu Lee (Daegu University, Republic of Korea), Naehyuck Chang (KAIST, Republic of Korea)
Pagepp. 124 - 129
KeywordInternet of Things, Energy harvesting, Storageless, Converterless
AbstractWide spread of Internet of Things (IoTs) still have huddles in cost and maintenance. Energy harvesting is a promising option to mitigate battery replacement, but the current energy harvesting methods still rely on batteries or equivalent and power converters for the maximum power point tracking (MPPT). Unfortunately, batteries are subject to wear and tear, which is a primary factor to prevent from being maintenance free. Power converters are expensive, heavy and lossy as well. In this paper, we introduce a novel energy harvesting and management technique to power the IoT, which does not require any long-term energy storages nor voltage converters unlike traditional energy harvesting systems. Extensive simulations and measurements from our prototype demonstrate that the proposed method harvests 8% more energy and extends the operation time of the device 60% more during a day. This paper also demonstrates a UV (ultraviolet) level meter for skin protect, named SmartPatch, using the proposed energy harvesting method. The proposed method is not limited to photovoltaic energy harvesting but applicable to most energy harvesting IoT power supplies that require impedance tracking.

2S-2 (Time: 14:20 - 14:50)
Title(Invited Paper) Distributed Computing in IoT: System-on-a-Chip for Smart Cameras as an Example
Author*Shao-Yi Chien, Wei-Kai Chan, Yu-Hsiang Tseng (National Taiwan University, Taiwan), Chia-Han Lee (Academia Sinica, Taiwan), V. Srinivasa Somayazulu, Yen-Kuang Chen (Intel Corporation, U.S.A.)
Pagepp. 130 - 135
KeywordIoT, video sensors, smart camera, distributed computing
AbstractThere are four major components in application systems with internet-of-things (IoT): sensors, communications, computation and service, where large amount of data are acquired for ultra-big data analysis to discover the context information and knowledge behind signals. To support such large-scale data size and computation tasks, it is not feasible to employ centralized solutions on cloud servers. Thanks for the advances of silicon technology, the cost of computation become lower, and it is possible to distribute computation on every node in IoT. In this paper, we take video sensing network as an example to show the idea of distributed computing in IoT. Existing related works are reviewed and the architecture of a system-on-a-chip solution for distributed smart cameras is proposed with coarse-grained reconfigurable image stream processing architecture. It can accelerate various computer vision algorithms for distributed smart cameras in IoT.

2S-3 (Time: 14:50 - 15:30)
Title(Invited Paper) Data Sensing and Analysis: Challenges for Wearables
AuthorJames Williamson, Qi Liu, Fenglong Lu, Wyatt Mohrman, Kun Li (University of Colorado Boulder, U.S.A.), Robert P. Dick (University of Michigan, U.S.A.), *Li Shang (University of Colorado Boulder, U.S.A.)
Pagepp. 136 - 141
KeywordWearable technology, Low-power design, Quantified self
AbstractWearables are a leading category in the Internet of Things. Compared with mainstream mobile phones, wearables target one order of magnitude form factor reduction, and offer the potential of providing ubiquitous, personalized services to end users. Aggressive reduction in size imposes serious limits on battery capacity. Wearables are equipped with a range of sensors, such as accelerometers and gyroscopes. Most economical sensors were developed for mobile phones, with energy consumptions more appropriate for phones than for ultra-compact wearables. This article describes the energy challenges for wearable sensing technologies, with a primary focus on the most widely used wearable sensors: MEMS-based inertial measurement units. Using sports and fitness wearables as the pilot application, we analyze the energy characteristics of MEMS IMU data sensing, analysis, and wireless communication. We then discuss the technologies needed to solve the power and energy consumptions challenges for wearables.


Session 2A  NoCS II (Power and Emerging Technology)
Time: 13:50 - 15:30 Tuesday, January 20, 2015
Location: Room 102
Chairs: Mehdi Tahoori (Karlsruhe Institute of Technology, Germany), Tomoya Horiguchi (Toshiba)

2A-1 (Time: 13:50 - 14:15)
TitleShuttleNoC: Boosting On-Chip Communication Efficiency by Enabling Localized Power Adaptation
AuthorHang Lu (State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences, China), *Guihai Yan, Yinhe Han, Ying Wang (State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, China), Xiaowei Li (State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences, China)
Pagepp. 142 - 147
KeywordNetworks-on-Chip (NoC), Power Adaptation, Bandwidth Scaling, Power Gating, Traffic Heterogeneity
AbstractNetworks-on-Chip (NoC) gradually becomes a main contributor of chip-level power consumption. Due to the temporal and spatial heterogeneity of on-chip traffic, existing power management approaches cannot adapt the NoC power consumption to its traffic intensity, and hence lead to a suboptimal power efficiency. They either resort to over-provisioned NoC design that only suits for traffic spatial distribution, or coarse-grained power gating that only serves traffic temporal variation. In this paper, we propose a novel NoC architecture called Shuttle Networks-on-Chip (ShuttleNoC). By permitting packets shuttling between multiple subnetworks, localized power adaptation can be achieved. Experimental results show that ShuttleNoC could achieve optimal power efficiency with up to 23.5% power savings and 22.3% performance boost in comparison with traditional heterogeneity-agnostic NoC designs.

2A-2 (Time: 14:15 - 14:40)
TitleEnergy-Efficient Optical Crossbars on Chip with Multi-Layer Deposited Silicon
AuthorHui LI, *Sébastien Le Beux (Lyon Institute of Nanotechnology, France), Gabriela Nicolescu (Ecole Polytechnique de Montréal, Canada), Ian O'Connor (Lyon Institute of Nanotechnology, France)
Pagepp. 148 - 153
KeywordOptical Network on Chip, crossbar, optical loss
AbstractThe many cores design research community have shown high interest in optical crossbars on chip for more than a decade. Key properties of optical crossbars, namely a) contention-free data routing b) low-latency communication and c) potential for high bandwidth through the use of WDM, motivate several implementations. These implementations demonstrate very different scalability and power efficiency ability depending on three key design factors: a) the network topology, b) the considered layout and c) the insertion losses induced by the fabrication process. The emerging design technique relying on multi-layer deposited silicon allows reducing optical losses, which may lead to significant reduction of the power consumption. In this paper, multi-layer deposited silicon based crossbars are proposed and compared. The results indicate that the proposed ring-based network exhibits, on average, 22% and 51.4% improvement for worst-case and average losses respectively compared to the most power-efficient related crossbars.

2A-3 (Time: 14:40 - 15:05)
TitleTwo-Phase Protocol Converters for 3D Asynchronous 1-of-n Data Links
AuthorJulian Hilgemberg Pontes, *Pascal Vivet, Yvain Thonnart (CEA/LETI, France)
Pagepp. 154 - 159
KeywordNoC, Asynchronous Circuits, Two-phase Handshake, Delay Insensitive Encoding, 3D
AbstractDesign of fully synchronous System on Chip is becoming a challenging task. This task is even more difficult in advanced nodes and 3D designs, where the local and global variability can turns the timing closure an overwhelming task. In this way, the use of asynchronous circuits for long link and 3D link communication can provide better robustness to both local and inter-die variability and achieve faster timing closure by extending the Globally Asynchronous Locally Synchronous style to 3D architectures. However, while the 4 phase protocol is very well adapted for on chip DI communication, it cannot be adapted for off chip and 3D interface communication due to potential large interface delays. In this paper, we propose to use a simple 2 phase DI protocol based on transitions for 1-of-n codes, and we propose new 4-phase / 2-phase data converters. The proposed circuit is able to reduces 20% the dynamic power and increase 31% the throughput for long link communications.
Slides

2A-4 (Time: 15:05 - 15:30)
TitleFine-Grained Runtime Power Budgeting for Networks-on-Chip
Author*Xiaohang Wang, Tengfei Wang (Guangzhou Institute of Advanced Technology, Chinese Academy of Sciences, China), Terrence Mak (Guangzhou Institute of Advanced Technology, Chinese Academy of Sciences/The Chinese University of Hong Kong, China), Mei Yang, Yingtao Jiang (University of Nevada, Las Vegas, U.S.A.), Masoud Daneshtalab (Royal Institute of Technology, Sweden/University of Turku, Finland)
Pagepp. 160 - 165
KeywordNetworks-on-chip, power budgeting, dynamic programming network, latency
AbstractPower budgeting for NoC needs to be performed to meet limited power budget while assuring the best possible overall system performance. For simplicity and ease of implementation, existing NoC power budgeting schemes, irrespective of the fact that the packet arrival rates of different NoC routers may vary significantly, treat all the individual routers indiscriminately when allocating power to them. However, such homogeneous power allocation may provide excess power to routers with low packet arrival rates whereas insufficient power to those with high arrival rates. In this paper, we formulate the NoC power budgeting problem as to optimize the network performance over a power budget through per-router frequency scaling, taking into account of heterogeneous packet arrival rates across different routers as imposed by run time traffic dynamics. Correspondingly, we propose a fine-grained solution using an agile dynamic programming network with a linear time complexity. In essence, frequency of a router is set individually according to its contribution to the average network latency while meeting the power budget. Experimental results have confirmed that with fairly low runtime and hardware overhead, the proposed scheme can help save up to 50 % application execution time when compared with the best existing methods.
Slides


Session 2B  Design Automation for Tomorrow’s Circuit Technologies
Time: 13:50 - 15:30 Tuesday, January 20, 2015
Location: Room 104
Chairs: Anupam Chattopadhyay (RWTH Aachen University, Germany), Shigeru Yamashita (Ritsumeikan University)

2B-1 (Time: 13:50 - 14:15)
TitleNonvolatile Memory Allocation and Hierarchy Optimization for High-Level Synthesis
AuthorShuangchen Li (Tsinghua University, China/University of California, Santa Barbara, U.S.A.), Ang Li, Yongpan Liu (Tsinghua University, China), *Yuan Xie (University of California, Santa Barbara, U.S.A.), Huazhong Yang (Tsinghua University, China)
Pagepp. 166 - 171
Keywordnonvolatile memory, high-level synthesis, emerging technology, system-level optimization
AbstractThe emerging nonvolatile memory (NVM) technology can potentially change the landscape of future IC designs with numerous benefits, such as high performance, instant on/off, ultra-low standby leakage power, and data retention. These advantages motivate designers to exploit utilizing NVM in application-specific circuit designs. The NVM architecture in ASIC and FPGA, however, is quite different from the conventional memory architecture in microprocessors. It is distributed and needs optimization for specific memory access patterns. Furthermore, unique challenges, such as large write energy, asymmetric read/write operations and so on, lead to extra design knobs. This paper focuses on the NVM allocation and hierarchy optimization in high-level synthesis. This is the first framework that integrates the NVM architectures in high-level synthesis. A hierarchical hybrid memory architecture is presented. The NVM architecture optimization decides the memory hierarchy, type (NVM or SRAM) and capacity. It is formulated as a mixed-integer linear programming (MILP) problem. In addition, a branch-and-bound heuristic is developed to handle the cases when the MILP is too costly. Experimental results on real world benchmarks demonstrate that both of the solutions reduce power consumption up to 69.3% under given performance/area constraints, compared with traditional designs without NVM.

2B-2 (Time: 14:15 - 14:40)
TitleReverse BDD-Based Synthesis for Splitter-Free Optical Circuits
AuthorRobert Wille, *Oliver Keszocze, Clemens Hopfmuller, Rolf Drechsler (University of Bremen, Germany)
Pagepp. 172 - 177
Keywordsynthesis, optical circuits, binary decision diagrams, splitter
AbstractWith the advancements in silicon photonics, optical devices have found applications e.g. for ultra-high speed and low-power interconnects as well as functional computations to be realized on-chip. Caused by the increasing complexity of the underlying functionality, also the need for computer-aided design methods for this technology rises. Motivated by that, initial work on the development of synthesis methods for optical circuits has been performed. But all approaches proposed thus far suffer e.g. from large synthesis results and restricted scalability. In particular, splittings in the resulting circuits which degrade the optical signals into hardly measureable fractions prevent an efficient and scalable synthesis for optical circuits. In this work, we present a synthesis approach based on Binary Decision Diagrams (BDDs) that overcomes these obstacles. The approach yields circuits that rely on a total of none splitters – at the expense of a moderate increase in the number of optical gates. Experiments confirm that, by this, an efficient and scalable synthesis scheme for optical circuits eventually becomes available.
Slides

2B-3 (Time: 14:40 - 15:05)
TitleDetermining the Minimal Number of SWAP Gates for Multi-Dimensional Nearest Neighbor Quantum Circuits
AuthorAaron Lye (University of Bremen, Germany), *Robert Wille, Rolf Drechsler (University of Bremen/Cyber Physical Systems, DFKI GmbH, Germany)
Pagepp. 178 - 183
Keywordoptimization, quantum circuits, nearest neighbor, exact, synthesis
AbstractMotivated by the promises of significant speed-ups for certain problems, quantum computing received significant attention in the past. While much progress has been made in the development of synthesis methods for quantum circuits, new physical developments constantly lead to new constraints to be addressed. The limited interaction distance between the respective qubits (i.e. nearest neighbor optimization) has already been considered intensely. But with the emerge of multi-dimensional quantum architectures, another physical constraint came up for which only a few automatic synthesis solutions exist yet – all of them of heuristic nature. In this work, we propose an exact scheme for nearest neighbor optimization in multidimensional quantum circuits. Although the complexity of the problem is a serious obstacle, our experimental evaluation shows that the proposed solution is sufficient to allow for a qualitative evaluation of the respective optimization steps. At the same time, this enabled an exact comparison to heuristical results for the first time.
Slides


Session 2C  Emerging Applications
Time: 13:50 - 15:30 Tuesday, January 20, 2015
Location: Room 105
Chairs: Juinn-Dar Huang (National Chiao Tung University, Taiwan), Youhua Shi (Waseda University)

2C-1 (Time: 13:50 - 14:15)
TitleDesign and Optimization of 3D Digital Microfluidic Biochips for the Polymerase Chain Reaction
AuthorZipeng Li (Duke University, U.S.A.), Tsung-Yi Ho (National Chiao Tung University, Taiwan), *Krishnendu Chakrabarty (Duke University, U.S.A.)
Pagepp. 184 - 189
KeywordDigital Microfluidics, real-time PCR, Three-dimensional model, Layout optimization
AbstractA digital microfluidic biochip (DMFB) is an attractive technology platform for revolutionizing immunoassays, clinical diagnostics, drug discovery, DNA sequencing, and other laboratory procedures in biochemistry. In most of these applications, real-time polymerase chain reaction (PCR) is an indispensable step for amplifying specific DNA segments. In recent years, three-dimensional (3D) DMFBs that integrate photodetectors (i.e., cyberphysical DMFBs) have been developed. They offer the benefits of smaller size, higher sensitivity and quicker time-to-results. However, current DMFB design methods target optimization in only two dimensions, hence they ignore the 3D two-layer structure of a DMFB. Moreover, these techniques ignore practical constraints related to the interference between on-chip device pairs, the performance-critical PCR thermal loop, and the physical size of devices. In this paper, we describe an optimization solution for a 3D DMFB, and present a three-stage algorithm to realize a compact 3D PCR chip layout, which includes: (i) PCR thermal-loop optimization; (ii) 3D global placement based on Strong-Push-Weak-Pull (SPWP) model; (iii) constraint-aware legalization. Simulation results for four laboratory protocols demonstrate that the proposed approach is effective for the design and optimization of a 3D chip for real-time PCR.
Slides

2C-2 (Time: 14:15 - 14:40)
TitleAn Accurate and Low Cost PM2.5 Estimation Method Based on Artificial Neural Network
Author*Lixue Xia, Rong Luo, Bin Zhao, Yu Wang, Huazhong Yang (Tsinghua University, China)
Pagepp. 190 - 195
KeywordPM2.5, Air quality, Neural network
AbstractPM2.5 has already been a major pollutant in many cities in China. It is a kind of harmful pollutant which may cause several kinds of lung diseases. However, the existing methods to monitor PM2.5 with high accuracy are too expensive to popularize. The high cost also limits the further researches about PM2.5. This paper implements a method to estimate PM2.5 with low cost and high accuracy by Artificial Neural Network (ANN) technique using other pollutants and meteorological factors that are easy to be monitored. An Entropy Maximization step is proposed to avoid the over-fitting related to the data distribution of pollutant data. Also, how to choose the input attributes is abstracted to an optimization problem. An iterative greedy algorithm is proposed to solve it, which reduces the cost and increases the estimation accuracy at the same time. The experiment shows that the linear correlation coefficient between the estimated value and real value is 0.9488. Our model can also classify PM2.5 levels with a high accuracy. Additionally, the trade-off between accuracy and cost is investigated according to the price and error rate of each sensor.
Slides

2C-3 (Time: 14:40 - 15:05)
TitleIterative Disparity Voting Based Stereo Matching Algorithm and Its Hardware Implementation
AuthorZhi Hu, *Yibo Fan, Xiaoyang Zeng (State Key Lab of ASIC & System, Fudan University, China)
Pagepp. 196 - 201
Keywordstereo matching, hardware-oriented, disparity voting
AbstractStereo matching is one of the key problems in computer vision. A large number of algorithms have been proposed but few of them achieve both high accuracy and short processing time on hardware. This paper presents a hardware-oriented stereo matching algorithm which is able to generate software-oriented-level results for 1920×1080 images at 48fps. Such performance prefigures new vistas of the applications of VLSI in stereo vision.
Slides

2C-4 (Time: 15:05 - 15:30)
TitleObstacle-Avoiding Wind Turbine Placement for Power-Loss and Wake-Effect Optimization
Author*Yu-Wei Wu (National Cheng Kung University, Taiwan), Yi-yu Shi (Missouri University of Science and Technology, U.S.A.), Sudip Roy (National Cheng Kung University, Taiwan), Tsung-Yi Ho (National Chiao Tung University, Taiwan)
Pagepp. 202 - 207
KeywordPlacement, Wind Turbine
AbstractAs finite energy resources are being consumed at fast rate than they can be replaced, renewable energy resources have drawn an extensive attention. Wind power development is one such example, which is growing significantly throughout the world. The main difficulty in wind power development is that wind turbines interfere with each other and such turbulent directly affects the power produced, known as the wake effect. In addition, wirelength among wind turbines is not merely an economic factor, but also more decides the power loss occurs in the wirelength. Moreover, in reality, obstacles exist in the wind farm which is unavoidable, e.g., private land, lake and so on. Nevertheless, to the best of our knowledge, none of the existing works consider wake effect, wirelength and obstacle-avoiding at the same time in the wind turbine placement problem. In this paper, we propose an analytical method to solve obstacle-avoiding placement of wind turbines for power-loss and wake-effect optimization. Experimental results show that the wind power produced by our tool is similar to that by the industrial tool AWS OpenWind. Besides, our algorithm can reduce the wirelength and avoid obstacles successfully while finding the locations of wind turbines at the same time.
Slides


Session 3S  (Special Session) New Challenges and Solutions in Nanometer Physical Design
Time: 15:50 - 17:30 Tuesday, January 20, 2015
Location: Room 103
Chair: Mark Po-Hung Lin (National Chung Cheng University, Taiwan)

3S-1 (Time: 15:50 - 16:15)
Title(Invited Paper) An Efficient Linear Time Triple Patterning Solver
AuthorHaitong Tian (University of Illinois at Urbana-Champaign, U.S.A.), Hongbo Zhang (Synopsys Inc., U.S.A.), Zigang Xiao, *Martin D. F. Wong (University of Illinois at Urbana-Champaign, U.S.A.)
Pagepp. 208 - 213
KeywordTriple patterning, lithography, stitches
AbstractTriple patterning lithography (TPL) has been recognized as one of the most promising techniques for 14/10nm technology node. In this paper, we applied triple patterning lithography on standard cell based designs, and proposed a novel algorithm to optimally solve the problem. The algorithm is able to find all legal stitch candidates, and guarantees to find a legal TPL decomposition with optimal number of stitches if one exists. Experimental results shows that the proposed algorithm is very efficient, which achieves 39.1% runtime improvement and 18.4% memory reduction compared with the state-of-the-art TPL algorithm on the same problem.

3S-2 (Time: 16:15 - 16:40)
Title(Invited Paper) Gate Sizing and Threshold Voltage Assignment for High Performance Microprocessor Designs
AuthorTiago Reimann (Universidade Federal do Rio Grande do Sul, Brazil), Cliff C.N. Sze (IBM, U.S.A.), *Ricardo Reis (Universidade Federal do Rio Grande do Sul, Brazil)
Pagepp. 214 - 219
Keywordpower optimization, physical synthesis, gate sizing
AbstractTiming-constrained power-driven gate sizing has aroused lot of research interest after the recent two discrete gate sizing contests organized by International Symposium on Physical Design. Since then, there are plenty of research papers published and new algorithms are proposed based on the ISPD 2013 contest formulation. However, almost all (new and old) papers in the literature ignore the details of how power-driven gate sizing fits in industrial physical synthesis flows, which limits their practical usage. This paper aims at filling this knowledge gap. We explain our approach to integrate a state-of-the-art Lagrangian Relaxation-based gate sizing into our actual physical synthesis framework, and explain the challenges and issues we observed from the point of view of VLSI design flows.
Slides

3S-3 (Time: 16:40 - 17:05)
Title(Invited Paper) Analytical Placement for Rectilinear Blocks
Author*Yasuhiro Takashima (University of Kitakyushu, Japan)
Pagepp. 220 - 225
KeywordRectilinear Block, Analytical Placement, Overlap Removable Length, Stable-LSE
AbstractThis paper proposes a fast analytical placement for rectilinear blocks. The LSI production method is also improved. As a result, there are much large number of elements on one chip. On the other hand, its turn-around-time is same as or less than those of the previous designs. To solve this difficulty, the reuse of the designed modules, that is, IPs, is promising. However, the LSI production improvement also leads larger numbers of IPs, while the shape of IP seems to be rectilinear. Thus, a fast analytical placement for rectilinear blocks is needed. In this paper, we enhance Overlap-Removable Length which has been introduced to the rectangle block placement to rectilinear block placement. We show the efficiency of the proposed method empirically.
Slides

3S-4 (Time: 17:05 - 17:30)
Title(Invited Paper) IR to Routing Challenge and Solution for Interposer-Based Design
Author*Eric Jia-Wei Fang, Terry Chi-Jih Shih, Darton Shen-Yu Huang (MediaTek, Taiwan)
Pagepp. 226 - 230
KeywordRouting, Interposer, IR drop
AbstractA novel IR-aware chip and interposer co-design methodology is presented to handle both chip-interposer routing and micro-bump planning for IR drops. Based on bump rules and power information in a chip, the methodology analyzes the locations of micro bumps to meet IR constraints. For chip-interposer routing, the computational geometry techniques (e.g., Delaunay triangulation and Voronoi diagram) are applied to a network flow formulation for minimizing both IR drops and total wirelength. With the chip and interposer co-design flow, IR constraints can be met with 100% chip-interposer routing completion. Experimental results based on industry designs demonstrate the high-quality of our algorithm.


Session 3A  Circuits for Performance and Reliability
Time: 15:50 - 16:40 Tuesday, January 20, 2015
Location: Room 102
Chairs: Sri Parameswaran (University of New South Wales, Australia), Chengmo Yang (University of Delaware)

3A-1 (Time: 15:50 - 16:15)
TitleAging Mitigation in Memory Arrays Using Self-Controlled Bit-Flipping Technique
Author*Anteneh Gebregiorgis (TU Delft, Netherlands), Mojtaba Ebrahimi, Saman Kiamehr, Fabian Oboril (Karlsruhe Institute of Technology, Germany), Said Hamdioui (TU Delft, Netherlands), Mehdi Tahoori (Karlsruhe Institute of Technology, Germany)
Pagepp. 231 - 236
KeywordAging, Relibility
AbstractBy downscaling CMOS technologies into the nanometer regime, the reliability of SRAM memories is threatened by accelerated transistor aging mechanisms such as Bias Temperature Instability (BTI). BTI leads to a considerable degradation of SRAM cell Static Noise Margin (SNM), which increases the memory failure rate. Since BTI is workload dependent, the aging rates of different cells in a memory array are quite nonuniform. To address this issue, a variety of bit-flipping techniques has been proposed to decrease the SNM degradation by balancing the signal probabilities of the cells. However, existing bit-flipping techniques impose too much area and power overhead as at least an additional column is required to store the inversion flags. In this paper, we propose a low cost self-controlled bit-flipping technique which inverts all bit positions with respect to an existing bit. This technique is applied to a register-file and cache units of an embedded microprocessor. Our simulation results show that the reliability of the proposed technique is similar to that of existing bit-flipping techniques, while imposing 64% less area overhead.

3A-2 (Time: 16:15 - 16:40)
TitleDesign Methodology for Approximate Accumulator Based on Statistical Error Model
AuthorChang Liu, *Xinghua Yang, Fei Qiao, Qi Wei, Huazhong Yang (Dept.of Electronic Engineering, Tsinghua University, China)
Pagepp. 237 - 242
Keywordapproximate-computing, statistical model, multistage speculative adder
AbstractApproximate computing technology has aroused growing interest in circuit and system design for its wellperformed tradeoff between output quality and performance. Numerous basic circuits and system design methodologies for approximate computing have been proposed. Considering that the existing methodologies for the evaluation of tradeoff between output quality and performance is time-consuming, this paper presents a fast design methodology for approximate accumulator based on statistical error models, in which the inexact multistage speculative adder is adopted and modeled for its advantage of compact error patterns. To validate the proposed methodology, Support Vector Machine(SVM) algorithm is analyzed and mapped to a hardware system composed of inexact and accurate computing circuits. Results show that our time for searching the optimal mapping circuits has been saved by 22.08% than functional-based simulation where the final approximate system design achieves 1.57× speedups with 8.56% accuracy degradation.
Slides


Session 3B  Frontiers in Logic Synthesis
Time: 15:50 - 17:30 Tuesday, January 20, 2015
Location: Room 104
Chairs: Robert Wille (University of Bremen, Germany), Yuko Hara-Azumi (Tokyo Institute of Technology)

3B-1 (Time: 15:50 - 16:15)
TitleMultiple Independent Gate FETs: How Many Gates Do We Need?
Author*Luca Amaru (Integrated Systems Laboratory - EPFL, Switzerland), Gage Hills (Robust Systems Group - Stanford University, U.S.A.), Pierre-Emmanuel Gaillardon (Integrated Systems Laboratory - EPFL, Switzerland), Subhasish Mitra (Robust Systems Group - Stanford University, U.S.A.), Giovanni De Micheli (Integrated Systems Laboratory - EPFL, Switzerland)
Pagepp. 243 - 248
KeywordMIGFET, Logic Synthesis, Emerging Devices, Enhanced Functionality, CAD
AbstractMultiple Independent Gate Field Effect Transistors (MIGFETs) are expected to push FET technology further into the semiconductor roadmap. In a MIGFET, supplementary gates either provide (i) enhanced conduction properties or (ii) more in- telligent switching functions. In general, each additional gate also introduces a side implementation cost. To enable more efficient digital systems, MIGFETs must leverage their expressive power to realize complex logic circuits with few physical resources. Researchers face then the question: How many gates do we need? In this paper, we address the logic side of this question. We determine whether or not an increasing number of gates leads to more compact logic implementations. For this purpose, we develop a logic synthesis flow that intrisically exploits a MIGFET switching function. Using simplified design assumptions and device/interconnect models, we synthesize MCNC benchmarks on 5 promising MIGFET devices, with number of gates ranging from 1 to 7. Experimental results evidence nontrivial area/delay/energy minima, located between 1 and 4 gates, depending on a MIGFET switching function and device/interconnect technology.
Slides

3B-2 (Time: 16:15 - 16:40)
TitlePolynomial Time Algorithm for Area and Power Efficient Adder Synthesis in High-Performance Designs
AuthorSubhendu Roy (The University of Texas at Austin, U.S.A.), Mihir Choudhury, Ruchir Puri (IBM T J Watson Research Center, U.S.A.), *David Z Pan (The University of Texas at Austin, U.S.A.)
Pagepp. 249 - 254
KeywordParallel prefix adder, Polynomial algorithm, logic synthesis
AbstractAdders are the most fundamental arithmetic units, and often on the timing critical paths of microprocessors. Among var- ious adder configurations, parallel prefix structures provide the high performance adders for higher bit-widths. With aggressive technology scaling, the performance of a paral- lel prefix adder, in addition to the dependence on the logic- level, is determined by wire-length and congestion which can be mitigated by adjusting fan-out. This paper proposes a polynomial-time algorithm to synthesize n bit parallel pre- fix adders targeting the minimization of the size of the pre- fix graph with log2 n logic level and any arbitrary fan-out restriction. The design space exploration by our algorithm provides a set of pareto-optimal solutions for delay vs. power trade-off, and these pareto-optimal solutions can be used in high-performance designs instead of picking from a fixed li- brary (Kogge Stone, Sklansky etc.). Experimental results demonstrate that our approach (i) excels highly competi- tive industry standard Synopsys Design Compiler adder (128 bit) in performance (2%), area (25%) and power (13.3%) in 32nm technology node, and (ii) improves performance/area over even 64 bit custom designed adders targeting 22nm technology library and implemented in an industrial high- performance design.

3B-3 (Time: 16:40 - 17:05)
TitleAccelerating SAT-Based Boolean Matching for Heterogeneous FPGAs Using One-Hot Encoding and CEGAR Technique
Author*Yusuke Matsunaga (Kyushu University, Japan)
Pagepp. 255 - 260
KeywordFPGA, technology mapping, SAT solver, CEGAR
AbstractThis paper describes two speed-up techniques for Boolean matching of LUT-based circuits. One is one-hot encoding technique for variables representing input assignments. Though it requires more variables than existing binary encoding technique, almost all added clauses using one-hot encoding are binary clauses, which are suitable for efficient Boolean constraint propagation. The other is CEGAR (counter example guided abstraction refinement) technique which reduces the CPU time significantly. With both techniques, we can solve Boolean matching problem with 9 input function in 20 milliseconds on average, which is faster than the existing algorithms more than one order of magnitude.


Session 3C  Energy Optimization for Electric Vehicles and Smart Grids
Time: 15:50 - 17:30 Tuesday, January 20, 2015
Location: Room 105
Chairs: Hideki Takase (Kyoto University, Japan), Yongpan Liu (Tsinghua University, China)

3C-1 (Time: 15:50 - 16:15)
TitleNegotiation-Based Task Scheduling and Storage Control Algorithm to Minimize User’s Electric Bills under Dynamic Prices
AuthorJi Li, Yanzhi Wang, Xue Lin, Shahin Nazarian, *Massoud Pedram (USC, U.S.A.)
Pagepp. 261 - 266
KeywordSmart Grid, Dynamic Pricing, Energy Storage, Optimization, Task Scheduling
AbstractDynamic energy pricing is a promising technique in the Smart Grid to alleviate the mismatch between electricity generation and consumption. Energy consumers are incentivized to shape their power demands, or more specifically, schedule their electricity-consuming applications (tasks) more prudently to minimize their electric bills. This has become a particularly interesting problem with the availability of residential photovoltaic (PV) power generation facilities and controllable energy storage systems. This paper addresses the problem of joint task scheduling and energy storage control for energy consumers with PV and energy storage facilities, in order to minimize the electricity bill. A general type of dynamic pricing scenario is assumed where the energy price is both time-of-use and power-dependent, and various energy loss components are considered including power dissipation in the power conversion circuitries as well as the rate capacity effect in the storage system. A negotiation-based iterative approach has been proposed for joint residential task scheduling and energy storage control that is inspired by the state-of-the-art Field-Programmable Gate Array (FPGA) routing algorithms. In each iteration, it rips-up and re-schedules all tasks under a fixed storage control scheme, and then derives a new charging/discharging scheme for the energy storage based on the latest task scheduling. The concept of congestion is introduced to dynamically adjust the schedule of each task based on the historical results as well as the current scheduling status, and a near-optimal storage control algorithm is effectively implemented by solving convex optimization problem(s) with polynomial time complexity. Experimental results demonstrate the proposed algorithm achieves up to 64.22% in the total energy cost reduction compared with the baseline methods.
Slides

3C-2 (Time: 16:15 - 16:40)
TitleMany-to-Many Active Cell Balancing Strategy Design
Author*Matthias Kauer, Swaminathan Narayanaswamy, Sebastian Steinhorst, Martin Lukasiewycz (TUM CREATE, Singapore), Samarjit Chakraborty (TU Munich, Germany)
Pagepp. 267 - 272
Keywordelectromobility, cell balancing, battery management, balancing strategy
AbstractIn the context of active cell balancing of electric vehicle battery cells, we deal with circuit architectures for inductor-based charge transfer and the corresponding high-level modeling and strategy development. In this work, we introduce a circuit architecture to transfer charge between arbitrarily many source and destination cells (many-to-many) for the first time and analyze the advantages over one-to-one transfer. Balancing simulation with numerical solvers remains challenging because of non-differentiable PWM signals, while the search space for high-level strategy design -- crucial for time and energy efficiency -- becomes even larger. Consequently, we develop a closed-form charge transfer model that extends state-of-the-art approaches and is three orders of magnitude faster than step-size controlled simulation. With an initial algorithm design based on experimentally derived rules, we demonstrate that many-to-many transfer dominates neighbor-only approaches in speed and efficiency even though it requires only one additional switch per circuit module.

3C-3 (Time: 16:40 - 17:05)
TitleIntra-Vehicle Network Routing Algorithm for Wiring Weight and Wireless Transmit Power Minimization
Author*Ta-Yang Huang, Chia-Jui Chang (National Cheng Kung University, Taiwan), Chung-Wei Lin (University of California at Berkeley, U.S.A.), Sudip Roy (National Cheng Kung University, Taiwan), Tsung-Yi Ho (National Chiao Tung University, Taiwan)
Pagepp. 273 - 278
KeywordIn-Vehicle Network, Routing, Power Consumption
AbstractAs the complexity of vehicle distributed systems increases rapidly, several hundreds of devices (sensors, actuators, etc.) are being placed in a modern automotive system. With the increase in wiring cables connecting these devices, the weight of a car increases significantly, which degrades the fuel efficiency in driving. In order to reduce the weight of the car, wireless communication has been introduced to replace wiring cables among some devices. However, the extra energy consumption for packet transmissions among wireless devices requires the frequent maintenances, e.g., recharging of batteries. In this paper, we propose an intra-vehicle network routing algorithm to simultaneously minimize the wiring weight and the transmission power for wireless communication. Experimental results of a set of test cases show that the proposed method can effectively minimize the wiring weight and the transmit power for wireless communication.

3C-4 (Time: 17:05 - 17:30)
TitleAn Autonomous Decentralized Mechanism for Energy Interchanges with Accelerated Diffusion Based on MCMC
Author*Yusuke Sakumoto (Tokyo Metropolitan University, Japan), Ittetsu Taniguchi (Ritsumeikan University, Japan)
Pagepp. 279 - 284
KeywordRenewable energy, Micro-grid, Autonomous decentralized mechanism, Energy interchange
AbstractIt is not easy to provide energy supply based on renewable energy enough to satisfy energy demand anytime and anywhere because renewable energy amounts depends on geographical conditions and the time of day. This paper proposes a novel autonomous decentralized mechanism of energy interchanges between distributed batteries on the basis of the diffusion equation and MCMC (Markov chain Monte Carlo) for realizing energy supply appropriately for energy demand. Experimental results show the proposed mechanism effectively works under several situations.
Slides



Wednesday, January 21, 2015

Session 2K  Keynote II
Time: 9:00 - 9:50 Wednesday, January 21, 2015
Location: International Conference Room
Chair: Kunio Uchiyama (Hitachi)

2K-1 (Time: 9:00 - 9:50)
Title(Keynote Address) Programmable Network
Author*Atsushi Takahara (NTT Network Innovation Laboratories, Japan)
Pagep. 285
AbstractNetwork Virtualization such as SDN (Software Defined Network) or NFV (Network Functions Virtualization) is the important technology in the new generation network architecture. This provides flexible networking for various kinds of network usage demand. Network virtualization requires the definition of a user specific network called as “slice” and the method for programming a slice design of programmable forwarding nodes in network. The key aspect is to introduce the programmability in network. This can provide new value for users and application providers by working together with computation and peripheral technologies such as cloud computing, Internet of Things, mobile devices and so on. Also, the delivery time of a slice can be shorter than in existing network. “Programmable Network” has huge potential to change the games in creating network service. In this talk, the R&D activities of network virtualization and its programmability methods are introduced to explain how the flexibility is realized as hardware and software system. Then, we discuss how we utilize programmable network for creating network services. The several use cases such as 4K/8K contents distribution, resiliency of network system, and edge distributed computing are introduced to show the possibility of creating new value by Programmable Network. Finally, we discuss the possibility of applying EDA technologies for supporting the design flow of a user specific network in Programmable Network.


Session 4S  (Special Session) Machine Learning in EDA: Promises and Challenges in Selected Applications
Time: 10:15 - 12:20 Wednesday, January 21, 2015
Location: Room 103
Chair: Li-C. Wang (University of California at Santa Barbara, U.S.A.)

4S-1 (Time: 10:15 - 10:45)
Title(Invited Paper) Machine Learning and Pattern Matching in Physical Design
AuthorBei Yu, *David Z. Pan (University of Texas at Austin, U.S.A.), Tetsuaki Matsunawa (Toshiba Corporation, Japan), Xuan Zeng (Fudan University, China)
Pagepp. 286 - 293
KeywordMachine Learning, Pattern Matching, Physical Design
AbstractMachine learning (ML) and pattern matching (PM) are powerful computer science techniques which can derive knowledge from big data, and provide prediction and matching. Since nanometer VLSI design and manufacturing have extremely high complexity and gigantic data, there has been a surge recently in applying and adapting machine learning and pattern matching techniques in VLSI physical design (including physical verification), e.g., lithography hotspot detection and data/pattern-driven physical design, as ML and PM can raise the level of abstraction from detailed physics-based simulations and provide reasonably good quality-of-result. In this paper, we will discuss key techniques and recent results of machine learning and pattern matching, with their applications in physical design.

4S-2 (Time: 10:45 - 11:15)
Title(Invited Paper) Self-Learning and Adaptive Board-Level Functional Fault Diagnosis
AuthorFangming Ye, *Krishnendu Chakrabarty (Duke University, U.S.A.), Zhaobo Zhang, Xinli Gu (Huawei Technologies, U.S.A.)
Pagepp. 294 - 301
Keywordboard-level, machine learning, fault diagnosis
AbstractFunctional fault diagnosis is necessary for board-level product qualification. However, ambiguous diagnosis results can lead to long debug times and wrong repair actions, which significantly increase repair cost and adversely impact yield. A state-of-the-art functional fault diagnosis system involves several key components: (1) design of functional test programs, (2) collection of functional-failure syndromes, (3) building of the diagnosis engine, (4) isolation of root causes, and (5) evaluation of the diagnosis engine. Advances in each of these components can pave the way for a more effective diagnosis system, thus improving diagnosis accuracy and reducing diagnosis time.Machine-learning and data analysis techniques offer an unprecedented opportunity to develop an automated and adaptive diagnosis system to increase diagnosis accuracy and reduce diagnosis time. This talk will describe how all the above components of an advanced diagnosis system can benefit from machine learning and information theory. Some of the topics to be discussed include incremental learning, decision trees, root-cause analysis and evaluation metrics, data acquisition, and knowledge transfer.

4S-3 (Time: 11:15 - 11:45)
Title(Invited Paper) Fast Statistical Analysis of Rare Failure Events for Memory Circuits in High-Dimensional Variation Space
AuthorShupeng Sun, *Xin Li (Carnegie Mellon University, U.S.A.)
Pagepp. 302 - 307
KeywordMemory, Monte Carlo
AbstractAccurately estimating the rare failure rates for nanoscale memory circuits is a challenging task, especially when the variation space is high-dimensional. In this paper, we summarize two novel techniques to address this technical challenge. First, we describe a subset simulation (SUS) technique to estimate the rare failure rates for continuous performance metrics. The key idea of SUS is to express the rare failure probability of a given circuit as the product of several large conditional probabilities by introducing a number of intermediate failure events. These conditional probabilities can be efficiently estimated with a set of Markov chain Monte Carlo samples generated by a modified Metropolis algorithm. Second, to efficiently estimate the rare failure rates for discrete performance metrics, scaled-sigma sampling (SSS) can be used. SSS aims to generate random samples from a distorted probability distribution for which the standard deviation (i.e., sigma) is scaled up. Next, the failure rate is accurately estimated from these scaled random samples by using an analytical model derived from the theorem of “soft maximum”. Our experimental results of several nanoscale circuit examples demonstrate that SUS and SSS achieve significantly improved accuracy over other traditional techniques when the dimensionality of the variation space is more than a few hundred.
Slides

4S-4 (Time: 11:45 - 12:20)
Title(Invited Paper) Data Mining in Functional Test Content Optimization
Author*Li-C. Wang (University of California at Santa Barbara, U.S.A.)
Pagepp. 308 - 315
KeywordFunctional verification, Data Mining, Test content optimization, Machine learning
AbstractThis paper reviews the data mining methodologies proposed for functional test content optimization where tests are sequences of instructions or transactions. Basic machine learning concepts and the key ideas of these methodologies are explained. Challenges for implementing these methodologies in practice are illustrated. Promises are demonstrated through experimental results based on industrial verification settings.


Session 4A  Efficient NVM Management, from Register to Disk
Time: 10:15 - 12:20 Wednesday, January 21, 2015
Location: Room 102
Chairs: Kyoungwoo Lee (Yonsei University, Republic of Korea), Koji Nii (Renesas Electronics)

4A-1 (Time: 10:15 - 10:40)
TitleCheckpoint-Aware Instruction Scheduling for Nonvolatile Processor with Multiple Functional Units
AuthorMimi Xie, Chen Pan, *Jingtong Hu (Oklahoma State University, U.S.A.), Chengmo Yang (University of Delaware, U.S.A.), Yiran Chen (University of Pittsburgh, U.S.A.)
Pagepp. 316 - 321
KeywordNonvolatile registers, FRAM, Energy, Writes
AbstractEmbedded systems powered with harvested energy experience frequent execution interruption due to unstable energy source. Nonvolatile (NV) register based processor is proposed to realize fast resume after power failure. The states in the volatile registers are checkpointed to NV registers. However, frequent checkpointing causes performance degradation and consumes excessive power. In this paper, we propose the checkpoint aware instruction scheduling (CAIS) algorithm to reduce the writes to NV registers. Experiments show that CAIS can improve performance and reduce power consumption.
Slides

4A-2 (Time: 10:40 - 11:05)
TitleBalloonfish: Utilizing Morphable Resistive Memory in Mobile Virtualization
AuthorLinbo Long, Duo Liu, *Xiao Zhu, Kan Zhong (Chongqing University, China), Zili Shao (The Hong Kong Polytechnic University, Hong Kong), Edwin H.-M. Sha (Chongqing University, China)
Pagepp. 322 - 327
Keywordmobile virtualization, phase change memory, Morphable Resistive Memory, page allocation
AbstractVirtualization offers significant benefits such as better isolation and security for mobile systems. However, the limited amount of memory and virtualization's memory-demanding nature makes it challenging to virtualize mobile systems efficiently. In this paper, we utilize morphable resistive memories to design a high-performance mobile system with extensible memory space. With morphable resistive memory, we convert the memory cell state between multi-level and single-level to achieve a balance between performance and memory space. Our evaluation based on the Samsung Exynos 5250 SoC with real Android applications show that our system achieve 27% performance improvement compared with the baseline scheme.

4A-3 (Time: 11:05 - 11:30)
TitleA Three-Stage-Write Scheme with Flip-Bit for PCM Main Memory
AuthorYanbin Li, *Xin Li, Lei Ju, Zhiping Jia (School of Computer Science and Technology, Shandong University, China)
Pagepp. 328 - 333
KeywordPCM, three-stage-write, two-stage-write, flip-N-write, performance improvement
AbstractPhase-change memory (PCM) is a non-volatile memory which suffers slow write performance and limited write endurance. Besides, writing a one to a PCM cell needs longer time but less electrical current than writing a zero. In traditional PCM, zeros and ones in a word are written at the same time and word write time has to be the time to write a one, thus incurring time waste. In this paper, we propose a three-stage write scheme with flip-bit for PCM main memory to reduce the number of changed bits and write latency. In our scheme, write operation is divided into comparison, write-0 and write-1 stages. In the comparison stage, new data and old data are compared and the new data is re-encoded by a flip-bit to minimize changed bits. Then the flip-bit and re-encoded data are written to PCM cells in an accelerating manner. All zero bits and one bits are written separately in later two stages to avoid the time waste in traditional write. Our scheme shrinks time consumption and reduces bit changes caused by write operation over other existing schemes. The experimental results show that this scheme decreases 43.5% bit changes, 16.6% write time and 34.6% write energy consumption on average.
Slides

4A-4 (Time: 11:30 - 11:55)
TitleA Garbage Collection Aware Stripping Method for Solid-State Drives
Author*Min Huang (Harbin Institute of Technology, China), Yi Wang (Shenzhen University/Hong Kong Polytechnic University, China), Zhaoqing Liu, Liyan Qiao (Harbin Institute of Technology, China), Zili Shao (The Hong Kong Polytechnic University, Hong Kong)
Pagepp. 334 - 339
KeywordSSD, garbage collection, stripping, parallelism
AbstractThis paper presents a Garbage Collection Stripping Method (GCAS), which is the first work to the best of our knowledge that jointly optimizes garbage collection operation and the I/O performance of stripping methods in NAND flash memory based SSDs. We implemented GCAS on a real hardware platform. Experiments show that GCAS can achieve a reduction up to 15.87% for the number of the block erase count and avoid 47.6% worst cases response time compared with Round-Robin stripping method.

4A-5 (Time: 11:55 - 12:20)
TitleUnified Non-Volatile Memory and NAND Flash Memory Architecture in Smartphones
Author*Renhai Chen (The Hong Kong Polytechnic University, Hong Kong), Yi Wang (Shenzhen University, China), Jingtong Hu (Oklahoma State University, U.S.A.), Duo Liu (Chongqing University, China), Zili Shao (The Hong Kong Polytechnic University, Hong Kong), Yong Guan (Capital Normal University, China)
Pagepp. 340 - 345
KeywordSecondary Storage, Smartphones, Non-volatile Memory
AbstractI/O is becoming one of major performance bottlenecks in NAND-flash-based smartphones. Novel NVMs can provide fast read/write operations. In this paper, we propose an unified NVM/flash architecture to improve the I/O performance. A cross-layer transparent scheme, vFlash (Virtualized Flash), is also proposed to manage the unified architecture. The experimental results show that the read and write performance for the proposed scheme is 2.45 times and 3.37 times better than that of the stock Android 4.2 system, respectively.


Session 4B  Robust Timing, and P/G Modeling and Design
Time: 10:15 - 12:20 Wednesday, January 21, 2015
Location: Room 104
Chairs: Ray Cheung (City University of Hong Kong, Hong Kong), Fan Yang (Fudan University, China)

4B-1 (Time: 10:15 - 10:40)
TitleA Retargetable and Accurate Methodology for Logic-IP-Internal Electromigration Assessment
AuthorPalkesh Jain (Qualcomm India Pvt Ltd, India), *Sachin S. Sapatnekar (University of Minnesota, U.S.A.), Jordi Cortadella (Universitat Politècnica de Catalunya, Spain)
Pagepp. 346 - 351
KeywordElectromigration, Characterization, Library, Automotive, Reliability
AbstractA new methodology for SoC-level logic-IP-internal EM verification is presented in this work, which provides an on-the-fly retargeting capability for reliability constraints. This flexibility is at design-verification stage, in the form of allowing arbitrary specifications (of lifetimes, temperatures, voltages and failure rates), as well as interoperability of IPs across foundries. The methodology is characterization and reuse based, and naturally incorporates complex effects like clock gating and variable switching rates at different pins. The benefit from such a framework is demonstrated on a 28nm design, with close SPICE-correlation and verification in a retargeted reliability condition.

4B-2 (Time: 10:40 - 11:05)
TitleNew Electromigration Modeling and Analysis Considering Time-Varying Temperature and Current Densities
AuthorHai-Bao Chen, *Sheldon X.-D. Tan, Xin Huang (Department of Electrical Engineering, University of California, Riverside, U.S.A.), Valeriy Sukharev (Mentor Graphics Corporation, U.S.A.)
Pagepp. 352 - 357
KeywordElectromigration, Reliability, Temperature, Current density
AbstractElectromigration (EM) is projected to be the major reliability issue for current and future VLSI technologies. However, existing EM models and assessment techniques are mainly based on the constant current density and temperature. Such models will not work well at the system level as the current density (power) and temperature are changing with time due to different tasks (their loans) applied at run time. Existing EM approaches using average current density or temperature, however, will lead to significant errors as shown in this work. In this paper, we propose a new physics-based EM model considering time-varying temperature and current density, which reflects a more practical chip working conditions especially for multi-core and emerging 3D ICs. We study the impacts of the time-varying current densities and temperature profiles on EM-induced lifetime of a wire for both nucleation phase and growth phase. We propose a fast stress calculation method for given time-varying temperature and current densities for the nucleation phase. We further develop new formulae to compute the resistance changes in growth phase due to changing temperature and current densities. Experimental results show that the proposed method shows an excellent agreement with the detailed numerical analysis but with much improved efficiency.

4B-3 (Time: 11:05 - 11:30)
TitleGenerating Circuit Current Constraints to Guarantee Power Grid Safety
Author*Zahi Moudallal, Farid N Najm (University of Toronto, Canada)
Pagepp. 358 - 365
KeywordIntegrated Circuits, Power Grid, Verification, Generating Constraints
AbstractEfficient and early verification of the chip power distribution network is a critical step in modern IC design. Vectorless verification, developed over the last decade as an alternative to simulation based methods, requires user-specified current constraints and checks if the corresponding worst-case voltage drops at all grid nodes are below user-specified thresholds. However, obtaining/specifying the current constraints remains a burdensome task for design teams. In this paper, we define and address the inverse problem: for a given grid, we will generate circuit current constraints which, if adhered to by the underlying logic, would guarantee grid safety. There are many potential applications for this approach, including various grid quality metrics, as well as power grid-aware placement and floorplanning. We give a rigorous problem definition and develop some key theoretical results related to maximality of the current space defined by the constraints. Based on this, we then develop two algorithms for constraints generation that target key quality metrics like the peak total power allowed by the grid and the uniformity of the temperature distribution.
Slides

4B-4 (Time: 11:30 - 11:55)
TitleBEE: Predicting Realistic Worst Case and Stochastic Eye Diagrams by Accounting for Correlated Bitstreams and Coding Strategies
AuthorAadithya Karthik (UC Berkeley, U.S.A.), Sayak Ray (Princeton University, U.S.A.), *Jaijeet Roychowdhury (UC Berkeley, U.S.A.)
Pagepp. 366 - 371
KeywordWorst case eye diagram, Peak distortion analysis, Coding/Communications Schemes, Inter-Symbol Interference, Jitter
AbstractModern high-speed links and I/O subsystems often employ sophisticated coding strategies to boost error resilience. The analysis of such systems, which involves accurate prediction of worst-case and stochastic eye diagrams, is challenging. Existing techniques such as Peak Distortion Analysis (PDA) typically predict overly pessimistic eye diagrams. Monte-Carlo methods, on the other hand, often predict overly optimistic eye diagrams, and they are also very time-consuming. As an alternative, we present BEE, an accurate and efficient computational technique to predict realistic worst-case and stochastic eye diagrams in modern high-speed links with neither excessive pessimism nor undue optimism. BEE is able to fully and correctly take into account many features underlying modern communications systems, including arbitrary transmit-side coding schemes as well as non-idealities such as ISI, crosstalk, asymmetric rise/fall times, jitter, parameter variability, etc. We demonstrate BEE on links involving (7,4)-Hamming and 8b/10b SERDES encoders, featuring channels that give rise to multiple reflections, dispersion, loss, and overshoot/undershoot. BEE successfully predicts actual worst case eye openings in all these real-world systems, which can be twice as large as the eye openings predicted by overly pessimistic methods like PDA. Also, BEE can be an order of magnitude faster than Monte-Carlo based eye estimation methods.

4B-5 (Time: 11:55 - 12:20)
TitleA Fast Parallel Approach for Common Path Pessimism Removal
Author*Chung-Hao Tsai, Wai-Kei Mak (National Tsing Hua University, Taiwan)
Pagepp. 372 - 377
KeywordTiming analysis, Common path pessimism removal
AbstractStatic timing analysis has always been indispensable in integrated circuit design. In order to consider design and electrical complexities (e.g., crosstalk coupling, voltage drops) as well as manufacturing and environmental variations, timing analysis is typically done using an “early-late” split. The early-late split timing analysis enables timers to effectively account for any within-chip variation effects. However, this dual-mode analysis may introduce unnecessary pessimism, which can lead to an over-conservative design. Thus, common path pessimism removal (CPPR) is introduced to eliminate this pessimism during timing analysis. A naive approach would require the analysis of all paths in the design. For today’s designs with millions of gates, enumerating all paths is impractical. In this paper, we propose a new approach to effectively prune the redundant paths and develop a multi-threaded timing analysis tool called MTimer for fast and accurate CPPR. The results show that our timer can achieve 3.53X speedup comparing with the winner of the TAU 2014 contest and maintain 100% accuracy on removing common path pessimism during timing analysis.


Session 4C  New Issues in Placement and Routing
Time: 10:15 - 12:20 Wednesday, January 21, 2015
Location: Room 105
Chairs: Shigetoshi Nakatake (University of Kitakyushu, Japan), Yuzi Kanazawa (Fujitsu Laboratories Ltd.)

4C-1 (Time: 10:15 - 10:40)
TitleDetailed-Routing-Driven Analytical Standard-Cell Placement
Author*Chau-Chin Huang, Chien-Hsiung Chiou, Kai-Han Tseng, Yao-Wen Chang (National Taiwan University, Taiwan)
Pagepp. 378 - 383
KeywordPlacement, Routing, Routability, Global-/Detailed-Routing, Analytical Placement
AbstractDue to the significant mismatch between global-routing congestions estimated during placement and the resulting design-rule violations in detailed routing, considering both global and detailed routability during placement is of particular importance for modern circuit designs. This paper presents an analytical standard-cell placement algorithm to optimize detailed routability with three major techniques: (1) A routability-driven wirelength model that directly minimizes routing congestion and wirelength simultaneously with no additional computational overhead in global placement. (2) A detailed-routability-aware whitespace allocation technique in legalization. (3) A multi-stage congestion-aware cell spreading method in detailed placement. Compared with the participating teams of the 2014 ISPD Detailed-Routing-Driven Placement Contest and a state-of-the-art routability-driven placer, our placer achieves the best quality in both detailed-routing violation and wirelength scores.

4C-2 (Time: 10:40 - 11:05)
TitleAn Approach to Anchoring and Placing High Performance Custom Digital Designs
Author*Shih-Ying Liu (National Chiao Tung University/MediaTek Inc., Taiwan), Tung-Chieh Chen (Synopsys Inc., Taiwan), Hung-Ming Chen (National Chiao Tung University, Taiwan)
Pagepp. 384 - 389
KeywordPlacement, Custom Digital
AbstractCustom layouts of digital blocks are often used in mixed-signal designs in order to meet the critical performance requirements. Unlike traditional standard-cell based digital placement, custom-cell based digital placement may need additional manual help and intervention to achieve higher performance. The need for manual intervention is primarily due to the inability of modern analytical placers in delivering satisfactory performance on placing designs without pre-placed blocks. While most design flow works in a flat or top-down fashion, custom digital design generally works in a bottom-up fashion that there is no prior knowledge on I/O pin plan since it is changeable by the owners of modules. Without any or very few fixed I/O locations, modern analytical placers tend to produce unsatisfactory results. In this work, we propose a method, mimicking the process of making beds, to guide state-of-the-art analytical placers to deliver better placement results. With the crafted pseudo anchors and nets, total HPWL on Capo10.5 [1], mPL6 [2], NTUplace3 [3] and VDAPlace [4] have improved by 2.92%, 8.69%, 25.26% and 10.72% respectively on a set of industry custom digital designs and improved by 2.19%, 4.34%, 36.09%, and 14.27% respectively on Peko-Suite1 benchmarks.

4C-3 (Time: 11:05 - 11:30)
TitleNon-Stitch Triple Patterning-Aware Routing Based on Conflict Graph Pre-Coloring
Author*Po-Ya Hsu, Yao-Wen Chang (National Taiwan University, Taiwan)
Pagepp. 390 - 395
Keywordtriple patterning lithography, routing, non-stitch triple patterning-aware routing, conflict graph
AbstractConsidering decomposition constraints earlier during routing becomes critical for realizing triple patterning lithography. In addition, stitches typically cause significant yield loss because of the overlay errors among different masks. As a result, leading foundries even get rid of the use of stitches in their design methodology. In order to completely avoid stitch-induced yield loss, we address the non-stitch triple patterning-aware routing problem. We observe that directly extending a state-of-the-art triple patterning-aware routing work to non-stitch routing might generate improper self-crossing nets and degrade routing quality due to sequential coloring for mask selection. To resolve these problems, we propose a new graph model to prevent self-crossing nets during routing and use a weighted conflict graph to globally consider net coloring. We then propose the first non-stitch triple patterning-aware routing scheme, which consists of two main stages: (1) conflict graph pre-coloring followed by (2) pre-coloring-based non-stitch routing. Experimental results show the effectiveness and efficiency of our routing scheme.
Slides

4C-4 (Time: 11:30 - 11:55)
TitleCut Mask Optimization with Wire Planning in Self-Aligned Multiple Patterning Full-Chip Routing
Author*Shao-Yun Fang (National Taiwan University of Science and Technology, Taiwan)
Pagepp. 396 - 401
Keywordself-aligned multiple patterning, cut mask optimization, SAMP-aware routing, wire planning
AbstractSelf-aligned quadruple patterning (SAQP) will be required for advanced sub-10nm nodes. However, cut masks used in SAQP are hardly manufacturable for arbitrary layouts because cut mask rules are limited by conventional 193nm lithography. To the best of our knowledge, existing SADP- and SAQP-aware detailed routers would fail to generate cut mask-friendly routing results for general SAMP. In this paper, we propose the first work of cut mask optimization with wire planning in SAMP full-chip routing. Experimental results show that the proposed routing algorithms are effective in generating routing results with optimized cut masks.
Slides

4C-5 (Time: 11:55 - 12:20)
TitleA Length Matching Routing Method for Disordered Pins in PCB Design
Author*Ran Zhang, Tieyuan Pan, Li Zhu, Takahiro Watanabe (Waseda University, Japan)
Pagepp. 402 - 407
KeywordPCB routing, length matching routing, EDA
AbstractIn this paper, for the disordered pins in printed circuit board (PCB) design, a heuristics algorithm is proposed to obtain a length matching routing. We initially check the longest common subsequence of pin pairs to assign layers for pins. Then, adopt single commodity flow to generate base routes. R-flip and C-flip are finally carried out to adjust the wire length. The experiments show that our algorithm generates the optimal routes with better wire balance within reasonable CPU times.
Slides


Session 5S  (Designers' Forum ) Car Electronics
Time: 13:50 - 15:30 Wednesday, January 21, 2015
Location: Room 103
Organizer: Shinichi Shibahara (Renesas Electronics, Japan), Chair: Koji Inoue (Kyushu University, Japan)

5S-1 (Time: 13:50 - 14:20)
Title(Invited Paper) Systems Modeling for Additional Development in Automotive E/E Architecture
Author*Hidekazu Nishimura (Keio University, Japan)
Pagepp. 408 - 409
KeywordCar Electronics, System Modeling
AbstractSystems modeling in automotive electrical and electronic (E/E) architecture is described according to systems engineering process. The purpose of my presentation is to show the importance of system architecture using SysML (Systems Modeling Language). Access and protection system is treated as an example designed in additional development based on the new requirements such as “user friendliness” and “omotenashi”. It is shown that model-based systems engineering is effective and useful to obtain system architecture.

5S-2 (Time: 14:20 - 14:50)
Title(Invited Paper) Implementation and Evaluation of Image Recognition Algorithm for An Intelligent Vehicle using Heterogeneous Multi-Core SoC
Author*Nau Ozaki, Masato Uchiyama, Yasuki Tanabe, Shuichi Miyazaki, Takaaki Sawada, Takanori Tamai, Moriyasu Banno (Toshiba Corp., Japan)
Pagepp. 410 - 415
KeywordCar Electronics, Image Recognition
AbstractImage recognition algorithm is becoming one of the most important technology for intelligent vehicle application such as Advanced Driver Assistance Systems (ADAS), however its computational costs are still considerably high. To realize such applications using image recognition algorithm as hard real-time task with low power consumption, we have developed heterogeneous multi-core SoC specialized for image recognition [1]. Subsequently, several image recognition applications have been developed using this SoC. In this paper, we address two ADAS applications and image recognition algorithms for them, and evaluate them on the SoC. The results of the evaluation show that the SoC allows these applications to run with significantly low power consumption comparing with general purpose CPU.

5S-3 (Time: 14:50 - 15:20)
Title(Invited Paper) Trend in Power Devices for Electric and Hybrid Electric Vehicles
Author*Khalid Hussein, Akira Fujita, Katsumi Sato (Mitsubishi Electric Corp., Japan)
Pagep. 416
KeywordCar Electronics, Power Devices
AbstractSince the very successful release of the world’s first mass-produced HEV (hybrid electric vehicle) in 1997, the number of HEVs as well as EVs (electric vehicles) has been gradually increasing in major cities around the world. Reliable and efficient power devices transferring energy between the battery and the motor/generator represent the heart of the electric power-train that realizes the electric-mobility concept. The objective of this presentation is to give a quick preview of the early mass-produced power modules for EV/HEVs and to highlight the advancement achieved so far in addressing the automotive severe reliability and high performance requirements. The presentation will also cover some of the power devices trends in terms of the major pillars comprising automotive power modules: (1) power chip technology, (2) packaging technology, and (3) functional integration.


Session 5A  Optimization and Exploration for Caches
Time: 13:50 - 15:30 Wednesday, January 21, 2015
Location: Room 102
Chairs: Hiroyuki Tomiyama (Ritsumeikan University, Japan), Lin Meng (Ritsumeikan University, Japan)

5A-1 (Time: 13:50 - 14:15)
TitleMultilane Racetrack Caches: Improving Efficiency Through Compression and Independent Shifting
Author*Haifeng Xu (University of Pittsburgh, U.S.A.), Yong Li (VMware, Inc., U.S.A.), Rami Melhem, Alex K. Jones (University of Pittsburgh, U.S.A.)
Pagepp. 417 - 422
KeywordRacetrack memory, Cache, Compression
AbstractRacetrack memory (RM), a spintronic domain-wall non-volatile memory has recently received attention as a high-capacity replacement for various structures in the memory system from secondary storage through caches. The main advantage of RM is an improved density and like other non-volatile memory structures, the static power of RM is dramatically lower than conventional CMOS memories. However, a major challenge of employing RM in universal memory components is the added access latency and dynamic energy consumption caused by shifts to align the data of interest with an access port. We propose multilane Racetrack caches (MRC), a RM last level cache design utilizing lightweight compression combined with independent shifting. MRC allows cache lines mapped to the same Racetrack structure to be accessed in parallel when compressed, mitigating potential shifting stalls in the RM cache. Our results demonstrate that unlike previously proposed RM caches, an isocapacity MRC cache replacement can outperform SRAM caches while providing energy improvement over STT-MRAM caches. In particular, MRC improves performance by 5% and reduces energy by 19% compared to an isocapacity baseline RM cache resulting in an energy delay product improvement of 25%.
Slides

5A-2 (Time: 14:15 - 14:40)
TitleManaging Hybrid On-Chip Scratchpad and Cache Memories for Multi-Tasking Embedded Systems
AuthorZimeng Zhou, *Lei Ju, Zhiping Jia, Xin Li (Shandong University, China)
Pagepp. 423 - 428
Keywordscratchpad memory, cache, multi-tasking, performance, energy efficiency
AbstractOn-chip memory management is essential in design of high performance and energy-efficient embedded systems. While many off-the-shelf embedded processors employ a hybrid on-chip SRAM architecture including both scratchpad memories (SPMs) and caches, many existing work on SPM management ignore the synergy between caches and SPMs. In this work, we propose a static SPM allocation strategy for the hybrid on-chip memory architecture in a multi-tasking environment, which minimizes the overall access latency and energy consumption of the instruction memory subsystem. We capture cache conflict misses via a fine-grained temporal cache behavior model. An integer linear programming (ILP) based formulation is proposed to generate an function-level SPM allocation scheme, where both intra- and inter-task cache interference as well as access frequency are captured for an optimal memory subsystem design. Compared with the state-of-the-art static SPM allocation strategy in a multi-tasking environment, experimental results show that our SPM management scheme achieves 30.51% further improvement in instruction memory subsystem performance, and up to 34.92% in terms of energy saving.
Slides

5A-3 (Time: 14:40 - 15:05)
TitleOptimizing Thread-to-Core Mapping on Manycore Platforms with Distributed Tag Directories
Author*Guantao Liu, Tim Schmidt, Rainer Doemer (University of California, Irvine, U.S.A.), Ajit Dingankar, Desmond Kirkpatrick (Intel Corporation, U.S.A.)
Pagepp. 429 - 434
Keywordthread-to-core mapping, manycore platforms, on-chip communication
AbstractWith the increasing demand for parallel computing power, manycore platforms are attracting more and more attention due to their potential to improve performance and scalability of parallel applications. However, as the core count increases, core-to-core communication becomes expensive. For manycore architectures using directory-based cache coherence protocols, the core-to-core communication latency depends not only on the physical placement on the chip, but also on the location of the distributed cache tag directory. In this paper, we first define the concept of core distance for multicore and manycore architectures. Using a ping-pong spin-lock benchmark, we quantify the core distance on a ring-network platform and propose an approach to optimize thread-to-core mapping in order to minimize on-chip communication overhead. In our experiments, our approach speeds up communication-intensive benchmarks by more than 25% on average over the Linux default mapping strategy.
Slides

5A-4 (Time: 15:05 - 15:30)
TitleAccelerating Non-Volatile/Hybrid Processor Cache Design Space Exploration for Application Specific Embedded Systems
Author*Mohammad Shihabul Haque, Ang Li, Akash Kumar (National University of Singapore, Singapore), Qingsong Wei (Data Storage Institute, Singapore)
Pagepp. 435 - 440
KeywordCache, Embedded, NVM, Hybrid, Modeling
AbstractIn this article, we propose a technique to accelerate non-volatile/hybrid of volatile and non-volatile processor cache design space exploration for application specific embedded systems. Utilizing a novel cache behavior modeling equation and a new accurate cache miss prediction mechanism, our proposed technique can accelerate NVM/hybrid FIFO processor cache design space exploration for SPEC CPU 2000 applications up to 249 times compared to the conventional approach.


Session 5B  CAD for Analog/RF/Mixed-Signal Design
Time: 13:50 - 15:30 Wednesday, January 21, 2015
Location: Room 104
Chairs: Sheldon Tan (University of California, Riverside, U.S.A.), Mark Po-Hung Lin (National Chung Cheng University, Taiwan)

5B-1 (Time: 13:50 - 14:15)
TitleAccurate Passivity-Enforced Macromodeling for RF Circuits via Iterative Zero/Pole Update Based on Measurement Data
AuthorYing-Chih Wang, Shihui Yin, Minhee Jun, *Xin Li, Lawrence T. Pileggi, Tamal Mukherjee, Rohit Negi (Carnegie Mellon University, U.S.A.)
Pagepp. 441 - 446
KeywordRF circuits, Modeling
AbstractPassive macromodeling for RF circuit blocks is a critical task to facilitate efficient system-level simulation for large-scale RF systems (e.g., wireless transceivers). In this paper we propose a novel algorithm to find the optimal macromodel that minimizes the modeling error based on measurement data, while simultaneously guaranteeing passivity. The key idea is to attack the passive macromodeling problem by solving a sequence of convex semi-definite programming (SDP) problems. As such, the proposed method can iteratively find the optimal poles and zeros for macromodeling. Our experimental results with several commercial RF circuit examples demonstrate that the proposed macromodeling method reduces the modeling error by 1.31-2.74x over other conventional approaches.
Slides

5B-2 (Time: 14:15 - 14:40)
TitlePhysical Verification Flow for Hierarchical Analog IC Design Constraints
Author*Volker Meyer zu Bexten, Markus Tristl (Infineon Technologies AG, Germany), Göran Jerke (Robert Bosch GmbH, Germany), Hartmut Marquardt (Mentor Graphics, Germany), Dina Medhat (Mentor Graphics, Egypt)
Pagepp. 447 - 453
Keywordconstraint verification, hierarchical constraints, constraint generation, circuit recognition
AbstractIn automotive applications -- as well as in other domains -- it is a major requirement that all relevant design constraints should be consistently derived and evaluated, as well as seamlessly enforced and verified in a well-documented manner. The authors present a new and industrial-strength approach to (1) derive appropriate design constraints from circuit structures, and (2) to verify constraints, such as clustering, matched orientation, matched parameters, alignment, and symmetry. Experimental results based on real-world automotive IC designs are shown.
Slides

5B-3 (Time: 14:40 - 15:05)
TitleAutomatic Design for Analog/RF Front-End System in 802.11ac Receiver
Author*Zhijian Pan, Chuan Qin, Zuochang Ye, Yan Wang (Institute of Microelectronics, Tsinghua University, China)
Pagepp. 454 - 459
KeywordAnalog/RF front-end, Design Automation
AbstractAlthough automatic optimization for individual analog/RF modules has been studied for many years, design automation for analog/RF systems that contain a complicated hierarchy of mixed-signal modules is still very challenging as the lack of an efficient way to bridge between different level descriptions in the design hierarchy. In this paper, we applied sparse regression as a modeling tool to model the modules that need to be optimized and embedded the modules in a large system to accomplish a realistic 802.11ac system design. The wireless system specification (e.g. bit error rate) for comprehensively evaluating the analog/RF front-ends is used as the optimization objective. The proposed method is implemented by linking the block-level performance metrics to the wireless system using mixed-signal simulation platform with performance modeling and Pareto optimal fronts. By this method, the receiver for 802.11ac systems is successfully designed and the worst error vector magnitude (EVM) is decreased by 34% from coarse design.

5B-4 (Time: 15:05 - 15:30)
TitleSIPredict: Efficient Post-Layout Waveform Prediction via System Identification
Author*Qicheng Huang, Xiao Li, Fan Yang, Xuan Zeng (Fudan University, China), Xin Li (Fudan University, China/Carnegie Mellon University, U.S.A.)
Pagepp. 460 - 465
KeywordSystem Identification, Post-layout, Waveform Prediction
AbstractWe propose a post-layout waveform prediction method to help designers have a quick view of the post-layout waveforms in iterative design process. Via system identification techniques, we build models to describe the relationships between pre-layout and post-layout simulation results. The model parameters are calibrated by using the simulation results of the first few data points of pre-layout and post-layout stages. By taking the corresponding pre-layout simulation results as inputs of the calibrated models, the rest post-layout waveform can be predicted.
Slides


Session 5C  Next-Generation Clock Network Synthesis
Time: 13:50 - 15:30 Wednesday, January 21, 2015
Location: Room 105
Chairs: Atsushi Takahashi (Tokyo Institute of Technology), David Z. Pan (University of Texas, Austin, U.S.A.)

5C-1 (Time: 13:50 - 14:15)
TitleUseful Clock Skew Scheduling Using Adjustable Delay Buffers in Multi-Power Mode Designs
Author*Juyeon Kim, Taewhan Kim (Seoul National University, Republic of Korea)
Pagepp. 466 - 471
Keywordtiming optimization, clock skew, multiple power modes
AbstractThis work addresses the problem of useful clock skew scheduling for designs with multiple power modes, which is now a mainstream for low-power designs. Precisely, we propose an optimal solution to the problem of useful clock skew scheduling for designs of multiple power modes with the objective of minimizing the number of adjustable delay buffers (ADBs) used. In addition, we solve two practical extensions: optimally allocating ADBs having quantized delay values and optimally allocating ADBs with delay upper bound.
Slides

5C-2 (Time: 14:15 - 14:40)
TitleFast Clock Skew Scheduling Based on Sparse-Graph Algorithms
Author*Rickard Ewetz (Purdue University, U.S.A.), Shankarshana Janarthanan (NVIDIA corporation, U.S.A.), Cheng-Kok Koh (Purdue University, U.S.A.)
Pagepp. 472 - 477
KeywordClock synthesis, skew, scheduling
AbstractIncorporating timing constraints explicitly imposed by the data and control paths during clock network synthesis can enhance the robustness of the synthesized clock networks. With these constraints, a clock scheduler can be used to guide the synthesis of a clock network by specifying a set of feasible arrival times at the respective sequential elements. Clock scheduling can be either static or dynamic. In static clock scheduling, a clock schedule is first specified; next, a clock network is constructed realizing the prescribed schedule. Clock trees constructed using this approach may consume significant routing resources. In dynamic clock scheduling, the clock tree and clock schedule are both simultaneously constructed and determined, respectively. In earlier studies, the scalability of dynamic clock scheduling, which is essentially a shortest path problem, has been limited. The bottleneck is in finding the shortest paths between different vertices in an incrementally changing weighted graph. In this work, we present two clock schedulers that address the scalability issues by exploiting the sparsity of this weighted graph. Experimental results show that the proposed clock schedulers are one to two orders of magnitude faster compared to a published scheduler in an earlier work. The proposed clock schedulers are scalable, and are tested on a synthesized circuit with 348 710 cells, 57 491 sequential elements, and 496 727 explicit timing constraints.
Slides

5C-3 (Time: 14:40 - 15:05)
TitleModeling and Optimization of Low Power Resonant Clock Mesh
Author*Wulong Liu (Tsinghua University, China), Guoqing Chen (Research Lab, Advanced Micro Devices, China), Yu Wang, Huazhong Yang (Tsinghua University, China)
Pagepp. 478 - 483
KeywordResonant clock, Clock mesh, Low Power, Transmission line
AbstractPower consumption is becoming more critical in modern IC designs and clock network is one of the major contributors for on-chip power. Resonant clock has been investigated as a potential solution to reduce the power consumption in clock network by recycling the energy with on-chip inductors. Most of the previous resonant clock work focuses on H-tree structures, while in this work, we propose a modeling and optimization method for the resonant clock mesh structure, which suffers from the high power consumption much more than the tree structure. Closed-form expressions for the transfer function, skew, and power are derived. Based on these expressions, the buffer size, LC tank location, grid size, wire width, and the sparsity of buffers and LC tanks are fully explored to make trade-offs among power, skew, and area. The exploration is also extended to 3D ICs and different mesh structures are evaluated.
Slides

5C-4 (Time: 15:05 - 15:30)
TitleSynthesis of Resonant Clock Networks Supporting Dynamic Voltage / Frequency Scaling
Author*Seyong Ahn, Minseok Kang (Seoul National University, Republic of Korea), Marios C. Papaefthymiou (University of Michigan, U.S.A.), Taewhan Kim (Seoul National University, Republic of Korea)
Pagepp. 484 - 489
Keywordresonant clock, synthesis, low power
AbstractSo far no works have addressed the problem of synthesizing resonant clock networks that are able to operate under the designs with DVFS capability even though the problem is potentially very important to maximize the synergy effect on saving power. In this context, this work proposes a comprehensive solution to the problem. Precisely, we propose a two-phase synthesis algorithm: (1) formulating the problem of inductor allocation, placement, and adjustable-sizing to support DVFS into a weighted set cover problem with the objective of minimizing total area of inductors followed by (2) performing the task of resizing adjustable driving buffers to support the switch of driving strength according to the clock frequencies by DVFS.
Slides


Session 6S  (Designers' Forum) Panel Discussion: Challenges in the Era of Big-Data Computing
Time: 15:50 - 17:30 Wednesday, January 21, 2015
Location: Room 103
Organizer: Koji Inoue (Kyushu University, Japan), Moderator: Koichiro Yamashita (Fujitsu Laboratories Ltd., Japan)

6S-1 (Time: 15:50 - 17:30)
Title(Panel Discussion) Challenges in the Era of Big-Data Computing
AuthorPanelists: Kento Aida (National Institute of Informatics, Japan), Derek Chiou (Microsoft, U.S.A.), Hiroshi Nakamura (The University of Tokyo, Japan), Hiroyuki Tanaka (Nippon Telegraph and Telephone Corporation, Japan), Iwao Yamazaki (Fujitsu Ltd., Japan)
AbstractThe advent of big data era may require a paradigm shift for designing computing systems. The amount of data to be obtained from real world increases exponentially every year, whereas the speed of performance improvements of conventional computing systems is quite slow compared to such rapid grows of big-data applications. So, now it is the time to revisit computer system architecture and its design. The panel discusses the direction of computing platforms in order to satisfy such performance requirements on next generation big-data era.


Session 6A  Optimization Techniques for Non-Volatile Memory based Systems
Time: 15:50 - 17:30 Wednesday, January 21, 2015
Location: Room 102
Chairs: Guangyu Sun (Peking University, China), Ju Lei (Shandong University)

6A-1 (Time: 15:50 - 16:15)
TitleAn Efficient STT-RAM-Based Register File in GPU Architectures
AuthorXiaoxiao Liu, Mengjie Mao, Xiuyuan Bi, Hai Li, *Yiran Chen (University of Pittsburgh, U.S.A.)
Pagepp. 490 - 495
KeywordSTT-RAM, MLC, Register file, GPU
AbstractModern GPGPUs employ a large register file (RF) to efficiently process heavily parallel threads in single instruction multiple thread (SIMT) fashion. The up-scaling of RF capacity, however, is greatly constrained by large cell area and high leakage power consumption of SRAM implementation. In this work, we propose a novel GPU RF design based on the emerging multi-level cell (MLC) spin-transfer torque RAM (STT-RAM) technology. Compared to SRAM, MLC STT-RAM (or MLC-STT) has much smaller cell area and almost zero standby power due to its nonvolatility. Moreover, by leveraging the asymmetric performance of the soft and the hard bits of a MLC-STT cell, we propose a remapping strategy to perform a flexible tradeoff between the access time and the capacity of the RF based on run-time access patterns. A novel rescheduling scheme is also developed to minimize the waiting time of the issued warps to access register banks. Experimental results over ISPASS2009 and CUDA benchmarks show that on average, our proposed MLC-STT RF can achieve 3.28% performance improvement, 9.48% energy reduction, and 38.9% energy efficiency enhancement compared to conventional SRAM-based design.
Slides

6A-2 (Time: 16:15 - 16:40)
TitleA Bit-Write Reduction Method based on Error-Correcting Codes for Non-Volatile Memories
Author*Masashi Tawada, Shinji Kimura, Masao Yanagisawa, Nozomu Togawa (Waseda University, Japan)
Pagepp. 496 - 501
KeywordNon-volatile memory, Bit-write reduction, Error-correcting codes, Encode/decode, Write-Reduction Code
AbstractNon-volatile memory has many advantages over SRAM. However, one of its largest problems is that it consumes a large amount of energy in writing. In this paper, we propose a bit-write reduction method based on error correcting codes for non-volatile memories. When a data is written into a memory cell, we do not write it directly but encode it into a codeword. We focus on error-correcting codes and generate new codes called write-reduction codes. In our write-reduction codes, each data corresponds to an information vector in an error-correcting code and an information vector corresponds not to a single codeword but a set of write-reduction codewords. Given a writing data and current memory bits, we can deterministically select a particular write-reduction codeword corresponding to a data to be written, where the maximum number of ipped bits are theoretically minimized. Then the number of writing bits into memory cells will also be minimized. We perform several experimental evaluations and demonstrate up to 72% energy reduction.

6A-3 (Time: 16:40 - 17:05)
TitleMinimizing MLC PCM Write Energy for Free through Profiling-Based State Remapping
Author*Mengying Zhao (City University of Hong Kong, Hong Kong), Yuan Xue, Chengmo Yang (University of Delaware, U.S.A.), Chun Jason Xue (City University of Hong Kong, Hong Kong)
Pagepp. 502 - 507
KeywordPCM, energy, state remapping
AbstractPhase change memory is becoming one of the most promising candidates to replace DRAM as main memory in deep sub-micron regime. Multi-level cell (MLC) PCM outperforms single level cell (SLC) PCM in terms of storage capacity but requires an iterative programming-and-verifying scheme to program cells to different resistance levels. The energy consumed in programming different MLC states varies significantly, thus motivating a state remapping technique to minimize the overall write energy. In this paper, we first compare dynamic and static state remapping strategies in terms of their efficacy in reducing energy, and then propose an effective and low-cost static state remapping algorithm. The experimental studies show 10.6% average (up to 16.9%) reduction in MLC PCM write energy, achieved within negligible hardware and performance overhead. Compared with the most related work, the proposed scheme saves more write energy on average, with near-zero performance, area and energy overhead.
Slides

6A-4 (Time: 17:05 - 17:30)
TitleImproving Performance and Lifetime of DRAM-PCM Hybrid Main Memory through a Proactive Page Allocation Strategy
AuthorHoda Aghaei Khouzani, *Chengmo Yang (University of Delaware, U.S.A.), Jingtong Hu (Oklahoma State University, U.S.A.)
Pagepp. 508 - 513
KeywordPhase Change Memory, Hybrid Main Memory, Memory Managment, Page Allocation
AbstractThis paper aims to reduce both DRAM misses and PCM writes in a DRAM-PCM hybrid memory architecture. We propose a proactive page allocation approach, exploiting the flexibility of mapping virtual pages to physical pages. By taking into consideration both the segment information and the number of conflict misses in DRAM, the proposed algorithm distributes heavily written pages across different DRAM sets. Trace-driven experiments show that the proposed technique is able to improve performance and lifetime of DRAM-PCM hybrid memory simultaneously.


Session 6B  Test for Higher Quality
Time: 15:50 - 17:30 Wednesday, January 21, 2015
Location: Room 104
Chairs: Tomokazu Yoneda (NAIST, Japan), Stefan Holst (Kyushu Institute of Technology)

6B-1 (Time: 15:50 - 16:15)
TitleEnhanced LCCG: A Novel Test Clock Generation Scheme for Faster-than-at-Speed Delay Testing
Author*Songwei Pei, Ye Geng (Department of Computer Science and Technology, Beijing University of Chemical Technology, China), Huawei Li (Key Laboratory of Computer System and Architecture, Institute of Computing Technology, China), Jun Liu (School of Computer and Information, Hefei University of Technology, China), Song Jin (School of Electrical and Electronic Engineering,North China Electric Power University, China)
Pagepp. 514 - 519
Keywordfaster-than-at-speed, delay testing, small delay defects
AbstractOn-chip faster-than-at-speed delay testing provides a promising way for small delay defect detection. However, the frequency of on-chip generated test clock would be impacted by process variations. Hence, it requires determining the actual frequency of generated test clock to ensure the effectiveness of faster-than-at-speed delay testing. In this paper, we present a novel test clock generation scheme, namely Enhanced LCCG, for faster-than-at-speed delay testing. In the proposed scheme, faster-than-at-speed test clock is firstly generated by configuring the corresponding control information specified in the test pattern into Enhanced LCCG. Then, by constructing oscillation paths and counting the corresponding oscillation iteration numbers, the actual frequency of test clock can be measured and calculated with high resolution. Experimental results are presented to validate the proposed method.

6B-2 (Time: 16:15 - 16:40)
TitleAn Efficient 3D-IC On-Chip Test Framework to Embed TSV Testing in Memory BIST
AuthorLiang-Che Li, Wen-Hsuan Hsu, *Kuen-Jong Lee (National Cheng Kung University, Taiwan), Chun-Lung Hsu (Industrial Technology Research Institute, Taiwan)
Pagepp. 520 - 525
Keyword3D-IC, TSV testing, Memory testing
Abstract3D-ICs use Through Silicon Via (TSV) to reduce the connection length and enhance I/O bandwidth. In this paper, we present an efficient on-chip 3D-ICs testing framework to merge TSV and BIST-based memory testing. During the memory testing, memory test patterns are also used to test TSVs. And we can perform TSV testing for many times during memory test time to improve detectability of TSVs. The experimental results will show the test time saving and the low area costing.

6B-3 (Time: 16:40 - 17:05)
TitleAn Integrated Temperature-Cycling Acceleration and Test Technique for 3D Stacked ICs
Author*Nima Aghaee, Zebo Peng, Petru Eles (Linköping University, Sweden)
Pagepp. 526 - 531
Keyword3D Stacked IC test, temperature cycling, test scheduling, early-life failures
AbstractThrough silicon vias in 3D ICs are subject to undesirable early-life effects such as protrusion and voids. Operating the ICs under extreme temperature cycling can accelerate these early-life failures and make them detectable. An integrated temperature-cycling acceleration and test technique is introduced in this paper that combines a temperature-cycling acceleration procedure with pre-, mid-, and post-bond tests for 3D ICs. The proposed method schedules the tests, cooling intervals, and heating sequences in order to provide the required temperature cycling effect.
Slides

6B-4 (Time: 17:05 - 17:30)
TitleSoftware-Based Test and Diagnosis of SoCs Using Embedded and Wide-I/O DRAM
Author*Sergej Deutsch, Krishnendu Chakrabarty (Duke University, U.S.A.)
Pagepp. 532 - 537
KeywordFault Diagnosis, Test-data Compression, Online Test
AbstractWe propose a test and diagnosis solution that makes use of software-based decompression of deterministic scan-test pattern and allows for test application from on-chip DRAM to the logic die, extending traditional hardware-based methods and allowing for online scan-based test and diagnosis. This solution therefore targets SoCs that contain, in addition to a microprocessor, multiple digital-logic cores and glue logic, all of which need to be tested using scan test patterns.


Session 6C  Reliability
Time: 15:50 - 17:30 Wednesday, January 21, 2015
Location: Room 105
Chairs: Xuan Zeng (Fudan University, China), Martin Wong (UIUC, U.S.A.)

6C-1 (Time: 15:50 - 16:15)
TitleLogic-DRAM Co-Design to Efficiently Repair Stacked DRAM With Unused Spares
AuthorMinjie Lv, *Hongbin Sun, Jingmin Xin, Nanning Zheng (Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University, China)
Pagepp. 538 - 543
KeywordMemory repair, 3D IC, Design for repair, DRAM, logic-DRAM co-design
AbstractThis paper exploits a cost efficient approach to repair 3D integration induced defective cells in stacked DRAM with unused spares, by leveraging logic-DRAM co-design. In particular, we propose to make the DRAM array open its spares to off-chip access by small architecture modification, and further design the defective address comparison and redundant address remapping with very efficient architecture on logic die to achieve the equivalent memory repair.

6C-2 (Time: 16:15 - 16:40)
TitleElectromigration-Aware Redundant via Insertion
AuthorJiwoo Pak, Bei Yu, *David Z. Pan (The University of Texas at Austin, U.S.A.)
Pagepp. 544 - 549
KeywordElectromigration, Via, Optimization
AbstractAs the feature size shrinks, electromigration (EM) becomes a more critical reliability issue in IC design. EM around the via structures accounts for much of the reliability problems in ICs, and the insertion of redundant vias can mitigate the adverse effect of EM by reducing current density. In this paper, we model EM reliability of redundant via structures, considering current distribution with different via layouts. Based on our EM model, we choose redundant via layouts that can increase the EM-related lifetime by using integer linear programming (ILP). To overcome the runtime issue of ILP, we also propose speed-up techniques for our EM- aware redundant via insertion. Experimental results show that our scheme brings much more EM-robustness to circuits with the similar number of redundant vias, compared to the conventional redundant via insertion techniques.

6C-3 (Time: 16:40 - 17:05)
TitleSynthesis of Resilient Circuits from Guarded Atomic Actions
AuthorYuankai Chen (Synopsys Inc., U.S.A.), *Hai Zhou (Northwestern University, U.S.A.)
Pagepp. 550 - 555
KeywordReliability, High-level synthesis
AbstractWith aggressive scaling of minimum feature sizes, supply voltages, and design guard-band, transient faults have become critical issues in modern electronic circuits. Synthesis from guarded atomic actions has been investigated by Arvind et al. to explore non-determinism for hardware concurrency. We show in this work that non-determinism in the guarded atomic actions can be further explored for synthesis of resilient circuits. When an error happens in one atomic action, it may not need to be recomputed if there exist other feasible actions. Such flexibilities will be increased in the specification and explored in the synthesis for efficient error resiliency. This expanded solution space offers the possibility of performance optimization. Experimental results demonstrate the effectiveness and efficiency of our synthesis approach.

6C-4 (Time: 17:05 - 17:30)
TitleIncremental Latin Hypercube Sampling for Lifetime Stochastic Behavioral Modeling of Analog Circuits
AuthorYen-Lung Chen (National Central University, Taiwan), Wei Wu (University of California, Los Angeles, U.S.A.), *Chien-Nan Jimmy Liu (National Central University, Taiwan), Lei He (University of California, Los Angeles, U.S.A.)
Pagepp. 556 - 561
KeywordStochastic Behavioral Modeling, Lifetime yield, Incremental sampling, Analog, Aging effects
AbstractIn advanced technology node, not only process variations but also aging effects have critical impacts on circuit performance. Most of existing works consider process variations and aging effects separately while building the corresponding behavior models. Because of the time-varied circuit property, parametric yield need to be reanalyzed in each aging time step. This results in expensive simulation cost for reliability analysis due to the huge number of circuit simulation runs. In this paper, an incremental Latin hypercube sampling (LHS) approach is proposed to build the stochastic behavior models for analog/mixed-signal (AMS) circuits while simultaneously considering process variations and aging effects. By reusing previous sampling information, only a few new samples are incrementally updated to build an accurate stochastic model in different time steps, which significantly reduces the number of simulations for aging analysis. Experiments on an operational amplifier and a DAC circuit achieve 242x speedup over traditional reliability analysis method with similar accuracies.
Slides



Thursday, January 22, 2015

Session 3K  Keynote III
Time: 9:00 - 9:50 Thursday, January 22, 2015
Location: International Conference Room
Chair: Kunio Uchiyama (Hitachi)

3K-1 (Time: 9:00 - 9:50)
Title(Keynote Address) When and How Will an AI Be Smart Enough to Design?
Author*Noriko Arai (National Institute of Informatics, Japan)
Pagep. 562
AbstractThe current rise of AI has mainly two origins. The first one is, of course, the invention of machine learning. Statistics and optimization deliver their theories to the machine learning. The combination of the big data and the massively parallel computing enables the machines to “learn” from the data existing on the web, the network and the database, though there is only small hope that machine learning technologies help the machine to solve the design problem like design automation. Another rather inconspicuous origin is the sophistication of the traditional logical approach. The virtue of the logical approach is in its ability to express complex input-output relations, such as the mapping form natural language text to its meaning and the logical relation between a premise and its consequence, in a way that a human can understand. In this talk, I introduce AI grand challenge, “Todai Robot Project” (Can an AI get into the University of Tokyo?), initiated by National Institute of Informatics in 2011, and discuss the impact of near-term AI technologies on design automation.


Session 7S  (Special Session) The Future of Emerging ReRAM Technology
Time: 10:15 - 12:20 Thursday, January 22, 2015
Location: Room 103
Chairs: Guangyu Sun (Peking University, China), Yuan Xie (University of California at Santa Barbara, U.S.A.)

7S-1 (Time: 10:15 - 10:45)
Title(Invited Paper) Toward Large-Scale Access-Transistor-Free Memristive Crossbars
AuthorAmirali Ghofrani, Miguel Angel Lastras-Montaño, *K.-T. Tim Cheng (University of California at Santa Barbara, U.S.A.)
Pagepp. 563 - 568
KeywordReRAM, access-transistor-free crossbar, write disturbance, sneak current, leakage current
AbstractMemristive crossbars have been shown to be excellent candidates for building an ultra-dense memory system because a per-cell access-transistor may no longer be necessary. However, the elimination of the access-transistor introduces several parasitic effects due to the existence of partially-selected devices during memory accesses, which could limit the scalability of access-transistor-free (ATF) memristive crossbars. In this paper we discuss these challenges in detail and describe some solutions addressing these challenges at multiple levels of design abstraction.
Slides

7S-2 (Time: 10:45 - 11:15)
Title(Invited Paper) Read Circuits for Resistive Memory (ReRAM) and Memristor-Based Nonvolatile Logics
Author*Meng-Fan Chang, Albert Lee, Chien-Chen Lin (National Tsing Hua University, Taiwan), Mon-Shu Ho (National Chung Hsin University, Taiwan), Ping-Cheng Chen (I-Shou University, Taiwan), Chia-Chen Kuo, Ming-Pin Chen, Pei-Ling Tseng, Tzu-Kun Ku (Industrial Technology Research Institute, Taiwan), Chien-Fu Chen, Kai-Shin Li, Jia-Min Shieh (National Nano Device Laboratories, Taiwan)
Pagepp. 569 - 574
KeywordReRAM, Memristor, sense amplifier
AbstractResistive memory device (Memristor) is one of the candidates for energy-efficient nonvolatile memory and nonvolatile logics (nvLogics) in the applications of wearable, IoT, cloud computing, and big-data processing. However, resistive RAM (ReRAM) and memristor-based nvLogics suffer limited perfroamnce and low yield due to process variations in transistors and resistance of memristor. This presentation discusses the design challenges in read circuits for high-speed, area-efficient, and low-voltage ReRAM and nvLogics. Memristor-based nvLogics, such nonvolatile-SRAM (nvSRAM), nonvolatile flip-flops (nvFF), and nonvolatile TCAM (nvTCAM) are included in this presentation. Several silicon-verified solutions on read scheme and sense amplifiers are also discussed in this presentation.

7S-3 (Time: 11:15 - 11:45)
Title(Invited Paper) 3D ReRAM with Field Assisted Super-Linear Threshold (FASTTM) Selector Technology for Super-Dense, Low Power, Low Latency Data Storage Systems
AuthorSung Hyun Jo, Tanmay Kumar, Mehdi Asnaashari, Wei D. Lu, *Hagop Nazarian (Crossbar Inc., U.S.A.)
Pagep. 575
Abstract3D Resistive Ram (ReRAM) technology exhibits the best attributes to suite present and emerging non-volatile memory storage applications. However, the major challenge to make ReRAM work in a 3D crossbar array is the integration of a selector device with an ReRAM device. The selector device will solve the so called “sneak path” barrier and enable large density memory arrays with low power consumption. Here we report a Field Assisted Superlinear Threshold (FAST) Selector technology that overcomes the sneak path barrier with a selectivity ratio of 10e10. The switching and recover speed, on/off ratio, switching slope, W/E and read endurance, and variability of the FAST selector will be discussed. Prototype 1S1R devices with the FAST selector integrated with a low current ReRAM cell have been demonstrated and characterized. This technology readily leads to the implementation of 1TnR and 3D ReRAM architectures, and provides significant architectural benefits, power reduction, performance improvement, and overall system cost reduction advantages for IoT, enterprise storage, mobile, and wearable systems utilizing ReRAM technology.

7S-4 (Time: 11:45 - 12:20)
Title(Invited Paper) Modeling and Design Optimization of ReRAM
Author*J. F. Kang, H. T. Li, P. Huang, Z. Chen, B. Gao, X. Y. Liu (Peking University, China), Z. Z. Jiang, H.-S. P. Wong (Stanford University, U.S.A.)
Pagepp. 576 - 581
Keywordemerging memory, resistive switching memory, SPICE model
AbstractResistive switching devices (RRAM) have been widely studied for the application in the next-generation data storage and neurormorphic computing systems. To meet the requirements of device-circuit-system co-design and optimization, A SPICE model of RRAM that can reproduce the device characteristics in circuit simulations is needed. In this talk, we will address a developed physical based SPICE model that can capture all the essential features of HfOx-based RRAM including the DC/AC and multi-level switching behaviors, switching reliability (endurance and read disturb) and variation characteristics, and resistance distributions. A novel extraction strategy is developed to extract the critical model parameters from the fabricated RRAM devices. A variety of electrical measurements on various RRAMs are performed to verify and calibrate the model. The verified model can be applied to explore a wide range of applications including: 1) variation-aware and reliability-emphasized system design; 2) system performance evaluation; 3) array architecture optimization. This verified design tool not only enables the system design but also provides solutions for the system optimization that capitalize on device/circuit interaction for both data storage and neuromorphic computing applications.
Slides


Session 7A  Ensuring the Correctness of System Integration
Time: 10:15 - 12:20 Thursday, January 22, 2015
Location: Room 102
Chairs: Takeshi Matsumoto (Ishitawa National College of Technology), Akash Kumar (Natioanl University of Singapore, Singapore)

7A-1 (Time: 10:15 - 10:40)
TitleEvaluation of Runtime Monitoring Methods for Real-Time Event Streams
Author*Biao Hu, Kai Huang, Gang Chen, Alois Knoll (Technical University of Muenchen, Germany)
Pagepp. 582 - 587
KeywordRuntime Monitor, dynamic counter, l-repetitive function, FPGA
AbstractRuntime monitoring is of great importance as a safe guard to guarantee the correctness of system runtime behaviors. Two new methods, i.e., dynamic counters and l-repetitive function, are recently developed to tackle the runtime monitoring for hard real-time systems. This paper investigates in depth these two newly developed runtime monitoring methods, trying to evaluate and identify their strengths and weaknesses. Representative scenarios are used as our case studies to quantitatively demonstrate our comparisons. We also provide FPGA implementations and resource usages of both methods.
Slides

7A-2 (Time: 10:40 - 11:05)
TitleAutomatic Timing-Coherent Transactor Generation for Mixed-Level Simulations
Author*Li-chun Chen, Hsin-I Wu, Ren-Song Tsay (National Tsing Hua University, Taiwan)
Pagepp. 588 - 593
KeywordMixed-level simulations, ESL, system simulators, transactor, timing coherent
AbstractIn this paper we extend the concept of the traditional transactor, which focuses on correct content transfer, to a new timing-coherent transactor that also accurately aligns the timing of each transaction boundary so that designers can perform precise concurrent system behavior analysis in mixed-abstraction-level system simulations which are essential to increasingly complex system designs. To streamline the process, we also develop an automatic approach for timing-coherent transactor generation. Our approach is actually applied in mixed-level simulations and the results show that it achieves 100% timing accuracy while the conventional approach produces results of 25% to 44% error rate.
Slides

7A-3 (Time: 11:05 - 11:30)
TitleHybrid Coverage Assertions for Efficient Coverage Analysis Across Simulation and Emulation Environments
AuthorHsuan-Ming Chou, Hong-Chang Wu, Yi-Chiao Chen, *Jean Tsao, Shih-Chieh Chang (National Tsing Hua University, Taiwan)
Pagepp. 594 - 599
Keywordcoverage analysis, assertion, hardware-accelerated environment
AbstractCoverage metrics are used to measure the completeness of the verification tests. However, in a modern hardware-accelerated environment, coverage may be analyzed across a simulator and an emulator. Hence, conventional coverage techniques can be directly applied. To resolve the above problem, we propose using coverage assertions to detect coverage events across a simulator and an emulator. In addition, an Assertion Operation Graph and graph-based algorithms are proposed to minimize the hardware and performance overheads of coverage assertions.
Slides

7A-4 (Time: 11:30 - 11:55)
TitleSWAT: Assertion-Based Debugging of Concurrency Issues at System Level
Author*Luis Gabriel Murillo, Róbert Lajos Bücs, Daniel Hincapie, Rainer Leupers, Gerd Ascheid (RWTH Aachen University, Germany)
Pagepp. 600 - 605
KeywordMany-core, Concurrency, Debug, HW/SW, Virtual Platforms
AbstractModern multi- and many-core systems are prone to concurrency-related bugs which surface only at system level. Detecting these bugs might require dealing with low-level hardware (HW) protocols and/or software (SW) inter-task interactions. Virtual platforms (VPs) offer a viable vehicle to conveniently debug HW/SW functionality, yet developers are mostly limited to manually break-point, step and interact with the system. To ease debugging during integration at system level, this paper introduces SWAT, an assertion-based debugging framework that checks and correlates system-wide interactions among HW and SW components. SWAT is used together with VPs to enable detecting HW and/or SW concurrency bugs with lower effort than existing techniques. Our proposed approach is evaluated in two state-of-the-art platforms running real-world SW stacks.

7A-5 (Time: 11:55 - 12:20)
TitleCommunication Protocol Analysis of Transaction-Level Models Using Satisfiability Modulo Theories
Author*Che-Wei Chang, Rainer Doemer (University of California, Irvine, U.S.A.)
Pagepp. 606 - 611
KeywordTLM, SMT, Verfication, AMBA bus, CAN bus
AbstractA critical aspect in SoC design is the correctness of communication between system blocks. In this work, we present a novel approach to formally verify various aspects of communication models, including timing constraints and liveness. Our approach automatically extracts timing relations and constraints from the design and builds a Satisfiability Modulo Theories (SMT) model whose assertions are then formally verified along with properties of interest input by the designer. Our method also addresses the complexity growth with a hierarchical approach. We demonstrate our approach on models communicating over industry standard bus protocol AMBA AHB and CAN bus. Our results show that the generated assertions can be solved within resonable time.
Slides


Session 7B  Orchestrating Tasks, Cores, and Communication
Time: 10:15 - 12:20 Thursday, January 22, 2015
Location: Room 104
Chairs: Zili Shao (Hong Kong Polytechnic University, Hong Kong), Masanori Hashimoto (Osaka University, Japan)

7B-1 (Time: 10:15 - 10:40)
TitleGuiding Fault-Driven Adaption in Multicore Systems through a Reliability-Aware Static Task Schedule
AuthorLaura A Rozo Duque, *Chengmo Yang (University of Delaware, U.S.A.)
Pagepp. 612 - 617
KeywordSystem reliability, Static scheduling, Runtime adaptation, Multicore systems
AbstractFuture multicore systems are expected to suffer from high and varying fault rates. Efficient fault tolerant solutions capable of combining the advantages of static optimization and runtime adaptation are needed. This paper proposes a collaborative reliability-aware scheduling framework that considers “reliability level” (RL) as an intermediate scheduling dimension and creates a “task-to-RL-to-core” mapping. This mapping is used to guide runtime adaptation, thus effectively relieving most of the computational overhead and improving application performance in a non-constant fault rate environment.

7B-2 (Time: 10:40 - 11:05)
TitleApproximation-Aware Scheduling on Heterogeneous Multi-Core Architectures
Author*Cheng Tan, Thannirmalai Somu Muthukaruppan, Tulika Mitra (National University of Singapore, Singapore), Lei Ju (Shandong University, China)
Pagepp. 618 - 623
Keywordenergy-efficiency, DVFS, TDP, QoS
AbstractEmbedded devices in the multi-core era are performance-per-watt optimized, rendering energy-efficiency important. The heterogeneity of multi-core platform makes the problem more complicated. Fortunately, the accommodation of minimal loss in accuracy for applications yields chances to improve energy-efficiency. In this paper, we introduce an approximation-aware scheduling framework for real-time tasks on the heterogeneous multi-core architectures. It efficiently schedules tasks running on the different cores with appropriate frequency under thermal design power constraint while minimizing energy consumption and maximizing the Quality-of-Service.
Slides

7B-3 (Time: 11:05 - 11:30)
TitleComposing Real-Time Applications from Communicating Black-Box Components
Author*Martin Becker (Real-Time Computer Systems, Technical University of Munich, Germany), Alejandro Masrur (Software Technology for Embedded Systems, Technical University Chemnitz, Germany), Samarjit Chakraborty (Real-Time Computer Systems, Technical University of Munich, Germany)
Pagepp. 624 - 629
KeywordReal-Time, Software, Components, Composition, Library
AbstractTo handle complexity, embedded software is usually divided into components that are developed independently from each other and need then be integrated in a reliable and deterministic manner. This involves buffering and synchronizing exchanged signals, as well as finding a feasible execution schedule, which is a tedious and error-prone procedure. We propose a framework that automatically performs such an integration, without requiring access to the components' source code. The developer only needs to declare interface signals between the components, connect them and define their execution periods. A software library then synthesizes deterministic communication mechanisms and provides a flexible, yet safe interface for time-triggered execution. Our approach does not require a run-time environment or a special compiler, which makes it light-weight and amenable to be used on embedded platforms with limited resources.
Slides

7B-4 (Time: 11:30 - 11:55)
TitleEnhanced Partitioned Scheduling of Mixed-Criticality Systems on Multicore Platforms
Author*Zaid Al-bayati (McGill University, Canada), Qingling Zhao (Zhejiang University, China), Ahmed Youssef (McGill University, Canada), Haibo Zeng (Virginia Tech, U.S.A.), Zonghua Gu (Zhejiang University, China)
Pagepp. 630 - 635
KeywordMixed-criticality systems, scheduling, multicore
AbstractMixed Criticality Systems (MCS) have gained increasing interest in the past few years due to their industrial relevance. When mixed-criticality systems are implemented on multicore architectures, several challenges arise such as the efficient partitioning of these systems. In this paper, we address this issue by presenting a novel mixed-criticality partitioning algorithm, the Dual-Partitioned Mixed-Criticality (DPM) algorithm, that allows limited migration of LO-criticality tasks to enhance the efficiency of the partitioning while maintaining many of the advantages of partitioned systems. Experimental results show that DPM consistently outperforms existing mixed-criticality partitioning algorithms, for example, at utilizations of 0.8 or higher, DPM is able to schedule 17% more systems.

7B-5 (Time: 11:55 - 12:20)
TitleReducing Dynamic Dispatch Overhead (DDO) of SLDL-Synthesized Embedded Software
AuthorJiaxing Zhang, Sanyuan Tang, *Gunar Schirner (Northeastern University, U.S.A.)
Pagepp. 636 - 643
KeywordEmbedded software, Software Synthesis, Optimization, SpecC, Compiler
AbstractSystem-Level Design Languages (SLDL) allow component-oriented specifications, e.g. for separating computation and communication. This separation allows for a flexible model composition, refinement and explorations. This flexibility, however, requires dynamic dispatch during execution that degrades the simulation performance. After synthesized to a target platform, the model re-composition is no longer required. Then, the involved Dynamic Dispatch Overhead (DDO) only limits performance without providing benefits. Thus, approaches are needed for software synthesis to analyze model connectivity and eliminate the DDO wherever possible. This paper introduces a static dispatch type analysis as part of the DDO-aware embedded C code synthesis from SLDL models. Our DDO-aware software (SW) synthesis emits faster, more readable static dispatch code whenever a static connectivity is determinable. By replacing virtual functions with direct function calls, the DDO can be totally eliminated allowing for aggressive inlining optimizations by the compiler. We demonstrate the benefits of the improved SW synthesis on a JPEG encoder, which runs up to 16% faster with DDO-reduction on an ARM9-based HW/SW platform. Our approach combines the flexibility benefits in specification modeling with efficient execution when synthesized to embedded targets.


Session 7C  Design for Manufacturability
Time: 10:15 - 12:20 Thursday, January 22, 2015
Location: Room 105
Chairs: Shigeki Nojima (Toshiba Corporation, Japan), Eric J.-W. Fang (MediaTek, Taiwan)

7C-1 (Time: 10:15 - 10:40)
TitleContact Pitch and Location Prediction for Directed Self-Assembly Template Verification
AuthorZigang Xiao, Yuelin Du, *Martin D.F. Wong (University of Illinois at Urbana-Champaign, U.S.A.), He Yi, H.-S. Philip Wong (Stanford University, U.S.A.), Hongbo Zhang (Synopsys Inc., U.S.A.)
Pagepp. 644 - 651
KeywordDirected Self-Assembly, Machine Learning, Pitch Prediction, Verification, Design for Manufacturability
AbstractIn Directed Self-Assembly (DSA), the patterning variation in the templates is very likely to affect the placement and shape of the final contact holes. However, rigorous DSA simulation is unacceptably slow for full chip verification. This paper presents a machine learning based DSA verification method that is able to learn a model of high accuracy in predicting the pitch size and hole location. The experimental results show that our method achieves high accuracy and low time cost compared to simulation-based methods.
Slides

7C-2 (Time: 10:40 - 11:05)
TitleLayout Decomposition Co-Optimization for Hybrid E-Beam and Multiple Patterning Lithography
Author*Yunfeng Yang, Wai-Shing Luk (Fudan University, China), Hai Zhou (Fudan University, China/Northwestern University, U.S.A.), Changhao Yan, Xuan Zeng (Fudan University, China), Dian Zhou (Fudan University, China/University of Texas at Dallas, U.S.A.)
Pagepp. 652 - 657
Keywordhybrid lithography, mulitple patterning, e-beam, layout decomposition, primal-dual
AbstractAs the feature size keeps scaling down and the circuit complexity increases rapidly, a more advanced hybrid lithography, which combines multiple patterning and e-beam lithography (EBL), is promising to further enhance the pattern resolution. In this paper, we formulate the layout decomposition problem for this hybrid lithography as a minimum vertex deletion K-partition problem, where K is the number of masks in multiple patterning. Stitch minimization and EBL throughput are considered uniformly by adding a virtual vertex between two feature vertices for each stitch candidate during the conflict graph construction phase. For K = 2, we propose a primal-dual method for solving the underlying minimum odd-cycle cover problem efficiently. In addition, a chain decomposition algorithm is employed for removing all "non-cyclable" edges. For K > 2, we propose a random-initialized local search method that iteratively applies the primal-dual solver. Experimental results show that compared with a two-stage method, our proposed methods reduce the EBL usage by 64.4% with double patterning and 38.7% with triple patterning on average for the benchmarks.

7C-3 (Time: 11:05 - 11:30)
TitlePolynomial Time Optimal Algorithm for Stencil Row Planning in E-Beam Lithography
AuthorDaifeng Guo, Yuelin Du, *Martin D.F. Wong (University of Illinois at Urbana-Champaign, U.S.A.)
Pagepp. 658 - 664
KeywordE-Beam, Stencil Planning, 1D Character Row Ordering
AbstractElectron beam lithography (EBL) is a very promising candidate for integrated circuit (IC) fabrication beyond the 10 nm technology node. To address its throughput issue, the Character Projection (CP) technique has been proposed, and its stencil planning can be optimized with aware of overlapping characters. However, the top level 2D stencil planning problem has been proved to be an NP-hard problem. As its most essential step, the 1D row ordering is believed hard as well, and no polynomial time optimal solution has been provided so far. In this paper, we propose a polynomial time optimal algorithm to solve the row ordering problem, which serves as the major subroutine for the entire stencil planning problem. Proof and experimental results are also provided to verify the correctness and efficiency of our algorithm.

7C-4 (Time: 11:30 - 11:55)
TitleFast Mask Assignment Using Positive Semidefinite Relaxation in LELECUT Triple Patterning Lithography
Author*Yukihide Kohira (The University of Aizu, Japan), Tomomi Matsui (Tokyo Institute of Technology, Japan), Yoko Yokoyama, Chikaaki Kodama (Toshiba Corporation, Japan), Atsushi Takahashi (Tokyo Institute of Technology, Japan), Shigeki Nojima, Satoshi Tanaka (Toshiba Corporation, Japan)
Pagepp. 665 - 670
Keywordtriple patterning lithography, LELECUT, mask assignment, positive semidefinite relaxation
AbstractRecently, LELECUT type triple patterning lithography (TPL) technology, where the third mask is used to cut the patterns, is discussed to alleviate native conflict and overlay problems in LELELE type TPL. In this paper, we formulate LELECUT mask assignment problem which maximizes the compliance to the lithography and apply a positive semidefinite relaxation. In our proposed method, the positive semidefinite relaxation is defined by extracting cut candidates from the layout, and a mask assignment is obtained from an optimum solution of the relaxation by randomized rounding technique.

7C-5 (Time: 11:55 - 12:20)
TitleLayout Decomposition for Spacer-is-Metal (SIM) Self-Aligned Double Patterning
Author*Shao-Yun Fang (National Taiwan University of Science and Technology, Taiwan), Yi-Shu Tai, Yao-Wen Chang (National Taiwan University, Taiwan)
Pagepp. 671 - 676
Keywordself-aligned double patterning, layout decomposition, spacer-is-metal
AbstractSelf-aligned double patterning (SADP) has become a preferred double patterning technology, due to its better overlay controllability. Since there is only one satisfiability-based previous work on SIM-type layout decomposition, which typically has higher decomposition flexibility (especially for gridless designs), we propose an efficient graph-based SIM-type layout decomposition heuristic. The decomposition problem is first transformed into a constrained set-covering problem. Then, an efficient algorithm composed of a greedy heuristic followed by a partition-based solution refinement scheme is proposed. Experimental results show that the algorithm can efficiently derive a good decomposition solution with minimized pattern conflicts.
Slides


Session 8S  (Designers' Forum) Technology Trend toward 8K Era
Time: 13:50 - 15:30 Thursday, January 22, 2015
Location: Room 103
Organizer: Hiroe Iwasaki (NTT Media Intelligence Laboratories, Japan), Chair: Masaitsu Nakajima (Panasonic Corporation, Japan)

8S-1 (Time: 13:50 - 14:15)
Title(Invited Paper) The Prospects of Next Generation Television - Japan’s Initiative to 2020 -
Author*Keiya Motohashi (NetTV Forum, Japan)
Pagepp. 677 - 679
Keyword8K, Television
AbstractIn japan, Telecommunications Authority supposed the first grand schedule for launching and spread of UHDTV & advanced smart TV, called ‘roadmap’, in June 2013. The roadmap suggested that business-based 4KTV & 8KTV broadcast would be launched until 2020. In 2014, experimental 4K broadcast started and ‘roadmap’ was revised. Comprehensively, grand schedule for launching UHDTV in Japan is said to move up about 2 years.

8S-2 (Time: 14:15 - 14:40)
Title(Invited Paper) 8K LCD : Technologies and Challenges toward the Realization of SUPER Hi-VISION TV
Author*Takeshi Kumakura (SHARP Corporation, Japan)
Pagepp. 680 - 683
Keyword8K, LCD
AbstractSince 2011, we have successfully developed a number of 8K LCD prototypes for SUPER Hi-VISION. The latest prototype has an 85-inches panel of 7680 by 4320 pixels and the frame rate has been achieved 120Hz and the color system has been expanded to a wide-gamut which is compliant with BT2020. Furthermore, the input interface has been upgraded to a novel optical fiber system which is capable to transmit 256Gbps with one cable.

8S-3 (Time: 14:40 - 15:05)
Title(Invited Paper) The World's 1st Complete-4K SoC Solution with Hybrid Memory System
Author*Daisuke Murakami, Yuki Soga, Daisuke Imoto, Yoshiharu Watanabe, Takashi Yamada (Panasonic Corporation, Japan)
Pagepp. 684 - 686
Keyword8K, SoC
AbstractThe 4K2K Display market is expanding faster than expected. For this market, we introduce the world’s 1st Complete-4K SoC solution with hybrid memory system. This SoC supports High Efficiency Video Coding (HEVC) 10bit 4K 60p decode, graphics processing, picture blending, and external video input/output. These operations can be performed simultaneously with 4K quality. To realize that capability, we adopt highly efficient novel hybrid memory system with Wide-IO 266 memory and DDR3-1866 memory (24.4GB/s, 20Gbits).

8S-4 (Time: 15:05 - 15:30)
Title(Invited Paper) H.265/HEVC Encoder for UHDTV
Author*Mitsuo Ikeda (NTT, Japan)
Pagepp. 687 - 688
Keyword8K, UHDTV
AbstractIn recent years, the growing demand for ultra-high-definition television (UHDTV) services has generated the rapid development of UHDTV technologies. A key issue in providing UHDTV services is achieving efficient video coding. This paper presents an outline of H.265/HEVC, the latest video coding standard, and reviews the technologies it includes.


Session 8A  Exploring Better Architecture of Your Systems
Time: 13:50 - 15:30 Thursday, January 22, 2015
Location: Room 102
Chairs: Rainer Doemer (University of California, Irvine, U.S.A.), Hoeseok Yang (Ajou University, Republic of Korea)

8A-1 (Time: 13:50 - 14:15)
TitleAn Accurate ACOSSO Metamodeling Technique for Processor Architecture Design Space Exploration
Author*Hongwei Wang (Beijing Key Laboratory of Mobile Computing and Pervasive Device/Institute of Computing Technology, Chinese Academy of Sciences/University of Chinese Academy of Sciences, China), Ziyuan Zhu, Jinglin Shi, Yongtao Su (Beijing Key Laboratory of Mobile Computing and Pervasive Device/Institute of Computing Technology, Chinese Academy of Sciences, China)
Pagepp. 689 - 694
Keyworddesign space exploration, metamodeling, ACOSSO
AbstractProcessor architects usually design uniprocessor or chip multiprocessor (CMP) by using a platform-based approach. One of the major challenges in this approach is to explore the exponential-size design space composed of many tunable and interacting architectural parameters. An exhaustive search of the design space is prohibitive because of the expensive run-time of simulations. So an efficient design space exploration (DSE) strategy that can fast find the multi-objective architectural configurations (points in design space) in terms of system metrics like performance and energy is needed. In this paper, we propose an accurate and efficient adaptive component selection and smoothing operator (ACOSSO) metamodel assisted NSGA-II (MA-NSGA-II) multi-objective optimization (MOO) technique for processor DSE. We show the effectiveness of our methodology by comparing with linear regression (LR), restrict cubic splines (RCS), natural cubic splines (NCS) and artificial neural network (ANN) metamodeling techniques for processor design metrics prediction and architecture optimization. The experimental results show that, the proposed methodology achieves higher prediction accuracy and better architecture optimization results.

8A-2 (Time: 14:15 - 14:40)
TitleSpeeding Up Single Pass Simulation of PLRUt Caches
Author*Josef Schneider, Jorgen Peddersen, Sri Parameswaran (The University of New South Wales, Australia)
Pagepp. 695 - 700
KeywordCache Simulation, PLRUt
AbstractCPU caches have become an essential component in many computer systems as they can significantly increase system performance by alleviating the effects of memory latency. For many designers part of the system design flow is the selection of an appropriately configured cache, a task which can be performed using cache simulators. Exploring the entire design space through precise cache simulation is a lengthy process, and while certain cache replacement policies have been optimised for fast simulation execution (such as LRU and FIFO), no effective optimisations have been proposed for an extremely effective replacement policy: Pseudo Least Recently Used tree-based, also known as PLRUt. In this paper we are the first to present a number of characteristics of the PLRUt replacement policy that lend themselves to the design of an optimised hash table-based cache simulator. We demonstrate that our optimised simulator is up to 1.93x faster than an un-optimised implementation.
Slides

8A-3 (Time: 14:40 - 15:05)
TitleADAPT: An ADAptive Manycore Methodology for Software Pipelined ApplicaTions
Author*Xi Zhang, Haris Javaid (University of New South Wales, Australia), Muhammad Shafique (Karlsruhe Institute of Technology, Germany), Jude Angelo Ambrose (University of New South Wales, Australia), Jörg Henkel (Karlsruhe Institute of Technology, Germany), Sri Parameswaran (University of New South Wales, Australia)
Pagepp. 701 - 706
KeywordManycore System, MPSoC, Streaming Applications, Software pipeline, Run-time adaptation
AbstractFuture multiprocessor architectures are expected to have hundreds of processors on a chip. To amortize the cost of such systems, they would be expected to be used in a variety of situations, and for a number of applications. In this paper, we examine how software pipelines, which are useful for streaming/multimedia applications, can be efficiently implemented in such multiprocessor systems. The goal is to balance the stages of the pipeline in the presence of workload variations. This paper shows a method to detect bottleneck stages and adds processors to those bottleneck stages at run-time. Further, if there are no free processors, then a shuffling of processors is performed. Our methodology (which was simulated on a commercial simulating system) adapts in less than two thousand cycles, and for a variety of benchmarks achieves up to 2.1× the throughput when compared to the state-of-the-art technology (modified and implemented in the same platform for purposes of comparison).
Slides

8A-4 (Time: 15:05 - 15:30)
TitleA Trace-Driven Approach for Fast and Accurate Simulation of Manycore Architectures
Author*Anastasiia Butko, Rafael Garibotti, Luciano Ost, Vianney Lapotre, Abdoulaye Gamatie, Gilles Sassatelli (LIRMM/CNRS/University of Montpellier II, France), Chris Adeniyi-Jones (ARM, Ltd., U.K.)
Pagepp. 707 - 712
Keywordmanycore architecture, modeling, trace-driven simulation, gem5 simulator, multi-threading
AbstractThe evolution of manycore sytems, forecasted to feature hundreds of cores by the end of the decade calls for effi- cient solutions for design space exploration and debugging. Among the relevant existing solutions the well-known gem5 simulator provides a rich architecture description frame- work. However, these features come at the price of prohibitive simulation time that limits the scope of possible explorations to configurations made of tens of cores. To address this limitation, this paper proposes a novel trace-driven simulation approach for efficient exploration of manycore architectures.
Slides


Session 8B  Circuit-Level Modeling and Simulation
Time: 13:50 - 15:30 Thursday, January 22, 2015
Location: Room 104
Chairs: Luca Daniel (Massachusetts Institute of Technology, U.S.A.), Takashi Sato (Kyoto University)

8B-1 (Time: 13:50 - 14:15)
TitleCompact Modeling of Microbatteries Using Behavioral Linearization and Model-Order Reduction
AuthorMohammed Shemsu Nesro (Masdar Institute of Technology, United Arab Emirates), Lizhong Sun (Applied Materials, U.S.A.), *Ibrahim (Abe) M. Elfadel (Masdar Institute of Science and Technology, United Arab Emirates)
Pagepp. 713 - 718
Keywordcompact modeling, micro batteries, model-order reduction
AbstractThin-film, solid-state microbatteries represent now a viable alternative for powering small form-factor microsystems or storing the power harvested by energy microsensors. One obstacle to their widespread use in integrated systems has been the absence of a high-fidelity, physics-based, compact model describing their operation and enabling their design and verification in the same CAD environment as integrated systems or energy harvesters. In this work, we develop and validate such a model using a thorough analysis of the electrochemistry of a thin-film, solid-state lithium-ion microbattery. Our compact model is based on carefully validating and exploiting the electroneutrality assumption of the thin-film, solid-state electrolyte. Such an assumption enables the replacement of the nonlinear partial differential equations (PDEs) describing the microbattery electrochemistry with linear ones without virtually any loss in accuracy. We apply to the latter equations the well-established methodology of Arnoldi-based model order reduction (MOR) techniques to develop a compact microbattery model capable of reproducing its input-output electrical behavior with less than 1% error with respect to the full nonlinear PDEs and of at least 30X speed up in transient simulation.

8B-2 (Time: 14:15 - 14:40)
TitleGPU-Accelerated Parallel Monte Carlo Analysis of Analog Circuits by Hierarchical Graph-Based Solver
AuthorYan Zhu, *Sheldon X.-D. Tan (University of California, Riverside, U.S.A.)
Pagepp. 719 - 724
Keywordhierarchical graph based solver, Monte Carlo Analysis, symbolic method, GPU
AbstractIn this article, we propose a new parallel matrix solver, which is very amenable for Graphic Process Unit (GPU) based fine-grain massively-threaded parallel computing. The new method is based on the graph-based symbolic analysis technique to generate the computing sequence of determinants in terms of determinant decision diagrams (DDDs). DDD represents very simple data dependence and data parallelism, which can be explored much easier by GPU massively-threaded parallel computing than existing LU-based methods. The new method is based on the hierarchical determinant decision diagrams (HDDDs). Inspired by the inherent data parallelism and simple data dependence in the evaluation process of HDDD, we design GPU-amenable continuous data structures to enable fast memory access and evaluation of massive parallel threads. In addition to parallelism in DDD graph, the new algorithm can naturally explore data independence existing in Monte Carlo and frequency domain analysis. The resulting algorithm is a generalpurpose matrix solver suitable for fine-grain massive GPU-based computing for any circuit matrices. Experimental results show that the new evaluation algorithm can achieve about two orders of magnitude speedup over the serial CPU based evaluation and more than 4X speedup over numerical SPICE-based simulation method on some large analog circuits.

8B-3 (Time: 14:40 - 15:05)
TitleAutomated Generation of Hybrid System Models for Reachability Analysis of Nonlinear Analog Circuits
Author*Hyun-Sek Lukas Lee (Institute of Microelectronic Systems, Leibniz Universität Hannover, Germany), Matthias Althoff (Institute of Robotics and Embedded Systems, Technische Universität München, Germany), Stefan Hoelldampf, Markus Olbrich, Erich Barke (Institute of Microelectronic Systems, Leibniz Universität Hannover, Germany)
Pagepp. 725 - 730
Keywordformal verification, reachability analysis, hybrid model, piecewise-linear, state-space explosion problem
AbstractWe address the problem of formally verifying analog circuits with an uncertain initial set by computing their reachable set. Our method is based on linearizations of the nonlinear circuit, which results in a piecewiselinear system. To limit the number of required locations, our approach computes locations on-the-fly. The method is fully automatic and only requires a circuit netlist. It provides a guaranteed bound on the number of linearization locations that have to be explicitly computed for such a circuit.

8B-4 (Time: 15:05 - 15:30)
TitleArea Efficient Device-Parameter Estimation Using Sensitivity-Configurable Ring Oscillator
Author*Shoichi Iizuka, Yuma Higuchi, Masanori Hashimoto, Takao Onoye (Osaka University, Japan)
Pagepp. 731 - 736
Keyworddevice-parameter estimation, ring oscillator, manufacturing variability, variation sensor
AbstractThis paper proposes an area efficient device-parameter estimation method with sensitivity-configurable ring oscillator (RO). This sensitivity-configurable RO has a number of configurations and the proposed method exploits this property for reducing sensor area and/or improving estimation accuracy. Experimental results with a 32nm predictive technology model show that the proposed method can reduce the estimation error by 49% or reduce the sensor area by 75% while keeping the accuracy.
Slides


Session 8C  Reliable and Trustworthy Electronics
Time: 13:50 - 15:30 Thursday, January 22, 2015
Location: Room 105
Chairs: Takashi Aikyo (Semiconductor Technology Academic Research Center, Japan), Eishi Ibe (Hitachi)

8C-1 (Time: 13:50 - 14:15)
TitleOn Test Syndrome Merging for Reasoning-Based Board-Level Functional Fault Diagnosis
AuthorZelong Sun (The Chinese University of Hong Kong, Hong Kong), *Li Jiang (Shanghai Jiao Tong University, China), Qiang Xu (The Chinese University of Hong Kong, Hong Kong), Zhaobo Zhang, Zhiyuan Wang, Xinli Gu (Huawei Technologies, Inc., U.S.A.)
Pagepp. 737 - 742
KeywordBoard Test, Diagnosis
AbstractMachine learning algorithms are advocated for automated di agnosis of board-level functional failures due to the extreme complexity of the problem. Such reasoning-based solutions, however, remain ineffective at the early stage of the product cycle, simply because there are insufficient historical data for training the diagnostic system that has a large number of test syndromes. In this paper, we present a novel test syndrome merging methodology to tackle this problem. That is, by lever aging the domain knowledge of the diagnostic tests and the board structural information, we adaptively reduce the feature size of the diagnostic system by selectively merging test syn- dromes such that it can effectively utilize the available training cases. Experimental results demonstrate the effectiveness of the proposed solution.
Slides

8C-2 (Time: 14:15 - 14:40)
TitleEvent-Driven Transient Error Propagation: A Scalable and Accurate Soft Error Rate Estimation Approach
AuthorMojtaba Ebrahimi, Razi Seyyedi, Liang Chen, *Mehdi Tahoori (Karlsruhe Institute of Technology, Germany)
Pagepp. 743 - 748
KeywordSoft errors, Relibility, fault simulation
AbstractFast and accurate soft error vulnerability assessment is an integral part of cost-effective robust system design. The de facto approach is expensive fault simulation or emulation in which the error is injected in random bits and cycles, and then the effect is simulated for millions of cycles. In this paper, we propose a novel alternative approach to obtain the soft error vulnerability by integrating transient error propagation in an event-driven gate-level logic simulator which captures the combined effect of various masking factors. By carefully combining various generated errors at different cycles, in one pass all the error generation and propagation effects across all bits and all cycles are analyzed. This enables us to drastically reduce the runtime while maintaining the accuracy compared to statistical fault injection.

8C-3 (Time: 14:40 - 15:05)
TitleA Novel Methodology for Testing Hardware Security and Trust Exploiting On-Chip Power Noise Measurement
Author*Daisuke Fujimoto, Makoto Nagata (Kobe University, Japan), Shivam Bhasin, Jean-Luc Danger (Telecom Paristech, France)
Pagepp. 749 - 754
KeywordTest, Hardware Security, Trust, Side-channel measurement, Trojan Detection
AbstractFor security-critical applications, the security and trust of devices must be tested before shipping. In this paper, we promote the use of On-Chip Power noise Measurements (OCM), in order to test security using side-channel techniques. We then propose for the first time a standard side-channel measurement setup using OCM. Finally, we provide some key ideas on methodology to integrate the validation of hardware security and trust in the standard testing flow, exploiting OCM.

8C-4 (Time: 15:05 - 15:30)
TitleHardware Trojan Detection Using Exhaustive Testing of k-bit Subspaces
AuthorNicole Lesperance, Shrikant Kulkarni, *Kwang-Ting Cheng (UC Santa Barbara, U.S.A.)
Pagepp. 755 - 760
KeywordHardware Trojan, Hardware Security, Cryptographic Hardware, Pseudoexhaustive Testing
AbstractPost-silicon hardware Trojan detection is challenging because the attacker only needs to implement one of many possible design modifications, while the verification effort must guarantee the absence of all imaginable malicious circuitry. Existing test generation strategies for Trojan detection use controllability and observability metrics to limit the modifications targeted. However, for cryptographic hardware, the n plaintext bits are ideal for an attacker to use in Trojan triggering because the size of n prohibits exhaustive testing, and all n bits have identical controllability, making it impossible to bias testing using existing methods. Our detection method addresses this difficult case by observing that an attacker can realistically only afford to use a small subset, k, of all n possible signals for triggering. By aiming to exhaustively cover all possible k subsets of signals, we guarantee detection of Trojans using less than k plaintext bits in the trigger. We provide suggestions on how to determine k, and validate our approach using an AES design.
Slides


Session 9S  (Designers' Forum) Panel Discussion: IP Base SoC Design and IP Design Innovation
Time: 15:50 - 17:30 Thursday, January 22, 2015
Location: Room 103
Organizer: Nobuyuki Nishiguchi (Cadence Design Systems, Japan), Moderator: Toshihiro Hattori (Renesas System Design Co., Ltd., Japan)

9S-1 (Time: 15:50 - 17:30)
Title(Panel Discussion) IP Base SoC Design and IP Design Innovation
AuthorPanelists: Hironori Ando (Synopsys, Japan), Kevin Yee (Cadence, U.S.A.), Randy Smith (Sonics, U.S.A.), Neil Parris (ARM, U.K.)
AbstractRecent SoC uses a lot of IP’s. This session discuss what innovation will happen in the next generation of SoC design with IP’s and IP design itself. Four major IP vendors are invited and will talk their views for future design innovation of SoC with their IP’s which include numbers and types of IP’s such as digital, analog, RF and even a MEMS, variety such as CPU, GPU, memory, bus, interface and so on, usage models in design hierarchy and its modeling and integration methods of those IP’s. And also in order to achieve the SoC design innovation they will mention IP itself design methodology including planning, specification, implementation, verification, validation and qualification. Comments, questions and discussions with audiences at the panel are welcome.


Session 9A  Power/Thermal Management and Modeling
Time: 15:50 - 17:30 Thursday, January 22, 2015
Location: Room 102
Chairs: Donghwa Shin (Yeungnam University, Republic of Korea), Takashi Nakada (University of Tokyo, Japan)

9A-1 (Time: 15:50 - 16:15)
TitleAROMA: A Highly Accurate Microcomponent-Based Approach for Embedded Processor Power Analysis
AuthorZih-Ci Huang, *Chi-Kang Chen, Ren-Song Tsay (National Tsing Hua University, Taiwan)
Pagepp. 761 - 766
KeywordEmbedded System, Power Optimization, Power Analysis, Power Profiling, Peak Power Analysis
AbstractWe propose a new embedded processor power analysis approach that maps instruction executions to microarchitecture components for highly efficient and accurate power evaluations, which are crucial for embedded system designs. We observe that in practice the execution of each high-level instruction in a processor always triggers same microcomponent activity sequence while the difference of power consumption values of different instructions is mainly due to timing variations caused by hazards and cache misses. Hence, by incorporating accurately pre-characterized microcomponent power consumption values into an efficient instruction-microcomponent processor timing simulation tool, we construct a highly accurate embedded processor power analysis tool. Additionally, based on the proposed approach we accurately and effortlessly capture the power waveform at any time point for power profiling, peak power and dynamic thermal distribution analysis. The experimental results show that the proposed approach is nearly as accurate as gate-level simulators, with an error rate of less than 1.2% while achieving simulation speeds of up to 20 MIPS, five orders faster than a commercial gate-level simulator.
Slides

9A-2 (Time: 16:15 - 16:40)
TitleBattery-Aware Mapping Optimization of Loop Nests for CGRAs
Author*Yu Peng, Shouyi Yin, Leibo Liu, Shaojun Wei (Institute of Microelectronics, Tsinghua University, China)
Pagepp. 767 - 772
Keywordreconfigurable computing, loop nests, energy consumption, polyhedral model
AbstractCoarse-grained Reconfigurable Architecture (CGRA) is a promising mobile computing platform that provides both high performance and high energy efficiency. Since loop nests are usually mapped onto CGRA for acceleration, optimizing the mapping is an important goal for design of CGRAs. Moreover, how to reduce energy consumption also becomes one of primary concerns in using CGRAs. This paper makes three contributions: a) Proposing an energy consumption model for CGRA; b) Formulating loop nests mapping problem to minimize the battery charge loss; c) Extract an efficient heuristic algorithm called BPMap. Experiment results show that our methods improve the performance of the kernels and lower the energy consumption.
Slides

9A-3 (Time: 16:40 - 17:05)
TitleTHOR: Orchestrated Thermal Management of Cores and Networks in 3D Many-Core Architectures
Author*Jinho Lee, Junwhan Ahn, Kiyoung Choi (Seoul National University, Republic of Korea), Kyungsu Kang (Samsung Electronics, Republic of Korea)
Pagepp. 773 - 778
KeywordDynamic thermal management, 3D stacking, network-on-chip, many-core
AbstractMost previous researches on thermal management of many-core architectures focus on the control of either core resources or network resources only, even though both have significant thermal impacts. This paper proposes a holistic thermal management that applies dynamic voltage/frequency scaling to cores and routers together to maximize system performance under temperature constraint. The proposed method first determines a power budget given in aggregate weighted power for every pillar of vertically adjacent tiles. Then it performs voltage/ frequency assignment under the budget while exploiting the characteristics of the applications. Experiments show that our approach outperforms existing methods.
Slides

9A-4 (Time: 17:05 - 17:30)
TitleEarly Stage Real-Time SoC Power Estimation Using RTL Instrumentation
AuthorJianlei Yang (Tsinghua University/Intel Corporation, China), *Liwei Ma, Kang Zhao (Intel Corporation, China), Yici Cai (Tsinghua University, China), Tin-Fook Ngai (Intel Corporation, China)
Pagepp. 779 - 784
KeywordReal-Time, Power Estimation, RTL Instrumentation, Singular Value Decomposition (SVD)
AbstractEarly stage power estimation is critical for SoC architecture exploration and validation in modern VLSI design, but real-time, long time interval and accurate estimation is still challenging for system-level estimation and software/hardware tuning. This work proposes a model abstraction approach for real-time power estimation in the manner of machine learning. The singular value decomposition (SVD) technique is exploited to abstract the principle components of relationship between register toggling profile and accurate power waveform. The abstracted power model is automatically instrumented to RTL implementation and synthesized into FPGA platform for real-time power estimation by instrumenting the register toggling profile. The prototype implementation on three IP cores predicts the cycle-by-cycle power dissipation within 5% accuracy loss compared with a commercial power estimation tool.
Slides


Session 9B  (Special Session) System-Level Designs and Tools for Multicore Systems
Time: 15:50 - 17:30 Thursday, January 22, 2015
Location: Room 104
Chair: Chung-Ta King (National Tsing Hua University, Taiwan)

9B-1 (Time: 15:50 - 16:15)
Title(Invited Paper) Heterogeneous Architecture Design with Emerging 3D and Non-Volatile Memory Technologies
AuthorQiaosha Zou, Matthew Poremba (The Pennsylvania State University, U.S.A.), Rui He, Wei Yang, Junfeng Zhao (Huawei Shannon Lab, China), *Yuan Xie (University of California at Santa Barbara, U.S.A.)
Pagepp. 785 - 790
AbstractIn this paper, different perspectives of heterogeneous architecture options will be presented. With 3D die stacking, disparate and heterogeneous technologies can be integrated on the same chip, such as CMOS logic and emerging non-volatile memory, enabling a new paradigm of architecture design. Futhermore, DRAM/NVM heterogeneous memory architecture also combines the benefits from different technologies and shed lights on future hybrid memory architectures. Design tradeoffs will be discussed with preliminary results presented.

9B-2 (Time: 16:15 - 16:40)
Title(Invited Paper) Alleviate Chip I/O Pin Constraints for Multicore Processors through Optical Interconnects
Author*Zhehui Wang, Jiang Xu, Peng Yang, Xuan Wang, Zhe Wang, Luan H.K. Duong, Zhifei Wang, Haoran Li, Rafael K.V. Maeda, Xiaowen Wu (Hong Kong University of Science and Technology, Hong Kong), Yaoyao Ye, Qinfen Hao (Huawei Technologies, China)
Pagepp. 791 - 796
Keywordinterconnect, modeling, performance
AbstractChip I/O pins are an increasingly limited resource and significantly affect the performance, power and cost of multicore processors. Optical interconnects promise low power and high bandwidth, and are potential alternatives to electrical interconnects. This work systematically developed a set of analytical models for electrical and optical interconnects to study their structures, receiver sensitivities, crosstalk noises, and attenuations. We verified the models by published implementation results. The analytical models quantitatively identified the advantages of optical interconnects in terms of bandwidth, energy consumption, and transmission distance. We showed that optical interconnects can significantly reduce chip pin counts. For example, compared to electrical interconnects, optical interconnects can save at least 92% signal pins when connecting chips more than 25 cm (10 inches) apart.
Slides

9B-3 (Time: 16:40 - 17:05)
Title(Invited Paper) A Fast and Accurate Network-on-Chip Timing Simulator with a Flit Propagation Model
AuthorTing-Shuo Hsu, Jun-Lin Chiu, Chao-Kai Yu, *Jing-Jia Liou (National Tsing Hua University, Taiwan)
Pagepp. 797 - 802
Keywordnetwork-on-chip, network-on-chip simulator, wormhole switching, router microarchitecture
AbstractNetwork-on-chip (NoC) can be a simulation bottleneck in a many-core system. Traditional cycle-accurate NoC simulators need a long simulation time, as they synchronize all components (routers and FIFOs) every cycle to guarantee the exact behaviors. Also, a NoC simulation does not benefit from transaction-level modeling (TLM) in speed without any accuracy loss, because the transaction timings of a simulated packet depend on other packets due to wormhole switching. In this paper, we propose a novel NoC simulation method which can calculate cycle-accurate timings with wormhole switching. Instead of updating states of routers and FIFOs cycle-by-cycle, we use a pre-built model to calculate a flit's exact times at ports of routers in a NoC. The results of the proposed simulator are verified with NoC implementations (cycle-accurate at RTL) created by a commercial NoC compiler. All timing results match perfectly with packet waveforms generated by above NoCs (with 40--325 times speed up). As another comparison, the speed of the simulator is similar or faster (0.5-23X) than a TG2 NoC model, which is a SystemC and transaction-level model without timing accuracy (due to ignoring wormhole traffics).
Slides

9B-4 (Time: 17:05 - 17:30)
Title(Invited Paper) Application-Level Embedded Communication Tracer for Many-Core Systems
Author*Chih-Tsun Huang, Kuan-Chun Tasi, Jun-Shen Lin, Hsiao-Wei Chien (National Tsing Hua University, Taiwan)
Pagepp. 803 - 808
KeywordMany-Core Systems, Embedded Tracer, Application Level, Debugging
AbstractDesign verification and debugging with both software and hardware is ever challenging for many-core systems. We present the embedded tracer architecture for application-level communication. Not only can the trace information be optimized, but also the verification can be performed at the system level efficiently. The unified architecture consolidates the debugging flow at different abstraction levels, and facilitates the performance analysis of the entire system as well. The use-case study and experiments have justified the effectiveness of the proposed tracer architecture.


Session 9C  Building Secure Systems
Time: 15:50 - 17:30 Thursday, January 22, 2015
Location: Room 105
Chairs: Wenjing Rao (University of Illinois, Chicago, U.S.A.), Sandip Ray (Intel Corporation, Portland, U.S.A.)

9C-1 (Time: 15:50 - 16:15)
TitleTiming-Based Anomaly Detection in Embedded Systems
AuthorSixing Lu, Minjun Seo, *Roman Lysecky (University of Arizona, U.S.A.)
Pagepp. 809 - 814
KeywordNon-intrusive monitoring, anomaly detection, mimicry attack
AbstractRecent research has demonstrated that many systems are vulnerable to numerous types of malicious activity. As the pervasiveness of embedded systems with network connectivity continues to increase, embedded systems security has become a critical challenge. However, most existing techniques for detecting malware utilize software-based methods that incur significant performance overheads that are often not feasible in embedded systems. In this paper, we present an overview of a novel method for non-intrusively detecting malware in embedded system. The proposed technique utilizes timing requirements to improve detection performance and provide increased resilience to mimicry attacks.
Slides

9C-2 (Time: 16:15 - 16:40)
TitleSatisfiability Don't Care Condition Based Circuit Fingerprinting Techniques
Author*Carson J Dunbar, Gang Qu (University of Maryland, U.S.A.)
Pagepp. 815 - 820
KeywordSDC, fingerprint, IP, SoC
AbstractCircuit fingerprints allow the authors of design intellectual properties (IPs) to trace each copy of their IPs by embedding features, known as digital fingerprints, which are unique to each device. In this paper, we propose a novel gate replacement approach to encode fingerprints based on the inherent Satisfiability Don’t Care (SDC) conditions in the circuit. Moreover, existing fingerprinting schemes all require redesign of the circuit which makes it prohibitively expensive for manufacturing. We develop a practical method to implement our SDC-based circuit fingerprint. First, we introduce flexibilities during the logic synthesis phase by replacing certain library cells with versatile multiplexers (MUXs). The MUX can be configured either as the original gate or one of its replacements with identical functionality except the SDC conditions. Then at the post-silicon stage, we configure these MUXs to create distinct fingerprints. We consider standard benchmark circuits and demonstrate that even on these circuits with limited size, we can find sufficient locations to embed fingerprints. Simulation with TSMC 0.35μm technology shows non-trivial design overhead, however, such overhead will become negligible for large real-life circuits.
Slides

9C-3 (Time: 16:40 - 17:05)
TitleIC Piracy Prevention via Design Withholding and Entanglement
AuthorSoroush Khaleghi, Kai Da Zhao, *Wenjing Rao (University of Illinois at Chicago, U.S.A.)
Pagepp. 821 - 826
KeywordHardware Security, IC Piracy, Reverse Engineering, Design Withholding
AbstractGlobalization of the semiconductor industry has raised serious concerns about trustworthy hardware. Particularly, an untrusted manufacturer can steal the information of a design (Reverse Engineering), and/or produce extra chips illegally (IC Piracy). Among many candidates that address these attacks, Design Withholding techniques work by replacing a part of the design with a reconfigurable block on chip, so that none of the manufactured chips will function properly until they are activated in a trusted facility, where the withheld function is restored back into the reconfigurable block on chip. However, most existing approaches are ad-hoc based, and are facing two major challenges: 1) susceptibility to a category of algorithmic attacks, from attackers in a strong position, such as a manufacturer; and 2) scaling up the defense level is checkmated by the explosion of hardware cost that has to be paid at the designer’s side. In this paper, we propose a novel protection scheme, called Entanglement, which can substantially strengthen the Design Withholding framework: 1) the algorithmic attacks are prevented by forcing the attacker to solve a huge number of problems of high computational complexity; 2) the attack cost (in terms of computational complexity) is quantitatively controllable at the designer’s end, with low hardware overhead: while the cost of attack can be increased exponentially, the hardware overhead imposed on the designer’s side grows only linearly. The proposed work distinguishes itself from the previous works by not relying on the difficulty of finding the solution for some NP-Complete/NP-Hard problems, but rather, on the exponentially boosted number of such problems that an attacker has to solve, while carefully maintaining the growth of the hardware overhead to be scalable via Entanglement.
Slides

9C-4 (Time: 17:05 - 17:30)
TitleVulnerability Analysis for Crypto Devices against Probing Attack
Author*Lingxiao Wei, Jie Zhang, Feng Yuan, Yannan Liu (The Chinese University of Hong Kong, Hong Kong), Junfeng Fan (Open Security Research, China), Qiang Xu (The Chinese University of Hong Kong, Hong Kong)
Pagepp. 827 - 832
KeywordProbing Attack, Vulnerability Analysis, Crypto Devices
AbstractProbing attack is a severe threat for the security of hardware cryptographic modules (HCMs). In this paper, we make the first step to evaluate the vulnerability of HCMs against probing attack, wherein we investigate the probing complexity and the key candidate reduction capability for probing attack on every signal in the circuit. We also present approximate solutions for the calculation of the proposed metrics to reduce computational complexity. Experimental results demonstrate that the proposed evaluation metric is both effective and efficient.