ASP-DAC 2016 Technical Program

The 21st Asia and South Pacific Design Automation Conference

Session 1A The Optimization of Memory Architecture and Management
Time: 10:20 - 12:00 Tuesday, January 26, 2016
Location: TF4203
Chairs: Yun Liang (Peking University, China, China), Swathi Gurumani (Advanced Digital Sciences Center, Singapore (UIUC-ASTAR center), Singapore)

1A-1 (Time: 10:20 - 10:45)

Title	Performance-centric Register File Design for GPUs using Racetrack Memory
Author	*Shuo Wang, Yun Liang, Chao Zhang, Xiaolong Xie, Guangyu Sun (Peking University, China), Yongpan Liu, Yu Wang (Tsinghua University, China), Xiuhong Li (Peking University, China)
Page	pp. 25 - 30
Keyword	GPU, Performance, Register File, Racetrack Memory, Compiler
Abstract	In this paper, we explore racetrack memory for designing high performance register file for GPU architecture. High storage density racetrack memory helps to improve the thread level parallelism, but the lengthy shift operation may largely degrade the performance. To mitigate the shift operation overhead, we develop a compiler-time managed register mapping algorithm. Our algorithm optimizes the mapping of registers to the physical address in the register file. Experimental results demonstrate that our technique achieves up to 24% (19% on average) improvement in performance for a variety of GPU applications.

1A-2 (Time: 10:45 - 11:10)

Title	Improving Read Performance of STT-MRAM based Main Memories through Smash Read and Flexible Read
Author	Lei Jiang (Advanced Micro Devices, U.S.A.), Wujie Wen (Florida International University, U.S.A.), *Danghui Wang (Northwestern Polytechnical University, China), Lide Duan (University of Texas at San Antonio, U.S.A.)
Page	pp. 31 - 36
Keyword	STT-MRAM, read disturbance, main memory, read scheme, LPDDR3
Abstract	Spin Transfer Torque Magnetoresistive RAM (STT-MRAM) has been recently deemed as one promising main memory alternative for high-end mobile processors. With process technology scaling, the amplitude of write current approaches that of read current in deep sub-micrometer STT-MRAM arrays. As a result, read disturbance errors (RDEs) emerge. Both high current restore required (HCRR) reads and low current long latency (LCLL) reads can guarantee read reliability and utterly remove RDEs. However, both of them degrade system performance, because of extra restores or a longer read latency. And neither of them always achieves the better performance when running a wide variety of applications. In this paper, we present two architectural techniques to boost read performance for STT-MRAM based main memories in the presence of RDEs. We first propose Smash Read (S-RD) to shorten the latency of HCRR reads by injecting a larger read current. We further introduce Flexible Read (F-RD) to dynamically adopt different types of read schemes, S-RD and LCLL, to maximize main memory system performance. On average, our techniques improve system performance by 9~13% and reduces total energy by 4~8% over all existing read schemes including HCRR and LCLL.
Slides

1A-3 (Time: 11:10 - 11:35)

Title	STLAC: A Spatial and Temporal Locality-Aware Cache and Network-on-Chip Codesign for Tiled Many-core Systems
Author	*Mingyu Wang (Institute of Microelectronics, Tsinghua University, China), Zhaolin Li (Research Institute of Information Technology, Tsinghua University, China)
Page	pp. 37 - 42
Keyword	Many-core, Adaptive Cache, Network-on-chip
Abstract	The spatial and temporal locality of workloads are the root causes for cache designs to overcome the memory wall problem. However, few existing state-of-the-art designs exploit both the two locality features to optimize the memory hierarchies in the area of tiled many-core systems, which losses the opportunities to explore more performance improvement. To address this problem, an adaptive spatial and temporal locality-aware cache and network-on-chip (NoC) codesign (STLAC) is proposed, which dynamically partitions the last level cache (LLC) as data prefetch buffer or victim cache for locality prediction and exploits a hybrid burst-support NoC for fast data prefetch. The data prefetch buffer speculates the data blocks in subsequent addresses to exploit the spatial locality, while the victim cache collects the evicted data blocks from the upper memory hierarchy to exploit the temporal locality. By combining the proposed adaptive cache partition with the hybrid burst-support NoC, the off-chip misses and on-chip network usage are greatly reduced. Experimental results demonstrate that the proposed STLAC reduces up to 43% off-chip misses and improves 15% performance on average compared with the traditional shared LLC design.
Slides

1A-4 (Time: 11:35 - 12:00)

Title	A Lightweight OpenMP4 Run-time for Embedded Systems
Author	Roberto E. Vargas, Sara Royuela, *Maria A. Serrano, Xavi Martorell, Eduardo Quiñones (Barcelona Supercomputing Center, Spain)
Page	pp. 43 - 49
Keyword	OpenMP4, Parallel programming Models, Many-core embedded processors, Compiler Analysis, Task Dependency Graph
Abstract	OpenMP is increasingly being adopted by current many-core embedded processors to exploit their parallel computation capabilities. Unfortunately, current run-time implementations of the latest specification (v4.0) are not suitable for processors relying on small and fast on-chip memories, due to its memory consumption. This paper proposes an OpenMP4 run-time that reduces the memory consumption while providing the same performance. Our run-time relies on a new compiler pass capable to generate the task dependency graph of OpenMP programs, which is then efficiently stored in memory.
Slides