ASP-DAC 2015 Technical Program

The 20th Asia and South Pacific Design Automation Conference

Session 5A Optimization and Exploration for Caches
Time: 13:50 - 15:30 Wednesday, January 21, 2015
Location: Room 102
Chairs: Hiroyuki Tomiyama (Ritsumeikan University, Japan), Lin Meng (Ritsumeikan University, Japan)

5A-1 (Time: 13:50 - 14:15)

Title	Multilane Racetrack Caches: Improving Efficiency Through Compression and Independent Shifting
Author	*Haifeng Xu (University of Pittsburgh, U.S.A.), Yong Li (VMware, Inc., U.S.A.), Rami Melhem, Alex K. Jones (University of Pittsburgh, U.S.A.)
Page	pp. 417 - 422
Keyword	Racetrack memory, Cache, Compression
Abstract	Racetrack memory (RM), a spintronic domain-wall non-volatile memory has recently received attention as a high-capacity replacement for various structures in the memory system from secondary storage through caches. The main advantage of RM is an improved density and like other non-volatile memory structures, the static power of RM is dramatically lower than conventional CMOS memories. However, a major challenge of employing RM in universal memory components is the added access latency and dynamic energy consumption caused by shifts to align the data of interest with an access port. We propose multilane Racetrack caches (MRC), a RM last level cache design utilizing lightweight compression combined with independent shifting. MRC allows cache lines mapped to the same Racetrack structure to be accessed in parallel when compressed, mitigating potential shifting stalls in the RM cache. Our results demonstrate that unlike previously proposed RM caches, an isocapacity MRC cache replacement can outperform SRAM caches while providing energy improvement over STT-MRAM caches. In particular, MRC improves performance by 5% and reduces energy by 19% compared to an isocapacity baseline RM cache resulting in an energy delay product improvement of 25%.
Slides

5A-2 (Time: 14:15 - 14:40)

Title	Managing Hybrid On-Chip Scratchpad and Cache Memories for Multi-Tasking Embedded Systems
Author	Zimeng Zhou, *Lei Ju, Zhiping Jia, Xin Li (Shandong University, China)
Page	pp. 423 - 428
Keyword	scratchpad memory, cache, multi-tasking, performance, energy efficiency
Abstract	On-chip memory management is essential in design of high performance and energy-efficient embedded systems. While many off-the-shelf embedded processors employ a hybrid on-chip SRAM architecture including both scratchpad memories (SPMs) and caches, many existing work on SPM management ignore the synergy between caches and SPMs. In this work, we propose a static SPM allocation strategy for the hybrid on-chip memory architecture in a multi-tasking environment, which minimizes the overall access latency and energy consumption of the instruction memory subsystem. We capture cache conflict misses via a fine-grained temporal cache behavior model. An integer linear programming (ILP) based formulation is proposed to generate an function-level SPM allocation scheme, where both intra- and inter-task cache interference as well as access frequency are captured for an optimal memory subsystem design. Compared with the state-of-the-art static SPM allocation strategy in a multi-tasking environment, experimental results show that our SPM management scheme achieves 30.51% further improvement in instruction memory subsystem performance, and up to 34.92% in terms of energy saving.
Slides

5A-3 (Time: 14:40 - 15:05)

Title	Optimizing Thread-to-Core Mapping on Manycore Platforms with Distributed Tag Directories
Author	*Guantao Liu, Tim Schmidt, Rainer Doemer (University of California, Irvine, U.S.A.), Ajit Dingankar, Desmond Kirkpatrick (Intel Corporation, U.S.A.)
Page	pp. 429 - 434
Keyword	thread-to-core mapping, manycore platforms, on-chip communication
Abstract	With the increasing demand for parallel computing power, manycore platforms are attracting more and more attention due to their potential to improve performance and scalability of parallel applications. However, as the core count increases, core-to-core communication becomes expensive. For manycore architectures using directory-based cache coherence protocols, the core-to-core communication latency depends not only on the physical placement on the chip, but also on the location of the distributed cache tag directory. In this paper, we first define the concept of core distance for multicore and manycore architectures. Using a ping-pong spin-lock benchmark, we quantify the core distance on a ring-network platform and propose an approach to optimize thread-to-core mapping in order to minimize on-chip communication overhead. In our experiments, our approach speeds up communication-intensive benchmarks by more than 25% on average over the Linux default mapping strategy.
Slides

5A-4 (Time: 15:05 - 15:30)

Title	Accelerating Non-Volatile/Hybrid Processor Cache Design Space Exploration for Application Specific Embedded Systems
Author	*Mohammad Shihabul Haque, Ang Li, Akash Kumar (National University of Singapore, Singapore), Qingsong Wei (Data Storage Institute, Singapore)
Page	pp. 435 - 440
Keyword	Cache, Embedded, NVM, Hybrid, Modeling
Abstract	In this article, we propose a technique to accelerate non-volatile/hybrid of volatile and non-volatile processor cache design space exploration for application specific embedded systems. Utilizing a novel cache behavior modeling equation and a new accurate cache miss prediction mechanism, our proposed technique can accelerate NVM/hybrid FIFO processor cache design space exploration for SPEC CPU 2000 applications up to 249 times compared to the conventional approach.