ASP-DAC 2013 Technical Program

The 18th Asia and South Pacific Design Automation Conference

Session 4B Memory Hierarchy Optimization
Time: 10:20 - 12:20 Thursday, January 24, 2013
Chair: Jason Xue (City University of Hong Kong, Hong Kong)

4B-1 (Time: 10:20 - 10:50)

Title	TRISHUL: A Single-pass Optimal Two-level Inclusive Data Cache Hierarchy Selection Process for Real-time MPSoCs
Author	*Mohammad Shihabul Haque, Akash Kumar, Yajun Ha, Qiang Wu, Shaobo Luo (National University of Singapore, Singapore)
Page	pp. 320 - 325
Keyword	data cache hierarchy configuration, real-time software, Single-pass, Simulation
Abstract	Hitherto discovered approaches analyze the execution time of a real-time application on all the possible cache hierarchy setups to find the application specific optimal two-level inclusive data cache hierarchy to reduce cost, space and energy consumption while satisfying the time deadline in real-time Multi-Processor Systems on Chip (MPSoC). These brute-force like approaches can take years to complete. Alternatively, application's memory access trace driven crude estimation methods can find a cache hierarchy quickly by compromising the accuracy of results. In this article, for the first time, we propose a fast and accurate application's trace driven approach to find the optimal real-time application specific two-level inclusive data cache hierarchy. Our proposed approach ``TRISHUL'' predicts the optimal cache hierarchy performance first and then utilizes that information to find the optimal cache hierarchy quickly. TRISHUL can suggest a cache hierarchy, which has up to 128 times smaller size, up to 7 times faster compared to the suggestion of the state-of-the-art crude trace driven two-level inclusive cache hierarchy selection approach for the application traces analyzed.
Slides

4B-2 (Time: 10:50 - 11:20)

Title	Optimizing Translation Information Management in NAND Flash Memory Storage Systems
Author	*Qi Zhang, Xuandong Li, Linzhang Wang, Tian Zhang (Nanjing University, China), Yi Wang, Zili Shao (The Hong Kong Polytechnic University, Hong Kong)
Page	pp. 326 - 331
Keyword	Translation block, NAND flash memory, On-demand, SSD
Abstract	Address mapping is one of the major functions in managing NAND flash. With the capacity increase of NAND flash, it becomes vitally important to reduce the RAM print of the address mapping table while not introducing big performance overhead. Demand-based address mapping is an effective approach to solve this problem, in which the address mapping table is stored in NAND flash (called translation pages), and mapping items are cached on-demand in RAM. Therefore, it is critical to manage translation pages in demand-based address mapping. This paper solves two most important problems in translation page management.First, to reduce frequent translation page updates caused by data requests,we propose a page-level caching mechanism to exploit the fundamental property of NAND flash where the basic read/write unit is one page. Second, to reduce the garbage collection overhead from translation pages, we propose a multiple write pointers strategy to group data pages corresponding to the same translation page into one data block, by which, when the data block is reclaimed via the garbage collection, we only need to update one translation page.We evaluate our scheme using a set of benchmarks from both real-world and synthetic traces. Experimental results show that our techniques can achieve significant reduction in the extra translation operations and improve the system response time.
Slides

4B-3 (Time: 11:20 - 11:50)

Title	An Adaptive Filtering Mechanism for Energy Efficient Data Prefetching
Author	*Xianglei Dang, Xiaoyin Wang, Dong Tong, Zichao Xie, Lingda Li, Keyi Wang (Peking University, China)
Page	pp. 332 - 337
Keyword	data prefetching, energy efficiency, useless prefetch filtering, memory performance optimization
Abstract	As data prefetching is used in embedded processors, it is crucial to reduce the wasted energy for improving the energy efficiency. In this paper, we propose an adaptive prefetch filtering (APF) mechanism to reduce the wasted bandwidth and energy as well as the cache pollution caused by useless prefetches. APF records the prefetch-victim address pairs of issued prefetches and collects information about which address in each pair is first accessed by the processor to guide the filtering of new generated useless prefetches. Meanwhile, filtered prefetches are recorded for building the feedback mechanism to avoid filtering useful prefetches. Experimental results demonstrate that APF reduces useless prefetches by an average of 53.81% with a mere 5.28% reduction of useful prefetches, thus reducing the memory access bandwidth consumption by 59.92% and the L2 cache energy by 6.19%. APF also improves the performance of several programs by reducing the cache pollution incurred by useless prefetches, thus gaining an average performance improvement of 2.12%.
Slides

4B-4 (Time: 11:50 - 12:20)

Title	Cache Capacity Aware Thread Scheduling for Irregular Memory Access on Many-Core GPGPUs
Author	*Hsien-Kai Kuo, Ta-Kan Yen, Bo-Cheng Charles Lai, Jing-Yang Jou (Department of Electronics Engineering, National Chiao Tung University, Taiwan)
Page	pp. 338 - 343
Keyword	Shared cache, Cache contention, Thread scheduling, Irregular applications, GPGPUs
Abstract	On-chip shared cache is effective to alleviate the memory bottleneck in modern many-core systems, such as GPGPUs. However, when scheduling numerous concurrent threads on a GPGPU, a cache capacity agnostic scheduling scheme could lead to severe cache contention among threads and thus significant performance degradation. Moreover, the diverse working sets in irregular applications make the cache contention issue an even more serious problem. As a result, taking cache capacity into account has become a critical scheduling issue of GPGPUs. This paper formulates a Cache Capacity Aware Thread Scheduling Problem to capture the impact of cache capacity as well as different architectural considerations. With a proof to be NP-hard, this paper has proposed two algorithms to perform the cache capacity aware thread scheduling. The simulation results on Nvidia’s Fermi configuration have shown that the proposed scheduling scheme can effectively avoid cache contention, and achieve an average of 44.7% cache miss reduction and 28.5% runtime enhancement. The paper also shows the runtime can be enhanced up to 62.5% for more complex applications.
Slides