ASP-DAC 2012 Technical Program

The 17th Asia and South Pacific Design Automation Conference

Session 6A Efficient Methods for Resource Utilization in Multi-Core NoC Designs
Time: 8:30 - 10:10 Thursday, February 2, 2012
Location: Room 204B
Chairs: Jiang Xu (The Hong Kong University of Science & Technology, Hong Kong), David Atienza (EPFL, Switzerland)

6A-1 (Time: 8:30 - 8:55)

Title	Proximity-Aware Cache Replication
Author	Chongmin Li, Dongsheng Wang, *Haixia Wang, Yibo Xue (Department of Computer Science & Technology, Tsinghua University, China), Jian Li (IBM Research in Austin, U.S.A.)
Page	pp. 481 - 486
Keyword	Chip multiprocessor, Cache replication, Proximity
Abstract	We propose Proximity-Aware cache Replication (PAR), an LLC replication technique that elegantly integrates an intelligent cache replication placement mechanism and a hierarchical directory-based coherence protocol into one cost-effective and scalable design. PAR dynamically allocates replicas of either shared or private data to a few predefined and fixed locations that are calculated at chip design time. Therefore, PAR fits well to future many-core CMPs thanks to its scalable on-chip storage and coherence design.

6A-2 (Time: 8:55 - 9:20)

Title	Dynamic Reusability-based Replication with Network Address Mapping in CMPs
Author	Jinglei Wang, Dongsheng Wang, *Haixia Wang, Yibo Xue (Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, China)
Page	pp. 487 - 492
Keyword	Chip Multiprocessor, Shared Cache, Replication, Network on Chip
Abstract	In a Chip MultiProcessor(CMP) with shared caches, the last level cache is distributed across all the cores. This increases the on-chip communication delay and thus influence the processor's performance. Replication can be provided in shared caches to reduce the on-chip communication delay. However, current proposals do not take into account replicating blocks's access characteristics and how to make the best of replicas, which have limited performance benefit. In this paper, we observe that reusability of cache blocks influences the availability of replication scheme severely. Based on this observation, we propose Dynamic Reusability-based Replication (DRR), a novel cache design to exploit efficient replicas management using blocks's reuse pattern. DRR monitors the recent referenced cache blocks' access pattern, and replicates the blocks with high reusability to appropriate L2 slices, and the replicated copies can be shared by their nearby cores. We evaluate DRR for 16-core system using splash-2 and parsec benchmarks. DRR improves performance by 30% on average over conventional shared cache design, 16% over Victim Replication(VR), 8% over Adaptive Selected Replication (ASR), and 25% over R-NUCA.

6A-3 (Time: 9:20 - 9:45)

Title	Hungarian Algorithm Based Virtualization to Maintain Application Timing Similarity for Defect-Tolerant NoC
Author	Ke Yue, Frank Lockom, Zheng Li, Soumia Ghalim, *Shangping Ren (IIT, U.S.A.), Lei Zhang, Xiaowei Li (Institute of Computing Technology, Chinese Academy of Sciences, China)
Page	pp. 493 - 498
Keyword	NoC, timing similarity, Hungarian method, defect-redundant
Abstract	Homogeneous manycore processors are emerging in broad application areas, including those with timing requirements, such as real-time and embedded applications. Typically, these processors employ Network-on-Chip (NoC) as the communication infrastructure and core-level redundancy is often used as an effective approach to improve the yield of manycore chips. For a given application's task graph and a task to core mapping strategy, the traffic pattern on the NoC is known a priori. However, when defective cores are replaced by redundant ones, the NoC topology changes. As a result, a fine-tuned program based on timing parameters given by one topology may not meet the expected timing behavior under the new one. To address this issue, a timing similarity metric is introduced to evaluate timing resemblances between different NoC topologies. Based on this metric, a Hungarian method based algorithm is developed to reconfigure a defect-tolerant manycore platform and form a unified application specific virtual core topology of which the timing variations caused by such reconfiguration are minimized. Our case studies indicate that the proposed metric is able to accurately measure the timing differences between different NoC topologies. The standard deviation between the calculated difference using the metric and the difference obtained through simulation is less than 6.58%. Our case studies also indicate that the developed Hungarian method based algorithm using the metric performs close to the optimal solution in comparison to random defect-redundant core assignments.

6A-4 (Time: 9:45 - 10:10)

Title	Using Link-level Latency Analysis for Path Selection for Real-time Communication on NoCs
Author	*Hany Kashif, Hiren D. Patel, Sebastian Fischmeister (University of Waterloo, Canada)
Page	pp. 499 - 504
Keyword	NoCs, path selection, real-time
Abstract	We present a path selection algorithm that is used when deploying hard real-time traffic flows onto a chip-multiprocessor system. This chip-multiprocessor system uses a priority-based real-time network-on-chip interconnect between the multiple processors. The problem we address is the following: given a mapping of the tasks onto a chip-multiprocessor system, we need to determine the paths that the traffic flows take such that the flows meet there deadlines. Furthermore, we must ensure that the deadline is met even in the presence of direct and indirect interference from other flows sharing network links on the path. To achieve this, our algorithm utilizes a link-level analysis to determine the impact of a link being used by a flow, and its affect on other flows sharing the link. Our experimental results show that we can improve schedulability by about 8% and 15% over Minimum Interference Routing and Widest Shortest Path algorithms, respectively.