ASP-DAC 2013 Technical Program

The 18th Asia and South Pacific Design Automation Conference

Session 5D Multi-/Many-Core System Optimization
Time: 13:40 - 15:40 Thursday, January 24, 2013
Chairs: Yuichi Nakamura (NEC, Japan), Yongpan Liu (Tsinghua University, China)

5D-1 (Time: 13:40 - 14:10)

Title	Register and Thread Structure Optimization for GPUs
Author	*Yun Liang (Center for Energy-efficient Computing and Applications, School of EECS, Peking University, China), Zheng Cui (Advanced Digital Sciences Center, Illinois at Singapore, Singapore), Kyle Rupnow (Nanyang Technological University, Singapore), Deming Chen (University of Illinois, Urbana-Champaign, U.S.A.)
Page	pp. 461 - 466
Keyword	GPU, register, thread structure, design space exploration
Abstract	GPUs are an increasingly popular implementation platform for a variety of general purpose applications from mobile and embedded devices to high performance computing. The CUDA and OpenCL parallel programming models enable easy utilization of the GPU's resources. However, tuning GPU applications' performance is a complex and labor intensive task. Software programmers employ a variety of optimization techniques to explore tradeoffs between the thread parallelism and performance of a single thread. However, prior techniques ignore register allocation, a significant factor in single thread performance and, indirectly affects the number of simultaneously active threads. In this paper, we show that joint optimization of register allocation and thread structure has great potential to significantly improve performance. However, the design space for this joint optimization can be large; therefore, we develop performance metrics appropriate for evaluation within a compiler's inner loop and efficient design space exploration techniques that use the metrics to narrow the search space. Across a range of GPU applications, we achieve average performance speedup of 1.33X (up to 1.73X) with design space exploration 355X faster than the exhaustive search.
Slides

5D-2 (Time: 14:10 - 14:40)

Title	Real-Time Partitioned Scheduling on Multi-Core Systems with Local and Global Memories
Author	*Che-Wei Chang (National Taiwan University, Taiwan), Jian-Jia Chen (Karlsruhe Institute of Technology, Germany), Tei-Wei Kuo (National Taiwan University, Taiwan), Heiko Falk (Ulm University, Germany)
Page	pp. 467 - 472
Keyword	real-time system, heterogeneous memory, partitioned scheduling, resource optimization, worst case execution time
Abstract	Real-time task scheduling becomes even more challenging with the emerging of island-based multi-core architecture, where the local memory module of an island offers shorter access time than the global memory module does. With such a popular architecture design in mind, this paper exploits real-time task scheduling over island-based homogeneous cores with local and global memory pools. Joint considerations of real-time scheduling and memory allocation are presented to efficiently use the computing and memory resources. A polynomial-time algorithm with an asymptotic 4-approximation bound is proposed to minimize the number of needed islands to successfully schedule tasks. To evaluate the performance of the proposed algorithm, 82 benchmarks from the MRTC, MediaBench, UTDSP, NetBench, and DSPstone benchmark suites are profiled by a worst-case-execution-time analyzer aiT and included in the experiments.

5D-3 (Time: 14:40 - 15:10)

Title	Dynamic Thermal Management for Multi-Core Microprocessors Considering Transient Thermal Effects
Author	*Zao Liu (University of California, Riverside, U.S.A.), Tailong Xu (Anhui University, China), Sheldon X.-D. Tan (University of California, Riverside, U.S.A.), Hai Wang (UESTC, China)
Page	pp. 473 - 478
Keyword	Dynamic thermal management, task migration, thermal analysis, moment matching, hot spots
Abstract	Dynamic thermal management method is a viable way to effectively mitigate the thermal emergences. In this paper, a new thermal management scheme is proposed to reduce the on-chip temperature variance and the occurrence of hot spots by considering more transient thermal effects. The new method performs the task migrations to reduce the temperature variations across the chip. Instead of intuitively assigning the heavy tasks to the low temperature cores to balance the thermal profile based on steady state thermal analysis, the proposed method applies moment matching based transient thermal analysis techniques for fast thermal estimation and prediction to guide the migration process. We show that by considering the dominant temperature moment component, the resulting algorithm can lead to significant reduction of hot spots with full transient thermal simulation. Our experimental results on a 16 core microprocessor demonstrate that the proposed method can reduce the number of the hot spots by 50% compared to the simple lowest temperature based task scheduling method, leading to more uniform on-chip temperature distribution across the microprocessor cores.
Slides

5D-4 (Time: 15:10 - 15:40)

Title	BAMSE: A Balanced Mapping Space Exploration Algorithm for GALS-based Manycore Platforms
Author	Mohammad Foroozannejad, Brent Bohnenstiehl, *Soheil Ghiasi (University of California, Davis, U.S.A.)
Page	pp. 479 - 484
Keyword	Manycore, GALS, Mapping, Algorithm
Abstract	We study the problem of mapping concurrent tasks of an application modeled as a data flow graph onto processors of a GALS-based manycore platform. We propose a mapping algorithm called BAMSE, which exploits the characteristics of streaming applications and the specifications of the target architecture to optimize the mapping solution. Different configuration parameters embedded into the algorithm enable one to strike a balance between scalability of the approach and the quality of generated solutions. Experiments with several real life applications show that our algorithm outperforms hand-optimized manual mappings up to 65% in terms of longest inter-processor communication link, and as high as 19% with respect to total length of the links, when the two criteria are used as primary and secondary optimization objectives, respectively. Additionally, our algorithm delivers superior mappings compared to ILP generated solutions after 10 days of solver runtime.
Slides