Title | Register and Thread Structure Optimization for GPUs |
Author | *Yun Liang (Center for Energy-efficient Computing and Applications, School of EECS, Peking University, China), Zheng Cui (Advanced Digital Sciences Center, Illinois at Singapore, Singapore), Kyle Rupnow (Nanyang Technological University, Singapore), Deming Chen (University of Illinois, Urbana-Champaign, U.S.A.) |
Page | pp. 461 - 466 |
Keyword | GPU, register, thread structure, design space exploration |
Abstract | GPUs are an increasingly popular implementation platform for a
variety of general purpose applications from mobile and embedded
devices to high performance computing. The CUDA and OpenCL
parallel programming models enable easy utilization of the GPU's
resources. However, tuning GPU applications' performance is a complex and labor intensive task. Software programmers employ a variety of optimization techniques to explore tradeoffs between the thread parallelism and performance of a single thread. However, prior techniques ignore register allocation, a significant factor in single thread performance and, indirectly affects the number of simultaneously active threads. In this paper, we show that joint optimization of register allocation and thread structure has great potential to significantly improve performance. However, the design space for this joint optimization can be large; therefore, we develop performance metrics appropriate for evaluation within a compiler's inner loop and efficient design space exploration techniques that use the metrics to narrow the search space. Across a range of GPU applications, we achieve average performance speedup of 1.33X (up to 1.73X) with design space exploration 355X faster than the exhaustive search. |
Slides |
Title | Real-Time Partitioned Scheduling on Multi-Core Systems with Local and Global Memories |
Author | *Che-Wei Chang (National Taiwan University, Taiwan), Jian-Jia Chen (Karlsruhe Institute of Technology, Germany), Tei-Wei Kuo (National Taiwan University, Taiwan), Heiko Falk (Ulm University, Germany) |
Page | pp. 467 - 472 |
Keyword | real-time system, heterogeneous memory, partitioned scheduling, resource optimization, worst case execution time |
Abstract | Real-time task scheduling becomes even more challenging with the emerging of island-based multi-core architecture, where the local memory module of an island offers shorter access time than the global memory module does. With such a popular architecture design in mind, this paper exploits real-time task scheduling over island-based homogeneous cores with local and global memory pools. Joint considerations of real-time scheduling and memory allocation are presented to efficiently use the computing and memory resources. A polynomial-time algorithm with an asymptotic 4-approximation bound is proposed to minimize the number of needed islands to successfully schedule tasks. To evaluate the performance of the proposed algorithm, 82 benchmarks from the MRTC, MediaBench, UTDSP, NetBench, and DSPstone benchmark suites are profiled by a worst-case-execution-time analyzer aiT and included in the experiments. |
Title | Dynamic Thermal Management for Multi-Core Microprocessors Considering Transient Thermal Effects |
Author | *Zao Liu (University of California, Riverside, U.S.A.), Tailong Xu (Anhui University, China), Sheldon X.-D. Tan (University of California, Riverside, U.S.A.), Hai Wang (UESTC, China) |
Page | pp. 473 - 478 |
Keyword | Dynamic thermal management, task migration, thermal analysis, moment matching, hot spots |
Abstract | Dynamic thermal management method is a viable way to effectively
mitigate the thermal emergences. In this paper, a new thermal
management scheme is proposed to reduce the on-chip temperature
variance and the occurrence of hot spots by considering more
transient thermal effects. The new method performs the task
migrations to reduce the temperature variations across the chip.
Instead of intuitively assigning the heavy tasks to the low
temperature cores to balance the thermal profile based on steady
state thermal analysis, the proposed method applies moment matching
based transient thermal analysis techniques for fast thermal
estimation and prediction to guide the migration process. We show
that by considering the dominant temperature moment component, the
resulting algorithm can lead to significant reduction of hot spots
with full transient thermal simulation. Our experimental results on
a 16 core microprocessor demonstrate that the proposed method can
reduce the number of the hot spots by 50% compared to the simple
lowest temperature based task scheduling method, leading to more
uniform on-chip temperature distribution across the microprocessor
cores. |
Slides |
Title | BAMSE: A Balanced Mapping Space Exploration Algorithm for GALS-based Manycore Platforms |
Author | Mohammad Foroozannejad, Brent Bohnenstiehl, *Soheil Ghiasi (University of California, Davis, U.S.A.) |
Page | pp. 479 - 484 |
Keyword | Manycore, GALS, Mapping, Algorithm |
Abstract | We study the problem of mapping concurrent tasks of an application modeled as a data flow graph onto processors of a GALS-based manycore platform. We propose a mapping algorithm called BAMSE, which exploits the characteristics of streaming applications and the specifications of the target architecture to optimize the mapping solution. Different configuration parameters embedded into the algorithm enable one to strike a balance between scalability of the approach and the quality of generated solutions. Experiments with several real life applications show that our algorithm outperforms hand-optimized manual mappings up to 65% in terms of longest inter-processor communication link, and as high as 19% with respect to total length of the links, when the two criteria are used as primary and secondary optimization objectives, respectively. Additionally, our algorithm delivers superior mappings compared to ILP generated solutions after 10 days of solver runtime. |
Slides |