(Back to Session Schedule)

The 18th Asia and South Pacific Design Automation Conference

Session 4A  Special Session: High-Level Synthesis and Parallel Programming Models for FPGAs
Time: 10:20 - 12:20 Thursday, January 24, 2013
Organizer: Yun (Eric) Liang (Peking University, China)

4A-1 (Time: 10:20 - 10:50)
Title(Invited Paper) Fractal Video Compression in OpenCL: An Evaluation of CPUs, GPUs, and FPGAs as Acceleration Platforms
Author*Doris Chen, Deshanand Singh (Altera Toronto Technology Center, Canada)
Pagepp. 297 - 304
Keywordhigh-level synthesis, FPGA, GPU
AbstractFractal compression is an efficient technique for image and video encoding that has not gained widespread acceptance due to its computational intensity. In this paper, we present a real-time implementation of fractal compression in OpenCL, and show how the algorithm can be efficiently optimized for multi-CPUs, GPUs, and FPGAs. We show that the core computation implemented on the FPGA through OpenCL is 3x and 114x faster than a high-end GPU and multi-core CPU, respectively. We also compare to a hand-coded FPGA implementation to showcase the effectiveness of OpenCL-to-FPGA compilation.
Slides

4A-2 (Time: 10:50 - 11:20)
Title(Invited Paper) High Level Synthesis of Multiple Dependent CUDA Kernels for FPGA
AuthorSwathi Gurumani, Hisham Cholakkail (Advanced Digital Sciences Center, Singapore), Yun Liang (Peking University, China), *Kyle Rupnow (Nanyang Technological University, Singapore), Deming Chen (University of Illinois at Urbana-Champaign, U.S.A.)
Pagepp. 305 - 312
KeywordHLS, FPGA, CUDA
AbstractHigh-level synthesis (HLS) tools provide automatic generation of hardware at the register transfer level (RTL) from algorithm descriptions written in high-level languages, enabling faster creation of custom accelerators for FPGA architectures. Existing HLS tools support a wide variety of input languages, and assist users in design space exploration through automation and feedback on designs' performance bottlenecks. This design space exploration applies techniques such as pipelining, partitioning and resource sharing in order to improve performance, and resource utilization. However, although automated exploration can find some inherent parallelism, data-parallel input source code is still superior for exposing a greater variety of parallelism. In prior work, we demonstrated automated design space exploration of GPU multi-threaded (CUDA) language source code for efficient RTL generation. In this paper, we examine the challenges in extending this automated design space exploration to multiple dependent CUDA kernels, demonstrate a step-by-step procedure for efficiently performing multi-kernel synthesis, and demonstrate the potential of this approach through a case study of a stereo matching algorithm. This study demonstrates that HLS of multiple dependent CUDA kernels can maintain performance parity with the GPU implementation, while consuming over 16X less energy than the GPU. Based on our manual procedure, we identify the key challenges in fully automating the synthesis of multi-kernel CUDA programs.
Slides

4A-3 (Time: 11:20 - 11:50)
Title(Invited Paper) The Liquid Metal IP Bridge
AuthorPerry Cheng, Stephen J. Fink, Rodric Rabbah, *Sunil Shukla (IBM Research, U.S.A.)
Pagepp. 313 - 319
KeywordHigh Level Synthesis, Heterogeneous Computing, FPGA
AbstractProgrammers are increasingly turning to heterogeneous systems to achieve performance. Examples include FPGA-based systems that integrate reconfigurable architectures with conventional processors. However, the burden of managing the coding complexity that is intrinsic to these systems falls entirely on the programmer. This limits the proliferation of these systems as only highly-skilled programmers and FPGA developers can unlock their potential. The goal of the Liquid Metal project at IBM Research is to address the programming complexity attributed to heterogeneous FPGA-based systems. A feature of this work is a vertically integrated development lifecycle that appeals to skilled software developers. A primary enabler for this work is a canonical IP bridge, designed to offer a uniform communication methodology between software and hardware, and that is applicable across a wide range of platforms available off-the-shelf.