ASP-DAC 2012 Technical Program

The 17th Asia and South Pacific Design Automation Conference

Session 9C Video, Display, and Signal Processing Technologies and Techniques
Time: 16:10 - 17:50 Thursday, February 2, 2012
Location: Room 202
Chairs: Shao-Yi Chien (National Taiwan University, Taiwan), Yen-Kuang Chen (Intel Corp., U.S.A.)

9C-1 (Time: 16:10 - 16:35)

Title	A 16-pixel Parallel Architecture with Block-level/Mode-level Co-reordering Approach for Intra Prediction in 4kx2k H.264/AVC Video Encoder
Author	Huailu Ren (College of Information Science and Engineering of Shandong University of Science and Technology, China), *Yibo Fan (State Key Lab of ASIC and System of Fudan University, China), Xinhua Chen (College of Information Science and Engineering of Shandong University of Science and Technology, China), Xiaoyang Zeng (State Key Lab of ASIC and System of Fudan University, China)
Page	pp. 801 - 806
Keyword	H.264/AVC, intra prediction, hardware architecture
Abstract	Intra prediction is the most important technology in H.264/AVC intra frame encoder. But there is extremely complicated data dependency and an immense amount of computation in intra prediction process. In order to meet the requirements of real-time coding and avoid hardware waste, this paper presents a parallel and high efficient H.264/AVC intra prediction architecture which targets high-resolution (e.g. 4kx2k) video encoding applications. In this architecture, the optimized intra 4x4 prediction engine can process sixteen pixels in parallel at a slightly higher hardware cost (compared to the previous four-pixel parallel architecture). The intra 16x16 prediction engine works in parallel with intra 4x4 prediction engine. It reuses the adder-tree of Sum of Absolute Transformed Difference (SATD) generator. Moreover, in order to reduce the data-dependency in intra 4x4 reconstruction loop, a block-level and mode-level co-reordering strategy is proposed. Therefore, the performance bottleneck of H.264/AVC intra encoding can be alleviated to a great extent. The proposed architecture supports full-mode intra prediction for H.264/AVC baseline, main and extended profiles. It takes only 163 cycles to complete the intra prediction process of one macroblock (MB). This design is synthesized with a SMIC 0.13µm CMOS cell library. The result shows that it takes 61k gates and can run at 215MHz, supporting real-time encoding of 4kx2k@40fps video sequences.

9C-2 (Time: 16:35 - 17:00)

Title	Fine-grained Dynamic Voltage Scaling on OLED Display
Author	Xiang Chen, Jian Zheng, *Yiran Chen (Dept. of Electrical and Computer Eng., University of Pittsburgh, U.S.A.), Hai Li (Dept. of Electrical and Computer Eng., Polytechnic Institute of New York University, U.S.A.), Wei Zhang (School of Computer Eng., Nanyang Technological University, Singapore)
Page	pp. 807 - 812
Keyword	OLED, Driver design, Dynamic voltage scaling
Abstract	OLED has emerged as the new generation display technique, while its power consumption remains inefficient. In this work, we proposed a fine-grained dynamic voltage scaling (FDVS) technique to reduce power consumption. An OLED panel is partitioned into multiple individual areas with objective DVS. A DVS-friendly OLED driver design is also proposed to enhance the color accuracy under DVS. Experiments show that compared to the existing DVS technique, FDVS technique can achieve efficient power saving and reduce the image compensation cost.

9C-3 (Time: 17:00 - 17:25)

Title	A Reconfigurable Accelerator for Neuromorphic Object Recognition
Author	Jagdish Sabarad, Srinidhi Kestur, Mi Sun Park, Dharav Dantara, *Vijaykrishnan Narayanan (The Pennsylvania State University, U.S.A.), Yang Chen, Deepak Khosla (HRL Laboratories, U.S.A.)
Page	pp. 813 - 818
Keyword	accelerator, neuromorphic vision, object recognition, fpga, convolution
Abstract	Advances in neuroscience have enabled researchers to develop computational models of auditory, visual and learning perceptions in the human brain. HMAX, which is a biologically inspired model of the visual cortex, has been shown to outperform standard computer vision approaches for multi-class object recognition. HMAX, while computationally demanding, can be potentially applied in various applications such as autonomous vehicle navigation, unmanned surveillance and robotics. In this paper, we present a reconfigurable hardware accelerator for the time-consuming S2 stage of the HMAX model. The accelerator leverages spatial parallelism, dedicated wide data buses with on-chip memories to provide an energy efficient solution to enable adoption into embedded systems. We present a systolic array-based architecture which includes a run-time reconfigurable convolution engine which can perform multiple variable-sized convolutions in parallel. An automation flow is described for this accelerator which can generate optimal hardware configurations for a given algorithmic specification and also perform run-time configuration and execution seamlessly. Experimental results on Virtex-6 FPGA platforms show 5X to 11X speedups and 14X to 33X higher performance-per-Watt over a CNS-based implementation on a Tesla GPU.

9C-4 (Time: 17:25 - 17:50)

Title	Efficient Implementation of Multi-Moduli Architectures for Binary-to RNS Conversion
Author	*Hector Pettenghi (Instituto de Engenharia de Sistemas e Computadores (INESC-ID), Portugal), Leonel Sousa (Instituto Superior Tecnico (IST)/ Instituto de Engenharia de Sistemas e Computadores (INESC-ID), Portugal), Jude Angelo Ambrose (School of Computer Science and Engineering, University of New South Wales, Australia)
Page	pp. 819 - 824
Keyword	Residue number system, Binary-to-RNS converters, memory-less processors, Digital Signal Processing
Abstract	This paper presents a novel approach to improve the existing Binary-to-RNS multi-moduli architectures. Multi-moduli architectures are implemented serially or in parallel. A novel choice of the weights associated to the inputs provides huge improvement when applied to the most efficient multi-moduli architectures known to date. Experimental results suggest that the proposed memory-less multi-moduli architectures achieve speedups of 1.94 and 1.62 for parallel and serial implementations, respectively, in comparison with the most efficient state-of-the-art structures.