Title | A 16-pixel Parallel Architecture with Block-level/Mode-level Co-reordering Approach for Intra Prediction in 4kx2k H.264/AVC Video Encoder |
Author | Huailu Ren (College of Information Science and Engineering of Shandong University of Science and Technology, China), *Yibo Fan (State Key Lab of ASIC and System of Fudan University, China), Xinhua Chen (College of Information Science and Engineering of Shandong University of Science and Technology, China), Xiaoyang Zeng (State Key Lab of ASIC and System of Fudan University, China) |
Page | pp. 801 - 806 |
Keyword | H.264/AVC, intra prediction, hardware architecture |
Abstract | Intra prediction is the most important technology in H.264/AVC intra frame encoder. But there is extremely complicated data dependency and an immense amount of computation in intra prediction process. In order to meet the requirements of real-time coding and avoid hardware waste, this paper presents a parallel and high efficient H.264/AVC intra prediction architecture which targets high-resolution (e.g. 4kx2k) video encoding applications. In this architecture, the optimized intra 4x4 prediction engine can process sixteen pixels in parallel at a slightly higher hardware cost (compared to the previous four-pixel parallel architecture). The intra 16x16 prediction engine works in parallel with intra 4x4 prediction engine. It reuses the adder-tree of Sum of Absolute Transformed Difference (SATD) generator. Moreover, in order to reduce the data-dependency in intra 4x4 reconstruction loop, a block-level and mode-level co-reordering strategy is proposed. Therefore, the performance bottleneck of H.264/AVC intra encoding can be alleviated to a great extent. The proposed architecture supports full-mode intra prediction for H.264/AVC baseline, main and extended profiles. It takes only 163 cycles to complete the intra prediction process of one macroblock (MB). This design is synthesized with a SMIC 0.13µm CMOS cell library. The result shows that it takes 61k gates and can run at 215MHz, supporting real-time encoding of 4kx2k@40fps video sequences. |
Title | Fine-grained Dynamic Voltage Scaling on OLED Display |
Author | Xiang Chen, Jian Zheng, *Yiran Chen (Dept. of Electrical and Computer Eng., University of Pittsburgh, U.S.A.), Hai Li (Dept. of Electrical and Computer Eng., Polytechnic Institute of New York University, U.S.A.), Wei Zhang (School of Computer Eng., Nanyang Technological University, Singapore) |
Page | pp. 807 - 812 |
Keyword | OLED, Driver design, Dynamic voltage scaling |
Abstract | OLED has emerged as the new generation display technique, while its power consumption remains inefficient. In this work, we proposed a fine-grained dynamic voltage scaling (FDVS) technique to reduce power consumption. An OLED panel is partitioned into multiple individual areas with objective DVS. A DVS-friendly OLED driver design is also proposed to enhance the color accuracy under DVS. Experiments show that compared to the existing DVS technique, FDVS technique can achieve efficient power saving and reduce the image compensation cost. |
Title | A Reconfigurable Accelerator for Neuromorphic Object Recognition |
Author | Jagdish Sabarad, Srinidhi Kestur, Mi Sun Park, Dharav Dantara, *Vijaykrishnan Narayanan (The Pennsylvania State University, U.S.A.), Yang Chen, Deepak Khosla (HRL Laboratories, U.S.A.) |
Page | pp. 813 - 818 |
Keyword | accelerator, neuromorphic vision, object recognition, fpga, convolution |
Abstract | Advances in neuroscience have enabled researchers
to develop computational models of auditory, visual and learning
perceptions in the human brain. HMAX, which is a biologically
inspired model of the visual cortex, has been shown to outperform
standard computer vision approaches for multi-class object
recognition. HMAX, while computationally demanding, can
be potentially applied in various applications such as autonomous
vehicle navigation, unmanned surveillance and robotics.
In this paper, we present a reconfigurable hardware accelerator
for the time-consuming S2 stage of the HMAX model.
The accelerator leverages spatial parallelism, dedicated wide
data buses with on-chip memories to provide an energy efficient
solution to enable adoption into embedded systems. We present
a systolic array-based architecture which includes a run-time
reconfigurable convolution engine which can perform multiple
variable-sized convolutions in parallel. An automation flow is
described for this accelerator which can generate optimal hardware
configurations for a given algorithmic specification and
also perform run-time configuration and execution seamlessly.
Experimental results on Virtex-6 FPGA platforms show 5X to
11X speedups and 14X to 33X higher performance-per-Watt over
a CNS-based implementation on a Tesla GPU. |
Title | Efficient Implementation of Multi-Moduli Architectures for Binary-to RNS Conversion |
Author | *Hector Pettenghi (Instituto de Engenharia de Sistemas e Computadores (INESC-ID), Portugal), Leonel Sousa (Instituto Superior Tecnico (IST)/ Instituto de Engenharia de Sistemas e Computadores (INESC-ID), Portugal), Jude Angelo Ambrose (School of Computer Science and Engineering, University of New South Wales, Australia) |
Page | pp. 819 - 824 |
Keyword | Residue number system, Binary-to-RNS converters, memory-less processors, Digital Signal Processing |
Abstract | This paper presents a novel approach to improve the existing Binary-to-RNS multi-moduli architectures. Multi-moduli architectures are implemented serially or in parallel. A novel choice of the weights associated to the inputs provides huge improvement when applied to the most efficient multi-moduli architectures known to date. Experimental results suggest that the proposed memory-less multi-moduli architectures achieve speedups of 1.94 and 1.62 for parallel and serial implementations, respectively, in comparison with the most efficient state-of-the-art structures. |