

#### Development of Full-HD Multi-standard Video CODEC IP Based on Heterogeneous Multiprocessor Architecture

<u>H.Nakata<sup>1</sup></u>, K.Hosogi<sup>1</sup>, M.Ehama<sup>1</sup>, T.Yuasa<sup>1</sup>, T.Fujihira<sup>1</sup> K.Iwata<sup>2</sup>, M.Kimura<sup>2</sup>, F.Izuhara<sup>2</sup>, S.Mochizuki<sup>2</sup>, M.Nobori<sup>2</sup>

> <sup>1</sup>Embedded System Platform Laboratory Central Research Laboratory Hitachi, Ltd.

<sup>2</sup>System Design Div. System Solution Business Group Renesas Technology Corp.

# 1. Introduction

- 2. Multiprocessor Architecture for Video CODEC
- 3. Development Methodology
- 4. Implementation Results
- 5. Summary and Conclusions

## 1. Introduction

- 2. Multiprocessor Architecture for Video CODEC
- 3. Development Methodology
- 4. Implementation Results
- 5. Summary and Conclusions

Video codec standards are increasing...

MPEG-1, MPEG-2, MPEG-4 H.263, H.264 (MPEG-4/AVC), VC-1, etc.

Video resolution becomes high...

Many consumer devices are supporting full-HD.



## Our target for video CODEC



We tried to apply a heterogeneous multiprocessor architecture to a video CODEC for our target.

#### CODEC IP applicable to many purpose



HDL: Hardware Description Language

- 1. Introduction
- 2. <u>Multiprocessor Architecture for Video CODEC</u>
- 3. Development Methodology
- 4. Implementation Results
- 5. Summary and Conclusions

## Top level architecture

- All modules are connected to SBUS
- SBUS is structured with 2 unidirectional **shift-register-based** 64bit buses
- The directions of the 2 buses are opposite to each other
- Some of modules use original programmable processors



## Separate stream domain and pixel domain

• Separate both domains by intermediate stream buffers

Optimize performance for each domain



Note) This figure shows decode process.

Data transfer directions are opposite for encode process.

## Distribute to plural intermediate streams

Pixel domain has 2 CEs which work in parallel

VLCS has to distribute an intermediate stream to both CEs for decode process



Note) This figure shows decode process. The data flow is opposite for encode process.

## Stream domain operation cycle budgeting



Reserve 100 fixed operation cycles per MB and assign 3 cycles/bit for bits in streams (This meets 40Mbps performance included 10% margin)

## Intermediate stream compaction

- Intermediate stream is compacted by simple coding method
  - Coded by
    - 1. fixed length code (FLC)
    - 2. FLC exp. golomb combined code (EGFLC)
  - EGFLC is used for coefficients and MVs.
    - Intermediate stream can be encoded and decoded fast by simple logic
    - Reduce size of intermediate buffer and bandwidth for intermediate data transfer
    - EGFLC is about 20% smaller than normal exp. golomb code in our case.



## VLCS structure

- Stream syntax is analyzed by our original 2way LIW processor, STX, except some syntax elements
- Some dedicated circuits are available for performance (40Mbps@162MHz)
- VSVLC decodes/encodes various variable length code for stream I/O.

Data



## Syntax analysis processor (STX)

• Two 32bit instruction slots available

STX instruction slot assignments

2 instruction slots used rate

| 32bit                                   | 32bit        | Stream Type | Rate |
|-----------------------------------------|--------------|-------------|------|
| Inst. slot A                            | Inst. slot B | H.264 CAVLC | 32%  |
|                                         |              | H.264 CABAC | 38%  |
| <ul> <li>load/store</li> </ul>          |              | MPEG-2      | 48%  |
| • stream I/O                            |              | MPEG-4      | 45%  |
| <ul> <li>accelerator control</li> </ul> |              | VC-1        | 46%  |



## Pixel-domain operation cycle budgeting

Required operation amount for MB is not so different

Assign operation cycle budget for a macroblock

Full-HD (1920 × 1080 30fps) video MB rate : 244,800 MB/sTarget operation frequency: 162MHz

Only 661 cycle is available for a MB processing pipeline stage

Too strict for processor based operation - (A MB has 384 pixels for luma & chroma)

Assign  $661 \times 2 = 1,332$  cycles by **2 parallel processing** (**1,200** cycle for actual operation, 132 cycle for margin)

### Hierarchical parallel processing

- Pixel domain uses hierarchical parallel processing technique
  - 1. 2 MBs processed 2 codec elements (CEs) in parallel
  - 2. Each MB is processed by "pipeline" technique: each module is assigned as an pipeline stage.
  - 3. Parallel processing is executed in each module: processor type modules have some tiny processor elements.



#### Pixel domain processor (Programmable Image Processing Element: PIPE)



#### PIPE based on MIAD architecture (MIAD: Multiple Instruction Arrayed Data)

• LD-CPU, Media-CPU, and ST-CPU have own program counter Those CPUs synchronize each other by sync flags in operation code



## **PIPE** extension

PIPE instruction set is extended for each module

- •Major extensions are added to Media-CPU
- •Some data setup operation extensions are added to LD/ST-CPU



## Hybrid architecture

CE works by combination of PIPEs and dedicated circuits

- PIPE architecture is optimized for 2D arrayed pixel processing
- Dedicated circuits used for the functions PIPE is inefficient for

Modules implemented by dedicated circuits in pixel-domain

| Module<br>name | Main functions                                                                              | Reasons to use dedicated circuits       |
|----------------|---------------------------------------------------------------------------------------------|-----------------------------------------|
| VLCF           | <ul><li>decode/encode intermediate stream</li><li>MV calculation</li></ul>                  | <ul> <li>PIPE is inefficient</li> </ul> |
| PMD            | <ul> <li>intra prediction mode selection<br/>(used by H.264 encode process only)</li> </ul> | <ul> <li>logic size</li> </ul>          |
| LMC            | <ul> <li>internal line buffer control</li> </ul>                                            | • PIPE is inefficient                   |
| CME            | <ul> <li>coarse motion estimation and<br/>compensation</li> </ul>                           | <ul> <li>performance</li> </ul>         |
| MEC            | <ul> <li>frame buffer access control for CME operations</li> </ul>                          | PIPE is inefficient                     |

- 1. Introduction
- 2. Multiprocessor Architecture for Video CODEC
- 3. Development Methodology
- 4. Implementation Results
- 5. Summary and Conclusions

## Design flow



All modules are connected to SBUS

The SBUS traffic of C model is designed to be the same as RTL



## Firmware development

•Processors (STX and PIPE) designed in C-language-based model

- Processor models in C model can take binary codes
- Cycle accurate processor models



- Rough performance evaluated in C model design
- Revise architecture if any problems found

#### •Firmware developed using assembler

Because...

- Small firmware code size
- Save time to develop high level language tools

#### Concurrent C model development

• Intermediate stream generator was developed for concurrent design



## **VLCS RTL verification**

•Difficult to make the same traffic between C and RTL for VLCS

- Plural streams transferred by local DMAC (LDMAC) (Impossible to predict the stream data transfer order)
- VLCS works tightly with global DMAC (GDMAC) for stream handling (GDMAC model required as test environment)
  - Verify final result values in internal and external memories
  - Use real GDMAC model for test environment





## PIPE based module RTL design & verification



# Verification using FPGA

- FPGA used for a detailed verification How implement large IP on FPGA
  - Allocate to 9 FPGAs (Xilinx VERTEX-4 XC4VLX200)
  - Connect FPGAs using SBUS
  - Verify encoder mode and decoder mode separately (Remove unnecessary logic for each mode)

What bugs found by FPGA verification

- Stall control
- Interrupt control
- Synchronization between processors
- Error stream handling
- Corner cases (Need to verify with many video streams)

## Adding codec standard support

- Codec standards added step by step
  - IP basic architecture expects for adding codec standards support But supporting one codec standards requires much works...



3 phases for IP development

- first phase
  - Designed basic architecture for multi codec support
  - Designed detail logic for H.264/MPEG-4 AVC wo/MBAFF (decode/encode)
- second phase
  - Supported for MPEG-2 and MPEG-4 (decode/encode)
  - Optimized PIPE micro architecture for logic size compaction
- third phase
  - Supported for H.264/MPEG-4 AVC MBAFF
  - Supported for VC-1 (decode only)

For codec support extension, firmware and additional RTL are developed

- 1. Introduction
- 2. Multiprocessor Architecture for Video CODEC
- 3. Development Methodology
- 4. Implementation Results
- 5. Summary and Conclusions

## **Developed CODEC IP**

- IP developed dividing to 3 phases
- The 3<sup>rd</sup> phase IP development has been completed

| Development Phase                                                                               | Phase 1                             | Phase 2                                                 | Phase 3                                                                     |
|-------------------------------------------------------------------------------------------------|-------------------------------------|---------------------------------------------------------|-----------------------------------------------------------------------------|
| VLCS Logic<br>[Relative logic size]                                                             | 240kG<br>[1.00]                     | 289kG<br>[1.20]                                         | 337kG<br>[1.40]                                                             |
| PIPE-Based Logic<br>(Sum of all PIPE based<br>modules in the CODEC IP)<br>[Relative logic size] | 2694kG<br>[1.00]                    | 2475kG (*1)<br>[0.92]                                   | 2712kG<br>[1.01]                                                            |
| Supported Codec<br>Standard                                                                     | H.264/<br>MPEG-4 AVC<br>(w/o MBAFF) | H.264/<br>MPEG-4 AVC<br>(w/o MBAFF)<br>MPEG-2<br>MPEG-4 | H.264/<br>MPEG-4 AVC<br>(w/ MBAFF)<br>MPEG-2<br>MPEG-4<br>VC-1(decode only) |

(\*1) Smaller than phase 1 because of PIPE micro architecture optimization

## Sample implementation results on a chip

#### •The 1<sup>st</sup> phase IP has been implemented in the test chip

| Technology                                       | 65 nm, 7-layer, Cu, CMOS                                        |                                                |  |
|--------------------------------------------------|-----------------------------------------------------------------|------------------------------------------------|--|
| Supply Voltage                                   | 1.2 V (Internal) 1.8 V (I/O)                                    |                                                |  |
| Clock Frequency                                  | 162 MHz (Internal)<br>324 MHz (DDR-SDRAM I/O)                   | Fuse                                           |  |
| Supported Codec<br>Standard                      | H.264/MPEG-4 AVC (w/o MBAFF)<br>High profile level 4.1          | Inter-<br>connection Video CODEC               |  |
| Performance                                      | 1920x1080 30 fps<br>40 Mbps (CABAC)                             | buffer                                         |  |
| CODEC Logic                                      | 3745 kG                                                         | CPU CPU Video I/O<br>RAM Peripherals Audio DSP |  |
| CODEC Internal<br>Memory                         | 228 kB                                                          | DSP RAM Peripherals Audio DSP                  |  |
| Measured Power<br>Consumption<br>(excluding I/O) | Encoding: 256 mW<br>Decoding: 172 mW<br>(both for full-HD case) | Micrograph of the test chip(*1)<br>© 2008 IEEE |  |

(\*1) K.Iwata, et al. "A 256mW Full-HD H.264 High-Profile CODEC Featuring Dual Macroblock-Pipeline Architecture in 65nm CMOS," 2008 Symposium on VLSI Circuits Digest of Technical Papers, pp.102-103

### **Design comparison**



- 1. Introduction
- 2. Multiprocessor Architecture for Video CODEC
- 3. Development Methodology
- 4. Implementation Results
- 5. Summary and Conclusions

- 1. A multi-standard video CODEC IP has been developed.
- The IP can handle full-HD (1920 × 1080 30fps) video at 162MHz for MPEG-2/4, H.264 for decode/encode.
   VC-1 is supported for decode.
- 3. The IP takes heterogeneous multiprocessor architecture; uses 2 kinds of processors, STX and PIPE, and PIPE was extended for each module.
- A test chip developed with 1<sup>st</sup> phase IP; The CODEC works only 256mW for full-HD H.264 encode and 172mW for decode. This power consumption is very low though we used processors for flexibility.

#### Acknowledgement

- Thank you for all persons who gave me this presentation opportunities.
- I want to say to all of you

