# Data-Model-Circuit Tri-Design for Ultra-Light Video Intelligence on Edge Devices

Yimeng Zhang<sup>1</sup>, **Akshay Kamath<sup>2</sup>**, Qiucheng Wu<sup>3</sup>, Zhiwen Fan<sup>4</sup>, Wuyang Chen<sup>4</sup>, Zhangyang Wang<sup>4</sup>, Shiyu Chang<sup>3</sup>, Sijia Liu<sup>1</sup>, Cong Hao<sup>2</sup>

<sup>1</sup>*Michigan State University, USA* <sup>2</sup>*Georgia Institute of Technology, USA*  <sup>3</sup>University of California at Santa Barbara, USA

<sup>4</sup>University of Texas at Austin, USA



ASP-DAC 2023

## **Speaker Bio**

### ECE graduate student at Georgia Tech, USA

2021 - 2023

*Thesis Advisor : Prof. Cong (Callie) Hao Research Area : Hardware accelerators for deep learning* 



Akshay Kamath



Interned in the SoC Design and Integration team at Apple Silicon Engineering Group, Cupertino, USA



RTL Design Engineer at Samsung Semiconductor India Research Cryptographic accelerator IPs, SerDes IPs, and test chips.



### Bachelor's degree from NITK Surathkal in India

Electronics and Communication Engineering MITACS Research Intern at University of Toronto in Summer 2016

### **Overview**

- / Multiple Object Tracking (MOT) on HD Video Stream
  - Challenges for MOT deployment on Edge Devices
  - Data-Model-Circuit Tri-Design Framework and its Efficacy



**(** 

Temporal Frame Filtering & Spatial Saliency Focusing



Hardware-friendly Sparsity Pattern-aware Pruning



Scalable Dataflow-style FPGA-based Accelerator

# Multiple Object Tracking (MOT)

- Basis for video intelligence in
  - Autonomous driving
  - Virtual reality
  - Medical imaging
- Involves object
  - Detection
  - Localization
  - Association



Source: <u>BDD100K</u> Driving Dataset Example for Object Tracking

### **Problem & Motivation**

### Accurate real-time High-Definition (HD) video processing for MOT

### **Critical Limitations:**

- Computationally-expensive ML models
  - High in accuracy but not hardware friendly
- Massive size of video data as input
  - Further adds to computational complexity
- Poor scalability of hardware devices
  - Limited computing resources and power
  - Low degree of parallelism by design tools



### **Challenges & Approaches**



MOT implementation prohibitively expensive in energy and latency due to large data size



Eliminate temporal and spatial redundancies in video frames to reduce data complexity

Gigantic number of parameters in state-of-the-art MOT models

Hardware-friendly model compression necessary



2

Lack of software-hardware full-stack design approach

-

Realistic implementation and evaluation on hardware

No prior work addresses 1-3 in a single unified MOT implementation pipeline!

### **Proposed Data-Model-Circuit Tri-Design Framework**

State-of-the-art MOT Model: QDTrack<sup>1</sup> with BDD100K dataset

- Faster R-CNN with Feature Pyramid Network (FPN) backbone
- Contrastive Learning to optimize backbone parameters
- Bi-directional Softmax for object association and tracking



<sup>1</sup>Pang et al., *Quasi-Dense Similarity Learning for Multiple Object Tracking*, CVPR 2021.



### **Data-Model-Hardware Tri-Design Overview**



## **RL-based Model for Temporal Frame Filtering**



### **Frame Filtering Model Architecture**



## **Temporal Frame Filtering Model Behavior**

#### **Before Frame Dropping**





#### After Frame Dropping



## **Spatial Saliency Focusing**



#### Spatial Saliency Reduction (SSR) Method

- Decompose input image into multiple patches
- Obtain saliency score using Sobel operator
- Create binary masks based on the scores
- Apply to both image and feature spaces

#### Highlights

- Drop unimportant pixels
- Align with visually spatial saliency, easy for computation
- Eliminate in images and features

### **Saliency Score and Patch Size Estimation**



Saliency score performs better than random patch dropping

Hardware-friendly "sweet" patch size: 60 x 60

### **Spatial Saliency Focusing Behavior**



# Hardware-aware Model Pruning

SOTA Iterative Magnitude Pruning (IMP) applied to QDTrack

| Metrics             | Dense<br>Model | Global<br>Pruning |        |        |  |
|---------------------|----------------|-------------------|--------|--------|--|
|                     | 0%             | 80%               | 85%    | 90%    |  |
| IDF1 ↑              | 0.714          | 0.712             | 0.706  | 0.703  |  |
| ΜΟΤΑ ϯ              | 0.637          | 0.631             | 0.627  | 0.624  |  |
| Sparse Kernel Ratio | 0%             | 36.59%            | 34.44% | 32.53% |  |

#### But... IMP is unstructured (:-)

X **Structured Pruning Unstructured Pruning** Hinders parallelization

Hardware-friendly

 $\checkmark$ 

Ratio of pruned 3 x 3 kernel weights over total number of pruned weights

Kernel-wise sparsity embedded in irregular weight pruning on QDTrack!

### **Irregular Pattern-aware Pruning**

- For uncovered sparse weights
  - From kernel-wise sparse patterns
- Effective area of a convolution kernel,
  - Maintains specific sparse patterns
  - May not yield kernels with all zero weights
- Pre-defined irregular sparse patterns<sup>[1]</sup>
  - Leveraged for 3 x 3 kernels
- Fixed number of such sparse patterns
  - Facilitates efficient hardware implementation



#### **Pattern Pruning**

<sup>[1]</sup>Xiaolong Ma et al., An image enhancing pattern-based sparsity for real-time inference on mobile devices, ECCV 2020.

## **Proposed Pruning Method**

- Find irregular weight sparse mask  $M_0$ 
  - As per model weight magnitudes
- Extract kernel-wise sparsity mask M<sub>k</sub> from M<sub>0</sub>
- Find irregular pattern-aware mask M<sub>p</sub>
  - On remaining weights identified by  $(1 M_k)$
- **Retrain** non-zero model weights
  - Under fixed pruning mask (M<sub>k</sub> + M<sub>p</sub>)

Our method achieves MOT performance like IMP across different pruning ratios!



#### Channel-wise pruning sensitive to pruning ratio

## **Computation Complexity Comparison**



Baseline models and our tri-design approaches

### **Back-end Hardware Accelerator Overview**

- a) Feature maps partitioned
  - Limited on-chip memory
  - Tile size same as patch
- b) Parallel computations
  - Unrolling each row
- c) Overlapping operations
  - Effectively single-cycle
- d) Pruning-aware design
  - Skip pruned channels





### **Scalable Multi-FPGA Dataflow Architecture**



### **Performance Comparison**

| Methods                | Metrics |        |                  |      |               |                       |  |
|------------------------|---------|--------|------------------|------|---------------|-----------------------|--|
|                        | IDF1 ↑  | MOTA ↑ | Latency↓<br>(ms) | EFR  | Power↓<br>(W) | Energy<br>Efficiency↓ |  |
| QDTrack (GPU Baseline) | 0.714   | 0.637  | 60.9             | 22.5 | 296           | 13.2 J/frame          |  |
| QDTrack on FPGA        | 0.714   | 0.637  | 554.7            | 1.8  | 50.8          | 28.2 J/frame          |  |
| Tri-design (proposed)  | 0.704   | 0.617  | 44.4             | 37.6 | 50.8          | <b>1.35</b> J/frame   |  |

- Standard MOT metrics: IDF1 score and Multi-Object Tracking Accuracy (MOTA).
- IDF1 emphasizes association accuracy while MOTA concerns with object detection accuracy.
- EFR (Effective Frame Rate) is indicative of throughput in processing video frames

### **Tri-Design Summary w.r.t. State-of-the-Art Baseline**





Highly-efficient video processing algorithm/hardware pipeline

- DNN-based state-of-the-art MOT model
- BDD100K dataset contents as inputs
- FPGA cluster with Alveo U50 and ZCU104
- Data-model-circuit tri-design for MOT implementation on edge
  - Aggressive data reduction techniques
  - Hardware-friendly model compression
  - SW optimization-aware dataflow accelerator

# **Thank You!**

Akshay Kamath akshay.k2@gatech.edu Dr. Callie Hao

callie.hao@gatech.edu

