#### Supporting Compressed-Sparse Activations and Weights on SIMD-like Accelerator for Sparse Convolutional Neural Networks

Chien-Yu Lin and Bo-Cheng Lai

Institute of Electronics Engineering National Chiao Tung University 図えぶ通た学



#### **Convolutional Neural Network**

- CNN now dominants visual recognition applications
  - Face recognition, object detection, autonomous vehicles...
- Major components: deep convolutional layers



#### **Convolutional Layer**

#### A lot of parallel multiplications and additions



Computations of a conv layer

#### **CNN Acceleration with SIMD**

• MAC unit can efficiently perform CNN and thus, adopted by many CNN accelerators [Google TPU, DianNao, Zhang 2015, Cambricon-X]



N. P. Jouppi et al., In-Datacenter Performance Analysis of a Tensor Processing Unit, ISCA 2017

T. Chen et al., DianNao: a small-footprint high-throughput accelerator for ubiquitous machine learning, ASPLOS 2014

C. Zhang et al., Optimizing fpga-based accelerator design for deep convolutional neural networks, FPGA 2015

S. Zhang et al., Cambricon-X: An accelerator for sparse neural networks, MICRO 2016

#### **CNN Acceleration with SIMD**



#### **CNN Acceleration with SIMD**



#### **CNN is Sparse**

- About 60% of weights and activations are ZEROs
  - Zeros in activations are dynamically generated after ReLU
  - Zeros in weights are obtained with Network Pruning
- Sparsity is promising for speedup (Zero-skipping) and energy reduction (smaller memory footprint)



A. Parashar et al., SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks, ISCA 2017 S. Han et al., Learning both weights and connections for efficient neural network, NIPS 2015

#### Sparse CNN on SIMD?



#### Sparse CNN on SIMD?



#### Sparse CNN on SIMD?



#### Sparse CNN on SIMD!



#### A Simple Sparse Layer

Weight Output Act a3  $\mathbf{0}$ a5

#### A Simple Sparse Layer



#### Compressed-Sparse Data: Only Keep Non-Zeros





#### Plus Index: Bit-Vector Recording Zero/Non-Zero



#### Target of DIM: Find Out Effectual Pairs



**a5** 

**w6** 

**a**7

#### **Dual Indexing Module: Step1 Activation Index** Weight Index 0 0 0 0 0 0 0 U AND 1 Act Weight 0 Weight Output 0 Act Index Index ( ] ) 0 0 0 0 0 () **Co-activated Index** Effectual 0 1 **Pairs** a3 1 **w3** $\left( \begin{array}{c} \\ \end{array} \right)$ 0 0 **w5** 01 a5 1 1 **w6** 0 1 1 0 0 0

#### Dual Indexing Module: Step2

**Activation Index** 



## Dual Indexing Module: Step3

**Activation Index** 



## Dual Indexing Module: Step4

**Activation Index** 



#### Alignment Issue Solved!

**Activation Index** 



#### **Accelerator Design**

Extended from Cambricon-X [MICRO 2016]



#### **Accelerator Design**

• Plug DIM into each PE



#### Accelerator Design

Encode output activations on-the-fly



## **Evaluation Methodology**

- Logic: Synthesis with TSMC 40nm
- SRAM and DRAM: CACTI
- Benchmark: Open Sparse-AlexNet + ImageNet Data
- Experiments: In-house performance simulator

#### **Accelerator Variants**

• Overheads: 14.4% in Area and 19.5% in Power

| Acc  | Act    | Weights | Index | Encoder | Area(mm <sup>2</sup> ) | Power(mW) |
|------|--------|---------|-------|---------|------------------------|-----------|
| DAW  | Dense  | Dense   | N/A   | N/A     | 2.05                   | 395       |
| SpA  | Sparse | Dense   | IM    | ~       | 2.15                   | 428       |
| SpW  | Dense  | Sparse  | IM    | N/A     | 2.23                   | 441       |
| SpAW | Sparse | Sparse  | DIM   | ~       | 2.34                   | 472       |

#### **DRAM Access Volume**

• 47.3% less in DRAM access volume compared to DAW



# **Energy Consumption**

46% energy reduction compared to DAW



#### **Energy-Delay-Product**

55.4% EDP reduction compared to DAW



## Summary

- SIMD-like accelerator has alignment issue while performing sparse CNN
- We propose a novel *Dual Indexing Module* (DIM) to handle the alignment issue efficiently
- By keeping data in a **compressed-sparse format**, a CNN accelerator with DIM can reduce DRAM access volume, energy consumption and EDP for 47.3%, 46% and 55.4%

# Thank You!

# **Additional Materials**

#### **Design Parameters of SpAW**

| Accelerator Parameters      | Value       |
|-----------------------------|-------------|
| Clock Rate                  | 1 GHz       |
| Number of PEs               | 16          |
| WBs (Total)                 | 32 KB       |
| WBs-Idx (Total)             | 2 KB        |
| AB-In/AB-Out (each)         | 8 KB        |
| AB-In-Idx/AB-Out-Idx (each) | 500 B       |
| PE Parameters               | Value       |
| Multiplier Precision        | 16-bit      |
| MAC Width                   | 16 * 16-bit |
| WB                          | 2KB         |
| WB-Idx                      | 128 B       |

#### **Related Work - Cnvlutin**

• Decouple neuron lanes to do zero-skipping in neurons



#### **Related Work - Cambricon-X**

Utilizing weight sparsity with step indexing (a compressed-sparse format)

