#### **BARVINN: Arbitrary Precision DNN** Accelerator Controlled by a RISC-V CPU

**MohammadHossein AskariHemmat**<sup>1</sup>, Sean Wagner<sup>2</sup>, Olexa Bilaniuk<sup>3</sup>, Yassine Hariri<sup>4</sup>, Yvon Savaria<sup>1</sup>, Jean-Pierre David<sup>1</sup> <sup>1</sup>Ecole Polytechnique Montreal, Canada <sup>2</sup>IBM, Toronto, Canada, <sup>3</sup>Mila, Montreal, Canada, <sup>4</sup>CMC Microsystems, Kingston, Canada

28th Asia and South Pacific Design Automation Conference, ASP-DAC 2023 Tokyo, Japan Jan 16-19 2023





POLYTECHNIQUE MONTRÉAL

UNIVERSITÉ **D'INGÉNIERIE** 







#### Outline

- 1. Motivation and Background.
- 2. BARVINN Overall Architecture:
  - a. MVU Array and Architecture.
  - b. PITO: Multi-Threaded RISC-V Controller.
- 3. BARVINN programming model and software stack.
- 4. Experiments and Results.
- 5. Conclusion and Future Work





Quantized Deep Neural Networks (QDNNs) rely on floating-point computations. ullet







- Quantized Deep Neural Networks (QDNNs) rely on floating-point computations.  $\bullet$
- $\bullet$ and costly in terms of power consumption and silicon area.



Compared to fixed-point and integer operations, floating-point computations are slow



- Quantized Deep Neural Networks (QDNNs) rely on floating-point computations.  $\bullet$
- and costly in terms of power consumption and silicon area.
- On the other hand, it has been shown that quantized models can achieve near floatingpoint precisions in vision tasks.

| Task           | Dataset            | Model     | Precision<br>A/W | Acc/<br>MAP | Size<br>(MB) |
|----------------|--------------------|-----------|------------------|-------------|--------------|
|                |                    |           | LSQ(2/2)         | 76.81       | 2.889        |
| Classification | CIFAR              | ResNet18  | LSQ(4/4)         | 76.92       | 5.559        |
|                | 100                | Resinetto | LSQ(8/8)         | 78.45       | 10.87        |
|                |                    |           | FP32             | 76.82       | 42.8         |
|                |                    |           | LSQ(2/2)         | 0.61        | 10.34        |
| Object         | bject VOC- SSD300- |           | LSQ(4/4)         | 0.60        | 11.81        |
| Detection      | 2007               | ResNet18  | LSQ(8/8)         | 0.68        | 14.77        |
|                |                    |           | FP32             | 0.59        | 32.49        |



Compared to fixed-point and integer operations, floating-point computations are slow



- Quantized Deep Neural Networks (QDNNs) rely on floating-point computations.  $\bullet$
- and costly in terms of power consumption and silicon area.
- On the other hand, it has been shown that quantized models can achieve near floatingpoint precisions in vision tasks.
- efficiently process data in arbitrary precision.





Compared to fixed-point and integer operations, floating-point computations are slow

However, there are no commercially available general processors (CPU or GPU) that can







**//ONTRÉAL** 

VIVERSITE NGÉNIERIE



- Introducing BARVINN!
- BARVINN is a DNN for running arbitrary precision quantized models.
- It has 8 processing elements (MVUs) that are controlled by a RISC-V  $\bullet$ controller.
- It has an overall 8.2 TMACs of computational power (binary ops).
- It has been implemented on Alveo U250 FPGA platform.







UNIVERSITÉ INGÉNIERIE/



UNIVERSITÉ

**'INGÉNIERIE** 





Matrix Vector Unit (MVU) Arrays.

MONTRÉAL

UNIVERSITÉ

**'INGÉNIERIE** 



BARVINN: Arbitrary Precision DNN Accelerator Controlled by a RISC-V CPU

11

# Matrix Vector Units: **BARVINN processing** elements.



POLYTECHNIQUE Montréal

UNIVERSITÉ D'INGÉNIERIE



- BARVINN has an array of 8 Matrix Vector Unit (MVU)s.  $\bullet$
- Each MVU is consist of: ullet







BARVINN: Arbitrary Precision DNN Accelerator Controlled by a RISC-V CPU

Weight

Ram



- BARVINN has an array of 8 Matrix Vector Unit (MVU)s.  $\bullet$
- Each MVU is consist of:
  - RAMs for activation, Weights, Scalers and Biases. ullet





Matrix Vector Unit (MVU) nnect Write Controller Read Controller Read I Scaler **Activation Ram** Ram Weight Ram bank 1 **Bias** Ram 64 bits 64x64 bits **MVP** 32 bit 32 bit 32 bit 16bit **32 bit** 32 bit 32 bit 32 bit MVU8 32 bit 32 bit 32 bit 1 bit 1 bit 1 bit Weight Bias Scaler Input \_64 bit Ram Ram Ram Ram





- BARVINN has an array of 8 Matrix Vector Unit (MVU)s.  $\bullet$
- Each MVU is consist of:
  - RAMs for activation, Weights, Scalers and Biases.  $\bullet$
  - Matrix Vector Product unit (MVP).





Write Interconnect Write Controller Read Controller Read Scaler **Activation Ram** Ram Weight Ram Bias Ram 64 bits 64x64 bits **MVP** 32 bit 32 bit 32 bit 16bit **32 bit** 32 bit 32 bit 32 bit MVU8 32 bit 32 bit 32 bit 1 bit 1 bit 1 bit Bias Scaler Input Weight \_64 bit Ram Ram Ram Ram

Matrix Vector Unit (MVU)





- BARVINN has an array of 8 Matrix Vector Unit (MVU)s.  $\bullet$
- Each MVU is consist of:
  - RAMs for activation, Weights, Scalers and Biases.  $\bullet$
  - Matrix Vector Product unit (MVP).
  - Pooling and Activation units.



Write Interconnect Write Controller Read Controller Read Scaler **Activation Ram** Ram Weight Ram Bias Ram 64 bits 64x64 bits **MVP** 32 bit 32 bit 32 bit 16bit Scaler 0 **32** bit 32 bit 32 bit 32 bit MVU8 32 bit 32 bit 32 bit 1 bit 1 bit 1 bit Weight Bias Scaler Input \_64 bit Ram Ram Ram Ram

Matrix Vector Unit (MVU)





- BARVINN has an array of 8 Matrix Vector Unit (MVU)s.  $\bullet$
- Each MVU is consist of:
  - RAMs for activation, Weights, Scalers and Biases.  $\bullet$
  - Matrix Vector Product unit (MVP).
  - Pooling and Activation units.
  - Scaler Unit.  $\bullet$



nnect Write Controller Read Controller Read Scaler **Activation Ram** Ram Weight Ram Bias Ram 64 bits 64x64 bits **MVP** 32 bit 32 bit 32 bit 16bit **32 bit** 32 bit 32 bit 32 bit MVU8 Pool/ReLU 32 bit 32 bit 32 bit 1 bit 1 bit 1 bit Weight Bias Scaler Input \_64 bit Ram Ram Ram Ram

Matrix Vector Unit (MVU)





- BARVINN has an array of 8 Matrix Vector Unit (MVU)s.  $\bullet$
- Each MVU is consist of: lacksquare
  - RAMs for activation, Weights, Scalers and Biases. lacksquare
  - Matrix Vector Product unit (MVP).
  - Pooling and Activation units.
  - Scaler unit. ullet
  - Quantizer unit.



nect Write Controller Read Controller Read Scaler **Activation Ram** Ram Weight Ram Bias Ram 64 bits 64x64 bits **MVP** 32 bit 32 bit 32 bit 16bit **32** bit 32 bit 32 bit 32 bit MVU8 32 bit 32 bit 32 bit Quantser Quantser Quantse 1 bit 1 bit 1 bit Weight Bias Scaler Input \_64 bit Ram Ram Ram Ram

Matrix Vector Unit (MVU)





- BARVINN has an array of 8 Matrix Vector Unit (MVU)s.  $\bullet$
- Each MVU is consist of: 1- RAMs for activation, 2- Weights, Scalers and Biases. 3- Matrix Vector Product unit (MVP). 4- Pooling and Activation units. Scaler unit. 5- Quantizer unit.
- Using 64 input element data and 64 x64 element matrix from Weight RAM, each MVU compute 64 output vector elements per each clock cycle.





VIVERSITE NGÉNIERIE



- Matrix Vector Products (MVP)s:  $\bullet$ 
  - Compute fixed-point arbitrary precision operands 1- to 16-bit.
  - Each MVP has 64 Vector-Vector Product (VVP).
  - Each cycle, 64 bits from activation RAM is broadcasted to each of the 64 VVPs, and a 64x64 matrix tile is loaded from the weight ram and loaded to separate VVPs.
  - The VVPs compute a 64-element dot product on ullet1-bit operands (as displayed in the adder tree).



**10NTRÉAL** 





- We use **bit-serial** math to support arbitrary precision.
- Example: A=0b011011, B=0b10100,  $C=A \times B$ lacksquare









- We use bit-serial math to support arbitrary precision.
- Example: A=0b011011, B=0b10100,  $C=A \times B$  $\bullet$ Start cycle time Step 1 (t=0 011011 **A**: 10100 **B**: **0** ← Partial sums + C: 000000000







- We use bit-serial math to support arbitrary precision.
- Example: A=0b011011, B=0b10100, C= A X B

| Ste | <u>ep 1 (t=0)</u> | <u>Ste</u> | <u>p 2 (t=1)</u> |
|-----|-------------------|------------|------------------|
| A:  | 011011            | A:         | 011011           |
| B:  | <b>1</b> 0100     | B:         | <b>10</b> 100    |
| +   | 0                 | +          | 1                |
| C:  | 000000000         | C:         | 000000001        |
|     |                   |            | ←                |







- We use bit-serial math to support arbitrary precision.
- Example: A=0b011011, B=0b10100, C= A X B

| <u>Step 1 (t=0)</u> | <u>Step 2 (t=1)</u> | <u>Step 3</u> |
|---------------------|---------------------|---------------|
| A: 011011           | A: 011011           | <b>A</b> :    |
| B: <b>1</b> 0100    | B: 10100            | B:            |
| + 0                 | + 1                 | +             |
| C: 000000000        | C: 000000001        | C: 000        |
|                     | ←                   |               |







- We use bit-serial math to support arbitrary precision.
- Example: A=0b011011, B=0b10100, C= A X B

| Step 1 (t=0)                    | <u>Step 2 (t=1)</u>       | Step 3        |  |
|---------------------------------|---------------------------|---------------|--|
| A: 011011                       | A: 011011                 | A:            |  |
| B: <b>1</b> 0100                | B: 10100                  | B:            |  |
| + 0                             | + 1                       | +             |  |
| C: 000000000                    | C: 000000001              | C: 000        |  |
| <u>Step 4 (t=6)</u>             | ←<br><u>Step 5 (t=10)</u> | <u>Step 6</u> |  |
| A: 011011                       | A: 011011                 | A:            |  |
| B: 10100                        | B: 10100                  | B:            |  |
| + 1                             | + 10                      | +             |  |
| C: 000000111                    | C: 0000010000             | C: 000        |  |
| <u>Step 7 (t=20)</u>            | Step 8 (t=24)             | Step 9        |  |
| A: 011011                       | A: 011011                 | <b>A</b> :    |  |
| B: 10100                        | B: 10100                  | в:            |  |
| + 1                             | + 1                       | +             |  |
| POCYTEC DOOD 100011<br>MONTRÉAL | C: 0010000111             | C: 010        |  |
|                                 | RARVININI Arbitrary Dr    | ocicion DNI   |  |



NGÉNIERIE

BARVINN: Arbitrary Precision DNN Accelerator Controlled by a RISC-V CPU

**A**:

**B**:

C:

1000011100



Total time:

**30 cycles** 



## Pito: A Simple RISC-V Procesor to Control MVU arrays.



POLYTECHNIQUE Montréal

UNIVERSITÉ D'INGÉNIERIE



• PITO is a RISC-V processor that supports RV32I instruction set.







- PITO is a RISC-V processor that supports RV32I instruction set.
- PITO has 8KB of instruction and 8KB of data RAM.





NIVERSITÉ INGÉNIERIE



- PITO is a RISC-V processor that supports RV32I instruction set.
- PITO has 8KB of instruction and 8KB of data RAM.
- It supports privilege mode to read and write from and to CSRs.





NIVERSITÉ NGÉNIERIE



- PITO is a RISC-V processor that supports RV32I instruction set.
- PITO has 8KB of instruction and 8KB of data RAM.
- It supports privilege mode to read and write from and to CSRs.
- It uses 75 extra RISC-V CSRs per MVU to control different MVU promoters.





INIVERSITÉ VINGÉNIERIE



- PITO is a RISC-V processor that supports RV32I instruction set.
- PITO has 8KB of instruction and 8KB of data RAM.
- It supports privilege mode to read and write from and to CSRs.
- It uses 75 extra RISC-V CSRs per MVU to control different MVU promoters.
- It has a custom C runtime for controlling different MVUs in a thread safe environment, reducing the need for a custom RTOS.





## How to control multiple processing elements?



POLYTECHNIQUE Montréal

UNIVERSITÉ D'INGÉNIERIE



- How to control multiple processing elements?
  - Solution1: Use a separate controller for each PE.
    - High throughput.
    - Fine control over each PE.
    - High resource utilization.









- How to control multiple processing elements?
  - Solution1: Use a separate controller for each PE.
  - Solution2: Use a shared controller for all PEs. ullet
    - Low resource utilization.
    - Lower throughput.





NIVERSITÉ INGÉNIERIE





- How to control multiple processing elements?  $\bullet$ 
  - Solution1: Use a separate controller for each PE.
  - Solution2: Use a shared controller for all PEs.
  - Our Solution: Barrel Processing: Share data path with multiple hardware threads (HARTs).
    - High throughput, fine grain control, low resource usage, no need for data or control hazard logic, etc















D'INGÉNIERIE







POLYTECHNIQUE MONTRÉAL

UNIVERSITÉ D'INGÉNIERIE









UNIVERSITÉ D'INGÉNIERIE

BARVINN: Arbitrary Precision DNN Accelerator Controlled by a RISC-V CPU

















D'INGÉNIERIE

BARVINN: Arbitrary Precision DNN Accelerator Controlled by a RISC-V CPU









BARVINN: Arbitrary Precision DNN Accelerator Controlled by a RISC-V CPU



41





BARVINN: Arbitrary Precision DNN Accelerator Controlled by a RISC-V CPU









UNIVERSITÉ D'INGÉNIERIE

BARVINN: Arbitrary Precision DNN Accelerator Controlled by a RISC-V CPU









UNIVERSITÉ D'INGÉNIERIE









UNIVERSITÉ D'INGÉNIERIE

BARVINN: Arbitrary Precision DNN Accelerator Controlled by a RISC-V CPU









UNIVERSITÉ D'INGÉNIERIE

BARVINN: Arbitrary Precision DNN Accelerator Controlled by a RISC-V CPU









D'INGÉNIERIE

BARVINN: Arbitrary Precision DNN Accelerator Controlled by a RISC-V CPU





BARVINN takes in a model in onnx format.  $\bullet$ 





NIVERSITÉ INGÉNIERIE







- BARVINN takes in a model in onnx format.  $\bullet$
- Our code generator traverse the computation graph  $\bullet$ and based on the node type, it generates the appropriate jobs and assigns them to different MVUs.
- The code generator then produces C code for each  $\bullet$ node.













- BARVINN takes in a model in onnx format.  $\bullet$
- Our code generator traverse the computation graph  $\bullet$ and based on the node type, it generates the appropriate jobs and assigns them to different MVUs.
- The code generator then produces C code for each node.
- Finally, using RISC-V toolchain, our C-Runtime and memory map, a binary is generated.





| 3 | 2     | k |
|---|-------|---|
| V |       |   |
|   | <br>J |   |



- BARVINN can be programmed for two  $\bullet$ computation modes:
  - Pipelined mode: Each computation node is assigned to a separate MVU.











- BARVINN can be programmed for two lacksquarecomputation modes:
  - Pipelined mode: Each computation node is assigned to a separate MVU.
  - Distributed mode: Computation of a single layer is distributed among multiple MVUs.





IGÉNIERIE





**Distributed Mode** 







- BARVINN can be programmed for two lacksquarecomputation modes:
  - Pipelined mode: Each computation node is assigned to a separate MVU.
  - Distributed mode: Computation of a single layer is distributed among multiple MVUs.
  - For now, our code generator only supports pipelined mode code generation.



NGÉNIERIE







# 3. Use case: Matrix Multiply

- Example, Matrix Multiply:
  - Weight Matrix: 128 x 128
  - Input Vector: 8 x 128
  - Weight precision: 2 bits
  - Input precision: 2 bits
  - Activation precision: 2 bits





BARVINN: Arbitrary Precision DNN Accelerator Controlled by a RISC-V CPU

# matmul.S # Kernel code for Matrix Multiply #include "pito\_def.h" addi x1, x0, 0 addi x2, x0, 2 // set weight precision to 2 add x1, x1, x2 slli x3, x2, 6 // set input precision to 2 add x1, x1, x3 slli x3, x2, 12 // set output precision to 2 add x1, x1, x3 csrw mvu\_precision, x1 // set quant\_msbidx to 10 , 10 csrwi mvu\_quant csrwi mvu\_wbaseaddr , 0 // set weight address to 0 csrwi mvu\_ibaseaddr , 0 // set input address to 0 addi x1, x0, 1 // set output address to 0x400 slli x1, x1, 10 csrw mvu\_obaseaddr , x1 csrwi mvu\_wstride\_0 , 30 // 1 tile back move x 2 bits // 1 tile ahead move x 2 bits csrwi mvu\_wstride\_1 , 2 csrwi mvu\_wstride\_2 , 0 csrwi mvu\_wstride\_3 , 0 // 1 tile back move x 2 bits csrwi mvu\_istride\_0 , 30 csrwi mvu\_istride\_1 , 0 csrwi mvu\_istride\_2 , 0 csrwi mvu\_istride\_3 , 30 csrwi mvu\_ostride\_0 , 0 csrwi mvu\_ostride\_1 , 0 csrwi mvu\_ostride\_2 , csrwi mvu\_ostride\_3 , 0 csrwi mvu\_wlength\_0 , 1 // 2 tiles in width // number bit combinations i.e. 2x2 bits csrwi mvu\_wlength\_1 , 3 csrwi mvu\_wlength\_2 , 1 // 2 tiles in height csrwi mvu\_wlength\_3 , 0 csrwi mvu\_ilength\_0 , 1 // 2 tiles in height // number bit combinations csrwi mvu\_ilength\_1 , 0 // 2 tiles in width of matrix operand csrwi mvu\_ilength\_2 , 0 csrwi mvu\_ilength\_3 , 0 csrwi mvu\_olength\_0 , 1 csrwi mvu\_olength\_1 , ś csrwi mvu\_olength\_2 , csrwi mvu\_olength\_3 , 0 addi x1, x0, 1 slli x1, x1, 30 // mul mode 01 addi x1, x1, 16 // Kick start MVU, 2 tiles x 2 tiles x 2bit x 2bits csrw mvu\_command, x1 ebreak









NIVERSITÉ 'INGÉNIERIE

```
# matmul.S
# Kernel code for Matrix Multiply
#include "pito_def.h"
addi x1, x0, 0
addi x2, x0, 2
add x1, x1, x2
                         // set weight precision to 2
slli x3, x2, 6
                         // set input precision to 2
add x1, x1, x3
                         // set output precision to 2
slli x3, x2, 12
    x1, x1, x3
add
csrw mvu_precision, x1
csrwi mvu_quant
                 , 10
                         // set quant_msbidx to 10
csrwi mvu_wbaseaddr , 0
                         // set weight address to 0
                         // set input address to 0
csrwi mvu_ibaseaddr , ∅
addi x1, x0, 1
slli x1, x1, 10
                         // set output address to 0x400
csrw mvu_obaseaddr , x1
csrwi mvu_wstride_0 , 30
                         // 1 tile back move x 2 bits
                         // 1 tile ahead move x 2 bits
csrwi mvu_wstride_1 , 2
csrwi mvu_wstride_2 , 0
csrwi mvu_wstride_3 , 0
                         // 1 tile back move x 2 bits
csrwi mvu_istride_0 , 30
csrwi mvu_istride_1 , 0
csrwi mvu_istride_2 , 0
csrwi mvu_istride_3 , 30
csrwi mvu_ostride_0 , 0
csrwi mvu_ostride_1 , 0
csrwi mvu_ostride_2 ,
csrwi mvu_ostride_3 , 0
                         // 2 tiles in width
csrwi mvu_wlength_0 , 1
                         // number bit combinations i.e. 2x2 bits
csrwi mvu_wlength_1 , 3
csrwi mvu_wlength_2 , 1
                         // 2 tiles in height
csrwi mvu_wlength_3 , 0
csrwi mvu_ilength_0 , 1
                         // 2 tiles in height
csrwi mvu_ilength_1 , 0
                         // number bit combinations
csrwi mvu_ilength_2 , 0
                         // 2 tiles in width of matrix operand
csrwi mvu_ilength_3 , 0
csrwi mvu_olength_0 , 1
csrwi mvu_olength_1 , ś
csrwi mvu_olength_2 , 0
csrwi mvu_olength_3 , 0
addi x1, x0, 1
slli x1, x1, 30
                         // mul mode 01
addi x1, x1, 16
                         // Kick start MVU, 2 tiles x 2 tiles x 2bit x 2bits
csrw mvu_command, x1
ebreak
```



Setting input, weight and output address



```
# matmul.S
# Kernel code for Matrix Multiply
#include "pito_def.h"
addi x1, x0, 0
addi x2, x0, 2
    x1, x1, x2
                          // set weight precision to 2
add
slli x3, x2, 6
                          // set input precision to 2
     x1, x1, x3
add
                          // set output precision to 2
slli x3, x2, 12
    x1, x1, x3
add
csrw mvu_precision, x1
                  , 10
                          // set quant_msbidx to 10
csrwi mvu_quant
csrwi mvu_wbaseaddr , 0
                          // set weight address to 0
csrwi mvu_ibaseaddr , 0
                          // set input address to 0
addi x1, x0, 1
slli x1, x1, 10
                          // set output address to 0x400
csrw mvu_obaseaddr , x1
csrwi mvu_wstride_0 , 30
                          // 1 tile back move x 2 bits
csrwi mvu_wstride_1 , 2
                          // 1 tile ahead move x 2 bits
csrwi mvu_wstride_2 , 0
csrwi mvu_wstride_3 , 0
                          // 1 tile back move x 2 bits
csrwi mvu_istride_0 , 30
csrwi mvu_istride_1 , 0
csrwi mvu_istride_2 , 0
csrwi mvu_istride_3 , 30
csrwi mvu_ostride_0 , 0
csrwi mvu_ostride_1 , 0
csrwi mvu_ostride_2 ,
csrwi mvu_ostride_3 , 0
                          // 2 tiles in width
csrwi mvu_wlength_0 , 1
csrwi mvu_wlength_1 , 3
                          // number bit combinations i.e. 2x2 bits
csrwi mvu_wlength_2 , 1
                          // 2 tiles in height
csrwi mvu_wlength_3 , 0
csrwi mvu_ilength_0 , 1
                          // 2 tiles in height
csrwi mvu_ilength_1 , 0
                          // number bit combinations
csrwi mvu_ilength_2 , 0
                          // 2 tiles in width of matrix operand
csrwi mvu_ilength_3 , 0
csrwi mvu_olength_0 , 1
csrwi mvu_olength_1 ,
csrwi mvu_olength_2 ,
csrwi mvu_olength_3 , 0
addi x1, x0, 1
slli x1, x1, 30
                          // mul mode 01
addi x1, x1, 16
                          // Kick start MVU, 2 tiles x 2 tiles x 2bit x 2bits
csrw mvu_command, x1
ebreak
```



Setting input, weight and output address

Setting input, weight and output memory access pattern variables





INIVERSITÉ INGÉNIERIE

BARVINN: Arbitrary Precision DNN Accelerator Controlled by a RISC-V CPU



57

Setting input, weight and output address

Setting input, weight and output memory access pattern variables

Kick start the accelerator









# 3. Use case: Matrix Multiply

- The configurations are written to each MVU through APB bus.
- Once the MVU is done computing, it will send and interrupt to the corresponding HART.



#### **PITO RISC-V Core**





- We synthesized BARVINN for ALVEO U250 FPGA from AMD.
- We used Vivado 2021.4 for synthesis.
- We used 8 MVUs with 64KB weight and Data RAMs and 2KB Scaler and 4KB Bias Rams.
- For PITO, we used 8KB instruction and data caches.

| Resource      | Pito RISC-V | MVU Array | Overall  |
|---------------|-------------|-----------|----------|
| LUT           | 10454       | 190625    | 201079   |
| BRAM          | 15          | 1312      | 1327     |
| DSP           | 0           | 512       | 512      |
| Dynamic Power | 0.410 W     | 21.066 W  | 21.504 W |
| Frequency     | 250 MHz     | 250 MHz   | 250 MHz  |



NGÉNIERIE



- We compared our platform against FINN and FILM-QNN.
- We used models provided by FINN repository.







- We compared our platform against FINN.
- We used models provided by FINN repository.
- For CNV model on CIFAR10, over different bit precisions:
  - We achieve better FPS/kLUTs
  - FINN uses less LUTs.



|      |               |               |      |     |       | -           |
|------|---------------|---------------|------|-----|-------|-------------|
|      | Bits<br>(W/A) | kLUT          | BRAM | DSP | FPS   | FPS/<br>kLU |
| Ours | 1/1           | 201.1 (15.0%) | 1327 | 512 | 61035 | 303.        |
|      | 1/2           | 201.1 (15.0%) | 1327 | 512 | 30517 | 151.        |
|      | 2/2           | 201.1 (15.0%) | 1327 | 512 | 15258 | 75.         |
| FINN | 1/1           | 28.2 (2.1%)   | 150  | 0   | 7716  | 273.        |
|      | 1/2           | 19.8(1.47%)   | 103  | 0   | 2170  | 109.        |
|      | 2/2           | 24.3(1.81%)   | 202  | 0   | 2170  | 89.         |
|      |               |               |      |     |       |             |





- We compared our platform against FINN.
- We used models provided by FINN repository.
- For CNV model on CIFAR10, over different bit precisions, we achieve better FPS/kLUT.
- For Resnet50 model, we achieve better FPS/Watt compared to FINN and FILM-QNN.

|               | Bits (W/A) | Clock Freq. | FPS  | FPS/Watt |
|---------------|------------|-------------|------|----------|
| Ours          | 1/2        | 250 MHz     | 2296 | 106.8    |
| FINN-R [1][6] | 1/2        | 178 MHz     | 2873 | 41.0     |
| FILM-QNN [20] | 4(8)/5     | 150 MHz     | 109  | 8.4      |





# **BARVINN is open source!**



## **Documentation**:

https://barvinn.readthedocs.io/en/latest/



BARVINN: Arbitrary Precision DNN Accelerator Controlled by a RISC-V CPU



## **Source Code:**

https://github.com/hossein1387/BARVINN





# 5. Conclusion and Future Work

- In this work, we presented BARVINN, an arbitrary precision DNN accelerator lacksquarecontrolled by a RISC-V processor.
- We presented the architecture of different components.
- We presented synthesis results.
- We used BARVINN to run inference on different models with different bit lacksquareprecisions.
- We are preparing BARVINN for ASIC implementation (GF 22 or GF12nm)
- We are planning to use TVM to improve code generation and optimization. lacksquare





# 5. Acknowledgement:

- This project is made possible with the help and support from:
  - CMC Microsystems
  - IBM
  - MILA
  - Mitacs
  - FRQNT



NGÉNIERIE









Fonds de recherche Nature et technologies Québec 🏄 🏄

Improving the Performance of CV-VEC for Quantized DNN Models



# **Thank You!**





UNIVERSITÉ D'INGÉNIERIE

Improving the Performance of CV-VEC for Quantized DNN Models



- We analyzed over classification ullet50 models from ONNX model Z00.
- Around 79% of these models  $\bullet$ use convolution with input channel sizes that are multiples of 64.





Improving the Performance of CV-VEC for Quantized DNN Models





Computation complexity ulletdiagram.









Improving the Performance of CV-VEC for Quantized DNN Models



MVU Data Storage Format

block 0 Blocks of 64 n-bit numbers

Organized in bit-sliced order

Each value is a "column"

Each row are bits from each value with same bit position

block 1





NIVERSITÉ D'INGÉNIERIE



#### Improving the Performance of CV-VEC for Quantized DNN Models





Low precision computation lacksquarepipeline.



Low precision computation in LSQ, this image was taken from LSQ paper SK Esser, et.al (2020)



#### Improving the Performance of CV-VEC for Quantized DNN Models

