

# Training Low Bit-width Convolutional Neural Networks on RRAM

<u>Yi Cai</u>, Tianqi Tang, Lixue Xia, Ming Cheng, Zhenhua Zhu, Yu Wang, Huazhong Yang

Dept. of E.E., Tsinghua National Laboratory for Information Science and Technology (TNList), Tsinghua University, Beijing, China E-mail: yu-wang@tsinghua.edu.cn



# Outline

- Background & Motivation
- RRAM-based Low-bit CNN Training System
  - Overall Framework
  - Quantization and AD/DA Conversion
- Experimental Results
- Conclusion and Future Work
- Reference



### **Convolutional Neural Network**





#### **Object Detection**



Natural Language Processing

# **Energy-efficient platforms are required**

#### Training CNNs is time-and-energy-consuming



**Exponentially Growing Data** 

**NN Computing Accelerators** 



# **Accelerating Inference is not enough**



On-line training and learning are necessary While most embedded devices are <u>energy-limited</u>

# **Memory Wall in von Neumann Architecture**



#### Too many memory accesses result in high power

| Operation       | Energy [pJ] | Relative Cost |
|-----------------|-------------|---------------|
| 16 bit int ADD  | 0.06        | 1             |
| 16 bit FP ADD   | 0.45        | 8             |
| 16 bit int MULT | 0.8         | 13            |
| 16 bit FP MULT  | 1.1         | 18            |
| 32b LPDDR2 DRAM | 640         | 10667         |



# **RRAM-based NN Computing Systems**

#### RRAM has become a promising candidate for NN training



### High-precision data & weights are not supported in RRAM-based Systems



# **Training Low Bit-width CNN is Feasible**

Binarized Neural Networks (BNN) [courbariaux2016binarynet]

- Training Neural Networks with Weights and Activations Constrained to +1 or -1
- Binary Weights and Activations (DNN)

XNOR-Net [rastegari2016xnor]

- Both the filters and the input to convolutional layers are binary (CNN)

|                          |                                                 | Network Variations                                                                              | Operations<br>used in<br>Convolution | Memory<br>Saving<br>(Inference) | Computation<br>Saving<br>(Inference) | Accuracy on<br>ImageNet<br>(AlexNet) |
|--------------------------|-------------------------------------------------|-------------------------------------------------------------------------------------------------|--------------------------------------|---------------------------------|--------------------------------------|--------------------------------------|
| Input<br>Weight w<br>win | Standard<br>Convolution                         | Real-Value Inputs           0.11 -0.210.34           -0.25 0.61 0.52                            | + , - , ×                            | 1x                              | 1x                                   | %56.7                                |
|                          | Binary Weight                                   | Binary Weights           0.11 -0.210.34         1 -1 1           -0.25 0.61 0.52         -1 1 1 | +,-                                  | ~32x                            | ~2x                                  | %56.8                                |
|                          | BinaryWeig<br>Binary Inpu<br>( <b>XNOR-Ne</b> t | Binary Weights<br>1 -11<br>-1 1 1<br>Binary Weights<br>1 -1 1<br>-1 1 1                         | XNOR ,<br>bitcount                   | ~32x                            | ~58x                                 | %44.2                                |

DoReFa-Net [zhou2016dorefanet]

- Using low bitwidth parameter gradients to train CNNs
- Can accelerate both <u>training and inference</u>

# Contributions

#### The main contributions of this work include:

- In this paper, we propose an <u>RRAM-based low-bitwidth CNN training system</u>.
   We also propose the <u>algorithm</u> of training low-bitwidth convolutional neural networks, to enable a RRAM-based system to implement on-chip CNN training.
   And <u>quantization and AD/DA conversion strategies</u> are proposed to adapt to RCS and improve the accuracy of CNN model.
- We explore the configuration space of combinations of <u>bitwidth of activations</u>, <u>convolution outputs</u>, <u>weights</u>, <u>and gradients</u> by experiments of training LeNet-5 and ResNet-20 on proposed system, testing over the MNIST and CIFAR-10 datasets respectively. Moreover, a <u>tradeoff</u> of balancing between energy overhead and prediction accuracy is discussed.
- We analyze the **probability distribution** of RRAM's stochastic **disturbance** and make experiments to explore the effects of the disturbance on CNN's training.

# **RRAM-based Low-bit CNN Training System**



Fig. 2: Framework of RRAM-based low-bitwidth CNN training system.

NI

# **Training Process-Inference**



# **Training Process-Backpropagation-1**



# **Training Process-Backpropagation-2**



### **Training Process-Parameters Update**



$$\begin{split} \text{GetDeltaW}(\frac{\partial J}{\partial K},\chi,m) &= \Delta K = \chi \cdot \frac{\partial J}{\partial K} + m \cdot \Delta K_{last} \\ \text{Learning rate} \quad \text{momentum} \end{split}$$

NI

# **Quantization and AD/DA Conversion**

• Quantizing floating-point x to k-bit fixed-point number



### **Quantization and AD/DA Conversion**

 Digital-to-Analog conversion strategy in the input interfaces of RRAM Crossbars



# **Quantization and AD/DA Conversion**

Analog-to-Digital conversion strategy in the output interfaces of RRAM Crossbars



# **Experimental Results**

- Benchmarks
  - LeNet on MNIST dataset
  - ResNet-20 on CIFAR-10 dataset
- Evaluation
  - Energy efficiency
  - Accuracy
- Disturbance setup

- Default: Satisfying  $\frac{Weight_{disturbance}}{Weight_{expected}} \sim U[-5\%, 5\%]$ 

- Comparisons
  - GPU: NVIDIA Titan X (Pascal)
  - CPU: Intel (R) Core(TM) i7-6900K CPU @ 3.20GHz

### **Experimental Results - Accuracy**

TABLE I: The classification accuracy for MNIST test dataset with different combinations of bitwidth in LeNet-5. A, CO, W, G are bitwidth of activations, convouts, weights, and gradients.

| А        | CO | W  | G  | Accuracy<br>without disturbance | Accuracy<br>with disturbance | _            |
|----------|----|----|----|---------------------------------|------------------------------|--------------|
| $32^{a}$ | 32 | 32 | 32 | 0.9914                          | *                            | -            |
| 8        | 8  | 8  | 8  | 0.9828                          | 0.9825                       | Disturbance- |
| 6        | 6  | 6  | 6  | 0.9745                          | 0.9733                       | Tolerant     |
| 4        | 4  | 4  | 4  | 0.9767                          | 0.9797                       | Toterant     |
| 3        | 3  | 2  | 4  | 0.9687                          | 0.9736                       |              |
| 2        | 2  | 2  | 2  | 0.9670                          | 0.9752                       | Detter       |
| 2        | 2  | 1  | 4  | _b                              |                              | Better       |
| 2        | 2  | 2  | 1  | 0.9633                          | 0.9647 Ge                    | neralization |
| 1        | 1  | 2  | 1  | 0.9416                          | 0.9375                       |              |
| 1        | 1  | 1  | 1  | -                               | -                            | •            |

<sup>a</sup> bitwidth=32 means 32-bit floating-point numbers.

<sup>b</sup> '-' means failing to train a convergent model under such bitwidth.

### **Experimental Results - Accuracy**



Fig. 3: Error Rate Curves of ResNet-20 on CIFAR-10: Accuracy Under Different Combinations of Bitwidth (A,CO,W,G).

# **Experimental Results - Tradeoff**



# **Experimental Results - Energy**

#### TABLE II: Energy Overhead Estimation of CNNs in Different Training Platforms

| Database | Platform                | CNN Model | $Energy(\mu J/img/iter)$ |          |        |         |
|----------|-------------------------|-----------|--------------------------|----------|--------|---------|
| MNIST    | $CPU^{a}$               | LeNet-5   | Conv+FC                  | 7997.6   | 88.7%  | - 16.8x |
|          |                         |           | Others                   | 1015.6   | 11.3%  |         |
|          |                         |           | All                      | 9013.2   | 100.0% |         |
|          | <b>GPU</b> <sup>b</sup> |           | Conv+FC                  | 11824.9  | 96.0%  | -23.0x  |
|          |                         |           | Others                   | 491.3    | 4.0%   |         |
|          |                         |           | All                      | 12316.2  | 100.0% |         |
|          | RRAM                    |           | Conv+FC                  | 44.96    | 8.4%   | - 1 Ov  |
|          |                         |           | Others                   | 491.3    | 91.6%  | 1.08    |
|          |                         |           | All                      | 536.26   | 100.0% |         |
| CIFAR-10 |                         | ResNet-20 | Conv+FC                  | 262523.4 | 77.9%  | - 8 9x  |
|          | CPU                     |           | Others                   | 74414.1  | 22 1%  | UIUX    |
|          |                         |           | All                      | 336937.5 | 100.0% |         |
|          | GPU                     |           | Conv+FC                  | 133066.9 | 79.4%  | - 4 4x  |
|          |                         |           | Others                   | 34465.2  | 20.6%  |         |
|          |                         |           | All                      | 167532.1 | 100.0% |         |
|          | RRAM                    |           | Conv+FC                  | 3653.7   | 9.6%   | -1 Ov   |
|          |                         |           | Others                   | 34465.2  | 90.4%  |         |
|          |                         |           | All                      | 38118.9  | 100.0% |         |

<sup>a</sup> Intel(R) Core(TM) i7-6900K CPU @ 3.20GHz.

<sup>b</sup> NVIDIA TITAN X (Pascal).

# Conclusion

#### Challenges:

- Can not efficiently support high-precision data & weights
- **Disturbance** on RRAM's resistance

#### Solutions:

- Low-bit CNN training system & algorithm design
- Disturbance analysis and noise-tolerant training

#### Future Work:

- Improving the training algorithm to get higher accuracy
- Enabling RRAM-based logic computing



#### Reference

[1] K. Simonyan et al., "Very deep convolutional networks for large-scale image recognition," arXiv preprint arXiv:1409.1556, 2014.

[2] J. Fan et al., "Human tracking using convolutional neural networks." IEEE Transactions on Neural Networks, vol. 21, no. 10, pp. 1610–1623, 2010.

[3] G. Hinton et al., "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 82–97, 2012.

[4] A. Karpathy et al., "Deep visual-semantic alignments for generating image descriptions," in Computer Vision and Pattern Recognition, 2015. [5] B. Li et al., "Merging the interface: Power, area and accuracy co- optimization for rram crossbar-based mixed-signal computing system,"

in DAC, 2015, p. 13.

[6] L. Xia et al., "Selected by input: Energy efficient structure for rram-based convolutional neural network," in DAC, 2016.

[7] P. Chi et al., "Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory," in ISCA, vol. 43, 2016.

[8] A. Shafiee et al., "Isaac: A convolutional neural network accelerator with in-situ analog arithmetic in crossbars," in Proc. ISCA, 2016.

[9] F. Alibart et al., "High precision tuning of state for memristive devices by adaptable variation-tolerant algorithm," Nanotechnology, vol. 23, no. 7, p. 075201, 2012.

[10] R. Degraeve et al., "Causes and consequences of the stochastic aspect of filamentary rram," Microelectronic Engineering, vol. 147, pp. 171–175, 2015.

[11] M. Courbariaux et al., "Binarized neural network: Training deep neural networks with weights and activations constrained to+1 or-1," arXivpreprint arXiv:1602.02830, 2016.

#### Reference

[12] M. Rastegari et al., "Xnor-net: Imagenet classification using binary convolutional neural networks," arXiv preprint arXiv:1603.05279, 2016.

[13] S. Zhou et al., "Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients," arXiv preprint arXiv:1606.06160, 2016.

[14] T. Tang et al., "Binary convolutional neural network on rram," in DesignAutomation Conference (ASP-DAC), 2017 22nd Asia and South Pacific. IEEE, 2017, pp. 782–787.

[15] M. Cheng et al., "Time: A training-in-memory architecture for memristor-based deep neural networks," in Proceedings of the 54<sup>th</sup> Annual Design Automation Conference 2017. ACM, 2017, p. 26.
[16] Z. Jiang et al., "A compact model for metal-oxide resistive random access memory with experiment verification," IEEE Transactions on Electron Devices, vol. 63, no. 5, pp. 1884–1892, 2016.
[17] S. Yu et al., "Scaling-up resistive synaptic arrays for neuro-inspired architecture: Challenges and prospect," in Electron Devices Meeting (IEDM), 2015 IEEE International. IEEE, 2015, pp. 17–3.
[18] Y. LeCun et al., "Comparison of learning algorithms for handwritten digit recognition," in International conference on artificial neural networks, vol. 60. Perth, Australia, 1995, pp. 53–60.
[10] K. Ha et al., "Deep provide learning aforizing a prospective of the IEEE."

[19] K. He et al., "Deep residual learning for image recognition," in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

[20] C. M. Bishop, "Training with noise is equivalent to tikhonov regularization," Neural computation, vol. 7, no. 1, pp. 108–116, 1995.

[21] A. F. Murray and P. J. Edwards, "Enhanced mlp performance and fault tolerance resulting from synaptic weight noise during training," IEEE Transactions on neural networks, vol. 5, no. 5, pp. 792–802, 1994.

[22] X. Dong et al., "Nvsim: A circuit-level performance, energy, and area model for emerging p.nonvolatile memory," TCAD, vol. 31, no. 7, pp.994–1007, 2012.

# THANKS FOR WATCHING

