

# Energy Efficient In-Memory Machine Learning for Data Intensive Image-Processing by Nonvolatile Domain-Wall Memory

Hao Yu<sup>1</sup>, Yuhao Wang<sup>1</sup>, Shuai Chen<sup>1</sup>, Wei Fei<sup>1</sup>, Chuliang Weng<sup>2</sup>, Junfeng Zhao<sup>2</sup> and Zhulin Wei<sup>2</sup> <sup>1</sup>School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore <sup>2</sup>Huawei Shannon Laboratory, China

http://www.ntucmosetgp.net

# **Machine Learning for Image Recognition**

"We took an artificial neural network and spread the computation across 16,000 of our CPU cores (in our data centers), and trained models with more than 1 billion connections." -- Google brain team

#### **A Cat Neuron**



"One of the neurons in the artificial neural network, trained from still frames from unlabeled YouTube videos, learned to detect cats."

# **Big-data Center at Exascale**

- 1 Core = <u>Microprocessor</u> (=6 Giga Flops @1.5GHz)
  - •4 FPUs + RegFiles
- •1 Chip = 742 Cores (=4.5 Tera Flops/s)
  - 213 MB of L1 I&D + 93 MB of L2
- 1 Node = 1 Chip + 16 **DRAMs** (16GB)
- 1 Group = 12 Nodes + 12 Routers (=54Tera Flops/s)
- 1 Rack = 32 Groups (=1.7 Peta Flops/s)
  - 384 nodes / rack
- •1 Data Center (=1 Exa Flops/s)
  - •3.6EB of Disk Storage
  - -3.6PB = 0.0036 bytes/flops
  - •583 Racks

Thousand cores in big memory



100Gbps bandwidth with 68MW power

# **Nonvolatile Memory Device**

- No-volatile state
- 2. No leakage power consumption
- 3. Small overhead between on/off switching
- 4. Universal memory for logic-in-memory



#### **Power issue**

# **In-memory Computing Architecture**



**Bandwidth** issue

# **Non-volatile In-memory Computing**



#### **Outline**

- NVM Device Modeling
- NVM In-memory Logic
- NVM In-memory Architecture for Machine Learning

#### **State of STT-MTJ Devices**

- Macro-scale state of spintroinc device:
  - $\blacksquare$  Magnetization angle  $\theta(t)$  between successive magnetic layers
  - State dynamics governed by Landau-Lifshitz-Gilbert equation



State θ(t) in terms of giant magnetization resistance:

GMR Equation 
$$R(\theta) = R(\theta_0) + \Delta R_{GMR} (1 - \cos \theta(t))/2$$

#### **NVM SPICE for STT-MTJ**



 $\begin{array}{l} plot \ v(n1\#theta) \\ plot \ (v(3)\text{-}v(4))/i(vasst) \end{array}$ 

http://www.nvmspice.org



| Array size | Behavioral Macromodel (s) | Physical model in NVM-SPICE (s) | Speedup ratio |
|------------|---------------------------|---------------------------------|---------------|
| 8*8        | 2.522                     | 0.257                           | 10x           |
| 16*16      | 98.131                    | 1.87                            | 52x           |
| 32*32      | 1119.99                   | 11.533                          | 97x           |
| 64*64      | 22188.8                   | 189                             | 117x          |

#### From STT-MTJ to Domain-wall Nanowire

#### Shifter, Write, Read operation:

- 1. Apply shift current to select domain
- 2. Apply write/read current through write/read port
- 3. The state can be read out by detecting the MTJ resistance



#### **Outline**

- NVM Device Modeling
- NVM In-memory Logic
- NVM In-memory Architecture for Machine Learning

# **Domain-wall based XOR Logic**



- •XOR gate is most complicated (16 transistors each gate) among all logic gates
- •XOR is highly used for big-data applications such as comparison and addition
- Power optimized XOR gate by DWL

# Two domain-wall nanowire devices to build one XOR gate:

- Write A to left nanowire
- Shift A to constructed port
- Write B to right nanowire
- Shift B to constructed port
- Read resistance of constructed port

# **Domain-wall based Full-adder and Multiplier**



# **Domain-wall based LUT Logic**



Single 5-bit input multiplication with constant



- Any logic function y=f(x) can be mapped to look-up table (LUT) with specified inputs
- •DWM for LUT word-line and bit-line decoders take the input and find the target nanowire cell that stores results



#### **Outline**

- NVM Device Modeling
- NVM In-memory Logic
- NVM In-memory Architecture for Machine Learning

# **Neuron and Neuron Network**



#### **Neuron model**

- An assembly of interconnected nodes and weighted links
- Output node sums up each of its input value according to weights of its links Compare output node against some threshold *t*

#### **Neuron network**

- A set of neurons with forwarded connection from inputs to outputs
- Hidden layer weights are obtained from off-line training and updated from on-line learning

# Non-volatile In-Memory ELM-SR



# Extreme learning machine

- Single hidden layer feed-forward neural networks
- Tuning-free without expensive iterative training of parameters
- ELM based image superresolution (ELM-SR)
  - Enhance resolution in image recognition for recognition
- How to map ELM-SR to nonvolatile in-memory architecture?

### Extreme Learning Machine based Super-resolution

#### **ELM-SR flow:**

- a) Input (offline memory): feature vector <u>P</u> extracted from images
- b)Training (offline memory) → obtain output weight vector ow
- c) Randomly generated input weight <u>iw</u> bias <u>b</u> matrices (offline memory): parameters tuning free
- d) Testing (online logic)
  - 1. input vectors times input weight vector *P\*iw*
  - 2. sigmoid function  $s = sigmoid(P^*iw+b)$
  - 3. multiplication by output weight matrix s\*ow

# **ELM-SR Operation Mapping: Weighted Sum**



# **ELM-SR Operation Mapping: Sigmoid**



# **Experimental Settings and Methodology**

# Conventional general purpose processor platform



# Proposed in-memory domain-wall based neural network platform

DW geometric and magnetic parameters<sup>1</sup> **DW-CACTI NVM-SPICE** DW-ADDER/DW-MULTIPLIER DW-I UT DW-LOGIC performance performance In-memory behavioral simulator FI M-SR Area Power Timing

<sup>&</sup>lt;sup>1</sup> Technology node of 32nm is assumed with width of 32nm, length of 64nm per domain, and thickness of 2.2nm for one domain-wall nanowire; the  $R_{off}$  is set at 2600Ω, the  $R_{on}$  at 1000Ω, the writing current at 100 $\mu$ A, and the current density at 6 × 10<sup>8</sup>A/cm<sup>2</sup> for shift-operation.

# **Preliminary Results and Conclusions**

Machine learning for super-resolution imaging

Comparisons with conventional architecture





| Platform                                | DW-NN                                                                         | GPP (with<br>on-chip<br>memory) | GPP (with<br>off-chip<br>memory) |
|-----------------------------------------|-------------------------------------------------------------------------------|---------------------------------|----------------------------------|
| Computation<br>al resources<br>utilized | 1×Processor<br>7714×DW-ADD<br>ER<br>7714×DW-MUL<br>551×DW-LUT<br>1×controller | 1×Processor                     | 1×Processor                      |
| Area of computationa 1 units            | 18 mm <sup>2</sup><br>(processor) + 0.5<br>mm <sup>2</sup><br>(accelerators)  | 18 mm <sup>2</sup>              | 18 mm <sup>2</sup>               |
| Power<br>(Watt)                         | 10.1                                                                          | 12.5                            | 12.5                             |
| Throughput<br>(MBytes/s)                | 108MBytes/s                                                                   | 9.3MBytes/s                     | 9.3MBytes/s                      |
| Energy<br>efficiency<br>(nJ/bit)        | 7                                                                             | 389                             | 642                              |

- 1. All operations involved in machine learning on neural network can be mapped to a logic-in-memory architecture by non-volatile domain-wall nanowire.
- 2. I/O traffic in proposed DW-NN is greatly alleviated with an energy efficiency improvement by 92x and throughput improvement by 11.6x compared to the conventional image processing system by general purpose processor.

# Thank you!



Please send comments to haoyu@ntu.edu.sg

http://www.ntucmosetgp.net

