

## BEYOND CHARGE-BASED COMPUTING

KAUSHIK ROY

MRIGANK SHARAD, DELIANG FAN, KARTHIK YOGENDRA, CHARLES AUGUSTINE, GEORGE PANAGOPOULOS, XUANYAO FONG

> ELECTRICAL & COMPUTER ENGINEERING PURDUE UNIVERSITY WEST LAFAYETTE, IN 47906, USA

# Why "Beyond Charge based Computing"?



### Why Beyond Charge Based Computing?

- Traditional computing models (Boolean logic, von Neumann architectures) are highly inefficient at performing tasks that humans routinely perform, such as visual recognition, semantic analysis, and reasoning.
- Bio-inspired computation can outperform Von-Neumann designs in many such data processing applications if the computing model matches device architecture and the devices operated at very low voltage





Design methods to exploit the advantages of technology innovations



Fe (4), Co (3) and Ni (2) : unpaired electrons per atom

### Sources of Magnetic Moments in Ferromagnetic Materials



Cobalt Lattice Structure (Hexagonal Close Pack)

Incomplete 3d orbitals is the principle source of magnetic moments in transition metals

# State Variable: Charge & Spin



## **Recent Inventions**



Memories: high density, stability, low read/write current, access time, zero leakage.. Logic (Boolean & Non-Boolean): Ultra low voltage switch, Neuromorphic computing, (all-spin logic - no spin-charge conversion), i/p o/p isolation, zero leakage, Interconnects: Spin channel (short), Ultra low voltage swing for charge based int.

# MOS vs. Magnets Switching Energy



Theoretical switching energy of magnet is 0.1aJ << 1fJ (MOSFET)

## **All-Spin Based Systems**



### Energy Barrier Modulation: MOSFETs vs. Magnets

Conventional Charge-based MOSFET (Bulk/SOI/FinFETS) Nano-Magnets with Electron Spin as State Variable

•The energy barrier in the active channel region can be modulated using:

#### Doping

- Uniform Channel Doping
- Symmetric/Asymmetric Halo Doping
- Source Drain Doping
- •Work Function and Material Engineering
- •Gate Dielectric and Thickness (T<sub>ox</sub>) Modulation

•The energy barrier in magnets is defined as the product of anisotropy and volume ( $E_B = K_{u2}$ .V)

- •The energy barrier in magnets can be modulated using:
  - Shape and Interfacial magnetic Anisotropy (K<sub>u2</sub>)
  - Volume of the nano-magnet (thickness (t) and cross-sectional area (A))
  - Saturation Magnetization ( $M_{SAT}$ )



# Changing the State of a Magnet

# • External Magnetic Field

- » Current carrying wires, coils, etc.
- » Dipolar coupling
- » ..
- Current Induced Spin-Transfer Torque (1996)

# Magnetic Tunneling Junctions (MTJ) Basics: Read Operation



### MTJ Basics: Spin-transfer torque induced write operation



 $P \rightarrow AP$  magnetization switching is relatively difficult than  $AP \rightarrow P$  due to lower spin injection efficiency of the free layer



# Bit-Cell Design using STT-MTJ

# 1-T, 1 R STT-MRAM



#### Key Advantages: Limitations

- >High density (nest/1 transistor per bit cell)
- > No R-Vialatility mited write speed
- >No teakage power in 2410-a consisted cells

### Design Challenge 1: Read and Write Failures in 1T-1R Bit-cell Structures



## **Design Challenge 2:** Read-Write Conflict



## Design Challenge3: Stray Fields and its Impacts on MTJ Stack



WRITE time increases for higher stray fields  $\rightarrow$  write failures

### Design Challenge 4: Stochastic Switching Behavior due to Thermal Fluctuations





- •Thermally induced initial magnetic oscillations are stochastic in nature (white noise)
- •STT switching delay distribution has a longer tail, degrading memory yield
- •Higher temperature helps in squeezing the switching delay distribution but degrades thermal stability



### Nanoelectronics Research Laboratory

# STT-MTJ Stacks: Stability, Read/Write Conflicts, Write power, ...

# **Alternative Device Structures**



Alternative Device designs may alleviate some of the problems related to scaling

### **Differential Spin Hall MRAM**



m1 and m2 are anti-parallel

#### **Energy-efficient write**

- Spin injection efficiency > 100 %
- Low resistance in write current path
- No tunneling barrier reliability

# 10X lower write energy compared to 1T1R STT-MRAM



#### High speed differential read

- No need for a reference cell
- 2X signal margin

### **1.6X faster read speed** compared to single-ended reading

Y Kim, SH Choday, K Roy, EDL 2013

### **Spin Valves**





### Non-Local STT-MRAM (Separate R/W Paths)



M. Sharad et al., DRC, 2012

# **Simulation Framework**





- a) S. Yuasa et al., Nature Materials vol. 3, no. 12, pp. 868-871, Dec. 2004.
- b) C. J. Lin et al., IEDM, Dec. 2009, pp. 11.6.1-11.6.4.
- c) T. Kishi et al., IEDM, Dec. 2008, pp. 12.6.1-12.6.4.
- Device level simulation results may be calibrated to experimental data
- Device parameters are imported into SPICE model for circuit level simulations

### Differential Spin Hall MRAM (Roy)



#### Energy-efficient write:

- Spin injection efficiency > 100%.
- Low-resistance in write current path.
- No tunneling barrier reliability issue.
- 10X lower write energy compared to 1T1R STT-MRAM



### High-speed differential read:

- No need for a reference cell.
- 2X signal margin.
- 1.6X faster read speed compared to single-ended reading

## **Read-Only Memory-Embedded MRAM**

(Roy)



Lee, Fong, Roy, "Rom-Embedded MRAM", EDL 2013

• No performance penalty to Random Access Memory (RAM) mode.

## STT-MRAM: Last Level Cache

(Roy)

• Drop-in replacement of SRAM with STT-MRAM for LLC leads to improvement in capacity and leakage, but higher latency and active energy.



• Circuit and architectural techniques can greatly improve the efficiency of spin-based caches.



# STT-MRAM: LAST LEVEL CACHE

- Drop-in replacement of SRAM with STT-MRAM for LLC leads to improvement in capacity and leakage, but higher latency and active energy
- Circuit and architectural techniques can greatly improve the efficiency of spinbased caches



Park, Raghunathan, Roy; DAC'12

C-SP♠N

Annual Review, Sept 26, 2013



# Comparison of 1T-1R with SRAM Last Level Caches with Similar Cache Area (2MB SRAM, 8MB STT MRAM)



[1] S. P. Park *et al*, DAC 2012

## System Level Implication of STT MRAM Last Level Caches



**Compared to SRAM cache STT MRAM caches offer** 

- Higher throughput due to larger integration density and lower cache miss rate
- Lower Energy due to low leakage and low utilization



# Boolean/Non-Boolean Computing with Spin Torque Transfer Devices

# **Spin-Torque Experiments**

Nano-magnets in a Lateral Spin Valves (LSV) can interact and switch through spin-torque

➢Non-local STT switching in LSV can be employed to realize All Spin Logic Device (ASLD)

Low-current, high-speed domain-wall motion can be achieved in scaled ferromagnetic nano-strips with perpendicular magnetic anisotropy (PMA)

Such low energy DW motion can be employed in logic computation





## **Compact Full Adder: Non-Boolean Logic**

Option1: with standard NAND logic (44 magnets)
Option2: functionality enhanced NB logic (5 magnets)







Truth Table for SUM
# Boolean Logic Using Spin Torque Devices

Iow voltage, spin-torque switching in spin valves can possibly achieve lower switching energy than CMOS

Compact All Spin Logic gates can achieve higher area density





#### **Majority Gate**

| Α | В | С | OUT |
|---|---|---|-----|
| 0 | 0 | 0 | 0   |
| 0 | 0 | 1 | 0   |
| 0 | 1 | 0 | 0   |
| 0 | 1 | 1 | 1   |
| 1 | 0 | 0 | 0   |
| 1 | 0 | 1 | 1   |
| 1 | 1 | 0 | 1   |
| 1 | 1 | 1 | 1   |

## Switching Energy vs. Speed



## 2-Phase Pipelining for Enhanced Performance



Non-volatile nanomagnets facilitate pipelining simply by the use of clocked power supply

No extra latches needed!



M. Sharad, K. Roy, ISQED, 2013

## **2-Phase Pipelining for Enhanced Performance**

For a logic block with N-magnets in series :

Eswitch =  $N(Tswitch \times Iswitch \times Vswitch)$ Where  $T_{Iogic} = N T_{switch}$ 

For fine-grained pipelining:

 $T_{logic_pl} = T_{switch_pl}$ Hence ,  $T_{switch_pl} = T_{switch} / N$  $E_{switch_pl} = E_{switch} / N$ 

Reduction in switching energy proportional to the number of logic stages in a pipeline can be achieved



## 2-Phase Pipe-lined 8-bit Multiplier (Carry-save)



Clocked power supply requires transistors which increase the required voltage supply

#### 2-Phase Pipelining: Design Optimization

Supply voltage needed for pipelined spin-logic reduces with Area\_Tx,

Higher transistor width and lower supply voltage leads to reduction in static power but the dynamic clocking power increases.



### 2-Phase Pipe-lining: Design Optimization

Area benefit of pipelining over 15nm CMOS design reduces with increasing Tx area for spin logic.

Power consumption of pipelined design reduces with increasing Tx area

It reaches a saturation point due to increase in dynamic power



## 2-Phase Pipelining: Design Trade-off



Minimum size clocking transistors require large  $V_{ds}$  and hence overall power consumption can be high

Area of spin logic block can be traded off (by using larger clocking transistors) to achieve larger power saving.

#### **3-D FOR HIGH DENSITY AND LOW POWER COMPUTATION BLOCKS**

3-D spin logic can be constructed by stacking 2-D layers along the vertical direction.

All the layers in the vertical direction are supplied current using the same CMOS transistors.



M. Sharad, K. Roy, ISQED, 2013

# **3-D FOR HIGH DENSITY AND LOW POWER COMPUTATION BLOCKS**



Due to low resistance of the metallic vias, the current injection per magnet remains almost the same for a given transistor width and supply voltage, even when multiple layers are stacked.

#### **3-D FOR HIGH DENSITY AND LOW POWER COMPUTATION BLOCKS**



Staking N layers reduces the effective power consumption by a factor of N

For a given number of stacks, larger size of clocking transistors can be used to lower the supply voltage and hence the static power



## Non-Boolean/Neuromorphic Computing with Spin Devices

M. Sharad, K. Roy; TNANO 2011, DAC 2012, DRC 2012, IEDM 2012

## Non-Boolean & Neuromorphic Computing

- Traditional computing models (Boolean logic, von Neumann architectures) are highly inefficient at performing tasks that humans routinely perform, such as visual recognition, semantic analysis, and reasoning.
- Bio-inspired computation can outperform Von-Neumann designs in many such data processing applications



### Hardware Implementation of Neural Networks



Digital designs consume large area whereas analog designs provide power hungry solutions. Hence, there is need to match the characteristics of the devices to the computing models to provide drastic improvements in efficiency.

## **Artificial Neural Networks**



Sharad, Roy; TNANO'12; Sharad, Roy, DAC, 2012 Sharad, Roy, IEDM, 2012 Sharad, Roy, IJCNN, 2012

- Low-impedance, current-mode STT-switches can provide compact and energyefficient mapping for analog <u>thresholding operation</u> of a <u>'neuron</u>'.
- Spin neurons with spin/charge based synapses can facilitate the design of ultra low power <u>neuromorphic</u> hardware for <u>non-Boolean</u> data processing.

## Lateral Spin Valve with Decoupled Read-Write Path : Non-local Spin Injection



Read-write decoupling in LSV's facilitates ultra low voltage switching

Image: Mean of the second s

## Lateral Spin Valve with Decoupled Read-Write Path: Local Spin Injection



> This structure employs local spin injection, can achieve higher injection efficiency apart from read-write decoupling

An extended read-port is used for sensing, while write -current is injected into the output lead through the magnet

> Small dimension of  $m_2$  insures mono-domain behavior, despite the read extension

## **Bipolar Spin Neuron Using Lateral Spin Valve**



The neuron device essentially acts as an ultra low voltage current comparator and can be employed to perform analog-mode computation



## **Neuron Models Using Domain Wall Magnets**

□ Properties of DWM favorable for high-speed, low current switching:

> Scaling (DW width as well as thickness) reduces the critical current for DW-motion [2, 4]

PMA device achieve significantly lower critical current density ~10<sup>6</sup> A/cm<sup>2</sup> [4]

Lower saturation magnetization and higher coercive field can achieve faster switching [1]
 Non-volatility can be sacrificed for high speed computing to further reduce the switching current

[1] Duc-The Ngo et al.," *arXiv preprint arXiv:1110.5112* (2011).
[2] K. Ikeda et al., " *Applied physics express* 4.9 (2011): 3002.
[3] C. K. Lim et al." *Applied physics letters* 84.15 (2004): 2820-2822.
[4] S. Fukami, et al "*IEEE Symp. on VLSI Tech.,* 2009.

## **Unipolar Domain Wall Neuron**



> A thin and short PMA DWM free-layer can be switched with a small current  $\sim 1 \mu A$  within 1ns

Such a magneto-metallic device can be used to perform ultra low voltage and low energy current-mode summation and thresholding, like a "**neuron**"

## **Bipolar Domain Wall Neuron**



- BDWN can compare the input currents received through the two input domains
- The resolution is determined by the critical switching current density for DW nano-strip

## **Communication between Neurons**



CMOS latch detects the state of neuron-MTJ without static current injection

- Deep-triode current source (DTCS) transistor driven by an output latch transmits current mode signal to receiving neurons
- > Computation current flows across a terminal voltage of  $\Delta V$

## **Communicating Neurons**

Neurons can be interconnected using weighted DTCS (Deep Triode Current Source) transistors.



Depending upon the polarity of inter-neuron weight, output of a source neuron connects to one of the inputs of the receiving neuron.

Current-mode, inter-neuron signaling takes place through DTCS transistors operating at ~20mV drain-to-source voltage resulting in low computation power

## **Programmable Synapses: Memristors/PCM**



- Memristors/PCM/DWM/ can be used for realizing low power neuromorphic computation array using bipolar spin neuron
- The magneto-metallic neurons facilitate input voltage levels of ~20mV resulting in low computation power

## Synapse: Domain Wall Magnet



A DWM consists of opposite polarity domains separated by a Non-magnetic region call the domain wall which can be moved by charge injection/ magnetic field

#### Device level Programmability using Domain Wall Magnet as Synapse



Neuron with small number of programmable DWM inputs can be employed to realize configurable data processing array of cellular neurons

#### **Drawing Analogy with Biological Neural Network!**



#### **DWM Based Neuron-Synapse Unit: Modeling**

Relative spin potential in the 2-D channel for neuron with 24 input synapses and charge current  $\sim$ 10µA per synapse



Neuron is simulated using self consistent solution of 2-D spin transport in channel and LLG for magnets



## **Application Examples**



SAR ADC employs recursive evaluation akin to CNN equation and hence can be implemented using the bipolar spin neuron

# Ultra low energy Analog signal acquisition and processing



cell state equation for Discrete-Time CNN  

$$x_{ij}(n) = \sum_{(k,l) \in N(i,j)} A(i, j; k, l) \cdot y_{kl}(n) + \sum_{(k,l) \in N(i,j)} B(i, j; k, l) \cdot u_{kl}(n) + Z(i, j)$$
  
 $y_{ij}(n) = f'(x_{ij}(n-1)) = \begin{cases} 1 & \text{if } x_{ij}(n-1) > 0 \\ 0 & \text{if } x_{ij}(n-1) < 0 \end{cases}$ 

X(n) : cell state at n<sub>th</sub> time step
A: 3x3 feedback template
B: 3x3 input template
U(n) :3x3 neighborhood input at T= n
Z : cell bias; y(n): cell output at T=n

Hardware based on Cellular neural networks can be employed in several image processing applications

## Example : Ultra Low Power SAR-ADC



SAR ADC employs recursive evaluation akin to CNN equation and hence can be implemented using the bipolar spin neuron

# Ultra low energy Analog Signal Acquisition and Processing: Performance



(S<sup>2</sup>x#Pixels x Fps)

S : technology ratio fps : frames/sec

#### **Spin Based Neuron for Cross-Bar Neural Network**



# Neural Computing With Spin Neurons and Resistive Crossbar Memory



The spin based neuron unit achieves ~100x improvement in power consumption for cross bar ANN architecture Based on memristor/PCM



## Interconnects


#### STT based Energy-Efficient Global Interconnects

#### **Motivation:**

□ Global Interconnects like data-buses and memory bit-lines can account for more than 90% power for on-chip memory

□ Emerging CMPs may dissipate ~50% power for global communication among memory and multiple processors

#### **Technology/circuit solutions:**

- On-chip transmission line
- CMOS Current-mode interconnects
- □ Low voltage swing
- Optical, plasmonic, graphene, CNT

#### **Our goals and approaches:**

□ Emerging spin-torque phenomena, like Spin Hall Effect (SHE), may lead to highspeed, low-voltage current-mode switches based on nano-scale magnets.

□ We propose and analyze spin-torque switches in the design of energy-efficient and high-performance current-mode on-chip global-interconnects.

□ Simulations show up to two order of magnitude higher energy-efficiency



C-SP<sup>•</sup>N

## **Unipolar Domain Wall Switch for Interconnect**



Unipolar Domain-Wall Switch

DWS as a trans-impedance converter

DWM motion (with SHE assist can be used to realize a two-terminal current – mode STT switch

□ Such a switch can convert an ultra-low voltage current-mode input signal into full swing output voltage in two steps: charge to spin conversion through DW-motion , followed by spin-to charge conversion through MTJ
C-SP♦N Annual Review, Sept 26, 2013 74 51



## Interconnects Using Spin-Torque Switches



For 2Gbps signaling over a 10mm on chip interconnect:

Conventional CMOS: Voltage mode:  $0.5CV^2$ =  $2.5pfx \ 0.0625 = 150fJ$ Current-mode: 90fJ S. Lee et al., ISSCC 2013 Spintronic Current-mode:  $E_{signaling} = I_{switch} T_{bit} \Delta V + I_{MTJ}$   $V_{MTJ}T_{bit}$   $+ T_{bit} (P_{inverter} + P_{driver})$ = (20µA x 0.5ns x 100mV) +(0.7µA x 0.6V x 0.5ns) + (0.5ns x 0.7µW)  $\sim$ 1fJ

Sharad , Roy , EDL., 2013 Sharad , Roy , IEDM, 2013

 STT switches can be employed for ultra-low energy and compact current mode global interconnects





### **Device-Circuit Co-Simulation-Framework**





## MODELING, SIMULATION, AND EVALUATION FRAMEWORK



Predictive modeling infrastructure will be co-developed with theme 4

C-SP<sup>♠</sup>N

# Summary

- Spin-torque memories for on-chip applications are becoming a reality
- Spin-torque devices such as LSVs and DWMs show potential for class of applications that CMOS is not efficient at implementing such as neuromorphic designs, associative computing, semantic analysis, etc.
- Potential for large improvement in energy consumption

### **References (mostly our work)**

 M. Sharad, G. Panagopoulos and K. Roy, "Spin Neuron for Ultra Low Power Computational Hardware", DRC, 2012.

[2] M. Sharad, "Spin-Based Neuron Model with Domain Wall Magnet as Synapse", IEEE Transaction on Nanotechnology, 2012

[3] M. Sharad, C. Augustine, G. Panagopoulos and K. Roy, "Cognitive Computing with Spin Based Neural Networks", DAC 2012.

[4] M. Sharad, C. Augustine, G. Panagopoulos and K. Roy, "Spin Based Neuron-Synapse Unit for Ultra Low Power programmable Computational Networks", IJCNN 2012.

[5] M. Sharad, C. Augustine, G. Panagopoulos and K. Roy, "Ultra Low Energy Analog Signal Processing Using Spin-Based Neurons", Nanoarch, 2012.

- [6] M. Sharad et. al, "Spin-Based Neuron Model with Domain Wall Magnet as Synapse", IEEE Transaction on Nanotechnology, 2012, arxiv.org.
- [7] M. Sharad et. al, "Ultra Low Energy Analog Signal Processing Using Spin Neurons Based on Nano Magnets", 2012, arxiv.org.

[8] M. Sharad et. al, "Proposal for Neuromorphic Hardware Using Spin Devices", 2012, arxiv.org

[9] Behin-Ain et. al., "Proposal for an all-spin logic device with built-in memory", Nature Nanotechnology 2010

[10] C. Augustine et al, "Low-Power Functionality Enhanced Computation Architecture Using Spin-Based Devices", NanoArch, 2011

[11] C. Augustine et al., "A Self-Consistent Simulation Framework for Spin-Torque Induced Domain Wall Propagation", IEDM , 2011

[12] Kimura et. al., "Switching magnetization of a nanoscale ferromagnetic particle using nonlocal spin injection. Phys. Rev. Lett. 2006

# Acknowledgement

Niladri Mojumder, GF Charles Augustine, Intel Mrigank Sharad, Purdue Delian Fang, Purdue Xuanyao Fong, Purdue Sumeet Gupta, Purdue Harsha Choday, Purdue Sang-Phill Park, Intel Prof. Supriyo Datta Prof. Swarup Bhunia, Case Western U. Prof. Anand Raghunathan, Purdue

StarNet, NRI, INDEX, SRC, Intel, NSF, Qualcomm