



## MPICC: Multiple-Precision Inter-Combined MAC Unit with Stochastic Rounding for Ultra-Low-Precision Training

### Leran Huang<sup>1</sup> (Speaker)

Yongpan Liu<sup>2</sup>, Xinyuan Lin<sup>2</sup>, Chenhan Wei<sup>2</sup>, Wenyu Sun<sup>2</sup>, Zengwei Wang<sup>2</sup> Boran Cao<sup>1</sup>, Chi Zhang<sup>2</sup>, Xiaoxia Fu<sup>2</sup>, Wentao Zhao<sup>2</sup>, and **Sheng Zhang<sup>1\*</sup>** 

<sup>1</sup> Tsinghua Shenzhen International Graduate School, Shenzhen, China<sup>2</sup> Department of Electronic Engineering, Tsinghua University, Beijing, China

## **Self Introduction**

- B.Eng. degree from Harbin Engineering University, Harbin, China, in 2022.
- M.Eng. candidate at Tsinghua University, Beijing, China.
- My research interests include the design of energy-efficient AI accelerators and multi-precision processing elements.



Leran Huang

## Outline

## Background & Challenges

- Flexible Support for Ultra-Low-Precisions
- Reduce Bit Width for Accumulation
- Support High-Precision at Low Cost
- Overall Introduction of MPICC
- Key Features of MPICC
- Experiment & Results
- Conclusion

## **Cost Trend of Deep Learning Training**



### **Model Complexity**

- Larger model sizes
- Greater compute load

### Hardware Needs

- More processing units
- Larger memory capacity
- Wider bandwidth
- Longer processing time

4/33

## **Quantization: Reduce Costs & Improve Performance**



## **Research on Ultra-Low-Precison Training**



| Model                        | FP32                   | MXFP4 Wt & MXFP6 A                                        |  |  |
|------------------------------|------------------------|-----------------------------------------------------------|--|--|
| GPT-20M                      | 3.98                   | 4.04                                                      |  |  |
| GPT-300M                     | 3.11                   | 3.14                                                      |  |  |
| GPT-1.5B                     | 2.74                   | 2.76                                                      |  |  |
| 10<br>8<br>6<br>Story 4<br>4 | ······ FP32<br>— MXFP4 | Wt- MXFP6 Act GPT-20M<br>GPT-150M<br>GPT-345M<br>GPT-1.5B |  |  |
| 2                            | .2 0.4                 | 0.6 0.8 1.0                                               |  |  |

6/33

## **Chal.1: Flexible Support for Ultra-Low-Precisions**

Features of Ultra-Low-Bit Quantization

**ResNet-18** 

activation tensor

10

**High precision** 

ntra-tensor

0.5

0.4

0.3

0.2

0.1

0.0

nter-tensor

-2

 $^{-1}$ 

Λ

Int

**Uniform-like** 

- Use multiple ultra-low-precisions (FP, INT, LOG)
- Determine precision based on **tensor distribution**

### **Demands for Direct Computing**



## **Chal.2: Reduce Bit Width for Accumulation**

### Why Bit Width Difficult to Reduce

- Ensure a wide range to prevent overflow
- Prevent training stagnation caused by <u>swamping</u>



[Naigang Wang et al., NeurIPS 2018]

- Necessity of Reducing Bit Width
  - FP adders become power **bottleneck**
  - Reduce the cost of result compression

### **FP Processing Power Breakdown**



## **Chal.3: Support High-Precision at Low Cost**

### Use High-Precision for Critical Layers

- Improve accuracy and support more networks
- Maintain performance ([3] throughput reduces 24%)

### **Cost Versus Bit Width**

- Cost increases rapidly with bit width
- Support high-precision is <u>expensive</u>

| Work                                                                                                                                                                           | Model                             | Method                                             | Degradation      | Operation    | Energy<br>(pJ) | Rela<br>Energ   | ative<br>y Cos  | st |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------|----------------------------------------------------|------------------|--------------|----------------|-----------------|-----------------|----|
| [1]                                                                                                                                                                            | AlexNet                           | Upgrade the final layer precision from FP8 to FP16 | 0.07%            | 8b INT Add   | 0.03           |                 | •               |    |
|                                                                                                                                                                                |                                   |                                                    |                  | 16b INT Add  | 0.05           |                 |                 |    |
| [2]                                                                                                                                                                            | [2] Mobile Upgrade Conv1x1 layers |                                                    | 0.60%            | 32b INT Add  | 0.1            | -               |                 |    |
| NetV2                                                                                                                                                                          |                                   | precision from FP4 to FP8                          |                  | 16b FP Add   | 0.4            |                 |                 |    |
| [3]                                                                                                                                                                            | ResNet                            | Add 3 FP16 fine-tuning                             | 0.32%            | 32b FP Add   | 0.9            |                 |                 |    |
| [3] -50                                                                                                                                                                        |                                   | epochs (originally 4bit)                           | 0.5270           | 8b INT Mult  | 0.2            |                 |                 |    |
|                                                                                                                                                                                |                                   |                                                    |                  | 32b INT Mult | 3.1            |                 |                 |    |
| <ul> <li>Reference List</li> <li>[1] [Naigang Wang et al., NeurIPS 2018]</li> <li>[2] [Xiao Sun et al., NeurIPS 2020]</li> <li>[3] [Brian Chmiel et al., ICLR 2023]</li> </ul> |                                   |                                                    | 8b FP Mult       | 1.1          |                |                 |                 |    |
|                                                                                                                                                                                |                                   |                                                    | 32b FP Mult      | 3.7          |                |                 |                 |    |
|                                                                                                                                                                                |                                   |                                                    | M. Horowitz, /S. | SCC 2014]    | 1 10           | 10 <sup>2</sup> | 10 <sup>3</sup> |    |

## Outline

- Background & Challenges
- Overall Introduction of MPICC
  - Multiple-Precision Inter-Combined Computing
- Key Features of MPICC
- Experiment & Results
- Conclusion

## **Overall Introduction of MPICC**

### Function of MPICC

- Performs 1/2/4 (K) dot products per operation
- Accumulates the results in FP12 & INT8
- $F = \begin{cases} A_1 \times B_1 + C & K \le 1 \\ A_1 \times B_1 + A_2 \times B_2 + C & K = 2 \\ A_1 \times B_1 + A_2 \times B_2 + A_3 \times B_3 + A_4 \times B_4 + C & K = 4 \end{cases}$
- Supported Precision Modes



### MPICC Overall Architecture



## Feature 1 of MPICC

### Feature1: MPICC Architecture

- Aim: Supports inter-computations among multiple FP, INT, and LOG formats
- **Chal. 1:** Solve need for multiple precision in ultra-low-precision training & Direct computing saves resources

### Principle of MPICC

Compute after decoding into a <u>unified format</u>



### MPICC Overall Architecture



## Feature 2 of MPICC

mode mac mode decl

4

3

mode top

split l

Left Decoder

16

data 1

**High-Precision Emulating Controller** 

16

data r

split r

**Right Decoder** 

16

16

mode decr

### Feature2: Optimized Stochastic Rounding MPICC Overall Architecture

- Aim: Perform SR in advance to maintain accuracy and reduce cumulative bit width
- Chal. 2: Prevent <u>swamping</u> and training stagnation
   & Save accumulator resource consumption

### Principle of Variable-SR



## **Feature 3 of MPICC**

### Feature3: High-Precision Emulating

- Aim: <u>Decompose</u> FP12 computation into multiple FP8 computations in low-precision hardware
- Chal. 3: <u>Avoid adding</u> high-precision hardware & Only 5.7% increase in area cost for control

### Principle of High-Precision Emulating



### MPICC Overall Architecture



## Outline

- Background & Challenges
- Overall Introduction of MPICC
- Key Features of MPICC
  - Feature 1: MPICC Architecture
  - Feature 2: Optimized Stochastic Rounding Strategy
  - Feature 3: High-Precision Emulating Controller
- Experiment & Results
- Conclusion

## **Feature 1 — MPICC Implementation Principle**

#### **Migration of INT Scheme to FP Realize Multi-Shape Multiplication (INT)**

- Decompose 2<sup>n</sup> bit multiplication into the sum of multiple 2 bit multiplications after shifting
- Adjust shift amounts to perform multi-shape

 Add FP-specific processes such as decoding, normalization, rounding ...



## Feature 1 — Logic of Decoding

15

FP4 K=4

11

A<sub>4</sub>

7 :

A3

### Decode to Uniform Type

• Decode to sign, exponent, and mantissa

| Result | Input          |  |  |
|--------|----------------|--|--|
| S1E5M3 | FP8, FP6, INT4 |  |  |
| S1E4M1 | FP4, LOG4      |  |  |

### Input & Output Arrangement

- Input comprises 2 8b data or 4 4b data
- Adjust bias when decoding exponents

3

A2

Mant. decoding result contain leading bit

A<sub>1</sub>

3

A4A3A2A1



## Feature 1 — Multiple Dot Product Unit

### Functions of the 5 Pipelines

- 1: Product unpack and cancellation
- 2: Multiplication, exponent comparison and align
- 3: Reduction and product normalization
- 4: Addition of dot product result and partial sum
- 5: Normalization and rounding



### Multiple Dot Product Unit Architecture



## Feature 2 — Introduction of Stochastic Rounding

### Principle & Advantages

- $SR(x) = \begin{cases} [x], & probability: 1 (x [x]) \\ [x], & probability: x [x] \end{cases}$
- Prevent <u>swamping</u> and training stagnation
- Reduce bit width and cost for accumulation

#### **Comparison of FP16 and FP32** 18000 ChunkSize=8 -FP32 SR--FP16 - SR 16000 14000 **Accumulation Values** <FP16 - NR > 12000 ChunkSize=16 -ChunkSize=1 **FP32. FP16-ChunkSize=32-256** 10000 ChunkSize=2 8000 ChunkSize=4 6000 ChunkSize=8 4000 ChunkSize=16 -ChunkSize=32 2000 -ChunkSize=64 0 ChunkSize=128 4096 8176 12256 16336 16 —ChunkSize=256 Accumulation Length

[Naigang Wang et al., NeurIPS 2018]

### Hardware Implementation of SR



Example of 2 Fraction Bits SR

| Fraction Bits | SR Result       |   |
|---------------|-----------------|---|
| 00            | 0% round up     |   |
| 01            | 25% round up    |   |
| 10            | 50% round up    |   |
| 11            | 75% round up    |   |
| [Sun Chang e  | otal DATE 20231 | 1 |

[Sun Chang et al., DATE 2023] 19/33

## Feature 2 — Introduction of Advanced SR

r-2

2

CarrySelect

1

#### **Comparison of 2 FP Adders with SR** rand rand Exponent difference / Swap Exponent difference / Swap $m_u$ p $e_x - e_y$ rand $m_y$ p $e_r - e_u$ op $e_x$ Shift $m_{-}$ $e_x$ opShift $m_{x}$ rand c/fp - 1 + r2's Complement p + r - 1pp - 1 + r2's Complement $\boldsymbol{p}$ C r-2p + r - 1Sticky Round Advanced SR p+2p + r $S'=S'_1,S'_2$ LZD/Shift Normalization LZD/Shift Normalization S' = 0, 0p+1p+1 $e_z$ p-1+rp - 1 + r $e_z$ $^{\prime}2$ $e_z$ $e_z$ c/fS'p+1p-1+r $e_z$ $e_z$ R, Sp-1p-1Round Correction Increment **Normal SR** $\Box$ Increment 2 [Sami Ben Ali et al., DATE 2024] $\bigcirc z$

### Motivation of Adv. SR

- SR starts early on addition
- **<u>Correct</u>** SR results at rounding
- Reduce bit with of LZD & Norm.

### Advantage of Adv. SR

- Lower logic latency
- Smaller area & power usage
- Maintaining the accuracy

### **Existing Problems**

- 1. Support for FP addition only
- 2. Cannot be applied to any MAC

## Feature 2 — Principle of Variable-SR

### Multiple Dot Product Unit



psum\_out

### Variable-SR During Mantissa Addition

- Product's higher for mantissa ADD, lower for SR ADD
- Insert random number in exponent alignment by case
- **Case 1:** Random number inserted to PSUM's right
- **Case 2:** Inserted to PROD's right and SR Adder2



21/33

## Feature 2 — Principle of Variable-SR

### Multiple Dot Product Unit



psum\_out

### **SR Correction During Normalization**

- Based on the count of leading zeros divided 4 cases
- **CNT=0:** Carry-over; Perform 2 bit addition to correct
- **CNT=1:** Hold; Perform 1 bit addition to correct
- CNT=2: Carry-back; No need to correct
- CNT 23: Cancellation; No need to SR



## Feature 3 — High-Precision Emulating Controller

### Realization of High-Precision Emulating Specific Decompose Process

- $FP12_1 \times FP12_2 = FP8_{H1} \times FP8_{H2} + FP8_{H1} \times FP8_{L2} + FP8_{L1} \times FP8_{L2} + FP8_{L1} \times FP8_{L2} + FP8_{L2} +$
- $FP12 \times FP8 = FP12_H \times FP8 + FP12_L \times FP8$  Omitted
- Multiplex Low-Precision MAC



- FP8<sub>H</sub> is FP12 truncated and rounded
- FP8<sub>L</sub> is smaller, indicating remainder



## Outline

- Background & Challenges
- Overall Introduction of MPICC
- Key Features of MPICC
- Experiment & Results
  - Analysis of Different Accumulator Modes
  - Practical Results of Network Training
  - Accuracy for High-Precision Emulation
  - Performance Comparison
- Conclusion

## **Experiment Conditions**

### Hardware Experiment Conditions

- TSMC 28nm technology
- Under typical corner (TT, 0.9V, 25°C)
- Area from Synopsys Design Compiler 2021.09
- Power consumption from Synopsys PrimeTime PX 2021.06

### Software Experiment Conditions

- ResNet-20 model on CIFAR-100 dataset
- SGD optimizer with an initial learning rate of 0.1
- Use gradient scaling, gradient clipping, and a momentum parameter of 0.9
- Train with a batch size of 128 for 200 epochs

## **Analysis of Different Accumulator Modes**

Accuracy Analysis



Figure 7: Accumulation results for different accumulators

[Eager-SR] [Sami Ben Ali et al., DATE 2024]

### Experiment Setup

- The operation mode is FP8×FP8
- One input set to 1
- The other following a uniform distribution (mean=1, standard deviation=1)
- Accumulated over 15,000 times

### Experiment Result

- Nearest Even (RNE) leads to swamping
- Variable-SR is <u>accurate than</u> Eager-SR
- Former less rounded <u>1 time</u> than latter

## **Analysis of Different Accumulator Modes**

### Experiment Setup

- Select accumulators in MPICC MACs
- Eager-SR performs SR on product to ensure that subsequent addition bit widths are equal

[10] [Sami Ben Ali et al., DATE 2024]

### Experiment Result

- Compared with Conventional-SR, area/ power reduced by 6.6%/11.3%
- Compared with E5M10 RNE, delay/area/
   power reduced by 16.7%/15.7%/14.9%

|                      | <b>L</b>                                                                  |                                                                                                                                              |                                                                                                           |                                                                                                               |
|----------------------|---------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------|
| Configuration        | Area(µm²)                                                                 | Power(µW/MHz)                                                                                                                                | Delay(ns)                                                                                                 | Accuracy                                                                                                      |
| E5M6 RNE             | 320.81                                                                    | 0.282                                                                                                                                        | 5                                                                                                         | Low                                                                                                           |
| E5M6 Variable-SR     | 390.94                                                                    | 0.337                                                                                                                                        | 5                                                                                                         | High                                                                                                          |
| E5M6 Eager-SR[10]    | 401.16                                                                    | 0.342                                                                                                                                        | 5                                                                                                         | Middle                                                                                                        |
| E5M6 Conventional-SR | 418.52                                                                    | 0.375                                                                                                                                        | 5                                                                                                         | High                                                                                                          |
| E5M10 RNE            | 463.64                                                                    | 0.396                                                                                                                                        | 6                                                                                                         | High                                                                                                          |
|                      | E5M6 RNE<br>E5M6 Variable-SR<br>E5M6 Eager-SR[10]<br>E5M6 Conventional-SR | E5M6 RNE       320.81         E5M6 Variable-SR       390.94         E5M6 Eager-SR[10]       401.16         E5M6 Conventional-SR       418.52 | E5M6 RNE320.810.282E5M6 Variable-SR390.940.337E5M6 Eager-SR[10]401.160.342E5M6 Conventional-SR418.520.375 | E5M6 RNE320.810.2825E5M6 Variable-SR390.940.3375E5M6 Eager-SR[10]401.160.3425E5M6 Conventional-SR418.520.3755 |

### **Table 3: Performance comparison of different accumulators**

## **Practical Results of Network Training**

### Experiment Result

- E8M6 RNE shows a **2.17%** degradation
- E8M6 SR12 shows a **0.64%** degradation
- E8M6 SR12 than E8M10 RNE 0.38% accurate

### Existing Improvements

- SR16 accuracy reduced due to small model
- Not employing SOTA quantization scheme

to compress 8 bit exponent to 5 bit

|               | Precision | Rounding Mode | SR bit width | Accuracy (%) |  |
|---------------|-----------|---------------|--------------|--------------|--|
|               | E8M23     | RNE           | -            | 75.05        |  |
|               | E8M10     | RNE           | -            | 74.03        |  |
|               | E8M6      | RNE           | -            | 72.88        |  |
|               | E8M6      | SR            | 16           | 74.39        |  |
| Act & Wt: FP8 | E8M6      | SR            | 12           | 74.41        |  |
|               | E8M6      | SR            | 8            | 74.23        |  |
|               | E8M6      | SR            | 4            | 73.94        |  |

### Table 4: Training accuracy with ResNet-20 on CIFAR-100

## **Accuracy for High-Precision Emulation**

### Area Decomposition of MPICC MAC Experiment Setup



Left decoder

Right decoder

- Multiple dot product unit
- High-precision emulating controller

### Accuracy Result for High-Precision



 Perform GEMM with random matrices of different sizes on MPICC MACs

Direct precision decomposition in [12]
 means not using rounding or subnormal

### Experiment Result

- FP8×FP8 MSE 3× larger than FP12×FP8,
   22-135× larger than FP12×FP12 Ours
- Small accuracy improvement over [12]

[12] [Stefano Markidis et al., IPDPSW 2023]

Figure 8: MSE for different sizes of GEMM computations

## **Performance Comparison**



Figure 9: Comparison of area/energy efficiencies among PEs

### Reference List

[6] [Jing Zhang et al., GLSVLSI 2023]
[7] [Luca Bertaccini et al., NeurIPS 2024]
[16] [Zürich Switzerland et al., TVLSI 2021]

### Summary of Improvement

- Support widest (18) precision combinations
- ◆ FP8 area/energy efficiencies up 1.17×/1.19×
- ◆ FP4 area/energy efficiencies up 4.69×/3.64×

### Improvement Analysis

- Focused Design: Target ultra-low-precision,
   <u>eliminate support</u> for redundant precisions
- Integrated Multiple-Precision: Not combine
   various single-precision PEs for efficiency
- MPICC: Use <u>direct computation</u>, eliminate precision conversion overhead

## Outline

- Background & Challenges
- Overall Introduction of MPICC
- Key Features of MPICC
- Experiment & Results
- Conclusion

## Conclusion

We present a MPICC MAC unit to maximize the hardware performance brought by the ultra-low-precision training.

|        |                         | 1            |              |              | -part        |              |              | eep rearming   |
|--------|-------------------------|--------------|--------------|--------------|--------------|--------------|--------------|----------------|
| Decign | Low-Precision Supported |              |              |              | ed           | Stochastic   | MPICC        | High-Precision |
| Design | FP16                    | FP8          | FP4          | LOG4         | INT4         | Rounding     | MIFICC       | Emulation      |
| Ours   | 0                       | $\checkmark$   |
| [16]   | $\checkmark$            | $\checkmark$ | Х            | ×            | ×            | ×            | ×            | ×              |
| [7]    | $\checkmark$            | $\checkmark$ | X            | ×            | ×            | $\checkmark$ | ×            | ×              |
| [10]   | X                       | ×            | $\checkmark$ | ×            | $\checkmark$ | ×            | $\checkmark$ | ×              |

### Table 1: Characteristics comparison of PEs for deep learning

• Supports FP12 with the same accumulation accuracy as FP16

- Three key features are proposed:
  - **A MPICC architecture** that supports inter-computations among multiple precisions;
  - A Variable-SR strategy to reduce accumulator bit width while maintaining accuracy;
  - A high-precision emulating controller to support high-precision at low cost.
- Compared to SOTA designs, our design supports the widest precisions and improves area/energy efficiencies by 1.17×/1.19×(FP8) & 4.69×/3.64×(FP4).

# Thanks for Attention! Q&A