AVS-Aware Power-Gate Sizing for Maximum Performance and Power Efficiency of Power-Constrained Processors

> Abhishek Sinkar and Nam Sung Kim January 28, 2011

Department of Electrical and Computer Engineering University of Wisconsin - Madison



#### Introduction

# Impact of process variations and PG-device size Impact on VV<sub>DD</sub>

- Impact on  $F_{MAX}$  and  $P_{TOT}$
- AVS-aware PG-device size optimization
  - Algorithm and simulation result
- Impact of WID variation on PG sizing
  - Global clocking
  - Frequency island clocking
- Experimental methodology
- Conclusion

#### Introduction

- Impact of process variations and PG-device size
  Impact on VV<sub>DD</sub>
  - Impact on  $F_{MAX}$  and  $P_{TOT}$
- AVS-aware PG-device size optimization
  Algorithm and Simulation Result
- Impact of WID variation on PG sizing
  Global clocking
  - ⊙ Frequency island clocking
- Experimental methodology
- Conclusion

### Introduction



### **Contributions**

- Analyze the impact of D2D variations on
  - Virtual rail voltage ( $VV_{DD}$ )
  - Maximum operating frequency  $(F_{MAX})$
  - Total power consumption ( $P_{TOT}$ )
- Propose algorithm to find
  - Optimal PG size
  - Optimal degree of AVS
- Extend algorithm to
  - Multicore processors w/ WID variation
    - o Global clocking
    - Frequency Island clocking

#### Introduction

#### Impact of process variations and PG-device size

- Impact on  $VV_{DD}$
- $\odot~$  Impact on  $F_{\text{MAX}}$  and  $P_{\text{TOT}}$
- AVS-aware PG-device size optimization
  Algorithm and simulation result
- Impact of WID variation on PG sizing
  Global clocking
  - ⊙ Frequency island clocking
- Experimental methodology
- Conclusion



•  $R_{PG-SLOW} > R_{PG-FAST}$ 

### P. V. + PG Size Impact on VV<sub>DD</sub>



 $s_{PG}$  increases  $\rightarrow R_{PG}$  decreases  $\rightarrow VV_{DD}$  increases



- F<sub>MAX</sub> increase diminishes rapidly
- P increases faster than F<sub>MAX</sub>

Larger PG device suitable for fast die, smaller for slow die

#### Introduction

- Impact of process variations and PG-device size
  Impact on VV<sub>DD</sub>
  - Impact on  $F_{MAX}$  and  $P_{TOT}$

# AVS-aware PG-device size optimization Algorithm and simulation result

- Impact of WID variation on PG sizing
  Global clocking
  - ⊙ Frequency island clocking
- Experimental methodology
- Conclusion

### **AVS Aware PG Size Optimization**



# **Algorithm**



# **Simulation Result**

| Proc.<br>Corners                     | Slow  |       | Nom   |        | Fast   |        |
|--------------------------------------|-------|-------|-------|--------|--------|--------|
| P <sub>TDP</sub>                     | 65W   | 70W   | 90W   | 100W   | 120W   | 130W   |
| V <sub>DD</sub> @ s <sub>PG</sub> =1 | 0.945 | 0.955 | 0.900 | 0.923  | 0.825  | 0.840  |
| S <sub>PGOPT</sub>                   | 0.755 | 0.715 | 0.640 | 0.515  | 0.568  | 0.455  |
| V <sub>DDOPT</sub>                   | 0.955 | 0.975 | 0.915 | 0.948  | 0.845  | 0.875  |
| f <sub>MAX</sub>                     | 0.999 | 0.999 | 0.999 | ~1.000 | ~1.000 | ~1.000 |

 24.5%, 36% and 43.2% reduction in PG size for slow, nominal and fast corners

#### Introduction

- Impact of process variations and PG-device size
  Impact on VV<sub>DD</sub>
  - Impact on  $F_{MAX}$  and  $P_{TOT}$
- AVS-aware PG-device size optimization
  Algorithm and simulation result

#### Impact of WID variation on PG sizing

- ⊙ Global clocking
- ⊙ Frequency island clocking
- Experimental methodology
- Conclusion

# **WID Variation**



S. Sarangi et al. IEEE Trans. On Semiconductor Manufacturing, vol. 21, no. 1, pp. 3~13, Feb. 2008.

- WID spatially correlated
- Result in C2C F<sub>MAX</sub> and I<sub>LEAK</sub> variation
- As # of cores increases
  - Relative variation among cores more significant

# **Global Clocking**

- Limits F<sub>MAX</sub> of multicore processor to that of slowest core
- Total power consumption of a N core processor •  $P_{TOT} = \sum_{i=1}^{N} \left( F_{MAX,j} \left( VV_{DD,j} \right) \times C_{EFF} \times VV_{DD,i} + I_{LEAK,i} \left( VV_{DD,i} \right) \right) \times VV_{DD,i}$ 
  - where N = no. of cores
    - = index of slowest core on die
    - $C_{eff}$  = effective switched capacitance per core

# **Global Clocking**



As PG size increases, faster(leakier) cores increase  $I_{LEAK}$ ,  $F_{MAX}$  is fixed by slowest core.

### **Frequency Island Clocking**

- Each core runs at its own max. frequency
- Total power consumption of a N core processor •  $P_{TOT} = \sum_{i=1}^{N} (F_{MAX,i}(VV_{DD,i}) \times C_{EFF} \times VV_{DD,i} + I_{LEAK,i}(VV_{DD,i})) \times VV_{DD,i}$

• For compute-bound workloads with sufficient # of threads • Performance(Throughput) =  $\left[\sum_{1}^{N} F_{MAX,i}\right]/N$ 

# **Throughput Experiment with FI Clk.**



#### PG Sizing with FI clocking and WID Var.



Use F<sub>MAX,AVG</sub> to compute P<sup>3</sup>/W constraint

#### Introduction

- Impact of process variations and PG-device size
  Impact on VV<sub>DD</sub>
  - Impact on  $F_{MAX}$  and  $P_{TOT}$
- AVS-aware PG-device size optimization
  Algorithm and simulation result
- Impact of WID variation on PG sizing
  Global clocking
  - ⊙ Frequency island clocking
- Experimental Methodology
- Conclusion

### **Experimental Methodology**

• For frequency and leakage modeling with power gates

•  $V_{th}$  and  $L_{eff}$  WID spatial and D2D variation map\*

• WID variation : Correlation coefficient = 0.5

• D2D variation :  $\sigma_{V_{th}}^{sys} = 6.4\%$ 



24 FO4 INV chain for measuring  $f(V_{DD})$ 32nm PTM SPICE model Dummy gates for measuring I(V<sub>DD</sub>) <u>Effective Widths</u> 50% INV. 30% NAND.20% NOR

 $V_{\text{DD}}$ 

 $\mathsf{VV}_\mathsf{DD}$ 

 $I_{DYN}(VV_{DD})$ 

LEAK

**SLEEP** 

### **Experimental Methodology**

• Power and thermal constraint

•  $P_{TDP}$  at  $V_{DD,TDP}$  = 90 W (at the nominal corner)

o  $T_{jmax} = 100$  ° C

Performance simulation with GPGPU-Sim

• Simulator modified to support FI clocking

| # of SM Cores                          | 4/8/16                      | Shared Mem/SM                               | 16KB        |
|----------------------------------------|-----------------------------|---------------------------------------------|-------------|
| SIMD Width/SM                          | 1/4/8                       | # of Mem Ch.                                | 4           |
| # of Threads /SM                       | 1024                        | BW/ Mem Ch.                                 | 8B/Cycle    |
| 3 of CTAs/SM                           | 8                           | DRAM Rq. Queue                              | 16          |
| # of Registers/SM                      | 16384                       | Mem Controller                              | FR-FCFS     |
| Constant and<br>Texture Cache<br>Sizes | 8KB, 2-<br>Way, 64B<br>Line | GDDR3 Mem# of<br>SM Cores<br>. tCL/tRP/tRAS | 10/10/35/25 |

# Conclusion

- Effect of PG-device sizing on F<sub>MAX</sub> and P<sub>TOT</sub>
  - $\odot$  F<sub>MAX</sub> and P<sub>TOT</sub> both increase, P<sub>TOT</sub> increases faster than F<sub>MAX</sub>.
  - Rate of increase diminishes quickly for slow die than for fast die

#### Reduction in PG-device size

- D2D variation :
  - o 24.5%(slow), 36%(nominal) and 43.2%(fast)
- WID variation
  - o Global Clk. 59%
  - o FI Clk. 58% (4 cores), 57% (16 cores)
- F<sub>MAX</sub> penalty
  - D2D variation : negligible
  - WID variation : improved  $F_{MAX}$  by ~3%
- As #. of cores increases
  - Opt. PG size increases
  - Opt. V<sub>DD</sub> for AVS decreases