## Variation-aware Statistical Energy Optimization on Voltage-Frequency Island based MPSoCs under Performance Yield Constraints

#### Song Jin<sup>1</sup>, Yinhe Han<sup>2</sup> and Songwei Pei<sup>3</sup>

<sup>1</sup>Department of Electronic and Communication Engineering, School of Electrical and Electronic Engineering, North China Electric Power University, P. R. China.

<sup>2</sup>State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, P. R. China.

<sup>3</sup>Department of Computer Science and Technology, Beijing University of Chemical Technology, P. R. China.

## Summary

- Motivation
- > Preliminaries
- Statistical energy optimization
- > Experiments

Energy efficiency is a primary design concern for embedded MPSoCs since the power supply commonly relies on the battery.



Concept of voltage-frequency island (VFI) was introduced into multi-core design for fine-grained energy management.

Each VFI operates at its own voltage and frequency

Mixed clock/mixed voltage FIFOs serve to data synchronization



VFI 3 (V<sub>3</sub>, f<sub>3</sub>, V<sub>t3</sub>)

A multi-processor chip with 3 VFIs

(Ogras et al. Voltage-Frequency Island Partitioning for GALS-based Networks-on-Chip, DAC'07)

- Existing work on VFI-based energy optimization
- 1. Commercial production
  - TILE64, ISSCC'08
  - Niagara2, ISSCC'07
- 2. Academic research Orgas, DAC'07
   Jang, ICCAD08
  - Marculescu, TCAD'08

Deterministic task scheduling

- V/F assignment
  - VFI partition

The exacerbated process variation (PV) degrades the optimization effectiveness of the existing work.



As a result, the deterministic optimization:

◆at worst case is costly and may not be a viable option.

We proposed a statistical energy optimization framework which

(1) considers probability features in execution latency and power of the task and,

(2) combines energy optimization sensitivity with statistical slack of the task to guide the overall optimization flow.

Our purpose is to

(1) achieve maximum energy saving meanwhile

(2) meeting performance yield constraints, defined as the probability of the design meeting timing constraints of the system.

# Summary

- Motivation
- > Preliminaries
- Statistical energy optimization
- > Experiments

## Preliminaries

- Target MPSoC platform:
- Tile base structure with heterogeneous PEs
- Inter-tile communication supported by Network-on-chip (NoC)

Data synchronization supported by Mixed V/F FIFO

| Energy model                                                                                                           | With PV effects                                 |  |  |  |
|------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------|--|--|--|
| $E_{sys} = E_{comp} + E_{comm}$                                                                                        | •Timing quantities: variables                   |  |  |  |
| $-E_{comp} = \sum_{i=1}^{n} (NC_j \cdot C_i \cdot V_i^2)$                                                              | •Normal distribution                            |  |  |  |
|                                                                                                                        | •Tight probability: variable comparison         |  |  |  |
| k=1<br>$n_V$ $V^2$                                                                                                     | <ul> <li>Clark equation: max, min</li> </ul>    |  |  |  |
| $-E_{bit} = \sum_{i}^{n_V} \left( E_{bit}^R(i) + E_{bit}^{Link}(i) + E_{bit}^{FIFO}(i) \right) \frac{V_i^2}{V_{DD}^2}$ | <ul> <li>Statistical task scheduling</li> </ul> |  |  |  |
|                                                                                                                        |                                                 |  |  |  |

# Summary

- Motivation
- > Preliminaries
- Statistical energy optimization
- > Experiments

## Statistical energy optimization – Problem Formulation

Problem formulation, given:

♦A directed acyclic task graph, including task related information, control and data dependency between tasks

MPSoC platform, frequency distribution under PV effects, votlage levels

♦A set of design constraints, such as maximum VFI number, performance yield constraints

To determine task schedule and VFI partition for energy minimization and satisfaction of performance yield.

## Statistical energy optimization - Framework Overview

Major optimization steps:

- Statistical task scheduling
- ♦ Voltage assignment to PEs
- ♦VFI partition by merging PEs



## Statistical energy optimization - Details

1. Statistical task scheduling and voltage assignment:

◆Purpose:

Schedule the task having larger energy optimization sensitivity onto the core that provides larger statistical slack.

#### ♦ Guiding metric:

Energy optimization sensitivity: the energy variations of a task under voltage scaling.

$$S_E = E_{norm} \sum_{i=1}^n \frac{V_i^2}{V_{DD}^2}$$

## Statistical energy optimization - Details

#### Procedure of Statistical Task Scheduling and Voltage Assignment

- 1. //Energy-aware Task Scheduling Process
  - . // Input: Full Task List FTL; Full PE List FPL;
- 3. // Output: Task schedule:  $t_i$  in FTL  $\rightarrow P_j$  in FPL;
- . Construct Ready Task List RTL and Available PE List APL;
- while(RTL is not empty && APL is not empty){
  - Select the task  $t_i$  with largest  $S_E$  in RTL:
  - For all the PEs in APL
    - Calculate  $FT(t_i, P_j)$  and  $SSlack(FT(t_i), P_j)$ ;
  - Apply tight probability to calculate  $\max(SSlack(FT(t_i), P_j));$
  - Schedule  $t_i$  onto  $P_j$  given max(SSlack(FT( $t_i$ ), $P_j$ ));
- Remove t<sub>i</sub> from RTL and P<sub>j</sub> from APL;}

12. Jump to step 4;

- 13. Voltage Assignment Process
- 14. // Input: Task Schedule; Available voltage levels; FTL;FPL;
  - 15. // Output: Assigned voltages for PEs in FPL;
  - 16. Assign initial voltages to PEs in FPL to minimize total energy;
  - 17. Perform SSTA to calculate Perf. yield;
    - while(Perf. Yield constraints are not satisfied){
      - Select  $P_j$  in FPL with minimum  $\sum S_E$  (scheduled tasks);
      - Relax assigned voltage of  $P_j$  and flag  $P_j$  as adjusted PE:
      - Perform SSTA to calculate Perf. Yield;}

## Statistical energy optimization - Details

2. VFI partition: meet maximum allowable number of VFIs



We evaluate all the VFI partitions from 1- $\sqrt[3]{FI}$  to maxima M/Failow algorithm of the maxima M/Failow algorithm of the number. The VFI partition with minimum energy while still meeting the performance yield constraints is selected as the final partitioning decision.

# Summary

- Motivation
- > Preliminaries
- Statistical energy optimization
- > Experiments

## Experiment - experimental setup

- 1. Simulation platform
- Heterogeneous MPSoC with 4-by-4 topology
- Frequency of PE ranges from 350MHz to 750MHz
- ♦ Five voltage levels available [0.7V,0.8V,0.9V,1.0V,1.1V]
- x-y routing algorithm
- Mixed V/F FIFOs
- 2. Frequency distribution of PEs
- HSPICE Monte Carlo simulation
- PTM 65nm technology node
- Critical path with logic levels 12 to 20

#### Experiment – experimental setup

- 3. Task graph
- Industrial benchmark E3S: multiple task graphs are grouped into one
- Random task graph by TGFF: number of tasks ranges from 80 to 100

STATISTIC OF TASK GRAPHS

|             | Task # | Communication. | Perf. yield |
|-------------|--------|----------------|-------------|
| Consumer    | 11     | low            | 100%        |
| Auto-indust | 24     | medium         | 100%        |
| Networks    | 13     | high           | 100%        |
| Telecomm    | 30     | high           | 100%        |
| TG1         | 88     | medium         | 100%        |
| TG2         | 96     | medium         | 100%        |
| TG3         | 101    | high           | 100%        |
| TG4         | 91     | medium         | 98%         |
| TG5         | 96     | medium         | 97%         |
| TG6         | 100    | high           | 96%         |

#### Experiment – experimental results

#### Comparison of energy optimizations

| Bench       | Optimal VFI Number |      | Energy Normalized |      | Performance Yield |      | leld | CPU time (mins) |      |                 |
|-------------|--------------------|------|-------------------|------|-------------------|------|------|-----------------|------|-----------------|
|             | D-NC               | D-WC | ours              | D-NC | D-WC              | ours | D-NC | D-WC            | ours | CFO time (mins) |
| Consumer    | 3                  | 3    | 5                 | 0.56 | 0.89              | 0.52 | 52%  | 100%            | 100% | 3               |
| Auto-indust | 4                  | 5    | 6                 | 0.59 | 0.91              | 0.60 | 60%  | 100%            | 100% | 7               |
| Networks    | 3                  | 3    | 4                 | 0.75 | 0.96              | 0.72 | 56%  | 100%            | 100% | 3               |
| Telecomm    | 4                  | 4    | 4                 | 0.51 | 0.87              | 0.52 | 60%  | 100%            | 100% | 8               |
| TG1         | 5                  | 6    | 6                 | 0.43 | 0.83              | 0.42 | 57%  | 100%            | 100% | 15              |
| TG2         | 4                  | 5    | 6                 | 0.46 | 0.90              | 0.46 | 61%  | 100%            | 100% | 17              |
| TG3         | 4                  | 4    | 5                 | 0.40 | 0.87              | 0.41 | 56%  | 100%            | 100% | 17              |
| TG4         | 4                  | 4    | 6                 | 0.53 | 0.91              | 0.61 | 51%  | 97%             | 98%  | 16              |
| TG5         | 4                  | 3    | 6                 | 0.58 | 0.86              | 0.66 | 53%  | 97%             | 97%  | 18              |
| TG6         | 4                  | 6    | 5                 | 0.52 | 0.93              | 0.58 | 53%  | 96%             | 96%  | 18              |

#### STATISTIC OF EXPERIMENTAL RESULTS

Energy normalized: normalized energy to the one in 1-VFI; D-NC&D-WC: deterministic optimizations on nominal value and worst case

### Experiment – experimental results

Impact of slack on optimization results

◆TG6 is chosen as an example

Relax deadline constraint of TG6 to make its performance yield of 100%



Fig. 3. Impact of slack on voltage assignment. (a) original version of TG6 with tight deadline, and (b) modified version of TG6 with loose deadline.

# Thank You (Q&A)