#### Energy Efficient Joint Scheduling and Multi-Core Interconnect Design

#### Cathy Xu, Chun Jason Xue, Yi He, Edwin Sha

January, 2010 ASPDAC 2010, Taipei

# Outline

- Introduction
  - Interconnect with segmented buses
  - Switch design
  - Energy estimation model
- Motivation Example
- Application Specific Interconnect Design and Scheduling with Segmented Buses
- Experimental Results
- Conclusion

## Introduction

#### Objectives:

- To analyze the interconnect energy consumption impacting factors.
- To determine the minimum number of segmented buses required and related switch configurations.
- To propose a computation and communication coscheduling algorithm to minimize the interconnect energy consumption.

## Interconnect with Segmented Buses Architecture

- Heterogeneity of processors can mean two things:
  - All processors of the system have the same functional capability, but they perform the same task at different speeds.
  - The processors have different functional capabilities: that is, certain tasks can only performed by certain processors.
- Our research addresses the processor heterogeneity of the first type.
- Definition: A target parallel system M<sub>heter</sub> = (P, w) consists of a set of processors P<sub>i</sub>, whose heterogeneity, in terms of processing speed is described by the execution time function w. The processors are connected by a communication network.



# **Switch Diagram and Connectivity**



ASPDAC2010, Taipei

## Estimation Model on Interconnect Energy Consumption

- **Energy consumption** E<sub>all</sub> = E<sub>dynamic</sub> + E<sub>static</sub>
- $E_{dynamic} = Dis^* P_{seg}$  where  $Dis = \sum {}^{N}Seg_i$  $P_{seg}$  is the unit dynamic power required to transfer one data over one seğinent
- $E_{static} = a^*b(M-1)L_{sch}Pseg$  where a is a chosen factor related to VLSI process
- The energy consumption factors:
  - **N** number of data transfers **b** number of buses
- - **Dis** the total segment span
  - Lsch schedule length



## Modeling Application with Inter Iteration Data Dependencies



V: set of the computation nodes (instructions)

E: set of edges that define data dependency relations

t: given computation time of node based on the underlying hardware resource

d(e): non-negative delays for an edge

To solve y''+3xy'+3y = 0



# **Original Schedule**



ASPDAC2010, Taipei

#### **Original Schedule's Data Transfer Diagram**



ASPDAC2010, Taipei

#### **Original Schedule's Data Transfer Diagram**





### **Data Transfer Diagram after Rotation 1**



ASPDAC2010, Taipei



#### **Data Transfer Diagram after Rotation 2** Control step Core 1 Core 2 Core 3 2,7 2,7 Core 4 Core 5 Minimum of **1** segmented buses

# **Summary on the Example DFG**

|                      | Ν    | Dis  | b    | Lsch | Edynamic | Estatic |
|----------------------|------|------|------|------|----------|---------|
| Original<br>Schedule | 7    | 15   | 1.25 | 7    | 15Pseg   | 35aPseg |
| Rotation Step 1      | 7    | 14   | 1.25 | 6    | 14Pseg   | 30aPseg |
| Rotation Step 2      | 5    | 11   | 1    | 6    | 11Pseg   | 8aPseg  |
| Improvement %        | 28.5 | 26.7 | 25   | 14.3 | 26.7     | 17.1    |

## Modeling Data Transfers over Segmented Buses



Set of X*i*,*j*,*k* variables are used to represent the data transfer.

- X*i,j,k* means the *i*th data transfer at control step *k* over segment *j*.
- If transfer occurs at control step k over segment j, then X*i*,*j*,*k* = 1. Otherwise, X*i*,*j*,*k* = 0.
- Constraints:
  - 1. Data transfer over one segment takes one control step.
  - 2. Data transfer has to travel through all segments from start to end in sequential order.
  - 3. Maximum concurrent data transfers at any single control step over any one segment is subject to the bandwidth requirement.
  - 4. Concurrent data transfers are allowed on non-overlapping segment at any control step.

# **Related Constraints**

#### • Constraints:

- Data transfer over one segment takes one control step. For each i, 1< i  $\leq$  N; each j, 1< j  $\leq$  J,  $\sum_{k} x_{iik} = 1$
- Data transfer has to travel through all segments from start to end in sequential order.
- For each i, 1< i  $\leq$ N; each j, 1< j  $\leq$ J, 1<k $\leq$ K, x<sub>ijk</sub> +  $\sum_{i}$ <sup>(j+1)  $\leq$ I  $\leq$ J $\sum_{m}$  0  $\leq$ m  $\leq$ (k-1)x<sub>ilm</sub>  $\leq$  2</sup>
- Maximum concurrent data transfers at any single control step over any one segment is subject to the bandwidth requirement.

For each k, 1<k $\leq$ K, each j, 1< j  $\leq$ J,  $\Sigma_i x_{ijk} \leq b_j$ 

 Concurrent data transfers are allowed on non-overlapping segment at any control step.

For each k, 1<k $\leq$ K, each j,  $\sum_{\text{distinct i}} x_{ijk} + x_{i(j+1)k} \leq b_j$ 

## **Valid Data Transfers Scenarios**



8

**X**i,1,7=**0** 

Xi,2,7=**0** 

7

k=7

6

7

8

**X**i,1,7=**1** 

Xi,2,7=**1** 



### **Invalid Data Transfers Scenarios**

7

k=7

6

←

7

8

Xi,1,7=**1** 

Xi,2,7=**0** 







# Steps on Interconnect Communication Scheduling (ICScheduling) Algorithm

- Derive Data Transfer Diagram (DTD) on given schedule input
- Define vector X<sup>T</sup> = (...x<sub>ijk</sub>, ..., x<sub>NJK</sub>) where N is the total number of transfers and K is the number of segments, L is the schedule length.
- Define vector S<sup>T</sup> = (s<sub>1</sub>, ..., s<sub>j</sub>, ..., s<sub>n</sub>) where s<sub>j</sub> is non negative integer
- Define vector V<sup>T</sup>=(1,...,1, b,...,b). There are N 1s and K bs in the vector.
- Construct matrix C based on the properties of the DTS
  - Data transfer over one segment takes one control step.
    For each i, 1< i ≤N; each j, 1< j ≤J, ∑<sub>k</sub> x<sub>iik</sub> = 1
  - Data transfer has to travel through all segments from start to end in sequential order.

For each i, 1< i  $\leq$ N; each j, 1< j  $\leq$ J, 1<k $\leq$ K, x<sub>ijk</sub> + $\Sigma_1$ <sup>(j+1)  $\leq$ I  $\leq$ J $\Sigma_m$ <sup>0  $\leq$ m  $\leq$ (k-1) x<sub>ilm</sub>  $\leq$  2</sup></sup>

• Maximum concurrent data transfers at any single control step over any one segment is subject to the bandwidth requirement.

For each k, 1<k $\leq$ K, each j, 1< j  $\leq$ J,  $\Sigma_i x_{ijk} \leq b_i$ 

 Concurrent data transfers are allowed on non-overlapping segment at any control step.

For each k, 1<k $\leq$ K, each j,  $\sum_{\text{distinct i}} x_{ijk} + x_{i(j+1)} k \leq b_j$ 

- Construct matrix C
- Solving CY = V for X

# **Switch Configurations**



#### Interconnect Energy Minimization Scheduling (IEMS) Algorithm

**Input:** DFG G(V,E,t,d), CLU, lc **Output:** *bmin*, *DSmin*, *CSmin*, *Emin*  $CS \leftarrow$  Initial Schedule(G,CLU):  $CSmin \leftarrow CS$ :  $(bmin, DSmin) \leftarrow ICScheduling(CS);$ *Emin* ← Caculate\_Energy(*bmin*, DSmin, CSmin); for each *i*,  $0 < i \leq 2|V|$  do CS ← CSRStep(CS, *lc*): (b, DS) ← ICScheduling (CS); *E*← Caculate\_Energy(*b*, *DS*, **CS**); if E < Emin then  $bmin \leftarrow b$ ;  $DSmin \leftarrow DS$ ; Smin  $\leftarrow S$ ; Emin  $\leftarrow E$ ; end if end for return bmin, DTmin, Smin, Emin;

•CSRStep is a function to perform the rotation on rotatable nodes at each control step •A node is rotatable if it doesn't have any input edges or all input edges have non-negative delays. •Bipartite matching for reassigning the rotatable nodes •Bipartite weight  $W(v, seati) = \sum_{w in V}$ e<sub>v.w</sub>|Core(w) – Core(seati)| •Calculate\_Energy is a function based on the interconnect energy estimation model

# **Experimental Data**

| Apps         | Ν  | Lsch | Dis1 | Dis2 | b1 | b2 | R <sub>Ed</sub> | R <sub>Es</sub> |
|--------------|----|------|------|------|----|----|-----------------|-----------------|
| liR          | 4  | 4    | 4    | 9    | 2  | 2  | 0.55            | 0               |
| IIR uf2      | 7  | 4    | 15   | 30   | 4  | 4  | 0.5             | 0               |
| llR uf3      | 12 | 5    | 41   | 81   | 6  | 6  | 0.49            | 0               |
| IIR uf4      | 14 | 7    | 29   | 78   | 4  | 5  | 0.62            | 0.2             |
| Volterra     | 11 | 12   | 9    | 63   | 1  | 5  | 0.85            | 0.8             |
| Volterra uf2 | 21 | 12   | 23   | 170  | 2  | 13 | 0.86            | 0.84            |
| C-schwa      | 7  | 9    | 7    | 28   | 1  | 7  | 0.75            | 0.85            |
| C-schwa uf2  | 15 | 9    | 27   | 120  | 2  | 12 | 0.77            | 0.83            |
| 4 Lattice    | 13 | 9    | 25   | 96   | 3  | 11 | 0.73            | 0.72            |
| All-pole     | 5  | 13   | 1    | 5    | 1  | 1  | 0.8             | 0               |
| All-pole uf2 | 14 | 18   | 5    | 23   | 2  | 4  | 0.78            | 0.5             |
| All-pole uf3 | 21 | 24   | 5    | 40   | 2  | 3  | 0.87            | 0.3             |
| Average      |    |      |      |      |    |    | 0.71            | 0.23            |

# Conclusion

- Interconnect network has become the performance bottleneck and it represents a considerable amount of the total area and energy consumption
- Scheduler gets more challenge as it has to take the inter-processor communication into account
- This paper focuses on developing models, methodologies and algorithms on interconnect with
  - Segmented buses
    - Determine the minimum number of segmented buses and the switch configurations.
    - Scheduling to minimize the interconnect energy consumption.

# **Any Question?**

# **Backup Slides**

7

k=7

6

7

8

Xi,1,7=1

Xi,2,7=**1** 





7 8 k=6 k=7 6 6 Xi,1,7=1 Xi,1,6=1 Xi,2,6=**0 X**i,2,7=**0 X**i,3,7=**1 X**i,3,6=**0** <7

(m)





(0)



7 8 k=6 k=7 6 Xi,1,7=**0** Xi,1,6**=0** Xi,2,7=**1** Xi,2,6=1 **X**i,3,7=**1 X**i,3,6=**0** 7

(q)









(u)