

# <u>Micro-architecture Pipelining</u> <u>Optimization with Throughput-</u> <u>Aware Floorplanning</u>

Yuchun Ma\* Zhuoyuan Li\* Jason Cong\*\* Xianlong Hong\* Glenn Reinman\*\* Sheqin Dong\* Qiang Zhou\*

\*Department of Computer Science & Technology, Tsinghua University, Beijng 100084, P.R.China \*\*Department of Computer Science, UCLA, USA





#### Introduction

- Problem Formulation
- Simultaneous Block and Interconnect Pipelining
  - > MILP formulation
  - > Graph-based heuristic algorithm
- Throughput aware floorplanning with pipelining
- Case Study for a Design Driver
- Conclusions and Future Works





#### Introduction

- Problem Formulation
- Simultaneous Block and Interconnect Pipelining
  - > MILP formulation
  - > Graph-based heuristic algorithm
- Throughput aware floorplanning with pipelining
- Case Study for a Design Driver
- Conclusions and Future Works





### **Effects of Wire Delay in Pipelined Design**

- In DSM regime, Single-cycle full chip communication will be no longer possible.
- Deeper superscalar pipelines increased latencies and a significant degradation in instruction throughput



- ITRS'01 0.07um Tech
- 5.63GHz across-chip clock
- 800mm<sup>2</sup> (28.3mm x 28.3mm)
- IPEM BIWS estimations
  - Buffer size: 100x
  - Driver/receiver size: 100x
- Pipelining
  - Need 5 clock cycles from corner to corner
  - Performance reduced by a factor of up to 2 to 3





### **Throughput-aware Strategies**

- Physical design should focus on keeping the throughput-critical paths as short as possible;
  - Provide feedback to architects at very early design stages.
    - > Throughput-aware design at the floorplanning level
    - > Architecture pipelining with physical design





### **Previous Works**

#### Throughput-aware floorplanning

- Cong, DAC03 and ASPDAC05]: the first automated microarchitecture exploration system combined with physical planning;
- [Long, DAC04]: a throughput lookup table (LUT);
- Ekpanyapong, DAC04]: minimize a weighted sum of bus latencies;
- > [Nookala, DAC05]: a statistical design of experiments which identify the performance critical buses in a micro-architecture.

#### Retiming techniques

- > Move the flip-flops within a circuit while keeping its functionality;
- Minimize the clock period;
- > The pipeline designs inside blocks are still assumed fixed ;



Different from micro-architecture pipelining which minimizes the latencies on critical paths



### Pipeline models in previous works

Wire-pipelining under the assumption that the blocks are separately designed.

pipelining designs are often sub-optimal. Assume all the paths can be optimized and pipelined independently.

- Block sharing in different loops.
- It is important to use accurate models which can consider both block and wire pipelining simultaneously.





#### Introduction

Problem Formulation

Simultaneous Block and Interconnect Pipelining

- > MILP formulation
- > Graph-based heuristic algorithm
- Throughput aware floorplanning with pipelining
- Case Study for a Design Driver
- Conclusions and Future Works





### Simultaneous Block and Interconnect Pipelining

Path-based pipelining design should optimize the distribution of flip-flops along multiple paths.

- > Represent the micro-architecture design by a *path graph G(V,E)*;
  - Treat blocks and wires equally as the components with corresponding delays;
  - Each node *v* is the joint between a block and a wire.;
  - Each directed edge  $e_i$  represents a wire or a block.







### Simultaneous Block and Interconnect Pipelining

- A problem of labeling the edge with the number of flipflops on each edge.
- Therefore the objective is to find a feasible solution with the optimal performance.
  - > Min: the weighted sum of latencies along the critical paths;
  - Constraints: the number of flip-flops should make clock period feasible along any segment of the paths.







#### Introduction

- Problem Formulation
- Simultaneous Block and Interconnect Pipelining
  - > MILP formulation
  - > Graph-based heuristic algorithm
- Throughput aware floorplanning with pipelining
- Case Study for a Design Driver
- Conclusions and Future Works





### **MILP Formulation**

We define a term  $a(v, P_i)$  that represents the maximum arrival time at node v along path  $P_i$ ;

With the given clock period  $\Phi$ , the set of paths P with the weight w<sub>pi</sub> for each path P<sub>i</sub>, we can then formulate the problem.



| Obj. | $Min \sum_{P_i \in P} (w_{P_i} \times \sum_{e_i \in P_i} f_{e_i})$                             |     |
|------|------------------------------------------------------------------------------------------------|-----|
| s.t. |                                                                                                | (1) |
|      | $a(v,P_i) \geq 0 \ \forall v \in V, \ P_i \in P$                                               | (2) |
|      | ${f}_{\scriptscriptstyle{ei}}\!\!\geq\!\!0  orall \;\; e_i\!\in\!E$                           | (3) |
| a(v) | $(P_i) \geq a(u,P_i) + d_{ei} - \Phi * f_{ei} \forall e_i \in E \text{ and } e_i \text{ is a}$ | 4   |
| с    | connection from node $u$ to node $v$ along $P_i$ .                                             | (4) |

 $a(v,P_1) - a(u,P_1) \ge d_{uv} - \phi \times f_{uv}$ 





### **MILP Formulation**



With the values of  $a(v,P_i)$  and  $f_{ei}$  for all  $e_i \in E$ , we can place flipflops accordingly.





## **Analysis of MILP Approach**

∑|P<sub>i</sub>| real variables a(v,P<sub>i</sub>), |E| integer variables f<sub>ei</sub>, and 2∑|P<sub>i</sub>| + 2|V| constraints
Pros

> Optimal pipeline

#### Cons

•

- > Time consuming
- Not applicable to be embedded in floorplanning iterations.

Obj. 
$$Min \sum_{P_i \in P} (w_{P_i} \times \sum_{e_i \in P_i} f_{e_i})$$

s.t.  $a(v,P_i) \leq \Phi \quad \forall v \in V, P_i \in P$  (1)

$$a(v,P_i) \geq 0 \quad \forall \quad v \in V, \quad P_i \in P \tag{2}$$

$$f_{ei} \ge 0 \quad \forall \ e_i \in E \tag{3}$$

$$a(v,P_i) \ge a(u,P_i) + d_{ei} - \Phi * f_{ei} \forall e_i \in E \text{ and } e_i \text{ is a } connection from node u to node v along  $P_i$ . (4)$$





### **Slacks along paths**

#### Slacks along paths: 1.60

Slack(P) = 0.4  $\Phi$   $1.0\Phi$  Slack<sub>2</sub> = 0.4  $\Phi$   $0.6\Phi$ 

| Sla | ck <sub>1</sub> = 0.4 ⊕ | $Slack_2 = 0 \Phi$ | П |
|-----|-------------------------|--------------------|---|
|     | <b>0.6</b> $\Phi$       | <b>1.0</b> $\Phi$  |   |

Slack<sub>1</sub> > Slack(P) One extra cycle!!



 We can optimize the pipeline design by controlling the extra delay slacks when inserting flip-flop along paths.





#### Dynamic scanning heuristic for combinational circuits

#### For a combinational circuit, there is no loop.

- > A directed acyclic graph (DAG) G'(V',E');
- > A pair of source node s and target node t.

#### Scan the graph in topology order and try to insert flipflop to meet the clock period with the least extra cycles.

- >  $a(v, P_i)$  is the arrival time for each passing node along each path;
- During the traversing, the paths are pipelined partially;
  - Delay slacks on the partially pipelined paths;
  - A newly inserted flip-flop will change the timing distribution in the graph;
  - If the delay between a newly inserted flip-flop and the previous flipflop (or source node) is less than one clock period, it will introduce extra slack.





## **Slack Information**

#### Ideal Slack along path:

if the edges along path are pipelined with no extra slack.

Add a new flip-flop *f*: To meet the clock period, each path *P<sub>i</sub>* passing *e<sub>i</sub>* should satisfie:

 $a(u, P_i) + d_{uf} \leq \Phi$ 

The extra slack caused by f is

 $Extra_Slack(P_i, f) = \Phi - (a(u, P_i) + d_{uf})$ 

- If  $Extra_Slack(P_i, f) > Ideal_Slack(P_i)$ 
  - > f will generate an extra cycle on path  $P_i$
  - $Ideal\_Slack(P_i) = \Phi + Ideal\_Slack_{cur}(P_i) Extra\_Slack(P_i, f)$
- If  $Extra_Slack(P_{i}, f) \leq Ideal_Slack(P_{i})$ 
  - f will not generate an extra cycle on path  $P_i$ .



 $deal\_Slack(P_i) = Ideal\_Slack_{cur}(P_i) - Extra\_Slack(f)$ 

 $Ideal\_Slack_{cur}(P) = (2 - 1.8) \varPhi = 0.2 \varPhi$ 



$$\begin{split} a(n0,P) &= 0\\ d_{0f} &= 0.6\,\varPhi\\ Extra\_Slack(P,f) &= \varPhi \text{-} (\ a(n0,P)\\ + \ d_{0f}) &= 0.4\,\varPhi \text{>} Ideal\_Slack_{cur}(P) \end{split}$$

one extra cycle



## **Dynamic Scanning**



| V   | a(v,P1)    | Ideal<br>Slack<br>(P1) | a (v, P2) | Ideal<br>Slack<br>(p2) | Extra<br>slack | f               | duf | Extra<br>cycle |
|-----|------------|------------------------|-----------|------------------------|----------------|-----------------|-----|----------------|
| S   | 0          | 0.2                    | 0         | 0.6                    | -              | ×               | ×   |                |
| n1  | 0<br>0     | -                      | ×         | ×                      | -              | ×               |     | ×              |
| n2  |            | ×                      | -         |                        | -              | -               | -   | -              |
| n3  | 0.5        |                        | ×         | ×                      | -              | -               |     |                |
| n4  | 0.7        | ×                      | -         |                        | -              | -               | -   | -              |
| n5  | 1.1<br>0.1 | 0.2<br>0.2             | -<br>0.9  | -<br>0.6               | 0              | e <sub>35</sub> | 0.6 | 0              |
| n6  |            |                        |           |                        |                |                 |     |                |
| n7  |            |                        |           |                        |                |                 |     |                |
| n8  |            |                        |           |                        |                |                 |     |                |
| n9  |            |                        |           |                        |                |                 |     |                |
| n10 |            |                        |           |                        |                |                 |     |                |
| Т   | -          | -                      | -         | -                      | -              | -               | -   | -              |



### **Dynamic Scanning**

ab/30/2007



n2

| v   | a(v,P1)     | Ideal<br>Slack<br>(P1) | a (v, P2)  | Ideal<br>Slack<br>(p2) | Extra<br>slack | f                      | duf            | Extra<br>cycle |
|-----|-------------|------------------------|------------|------------------------|----------------|------------------------|----------------|----------------|
| S   | 0           | 0.2                    | 0          | 0.6                    | -              | ×                      | ×              |                |
| n1  | 0<br>V      | -                      | ×          | ×                      | -              | ×                      |                | ×              |
| n2  | <b>0</b>    | ×                      | -          |                        | -              | -                      | -              | -              |
| n3  | 0.5         |                        | ×          | ×                      | -              | -                      |                |                |
| n4  | 0.7         | ×                      | -          |                        | -              | -                      | -              | -              |
| n5  | 1.1<br>0.1  | 0.2<br>0.2             | -<br>0.9   | -<br>0.6               | 0              | <i>e</i> <sub>35</sub> | 0.6            | 0              |
| n6  | 0.6         | -                      | 1.4<br>0.5 | 0.6                    | 0.1            | e <sub>45</sub>        | 0.2            | 0              |
| n7  | 1.3<br>₿.3  | 0.2<br>0.2             | ×          | ×                      | (              | e <sub>6</sub>         | <sub>7</sub> 0 | .4             |
| n8  | ×           | ×                      | 0.         | 8 -                    | -              | -                      |                | -              |
| n9  | <b>Ō.</b> 8 | -                      | ×          | ×                      | -              | -                      | -              |                |
| n10 | ×           | ×                      | 1<br>0.5   | .5 0.                  | 5 (            | e <sub>8</sub>         | 90             | .2             |
| Т   | -0          | -                      | -          | -                      | -              | -                      | -              | -              |



### **Analysis of Dynamic Scanning**

#### The worst complexity should be |Path|<sup>2</sup>|V|.

> |Path| is the number of critical paths in the architecture and |V| is the number of nodes in the graph which is about the total number of blocks and wires.

#### Cycle break for sequential circuits.

- In micro-architecture, almost every component is involved in one or more loops;
- Break the cycles by directly changing the direction of the back edge to the target node t;
- The information of the cycles will be lost, but the error is pretty small as shown in experimental results section.





#### Introduction

Problem Formulation

Simultaneous Block and Interconnect Pipelining

- > MILP formulation
- > Graph-based heuristic algorithm
- Throughput aware floorplanning with pipelining
- Case Study for a Design Driver
- Conclusions and Future Works





### **Superscalar Processors**

 Each of these blocks spans one or more pipeline stages, and instructions typically flow through these stages ;
Extra delay in any of these sub-systems will have different effects on system performance.







### Throughput aware floorplanning with pipelining

#### **Given:**

- > Target clock period  $\Phi$
- Clocking overhead T<sub>overhead</sub>
- > List of blocks with their area, dimensions and delay
- Critical architectural paths with performance sensitivities
- Objective: Generate a floorplan which optimizes for the die area, wirelength and performance, based on the pipeline design for blocks and wires.
  - > Based on a simulated annealing framework with CBL representation
  - Integrate the graph based dynamic approach
  - > Extra latency from the wires is used to compute the new BIPS

 $cos t = w1 * \frac{1}{BIPS} + w2 * Area + w3 * Wire$ 





#### Introduction

- Problem Formulation
- Simultaneous Block and Interconnect Pipelining
  - > MILP formulation
  - > Graph-based heuristic algorithm
- Throughput aware floorplanning with pipelining
- Case Study for a Design Driver
- Conclusions and Future Works





### **Design Driver**

An out-of-order superscalar processor micro-architecture with 4 banks of L2 cache in 70*nm* technology.

| Instruction Cache | 32KB, 32B/block, 2-way                       |
|-------------------|----------------------------------------------|
| Decode Width      | 4                                            |
| ROB Size          | 128 entries                                  |
| Issue Queue       | 32 entries                                   |
| Issue Width       | 4 ALU ops, 2 MEM ops per cycle               |
| Register File     | 70 INT and 70 FP                             |
| Functional Units  | Units 2 IntALU, 1 FPALU, 1 IntMult, 1 FPMult |
| Load/Store Queue  | 32 entries                                   |
| L1Data Cache      | 16KB, 32B/block, 4-way, 2RW ports            |
| Unified L2 cache  | 1MB, 64B/block, 8-way                        |

#### Comparisons

- In MEVA([Jason, DAC03 and ASPDAC05]), the flip-flops are assumed to be ideally inserted along each path, which gives an upper bound for pipeline design(UB)
- > wire-pipelining results (WP)



- the solutions obtained from the MILP solver (MILP)
- our graph-based heuristic approach (GH)



### **Impact of Frequencies**

# Take a fixed packing result and run the different pipeline approaches under the frequency from 2GHz to 7GHz.

- > MEVA is far off from the real designs in most cases(on average 20% larger );
- Wire pipelining loses a lot of the flexibility to get a better design, our approach gives about a 27% performance improvement over it;
- > Our approach has an average of 2% error to the results of the MILP;
- The running time for our approach is about 1 second, while the MILP need about 200s to get a good result.







## **Different Packing results**

# Randomly pick 5 packings and run the pipelining approaches under 3GHz.

- > The difference between our approach and the MILP solution is about 3%;
- The error for upper bound in MEVA is about 16% and the error for wirepipelining results is about 24%.

| packing | Area      | wire   | vire BIPS |       |       |       |  |
|---------|-----------|--------|-----------|-------|-------|-------|--|
|         | (mm*mm)   |        | UB        | WP    | GH    | MILP  |  |
| 1       | 4.1*7.76  | 98471  | 2.677     | 1.539 | 2.136 | 2.262 |  |
| 2       | 4.24*7.92 | 106594 | 2.417     | 1.731 | 2.139 | 2.16  |  |
| 3       | 4.9*6.35  | 102578 | 2.547     | 1.748 | 2.091 | 2.202 |  |
| 4       | 5.48*5.88 | 134051 | 2.654     | 1.563 | 2.319 | 2.346 |  |
| 5       | 6.99*4.64 | 135455 | 2.793     | 1.731 | 2.235 | 2.286 |  |
| Average |           |        | 1.16      | 0.738 | 0.97  | 1     |  |
| Error   |           |        |           |       |       |       |  |





### Integrated with floorplanning

- We integrate graph-based heuristic with the thoughput-driven floorplannning
- Take MILP approach as a post process at the end of the floorplanning.
  - It is an accurate evaluation of the pipelining design, which enables the optimization process to be guided correctly and converge well.

| Frequenc | UB+post_MILP  |              |       | GH            |              |       |  |
|----------|---------------|--------------|-------|---------------|--------------|-------|--|
| y<br>GHz | Area<br>(mm²) | Wire<br>(mm) | BIPS  | Area<br>(mm²) | Wire<br>(mm) | BIPS  |  |
| 2        | 32.           | 115.6        | 1.492 | 31.8          | 142          | 1.714 |  |
| 3        | 34.6          | 103.7        | 2.139 | 33.3          | 108.4        | 2.22  |  |
| 4        | 32.4          | 98.7         | 2.776 | 36.1          | 124.3        | 2.828 |  |
| 5        | 32.8          | 126.2        | 2.885 | 32.6          | 94.17        | 3.35  |  |
| 6        | 36.0          | 108.4        | 3.636 | 33.7          | 100.3        | 3.882 |  |
| 7        | 35.9          | 112.5        | 3.479 | 36.8          | 129.9        | 3.906 |  |
| Comp     | 1             | 1            | 1     | 1.003         | 1.05         | 1.091 |  |





### **Conclusion and Future Works**

First work that considers block pipelining and interconnect pipelining simultaneously;

- > A MILP formulation.
- > A novel dynamic scanning heuristic
- heuristic gives solutions very close to the MILP results (2% more than MILP results on average) but in a much shorter runtime;
- The dynamic scanning heuristic is stable and effective which is applicable in floorplanning optimization;
- Refine the MILP formulation and attempt to handle it in a much more efficient way.



