

### Network-on-Chip Router Design with Buffer-Stealing

### Wan-Ting Su, Jih-Sheng Shen\*, Pao-Ann Hsiung ASP-DAC 2011

Embedded Systems Laboratory Department of Computer Science & Information Engineering National Chung Cheng University

#### Introduction

- Motivation and Objective
- Buffer Stealing Design
- Experiments
- Conclusions

### Introduction

- Motivation and Objective
- Buffer Stealing Design
- Experiments
- Conclusions

## Network-on-Chip

- A basic Network-on-Chip (NoC) architecture
  - Routers,
  - Communication links,
  - Network-Interface Component (NIC)



## NoC Buffer

#### The use of buffering

- Wait for routing decisions
- Compete for the same output channel
- No buffer space in next hop router

### The utilization of input buffers directly influence

- NoC congestions
- Throughput
- Packet latency

#### Introduction

### Motivation and Objective

- Buffer Stealing Design
- Experiments
- Conclusions

## Motivation

### Increase the buffer size (or infinite buffer)

- Reduce NoC congestions and packet latency
- Enhance throughput

However, the problems of infinite buffer are
 High hardware resource overhead

#### • Large energy consumers

- 64% of the total router leakage power [32]
- More dynamic energy consumption [33]

[32] W. Hangsheng, L. S. Peh, and S. Malik, "Power-driven design of router microarchitectures in on-chip networks," in *Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture* (MICRO), pp. 105-116, 2003.

[33] T. T. Ye, L. Benini, and G. De Micheli, "Analysis of power consumption on switch fabrics in network routers," in *Proceedings of the 39th Design Automation Conference* (DAC), pp. 524-529, 2002.

## Related Work: Virtual Channel Method

| Buffer-sharing<br>methods                                                                                                                                                                        | Lai [7] (DAC'08)                                                                                 | Liu [4] (CSS'07)<br>Liu [9] (MWSCAS'06)             | Neishaburi [11]<br>(GLSVLSI'09)                                                                         | Our method (BS)                                                                                                                          |  |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|-----------------------------------------------------|---------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|--|
| Simulation<br>environment                                                                                                                                                                        | Cycle-accurate<br>simulator                                                                      | Flexsim 1.2                                         | RTL, VHDL                                                                                               | RTL, VHDL<br>Modelsim SE 5.8d                                                                                                            |  |
| Routing<br>algorithm                                                                                                                                                                             | Dimension routing                                                                                | Adaptive routing                                    | XY-YX routing                                                                                           | XY routing                                                                                                                               |  |
| <ul> <li>Traffic Virtual channel buffers require up to nearly <u>50% of area</u> and account for <u>64% of leakage power</u> in a router implemented under the 70 nm CMOS technology.</li> </ul> |                                                                                                  |                                                     |                                                                                                         |                                                                                                                                          |  |
|                                                                                                                                                                                                  |                                                                                                  | (one VC = 4 flits)                                  |                                                                                                         |                                                                                                                                          |  |
| Performance<br>enhancement                                                                                                                                                                       | <ol> <li>Throughput :<br/>8.3%<br/>increase</li> <li>Latency :<br/>19.6%<br/>decrease</li> </ol> | Throughput :<br>• 2% than SAMQ<br>• 1% than DAMQall | <ul><li>Latency :</li><li>7.1% decrease<br/>in uniform</li><li>3.5% decrease<br/>in transpose</li></ul> | <ol> <li>Throughput :<br/>40% increase<br/>(23.47% in average)</li> <li>Latency :<br/>22.46% decrease<br/>(10.17% in average)</li> </ol> |  |
| Compare to                                                                                                                                                                                       | <ol> <li>2 VCs router</li> <li>4 VCs router</li> </ol>                                           | 1) SAMQ<br>2) DAMQall                               | 4 VCs router<br>(one VC = 16 flits)                                                                     | Extended original<br>buffer (with the same<br>additional hardware<br>overhead )                                                          |  |

## Related Work: Central Buffer Sharing Method

Central buffer-sharing method is proposed [17, 18]



[17] P.-T. Huang and W. Hwang, "2-level FIFO architecture design for switch fabrics in network-on-chip," in Proceedings of the International Symposium on Circuits and Systems. IEEE, May 2006, pp. 4863–4866.
[18] L.-F. Leung and C.-Y. Tsui, "Optimal link scheduling on improving best-effort and guaranteed services performance in network-on-chip systems," in Proceedings of the 43rd Design Automation Conference. ACM, July 2006, pp. 833–838.

## Objective

#### Buffer Stealing (BS) mechanism

- Utilize at runtime the free input buffers from other input channels
  - Instead of increasing the buffer size at design time.
- Advantages :
  - 1) Congestion reduction
  - 2) Throughput Enhancement
  - 3) Efficient buffer utilization
  - 4) Low resource overhead

The concept of BS can be easily generalized to routers with VCs; ➤ Consider the hardware overhead and power drawback incurred by VC-routers

## An Example of Buffer Stealing



### Introduction

Motivation and Objective

### Buffer Stealing Design

- Experiments
- Conclusions

## **Circuit Designs for Buffer Stealing**

#### Thief buffer design

• A thief buffer is a buffer that steals the buffer space of other channels when its free buffer space is not enough to store incoming flits.

#### Victim buffer design

 A victim buffer is a buffer whose free space can be stolen by a thief buffer.





### Introduction

- Motivation and Objective
- Buffer Stealing Design

### Experiments

Conclusions

## System Architecture

NoC parameters used in this work :

• A 64-bit 5-input-buffer router

|                   | The selected NoC parameters | Advantages                                                                                        |
|-------------------|-----------------------------|---------------------------------------------------------------------------------------------------|
| Arbitration Logic | Round-Robin<br>Algorithm    | <ol> <li>All input requests will be<br/>eventually granted</li> <li>Prevent starvation</li> </ol> |
| Switching Method  | Wormhole Switching          | <ol> <li>Minimal buffering requirement</li> <li>Low-latency communication</li> </ol>              |
| Routing Algorithm | X-Y Routing<br>Algorithm    | <ol> <li>Minimize the area overhead</li> <li>Minimize the control overhead</li> </ol>             |

## **Experiment Environment**

- Simulation Environment
  - Modelsim SE 5.8d

• Cycle Accurate Simulation, VHDL RTL Coding Style

Synthesis Environment
 XILINX ISE 8.2.03i
 Target platform – XILINX ML410 (xc4vfx60-11ff1152) FPGA

## Comparisons of Different Buffer Implementation

 The comparison with conventional design in the same additional hardware overhead.

NoC area model in [6]

Extended Buffer (EB): 80 bits vs. Buffer Stealing Design: 64 bits

[6] M.-M. Kim and J.-D. Davis and M. Oskin and T. Austin, "Polymorphic On-Chip Networks," in Proceedings of the 35th International Symposium on Computer Architecture. 2008, pp. 101 -112.

### **Comparison with Extended Buffer**

Throughput



### Comparison with Extended Buffer (cont.)

Latency



Input period: Output period

# Comparisons of different buffer implementation

- The comparison with **central buffer design**.
  - [18] L.-F. Leung and C.-Y. Tsui, "Optimal link scheduling on improving besteffort and guaranteed services performance in network-on-chip systems," in Proceedings of the 43rd Design Automation Conference. ACM, July 2006, pp. 833–838.

## **Comparison with Central Buffer**

#### Throughput



Input period : Output period

Latency



Input period : Output period

Performance/ Hardware resource overhead

#### TABLE I

Synthesis results for different buffer designs

| Buffer          | Frequency | Hardware     |  |
|-----------------|-----------|--------------|--|
| Design          | (MHz)     | overhead (%) |  |
| Original buffer | 232.056   | baseline     |  |
| Buffer Stealing | 203.442   | 22.18        |  |
| Central Buffer  | 142.511   | 215.48       |  |

- It shows that the large hardware overhead of the central buffer. (additional 215.5% resources required)
- > The central buffer design is not a **cost-efficient** implementation.

Throughput to hardware overhead ratio (in 1350 cycles)

• Throughput / hardware overhead (%)



I/Latency to hardware overhead ratio (300 flits)

• (1 /Latency) / hardware overhead (%)



Input period : Output period

#### Introduction

- Motivation and Objective
- Buffer Stealing Design
- Experiments

### Conclusions

## Conclusions

The main idea of the proposed Buffer Stealing (BS) design

#### Experiment results show that

- Increase in throughput of 23.47 % flits (in average)
- Reduce latency by 10.17% cycles (in average)
  - Better congestion toleration

#### Future work

- More **real-world example** implementations
- The support for dynamically reconfigurable system

## Thank you for your listening !