

# FPGA-Accelerated Maze Routing Kernel for VLSI Designs

Xun Jiang<sup>1</sup>, Jiarui Wang<sup>2</sup>, Yibo Lin<sup>2</sup>, and Zhongfeng Wang<sup>1</sup>

<sup>1</sup> ICAIS Lab, School of Electronic Science and Engineering, Nanjing University

<sup>2</sup> CECA, CS Department, Peking University

ASP-DAC, Jan. 17-20, 2022, Virtual Conference







Outline



### • Design Methodology

### • Experiment



Outline



### •Design Methodology

•Experiment

### **Detailed Routing**



Solution of Dr.CU on ISPD18 test10 [1]

- Main Challenges
  - Complicated design rules
  - Large solution space  $(10^4 \times 10^4 \times 10 \text{ grid graph})$
  - More time-consuming with new technology

[1] Dr. CU: Detailed routing by sparse grid graph and minimum-area-captured path search, TCAD, 2019

### Maze Routing Kernel

|  |    | 3               | 2                | 3                | 4               | 5  | 6               | 7               | 8               | 9                | 10              | <mark>11</mark> | <mark>12</mark> | <mark>13</mark> | <mark>14</mark>  |                 |                 |
|--|----|-----------------|------------------|------------------|-----------------|----|-----------------|-----------------|-----------------|------------------|-----------------|-----------------|-----------------|-----------------|------------------|-----------------|-----------------|
|  |    | 2               | 1                | 2                | 3               | 4  | 5               | 6               | 7               | 8                | 9               | <mark>10</mark> | 11              | <mark>12</mark> | <mark>1 3</mark> | <mark>14</mark> |                 |
|  |    | 1               | S                | •                | 2               | 8  | 4               | 5               | 6               | 7                | 8               | 9               | 10              | <mark>11</mark> | <mark>12</mark>  | <mark>13</mark> | <mark>14</mark> |
|  |    | 2               | 1                | 2                | 3               | 4  | 5               | 6               | 7               | 8                | 9               | 10              | 11              | <mark>12</mark> | <mark>1 3</mark> | <mark>14</mark> |                 |
|  |    |                 |                  |                  |                 | 5  | 6               | 7               | 8               | 9                | 10              | <mark>11</mark> | <mark>12</mark> | <mark>13</mark> | <mark>14</mark>  |                 |                 |
|  |    |                 |                  | 14               |                 | 6  | 7               | 8               | 9               | 10               | 11              | <mark>12</mark> | <mark>13</mark> | 14              |                  |                 |                 |
|  |    |                 | 14               | <mark>13</mark>  |                 | 7  | 8               | 9               | 10              | 11               | <mark>12</mark> | <mark>13</mark> | <mark>14</mark> |                 |                  |                 |                 |
|  | Т  | <mark>14</mark> | <mark>1 3</mark> | <mark>12</mark>  |                 | 8  | 9               | 10              | 11              | <mark>12</mark>  | <mark>13</mark> | 14              |                 |                 |                  |                 |                 |
|  | 14 | <mark>13</mark> | 12               | 11               | 10              | 9  | <mark>10</mark> | 11              | <mark>12</mark> | <mark>1 3</mark> | 14              |                 |                 |                 |                  |                 |                 |
|  |    | <mark>14</mark> | <mark>1 3</mark> | <mark>12</mark>  | <mark>11</mark> | 10 | <mark>11</mark> | <mark>12</mark> | <mark>13</mark> | 14               |                 |                 |                 |                 |                  |                 |                 |
|  |    |                 | <mark>14</mark>  | <mark>1 3</mark> | <mark>12</mark> | 11 | <mark>12</mark> | <mark>13</mark> | 14              |                  |                 |                 |                 |                 |                  |                 |                 |

- Searching Space:
  - ◆ 2D/3D Grid Graph
- Searching Process:
  - Select vertex with minimal cost
  - Expand frontier
  - Check terminal vertex
  - Reconstruct path



### **FPGA** Acceleration



# Next is EDA?



### **Previous Works and Challenges**

- Previous Works
  - Detailed routing: TritonRoute[ICCAD'18], Dr. CU[TCAD'20], and et al
    - Only leverage parallelization on CPU
  - FPGA-accelerated FPGA routing: [Korolija et al, IPDPSW'19]
    - 4-6x slower than Intel Core i5
- Challenges
  - Data dependency in maze routing
  - Different size of nets
  - A large number of random memory access





### Introduction

### • Design Methodology

• Experiment



### **Multi-Pin Maze Routing**

- Goal: Connect all pins on the net with minimal cost
- Input: A grid graph G(V, E), N sets of vertices {S<sub>n</sub>} related to the set of pins {p<sub>n</sub>}
- Output: Path *P* with minimum total cost

#### **Cost Metrics:**

- 1. Total wire length
- 2. Total via count
- 3. Non-preferred usage
- 4. Design rule violations





Path **P** = Set of Partial Searched Paths

### **Hardware Optimization Methods**

#### **Baseline Software Dr.CU[2]**



#### **Hardware Optimization Methods**

- **Priority Queue Operation:** 
  - Pipelined hardware priority queue
  - Prediction of priority queue
- Computing Dominated Part:
  - ♦ Fast hardware circuit
- Memory Access Dominated Part:
  - Compact graph data structure
  - Temporary cost reuse on chip
- Complex Control Flow:
  - Predefined operation and schedule

### Maze Routing Processing Element

**Maze Routing PE Structure** 

**Operations and schedule** 



- Priority Queue : sort vertices by temporary cost in ascending order
- Controller: schedule different operations of hardware shown in the right picture
- Executor: compute the new length/penalty/cost of adjacent vertices and update the temporary cost

### **Tri-State Priority Queue**

#### Base architecture: P-Heap [3]



Time complexity: O(1)

FPGA LUT resource: O(log(n))

FPGA SRAM resource: O(n)

Only leverage internal parallelism of Priority Queue

#### Priority Queue with Prediction:



Predict Policy: Predict the new top vertex before the last popped vertex's adjacent vertices pushed into the Priority Queue.

#### Analysis of prediction in timeline:



[3] R. Bhagwan and B. Lin, "Fast and scalable priority queue architecture for high-speed network switches," in Proceedings IEEE INFOCOM 2000



**Graph Data Structure** 

(1) Graph Data

#### (2) Vertex Data

|                           | EdgeCost1 |            |      |      |    |         |    | EdgeCost0  |            |      |         |      |  |
|---------------------------|-----------|------------|------|------|----|---------|----|------------|------------|------|---------|------|--|
|                           |           | EdgeCost3  |      |      |    |         |    |            | EdgeCost2  |      |         |      |  |
| 1 Graph Pin Table         |           | EdgeCost5  |      |      |    |         |    |            | EdgeCost4  |      |         |      |  |
| 2 Pin Vertex Table        |           | newLenAdd1 |      |      |    |         |    |            | newLenAdd0 |      |         |      |  |
| ③ Chip Data Block         |           | newLenAdd5 |      |      |    |         |    | newLenAdd4 |            |      |         |      |  |
| -                         |           | \          | Vert | tex3 |    | Vertex2 |    |            | Vertex1    |      | Vertex0 |      |  |
| 4 Vertex Property Table 🗕 |           | М          | uL   | L5   | L4 | L3      | L2 | L1 L0      | Vertex5    |      | Vertex4 |      |  |
|                           |           |            |      |      |    | Pir     | າ5 | Pin4       | Pin3       | Pin2 | Pin1    | Pin0 |  |

#### **Adjacent List Data Structure**

Compact vertex data

Contiguous graph data  $\rightarrow$  Better data transfer via PCIe interface

Better memory access on DRAM  $\rightarrow$ 

### **Batch Nets Parallelization**

**Control Signal to PE** 

**AXI Memory Access** 

**Graph Data in DDR** 

**Vertex Data in PE** 

**Batch Data in DDR** 

**PE Done Signal** 



#### **Functions of Graph Dispatcher:**

- 1. Start Maze Routing PEs and monitor their state
- 2. Schedule batched graphs

#### Schedule Policy:

- 1. Fixed Priority
- 2. Start PE whenever PE is idle until no workloads reserved in DDR

#### **Principle:**

- 1. Make all PEs busy
- 2. PEs run in parallel independently





### Introduction

### •Design Methodology

### • Experiment



### **Experiment Settings**

- Experimental Platform (Amazon EC2 f1.2xlarge instance)
  - ♦ 8-core Intel Xeon CPU E2-2686 v4 @2.30GHz, 122GB Memory
  - ◆ Xilinx Ultrascale+ VU9P FPGA, 1 16GB DDR4 Interface
- Configurations
  - CPU : 1/2/4/8 threads
  - ◆ FPGA : 1/2/4/8/12 PEs, 125MHz frequency
- Test sets
  - ISPD 2018 contest benchmarks[3]
  - Node number of net less than 2<sup>16</sup> (more than 90% of all nets for large batch)
- [3] S. Mantik et al "ISPD 2018 initial detailed routing contest and benchmarks," ISPD 2018

### **Routing Quality & Runtime**



- CPU with 8 threads (maximum performance), FPGA with 12 PEs
- Runtime is the sum of the time of tested nets running in the first iteration
- Quality degradation is less than 1%
- The speed of each test is above 2x; the speedup of large tests is almost 3x

### Speedup & Scalability

#### Speedup with **batch size**



#### Speedup with **PEs/Threads number**



#CPU runtime(N threads)

batch size = 5923

### **Hardware Resource**



#### Module Resource Usage

BRAM

URAM

- **URAM** is the key to reuse more data on chip (#size of URAM = 8 \* #size of BRAM)
- LUTs and FFs is plenty compared with RAMs
- Still has potential to increase the number of PEs





### Introduction

### •Design Methodology

### •Experiment





NAN.TTNG

- Provide a design methodology to accelerate maze routing on FPGA
- Design an efficient data structure, algorithm and hardware implementation
- Better scalability than multi-threads software
- Up to 3.1x speed-up
- Future Work:
  - Optimize the performance of routing on the CPU-FPGA system
  - Process nets with larger size on FPGA



## Thanks! Questions are welcome

Email: jiangx@smail.nju.edu.cn ICAIS Lab, Nanjing University, China