

Authors: Yuan Dai, Xuchen Gao\*, et al.

Presenter: Xuchen Gao

Institution: Fudan University, China

Email: xcgao22@m.fudan.edu.cn

- Introduction
- Problem Formulation
- Proposed Methodology
- Experiment
- Conclusion

- Introduction
- Problem Formulation
- Proposed Methodology
- Experiment
- Conclusion

## **Background**

 The Workflow of Coarse-Grained Reconfigurable Architecture (CGRA)



#### **Motivation**

#### Motivation Example



Motivating example code

DFG of example code

#### **Motivation**

#### Motivation Example

■ Invalid scheme with N = 2 and B = 32



| Bank0      | Bank1      |
|------------|------------|
| A[0][0~31] | A[1][0~31] |
| A[2][0~31] | A[3][0~31] |
|            | •••••      |

Bank index =  $\left| \frac{x}{32} \right| \% 2$ x: the address of data  $\bigcirc LD_0: A[i-1][j]$ 

 $lueblight LD_1: A[i-1][j-1] \ lueblight LD_2: A[i+2][j-1] \ lueblight ST_0: A[i][i]$ 

 $LD_0$ 

ADD

#### **Motivation**

#### Motivation Example



 $\bigcirc LD_0: A[i-1][i]$  $\blacksquare LD_1 : A[i-1][j-1]$  $D_{2}$ : A[i+2][i-1]  $OST_0$ : A[i][i]

ADD

- Introduction
- Problem Formulation
- Proposed Methodology
- Experiment
- Conclusion

#### **Definitions**

Affine Address Access

$$\phi(\vec{i}) = [AS_0 \ AS_1 \dots AS_{l-1}] \begin{vmatrix} i_0 \\ i_1 \\ \dots \\ i_{l-1} \end{vmatrix} + BA$$

Access Pattern

$$P_A = \{\phi_1(\vec{i_1}), ..., \phi_m(\vec{i_m})\}$$

Control Step Access Pattern

$$P_A^n = \{\phi_1(\vec{i_1}), ..., \phi_q(\vec{i_q})\}$$
 $q (q \le m)$  memory accesses at  $n^{th}$  cycle

#### **Definitions**

Memory Partition

$$f(\phi) = \lfloor \frac{\phi}{B} \rfloor \% N; \qquad g(\phi) = \lfloor \frac{\phi}{N \cdot B} \rfloor \cdot B + \phi \% B$$
 
$$f(\phi) \text{: bank function} \qquad g(\phi) \text{: offset function}$$

Iteration Distance

$$iter_{d}^{\alpha\beta} = \frac{data\_delay}{Mapped\_II}$$

## **Target**

1 Access II

$$AII = MAX(Con_i), 1 \le i \le N$$

2 Iteration Count Vector

$$\forall \vec{i_{\alpha}}, \vec{i_{\beta}} \in M, \ \forall \phi_{\alpha}(\vec{i_{\alpha}}), \phi_{\beta}(\vec{i_{\beta}}) \in P_{A}^{n}, \alpha \neq \beta$$

$$(\vec{i_{\alpha}} - \vec{i_{\beta}}) \cdot (1, ic_{0}, ..., \prod_{t=0}^{l-2} ic_{t}) = iter_{d}^{\alpha\beta}, \ f(\phi_{\alpha}(\vec{i_{\alpha}})) \neq f(\phi_{\beta}(\vec{i_{\beta}}))$$

Given an l-level loop in the iteration domain M with m affine memory accesses on the same array

$$P_A = \{\phi_1(\overrightarrow{i_1}), \dots, \phi_m(\overrightarrow{i_m})\}$$

find the N and B such that:

Minimize AII in ① & Hold the equation ②

- Introduction
- Problem Formulation
- Proposed Methodology
- Experiment
- Conclusion

#### **End-to-End Framework: Hardware**



- Enhanced IOB to access different banks during execution
- Specific mechanism for loading/storing data from/to external memory

The SoC supports parallel memory access

#### **End-to-End Framework: Software**



Software flow with the proposed memory partition algorithm

## **End-to-End Framework: Software**



## **CSP** formulation

- Constraint Satisfaction Problem(CSP)
  - Iteration Constraint(pre-mapping): Before mapping, all the accesses in the  $P_A$  are regarded in the same iteration

$$\vec{i_{\alpha}} = \vec{i_{\beta}} = \vec{i}; \ \vec{i} \in M$$

■ Bank Constraint:  $\phi_{\alpha}(\vec{i}_{\alpha})$  and  $\phi_{\beta}(\vec{i}_{\beta})$  access the same bank equation:

$$f(\phi_{\alpha}(\vec{i_{\alpha}})) = f(\phi_{\beta}(\vec{i_{\beta}})) \Leftrightarrow \lfloor \frac{\phi_{\alpha}(\vec{i_{\alpha}})}{B} \rfloor \% N = \lfloor \frac{\phi_{\beta}(\vec{i_{\beta}})}{B} \rfloor \% N$$

$$\Leftrightarrow \exists k \in \mathbb{Z}, \ s.t. \ \lfloor \frac{\phi_{\alpha}(\vec{i_{\alpha}})}{B} \rfloor = \lfloor \frac{\phi_{\beta}(\vec{i_{\beta}})}{B} \rfloor + kN$$





#### **CSP** formulation

Constraint Satisfaction Problem(CSP)

$$-B < \left[ (AS_0^{\alpha} - AS_0^{\beta}) \dots (AS_{l-1}^{\alpha} - AS_{l-1}^{\beta}) \right] \begin{bmatrix} i_0 \\ i_1 \\ \dots \\ i_{l-1} \end{bmatrix} + (BA^{\alpha} - BA^{\beta}) - kNB < B$$

**\*\*CSP problem:** the conflict exists when one solution is found under such the constraints

# **Graph Coloring-based Scheduling**

- When failing to find a conflict-free solution
  - schedule the conflict accesses into different control steps
  - minimizing the All



Example of graph coloring and scheduling process

# **CSP-based Post-mapping Checking**

• Iteration Constraint: now  $iter_d^{\alpha\beta} \neq 0$ 

$$(\vec{i}_{\alpha} - \vec{i}_{\beta}) \cdot (1, ic_0, ..., \prod_{t=0}^{l-2} ic_t) = iter_d^{\alpha\beta}$$

It can be transformed into a CSP problem, where the conflict exists when one solution is found

CSP Check

$$\exists k \in \mathbb{Z}, \ s.t. \ -B < \phi_{\alpha}(\vec{i_{\alpha}}) - \phi_{\beta}(\vec{i_{\beta}}) - kNB < B$$

- Introduction
- Problem Formulation
- Proposed Methodology
- Experiment
- Conclusion

## **System Level Evaluation**



# **System Level Evaluation**



Comparison of performance



Comparison of energy efficiency

## **System Level Evaluation**



Comparison of compilation time

## **CGRA Level Evaluation**

- $Energy\ efficiency = \frac{Throughput}{Power\ consumption}$
- $Throughput = \frac{Frequency}{Mapped_{II}}$



- Introduction
- Framework of our work
- Analyzing & transforming passes
- SoC runtime configuration of CGRA
- Experiment
- Conclusion

## Conclusion

- A CSP-based memory conflict detection for memory partition
- A graph coloring-based approach for memory access scheduling during the partition process to improve performance
- An end-to-end CGRA framework, including the data parallel architecture and compiler with efficient data parallelism support



# Thanks

Yuan Dai, Xuchen Gao, Chen Shen, Bingbing Peng, Wenbo Yin, Wai-Shing Luk\*, Lingli Wang\*

State Key Laboratory of Integrated Chips and Systems Fudan University, Shanghai, China Email: xcgao22@m.fudan.edu.cn, \*llwang@fudan.edu.cn