

A©E-Lab

### A Hierarchical Dataflow-Driven Heterogeneous Architecture for Wireless Baseband Processing

Limin Jiang, Yi Shi, Yintao Liu, Qingyu Deng, Siyi Xu, Yihao Shen, Fangfang Ye, Shan Cao, and Zhiyuan Jiang

School of Communication and Information Engineering, Shanghai University, China

### Outline

- Background & Motivation
- Related Works
- System Design
- Evaluation
- Conclusion

### Background

Gexpansion drives demand for energy-efficient, open-source hardware to replace proprietary solutions.

A©E-Lab

Implementing hardware for data-intensive wireless baseband processing (WBP) poses major challenges.



### Background

WBP: How?

**D**DSP:

✓VLIW boosts ILP✓High control overheadlimits scalability



■X86 Server/GPGPU:

✓ Massive computing
capability
✓ High energy consumption



#### **D**ASIC:

✓ Best PPA✓ Long time to market



### **Motivation**

A©E-Lab

Two characteristics of WBP:
Modular: TDD frame structure separates uplink and downlink into data-independent, successive modules.

Cyclical: Signal generation or decoding based on communication protocols is on a *subframe* (periodic) basis.



### WBP is decoupled and predictable.

### Contribution

A cache-free manycore architecture is proposed to increase energy and area efficiency without compromising performance due to the predictable data processing nature of WBP.

- We develop a *pack-and-ship* data dispatch system to enable the tiles to operate in a **bundled access and execution** style, which can drastically reduce the cost of data movement.
- A hierarchical dataflow task scheduling scheme is designed and two strategies, namely multi-threading and lazy-deletion, are proposed to fully utilize the hardware resources.

ACE-Lab

### **Related Works**

Various works have been presented in academia seeking a way towards manycore parallel computing for WBP.

|          | Work     | Core<br>Heterogeneity | Scalability | DLP                     | TLP        | HW/SW<br>Co-design |
|----------|----------|-----------------------|-------------|-------------------------|------------|--------------------|
| GPPs     | Sora     |                       | $\odot$     | $\odot$                 | $\odot$    |                    |
| Sys-leve | TeraPool |                       |             |                         | -          |                    |
| Analyses | SPECTRUM |                       | $\bigcirc$  |                         | $\bigcirc$ |                    |
| NoCs     | MACRON   |                       |             | $\bigcirc$              | -          |                    |
|          | MAGALI   |                       |             | $\textcircled{\bullet}$ | -          | •••                |
| ASIP     | DXT501   |                       |             |                         | $\bigcirc$ |                    |
|          | Ours     |                       |             |                         |            |                    |

## **System Design: Architecture**

Tile: RV32IM core with customized vector extension & local scratchpad memory.

- L2 DMA: Orchestrating Tiles via a scalar scheduler.
- **CS-SPM:** Swap space for cluster.
- Main scheduler: Managing highlevel scheduling; Directing the main DMA engine to transfer data between main memory and the clusters.



### System Design: Pack-n-Ship

### NUMA Approach:

Each tile and cluster has its own SPM, accessed by *outside* DMA.

Eliminating fragment memory access.

#### Before Execution

Alter T-SPM direction by atomic instructions.
DMA moves data from CS-SPM to T-SPM.
Change back T-SPM direction to the RV core.
De-assert core reset.

### After Execution

Store results in T-SPM.
Notify attributes of return value to CSRs.
Alter T-SPM direction & Issue an interrupt.
DMA retrieve data back to CS-SPM.



# System Design: Heterogeneous Configuration

#### Configurable dimensions for WBP

- 1 # of clusters & tiles: Enhancing thread- & task-level parallelism of processing Tx & Rx in consecutive time slots – dependent on protocol throughput.
- 2 SPM footprint: Dependent on computation type: FFT, Polar decoding; Multithreading capability.
- **3** # of lanes and VRFs: Enhancing DLP capabilities.



### **Execution Model: Dataflow Model**

#### A©E-Lab

Thread: Several related

#### WBP interpreted as DAGs

A subsequent module is activated only when all preceding tasks are complete.
 WBP follows a consistent flow over time, enhancing data locality as the DAG information is unlikely to be reconfigured on the hardware.

Less scheduler-bounded



A©E-Lab

### **Execution Model: Dataflow Model**

- Attributes guide the scheduler in selecting the most suitable tile for deployment.
- Runtime adjustment of DAGs
  - Tasks can be dismissed on-the-fly once the worst-case DAG is determined.
  - If the blind detection task detects fewer users, the computational burden can be reduced.



## **Execution Model: Multi-Level Scheduling**



#### 18 return aSet

#### 14

A©E-Lab

### **Execution Model: Multi-Level Scheduling**

### Tile-Level Scheduling

L2 Scheduler: Processes the nodes (tasks) in the task code pool and checks their readiness through the FIFO queues.

**DAG** nodes.

**Load Indications:** Task to be processed and the preferred tile.

L2 DMA: Transfers data from the compute data section to the heterogeneous tiles







| Single-Tile Performance                                              |              |          |                | Config          | : A 64-lane | , 50.1 GO     | PS VXU          |      |       |
|----------------------------------------------------------------------|--------------|----------|----------------|-----------------|-------------|---------------|-----------------|------|-------|
| □Lies be                                                             | etween c     | ommercia | al hardw       | are and         | AS          | ICs           |                 |      |       |
| □2.3x in FFT & 2x in BP<br>□Still suffer from Von Neumann bottleneck |              |          |                | Kernel          | Platform    | FFT<br>Length | Clock<br>Cycles |      |       |
|                                                                      |              |          |                |                 | UN          |               |                 | 128  | 588   |
|                                                                      |              |          |                |                 |             |               | DSP             | 512  | 2559  |
|                                                                      | Kernel       | Platform | Dec.<br>Length | Norm.<br>Thrpt. |             |               |                 | 2048 | 11922 |
|                                                                      | BP<br>Decode | GPU      | 512            | 0.25            |             | FFT           | HW<br>Accel.    | 128  | 211   |
|                                                                      |              |          | 1024           | 0.21            |             |               |                 | 512  | 845   |
|                                                                      |              | ASIC     | 1024           | 15.23           |             |               |                 | 2048 | 3875  |
|                                                                      |              | Ours     | 512            | 0.54            |             |               |                 | 128  | 251   |
|                                                                      |              |          | 1024           | 0.53            |             |               | Ours            | 512  | 1122  |
|                                                                      |              |          | 1027 0.00      |                 |             |               | 2048            | 5073 |       |

16

- A©E-Lab -

A©E-Lab

### Ablation Study

12T vs. 3C4T-2L2S
6.5% power increase; 1.3x throughput
Under-utilization in single-level arch.
6.4% and 9.5% gain under lazydeletion Config:

Large (L) Tile (T): w/ 64-lane VXU, 32 VRFs Small (S) Tile: w/ 8-lane VXU, 64VRFs

| Baseline          | Ρο   | wer (W) | Throughput (Mbps) |             |  |  |
|-------------------|------|---------|-------------------|-------------|--|--|
| + extra features  | 12T  | 3C4T    | Single-Level      | Multi-Level |  |  |
| 12T / 3C4T arch.  |      | 3.45    | 8.5               | 21.2        |  |  |
| + Multi-Threading | 3.24 |         | 64.1              | 84.7        |  |  |
| + Lazy-Deletion   |      |         | 68.5              | 93.6        |  |  |

A©E-Lab

### Link throughput experiment on prototype

□5C9T □288Mbps





| Module                  | Configuration   |
|-------------------------|-----------------|
| Channel Coding          | Polar Codes     |
| Rate-Matching           | RV0             |
| Scrambling              | Gold Sequence   |
| Modulation              | QPSK            |
| OFDM                    | 128 subcarriers |
| Channel<br>Estimation   | Least Squares   |
| Channel<br>Equalization | Zero Forcing    |
| Channel<br>Decoding     | Min-Sum BP      |
|                         |                 |

### Conclusion

- We propose a pack-and-ship approach within a cache-free NUMA system.
  - Instructions and data are organized in bundles and delivered by schedulers to local scratchpad memory in order to reduce data movement costs.
- We also develop a hierarchical dataflow scheme along with two strategies, namely multi-threading and lazy-deletion, to exploit and allocate the hardware resources more efficiently.
- Our HW/SW co-design surpasses the existing architectures attributed to strong single-tile performance as well as flexible scalability and coarse-grained parallelism.



<u>Advanced Communication and Computing Electronics Lab</u>

A©E-Lab

## Q&A Thank you for your attention!

A Hierarchical Dataflow-Driven Heterogeneous Architecture for Wireless Baseband Processing

> Presenter: Limin Jiang jianglimin@shu.edu.cn Shanghai University