#### A Vertically Integrated and Interoperable Multi-Vendor Synthesis Flow for Predictable NoC Design in Nanoscale Technologies

#### <u>Alberto Ghiribaldi</u>

University of Ferrara

Mikkel Stensgaard Teklatech A/S Hervé Tatenguem Fankem

University of Ferrara

Tobias Bjerregaard Teklatech A/S Federico Angiolini iNoCs SaRL

Davide Bertozzi University of Ferrara







# **Networks in Embedded Systems**

#### **General-Purpose Systems-on-Chip**

- Starting point: basic computation tile & switching element
- System assembly: tile-based composition in a grid structure



#### **Application-Specific Systems-on-Chip**

- Starting point:
   Core-graph representation with annotated avg. communication
- System assembly: Custom-Tailored Topology Synthesis



- Ad-hoc design methodology is required for customizable network topologies:
  - span different levels of abstraction
  - derive the most efficient NoC configuration for a given application domain
  - For application specific NoCs, the synthesis flow makes the difference

### **Criticalities of current flows**

- No industrial partner has visibility of the whole design process.
  - The final design process ends up being a composition of loosely coupled steps
- If tool interoperability is not properly addressed in a multi-vendor flow, the design cycle may largely prolong.
  - Lack of consistent representation of design data
  - Lack of consistent interpretation of technology information
- Nanoscale design issues (power grid integrity, interconnect delay,..) cannot be addressed as an afterthought any more, but from the ground up, by ranking designs upfront with respect to these issues.

# **Evolving Technology Requirements**

**Our objective:** *develop a comprehensive and interoperable multi-vendor synthesis flow for Application-Specific Network-on-Chip,* 

- from application requirements down to layout,
- deal with Network-on-Chip design on nanoscale technologies with *less design iterations*,
- **integrating** *prototype with mainstream tools.*

Our Flow targets the following NoC Synthesis challenges, in order to cope with the threats arising from the underline silicon technology:

- front-end driven floorplanning
- fast and accurate system-level power grid models
- dynamic IR-drop minimization
- ad-hoc strategies for link inference



## **Communication Exchange Format**

#### Key enabler: Communication Exchange Format (CEF)

- Open design format based on XML.
- Specifies SoC-level architecture and communication infrastructure.
- Designed to ease tool interoperability.
- Allows iterative and incremental design.
- Allows expression of design intent and design implementation
  - 1. system cores, including interface parameters;
  - 2. communication requirements across cores;
  - 3. clock and power domains;
  - 4. architectural parameters and implementation;
  - 5. floorplan of the design.



#### **CEF File Structure**

#### <cef> <communication></communication> <system> <system properties></system properties> <domains></domains> <blocks></blocks> <links></links> <routes></routes> </system>

</cef>

### **Tool Environment**

**Teklatech**'s Floorplanner – Floorplan definition **iNoCs**'s Topology Synthesizer – Design space exploration **iNoCs**'s RTL generator – NoC RTL generation

**Design Compiler – Synthesis and Mapping** 

IC Compiler – Layout Generation

(In-Design) PrimeRail – Power Grid analysis

Modelsim – Activity Annotation and Functional Testing

Prime Time – Power and Timing analysis

Thanks to CEF, the proposed flow results from the vertical integration of mainstream and prototype tools.

## **Flow Overview**



- System IP Cores, knowledge of communication streams, use cases (among the other things)
  - Application specification from Intel Mobile Communications
  - Target Technology: 40 nm Intel Mobile Communications Technology Library

### Example application: Full HD-TV on smartphones

New mobile devices with big display sizes already feature full HD-TV resolution





#### Full-HD video playback:

- 1920x1080 pixel,
- 60 frames/s,
- true color.

#### System composition:

- 7 Masters
- 9 Slaves
- 10 separate clock domains

#### **CEF Communication - Example 1**

<communication>

```
<use case>
        <name>FullHDVideoPlayback</name>
        <id>167772161</id>
        <domain operating points>
            <domain>
                <domain id>134217729</domain id>
                <operating point id>
                    150994945
                </operating point id>
            </domain>
        </domain operating points>
        <weight>0.166667</weight>
    </use case>
</communication>
```

#### **CEF Communication - Example 2**

<flow>

<id>201326593</id>

<flow\_type>RD</flow\_type>

<max\_outstanding\_transactions>1</max\_outstanding\_transactions>

<sustained\_bandwidth>6.83594</sustained\_bandwidth>

<peak\_bandwidth>6.83594</peak\_bandwidth>

<zero\_load\_latency\_bound>-1</zero\_load\_latency\_bound>

<latency\_bound>-1</latency\_bound>

<priority\_class>0</priority\_class>

<gid>4294967295</gid>

<burstiness>

<burst>

<length>64</length>

<percent>100</percent>

</burst>

</burstiness>

</flow>

## **Flow Overview**



- Communication streams notification to floorplanning tool (among the other things)
- Teklatech's Floorplanning tool has two objective functions:
  - Places system blocks that communicate the most close to each other
  - Minimizes IR drop (following a fast early-phase IR drop analysis)
- Floorplans can be created to take into account both communication cost and IR drops, only one, or none of them.

## **Floorplanning the target design**

#### **Best Communication**



#### **Best IR Drop**



151

121

45 30 15

0

#### **Worst Communication**



Having high communication traffic (thick lines) spread over short (left) or long (right) links is likely to heavily affect the power required for data transmission.

#### Worst IR Drop



Best IR drop solutions spread out the hot spot across a large part of the floorplan (left), instead of concentrating it in a specific region (right).

## **Floorplanning the target design**

- We selected four representative points of the design space, to later prove:
  - Correlation of communication cost with dynamic power models and with actual dynamic power.
  - Accuracy of early phase power grid analysis



### **CEF Block Example 1 (Core)**

<block>

<name>CPU</name>

..... <size>

<corners>

<corner> <x>0</x> <y>0</y> </corner>

<corner> <x>3600</x> <y>1273</y> </corner>

</corners>

<soft> true </soft>

<area> 4582800 </area>

</size>

<bottom\_left\_corner\_position>

<x>1192</x>

<y>3975</y>

</bottom\_left\_corner\_position>

<orientation>

<rotatable>true</rotatable>

<mirrorable>true</mirrorable>

<rotation>0</rotation>

<mirroring\_x>false</mirroring\_x>

<mirroring\_y>false</mirroring\_y>

</orientation>

</block>

### **Flow Overview**



- IP Core placement and communication streams notified to the topology synthesis tool (among the other things)
  - Professional iNoCs topology synthesis toolflow was used for this purpose

# **Floorplan-Aware Topology Synthesis**

A large number of topologies was synthesized for each floorplan provided by Teklatech

Every dot is an average across use cases Red and Brown 121 **Correlation between communication-aware floorplanning** 120 dots (lowest and topology dynamic power is proved! 119 Interoperability between Teklatech and iNoCs is proved! communication 118 cost atency (ns) 117 topologies) oc54\_24sw\_128bits\_run2 noc14\_24s noc53 24sw 128bits run3 noc44 24sw 112bits rui 116 noc35\_24sw\_96bits\_run3\_noc46\_24sw\_112bits\_run3 show lower noc55\_24sw\_128bits\_run2 25 24sw 80pts run2 w 48Hits oc35 24sw 96bits runz noc45 24sw 112bits run2 115 noc43\_25sw 12bits ru noc33\_25sw\_96bits noc53 25sw 128bits run2 latency and noc51 20sw\_128bits\_ren2 noc11 20sw 6 bits run2 114 noc31\_20sv\_96 noc51 20sw 128bits run0 power with noc42 21s 113 noc50\_16sw\_128bits\_run0 noc56\_Z5sw\_128bits\_ren0 respect to **Blue** noc54\_25sw\_128bits\_run0 noc15\_25sw\_64b noc46 25sw 112bits run0 112 noc55 25sw 128bits\*run0 and **Dark** ones 111 30 40 100 70 (topologies Power (mW) with higher run2 = best IR/ run4 = best IR/ run1 = worst IR/ run3 = worst IR/ comm. Cost) best comm best comm worst comm worst comm , comm = 44.52comm =15.83 comm =18.87 comm = 43.91

#### **CEF Block Example - Switch**

#### <block>

<name>switch req0 0</name> <id>16777902</id> <block type>300</block type> <ports> ..... </ports> <size> ..... </size> <bottom left corner position> .. </bottom left corner position> <orientation> ..... </orientation> <architecture> <input buffers> <depth>2</depth> <depth>2</depth> </input buffers> <output buffers> <depth>6</depth> <depth>6</depth> <depth>6</depth> </output buffers> <virtual channels>1</virtual channels> <arbitration policy>fixedpriority0high</arbitration policy> </architecture>

</block>

### **Flow Overview**



- Converts CEF file to a mainstream backend floorplan script
   Currently for Synopsys ICC and Cadence SoC Encounter
  - RTL description of NoC topology and ad-hoc scripts for design import into Synopsys DC.

#### **Concurrent Hierarchical Layout Flow**

#### **Toplevel flow**

Complete Design Import

#### Floorplan Definition from CEF file

- Chip Area
- Blocks' Hard Bounds

#### Power Ground Network Definition

• From Teklatech's specifications

Commit: from Plan Groups to Soft Macros

> Place Fp Pins Save Hierarchy

**Toplevel Placement & Routing** 

Uncommit: from Soft Macros to Plan Groups

Timing and Power driven Global Refinements Reflects mature industrial practice (e.g., Toshiba)

#### Block Implementation Methodology

- Synthesis is performed block by block
- Boundary constraints given:
  - Timing margins: 25% of fastest clock in design
  - Output pin capacitance: 100x
     input capacitance of biggest
     inverter
  - Transition from input pins: 0.4 ns
- Overconstraining IO blocks is the most power efficient approach that we found.

#### **Experimental Results**

- Correlation of Dynamic Power
  - Correlation of absolute power numbers between Topology Synthesis Tool and Post-Layout analysis
- Correlation of IR Drop Maps
  - Between early phase analysis in front-end floorplanning and Post-layout analysis
- Fast Timing Convergence

### **Correlation of absolute power values**



Different types of flip-flops inferred by the logic synthesis tool with respect to those assumed by power models in topology synthesis. → These flip-flops feature more static power but less energy when active.

# Correlation of IR Drop Maps

Same Voltage Drop trend, and localized hotspots are successfully predicted

(On the right: Worst Comm./ Worst IR drop topology)

Correlation proved for both peak locations and maps

#### IR Drop Map from Teklatech's floorplanner



#### **IR Drop Map from Synopsys ICC Rail Analysis**



#### Timing convergence (min-max analysis)

| Clock Domain    | Clock Period | Slack Pre-Opt | Slack Post-Opt |
|-----------------|--------------|---------------|----------------|
| clk_Audio       | 100 MHz      | 5,75 ns       | 3,45 ns        |
| clk_CPU         | 500 MHz      | 0,2 ns        | 0,09 ns        |
| clk_DDR         | 250 MHz      | 0,38 ns       | 0,17 ns        |
| clk_DMA         | 200 MHz      | 0,65 ns       | 0,39 ns        |
| clk_DSP         | 300 MHz      | 0,34 ns       | 0,19 ns        |
| clk_Radio       | 150 MHz      | 1,12 ns       | 0,82 ns        |
| clk_SD_USB_WiFi | 200 MHz      | 0,53 ns       | 0,23 ns        |
| clk_SPI         | 140 MHz      | 6,33 ns       | 2,27 ns        |
| clk_SRAM        | 500 MHz      | -0,19 ns      | 0,14 ns        |
| clk_Video       | 300 MHz      | 0,38 ns       | 0,2 ns         |

- Cells in those clock domains that have a big slack are relaxed (from a driving strength viewpoint) to save power.
- Small timinig violations in the fastest clock domains are easily fixed.

The Hierarchical Methodology delivers the fast convergence claimed by proposed flow (first time right)

# **Final Design**



#### Hierarchical view of the layout of the NoC custom-tailored for the "Best Communication/Best IR Drop" floorplan

Total NoC Area: 946152 μm<sup>2</sup> Total NoC Power: 158 mW for VideoPlayback (out of 1569 mW for the total chip)

### Conclusions

- We deliver a vertically integrated, interoperable and multivendor synthesis flow for NoCs.
  - Ranges from application specification to layout generation
  - Integrates **prototype** with **mainstream** tools
  - Addresses selective challenges of nanoscale design
- Validated by proving:
  - Correlation of design choices across the design hierarchy
  - Accuracy of abstract models (IR drop, dynamic power) used in early phase analysis and synthesis
  - Fast convergence (first-time-right)

#### **Backup Slides**

#### **Communication Exchange Format**



### **Available CEF Tools**

| Vendor       | ТооІ                         | Release                  |
|--------------|------------------------------|--------------------------|
| iNoCs        | "Aphrodite" GUI              | Commercial, Beta         |
| iNoCs        | "Cratus" NoC Synthesizer     | Commercial, Beta         |
| iNoCs        | "dNoCSim" NoC Simulator      | Commercial, Beta         |
| iNoCs        | NoC RTL Generator            | Internal                 |
| iNoCs        | "Hades" NoC Verifier         | Internal                 |
| Teklatech    | FloorDirector Floorplanner   | Commercial               |
| Teklatech    | cef2fp Floorplanning Utility | Free, Beta               |
| U P Valencia | gNoCSim NoC Simulator        | Internal                 |
| Simula       | gen-route NoC Router         | Internal, in publication |
| Simula       | dlbdrvd NoC Router           | Internal, in publication |