#### **Cycle Accurate Power Estimation Tool**

Rajat Chaudhry Daniel Stasiak Stephen Posluszny Sang Dhong

> Cell Processor Development IBM Corporation Austin, Texas

# Outline

Motivation and Background Tool Methodology Usage on the Cell Processor Hardware Correlation Summary

# **Motivation and Background**

## Background

Chip Power density has increased due to
Increased chip frequency
Exponential increase in leakage due to scaling

Traditionally low power design was limited to battery powered devices to increase battery life

Higher Power Density is driving up

Packaging costs

• Cooling system costs and dimensions

# Background

- Due to high costs even wall socket powered devices need to implement low power design strategies
- Especially chips used in high volume and price sensitive products like gaming consoles
- The Cell processor had stringent power limits with high frequency targets
- This required Power Efficient Architecture, Microarchitecture and circuits

#### **Motivation**

To check the power efficiency of the design, there was a need for a tool

- To provide accurate power estimates
- Provide cycle-by-cycle power estimates due to heavy use of fine grain clock gating

Need highly accurate transistor level power simulation

 High percentage of custom macros with unique circuit topologies including arrays and dynamic circuits

Need high throughput RTL simulation

 Thousands of cycles need to be simulated to estimate power for different workloads

#### Scope

#### The scope of the tool was to

- Verify the RTL and circuit implementation of the design from a power perspective
- Estimate Active workload dependent power

#### Did not estimate leakage power

- There was no logic implementation to mitigate leakage power
- Leakage was independently calculated based on transistor widths and other process and technology parameters

Did not estimate global clock grid powerNot workload dependent

#### **Tool Methodology**

## **Switching Power**

Switching Power of a circuit in a given cycle is defined by the following equation

 $P = \frac{1}{2} C V^2 f$ 

Where:

*C* is the total node Capacitance switched *V* is the power supply voltage *f* is the clock frequency

• V and f are fixed for the design.

• The factors affecting switching node capacitance (C) are Input switching and clock gating in the circuit

# Methodology

The CAPET Methodology consists of the following steps

- Use circuit simulation to build macro power models based on Input Switching and Clock Gating
- Extract Cycle by Cycle Input Switching and Clock Gating information for each macro instance from RTL simulation
- Use the switching and clock gating information to calculate power for each macro instance to get total macro power
- Power due to signal interconnect capacitance can be estimated using
  - Signal Switching information
  - Interconnect Capacitance estimated using Steiner routes or 3D extraction
- Total Power = Macro Power + Net Switching Power
- Repeat this for every cycle

# **Tool Flow**



#### **Macro Power Model**

A macro is defined as a floorplannable block. Ranges from hundreds to thousands of gates

The macro power model is created using IBM's CPAM tool

Definition: Input Switching Factor is defined as the percent of inputs switching state between two consecutive clock cycles

**CPAM runs random vectors on the schematic using multiple Switching Factors under two conditions** 

- With all clock buffers turned on for fully clock active power
- With all clock buffers forced off to get fully clock gated power

The Power model assumes power is linear with input switching factor and clock activity

#### **Macro Power Model**



 $BlkPwr(SF, CLK) = P_{clk0}(SF) + (P_{clk100}(SF) - P_{clk0}(SF)) * CLK.$ 

SF is the input switching factor for the macro

 $P_{clk0}(SF)$  is the power at input switching factor SF when clock activity is 0%

 $P_{clk100}(SF)$  is the power at input switching factor SF when clock activity is 100%

CLK is the percent of clock activity

#### **RTL Simulation**

**RTL simulations are done using IBM's Mesa simulator** 

For each macro instance the state of each input is monitored at cycle boundaries to measure the input switching factor

The switching of each global net is monitored to calculate interconnect switching power

## **Measuring Clock Activity**

Clock activity for custom macros is measured by

- Monitoring all the clock buffers that are turned on in the macro
- The designers provide a table with a relative power weights for each clock buffer
- The clock activity is determined by adding the weights of the clock buffers that are turned on

#### For synthesized macros

 Clock activity is measured by the percent of latch bits that are active in the given cycle

#### **Total Design Power**

Using the clock activity and Input Switching Factor for each macro instance in a cycle. The total power in a given cycle *C* is calculated by

 $TotalPower(C) = \Sigma BlkPwr(SF, CLK) + \frac{1}{2}C_{net}(C)V^2f$ 

Where:

*C<sub>net</sub>* is the total Interconnect Capacitance switched *V* is the power supply voltage *f* is the clock frequency

# **Usage on The Cell Processor**

#### **Cell Broadband Engine Processor**



## **RTL Workloads**

Each Core or functional Unit was required to run 3 sets of workload

- Idle power
- Typical power
- Max power

For each workload, designers were provided with power, average switching activity and clock activity for their macros

Idle power was the most useful in catching bugs in the clock gating logic

#### **Power Grid and Thermal Analysis**

Power numbers from the highest power workloads for each core were used as input to the IR drop analysis tool

Workloads with maximum *dl/dt* characteristics were used as input for analyzing mid frequency AC noise on the power grid

The highest power workloads were used to simulate thermal profile of the chip

# **Chip Level Estimates**

RTL Simulation is not fast enough to run a complete program

- The state of the partitions under the 3 different workloads, idle, typical and high, were estimated using architectural level simulations
- Based on the utilization rate a target power for a partition was calculated

The chip level power target was the sum of all partition powers

#### **Results: Typical Workload for a partition**



#### **Results: Different workloads on a partition**



#### **Results: Clock Activity based on workload**

**Clock Buffers Active** 



# **Hardware Correlation**

#### **Hardware Correlation**

Hardware results were correlated by running the hardware under the same conditions as simulation.

The high power workloads were converted to infinite loops and run on the hardware to measure the average power. The same workloads were run on CAPET

- On chip temperature was monitored using an on chip thermal diode and maintained using a water cooler
- Hardware correlation on multiple parts was done by IBM and Sony.
- Adjustment for process parameters was done.
- Results from all sites show the power estimates are within 10% of hardware

# Summary

CAPET methodology was successfully used to estimate the power for the first generation Cell processor

It helped designers in measuring the effects of clock gating and circuit tuning

Hardware results validate the accuracy of the tool