#### Compiler-Assisted Refresh Minimization for Volatile STT-RAM Cache

Qingan Li, Jianhua Li, Liang Shi, Chun Jason Xue, Yiran Chen, Yanxiang He

> City University of Hong Kong University of Pittsburg

# Outline

- Background
- Observation
- Solution
- Experiments

## STT-RAM for cache

- STT-RAM compared to traditional SRAM caches:
  - Leakage power consumption



SRAM caches vs STT-RAM caches

#### STT-RAM for cache

- STT-RAM has been proposed as a new candidate for building cache.
- Lots of work has been done to mitigate the costly write operations on STT-RAM
  - smart write buffer, early write termination, hybrid cache, etc.
  - relaxing the non-volatile property

# Volatile STT-RAM for cache

- Volatile STT-RAM
  - Pros: trade non-volatility for write performance
  - Cons: need refresh schemes to ensure correctness

|                    | Design 1 | Design 2 | Design 3 |
|--------------------|----------|----------|----------|
| Cell size (F2)     | 23       | 22       | 27.3     |
| MTJ sw time (ns)   | 10       | 5        | 1.5      |
| Retention Time     | 4.27yr   | 3.24s    | 26.5µs   |
| Write Latency (ns) | 10.378   | 5.37     | 1.5      |
| Write Dyn. Eng(nJ) | 0.958    | 0.466    | 0.187    |

As the retention time decreases, write speed and write dynamic energy are improved.

# Refresh schemes for volatile STT-RAM cache

- The reduced retention time (T) of STT-RAM cache needs refresh schemes to avoid data loss.
- Refresh schemes:

#### – Dram-style refresh:

• refresh all blocks within every retention time **T**.

#### – Full-refresh<sup>1</sup>:

- attach a counter to each block to track its lifespan asynchronously.
- Only refresh the blocks not written or refreshed within **T**.

#### – Dirty-refresh<sup>2</sup>:

• Only refresh dirty blocks not written or refreshed within **T.** 

# Refresh schemes for volatile STT-RAM cache

- Each refresh operation for STT-RAM cache involves:
  - read data from cache and write them into buffer
  - read data from buffer and write them back into cache

refresh consumes lots of energy

• Refresh scheme is by-product of volatile STT-RAM design, and consumes lots of energy.

This paper proposes a compilation approach to minimize the required refresh operations to improve the energy efficiency.

# Outline

- Background
- Observation
- Solution
- Experiments

#### Observation 1: N-refresh scheme

- Some data blocks are left in idle status for a long time before next use. (An extreme case: dead blocks)
- Previous refresh schemes refresh these blocks many times for little benefits.
- We propose a *N***-refresh** scheme:
  - refresh blocks sequently for at most (N-1) times.
  - can reduce the number of refresh operations
  - only need add logN bits to the counter attached to each block (extension over Full-refresh scheme)

#### Observation 2: passive & active refresh

- To recharge the lifespan of a cache block
  - *passive refresh*: depends on memory trace and cache misses
    - loading data from the main memory to this cache block
    - an instruction writes data to this block (the whole block will be **passively** refreshed)
  - *active refresh*: depends on refresh schemes (full-refresh, dirty-refresh, N-refresh)
- So, within the retention time **T**, if there is a passive refresh or an active refresh, there is no data loss.

## When do we need an *active refresh* ?

• Given a data write trace, *Dtrace*, with timing information

| data write trace | а | С | b  | d  | а  | С  | b  | d  | а | allocation : {a,b}, {c,d}   |
|------------------|---|---|----|----|----|----|----|----|---|-----------------------------|
| time stamp(us)   | 6 | 9 | 12 | 15 | 18 | 21 | 24 | 27 |   | retention time <b>T</b> : 5 |

• Transform the data write trace into a block write trace: is the interval between two writes less than **T**?



#### Reduce active refresh by data layout



We found that, by re-arrange the data layout, we can change the behavior of *passive refresh* such that, the required number of *active refresh* is minimized.

# Outline

- Background
- Observation
- Solution
- Experiments

# Problem

- Given a data write trace, *Dtrace*, with timing information,
  - allocate data objects into memory blocks
  - transform the data write trace into a block write trace,
    Btrace, by mapping the data to the owner memory blocks
  - split Btrace into a set of sub-traces, where each sub-trace recording writes to the same block
  - compute the required number of *active refresh* for each sub-trace

$$nRfr_{b^{i}} = \frac{\overset{N}{\overset{0}{\beta}}}{\overset{0}{\beta}} \left(TS_{bw_{j+1}^{i}} - TS_{bw_{j}^{i}}\right) / T^{\acute{U}}_{\acute{U}} + \overset{0}{\overset{0}{\beta}} \left(TS_{bw_{1}^{i}} - TS_{begin}\right) / T^{\acute{U}}_{\acute{U}} + \overset{0}{\overset{0}{\beta}} \left(TS_{end} - TS_{bw_{N}^{i}}\right) / T^{\acute{U}}_{\acute{U}}$$

• Solution: find an allocation to minimize:  $\sum nRfr_{b^i}$ 

# Problem (more details)

- How to **compute** the required number of *active* refresh for each sub-trace  $nRfr = \sum_{nRfr} |(TS = -TS = )/T|$ 
  - *T*: the retention time
  - N: the number of writes to b
  - $bw_{i}^{i}$  and  $bw_{i+1}^{i}$ : a pair of consecutive writes to b
  - *TS<sub>bw</sub>*: the time stamp of a block write, *bw*
  - *Ts*<sub>begin</sub> & *TS*<sub>end</sub>: the starting/ending time of the program
- the objective:

$$\min\sum_{i} nRfr_{b^{i}}$$

This problem is hard to solve. Need simplification!

$$\begin{split} & Rfr_{b^{i}} = \sum_{j=1} \left[ \left( TS_{bw_{j+1}^{j}} - TS_{bw_{j}^{j}} \right) / T \right] \\ & + \left[ \left( TS_{bw_{1}^{i}} - TS_{begin} \right) / T \right] \\ & + \left[ \left( TS_{end} - TS_{bw_{N}^{j}} \right) / T \right] \end{split}$$

# Simplified problem

- Given a data write trace, *Dtrace*, with timing information,
  - extract a set of sub-traces from *Dtrace*, where each subtrace recording writes to a pair of data objects

| (a,b) sub-trace | а | b  | а  | b  | а  |
|-----------------|---|----|----|----|----|
| time stamp(us)  | 6 | 12 | 18 | 24 | 30 |

| (a,c) sub-trace | а | С | а  | С  | а  |
|-----------------|---|---|----|----|----|
| time stamp(us)  | 6 | 9 | 18 | 21 | 30 |

| (a,d) sub-trace | а | d  | а  | d  | а  |
|-----------------|---|----|----|----|----|
| time stamp(us)  | 6 | 15 | 18 | 27 | 30 |

| (b,c) sub-trace | С | b  | С  | b  |
|-----------------|---|----|----|----|
| time stamp(us)  | 9 | 12 | 21 | 24 |

| (b,d) sub-trace | b  | d  | b  | d  |
|-----------------|----|----|----|----|
| time stamp(us)  | 12 | 15 | 24 | 27 |

| (c,d) sub-trace | С | d  | С  | d  |
|-----------------|---|----|----|----|
| time stamp(us)  | 9 | 15 | 21 | 27 |

# Simplified problem

- For each sub-trace corresponding to objects x and y, if assigning x and y into the same block, the cost is approximated as:  $\sum_{x=1}^{M} |(-x) - y| = 0$ 

$$\cos t_{x,y} = \sum_{k=1}^{M} \left[ \left( TS_{dw_{k+1}} - TS_{dw_{k}} \right) / T \right]$$

- *M*: the number of data writes in the sub-trace
- $dw_k$  and  $dw_{k+1}$ : a pair of consecutive writes to x or y otherwise, the cost is zero.
- The goal is to find an allocation (mapping data to memory blocks) to minimize:

$$\sum_{i \in Data} \sum_{\substack{j \in Data \\ j \neq i}} \cos t_{i,j} \cdot x_{i,j}$$

 $x_{i,j}$  is an binary variable to indicate whether *i* and *j* are allocated into the same block

# Heuristic algorithm

- This simplified problem can be easily modeled as a quadratic assignment problem (QAP), which is NP-hard in general.
- A heuristic algorithm to solve this simplified problem
  - Encode the problem using a complete graph G: encode each data object as the vertex, and cost<sub>i,j</sub> as the edge weight
  - 2. Allocate a pair of objects into the same block, if the related edge weight is minimum among all edges of *G*
  - 3. Update *G* by merging this pair of objects
  - 4. If the remaining graph is not empty, go to 2

#### Heuristic algorithm

- $cost_{i,i}$  is encoded in a complete graph, as the edge weight
- assume each block can hold only three objects
- merge function: min {edge1, edge2, ...}



# Outline

- Background
- Observation
- Solution
- Experiments

#### **Experimental setup**



#### Normalized to FR: Energy



#### Normalized to FR: Performance



# Conclusion

- Volatile STT-RAM cache has been proposed for better performance. But it requires refresh schemes to avoid data loss. And, refresh schemes bring about additional energy overhead.
- This paper proposes a compilation approach to minimize the refresh overhead by re-arranging data layout.
- Experiments show that, this approach can reduce refresh operations (73.3%), energy consumption (27.6%), and cycles (0.6%).

# Thank you!

#### Questions?

#### Sensitiveness to different cache block size



For *adpcm* and *bcnt*, the proposed method can consistently reduce the *active refresh*, and increase cache hit rates, for varied cache block size.

#### Sensitiveness to different cache size



For *adpcm* and *bcnt*, the proposed method can consistently reduce the *active refresh*, and increase cache hit rates, for varied cache size.