### WIT-Greedy: Hardware System Design of Weighted ITerative Greedy Decoder for Surface Code

Wang LIAO<sup>1</sup>, Yasunari Suzuki<sup>2, 3</sup>, Teruo Tanimoto<sup>3, 4</sup>, Yosuke Ueno<sup>1</sup>, Yuuki Tokunaga<sup>2</sup>



The University of Tokyo
 NTT Computer and Data Science Laboratories
 JST PRESTO
 Kyushu University

2023/01/17 @ ASPDAC



Quantum computing (QC) era is coming, but lifetime of qubits limits its application

- QC enables exponential speed-up for several important information processing
- Error due to short lifetime of qubit is the current bottleneck
  - Only 100 µs, but we need...



### QEC is necessary due to high error rate

- Error rate due to short lifetime of physical qubit is large
- Quantum error correction(QEC) reduces error rate
  - Physical qubits are encoded to logical qubits
  - Error rate of logical qubit can be reduced to any small value





Logical error rate is reduced w/ the same single bit error rate

### QEC is necessary due to high error rate

- Error rate due to short lifetime of physical qubit is large
- Quantum error correction(QEC) reduces error rate
  - Physical qubits are encoded to logical qubits
  - Error rate of logical qubit can be reduced to any small value





Logical error rate is reduced w/ the same single bit error rate

• Reason:

Decoding: error estimation in QEC code

 Fast correction is necessary due to short lifetime, but assumes ideal uniform error rates (unweighted decoding)



• Reason:

Decoding: error estimation in QEC code

- Fast correction is necessary due to short lifetime, but assumes ideal uniform error rates (unweighted decoding)
- Non-uniform error rates are the nature of real qubits



Ideal qubits: uniform error rates Real qubits: Non-uniform err. rates

• Reason:

Decoding: error estimation in QEC code

- Fast correction is necessary due to short lifetime, but assumes ideal uniform error rates (unweighted decoding)
- Non-uniform error rates are the nature of real qubits



Profiling distribution of non-uniform error rates of real qubits in IBMQ chips (Eagle)[1]

[1] IBMQ, https://quantum-computing.ibm.com/services/resources

Ideal qubits: uniform error rates Real qubits: Non-uniform err. rates

• Reason:

Decoding: error estimation in QEC code

- Fast correction is necessary due to short lifetime, but assumes ideal uniform error rates (unweighted decoding)
- Non-uniform error rates are the nature of real qubits



• Reason:

Decoding: error estimation in QEC code

- Fast correction is necessary due to short lifetime, but assumes ideal uniform error rates (unweighted decoding)
- Non-uniform error rates are the nature of real qubits



### • Reason:

- Fast correction is necessary due to short lifetime, but assumes ideal uniform error rates (unweighted decoding)
- Non-uniform error rates are the nature of real qubits
- Current issue: we solved in this paper
  - No fast error estimating device (i.e. decoder) for weighted decoding (i.e., error estimation w/ non-uniform error rates) large-distance code

TABLE I COMPARISON OF HARDWARE IMPLEMENTATION OF SURFACE CODE DECODERS. OUR FRAMEWORK IS ABLE TO SCALE UP AND TAKE FABRICATION VARIANCE INTO CONSIDERATION.

| Pre | Decoding algorithm<br>vious work | $d_{max}$ | Latency   | Measurement<br>error?<br>de distanc | Decoding<br>window<br>Hetel y | Target<br>device | Weighted matching? | Scalability |        |
|-----|----------------------------------|-----------|-----------|-------------------------------------|-------------------------------|------------------|--------------------|-------------|--------|
| ٦   | Look-up table [15]               | 5         | 42 ns     | Y                                   | 3                             | FPGA             | Y                  | Hard        |        |
|     | Neural network [16]              | 5         | 194 ns    | N                                   | -                             | FPGA             | <u>YNO</u> Supp    | omtafor we  | ighted |
|     | Union-find [17]                  | 11        | 325 ns    | Y                                   | 11                            | FPGA             | Ν                  | Easy        | U      |
|     | Greedy [18]                      | 9         | 162.72 ps | N                                   | -                             | SFQ              | N                  | Easy        |        |
|     | Greedy [19]                      | 13        | 215 ps    | Y                                   | 3                             | SFQ              | N                  | Easy        |        |
| [   | Greedy (Our proposal)            | 11        | 370 ns    | Y                                   | 11                            | FPGA             | Y                  | Easy        |        |

10

# Highlight of this paper

- Current issue: No fast decoder for weighted decoding (i.e., error estimation w/ non-uniform error rates) large-distance code
- Our contribution: we solved it by designing a fast hardware decoder
  - Fast: average delay within time budget
  - Weighted decoding (i.e., error estimation w/ non-uniform error rates)
  - Large distance: support surface code up to d=11

COMPARISON OF HARDWARE IMPLEMENTATION OF SURFACE CODE DECODERS. OUR FRAMEWORK IS ABLE TO SCALE UP AND TAKE FABRICATION VARIANCE INTO CONSIDERATION.

|     | Decoding algorithm    | $d_{max}$ | Latency   | Measurement | Decoding        | Target | Weighted  | Scalability |        |
|-----|-----------------------|-----------|-----------|-------------|-----------------|--------|-----------|-------------|--------|
| Pre | vious work            |           | Small co  | de distanc  | window<br>Ceony | device | matching? |             |        |
| Г   | Look-up table [15]    | 5         | 42 ns     | Y           | 3               | FPGA   | Y         | Hard        |        |
|     | Neural network [16]   | 5         | 194 ns    | N           | -               | FPGA   | YNO SUP   | ometafor we | ighted |
| -4  | Union-find [17]       | 11        | 325 ns    | Y           | 11              | FPGA   | N         | Easy        |        |
|     | Greedy [18]           | 9         | 162.72 ps | N           | -               | SFQ    | N         | Easy        |        |
|     | Greedy [19]           | 13        | 215 ps    | Y           | 3               | SFQ    | N         | Easy        |        |
|     | Greedy (Our proposal) | 11        | 370 ns    | Y           | 11              | FPGA   | Y         | Easy        |        |

## Contents

- Introduction of decoder for surface code
- Proposed weighted hardware decoder of WIT-Greedy
- Conclusion

## Contents

- Introduction of decoder for surface code
- Proposed weighted hardware decoder of WIT-Greedy
- Conclusion

## Overview of decoding for surface code

- Surface code is currently the most promising QEC code
- Decoding: estimation of error allocation inside code
- Decoding task is essentially a classical problem: solve matching problem of active syndrome nodes



## Overview of decoding for surface code

- Surface code is currently the most promising QEC code
- Decoding: estimation of error allocation inside code
- Decoding task is essentially a classical problem: solve matching problem of active syndrome nodes



### Error syndrome using surface code

 Error syndrome: parity check # of errors in neighbor data qubits → error detection in QEC



O Ancilla qubit → Syndrome



Logical qubit

Error syndrome via ancilla qubit

Measurement value

### Error syndrome using surface code

 Error syndrome: parity check # of errors in neighbor data qubits → error detection in QEC



### Error syndrome using surface code

- Error syndrome: parity check # of errors in neighbor data qubits → error detection in QEC
- Code distance d: error detection ability



### Successive syndrome during computation

 Syndrome measurement is conducted every computation code cycle due to large error rates



OMeasurement value = 1\*

### Successive syndrome during computation

 Syndrome measurement is conducted every computation code cycle due to large error rates



How can we estimate errors using syndrome queue?

OMeasurement value = 1\*

Convert decoding task of syndrome queue to a matching problem

 Most probable error estimation -> find minimum weight perfect matching (i.e. most probable error pattern)



Weight of edge: related to the error rate of each qubit Meaning of edges in graph:

- Horizontal edge > data qubit error
- Vertical edge -> ancilla qubit error

## Summary of decoding process

- Pick up all active syndrome node (i.e. measurement value = 1)
- Re-construct complete graph with weights of path
- Solve matching problem of active syndrome nodes
- Error assignment



## Contents

- Introduction of decoder for surface code
- Proposed weighted hardware decoder of WIT-Greedy
- Conclusion

## Our contribution

Syndrome.

input

Syndrome\_

input

- We developed fast weighted decoder with the same computation complexity of the baseline unweighted decoder
- Key idea: weight tables and parallel processing

#### Complexity and flow of hardware decoder



Parallel access

Weight tables

## Our contribution

- We developed fast weighted decoder with the same computation complexity of the baseline unweighted decoder
- Key idea: weight tables and parallel processing

#### **Complexity and flow of hardware decoder**



25



## Difficulty to consider non-uniform weights

- Hard to find shortest path betw. nodes
  - Unweighted one can be calculated by Manhattan distance
- Shortest path needs repeated enumeration
  - Complexity of  $O(d^3)$  for  $d^6$  times

*d* is code distance



Matching result of **uniform** weight

Shortest path in uniform weight can be replaced by Manhattan distance



Matching result of **non-uniform** weight

However, for non-uniform case, we can not use Manhattan distance

### Our proposal: Pre-calculated weight table of shortest paths

- Previous solutions:
  - 1. Perform shortest-path finding in each iteration -> time consuming
  - 2. Full look-up table for matching -> mem. resource exhaust at d=5 [1]

Complexity  $O(a^3)$ 

Increasing rate of size at  $2^{d^3}$ 

- Our solution:
  - Construct pre-calculated weight table of shortest path of each node pair



## Our contribution

- We developed fast weighted decoder with the same computation complexity of the baseline unweighted decoder
- Key idea: weight tables and parallel processing

Accelerate matching

#### Complexity and flow of hardware decoder

d is code distance



• Processing flow chart



- Processing flow chart
  - Active syndrome node encoding —
  - Pair table based parallel comparator

Generating address to query weight tables



- Processing flow chart
  - Active syndrome node encoding
  - Pair table based parallel comparator

→ Matching nodes



- Processing flow chart
  - Active syndrome node encoding
  - · Pair table based parallel comparator



### Parallel processing in active syndrome encoding

- Encoding rows: 1 clock delay for entire row
- Merge to syndrome queue



Row encoding unit

### Parallel processing in active syndrome encoding

- Encoding rows: 1 clock delay for entire row
- Merge to syndrome queue: 3-stage pipeline



- Processing flow chart
  - Active syndrome node encoding
  - Pair table based parallel comparator



35

### Generate pair table for parallel matching

- Large pair table of registers is created for comparing all the pairs simultaneously
  - Parallel accessing weight tables to read weight of each node pairs
  - Record existing top least weight





### Generate pair table for parallel matching

- Large pair table of registers is created for comparing all the pairs simultaneously
  - Parallel accessing weight tables to read weight of each node pairs
  - Record existing top least weight



- Parallel greedy algorithm:
  - pick up the first pair of the current least weight @minimum of 1 clock
  - Turn off flags of pairs containing matched nodes
  - Pop-up pairs for Pauli frame generation in parallel



Node2

Node4

- Parallel greedy algorithm:
  - pick up the first pair of the current least weight @minimum of 1 clock
  - Turn off flags of pairs containing matched nodes
  - Pop-up pairs for Pauli frame generation in parallel





- Parallel greedy algorithm:
  - pick up the first pair of the current least weight @minimum of 1 clock
  - Turn off flags of pairs containing matched nodes
  - Pop-up pairs for Pauli frame generation in parallel



- Parallel greedy algorithm:
  - pick up the first pair of the current least weight @minimum of 1 clock
  - Turn off flags of pairs containing matched nodes
  - Pop-up pairs for Pauli frame generation in parallel



**Parallel processing acceleration** 

(*d* is code distance, N = N<sub>active\_syn</sub> is paring window size)



42

## Implementation result using FPGA

- Requirement of latency is satisfied
  - We have an average delay of 370 ns at d=11
- Scale of circuit implementation is reasonable

| d  | De  | ecoding cyl | ces | Buffer overflow times |  |  |
|----|-----|-------------|-----|-----------------------|--|--|
| u  | min | average     | max |                       |  |  |
| 5  | 18  | 19          | 71  | 0                     |  |  |
| 7  | 19  | 22          | 97  | 0                     |  |  |
| 9  | 24  | 29          | 105 | 0                     |  |  |
| 11 | 33  | 37          | 103 | 0                     |  |  |
|    |     |             |     |                       |  |  |

TABLE III RESOURCE UTILIZATION OVERHEAD FOR DECODING LOGIC

| 10 2.5 |
|--------|
| 10 10  |
| 10 35  |
| 10 126 |
|        |

#### Latency fits into 1µs time budget of code cycle

Condition of latency evaluation:

- RTL simulation
- Write random syndrome flipping testbench and monitor the delay
- Flip rate = 6\*average error rate
- 10<sup>4</sup> random syndrome cycle sample

Condition of resource evaluation

- Queue overflow probability < 1e-15
- Clock constraint = 100 MHz

Reasonable resource utilization of circuits for d=11

• placement&routing in FPGA (field programmable gate array, user reconfigurable circuits)

## Conclusion

- Weighted decoding with non-uniformity is a necessity
- We made it with results of
  - Iterative greedy matching within average 370 ns
  - Support surface code up to d=11
  - Implementable with mid-class classical commercial device

 TABLE I

 COMPARISON OF HARDWARE IMPLEMENTATION OF SURFACE CODE DECODERS. OUR FRAMEWORK IS ABLE TO SCALE UP AND TAKE FABRICATION

 VARIANCE INTO CONSIDERATION.

|     | Decoding algorithm    | $d_{max}$ | Latency   | Measurement | Decoding         | Target | Weighted  | Scalability |
|-----|-----------------------|-----------|-----------|-------------|------------------|--------|-----------|-------------|
| Pre | vious work            | Sr        | nall cod  | eԾՈյ        | window<br>length | device | matching? |             |
| Г   | Look-up table [15]    | 5         | 42 ns     | Y           | 3                | FPGA   | No supp   | ortweighted |
|     | Neural network [16]   | 5         | 194 ns    | N           | -                | FPGA   | Y         | Hard        |
| 4   | Union-find [17]       | 11        | 325 ns    | Y           | 11               | FPGA   | N         | Easy        |
|     | Greedy [18]           | 9         | 162.72 ps | N           | -                | SFQ    | N         | Easy        |
| L   | Greedy [19]           | 13        | 215 ps    | Y           | 3                | SFQ    | N         | Easy        |
|     | Greedy (Our proposal) | 11        | 370 ns    | Y           | 11               | FPGA   | Y         | Easy        |