# Spatially parallel decoding for multi-qubit lattice surgery

Sophia Fuhui Lin\*1,2, Eric C. Peterson<sup>1</sup>, Krishanu Sankar<sup>1</sup>, Prasahnt Sivarajah<sup>1</sup>

<sup>1</sup>AWS Center for Quantum Computing

<sup>2</sup>University of Chicago

Abstract—Running quantum algorithms protected by quantum error correction requires a real time, classical decoder. To prevent the accumulation of a backlog, this decoder must process syndromes from the quantum device at a faster rate than they are generated. Most prior work on real time decoding has focused on an isolated logical qubit encoded in the surface code. However, for surface code, quantum programs of utility will require multiqubit interactions performed via lattice surgery. A large merged patch can arise during lattice surgery — possibly as large as the entire device. This puts a significant strain on a real time decoder, which must decode errors on this merged patch and maintain the level of fault-tolerance that it achieves on isolated logical qubits.

These requirements are relaxed by using spatially parallel decoding, which can be accomplished by dividing the physical qubits on the device into multiple overlapping groups and assigning a decoder module to each. We refer to this approach as spatially parallel windows. While previous work has explored similar ideas, none have addressed system-specific considerations pertinent to the task or the constraints from using hardware accelerators. In this work, we demonstrate how to configure spatially parallel windows, so that the scheme (1) is compatible with hardware accelerators, (2) supports general lattice surgery operations, (3) maintains the fidelity of the logical qubits, and (4) meets the throughput requirement for real time decoding. Furthermore, our results reveal the importance of optimally choosing the buffer width to achieve a balance between accuracy and throughput — a decision that should be influenced by the device's physical noise.

# I. INTRODUCTION

Given the error rates experienced by quantum computers, quantum error correction (QEC) is necessary for running large-scale quantum applications. QEC protects quantum information by encoding each logical qubit in multiple physical *data* qubits, and using another set of physical *syndrome* qubits to detect errors on the data qubits. In each measurement cycle, syndrome qubits extract parity information from the data qubits on which they act. This parity information is then sent to a classical decoder that decodes the syndromes and reports a correction.

Here we focus on the surface code [13], [19], [26], a popular QEC that tolerates relatively high physical noise, supports relatively easy logical operations, and only requires nearest-neighbor grid connectivity. In recent years, academic and industry labs have experimentally demonstrated small instances of a surface code logical memory [1], [27], [43] and lattice surgery [16]. While an offline decoder — applied after

the experiment has completed — is sufficient for such demonstrations, applications with nontrivial information processing will require real time decoding [19].

A requirement for real time decoding is that the throughput of the decoder match the rate of syndrome measurements, which avoids an exponential backlog of data [38]. And the requirement can be quite strict: each syndrome measurement cycle on a superconducting device can be completed in  $\sim 1~\mu s$  (921 ns in [1]). The strict timescales have persuaded the community to explore hardware accelerators for the task, such as Field Programmable Gate Arrays (FPGAs) [9], [31], [40], Application Specific Integrated Circuits (ASICs) [10], or onchip SFQ-based superconducting digital circuits [23], [32], [39]. While their software counterparts — designed to run on general-purpose CPUs [22] — offer more flexibility, hardware decoders promise higher throughput, deterministic execution, and tighter integration with the rest of the control system [2].

Whichever decoder one chooses, a single monolithic instance will inevitably struggle to meet the decoding demands that arise during multi-qubit logic gates. For surface codes, lattice surgery [18], [24], [28]–[30], [36] is the most efficient way to perform multi-qubit logical operations. When lattice surgery is performed on two or more (possibly distant) patches of logical qubits, they are merged into a single patch for multiple measurement cycles before being split apart; a snapshot of the procedure is shown in Figure 1. It's worth noting that current state of the art decoders can only meet the throughput demands on a patch that is the size of an individual logical qubit up to distance 20 to 30 [2]. Worse, their throughput is inversely proportional to the size of the patch [22]. Meanwhile, lattice surgery requires merging multiple such patches [29] and decoding the merged patch.

Dividing the decoding task into overlapping windows is a promising approach [4], [34], [37] to manage this scalability challenge. For instance, prior work [34], [37] leverages temporal parallelism by processing the syndrome data from many measurement rounds with *temporally* parallel windows. This limits the growth of syndrome backlog when the inner decoder is slower than syndrome generation in terms of throughput. But this strategy doesn't mitigate the large spatial window that each parallel decoder must cover. In [34], the authors introduce the idea of dividing the decoding problem in both time and space, but do not analyze its feasibility. In [4], the authors leverage a different kind of spatial parallelism, where independent subgraphs arising from the logical circuit are decoded in parallel. This addresses the spatial challenge,

<sup>\*</sup>Corresponding author: Sophia Lin (email: lfsophia@amazon.com).



Fig. 1: A merge operation in lattice surgery, specifically, a logical  $Z \otimes Z$  measurement.

but requires a dynamic choice in where parity information is routed. We suspect this dynamic routing may prove complicated for hardware decoders to realize in practice.

Here, we explore spatially parallel windows. To the best of our knowledge, ours is the first analysis of a decoding scheme that is both capable of handling large patches that arise during logical operations and compatible with practical system level constraints of hardware accelerators. As alluded to in [34], we employ the strategy of dividing the decoding task into multiple overlapping windows, and assign a decoder module to each window. The inner decoder that operates on an individual window can be any real time decoder; our scheme is agnostic to the choice. The important point is that these decoders can be coordinated to output a correction when a merged patch spans multiple windows. This technique works because we only need to protect logical information up to the distance of the isolated logical qubits. This allows us to overlap windows with local views and focus on resolving the disagreements at their seams. The decoders are scheduled to run during different time steps, so that they only need to resolve with their neighbors before and after they run.

A motivating principle of our work is that various design choices need to be carefully balanced to ensure that a scheme with spatially parallel windows can meet the requirements of real time decoding. A hardware constraint specific to the problem is that each window needs to be pre-assigned to an area on the device. This is because real time decoders that operate on individual windows, regardless of the specific implementation, require hardware accelerators with fixed positions. Furthermore, our work emphasizes the importance of avoiding larger windows than necessary, since large windows challenge the scalability of inner decoders.

We begin by demonstrating how to configure spatially parallel windows in a manner that (1) is compatible with hardware accelerators like FPGAs and ASICs, and (2) can handle the larger merged patches that arise during lattice surgery operations. The decoding scheme is scalable, in the sense that its speed and resource requirements are independent of the number of patches that are merged or how far they are apart on the device. Then we examine the factors influencing the performance, focusing specifically on two key aspects: accuracy and throughput. We proceed to analyze how to achieve a balance between these requirements. Finally, we

estimate the size of the individual code patches this scheme can support, assuming one uses an FPGA-based inner decoder.

By performing numerical simulations and analyzing the mechanisms through which spatially parallel windows lead to extra decoding errors, we identify that the size of the windows and the width of the overlapping areas (buffer width) are important factors that determine the accuracy. To maintain the fidelity of the logical qubits, the size of the windows cannot be smaller than the individual patches that encode the logical qubits. The buffer width, however, is a key variable in the trade-off between accuracy and throughput. Enlarging the buffer results in larger windows, which subsequently leads to lower throughput (or significantly raises the requirement on computing resources). Conversely, when the buffer is too narrow, it compromises the accuracy of the decoding. We find that the optimal choice of buffer width not only depends on the size of individual code patches, but also depends on the level of physical noise on the device. This can be explained by examining the terms in the logical error rate expression, especially the entropic factors. At a modest physical noise level, the buffer width should be between one-half and twothirds of the width of an individual patch of code.

The throughput of the decoding scheme is determined by its slowest component. To identify the bottleneck, we separately estimate the speed of both the inter-window communication and the decoding modules. Our findings indicate that the interwindow communication will not become the bottleneck, even when the width of the windows exceed 100. This means the overall throughput of the decoding scheme meets the requirement for real time decoding if and only if each inner decoder module runs sufficiently fast on its window. We discuss the scalability of two state-of-the-art real time decoders [2], [31] when applied to individual windows.

Although we focus on surface code in this work, our results are also relevant to other local codes, e.g. color codes.

#### II. BACKGROUND

# A. Surface code, lattice surgery, and decoding

A square patch of surface code encodes one logical qubit in a  $d \times d$  grid of data qubits, where d is the code distance, and uses  $d^2-1$  syndrome qubits for the stabilizer measurements. A distance-5 example is shown in Figure 2a. The stabilizer measurements project errors on the data qubits into Pauli X,Y or Z errors. Each X (Z) stabilizer measures the X (Z) parity of the data qubits that it acts on, and detects the Z (X) errors on them since the X and Z operators anti-commute. An X (Z) syndrome qubit reads 1 if an odd number of its data qubits are affected by a Z (X) error. In this case we say the stabilizer is flipped. A flipped syndrome readout is also called an *anyon* or a *defect* in literature. A Y error can be viewed as the product of an X error and a Z error. It can be detected by both types of the stabilizers.

An X(Z) logical operator,  $\hat{X_L}(\hat{Z_L})$ , is a product chain of X(Z) operators on a set of data qubits that connect the two X(Z) boundaries. Examples of them are shown in Figure 2a. The logical operators do not flip any stabilizer. Note that if a chain of operators is equivalent to a linear combination of



Fig. 2: (a) A square patch of surface code. The red and blue faces are the X and Z stabilizers, respectively. The blue and red lines mark  $\hat{Z_L}$  and  $\hat{X_L}$ , the Z and X logical operators. (b) An example decoding graph for the X errors. The vertices are the Z stabilizers, augmented by the virtual boundary node (in yellow). The stabilizers that are flipped are marked in red, and a minimum weight matching is denoted by the green edges.

the stabilizers, it does not flip any stabilizer either, but it acts trivially on the logical qubit. We define the code distance for X (Z) errors,  $d_X$  ( $d_Z$ ), as the weight of the shortest X (Z) logical operator. It is the minimum weight of a nontrivial X (Z) error chain that cannot be detected by the code. The code distance d is the minimum of  $d_X$  and  $d_Z$ . If a device has biased noise, e.g. more Z errors than X errors, then a rectangular patch with different  $d_X$  and  $d_Z$  could be used to further suppress the dominant type of errors.

Lattice surgery implements logical multi-qubit Pauli measurements via the merging and splitting of patches, which can be used for performing logical gates. For instance, when two patches are merged along boundaries that coincide with their Z logical operators, the value of  $Z\otimes Z$  can be inferred from the product of the Z stabilizers spanning the boundary. X Pauli products are measured similarly, while measurements that involve the Y operator can be done with twist-based lattice surgery [30] or an alternative protocol [7]. When logical qubits are separated in space, the merge operation can be facilitated by a long ancilla patch as in Figure 5 in [30].

While all errors chains with weight smaller than d are detectable by the code, only the ones with distance no more than  $\lfloor \frac{d}{2} \rfloor$  are guaranteed to be correctable. The decoding of syndrome is performed on a decoding graph (Figure 2b). For the surface code, independent X and Z decoding graphs can be used for correcting X and Z logical errors. On a decoding graph, each edge represents an error mechanism that is detectable by its vertices. An edge between two adjacent Z stabilizers represents an X flip on the data qubit shared by them. An edge connected to the virtual boundary node represents an error that is only detected by one stabilizer.

For simplicity, Figure 2b only shows a 2D decoding graph for one round of syndrome measurement. The full decoding graph that a decoder works with has a time dimension because measurement errors also need to be represented. Such a graph can be viewed as multiple copies of Figure 2b stacked on top of each other. If two vertices represent the readouts of the same stabilizer from adjacent time steps, they are connected by an edge that represents a flip on the syndrome qubit.

A popular technique for decoding syndromes from a surface code is minimum-weight perfect matching (MWPM) [13], which identifies the lowest weight error patterns on a given decoding graph. Sparse Blossom [22] and Fusion Blossom [42] are recent, fast software implementations of this technique. Union-Find (UF) [12] is another popular technique, and can be viewed as an approximation of the Blossom algorithm [14], [15] that implements MWPM [41]. It is known to have better runtime complexity and slightly lower accuracy than MWPM.

## B. Real time decoding with FPGA

For concreteness, we explore the implications of using an FPGA for the inner decoder. FPGAs offer an attractive alternative to general-purpose CPUs for tasks with strict timing requirements and modest programmability requirements. An FPGA is made up of a "fabric" consisting of a great many components whose behaviors are individually programmable, whose input and output can be flexibly routed among other components, and whose timing is uniformly specified so as to support easy latching of inter-component signals. These properties imbue FPGAs with excellent parallelism and pipelinability, enabling them to achieve excellent throughput on highbandwidth tasks. The primary trade-off for these abilities is that the FPGA fabric is almost always programmed "once and for all"—that is, the individually programmable components nonetheless have their behaviors fixed for the lifetime of an application.

Because surface code syndromes arrive in a stream at approximately  $(d^2-1)$  Mbps, decoding is a strong candidate application for an FPGA [9], [31], [40]. However, the constraints of FPGA programming bear directly on the decoding problem:

- Scalability: An individual FPGA enjoys a fixed set of components, making it suitable for decoding only up to a certain size. Meanwhile, arbitrarily long swaths of surface code may appear during lattice surgery protocols.
- Communication: The high-speed communication afforded by the lock-step evolution of FPGA components is somewhat lost when connecting two separate fabrics together (e.g., over ethernet). Accordingly, programmers need to keep tight control over inter-fabric communication to keep it from dominating the runtime of an application. Additionally, local connections between components are set out at "compile time", sometimes coming at significant layout cost.
- Fixed gateware: Since FPGAs only permit modest onthe-fly reprogramming, care must be taken to match these limited abilities with the changing decoder requirements of a surface code patch undergoing lattice surgery.

Deploying an FPGA to accomplish decoding requires accommodating each of these constraints.

## III. RELATED WORK

In recent years, there has been a surge in research on real time decoding for QEC. Helios [31] is an FPGA-based UF decoder. When acting on a surface code with distance d, it achieves a sublinear average time complexity per cycle, at the

cost of  $O(d^3)$  hardware resources. The work demonstrated an implementation operable up to d=21. Riverlane [2] recently implemented a UF decoder on both FPGAs and ASICs and reported results for up to d=23. Astrea [40] and LILLIPUT [9] output the same solution as a MWPM decoder for surface code up to d=7 and d=5, respectively. These can be implemented on FGPAs. AFS proposes to implement the UF decoder on ASICs [10]. There are also real time decoders with SFQ-based superconducting digital circuits [23], [32], [39]. Some decoders use hierarchical decoding [8], [32], [35].

A few recent works divide the decoding task into overlapping windows [4], [34], [37], an approach that is also employed in our work. In [37] and [34], the authors primarily address parallelization in time, which helps contain the backlog when the throughput of the decoder does not match the rate that the syndrome is generated. In [34], the authors also includes a discussion on dividing the decoding problem in space but do not explore the consequences. The information passed between neighboring windows is similar, whether they are temporally or spatially parallel. However, temporally and spatially parallel windows apply to different problems and so have different constraints and goals. For example, [34] does not investigate how a narrow buffer would compromise the accuracy of decoding.

A paper from PsiQuantum [4] is the only prior work that targets lattice-surgery style fault-tolerant blocks. Its focus is different from our work though. It prioritizes minimizing the latency of a software implementation and does not address "further hardware and systems considerations that are relevant". However, we put more emphasis on the constraints of hardware accelerators, and consider throughput instead of latency as the more important measure of speed for real time decoders. For example, they develop algorithms that take lattice surgery blocks as input, then assign windows, while we explicitly require that windows have pre-assigned positions so that they are compatible with hardware accelerators. Both papers include numerical results on how the buffer width affects decoding accuracy, but we provide more analysis and also study how the influence depends on the physical noise level.

#### IV. FRAMEWORK OF THE DECODING SCHEME

In Section II-B, we discussed three constraints of an FPGA implementation. Our protocol keeps these constraints foremost in mind: our starting premise is that a single decoder handles a fixed (small) window; we explicitly analyze the single burst of communication needed from one stage of decoding to the next, which takes place over a fixed subset of a nearest-neighbor topology network; and we limit our reprogrammability requirements to a simple form of "masking", where read and write operations are individually dis/allowed at given cells.

In this section we explain key aspects of the design. We first show how the decoder modules should be connected and why it is necessary for them to overlap in buffer regions. Then we discuss the metrics of the decoding scheme and the design choices that influence them. Finally we introduce a configuration that is compatible with the hardware constraints.





Fig. 3: A segment from a longer patch of surface code, divided into 2 non-overlapping windows by an artificial boundary (dashed line). The optimal corrections are shown in green and the suboptimal ones in purple. (a) One flipped syndrome in window A. The suboptimal correction will be selected if matching to the artificial boundary is allowed. (b) An error flips two syndromes, one in each window. The suboptimal correction will be selected if matching to the artificial boundary is disallowed.

### A. Connecting decoder modules

The time that it takes for a monolithic decoder to process a long patch of code scales at least linearly with the area of the patch. (Exceptions like look-up-table-based decoders have exponentially high storage overhead so they only work for very small patches [9]). Spatially parallel decoding circumvents this by dividing the decoding task into multiple processes. In order to avoid frequent communication between the processes, we do not consider fine-grained schemes where each (ancilla or data) qubit is assigned its own decoder module. We only consider coarse-grained parallel schemes where each decoder module works on an area that is similar to an individual patch of logical qubit like Figure 2(a). We refer to the area that a decoder module acts on as a window.

It might seem resource efficient to divide the device into non-overlapping windows, and run decoders on each window simultaneously. However, this will either completely sacrifice the accuracy of decoding, or require high communication overhead between neighboring decoder modules. If no communication is allowed between non-overlapping windows, the decoder has two choices when acting on one window: (1) it allows matching across the boundary with neighboring windows, (2) it does not allow such matching. The examples in Figure 3 show that either case leads to easy decoding failures. In case (1), the flipped syndrome in Figure 3(a) would be matched to the boundary between the two windows, while it should be matched to the lower boundary. In case (2), the flipped syndromes in Figure 3(b) would be matched to the top and bottom boundaries, while they should be matched to each other. Both examples result in logical errors.

In order to maintain the accuracy of decoding and avoid high communication overhead, one can take the approach of using overlapping windows and applying decoder modules on neighboring windows in different *layers* (time steps). We will use terms *rough boundary* and *smooth boundary* in the explanation. On a decoding graph, a boundary with edges to



Fig. 4: Decoding graphs for X errors, used by (a) window A and (b) window B. In both (a) and (b), the buffer is marked in pink, and the two nodes where artificial defects could be introduced are marked by solid black dots. In (a), the red edges are the ones that connect the nodes in A's commit region to the buffer. Corrections along the dotted edges are on nodes in the buffer, and are not committed by window A. In (b) the decoding graph has solid edges in the buffer, because the corrections in the buffer are finalized in this step.

the virtual boundary node is termed rough, while a boundary with no edges out of it termed smooth. These boudaries can refer to either ones that already exist in the global decoding graph (e.g. the top and bottom boundaries of Fig. 4 are rough), or an artificial one that arises from partitioning the patch into multiple windows (e.g. the vertical boundaries in Fig. 4 are rough and smooth respectively). These terms should not be confused with the rough and smooth boundaries of a surface code, which are synonymous with boundaries that absorb strings of Z and X flips, respectively.

Figure 4 illustrates this approach for a segment of the X decoding graph taken from a longer patch. As shown in shown in Figure 4a, window A has a rough boundary on its right side. The edges that connect to this boundary are derived from those that straddle window A in the global decoding graph. In the first layer, only window A is acted on by a decoder. After it obtains a correction, the decoder only commits the part that acts on the nodes in A's *commit region*, marked by the solid edges in Figure 4a. The edges with dotted lines are in the buffer between the windows A and B, and corrections in this region are not applied by window A.

The only communication between windows A and B happens when A finishes decoding. And the only information A passes to B is the *artificial defects*, the nodes along the left (smooth) boundary of B that are flipped by corrections in the commit region of A. The artificial defects are marked with solid black dots in Figure 4. After B learns about these

defects, it performs corresponding flips on the nodes along its left boundary and begins decoding, as shown in Figure 4b. This ensures consistent corrections across the windows.

Unlike in window A, the decoding graph of window B (Figure 4b) does not have an artificial rough boundary with neighboring windows. This is true for any window scheduled in the final layer. In the X decoding graphs shown in Figure 4, there are still natural rough boundaries at the top and bottom; however, these would be smooth in the corresponding Z decoding graph and so this graph would have no rough boundaries. When a decoding graph has no rough boundaries, the decoder will fail to output a correction if the syndrome it receives contains an odd number of flipped nodes. In other words, there is a *lone defect* that cannot be matched. When this happens, we terminate decoding and report that a logical error has occurred. This failure is due to suboptimal decoding in previous layers.

#### B. Metrics

Throughput is one of the most important metrics for a real time decoder. It should match the rate at which syndromes are generated so that the syndrome backlog will not grow exponentially, exhausting storage space and slowing down logical operations [38]. When the decoder has enough throughput, it is sufficient to apply the sliding window technique [25] along the time dimension. When the inner decoder is not fast enough, a temporally parallel decoding scheme can mitigate the backlog problem. But temporal parallelization alone is not an ideal solution for the long patches that arise from lattice surgery, which will require more layers of temporally parallel windows. This will slow down the logical clock rate further and also incur more hardware costs if the decoder is implemented with hardware accelerators.

Accuracy is a critical metric for any decoding scheme. When multiple logical qubits undergo a logical operation, they should ideally maintain the same level of fidelity as when they are isolated. As we will show in Section V, this requires sufficiently large windows and buffers.

Latency is another metric of the decoder's efficiency. It measures the total delay between the time that a syndrome measurement cycle finishes and the decoder outputs a correction. It is particularly important when performing conditional operations, like gate-by-measurement. The latency of the decoder determines how long a logical qubit needs to idle before the conditional operation can be executed. This is because the value that conditions the operation is in part determined by the decoder's corrections. While it doesn't have a strict limit like throughput, a large latency is still undesirable because it leads to an increase in the space-time volume of the computation [8], [11]. Specifically, the time of the computation increases linearly with the latency, and to protect the logical information, the code distance should be increased logarithmically to compensate. However, the scheme we consider only has a latency of a few cycles. The logical error rate per cycle is roughly proportional to  $(100p)^{(d+1)/2}$  [17]. At  $p \sim 0.1\%$ , increasing d by 2 would suppress the logical error rate by  $\sim 10X$ , which is enough for overcoming the latency.



Fig. 5: (a) A patch with a tree structure is 2-colorable. (b) A patch that arises during a  $Y \otimes Y$  logical measurement. Not 2-colorable.

The latency of the spatially parallel windows is determined by (1) the latency in each component of the decoding scheme, which is already reflected in the throughput, and (2) the number of layers that the windows are divided into. To reduce the latency, it is preferable to use a window configuration with a small number of layers. This is one of the considerations in the following subsection.

#### C. Window configuration for hardware decoding

For a software implementation of the spatially parallel windows, one can dynamically and flexibly adjust the configuration of the windows during the computation, given the lattice surgery operations at each time step. However, for real time decoding using a hardware accelerator like an FPGA or ASIC, the positions of the windows and the links between them should ideally be fixed. Here, we seek a window configuration that is compatible with hardware accelerators and also accommodates general lattice surgery operations.

Besides having fixed window positions and links, the window configuration and mapping/compiling strategy should respect three other constraints to ensure the accuracy of decoding. First, the commit regions of the windows in the same layer must be disjoint. Unlike windows in different layers (e.g. the two windows in Figure 4), they cannot share data qubits. This is because no information is shared between windows in the same layer, while the information necessary for making consistent corrections is passed between neighboring windows in different layers. Second, the windows should not have smaller size than an isolated patch of code (e.g. Figure 2a). Thirdly, when a patch spans multiple commit regions, the intersection of the patch with each commit region should not be too small. The second and third points will be further explained in Section V.

To reduce the latency of the decoder, the windows should be arranged in a small number of layers. Can a two-layer configuration meet the requirements? For the large patch in Figure 1, it suffices to use a 2-colorable checkerboard configuration of windows. This is because the patch has a tree structure. See Figure 5a for a window configuration that accommodates a multi-patch merge of this type (only the commit region of each window is shown). In this example, although the commit regions with the same color overlap at

corners, their committed corrections do not conflict because no active qubits are at these corners.

However, not all patches from lattice surgery operations have a tree structure similar to that in Figure 1 and Figure 5a. These examples only involve X and Z logical measurements. But when a lattice surgery operation involves a Y measurement on a logical qubit, the patch may need to be touched on more than one X or Z edge, which may result in a patch that cannot be accommodated by a checkerboard pattern. For instance, Figure 5b shows a merged patch that arises during a  $Y \otimes Y$  logical measurement [6].

To accommodate general lattice surgery operations we propose to use a 3-coloring of staggered squares (Figure 6). It is a simple configuration that satisfies all the constraints mentioned above. The size of each commit region is the same as the size of an individual patch of code. If the rows of windows are not staggered then 4 colors are required, which leads to an unnecessary increase of latency.

# V. LOGICAL ERROR RATES

In this section we study the logical error rates (LER) when using spatially parallel windows. We begin by describing the simulation setup on a rectangular surface code patch, and then study logical errors along the short and long edges of the patch. We then move on to simulations and analysis for a more general setup, five windows stacked in two rows. We also discuss how the results influence design decisions (e.g. buffer width) and introduce constraints on mapping.

# A. Methodology for simulations

We perform quantum memory simulations, where each shot includes d cycles of stabilizer measurements. We use Stim [20] for simulating stabilizer circuits along with a circuit-level noise model that includes single-qubit, two-qubit, and measurement errors characterized by a single parameter p. We use PyMatching2 [22], a state-of-the-art implementation of the MWPM decoder, both as the inner decoder applied to individual windows and as the baseline global decoder. Although evaluation of the throughput would require a real time inner decoder, PyMatching2 is sufficient for simulations that focus on logical error rates and syndrome densities.

For Sections V-B and V-C we use a rectangular surface code patch that would arise when merging two square-shaped patches of distance d during lattice surgery. Our goal is to maintain the fidelity of the individual patches, and we study the impact of the buffer width and physical noise p on this fidelity. As such, we use the simplest setup (Figure 7) with just two overlapping windows, with window A applied before B. This configuration is such that the square-patches are the same size as the commit regions of the windows.

## B. Logical errors along the short edge

In a memory experiment, a logical error occurs if an X or Z logical operator is flipped after applying both the noise and the correction. We first study the impact of logical errors along the short edge of Figure 7, with code distance d=15. We



Fig. 6: A window configuration with staggered squares. (a) Only showing the commit region of each window. Each arrow represents a one-way link that communicates information on artificial defects. (b-d) The windows in the first, second, and third layers, respectively. The buffers are shown in lighter color and surrounded by dotted edges. The windows on the last layer do not have buffers. The buffer width in this figure is chosen to improve readability.



Fig. 7: Setup for the numerical simulations.

only use one of the X and Z (global) decoding graphs: the one with rough boundaries on the top and bottom edges. For this setup the logical operator should be a horizontal string that connects the left and right edges of the entire patch.

In Figure 8 we plot the LER against w, the buffer width (e.g. Figure 4 has w=3). The figure shows that when the buffer width is small, the accuracy of the parallel decoder is orders of magnitude worse than its global counterpart. It also shows that increasing w closes the gap. In fact, the gap diminishes before the buffer is grown to match the length of the short edge, which implies that w need not be as large as d to maintain accuracy. This is beneficial for throughput, since large buffers invariably decrease the speed of each decoder module or place tighter requirements on the underlying hardware.

So how large does w need to be? From Figure 8 we observe that the optimal choice of w depends on the physical noise p. At p=0.1%, the 95% confidence intervals of the LERs from parallel and global decoding overlap when  $w \geq 10$ . At w=10 and p=0.1%, the ratio of parallel to global LER is  $\sim 1.32$ . But at p=0.5%, the ratio is below 1.3 when  $w \geq 6$ , and it is only  $\sim 1.05$  at w=10. This shows that one might need a larger w at lower p to match the accuracy of global decoding.

In Figure 9, we provide an another view of the results from Figure 8. In Figure 9a we plot the LERs against p on log scale. The slope of a line in the plot shows how quickly the LER is suppressed when p decreases, and is a proxy for effective code distance. When we plot the slope against w in Figure 9b, we



Fig. 8: Logical error rate v.s. buffer width, using the setup in Figure 7 with d=15, counting the logical errors along the short edge (that connect the top and bottom boundaries). The dashed lines are the baseline results from using a *global* decoder on the same underlying patch. The shaded regions indicate the 95% confidence intervals. Each data point at p=0.1% is obtained with 8B shots.

see that the benefit of growing w diminishes as w increases.

To facilitate understanding of the numerical results, we also study error strings that confuse the parallel decoder but can be corrected by a global decoder. In Figure 10 we show examples of such error strings for d=15, w=9 and 3. The example error strings have lengths (l) 6 and 4, respectively. Since l is no more than  $\lfloor \frac{d}{2} \rfloor$  (which is 7) for both strings, they can be corrected by a global decoder.

However, in Figure 10a, window A can match the left end of the string to the top boundary (and commit this), and the right end to the rough boundary outside of the buffer (tentatively). Such a matching consists of 6 edges, the same as the correct one, so window A will output one or the other with equal probability. When window B is in action, it only sees the right end of the error string, and matches it to the bottom boundary. Since one end of the error string is matched to the top boundary and the other is matched to the bottom, the correction flips the horizontal logical operator while the noise does not. Thus the operator is flipped at the end and we have



Fig. 9: The data in Figure 8 represented differently. (a) Logical error rate v.s. physical error rate, for parallel decoding with different buffer width w. (b) The slopes of the lines in (a) and (b), plotted against w.



Fig. 10: Example error strings that confuse the parallel decoder but can be corrected by a global decoder. Left: w=9. Right: w=3. In each example, the region between the two dashed vertical lines is the overlap of windows A and B. Window A is decoded first. The flipped stabilizers are in red. The error strings are in green. The purple edges are wrong decoder outputs by A. The dotted horizontal line marks the middle of the patch.

a logical error. In Figure 10b, window A can match the left end of the error string to the bottom boundary. The right end of the string is invisible to A, so A does not need to act on it. Such a matching is again the same weight as the actual error string. Then window B only sees the right end of the error string and matches it to the top boundary.

How long do these strings need to be? For a connected error string (that is correctable by some global decoder) to result in a logical error along the short edge, it should satisfy all the three conditions below:

- One end of it is matched to the top boundary and the other is matched to the bottom, in the committed matching
- 2) Exactly one end of it is in the commit region of window A. If neither end is in the commit region of A, no correction will be committed by window A, and the error is entirely decoded by B as in the case of a global decoder. If both ends are in the commit region of A, then the corrections to both defects are committed by A.
- 3) In window A, the end that is not matched to the top/bottom boundary is matched to the rough boundary out of the buffer. If both ends are matched to the same boundary, we do not get a logical error. If they are matched to opposite boundaries by A, the global decoder would also do such a matching.

Let l be the weight of the error string (its number of edges). The right end of the error string can be just above or just below the middle of the patch, and the left end can be just within the commit region of A. The weight of the wrong matching in A, described above, is the sum of the distance from the right end to the rough boundary out of the buffer, and from the left end to the top/bottom boundary. Assuming an odd d, the first part is  $\max(w+1-l,0)$ , and the second part is  $\max(\frac{d-1}{2}+1-l,0)$ , for a diagonal error string similar to the ones in Figure 10. The wrong matching might be output by decoder A if it has no higher weight than any correct one. Then for d=15, a string with l=4 (5, 6, 7) can confuse a parallel decoder with  $w\leq 3$  ( $w\leq 6$ ,  $w\leq 9$ ,  $w\leq 12$ ). Note that when  $w\geq d-2$ , a connected error string that can confuse the parallel decoder is also long enough to confuse any global decoder.

The lengths and the quantity of error strings help explain the LER. The chance of having a weight l error is roughly  $(1-p)^{n-l}p^l$ , where n is the total number of edges in the 3D decoding graph (we say *roughly* because the probability associated with an edge in the decoding graph is not exactly p—the circuit-level noise model adds complications). The LER can be expressed as  $P = \sum_{l=l_{\min}}^{n} N_{\text{fail}}(l)(1-p)^{n-l}p^l$ , where  $N_{\text{fail}}(l)$  is the number of weight-l errors that cause logical error and  $l_{\min}$  is the minimum weight of an uncorrectable error (which does not need to be a connected string).

When p is low, the LER is usually dominated by the term from  $l_{\min}$ . This is why  $l_{\min}$  is an important quantity, whether in the decoder-specific context in this paper or the decoder-independent context where it is equivalent to  $\lceil \frac{d}{2} \rceil$ . The range of p we use for the numerical simulations includes 0.1% which is already low. However, our simulation results cannot be fully explained by the min weight of uncorrectable errors. For

example, Figure 10b has a  $l_{\rm min}$  of 4 with parallel decoding, but it has more than two orders of magnitude lower LER than a distance 8 surface code decoded globally (which also has  $l_{\rm min}=4$ ) under the same p of 0.1%.

The value  $l_{\min}$  alone does not fully explain the simulation results because the entropy of error chains sometimes play a more important role in determining the LER at modest physical noise [3], [5]. When the buffer is just narrow enough for an error of weight  $l_{\min}$  to cause a logical error, as in both examples in Figure 10, the entropic factor  $N_{\text{fail}}(l_{\min})$  is a tiny value. The entropic factors of higher order terms in P are significantly larger. When p is not low enough, the disparity in error probability is not enough to offset the difference in the entropic factors, then the main contribution to the LER is from the higher order terms. If the LER from the parallel decoder is dominated by terms with weight no less than  $\lfloor \frac{d}{2} \rfloor$ , then it will be close to the LER from the global decoder, even when the two decoding schemes have different  $l_{\min}$ . This explains the results in Figure 8.

# C. Logical errors along the long edge

For a rectangular patch of surface code with a long edge and a short edge, the logical errors along the long edge constitute only a small portion of the total LER when a global decoder is used. Here, we check that spatially parallel windows preserve this property.

We again use the setup in Figure 7, and show the results of numerical simulations in Figure 11. This time, we observe that the LER gap between parallel and global decoding stays constant as the buffer is grown. We note a distinction between Figure 11 and Figure 8. As the buffer width w increases, the code distance increases along the long edge but remains constant along the short edge. This is why the LER from the global baseline decreases as w grows in Figure 11 but remains almost constant in Figure 8. We also find that the LER from parallel decoding coincides with the one from applying the global decoder to a smaller patch that is the same size as window A.

Along the long edge, we must consider error strings like the one in Figure 12. Window A in this figure has code distance 9, the entire patch has distance 13, and the green error string has weight 5. The string is not long enough to confuse the global decoder on the entire patch, but is long enough to confuse a window. In particular, window A would prefer the incorrect purple matching to the green one, because it only has weight 4. Then window B, which only sees the right end of the error string, has no choice but to match it to right rough boundary of its decoding graph. The result is a logical error along the horizontal edge.

Our findings about the LER along the long edge are conclusive: as long as the horizontal dimension of window A in Figure 7 is longer than d (which will be the case due to the buffer), the increased LER from logical errors along the longer edge is negligible when compared to the increase from the errors along shorter edge. Therefore we do not need to consider this type of logical errors when choosing w.



Fig. 11: Logical error rate v.s. buffer width, using the setup in Figure 7 with d=7, counting the logical errors along the long edge. The dashed lines are the baseline results from using a *global* decoder on the same patch. The results from applying the global decoder on a patch that is the size of window A are also shown.



Fig. 12: An example error string that causes a logical error along the long edge, when the parallel decoder is used. As in Figure 10, window A is applied before B, the buffer is between the vertical dashed lines, the error string is in green, and a wrong decoder output by A is in purple.

# D. Generalizing to 2D configurations

As mentioned in Section IV-C, some lattice surgery operations involve wider merged patches. This section investigates whether the window configuration in Figure 6 works on these patches, and the buffer width requirement for decoding on these patches.

We perform simulations with the 2 settings shown in Figure 13, where we allocate an underlying patch in different regions of Figure 6, and perform decoding with the given window configurations. In practice, such a large patch could be the result of a merge: we start with two vertically-stacked square patches, each with the same size as a commit region in Figure 6. They are first deformed into rectangles with longer horizontal edges, so that they can perform  $Y \otimes Y$  logical measurement [6], [30]. Then the top and bottom rectangles are merged. This example scenario motivates our choices of baselines: the total LERs of the two isolated patches, either before all the logical operations (when they are squares), or right before the merge (when they are longer rectangles).

The simulation results in Figure 14 shows that, in this setting, the LER of parallel windows is suppressed exponentially



Fig. 13: Setups for simulations with 5 windows stacked in 2 rows. As in Figure 6, the color of a window indicates the step that the decoder acts on it. The area of the underlying patch is enclosed by the pink square. The vertical and horizontal logical operators are in green and red. In (a) the two original patches that are merged are labelled b1 and o1, while in (b) they are labelled g1 and b1.



Fig. 14: LER v.s. buffer width at p=0.4% over 15 rounds, using the setups (a) in Fig. 13a, and (b) in Fig. 13b. We show the different logical errors from using the horizontal and the vertical logical operators. The size of each smaller square (e.g. the commit region of b1) is chosen to be  $15\times15$ . The blue (purple) baseline is the LER of two isolated patches, each with size  $15\times29$  ( $15\times15$ ), taken over 15 rounds and with logical errors along the edge with d=15.

with w, and is below the baseline before w is grown to more than half the commit region width. Compared to the linear case presented earlier, the parallel windows can be seen to make use of the extra distance present during this logical operation — the buffers extend in both directions rather than one direction. The difference between Figures 14a and b is due to the order in which the windows act. In (a), the case with vertical logical operator has higher errors because the intersection of the merged patch and the commit regions of o2 and b3 have short horizontal edges, and o2 and b3 are scheduled before their neighbor g2. This does not apply to (b) because in the lower row of (b), the middle window acts first.

# E. Implications

The results in this section could help us determine the buffer width w that is needed for achieving high accuracy. Linear settings (Figure 7) have larger buffer width requirements than multi-row settings (Figure 13), and the logical error rate in the linear setting is dominated by errors along the shorter edge. Therefore, it suffices to choose w based on one's error budget and the physical noise level, according to the results in Section V-B. In general, w should be around d/2. Its size could be larger if the physical noise level is low or if a closer match to the logical error rate of the global decoder is required. Conversely, it could be smaller if the physical noise level is high or if a looser match is acceptable.

Section V-C shows that the size of the windows (or specifically, the intersection of a patch with a window) also bound the fidelity that can be achieved. This is acceptable as long as a window is wider than an isolated patch of code. One mapping/compiling constraint should be noted though. When a patch spans multiple commit regions, and its intersection with one of the windows is too small, the type of logical errors in Figure 12 can be large. This is the case in Figure 13a, where the intersection of the patch with the commit region of o2 has width d/2. When w is at least half of d, o2's total width is at least d, so the contribution from this type of error is not too large. However, the intersection of a patch with a commit region should not be smaller than half the commit region.

#### VI. THROUGHPUT

In this section, we shift the focus to throughput and study how large the window size can grow before the throughput limit is reached. We quantify throughput by the average time that it takes for a decoder to process one round of measurements. Since each surface code measurement cycle takes  $\sim 1~\mu s$  on a superconducting qubit-based quantum computer, we set it as the standard that a real time decoder should meet.

Figure 15 shows a flowchart of a real time decoder using the spatially parallel windows. Each window receives information from the control system measuring the qubits and its neighboring windows from previous layers. After decoding, each window sends the committed corrections to the runtime and its artificial defects to neighboring windows in later layers.

This results in a pipelined scheme where the throughput is limited by the slower of two components: the inter-window



Fig. 15: A flowchart of the parallel decoding scheme, the colored squares correspond to the windows in Figure 6.

communication or the inner decoder. In this section, we first show that the inter-window communication will not be a bottleneck for an FPGA-based implementation, then discuss the scalability of inner decoders and how it affects the maximum window size that an FPGA implementation can have. To achieve larger window sizes than supported by FPGA-based inner decoders, one may turn to more resource efficient ASICs [2].

#### A. Inter-window communication

The throughput of the communication links depends on the number of bits that it takes to transmit information on the artificial defects. As explained in Section IV-A, the only information that a decoder passes to its neighbor in a subsequent layer is the artificial defects. The length of this message is determined by (1) the number of bits required to represent an artificial defect and (2) the expected number of artificial defects that arise during decoding.

Quantity (1) depends on the number of different locations for artificial defects. Between two neighboring windows A and B, this is proportional to the area of  $A_{\text{commit}} \cap B$ , the intersection of window B and the commit region of A. Suppose we use the staggered square configuration, where the commit regions are  $d \times d$  and d cycles are processed at once. Then, between any pair of neighboring windows, the number of possible artificial defect positions is  $\sim d^2/2$  for either the X or the Z decoding graphs. This translates to  $\lceil \log_2(d^2/2) \rceil$  bits for representing each artificial defect — 9 bits for d ranging from 23 to 32, and 12 bits for d from 65 to 89.

To study quantity (2), we look at the distributions of artificial defects recorded from the simulations in Section V-B (Figure 16). We found that the distributions are independent of w and heavily dependent on p. Distributions from different values of p can be fit with the Poisson distribution with different means  $(\lambda)$ , with the mean of the distribution roughly proportional to p. When we compared the data from Section V-B and Section V-C, we find that the mean of the distribution is also proportional to the area of  $A_{\text{commit}} \cap B$ . This is not surprising: p is proportional to the chance that an



Fig. 16: The number of artificial defects, from the same simulations as Figure 8. The curves are fit to the Poisson distribution.

artificial defect arises at each possible location, and the area of  $A_{\text{commit}} \cap B$  is associated with the number of such locations.

Given p and the area of the face where the artificial defects lie, we can estimate the mean count of artificial defects by extrapolating from our simulation results. For example, at p=0.1% and  $d\sim30$ , the mean count of artificial defects is  $\sim5$ . Then for each of the X and Z decoding graphs, it takes about  $5\times9=45$  bits to transmit the artificial defect information, without further optimization. This sums up to  $\sim90$  bits in total, for a link between two neighboring windows.

For an FPGA-based implementation, each copy of the inner decoder is implemented on one FPGA and acts on one window. Therefore, the inter-window communication is between FPGAs. For reference, the communication between two Xilinx FPGAs using a standard high-speed serial communication link is 54 clock cycles, plus 1 clock cycle per 64-bits [33]. On FPGAs, clock cycles of 4 ns (250 MHz) can be routinely achieved, even at high resource utilization. For windows with  $d \times d$  commit regions arranged as in Figure 6, we have shown that, at p = 0.1%, the information passed between two neighbors is at most  $\sim 94$  bits for  $d \leq 30$ , if d rounds of syndrome is processed at once. That means each of these inter-FPGA links will take roughly 56 clock cycles, or  $\sim 224$  ns if running at 250 MHz. Divided by d, the cost of communication is only 7 ns per measurement cycle, well below the requirement of 1  $\mu s$ . We also calculated the communication overhead for larger d, and found that it decreases with d in the range that is relevant. At d = 151, the average cost of an inter-window communication is below 3 ns per measurement cycle. We performed the same calculation at a high physical noise of p = 0.5%, and found that the overhead is 44 ns per round at d = 5 and steadily drops to below 9 ns at d=151. This is because when we divide the link latency by d, the contribution from the constant latency (54 clock cycles) decreases as d increases, which outweighs the increase in the other term that is proportional to  $d^2$ . We conclude that the inter-window communication does not limit the size of windows.

#### B. Inner decoders

The throughput of the inner decoder depends on the size of the windows and the choice of decoder. Since spatially parallel windows can be layered over any inner decoder, and the development of real time decoders is still an active field, there is no clear answer for the maximum size of the window. Nonetheless, we can discuss the scalability of real time decoders that are currently available. Note that we also view the link from the control system to a window and the one from a window to the runtime as parts of an inner decoder.

An inner decoder's scalability bottleneck depends on its design. As examples, we look at two FPGA-based real time decoders that work for at least medium-sized codes and have different scalability challenges. Helios [31] is a distributed version of a Union-Find decoder that exploits parallel computing resources for speedup. It has been demonstrated on a  $21 \times 21$ patch and operates with  $O(d^3)$  processing elements, each assigned to a node in the decoding graph and communicates via shared memory. Its average runtime per measurement round decreases with d, as long as d is not large enough to require board to board communication between FPGAs (which is much slower than shared memory based communication). Aside from this limit, the main scalability challenge of Helios is the  $O(d^3)$  scaling of its computing resources. A recent update of Helios [21] improved its scalability to a  $51 \times 51$ patch at the cost of increased latency.

Riverlane's Collision Clustering (CC) decoder [2] is another UF-based decoder, with demonstrated implementations on both FPGA and ASIC. The advantage of CC is its efficient use of storage resources, so its scalability is not constrained by hardware capacity, but instead constrained by the throughput. On an FPGA, its average execution time per measurement round increases with d, but is below 1  $\mu s$  until d=23. Like Helios, each instance of the CC decoder also needs to fit on 1 FPGA due to the usage of shared memory.

The maximum window size supported by the FPGA is constrained by the scalability of the inner decoder. The decoding graphs in [2], [31] are cubes, but for the spatially parallel windows, the number of measurement rounds that need to be processed at once does not depend on the full width of the window, but rather the width of the commit region. This would decrease the volume of the decoding graph by roughly half, but in practice, applying sliding windows along the time dimension would increase the volume of the decoding graph by 2X. Therefore we expect that volume of the decoding graph in a window would be similar to the ones used in [2], [31], which means an inner decoder for d = 51translates to a commit region width of  $\sim 25$  for the spatially parallel windows, assuming that the buffer is slightly wider than half of the commit region. We conclude that FPGAbased implementations of real time decoders connected into spatially parallel windows will be able to support at least medium term demonstrations of logical operations. In the long term, ASIC-based implementations should be able to support larger code distances, because they can be optimized for higher performance and lower cost [2].

How does the maximum window size depend on the phys-

ical noise p? Recall from Section V that when p is lower, a wider buffer is required for the spatially parallel windows to achieve good accuracy. Wider buffers lead to larger windows, which impose higher requirements on the inner decoder. But it is worth noting that most decoders also run faster at low p [22], which is natural since physical errors are sparse at low p. If the main constraint on the scalability of an inner decoder is its throughput, as in the case of the CC decoder, then these are competing factors that determine the maximum window size it can support at low p.

# VII. CONCLUSION

Spatially parallel windows make real time decoding of QEC codes scalable for the large patches of codes that arise during fault-tolerant quantum logical operations. We show that the scheme is compatible with the constraints imposed by using hardware accelerators, with proper window configuration. We perform numerical studies of the decoding accuracy and investigate the mechanisms through which the spatially parallel windows might introduce additional logical errors. The implications of the results impose constraints on mapping and introduce requirements on the window size and buffer width. We specifically study how the requirement on buffer width depends on the physical noise level. We assess the throughput of the decoding scheme and analyze the maximum window size that can be supported before the throughput falls below the requirement. We find that the communication between neighboring windows never becomes the bottleneck in the decoding scheme, so the maximum window size is limited by the scalability of the inner decoders. Since running on larger windows decreases the throughput of the inner decoders and/or requires more hardware resources, the window size and the buffer width need to be carefully chosen to achieve a balance between throughput/cost and accuracy.

#### **ACKNOWLEDGEMENTS**

This work is funded in part by the STAQ project under award NSF Phy-1818914/232580 and in part by the NSF Quantum Leap Challenge Institute for Hybrid Quantum Architectures and Networks (NSF Award 2016136).

#### REFERENCES

- [1] "Suppressing quantum errors by scaling a surface code logical qubit," *Nature*, vol. 614, no. 7949, pp. 676–681, 2023.
- [2] B. Barber, K. M. Barnes, T. Bialas, O. Buğdaycı, E. T. Campbell, N. I. Gillespie, K. Johar, R. Rajan, A. W. Richardson, L. Skoric *et al.*, "A real-time, scalable, fast and highly resource efficient decoder for a quantum computer," *arXiv preprint arXiv:2309.05558*, 2023.
- [3] M. E. Beverland, B. J. Brown, M. J. Kastoryano, and Q. Marolleau, "The role of entropy in topological quantum error correction," *Journal of Statistical Mechanics: Theory and Experiment*, vol. 2019, no. 7, p. 073404, 2019.
- [4] H. Bombín, C. Dawson, Y.-H. Liu, N. Nickerson, F. Pastawski, and S. Roberts, "Modular decoding: parallelizable real-time decoding for quantum computers," arXiv preprint arXiv:2303.04846, 2023.
- [5] S. Bravyi and A. Vargo, "Simulation of rare events in quantum error correction," *Physical Review A*, vol. 88, no. 6, p. 062308, 2013.
- [6] C. Chamberland and E. T. Campbell, "Circuit-level protocol and analysis for twist-based lattice surgery," *Physical Review Research*, vol. 4, no. 2, p. 023090, 2022.

- [7] C. Chamberland and E. T. Campbell, "Universal quantum computing with twist-free and temporally encoded lattice surgery," *PRX Quantum*, vol. 3, no. 1, p. 010331, 2022.
- [8] C. Chamberland, L. Goncalves, P. Sivarajah, E. Peterson, and S. Grimberg, "Techniques for combining fast local decoders with global decoders under circuit-level noise," *Quantum Science and Technology*, vol. 8, no. 4, p. 045011, 2023.
- [9] P. Das, A. Locharla, and C. Jones, "Lilliput: a lightweight low-latency lookup-table decoder for near-term quantum error correction," in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2022, pp. 541–553.
- [10] P. Das, C. A. Pattison, S. Manne, D. M. Carmean, K. M. Svore, M. Qureshi, and N. Delfosse, "Afs: Accurate, fast, and scalable errordecoding for fault-tolerant quantum computers," in 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2022, pp. 259–273.
- [11] N. Delfosse, A. Paz, A. Vaschillo, and K. M. Svore, "How to choose a decoder for a fault-tolerant quantum computer? the speed vs accuracy trade-off," arXiv preprint arXiv:2310.15313, 2023.
- [12] N. Delfosse and G. Zémor, "Linear-time maximum likelihood decoding of surface codes over the quantum erasure channel," *Physical Review Research*, vol. 2, no. 3, p. 033042, 2020.
- [13] E. Dennis, A. Kitaev, A. Landahl, and J. Preskill, "Topological quantum memory," *Journal of Mathematical Physics*, vol. 43, no. 9, pp. 4452– 4505, 2002.
- [14] J. Edmonds, "Maximum matching and a polyhedron with 0, 1-vertices," Journal of research of the National Bureau of Standards B, vol. 69, no. 125-130, pp. 55–56, 1965.
- [15] J. Edmonds, "Paths, trees, and flowers," Canadian Journal of mathematics, vol. 17, pp. 449–467, 1965.
- [16] A. Erhard, H. Poulsen Nautrup, M. Meth, L. Postler, R. Stricker, M. Stadler, V. Negnevitsky, M. Ringbauer, P. Schindler, H. J. Briegel et al., "Entangling logical qubits with lattice surgery," *Nature*, vol. 589, no. 7841, pp. 220–224, 2021.
- [17] A. G. Fowler, S. J. Devitt, and C. Jones, "Surface code implementation of block code state distillation," *Scientific reports*, vol. 3, no. 1, p. 1939, 2013
- [18] A. G. Fowler and C. Gidney, "Low overhead quantum computation using lattice surgery," arXiv preprint arXiv:1808.06709, 2018.
- [19] A. G. Fowler, M. Mariantoni, J. M. Martinis, and A. N. Cleland, "Surface codes: Towards practical large-scale quantum computation," *Physical Review A*, vol. 86, no. 3, p. 032324, 2012.
- [20] C. Gidney, "Stim: a fast stabilizer circuit simulator," Quantum, vol. 5, p. 497, Jul. 2021. [Online]. Available: https://doi.org/10.22331/q-2021-07-06-497
- [21] N. Godawatte Liyanage, Y. Wu, S. Tagare, and L. Zhong, "Multi-fpga union-find decoder for surface codes," *Bulletin of the American Physical* Society, 2024.
- [22] O. Higgott and C. Gidney, "Sparse blossom: correcting a million errors per core second with minimum-weight matching," arXiv preprint arXiv:2303.15933, 2023.
- [23] A. Holmes, M. R. Jokar, G. Pasandi, Y. Ding, M. Pedram, and F. T. Chong, "Nisq+: Boosting quantum computing power by approximating quantum error correction," in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020, pp. 556–569
- [24] D. Horsman, A. G. Fowler, S. Devitt, and R. Van Meter, "Surface code quantum computing by lattice surgery," *New Journal of Physics*, vol. 14, no. 12, p. 123011, 2012.
- [25] A. R. İyengar, M. Papaleo, P. H. Siegel, J. K. Wolf, A. Vanelli-Coralli, and G. E. Corazza, "Windowed decoding of protograph-based ldpc convolutional codes over erasure channels," *IEEE Transactions on Information Theory*, vol. 58, no. 4, pp. 2303–2320, 2011.
- [26] A. Kitaev, "Fault-tolerant quantum computation by anyons," *Annals of Physics*, vol. 303, no. 1, pp. 2–30, 2003. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0003491602000180
- [27] S. Krinner, N. Lacroix, A. Remm, A. Di Paolo, E. Genois, C. Leroux, C. Hellings, S. Lazar, F. Swiadek, J. Herrmann *et al.*, "Realizing repeated quantum error correction in a distance-three surface code," *Nature*, vol. 605, no. 7911, pp. 669–674, 2022.
- [28] L. Lao, B. van Wee, I. Ashraf, J. Van Someren, N. Khammassi, K. Bertels, and C. G. Almudever, "Mapping of lattice surgery-based quantum circuits on surface code architectures," *Quantum Science and Technology*, vol. 4, no. 1, p. 015005, 2018.
- [29] D. Litinski, "A game of surface codes: Large-scale quantum computing with lattice surgery," *Quantum*, vol. 3, p. 128, 2019.

- [30] D. Litinski and F. von Oppen, "Lattice surgery with a twist: simplifying clifford gates of surface codes," *Quantum*, vol. 2, p. 62, 2018.
- [31] N. Liyanage, Y. Wu, A. Deters, and L. Zhong, "Scalable quantum error correction for surface codes using fpga," arXiv preprint arXiv:2301.08419, 2023.
- [32] G. S. Ravi, J. M. Baker, A. Fayyazi, S. F. Lin, A. Javadi-Abhari, M. Pedram, and F. T. Chong, "Better than worst-case decoding for quantum error correction," in *Proceedings of the 28th ACM International* Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2023, pp. 88–102.
- [33] Vivado Design Suite, Aurora 64B/66B LogiCORE IP Product Guide (PG074), AMD.
- [34] L. Skoric, D. E. Browne, K. M. Barnes, N. I. Gillespie, and E. T. Campbell, "Parallel window decoding enables scalable fault tolerant quantum computation," *Nature Communications*, vol. 14, no. 1, p. 7040, 2023
- [35] S. C. Smith, B. J. Brown, and S. D. Bartlett, "Local predecoder to reduce the bandwidth and latency of quantum error correction," *Physical Review Applied*, vol. 19, no. 3, p. 034050, 2023.
- [36] D. B. Tan, M. Y. Niu, and C. Gidney, "A sat scalpel for lattice surgery: Representation and synthesis of subroutines for surface-code fault-tolerant quantum computing," in 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 2024, pp. 325–339.
- [37] X. Tan, F. Zhang, R. Chao, Y. Shi, and J. Chen, "Scalable surface code decoders with parallelization in time," arXiv preprint arXiv:2209.09219, 2022
- [38] B. M. Terhal, "Quantum error correction for quantum memories," Reviews of Modern Physics, vol. 87, no. 2, p. 307, 2015.
- [39] Y. Ueno, M. Kondo, M. Tanaka, Y. Suzuki, and Y. Tabuchi, "Qecool: On-line quantum error correction with a superconducting decoder for surface code," in 2021 58th ACM/IEEE Design Automation Conference (DAC). IEEE, 2021, pp. 451–456.
- [40] S. Vittal, P. Das, and M. Qureshi, "Astrea: Accurate quantum error-decoding via practical minimum-weight perfect-matching," in *Proceedings of the 50th Annual International Symposium on Computer Architecture*, 2023, pp. 1–16.
- [41] Y. Wu, N. Liyanage, and L. Zhong, "An interpretation of union-find decoder on weighted graphs," arXiv preprint arXiv:2211.03288, 2022.
- [42] Y. Wu and L. Zhong, "Fusion blossom: Fast mwpm decoders for qec," arXiv preprint arXiv:2305.08307, 2023.
- [43] Y. Zhao, Y. Ye, H.-L. Huang, Y. Zhang, D. Wu, H. Guan, Q. Zhu, Z. Wei, T. He, S. Cao et al., "Realization of an error-correcting surface code with superconducting qubits," *Physical Review Letters*, vol. 129, no. 3, p. 030501, 2022.