Fermat's Library | Scalable Neural Network Decoders for Higher Dimensional Quantum Codes annotated/explained version.

Scalable Neural Network Decoders

for Higher Dimensional Quantum Codes

N. P. Breuckmann

and X. Ni

1,2

Institute for Quantum Information, RWTH Aachen University, Germany

Max Planck Institute of Quantum Optics, Germany

May 16, 2018

Machine learning has the potential to

become an important tool in quantum er-

ror correction as it allows the decoder

to adapt to the error distribution of a

quantum chip. An additional motivation

for using neural networks is the fact that

they can be evaluated by dedicated hard-

ware which is very fast and consumes little

power. Machine learning has been previ-

ously applied to decode the surface code.

However, these approaches are not scal-

able as the training has to be redone for

every system size which becomes increas-

ingly diﬃcult. In this work the existence

of local decoders for higher dimensional

codes leads us to use a low-depth convo-

lutional neural network to locally assign

a likelihood of error on each qubit. For

noiseless syndrome measurements, numer-

ical simulations show that the decoder has

a threshold of around 7.1% when applied

to the 4D toric code. When the syndrome

measurements are noisy, the decoder per-

forms better for larger code sizes when

the error probability is low. We also give

theoretical and numerical analysis to show

how a convolutional neural network is dif-

ferent from the 1-nearest neighbor algo-

rithm, which is a baseline machine learning

method.

1 Introduction

A full-featured quantum computer will rely on

some form of error correction as physical qubits

are prone to the eﬀects of environmental noise.

The authors contribute equally to this work.

When using error correcting codes, decoders play

a large role in the performance of the fault-

tolerant protocols. Using neural networks to de-

code the surface code has been suggested in ear-

lier works [3, 21, 36, 37]. The primary motivation

inspiring these works to use neural networks is

their ability to adapt to the error model. How-

ever, an issue that has not been addressed so far

is scalability: in previous approaches, the neu-

ral networks have to be re-trained for every sys-

tem size, despite the fact that in general machine

learning methods become problematic when the

input space becomes too large. Thus, it is in-

teresting to ask whether there exists a family of

quantum error correcting codes, which we can de-

code using neural networks in a clearly scalable

way.

To provide a positive answer to this question,

we introduce a decoder for the four-dimensional

version of the toric code based on neural net-

works for which the training only has to be per-

formed once on a small system size. Afterwards

the decoder can be scaled up to arbitrary system

sizes without re-training. The layout of the net-

work is a convolutional neural network which are

widely used as a building block in image recogni-

tion. Furthermore, the neural network is con-

stant depth with respect to the scaling of the

system size. Our approach is informed by the

existence of local decoders for the 4D toric code

[6, 7, 11, 28]. A drawback of those decoders

is their low threshold of less than 2% against

independent noise (assuming perfect syndrome

measurements). The (global) minimum-weight

decoder, on the other hand, has a threshold of

10.1% [33] but it is not known to be computa-

tionally eﬃcient. Our decoder shown a threshold

of 7.1% while having a complexity close to local

decoders. This result is very close to the 7.3%

Accepted in Quantum 2018-05-04, click title to verify 1

arXiv:1710.09489v3 [quant-ph] 14 May 2018

reported in [14] which uses a renormalization ap-

proach. When the syndrome measurements are

noisy, our decoder still performs better for larger

lattice when error probability is low, see Fig-

ure 10 for the plot. More numerical simulation

needs to be done to determine the threshold.

This paper is structured as follows. In sec-

tion 2 we discuss previous work in which ma-

chine learning was applied to decode quantum

codes. In section 3 we give a review of the 4D

toric code and its properties. In section 4, we

provide a general introduction to machine learn-

ing. In section 5 we describe the application of

convolutional neural networks as a subroutine in

our decoding algorithm. In section 6 we discuss

the results of Monte-Carlo simulations. In Ap-

pendix A, we give a short introduction to the

backpropagation algorithm which we use to train

the neural network. In Appendix D, we give a

toy example which shows how a baseline machine

learning model can fail at a task similar to de-

coding when the system size becomes large. This

highlights the importance of choosing a transla-

tional invariant model as we do. In Appendix E,

we observed and analyzed one property of multi-

layer convolutional neural networks, and show it

ﬁts nicely to the task of assigning likelihood of

qubit errors for high dimensional toric code.

2 Previous Work

It is known for a while in the classical error cor-

rection community that the errors and syndromes

of a parity checking code can be put together into

a probabilistic graphical model, which is a major

tool for machine learning. Probabilistic graphical

models include for example Bayesian networks

and the Ising model (see [39] for an introduction

to probabilistic graphical model and connection

to error correcting codes). In [15] (and references

therein), it is shown that by using belief propa-

gation, the decoder can achieve very good perfor-

mance for LDPC and turbo code that are close to

the Shannon limit. We want to note that while

there is no formal connection (to our knowledge)

between neural networks and graphical models, a

graphical model of a problem is often helpful for

designing the architecture of the neural network.

For example, in [25] the structure of the Tanner

graph is used to construct a neural network, in

order to improve belief propagation for (classical)

codes that have high density checks.

In [29], the authors discussed applying be-

lief propagation to decoding of quantum stabi-

lizer codes. While stabilizer codes and classi-

cal parity checking codes share many similarities,

the authors highlighted several issues that might

cause belief propagation to work much worse for

stabilizer codes (even with low density checks).

In [12, 13], the authors used belief propagation as

a subroutine in the renormalization decoder for

topological codes, with the purpose of synchro-

nizing marginal probabilities across neighbouring

cells.

Neural network decoder for quantum stabilizer

codes has been studied in [3, 21, 36, 37] (more

precisely in [36] hidden Boltzmann machines are

used instead of neural networks). Our work will

use a similar scheme as [21, 36], where the neural

networks predict which qubits have undergone an

error. In [3, 37], the neural networks output 1 or

2 bits which will correct the ﬁnal measurement of

the logical operators. The main diﬀerence of our

work is that our machine learning decoder is nat-

urally scalable through the use of convolutional

neural networks.

3 The 4D Toric Code

3.1 Deﬁnition

The 4D toric code is a stabilizer quantum code

deﬁned on a four-dimensional (4D) hypercubic

lattice which was ﬁrst considered in [11]. We will

consider periodic boundary conditions so that

topologically the lattice is a 4D torus. The faces

of the lattice are squares and they are identi-

ﬁed with the qubits. From now on we will use

the words face and qubit interchangeably. The

number of faces in a lattice of side-length L is





= 6L

. This follows from the fact that

every face is uniquely determined by a vertex v

and two coordinates i, j ∈ {x, y, z, w}, so that the

face with base-point v lies in the i-j-plane. More

generally, the number of k-dimensional objects

in the lattice (1-dimensional objects would be

edges or 3-dimensional objects would be cubes)

is given by





. There are stabilizer checks for

every edge and cube in the lattice. A stabilizer

check associated with a particular edge acts as

Pauli-X on all faces which are incident to it (see

Figure 1 (left)), whereas a stabilizer check associ-

ated with a particular cube acts as Pauli-Z on all

Accepted in Quantum 2018-05-04, click title to verify 2

Figure 1: The stabilizer checks of the 4D toric code

correspond to edges which act as Pauli-X on all qubits

incident to an edge (left) and cubes which act as Pauli-Z

on all qubits incident to an edge (right).

faces incident to this particular cube (see Figure 1

(right)). All stabilizer checks act on 6 qubits and

each qubit is acted upon by 8 stabilizer checks.

Consider an operator E acting as Pauli-Z on

some subset of qubits. For E to commute with

an X-check their overlap has to be even. Hence,

to commute with all X-checks E can not contain

any connected components that have a bound-

ary where a single face is incident to an edge.

Geometrically this means that E has to be a col-

lection of boundaryless (closed) surfaces. If E

itself is the boundary of a 3D volume inside the

hypercubic lattice then it is the product of all Z-

stabilizer elements inside this volume. But there

are other examples of surfaces which do not have

a boundary: Since the topology of the lattice

is non-trivial we can consider a sheet extending

over the whole x-y-plane. Due to the periodic-

ity of the lattice this sheet has no boundary, but

is not itself the boundary of any 3D volume and

thus it is not the product of any Z-stabilizers.

It is therefore a logical operator. There is one

such operator for every plane in the lattice. Each

plane is labelled by two coordinates (x-y-plane,

x-z-plane, . . . ) so that the 4D toric code en-

codes





= 6 logical qubits. A sheet extending

through the whole lattice consists of at least L

faces which means that the 4D toric code has dis-

tance growing quadratically with the number of

physical qubits. The parameters of the code are

[[n = 6L

, k = 6, d = L

]].

In [33] the authors argue that the threshold

of the 4D toric code against independent bit-ﬂip

and phase-ﬂip errors is approximately p

≈ 0.11

under minimum-weight decoding. This is the

same threshold as for the 2D toric code and the

surface code.

3.2 Syndromes

A feature of the 4D toric code is that the set of

violated stabilizer checks form extended objects:

since we pick up the boundary of an error (which

is a collection of surfaces), the syndrome will al-

ways form closed loops as opposed to a set of

points for the 2D toric code or surface code. A

minimum-weight decoding, similar to minimum-

weight perfect matching for the 2D toric code, is

ﬁnding a minimum-area surface among all sur-

faces that have the syndrome as its boundary.

It is known that the problem of ﬁnding such

a minimum-are surface can be solved eﬃciently

when the lattice is three-dimensional [32], but it

is open whether it can be solved eﬃciently for a

4D lattice. However, there are several decoders

with worse error-correction performance than the

minimum weight decoder but which are compu-

tationally eﬃcient [1, 7, 11, 18, 28]. The com-

mon way these decoders operate is by iteratively

shortening the syndrome loops by ﬂipping nearby

faces.

The fact that the syndromes form closed loops

can be understood in terms of local linear depen-

dencies between the stabilizer elements: Taking

all X-checks (edges) which are incident to a par-

ticular vertex gives the identity operator. This is

the case because for every triple of vertex, edge

and face which are pairwise incident to one an-

other we can always ﬁnd one more edge which

is incident to the same vertex and face (see Fig-

ure 2). Similarly, taking all Z-checks (cubes) of

one hypercube gives the identity since there are

two cubes are overlapping on every face of the

hypercube.

Figure 2: The dependency of the edge stabilizers which

are incident to a common vertex. Taking the product

of all edge-stabilizers (red) incident to a common vertex

(red) gives the identity.

Each of those local linear dependencies can

be interpreted as a (classical) parity check on

the syndrome. More explicitly: Both the X-

Accepted in Quantum 2018-05-04, click title to verify 3

syndrome and the Z-syndrome are encoded by

a classical linear code. The code words of this

code are the valid syndromes. What are the pa-

rameters of this code? - There is a bit for every

X-check (edge) in the lattice, so the block size

of the code is 4L

. There is a local dependency

for every vertex. However, the local dependen-

cies themselves are not independent. Taking the

product over all vertices and all edge-checks in-

cident to this vertex gives the identity since ev-

ery edge is incident to two vertices. Hence, the

number of independent checks in the classical lin-

ear code encoding the syndrome information is

−1. The encoded syndrome information there-

fore contains 4L

−(L

−1) = 3L

+ 1 bits. The

distance of the classical code is 4 since adding

the boundary of a face takes us from one valid

syndrome to another valid syndrome.

Since the syndrome of the 4D toric code is

encoded it has some build-in robustness against

syndrome errors. In comparison, it is known

that the 2D toric code does not have a decoding

threshold in the presence of syndrome noise. The

parity measurements have to be repeated and the

record of the repeated syndrome measurements

is decoded. This essentially implements a repe-

tition code in time. Repeating measurements is

not necessary for the 4D toric code to have in-

creased error suppression capabilities with larger

system size. The fact that measurements are not

repeated is referred to as single-shot measure-

ments. It has been shown analytically in [5] that

a single-shot decoder can have a threshold and

single-shot decoders for the 4D toric code have

analyzed numerically in [2, 7, 14].

As 4D space is hard to visualize it is useful

to consider the 3D toric code to gain some geo-

metric intuition. It is deﬁned on a 3D cubic lat-

tice with periodic boundaries. Similarly to the

4D toric code the qubits are identiﬁed with the

faces of the lattice and the X-stabilizer checks

act on all faces incident to an edge (weight 4)

while the Z-stabilizer checks act on all faces inci-

dent to a cube (weight 6). A boundaryless sheet

acting as Pauli-Z commutes with all X-stabilizer

checks just as for the 4D toric code. The X-

stabilizer checks also satisfy the local linear de-

pendencies shown in Figure 2. To commute with

all Z-stabilizer checks it suﬃces to take a closed

loop of Pauli-X operators in the dual lattice. The

existence of these string-like logical operators re-

sults in a threshold which is generally lower under

minimum-weight decoding [33, 38].

4 Basics of Machine Learning

To have a better understanding of our decod-

ing procedure, it is useful to ﬁrst give a small

overview of machine learning. This will provide

insights to advantages and limitations of our pro-

cedure, and possible ways to improve it. For more

extensive review on machine learning and in par-

ticular neural networks, we refer to [17, 22, 26].

Machine learning is often categorized into su-

pervised learning, unsupervised learning and re-

inforcement learning, each has a diﬀerent focus.

However, all of them usually involve the use of

some models. In this paper, the word “model”

simply means a family of functions f

or sub-

routines s

, where the (most likely multidimen-

sional) parameter α needs to be trained by cer-

tain learning algorithms. Below we give a not

very comprehensive introduction to these three

kinds of learning

• Supervised learning is to ﬁnd α such that

= f

hidden

, where f

hidden

is given im-

plicitly by a dataset of input-output pairs

, y

= f

hidden

)).

• Unsupervised learning refers to many diﬀer-

ent tasks which are related to ﬁnding struc-

tures in a dataset. In contrary to supervised

learning, there is no desired output for each

entry in the dataset. For example, consider-

ing tasks in error correction, inferring prop-

erties of the error model from a dataset of

syndromes or using the dataset to generate

new similar syndromes can be viewed as un-

supervised learning, as well as some tasks

studied in [9].

• Reinforcement learning is concerned with

how agents ought to take actions in an en-

vironment so as to maximize some notion of

cumulative reward. However, in this paper

we will use the word “reinforcement learn-

ing” to denote the optimization of a subrou-

tine s

of a program, so that at the end of the

program a predeﬁned score is maximized.

For complicated tasks, it is beneﬁcial to com-

bine these approaches. A notable example which

Accepted in Quantum 2018-05-04, click title to verify 4

highlights this is the recent success of the AI Al-

phaGo [30, 31]. We give a short summary of this

combined approach in Appendix C and discuss

some ideas for applying it to the decoding prob-

lem of quantum codes.

One major diﬃculty encountered in machine

learning is the overwhelmingly large input/state

space. Here we will only give a short explanation

of how this diﬃculty arises in supervised learn-

ing. Recall that we are given a dataset (x

, y

)

with the goal being approximating f

hidden

. If the

input space is large, then for a random x, it is

unlikely that x is “similar” to any of the x

from

the dataset. Therefore, it is hard to guess the

value of f

hidden

(x) and approximate it by f

(x).

This is why often the most important part of a

machine learning project is to choose a suitable

model, usually by encoding the prior knowledge

of the task into it. The model we use in this pa-

per is neural networks which we introduce in the

following section.

4.1 Basics of Neural Networks

A neural network is a directed, multipartite

graph consisting of layers l = 0, . . . , L. The ver-

tices in layer l are connected to vertices in the

following layer l + 1.

The vertices of the network are called neu-

rons. The main idea behind the neural network

is the following: Each neuron computes a primi-

tive non-linear function f

w,b

: R

→ R, for which

the input values are given by neurons of the

previous layer connected to it. The subscripts

w ∈ R

and b ∈ R are parameters which can dif-

fer for each neuron. Before we discuss the func-

tion f

w,b

in more detail let us ﬁrst understand

how the network performs a computation. The

neurons in the ﬁrst layer do not have any pre-

decessors and their output is simply set to be

the input of the network, which is a bit string

x ∈ {0, 1}

. The values of the neurons in the

last layer are interpreted as the output of the

network. We see that the network describes a

function F : {0, 1}

→ [0, 1]

. The ﬁrst layer

l = 0 is called the input layer and the last layer

l = L is called the output layer. All other layers

l = 1, . . . , L−1 are called hidden layers since they

are considered to be internal to the network.

The parameters w and b are called weights and

biases. They deﬁne a linear map w · y + b where

y is the input of the neuron. The function that

(x)

Input layer

l = 0

Hidden layer

l = 1

Ouput layer

l = 2

Figure 3: A neural network consisting of 3 layers. The

network takes input x ∈ {0, 1}

which is represented

as the ﬁrst layer of neurons. The values of neurons in

the hidden layer and the output layer are given by the

function f

w,b

: R

→ [0, 1] evaluated on their input

(indicated by q incoming arrows). The parameters w ∈

and b ∈ R, called weights and bias, can be diﬀerent

for each neuron. The values of the neurons in the last

layer are the output of the network F

(x) ∈ R

each neuron computes has the form:

w,b

: R

→ [0, 1], y 7→ σ(w · y + b) (1)

where in this paper σ is either tanh or the sigmoid

function

σ(z) =

1 + exp(−z)

(2)

which is plotted in Figure 4. Both non-linear

functions can be thought of as a smoothed step

function. The smoothness of σ is important to

the training of the neural networks as it is based

on optimizing certain objective function with gra-

dient descent.

Let us have a look at a small example which

gives some intuition why neural networks are able

to perform interesting computations: Consider a

single neuron which takes q = 2 inputs and has

weights w = (−12, −12) and bias b = 17. The

neuron computes the following values for each in-

put:

f(0, 0) = σ(17) ≈ 1

f(1, 0) = σ(5) ≈ 1

f(0, 1) = σ(5) ≈ 1

f(1, 1) = σ(−7) ≈ 0.

(3)

Accepted in Quantum 2018-05-04, click title to verify 5

−4 −2 0 2 4

0.5

σ(z)

Θ(z)

Figure 4: The sigmoid function σ is a smooth version of

the Heaviside step function Θ.

We observe that these are approximately the in-

put/output relations of the NAND gate. The

approximation can be made arbitrarily close by

increasing the absolute values of the weights w

and the bias b. Since any Boolean function

F : {0, 1}

→ {0, 1}

can be computed by a

network of NAND gates, there consequently also

exists a representation of F as a neural network.

However, in practice it is more eﬃcient to adapt

the connectivity of the network to the problem

at hand.

The process of training the network, i.e. grad-

ually adapting the values of the weights w and

biases b to make the network a desired function,

is described in Appendix A.

4.2 Convolutional Neural Networks

Compared to fully connected neural networks,

convolutional neural networks (CNN) require

much fewer parameters to describe. We will start

with the deﬁnition of one convolutional layer.

The input to the layer resides on a D-dimensional

lattice of size L

. On each lattice site, there is

a d-dimensional vector x

∈ R

, where the sub-

script u ∈ Z

(we use Z

to denote integer in

range [0, L − 1]). We deﬁne the kernel to be a

vector K

u,i

, where u ∈ Z

and i ∈ Z

. With a

slight abuse of notation, we will say such a kernel

has size n

. The convolution is then

u∈Z

i∈Z

v−u,i

u,i

, (4)

where x

v−u,i

is the ith element of x

v−u

, and each

element of v has range [n−1, L−1]. The index v

of the output y can be shifted according to needs

(e.g. to Z

L−n+1

). In this paper, we solely work

with input that has the periodic boundary con-

dition. Thus, we can indeed treat the indices u

of input x

as elements of the cyclic group Z

The convolution given in Equation 4 can then

be modiﬁed accordingly such that v ∈ Z

. If

there are r diﬀerent kernels, we can apply Equa-

tion 4 for each kernel individually and obtain

v,i

, where i ∈ Z

. We will say the resulted out-

put y

v,i

has r channels, and further convolution

can be again applied on y

v,i

according to Equa-

tion 4. Non-linear functions are usually applied

after each convolution. In this paper, convolu-

tional neural networks are neural networks with

only convolutional layers (where in computer vi-

sion, the convolutional neural networks usually

contains coarse-grain layers, etc).

The connectivity of a 1D convolutional neu-

ral networks is illustrated in Figure 13. We can

see that a convolutional layer is simply a neu-

ral network layer with local connectivity and the

weights are translationally invariant.

5 Machine Learning for Decoding

Quantum Codes

Decoding quantum error correcting codes can be

considered both as a supervised and reinforce-

ment learning problem.

Supervised learning: In most scenarios, we

can generate pairs of (syndrome, correct de-

coding), either by simulation on classical

computer or data from real experiments.

Thus, we can use the pairs to train suitable

models.

Reinforcement learning: Training certain

types of decoders is more naturally con-

sidered as reinforcement learning problem.

For example, we can leave some freedom

in the cellular automaton decoder to be

trained, with the goal of optimizing the

memory time. In this setting, there is not

a clear correct answer at each time step,

and therefore we cannot directly train the

cellular automaton with a input-output

relation.

One major diﬃculty of applying machine learn-

ing to decoding is the variable lattice size (or in

Accepted in Quantum 2018-05-04, click title to verify 6

general, variable code size of some code family).

Note that usually learning algorithms run on one

code size at a time, and it is fair to assume that

the training time will grow when we increase the

code size, possibly exponentially. Thus, it is not

trivial to train a machine learning decoder on a

larger lattice to utilize the error suppression abil-

ity provided by topological codes. Moreover, if

we restrict the machine learning models to neural

networks, it is tempting to directly train a fully-

connected neural nets with the pair (syndrome,

correct decoding). While this is theoretically pos-

sible, in practice it is very hard to achieve for

large lattices, due to the following:

1. Decoding quantum error correcting codes is

a quite complicated task. In order for the

fully-connected neural nets to approximate

the input-output relation, they need to have

large amount of parameters (either by hav-

ing many layers or many neurons in each

layer). This might already require too many

computational resources to set up the nets

or train them.

2. A neural net with large amount of parame-

ters in turn requires large amount of training

data, in order to not overﬁt.

See Appendix D for a relevant discussion on ob-

stacle of applying a baseline machine learning al-

gorithm to decoding problem on large lattice.

5.1 Architecture and Training

In this subsection, we will only concern about

Pauli-X stabilizer checks. We can pack the 6L

qubits into a multi-dimensional array of shape

(L, L, L, L, 6), and the Pauli-X stabilizer checks

into one of shape (L, L, L, L, 4). We choose to

use the following architecture for decoding:

Architecture: Given an X-syndrome with

shape (L, L, L, L, 4), a convolutional neural

net (CNN) is applied, with an output of

shape (L, L, L, L, 6). Pauli-X is applied to

one or multiple qubits according to the out-

put, and the error syndrome is updated ac-

cordingly. This process is repeated until no

syndrome is left or certain condition for halt-

ing is met. For the particular training pro-

cess we will describe shortly after, we choose

to ﬂip qubits corresponding to the largest

values in the outputs from the CNN.

Training: It is not clear how the CNN should

decide which qubits to ﬂip, thus a natural

approach would be optimizing the CNN to

achieve the best memory time. However, we

decide to explicitly train the CNN to esti-

mate whether a bit-ﬂip has occurred on each

qubit. While not optimal, it allows a much

faster training process and it is less prone to

ﬁnite size eﬀect potentially caused by rein-

forcement learning on a small lattice.

This architecture has a few desired properties.

First, the convolutional neural net is a constant-

depth (with respect to L) and translational-

invariant local circuit. As a result, each element

of the (L, L, L, L, 6) output is determined by its

surrounding region in the same way. Note that

this also means that the training only needs to be

done on a constant size region, which is a huge

advantage for high dimensional lattices as the

computational resource needed grows very fast

with respect to L. Once we trained the CNN on

a lattice with size L, we can naturally extend it

to a larger size L

> L, therefore we can evalu-

ate the performance of a ﬁxed CNN on several

diﬀerent size L. This is crucial for discussing the

error suppression ability by increasing the system

size, or performing an estimation of the decoding

threshold. We shall point out that the architec-

ture together with training procedure is designed

with 4D toric code in mind, where multiple local

decoders have already shown a threshold behav-

ior.

To some degree, the aim of our neural network

decoder is to develop a local decoding strategy

such as those analyzed in [7]. In order to see this,

we will give a brief introduction to the DKLP rule

and the Toom’s rule. For the DKLP rule, a qubit

is ﬂipped if 3 or 4 adjacent edge stabilizer checks

are violated, and is ﬂipped with probability 1/2

if only two are violated. This update rule should

not apply simultaneously to faces that share an

edge. Toom’s rule is deﬁned for a 2D lattice,

where each face is associated with a degree of

freedom which can take the values +1 and −1.

In each step of the Toom’s rule, the value of a

face will be ﬂipped if it is diﬀerent from both

its ‘north’ and ‘east’ faces’ values. It can be ap-

plied to the 4D toric code by applying it to every

2D plane of the 4D hypercubic lattice (e.g. ap-

plying to all x-y-planes and then applying to all

x-z-planes, etc). It is not hard to see that for

Accepted in Quantum 2018-05-04, click title to verify 7

both rules, the criteria which decides whether a

qubit should be ﬂipped or not can be written

as a neural network. After all, the neural net-

works can approximate any function on a ﬁnite

input spaces. Intuitively, the DKLP rule tries

to ﬂip qubits that likely have an error on them,

which our neural network decoder also intends

to achieve due to the speciﬁc training procedure.

On the other hand, Toom’s rule is not designed

with this goal in mind. Therefore, it will not be

obtained by our training procedure. This serves

as a reminder that some good decoders require

a diﬀerent training procedure to ﬁnd, e.g. some

kind of reinforcement learning. It is worth not-

ing that the schedule components of these rules

(i.e. applying Toom’s rule to 2D planes sequen-

tially and applying DKLP rule to non-adjacent

faces) are not contained in the architecture of

our decoder. If the goal is to accurately simulate

the DKLP and Toom’s rules, we can introduce

a clock variable as an additional input on each

face.

One detail of our architecture is that, as

we mentioned in the architecture paragraph,

we ﬂip qubits corresponding to the largest val-

ues in the outputs from the CNN. In order

to reduce the computational time of decod-

ing, we can ﬂip multiple qubits based on a

single evaluation of the CNN. In the numeri-

cal simulation described in section 6, the num-

ber of qubits to be ﬂipped is chosen to be

#(current violated syndrome checks)/x, where x

is around 50. We did not study the eﬀect of diﬀer-

ent choice of x. It can be viewed as a tunable pa-

rameter that requires further optimization. An-

other reasonable way is to ﬂip the qubits whose

corresponding values from the CNN’s output are

larger than a pre-determined threshold. If this

approach is used, the decoder will be completely

local with respect to the 4D lattice.

We also add an additional step in the archi-

tecture when decoding syndromes from perfect

measurements of check operators, which is to ap-

ply the “parallel line decoder” introduced in Ap-

pendix B. The motivation comes from the obser-

vation that without the parallel line decoder, a

large percentage of the failures are resulted from

the decoder getting stuck at syndromes that only

contains a few pairs of parallel lines (e.g. Fig-

ure 11). Failures of this type are called “energy-

barrier limited decoding” in [7]. As we do not

focus on building a completely local decoder, we

choose to use this simple and fast subroutine to

improve the performance.

6 Numerical Analysis of the Neural

Network Decoder

6.1 Error Model and Setup

We perform a numerical analysis of the neural

network decoder by Monte Carlo simulations.

We consider two error models: (a) In the ﬁrst

error model we assume that the measurements

of the check operators can be done perfectly.

We consider the independent X-Z-error model

in which a Pauli-X and a Pauli-Z are applied in-

dependently to each qubit with probability p. (b)

In the second error model we take measurement

errors into account. The errors on the qubits

are modeled as in (a). Additionally each out-

come of a syndrome measurement is ﬂipped with

probability q. The error correction procedure of

the Pauli-X errors and the Pauli-Z errors can

be done independently as each parity-check can

only detect either type and all parity-checks are

pairwise commuting.

For error model (a) where measurements are

perfect we numerically estimate the rate of logical

errors

P depending on the physical error rate p.

The rate of logical errors

P is the probability of

the neural network decoder to apply a non-trivial

logical operator (or exceeding a time limit). We

can estimate

P for a ﬁxed p as follows: First,

we sample an error E by applying a Pauli-X or

Pauli-Z each with probability p. We then com-

pute the result of the syndrome check measure-

ments s and give it to the neural network decoder

which determines a recovery operator R. After

the application of R we are back in a code state.

The decoder was successful if the application of E

and R acts trivially on the code state. This can

be checked eﬃciently since any operator which

leaves the code space as a whole invariant acts

non-trivially within the code space if and only if

it commutes with all other logical operators.

For error model (b), which takes syndrome er-

rors into account, the neural network decoder can

in general not correct back to a code state since

the decoder has access to the erroneous measure-

ment result only. To obtain a threshold we es-

timate the memory time T which gives the av-

Accepted in Quantum 2018-05-04, click title to verify 8

erage number of error correction rounds until a

logical failure occurs. If the physical error rate

is below threshold the memory time will diverge

with increasing lattice size L. More concretely,

in each error correction round, we ﬁrst sample an

error E by applying a Pauli-X or Pauli-Z each

with probability p and update the (noiseless) syn-

drome check measurement s from the last round.

We then ﬂip each bit in s with probability q to

obtain the faulty syndrome check measurement

, and feed s

to the neural network decoder.

The decoder will output a list of position where

Pauli-X or Pauli-Z are applied, so we can keep

track of the noiseless syndrome measurement s

out

afterwards. To evaluate whether a logical failure

has occurred, we input s

out

to the decoder we use

in the error model (a). Note that we do not have

to repeat the syndrome measurements as for the

2D toric code.

The networks we consider consist of a input

layer which receives the result of the parity check

measurements, one convolutional layer with ker-

nel size 3

(i.e. n = 3 in Equation 4) and then

two convolutional layer with kernel size 1

. The

number of channels in the hidden layer is 15 for

the noiseless syndrome measurement, and 20 for

the noisy case. The choices of these numbers here

are mostly aimed for a large neural network with-

out getting too slow or require too much memory

to train. After the linear map of the ﬁnal con-

volutional layer, a softmax function is applied on

all inputs:

→

(5)

It is a widely used in the scenario when we want

the output to be a probability distribution (i.e.

non-negative numbers that sum to 1). However,

there is no obvious reason that the softmax layer

is needed in our neural networks, as during the

decoding process we only care about the relative

order of the output numbers, which the softmax

layer preserves. The rational behind this is that

among the limited number of neural networks

we trained, the ones with the softmax layer per-

forms slightly better. Note that the softmax layer

breaks the locality of the networks. Nevertheless,

if preferred, we can view softmax together with

the cross-entropy as the modiﬁed cost function,

while the neural networks remains local.

A 1D slice of the neural network connectivity is

shown in Figure 5. It is obvious that without the

…

Softmax

Cross-entropy cost function

Dataset

Inputs

Outputs

Figure 5: A 1D slice of the decoding neural network.

The neurons in the ﬁrst hidden layers depend on the

neighbouring sites of the input layer, while each neuron

in the later layers only depend on the neurons in the

same sites in the previous layer. The input from the

dataset is the error syndrome. The softmax layer does

introduce non-locality in the network. However, we can

eﬀectively view softmax together with the cross-entropy

as the modiﬁed cost function.

softmax layer, each element in the output is de-

termined only by a corresponding 3

local region

on the input. The reason that we only have one

convolutional layer with kernel size 3

is mainly

due to the faster runtime and likely will cause less

ﬁnite-size eﬀect on the lattice sizes we are testing

on. See Appendix E for more discussion on using

deeper neural networks.

6.2 Training

We trained two neural networks for error model

(a) and (b) described in section 6.1 respectively.

However, the same neural network will be used

for decoding syndromes generated according to

diﬀerent error probabilities. Roughly speaking,

the training of the neural networks is done using

gradient descent (see section 4.1). The inputs are

error syndromes of shape (L, L, L, L, 4). The out-

puts have shape (L, L, L, L, 6), where an element

is equal to 1/N if an error happened on the corre-

sponding qubit and equal to 0 otherwise, with N

being the total number of errors. The normaliza-

tion is done to match the normalization done by

the softmax layer. We use a variation of the gra-

dient descent algorithm called Adam [20], which

is included in the Tensorflow library. The cost

function we use is cross-entropy. We also manu-

ally lower the learning rate of Adam when the

Accepted in Quantum 2018-05-04, click title to verify 9

Figure 6: Illustration of the neural network. Each array

of neurons is shown as a 2D square grid but in fact has

the same dimensionality as the lattice (3D or 4D). The

network consists of a single input layer (leftmost array)

which receives the measurement results. The input layer

is followed by three hidden layers. The number of chan-

nels in each hidden layer is 4 in this illustration. The

ﬁnal layer returns the probability distribution as output.

A single convolution between the input layer and the ﬁrst

hidden layer is indicated in blue.

decrease of cost function slows down. For er-

ror model (a), we train the network with syn-

dromes corresponding to error rate p uniformly

distributed from 3% to 7%. For error model (b),

we train the network with error rate p uniformly

distributed from 2% to 3%, and q being a con-

stant 2.5%. These values were determined to be

be the approximate locations of the thresholds in

trial runs.

We also applied the neural network decoder to

the 3D toric code under error model (a). The

convolutions chosen to be three dimensional in

this case, but the structure of the neural network

is identical otherwise. The network is trained for

p at 17% which is again just below the value of

the threshold.

We do want to note that the details of the

training are not very important (e.g. the vari-

ation of gradient descent we use, the concrete

parameters we used for gradient descent, etc).

Indeed, the neural nets in this paper are fairly

shallow compared to the neural networks used by

the machine learning community at the present

time. Based on experience, any reﬁned gradient

optimizer should be able to train the networks

decently well without ﬁne-tuning the training pa-

rameters. Additionally, we almost did not do any

post-selection on the training of neural networks,

and in general the decoder works reasonably well

(a) 0 steps

(b) 15 steps

Figure 7: Applying the neural network decoder to the

3D toric code with L = 5. The syndrome is highlighted

in red. The neural network outputs a probability dis-

tribution over the faces indicating where it believes an

error to be present. The probability of each face is indi-

cated by its opaqueness. Each ﬁgure shows the current

syndrome and the output of the network during the de-

coding. In each step the decoder ﬂips the face with the

highest probability.

Accepted in Quantum 2018-05-04, click title to verify 10

with a trained neural network. We believe the

performance of decoder observed in this paper is

not a rare event.

6.3 Performance

To evaluate the performance, we follow the pro-

cedures described in section 6.1. We will ﬁrst

discuss the results for error model (a) where we

assume perfect stabilizer measurements. As we

mentioned in section 5.1, the parallel line decoder

is applied after the neural network decoder when

obtaining these results. In Figure 8 and Figure 9,

we plot the logical error rates versus the physical

error rates and compare them with the minimum-

weight decoder. The minimum-weight decoder is

implemented by mapping the problem of ﬁnding

a minimum surface with the syndrome as bound-

ary to a linear integer programming problem.

We assume a scaling behavior in the variable

x = (p − p

1/ν

(6)

to determine the critical error probability p

. We

expand the logical error probability

P for small

x (around p = p

where the dependence on the

system size L is small):

P (p, L) = A + Bx + Cx

(7)

For the 4D toric code we obtain by ﬁt-

ting Equation 7 to the data for p =

0.066, 0.068, 0.072, 0.074 (see Figure 9). The ﬁt-

ting parameters p

and ν were determined by a

non-linear ﬁt:

= 0.071 ± 0.003, ν = 0.65 ± 0.02 (8)

This is comparable to the performance of the 4D

renormalization group decoder described in [14]

which achieves a threshold of p

= 0.073 ± 0.001

for the same error model.

Below we will discuss error model (b) with

measurement errors. Here we will always set

p = q, where p is the error rate of the physical

qubits, and q is the error rate of each stabilizer

parity measurement. For clariﬁcation, we do not

use the parallel line decoder for this error model.

In Figure 10, we plot the average memory time T

versus the error rate p. Due to a time restriction

on the computing cluster some runs were forced

to halt where the simulated system was still in a

correctable state. In the worst case (large system

0.1 0.15 0.2 0.25

0.5

(a) Neural network decoder

0.2 0.22 0.24 0.26

0.1

0.2

0.3

0.4

(b) Minimum-weight decoder

Figure 8: (a) The results of the numerical simula-

tion for the 3D toric code for Z-errors only, assum-

ing perfect measurements. We considered system sizes

L = 6, 8, 10, 12. The lines cross at around 17.5%.

(b) Results for the minimum-weight decoder which has

exponential run-time. The lines cross at around 23%.

Note that the threshold of the line-like logical operator

will be signiﬁcantly lower (see [33]).

Accepted in Quantum 2018-05-04, click title to verify 11

0.065 0.07 0.075

0.2

0.4

0.6

0.8

(a) Neural network decoder

0.05 0.1 0.15

0.2

0.4

(b) Minimum-weight decoder

Figure 9: (a) The results of the numerical simulation for

the 4D toric code assuming perfect measurements. We

considered system sizes L = 5, 6, 7, 8. The solid lines

are given by the values of Equation 7. (b) Numerical

simulation for the minimum-weight decoder assuming

perfect measurements. The lines cross at around 11%

which is in agreement with the numerical results of [2].

size and low physical error rate) only around 10%

of the runs ﬁnished. In order to obtain mean-

ingful statistics from the data, we make the as-

sumption that for a ﬁxed L and p, the memory

time follows an exponential distribution, and all

unﬁnished runs have longer memory time than

ﬁnished runs (which is called type II censoring

on the right in the statistics literature). Under

these assumption, the mean (and its conﬁdence

interval) can be estimated according to [10].

0.027 0.03 0.032

Figure 10: The results of the numerical simulation for

the 4D toric code assuming noisy measurements. We

considered system sizes L = 5, 6, 7, 8. The error bars

show the 80% conﬁdence interval. The dashed lines in-

dicate a linear ﬁt in the log-log–plot. The memory time

is expected to decay exponentially with p and diverge ex-

ponentially with increasing L for any ﬁxed p < p

. The

data does not allow a precise estimation of the thresh-

old p

7 Discussion

We have shown how convolutional neural net-

works can be utilized to decode higher dimen-

sional topological quantum codes, leading to a

scalable architecture. The neural network is

trained once and can then be scaled up to arbi-

trary system sizes. However, our approach is only

one way of utilizing neural networks to decode

high-dimensional quantum codes. Given the ver-

satility of neural networks, it is clear that there

will be other approaches to the problem, which

are worth exploring. As we mentioned in sec-

tion 5, ideally we need to perform reinforcement

learning (i.e. optimize the parameters of the neu-

ral networks with the objective being lowest log-

ical error rate or longest memory time).

Accepted in Quantum 2018-05-04, click title to verify 12

There is another important reason that we

choose neural networks instead of other machine

learning models for decoding. As discussed in [4],

good learning algorithms should be eﬃcient in

terms of human involvement. For example, it is

highly undesirable if small changes in the quan-

tum error correcting code and the experimental

hardware require a large amount of human ef-

fort to rewrite the learning algorithms. These en-

vironmental changes are certainly going to hap-

pen very frequently before a large-scale quantum

computer is built. On the other hand, if a certain

class of learning algorithms are fast to implement,

then with the same amount of man-hour we can

test more diﬀerent algorithms for the problem.

Therefore, less human involvement often trans-

lates to better performance. At the moment of

writing, neural networks have some of the most

ﬂexible and automated software packages among

machine learning models.

We would also like to highlight that using

neural networks for decoding may have advan-

tages from a practical perspective. Many spe-

cialized neural network chips have been manu-

factured [19, 23, 24] which promise much lower

power consumption. If the decoding has to take

place inside a fridge such dedicated hardware

could help to keep down thermal noise. For ex-

ample, in [23] the authors report on an integrated

circuit implementing a network of 1 million neu-

rons on a 240µm ×390µm CMOS chip. The chip

draws 20mW/cm

as compared to 50W/cm

100W/cm

for a modern CPU or 30W/cm

for

an FPGA [34], thereby reducing potential ther-

mal noise by orders of magnitude.

A natural question is whether we can build a

similar convolutional neural network decoder for

2D toric code. As the architecture we proposed

heavily based on the fact that 4D toric code can

be decoded in a local single-shot manner, it can-

not be directly applied to 2D toric code. How-

ever, we foresee that a decent convolutional neu-

ral network decoder for 2D toric code exists, and

it will likely share many similarities to the renor-

malization group decoder in [13].

Acknowledgments

We would like to thank Christophe Vuillot and

Kasper Duivenvoorden for interesting discus-

sions and Barbara Terhal for feedback on our

manuscript.

References

[1] Charlene Sonja Ahn. Extending quantum

error correction: new continuous measure-

ment protocols and improved fault-tolerant

overhead. PhD thesis, California Institute

of Technology, 2004.

[2] Gaku Arakawa, Ikuo Ichinose, Tetsuo Mat-

sui, and Koujin Takeda. Self-duality and

phase structure of the 4d random-plaquette

z2 gauge model. Nuclear Physics B, 709(1):

296–306, 2005.

[3] Paul Baireuther, Thomas E. O'Brien, Brian

Tarasinski, and Carlo W. J. Beenakker.

Machine-learning-assisted correction of cor-

related qubit errors in a topological code.

Quantum, 2:48, jan 2018. DOI: 10.22331/q-

2018-01-29-48.

[4] Yoshua Bengio, Yann LeCun, et al. Scaling

learning algorithms towards ai. Large-scale

kernel machines, 34(5):1–41, 2007.

[5] H´ector Bomb´ın. Single-shot fault-tolerant

quantum error correction. Physical Review

X, 5(3):031043, 2015. DOI: 10.1103/Phys-

RevX.5.031043.

[6] Nikolas P. Breuckmann. Homological

quantum codes beyond the toric code.

PhD thesis, RWTH Aachen University,

2017. URL https://doi.org/10.18154/

rwth-2018-01100.

[7] Nikolas P Breuckmann, Kasper Duivenvoor-

den, Dominik Michels, and Barbara M Ter-

hal. Local decoders for the 2d and 4d toric

code. Quantum Information and Compu-

tation, 17(3 and 4):0181–0208, 2017. DOI:

10.26421/QIC17.3-4.

[8] Christopher Clark and Amos Storkey. Train-

ing deep convolutional neural networks to

play go. In International Conference on Ma-

chine Learning, pages 1766–1774, 2015.

[9] Joshua Combes, Christopher Ferrie, Chris

Cesare, Markus Tiersch, GJ Milburn,

Hans J Briegel, and Carlton M Caves. In-

situ characterization of quantum devices

with error correction. 2014. URL https:

//arxiv.org/abs/1405.5656.

[10] H.A. David and H.N. Nagaraja. Or-

der Statistics. Wiley Series in Probabil-

ity and Statistics. Wiley, 2004. ISBN

9780471654018.

[11] Eric Dennis, Alexei Kitaev, Andrew Lan-

dahl, and John Preskill. Topological quan-

Accepted in Quantum 2018-05-04, click title to verify 13

tum memory. Journal of Mathematical

Physics, 43(9):4452–4505, 2002.

[12] Guillaume Duclos-Cianci and David Poulin.

Fast decoders for topological quantum

codes. Physical review letters, 104

(5):050504, 2010. DOI: 10.1103/Phys-

RevLett.104.050504.

[13] Guillaume Duclos-Cianci and David Poulin.

Fault-tolerant renormalization group de-

coder for abelian topological codes. Quan-

tum Information & Computation, 14(9-10):

721–740, 2014.

[14] Kasper Duivenvoorden, Nikolas P Breuck-

mann, and Barbara M Terhal. Renormal-

ization group decoder for a four-dimensional

toric code. arXiv preprint arXiv:1708.09286,

2017.

[15] Brendan J Frey and David JC MacKay.

A revolution: Belief propagation in graphs

with cycles. Advances in neural information

processing systems, pages 479–485, 1998.

[16] Gabriel Goh. Why momentum really works.

Distill, 2017. DOI: 10.23915/distill.00006.

[17] Ian Goodfellow, Yoshua Bengio, and Aaron

Courville. Deep Learning. MIT Press, 2016.

[18] Matthew B Hastings. Decoding in hyper-

bolic spaces: Ldpc codes with linear rate

and eﬃcient error correction. Quantum In-

formation and Computation, 14, 2014.

[19] Norman P Jouppi, Cliﬀ Young, Nishant

Patil, David Patterson, Gaurav Agrawal,

Raminder Bajwa, Sarah Bates, Suresh Bha-

tia, Nan Boden, Al Borchers, et al. In-

datacenter performance analysis of a tensor

processing unit. 44th International Sympo-

sium on Computer Architecture, 2017. DOI:

10.1145/3079856.3080246.

[20] Diederik Kingma and Jimmy Ba. Adam: A

method for stochastic optimization. 3rd In-

ternational Conference for Learning Repre-

sentations, San Diego, 2015. URL https:

//arxiv.org/abs/1412.6980.

[21] Stefan Krastanov and Liang Jiang. Deep

neural network probabilistic decoder for sta-

bilizer codes. Scientiﬁc Reports, 7(1), sep

2017. DOI: 10.1038/s41598-017-11266-1.

[22] Stephen Marsland. Machine Learning: An

Algorithmic Perspective, Second Edition.

Chapman & Hall/CRC, 2nd edition, 2014.

ISBN 1466583282, 9781466583283.

[23] Paul A Merolla, John V Arthur, Ro-

drigo Alvarez-Icaza, Andrew S Cassidy, Jun

Sawada, Filipp Akopyan, Bryan L Jackson,

Nabil Imam, Chen Guo, Yutaka Nakamura,

et al. A million spiking-neuron integrated

circuit with a scalable communication net-

work and interface. Science, 345(6197):668–

673, 2014. DOI: 10.1126/science.1254642.

[24] Janardan Misra and Indranil Saha. Artiﬁcial

neural networks in hardware: A survey of

two decades of progress. Neurocomputing,

74(1–3):239 – 255, 2010. ISSN 0925-2312.

DOI: 10.1016/j.neucom.2010.03.021.

[25] Eliya Nachmani, Yair Be'ery, and David

Burshtein. Learning to decode linear

codes using deep learning. In 2016 54th

Annual Allerton Conference on Commu-

nication, Control, and Computing (Aller-

ton). IEEE, sep 2016. DOI: 10.1109/aller-

ton.2016.7852251.

[26] Michael A. Nielsen. Neural Networks and

Deep Learning. Determination Press, 2015.

[27] Genevieve B Orr and Klaus-Robert M¨uller.

Neural networks: tricks of the trade.

Springer, 2003.

[28] Fernando Pastawski. Quantum memory: de-

sign and applications. PhD thesis, LMU

Munich, 2012. URL https://edoc.ub.

uni-muenchen.de/14703/.

[29] David Poulin and Yeojin Chung. On the

iterative decoding of sparse quantum codes.

Quantum Information and Computation, 8

(10):0987–1000, 2008.

[30] David Silver, Aja Huang, Chris J Maddi-

son, Arthur Guez, Laurent Sifre, George

Van Den Driessche, Julian Schrittwieser,

Ioannis Antonoglou, Veda Panneershelvam,

Marc Lanctot, et al. Mastering the game

of go with deep neural networks and tree

search. Nature, 529(7587):484–489, 2016.

DOI: 10.1038/nature16961.

[31] David Silver, Julian Schrittwieser, Karen Si-

monyan, Ioannis Antonoglou, Aja Huang,

Arthur Guez, Thomas Hubert, Lucas Baker,

Matthew Lai, Adrian Bolton, Yutian Chen,

Timothy Lillicrap, Fan Hui, Laurent Sifre,

George van den Driessche, Thore Graepel,

and Demis Hassabis. Mastering the game of

go without human knowledge. Nature, 550

(7676):354–359, Oct 2017. ISSN 0028-0836.

DOI: 10.1038/nature24270.

[32] John M Sullivan. A crystalline approxima-

Accepted in Quantum 2018-05-04, click title to verify 14

tion theorem for hypersurfaces. PhD thesis,

Princeton University, 1990.

[33] Koujin Takeda and Hidetoshi Nishi-

mori. Self-dual random-plaquette

gauge model and the quantum toric

code. Nuclear Physics B, 686(3):377

– 396, 2004. ISSN 0550-3213. DOI:

10.1016/j.nuclphysb.2004.03.006.

[34] David Barrie Thomas, Lee Howes, and

Wayne Luk. A comparison of cpus, gpus, fp-

gas, and massively parallel processor arrays

for random number generation. In Proceed-

ings of the ACM/SIGDA International Sym-

posium on Field Programmable Gate Arrays,

FPGA ’09, pages 63–72, New York, NY,

USA, 2009. ACM. ISBN 978-1-60558-410-

2. DOI: 10.1145/1508128.1508139.

[35] Yu Tomita and Krysta M Svore. Low-

distance surface codes under realistic quan-

tum noise. Physical Review A, 90(6):062320,

2014. DOI: 10.1103/PhysRevA.90.062320.

[36] Giacomo Torlai and Roger G Melko. A neu-

ral decoder for topological codes. Physical

Review Letters, 119(3):030501, 2017. DOI:

10.1103/PhysRevLett.119.030501.

[37] Savvas Varsamopoulos, Ben Criger, and

Koen Bertels. Decoding small surface codes

with feedforward neural networks. Quan-

tum Science and Technology, 3(1):015004,

nov 2017. DOI: 10.1088/2058-9565/aa955a.

[38] Chenyang Wang, Jim Harrington, and John

Preskill. Conﬁnement-higgs transition in a

disordered gauge theory and the accuracy

threshold for quantum memory. Annals of

Physics, 303(1):31–58, 2003.

[39] Jonathan S Yedidia, William T Freeman,

and Yair Weiss. Understanding belief prop-

agation and its generalizations. Exploring

artiﬁcial intelligence in the new millennium,

8:236–239, 2003.

[40] Chiyuan Zhang, Samy Bengio, Moritz

Hardt, Benjamin Recht, and Oriol Vinyals.

Understanding deep learning requires re-

thinking generalization. 5th International

Conference on Learning Representations,

2016.

A Training of Neural networks

We have mentioned in the main text that neural

networks are a powerful ansatz to model func-

tions F : {0, 1}

→ [0, 1]

. The question is how

to choose the individual weights and biases of

the neurons to make the network compute F , or

at least give a good approximation. This task

can be formulated in terms of an optimization

problem where pairs of input and desired out-

put (x, F (x)) are used to ﬁnd the right weights

and biases. In our setup we assume that the in-

puts of F are weighed by some probability dis-

tribution P : {0, 1}

→ [0, 1]. The distribu-

tion P prioritizes certain inputs over others and

eﬀectively reduces the dimensionality of the in-

put space. In principle we would want to op-

timize over all possible pairs of inputs and out-

puts of F (while taking P into account). How-

ever, this is generally not practicable so that we

restrict ourselves to optimize over some subset

D ⊂ {(x, F (x)) | x ∈ {0, 1}

}. The set D is

sampled according to this distribution P . The

optimization of the network is called training and

the sample D is called the training data or train-

ing set.

We will now describe the training of neural net-

works based on gradient descent. We denote the

weight vector of the ith neuron in layer l by w

and the jth entry of this vector by w

i,j

. Simi-

larly, the bias of the ith neuron in the lth layer

is labeled b

. These are the parameters that we

need to optimize. An essential ingredient for the

optimization is a measure of how good a neural

network performs on the training data D. This

measure is called the cost function C

i,j

, b

)

which usually maps the values of the weights w

i,j

and biases b

of the neural network into [0, ∞]. If

the value of the cost function is small then this

is an indicator that the network performs well.

For reasons that will become apparent in the fol-

lowing discussion, we demand C

to be diﬀer-

entiable. An obvious choice for the cost func-

tion is the average squared L

norm k · k

of the

diﬀerence of the networks output F

(x, w

i,j

, b

which depends on the choice of the weights w

i,j

and biases b

, and the desired value F (x) over all

elements of the training set D:

i,j

, b

) =

2|D|

(x,F (x))∈D

(x, w

i,j

, b

) − F (x)k

(9)

To optimize the weights and biases we per-

form an iterative procedure called gradient de-

Accepted in Quantum 2018-05-04, click title to verify 15

scent. A good introduction to gradient descent

and its variants can be found in [16]. Generally,

gradient descent is a tool to ﬁnd a local mini-

mum of a diﬀerentiable function f : R

→ R

which is close to some initial point x

∈ R

In the ﬁrst step we evaluate the negative gra-

dient −∇f at x

. By following the negative gra-

dient for a small enough distance we will ob-

tain a point x

:= x

− η

∇f(x

) such that

f(x

) ≤ f(x

). Iterating this process gives a se-

quence x

, x

, . . . , where

i+1

:= x

− η

∇f(x

). (10)

If the parameters η

where chosen small enough

we have that f (x

i+1

) ≤ f(x

) so that the se-

quence x

will converge towards the location of

a local minimum. Clearly, we do not want to

choose the η

too small or otherwise the rate of

convergence of the x

will be slow. However, if

we are choosing η

too large it will make us over-

shoot the location of the local minimum. There

are situations in which f satisﬁes conditions, i.e.

if f is convex and smooth, in which there ex-

ists an explicit choice of η

for which the con-

vergence can be guaranteed. In the context of

training neural networks the parameters η

are

collectively referred to as the learning rate. At

the time of writing there is no developed theory

on how to choose the learning rates optimally

and we have to consider heuristics. An overview

of several heuristics can be found in [27].

Let us now apply gradient descent to optimize

neural networks. The setup is the following: We

have some set of training data D, a neural net-

work with some initial choice of weights and bi-

ases and a cost function C

. The task is to ﬁnd

weights and biases which (locally) minimize the

cost function C

. This confronts us with the

problem of how to compute the gradient ∇C

This is solved by the second major ingredient of

the training of neural networks: The backprop-

agation algorithm. The backpropagation algo-

rithm consists of two steps: In the ﬁrst step we

compute the cost function C

of the neural net-

work (Eqn. 9) as well as record all values of the

neurals in the network. To evaluate the output of

the network F

we evaluating the input x on the

ﬁrst hidden layer and then feed the output of the

ﬁrst hidden layer into the second hidden layer and

so forth until we obtain the output of the network

at the last layer. In the second step of the back-

propagation algorithm we compute the derivative

of the cost function with respect to all weights

and biases. The derivatives can be computed in

linear time in the number of neurons. Obtaining

the derivatives is a matter of applying the chain

rule several times. To simplify notation we intro-

duce the variable s

k∈pred(i,l)

i,k

l−1

+ b

where pred(i, l) and f

are the predecessors and

the value of the ith neuron in the lth layer re-

spectively. Remember that the value of a neuron

is f

= σ(s

). The derivatives of the cost func-

tion C

with respect to the weight w

i,j

can be

expanded as

∂C

∂w

i,j

∂C

∂s

∂w

i,j

(11)

The second factor of Equation 11 is simply

∂s

∂w

i,j

= f

l−1

. (12)

The form of the ﬁrst term of Equation 11 depends

on whether l is a hidden layer or the output layer.

For l = L we expand over the values of the neu-

rons in the output layer F

= f

∂C

∂s

k=1

∂C

∂f

∂s

∂C

∂f

) (13)

For all hidden layers l < L we expand over the

sums s

l+1

of neurons which are in the next layer

and connected to the ith neuron in layer l

∂C

∂s

k∈succ(i,l)

∂C

∂s

l+1

∂s

l+1

∂s

k∈succ(i,l)

∂C

∂s

l+1

i,k

)

(14)

where succ(i, l) indicates the set of all neurons

in the next layer connected to the ith neuron in

layer l. The derivatives with respect to the bi-

ases b

proceeds completely analogously, the only

diﬀerence being that Equation 12 evaluates to 1.

Note that in order to compute Equation 14 for

some layer l we need to have the result of layer

l +1. Hence we ﬁrst evaluate Equation 13 for the

output layer and then go backwards through the

layers of the network (hence the name backprop-

agation). Finally, having obtained all derivatives

with respect to the weights and biases allows us

to perform a single step in the gradient descent

(see Equation 10).

Accepted in Quantum 2018-05-04, click title to verify 16

In the discussion above we made two simpliﬁ-

cations which we are going to address now: The

ﬁrst simpliﬁcation was the choice of the cost func-

tion. The L

norm is very intuitive but it leads

to a very slow convergence of the training. The

reason for this can be seen in Equation 13 and

Equation 14. The derivatives are proportional to

the derivative of the sigmoid function σ

(z) which

is close to 0 when |z| is suﬃciently large. This

is avoided by choosing the cross-entropy as cost-

function:

= −

|D|

x∈D

k=1

(x) log



(x)



+ (1 − F

(x)) log



1 − F

(x)

i

(15)

The cross-entropy has a less intuitive form. How-

ever, since F (x) ∈ [0, 1] and 0 < F

(x) < 1

one can see that the cross-entropy is (a) positive

and (b) small when the output of the network

is close to the desired output. The cross-entropy

has the advantage that in Eqn. 11 the derivatives

of the sigmoid function cancel, see [17, 22, 26] for

a derivation.

The second simpliﬁcation was taking the av-

erage over the whole training set D. In prac-

tice, the training set is usually subdivided into

several subsets called batches so that a batch of

data can be loaded into memory. This is known

as stochastic gradient descent. Furthermore, part

of the available training data is kept aside and

not used for the training of the network. This

data set is the called the validation set V and it

is only used to evaluate the cost function after

every step of the training. The reason to keep

the validation set separate is to check whether

the neural network performs well on data out-

side of the training data. In other words, this is

the ﬁrst measure against overﬁtting To summa-

rize the training procedure:

1. Initialize the weights and biases.

2. Pick a batch B ⊂ D and learning rate η.

3. Perform a single step of the gradient descent,

using backpropagation to compute the gra-

dient.

4. Compute the cost function C

on the vali-

dation set.

As long as C

keeps descending we repeat steps 2

- 4. The initial values in step 1 are usually chosen

to be random.

B Description of the parallel line de-

coder

In this section, we will describe in detail the par-

allel line decoder. As we mentioned in section 5.1,

its purpose to decode the syndromes that consist

only a few parallel lines, and we designed it with

easiness to code in mind. The steps are the fol-

lowing:

1. Make a list of current violated parity checks.

Order the violated edges in the list by their

direction.

2. For each edge e in the list: Assume e has

coordinate (x

, x

, d), we look for the

closest edge e

= (y

, y

, d) that still

has a violated parity check, such that x

. We then change one of the x

in e so

that e and e

get closer by ﬂipping the cor-

responding qubit. Update the syndrome.

3. Repeat Step 1, 2 until no parity check is vi-

olated or certain time limit is exceeded.

C Brief summary of the AlphaGo ar-

chitecture

In this section we will give a brief summary about

the architecture of AlphaGo [30], which ends up

being a strong AI at playing board game Go. It

highlights the importance of combining diﬀerent

type of machine learning in solving complicated

tasks.

1. Supervised learning is used to train a model

to approximate human strategy, where the

optimization is done to predict human move

Accepted in Quantum 2018-05-04, click title to verify 17

Figure 11: The eﬀect of Step 2 illustrated in a 2D cross

section. The solid blue lines are the initial violated parity

checks, and the dotted green lines are the violated parity

checks after a single execution of the Step 2. In the next

loop, no violated parity check will be left.

with most accuracy. While solely mimicking

human moves can already achieve non-trivial

performance [8], it is almost surely not the

best strategy. Therefore the following steps

are needed.

2. Reinforcement learning is then used to fur-

ther improve the model, with the goal of

achieving best win rate against previous

trained models. Roughly speaking, it is

done by gradually changing parameters of

the neural network towards the direction of

winning more games, starting from the pa-

rameters obtained from the supervised learn-

ing above.

3. Monte-Carlo tree search is hand-picked as a

subroutine, which reﬂects the 2-player turn

by turn nature of the game Go. The details

of Monte-Carlo tree search is not of concern

to this paper. Here the point being a large

and important fraction of the AlphaGo AI is

pre-determined.

While it is possible with only the reinforcement

learning in the step 2, the trained AI can still

achieve a similar performance as the AlphaGo,

it is a much riskier approach. This is because

the reinforcement learning can get stuck at some

local minimum. The goal of step 1 is to set the

start point of optimization to be approximately

the human strategy, so that we can hope the local

minimum obtained by reinforcement learning is

at least better.

In a later paper [31], the authors proposed a

new training procedure which does not require

the dataset of human moves. With the nature of

2-player game in mind, they use the policy neu-

ral network which decides the next move to play

out the following few turns of both players. By

doing that, they are able to recognize the bet-

ter next move, and use that as a training target

for the policy neural network. Overall, this ap-

proach avoids the diﬃculty of training the policy

neural network solely based on the win or loss of

a match which typically consists of hundreds of

moves. Similarly, in our paper we try to avoid

the same diﬃculty by explicitly training the neu-

ral network to recognize qubits aﬀected by errors

from the syndromes, instead of training the neu-

ral network to achieve longer memory times. We

envision the possibility of applying reinforcement

learning after the supervised learning phase for

our decoder, so that it can even ﬁnd a slightly

better strategy or adapt to not drastically diﬀer-

ent noise models.

D A toy example of error correction

with 1-nearest neighbor algorithm

The goal of this section is to demonstrate a base-

line machine learning algorithm can fail at a toy

problem similar to decoding topological codes

when the lattice gets large. This should serve

as an alert when attempt to use neural networks

for decoding, especially due to the lack of under-

standing of the generalization behavior of neu-

ral networks (see [40]). The simple algorithm we

will be considering is the 1-nearest neighbor al-

gorithm. It belongs to the family of k-nearest

neighbor algorithms, which is used for classiﬁ-

cation and regression. The output of these algo-

rithms only depend on the k entries in the dataset

that are closest to the input, e.g. average over

the k entries for regression. It can be consid-

ered as a natural extension to the lookup table

when the input space is too large and we can-

not pre-compute all possible inputs (lookup ta-

ble has been used for decoding the surface code

in [35]) The concrete procedure of using 1-nearest

neighbor algorithm for our toy problem will be

explained below.

Suppose we are given a square lattice with size

. Among all the plaquettes, bpL

c are ﬂipped,

where p < 0.1 is some ﬁxed small number. Our

goal is to ﬂip them back, but it is forbidden to

look at the lattice directly, and we can only com-

Accepted in Quantum 2018-05-04, click title to verify 18

pare it to entries of a database. The database

contains N = poly(L) entries, and each one is

generated by randomly ﬂip bpL

c plaquettes. A

natural way to use the database is to ﬁnd the

entry that has the most overlapping ﬂipped pla-

quettes with our lattice, and then ﬂip all the pla-

quettes in that entry. This approach can be con-

sidered as a 1-nearest neighbor algorithm. In-

tuitively, this strategy will perform poorly if L is

large, as the entries of the database are too sparse

compared to all possibilities. In more detail, de-

ﬁne the random variable C to be the number of

overlapping ﬂipped plaquettes between two ran-

domly generated lattice conﬁguration described

above. For large L, C can be well approximated

by N (L

, L

pα). Or equivalently,

C − L

√

pα

≈ N (0, 1) (16)

Since it is known that [10]



max

i≤N

)



∼

log N (17)

Var



max

i≤N

)



∼ 1/ log N (18)

where X

∼ N (0, 1). Thus, in the database,

the most similar entry will typically have around

√

p log N α

overlapping plaquettes. This

is much smaller compare to L

p. Therefore, if we

ﬂip plaquettes according the entry, the total num-

ber of ﬂipped plaquettes will increase instead of

decrease.

Instead of this toy problem, we can apply a

similar 1-nearest neighbor algorithm to a dataset

of (syndrome, position of ﬂipped qubits). Let

us consider the regime where the error rate is

low. In this case, most ﬂipped qubits are sepa-

rated from each other, and the syndromes in the

dataset are almost equivalent to a list of ﬂipped

qubits. Therefore, we might extrapolate from the

toy problem discussed above that this particular

1-nearest neighbor algorithm will fail to decode

topological codes when the lattice becomes large

enough. Thus, if we want to have a machine

learning decoder that is scalable, it should ide-

ally distinguish from such an approach.

E Analysis of the Convolutional Neural

Network

In the paper, the convolutional neural networks

we use only have one convolution layer with ker-

nel size 3

and others are 1 × 1 × 1 × 1. There-

fore each element in the output (before applying

softmax) is determined by its surrounding 3

re-

gion. This is mainly for the sake of fast train-

ing and evaluation. It is interesting to ask what

will happen if we have a deeper neural network,

especially considering the recent success of deep

neural networks in image classiﬁcation and pro-

cessing. More concretely, we wish to know if we

increase the number of convolution layers (and as

a result increase the region each qubit can “look

around” to decide its error probability), will the

machine learning procedure described in our pa-

per perform better? In the previous section we

give an example that a nearest neighbor algo-

rithm performs worse on a larger region. And

in theory a deeper neural network will have a

better ability to remember all the training data,

thus has the capacity to be a nearest neighbor

algorithm. Therefore, it is important to try to

understand some diﬀerence between how neural

nets and nearest neighbor algorithm generalize to

unknown inputs. To this end, we tested a slightly

larger neural network for our decoder.

E.1 Behavior of a deeper neural network

trained as decoder

In this subsection, We consider a neural network

which has two convolution layer with kernel size

. Thus, each element in the output of the neu-

ral network can be inﬂuenced by a 5

region, or

around 2500 input bits. Generally speaking, if we

have large neural nets on this many input bits,

overﬁtting is very likely to happen. For our par-

ticular setup of convolutional neural networks, it

is much harder to estimate how many training

data are needed to avoid overﬁtting. Neverthe-

less, it is still interesting to look at the following

two facts:

• The trained larger network has a similar

performance when decoding noiseless syn-

dromes compared to the network we use in

the main text, but is not tested as exten-

sively.

• The larger network exhibits the decay of sen-

sitivity, which means each element in the

output will on average change more if an in-

put bit closer to it is being changed. See

Figure 12 for the plot.

Accepted in Quantum 2018-05-04, click title to verify 19

Figure 12: Sensitivity of the neural network output with

respect to changing a single input bit. The neural net-

work is trained to estimate the probability of an error

happened on the center qubit. The X-axis is the dis-

tance between the single changed input bit and the cen-

ter, induced by L

-norm. For each distance, we pick a

random input bit and evaluate the diﬀerence of output

if we change that bit. This is repeated for 20 times and

an average is computed.

Note that the second fact partially explained the

ﬁrst one: if the input bits outside the 3

region

do not have much inﬂuence on the corresponding

output bit, then the smaller neural network used

in the paper can in principle approximate the

same input-output relation. The decay of sen-

sitivity is likely resulted from the following two

reasons. First, the decoding task, and thus the

dataset we train our neural networks on have this

structure. In more detail, measurement outcome

of a parity check far away from a qubit contains

very little information about whether an error

happened on the qubit. Secondly, as we will dis-

cuss in the next subsection, a convolutional neu-

ral network has some tendency of prioritizing lo-

cal information. Intuitively, the alignment of the

tendency of the convolutional neural network and

the decoding task will lower the chance of over-

ﬁtting. Therefore, we might think convolutional

neural network is not only a convenient machine

learning model for this task, but also a naturally

suitable one.

E.2 Tendency of using local information

As mentioned in the last subsection, here we will

discuss one mechanism that make the convolu-

tional neural network prioritize local information.

The mechanism mainly involves the local connec-

tivity, the initialization and training of the net-

works. We will explain the mechanism using a

highly simpliﬁed setting. However, we believe it

still has some eﬀect for more general cases. We

do not aim for rigorousness in this subsection.

In Figure 13 we draw a m = 4 layer 1D CNN

with n = 7 input bits and 1 output bit. The

variables x

are the values of the neurons (as well

as inputs). We will ﬁrst consider the following

simpliﬁed situation:

1. There is no non-linear activation function

applied in the middle of the network. The

function tanh is only applied after the ﬁnal

layer x

(In the ﬁgure this is x

). There

is also no bias term in the network. So the

last layer x

is a linear combination of the

input bits {x

}. For the ease of notation,

the number of channels for each layer is set

to 1.

2. The neural network is not convolutional, but

instead a normal NN with the same local

connectivity as a CNN.

3. The input-output relation to be learned as

y = x

= x

= ±1, where k = bn/2c, y

is the desired output, and the probabilities

of being 1 and −1 are both 0.5. Other x

are set to 0. The cost function is c = (y −

tanh x

)

It is obvious that the neural net can approxi-

mate the above input-output relation well by be-

ing a linear combination x

= a

+ a

with the weights satisfy a

+ a

 1. We

will argue that with gradient descent as training

method, in general we will have a

> a

. In other

words, the output of the network y

= tanh(x

)

will depends more heavily based on x

compared

to x

. Operationally, this can be checked by set-

ting x

6= x

. As we never train the neural net

with data that have x

6= x

, this can be con-

sidered as checking how the neural net generalize

to unseen inputs.

We will assume that the weights {w

} are ini-

tialized with the same normal distributions with

mean 0 and standard deviation σ. Since there is

no non-linear activation function, we have

(ij)

(19)

We will show that when the weights of the net-

work are initialized, Var(a

(ij)

) is proportional to

Accepted in Quantum 2018-05-04, click title to verify 20

…

Figure 13: An illustration of the connectivity of a small

1D convolutional neural network. In this section, we set

the number of channels to 1. Thus, each x

is a real

number. The arrays represent dependence relation. For

example, x

is a function of x

, x

, and in this

section the function is simply a linear sum.

the total number of paths from x

to x

. Let p

be any path from x

to x

. We can think p as

the set of weights w

on the path. It is easy to

show that

(ij)

∈p

≡

(20)

Note that for diﬀerent path p

, p

, while

is not independent of W

, we have

Cov(W

, W

) = 0. Thus,

Var(a

(ij)

) =

Var(W

) (21)

More generally, we can formulate the above ar-

gument as the following observation:

Observation 1. Deﬁne b

)

p∈S

∈p

, where S is the set of paths

from x

to x

. When the weights of the net-

work are initialized, Var(b

)

) is proportional

to the total number of paths in S.

We want to use the variance to compare the

magnitudes of b

)

. In order for this to work,

one condition is that the distributions of b

)

should all have a “regular shape” (e.g. approxi-

mated by normal distribution). In Figure 15, we

plotted the distribution of a

(m1)

bn/2c

, which is deﬁned

in Equation 19. Again m is the layer of the net-

work and n is the size of input. From this numer-

ical experiment, it is likely that the probability

distribution of b

)

will have a bell shape when

and x

are reasonably far away. Thus,

the variances will provide us a very rough way

to compare the magnitudes of b

)

. For exam-

ple, this implies that in Figure 13, x

likely has

a larger dependence on the x

compared to x

1bn/2c

Figure 14: All paths through the grey area contributes

to the gradient of w

, while only the path along the

edge contributes to w

(assuming there is no path from

1bn/2c

to w

Figure 15: The distribution of the outputs from 3 and

4 layer neural network when the weights are initialized.

600 numerical trials are done in order to plot each ﬁgure.

The networks have the connectivity shown in Figure 13.

The only non-zero element we set in the input is the

center element x

1bn/2c

= 1. Thus, what we are showing

here is the distribution of a

(m1)

bn/2c

, where a

(ij)

is deﬁned in

Equation 19 and x

is the output of the network. We

see that the distribution can be described as bell shapes.

Thus, the variances already contain a large amount of

information of the distribution.

on x

. Some other conditions will be needed if

we want to make the above comparison rigorous.

However, we will skip further discussion on this

topic and be contend with an incomplete analy-

sis.

We can use this approach to compare the gra-

dients of the cost function c with respect to dif-

ferent w

. The gradient sum over the two possible

inputs x

= ±1 is

=±1

∂c

∂w

=±1

∂c

∂f

∂x

∂w

(22)

It is easy to see that

∂c

∂f



= −

∂c

∂f



=−1

(23)

Accepted in Quantum 2018-05-04, click title to verify 21

and

∂x

∂w

p∈S

∈p,j6=i

(24)

where S

is the set of paths which go through w

and one of x

and x

1bn/2c

. Substitute these in,

we have

=±1

∂c

∂w

= 2

∂c

∂f



p∈S

∈p,j6=i

(25)

Since the ﬁrst two terms on the r.h.s are the

same when we compute

∂c

∂w

for diﬀerent w

the same network, to compare the gradients we

only need to consider the last term. By using

Observation 1, we know that just after the ini-

tialization (or when w

have not been too corre-

lated), the gradient of the weight that involved

in more paths connecting x

or x

1bn/2c

to x

has a larger variance, and obviously the mean of

the gradients are 0. On a speculation level, this

means most weights connecting x

1bn/2c

to x

changes faster compared to those connects x

in the same layer. Intuitively, this trends

will continue, since the gradient w.r.t to a weight

will be larger if other weights on the path have

larger absolute values. Indeed, in Figure 16, we

can see for one particular setting, the gradient

with respect to a weight in the center region is

in general larger compared to one in the corner

throughout the whole training process.

There is another important factor that causes

1bn/2c

to have a larger inﬂuence on the output.

Observation 2. If we assume that at the end of

the training, all the weights have the same value,

e.g. 1. In this case, x

1bn/2c

will still have a larger

coeﬃcient in the expansion of the output, because

there are more paths connecting from it.

So the high chance of x

1bn/2c

having larger in-

ﬂuence is likely a combined result of the Observa-

tion 2 and the gradient mechanism we discussed

previously.

Now let us discuss the simpliﬁcation we made

above. If the network is convolutional, which

means the weights are shared in the same layer,

Observation 2 is still true. However, it is not

clear whether the above argument about gradi-

ents is still relevant, as apparently the weights

are changing at the same rates regardless of its

position in one layer. Below we will look at a

very speciﬁc scenario, in which the magnitudes

of the gradients with respect to diﬀerent weights

still play a role in the ﬁnal outcome. First, note

that Observation 1 still holds, because it remains

true that Cov(W

, W

) = 0 for any two diﬀer-

ent paths p

and p

. Assume in the Fig 14, w

and

are shared weights and thus w

= w

. Let us

consider the possible scenario that the gradient

with respect to w

and w

have diﬀerent signs,

but the norm of the gradient w.r.t w

is much

larger. Then the update of w

(and w

) will be ac-

cording to the gradient of w

, which increase the

likelihood of the ﬁnal decision x

to be based

on x

1bn/2c

. Overall, more study needs to be done

regarding the convolutional neural networks.

The eﬀect of activation function in the mid-

dle of the network will depend on the choice

of the particular function and the initial distri-

bution of the weights. For example, if we ini-

tialize all weights with a small enough variance,

then initially all x

are still in the linear regime

of activation function such as sigmoid or tanh.

For the popular relu activation function (i.e.

y = max(x, 0)), since it is linear whenever the in-

put is larger then 0, we can expect observation 1

to still be true.

Lastly, we considered a very simple input prob-

ability distribution. In general, the inputs are

likely to be noisy, and the input-output relation

is more complicated than an identity function.

To have an toy example of noisy inputs, later in

the next subsection we are going to perform a nu-

merical experiment where we replace the 0 input

bits to a random ±1 in the input distribution we

considered above. Intuitively, we can still argue

that for x

in the middle of a layer will have a

larger signal-to-noise ratio compared to the ones

on the edge, as a larger part of the variance come

actually from the signal. This might suggest that

the weights in the middle will change faster.

E.3 Numerical results

We setup the network and the training data ac-

cording to the description above, with the only

change being that inputs are 2D with size 7 × 7

and 9 ×9. In particular, the neural networks are

not convolutional, but have the same local con-

nectivity as a 2D convolutional one. The number

of channels for each hidden layer is 10. We choose

one of the corner input bits to be the same (or

Accepted in Quantum 2018-05-04, click title to verify 22

reverse) of the center bit. The weights of the net-

work are initialized with normal distribution that

has mean 0 and standard variance 0.1. We use

the vanilla gradient descent with some exponen-

tial decay of the learning rate. The ﬁnal output

of the tanh is rounded to integer and compared to

the desired output. The training stops when the

average of cost function c is smaller then 0.02 for

a batch of 100 training data (the threshold of 0.02

is chosen so that when given the reversed input,

the output of tanh will be rounded to ±1 instead

of 0). We then test the network on inputs where

the chosen corner bit and the center bit have dif-

ferent signs. We repeat the above procedure for

20 times on both 7×7 and 9×9 inputs. In all ex-

periments, the network always output the center

input bit when the center and corner input bits

have diﬀerent sign. To verify the argument in

the previous subsection that the gradients with

respect to the variables in the center region are

larger, we numerically compute the gradients at

10 selected time steps during the training pro-

cess. The result can be found in Figure 16. Here

we set the input size to be 9 ×9, and the number

of channels for each hidden layer to be 1. We

compute the gradient of the loss function with

respect to two weights. Both weights are con-

necting from the ﬁrst to the second hidden lay-

ers. One of the weights located in the center,

while the other located in the corner correspond-

ing to the non-zero element in the input. In the

plot, each vertical segment represents the range

of the gradients at a particular time step of the

30 training runs, while the width of transparent

color patches represents the distribution. Indeed

we can see the gradient of the weight in the cen-

ter region is in general larger compared to the

one in the corner.

We also do a test where the scenario is more

complicated (and to some degree more practi-

cal). As discussed previously, we can add noise

by changing all the 0 in the inputs to a random

±1. relu activation function (i.e. y = max(x, 0)

is added to the middle layers of the network. The

size of the input is 9 × 9, and accuracy is evalu-

ated by comparing the signs of the ﬁnal tanh to

the desired outputs. Otherwise the setting is the

same as in the previous experiment. In Figure 17,

we plot the accuracy during the training process.

We can see while the accuracy rises when given

the aligned inputs (i.e. the particular corner bit

Figure 16: Absolute values of gradients with respect to

two weights during diﬀerent stages of the training. Both

weights are connecting from the ﬁrst to the second hid-

den layers. The data comes from running the same train-

ing process for 30 times, and we compute the gradient

at 10 speciﬁc time steps during the training. In the plot,

the vertical segments made of solid lines represent the

ranges of the gradients, while the width of the transpar-

ent color patches represents the distribution. We can

see the gradient of the weight in the center region is in

general larger compared to the one in the corner.

and the center bit are equal), the network be-

comes more deterministically at outputting the

center bit when given reversed inputs (i.e. the

particular corner bit and the center bit have in-

versed sign). The test are repeated several times

where similar results are observed.

Accepted in Quantum 2018-05-04, click title to verify 23

Figure 17: Accuracy of matching the center input bits

during training. Inputs are noisy and there are non-linear

activation functions for all layers. The blue solid line

corresponds to the accuracy when given aligned inputs,

while the orange dotted one corresponds to reversed in-

puts. Note that in order to diﬀerentiate the lines, the

accuracy for reversed inputs is deﬁned to be the percent-

age of output matching the corner bit. Thus towards the

end of the training, the outputs almost always matching

the center bits when given reversed input. The accuracy

is evaluated on a batch of 50 inputs.

Accepted in Quantum 2018-05-04, click title to verify 24

Comments

Products

Project