Fermat's Library | Quantum Shannon Theory annotated/explained version.

Should be: http://www.theory.caltech.edu/people/preskill/ph229/

arXiv:1604.07450v1 [quant-ph] 25 Apr 2016

Quantum Information

Chapter 10. Quantum Shannon Theory

John Preskill

Institute for Quantum Information and Matter

California Institute of Technology

Updated April 2016

For further updates and additional chapters, see:

http://www.theory.caltech.edu/∼preskill/ph219/

Please send corrections to preskill@caltech.edu

Contents

10 Quantum Shannon Theory 1

10.1 Shannon for Dummies 2

10.1.1 Shannon entropy and data compression 2

10.1.2 Joint typicality, conditional entropy, and mutual infor-

mation 6

10.1.3 Distributed source coding 8

10.1.4 The noisy channel coding theorem 9

10.2 Von Neum an n Entropy 16

10.2.1 Mathematical properties of H(ρ) 18

10.2.2 Mixing, measurement, and entropy 20

10.2.3 Strong subadditivity 21

10.2.4 Monotonicity of mutual information 23

10.2.5 Entropy and ther modynamics 24

10.2.6 Bekenstein’s entropy bound. 26

10.2.7 Entropic uncertainty relations 27

10.3 Quantum Source Coding 30

10.3.1 Quantum compression: an example 31

10.3.2 Schumacher compression in general 34

10.4 Entanglement Concentration and Dilution 38

10.5 Quantifying Mixed-State Entanglement 45

10.5.1 Asymptotic irreversibility under LOCC 45

10.5.2 Squashed entanglement 47

10.5.3 Entanglement monogamy 48

10.6 Accessible Information 50

10.6.1 How much can we learn from a measurement? 50

10.6.2 Holevo bound 51

10.6.3 Monotonicity of Holevo χ 53

10.6.4 Improved distinguishability through coding: an example 54

10.6.5 Classical capacity of a quantum channel 58

Contents iii

10.6.6 Entanglement-breaking channels 62

10.7 Quantum Channel C ap acities and Decoupling 63

10.7.1 Coherent information and the quantum channel capacity 63

10.7.2 The decoupling principle 66

10.7.3 Degradable channels 69

10.8 Quantum Protocols 71

10.8.1 Father: Entanglement-assisted quantum communication 71

10.8.2 Mother: Quantum state transfer 74

10.8.3 Operational meaning of s trong subadditivity 78

10.8.4 Negative conditional entropy in thermodynamics 79

10.9 The Decoupling Inequality 81

10.9.1 Proof of the decoupling inequality 83

10.9.2 Proof of the mother inequality 85

10.9.3 Proof of the father inequality 87

10.9.4 Quantum channel capacity revisited 89

10.9.5 Black holes as mirrors 90

10.10 Summary 93

10.11 Bibliographical Notes 96

10.12 Exercises 97

References 110

Preface

This is the 10th and ﬁnal chapter of my book on Quantum Information,

based on the course I have been teaching at Caltech since 1997. An early

version of this chapter (originally Chapter 5) has been available on the

course website since 1998, but th is version is substantially revised and

expanded.

The level of detail is uneven, as I ’ve aimed to provide a gentle introduc-

tion, but I’ve also tried to avoid statements that are incorrect or obscure.

Generally speaking, I chose to include topics that are both useful to know

and relatively easy to explain; I had to leave out a lot of good stuﬀ, but

on the other hand the chapter is already quite long.

My version of Quantum Shannon Theory is no substitute for the more

careful treatment in Wilde’s book [1], but it may be more suitable for

beginners. This chapter contains occasional references to earlier chapters

in my book, but I hope it will be intelligible when read independently of

other chapters, including the chapter on quantum error-correcting codes.

This is a working draft of Chapter 10, which I will continue to update.

See th e URL on the title page for further u pdates and d rafts of other

chapters. Please send an email to preskill@caltech.edu if you notice errors.

Eventually, the complete book will be published by Cambridge Univer-

sity Press. I hesitate to predict the publication date — they have been

far too patient with me.

Quantum Shannon Theory

Quantum information science is a synthesis of three great themes of 20th

century thought: quantum physics, computer science, and information

theory. Up until now, we have given sh ort shrift to the information theory

side of this trio, an oversight now to be remedied.

A suitable name for this chapter might have been Quantum Information

Theory, but I prefer for that term to have a broader meaning, encompass-

ing much that has already been presented in this book. Instead I call it

Quantum Shannon Theory, to emphasize that we will mostly be occupied

with generalizing and applying Claude Shannon’s great (classical) con-

tributions to a quantum setting. Quantum Shannon theory has several

major thrusts:

1. Compressing quantum information.

2. Transmitting classical and quantum information through noisy

quantum channels.

3. Quantifying, characterizing, transforming, and using qu antum en-

tanglement.

A recurring theme unites these topics — the properties, interpretation,

and app lications of Von Neumann entropy.

My goal is to introduce some of the main ideas and tools of quantum

Shannon theory, but there is a lot we won’t cover. For example, we

will mostly consider information theory in an asymptotic setting, where

the s ame quantum channel or state is used arb itrarily many times, thus

focusing on issues of principle rather than more practical questions about

devising eﬃcient protocols.

2 10 Quantum Shannon Theory

10.1 Shannon for Dummies

Before we can understand Von Neumann entropy and its relevance to

quantum information, we should discuss Shannon entropy and its rele-

vance to classical information.

Claude Shannon established the two core results of classical information

theory in his landmark 1948 paper. The two central problems that he

solved were:

1. How much can a message be compressed; i.e., how redu ndant is

the information? This question is answered by the “source coding

theorem,” also called the “noiseless coding theorem.”

2. At what rate can we commun icate reliably over a noisy channel;

i.e., how much redundancy must be incorporated into a message

to protect against errors? This question is answered by the “noisy

channel coding theorem.”

Both questions concern redundancy – how unexpected is the next letter

of th e message, on the average. One of Shannon’s key insights was that

entropy provides a suitable way to quantify redundancy.

I call this section “Shannon for Dummies” because I will try to explain

Shannon’s ideas quickly, minimizing distracting details. That way, I can

compress classical information theory to about 14 pages.

10.1.1 Shannon entropy and data compression

A message is a string of letters, where each letter is chosen from an alpha-

bet of k possible letters. We’ll consider an id ealized setting in which the

message is p roduced by an “information source” which picks each letter

by sampling from a pr ob ab ility distribution

X := {x, p(x)}; (10.1)

that is, the letter has the value

x ∈ {0, 1, 2, . . . k−1} (10.2)

with probability p(x). If th e source emits an n-letter message the partic-

ular string x = x

. . . x

occurs w ith probability

p(x

. . . x

) =

i=1

p(x

). (10.3)

Since the letters are statistically independent, and each is produced by

consulting the same probability distribution X, we say that the letters

10.1 Shannon for Dummies 3

are independent and identically distributed, abbreviated i.i.d. We’ll use

to denote the ensemble of n-letter messages in wh ich each letter is

generated independently by sampling from X, and ~x = (x

. . . x

) to

denote a string of bits.

Now consider long n-letter messages, n ≫ 1. We ask: is it possible to

compress the message to a shorter string of letters that conveys essen-

tially the same information? The answer is: Yes, it’s possible, unless the

distribution X is uniformly rand om.

If the alphabet is binary, then each letter is either 0 with probability

1 − p or 1 with probability p, where 0 ≤ p ≤ 1. For n very large, the

law of large numbers tells us that typical strings will contain about n(1 −

p) 0’s and about np 1’s. T he number of distinct strings of this form is of

order the binomial coeﬃcient





, and from the Stirling approximation

log n! = n log n − n + O(log n) we obtain

log





= log



(np)! (n(1 − p))!



≈ n log n − n − (np log np − np + n(1 −p) log n(1 − p) −n(1 −p))

= nH(p), (10.4)

where

H(p) = −p log p − (1 − p) log(1 − p) (10.5)

is th e entropy function.

In this derivation we used the Stirling approximation in the appropriate

form for natural logarithms. But from now on we will prefer to use log-

arithms with base 2, which is more convenient for expressing a quantity

of information in bits; thus if no base is indicated, it will be understood

that the base is 2 unless otherwise stated. Adopting this convention in

the expression for H(p), the number of typical strings is of order 2

nH(p)

To convey essentially all the information carried by a string of n bits,

it suﬃces to choose a block code that assigns a nonnegative integer to

each of the typical strings. This block code needs to d istin gu ish about

nH(p)

messages (all occurring with nearly equal a priori probability), so

we may s pecify any one of the messages using a binary string with length

only slightly longer than nH(p). Since 0 ≤ H(p) ≤ 1 for 0 ≤ p ≤ 1, and

H(p) = 1 only for p =

, the block code shortens the message for any

p 6=

(whenever 0 and 1 are not equally probable). This is Shan non’s

result. The key idea is that we do not need a codeword for ever y sequence

of letters, only for the typical sequences. The probability that the actual

message is atypical becomes n egligible asymptotically, i.e., in the limit

n → ∞.

4 10 Quantum Shannon Theory

Similar reasoning applies to the case where X samples from a k-letter

alphabet. In a string of n letters, x typically occurs about np(x) times,

and the number of typical strings is of order

(np(x))!

≃ 2

−nH(X)

, (10.6)

where we have again invoked the Stirling app roximation and now

H(X) = −

p(x) log

p(x). (10.7)

is the Shannon entropy (or simply entropy) of the ensemble X = {x, p(x)}.

Adopting a block code that assigns integers to the typical sequences, the

information in a string of n letters can be compressed to about nH(X)

bits. In this sens e a letter x chosen f rom the ensemble carries, on the

average, H(X) bits of information.

It is useful to restate this reasoning more carefully using the strong

law of large numbers, which asserts that a sample average for a random

variable almost certainly converges to its expected value in the limit of

many trials. If we sample from the distribution Y = {y, p(y)} n times,

let y

, i ∈ {1, 2, . . . , n} denote the ith s ample, and let

µ[Y ] = hyi =

y p(y) (10.8)

denote the expected value of y. Then for any positive ε and δ there is a

positive integer N such that



i=1

− µ[Y ]



≤ δ (10.9)

with probability at least 1−ε for all n ≥ N. We can apply this statement

to the random variable log

p(x). Let us say that a sequence of n letters

is δ-typical if

H(X) − δ ≤ −

log

p(x

. . . x

) ≤ H(X) + δ; (10.10)

then the stron g law of large numbers says that for any ε, δ > 0 and n

suﬃciently large, an n-letter sequence will be δ-typical with probability

≥ 1 −ε.

Since each δ-typical n-letter sequence ~x occurs with probability p(~x)

satisfying

min

= 2

−n(H+δ)

≤ p(~x) ≤ 2

−n(H−δ)

= p

max

, (10.11)

10.1 Shannon for Dummies 5

we m ay infer upper and lower bounds on the number N

typ

(ε, δ, n) of typical

sequences:

typ

min

≤

typical x

p(x) ≤ 1, N

typ

max

≥

typical x

p(x) ≥ 1−ε, (10.12)

implies

n(H+δ)

≥ N

typ

(ε, δ, n) ≥ (1 − ε)2

n(H−δ)

. (10.13)

Therefore, we can en code all typical sequences using a block code with

length n(H + δ) bits. T hat way, any message emitted by the source can

be compressed and deco ded successfully as long as the message is typical;

the compression procedure achieves a success probability p

success

≥ 1 −ε,

no matter how the atypical sequences are decoded.

What if we try to compress the message even further, say to H(X) −δ

′

bits per letter, where δ

′

is a constant independent of the message length

n? Then we’ll run into trouble, because there won’t be enough codewords

to cover all the typical messages, and we won’t be able to decode the

compressed message with negligible probability of error. The probability

success

of successfully decodin g the message will be bounded above by

success

≤ 2

n(H−δ

′

)

−n(H−δ)

+ ε = 2

−n(δ

′

−δ)

+ ε; (10.14)

we can correctly decode only 2

n(H−δ

′

)

typical messages, each occurring

with probability no higher than 2

−n(H−δ)

; we add ε, an upper bound

on the probability of an atypical message, allowing optimistically for the

possibility that we somehow manage to decode the atypical messages cor-

rectly. Since we m ay choose ε and δ as small as we please, this success

probability becomes s mall as n → ∞, if δ

′

is a positive constant.

The number of bits per letter encoding th e compressed message is called

the rate of the compression code, and we say a rate R is achievable asymp-

totically (as n → ∞) if there is a sequence of codes with rate at least R

and error probability approaching zero in the limit of large n. To sum-

marize our conclusion, we have found that

Compression Rate = H(X) + o(1) is achievable,

Compression Rate = H(X) − Ω(1) is not achievable, (10.15)

where o(1) denotes a positive quantity which may be chosen as small as

we please, and Ω(1) d enotes a positive constant. This is S hannon’s source

coding theorem.

We have not discussed at all the details of the compr ession cod e. We

might imagine a huge lookup table which assigns a unique codeword to

each message and vice versa, but because such a table has size exponen-

tial in n it is quite impractical for compressing and decompressing long

6 10 Quantum Shannon Theory

messages. It is fascinating to study how to make the coding and deco ding

eﬃcient while preserving a near optimal rate of compression, and quite

important, to o, if we really want to compress something. But this prac-

tical aspect of classical compression theory is beyond the scope of this

book.

10.1.2 Joint typicality, conditional entropy, and mutual information

The Shannon entropy quantiﬁes my ignorance per letter about the output

of an information s ou rce. If the sou rce X produces an n-letter message,

then n(H(X) + o(1)) bits suﬃce to convey the content of the message,

while n(H(X) − Ω(1)) bits d o not suﬃce.

Two inform ation sources X and Y can be correlated. Letters

drawn from th e sources are governed by a joint distribution XY =

{(x, y), p(x, y)}, in which a pair of letters (x, y) appears w ith pr ob ab ility

p(x, y). The sources are independent if p(x, y) = p(x)p(y), but correlated

otherwise. If XY is a joint distribution, we use X to denote the marginal

distribution, deﬁned as

X =

(

x, p(x) =

p(x, y)

)

, (10.16)

and similarly for Y . If X and Y are correlated, then by reading a message

generated by Y

I reduce my ignorance about a message generated by X

which should make it possible to compress the output of X further than

if I did not have access to Y .

To make this idea more precise, we use the concept of jointly typi-

cal sequences. Sampling from the distribution X

, that is, sampling

n times from the joint distribution XY , produces a message (~x, ~y) =

. . . x

, y

. . . y

) with probability

p(~x, ~y) = p(x

, y

)p(x

, y

) . . . p(x

, y

). (10.17)

Let us say that (~x, ~y) drawn f rom X

is jointly δ-typical if

−n(H(X)+δ)

≤ p(~x) ≤ 2

−n(H(X)−δ)

−n(H(Y )+δ)

≤ p(~y) ≤ 2

−n(H(Y )−δ)

−n(H(XY )+δ)

≤ p(~x, ~y) ≤ 2

−n(H(XY )−δ)

. (10.18)

Then, applying the strong law of large numbers simultaneously to the

three distrib utions X

, Y

, and X

, we infer that for ε, δ > 0 and

n suﬃciently large, a sequence drawn from X

will be δ-typical with

10.1 Shannon for Dummies 7

probability ≥ 1−ε. Using Bayes’ rule, we can th en obtain upper and lower

bounds on the conditional probability p(~x|~y) for jointly typical sequences:

p(~x|~y) =

p(~x, ~y)

p(~y)

≥

−n(H(XY )+δ)

−n(H(Y )−δ)

= 2

−n(H(X|Y )+2δ)

p(~x|~y) =

p(~x, ~y)

p(~y)

≤

−n(H(XY )−δ)

−n(H(Y )+δ)

= 2

−n(H(X|Y )−2δ)

. (10.19)

Here we have introduced the quantity

H(X|Y ) = H(XY ) − H(Y ) = h−log p(x, y) + log p(y)i = h−log p(x|y)i,

(10.20)

which is called the conditional entropy of X given Y .

The conditional entropy quantiﬁes my remaining ignorance about x

once I know y. From eq.(10.19) we see that if (~x, ~y) is jointly typical

(as is the case with high p robability for n large), then the number of

possible values for ~x compatible with the known value of ~y is no more

than 2

n(H(X|Y )+2δ)

; hence we can convey ~x with a high success probability

using only H(X|Y ) + o(1) bits per letter. On the other hand we can’t

do much better, because if we use only 2

n(H(X|Y )−δ

′

)

codewords, we are

limited to conveying reliably no more than a fraction 2

−n(δ

′

−2δ)

of all

the jointly typical messages. To summarize, H(X|Y ) is the numb er of

additional bits per letter needed to specify both ~x and ~y once ~y is known.

Similarly, H(Y |X) is the number of additional bits per letter needed to

specify both ~x and ~y when ~x is known .

The information about X that I gain when I learn Y is quantiﬁed by

how much th e number of b its per letter needed to specify X is reduced

when Y is known. Thus is

I(X; Y ) ≡ H(X) − H(X|Y )

= H(X) + H(Y ) − H(XY )

= H(Y ) − H(Y |X), (10.21)

which is called the mutual inf ormation. The mutual information I(X; Y )

quantiﬁes how X and Y are correlated, and is sym metric under inter-

change of X and Y : I ﬁnd out as much about X by learning Y as about Y

by learning X. Learning Y never reduces my knowledge of X, so I(X; Y )

is obviously nonnegative, and indeed the inequality H(X) ≥ H(X|Y ) ≥ 0

follows easily from the convexity of the log function.

Of course, if X and Y are completely uncorrelated, we have p(x, y) =

p(x)p(y), and

I(X; Y ) ≡



log

p(x, y)

p(x)p(y)



= 0; (10.22)

8 10 Quantum Shannon Theory

we don’t ﬁnd out anything about X by learning Y if there is no correlation

between X and Y .

10.1.3 Distributed source coding

To sharpen our und er standing of the operational meaning of conditional

entropy, consider this situation: Suppose th at the joint distrib ution XY

is sampled n times, where Alice receives the n-letter message ~x and Bob

receives th e n-letter message ~y. Now Alice is to send a message to Bob

which will enable Bob to determine ~x with high success probability, and

Alice wants to send as few bits to Bob as possib le. This task is harder

than in the scenario considered in §10.1.2, where we assumed that the

encoder and the decoder share full knowledge of ~y, and can choose th eir

code for compressing ~x accordingly. It turns out, th ou gh , that even in

this more challenging setting Alice can compress the message she sends to

Bob down to n (H(X|Y ) + o(1)) b its, using a method called Slepian-Wolf

coding.

Before receiving (~x, ~y), Alice and Bob agree to sort all the possible n-

letter messages that Alice might receive into 2

possible bins of equal

size, where the choice of bins is known to both Alice and Bob. When Alice

receives ~x, sh e sends nR bits to Bob, identifying the bin that contains ~x.

After Bob receives this m essage, he knows both ~y and the bin containing

~x. If there is a unique message in that bin which is jointly typical with

~y, Bob deco des accordingly. Otherwise, he decodes arbitrarily. This pro-

cedure can fail either because ~x and ~y are not jointly typical, or because

there is more than one message in the bin which is jointly typical w ith ~x.

Otherwise, Bob is sure to decode correctly.

Since ~x and ~y are jointly typical with high probability, the compression

scheme works if it is unlikely f or a bin to contain an incorrect message

which is jointly typical with ~y. If ~y is typical, wh at can we say about

the number N

typ|~y

of messages ~x that are jointly typical with ~y? Using

eq.(10.19), we have

1 ≥

typical ~x|~y

p(~x|~y) ≥ N

typ|~y

−n(H(X|Y )+2δ)

, (10.23)

and thus

typ|~y

≤ 2

n(H(X|Y )+2δ)

. (10.24)

Now, to estimate the probability of a decoding error, we need to specify

how the bins are chosen. Let’s assume the bins are chosen uniformly at

random, or equivalently, let’s consider averaging uniformly over all codes

that d ivide th e length-n strings into 2

bins of equal size. Then th e

10.1 Shannon for Dummies 9

probability that a particular bin contains a message jointly typical with

a speciﬁed ~y purely by accident is bounded above by

−nR

typ|~y

≥ 2

−n(R−H(X| Y )−2δ)

. (10.25)

We conclude that if Alice sends R bits to Bob per each letter of the

message x, where

R = H(X|Y ) + o(1), (10.26)

then the probability of a decoding error vanishes in th e limit n → ∞, at

least w hen we average over uniformly all codes. Su rely, then , there must

exist a particular sequence of codes Alice and Bob can use to achieve the

rate R = H(X|Y ) + o(1), as we wanted to show.

In this scenario, Alice and Bob jointly know (x, y), but initially neither

Alice nor Bob has access to all their shared information. The goal is

to merge all the information on Bob’s side with minimal communication

from Alice to Bob, and we have found that H(X|Y ) + o(1) bits of com-

munication per letter su ﬃce for this purpose. Similarly, the information

can be merged on Alice’s side using H(Y |X)+o(1) bits of communication

per letter from Bob to Alice.

10.1.4 The noisy channel coding theorem

Suppose Alice wants to send a message to Bob, but the communication

channel linking Alice and Bob is noisy. Each time they use the channel,

Bob receives the letter y with probability p(y|x) if Alice sends the letter

x. Using the chan nel n ≫ 1 times, Alice hopes to transmit a long message

to Bob.

Alice and Bob realize that to communicate reliably despite the noise

they should use some kind of code. For example, Alice might try sending

the same bit k times, with Bob us ing a majority vote of the k noisy bits

he receives to decode what Alice sent. One wonders: for a given channel,

is it possible to ensure perfect transmission asymptotically, i.e., in the

limit where the number of channel uses n → ∞? An d what can be said

about the rate of the cod e; that is, how many bits must be sent per letter

of the transmitted message?

Shannon answered these questions. He showed that any channel can

be used for perfectly reliable communication at an asymptotic nonzero

rate, as long as there is some correlation between the channel’s in put and

its outp ut. Furthermore, he found a useful formula for th e optimal rate

that can be achieved. These results are the content of the noisy channel

coding theorem.

10 10 Quantum Shannon Theory

Capacity of the binary symmetric channel. To be concrete, suppose we

use the binary alphabet {0, 1}, and the binary symmetric channel; this

channel acts on each bit independently, ﬂipping its value with probability

p, and leaving it intact with probability 1 − p. Thus the conditional

probabilities characterizing the channel are

p(0|0) = 1 − p, p(0|1) = p,

p(1|0) = p, p(1|1) = 1 − p.

(10.27)

We want to construct a family of codes with increasing block size n,

such that the probability of a d ecoding error goes to zero as n → ∞.

For each n, the cod e contains 2

codewords among the 2

possible strings

of length n. The rate R of the code, the number of encoded data bits

transmitted per physical bit carried by the chann el, is

R =

. (10.28)

To protect against errors, we should choose the code so that the co de-

words are as “far apart” as possible. For given values of n and k, we

want to maximize the number of bits that must be ﬂipped to change one

codeword to another, the Hamming distance between the two codewords.

For any n-bit input message, we expect about np of the bits to ﬂip — the

input diﬀu ses into one of about 2

nH(p)

typical output strings, occupying

an “error sphere” of “Hamming radius” np about the inpu t string. To

decode reliably, we want to choose our input codewords so that the error

spheres of two diﬀerent cod ewords do n ot overlap subs tantially. Oth-

erwise, two diﬀerent inputs will sometimes yield the same output, and

decoding errors will inevitably occur. To avoid such decoding ambigui-

ties, the total number of strings contained in all 2

= 2

error spheres

should not exceed the total number 2

of bits in the output message; we

therefore require

nH(p)

≤ 2

(10.29)

R ≤ 1 − H(p) := C(p). (10.30)

If transmission is highly reliable, we cannot expect the rate of the code to

exceed C(p). But is the rate R = C(p) actually achievable asymptotically?

In fact transmission with R = C − o(1) and negligible decoding error

probability is possible. Perhaps Shannon’s most ingenious idea was that

this rate can be achieved by an average over “random codes.” Though

choosing a code at random does not seem like a clever strategy, rather

surprisingly it turns out that random coding achieves as high a rate as

any other coding scheme in the limit n → ∞. Since C is the optimal rate

10.1 Shannon for Dummies 11

for reliable trans mission of data over the noisy channel it is called the

channel capacity.

Suppose that X is the uniformly random ensemble for a single bit (ei-

ther 0 with p =

or 1 with p =

), and that we sample from X

a total

of 2

times to generate 2

“random codewords.” The resulting code

is known by both Alice and Bob. To send nR bits of information, Alice

chooses one of the codewords and sends it to Bob by using the channel n

times. To decode the n-bit message he receives, Bob draws a “Hamming

sphere” with “radius” s lightly large than np, containing

n(H(p)+δ)

(10.31)

strings. If this sph er e contains a unique codeword, Bob decodes the mes-

sage accordingly. If the sphere contains more than one codeword, or no

codewords, Bob decodes arbitrarily.

How likely is a decoding error? For any positive δ, Bob’s decod ing

sphere is large enough that it is very likely to contain the codeword sent

by Alice when n is s uﬃciently large. Therefore, we need only worr y that

the spher e might contain another codeword just by accident. Since there

are altogether 2

possible strings, Bob’s sphere contains a fraction

f =

n(H(p)+δ)

= 2

−n(C(p)−δ)

, (10.32)

of all the strings. Because the codewords are uniformly random, the

probability that Bob’s sphere contains any particular codeword aside from

the one sent by Alice is f, and the probability that the sphere contains

any one of th e 2

− 1 invalid codewords is no more than

f = 2

−n(C(p)−R−δ)

. (10.33)

Since δ may be as small as we please, we may choose R = C(p)−c where c

is any positive constant, and the d ecoding error probability will approach

zero as n → ∞.

When we speak of codes chosen at random, we really mean that we

are averaging over many possible codes. The argument so far has shown

that the average p robability of error is small, where we average over the

choice of random code, and for each speciﬁed code we also average over

all codewords. It follows that there must be a particular sequ ence of

codes such that the average probability of error (when we average over

the codewords) vanishes in the limit n → ∞. We would like a stronger

result – that the probability of error is small for every codeword.

To establish the stronger result, let p

denote the probability of a de-

coding error when codeword i is sent. For any positive ε and suﬃciently

12 10 Quantum Shannon Theory

large n, we have demonstrated the existence of a code such that

i=1

≤ ε. (10.34)

Let N

2ε

denote th e number of codewords with p

≥ 2ε. Th en we infer

that

2ε

)2ε ≤ ε or N

2ε

≤ 2

nR−1

; (10.35)

we see that we can throw away at most half of the codewords, to achieve

≤ 2ε for every codeword. T he new code we have constructed has

Rate = R −

, (10.36)

which approaches R as n → ∞. We have seen, then, that the rate R =

C(p) − o(1) is asymptotically achievable with negligible probability of

error, where C(p) = 1 −H(p).

Mutual information as an achievable rate. Now consider how to apply

this random coding argument to more general alphabets an d channels.

The channel is characterized by p(y|x), the conditional probab ility that

the letter y is received when the letter x is sent. We ﬁx an ensemble X =

{x, p(x)} for the input letters, and generate the codewords for a length-n

code with rate R by sampling 2

times from the distribution X

; the

code is k nown by both the sender Alice and the receiver Bob. To convey

an encoded nR-bit message, one of the 2

n-letter codewords is selected

and sent by using th e channel n times. The channel acts independently on

the n letters, governed by the same conditional probab ility distribution

p(y|x) each time it is used. The input ensemble X, together with the

conditional probability characterizing th e channel, determines the joint

ensemble XY for each letter sent, and therefore the joint ensemble X

for the n us es of the channel.

To deﬁne a decoding pr ocedure, we use the notion of joint typicality

introduced in §10.1.2. When Bob receives th e n-letter output message

~y, he determines whether there is an n-letter input codeword ~x jointly

typical with ~y. If such ~x exists and is unique, Bob decodes accordingly. If

there is no ~x jointly typical with ~y, or more than one such ~x, Bob decodes

arbitrarily.

How likely is a decoding error? For any positive ε and δ, the (~x, ~y)

drawn from X

is j ointly δ-typical with probability at least 1 − ε if n

is suﬃciently large. Therefore, we need only worry that there might more

than one codeword jointly typical with ~y.

10.1 Shannon for Dummies 13

Suppose that Alice samples X

to generate a codeword ~x, which she

sends to Bob using the channel n times. Then Alice samp les X

a second

time, producing another codeword ~x

′

. With probability close to one, both

~y and ~x

′

are δ-typical. But what is the probability that ~x

′

is jointly δ-

typical with ~y?

Because the samples are independent, the probability of drawing these

two codewords factorizes as p(~x

′

, ~x) = p(~x

′

)p(~x), and likewise the channel

output ~y when the ﬁrst codeword is sent is ind ependent of the second

channel input ~x

′

, so p(~x

′

, ~y) = p(~x

′

)p(~y). From eq.(10.18) we ob tain an

upper bound on the number N

j.t.

of jointly δ-typical (~x, ~y):

1 ≥

j.t. (~x,~y)

p(~x, ~y) ≥ N

j.t.

−n(H(XY )+δ)

=⇒ N

j.t.

≤ 2

n(H(XY )+δ)

(10.37)

We also know that each δ-typical ~x

′

occurs with probability p(~x

′

) ≤

−n(H(X)−δ)

and that each δ-typical ~y occurs with probability p(~y) ≤

−n(H(Y )−δ)

. Therefore, the probability that ~x

′

and ~y are jointly δ-typical

is bound ed above by

j.t. (~x

′

,~y)

p(~x

′

)p(~y) ≤ N

j.t.

−n(H(X)−δ)

−n(H(Y )−δ)

≤ 2

n(H(XY )+δ)

−n(H(X)−δ)

−n(H(Y )−δ)

= 2

−n(I(X;Y )−3δ)

. (10.38)

If there are 2

codewords, all generated independently by sampling X

then the probability that any other codeword besides ~x is jointly typical

with ~y is bounded above by

−n(I(X;Y )−3δ)

= 2

n(R−I(X;Y )+3δ)

. (10.39)

Since ε and δ are as small as we please, we may choose R = I(X; Y ) − c,

where c is any positive constant, and the decoding err or probability will

approach zero as n → ∞.

So far we have shown that the error probability is small when we av-

erage over codes and over codewords. To complete the argument we use

the same reasoning as in our discussion of the capacity of the binary sym-

metric channel. There must exist a particular sequence of code with zero

error probability in the limit n → ∞, when we average over codewords.

And by pr uning the codewords, reducing the rate by a negligible amount,

we can ens ure that the error probability is small for every codeword. We

conclude that the rate

R = I(X; Y ) − o(1) (10.40)

14 10 Quantum Shannon Theory

is asymptotically achievable with negligible probability of error. This

result provides a concrete operational interpretation f or the mutual in-

formation I(X; Y ); it is the information per letter we can tran smit over

the channel, supporting the heuristic claim that I(X; Y ) quantiﬁes the

information we gain about X when we have access to Y .

The mutual information I(X; Y ) depends not only on the channel’s

conditional probability p(y|x) but also on the a priori p robability p(x)

deﬁning the codeword ensemble X. The achievability argument for ran-

dom coding applies for any choice of X, so we have demonstrated that

errorless trans mission over the noisy channel is possible for any rate R

strictly less than

C := max

I(X; Y ). (10.41)

This quantity C is called the channel capacity; it depends only on the

conditional probabilities p(y|x) that deﬁne the channel.

Upper bound on the capacity. We have now shown that any rate R < C is

achievable, but can R exceed C with the error probability still approaching

0 for large n? To see that a rate for errorless transmission exceeding C is

not possible, we reason as follows.

Consider any code with 2

codewords, and consider the uniform en-

semble on the cod ewords, denoted

, in which each codeword occurs

with probability 2

−nR

. Ev idently, then,

) = nR. (10.42)

Sending the codewords through n uses of the channel we obtain an en-

semble

of output states, and a joint ensemble

Because the channel acts on each letter independently, the conditional

probability for n uses of the channel factorizes:

p(y

···y

···x

) = p(y

)p(y

) ···p(y

), (10.43)

and it follows that the conditional entropy satisﬁes

) = h−log p(~y|~x)i =

h−log p(y

), (10.44)

where

and

are the marginal probability distributions for the ith

letter determined by our distribution on the codewords. Because Shannon

entropy is subadditive, H(X, Y ) ≤ H(X) + H(Y ), we have

) ≤

), (10.45)

10.1 Shannon for Dummies 15

and therefore

;

) = H(

) −H(

)

≤

(H(

) −H(

))

;

) ≤ nC. (10.46)

The mutual inform ation of the messages sent and received is bounded

above by the sum of th e mutual inform ation per letter, and the mutual

information for each letter is bounded above by the capacity, because C

is deﬁned as the maximum of I(X; Y ) over all input ensembles.

Recalling the symmetry of mutual information, we have

;

) = H(

) − H(

)

= nR −H(

) ≤ nC. (10.47)

Now, if we can decod e reliably as n → ∞, this means that the input

codeword is completely determined by the signal received, or that the

conditional entropy of the input (per letter) must get small

) → 0. (10.48)

If errorless transmission is possib le, then, eq. (10.47) becomes

R ≤ C + o(1), (10.49)

in the limit n → ∞. The asymptotic rate cannot exceed the capacity. In

Exercise 10.8, you will sharpen the statement eq.(10.48), showing that

) ≤

) + p

R, (10.50)

where p

denotes the decoding error probability, an d H

) =

−p

log

− (1 −p

) log

(1 − p

) .

We have now seen that the capacity C is the highest achievable rate

of communication through the noisy channel, where the probability of

error goes to zero as the number of letters in the message goes to inﬁnity.

This is Shannon’s noisy chann el coding theorem. What is particularly

remarkable is that, although the capacity is achieved by messages that are

many letters in length, we have obtained a single-letter formula for the

capacity, expressed in terms of the optimal mutual information I(X; Y )

for jus t a single use of the channel.

The method we used to show that R = C −o(1) is achievable, averaging

over random codes, is not constructive. Since a random code has no

16 10 Quantum Shannon Theory

structure or pattern, encoding and decoding are unwieldy, requiring an

exponentially large code book. Nevertheless, the theorem is important

and useful, because it tells us what is achievable, and not achievable,

in principle. Furthermore, since I(X; Y ) is a concave function of X =

{x, p(x)} (with {p(y|x)} ﬁxed), it has a unique local maximum, an d C

can often be computed (at least numerically) for channels of interest.

Finding codes which can be eﬃciently encoded and decoded, and come

close to achieving the capacity, is a very interesting pursuit, but beyond

the scope of our lightning introduction to Shannon theory.

10.2 Von Neumann Entropy

In classical information theory, we often consider a source that prepares

messages of n letters (n ≫ 1), where each letter is drawn independently

from an ensemble X = {x, p(x)}. We have seen that the Shannon entropy

H(X) is the number of incompressible bits of information carried per

letter (asymptotically as n → ∞).

We may also be interested in correlations among messages. The cor-

relations between two ensembles of letters X and Y are characterized by

conditional pr ob ab ilities p(y|x). We have seen that the mutual informa-

tion

I(X; Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X), (10.51)

is the number of bits of information per letter about X that we can

acquire by reading Y (or vice versa). If the p(y|x)’s characterize a noisy

channel, then, I(X; Y ) is the amount of information per letter th at can

be transmitted through the channel (given the a priori distribution X for

the channel inputs).

We would like to generalize these considerations to quantum informa-

tion. We may imagine a source that prepares messages of n letters, but

where each letter is chosen fr om an en semble of quantum states. The

signal alphabet consists of a set of quantum states {ρ(x)}, each occur ring

with a speciﬁed a priori probability p(x).

As we d iscussed at length in Chap ter 2, the probability of any outcome

of any measurement of a letter chosen from this ensemble, if the observer

has no knowledge about which letter was prepared, can be completely

characterized by the d ensity oper ator

ρ =

p(x)ρ(x); (10.52)

for a POVM E = {E

}, the probability of outcome a is

Prob(a) = tr(E

ρ). (10.53)

10.2 Von Neumann Entropy 17

For this (or any) density operator, we may deﬁne the Von Neumann en-

tropy

H(ρ) = −tr(ρ log ρ). (10.54)

Of course, we may choose an orthonormal basis {|ai} that diagonalizes ρ,

ρ =

|aiha|; (10.55)

the vector of eigenvalues λ(ρ) is a probability distribution, and the Von

Neumann entropy of ρ is just the Shannon entropy of th is distribution,

H(ρ) = H(λ(ρ)). (10.56)

If ρ

is the density operator of system A, we will sometimes use the

notation

H(A) := H(ρ

). (10.57)

Our convention is to denote quantum systems with A, B, C, . . . and clas-

sical probability distributions with X, Y, Z, . . . .

In the case where the signal alphabet {|ϕ(x)i, p(x)} consists of mutu-

ally orthogonal pure states, the quantum source reduces to a classical one;

all of the signal states can be perfectly distinguish ed, and H(ρ) = H(X),

where X is the classical ensemble {x, p(x)}. The quantum source is more

interesting when the signal states {ρ(x)} are not mutually commuting.

We will argue that the Von Neumann entropy quantiﬁes the incompress-

ible information content of the quantum source (in the case where the

signal states are pure) much as the Shannon entropy quantiﬁes the infor-

mation content of a classical source.

Indeed, we will ﬁ nd that Von Neumann entropy plays multiple roles.

It quantiﬁes not only the quantum inf ormation content per letter of the

pure-state ensemble (the min imum number of qubits per letter needed to

reliably encode the information) but also its classical information content

(the maximum amount of information per letter—in bits, not qubits—

that we can gain about the preparation by making the best possible

measurement). And we will see that Von Neumann information enters

quantum information in yet other ways — for example, quantifying the

entanglement of a bipartite pure state. Thus quantum information the-

ory is largely concerned with the interpretation an d uses of Von Neumann

entropy, much as classical information theory is largely concerned with

the interpretation and uses of Shannon entropy.

In fact, the mathematical machinery we need to develop quantum in-

formation theory is very similar to Shan non’s mathematics (typical se-

quences, rand om coding, . . . ); so similar as to sometimes obscure that

the conceptual context is really quite diﬀerent. The central issue in quan-

tum information theory is that nonorthogonal quantum states cannot be

perfectly distinguished, a feature with no classical analog.

18 10 Quantum Shannon Theory

10.2.1 Mathematical properties of H(ρ)

There are a handful of properties of the Von Neumann entropy H(ρ)

which are frequently useful, many of which are closely analogous to cor-

responding pr operties of the Shannon entropy H(X). Proofs of some of

these are Exercises 10.1, 10.2, 10.3.

1. Pure states. A pure state ρ = |ϕihϕ| has H(ρ) = 0.

2. Unitary invariance. The entropy is unchanged by a unitary

change of basis,

H(UρU

−1

) = H(ρ), (10.58)

because H(ρ) depends only on the eigenvalues of ρ.

3. Maximum. If ρ has d nonvanishin g eigenvalues, then

H(ρ) ≤ log d, (10.59)

with equality when all the n on zero eigenvalues are equ al. The en-

tropy is maximized when the q uantum state is maximally mixed.

4. Concavity. For λ

, λ

, ··· , λ

≥ 0 and λ

+ λ

+ ··· + λ

= 1,

H(λ

+ ··· + λ

) ≥ λ

H(ρ

) + ··· + λ

H(ρ

). (10.60)

The Von Neumann entropy is larger if we are more ignorant about

how the state was prepared. This property is a consequence of the

convexity of the log function.

5. Subadditivity. Con sider a bipartite system AB in the state ρ

Then

H(AB) ≤ H(A) + H(B) (10.61)

(where ρ

= tr

(ρ

) and ρ

= tr

(ρ

)), with equality for

= ρ

⊗ρ

. Thus, entropy is additive for uncorrelated systems,

but otherwise the entropy of the whole is less than the sum of the

entropy of the parts. This property is the quantum generalization

of subadditivity of Shannon entropy:

H(XY ) ≤ H(X) + H(Y ). (10.62)

6. Bipartite pure states. If the s tate ρ

of the bipartite sy stem

AB is pure, then

H(A) = H(B), (10.63)

because ρ

and ρ

have the same nonzero eigenvalues.

10.2 Von Neumann Entropy 19

7. Quantum mutual information. As in the classical case, we de-

ﬁne the mutual inform ation of two quantum systems as

I(A; B) = H(A) + H(B) − H(AB), (10.64)

which is nonnegative because of the subadditivity of Von Neumann

entropy, and zero only for a product state ρ

= ρ

⊗ ρ

8. Triangle inequality (Araki-Lie b inequality). For a bipartite

system,

H(AB) ≥ |H(A) −H(B)|. (10.65)

To derive the triangle inequality, consider the tr ipartite pure state

|ψi

ABC

which puriﬁes ρ

= tr

(|ψihψ|). Since |ψi is pure,

H(A) = H(BC) and H(C) = H(AB); applying subadditivity to

BC yields H(A) ≤ H(B) + H(C) = H(B) + H(AB). The same in-

equality applies with A and B interchanged, from which we obtain

eq.(10.65).

The triangle inequality contrasts sharply with the analogous p roperty of

Shannon entropy,

H(XY ) ≥ H(X), H(Y ). (10.66)

The Shannon entropy of just part of a classical bipartite system cannot

be greater than the Shannon entropy of the whole system. Not so for the

Von Neumann entropy! For example, in the case of an entangled bipartite

pure quantum state, we have H(A) = H(B) > 0, while H(AB) = 0. The

entropy of the global system vanishes because our ignorance is minimal

— we know as much about AB as the laws of quantum physics w ill al-

low. But we have incomplete knowledge of the parts A and B, w ith our

ignorance quantiﬁed by H(A) = H(B). For a q uantum system, but n ot

for a classical one, information can be encoded in the correlations among

the parts of the system, yet be invisible w hen we look at the parts one at

a time.

Equivalently, a property th at holds classically but not quantumly is

H(X|Y ) = H(XY ) − H(Y ) ≥ 0. (10.67)

The Shannon conditional entropy H(X|Y ) quantiﬁes our remaining igno-

rance about X when we know Y , and equals zero when k nowing Y makes

us certain about X. On the other hand, th e Von Neumann conditional

entropy,

H(A|B) = H(AB) − H(B), (10.68)

can be negative; in particular we have H(A|B) = −H(A) = −H(B) < 0

if ρ

is an entangled pure state. How can it make sense that “knowing”

20 10 Quantum Shannon Theory

the subsystem B makes us “more than certain” about the subsystem A?

We’ll return to this intriguing question in §10.8.2.

When X and Y are perfectly correlated, then H(XY ) = H(X) =

H(Y ); the conditional entropy is H(X|Y ) = H(Y |X) = 0 and the mutual

information is I(X; Y ) = H(X). In contrast, for a bipartite pure state of

AB, the quantum state for which we may regard A and B as perf ectly

correlated, the mutual information is I(A; B) = 2H(A) = 2H(B). I n this

sense the quantum correlations are stronger than classical correlations.

10.2.2 Mixing, measurement, and entropy

The Shannon entropy also has a property called Schur concavity, wh ich

means that if X = {x, p(x)} and Y = {y, q(y)} are two ensembles such

that p ≺ q, then H(X) ≥ H(Y ). In fact, any function on probab ility

vectors is Schur concave if it is invariant under permutations of its argu-

ments and also concave in each argument. Recall that p ≺ q (q majorizes

p) means th at “p is at least as random as q” in the sense that p = Dq for

some doubly stochastic m atrix D. Thus Schur concavity of H says that

an en semble with more randomness has higher entropy.

The Von Neumann entr opy H(ρ) of a density operator is the Shannon

entropy of its vector of eigenvalues λ(ρ). Furthermore, we showed in

Exercise 2.6 that if the quantum state ensemble {|ϕ(x)i, p(x)} realizes ρ,

then p ≺ λ(ρ); therefore H(ρ) ≤ H(X), where equality holds only for an

ensemble of mutually orthogonal states. The decrease in entropy H(X)−

H(ρ) quantiﬁes how distinguishability is lost when we mix nonorthogonal

pure states. As we will soon see, the amount of information we can gain by

measuring ρ is no more than H(ρ) bits, so some of the information about

which state was prepared h as been irretrievably lost if H(ρ) < H(X).

If we perform an orthogonal m easurement on ρ by projecting onto the

basis {|yi}, then outcome y occurs with pr ob ab ility

q(y) = hy|ρ|yi =

|hy|ai|

, where ρ =

|aiha| (10.69)

and {|ai} is the basis in which ρ is diagonal. Since D

= |hy|ai|

is a

doubly stochastic matrix, q ≺ λ(ρ) and therefore H(Y ) ≥ H(ρ), where

equality holds only if the measurement is in the basis {|ai}. Mathemati-

cally, th e conclusion is that for a n on diagonal and nonnegative Hermitian

matrix, the diagonal elements are more random than the eigenvalues.

Speaking more physically, the outcome of an orthogonal measurement is

easiest to predict if we measure an observable which commutes with the

density operator, and becomes less predictable if we measure in a diﬀerent

basis.

10.2 Von Neumann Entropy 21

This majorization property h as a further consequence, which will be

useful for our discussion of quantum compression. Suppose that ρ is a

density operator of a d-dimensional system, with eigenvalues λ

≥ λ

≥

··· ≥ λ

and that E

′

i=1

ihe

| is a projector onto a subspace Λ of

dimension d

′

≤ d with orthonormal basis {|e

i}. Then



ρE

′



′

i=1

|ρ|e

i ≤

′

i=1

, (10.70)

where the inequality follows because the diagonal elements of ρ in the

basis {|e

i} are majorized by the eigenvalues of ρ. In other words, if we

perform a two-outcome orthogonal m easurement, projecting onto either

Λ or its orthogonal complement Λ

⊥

, the probability of projecting onto Λ

is no larger than the sum of the d

′

largest eigenvalues of ρ (the Ky Fan

dominance principle).

10.2.3 Strong subadditivity

In addition to the sub ad ditivity property I(X; Y ) ≥ 0, correlations of

classical random variables obey a fur ther pr operty called strong subaddi-

tivity:

I(X; Y Z) ≥ I(X; Y ). (10.71)

This is the eminently reasonable statement th at the correlations of X

with Y Z are at least as strong as the correlations of X with Y alone.

There is another useful way to think about (classical) strong subaddi-

tivity. Recalling the deﬁnition of mutual information we have

I(X; Y Z) − I(X; Y ) = −



log

p(x)p(y, z)

p(x, y, z)

+ log

p(x, y)

p(x)p(y)



= −



log

p(x, y)

p(y)

p(y, z)

p(y)

p(x, y, z)



= −



log

p(x|y)p(z|y)

p(x, z|y)



p(y)I(X; Z|y) ≥ 0.

(10.72)

For each ﬁxed y, p(x, z|y) is a normalized probability distribution with

nonnegative mutual information; hence I(X; Y Z) − I(X; Y ) is a convex

combination of nonnegative terms and therefore nonnegative. The quan-

tity I(X; Z|Y ) := I(X; Y Z) − I(X; Y ) is called the conditional mutual

information, because it quantiﬁes how str on gly X and Z are correlated

22 10 Quantum Shannon Theory

when Y is known; strong sub ad ditivity can be restated as the nonnega-

tivity of conditional mutual information,

I(X; Z|Y ) ≥ 0. (10.73)

One might ask under what conditions strong subadditivity is satisﬁed

as an equality; that is, w hen does the conditional mutual information

vanish? Since I(X; Z|Y ) is sum of non negative terms, each of these terms

must vanish if I(X; Z|Y ) = 0. Therefore for each y with p(y) > 0, we

have I(X; Z|y) = 0. The mutual information vanish es only for a product

distribution, therefore

p(x, z|y) = p(x|y)p(z|y) =⇒ p(x, y, z) = p(x|y)p(z|y)p(y). (10.74)

This means that the correlations between x and z arise solely from their

shared correlation with y, in which case we say that x and z are condi-

tionally independent.

Correlations of quantum systems also obey strong subadditivity:

I(A; BC) − I(A; B) := I(A; C|B) ≥ 0. (10.75)

But while th e proof is elementary in the classical case, in the quantum

setting strong sub ad ditivity is a rather deep result with many important

consequences. We will postpone the proof until §10.8.3, where we will

be able to justify the quantum statement by giving it a clear operational

meaning. We’ll also see in Ex er cise 10.3 that strong subadditivity follows

easily from another deep property, the monotonicity of relative entropy:

D(ρ

kσ

) ≤ D(ρ

kσ

), (10.76)

where

D(ρkσ) := tr ρ (log ρ − log σ) . (10.77)

The relative entropy of two density operators on a sy stem AB cannot be

less than the induced r elative entropy on the su bsystem A. Insofar as we

can regard the relative entropy as a measure of the “distance” between

density operators, monotonicity is the reasonable statement that q uantum

states become no easier to distinguish when we look at the subsystem A

than when we look at the fu ll system AB. It also follows (Exercise 10.3),

that the action of a quantum channel N cannot increase relative entropy:

D(N(ρ)kN(σ)) ≤ D(ρkσ) (10.78)

There are a few other ways of formulating stron g subadditivity which

are helpful to keep in mind. By expressing the quantum mutual informa-

tion in terms of the Von Neumann entropy we ﬁnd

H(ABC) + H(B) ≤ H(AB) + H(BC). (10.79)

10.2 Von Neumann Entropy 23

While A, B, C are three disjoint quantum systems, we may view AB and

BC as overlapping systems w ith intersection B and union ABC; then

strong subadditivity say s that the sum of the entropies of two overlapp ing

systems is at least as large as the sum of the entropies of their union and

their intersection. In terms of conditional entropy, strong s ubadditivity

becomes

H(A|B) ≥ H(A|BC); (10.80)

loosely speaking, our ignorance about A when we know only B is no

smaller than our ignorance about A when we know both B and C, but

with the proviso that for qu antum information “ignorance” can sometimes

be negative!

As in the classical case, it is instructive to consider the condition for

equality in strong subad ditivity. What does it mean for systems to have

quantum conditional independence, I(A; C|B) = 0? It is easy to formulate

a suﬃcient condition. Suppose that system B has a decomposition as a

direct sum of tensor products of Hilbert spaces

⊗ H

, (10.81)

and that the state of ABC has the block diagonal form

ABC

⊗ ρ

. (10.82)

In each b lock labeled by j the state is a tensor product, with conditional

mutual information

I(A; C|B

) = I(A; B

C) − I(A; B

) = I(A; B

) −I(A; B

) = 0;

(10.83)

What is less obvious is that the converse is also true — any state with

I(A; C|B) = 0 has a decomposition as in eq.(10.82). This is a useful fact,

though we will not give the proof here.

10.2.4 Monotonicity of mutual information

Strong subadditivity implies another important property of quantum mu-

tual information, its monotonicity — a quantum channel acting on system

B cannot increase the mutual information of A and B. To derive mono-

tonicity, suppose that a quantum channel N

B→B

′

maps B to B

′

. Like any

quantum channel, N has an isometric extension, its Stinespring d ilation

B→B

′

, mapping B to B

′

and a suitable environment system E. Sin ce

24 10 Quantum Shannon Theory

the isometry U does not change the eigenvalues of the dens ity operator,

it preserves the entropy of B and of AB,

H(B) = H(B

′

E), H(AB) = H(AB

′

E), (10.84)

which implies

I(A; B) = H(A) + H(B) − H(AB)

= H(A) + H(B

′

E) − H(ABE

′

) = I(A; B

′

E). (10.85)

From s trong subadditivity, we obtain

I(A; B) = I(A; B

′

E) ≥ I(A, B

′

) (10.86)

the desired statement of monotonicity.

10.2.5 Entropy and thermodynamics

The concept of entropy ﬁrst entered science through the study of thermo-

dynamics, and the mathematical properties of entropy we have enumer-

ated have many interesting thermodynamic implications. Here we will

just mention a few ways in which the nonn egativity and monotonicity of

quantum relative entropy relate to ideas encountered in thermodynamics.

There are two distinct ways to approach the foundations of quantum

statistical physics. In one, we consid er the evolution of an isolated closed

quantum system, but ask what we will observe if we have access to on ly

a portion of the full system. Even though the evolution of the full system

is unitary, the evolution of a subsystem is not, and the subsystem may

be accurately described by a thermal ensemble at late times. Information

which is in itially encoded locally in an out-of-equilibrium state becomes

encoded more and more nonlocally as the system evolves, eventually be-

coming invisible to an observer conﬁned to the subsystem.

In the other approach, we consider th e evolution of an open system A,

in contact with an unobserved environment E, and track the evolution

of A only. From a fundamental perspective this second approach may be

regarded as a special case of the ﬁrst, since AE is closed, with A as a

privileged subsystem. In practice, though, it is often more convenient to

describe th e evolution of an open system using a master equation as in

Chapter 3, and to analyze evolution toward thermal equilibrium without

explicit reference to the environment.

Free energy and the second law. Tools of quantum Shannon theory can

help us understand w hy the state of an open system with Hamiltonian H

10.2 Von Neumann Entropy 25

might be expected to be close to the thermal Gibbs state

−βH

tr (e

−βH

)

, (10.87)

where kT = β

−1

is the temper ature. Here let’s observe one noteworthy

feature of this state. For an arbitrary density operator ρ, consider its free

energy

F (ρ) = E(ρ) − β

−1

S(ρ) (10.88)

where E(ρ ) = hHi

denotes the expectation value of the Hamiltonian in

this state; for this su bsection we respect the conventions of thermodynam-

ics by denoting Von Neumann entropy by S(ρ) rather than H(ρ) (lest H

be confused with the Hamiltonian H), and by using natural logarithms.

Expressing F (ρ) and the free energy F (ρ

) of the Gibbs state as

F (ρ) = tr (ρH) − β

−1

S(ρ) = β

−1

tr ρ (ln ρ + βH) ,

F (ρ

) = −β

−1



tr e

−βH



, (10.89)

we see that the relative entropy of ρ and ρ

D(ρkρ

) = tr (ρ ln ρ) − tr



ρ ln ρ



= β



F (ρ) − F (ρ

)



≥ 0, (10.90)

with equality only for ρ = ρ

. The nonnegativity of relative entropy

implies that at a given temperature β

−1

, the Gibbs state ρ

has the

lowest possible free energy. Our open system, in contact with a thermal

reservoir at temperature β

−1

, will prefer the Gibbs state if it wishes to

minimize its free energy.

What can we say about the approach to th er mal equilibr ium of an open

system? We may anticipate that the joint un itary evolution of system and

reservoir induces a quantum channel N acting on the system alone, and

we know that relative entropy is monotonic — if

N : ρ 7→ ρ

′

, N : σ 7→ σ

′

, (10.91)

then

D(ρ

′

kσ

′

) ≤ D(ρkσ). (10.92)

Furthermore, if the Gibbs state is an equilibrium state, we expect this

channel to preserve the Gibbs state

N : ρ

7→ ρ

; (10.93)

26 10 Quantum Shannon Theory

therefore,

D(ρ

′

kρ

) = β



F (ρ

′

) − F (ρ

)



≤ β



F (ρ) − F (ρ

)



= D(ρ kρ

(10.94)

and hence

F (ρ

′

) ≤ F (ρ). (10.95)

Any channel that preserves the Gibbs state cannot increase the free en-

ergy; instead, free energy of an out-of-equilibrium state is monotonically

decreasing under open-state evolution. This statement is a version of the

second law of thermodynamics.

10.2.6 Bekenstein’s entropy bound.

Similar ideas lead to Bekenstein’s bou nd on entropy in quantum ﬁeld

theory. The ﬁ eld theoretic details, though interesting, would lead us

far aﬁeld. The gist is that Bekens tein proposed an inequality relating

the energy and the entropy in a bounded spatial region. This bound

was motivated by gravitational physics, but can be f ormulated with ou t

reference to gravitation, and follows from properties of relative entropy.

A subtlety is that entropy of a region is in ﬁnite in quantum ﬁeld th eory,

because of contributions coming from arbitrarily short-wavelength quan-

tum ﬂuctuations near the boundary of the region. Therefore we have to

make a subtraction to deﬁne a ﬁnite quantity. The natural way to do this

is to subtract away the entropy of the same region in the vacuum state

of the theory, as any ﬁnite energy state in a ﬁnite volume has the same

structure as the vacuum at very short distances. Although the vacuum is

a pure state, it, and any other reasonable state, has a marginal state in a

ﬁnite region which is highly mixed, because of entanglement between the

region and its complement.

For the purpose of our discus sion here, we may designate any mixed

state ρ

we choose supported in the bounded region as the “vacuum,”

and deﬁne a corresponding “modular Hamiltonian” K by

−K

tr (e

−K

)

. (10.96)

That is, we regard the state as the thermal mixed state of K, with the

temperature arbitrarily set to unity (which is just a normalization con-

vention for K). Then by rewriting eq.(10.90) we see that, for any state

ρ, D(ρkρ

) ≥ 0 implies

S(ρ) − S(ρ

) ≤ tr (ρK) − tr (ρ

K) (10.97)

10.2 Von Neumann Entropy 27

The left-hand s ide, the entropy with vacuum entropy subtracted, is not

larger than the right-hand side, the (modular) energy with vacuum energy

subtracted. This is one version of Bekenstein’s bound. Here K, which is

dimensionless, can be loosely interpreted as ER, where E is the energy

contained in the region and R is its linear size.

While the bound follows easily from nonnegativity of relative entropy,

the subtle part of th e argument is recognizing that the (suitably sub-

tracted) expectation value of the mod ular Hamiltonian is a reasonable

way to deﬁne ER. The detailed justiﬁcation for this involves properties

of r elativistic quantum ﬁeld theory that we won’t go into here. Suﬃce it

to say that, because we constructed K by regarding the marginal state

of the vacuum as the Gibbs state associated w ith the Hamiltonian K,

we expect K to be linear in the energy, and dimensional analysis then

requires inclusion of the factor of R (in units with ~ = c = 1).

Beken stein was led to conjecture such a boun d by thinkin g about black

hole thermodynamics. Leaving out numerical factors, just to get a feel

for the orders of magnitude of things, the entropy of a b lack h ole with

circumference ∼ R is S ∼ R

/G, and its mass (energy) is E ∼ R/G,

where G is Newton’s gravitational constant; hence S ∼ ER for a black

hole. Bekenstein realized that unless S = O(ER) for arbitrary states and

regions, we could throw extra stu ﬀ into the region, making a black hole

with lower entropy than the initial state, thus violating the (generalized)

second law of thermodynamics. Though black holes provided the motiva-

tion, G drops ou t of the inequ ality, which holds even in nongravitational

relativistic quantum ﬁeld theories.

10.2.7 Entropic uncertainty relations

The uncertainty principle asserts that noncommuting observables can-

not simultaneously have deﬁnite values. To translate th is statement into

mathematics, recall that a Hermitian observable A has spectral represen-

tation

A =

|xia(x)hx| (10.98)

where {|xi} is the orth on ormal basis of eigenvectors of A and {a(x)} is

the corresponding vector of eigenvalues; if A is measured in the state ρ,

the outcome a(x) occurs w ith probability p(x) = hx|ρ|xi. Thus A has

expectation value tr(ρA) and variance

(∆A)

= tr



ρA



− (trρA)

. (10.99)

28 10 Quantum Shannon Theory

Using the Cauchy-Schwarz inequality, we can show that if A and B are

two Hermitian obser vables an d ρ = |ψihψ| is a pure state, then

∆A∆B ≥

|hψ|[A, B]|ψi|. (10.100)

Eq.(10.100) is a useful statement of the uncertainty principle, but has

drawbacks. It depends on the state |ψi and for that reason does not

fully capture the incompatibility of the two observables. Furthermore,

the variance does not characterize very well the unpredictability of the

measurement outcomes; entropy would be a more informative measure.

In fact there are entropic uncertainty relations which do not suﬀer from

these deﬁciencies. If we measure a state ρ by projecting onto the orthonor-

mal basis {|xi}, the outcomes deﬁne a classical ensemble

X = {x, p(x) = hx|ρ|xi}; (10.101)

that is, a probability vector whose entries are the diagonal elements of ρ

in the x-basis. The Shannon entropy H(X) quantiﬁes how uncertain we

are about the outcome before we perform the measur ement. If {|zi} is

another orth on ormal basis, there is a corresponding classical ensemble Z

describing the probability distribution of outcomes when we measure the

same state ρ in the z-basis. If the two bases are incompatible, there is a

tradeoﬀ between our uncertainty about X and about Z, captured by the

inequality

H(X) + H(Z) ≥ log





+ H(ρ), (10.102)

where

c = max

x,z

|hx|zi|

. (10.103)

The second term on the right-hand side, which vanishes if ρ is a pure

state, reminds us that our uncertainty increases when the s tate is mixed.

Like many good things in quantum information theory, this entropic un-

certainty relation follows from the m on otonicity of the quantum relative

entropy.

For each measurement there is a corresponding quantum channel, re-

alized by performing the measurement and printing the outcome in a

classical register,

: ρ 7→

|xihx|ρ|xihx| =: ρ

: ρ 7→

|zihz|ρ|zihz| =: ρ

. (10.104)

10.2 Von Neumann Entropy 29

The Shannon entropy of the measur ement ou tcome distribution is also

the Von Neumann entropy of the corresponding channel’s output state,

H(X) = H(ρ

), H(Z) = H(ρ

); (10.105)

the entropy of this output s tate can be expressed in terms of the relative

entropy of input and output, and the entropy of the channel input, as in

H(X) = −trρ

log ρ

= −trρ log ρ

= D(ρkρ

) + H(ρ). (10.106)

Using the monotonicity of relative entropy under the action of the chan-

nel M

, we have

D(ρkρ

) ≥ D(ρ

(ρ

)), (10.107)

where

D(ρ

(ρ

)) = −H(ρ

) − trρ

log M

(ρ

), (10.108)

and

(ρ

) =

x,z

Writing

log M

(ρ

) =

|zilog

hz|xihx|ρ|xihx|zi

hz|, (10.110)

we see that

−trρ

log M

(ρ

) = −

hz|ρ|zilog

hz|xihx|ρ|xihx|zi

(10.111)

Now, because −log(·) is a monotonically decreasing function, we have

−log

hz|xihx|ρ|xihx|zi

≥ −log

max

x,z

|hx|zi|

hx|ρ|xi

= log





, (10.112)

and therefore

−trρ

log M

(ρ

) ≥ log





. (10.113)

30 10 Quantum Shannon Theory

Finally, putting together eq.(10.106), (10.107) (10.108), (10.113), we ﬁnd

H(X) − H(ρ) = D(ρkρ

) ≥ D(ρ

(ρ

))

= −H(Z) − trρ

log M

(ρ

) ≥ −H(Z) + log





, (10.114)

which is equivalent to eq.(10.102).

We say that two diﬀerent bases {|xi}, {|zi} for a d-dimensional Hilbert

space are mutually unbiased if for all x, z

|hx|zi|

; (10.115)

thus, if we measure any x-basis state |xi in the z-basis, all d outcomes

are equally probable. For measurements in two mutually unbiased bases

performed on a pure state, the entropic uncertainty relation becomes

H(X) + H(Z) ≥ log d. (10.116)

Clearly this inequality is tight, as it is saturated by x-basis (or z-basis)

states, for which H(X) = 0 and H(Z) = log d.

10.3 Quantum Source C oding

What is the quantum analog of Shannon’s source coding theorem?

Let’s consider a long message consisting of n letters, where each letter

is a pure qu antum state chosen by sampling fr om the ensemble

{|ϕ(x)i, p(x)}. (10.117)

If the states of this ensemble are mutually orthogonal, then the message

might as well be classical; the interesting quantum case is where the

states are not orthogonal and th er efore not perfectly distinguishable. The

density operator realized by this ensemble is

ρ =

p(x)|ϕ(x)ihϕ(x)|, (10.118)

and the entire n-letter message has the density operator

⊗n

= ρ ⊗··· ⊗ ρ. (10.119)

How redundant is the quantum information in this message? We would

like to devise a quantum code allowin g us to compress the message to a

smaller Hilbert space, but without much compromising the ﬁ delity of the

message. Perhaps we have a quantum memory device, and we know the

10.3 Quantum Source Coding 31

statistical properties of the recorded data; speciﬁcally, we know ρ. We

want to conserve space on our (very expens ive) quantum hard drive by

compressing the data.

The optimal compression that can be achieved was found by Schu-

macher. As you might guess, the message can be compressed to a Hilbert

space H with

dim H = 2

n(H(ρ)+o(1))

(10.120)

with negligible loss of ﬁdelity as n → ∞, while errorless compression to

dimension 2

n(H(ρ)−Ω(1))

is not possible. In this sense, the Von Neumann

entropy is the number of qubits of quantum information carried per letter

of th e message. Compression is always possible unless ρ is maximally

mixed, just as we can always compress a classical message unless the

information source is uniform ly random. This result provides a precise

operational interpretation for Von Neumann entropy.

Once Shannon’s results are known and understood, the proof of Schu-

macher’s compression theorem is not diﬃcult, as the mathematical ideas

needed are very similar to th ose used by Shannon. But conceptually

quantum compression is very diﬀerent f rom its classical counterpart, as

the imperfect distinguishability of nonorthogonal quantum states is the

central idea.

10.3.1 Quantum compression: an example

Before discussing Schumacher’s quantum compression proto col in full gen-

erality, it is helpful to consider a simple example. Suppose that each letter

is a single qubit drawn from the ensemble

| ↑

i =





, p =

, (10.121)

| ↑

i =

√

, p =

, (10.122)

so that the density operator of each letter is

ρ =

| ↑

ih↑

| +

| ↑

ih↑



1 0

0 0



. (10.123)

32 10 Quantum Shannon Theory

As is obvious from symmetry, the eigenstates of ρ are qubits oriented up

and down along the axis ˆn =

√

(ˆx + ˆz),

′

i ≡ | ↑

ˆn

i =



cos

sin



′

i ≡ | ↓

ˆn

i =



sin

−cos



; (10.124)

the eigenvalues are

λ(0

′

) =

√

= cos

λ(1

′

) =

−

√

= sin

; (10.125)

evidently λ(0

′

) + λ(1

′

) = 1 and λ(0

′

)λ(1

′

) =

= detρ. Th e eigenstate

′

i h as equal (and relatively large) overlap with both signal states

|h0

′

| ↑

= |h0

′

| ↑

= cos

= .8535, (10.126)

while |1

′

i has equal (and r elatively small) overlap with both,

|h1

′

| ↑

= |h1

′

| ↑

= sin

= .1465. (10.127)

Thus if we don’t know whether | ↑

i or | ↑

i was sent, the best guess we

can make is |ψi = |0

′

i. This guess has the maximal ﬁdelity with ρ

F =

|h↑

|ψi|

|h↑

|ψi|

, (10.128)

among all possible sin gle-qubit states |ψi (F = .8535).

Now imagine that Alice needs to send three letters to Bob, but s he can

aﬀord to send only two qubits. Still, she wants Bob to r econstruct her

state with the highest possible ﬁdelity. She could send Bob two of her

three letters, an d ask Bob to guess |0

′

i for the third. Then Bob receives

two letters with per fect ﬁdelity, and his guess has F = .8535 for the

third; hence F = .8535 overall. But is there a more clever pr ocedure that

achieves higher ﬁdelity ?

Yes, there is. By diagonalizing ρ, we decomposed the Hilbert space

of a single qub it into a “likely” one-dimensional subspace (spanned by

′

i) and an “unlikely” one-dimensional subspace (spanned by |1

′

i). In a

similar way we can decompose the Hilber t space of three qubits into likely

and unlikely subspaces. If |ψi = |ψ

i ⊗ |ψ

i is any s ignal state,

10.3 Quantum Source Coding 33

where the state of each qubit is either | ↑

i or | ↑

i, we have

|h0

′

|ψi|

= cos





= .6219,

|h0

′

|ψi|

= |h0

′

|ψi|

= |h1

′

|ψi|

= cos





sin





= .1067,

|h0

′

|ψi|

= |h1

′

|ψi|

= |h1

′

|ψi|

= cos





sin





= .0183,

|h1

′

|ψi|

= sin





= .0031. (10.129)

Thus, we may decompose the space into the likely subspace Λ spanned

by {|0

′

i, |0

′

i, |0

′

i, |1

′

i}, and its orthogonal complement Λ

⊥

If we make an incomplete orthogonal measurement that projects a signal

state onto Λ or Λ

⊥

, the probability of projecting onto the likely s ubspace

Λ is

likely

= .6219 + 3(.1067) = .9419, (10.130)

while the probability of projecting onto the unlikely subspace is

unlikely

= 3(.0183) + .0031 = .0581. (10.131)

To perform this measurement, Alice could, for example, ﬁrst ap ply

a unitary transformation U that rotates the four high-p robability basis

states to

|·i ⊗ |·i ⊗ |0i, (10.132)

and the four low-probab ility basis states to

|·i ⊗ |·i ⊗ |1i; (10.133)

then Alice measures the third qubit to perform the projection. If the

outcome is |0i, then Alice’s input state has in eﬀect been projected onto

Λ. She sends the remaining two unmeasured qub its to Bob. When Bob

receives this compressed two-qubit state |ψ

comp

i, he decompresses it by

appending |0i and applying U

−1

, obtaining

|ψ

′

i = U

−1

(|ψ

comp

i ⊗ |0i). (10.134)

If Alice’s measurement of the third qubit yields |1i, she has projected her

input state onto the low-probability subspace Λ

⊥

. In this event, the best

thing she can do is send the state that Bob will decompress to the most

likely state |0

′

i – that is, she sends the state |ψ

comp

i such that

|ψ

′

i = U

−1

(|ψ

comp

i ⊗ |0i) = |0

′

i. (10.135)

Thus, if Alice encodes the three-qubit signal state |ψi, sends two qubits

to Bob, and Bob decodes as just described, then Bob obtains the state ρ

′

|ψihψ| → ρ

′

= E|ψihψ|E + |0

′

ihψ|(I − E)|ψih0

′

|, (10.136)

34 10 Quantum Shannon Theory

where E is the projection onto Λ. The ﬁdelity achieved by this procedu re

F = hψ|ρ

′

|ψi = (hψ|E|ψi)

+ (hψ|(I − E)|ψi)(hψ|0

′

= (.9419)

+ (.0581)(.6219) = .9234. (10.137)

This is indeed better than the naive procedure of sending two of the three

qubits each with perfect ﬁ delity.

As we consider longer messages with more letters, the ﬁdelity of the

compression improves, as long as we don’t try to compress too much.

The Von-Neumann entropy of the one-qubit ensemble is

H(ρ) = H



cos



= .60088 . . . (10.138)

Therefore, according to Schumacher’s theorem, we can shorten a long

message by the factor, say, .6009, and still achieve very good ﬁdelity.

10.3.2 Schumacher compre ssion in general

The key to Shannon’s noiseless cod ing theorem is that we can code the

typical sequences and ignore the rest, without much loss of ﬁdelity. To

quantify the compressibility of quantum information, we promote the

notion of a typical sequence to that of a typical subspace. The key to

Schumacher’s noiseless quantum co ding theorem is that we can code the

typical subsp ace and ignore its orthogonal complement, without much

loss of ﬁdelity.

We consider a message of n letters where each letter is a pur e quantum

state drawn from the ensemble {|ϕ(x)i, p(x)}, so that the density operator

of a single letter is

ρ =

p(x)|ϕ(x)ihϕ(x)|. (10.139)

Since the letters are drawn independently, the density operator of the

entire message is

⊗n

≡ ρ ⊗··· ⊗ ρ. (10.140)

We claim that, for n large, this density matrix has nearly all of its sup-

port on a subspace of the full Hilbert space of the messages, where the

dimension of this subspace asymptotically approaches 2

nH(ρ)

This claim follows directly from the corresponding classical statement,

for we m ay consider ρ to be realized by an ensemble of orthonormal pure

states, its eigenstates, where the probability assigned to each eigenstate

is the corr esponding eigenvalue. In this basis our source of quantum

information is eﬀectively classical, producing messages which are tensor

10.3 Quantum Source Coding 35

products of ρ eigenstates, each with a probability given by the product

of the corresponding eigenvalues. For a s peciﬁed n and δ, deﬁne the δ-

typical subspace Λ as the space spanned by the eigenvectors of ρ

⊗n

with

eigenvalues λ satisfying

−n(H−δ)

≥ λ ≥ 2

−n(H+δ)

. (10.141)

Borrowing directly from Shannon’s argument, we inf er th at for any δ, ε >

0 and n suﬃciently large, the sum of the eigenvalues of ρ

⊗n

that obey

this condition satisﬁes

tr(ρ

⊗n

E) ≥ 1 − ε, (10.142)

where E denotes the projection onto the typical s ubspace Λ, and the

number dim(Λ) of such eigenvalues satisﬁes

n(H+δ)

≥ dim(Λ) ≥ (1 − ε)2

n(H−δ)

. (10.143)

Our coding strategy is to send states in the typical subspace faithfully.

We can make a measurement that projects th e input message onto either

Λ or Λ

⊥

; the outcome will be Λ with prob ab ility p

= tr(ρ

⊗n

E) ≥ 1 −ε.

In that event, the projected state is coded and sent. Asymptotically, th e

probability of the other outcome becomes negligible, so it matters little

what we do in that case.

The coding of the projected state merely packages it so it can be carried

by a minimal number of qubits. For example, we apply a unitary change

of basis U that takes each state |ψ

typ

i in Λ to a state of the f orm

U|ψ

typ

i = |ψ

comp

i ⊗|0

rest

i, (10.144)

where |ψ

comp

i is a state of n(H + δ) qubits, and |0

rest

i d enotes the state

|0i ⊗ . . . ⊗ |0i of the remaining qubits. Alice sends |ψ

comp

i to Bob, who

decodes by appending |0

rest

i and applying U

−1

Suppose that

|ϕ(~x)i = |ϕ(x

)i ⊗. . . ⊗ |ϕ(x

)i, (10.145)

denotes any one of the n-letter pu re state messages that might be sent.

After coding, transmission, and decoding are carried out as jus t described,

Bob h as reconstructed a state

|ϕ(~x)ihϕ(~x)| 7→ ρ

′

(~x) = E|ϕ(~x)ihϕ(~x)|E

+ ρ

Junk

(~x)hϕ(~x)|(I − E)|ϕ(~x)i, (10.146)

where ρ

Junk

(~x) is the state we choose to send if the measur ement yields

the outcome Λ

⊥

. What can we say about the ﬁdelity of this procedure?

36 10 Quantum Shannon Theory

The ﬁdelity varies from message to message, so we consider the ﬁdelity

averaged over the ens emble of possible messages:

F =

p(~x)hϕ(~x)|ρ

′

(~x)|ϕ(~x)i

p(~x)hϕ(~x)|E|ϕ(~x)ihϕ(~x)|E|ϕ(~x)i

p(~x)hϕ(~x)|ρ

Junk

(~x)|ϕ(~x)ihϕ(~x)|I − E|ϕ(~x)i

≥

p(~x)hϕ(~x)|E|ϕ(~x)i

, (10.147)

where the last inequality holds because the “Junk” term is nonnegative.

Since any real number z satisﬁes

(z − 1)

≥ 0, or z

≥ 2z − 1, (10.148)

we have (setting z = hϕ(~x)|E|ϕ(~x)i)

hϕ(~x)|E|ϕ(~x)i

≥ 2hϕ(~x)|E|ϕ(~x)i − 1, (10.149)

and hence

F ≥

p(~x)(2hϕ(~x)|E|ϕ(~x)i − 1)

= 2 tr(ρ

⊗n

E) − 1 ≥ 2(1 − ε) − 1 = 1 − 2ε. (10.150)

Since ε and δ can be as small as we please, we have shown that it is

possible to compress the message to n(H + o(1)) qubits, w hile achieving

an average ﬁd elity that becomes arbitrarily good as n gets large.

Is further compression possible? Let us suppose that Bob will decode

the message ρ

comp

(~x) that he receives by appending qubits and applying

a unitary transformation U

−1

, obtaining

′

(~x) = U

−1

(ρ

comp

(~x) ⊗|0ih0|)U (10.151)

(“unitary decoding”), and suppose that ρ

comp

(~x) has been compressed

to n(H −δ

′

) qubits. Then, no matter how the input messages have been

encoded, the decoded messages are all contained in a su bspace Λ

′

of Bob’s

Hilbert space with dim(Λ

′

) = 2

n(H−δ

′

)

If the input message is |ϕ(~x)i, then the den sity operator reconstructed

by Bob can be diagonalized as

′

(~x) =

iλ

|, (10.152)

10.3 Quantum Source Coding 37

where the |a

i’s are mutually orth ogonal states in Λ

′

. The ﬁdelity of the

reconstructed message is

F (~x) = hϕ(~x)|ρ

′

(~x)|ϕ(~x)i

hϕ(~x)|a

iha

|ϕ(~x)i

≤

hϕ(~x)|a

iha

|ϕ(~x)i ≤ hϕ(~x)|E

′

|ϕ(~x)i, (10.153)

where E

′

denotes the orthogonal projection onto the subspace Λ

′

. Th e

average ﬁdelity therefore obeys

F =

p(~x)F(~x) ≤

p(~x)hϕ(~x)|E

′

|ϕ(~x)i = tr(ρ

⊗n

′

). (10.154)

But, according to the Ky Fan dominance principle d iscussed in §10.2.2,

since E

′

projects onto a space of dimension 2

n(H−δ

′

)

, tr(ρ

⊗n

′

) can be

no larger than the sum of the 2

n(H−δ

′

)

largest eigenvalues of ρ

⊗n

. Th e

δ-typical eigenvalues of ρ

⊗n

are no smaller than 2

−n(H−δ)

, so the sum of

the 2

n(H−δ

′

)

largest eigenvalues can be bounded above:

tr(ρ

⊗n

′

) ≤ 2

n(H−δ

′

)

−n(H−δ)

+ ε = 2

−n(δ

′

−δ)

+ ε, (10.155)

where the + ε accounts for the contribution from the atypical eigenvalues.

Since we may choose ε and δ as small as we please for suﬃciently large

n, we conclude that the average ﬁd elity

F gets small as n → ∞ if we

compress to H(ρ) − Ω(1) qubits per letter. We ﬁnd, then, that H(ρ)

qubits per letter is the optimal compression of th e quantum information

that can be achieved if we are to obtain good ﬁdelity as n goes to in ﬁnity.

This is Schumacher’s quantum source coding theorem.

The above argument applies to any conceivable encoding scheme, but

only to a restricted class of d ecoding schemes, unitary decodings. The

extension of th e argument to general decoding schemes is s ketched in

§10.6.3. The conclusion is the same. The point is th at n(H − δ) qubits

are too few to faithfully encode the typical subspace.

There is another u seful way to think about Schumacher’s quantum com-

pression protocol. Suppose that Alice’s density operator ρ

⊗n

has a pu-

riﬁcation |ψi

which Alice s hares with Robert. Alice wants to convey

her share of |ψi

to Bob with high ﬁd elity, s ending as few qubits to Bob

as possible. To accomplish this task, Alice can use the same procedure

as described above, attempting to compress the state of A by projecting

onto its typical subspace Λ. Alice’s projection succeeds w ith probability

P (E) = hψ|I ⊗E|ψi = tr



⊗n



≥ 1 − ε, (10.156)

38 10 Quantum Shannon Theory

where E projects onto Λ, and when successful prepares the state

(I ⊗ E) |ψi

P (E)

. (10.157)

Therefore, after Bob decompresses, the state h e shares with Robert has

ﬁdelity F

with |ψi satisfying

≥ hψ|I ⊗ E|ψihψ|I ⊗ E|ψi =



⊗n



≥ (1 −ε)

≥ 1 −2ε.

(10.158)

We conclude that Alice can transfer her share of the pure state |ψi

Bob by sending nH(ρ) + o(n) qubits, achieving arbitrarily good entan-

glement ﬁdelity F

as n → ∞.

To summarize, there is a close analogy between Sh an non’s classical

source coding theorem and Schumacher’s quantum source coding theorem.

In the classical case, nearly all long messages are typical sequences, so we

can code on ly these and still have a sm all probability of error. In the

quantum case, nearly all long messages have nearly perfect overlap with

the typical su bspace, so we can code only the typical subspace and still

achieve good ﬁdelity.

Alternatively, Alice could send classical information to Bob, the string

···x

, and Bob could follow these classical instructions to recon-

struct Alice’s state |ϕ(x

)i ⊗ . . . ⊗ |ϕ(x

)i. By this means, they could

achieve high-ﬁdelity compression to H(X) + o(1) bits — or qubits — per

letter, where X is the classical ensemble {x, p(x)}. But if {|ϕ(x)i, p(x)}

is an ensemble of nonorthogonal pure states, this classically achievable

amount of compression is not optimal; some of the classical inf ormation

about the preparation of the state is redundant, because the nonorthog-

onal states cannot be perfectly distinguished. Schumacher cod ing goes

further, achieving op timal compression to H(ρ) + o(1) qubits per let-

ter. Quantum compression packages the message more eﬃciently than

classical compression, but at a price — Bob receives the quantum state

Alice intended to send, but Bob doesn’t know what he has. In contrast

to the classical case, Bob can’t fully decipher Alice’s q uantum message

accurately. An attempt to read the message will unavoidably disturb it.

10.4 Entanglement Concentration and Dilution

Any bipartite pure state that is not a product state is entangled. But

how entangled? Can we compare two states and say that one is more

entangled than the other?

10.4 Entanglement Concentration and Dilution 39

For example, consider the two bipartite states

|φ

i =

√

(|00i + |11i),

|ψi =

√

|00i +

|11i +

|22i. (10.159)

|φ

i is a m aximally entangled state of two qubits, while |ψi is a partially

entangled state of two qutrits. Which is more entangled?

It is not immediately clear that the question has a meaningful answer.

Why should it be possible to ﬁnd an unambiguous way of ordering all

bipartite pure states according to their degree of entanglement? Can we

compare a pair of qu trits with a pair of qubits any more than we can

compare app les and oranges?

A crucial feature of entanglement is that it cannot be created by local

operations and classical communication (LOCC). In particular, if Alice

and Bob share a bipartite pure state, its Schmidt number does not increase

if Alice or Bob performs a unitary transformation on her/his share of

the state, nor if Alice or Bob measures her/his share, even if Alice and

Bob exchange classical messages abou t their actions and measurement

outcomes. Therefore, any quantitative measure of entanglement should

have the property that LOCC cannot increase it, and it should also vanish

for an unentangled product s tate. An obvious candidate is the Schmidt

number, but on reﬂection it does not seem very satisfactory. Consider

|ψ

i =

1 − 2|ε|

|00i + ε|11i + ε|22i, (10.160)

which has Schmidt number 3 for any |ε| > 0. Do we really want to say

that |ψ

i is “more entangled” than |φ

i? Entanglement, after all, can be

regarded as a r esource — we might plan to use it for teleportation, for

example — and it seems clear that |ψ

i (for |ε| ≪ 1) is a less valuable

resource than |φ

It turns out, though, that there is a natural and useful way to quantify

the entanglement of any bipartite pure state. To compare two states,

we use LOCC to convert both states to a common currency that can be

compared directly. The common currency is maximal entanglement, and

the amount of shared entanglement can be expressed in units of Bell pairs

(maximally entangled two-qubit states), also called ebits of entanglement.

To quantify the entanglement of a particular bipartite pure state,

|ψi

, imagine prep aring n identical copies of that state. Alice and Bob

share a large s upply of maximally entangled Bell pairs. Using LOC C,

they are to convert k Bell pairs (|φ

)

⊗k

) to n high-ﬁdelity copies of

the desired state (|ψi

)

⊗n

). What is the minimum number k

min

of Bell

pairs with which they can perform this task?

40 10 Quantum Shannon Theory

To obtain a precise answer, we consider the asymptotic setting, requir-

ing arbitrarily high-ﬁdelity conversion in the limit of large n. We say that

a rate R of conversion from |φ

i to |ψi is asymptotically achievable if for

any ε, δ > 0, there is an LO CC protocol with

≤ R + δ, (10.161)

which prep ares the target state |ψ

⊗n

with ﬁdelity F ≥ 1−ε. We deﬁne

the entanglement cost E

of |ψi as the inﬁmum of achievable conversion

rates:

(|ψi) := inf {achievable rate for creating |ψi f rom Bell pairs}.

(10.162)

Asymptotically, we can create many copies of |ψi by consuming E

Bell

pairs per copy.

Now imagine that n copies of |ψi

are already shared by Alice and

Bob. Using LOCC, Alice and Bob are to convert (|ψi

)

⊗n

back to the

standard cu rrency: k

′

Bell pairs |φ

⊗k

′

. What is the maximum numb er

′

max

of Bell pairs they can extract from |ψi

⊗n

? In this case we say that

a rate R

′

of conversion from |ψi to |φ

i is asymptotically achievable if for

any ε, δ > 0, there is an LO CC protocol with

′

≥ R

′

− δ, (10.163)

which prep ares the target state |φ

⊗k

′

with ﬁdelity F ≥ 1−ε. We deﬁn e

the distillable entanglement E

of |ψi as the supremum of achievable

conversion rates:

(|ψi) := sup {achievable rate for distilling Bell pairs from |ψi}.

(10.164)

Asymptotically, we can convert many copies of |ψi to Bell pairs, obtaining

Bell pairs per copy of |ψi consum ed.

Since it is an in inviolable principle that LOCC cannot create entan-

glement, it is certain that

(|ψi) ≤ E

(|ψi); (10.165)

otherwise Alice and Bob could increase their number of shared Bell pairs

by converting them to copies of |ψi and then back to Bell pairs. In fact

the entanglement cost and distillable entanglement are equal for bipartite

pure states. (The story is m ore complicated for b ipartite mixed states;

see §10.5.) Therefore, for pure states at least we may d rop the subscript,

using E(|ψi) to denote the entanglement of |ψi. We don’t n eed to dis-

tinguish between entanglement cost and distillable entanglement because

10.4 Entanglement Concentration and Dilution 41

conversion of entanglement from one form to another is an asymptotically

reversible process. E quantiﬁes both what we have to pay in Bell pairs to

create |ψi, and value of |ψi in Bell pairs for performing tasks like quantum

teleportation which consume entanglement.

But w hat is the value of E(|ψi

)? Perhap s you can guess — it is

E(|ψi

) = H(ρ

), (10.166)

the Von Neumann entropy of Alice’s density operator ρ

(or equivalently

Bob’s density operator ρ

). This is clearly the right answer in the case

where |ψi

is a product of k Bell pairs. In that case ρ

(or ρ

) is

for each qubit in Alice’s possession





⊗k

, (10.167)

and

H(ρ

) = k H





= k. (10.168)

How do we see that E = H(ρ

) is the right answer for any bipartite pure

state?

Though it is perfectly ﬁne to use Bell pairs as the common currency

for comparin g bipartite entangled states, in the asymptotic setting it is

simpler and more natural to allow fractions of a Bell pair, which is what

we’ll do here. T hat is, we’ll consider a maximally entangled state of two d-

dimensional systems to be log

d Bell pairs, even if d is not a power of two.

So our goal will be to show that Alice and Bob can use LOCC to convert

shared maximal entanglement of sys tems with dimension d = 2

n(H(ρ

)+δ)

into n copies of |ψi, for any positive δ and with arbitrarily good ﬁdelity

as n → ∞, and conversely that Alice and Bob can use LOCC to convert

n copies of |ψi into a shared maximally entangled state of d-dimensional

systems with arbitrarily good ﬁdelity, where d = 2

n(H(ρ

)−δ)

. This suﬃces

to demonstrate that E

(|ψi) = E

(|ψi) = H(ρ

First let’s see that if Alice and Bob share k = n(H(ρ

) + δ) Bell

pairs, then they can prepare |ψi

⊗n

with high ﬁdelity using LOCC . T hey

perform this task, called entanglement dilution, by combining quantum

teleportation with Schumacher compression. To get started, Alice locally

creates n copies of |ψi

, where A and C are systems she controls in her

laboratory. Next she wishes to teleport the C

share of these copies to

Bob, but to minimize the consumption of Bell pairs, she should compress

before teleportin g it.

If A and C are d-dimensional, then the bip artite state |ψi

can be

expressed in terms of its Schmidt basis as

|ψi

√

|00i +

√

|11i + . . . +

√

d−1

|d−1, d−1i, (10.169)

42 10 Quantum Shannon Theory

and n copies of the state can be expressed as

|ψi

⊗n

d−1

,...,x

p(x

) . . . p(x

) |x

. . . x

⊗ |x

. . . x

p(~x) |~xi

⊗ |~xi

, (10.170)

where

p(~x) = 1. If Alice attempts to proj ect onto the δ-typical s ub-

space of C

, she succeeds with high probability

P =

δ−typical ~x

p(~x) ≥ 1 − ε (10.171)

and when successful prepares the post-measurement state

|Ψi

= P

−1/2

δ−typical ~x

p(~x) |~xi

⊗ |~xi

, (10.172)

such that

hΨ|ψ

⊗n

i = P

−1/2

δ−typical ~x

p(~x) =

√

P ≥

√

1 − ε. (10.173)

Since the typical subspace has d im ension at most 2

n(H(ρ)+δ)

, Alice can

teleport the C

half of |Ψi to Bob with perfect ﬁdelity using no more than

n(H(ρ) + δ) Bell pairs shared by Alice and Bob. The teleportation us es

LOCC: Alice’s entangled measurement, classical communication from Al-

ice to Bob to convey the measurement outcome, and Bob’s unitary trans-

formation conditioned on the outcome. Finally, after the teleportation,

Bob decompresses, so that Alice and Bob share a state which has high

ﬁdelity with |ψi

⊗n

. This p roto col demonstrates that the entanglement

cost E

of |ψi is not more than H(ρ

Now consider the distillable entanglement E

. Suppose Alice and Bob

share the state |ψi

⊗n

. Since |ψi

is, in general, a partially entangled

state, the entanglement that Alice and Bob sh are is in a diluted form.

They wish to concentrate their shared entanglement, squeezing it down

to the smallest possible Hilbert space; that is, they want to convert it to

maximally-entangled pairs. We will show that Alice and Bob can “distill”

at least

′

= n(H(ρ

) −δ) (10.174)

Bell pairs from |ψi

⊗n

, with high likelihood of success.

To illustrate th e concentration of entanglement, imagine that Alice and

Bob h ave n copies of the two-qubit state |ψi, which is

|ψ(p)i =

1 −p |00i +

√

p |11i, (10.175)

10.4 Entanglement Concentration and Dilution 43

where 0 ≤ p ≤ 1, when expressed in its S chmidt basis. That is, Alice and

Bob share the state

|ψ(p)i

⊗n

= (

1 −p |00i +

√

p |11i)

⊗n

. (10.176)

When we expand this state in the {|0i, |1i} basis, we ﬁnd 2

terms, in

each of which Alice and Bob hold exactly the same binary string of length

Now suppose Alice (or Bob) performs a local measurement on her (his)

n qubits, measuring the total spin along the z-axis

(total)

i=1

(i)

. (10.177)

Equivalently, the measurement determin es the Hamming weight of Alice’s

n qu bits, the number of |1i’s in Alice’s n-bit string; that is, the number

of spins pointing up.

In the expansion of |ψ(p)i

⊗n

there are





terms in which Alice’s

string has Hamming weight m, each occurring with the same amplitude:

(1 − p)

(n−m)/2

m/2

. Hence the probability that Alice’s measurement ﬁnds

Hamming weight m is

p(m) =





(1 − p)

n−m

. (10.178)

Furthermore, because Alice is careful not to acquire any additional infor-

mation besides the Hamming weight when she conducts the measurement,

by measuring the Hamming weight m she prepares a uniform super po-

sition of all





strings with m up spins. Because Alice and Bob have

perfectly correlated strings, if Bob were to measur e the Hamming weight

of his qubits he would ﬁnd the same outcome as Alice. Alternatively,

Alice could report her outcome to Bob in a classical message, saving Bob

the trouble of doing the measurement himself. Thus, Alice and Bob share

a maximally entangled state

i=1

|ii

⊗ |ii

, (10.179)

where the sum runs over the D =





strings with Hamming weight m.

For n large the binomial distribution {p(m)} approaches a sharply

peaked function of m with mean µ = np and variance σ

= np(1 − p).

Hence the probability of a large deviation from the mean,

|m − np| = Ω(n), (10.180)

44 10 Quantum Shannon Theory

is p = exp (−Ω(n)). Using Stirling’s approximation, it then follows that

n(H(p)−o(1))

≤ D ≤ 2

n(H(p)+o(1))

. (10.181)

with probability approaching one as n → ∞, where H(p) = −p log

p −

(1 − p) log

(1 − p) is the entropy function. Thus with high probability

Alice and Bob share a maximally entangled s tate of Hilbert spaces H

and H

with dim(H

) = dim(H

) = D and log

D ≥ n(H(p) − δ). In

this sense Alice and Bob can distill H(p)−δ Bell pairs per copy of |ψi

Though the number m of up spins that Alice (or Bob) ﬁnds in her (his)

measurement is typically close to np, it can ﬂuctuate abou t this value.

Sometimes Alice and Bob will be lucky, and then will m an age to distill

more than H(p) Bell pairs per copy of |ψ(p)i

. But the probability of

doing substantially better becomes negligible as n → ∞.

The same idea applies to bipartite pure states in larger Hilbert spaces.

If A and B are d-dimensional s y stems, then |ψi

has the Schmidt de-

composition

|ψ(X)i

d−1

i=0

p(x) |xi

⊗ |xi

, (10.182)

where X is the classical ensemble {x, p(x)}, and H(ρ

) = H(ρ

) =

H(X). The Schmidt decomposition of n copies of ψi is

d−1

,...,x

p(x

)p(x

) . . . p(x

) |x

. . . x

⊗ |x

. . . x

(10.183)

Now Alice (or Bob) can measure the total number of |0i’s, the total

number of |1i’s, etc. in her (his) possession. If she ﬁnd s m

|0i’s, m

|1i’s,

etc., then her measurement prepares a maximally entangled state with

Schmidt number

D(m

, m

, . . . , m

d−1

) =

! . . . m

d−1

(10.184)

and this outcome occurs with probability

p(m) = D(m

, m

, . . . , m

d−1

)p(0)

p(1)

. . . p(d−1)

d−1

. (10.185)

For n large, Alice will typically ﬁnd m

≈ np(x), and again the probability

of a large deviation is small, so that, from Stirling’s approximation

n(H(X)−o(1))

≤ D ≤ 2

n(H(X)+o(1))

(10.186)

with high p robability. Thus, asymptotically for n → ∞, n(H(ρ

) −o(1))

high-ﬁdelity Bell pairs can be distilled from n copies of |ψi, establishing

that E

(|ψi) ≥ H(ρ

), and therefore E

(|ψi) = E

(|ψi) = E(|ψi).

10.5 Quantifying Mixed-State Entanglement 45

This entanglement concentration protocol uses lo cal operations but

does not require any classical communication. When Alice and Bob do the

same measurement they always get the s ame outcome, so there is no need

for them to communicate. Classical commun ication really is necessary,

though, to perform entanglement dilution. The protocol we described

here, based on teleportation, requires two bits of classical one-way com-

munication per Bell pair consumed; in a more clever protocol this can be

reduced to O(

√

n) bits, but no further. Since the classical communica-

tion cost is sublinear in n, the number of bits of classical communication

needed per copy of |ψi becomes negligible in the limit n → ∞.

10.5 Quantifying Mixed-State Entanglement

10.5.1 Asymptotic irreversibility under LOCC

The entanglement cost E

and the distillable entanglement E

are nat-

ural and operationally meaningful ways to quantify entanglement. It’s

quite satisfying to ﬁnd that, because entanglement dilution and concen-

tration are asymptotically reversible for pure states, these two measures of

pure-state bipartite entanglement agree, and provide another operational

role for the Von Neumann entropy of a marginal quantum state.

We can deﬁne E

and E

for bipartite mixed states just as we did

for pure states, but the story is more complicated — when we prepare

many copies of a mixed state shared by Alice and Bob, the dilution of Bell

pairs is not in general reversible, even asymp totically, and the distillable

entanglement can be strictly less th an the entanglement cost, though it

can never be larger. There are even bipartite mixed states with nonzero

entanglement cost and zero distillable entanglement, a phenomenon called

bound e ntanglement. This irreversibility is not shocking; any bipartite

operation which maps many copies of the pure state |φ

to many

copies of the mixed state ρ

necessarily discards some information to

the environment, and we don’t normally expect a process that forgets

information to be rever sible.

This separation between E

and E

raises the qu estion, what is the

preferred way to quantify the amount of entanglement when two parties

share a mixed quantum state? The answer is, it depends . Many diﬀerent

measures of bipartite mixed-state entanglement have been proposed, each

with its own distinctive advantages and disadvantages. Even though they

do not always agree, both E

and E

are certainly valid measures. A

further distinction can be made between the rate E

at which entangle-

ment can be distilled with one-way communication between the parties,

and the rate E

with two-way communication. There are bipartite mixed

states for which E

> E

, and even states for which E

is nonzero while

46 10 Quantum Shannon Theory

is zero. I n contrast to the pure-state case, we don’t have nice formu-

las for the values of the various entanglement measures, though there are

useful upper and lower bounds. We will derive a lower bound on E

§10.8.2 (the hashing inequality).

There are certain properties that any reasonable measure of bipartite

quantum entanglement should have. T he most important is that it must

not increase under local operations and classical commu nication, because

quantum entanglement cannot be created by LOCC alone. A function

on bipartite states that is nonincreasing under LOCC is called an en-

tanglement monotone. Note that an entanglement monotone will also be

invariant under local unitary operations U

= U

⊗U

, for if U

can

reduce the entanglement for any state, its inverse can increase entangle-

ment.

A second important property is that a bipartite entanglement measure

must vanish for separable states. Recall from Chapter 4 that a bipartite

mixed state is separable if it can be expressed as a convex combination

of product states,

p(x) |α(x)ihα(x)|

⊗ |β(x)ihβ(x)|

. (10.187)

A sep arab le state is not entangled, as it can be created using LOCC.

Via classical communication, Alice and Bob can establish a sh ared source

of rand omness, the distribution X = {x, p(x)}. T hen they may jointly

sample from X; if the outcome is x, Alice prepares |α(x)i while Bob

prepares |β(x)i.

A third desirable property for a bipartite entanglement measure is that

it should agree with E = E

= E

for bipartite pure states. Both the

entanglement cost and the distillable entanglement respect all three of

these p roperties.

We remark in passing that, despite the irreversibility of entanglement

dilution under LOCC, there is a mathematically viable way to formulate

a reversible theory of bipartite entanglement which applies even to mixed

states. In this formulation, we allow Alice and Bob to perform arbitrary

bipartite operations that are incapable of creating entanglement; these

include LO CC as well as additional oper ations which cannot be realized

using LOCC. In this framework, dilution and concentration of entan-

glement become asymptotically reversible even for mixed states, and a

unique measure of entanglement can be formulated characterizing the op-

timal rate of conversion between copies of ρ

and Bell pairs using these

non-entangling operations.

Irreversible bipartite entanglement theory under LOCC, and also the

reversible theory und er non-entangling bipartite operations, are both ex-

amples of resource theories. In the resour ce theory framework, one or

10.5 Quantifying Mixed-State Entanglement 47

more parties are able to perform some restricted class of operations, and

they are capable of preparing a certain restricted class of states using

these operations. In addition, the parties may also have access to re-

source states, which are outside the class they can prepare on their own.

Using their r estricted operations, they can transform resource states from

one form to another, or consume resource s tates to perform operations

beyond what they could achieve with their r estricted operations alone.

The name “resource state” conveys that such states are valuable because

they m ay be consumed to d o useful thin gs.

In a two-party setting, where LOCC is allowed or more general non-

entangling oper ations are allowed, bipartite entangled states may be re-

garded as a valuable resource. Resource theory also applies if the allowed

operations are required to obey certain symmetries; then states breaking

this symmetry become a resource. In th er modynamics, states deviat-

ing from thermal equ ilibrium are a resource. Entanglement theory, as a

particularly well developed resource theory, provides guid an ce and tools

which are broadly applicable to many d iﬀerent interesting situations.

10.5.2 Squashed entanglement

As an example of an alternative bipartite entanglement measure, consider

the squ ashed entanglement E

, deﬁned by

(ρ

) = inf



I(A; B|C) : ρ

= tr

(ρ

ABC

)



(10.188)

The squashed entanglement of ρ

is the greatest lower boun d on the

quantum conditional mutual information of all possible extensions of ρ

to a tripartite state ρ

ABC

; it can be shown to be an entanglement mono-

tone. The locu tion “squashed” conveys that choosing an optimal condi-

tioning system C squashes out the n on -quantum correlations between A

and B.

For pure states the extension is superﬂuous, so that

(|ψi

) =

I(A; B) = H(A) = H(B) = E(|ψi

). (10.189)

For a separable state, we may choose the extension

ABC

p(x) |α(x)ihα(x)|

⊗ |β(x)ihβ(x)|

⊗ |xihx|

. (10.190)

where {|xi

} is an orthonormal set; the state ρ

ABC

has the block-diagonal

form eq.(10.82) and hence I(A; B|C) = 0. Conversely, if ρ

has any

extension ρ

ABC

with I(A; B|C) = 0, then ρ

ABC

has the form eq.(10.82)

and therefore ρ

is separable.

48 10 Quantum Shannon Theory

is diﬃcult to compute, because the inﬁmum is to be evaluated over

all possible extensions, where the system C may have arbitrarily high

dimension. T his property also raises the logical possibility that there are

nonseparable states for which the inﬁmum vanishes; conceivably, though

a nonseparable ρ

can have n o ﬁnite-dimensional extension for which

I(A; B|C) = 0, perhaps I(A; B|C) can approach zero as the dimension of

C increases. Fortunately, though this is n ot easy to show, it turns out

that E

is strictly positive for any nonseparable state. I n this sense, then,

it is a faithful entanglement measure, strictly positive if and only if the

state is nonseparable.

One d esir ab le property of E

, not shared by E

and E

, is its addi-

tivity on tensor products (Exercise 10.6),

(ρ

⊗ ρ

′

) = E

(ρ

) + E

(ρ

′

). (10.191)

Though, u nlike E

and E

, squashed entanglement does not have an

obvious operational meaning, any additive entanglement monotone which

matches E for bipartite pure states is bounded above and below by E

and E

respectively,

≥ E

. (10.192)

10.5.3 Entanglement monogamy

Classical correlations are polyamorous; they can be shared among many

parties. If Alice and Bob read the same newspaper, then they have in-

formation in common and become correlated. Nothing prevents Claire

from reading the same newspaper; then Claire is just as strongly corre-

lated with Alice and with Bob as Alice and Bob are with one another.

Furthermore, David, Edith, and all their friends can read the newspaper

and join the party as well.

Quantum correlations are not like that; they are harder to share. If

Bob’s state is pure, then the tripartite quantum state is a product ρ

⊗

, and Bob is completely uncorrelated with Alice and Claire. If Bob’s

state is mixed, then he can be entangled with other parties. But if Bob is

fully entangled with Alice (shares a pure s tate with Alice), then the state is

a product ρ

⊗ρ

; Bob has used up all h is ability to entangle by sharing

with Alice, and Bob cannot be correlated with Claire at all. Conversely,

if Bob shares a pure state with Claire, the state is ρ

⊗ ρ

, and Bob

is uncorrelated with Alice. Thus we say that quantum entanglement is

monogamous.

Entanglement measures obey monogamy inequalities wh ich reﬂect this

tradeoﬀ between Bob’s entanglement with Alice and with Claire in

a three-party state. Squashed entanglement, in p articular, obeys a

10.5 Quantifying Mixed-State Entanglement 49

monogamy relation following easily from its deﬁ nition, wh ich was ou r

primary motivation for introducing this quantity; we have

(A; B) + E

(A; C) ≤ E

(A; BC). (10.193)

In particular, in the case of a pure tripartite state, E

= H(A) is the

(pure-state) entanglement shared between A and BC. The inequality is

saturated if Alice’s system is divided into subsystems A

and A

such

that the tripartite pure state is

|ψi

ABC

= |ψ

⊗ |ψ

. (10.194)

In general, combining eq.(10.192) with eq.(10.193) yields

(A; B) + E

(A; C) ≤ E

(A; BC); (10.195)

loosely speaking, the entanglement cost E

(A; BC) im poses a ceiling on

Alice’s ability to entangle with Bob and C laire individually, requiring

her to trade in some distillable entanglement with Bob to increase her

distillable entanglement with Claire.

To prove the monogamy relation eq.(10.193), we note that mutual in -

formation obeys a chain rule which is really just a restatement of the

deﬁnition of conditional mutual inform ation:

I(A; BC) = I(A; C) + I(A; B|C). (10.196)

A similar equation follows directly from the deﬁnition if we condition on

a fourth system D,

I(A; BC|D) = I(A; C|D) + I(A; B|CD). (10.197)

Now, E

(A; BC) is the inﬁmum of I(A; BC|D) over all possible exten-

sions of ρ

ABC

to ρ

ABCD

. But since ρ

ABCD

is also an extension of ρ

and ρ

, we have

I(A; BC|D) ≥ E

(A; C) + E

(A; B) (10.198)

for any such extension. Taking the inﬁmum over all ρ

ABCD

yields

eq.(10.193).

A further aspect of monogamy arises when we consider extending a

quantum state to more parties. We say that th e bipartite state ρ

of sys-

tems A and B is k-extendable if there is a (k+1)-part state ρ

...B

whose

marginal state on AB

matches ρ

for each j = 1, 2, . . . k, and such th at

...B

is invariant under permutations of the k systems B

, B

. . . B

Separable states are k-extendable for every k, and entangled pure states

are not even 2-extendable. Every entangled mixed state fails to be k-

extendable for some ﬁnite k, and we may regard the maximal value k

max

for which such a symmetric extension exists as a rough measure of how

entangled the state is — bipartite entangled states with larger and larger

max

are closer and closer to being separable.

50 10 Quantum Shannon Theory

10.6 Accessible Information

10.6.1 How much can we learn from a measurement?

Consider a game played by Alice and Bob. Alice pr epares a quantum s tate

drawn from the ensemble E = {ρ(x), p(x)} and sends the state to Bob.

Bob knows this ensemble, but not the particular state that Alice chose

to send. After receiving the state, Bob performs a POVM with elements

{E(y)} ≡ E, hopin g to ﬁnd out as much as he can about what Alice

sent. The conditional prob ab ility that Bob obtains outcome y if Alice

sent ρ(x) is p(y|x) = tr (E(y)ρ(x)), and the joint distribution governing

Alice’s p reparation and Bob’s measurement is p(x, y) = p(y|x)p(x).

Before he measures, Bob’s ignorance about Alice’s s tate is quantiﬁed

by H(X), the number of “bits per letter” needed to specify x; after he

measures his ignorance is reduced to H(X|Y ) = H(XY ) − H(Y ). The

improvement in Bob’s knowledge achieved by the measurement is Bob’s

information gain, the mutual information

I(X, Y ) = H(X) − H(X|Y ). (10.199)

Bob’s best strategy (his optimal measurement) maximizes this information

gain. The best information gain Bob can achieve,

Acc(E) = max

I(X; Y ), (10.200)

is a property of the ensemble E called th e accessib le information of E.

If the states {ρ(x)} are mutually orthogonal they are perfectly distin-

guishable. Bob can identify Alice’s state with certainty by choosing E(x)

to be the projector onto the support of ρ(x); Then p(y|x) = δ

x,y

= p(x|y),

hence H(X|Y ) = h−log p(x|y)i = 0 and Acc(E) = H(X). Bob’s task is

more challenging if Alice’s states are not orthogonal. Then no measure-

ment will identify the state perf ectly, so H(X|Y ) is necessarily positive

and Acc(E) < H(X).

Though there is no simple general formula for the accessible infor-

mation of an ensemble, we can derive a useful upper bound, called

the Holevo bound. For the special case of an ensemble of pure s tates

E = {|ϕ(x)i, p(x)}, the Holevo bound becomes

Acc(E) ≤ H(ρ), where ρ =

p(x)|ϕ(x)ihϕ(x)|, (10.201)

and a sharper statement is possible for an ensemble of mixed states, as we

will see. Since the entropy for a quantum system with dimension d can

be no larger than log d, the Holevo bound asserts that Alice, by sending

n qubits to Bob (d = 2

) can convey no more than n bits of information.

10.6 Accessible Information 51

This is true even if Bob performs a sophisticated collective measurement

on all the qub its at once, rather than measuring them one at a time.

Therefore, if Alice wants to convey classical information to Bob by

sending qubits, she can do no better than treating the qubits as though

they were classical, sending each qubit in one of the two orthogonal states

{|0i, |1i} to transmit one bit. This statement is not so obvious. Alice

might try to stuﬀ more classical information into a single qubit by sending

a state chosen from a large alphabet of pure single-qubit signal states,

distributed uniformly on the Bloch sphere. But the enlarged alphabet is

to no avail, because as the number of possib le signals increases the signals

also become less distinguishable, and Bob is not able to extract the extra

information Alice hoped to deposit in the qubit.

If we can send information more eﬃciently by using an alphabet of

mutually orthogonal states, why s hould we be interested in the accessible

information for an ensemble of non-orthogonal states? There are many

possible reasons. Perhaps Alice ﬁnds it easier to send signals, like co-

herent states, which are imperfectly distinguishab le rather than mutually

orthogonal. Or perhaps Alice sen ds signals to Bob through a noisy chan-

nel, so that signals which are orthogonal when they enter the channel are

imperfectly distinguishable by the time they reach Bob.

The accessible information game also arises when an experimental

physicist tr ies to measure an unknown classical f orce using a quantum

system as a probe. For example, to measure the z-component of a mag-

netic ﬁeld, we may prepare a spin-

particle pointing in the x-direction;

the spin precesses for time t in the unknown ﬁeld, producing an ensem-

ble of possible ﬁnal states (which will be an ensemble of mixed states if

the initial preparation is imperfect, or if decoherence occurs during the

experiment). The more information we can gain abou t th e ﬁnal state of

the spin, the more accurately we can determine the value of the magnetic

ﬁeld.

10.6.2 Holevo bound

Recall that quantum mutual information obeys monotonicity — if a quan-

tum channel maps B to B

′

, then I(A; B) ≥ I(A; B

′

). We d er ive the

Holevo bound by applying monotonicity of mutual information to the ac-

cessible inform ation game. We will sup pose that Alice records her chosen

state in a classical register X and Bob likewise records his measurement

outcome in another register Y , so that Bob’s information gain is the mu-

tual information I(X; Y ) of the two registers. After Alice’s preparation

52 10 Quantum Shannon Theory

of her system A, the joint state of XA is

p(x)|xihx| ⊗ ρ(x). (10.202)

Bob’s measurement is a quantum channel mapping A to AY according to

ρ(x) 7→

M(y)ρ(x)M (y)

†

⊗ |yihy|, (10.203)

where M (y)

†

M (y) = E(y), yielding the state for XAY

′

XAY

p(x)|xihx| ⊗ M (y)ρ(x)M (y)

†

⊗ |yihy|. (10.204)

Now we have

I(X; Y )

′

≤ I(X; AY )

′

≤ I(X; A)

, (10.205)

where the subscript indicates the state in which the mutual information

is evaluated; the ﬁrst inequality uses strong subadd itivity in the state ρ

′

and the second us es monotonicity under the chan nel m ap ping ρ to ρ

′

The quantity I(X; A) is an intrinsic property of the ens emble E; it

is denoted χ(E) and called the Holevo chi of the ensemble. We have

shown that however Bob chooses his measurement his information gain is

bounded above by the Holevo chi; therefore,

Acc(E) ≤ χ(E) := I(X; A)

. (10.206)

This is the Holevo bound.

Now let’s calculate I(X; A)

explicitly. We note that

H(XA) = −tr

p(x)|xihx| ⊗ ρ(x) log

′

p(x

′

)|x

′

ihx

′

| ⊗ ρ(x

′

)

= −

tr p(x)ρ(x) (log p(x) + log ρ(x))

= H(X) +

p(x)H(ρ(x)), (10.207)

and therefore

H(A|X) = H(XA) − H(X) =

p(x)H(ρ(x)). (10.208)

Using I(X; A) = H(A) − H(A|X), we then ﬁn d

χ(E) = I(X; A) = H(ρ

) −

p(x)H(ρ

(x)) ≡ H(A)

− hH(A)i

(10.209)

10.6 Accessible Information 53

For an ensemble of pure states, χ is just the entropy of the density operator

arising from the ensemble, but for an ensemble E of mixed states it is a

strictly smaller quantity – the diﬀerence between the entropy H(ρ

) of the

convex sum of signal states and the convex sum hHi

of the signal state

entropies; this diﬀerence is always nonn egative because of the concavity

of the entropy function (or because mutual information is nonnegative).

10.6.3 Monotonicity of Holevo χ

Since Holevo χ is the mutual information I(X; A) of the classical register

X and the quantum system A, the mon otonicity of mutual information

also implies the monotonicity of χ. If N : A → A

′

is a quantum channel,

then I(X; A

′

) ≤ I(X; A) and therefore

χ(E

′

) ≤ χ(E), (10.210)

where

E = {ρ(x)), p(x)} and E

′

= {ρ

′

(x) = N(ρ(x)), p(x)}. (10.211)

A channel cannot increase the Holevo χ of an ensemble.

Its mon otonicity provides a further ind ication that χ(E) is a useful

measure of the information en coded in an en semble of quantum states;

the decoherence described by a quantum channel can reduce this quantity,

but n ever increases it. In contrast, the Von Neumann entropy may either

increase or decrease under the action of a channel. Mapping pure states

to mixed states can increase H, but a channel might instead map the

mixed states in an en semble to a ﬁxed pu re state |0ih0|, decreasing H

and improving the purity of each signal state, but without improving the

distinguishability of the states.

We discussed the asymptotic limit H(ρ) on quantum compression per

letter in §10.3.2. There we considered unitary decoding; invoking the

monotonicity of Holevo χ clariﬁes why more general decoders cannot do

better. Suppose we compress and decompress the ensemble E

⊗n

using an

encoder N

and a decoder N

, where both maps are quantum channels:

⊗n

−→

(n)

−→

′(n)

≈ E

⊗n

(10.212)

The Holevo χ of the input pure-state product ensemble is additive,

χ(E

⊗n

) = H(ρ

⊗n

) = nH(ρ), and χ of a d-dimensional system is no larger

than log

d; therefore if the ensemble

(n)

is compressed to q qubits per

letter, then because of the monotonicity of χ the decompressed ensemble

′(n)

has Holevo chi per letter

χ(

′(n)

) ≤ q. If the decompressed output

54 10 Quantum Shannon Theory

ensemble has high ﬁdelity with the in put ensemble, its χ per letter should

nearly match the χ per letter of the input ensemble, hence

q ≥

χ(

′(n)

) ≥ H(ρ) − δ (10.213)

for any positive δ and suﬃciently large n. We conclude that high-ﬁdelity

compression to fewer than H(ρ) qubits per letter is impossible asymptot-

ically, even when the compression an d decompression maps are arbitrary

channels.

10.6.4 Improved distinguishability through coding: an example

To better acquaint ourselves with the concept of accessible information,

let’s consider a single-qubit example. Alice prepares one of the three

possible pure s tates

|ϕ

i = | ↑

ˆn

i =





|ϕ

i = | ↑

ˆn

i =

−

√

|ϕ

i = | ↑

ˆn

i =

−

√

; (10.214)

a spin -

object points in one of three directions th at are symmetrically

distributed in the xz-plane. Each state has a priori probability

. Evi-

dently, Alice’s signal states are nonorthogonal:

hϕ

|ϕ

i = hϕ

|ϕ

i = hϕ

|ϕ

i = −

. (10.215)

Bob’s task is to ﬁnd out as much as he can about what Alice prepared by

making a suitable measurement. The density matrix of Alice’s ensemble

ρ =

(|ϕ

ihϕ

| + |ϕ

ihϕ

| + |ϕ

ihϕ

|) =

I, (10.216)

which has H(ρ) = 1. T herefore, the Holevo bound tells us that the mu-

tual information of Alice’s preparation and Bob’s measurement outcome

cannot exceed 1 bit.

In fact, though, the accessible information is considerab ly less than the

one bit allowed by the Holevo bound. In this case, Alice’s ensemble has

enough symmetry that it is not hard to gu ess the optimal measurement.

Bob may choose a POVM with thr ee outcomes, wher e

(I − |ϕ

ihϕ

|), a = 1, 2, 3; (10.217)

10.6 Accessible Information 55

we see that

p(a|b) = hϕ

|ϕ

i =



0 a = b,

a 6= b.

(10.218)

The measurement outcome a excludes the possibility that Alice prepared

a, but leaves equal a posteriori probabilities



p =



for the other two

states. Bob’s information gain is

I = H(X) − H(X|Y ) = log

3 − 1 = .58496. (10.219)

To show that this measurement is really optimal, we may appeal to a

variation on a theorem of Davies, which assur es us that an optimal POVM

can be chosen with th ree E

’s that share th e same th ree-fold sym metry

as the three states in the input ensemble. This result r estricts the possible

POVM’s en ou gh so th at we can check that eq. (10.217) is optimal with

an explicit calculation. Hence we have found that the ens emble E =

{|ϕ

i, p

} has accessible inform ation.

Acc(E) = log





= .58496... (10.220)

The Holevo bound is not saturated.

Now suppose th at Alice has enough cash so that she can aﬀord to send

two qubits to Bob, where again each qubit is drawn from the ensemble E.

The obvious thing for Alice to do is prepare one of the nine states

|ϕ

i ⊗ |ϕ

i, a, b = 1, 2, 3, (10.221)

each with p

= 1/9. Th en Bob’s best strategy is to perform the POVM

eq. (10.217) on each of the two qubits, achieving a mutual in formation of

.58496 bits per qubit, as before.

But, determined to do better, Alice and Bob decide on a diﬀerent strat-

egy. Alice will prepare one of three two-qubit states

|Φ

i = |ϕ

i ⊗|ϕ

i, a = 1, 2, 3, (10.222)

each occurring with a priori probab ility p

= 1/3. Considered one-qubit

at a time, Alice’s choice is governed by the ensemble E, but now her two

qubits have (classical) correlations – both are prepared the same way.

The three |Φ

i’s are linearly independent, and so span a three-

dimensional subspace of the four -dimensional two-qubit Hilbert space.

In Exercise 10.4, you will show that the density operator

ρ =

a=1

|Φ

ihΦ

, (10.223)

56 10 Quantum Shannon Theory

has the nonzero eigenvalues 1/2, 1/4, 1/4, so that

H(ρ) = −

log

− 2



log



. (10.224)

The Holevo bound requires that the accessible information pe r qubit is no

more than 3/4 bit, which is at least consistent with the possibility that we

can exceed the .58496 bits per qubit attained by the nine-state method.

Naively, it may seem that Alice won’t be able to convey as much clas-

sical in formation to Bob, if she chooses to send one of only three possible

states instead of nine. But on further reﬂection, this conclusion is not

obvious. True, Alice has fewer signals to choose from, but the signals are

more distinguishable; we have

hΦ

|Φ

i =

, a 6= b, (10.225)

instead of eq. (10.215). It is up to Bob to exp loit this improved distin-

guishability in his choice of measurement. In particular, Bob will ﬁnd

it advantageous to perform collective measurements on th e two qubits

instead of measuring them one at a time.

It is no longer obvious what Bob’s optimal measurement will be. But

Bob can invoke a general p rocedure that, while not guaranteed optimal,

is usually at least pretty good. We’ll call the POVM constructed by this

procedure a “pretty good measurement” (or PGM).

Consider some collection of vectors |

i that are not assumed to be or-

thogonal or normalized. We want to devise a POVM that can distinguish

these vectors reasonably well. L et us ﬁrst construct

G =

|; (10.226)

This is a positive operator on the space spanned by the |

i’s. T herefore,

on that subspace, G has an inverse, G

−1

and that inverse has a positive

square root G

−1/2

. Now we deﬁne

= G

−1/2

, (10.227)

and we see that

= G

−1/2

= G

−1/2

= I, (10.228)

on the span of the |

i’s. If necessary, we can augment these E

’s with

one more positive operator, the projection E

onto the orthogonal com-

plement of the span of the |

i’s, and so construct a POVM. This POVM

is th e PGM associated with th e vectors |

10.6 Accessible Information 57

In the special case where the |

i’s are orthogonal,

i =

|φ

i, (10.229)

(where the |Φ

i’s are orthonormal), we have

a,b,c

(|φ

iλ

−1/2

hφ

|)(|φ

iλ

hφ

|)(|φ

iλ

−1/2

hφ

= |φ

ihφ

|; (10.230)

this is the orthogonal measurement that perfectly distinguishes the |Φ

i’s

and so clearly is optimal. If the |

i’s are lin early independent but not

orthogonal, then the PGM is again an orthogonal measurement (because

n one-dimensional operators in an n-dimensional space can constitute a

POVM only if mutually orthogonal — see E x er cise 3.11), but in that case

the measurement may not be optimal.

In Exer cise 10.4, you’ll construct the PGM for the vectors |Φ

i in eq.

(10.222), and you’ll show that

p(a|a) = hΦ

|Φ

i =



1 +

√



= .971405

p(b|a) = hΦ

|Φ

i =



1 −

√



= .0142977, (10.231)

(for b 6= a). It follows that the conditional entropy of the input is

H(X|Y ) = .215893, (10.232)

and since H(X) = log

3 = 1.58496, the inform ation gain is

I(X; Y ) = H(X) − H(X|Y ) = 1.36907, (10.233)

a mutual information of .684535 bits per qubit. Thus, the imp roved dis-

tinguishability of Alice’s signals has ind eed paid oﬀ – we have exceeded

the .58496 bits that can be extracted from a single qubit. We still didn’t

saturate the Holevo bound (I ≤ 1.5 in th is case), but we came a lot closer

than before.

This example, ﬁ rst described by Peres and Wootters, teaches some

useful lessons . Firs t, Alice is able to convey more information to Bob

by “pruning” her set of codewords. She is better oﬀ choosing among

fewer signals that are more distinguishable than more signals that are

less distinguishable. An alphabet of three letters encodes more than an

alphabet of nine letters.

Second, Bob is able to read more of the information if he performs a

collective measurement instead of measuring each qubit separately. His

optimal orthogonal measurement projects Alice’s signal onto a basis of

entangled states.

58 10 Quantum Shannon Theory

10.6.5 Classical capacity of a quantum channel

This example illustrates how coding and collective measurement can en-

hance accessible inf ormation, bu t while using the code narrowed the gap

between th e accessible in formation and the Holevo chi of the ensemble,

it did not close the gap completely. As is often the case in information

theory, we can characterize the accessible information more precisely by

considering an asymptotic i.i.d. setting. To be speciﬁc, we’ll consid er the

task of sending classical information reliably through a noisy quantum

channel N

A→B

An ensemble of input signal states E = {ρ(x), p(x)} prepared by Al-

ice is mapped by the channel to an ensemble of outp ut signals E

′

{N(ρ(x)), p(x)}. If Bob m easures the output his information gain

Acc(E

′

) ≤ I(X; B) = χ(E

′

). (10.234)

is bounded above by the Holevo chi of the output en semble E

′

. To convey

as much information through the channel as possible, Alice and Bob may

choose the input ensemble E that maximizes the Holevo chi of the output

ensemble E

′

. The maximum value

χ(N) := max

χ(E

′

) = max

I(X; B), (10.235)

of χ(E

′

) is a property of the channel, which we will call the Holevo chi of

As we’ve seen, Bob’s actual optimal information gain in this single-shot

setting may fall short of χ(E

′

) in general. But instead of using the channel

just once, suppose that Alice and Bob u se the channel n ≫ 1 times, where

Alice sends signal states chosen from a code, and Bob performs an optimal

measurement to d ecode the signals he receives. Then an information gain

of χ(N) bits per letter really can be achieved asymptotically as n → ∞.

Let’s den ote Alice’s ensemble of encoded n-letter signal states by

(n)

denote the ensemble of classical labels carried by the signals by

, and

denote Bob’s en semble of measurement outcomes by

. Let’s say that

the code has rate R if Alice may choose from among 2

possible signals

to send. If classical inform ation can be sent through the channel with

rate R − o(1) su ch that Bob can decode the signal with negligible error

probability as n → ∞, then we say the rate R is achievable. The classical

capacity C(N) of the quantum channel N

A→B

is the supremum of all

achievable rates.

Just as in our d iscussion of the capacity of a classical chann el in §10.1.4,

the conditional entropy per letter

)) approaches zero as n → ∞

10.6 Accessible Information 59

if the error probability is asymptotically negligible; therefore

R ≤



;

) + o(1)



≤



max

(n)

I(X

; B

) + o(1)





χ(N

⊗n

) + o(1)



, (10.236)

where we obtain the ﬁrst inequality as in eq.(10.47) and the second in-

equality by invoking the Holevo bound, optimized over all possible n-letter

input ensembles. We therefore infer that

C(N) ≤ lim

n→∞



⊗n



; (10.237)

the classical capacity is bounded above by the asymptotic Holevo χ per

letter of the product channel N

⊗n

In fact this upper bound is actually an achievable rate, and hence equal

to the classical capacity C(N). However, this formula for the classical ca-

pacity is n ot very useful as it stands, because it requires th at we optimize

the Holevo χ over message ensembles of arbitrary length; we say that the

formula for capacity is regularized if, as in this case, it involves taking a

limit in which the number of channel tends to inﬁnity. It would be far

preferable to reduce our expression for C(N) to a sing le-letter formula

involving just one us e of the channel. In the case of a classical chan-

nel, the r eduction of the regularized expression to a single-letter formula

was possible, because the conditional entropy for n uses of the chan nel is

additive as in eq.(10.44).

For quantum channels the situation is more complicated, as channels

are kn own to exist such that the Holevo χ is strictly super ad ditive:

χ (N

⊗ N

) > χ (N

) + χ (N

) . (10.238)

Therefore, at least for some channels, we are stuck with the not-very-

useful regularized formula for the classical capacity. But we can obtain

a single-letter formula for the optimal achievable communication rate if

we put a restriction on the co de used by Alice and Bob. In general,

Alice is entitled to choose input codeword s which are entangled across the

many uses of the channel, and when such entangled codes are permitted

the computation of the classical channel capacity may be diﬃcult. But

suppose we demand that all of Alice’s codewords are product states. With

that proviso the Holevo chi becomes subadditive, and we may express the

optimal rate as

(N) = χ(N). (10.239)

(N) is called the product-state capacity of the channel.

60 10 Quantum Shannon Theory

Let’s verify the subadd itivity of χ for product-state codes. The product

channel N

⊗n

maps product states to product states; hence if Alice’s input

signals are product states then so are Bob’s output signals, and we can

express Bob’s n-letter ensemble as

(n)

= {ρ(x

) ⊗ρ(x

) ⊗··· ⊗ ρ(x

), p(x

. . . x

)}, (10.240)

which has Holevo χ

χ(E

(n)

) = I(X

; B

) = H(B

) − H(B

). (10.241)

While the Von Neumann entropy is subadditive,

H(B

) =

i=1

H(B

); (10.242)

the (negated) conditional entropy

−H(B

) = −

p(~x) H (ρ(~x)) (10.243)

(see eq.(10.209)) is not subadditive in general. But for the product-state

ensemble eq.(10.240), since the entropy of a product is additive, we have

H(B

) =

,...,x

p(x

, . . . x

)

i=1

H (ρ(x

))

i=1

)H(ρ(x

)) =

i=1

H(B

) (10.244)

where x

= {x

, p

)} is the marginal probability d istribution for the

ith letter. Eq.(10.244) is a quantum analog of eq.(10.44), which holds

for produ ct-state ensembles but not in general for entangled ensembles.

Combining eq.(10.241), (10.242), (10.244), we have

I(X

; B

) ≤

i=1

(H(B

) −H(B

)) =

I(X

; B

) ≤ nχ(N).

(10.245)

Therefore the Holevo χ of a channel is subadditive when restricted to

product-state codewords, as we wanted to show.

We won’t give a carefu l argument here that C

(N) is an asymptot-

ically achievable rate using product-state codewords; we’ll just give a

rough sketch of the idea. We demonstrate achievability with a r an dom

coding argument similar to Shannon’s. Alice ﬁx es an input ensemble

10.6 Accessible Information 61

E = {ρ(x), p(x)}, and samples from the pr oduct ensemble E

⊗n

to gener-

ate a codeword; that is, the codeword

ρ(~x) = ρ(x

) ⊗ρ(x

) ⊗··· ⊗ ρ(x

) (10.246)

is selected w ith probability p(~x) = p(x

)p(x

) . . . p(x

). (In fact Alice

should choose each ρ(~x) to be pure to optimize the commu nication rate.)

This codeword is sent via n uses of the channel N, and Bob receives the

product state

⊗n

(ρ(~x)) = N(ρ(x

)) ⊗ N(ρ(x

)) ⊗ ··· ⊗ N(ρ(x

)). (10.247)

Averaged over codewords, the joint state of Alice’s classical register X

and Bob’s system B

p(~x) |~xih~x|⊗ N

⊗n

(ρ(~x)). (10.248)

To decode, Bob performs a POVM designed to distinguish the code-

words eﬀectively; a variant of the pretty good measurement described in

§10.6.4 does the job well enough. The state Bob r eceives is mostly su p-

ported on a typical subspace w ith dimension 2

n(H(B)+o(1))

, and for each

typical codeword that Alice sends , what Bob receives is mostly supported

on a much smaller typical subspace with dimension 2

n(H(B|X)+o(1))

. The

key point is that ratio of these spaces is exponential in the mutual infor-

mation of X and B:

n(H(B|X)+o(1))

n(H(B)−o(1))

= 2

−n(I(X;B)−o(1))

(10.249)

Each of Bob’s POVM elements has support on the typical subspace arising

from a particular one of Alice’s codewords. The probability that any

codeword is mapped purely by accident to the decoding subsp ace of a

diﬀerent codeword is suppressed by the ratio eq.(10.249). Therefore, the

probability of a decoding error remains small even when there are 2

codewords to distinguish, for R = I(X; B) − o(1).

We complete the argument with standard Shannonisms. Since the prob-

ability of decoding error is small when we average over codes, it must also

be small, averaged over codewords, for a particular sequence of codes.

Then by pruning half of the codewords, reducing the rate by a n egligible

amount, we can ensure that the d ecoding errors are improbable for every

codeword in the code. Therefore I(X; B) is an achievable rate for clas-

sical communication. Optimizing over all product-state input ensembles,

we obtain eq.(10.239).

To turn this into an honest argument, we would need to specify Bob’s

decoding measurement more explicitly an d do a carefu l error analysis.

62 10 Quantum Shannon Theory

This gets a bit technical, so we’ll skip the details. Somewhat surprisingly,

though, it turns out to be easier to prove capacity theorems when quantum

channels are used for other tasks besid es sending classical information.

We’ll turn to that in §10.7.

10.6.6 Entanglement-breaking channels

Though Holevo chi is superadditive for some quantum channels, there are

classes of channels for w hich chi is additive, and for any such channel

N the classical capacity is C = χ(N) without any need for regulariza-

tion. For example, consider entanglement-breaking channels. We say that

A→B

is entanglement breaking if for any input state ρ

, I ⊗ N(ρ

)

is a separable state on RA — the action of N on A always breaks its

entanglement with R. We claim that if N

is entanglement breaking, and

is an arbitrary channel, then

χ (N

⊗ N

) ≤ χ(N

) + χ(N

). (10.250)

To bound the chi of the produ ct channel, consider an in put ensemble

p(x) |xihx| ⊗ ρ(x)

. (10.251)

Because N

is entanglement breaking, ρ(x)

is mapped by the product

channel to a separable state:

⊗ N

: ρ(x)

7→

p(y|x) σ(x, y)

⊗ τ (x, y)

. (10.252)

Now χ(N

⊗N

) is the maximum of I(X; B

)

′

, evaluated in the state

′

x,y

p(x)p(y|x)|xihx| ⊗ σ(x, y)

⊗ τ (x, y)

(10.253)

which may be regarded as the marginal state (after tracing out Y ) of

′

XY B

x,y

p(x, y)|x, yihx, y| ⊗ σ(x, y)

⊗

τ (x, y)

(10.254)

Because

′

becomes a product state when conditioned on (x, y), it satisﬁes

H(B

|XY ) = H(B

|XY ) + H(B

|XY ), (10.255)

and from the subadditivity and strong subadditivity of entropy we have

I(X; B

) ≤ I(XY ; B

) = H(B

) − H(B

|XY )

≤ H(B

) + H(B

) −H(B

|XY ) − H(B

|XY )

= I(XY ; B

) + I(XY ; B

). (10.256)

10.7 Quantum Channel Capacities and Decoupling 63

The right-hand side is boun ded above by χ(N

)+χ(N

), and maximizing

the left-hand side yields eq.(10.250).

An example of an entanglement-breaking channel is a classical-quantum

channel, also called a c-q channel, which acts according to

A→B

: ρ

7→

hx|ρ

|xiσ(x)

, (10.257)

where {|xi} is an orthonormal basis. In eﬀect, the channel performs a

complete orthogonal measurement on the input state and then prepares an

output state conditioned on the measurement outcome. The measurement

breaks the entanglement between system A and any other s y stem with

which it was initially entangled. Therefore, c-q channels are entanglement

breaking an d have additive Holevo chi.

10.7 Quantum Channel Capacities and Decoupling

10.7.1 Coherent information and the quantum channel capacity

As we have already emphasized, it’s marvelous that the capacity for a

classical channel can be expressed in terms of the optimal correlation

between input and output for a single use of the channel,

C := max

I(X; Y ). (10.258)

Another pleasing feature of this formula is its robustness. For example,

the capacity does not increase if we allow the sender and receiver to

share randomness, or if we allow feedback from receiver to sender. But

for quantum channels the story is more complicated. We’ve seen already

that no simple single-letter formula is known for the classical capacity of a

quantum channel, if we allow entanglement among the channel inputs, and

we’ll soon see that the same is true for the quantum capacity. In addition,

it turn s out that entanglement s hared between sender and receiver can

boost the classical and quantum capacities of some channels, and so can

“backward” communication from receiver to sender. There are a variety of

diﬀerent notions of capacity for quantum channels, all reasonably natural,

and all with diﬀerent achievable rates.

While Shannon’s theory of classical communication over noisy classical

channels is pristine and elegant, the same cannot be said for the the-

ory of commun ication over noisy quantum channels, at least not in its

current state. It’s still a work in progress. Perhaps some day another

genius like Shan non will construct a beautiful theory of quantum capac-

ities. For now, at least there are a lot of interesting things we can say

about achievable rates. Fu rthermore, the tools that have been developed

64 10 Quantum Shannon Theory

to address questions about quantum capacities have other applications

beyond communication theory.

The most direct analog of the classical capacity of a classical channel

is th e quantum capacity of a quantum channel, unassisted by shared en-

tanglement or feedback. The q uantum channel N

A→B

is a TPCP map

from H

to H

, and Alice is to u se th e channel n times to convey a

quantum state to Bob with high ﬁdelity. She prepares her state |ψi in a

code subspace

(n)

⊆ H

⊗n

(10.259)

and sends it to Bob, who applies a decoding map, attempting to recover

|ψi. The rate R of the code is the number of encod ed qubits sent per

channel use,

R = log

dim



(n)



, (10.260)

We say that the rate R is achievable if there is a sequence of codes with

increasing n such that for any ε, δ > 0 and for suﬃciently large n the rate is

at least R−δ and Bob’s recovered state ρ has ﬁdelity F = hψ|ρ|ψi ≥ 1−ε.

The quantum c hannel capacity Q(N) is the supremum of all achievable

rates.

There is a regularized formula for Q(N). To understand the formula

we ﬁrst need to recall that any channel N

A→B

has an isometric Stine-

spring dilation U

A→BE

where E is the channel’s “env ironment.” Further-

more, any input d ensity operator ρ

has a puriﬁcation; if we introdu ce

a reference system R, f or any ρ

there is a pure state ψ

such that

= tr

(|ψihψ|). (I will sometimes use ψ r ather than the Dirac ket |ψi

to denote a pure state vector, when the context m akes the meaning clear

and the ket notation seems u nnecessarily cumbersome.) Ap plying the

channel’s dilation to ψ

, we obtain an ou tput pure state φ

RBE

, which

we represent graphically as:

A U B

✲ ✲

✲

We then deﬁne the one-shot quantum capacity of the channel N by

(N) := max

(−H(R|B)

RBE

) . (10.261)

10.7 Quantum Channel Capacities and Decoupling 65

Here the maximum is taken over all possible input density operators {ρ

and H(R|B) is the quantum conditional entropy

H(R|B) = H(RB) − H(B) = H(E) − H(B), (10.262)

where in the last equality we used H(RB) = H(E) in a pure state of RBE.

The quantity −H(R|B) has such a pivotal role in quantum communication

theory that it deserves to have its own special name. We call it the

coherent information f rom R to B and denote it

(RiB)

= −H(R|B)

= H(B)

− H(E)

. (10.263)

This quantity does not depend on how the puriﬁcation φ of th e density

operator ρ

is chosen; any one puriﬁcation can be obtained f rom any

other by a unitary transformation acting on R alone, which does not

alter H(B) or H(E). Indeed, since the expression H(B) − H(E) only

depends on the marginal state of BE, for the purpose of computing this

quantity we could just as well consider the input to the chan nel to be the

mixed state ρ

obtained fr om ψ

by tracing out the referen ce system

For a classical channel, H(R|B) is always nonnegative and the coherent

information is never positive. In the quantum setting, I

(RiB) is positive

if the reference s y stem R is m ore strongly correlated with the channel

output B than with the environment E. Indeed, an alternative way to

express the coherent information is

(RiB) =

(I(R; B) − I(R; E)) = H(B) − H(E), (10.264)

where we note that (because φ

RBE

is pure)

I(R; B) = H(R) + H(B) − H(RB) = H(R) + H(B) −H(E),

I(R; E) = H(R) + H(E) − H(RE) = H(R) + H(E) −H(B). (10.265)

Now we can state the regularized formula for the quantum channel

capacity — it is the optimal asymptotic coherent information per letter

Q(N

A→B

) = lim

n→∞

max

)

, (10.266)

where the input density operator ρ

is allowed to be entangled across

the n channel uses. If coherent information were subadditive, we could

reduce this expression to a single-letter quantity, the one-shot capacity

(N). But, unfortunately, for some channels the coherent inform ation

can be superadditive, in which case the regularized formula is not very

informative. At least we can s ay that Q

(N) is an achievable rate, and

therefore a lower bound on the capacity.

66 10 Quantum Shannon Theory

10.7.2 The decoupling principle

Before we address achievability, let’s understand why eq.(10.266) is an

upper bound on the capacity. First we note that the monotonicity of

mutual information implies a corresponding m on otonicity property for

the coherent information. Sup pose that the channel N

A→B

is followed by

a channel N

B→C

. Because mutual information is monotonic we have

I(R; A) ≥ I(R; B) ≥ I(R; C), (10.267)

which can also be expr essed as

H(R) − H(R|A) ≥ H(R) − H(R|B) ≥ H(R) − H(R|C), (10.268)

and hence

(RiA) ≥ I

(RiB) ≥ I

(RiC). (10.269)

A quantum channel cannot increase the coherent information, which has

been called the quantum data-processing i nequality.

Suppose now that ρ

is a quantum code state, and that the two chan-

nels acting in succession are a noisy channel N

A→B

and the d ecoding

map D

B→

applied by Bob to the channel output in order to recover the

channel input. Consider the action of the dilation U

A→BE

of N followed

by the dilation V

B→

′

of D on the inp ut puriﬁcation ψ

, under the

assumption that Bob is able to recover perfectly:

−→ φ

RBE

−→

′

= ψ

⊗χ

′

. (10.270)

If the decodin g is perfect, then after d ecoding Bob holds in system

B the

puriﬁcation of the state of R, so that

H(R) = I

(RiA)

= I

(Ri

. (10.271)

Since the initial and ﬁnal states have the same coherent information, the

quantum data processing inequality implies that the same must be true

for the intermediate state φ

RBE

H(R) = I

(RiB) = H(B) − H(E)

=⇒ H(B) = H(RE) = H(R) + H(E). (10.272)

Thus the state of RE is a product state. We have found that if Bob is

able to recover perfectly from the action of the channel dilation U

A→BE

on the pu re state ψ

, then, in the resulting chann el ou tput pure state

RBE

, the marginal state ρ

must be the product ρ

⊗ ρ

10.7 Quantum Channel Capacities and Decoupling 67

Conversely, suppose that ψ

is an entangled pure state, and Alice

wishes to transfer the puriﬁcation of R to Bob by sending it through the

noisy channel U

A→BE

. And suppose that in the resulting tripartite pu re

state φ

RBE

, the marginal state of RE factorizes as ρ

= ρ

⊗ρ

. Then

B decomposes into subsystems B =

such that

RBE

⊗ χ

. (10.273)

Now Bob can construct an isometric decod er V

→

, which extracts the

puriﬁcation of R into Bob’s preferred subsystem

B. Since all puriﬁcations

of R diﬀer by an isometry on Bob’s side, Bob can choose his decoding

map to output the state ψ

; then the inp ut state of RA is successfully

transmitted to R

B as desired. Furthermore, we may choose the initial

state to be a maximally entangled state Φ

of the reference system with

the code s pace of a quantum code; if the marginal state of RE factorizes

in the r esulting output pure state φ

RBE

, then by the relative state method

of Chapter 3 we conclude that any state in the code space can be sent

through the channel and decoded with perfect ﬁdelity by Bob.

We have found that pu riﬁed quantum information transmitted through

the noisy channel is exactly corr ectable if and only if the reference sys-

tem is completely uncorrelated with the channel’s environment, or as we

sometimes say, decoupled f rom the environment. This is the decoupling

principle, a powerful notion und er lying many of the key results in the

theory of quantum chan nels.

So far we have shown that exact correctability correspon ds to exact

decoupling. Bu t we can likewise see that approximate correctability cor-

responds to approximate decoupling. Suppose for example that the s tate

of RE is close to a product state in the L

norm:

kρ

− ρ

⊗ ρ

≤ ε. (10.274)

As we learned in Chapter 2, if two density operators are close together

in th is norm , that means they also have ﬁdelity close to one and hence

puriﬁcations with a large overlap. Any puriﬁcation of the product state

⊗ ρ

has the form

RBE

⊗ χ

, (10.275)

and since all p uriﬁcations of ρ

can be transformed to one another by

an isometry acting on the purifying system B, there is a way to choose

the decomposition B = B

such that

F (ρ

, ρ

⊗ρ

) =



hφ

RBE



≥ 1 − kρ

− ρ

⊗ ρ

≥ 1 −ε.

(10.276)

68 10 Quantum Shannon Theory

Furthermore, because ﬁdelity is monotonic, both under tracing out E and

under the action of Bob’s decoding map, and because Bob can decode

RBE

perfectly, we conclude that



B→

(ρ

) , ψ



≥ 1 −ε (10.277)

if Bob ch ooses the proper decodin g map D. Thus approximate decoupling

in the L

norm implies high-ﬁdelity correctability. It is convenient to note

that the argument still works the same way if ρ

is ε-close in the L

norm to

⊗

, wher e

is not necessarily tr

(ρ

) and

is not

necessarily tr

(ρ

). We’ll use this form of the argument in what f ollows.

On the other hand, if (approximate) decoupling fails, the ﬁdelity of

Bob’s decoded state will be seriously compromised. Suppose that in the

state φ

RBE

we have

H(R) + H(E) − H(RE) = ε > 0. (10.278)

Then the coherent information of φ is

(RiB)

= H(B)

− H(E)

= H(RE)

− H(E)

= H(R)

− ε.

(10.279)

By the quantum data processing inequality, we know that the coherent

information of Bob’s decoded state

is no larger; hence

(Ri

= H(R)

− H(R

≤ H(R)

− ε, (10.280)

and therefore

H(R

≥ ε (10.281)

The deviation from perfect decoupling means that the decoded state of

B has some residual entanglement with the environment E, and is there-

fore impure.

Now we have the tools to derive an upper bound on th e quantum chan-

nel capacity Q(N). For n channel uses, let ψ

(n)

be a maximally entangled

state of a reference system H

(n)

⊆ H

⊗n

with a code space H

(n)

⊆ H

⊗n

where dim H

(n)

= 2

, so that

)

(n)

= H(R

)

(n)

= nR. (10.282)

Now A

is transmitted to B

through



A→BE



⊗n

, yielding the pure

state φ

(n)

of R

. If Bob can decode with high ﬁdelity, then his

decoded state must have coherent information H(R

)

(n)

−o(n), and the

quantum data processing inequality then implies that

)

(n)

= H(R

)

(n)

− o(n) = nR − o(n) (10.283)

10.7 Quantum Channel Capacities and Decoupling 69

and hence

R =

)

(n)

+ o(1). (10.284)

Taking the limit n → ∞we s ee th at the expression for Q(N) in eq.(10.266)

is an u pper bound on the quantum channel capacity. In Exercise 10.9,

you will sharpen the statement eq.(10.283), showing that

H(R

) −I

) ≤ 2H

(ε) + 4εnR. (10.285)

To show that Q(N) is an achievable rate, rather than just an upper

bound, we will need to formulate a quantum version of Shannon’s random

coding argument. Our strategy (see §10.9.3) will be to demonstr ate the

existence of codes that achieve approximate decouplin g of E

from R

10.7.3 Degradable channels

Though coherent information can be super ad ditive in some cases, there

are classes of channels for which the coherent information is additive, and

therefore the quantum channel capacity matches the single-shot capacity,

for w hich there is a single-letter formula. One such class is the class of

degradable channels.

To understand what a degradable channel is, we ﬁrst need the concept of

a complementary channel. Any channel N

A→B

has a Stinespring dilation

A→BE

, from which we obtain N

A→B

by tracing out the environment

E. Alternatively we obtain the chan nel N

A→E

complementary to N

A→B

by tracing out B instead. Since we have the freedom to compose U

A→BE

with an isometry V

E→E

without changing N

A→B

, the complementary

channel is deﬁned only up to an isometry acting on E. This lack of

uniqueness need not trouble us, because the properties of interest for th e

complementary channel are invariant under such isometries.

We say that the channel N

A→B

is degradable if we can obtain its com-

plementary channel by composin g N

A→B

with a channel mappin g B to

A→E

= T

B→E

◦ N

A→B

. (10.286)

In this sense, when Alice sends a state through the channel, Bob, who

holds system B, receives a less noisy copy than Eve, who holds system E.

Now suppose that U

→B

and U

→B

are dilations of the

degradable channels N

and N

. Alice introduces a reference system R

and prepares an input pure state ψ

, then sends the state to Bob via

⊗ N

, preparing the output pure s tate φ

. We would like to

evaluate the coherent information I

(RiB

)

in this state.

70 10 Quantum Shannon Theory

The key point is that because both channels are degradable, there is a

product channel T

⊗ T

mapping B

to E

, and the monotonicity

of mutual information therefore implies

I(B

; B

) ≥ I(E

; E

). (10.287)

Therefore, the coherent information satisﬁes

(RiB

) = H(B

) −H(E

)

= H(B

) + H(B

) − I(B

; B

) − H(E

) −H(E

) + I(E

; E

)

≤ H(B

) −H(E

) + H(B

) − H(E

). (10.288)

These quantities are all evaluated in the state φ

. But notice

that for the evaluation of H(B

) − H(E

), the isometry U

→B

irrelevant. This quantity is really the same as the coherent information

(RA

), where now we regard A

as p art of the reference system for

the input to channel N

. Sim ilarly H(B

) − H(E

) = I

(RA

), and

therefore,

(RiB

) ≤ I

(RA

) + I

(RA

) ≤ Q

) + Q

(10.289)

where in the last inequality we us e the deﬁnition of the one-shot capacity

as coherent information maximized over all inputs. Since Q

⊗N

) is

likewise deﬁned by maximizing the coherent inform ation I

(RiB

), we

ﬁnd that

⊗ N

) ≤ Q

) + Q

) (10.290)

if N

and N

are degradable.

The regularized formula for the capacity of N is

Q(N) = lim

n→∞

⊗n

) ≤ Q

(N), (10.291)

where the last inequality follows from eq.(10.290) assu ming that N is

degradable. We’ll see that Q

(N) is actually an achievable r ate, and

therefore a single-letter formula for the quantum capacity of a degradable

channel.

As a concrete example of a degradable channel, consider the generalized

dephasing channel with dilation

A→BE

: |xi

7→ |xi

⊗|α

, (10.292)

where {|xi

}, {|xi

} are orthonormal bases for H

, H

respectively, and

the states {|α

} of the environment are not necessarily orthogonal. The

10.8 Quantum Protocols 71

corresponding channel is

A→B

: ρ 7→

x,x

′

|xihx|ρ|x

′

ihα

′

|α

ihx

′

|, (10.293)

which has the complementary channel

A→E

: ρ 7→

|α

ihx|ρ|xihα

|. (10.294)

In the special case where the states {|α

= |xi

} are orthonormal, we

obtain the completely dephasing channel

∆

A→B

: ρ 7→

|xihx|ρ|xihx|, (10.295)

whose complement ∆

A→E

has the same form as ∆

A→B

. We can easily

check that

A→E

= N

C→E

◦ ∆

B→C

◦ N

A→B

; (10.296)

therefore N

◦ ∆ degrades N to N

. Thus N is degradable and Q(N) =

(N).

Further examples of degradable channels are discussed in Exercise

10.11.

10.8 Quantum Protocols

Using the decoupling principle in an i.i.d. setting, we can prove achievable

rates for two fundamental quantum protocols. These are fondly known as

the father and mother protocols, so named because each sp awns a brood

of interesting corollaries. We will formulate these p roto cols and discuss

some of their “ch ildren” in this section, postponing the proofs until §10.9.

10.8.1 Father: Entanglement-assisted quantum communication

The father protocol is a scheme for entanglement-assisted quantum com-

munication. Through many uses of a noisy quantum channel N

A→B

this pr otocol sends quantum information with high ﬁdelity from Alice to

Bob, while also consuming some previously prepared quantum entangle-

ment shared by Alice and Bob. The task per formed by the protocol is

summarized by the father resource ineq uality



A→B

: ρ



I(R; E)[qq] ≥

I(R; B)[q → q], (10.297)

72 10 Quantum Shannon Theory

where the resources on the left-hand side can be used to achieve the re-

sult on the right-hand side, in an asymptotic i.i.d. setting. That is, for

any positive ε, the quantum channel N may be used n times to trans-

mit

I(R; B) − o(n) qub its with ﬁdelity F ≥ 1 − ε, while consuming

I(R; E) + o(n) ebits of entanglement shared between sender and re-

ceiver. These entropic quantities are evaluated in a tripartite pure state

RBE

, obtained by applyin g the Stinespring dilation U

A→BE

of N

A→B

to the puriﬁ cation ψ

of the input density operator ρ

. Eq.(10.297)

means that for any input density operator ρ

, there exists a coding pro-

cedure that achieves the quantum communication at the speciﬁed rate by

consuming entanglement at the s peciﬁed rate.

To remember the father resource inequality, it helps to keep in min d

that I(R; B) quantiﬁes something good, the correlation with the reference

system which survives transmission through the channel, while I(R; E)

quantiﬁes something bad, the correlation between the reference system R

and the chan nel’s environment E, which causes the tr an smitted informa-

tion to decohere. The larger the good quantity I(R; B), the higher the

rate of quantum communication. The larger the bad quantity I(R; E),

the more entanglement we need to consume to overcome the noise in the

channel. To remember the factor of

in f ront of I(R; B), consider the

case of a noiseless quantum channel, where ψ

is m aximally entangled;

in that case there is n o environment,

√

d−1

i=0

|ii

⊗|ii

, (10.298)

and

I(R; B) = H(R) = H(B) = log

d is just the number of qubits in

A. To remember the factor of

in front of I(R; E), consider the case of

a noiseless classical channel, where the q uantum information completely

decoheres in a preferred basis; in that case

RBE

√

d−1

i=0

|ii

⊗ |ii

, (10.299)

and I(R; B) = I(R; E) = H(R) = H(B) = log

d. Then the father

inequality merely says that we can teleport

qubits by consuming

ebits and sending n classical bits.

Before proving the father resource in equality, we will ﬁrst discuss a few

of its interesting consequences.

Entanglement-assisted classical communication. Su ppose Alice wants to

send classical information to Bob, rather than quantum information.

Then we can use s uperdense coding to turn the quantum communication

10.8 Quantum Protocols 73

achieved by the father protocol into classical communication, at the cost

of consuming some additional entanglement. By invoking the superdense

coding resource inequality

SD : [q → q] + [qq] ≥ 2[c → c] (10.300)

I(R; B) times, and combining with the father resource inequality, we

obtain I(R; B) bits of classical communication per use of the channel

while consuming a number of ebits

I(R; E) +

I(R; B) = H(R) (10.301)

per channel use. Thus we obtain an achievable rate for entanglement-

assisted classical communication through the noisy qu antum channel:



A→B

: ρ



+ H(R)[qq] ≥ I(R; B)[c → c]. (10.302)

We may deﬁne the entanglement-assisted classical capacity C

(N) as the

supremum over achievable rates of classical communication per channel

use, assuming that an unlimited amount of entanglement is available at

no cost. Then the resource inequality eq.(10.302) implies

(N) ≥ max

I(R; B). (10.303)

In this case there is a matching upper bound, so C

(N) is really an

equality, and hence a single-letter formula for the entanglement-assisted

classical capacity. Furthermore, eq.(10.302) tells us a rate of entangle-

ment consumption which suﬃces to achieve the capacity. If we disregard

the cost of entanglement, the father protocol shows that a rate can be

achieved for entanglement-assisted quantum communication which is half

the entanglement-assisted classical capacity C

(N) of the n oisy channel

N. That’s clearly true, since by consum ing entanglement we can use tele-

portation to convert n bits of classical commu nication into n/2 qubits of

quantum communication.

Quantum channel capacity. It may be that Alice wants to send quantum

information to Bob, but Alice and Bob are not so fortunate as to have

pre-existing entanglement at their disposal. They can still make us e of

the father protocol, if we are willing to loan them some entanglement,

which they are later required to rep ay. In this case we say that the

entanglement catalyzes the quantum communication. Entanglement is

needed to activate the process to begin with, but at the conclusion of the

process no net entanglement has been consumed.

In this catalytic setting, Alice and Bob borrow

I(R; E) ebits of entan-

glement per use of the chan nel to get started, execute the father protocol,

74 10 Quantum Shannon Theory

and then sacriﬁce some of the qu antum communication they have gener-

ated to replace the borrowed entanglement via the resource inequality

[q → q] ≥ [qq]. (10.304)

After repaying their debt, Alice and Bob retain a number of qubits of

quantum communication per channel use

I(R; B) −

I(R; E) = H(B) − H(E) = I

(RiB), (10.305)

the channel’s coherent information from R to B. We therefore obtain the

achievable rate for quantum communication



A→B

: ρ



≥ I

(RiB)[q → q], (10.306)

albeit in the catalyzed setting. It can actually be shown that this same

rate is achievable without invoking catalysis (see §10.9.4). As already

discussed in §10.7.1, though, because of the superadditivity of coherent

information this resource inequality does not yield a general single-letter

formula for the quantum channel capacity Q(N).

10.8.2 Mother: Quantum state transfer

In the mother protocol, Alice, Bob, and E ve initially share a tripartite

pure state φ

ABE

; thus Alice and Bob together hold the pu riﬁcation of

Eve’s system E. Alice wants to send her share of this puriﬁcation to

Bob, using as few qubits of n oiseless quantum communication as possible.

Therefore, Alice divides her system A into two subsystems A

and A

where A

is as small as possible and A

is uncorrelated with E. She keeps

and sends A

to Bob. After receiving A

, Bob divides A

B into two

subsystems B

and B

, wher e B

puriﬁes E and B

puriﬁes A

. Thus,

at the conclusion of the protocol, Bob holds the p uriﬁcation of E in B

and in addition Alice and Bob share a bipartite pu re state in A

. The

protocol is portrayed in the following diagram:

❅

❅ 



ABE

=⇒

❅

❅ 



=⇒



In the i.i.d. version of the m other protocol, the initial state is φ

⊗n

ABE

, and

the task achieved by the protocol is summarized by the mother resource

inequality

hφ

ABE

i +

I(A; E)[q → q] ≥

I(A; B)[qq] + hφ

′

i, (10.307)

10.8 Quantum Protocols 75

where the r esources on the left-hand side can be used to achieve the

result on the right-hand side, in an asymptotic i.i.d. setting, and the

entropic quantities are evaluated in the state φ

ABE

. That is, if A

(n)

denotes the state Alice sends and A

(n)

denotes the state she keeps, then

for any positive ε, the state of A

(n)

is ε-close in the L

norm to a

product state, where log



(n)



I(A; E)+o(n), while A

(n)

contains

I(A; B)−o(n) shared ebits of entanglement. Eq.(10.307) means that for

any input pure s tate φ

ABE

there is a way to choose the subsystem A

(n)

of the speciﬁed dimension such that A

(n)

and E

are nearly uncorrelated

and the speciﬁed amount of entanglement is harvested in A

(n)

The mother protocol is in a sense dual to the father protocol. While

the father protocol consumes entanglement to achieve quantum commu-

nication, the mother protocol consumes quantum communication and

harvests entanglement. For the mother, I(A; B) quantiﬁes the correla-

tion between Alice and Bob at the beginning of the protocol (something

good), and I(A; E) quantiﬁes the noise in the initial shared entanglement

(something bad). The mother protocol can also be viewed as a quantum

generalization of the Slepian-Wolf distributed compression protocol dis-

cussed in §10.1.3. The mother protocol merges Alice’s and Bob’s shares

of th e puriﬁcation of E by sending Alice’s share to Bob, much as dis-

tributed source coding merges the classical correlations shared by Alice

and Bob by sending Alice’s classical information to Bob. For this rea-

son the mother protocol has been called the fully quantum Slepian-Wolf

protocol; the modiﬁer “fully” will be clariﬁed in §10.8.2, when we discuss

a variant on q uantum state transfer in which classical communication is

assumed to be freely available.

We may also view the mother protocol as a generalization of the en-

tanglement concentration protocol discussed in §10.4, extending that dis-

cussion in three ways:

1. The initial entangled state shared by Alice and Bob may be mixed

rather than pure.

2. The communication from Alice to Bob is quantum rather than clas-

sical.

3. The amount of commu nication th at suﬃces to execute the protocol

is quantiﬁed by the resource inequality.

Also note that if the state of AE is pure (uncorrelated with B), then

the mother protocol reduces to Schumacher compression. In that case

I(A; E) = H(A), and the mother resource inequality states that the

76 10 Quantum Shannon Theory

puriﬁcation of A

can be transf er red to Bob with high ﬁdelity using

nH(A) + o(n) qubits of quantum communication.

Before proving the mother resource inequality, we will ﬁrst discuss a

few of its interesting consequences.

Hashing inequality. Suppose Alice and Bob wish to distill entanglement

from many copies of the state φ

ABE

, using only lo cal operations and clas-

sical communication (LOCC). In the catalytic setting, they can borrow

some quantum communication, use the mother protocol to distill some

shared entanglement, and then use classical communication and their

harvested entanglement to repay their debt via quantum teleportation.

Using the telep ortation resource inequality

T P : [qq] + 2[c → c] ≥ [q → q] (10.308)

I(A; E) times, and combining with the mother resource inequality, we

obtain

hφ

ABE

i + I(A; E)[c → c] ≥ I

(AiB)[qq] + hφ

′

i, (10.309)

since the net amount of distilled entanglement is

I(A; B) per copy of

φ achieved by the mother minus the

I(A; E) per copy consumed by

teleportation, and

I(A; B) −

I(A; E) = H(B) − H(E) = I

(AiB). (10.310)

Eq.(10.309) is the hashing inequality, which quantiﬁes an achievable rate

for distilling eb its of entanglement shared by Alice and Bob from many

copies of a mixed state ρ

, using one-way classical communication, as-

suming that I

(AiB) = − H(A|B) is positive. Furthermore, the hashing

inequality tells us how much classical communication suﬃces f or this pur-

pose.

In the case where the state ρ

is pure, I

(AiB) = H(A) − H(AB) =

H(A) and there is no environment E; thus we recover our earlier con-

clusion about concentration of pure-state bipartite entanglement — that

H(A) Bell pairs can be extracted per copy, with a negligible classical

communication cost.

State merging. Su ppose Alice and Bob share the puriﬁcation of Eve’s

state, and Alice wants to tr an sfer her share of the puriﬁcation to Bob,

where now unlimited classical communication from Alice to Bob is avail-

able at no cost. In contrast to the mother protocol, Alice wants to achieve

the transf er with as little one-way quantum communication as possible,

even if she needs to send more bits in order to s end fewer qubits.

10.8 Quantum Protocols 77

In the catalytic setting, Alice and Bob can borr ow some quantum com-

munication, perform the mother protocol, then use teleportation and the

entanglement extracted by the mother protocol to repay some of the bor-

rowed quantum communication. Combining teleportation of

I(A; B)

qubits w ith the mother resource inequality, we ob tain

hφ

ABE

i + H(A|B)[q → q] + I(A; B)[c → c] ≥ hφ

′

i, (10.311)

using

I(A; E) −

I(A; B) = H(E) − H(B) = H(AB) − H(B) = H(A|B).

(10.312)

Eq.(10.311) is the state-merging inequality, expressing how much quantum

and classical communication suﬃces to achieve the state transfer in an

i.i.d. setting, assuming that H(A|B) is nonnegative.

Like the mother protocol, this state mer ging protocol can be viewed as

a (partially) quantum version of the Slepian-Wolf protocol for merging

classical correlations. In the classical setting, H(X|Y ) quantiﬁes Bob’s

remaining ignorance about Alice’s information X when Bob knows only

Y ; correspondingly, Alice can reveal X to Bob by sending H(X|Y ) bits per

letter of X. Similarly, state merging provides an operational meaning to

the quantum conditional information H(A|B), as the number of qubits per

copy of φ that Alice sends to Bob to convey her share of the puriﬁ cation of

E, assuming classical communication is free. In this sense we may regard

H(A|B) as a m easure of Bob’s remaining “ignorance” about the shared

puriﬁcation of E when he holds only B.

Classically, H(X|Y ) is nonnegative, and zero if and only if Bob is al-

ready certain about XY , but quantumly H(A|B) can be negative. How

can Bob have “negative u ncertainty” about the quantum state of AB?

If H(A|B) < 0, or equivalently if I(A; E) < I(A; B), then the mother

protocol yields more quantum entanglement than the amount of quan-

tum communication it consumes. Therefore, when H(A|B) is negative

(i.e. I

(AiB) is positive), the mother resource inequality implies th e

Hashing inequality, asserting that classical communication from Alice to

Bob not only ach ieves state transfer , but also distills −H(A|B) ebits of

entanglement per copy of φ. These distilled ebits can be deposited in

the entanglement bank, to be withdrawn as needed in future rounds of

state merging, thus reducing the quantum communication cost of those

future rounds. Bob’s “negative uncertainty” today redu ces the quantum

communication cost of tasks to be performed tomorrow.

78 10 Quantum Shannon Theory

10.8.3 Operational meaning of strong su badditivity

The observation that H(A|B) is the quantum communication cost of state

merging allows us to formulate a simple operational proof of the strong

subadditivity of Von Neumann entropy, exp ressed in the form

H(A|BC) ≤ H(A|B), or −H(A|B) ≤ −H(A|BC). (10.313)

When H(A|B) is positive, eq.(10.313) is the obvious statement that it is

no harder to merge Alice’s s y stem w ith Bob’s if Bob holds C as well as

B. Wh en H(A|B) is negative, eq.(10.313) is the obvious statement that

Alice and Bob can distill no less entanglement using one-way classical

communication if Bob holds C as well as B.

To complete this argument, we need to know that H(A|B) is not only

achievable but also that it is the optimal quantum communication cost

of state merging, and that −H(A|B) ebits is the optimal y ield of hash-

ing. The optimality follows from the principle that, for a bipartite pure

state, k qu bits of quantum communication cannot increase the shared

entanglement of AB by more than k ebits.

If H(A|B) is negative, consider cutting the system ABE into the two

parts AE and B, as in the following ﬁgure:

❅

❅ 



❅

=⇒



❅

In the hashing protocol, applied to n copies of φ

ABE

, the entanglement

across this cut at the beginning of the protocol is nH(B). By the end of

the protocol E

has decoupled from A

(n)

and has entanglement nH(E)

with B

(n)

, ignoring o(n) corrections. If k ebits shared by Alice an d Bob

are distilled, the ﬁnal entanglement across the AE-B cut is

nH(E) + k ≤ nH(B) =⇒

≤ H(B) − H(E) = −H(A|B). (10.314)

This inequality holds because LOCC cannot increase th e entanglement

across the cut, and implies that n o more than −H(A|B) ebits of en-

tanglement per copy of φ

ABE

can be distilled in the hashing protocol,

asymptotically.

On the other hand , if H(A|B) is positive, at the conclusion of state

merging B

(n)

is entangled with E

, and the entanglement across the AE-

B cut is at least nH(E). To achieve this increase in entanglement, the

10.8 Quantum Protocols 79

number of qubits sent from Alice to Bob must be at least

k ≥ nH(E) − nH(B) =⇒

≥ H(E) − H(B) = H(A|B) (10.315)

This inequality holds because the entanglement across the cut cannot

increase by more than the quantum communication across the cut, and

implies that at least H(A|B) qubits must be sent per copy of φ

ABE

achieve state merging.

To summarize, we have proven strong subadditivity, n ot by the tradi-

tional route of sophisticated matrix analysis, but via a less direct method.

This proof is built on two cornerstones of quantum information theory —

the decoupling principle and the theory of typical subsp aces — which are

essential ingredients in the proof of the mother resource inequality.

10.8.4 Negative conditional entropy in thermodynamics

As a further app lication of the decoupling mother resource inequality, we

now revisit Landauer’s Principle, developing another perspective on the

implications of negative quantum conditional entropy. Recall that erasure

of a bit is a process which maps the bit to 0 irrespective of its initial value.

This process is irreversible — knowing only the ﬁnal state 0 after erasure,

we cannot determine whether the initial state before erasure was 0 or 1.

Irreversibility implies that erasure incurs an unavoidable ther modynamic

cost. According to Landauer’s Principle, er asin g a bit at temperature T

requires work no less than W = kT ln 2.

A speciﬁc erasure procedure is analyzed in Exercise 10.14. Suppose a

two-level quantum system has energy eigenstates |0i, |1i with correspond-

ing eigenvalues E

and E

, where E = E

− E

≥ 0. Initially the qubit

is in an un k nown mixture of these two states, and the energy splitting

is E = 0. We erase the bit in three steps. In the ﬁrst step, we bring

the bit into contact with a heat bath at temperature T > 0, an d wait

for the bit to come to thermal equilibrium with the bath. In th is step

the bit “forgets” its initial value, but the bit is not yet erased because

it has not been reset. In the second step, with the bit still in contact

with the bath, we turn on a control ﬁeld which slowly increases E

to a

value much larger than kT while maintaining thermal equilibrium all the

while, thus resetting the bit to |0i. In the third step, we isolate the bit

from the bath and turn oﬀ the control ﬁeld, so the two states of the bit

become degenerate again. As shown in Exercise 10.14, work W = kT ln 2

is required to execute step 2, with the energy dissipated as heat ﬂowing

from bit to bath.

We can also run the last two steps back ward, increasing E

while the

bit is isolated from the bath, then decreasing E

with the bit in contact

80 10 Quantum Shannon Theory

with the bath. This procedure maps the state |0i to the maximally mixed

state of the bit, extracting work W = kT ln 2 from the bath in the process.

Erasure is irreversible because the agent performing the erasure does

not know the information being erased. (If a copy of the information were

stored in her memory, survival of that copy would mean that the erasure

had not succeeded). From an information-theoretic perspective, the re-

duction in the thermodynamic entropy of the er ased bit, and h ence the

work required to perform the erasure, arises because erasure reduces the

agent’s ignorance about the state of the bit, ignorance which is q uantiﬁed

by the Shannon entropy. But to be more precise, it is the conditional en-

tropy of the system, given the state of the agent’s m emory, wh ich captures

the agent’s ignorance before erasure and therefore also the thermodynamic

cost of erasing. Thus the minimal work needed to erase system A should

be expressed as

W (A|O) = H(A|O)kT ln 2, (10.316)

where O is th e memory of the observer who performs the erasure, and

H(A|O) quantiﬁes that observer’s ignorance about the state of A.

But what if A and O are quantum sy stems? We know that if A and

O are entangled, then the conditional entropy H(A|O) can be negative.

Does that mean we can erase A while extracting work rather than doing

work?

Yes, we can! Suppose for example that A and O are qubits and their

initial state is maximally entangled. By controlling the contact between

AO and the heat bath, the observer can extract work W = 2kT log 2

while transforming AO to a maximally mixed state, using the same work

extraction protocol as described above. Then she can do work W =

kT log 2 to return A to the state |0i. The net eﬀect is to erase A while

extracting work W = kT log 2, satisfying the equality eq.(10.316).

To appr eciate why this trick works, we s hould consider the joint state

of AO rather than the state of A alone. Although the marginal state of A

is mixed at the beginning of the protocol and pure at the end, the state

of AO is pure at th e beginning and mixed at the end. Positive work is

extracted by sacriﬁcing the purity of AO.

To generalize this idea, let’s consider n ≫ 1 copies of the state ρ

system A and memory O. Our goal is to map the n copies of A to the

erased state |000 . . . 0i while using or extracting the optimal amount of

work. In f act, the optimal work per copy is given by eq.(10.316) in the

n → ∞ limit.

To achieve th is asymptotic work per copy, the observer ﬁrst projects

onto its typical subspace, succeeding with probability 1 − o(1). A

unitary transformation then rotates the typical subspace to a subsystem

10.9 The Decoupling Inequality 81

A containing n(H(A) + o(1)) qubits, while erasing the complementary

qubits as in eq.(10.144). Now it only remains to erase

The mother resour ce inequality ensures that we may decompose

A into

subsystems A

such that A

contains

(I(A; O) − o(1)) qubits and is

nearly maximally entangled with a subsystem of O

. What is important

for the erasure proto col is that we may identify a subsystem of

containing n (I(A; O) − o(1)) qubits which is only distance o(1) away from

a pure state. By controlling th e contact between this subsystem and the

heat bath, we may extract work W = n(I(A; O) − o(1))kT log 2 while

transforming the subsystem to a maximally mixed state. We then proceed

to erase

A, expending work kT log |

A| = n(H(A)+ o(1))kT log 2. Th e net

work cost of the erasure, per copy of ρ

, is therefore

W = (H(A) −I(A; O) + o(1)) kT log 2 = (H(A|O) + o(1)) kT log 2,

(10.317)

and the erasure succeeds with p robability 1 − o(1). A notable feature of

the protocol is that only the subsystem of O

which is entangled with A

is aﬀected. Any correlation of the memory O with other systems remains

intact, and can be exploited in the future to reduce the cost of erasure of

those other systems.

As does th e state merging protocol, this erasure protocol pr ovides an

operational interpretation of strong subadditivity. For positive H(A|O),

H(A|O) ≥ H(A|OO

′

) means that it is no harder to erase A if the observer

has access to both O and O

′

than if she has access to O alone. For negative

H(A|O), −H(A|OO

′

) ≥ −H(A|O) means th at we can extract at least as

much work from AOO

′

as from its subsystem AO.

To carry out this protocol and extract the optimal amount of work

while erasing A, we need to know which subsystem of O

provides the

puriﬁcation of A

. The decoupling argument ensures that this subsystem

exists, but does not provide a constructive method for ﬁnding it, and

therefore no concrete protocol for erasing at optimal cost. This quandary

is characteristic of Shannon theory; for example, Shannon ’s noisy channel

coding theorem ensures the existence of a code that achieves the channel

capacity, but does not provide any explicit cod e construction.

10.9 The Decoupling Inequality

Achievable rates for quantum protocols are derived by using random

codes, much as in classical Shannon theory. But this similarity between

classical and quantum Shannon theory is superﬁcial — at a deeper con-

ceptual level, quantum protocols diﬀer substantially from classical ones.

Indeed, the decoupling principle underlies many of the key ﬁndings of

82 10 Quantum Shannon Theory

quantum Shannon theory, providing a unifying theme that ties together

many diﬀerent resu lts. In particular, the mother and father resource in-

equalities, and hence all their descendants enumerated above, follow from

an inequality that speciﬁes a suﬃcient condition for decoupling.

This decoupling inequality addresses the following question: Suppose

that Alice an d Eve share a quantum state σ

, wher e A is an n-qubit

system. This state may be mixed, but in general A and E are correlated;

that is, I(A; E) > 0. Now Alice starts discarding qubits one at a time,

where each qubit is a r an domly selected two-dimensional subsystem of

what Alice holds. Each time Alice discards a qubit, her correlation with E

grows weaker. How many qubits should sh e discard so that the subsystem

she retains has a negligible correlation with Eve’s system E?

To make the question precise, we need to formalize what it means to

discard a random qubit. More generally, suppose that A has dimension

|A|, an d Alice decomposes A into subsystems A

and A

, then discards A

and retains A

. We would like to consider many possible ways of choosing

the discarded system with speciﬁed d im ension |A

|. Eq uivalently, we may

consider a ﬁxed decomposition A = A

, where we apply a unitary

transformation U to A before discarding A

. Then discarding a r an dom

subsystem with dimension |A

| is the same thing as applying a random

unitary U before discarding the ﬁx ed subsystem A

❅



To analyze th e consequences of discarding a random subsy stem, then,

we will need to be able to compute the expectation value of a function

f(U ) when we average U uniformly over the group of unitary |A| × |A|

matrices. We denote this expectation value as E

[f(U )]; to perform

computations we will only need to know that E

is suitably normalized,

and is invariant under left or right multiplication by any constant unitary

matrix V :

[I] = 1, E

[f(U )] = E

[f(V U)] = E

[f(UV )] . (10.318)

These conditions uniquely deﬁne E

[f(U)], which is sometimes described

as the integral over the unitary group using the invariant measure or Haar

measure on the group.

If we apply the unitary transformation U to A, and then discard A

the marginal state of A

E is

(U) := tr



⊗ I

) σ



†

⊗ I



. (10.319)

10.9 The Decoupling Inequality 83

The decoupling inequality expresses how close (in the L

norm) σ

to a product state when we average over U :





kσ

(U) − σ

max

⊗ σ



≤

| · |E|





, (10.320)

where

max

I (10.321)

denotes the maximally mixed state on A

, and σ

is the marginal state

This inequality has interesting consequences even in the case where

there is no sy stem E at all and σ

is pure, where it becomes





kσ

(U ) −σ

max



≤





. (10.322)

Eq.(10.322) implies that, for a randomly chosen pure s tate of th e bipartite

system A = A

, wher e |A

|/|A

| ≪ 1, the d ensity operator on A

very nearly maximally mixed with high probability. One can likewise

show that the expectation value of the entanglement entropy of A

very close to the maximal value: E [H(A

)] ≥ log

|−|A

|/ (2|A

|ln 2) .

Thus, if f or example A

is 50 qubits and A

is 100 qubits, the typical

entropy deviates from maximal by only about 2

−50

≈ 10

−15

10.9.1 Proof of the decoupling inequality

To prove the decoupling inequality, we w ill ﬁrst bound the distance be-

tween σ

and a p roduct state in the L

norm, and then use the Cauchy-

Schwarz inequality to obtain a bound on the L

distance. Eq.(10.320)

follows from



kσ

(U) − σ

max

⊗ σ



≤





, (10.323)

combined with

(E [f ])

≤ E [f]

and kMk

≤ dkMk

(10.324)

(for nonnegative f ), which implies

(E [k · k

])

≤ E



k · k



≤ |A

| · |E| · E



k · k



. (10.325)

We also note that

kσ

− σ

max

⊗ σ

= tr



− σ

max

⊗ σ



= tr





−





, (10.326)

84 10 Quantum Shannon Theory

because



max



; (10.327)

therefore, to prove eq.(10.323) it suﬃces to s how





(U)



≤









. (10.328)

We can facilitate the computation of E





(U)



using a clever

trick. For any bipartite system BC, imagine introducing a second copy

′

of the system. Then (Exercise 10.7)





= tr

BCB

′

⊗ S

′

) (σ

⊗ σ

′

) , (10.329)

where S

′

denotes the swap operator, which acts as

′

: |ii

⊗ |ji

′

7→ |ji

⊗ |ii

′

. (10.330)

In p articular, then,



(U )



= tr

AEA

′



′

⊗ S

′

⊗ S

′



(σ

(U) ⊗ σ

′

(U))

= tr

AEA

′

(U ) ⊗S

′

) (σ

⊗ σ

′

) , (10.331)

where

′

(U) =



†

⊗ U

†

′



′

⊗ S

′



⊗ U

′

) . (10.332)

The expectation value of M

′

(U ) is evaluated in Exercise 10.7; there

we ﬁnd

′

(U)] = c

′

+ c

′

(10.333)

where



1 −1/|A

1 −1/|A|



≤



1 −1/|A

1 −1/|A|



≤

. (10.334)

Plugging into eq.(10.331), we then obtain





(U)



≤ tr

AEA

′



′



⊗ S

′



(σ

⊗ σ

′

)









, (10.335)

thus proving eq.(10.328) as desired.

10.9 The Decoupling Inequality 85

10.9.2 Proof of the mother inequality

The mother inequality eq.(10.307) follows f rom the decouplin g inequality

eq.(10.320) in an i.i.d. setting. Suppose Alice, Bob, and Eve share the

pure state φ

⊗n

ABE

. Then there are jointly typical subspaces of A

, B

, and

, which we denote by

E, such that



¯



= 2

nH(A)+o(n)



¯



= 2

nH(B)+o(n)



¯



= 2

nH(E)+o(n)

. (10.336)

Furthermore, the normalized pure s tate φ

′

obtained by projecting

⊗n

ABE

onto

A ⊗

B ⊗

E deviates from φ

⊗n

ABE

by distance o(1) in the L

norm.

In order to transfer the puriﬁcation of E

to Bob, Alice ﬁrst projects

onto its typical subspace, su cceeding with probability 1 − o(1), and

compresses the result. She th en divid es her compr essed system

A into

two parts

, and applies a random unitary to

A before sending

Bob. Quantum state transfer is achieved if

decouples from

Because φ

′

is close to φ

⊗n

ABE

, we can analyze whether the pr otocol

is successful by supposing the initial state is φ

′

rather than φ

⊗n

ABE

According to the decouplin g inequality





kσ

(U ) −σ

max

⊗ σ





≤

A|· |





n(H(A)+H(E)+o(1))





n(H(A)+H(E)−H(B)+o(1))

;

(10.337)

here we have used properties of typical subs paces in the second line, as

well as the property that σ

and σ

have the same nonzero eigenvalues,

because φ

′

is pure.

Eq.(10.337) bou nds the L

distance of σ

(U ) from a product state

when averaged over all unitaries, and therefore suﬃces to ensure the exis-

tence of at least one unitary transformation U such that th e L

distance

is bounded above by the right-hand side. Therefore, by choosing this

U, Alice can decouple

from E

to o(1) accuracy in the L

norm by

sending to Bob

log

| =

(H(A) + H(E) − H(B) + o(1)) =

(I(A; E) + o(1))

(10.338)

qubits, randomly chosen from the (compressed) typical subs pace of A

Alice retains nH(A) −

I(A; E) −o(n) qubits of her compressed system,

which are nearly maximally mixed and u ncorrelated with E

; hence at

the end of the protocol she shares with Bob this many qubit pairs, which

86 10 Quantum Shannon Theory

have high ﬁdelity with a maximally entangled state. Since φ

ABE

is pure,

and therefore H(A) =

(I(A; E) − I(A; B)), we conclude that Alice and

Bob distill

I(A; B)−o(n) ebits of entanglement, thus proving the mother

resource inequality.

We can check that this conclusion is plausible using a crude counting

argument. Disregarding the o(n) corrections in the exponent, the state

⊗n

ABE

is nearly maximally mixed on a typical subspace of A

with

dimension 2

nH(AE)

, i.e. the marginal state on

E can be realized as a

nearly uniform ensemble of th is many mutually orthogonal states. If

is rand omly chosen and suﬃciently small, we expect that, for each state

in this en semble,

is nearly maximally entangled with a subsystem

of the much larger system

E, and that the marginal states on

arising from diﬀerent states in the

E ensemble have a small overlap.

Therefore, we anticipate that tracing out

yields a state on

E which

is nearly maximally mixed on a su bspace with d im ension |

nH(AE)

Approximate decoupling occurs when this state attains full rank on

since in that case it is close to maximally mixed on

E and therefore

close to a product state on its support. T he state transfer succeeds,

therefore, p rovided

nH(AE)

≈ |

| · |

E| =

A| · |

≈

n(H(A)+H(E))

=⇒ |

≈ 2

nI(A;E)

, (10.339)

as in eq.(10.338).

Our derivation of the mother resource inequality, based on random cod-

ing, does not exhibit any concrete pr otocol that achieves the claimed rate,

nor does it guarantee the existence of any protocol in which the required

quantum processing can be executed eﬃciently. Concerning the latter

point, it is notable that our der ivation of the decoupling inequality ap-

plies not just to the expectation value averaged uniformly over the unitary

group, but also to any average over unitary transformations which satisﬁes

eq.(10.333). In fact, this identity is satisﬁed by a uniform average over the

Cliﬀord group, which means that there is some Cliﬀord transformation

A which achieves the rates speciﬁed in the mother resource inequality.

Any Cliﬀord tr an sformation on n qub its can be reached by a circuit with

O(n

) gates. Since it is also known that Schumacher compression can be

achieved by a polynomial-time quantum computation, Alice’s encoding

operation can be carried out eﬃciently.

In fact, after compressing, Alice encodes the qu antum information she

sends to Bob using a stabilizer code (with Cliﬀord encoder U ), and Bob’s

task, after receiving

is to correct the erasure of

. Bob can r eplace

each erased qubit by the standard state |0i , and then measure the code’s

10.9 The Decoupling Inequality 87

check operators. With high probability, there is a unique Pauli operator

acting on the erased qub its that restores Bob’s state to the code space, and

the recovery operation can be eﬃciently computed using linear algebra.

Hence, Bob’s part of the mother protocol, like Alice’s, can be executed

eﬃciently.

10.9.3 Proof of the father inequality

One-shot version. In the one-shot version of th e father protocol, Alice an d

Bob share a pair of maximally entangled systems A

, and in addition

Alice holds input state ρ

of system A

which she wants to convey

to Bob. Alice encodes ρ

by applying a unitary transformation V to

A = A

, then sends A to Bob via the noisy quantum channel N

A→B

Bob app lies a decoding map D

→

jointly to the channel output and

his half of the entangled state he shares with Alice, hoping to recover

Alice’s input state with high ﬁdelity:

❅



V A N

We would like to know how much shared entanglement suﬃces for Alice

and Bob to su cceed.

This question can be answered using the decoupling inequality. First

we introduce a reference system R

′

which is maximally entangled with

; then Bob succeeds if his decoder can extract the puriﬁcation of R

′

Because the systems R

′

and A

are maximally entangled, the encod-

ing unitary V acting on A

can be replaced by its transpose V

acting

on R

′

. We may also replace N by its Stinesprin g dilation U

→B

so that the extended output state φ of R

′

E is pure:

′

❅



❅



V U

❅



′

❅



❅



88 10 Quantum Shannon Theory

Finally we invoke the decoupling prin ciple — if R

′

and E decouple, then

′

is puriﬁed by a subsystem of B

, which means that Bob can recover

with a suitable d ecoding map.

If we consider V , and hence also V

, to be a random unitary, then

we may describe the situation this way: We have a tripartite pure state

, where R = R

′

, and we would like to k now whether the marginal

state of R

′

E is close to a product state when th e random subsystem

is discarded from R. This is exactly the qu estion addressed by the

decoupling inequality, which in this case may be expressed as





kσ

′

(V ) − σ

max

′

⊗ σ



≤

|R| · |E|





, (10.340)

Eq.(10.340) asserts that the L

distance from a product state is bounded

above when averaged uniformly over all unitary V ’s; therefore there must

be some particular encoding unitary V that satisﬁes the same bound. We

conclude that near-perfect decoupling of R

′

E, and therefore high-ﬁdelity

decoding of B

, is achievable provided that

| = |B

| ≫ |R

′

| · |E| tr





= |A

| · |E| tr





, (10.341)

where to obtain the second equality we use the purity of φ

and recall

that the reference system R

′

is m aximally entangled with A

i.i.d. version. In the i.i.d. version of the father protocol, Alice and

Bob achieve high ﬁdelity entanglement-assisted quantum communication

through n uses of the quantum channel N

A→B

. Th e code they use for

this purpose can be described in the following way: Consider an input

density operator ρ

of system A, which is puriﬁed by a reference system

R. Sending the puriﬁed input state ψ

through U

A→BE

, the isometric

dilation of N

A→B

, generates the tripartite pure state φ

RBE

. Evidently

applying



A→BE



⊗n

to ψ

⊗n

produces φ

⊗n

RBE

But now su ppose that before transmitting the state to Bob, Alice

projects A

onto its typical subspace

A, succeeding with p robability

1 − o(1) in preparing a state of

R that is nearly maximally entangled,

where

R is the typical su bspace of R

. Imagine dividing

R into a ran-

domly chosen subsystem B

and its complementary subs y stem R

′

; then

there is a corresponding decomposition of A = A

such that A

is very

nearly maximally entangled with B

and A

is very nearly maximally

entangled with R

′

If we interpret B

as Bob’s half of an entangled state of A

shared

with Alice, this becomes the setting w here the one-shot f ather pr otocol

applies, if we ignore the small deviation from maximal entanglement in

and R

′

. As for our an alysis of the i.i.d. mother protocol, we apply

10.9 The Decoupling Inequality 89

the one-shot father inequality not to φ

⊗n

RBE

, b ut rather to the nearby state

′

, where

B and

E are the typical subspaces of B

and E

respec-

tively. Applying eq.(10.340), and us ing proper ties of typical subsp aces,

we can bound the square of the L

deviation of R

′

E from a product state,

averaged over the choice of B

, by

R| · |





n(H(R)+H(E)−H(B)+o(1))

n(I(R;E)+o(1))

; (10.342)

hence the bound also applies for some particular way of choosing B

This choice deﬁnes the code used by Alice and Bob in a protocol which

consumes

log

| =

I(R; E) + o(n) (10.343)

ebits of entanglement, and conveys from Alice to Bob

nH(B) −

I(R; E) − o(n) =

I(R; B) − o(n) (10.344)

high-ﬁdelity qubits. This pr oves the f ather resource inequality.

10.9.4 Quantum c hannel capacity revisited

In §10.8.1 we showed that th e coherent information is an ach ievable rate

for q uantum commu nication over a noisy quantum channel. That deriva-

tion, a corollary of the father resource inequality, applied to a catalytic

setting, in which s hared entanglement between sender and receiver can

be borrowed and later repaid. It is useful to see that the same rate is

achievable without catalysis, a result we can d er ive from an alternative

version of the decoupling inequality.

This version applies to the setting depicted here:

❅



|0i

A density operator ρ

for system A, with puriﬁcation ψ

, is tran smit-

ted through a channel N

A→B

which has the isometric d ilation U

A→BE

The reference system R has a decomposition into subs y stems R

. We

apply a random unitary transformation V to R, then project R

onto a

ﬁxed vector |0i

, and renormalize the resulting state. In eﬀect, then we

are projecting R onto a subspace with dimension |R

|, which puriﬁ es a

90 10 Quantum Shannon Theory

corresponding code subspace of A. This procedure prepares a normalized

pure state φ

, and a corresponding normalized marginal state σ

of R

If R

decouples from E, then R

is puriﬁed by a subsystem of B, which

means that the code s ubspace of A can be recovered by a decoder applied

to B. A suﬃcient condition for approximate decoupling can be derived

from the inequality





kσ

(V ) −σ

max

⊗ σ



≤ |R

| · |E| tr





. (10.345)

Eq.(10.345) resembles eq.(10.320) and can be derived by a similar method .

Note that the right-hand side of eq.(10.345) is enhanced by a factor of

| relative to the right-hand side of eq.(10.320). This factor arises be-

cause after pr ojecting R

onto the ﬁx ed state |0i we need to renormalize

the state by multiplying by |R

|, while on th e other hand the proj ection

suppresses the expected distance squared from a product state by a factor

In the i.i.d. setting where the n oisy channel is u sed n times, we consider

⊗n

RBE

, and project onto the jointly typical subspaces

E of R

, B

respectively, succeeding with high probability. We choose a cod e by

projecting

R onto a random subspace with d im ension |R

|. Then, the

right-hand side of eq.(10.345) becomes

| · 2

n(H(E)−H(B)+o(1))

, (10.346)

and since the inequality holds when we average uniformly over V , it surely

holds for some particular V . That unitary deﬁnes a code which achieves

decoupling an d has the rate

log

| = H(E) − H(B) − o(1) = I

(RiB) − o(1). (10.347)

Hence the coherent information is an achievable rate for high-ﬁdelity

quantum communication over the noisy channel.

10.9.5 Black holes as mirrors

As our ﬁnal application of the decoupling inequality, we consider a highly

idealized model of black hole dynamics. Suppose that Alice holds a k-

qubit system A which she wants to conceal from Bob. To be safe, she

discards her qubits by tossing th em into a large black hole, where s he

knows Bob will not dare to follow. The black hole B is an (n−k)-qu bit

system, which grows to n qubits after merging with A, where n is much

larger than k.

10.9 The Decoupling Inequality 91

Black holes are not really completely black — they emit Hawking ra-

diation. But qubits leak out of an evaporating black hole very slowly, at

a rate per unit time which scales like n

−1/2

. Correspondingly, it takes

time Θ(n

3/2

) for the black hole to radiate away a signiﬁcant fraction of

its qubits. Because the black h ole Hilbert space is so enormous, this is

a very long time, about 10

years for a s olar mass black h ole, for wh ich

n ≈ 10

. Though Alice’s qub its might not remain secret forever, she is

content knowing that they will be safe from Bob for 10

years.

But in her haste, Alice fails to notice that her black hole is very, very

old. It has been evaporating for so long that it has already radiated away

more than half of its qubits. Let’s assume that the joint state of the black

hole and its emitted radiation is pure, and furthermore that the radiation

is a Haar-random subsystem of the full system.

Because the black hole B is so old, |B| is much smaller than the dimen-

sion of the radiation subsystem; therefore, as in eq.(10.322), we expect

the state of B to be very nearly maximally mixed with high probability.

We denote by R

the subsystem of the emitted radiation which puriﬁes

B; thus the state of BR

is very nearly maximally entangled. We assume

that R

has been collected by Bob and is under his control.

To keep track of what happens to Alice’s k qubits, we suppose that

her k-qubit system A is maximally entangled with a reference system

. After A enters the black hole, Bob waits for a while, until the k

′

qubit system A

′

is emitted in the black hole’s Hawking radiation. Af ter

retrieving A

′

, Bob hopes to recover the puriﬁcation of R

by applying a

suitable decoding map to A

′

. Can he su cceed?

We’ve learned that Bob can succeed with high ﬁdelity if the remaining

black hole system B

′

decouples from Alice’s reference system R

. Let’s

suppose that the qub its emitted in the Hawking radiation are chosen

randomly; that is, A

′

is a Haar-random k

′

-qubit subsystem of the n-qubit

system AB, as depicted here:

❅



′

Alice

❅



❅



′

Bob

The double lines indicate the very large systems B and B

′

, and single

lines the smaller systems A and A

′

. Because the radiated qubits are

92 10 Quantum Shannon Theory

random, we can determine whether R

′

decouples using the decoupling

inequality, which for this case becomes



kσ

′

(U) − σ

max

′

⊗ σ



≤

|ABR

′



ABR



(10.348)

Because the state of AR

is pure, and B is maximally entangled with

, we have tr



ABR



= 1/|B|, and therefore the Haar-averaged L

distance of σ

′

from a product state is bounded above by

|AR

′

|A|

′

. (10.349)

Thus, if Bob waits for only k

′

= k + c qubits of Hawking radiation to be

emitted after Alice tosses in h er k qubits, Bob can decode her qubits with

excellent ﬁdelity F ≥ 1 − 2

−c

Alice made a serious mistake. Rather than waiting for Ω(n) qubits to

emerge from the black hole, Bob can already decode Alice’s secret quite

well wh en he has collected just a few m ore than k qubits. And Bob is

an excellent physicist, who knows enough about black hole dynamics to

infer the encoding unitary transf ormation U , information he uses to ﬁnd

the right deco ding map.

We could describe the conclusion, more p rosaically, by saying that

the random unitary U applied to AB encodes a go od quantum error-

correcting code, which achieves high-ﬁdelity entanglement-assisted trans-

mission of quantum information though an erasure channel with a high

erasure probability. Of the n input qub its, only k

′

randomly selected

qubits are received by Bob; the rest r emain in side the black h ole an d

hence are inaccessible. The input qubits, then, are erased with probabil-

ity p = (n − k

′

)/n, while nearly error-free qubits are recovered fr om the

input q ubits at a rate

R =

= 1 − p −

′

−k

; (10.350)

in the limit n → ∞ with c = k

′

− k ﬁxed, this rate approaches 1 − p, the

entanglement-assisted quantum capacity of the erasure channel.

So far, we’ve assumed that the emitted system A

′

is a randomly selected

subsystem of AB. That won’t be true for a real black hole. However, it is

believed that the internal dynamics of actually black holes mixes qu antum

information quite rapidly (the fast sc rambling conjecture). For a black

hole with temperature T , it takes time of order ~/kT for each qubit to be

10.10 Summary 93

emitted in the Hawking radiation, and a time longer by only a factor of

log n for the dynamics to mix the black hole degrees of freedom suﬃciently

for ou r decoupling estimate to hold with reasonable accuracy. For a s olar

mass black hole, Alice’s qubits are revealed ju st a few milliseconds after

she d eposits them, much faster than the 10

years she had hoped for!

Because Bob holds the sys tem R

which puriﬁes B, and because he knows

the right deco ding map to apply to A

′

, the black hole behaves like an

information mirror — Alice’s qu bits bounce right b ack!

If Alice is more careful, she will dump her qub its into a young black

hole instead. If we assume that the initial b lack hole B is in a pure state,

then σ

ABR

is also pure, and the Haar-averaged L

distance of σ

′

from a product state is bounded above by

|ABR

′

n+k

′

(10.351)

after

′

(n + k + c) (10.352)

qubits are emitted. In this case, Bob needs to wait a long time, until more

than half of the qubits in AB are radiated away. Once Bob has acquired

k + c more qubits than the number still residing in the black hole, he

is empowered to decode Alice’s k qubits with ﬁdelity F ≥ 1 − 2

−c

. In

fact, there is nothing special about Alice’s subsystem A; by adjusting

his decoding map appropriately, Bob can decode any k qubits he chooses

from among the n qubits in the initial black hole AB.

There is far more to learn about qu antum information processing by

black holes, an active topic of current research (as of this writing in 2016),

but we will not delve further into this fascinating topic here. We can be

conﬁdent, though, that the tools and concepts of quantum information

theory discussed in this book will be h elpful for addressing the many

unresolved mysteries of quantum grav ity.

10.10 Summary

Shannon entropy and classical data compression. The Shannon

entropy of an ensemble X = {x, p(x)} is H(X) ≡ h−log p(x)i; it quantiﬁes

the compressibility of classical information. A message n letters long,

where each letter is drawn independently from X, can be compressed

to H(X) bits per letter (and no further), yet can still be decoded with

arbitrarily good accuracy as n → ∞.

Conditional entropy and information merging. The conditional

entropy H(X|Y ) = H(XY ) −H(Y ) quantiﬁes how much th e information

94 10 Quantum Shannon Theory

source X can be compressed when Y is known. If n letters are dr awn

from XY , wh er e Alice holds X and Bob holds Y , Alice can convey X to

Bob by sending H(X|Y ) bits per letter, asymptotically as n → ∞.

Mutual information and classical channel capacity. The mutual

information I(X; Y ) = H(X) + H(Y ) − H(X, Y ) qu antiﬁes how infor-

mation sources X and Y are correlated; when we learn the value of y we

acquire (on the average) I(X; Y ) bits of information abou t x, an d vice

versa. The capacity of a memoryless noisy classical commu nication chan-

nel is C = max

I(X; Y ). This is the highest number of b its per letter

that can be transm itted through n uses of the channel, using the best

possible code, with negligible error probability as n → ∞.

Von Neumann entropy and quantum data compression. The

Von Neumann entropy of a density operator ρ is

H(ρ) = −trρ log ρ; (10.353)

it quantiﬁes the compressibility of an ensemble of pure quantum states.

A message n letters long, where each letter is d rawn independently from

the ensemble {|ϕ(x)i, p(x)}, can be compressed to H(ρ) qubits per letter

(and no further) where ρ =

p(x)|ϕ(x)ihϕ(x)|, yet can still be decoded

with arbitrarily good ﬁdelity as n → ∞.

Entanglement concentration and dilution. The entanglement E of

a bipartite pure state |ψi

is E = H(ρ

) where ρ

= tr

(|ψihψ|). With

local oper ations and classical communication, we can prepare n copies of

|ψi

from nE Bell pairs (but not from fewer), and we can distill nE Bell

pairs (but not more) from n copies of |ψi

, asymptotically as n → ∞.

Accessible information. The Holevo chi of an ensemble E =

{ρ(x), p(x)} of quantum states is

χ(E) = H

p(x)ρ(x)

−

p(x)H(ρ(x)). (10.354)

The accessible information of an ensemble E of quantum states is the

maximal number of bits of information that can be acquired about the

preparation of the state (on the average) with the best possible measure-

ment. The accessible information cannot exceed the Holevo chi of the

ensemble. The produ ct-state capacity of a quantum channel N is

(N) = max

χ(N(E)). (10.355)

This is the highest number of classical bits per letter that can be trans-

mitted through n uses of the quantum channel, with negligible error prob-

ability as n → ∞, assuming that each codeword is a product state.

10.10 Summary 95

Decoupling and quantum communication. In a tripartite pure

state φ

RBE

, we say that systems R and E decouple if the marginal den-

sity operator of RE is a product state, in w hich case R is puriﬁed by

a subsystem of B. A quantum state transmitted through a noisy quan -

tum channel N

A→B

(with isometric dilation U

A→BE

) can be accurately

decoded if a r eference system R which p uriﬁes channel’s input A nearly

decouples from the channel’s environment E.

Fathe r and mother protocols. The father and mother resource

inequalities specify achievable rates for entanglement-assisted q uantum

communication and qu antum state transfer, respectively. Both follow

from the decoupling inequality, which establishes a suﬃcient condition for

approximate decoupling in a tripartite mixed state. By combining the fa-

ther and mother protocols with superdense coding and teleportation, we

can derive achievable r ates for other protocols, including entanglement-

assisted classical communication, quantum communication, entanglement

distillation, and quantum state merging.

Homage to Ben Schumacher:

Ben.

He rocks.

I remember

When

He showed me how to ﬁt

A qubit

In a small box.

I wonder how it feels

To be compress e d.

And then to pass

A ﬁdelity test.

Or does it feel

At all, and if it does

Would I squeal

Or be just as I was?

If not undone

I’d become as I’d begun

And write a memorandum

On being random.

Had it felt like a belt

Of rum?

And might it be predicted

That I’d become addicted,

96 10 Quantum Shannon Theory

Longing for my session

Of compression?

I’d crawl

To Ben again.

And call,

Put down your p e n!

Don’t stall!

Make me small!

10.11 Bibliographical Notes

Cover and Thomas [2] is an excellent textbook on classical information

theory. Shannon’s original paper [3] is still very much worth reading.

Nielsen and Chuang [4] provide a clear introduction to some aspects

of quantum Shannon theory. Wilde [1] is a more up-to-date and very

thorough account.

Properties of entropy are reviewed in [5]. Strong subadditivity of Von

Neumann entropy was proven by Lieb and Ruskai [6], and the condition

for equality was derived by Hayden et al. [7]. The conn ection between

separability and majorization was pointed out by Nielsen and Kempe [8].

Beken stein’s entropy bound was formulated in [9] and derived by Casini

[10]. Entropic uncertainty relations are r eviewed in [11], and I follow

their derivation. The original derivation, by Maassen and Uﬃnk [12] uses

diﬀerent methods.

Schumacher compression was ﬁrst discussed in [13, 14], and Bennett

et al. [15] devised protocols for entanglement concentration and dilution.

Measures of mixed-state entanglement are reviewed in [16]. The reversible

theory of mixed-state entanglement was formulated by Brand˜ao and Ple-

nio [17]. Squashed entanglement was introd uced by Christandl and Win-

ter [18], and its monogamy discussed by Koashi and Winter [19]. Brand˜ao,

Christandl, and Yard [20] showed that squashed entanglement is positive

for any nonseparable bipartite state. Doherty, Parrilo, and Spedalieri [21]

showed that every nonseparable bipartite state fails to be k-extendable

for some ﬁnite k.

The Holevo bound was derived in [22]. Peres-Wootters cod ing was dis-

cussed in [23]. The product-state capacity formula was derived by Holevo

[24] and by Schumacher and Westmoreland [25]. Hastings [26] showed

that Holevo chi can be superadditive. Horodecki, Shor, and Ruskai [27]

introduced entanglement-breaking channels, and additivity of Holevo chi

for these channels was s hown by Shor [28].

Necessary and suﬃcient conditions for quantum error correction were

formulated in terms of the decoupling principle by Schumacher and

10.12 Exercises 97

Nielsen [29]; that (regularized) coherent information is an upper bound

on quantum capacity was shown by Schumacher [30], Schumacher and

Nielsen [29], and Barnum et al. [31]. That coherent inf ormation is an

achievable rate for quantum communication was conjectured by Lloyd

[32] and by Schumacher [30], then proven by Shor [33] and by Devetak

[34]. Devetak and Winter [35] showed it is also an achievable rate for

entanglement distillation. The quantum Fano inequality was derived by

Schumacher [30].

Approximate decoupling was analyzed by Schumacher and Westmore-

land [36], an d used to prove capacity theorems by Devetak [34], by

Horodecki et al. [37], by Hayden et al. [38], and by Abeyesinghe et

al. [39]. Th e entropy of Haar-random subsystems had been discu ssed

earlier, by Lubkin [40], Lloyd and Pagels [41], and Page [42]. Devetak,

Harrow, and Winter [43, 44] introduced the mother and father proto-

cols and their descend ants. Devatak and Sh or [45] introduced degradable

quantum channels and proved that coherent information is additive for

these channels. Bennett et al. [46, 47] fou nd the single-letter formula

for entanglement-assisted classical capacity. Superadditivity of coherent

information was discovered by Shor and Smolin [48] and by DiVincenzo

et al. [49]. Smith and Yard [50] found extreme examples of superadd itiv-

ity, in which two zero-capacity channels have non zero capacity when used

jointly. The achievable rate for state m er ging was der ived by Horodeck i et

al. [37], and used by them to prove strong sub ad ditivity of Von Neumann

entropy.

Decoupling was applied to Landuaer’s principle by Renner et al. [51],

and to black h oles by Hayden and Preskill [52]. The fast scrambling

conjecture was proposed by Sekino and Susskind [53].

10.12 Exercises

10.1 Positivity of quantum relative entropy

a) Show that ln x ≤ x − 1 for all positive real x, with equality iﬀ

x = 1.

b) The (classical) relative entropy of a probability distribution

{p(x)} relative to {q(x)} is deﬁned as

D(p k q) ≡

p(x) (log p(x) − log q(x)) . (10.356)

Show that

D(p k q) ≥ 0 , (10.357)

with equality iﬀ the probability distributions are identical.

Hint: Ap ply the inequality from (a) to ln (q(x)/p(x)).

98 10 Quantum Shannon Theory

c) T he quantum relative entropy of the density operator ρ with

respect to σ is deﬁned as

D(ρ k σ) = tr ρ (log ρ − log σ) . (10.358)

Let {p

} denote the eigenvalues of ρ and {q

} denote the eigen-

values of σ. Show that

D(ρ k σ) =

log p

−

log q

, (10.359)

where D

is a doubly stochastic matrix. Express D

in terms

of the eigenstates of ρ and σ. (A matrix is doubly sto chastic if

its entries are nonnegative real numbers, where each r ow and

each column sums to one.)

d) Show that if D

is doubly stochastic, then (for each i)

log

≥

log q

, (10.360)

with equality only if D

= 1 for some a.

e) Show that

D(ρ k σ) ≥ D(p k r) , (10.361)

where r

f) Show that D(ρ k σ) ≥ 0, with equality iﬀ ρ = σ.

10.2 Properties of Von Neumann entropy

a) Use nonnegativity of quantum relative entropy to prove the sub-

additivity of Von Neumann entropy

H(ρ

) ≤ H(ρ

) + H(ρ

), (10.362)

with equality iﬀ ρ

= ρ

⊗ ρ

. Hint: Consider the relative

entropy of ρ

and ρ

⊗ ρ

b) Use s ubadditivity to prove the concavity of the Von Neumann

entropy:

) ≥

H(ρ

) . (10.363)

Hint: C on sider

(ρ

)

⊗ (|xihx|)

, (10.364)

where the states {|xi

} are mutually orthogonal.

10.12 Exercises 99

c) Use the condition

H(ρ

) = H(ρ

) + H(ρ

) iﬀ ρ

= ρ

⊗ ρ

(10.365)

to show that, if all p

’s are nonzero,

H(ρ

) (10.366)

iﬀ all the ρ

’s are identical.

10.3 Monotonicity of qua ntum relative entropy

Quantum relative entropy has a property called monotonicity:

D(ρ

kσ

) ≤ D(ρ

kσ

); (10.367)

The relative entropy of two density operators on a system AB can-

not be less than the induced relative entropy on the subsystem A.

a) Use monotonicity of quantum relative entropy to prove the

strong subadditivity property of Von Neumann entropy. Hint:

On a tripartite system ABC, consider the relative entropy of

ABC

and ρ

⊗ ρ

b) Use monotonicity of quantum relative entropy to sh ow that the

action of a quantum channel N cannot increase r elative en-

tropy:

D(N(ρ)kN(σ) ≤ D(ρkσ), (10.368)

Hint: Recall that any quantum channel has an isometric dila-

tion.

10.4 The Peres–Wootters POVM.

Consider the Peres–Wootters information source described in

§10.6.4 of the lecture notes. It prepares one of the three states

|Φ

i = |ϕ

i ⊗|ϕ

i, a = 1, 2, 3, (10.369)

each occurring with a priori probability

, where the |ϕ

i’s are

deﬁned in eq.(10.214).

a) Express the density matrix

ρ =

|Φ

ihΦ

, (10.370)

in terms of the Bell basis of maximally entangled states

{|φ

i, |ψ

i}, and compute H(ρ).

100 10 Quantum Shannon Theory

b) For the three vectors |Φ

i, a = 1, 2, 3, construct the “pr etty

go od measurement” deﬁned in eq.(10.227). (Again, expand

the |Φ

i’s in the Bell basis.) In this case, the PGM is an

orthogonal measurement. Express the elements of the PGM

basis in terms of the Bell basis.

c) C ompute the mutual information of the PGM outcome and the

preparation.

10.5 Separability and majorization

The hallmark of entanglement is that in an entangled state the whole

is less random than its parts. But in a separable state the correla-

tions are essentially classical and so are expected to ad here to the

classical principle that the parts are less disordered than the whole.

The objective of this problem is to make this expectation precise by

showing that if the bipartite (mixed) state ρ

is separable, then

λ(ρ

) ≺ λ(ρ

) , λ(ρ

) ≺ λ(ρ

) . (10.371)

Here λ(ρ) denotes the vector of eigenvalues of ρ, and ≺ denotes

majorization.

A separable state can be realized as an ensemble of pure product

states, so that if ρ

is separable, it may be expressed as

|ψ

ihψ

| ⊗ |ϕ

ihϕ

| . (10.372)

We can also diagonalize ρ

, expressing it as

ihe

| , (10.373)

where {|e

i} denotes an orthonormal basis for AB; then by the HJW

theorem, there is a unitary matrix V such that

√

i =

√

|ψ

i ⊗|ϕ

i . (10.374)

Also note that ρ

can be diagonalized, so that

|ψ

ihψ

| =

ihf

| ; (10.375)

here {|f

i} denotes an orthonormal basis for A, and by the HJW

theorem, there is a unitary matrix U such that

√

|ψ

i =

aµ

√

i . (10.376)

10.12 Exercises 101

Now show that there is a doubly stochastic matrix D such that

jµ

. (10.377)

That is, you must check that the entries of D

jµ

are real and non-

negative, and that

jµ

= 1 =

jµ

. Thus we conclude that

λ(ρ

) ≺ λ(ρ

). Just by interchanging A and B, the same argu-

ment also shows that λ(ρ

) ≺ λ(ρ

Remark: Note that it follows from th e Schur concavity of Shannon

entropy that, if ρ

is separable, then the von Neumann entropy

has the properties H(AB) ≥ H(A) and H(AB) ≥ H(B). Thus,

for separable states, conditional entropy is nonnegative: H(A|B) =

H(AB)−H(B) ≥ 0 and H(B|A) = H(AB)−H(A) ≥ 0. I n contrast,

if H(A|B) is negative, then according to the hashing inequality the

state of AB has positive distillable entanglement −H(A|B), and

therefore is surely not separable.

10.6 Additivity of squashed entanglement

Suppose that Alice holds systems A, A

′

and Bob holds systems

B, B

′

. How is the entanglement of AA

′

with BB

′

related to the

entanglement of A with B and A

′

with B

′

? In this problem we will

show th at the squashed entanglement is super ad ditive,

(ρ

ABA

′

) ≥ E

(ρ

) + E

(ρ

′

) (10.378)

and is strictly additive for a tensor product,

(ρ

⊗ ρ

′

) = E

(ρ

) + E

(ρ

′

). (10.379)

a) Use the chain rule for mutual information eq.(10.196) and

eq.(10.197) and the nonnegativity of quantum conditional mu-

tual information to show that

I(AA

′

; BB

′

|C) ≥ I(A; B|C) + I(A

′

; B

′

|AC), (10.380)

and show that eq.(10.378) follows.

b) Show that for any extension ρ

ABC

⊗ρ

′

of the product s tate

⊗ ρ

′

, we have

I(AA

′

; BB

′

|CC

′

) ≤ I(A; B|C) + I(A

′

; B

′

). (10.381)

Conclude that

(ρ

⊗ ρ

′

) ≤ E

(ρ

) + E

(ρ

′

), (10.382)

which, when combined with eq.(10.378), implies eq.(10.379).

102 10 Quantum Shannon Theory

10.7 Proof of the decoupling inequality

In this problem we complete the derivation of the decoupling in-

equality sketched in §10.9.1.

a) Verif y eq.(10.329).

To derive the expression for E

′

(U )] in eq.(10.333), we

ﬁrst note that the invariance property eq.(10.318) implies that

′

(U)] commutes with V ⊗ V for any unitary V . There-

fore, by Schur’s lemma, E

′

(U)] is a weighted sum of pro-

jections onto irreducible representations of the unitary group. The

tensor product of two fundamental representations of U (d) contains

two irreducible representations — the sy mmetric and antisymmetric

tensor representations. Therefore we may write

′

(U)] = c

sym

(sym)

′

+ c

anti

(anti)

′

; (10.383)

here Π

(sym)

′

is the orthogonal projector onto the subspace of AA

′

symmetric under the interchange of A and A

′

, Π

(anti)

′

is the pro-

jector onto the antisymmetric subspace, and c

sym

, c

anti

are suitable

constants. Note that

(sym)

′

+ S

′

) ,

(anti)

′

− S

′

) , (10.384)

where S

′

is the swap operator, and that the symmetric and anti-

symmetric subspaces have dimension

|A|(|A| + 1) and dimension

|A|(|A| − 1) respectively.

Even if you are not familiar with group representation theory, you

might regard eq.(10.383) as obvious. We may write M

′

(U) as a

sum of two terms, one symmetric and the other antisymmetric un-

der the interchange of A and A

′

. The expectation of the symmetric

part must be symmetric, and the expectation value of the antisym-

metric part must be antisymmetric. Furthermore, averaging over

the unitary group ensures that no symmetric state is preferred over

any other.

b) To evaluate the constant c

sym

, multiply both sides of eq.(10.383)

by Π

(sym)

′

and take the trace of both sides, thus ﬁnding

sym

| + |A

|A|+ 1

. (10.385)

10.12 Exercises 103

c) To evaluate the constant c

anti

, multiply both sides of eq.(10.383))

by Π

(anti)

′

and take the trace of both sides, thus ﬁnding

anti

| − |A

|A| − 1

. (10.386)

d) Usin g

sym

+ c

anti

) , c

sym

− c

anti

) (10.387)

prove eq.(10.334).

10.8 Fano’s inequality

Suppose X = {x, p(x)} is a probability distribution for a letter x

drawn from an alph abet of d possible letters, and that XY is the

joint distribution for x and an other random variable y which is

correlated with x. Upon receiving y we estimate the value of x by

evaluating a function ˆx(y). We may anticipate that if our estimate

is usually correct, then the conditional entropy H(X|Y ) must be

small. In this problem we w ill conﬁrm that expectation.

Let e ∈ {0, 1} denote a binary ran dom variable which takes the value

e = 0 if x = ˆx(y) and takes the value e = 1 if x 6= ˆx(y), and let

XY E denote the joint distribution for x, y, e. The error probability

is the probability that e = 1, averaged over this distribution.

Our goal is to derive an upper bound on H(X|Y ) depending on P

a) Show that

H(X|Y ) = H(X|Y E) + H(E|Y ) − H(E|XY ). (10.388)

Note that H(E|XY ) = 0 because e is determin ed when x and y

are know, and that H(E|Y ) ≤ H(E) because mutual information is

nonnegative. Therefore,

H(X|Y ) ≤ H(X|Y E) + H(E). (10.389)

b) Noting that

H(X|Y E) = p(e = 0)H(X|Y, e = 0) + p(e = 1)H(X|Y, e = 1),

(10.390)

and that H(X|Y, e = 0) = 0 (because x = ˆx(y) is determined

by y when ther e is n o error), show that

H(X|Y E) ≤ P

log

(d − 1). (10.391)

104 10 Quantum Shannon Theory

c) Finally, show that

H(X|Y ) ≤ H

) + P

log

(d − 1), (10.392)

which is Fano’s inequality.

d) Use Fano’s inequality to derive eq.(10.50), hence completing the

proof that the classical channel capacity C is an upper bound

on achievable rates for communication over a noisy channel

with negligible error probability.

10.9 A quantum version of Fano’s inequality

a) In a d-dimensional system, suppose a density operator ρ approx-

imates the pure state |ψi with ﬁdelity

F = hψ|ρ|ψi = 1 − ε. (10.393)

Show that

H(ρ) ≤ H

(ε) + ε log

(d − 1). (10.394)

Hint: Recall that if a complete orthogonal measurement per-

formed on the state ρ has distribution of ou tcomes X, then

H(ρ) ≤ H(X), where H(X) is the Shannon entropy of X.

b) As in §10.7.2, suppose that the noisy channel N

A→B

acts on the

pure state ψ

, and is followed by the decoding map D

B→C

Show that

H(R)

− I

(R iB)

≤ 2H(RC)

, (10.395)

where

= N(ψ

), σ

= D ◦ N(ψ

). (10.396)

Therefore, if the decoder ’s output (the state of RC) is almost

pure, then the coherent information of th e channel N comes

close to matching its input entropy. Hint: Use the data pro-

cessing inequality I

(R iC)

≤ I

(R iB)

and the subaddi-

tivity of von Neumann entropy. It is convenient to consider

the joint pure state of the reference system, the output, and

environments of the dilations of N and D.

c) Suppose that the decoding map recovers the ch an nel input with

high ﬁdelity,

F (D ◦ N(ψ

), ψ

) = 1 − ε. (10.397)

10.12 Exercises 105

Show that

H(R)

− I

(R iB)

≤ 2H

(ε) + 2ε log

− 1), (10.398)

assuming that R and C are d-dimensional. This is a quantum

version of Fano’s in equality, which we may use to derive an

upper bound on the quantum channel capacity of N.

10.10 Mother protocol for the GHZ state

The mother resource in equality expresses an asymptotic resource

conversion that can be achieved if Alice, Bob, and Eve share n copies

of the pure state φ

ABE

: by sending

I(A; E) qubits to Bob, Alice

can destroy the correlations of her state with Eve’s state, so th at

Bob alone holds the puriﬁcation of Eve’s state, and furthermore

Alice and Bob share

I(A; B) ebits of entanglement at the end

of the protocol; here I(A; E) and I(A; B) denote quantum mutual

informations evaluated in the state φ

ABE

Normally, the r esource conversion can be realized with arbitrarily

go od ﬁdelity only in the limit n → ∞. But in this problem we will

see that the conversion can be perfect if Alice, Bob and Eve share

only n = 2 copies of the three-qubit GHZ state

|φi

ABE

√

(|000i + |111i) . (10.399)

The p rotocol achieving this perfect conversion uses the notion of

coherent classical communication deﬁned in Chapter 4.

a) Show that in the GHZ state |φi

ABE

, I(A; E) = I(A; B) = 1.

Thus, for this state, the mother inequality becomes

2hφ

ABE

i + [q → q]

≥ [qq]

+ 2hφ

′

i . (10.400)

b) Suppose that in the GHZ state Alice measures the Pauli operator

X, gets the outcome +1 and broadcasts h er outcome to Bob

and Eve. What state do Bob and E ve then share? What if

Alice gets the outcome −1 instead?

c) Suppose that Alice, Bob, and Eve share just one copy of the GHZ

state φ

ABE

. Fin d a protocol such that, after one unit of coher-

ent classical communication from Alice to Bob, the shared state

becomes |φ

⊗ |φ

, where |φ

i = (|00i + |11i) /

√

2 is

a maximally entangled Bell pair.

d) Now suppose that Alice, Bob, and Eve start out with two copies

of the GHZ state, and suppose that Alice and Bob can borrow

106 10 Quantum Shannon Theory

an ebit of entanglement, which will be repaid later, to catalyze

the resource conversion. Use coherent superdense coding to

construct a protocol that achieves the (catalytic) conversion

eq. (10.400) perfectly.

10.11 Degradability of a mplitude damping and era sure

The qubit amplitude damping channel N

A→B

a.d.

(p) discussed in §3.4.3

has the dilation U

A→BE

such that

U :|0i

7→ |0i

⊗ |0i

|1i

7→

1 − p |1i

⊗ |0i

√

p |0i

⊗ |1i

;

a qubit in its “ground state” |0i

is unaﬀected by the channel,

while a qubit in the “excited state” |1i

decays to the ground s tate

with probability p, and the decay process excites the environment.

Note that U is invariant under interchange of systems B and E

accompanied by transformation p ↔ (1 − p). Thus the channel

complementary to N

A→B

a.d.

(p) is N

A→E

a.d.

(1 −p).

a) Show that N

A→B

a.d.

(p) is degradable for p ≤ 1/2. Ther efore, the

quantum capacity of the amplitude damping channel is its op-

timized one-shot coherent information. Hint: It s uﬃces to

show th at

A→E

a.d.

(1 − p) = N

B→E

a.d.

(q) ◦ N

A→B

a.d.

(p), (10.401)

where 0 ≤ q ≤ 1.

The erasure channel N

A→B

erase

(p) h as the dilation U

A→BE

such that

U : |ψi

7→

1 −p |ψi

⊗ |ei

√

p |ei

⊗ |ψi

; (10.402)

Alice’s system p asses either to Bob (with probability 1 − p) or to

Eve (w ith probability p), while the other party receives the “erasure

symbol” |ei, which is orthogonal to Alice’s Hilbert s pace. Because U

is invariant u nder interchange of systems B and E accompanied by

transformation p ↔ (1−p), the chan nel complementary to N

A→B

erase

(p)

is N

A→E

erase

(1 − p).

b) Show that N

A→B

erase

(p) is degradable for p ≤ 1/2. Therefore, the

quantum capacity of the amplitude damping channel is its op-

timized one-shot coherent information. Hint: It s uﬃces to

show th at

A→E

erase

(1 − p) = N

B→E

erase

(q) ◦ N

A→B

erase

(p), (10.403)

where 0 ≤ q ≤ 1.

10.12 Exercises 107

c) Show that for p ≤ 1/2 the qu antum capacity of the erasure

channel is

Q(N

A→B

erase

(p)) = (1 − 2p) log

d, (10.404)

where A is d-dimensional, and that the capacity vanishes for

1/2 ≤ p ≤ 1.

10.12 Capacities of the depolarizing channel

Consider the depolarizing channel N

depol.

(p), which acts on a pure

state |ψi of a single qubit according to

depol.

(p) : |ψihψ| 7→



1 −



|ψihψ| +

p ·

I. (10.405)

For this channel, compute the product-state classical capacity C

(p),

the entanglement-assisted classical capacity C

(p), and the on e-sh ot

quantum capacity Q

(p). P lot the results as a function of p. For

what value of p does Q

hit zero?

The depolarizing channel is not degradable, and in fact the quantum

capacity Q(p) is larger than Q

(p) when th e channel is suﬃciently

noisy. The function Q(p) is still unknown.

10.13 Noisy superdense coding a nd tele portation.

a) By converting the entanglement achieved by the mother proto-

col into classical communication, prove the noisy superdens e

coding resource inequality:

Noisy SD : hφ

ABE

i + H(A)[q → q] ≥ I(A; B)[c → c].

(10.406)

Ver if y that this matches the standard noiseless superdense co d-

ing resource in equality when φ is a maximally entangled state

of AB.

b) By converting the entanglement achieved by the m other protocol

into quantum communication, prove th e noisy teleportation

resource inequality:

Noisy T P : hφ

ABE

i + I(A; B)[c → c] ≥ I

(AiB)[q → q].

(10.407)

Ver if y that this matches the standard noiseless teleportation

resource inequality when φ is a maximally entangled state of

AB.

108 10 Quantum Shannon Theory

10.14 The cost of erasure

Erasure of a bit is a p rocess in which the state of the bit is reset to 0.

Erasure is irreversible — knowing only the ﬁnal state 0 after erasure,

we cannot determine w hether the initial state before erasure was 0

or 1. This irreversibility implies that erasure incurs an unavoidable

thermodynamic cost. According to Landauer’s Principle, erasing a

bit at temperature T requires work W ≥ kT log 2. In this problem

you will verify that a particular procedure for achieving erasure

adheres to Landauer’s Principle.

Suppose that the two states of the bit both have zero energy. We

erase the bit in two steps. In the ﬁrst step, we bring the bit into

contact with a reservoir at temperature T > 0, and wait for the bit

to come to thermal equilibrium with the reservoir. In this step the

bit “forgets” its initial value, but the bit is not yet erased because

it has not been reset.

We reset the bit in the second step, by slowly turning on a control

ﬁeld λ which splits the degeneracy of the two states. For λ ≥ 0, the

state 0 has energy E

= 0 and th e state 1 has energy E

= λ. After

the bit thermalizes in step one, the value of λ incr eases gradu ally

from the initial value λ = 0 to the ﬁnal value λ = ∞; the increase

in λ is slow enough that the qubit r emains in thermal equilibrium

with the reservoir at all times. As λ in cr eases, the probability P (0)

that the qubit is in the s tate 0 approaches unity — i.e., the bit is

reset to the state 0, which has zero energy.

(a) For λ 6= 0, ﬁnd th e probab ility P (0) that the q ubit is in the

state 0 and the probability P (1) that the qubit is in the state

(b) How much work is required to increase the contr ol ﬁeld from λ

to λ + dλ?

to λ = ∞? (You will have to evaluate an integral, wh ich can

be done analytically.)

10.15 The ﬁrst law of Von Neumann entropy

Writing the density operator in terms of its modular Hamiltonian

K as in §10.2.6,

ρ =

−K

tr (e

−K

)

, (10.408)

10.12 Exercises 109

consider how the entropy S(ρ ) = −tr (ρ ln ρ) changes when the

density operator is perturbed slightly:

ρ → ρ

′

= ρ + δρ. (10.409)

Since ρ and ρ

′

are both normalized density operators, we have

tr (δρ) = 0. Show that

S(ρ

′

) − S(ρ) = tr



′



− tr (ρK) + O



(δρ)



; (10.410)

that is,

δS = δhKi (10.411)

to ﬁrst order in the small change in ρ. This statement generalizes

the ﬁrst law of thermodynamics; f or the case of a thermal density

operator with K = T

−1

H (where H is the Hamiltonian and T is

the temperature), it becomes the more familiar statement

δE = δhHi = T δS. (10.412)

References

[1] M. M. Wilde, Quantum Information Theory (Cambridge, 2013).

[2] T. M. Cove r and J. A. Thoma s, Information Theory (Wiley, 1991).

[3] C. E Shannon and W. Weaver, The Mathematical Theory of Communication

(Illinois, 1949).

[4] M. A. Nielsen and I. L. Chuang, Quantum Computation and Quantum In-

formation (Cambridge, 2 000).

[5] A. Wehrl, General properties of entropy, Rev. Mod. Phys. 50, 221 (1978).

[6] E. H. Lieb and M. B. Ruskai, A fundamental property of quantum-

mechanical entropy, Phys. Rev. Le tt. 30, 434 (1973).

[7] P. Hayden, R. Jozsa, D. Petz, and A. Winter, Structure of states which

satisfy strong subadditivity with equality, Comm. Math. Phys. 246, 3 59-

374 (2003).

[8] M. A. Nielsen and J. Kempe, Separable states are more disordered globa lly

than locally, Phys. Rev. Lett. 86, 51 84 (2001).

[9] J. B ekenstein, Universal upper bound on the entropy-to-ene rgy ration of

bounded systems, Phys. Rev. D 23, 287 (198 1).

[10] H. Casini, Relative entropy and the Bekenstein bound, Class. Qua nt. Grav.

25, 205021 (2008).

[11] P. J. C oles, M. Berta, M. Tomamichel, S. Wehner, Entropic uncertainty

relations and their applications, arXiv:1511.04857 (2015).

[12] H. Maassen and J. Uﬃnk, Phys. Rev. Lett. 60, 1103 (1988).

[13] B. Schumacher, Quantum coding, Phys. Rev. A 51, 2738 (1995).

[14] R. Joz sa and B. Schumacher, A new proof of the quantum noiseles s coding

theory, J. Mod. Optics 41, 2343-23 49 (1994).

[15] C. H. Bennett, H. J. Bernstein, S. Popescu, and B. Schumacher, Concen-

trating partial entanglement by local operations, Phys. Rev. A 53, 2046

(1996).

110

Reference s 111

[16] R. Horodecki, P. Horodecki, M. Horodecki, and K. Ho rodecki, Quantum

entanglement, Rev . Mod. Phys. 81, 865 (2009 ).

[17] F. G. S. L. Br and˜ao and M. B. Plenio, A reversible theory of entanglement

and its relation to the second law, Comm. Math. Phys. 295, 829-851 (2010).

[18] M. Christandl and A. Winter, “Squashed entanglement”: an additive en-

tanglement measure, J. Math. Phys. 45, 829 (2004).

[19] M. Ko ashi and A. Winter, Monogamy of quantum entanglement and o ther

correlations, Phys. Rev. A 69, 022309 (2004).

[20] F. G. S. L. Brand˜ao, M. Cristandl, and J. Yard, Faithful squashed entan-

glement, Comm. Math. Phys. 306, 805-830 (2 011).

[21] A. C. Doherty, P. A. Parrilo, and F. M. Spedalier i, Complete family of

separability criteria, Phys. Rev. A 69, 022308 (2004).

[22] A. S. Holevo, Bounds for the quantity of information transmitted by a

quantum communication channel, Probl. Peredachi Inf. 9, 3-11 (1973).

[23] A. Peres and W. K. Wootters, Optimal detection of quantum information,

Phys. Rev. Lett 66, 1119 (199 1).

[24] A. S. Holevo, The capacity of the quantum channel with g eneral signal

states, arXiv: quant-ph/9611023.

[25] B. Schumacher and M. D. Westmoreland, Sending classical information via

noisy quantum channels, Phys. Rev. A 56, 131-138 (1997).

[26] M. B. Hastings, Super additivity of communication capacity using entangled

inputs, Natur e Physics 5, 255-25 7 (2009 ).

[27] M. Horodecki, P. W. Shor, and M. B. Ruskai, Entanglement breaking chan-

nels, Rev. Math. Phys. 15, 629-641 (2003).

[28] P. W. Shor, Additivity of the classic al capacity for entanglement-breaking

quantum channels, J. Math. Phys. 4 3, 43 34 (2002).

[29] B. Schumacher and M. A. Nielsen, Quantum data processing and error

correction, Phys. Rev. A 54, 26 29 (1996).

[30] B. Schumacher, Se nding entanglement through noisy quantum channels,

Phys. Rev. A 54, 2614 (1996).

[31] H. Barnum, E. Knill, and M. A. Nielsen, O n quantum ﬁdelities and channel

capacities, IEEE Trans. Inf. Theory 46, 1317-1329 (2000).

[32] S. Lloyd, Capacity of the noisy quantum channel, Phys. Rev. A 55, 1613

(1997).

[33] P. W. Shor, unpublished (2002).

[34] I. Devetak, The pr ivate classical capacity and quantum capacity of a quan-

tum channel, IEEE Trans. Inf. Theory 51, 44-55 (2005).

[35] I. Devetak and A. Winter, Distillation of secret key and entanglement from

quantum states, Proc. Roy. Soc. A 461, 207-235 (2005).

112 Refe rences

[36] B. Schumacher and M. D. Westmoreland, Approximate quantum err or cor-

rection, Quant. Inf. Proc . 1, 5-12 (2002).

[37] M. Horodecki, J. Oppe nheim, and A. Winter, Quantum state merging and

negative info rmation, Comm. Math. Phys. 269, 1 07-136 (20 07).

[38] P. Hayden, M. Horodecki, A. Winter, and J. Ya rd, Open Syst. Inf. Dyn. 15,

7-19 (2008).

[39] A. Abeyesinge, I. Devetak, P. Hayden, and A. Winter, Proc. Roy. Soc. A,

2537-2563 (2009).

[40] E. Lubkin, Entropy of an n-sys tem from its cor relation with a k-reservoir,

J. Math. Phys. 19, 1028 (1978 ).

[41] S. Lloyd and H. Pagels, Co mplexity as thermodynamic depth, Ann. Phys.

188, 186-213 (1988)

[42] D. N. Page, Average entropy of a subsystem, Phys. Rev. Lett. 71, 1291

(1993).

[43] I. Devetak, A. W. Harr ow, and A. Winter, A family of quantum protocols,

Phys. Rev. Lett. 93, 230504 (200 4).

[44] I. Devetak, A. W. Harrow, and A. Winter, A resource framework for quan-

tum Shannon theory, IEEE Trans. Inf. Theory 54, 45 87-4618 (200 8).

[45] I. Devetak and P. W. Shor, The capacity of a quantum channel for s imul-

taneous transmission of classical and quantum information, Comm. Math.

Phys. 256, 287-303 (2005).

[46] C. H. B ennett, P. W. Shor, J. A. Smolin, and A. V. Thapliyal,

Entanglement-assisted classical capacity of noisy quantum channels, Phys.

Rev. Lett. 83, 3081 (1999).

[47] C. H. B ennett, P. W. Shor, J. A. Smolin, and A. V. Thapliyal,

Entanglement-assisted classical capacity of a quantum channel and the re-

verse Shannon theorem, IEEE Trans. Inf. Theory 48, 2637-2655 (20 02).

[48] P. W. Shor and J. A. Smolin, Quantum e rror-correcting codes nee d not

completely re veal the error syndrome, arXiv:quant-ph/9604006.

[49] D. P. DiVincenzo, P. W. Shor, and J. A. Smolin, Quantum channel capacity

of very noisy channels, Phys. Rev. A 57, 830 (1998).

[50] G. Smith and J. Yard, Quantum communication with zero-capacity chan-

nels, Science 321, 1 812-1815 (2008).

[51] L. del Rio, J. Aberg, R. Renner, O. Dahlsten, and V. Vedral, The thermo-

dynamic meaning of negative entropy, Nature 474, 61-63 (2011 ).

[52] P. Hayden and J. Pres kill, Black holes as mirrors: quantum information in

random subsystems, JHEP 09, 120 (2007).

[53] Y. Sekino and L. Sussk ind, Fast scramblers, J HEP 10, 065 (20 08).

Comments

Products

Project