Christian Robert is a Professor at Université Paris-Dauphine (Franc...

Julian Besag (March 1945 – August 2010) was a British statistician...

The MANIAC built under the direction of Nicholas Metropolis at the ...

Los Alamos during World War 2 include led many of the efforts for t...

Stanislaw Ulam (April 1909 - May 1984) was a Polish mathematician a...

John Von Neuman (December 1903 - February 1957) was Hungarian-Ameri...

The ENIAC was the first electronic general-purpose computer; it was...

A Markov chain is irreducible if it is possible to get to any state...

The simulated annealing algorithm is worth reading about, it is a p...

Josiah Willard Gibbs (1839-1903) was an American scientist who amon...

arXiv:0808.2902v7 [stat.CO] 9 Jan 2012

Statistical Science

2011, Vol. 26, No. 1, 102–115

DOI:

10.1214/10-STS351

c

Institute of Mathematical Statistics, 2011

A Short History of Markov Chain Monte

Carlo: Subjective Recol lections from

Incomplete Data

1

Christian Robert and George Casella

This paper is dedicated to the memory of our friend Julian Besag,

a giant in the ﬁeld of MCMC.

Abstract. We attempt to trace the history and development of Markov

chain Monte Carlo (MCMC) from its early inception in the late 1940s

through its use today. We see how the earlier stages of Monte Carlo

(MC, not MCMC) research have led to the algorithms currently in use.

More importantly, we see how the d evelopment of this methodology

has not only changed our solutions to problems, but has changed the

way we think about problems.

Key words and phrases: Gibbs sampling, Metropolis–Hasting algo-

rithm, hierarchical models, Bayesian methods.

1. INTRODUCTION

Markov chain Monte Carlo (MCMC) m ethods ha-

ve been around for almost as long as Monte C arlo

techniques, even though their impact on Statistics

has not been truly felt until the very early 1990s,

except in the specialized ﬁelds of Spatial Statistics

and Image Analysis, where those methods appeared

earlier. The emergence of Markov based techniques

in Physics is a story that remains untold within this

Christian Robert is Professor of S tatistics,

CEREMADE, Universit´e Paris Dauphine, 75785 Paris

cedex 16, France e-mail:

xian@ceremade.dauphine.fr.

George Casella is Distinguished Professor, Departm ent

of Statistics, University of Florida, Gainesville, Florida

32611, USA e-mail:

casella@uﬂ.edu.

1

A shorter version of this paper appears as Chapter 1

in Handbook of Markov Chain Monte Carlo (2011), edited

by Steve Brooks, Andrew Gelman, Galin Jones and Xiao-Li

Meng. Chapman & Hall/CRC Handbooks of Modern Statis-

tical Methods, Boca Raton, Florida.

This is an electro nic reprint of the original article

published by the Institute of Mathematical Sta tis tics in

Statistical Science, 2011 , Vol. 26, No. 1, 102–115. This

reprint diﬀers from the or iginal in pagination and

typographic detail.

survey (see Landau and Binder, 2005 ). Also, we will

not enter into a description of MCMC techniques.

A comprehensive treatment of MCMC tech niques,

with f urther references, can be found in Robert and

Casella (

2004).

We will distinguish between the introduction of

Metropolis–Hastings based algorithms and those re-

lated to Gibbs sampling, since they each stem from

radically diﬀerent origins, even though their math-

ematical justiﬁcation via Markov chain theory is the

same. Tracing the development of Monte Carlo meth-

ods, we will also brieﬂy mention what we might call

the “second-generation MCMC revolution.” Starting

in the mid-to-late 1990s, this includes the develop-

ment of particle ﬁlters, reversible jump and perfect

sampling, and concludes with more curr ent work on

population or sequential Monte Carlo and regenera-

tion an d the computing of “honest” standard errors.

As mentioned above, the realization that Markov

chains could be used in a wide variety of situa-

tions only came (to mainstream s tatisticians) with

Gelfand and Smith (

1990), despite earlier p ublica-

tions in the statistical literature like Hastings (1970),

Geman and Geman (1984) and Tanner and Wong

(

1987). Several reasons can be ad vanced: lack of

computing machinery (think of the computers of

1

2 C. ROBERT AND G. CASELLA

1970!), or back grou nd on Markov chains, or hesi-

tation to tr ust in the practicality of the method.

It thus required visionary researchers like Gelfand

and Smith to convince the community, supported

by papers that demonstrated, through a series of

applications, that the method was easy to under-

stand, easy to implement and practical (Gelfand

et al.,

1990; Gelfand, Smith and Lee, 1992; Smith

and Gelfand, 1992; Wakeﬁeld et al., 1994). The r ap id

emergence of the d edicated BUGS (Bayesian infer-

ence Using Gibbs Sampling) software as early as

1991, when a paper on BUGS was presented at the

Valencia meeting, was another compelling argum ent

for adopting, at large, MCMC algorithms.

2

2. BEFORE THE REVOLUTION

Monte Carlo methods were born in Los Alamos,

New Mexico during World War II, eventually result-

ing in the Metropolis algorithm in the early 1950s.

While Monte Carlo method s were in use by that

time, MCMC was brought closer to statistical prac-

ticality by the work of Hastings in the 1970s.

What can be reasonably seen as the ﬁrst MCMC

algorithm is what we now call the Metropolis algo-

rithm, published by Metropolis et al. (

1953). It em-

anates from the same group of scientists who pro-

duced the Monte Carlo method, namely, the research

scientists of Los Alamos, mostly physicists working

on mathematical physics and the atomic bomb.

MCMC algorithms therefore date back to the same

time as the development of regular (MC only) Monte

Carlo methods, which are usually traced to Ulam

and von Neumann in the late 1940s. Stanislaw Ulam

associates the original id ea with an intractable com-

binatorial computation he attempted in 1946 (cal-

culating the probability of winning at the card game

“solitaire”). This idea was enthusiastically adopted

by John von Neumann for implementation with di-

rect applications to neutron diﬀusion, the name

“Monte Carlo” being suggested by Nicholas Metropo-

lis. (Eckhardt,

1987, describes these early Monte Car-

lo developments, and Hitchcock,

2003, gives a brief

history of the Metropolis algorithm.)

These o ccurrences very closely coincide with the

appearance of the very ﬁrst computer, the ENIAC,

which came to life in February 1946, after three

years of construction. The Monte Carlo method was

2

Historically speaking, the development of BUGS initiated

from Geman and Geman (

1984) and Pearl (1987), in accord

with the developments in the artiﬁcial intelligence community,

and it predates Gelfand and Smith (1990).

set up by von Neumann, who was using it on thermo-

nuclear and ﬁssion problems as early as 1947. At the

same time, that is, 1947, Ulam and von Neumann in-

vented inversion and accept-reject techniques (also

recounted in Eckhardt,

1987) to simulate from nonu-

niform distr ibutions. Without computers, a rudimen-

tary version invented by Ferm i in the 1930s did n ot

get any recognition (Metropolis,

1987). Note also th at,

as early as 1949, a symposium on Monte Carlo was

supported by Rand, NBS and the Oak Ridge labora-

tory and that Metropolis and Ulam (

1949) published

the very ﬁrst paper about the Monte Carlo method.

2.1 The Metropolis et al. (1953) Paper

The ﬁrst MCMC algorithm is associated with a se-

cond computer, called MANIAC, built

3

in Los Ala-

mos under the direction of Metropolis in early 1952.

Both a physicist and a mathematician, Nicolas Me-

tropolis, who died in Los Alamos in 1999, came to

this place in April 1943. The other members of the

team also came to Los Alamos during those years,

including the controversial Teller. As early as 1942,

he became obsessed with the hydrogen (H) bomb,

which he eventually managed to design with Stanis-

law Ulam, using the better computer facilities in the

early 1950s.

Published in June 1953 in the Journal of Chem-

ical Physics, the primary focus of Metropolis et al.

(

1953) is the computation of integrals of the form

I =

Z

F (θ) exp{−E(θ)/kT }dθ

.

Z

exp{−E(θ)/kT }dθ,

on R

2N

, θ denoting a set of N particles on R

2

, with

the energy E being deﬁned as

E(θ) =

1

2

N

X

i=1

N

X

j=1

j6=i

V (d

ij

),

where V is a potential function and d

ij

the Eu-

clidean distance between particles i and j in θ. Th e

Boltzmann distribution exp{−E(θ)/kT } is parame-

terized by the temperature T , k being the Boltzmann

constant, with a normalization factor

Z(T ) =

Z

exp{−E(θ)/kT }dθ,

that is not available in closed form, except in tr ivial

cases. Since θ is a 2N -dimensional vector, numerical

3

MANIAC stands for Mathematical Analyzer, Numerical

Integrator and Computer.

A SH ORT HISTORY OF MARKOV CHAIN MONTE CARLO 3

integration is impossible. Given the large dimension

of the problem, even standard Monte Carlo tech-

niques fail to correctly approximate I, since

exp{−E(θ)/kT } is very small for most realizations

of the random conﬁgurations of the particle system

(uniformly in the 2N sq uare). In order to improve

the eﬃciency of the Monte Carlo method, Metropo-

lis et al. (

1953) propose a random walk mod iﬁca-

tion of the N particles. That is, for each particle i

(1 ≤i ≤ N ), values

x

′

i

= x

i

+ σξ

1i

and y

′

i

= y

i

+ σξ

2i

are proposed, where both ξ

1i

and ξ

2i

are uniform

U(−1, 1). The energy diﬀerence ∆E between the

new conﬁguration and the pr evious one is then com-

puted and the new conﬁguration is accepted with

probability

min{1, exp(−∆E/kT )},(1)

and otherwise the previous conﬁguration is repli-

cated, in the sense that its counter is increased by

one in the ﬁnal average of the F (θ

t

)’s over the τ

moves of the random walk, 1 ≤ t ≤ τ . Note that

Metropolis et al. (

1953) move one particle at a time,

rather than moving all of them together, which ma-

kes the initial algorithm appear as a primitive kind

of Gibbs sampler!

The auth ors of Metropolis et al. (

1953) demon-

strate the validity of the algorithm by ﬁrst establish-

ing irreducibility, which they call ergodici ty, and sec-

ond proving ergodicity, that is, convergence to the

stationary distribution. The second part is obtained

via a discretization of th e sp ace: They ﬁrst note that

the proposal move is reversible, then establish that

exp{−E/kT } is invariant. The result is therefore

proven in its full generality, minus the discretiza-

tion. The number of iterations of the Metropolis al-

gorithm used in the paper seems to be limited: 16

steps for bu rn-in and 48 to 64 subsequent iterations,

which required four to ﬁve hours on the Los Alamos

computer.

An interesting variation is the Simulated Anneal-

ing algorithm, developed by Kirkpatrick , Gelatt and

Vecchi (

1983), who connected optimization with an-

nealing, the cooling of a metal. Th eir variation is to

allow the temperature T in (

1) to chan ge as the algo-

rithm runs, according to a “cooling schedule,” and

the Simulated Annealing algorithm can be shown

to ﬁnd the global maximum with probability 1, al-

though the analysis is quite complex d ue to the fact

that, with varying T , the algorithm is no longer

a time-homogeneous Markov chain.

2.2 The Hastings (

1970) Paper

The Metropolis algorithm was later generalized

by Hastings (1970) and his s tudent Peskun (1973,

1981) as a statistical simulation tool that could over-

come th e curse of dimensionality met by regular

Monte Carlo methods, a point already emphasized

in Metropolis et al. (

1953).

4

In his Biometrika paper,

5

Hastings (1970) also de-

ﬁnes his methodology for ﬁnite and reversible Mar-

kov chains, treating the continuous case by using

a discretization analogy. The generic pr ob ab ility of

acceptance for a move from state i to state j is

α

ij

=

s

ij

1 + (π

i

/π

j

)(q

ij

/q

ji

)

,

where s

ij

= s

ji

, π

i

denotes the target and q

ij

the

proposal. This generic form of probability encom-

passes the forms of both Metropolis et al. (

1953) and

Barker (

1965). At this stage, Hastings mentions that

little is known about the relative merits of those two

choices (even though) Metropolis’s method may be

preferable. He also warns against high re jection rates

as indicative of a poor choice of transition matrix,

but does n ot mention the opposite pitfall of low re-

jection rates, associated with a slow exploration of

the target.

The examples given in the paper are a Poisson tar-

get with a ±1 rand om walk proposal, a normal tar-

get with a uniform random walk proposal mixed

with its r eﬂection, that is, a uniform proposal cen-

tered at −θ

t

rather than at the current value θ

t

of

the Markov chain, and then a multivariate target

where Hastings introduces a Gibbs sampling strat-

egy, updating one component at a time an d deﬁning

the composed transition as s atisfying the stationary

condition because each component does leave the

target invariant. Hastings (

1970) actually refers to

Ehrman, Fosdick and Handscomb (

1960) as a preli-

minary, if speciﬁc, instance of this samp ler. More pre-

cisely, this is Metropolis-within-Gibbs except for the

name. This ﬁrst introd uction of the Gibbs sampler

has thus been completely overlooked, even though

the proof of convergence is completely general, based

on a composition argument as in Tierney (

1994), dis-

cussed in Section

4.1. The r emainder of the paper

4

In fact, Hastings starts by mentioning a decomposition

of the target distribution into a product of one-dimensional

conditional distributions, but this falls short of an early Gibbs

sampler.

5

Hastings (

1970) is one of the ten Biometrika papers re-

produced in Titterington and Cox (

2001).

4 C. ROBERT AND G. CASELLA

deals with (a) an importance sampling version of

MCMC, (b) general remarks about assessment of the

error, and (c) an application to random orthogonal

matrices, with another example of Gibbs sampling.

Three years later, Peskun (

1973) published a com-

parison of Metropolis’ and Barker’s forms of accep-

tance probabilities and shows in a discrete setup

that the optimal choice is that of Metropolis, where

optimality is to be understood in terms of the asymp-

totic variance of any empirical average. The proof is

a direct consequence of a result by Kemeny and Snell

(

1960) on the asymptotic variance. Peskun also es-

tablishes that this asymptotic variance can imp rove

upon the i.i.d. case if and only if the eigenvalues of

P − A are all negative, when A is the transition

matrix corresponding to i.i.d. simulation and P th e

transition matrix corresponding to the Metropolis

algorithm, but he concludes that the trace of P −A

is always positive.

3. SEEDS OF THE REVOLUTION

A number of earlier pioneers had brought forward

the seeds of Gibbs sampling; in particular, Hammer-

sley and Cliﬀord had produced a constructive argu-

ment in 1970 to recover a joint distribution from its

conditionals, a result later called the Hammersley–

Cliﬀord theorem by Besag (

1974, 1986). Besides Has-

tings (

1970) and Geman and Geman (1984), already

mentioned, other papers that contained the seeds

of Gibbs sampling are Besag and Cliﬀord (

1989),

Broniatowski, Celeux and Diebolt (

1984), Qian and

Titterington (1990) and Tann er and Wong (1987).

3.1 Besag’s Early Work and the Fundamental

(Missing) Theorem

In the early 1970’s, Hammersley, Cliﬀord and Be-

sag were working on the speciﬁcation of joint distri-

butions from conditional distributions and on nec-

essary and suﬃ cient conditions for the conditional

distributions to be compatible with a joint distri-

bution. Wh at is now known as the Hammersley–

Cliﬀord theorem states that a joint distribu tion for

a vector associated with a dependence graph (edge

meaning dependence and absence of edge conditional

indepen dence) must be represented as a product of

functions over the cliques of the graphs, that is,

of functions depending only on the components in-

dexed by the labels in the clique.

6

6

A clique is a maximal subset of the nodes of a graphs su ch

that every pair of nodes within the clique is connected by an

edge (Cressie,

1993).

From a historical point of v iew, Hammersley (

1974)

explains why the Hammersley–Cliﬀord theorem was

never published as such, but only through Besag

(

1974). The r eason is that Cliﬀord and Hammers-

ley were dissatisﬁed with the positivity constraint:

The joint density could be recovered from the full

conditionals only when the support of the joint was

made of the product of the supports of the full con-

ditionals. While they strived to make the theorem

independent of any positivity condition, their grad-

uate student publish ed a counter-example that put

a full stop to their endeavors (Moussouris,

1974).

While Besag (1974) can certainly be credited to

some extent of the (re-)discovery of the Gibbs sam-

pler, Besag (

1975) expressed doubt about the prac-

ticality of his method, noting that “the technique

is unlikely to be particularly helpful in many other

than binary situations and the Markov chain itself

has no practical interpretation,” clearly understat-

ing the importance of his own work.

A more optimistic sentiment was expressed ear-

lier by Hammersley and Handscomb (

1964) in their

textbook on Monte Carlo methods. There they cover

such topics as “Crude Monte Carlo,” importance

sampling, control variates and “Conditional Monte

Carlo,” which looks surp risingly like a missing-data

completion approach. Of course, they do not cover

the Hammersley–Cliﬀord theorem, but they state in

the Preface:

We are convinced nevertheless that Monte

Carlo methods will one day reach an im-

pressive maturity.

Well said!

3.2 EM and Its Simulated Versions as Precursors

Because of its use for missing data problems, the

EM algorithm (Dempster, Laird and Rubin,

1977)

has early connections with Gib bs sampling. For in-

stance, Broniatowski, Celeux and Diebolt (

1984) and

Celeux and Diebolt (

1985) had tried to overcome the

dependence of EM methods on the starting value

by replacing the E step with a simulation step, the

missing data z being generated conditionally on the

observation x and on the current value of the pa-

rameter θ

m

. The maximization in the M step is then

done on the simulated complete-data log-likelihood,

a predecessor to the Gibbs step of Diebolt and R o-

bert (

1994) for mixture estimation. Unfortunately,

the theoretical convergence results for these meth-

ods are limited. Celeux and Diebolt (

1990) have,

however, solved th e convergence pr ob lem of SEM

A SH ORT HISTORY OF MARKOV CHAIN MONTE CARLO 5

by devising a hybrid vers ion called SAEM (for Simu-

lated Annealing EM ), where the amount of random-

ness in the simulations decreases with the iterations,

ending up with an EM algorithm.

7

3.3 Gibbs a nd B eyond

Although somewhat removed from statistical in-

ference in the classical sense and based on earlier

techniques used in Statistical Physics, the landmark

paper by Geman and Geman (

1984) brought Gibbs

sampling into the arena of statistical application.

This paper is also responsible for the name Gibbs

sampling, because it implemented this method for

the Bayesian study of Gibbs random ﬁelds which,

in turn, derive their name from the physicist Josiah

Willard Gibbs (1839–1903). T his original implemen-

tation of the Gibbs sampler was applied to a discrete

image processing problem and did not involve com-

pletion. But this was one more spark that led to

the exp losion, as it had a clear inﬂuence on Gr een,

Smith, Spiegelhalter and others.

The extent to which Gibbs sampling and Metropo-

lis algorithms were in use within the image analy-

sis and point process communities is actually q uite

large, as illustrated in Ripley (

1987) where Sec-

tion 4.7 is entitled “Metropolis’ method and ran-

dom ﬁelds” an d describes the implementation and

the validation of the Metropolis algorithm in a ﬁ-

nite setting with an application to Markov random

ﬁelds and the corresponding issue of bypassing the

normalizing constant. Besag, York and Molli´e (

1991)

is another striking example of the activity in the spa-

tial statistics community at the end of the 1980s.

4. THE REVOLUTION

The gap of more than 30 years between Metropo-

lis et al. (1953) and Gelfand and Smith (1990) can

still be partially attributed to the lack of appropri-

ate computing power, as most of the examples now

processed by MCMC algorithms could not have been

treated previously, even though the hundreds of di-

mensions processed in Metropolis et al. (

1953) were

quite formidable. However, by the mid-1980s, the

pieces were all in place.

7

Other and more well-known connections between EM and

MCMC algorithms can be found in the literature (Liu and

Rubin,

1994; Meng and Rubin, 1993; Wei and Tanner, 1990),

but the connection with Gibbs sampling is more tenuous in

that the simulation methods are used to approximate quan-

tities in a Monte Carlo fashion.

After Peskun, MC MC in the statistical world was

dormant for about 10 years, and then several pa-

pers appeared that highlighted its usefulness in spe-

ciﬁc settings like pattern recognition, image analysis

or spatial statistics. In particular, Geman and Ge-

man (

1984) inﬂuenced Gelfand and S mith (1990) to

write a paper that is the genuine starting point for

an intens ive use of MCMC methods by the main-

stream s tatistical community. It sparked new inter-

est in Bayesian methods, statistical computing, al-

gorithms and stochastic processes through the use

of computing algorithms such as the Gibbs sam-

pler and the Metropolis–Hastings algorithm. (See

Casella and George, 1992, for an elementary intro-

duction to the Gibbs sampler.

8

)

Interestingly, the earlier paper by Tanner and

Wong (1987) had essentially the same ingredients

as Gelfand and Smith (

1990), namely, th e fact that

simulating from the conditional distributions is suf-

ﬁcient to asymptotically simulate from the joint.

This paper was considered important enough to be

a discussion paper in the Journal of the American

Statistical Association, but its impact was somehow

limited, compared with Gelfand and Smith (

1990).

There are several reasons for this; one being that the

method seemed to only apply to missin g data prob-

lems, this impression being reinforced by the name

data augmentation, and another is that the authors

were more focused on approximating the posterior

distribution. They suggested a MCMC approxima-

tion to the target π(θ|x) at each iteration of the

sampler, based on

1

m

m

X

k=1

π(θ|x, z

t,k

),

z

t,k

∼ ˆπ

t−1

(z|x), k = 1, . . . , m,

that is, by replicating m times the simulations from

the current approximation ˆπ

t−1

(z|x) of the marginal

posterior distribution of the missing data. This fo-

cus on estimation of the posterior distribution con-

nected the original Data Augmentation algorithm to

EM, as pointed out by Dempster in the discussion.

8

On a humorous note, the original Technical Report of this

paper was called Gibbs for Kids, which was changed because

a referee did not appreciate the humor. However, our colleague

Dan Gianola, an Animal Breeder at Wisconsin, liked the title.

In using Gibbs sampling in his work, he gave a presentation in

1993 at the 44th Annual Meeting of the Europ ean Association

for Animal Production, Arhus, Denmark. The title: Gibbs for

Pigs.

6 C. ROBERT AND G. CASELLA

Although the discussion by Carl Morris gets very

close to the two-stage Gibbs sampler for hierarchi-

cal models, he is still concerned about doing m iter-

ations, and worries about how costly that would be.

Tanner and Wong mention taking m = 1 at the end

of the paper, referring to this as an “extreme case.”

In a sense, Tanner and Wong (

1987) were still too

close to Rubin’s

1978 multiple imputation to start

a new revolution. Yet another reason for th is may

be that the theoretical background was based on

functional analysis rather than Markov ch ain theory,

which needed, in particular, for th e Markov kernel

to be uniformly bounded and equicontinuous. This

may have discouraged potential users as requiring

too much mathematics.

The authors of this review were fortunate enough

to attend many focused conferences d uring this time,

where we were able to witness the explosion of Gibbs

sampling. In the summer of 1986 in Bowling Green,

Ohio, Adrian Smith gave a series of ten lectures

on hierarchical models. Although there was a lot of

computing mentioned, the Gibbs sampler was not

fully developed yet. In another lecture in June 1989

at a Bayesian workshop in Sherbrooke, Qu´ebec, he

revealed for the ﬁrst time the generic features of

the Gibbs sampler, and we still remember vividly

the shock induced on ourselves and on the whole

audience by the sheer breadth of the method: This

development of Gibbs sampling, MCMC, and the re-

sulting seminal paper of Gelfand and Smith (

1990)

was an epiphany in the world of Statistics.

Definition (Epiphany n). A spiritual event in

which the essence of a given object of manifesta-

tion appears to the subject, as in a sudden ﬂash of

recognition.

The explosion had begun, and just two years later,

at an MCMC conference at Ohio State University

organized by Alan Gelfand, Prem Goel and Ad rian

Smith, there were three full days of talks. The pre-

senters at the conference read like a Who’s Who of

MCMC, and the level, intensity an d imp act of that

conference, and the subsequent research, are immea-

surable. Many of the talks were to become inﬂuential

papers, including Albert and Chib (1993), Gelman

and Rubin (

1992), Geyer (1992), Gilks (1992), Liu,

Wong and Kong (1994, 1995) and Tierney (1994).

The program of the conference is reproduced in the

Appendix.

Approximately one year later, in May of 1992,

there was a meeting of the Royal Statistical Soci-

ety on “The Gibbs sampler and other Markov chain

Monte Carlo methods,” where four papers were pre-

sented followed by much discussion. The papers ap-

pear in the ﬁrst volume of JRSSB in 1993, together

with 49(!) pages of discussion. The excitement is

clearly evident in the writings, even though the the-

ory and implementation were not always perfectly

understood.

9

Looking at these meetings, we can see the paths

that Gib bs sampling would lead us down. In the next

two sections we will summarize some of the advances

from the early to mid 1990s.

4.1 Advances in MCMC Theory

Perhaps the most inﬂuential MCMC theory pa-

per of the 1990s is Tierney (

1994), who carefully

laid out all of the assumptions needed to analyze

the Markov chains and then developed their prop-

erties, in particular, convergence of ergodic averages

and central limit theorems. In one of the discu ssions

of that paper, Chan and Geyer (

1994) were ab le to

relax a condition on Tierney’s Central Limit Theo-

rem, and this new condition plays an important role

in research today (see Section

5.4). A pair of very in-

ﬂuential, and in novative, papers is the work of Liu,

Wong and Kong (

1994, 1995), who very carefully an-

alyzed the covariance str ucture of Gibbs sampling,

and were able to formally establish the validity of

Rao–Blackwellization in Gibbs sampling. Gelfand

and Smith (

1990) had used Rao–Blackwellization,

but it was not justiﬁed at that time, as the origi-

nal theorem was only applicable to i.i.d. sampling,

which is not the case in MCMC. Another signiﬁcant

entry is Rosenthal (

1995), who obtained one of the

earliest results on exact rates of convergence.

Another paper must be singled out, namely, Men-

gersen and Tweedie (

1996), for setting the tone for

the study of the speed of convergence of MCMC

algorithms to the target distribution. Subsequent

works in this area by Richard Tweedie, Gareth Ro-

berts, Jeﬀ R osenthal and co-authors are too numer-

ous to be mentioned here, even though the paper

9

On another humorous note, Peter Cliﬀord opened the

discussion by noting “. . .we have had the opportunity to

hear a large amount about an important new area in statis-

tics. It may well be remembered as the ‘afternoon of the 11

Bayesians.’ Bayesianism has obviously come a long way. It

used to be that you could tell a Bayesian by his tendency

to hold meetings in isolated parts of Spain and his obsession

with coherence, self-interrogation and other manifestations of

paranoia. Things have changed, and there may be a general

lesson here for statistics. Isolation is counter-productive.”

A SH ORT HISTORY OF MARKOV CHAIN MONTE CARLO 7

by Roberts, Gelman and Gilks (

1997) must be cited

for setting explicit targets on the acceptance rate

of the ran dom walk Metropolis–Hastings algorithm,

as well as Roberts and Rosenthal (

1999) for get-

ting an upper bound on the number of iterations

(523) needed to approximate the target up to 1%

by a slice samp ler. The untimely death of Richard

Tweedie in 2001, alas, had a major impact on the

book about MCMC convergence he was contemplat-

ing with Gareth Roberts.

One pitfall arising from the widespread use of

Gibbs sampling was the tendency to specify models

only through their conditional distributions, almost

always without referring to the positivity conditions

in Section

3. Unfortunately, it is possible to spec-

ify a perfectly legitimate-looking set of conditionals

that do not correspon d to any joint distribution, and

the resulting Gibbs chain cannot converge. Hobert

and Casella (

1996) were able to document the con-

ditions needed for a convergent Gibbs chain, and

alerted the Gibbs community to this problem, which

only arises when improper priors are used, but this

is a frequent occurrence.

Much other work followed, and continues to grow

today. Geyer and Thomp son (

1995) describe how to

put a “ladder” of chains together to have both “hot”

and “cold” exploration, followed by Neal’s

1996 in-

troduction of tempering; Athreya, Doss and Sethu-

raman (1996) gave more easily veriﬁable conditions

for convergence; Meng and van Dyk (

1999) and Liu

and Wu (

1999) developed the theory of parame-

ter expansion in the Data Augmentation algorithm,

leading to construction of chains w ith faster con-

vergence, and to the work of Hobert and Marchev

(

2008), who give precise constructions and theorems

to show how parameter expansion can uniformly im-

prove over the original chain.

4.2 Advances in MCMC Applications

The real reason for the explosion of MCMC meth-

ods was the fact that an enorm ou s number of prob-

lems that were deemed to be computational night-

mares now cracked open like eggs. As an example,

consider this very simple random eﬀects model from

Gelfand and Smith (

1990). Observe

Y

ij

= θ

i

+ ε

ij

, i = 1, . . . , K, j = 1, . . . , J,(2)

where

θ

i

∼N(µ, σ

2

θ

),

ε

ij

∼N(0, σ

2

ε

), independent of θ

i

.

Estimation of the variance components can be diﬃ-

cult for a frequentist (REML is typically preferr ed),

but it indeed was a nightmare for a Bayesian, as the

integrals were intractable. However, with the usual

priors on µ, σ

2

θ

and σ

2

ε

, the f ull cond itionals are triv-

ial to sample from and the problem is easily solved

via Gibbs sampling. Moreover, we can increase the

numb er of variance components and the Gibbs so-

lution remains easy to implement.

During the early 1990s, researchers found that

Gibbs, or Metropolis–Hastings, algorithms would be

able to give solutions to almost any problem that

they looked at, and there was a veritable ﬂood of pa-

pers applying MCMC to previously intractable mod-

els, and getting good answers. For example, bu ild-

ing on (

2), it was quickly r ealized that Gibbs sam-

pling was an easy route to getting estimates in the

linear mixed models (Wang, Rutledge and Gianola,

1993, 1994), and even generalized linear mixed mod-

els (Zeger and Karim,

1991). Building on the expe-

rience gained with the EM algorithm, similar ar-

guments made it possible to analyze probit models

using a latent variable approach in a linear mixed

model (Alber t and Chib,

1993), and in mixture mod-

els with Gibbs sampling (Diebolt and Robert,

1994).

It progressively dawned on the community that la-

tent variables could be artiﬁcially introduced to run

the Gibbs sampler in about every situation, as even-

tually published in Damien, Wakeﬁeld and Walker

(1999), the main example being the slice sampler

(Neal,

2003). A very incomplete list of some other

applications include changepoint analysis (Carlin,

Gelfand and Smith,

1992; Stephens, 1994), Genomics

(Churchill,

1995; Lawrence et al., 1993; Stephens

and Smith,

1993), capture–recapture (Dupuis, 1995;

George and Robert,

1992), variable selection in re-

gression (George an d McCulloch,

1993), spatial sta-

tistics (Raftery and Banﬁeld,

1991), and longitudi-

nal studies (Lange, Carlin and Gelfand,

1992).

Many of these applications were advanced though

other developments su ch as the Adaptive Rejection

Sampling of Gilks (

1992); Gilks, Best and Tan (1995),

and the simulated tempering approaches of Geyer

and T hompson (

1995) or Neal (1996).

5. AFTER THE REVOLUTION

After the revolution comes the “second” revolu-

tion, but now we have a more mature ﬁeld. The

revolution has slowed, and the problems are being

solved in, perhaps, deeper and more sophisticated

ways, even though Gibbs sampling also oﬀers to the

8 C. ROBERT AND G. CASELLA

amateur the possibility to handle Bayesian analy-

sis in complex models at little cost, as exhibited by

the widespread use of BUGS, which mostly focuses

10

on this approach. But, as before, the methodology

continues to expand the set of problems for which

statisticians can provide meaningful solutions, and

thus continues to further the impact of Statistics.

5.1 A Brief Glimpse at Particle Systems

The realization of the possibilities of iterating im-

portance sampling is not new: in fact, it is about as

old as Monte Carlo methods themselves. It can be

found in the molecular simulation literature of the

50s, as in Hammersley and Morton (

1954), Rosen-

bluth and Rosenbluth (

1955) and Marshall (1965).

Hammersley and colleagues proposed such a method

to simulate a self-avoiding random walk (see Madras

and Slade,

1993) on a grid, due to huge ineﬃcien-

cies in regular importance sampling and rejection

techniques. Although this early implementation oc-

curred in particle physics, the use of the term “par-

ticle” only dates back to Kitagawa (

1996), while

Carpenter, Cliﬀord and Fernhead (

1997) coined the

term “particle ﬁlter.” In signal processing, early oc-

currences of a particle ﬁlter can be traced back to

Handschin and Mayne (

1969).

More in connection with ou r theme, the landmark

paper of Gordon, Salmond and Smith (1993) in-

troduced the bootstrap ﬁlter which, wh ile formally

connected with importance sampling, involves past

simulations and p ossible MCMC steps (Gilks and

Berzuini,

2001). As described in the volume edited

by Doucet, de Freitas and Gordon (

2001), parti-

cle ﬁlters are simulation methods adapted to se-

quential settings where data are collected progres-

sively in time, as in radar detection, telecommuni-

cation correction or ﬁnancial volatility estimation.

Taking advantage of state-space representations of

those dynamic models, particle ﬁ lter methods pro-

duce Monte Carlo approximations to the posterior

distributions by prop agating simulated samples who-

se weights are actualized against the incoming ob-

servations. Since the importance weights have a ten-

dency to d egenerate, that is, all weights but one are

close to zero, additional MCMC steps can be intro-

duced at times to recover the variety and rep resen-

tativeness of the sample. Mod er n connections with

MCMC in the construction of the proposal kernel

10

BUGS now uses both Gibbs sampling and Metropolis–

Hastings algorithms.

are to be f ou nd, for instance, in Doucet, God sill and

Andrieu (

2000) and in Del Moral, Doucet and Jasra

(

2006). At the same time, sequential imputation was

developed in Kong, Liu and Wong (

1994), while Liu

and Chen (

1995) ﬁrst formally pointed out the im-

portance of resampling in sequential Monte Carlo,

a term coined by them.

The recent literature on the topic more closely

bridges the gap between sequential Monte Carlo and

MCMC methods by making adap tive MCMC a pos-

sibility (see, e.g., And rieu et al.,

2004, or Roberts

and Rosenthal,

2007).

5.2 Perfect Sampling

Introduced in the semin al paper of Propp and

Wilson (1996), perf ect sampling, namely, the abil-

ity to use MCMC methods to produce an exact

(or perfect) simulation from the target, maintains

a unique place in the history of MCMC methods.

Although this exciting discovery led to an outburst

of p apers, in particular, in the large body of work of

Møller and coauthors, including the book by Møller

and Waagepetersen (

2003), as well as many reviews

and introductory materials, like Casella, Lavine and

Robert (2001), Fismen (1998) and Dimakos (2001),

the excitement quickly dried out. The major reason

for this ephemeral lifesp an is that the construction

of perfect samplers is most often close to impossible

or impractical, despite some advances in the imple-

mentation (Fill,

1998a, 1998b).

There is, h owever, ongoing activity in the area

of point processes and stochastic geometry, much

from the work of Møller and Kendall. In particu-

lar, Kendall and Møller (2000) developed an alter-

native to the Coupling From The Past (C FPT) al-

gorithm of Propp and Wilson (1996), called horizon-

tal CFTP, which m ainly applies to point p rocesses

and is based on continuous time birth-and-death

processes. See also Fern´andez, Ferrari and Garcia

(1999) for another horizontal CFTP algorithm for

point processes. Berthelsen and Møller (

2003) ex-

hibited a use of these algorithms for nonparametric

Bayesian inference on point processes.

5.3 Reversible Jump and Variable Dimensions

From many viewpoints, the invention of th e re-

versible jump algorithm in Green (

1995) can be seen

as the start of the second MCMC revolution: the

formalization of a Markov chain that moves across

models and parameter spaces allowed for the Baye-

sian processing of a wide variety of new models and

A SH ORT HISTORY OF MARKOV CHAIN MONTE CARLO 9

contributed to the success of Bayesian model choice

and subsequently to its adoption in other ﬁelds. The-

re exist earlier alternative Monte Carlo solutions

like Gelfand and Dey (

1994) and Carlin and Chib

(

1995), the later being very close in spirit to re-

versible ju mp MCMC (as shown by the completion

scheme of Brooks, Giudici and Roberts, 2003), but

the deﬁnition of a proper balance condition on cross-

model Markov kernels in Green (1995) gives a generic

setup for exploring variable dimension spaces, even

when the number of models under comparison is in-

ﬁnite. The impact of this new idea was clearly per -

ceived when looking at the First European Confer-

ence on Highly Structured Stochastic Systems that

took place in Rebild, Denmark, the next year, or-

ganized by Stephen Lauritzen an d Jesper Møller:

a large majority of the talks were aimed at direct

implementations of RJMCMC to various inference

problems. The application of RJMCMC to mixture

order estimation in the discussion paper of Richard-

son and Green (

1997) ensured furth er dissemination

of the technique. Continuing to develop RJMC MC t,

Stephens (2000) pr oposed a continuous time version

of R JMCMC, based on earlier ideas of Geyer and

Møller (1994), but with similar properties (Capp´e,

Robert and Ryd´en,

2003), while Brooks, Giudici and

Roberts (

2003) made proposals for increasing the ef-

ﬁciency of the moves. In retrospect, while reversible

jump is somehow u navoidable in the processing of

very large numbers of models under comparison,

as, for instance, in variable selection (Marin and

Robert,

2007), the implementation of a complex al-

gorithm like RJMCMC for the comparison of a few

models is somewhat of an overkill since there may

exist alternative solutions based on mod el speciﬁc

MCMC chains, for example (Chen, Shao and Ibra-

him,

2000).

5.4 Regeneration and the CLT

While the Central Limit Theorem (CLT) is a cen-

tral tool in Monte Carlo convergence assessment, its

use in MCMC setups took longer to emerge, despite

early signals by Geyer (

1992), and it is only recently

that s uﬃciently clear conditions emerged. We re-

call that the Ergodic Theorem (see, e.g., Robert and

Casella,

2004, Theorem 6.63) states that, if (θ

t

)

t

is

a Markov chain with stationary distribution π, and

h(·) is a function with ﬁnite variance, then under

fairly mild conditions,

lim

n→∞

¯

h

n

=

Z

h(θ)π(θ) dθ = E

π

h(θ),(3)

almost everywhere, where

¯

h

n

= (1/n)

P

n

i=1

h(θ

i

).

For the CLT to be used to monitor this convergence,

√

n(

¯

h

n

−E

π

h(θ))

p

Var h(θ)

→N(0, 1),(4)

there are two roadblock s. First, convergence to nor-

mality is strongly aﬀected by the lack of indepen-

dence. To get CLTs f or Markov chains, we can use

a result of Kipnis and Varadhan (

1986), which re-

quires the chain to be reversible, as is the case for

holds for Metropolis–Hastings chains, or we must

delve into mixing conditions (Billingsley, 1995, Sec-

tion 27), which are typically not easy to verify. How-

ever, Chan and Geyer (1994) showed how the condi-

tion of geometric er godicity could be used to estab-

lish CLTs for Markov chains. But getting the con-

vergence is only half of the p roblem. In order to

use (

4), we must be able to consistently estimate the

variance, which turns out to be an other diﬃcult en-

deavor. The “na¨ıve” estimate of the usual standard

error is not consistent in the dependent case and

the most promising paths for consistent variance es-

timates seems to be through regeneration and batch

means.

The theory of r egeneration uses the concept of

a split chain (Athreya and Ney,

1978), and allows

us to independently restart the chain while preserv-

ing the stationary distribution. These independent

“tours” then allow the calculation of consistent vari-

ance estimates and honest monitoring of convergence

through (4). Early work on applying regeneration

to MCMC chains was done by Mykland, Tierney

and Yu (

1995) and R obert (1995), who showed how

to constru ct the chains and use them for variance

calculations and diagnostics (see also Guihenneuc-

Jouyaux and Robert,

1998), as well as deriving adap-

tive MCMC algorithms (Gilks, Roberts and Sahu,

1998). Rosenthal (1995) also showed how to con-

struct an d use regenerative chains, and much of this

work is reviewed in Jones and Hobert (2001). The

most interesting and p ractical developments, how-

ever, are in Hobert et al. (2002) and Jones et al.

(

2006), where consistent estimators are constr ucted

for Var h(X), allowing valid monitoring of conver-

gence in chains that satisfy the CLT. Interestingly,

although Hob er t et al. (2002) use regeneration, Jones

et al. (2006) get their consistent estimators thorough

another technique, that of cumulative batch means.

6. CONCLUSION

The impact of Gibbs samplin g and MCMC was to

change our entire method of think ing and attacking

10 C. ROBERT AND G. CASELLA

problems, representing a paradigm shift (Kuh n,

1996).

Now , the collection of real problems that we could

solve grew almost without bound. Markov chain Mon-

te Carlo changed our emphasis from “closed form”

solutions to algorithms, expanded our impact to solv-

ing “real” applied problems and to improving nu-

merical algorithms using statistical ideas, and led us

into a world where “exact” now m eans “simulated.”

This has truly been a quantum leap in the evo-

lution of the ﬁeld of statistics, and the evidence is

that there are no signs of slowing down. Although

the “explosion” is over, the current work is going

deeper into theory and applications, and continues

to expand our horizons and inﬂuen ce by increasing

our ability to solve even bigger and more important

problems. The size of the data sets, and of the mod-

els, for example, in genomics or climatology, is some-

thing that could not have been conceived 60 years

ago, wh en Ulam and von Neumann invented the

Monte Carlo method. Now we continue to plod on,

and hope that the advances that we make here will,

in some way, help our colleagues 60 years in the fu-

ture solve the problems that we cannot yet conceive.

APPENDIX: WORKSHOP ON

BAYESIAN COMPUTATION

This section contains the program of the Work-

shop on Bayesian Computation via Stochastic Sim-

ulation, held at Ohio State University, February 15–

17, 1991. The organizers, an d their aﬃliations at the

time, were Alan Gelfand, Un iversity of Connecti-

cut, Prem Goel, Ohio State University, and Adrian

Smith, Imperial College, London.

• Friday, Feb. 15, 1991.

(a) Theoretical Aspect of Iterative Sampling,

Chair: Adrian Sm ith .

(1) Martin Tanner, University of Rochester:

EM, MCEM, DA and PMDA.

(2) Nick Polson, Carnegie Mellon Univer-

sity: On the Convergence of the Gibbs Sam-

pler and Its Rate.

(3) Wing-Hung Wong, Augustin Kong and

Jun Liu, University of Chicago: Correlation

Structure and Convergence of the Gibbs Sam-

pler and Related Algorithms.

(b) Applications—I, Chair: Prem Goel.

(1) Nick Lange, Brown University, Brad

Carlin, Carnegie Mellon University and Alan

Gelfand, University of Conn ecticut: Hierar-

chical Bayes Models for Progression of HIV

Infection.

(2) Cliﬀ Litton, Nottingham University,

England: Archaeologi cal Applications of Gibbs

Sampling.

(3) Jonas Mockus, Lithuanian Academy of

Sciences, Vilnius: Bayesian Approach to Global

and Stochastic O ptimization.

• Saturday, Feb. 16, 1991.

(a) Posterior Simulation an d Markov Sampling,

Chair: Alan Gelfand.

(1) Lu ke Tierney, University of Minnesota:

Exploring Posterior Distributions Using Mar-

kov Chains.

(2) Peter Mueller, Purdue University:

A Generic Approach to Posterior Integration

and Bayesian Sampling.

(3) Andrew Gelman, University of Califor-

nia, Berkeley and Donald P. Rubin, Harvard

University: On the Routine Use of Markov

Chains for Simulations.

(4) Jon Wakeﬁeld, Imperial College, Lon-

don: Parameterization Issues in Gibbs Sam-

pling.

(5) Panickos Palettas, Virginia Polytech-

nic In stitute: Acceptance–Rejection Method in

Posterior Computations.

(b) Applications—II, Chair: Mark Berliner.

(1) David Stephens, Imperial College, Lon-

don: Gene Mapping via Gibbs Sampling.

(2) Constantine Gatsonis, Harvard Univer-

sity: Random Eﬂeeds Model for Ordinal Cate-

qorica! Data with an Application to ROC Anal-

ysis.

(3) Arnold Zellner, University of Chicago,

Luc Bauwens and Herman Van Dijk: Bayesian

Speciﬁcation Analysis and Estimation of Simul-

taneous Equation M odels Using Monte Carlo

Methods.

(c) Adaptive Sampling, Chair: Carl Morris.

(1) Mike Evans, University of Toronto and

Carnegie Mellon University: Some Uses of

Adaptive Importance Sampling and Chaining.

(2) Wally Gilks, Medical Research Coun-

cil, Cambridge, England: Adaptive Rejec tion

Sampling.

A SH ORT HISTORY OF MARKOV CHAIN MONTE CARLO 11

(3) Mike West, Duke University: Mixture

Model Approximations, Sequential Updating

and Dynamic M odels.

• Su nday, Fe b. 17, 1991.

(a) Generalized Linear and Nonlinear Models,

Chair: Rob K ass.

(1) Ruey Tsay and R obert McCulloch, Uni-

versity of Chicago: Bayesian Analysis of Au-

toregressive Time Series.

(2) Christian Ritter, Univer sity of Wiscon-

sin: Sampling Based Inference in Non Li near

Regression.

(3) William DuMouch el, BBN Software,

Boston: Application of the Gibbs Sampler to

Variance Component Modeling.

(4) James Albert, Bowling Green Univer-

sity and Sidhartha Chib, Washington Univer-

sity, St. Louis: Bayesian Regression Analysis

of Binary D ata.

(5) Ed win Green and William Strawder-

man, Rutgers University: Bayes E stimates for

the Linear Model with Unequal Variances.

(b) Maximum Likelihood and Weighted Boots-

trapping, Chair: George Casella.

(1) Adrian Raftery and Michael Newton,

University of Washington: Approximate Baye-

sian Infere nce by the Weighted Bootstrap.

(2) Charles Geyer, Universlty of Chicago:

Monte Carlo Maximum Likelihood via Gibbs

Sampling.

(3) Elizabeth Thompson, University of

Washington: Stochastic Simulation for Com-

plex Genetic Analysis.

(c) Panel Discussion—Future of Bayesian Infer-

ence Using Stochastic Simulation, Chair: Prem

Gael.

• Panel—Jim Berger, Alan Gelfand and Ad -

rian Smith.

ACKNOWLEDGMENTS

We are grateful for comments and suggestions

from Brad Carlin, Olivier Capp´e, David Spiegelhal-

ter, Alan Gelfand, Peter Green, Jun Liu, Sharon

McGrayn e, Peter M¨uller, Gareth Roberts and Adrian

Smith. Christian Robert’s work was partly done dur-

ing a visit to the Queensland University of Technol-

ogy, Brisbane, and the author is grateful to Ker-

rie Mengersen for her hospitality and support. Sup-

ported by the Agence Nationale de la Recherche

(ANR, 212, rue de Bercy 75012 Paris) through the

2006–2008 project ANR=05-BLAN-0299 Adap’MC.

Supported by NSF Grants DMS-06-31632, SES-06-

31588 and MMS 1028329.

REFERENCES

Albert, J. H. and Chib, S. (1993). Bayesian analysis of bi-

nary and polychotomous resp on se data. J. Amer. Statist.

Assoc. 88 669–679.

MR1224394

Andrieu, C., de Freitas, N., Doucet, A. and Jordan, M.

(2004). An introduction t o MCMC for machine learning.

Machine Learning 50 5–43.

Athreya, K. B. and Ney, P. (1978). A new approach to

the limit theory of recurrent Markov chains. Trans. Amer.

Math. Soc. 245 493–501.

MR0511425

Athreya, K. B. , Doss, H. and Sethuraman, J. (1996). On

the convergence of the Markov chain simulation method.

Ann. Statist. 24 69–100.

MR1389881

Barker, A. (1965). Monte Carlo calculations of the radial

distribution functions for a proton electron p lasma. Aust.

J. Physics 18 119–133.

Berthelsen, K. and Møller, J. (2003). Likelihood and

non-parametric Bayesian MCMC inference for spatial point

processes based on perfect simulation and path sampling.

Scand. J. Statist. 30 549–564.

Besag, J. (1974). Spatial interaction and the statistical anal-

ysis of lattice systems (with discussion). J. Roy. Statist.

Soc. Ser. B 36 192–236.

MR0373208

Besag, J. (1975). Statistical analysis of non-lattice data. The

Statistician 24 179–195.

Besag, J. (1986). On the statistical analysis of dirty pictures.

J. Roy. Statist. Soc. Ser. B 48 259–302.

MR0876840

Besag, J. and C li f ford, P. (1989). Generalized Monte Car-

lo signiﬁcance tests. Biometrika 76 633–642.

MR1041408

Besag, J., York, J. and Molli

´

e, A. (1991). Bayesian image

restoration, with two applications in spatial statistics (with

discussion). Ann. I nst. Statist. Math. 43 1–59.

MR1105822

Billingsley, P. ( 1995). Probability and Measure, 3rd ed. Wi-

ley, New York.

MR1324786

Broniatowski, M., Ce le ux, G. and Diebolt, J. (1984).

Reconnaissance de m´elanges de densit´es par un algorithme

d’apprentissage probab iliste. In Data Analysis and Infor-

matics, I II (Versailles, 1983) 359–373. N orth-Holland, Am-

sterdam.

MR0787647

Brooks, S. P., Giudici, P. and Roberts, G. O. (2003). Ef-

ﬁcient construction of reversible jump Markov chain Monte

Carlo proposal distributions (with discussion). J. R. Stat.

Soc. Ser. B Stat. Methodol. 65 3–55.

MR1959092

Capp

´

e, O. , Robert, C. P. and Ryd

´

en, T. (2003). Reversible

jump, birth-and-death and more general continuous time

Markov chain Monte Carlo samplers. J. R. Stat. Soc. Ser.

B Stat. Methodol. 65 679–700. MR1998628

Carlin, B. and Chib, S. (1995). Bayesian model choice

through Markov chain Monte Carlo. J. Roy. Statist. Soc.

Ser. B 57 473–484.

Carlin, B., Gelfand , A. an d Smith, A. (1992). Hierarchical

Bayesian analysis of change point problems. Appl. Statist.

41 389–405.

12 C. ROBERT AND G. CASELLA

Carpenter, J., Clifford, P. and Fernhead, P. (1997).

Building robust simulation-based ﬁlters for evolving

datasets. Technical report, Dept. Statistics, Oxford

Univ.

Casella, G. and Ge orge, E. I. (1992). Explaining the Gibbs

sampler. Amer. Statist. 46 167–174.

MR1183069

Casella, G., Lavine, M. and Robert, C. P. (2001). Ex-

plaining the perfect sampler. Amer. Statist. 55 299–305.

MR1939363

Celeux, G. and Diebolt, J. (1985). The SEM algorithm:

A probabilistic teacher algorithm derived from the EM al-

gorithm for the mixture problem. Comput. Statist. Quater

2 73–82.

Celeux, G. and Diebolt, J. (1990). Une version de type

recuit simul´e de l’algorithme EM. C. R. Acad. Sci. Paris

S´er. I Math. 310 119–124.

MR1044628

Chan, K. and Geyer, C. (1994). Discussion of “Markov

chains for exploring posterior d istribu tion.” Ann. Statist.

22 1747–1758.

Chen, M.-H. , Shao, Q.-M. and Ibrahim, J. G. (2000).

Monte Carlo Methods in Bayesian Computation. Springer,

New York.

MR1742311

Churchill, G. (1995). Accurate restoration of DNA se-

quences (with discussion). In Case Studies in Bayesian

Statistics ( C. Gatsonis and J. S. Hodges, eds.) 2 90–148.

Springer, New York.

Cressie, N. A. C. (1993). Statistics for Spatial Data. Wiley,

New York. Revised reprint of the 1991 edition.

MR1239641

Damien, P., Wakefield, J. and Walker, S. (1999). Gibbs

sampling for Bayesian non-conjugate and hierarchical mod-

els by using auxiliary variables. J. R. Stat. Soc. Ser. B Stat.

Methodol. 61 331–344.

MR1680334

Del Moral, P., Doucet, A. and Jasra, A. (2006). Sequen-

tial Monte Carlo samplers. J. R. Stat. Soc. Ser. B Stat.

Methodol. 68 411–436.

MR2278333

Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977).

Maximum likelihood from incomplete data via the EM al-

gorithm (with discussion). J. Roy. Statist. Soc. Ser. B 39

1–38.

MR0501537

Diebolt, J. and Robert, C. P. (1994). Estimation of ﬁnite

mixture distributions through Bayesian sampling. J. Roy.

Statist. Soc. Ser. B 56 363–375.

MR1281940

Dimakos, X. K. (2001). A guid e to exact simulation. Inter-

nat. Statist. Rev. 69 27–48.

Doucet, A., de Freitas, N. and Gordon, N. (2001). Se-

quential Monte Carlo Methods in Practice. Springer, New

York.

MR1847783

Doucet, A., Godsill, S. and Andrieu, C . (2000). On se-

quential Monte Carlo sampling methods for Bayesian ﬁl-

tering. Statist. Comput. 10 197–208.

Dupuis, J. A. (1995). Bayesian estimation of movement

and survival probabilities from capture–recapture data.

Biometrika 82 761–772.

MR1380813

Eckhardt, R. (1987). Stan Ulam, John von Neumann, and

the Monte Carlo method. Los Alamos Sci . 15 Special Issue

131–137. MR0935772

Ehrman, J. R., Fosdick, L. D. and Handscomb, D. C.

(1960). Computation of order parameters in an Ising lattice

by the Monte Carlo method. J. Math. Phys. 1 547–558.

MR0117974

Fern

´

andez, R., Ferrari, P. and Ga rcia, N. L. (1999).

Perfect simulation for interacting point processes, loss n et -

works and Ising models. Technical report, Laboratoire

Raphael Salem, Univ. de Rouen.

Fill, J. A. (1998a). An interruptible algorithm for perfect

sampling via Markov chains. Ann. Appl . Probab. 8 131–

162.

MR1620346

Fill, J. (1998b). The move-to front rule: A case study for

two perfect sampling algorithms. Prob. Eng. Info. Sci 8

131–162.

Fismen, M. (1998). Exact simulation using Markov chains.

Technical Report 6/98, Institutt for Matematiske Fag,

Oslo. Diploma-thesis.

Gelfand, A. E. and Dey, D . K. (1994). Bayesian model

choice: Asymptotics and exact calculations. J. Roy. Statist.

Soc. Ser. B 56 501–514.

MR1278223

Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-ba-

sed approaches to calculating marginal densities. J. Amer.

Statist. Assoc. 85 398–409.

MR1141740

Gelfand, A. E., Smith, A. F. M. and Le e, T.-M. (1992).

Bayesian analysis of constrained parameter and truncated

data problems using Gibbs sampling. J. Amer. Statist. As-

soc. 87 523–532.

MR1173816

Gelfand, A., Hills, S., Racine-Poon, A . and S mith, A.

(1990). Illustration of Bayesian inference in normal data

mod els using Gibbs sampling. J. Amer. Statist. Assoc. 85

972–982.

Gelman, A. and Rubin, D. (1992). Inference from itera-

tive simulation using multiple sequences (with discussion).

Statist. Sci. 7 457–511.

Geman, S . and Geman, D. (1984). Stochastic relaxation,

Gibbs distributions and the Bayesian restoration of images.

IEEE Trans. Pattern Anal. Mach. Intell . 6 721–741.

George, E. and McC ulloch, R. (1993). Variable selection

via Gibbbs sampling. J. Amer. Statist. Assoc. 88 881–889.

George, E. I. an d Robert, C. P. (1992). Capture–recapture

estimation via Gibbs sampling. Biometrika 79 677–683.

MR1209469

Geyer, C. (1992). Practical Monte Carlo Markov chain (with

discussion). Statist. Sci. 7 473–511.

Geyer, C. J. and M øller, J. (1994). Simulation proced ures

and likelihood inference for spatial point processes. Scand.

J. Statist. 21 359–373.

MR1310082

Geyer, C. and Thompson, E. (1995). Annealing Markov

chain Monte Carlo with applications to ancestral inference.

J. Amer. Statist. Assoc. 90 909–920.

Gilks, W. (1992). Derivative-free adaptive rejection sam-

pling for Gibbs sampling. In Bayesian Statistics 4

(J. M. Bernardo, J. O. Berger, A. P. Dawid and A. F. M.

Smith, eds.) 641–649. Oxford Univ. Press, Oxford.

Gilks, W. R. and Ber zuini, C. (2001). Following a moving

target—Monte Carlo inference for dynamic Bayesian mod-

els. J. R. Stat. Soc. Ser. B Stat. Methodol. 63 127–146.

MR1811995

Gilks, W. , Best, N. and Tan, K. (1995). Adaptive rejection

Metropolis sampling within Gibbs sampling. Appl. Statist.

44 455–472.

Gilks, W. R., Roberts, G. O. and Sahu, S. K. (1998).

Adaptive Markov chain Monte Carlo through regeneration.

J. Amer. Statist. Assoc. 93 1045–1054.

MR1649199

A SH ORT HISTORY OF MARKOV CHAIN MONTE CARLO 13

Gordon, N., Salmond, J. and Smith, A. ( 1993). A novel

approach to non-linear/non-Gaussian Bayesian state esti-

mation. IEEE Proceedings on Radar and Signal Processing

140 107–113.

Green, P. J. (1995). Reversible jump Markov chain Monte

Carlo computation and Bayesian model determination.

Biometrika 82 711–732.

MR1380810

Guihenneuc-Jouyaux, C. and Robert, C. P. ( 1998). Dis-

cretization of continuous Markov chains and Markov chain

Monte Carlo convergence assessment. J. Amer. Statist. As-

soc. 93 1055–1067.

MR1649200

Hammersley, J. M. (1974). D iscussion of Mr Besag’s paper.

J. Roy. Statist. Soc. Ser. B 36 230–231.

Hammersley, J. M. and Handscomb, D. C. (1964). Monte

Carlo Methods. Wiley, New York.

Hammersley, J. M. and Morton, K. W. (1954). Poor

man’s Monte Carlo. J. Roy. Statist. Soc. Ser. B. 16 23–

38. MR0064475

Handschin, J. E. and Mayne, D . Q. (1969). Monte Carlo

techniques to estimate the conditional expectation in

multi-stage non-linear ﬁltering. Internat. J. Control 9 547–

559.

MR0246490

Hastings, W. (1970). Monte Carlo sampling methods using

Markov chains and their application. Biometrika 57 97–

109.

Hitchcock, D. B. (2003). A history of the Metropolis–

Hastings algorithm. Amer. Statist. 57 254–257. MR2037852

Hobert, J. P. and Casella, G. (1996). The eﬀect of im-

proper priors on Gibbs sampling in hierarchical linear

mixed models. J. Amer. Statist. Assoc. 91 1461–1473.

MR1439086

Hobert, J. P. and Marchev, D. (2008). A theoretical

comparison of the data augmentation, marginal augmen-

tation and PX-DA algorithms. Ann. Statist. 36 532–554.

MR2396806

Hobert, J. P., Jones, G. L., Presnell, B. and Rosen-

thal, J. S. (2002). On the applicability of regenerative

simulation in Markov chain Monte Carlo. Biometrika 89

731–743.

MR1946508

Jones, G. L. and Hobert, J. P. (2001). Honest exploration

of intractable probability distributions via Markov chain

Monte Carlo. Statist. Sci. 16 312–334. MR1888447

Jones, G. L., Haran, M., Ca ffo, B. S . and Neath, R.

(2006). Fixed-width output analysis for Markov chain

Monte Carlo. J. Amer. Statist. Assoc. 101 1537–1547.

MR2279478

Kemeny, J. G. and Snell, J. L. (1960). Finite Markov

Chains. Van Nostrand, Princeton, NJ.

MR0115196

Kendall, W. S. and Møller, J. (2000). Perfect simula-

tion u sing dominating processes on ordered spaces, with

application to locally stable point processes. Adv. in Appl.

Probab. 32 844–865.

MR1788098

Kipnis, C. and Varadhan, S. R. S. (1986). Central limit

theorem for additive functionals of reversible Markov pro-

cesses and applications to simple exclusions. Comm. Math.

Phys. 104 1–19.

MR0834478

Kirkpatrick, S., Gelatt, C. D. Jr. and Vecchi, M. P.

(1983). Optimization by simulated annealing. Science 220

671–680. MR0702485

Kitagawa, G. (1996). Monte Carlo ﬁlter and smo other for

non-Gaussian nonlinear state space models. J. Comput.

Graph. Statist. 5 1–25. MR1380850

Kong, A., Liu, J. and Wong, W. (1994). Sequential im-

putations and Bayesian missing data problems. J. Amer.

Statist. Assoc. 89 278–288.

Kuhn, T. (1996). The Structure of Scientiﬁc Revolutions, 3rd

ed. Univ. Chicago Press, Chicago.

Landau, D. P. and Binder, K. (2005). A Guide to Monte

Carlo Simulations in Statistical Physics. Cambridge Univ.

Press, Cambridge.

Lange, N., Carlin, B. P. and Gelfand, A. E. (1992). Hier-

archical Bayes models for the progression of HIV infection

using longitudinal CD4 T-cell numbers. J. Amer. Statist.

Assoc. 87 615–626.

Lawrence, C. E., Altschul, S. F., Boguski, M. S.,

Liu, J. S., Neuwald, A. F. and Wootton, J. C. (1993).

Detecting subtle sequence signals: A Gibbs sampling strat-

egy for multiple alignment. Science 262 208–214.

Liu, J. and Chen, R. (1995). Blind deconvolution via sequen-

tial imputations. J. Amer. Statist. Assoc. 90 567–576.

Liu, C. and Rubin, D. B. (1994). The ECME algorithm:

A simple extension of EM and ECM with faster mon otone

convergence. Biometrika 81 633–648.

MR1326414

Liu, J. S. and Wu, Y. N. (1999). Parameter expansion for

data augmentation. J. Amer. Statist. Assoc. 94 1264–1274.

MR1731488

Liu, J. S., Wong, W. H. and Kong, A. (1994). Covari-

ance structure of the Gibbs sampler with applications to

the comparisons of estimators and augmentation schemes.

Biometrika 81 27–40.

MR1279653

Liu, J. S., Wong, W. H. and Kong, A. (1995). Covari-

ance structure and convergence rate of the Gibbs sampler

with various scans. J. Roy. Statist. Soc. Ser. B 57 157–169.

MR1325382

Madras, N. and Slade, G . (1993). The Self-Avoiding Walk.

Birkh¨auser, Boston, MA.

MR1197356

Marin, J.-M. and Robert, C. P. (2007). Bayesian Core:

A Practical Approach to Computational Bayesian Statis-

tics. Springer, New York . MR2289769

Marshall, A. (1965). The use of multi-stage sampling

schemes in Monte Carlo computations. In Symposium on

Monte Carlo Methods. Wiley, New York.

Meng, X.-L. and Rubin, D. B. (1993). Maximum likelihood

estimation via th e ECM algorithm: A general framework.

Biometrika 80 267–278.

MR1243503

Meng, X.-L. and van Dyk, D . A. (1999). Seeking eﬃcient

data au gmentation schemes via conditional and marginal

augmentation. Biometrika 86 301–320. MR1705351

Mengersen, K. L. and Tweedie, R. L. (1996). Rates of con-

vergence of the Hastings and Metropolis algorithms. Ann.

Statist. 24 101–121.

MR1389882

Metropolis, N. (1987). The beginning of th e Monte

Carlo method. Los Alamos Sci. 15 Special Issue 125–130.

MR0935771

Metropolis, N. and U lam, S. (1949). The Monte Carlo

metho d. J. Amer. Statist. Assoc. 44 335–341. MR0031341

Metropolis, N., Rosenbluth, A., Rosenbluth, M.,

Teller, A. and Teller, E. (1953). Eq uations of state

14 C. ROBERT AND G. CASELLA

calculations by fast computing machines. J. Chem. Phys.

21 1087–1092.

Møller, J. and Waagepetersen, R. (2003). Statistical In-

ference and Simulation for Spatial Point Processes. Chap-

man & Hall/CRC, Bo ca Raton.

Moussouris, J. (1974). Gibbs and Markov random systems

with constraints. J. Statist. Phys. 10 11–33.

MR0432132

Mykland, P., Tierney, L. and Yu, B. (1995). Regeneration

in Markov chain samplers. J. Amer. Statist. Assoc. 90 233–

241. MR1325131

Neal, R. ( 1996). Sampling from multimodal distributions us-

ing tempered transitions. Statist. Comput. 6 353–356.

Neal, R. M. (2003). Slice sampling (with discussion). Ann.

Statist. 31 705–767.

MR1994729

Pearl, J. (1987). Evidential reasoning using stochastic sim-

ulation of causal models. Artiﬁcial Intelligence 32 245–257.

MR0885357

Peskun, P. H. (1973). Optimum Monte-Carlo sampling using

Markov chains. Biometrika 60 607–612. MR0362823

Peskun, P. H. (1981). Guidelines for choosing the transi-

tion matrix in Monte Carlo metho ds using Markov chains.

J. Comput. Phys. 40 327–344.

MR0617102

Propp, J. G. and Wilson, D. B. (1996). Exact sampling

with coupled Markov chains and applications to statisti-

cal mechanics. In Proceedings of the Seventh International

Conference on Random Structures and Algorithms (At-

lanta, GA, 1995) 9 223–252.

MR1611693

Qian, W. and Titterington, D. M. (1990). Parameter es-

timation for hidden Gibb s chains. Statist. Probab. Lett. 10

49–58. MR1060302

Raftery, A. and Banfield, J. (1991). Stopping the Gibbs

sampler, th e use of morp hology, and other issues in spatial

statistics. Ann. Inst. Statist. Math. 43 32–43.

Richardson, S. and Green, P. J. (1997). On Bayesian anal-

ysis of mixtures with an unknown number of components

(with discussion). J. Roy. Statist. Soc. Ser. B 59 731–792.

MR1483213

Ripley, B. D. (1987). Stochastic Simulation. Wiley, New

York. MR0875224

Robert, C. P. (1995). Convergence control methods for

Markov chain Monte Carlo algorithms. Statist. Sci. 10 231–

253.

MR1390517

Robert, C. P. and Casella, G. (2004). Monte Carlo Sta-

tistical Methods, 2nd ed. Sprin ger, New York.

MR2080278

Roberts, G. O. and Rosenthal, J. S. (1999). Convergence

of slice sampler Markov chains. J. R. Stat. Soc. Ser. B Stat.

Methodol. 61 643–660.

MR1707866

Roberts, G. O. and Rosenthal, J. S. (2007). Coupling

and ergod icity of adaptive Markov chain Monte Carlo al-

gorithms. J. Appl. Probab. 44 458–475. MR2340211

Roberts, G. O., Gelman, A. and Gilks, W . R. (1997).

Weak convergence and optimal scaling of random walk

Metropolis algorithms. Ann. Appl. Probab. 7 110–120.

MR1428751

Rosenbluth, M. and Rosenbluth, A. (1955). Monte Carlo

calculation of the average extension of molecular chains.

J. Chem. Phys. 23 356–359.

Rosenthal, J. S. (1995). Minorization conditions and con-

vergence rates for Markov chain Monte Carlo. J. Amer.

Statist. Assoc. 90 558–566.

MR1340509

Rubin, D. (1978). Multiple imputation in sample surveys:

A phenomenological Bayesian approach to nonresponse. In

Imputation and Editing of Faulty or Missing Survey Data.

U.S. Department of Commerce, Washington, DC.

Smith, A. F. M. and Gelfand, A. E. (1992). Bayesian

statistics without tears: A sampling-resampling perspec-

tive. Amer. Statist. 46 84–88.

MR1165566

Stephens, D. A. (1994). Bayesian retrospective multiple-

changepoint identiﬁcation. Appl. Statist. 43 159–178.

Stephens, M. (2000). Bayesian analysis of mixture mod-

els with an unknown number of components—an alterna-

tive to reversible jump methods. Ann. Statist. 28 40–74.

MR1762903

Stephens, D. A. and Smith, A. F. M. (1993). Bayesian

inference in multipoint gene mapping. Ann. Hum. Genetics

57 65–82.

Tanner, M. A. and Wong, W. H. (1987). The calculation

of posterior distributions by data augmentation (with dis-

cussion). J. Amer. Statist. Assoc. 82 528–550.

MR0898357

Tierney, L. (1994). Markov chains for exploring posterior

distributions (with discussion). Ann. Statist. 22 1701–1786.

MR1329166

Titterington, D. and Cox, D . R. (2001). Biometrika: One

Hundred Years. Oxford Univ. Press, Oxford.

MR1841263

Wakefield, J., Smith, A., Racine-Poon, A. and Gel-

fand, A. (1994). Bayesian analysis of linear and non-linear

population models using the Gibbs sampler. Appl. Statist.

43 201–222.

Wang, C. S., Rutledge, J. J. an d Gianola, D. (1993).

Marginal inferences about variance-components in a mixed

linear model using Gibbs sampling. Gen. Sel. Evol. 25 41–

62.

Wang, C. S., Rutledge, J. J. an d Gianola, D. (1994).

Bayesian analysis of mixed linear models via Gibbs sam-

pling with an application to litter size in Iberian pigs. Gen.

Sel. Evol. 26 91–115.

Wei, G. and Tanner, M. (1990). A Monte Carlo implemen-

tation of the EM algorithm and the poor man’s data aug-

mentation algorithm. J. Amer. Statist. Assoc. 85 699–704.

Zeger, S. L. and Karim, M. R. (1991). Generalized linear

mod els with random eﬀects; a Gibbs sampling approach.

J. Amer. Statist. Assoc. 86 79–86.

MR1137101

The ENIAC was the first electronic general-purpose computer; it was Turing-complete, and digital.
![Imgur](https://i.imgur.com/AzLlTfe.png)
Josiah Willard Gibbs (1839-1903) was an American scientist who among many many other things, helped create the field of statistical mechanics, which explains the laws of thermodynamics as consequences of the statistical properties of ensembles of the possible states of physical systems of many particles.
For more: https://en.wikipedia.org/wiki/Josiah_Willard_Gibbs
Los Alamos during World War 2 include led many of the efforts for the development of nuclear weapons (Manhattan project), and was responsible for the design and overall coordination of development. The lab was only officially revealed to the public in the post-WW 2 era.
Stanislaw Ulam (April 1909 - May 1984) was a Polish mathematician and nuclear physicist. Among many other accomplishments, he invented the Monte Carlo method of computation, and made the major realization that computers could be used to apply statistical methods to functions without known solutions. The ubiquitous software package used for MCMC, Stan, is named in honor of Stanislaw.
For more about Stanislaw: https://en.wikipedia.org/wiki/Stanislaw_Ulam
The MANIAC built under the direction of Nicholas Metropolis at the Los Alamos Scientific Laboratory and based on the von Neumann architecture of the IAS. The computer was used for atomic bomb design.
![](https://i.imgur.com/pQ6PxAY.png)
Julian Besag (March 1945 – August 2010) was a British statistician known for his work in spatial statistics, and Bayesian inference, and made major contributions to Markov Chain Monte Carlo algorithms.
His 1986 paper "On the Statistical Analysis of Dirty Pictures" was the most cited paper by a UK mathematical scientist in the 1980s.
For more: https://en.wikipedia.org/wiki/Julian_Besag
Christian Robert is a Professor at Université Paris-Dauphine (France) and part-time Professor at the University of Warwick, in their Department of Statistics. He is a former editor of the Journal of the Royal Statistical Society, and his work focuses on Bayesian and computational statistics.
George Casella (January 1951 – June 2012) was a Distinguished Professor in the Department of Statistics at the University of Florida, and made major contributions to Bayesian and empirical Bayes methods, helping to invent the Bayesian lasso.
A Markov chain is irreducible if it is possible to get to any state from any other state.
The simulated annealing algorithm is worth reading about, it is a probabilistic technique for approximating the global optimum of a given function, and is widely used, especially for combinatorial problems.
For more:
https://en.wikipedia.org/wiki/Simulated_annealing
John Von Neuman (December 1903 - February 1957) was Hungarian-American mathematician, physicist, computer scientist, and polymath, regarded as the foremost mathematician of his time. During World War 2 at Los Alamos, his hydrogen bomb work was done using computing, where he and Stanisław Ulam developed simulations on von Neumann's digital computers for the hydrodynamic computations. Together, they developed the Monte Carlo method, finding solutions to complicated problems by making approximations using random numbers.
For more about Von Neuman: https://en.wikipedia.org/wiki/John_von_Neumann￼