In the last chapter we learned that deep neural networks are
often much harder to train than shallow neural networks.
That's unfortunate, since we have good reason to believe that if
we could train deep nets they'd be much more powerful than
shallow nets. But while the news from the last chapter is
discouraging, we won't let it stop us. In this chapter, we'll
develop techniques which can be used to train deep networks,
and apply them in practice. We'll also look at the broader
picture, briefly reviewing recent progress on using deep nets
for image recognition, speech recognition, and other
applications. And we'll take a brief, speculative look at what the
future may hold for neural nets, and for artificial intelligence.
The chapter is a long one. To help you navigate, let's take a
tour. The sections are only loosely coupled, so provided you
have some basic familiarity with neural nets, you can jump to
whatever most interests you.
The main part of the chapter is an introduction to one of the
most widely used types of deep network: deep convolutional
networks. We'll work through a detailed example - code and all
- of using convolutional nets to solve the problem of classifying
handwritten digits from the MNIST data set:
We'll start our account of convolutional networks with the
shallow networks used to attack this problem earlier in the
book. Through many iterations we'll build up more and more
powerful networks. As we go we'll explore many powerful
techniques: convolutions, pooling, the use of GPUs to do far
more training than we did with our shallow networks, the
algorithmic expansion of our training data (to reduce
overfitting), the use of the dropout technique (also to reduce
overfitting), the use of ensembles of networks, and others. The
Deep learning
result will be a system that offers near-human performance. Of
the 10,000 MNIST test images - images not seen during
training! - our system will classify 9,967 correctly. Here's a
peek at the 33 images which are misclassified. Note that the
correct classification is in the top right; our program's
classification is in the bottom right:
Many of these are tough even for a human to classify. Consider,
for example, the third image in the top row. To me it looks
more like a "9" than an "8", which is the official classification.
Our network also thinks it's a "9". This kind of "error" is at the
very least understandable, and perhaps even commendable.
We conclude our discussion of image recognition with a survey
of some of the spectacular recent progress using networks
(particularly convolutional nets) to do image recognition.
The remainder of the chapter discusses deep learning from a
broader and less detailed perspective. We'll briefly survey other
models of neural networks, such as recurrent neural nets and
long short-term memory units, and how such models can be
applied to problems in speech recognition, natural language
processing, and other areas. And we'll speculate about the
future of neural networks and deep learning, ranging from
ideas like intention-driven user interfaces, to the role of deep
learning in artificial intelligence.
The chapter builds on the earlier chapters in the book, making
use of and integrating ideas such as backpropagation,
regularization, the softmax function, and so on. However, to
read the chapter you don't need to have worked in detail
through all the earlier chapters. It will, however, help to have
read Chapter 1, on the basics of neural networks. When I use
concepts from Chapters 2 to 5, I provide links so you can
familiarize yourself, if necessary.
It's worth noting what the chapter is not. It's not a tutorial on
the latest and greatest neural networks libraries. Nor are we
going to be training deep networks with dozens of layers to
solve problems at the very leading edge. Rather, the focus is on
understanding some of the core principles behind deep neural
networks, and applying them in the simple, easy-to-understand
context of the MNIST problem. Put another way: the chapter is
not going to bring you right up to the frontier. Rather, the
intent of this and earlier chapters is to focus on fundamentals,
and so to prepare you to understand a wide range of current
The chapter is currently in beta. I welcome notification of
typos, bugs, minor errors, and major misconceptions. Please
drop me a line at if you spot such an
Introducing convolutional networks
In earlier chapters, we taught our neural networks to do a
pretty good job recognizing images of handwritten digits:
We did this using networks in which adjacent network layers
are fully connected to one another. That is, every neuron in the
network is connected to every neuron in adjacent layers:
In particular, for each pixel in the input image, we encoded the
pixel's intensity as the value for a corresponding neuron in the
input layer. For the pixel images we've been using, this
means our network has ( ) input neurons. We then
trained the network's weights and biases so that the network's
output would - we hope! - correctly identify the input image:
'0', '1', '2', ..., '8', or '9'.
Our earlier networks work pretty well: we've obtained a
classification accuracy better than 98 percent, using training
and test data from the MNIST handwritten digit data set. But
upon reflection, it's strange to use networks with fully-
connected layers to classify images. The reason is that such a
network architecture does not take into account the spatial
structure of the images. For instance, it treats input pixels
which are far apart and close together on exactly the same
footing. Such concepts of spatial structure must instead be
inferred from the training data. But what if, instead of starting
with a network architecture which is tabula rasa, we used an
architecture which tries to take advantage of the spatial
structure? In this section I describe convolutional neural
*The origins of convolutional neural networks go back to the 1970s. But the
seminal paper establishing the modern subject of convolutional networks was
a 1998 paper, "Gradient-based learning applied to document recognition", by
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. LeCun has
since made an interesting remark on the terminology for convolutional nets:
"The [biological] neural inspiration in models like convolutional nets is very
tenuous. That's why I call them 'convolutional nets' not 'convolutional neural
nets', and why we call the nodes 'units' and not 'neurons' ". Despite this
remark, convolutional nets use many of the same ideas as the neural networks
we've studied up to now: ideas such as backpropagation, gradient descent,
regularization, non-linear activation functions, and so on. And so we will
follow common practice, and consider them a type of neural network. I will use
the terms "convolutional neural network" and "convolutional net(work)"
interchangeably. I will also use the terms "[artificial] neuron" and "unit"
. These networks use a special architecture which is particularly
well-adapted to classify images. Using this architecture makes
convolutional networks fast to train. This, in turns, helps us
train deep, many-layer networks, which are very good at
classifying images. Today, deep convolutional networks or
some close variant are used in most neural networks for image
Convolutional neural networks use three basic ideas: local
receptive fields, shared weights, and pooling. Let's look at each
28 × 28
= 28 × 28