This sounds like "given enough time, any number can be factorized i...
One of the most striking facts about neural networks is that
they can compute any function at all. That is, suppose someone
hands you some complicated, wiggly function, :
No matter what the function, there is guaranteed to be a neural
network so that for every possible input, , the value (or
some close approximation) is output from the network, e.g.:
This result holds even if the function has many inputs,
, and many outputs. For instance, here's a
network computing a function with inputs and
outputs:
CHAPTER 4
A visual proof that neural nets can compute any function
f (x)
x
f (x)
f = f ( , , )
x
1
x
m
m = 3 n = 2
This result tells us that neural networks have a kind of
universality. No matter what function we want to compute, we
know that there is a neural network which can do the job.
What's more, this universality theorem holds even if we restrict
our networks to have just a single layer intermediate between
the input and the output neurons - a so-called single hidden
layer. So even very simple network architectures can be
extremely powerful.
The universality theorem is well known by people who use
neural networks. But why it's true is not so widely understood.
Most of the explanations available are quite technical. For
instance, one of the original papers proving the result*
*Approximation by superpositions of a sigmoidal function, by George Cybenko
(1989). The result was very much in the air at the time, and several groups
proved closely related results. Cybenko's paper contains a useful discussion of
much of that work. Another important early paper is Multilayer feedforward
networks are universal approximators, by Kurt Hornik, Maxwell Stinchcombe,
and Halbert White (1989). This paper uses the Stone-Weierstrass theorem to
arrive at similar results.
did so using the Hahn-Banach theorem, the Riesz
Representation theorem, and some Fourier analysis. If you're a
mathematician the argument is not difficult to follow, but it's
not so easy for most people. That's a pity, since the underlying
reasons for universality are simple and beautiful.
In this chapter I give a simple and mostly visual explanation of
the universality theorem. We'll go step by step through the
underlying ideas. You'll understand why it's true that neural
networks can compute any function. You'll understand some of
the limitations of the result. And you'll understand how the
result relates to deep neural networks.
To follow the material in the chapter, you do not need to have
read earlier chapters in this book. Instead, the chapter is
structured to be enjoyable as a self-contained essay. Provided
you have just a little basic familiarity with neural networks, you
should be able to follow the explanation. I will, however,
provide occasional links to earlier material, to help fill in any
gaps in your knowledge.
Universality theorems are a commonplace in computer science,
so much so that we sometimes forget how astonishing they are.
But it's worth reminding ourselves: the ability to compute an
arbitrary function is truly remarkable. Almost any process you
can imagine can be thought of as function computation.
Consider the problem of naming a piece of music based on a
short sample of the piece. That can be thought of as computing
a function. Or consider the problem of translating a Chinese
text into English. Again, that can be thought of as computing a
function*
*Actually, computing one of many functions, since there are often many
acceptable translations of a given piece of text.
. Or consider the problem of taking an mp4 movie file and
generating a description of the plot of the movie, and a
discussion of the quality of the acting. Again, that can be
thought of as a kind of function computation*
*Ditto the remark about translation and there being many possible functions.
. Universality means that, in principle, neural networks can do
all these things and many more.
Of course, just because we know a neural network exists that
can (say) translate Chinese text into English, that doesn't mean
we have good techniques for constructing or even recognizing
such a network. This limitation applies also to traditional
universality theorems for models such as Boolean circuits. But,
as we've seen earlier in the book, neural networks have
powerful algorithms for learning functions. That combination
of learning algorithms + universality is an attractive mix. Up to
now, the book has focused on the learning algorithms. In this
chapter, we focus on universality, and what it means.
Two caveats
Before explaining why the universality theorem is true, I want
to mention two caveats to the informal statement "a neural
network can compute any function".
First, this doesn't mean that a network can be used to exactly
compute any function. Rather, we can get an approximation
that is as good as we want. By increasing the number of hidden
neurons we can improve the approximation. For instance,
earlier I illustrated a network computing some function
using three hidden neurons. For most functions only a low-
quality approximation will be possible using three hidden
neurons. By increasing the number of hidden neurons (say, to
five) we can typically get a better approximation:
And we can do still better by further increasing the number of
hidden neurons.
To make this statement more precise, suppose we're given a
function which we'd like to compute to within some desired
accuracy . The guarantee is that by using enough hidden
neurons we can always find a neural network whose output
satisfies , for all inputs . In other words, the
approximation will be good to within the desired accuracy for
every possible input.
The second caveat is that the class of functions which can be
approximated in the way described are the continuous
functions. If a function is discontinuous, i.e., makes sudden,
sharp jumps, then it won't in general be possible to
f (x)
f (x)
ϵ > 0
g(x)
|g(x) f (x)| < ϵ
x
approximate using a neural net. This is not surprising, since
our neural networks compute continuous functions of their
input. However, even if the function we'd really like to compute
is discontinuous, it's often the case that a continuous
approximation is good enough. If that's so, then we can use a
neural network. In practice, this is not usually an important
limitation.
Summing up, a more precise statement of the universality
theorem is that neural networks with a single hidden layer can
be used to approximate any continuous function to any desired
precision. In this chapter we'll actually prove a slightly weaker
version of this result, using two hidden layers instead of one. In
the problems I'll briefly outline how the explanation can, with a
few tweaks, be adapted to give a proof which uses only a single
hidden layer.
Universality with one input and one
output
To understand why the universality theorem is true, let's start
by understanding how to construct a neural network which
approximates a function with just one input and one output:
It turns out that this is the core of the problem of universality.
Once we've understood this special case it's actually pretty easy
to extend to functions with many inputs and many outputs.
To build insight into how to construct a network to compute ,
let's start with a network containing just a single hidden layer,
with two hidden neurons, and an output layer containing a
single output neuron:
f
To get a feel for how components in the network work, let's
focus on the top hidden neuron. In the diagram below, click on
the weight, , and drag the mouse a little ways to the right to
increase . You can immediately see how the function
computed by the top hidden neuron changes:
As we learnt earlier in the book, what's being computed by the
hidden neuron is , where is the
sigmoid function. Up to now, we've made frequent use of this
algebraic form. But for the proof of universality we will obtain
more insight by ignoring the algebra entirely, and instead
manipulating and observing the shape shown in the graph.
This won't just give us a better feel for what's going on, it will
also give us a proof*
*Strictly speaking, the visual approach I'm taking isn't what's traditionally
thought of as a proof. But I believe the visual approach gives more insight into
why the result is true than a traditional proof. And, of course, that kind of
insight is the real purpose behind a proof. Occasionally, there will be small
gaps in the reasoning I present: places where I make a visual argument that is
plausible, but not quite rigorous. If this bothers you, then consider it a
challenge to fill in the missing steps. But don't lose sight of the real purpose: to
understand why the universality theorem is true.
of universality that applies to activation functions other than
the sigmoid function.
To get started on this proof, try clicking on the bias, , in the
diagram above, and dragging to the right to increase it. You'll
w
w
σ(wx + b) σ(z) 1/(1 + )
e
z
b
see that as the bias increases the graph moves to the left, but its
shape doesn't change.
Next, click and drag to the left in order to decrease the bias.
You'll see that as the bias decreases the graph moves to the
right, but, again, its shape doesn't change.
Next, decrease the weight to around or . You'll see that as
you decrease the weight, the curve broadens out. You might
need to change the bias as well, in order to keep the curve in-
frame.
Finally, increase the weight up past . As you do, the
curve gets steeper, until eventually it begins to look like a step
function. Try to adjust the bias so the step occurs near .
The following short clip shows what your result should look
like. Click on the play button to play (or replay) the video:
We can simplify our analysis quite a bit by increasing the
weight so much that the output really is a step function, to a
very good approximation. Below I've plotted the output from
the top hidden neuron when the weight is . Note that
this plot is static, and you can't change parameters such as the
weight.
2 3
w = 100
x = 0.3
w = 999
It's actually quite a bit easier to work with step functions than
general sigmoid functions. The reason is that in the output
layer we add up contributions from all the hidden neurons. It's
easy to analyze the sum of a bunch of step functions, but rather
more difficult to reason about what happens when you add up
a bunch of sigmoid shaped curves. And so it makes things
much easier to assume that our hidden neurons are outputting
step functions. More concretely, we do this by fixing the weight
to be some very large value, and then setting the position of
the step by modifying the bias. Of course, treating the output as
a step function is an approximation, but it's a very good
approximation, and for now we'll treat it as exact. I'll come
back later to discuss the impact of deviations from this
approximation.
At what value of does the step occur? Put another way, how
does the position of the step depend upon the weight and bias?
To answer this question, try modifying the weight and bias in
the diagram above (you may need to scroll back a bit). Can you
figure out how the position of the step depends on and ?
With a little work you should be able to convince yourself that
the position of the step is proportional to , and inversely
proportional to .
In fact, the step is at position , as you can see by
modifying the weight and bias in the following diagram:
It will greatly simplify our lives to describe hidden neurons
using just a single parameter, , which is the step position,
. Try modifying in the following diagram, in order to
get used to the new parameterization:
w
x
w
b
b
w
s = b/w
s
s = b/w
s
As noted above, we've implicitly set the weight on the input
to be some large value - big enough that the step function is a
very good approximation. We can easily convert a neuron
parameterized in this way back into the conventional model, by
choosing the bias .
Up to now we've been focusing on the output from just the top
hidden neuron. Let's take a look at the behavior of the entire
network. In particular, we'll suppose the hidden neurons are
computing step functions parameterized by step points (top
neuron) and (bottom neuron). And they'll have respective
output weights and . Here's the network:
What's being plotted on the right is the weighted output
from the hidden layer. Here, and are the
outputs from the top and bottom hidden neurons, respectively*
*Note, by the way, that the output from the whole network is
, where is the bias on the output neuron. Obviously, this
isn't the same as the weighted output from the hidden layer, which is what
we're plotting here. We're going to focus on the weighted output from the
hidden layer right now, and only later will we think about how that relates to
the output from the whole network.
. These outputs are denoted with s because they're often
known as the neurons' activations.
w
b = ws
s
1
s
2
w
1
w
2
+
w
1
a
1
w
2
a
2
a
1
a
2
σ( + + b)
w
1
a
1
w
2
a
2
b
a
Try increasing and decreasing the step point of the top
hidden neuron. Get a feel for how this changes the weighted
output from the hidden layer. It's particularly worth
understanding what happens when goes past . You'll see
that the graph changes shape when this happens, since we have
moved from a situation where the top hidden neuron is the first
to be activated to a situation where the bottom hidden neuron
is the first to be activated.
Similarly, try manipulating the step point of the bottom
hidden neuron, and get a feel for how this changes the
combined output from the hidden neurons.
Try increasing and decreasing each of the output weights.
Notice how this rescales the contribution from the respective
hidden neurons. What happens when one of the weights is
zero?
Finally, try setting to be and to be . You get a
"bump" function, which starts at point , ends at point , and
has height . For instance, the weighted output might look
like this:
Of course, we can rescale the bump to have any height at all.
Let's use a single parameter, , to denote the height. To reduce
clutter I'll also remove the " " and " " notations.
s
1
s
1
s
2
s
2
w
1
0.8
w
2
0.8
s
1
s
2
0.8
h
=
s
1
=
w
1
Try changing the value of up and down, to see how the height
of the bump changes. Try changing the height so it's negative,
and observe what happens. And try changing the step points to
see how that changes the shape of the bump.
You'll notice, by the way, that we're using our neurons in a way
that can be thought of not just in graphical terms, but in more
conventional programming terms, as a kind of if-then-else
statement, e.g.:
if input >= step point:
add 1 to the weighted output
else:
add 0 to the weighted output
For the most part I'm going to stick with the graphical point of
view. But in what follows you may sometimes find it helpful to
switch points of view, and think about things in terms of if-
then-else.
We can use our bump-making trick to get two bumps, by gluing
two pairs of hidden neurons together into the same network:
I've suppressed the weights here, simply writing the values
for each pair of hidden neurons. Try increasing and decreasing
both values, and observe how it changes the graph. Move the
h
h
h
bumps around by changing the step points.
More generally, we can use this idea to get as many peaks as we
want, of any height. In particular, we can divide the interval
up into a large number, , of subintervals, and use
pairs of hidden neurons to set up peaks of any desired height.
Let's see how this works for . That's quite a few neurons,
so I'm going to pack things in a bit. Apologies for the
complexity of the diagram: I could hide the complexity by
abstracting away further, but I think it's worth putting up with
a little complexity, for the sake of getting a more concrete feel
for how these networks work.
You can see that there are five pairs of hidden neurons. The
step points for the respective pairs of neurons are , then
, and so on, out to . These values are fixed - they
make it so we get five evenly spaced bumps on the graph.
Each pair of neurons has a value of associated to it.
Remember, the connections output from the neurons have
weights and (not marked). Click on one of the values,
and drag the mouse to the right or left to change the value. As
you do so, watch the function change. By changing the output
[0, 1]
N N
N = 5
0, 1/5
1/5, 2/5 4/5, 5/5
h
h h h
weights we're actually designing the function!
Contrariwise, try clicking on the graph, and dragging up or
down to change the height of any of the bump functions. As you
change the heights, you can see the corresponding change in
values. And, although it's not shown, there is also a change in
the corresponding output weights, which are and .
In other words, we can directly manipulate the function
appearing in the graph on the right, and see that reflected in
the values on the left. A fun thing to do is to hold the mouse
button down and drag the mouse from one side of the graph to
the other. As you do this you draw out a function, and get to
watch the parameters in the neural network adapt.
Time for a challenge.
Let's think back to the function I plotted at the beginning of the
chapter:
I didn't say it at the time, but what I plotted is actually the
function
plotted over from to , and with the axis taking values
from to .
That's obviously not a trivial function.
You're going to figure out how to compute it using a neural
network.
In our networks above we've been analyzing the weighted
combination output from the hidden neurons. We now
know how to get a lot of control over this quantity. But, as I
h
+h
h
h
f (x) = 0.2 + 0.4 + 0.3x sin(15x) + 0.05 cos(50x),
x
2
(113)
x
0 1
y
0 1
j
w
j
a
j
noted earlier, this quantity is not what's output from the
network. What's output from the network is
where is the bias on the output neuron. Is there some way we
can achieve control over the actual output from the network?
The solution is to design a neural network whose hidden layer
has a weighted output given by , where is just the
inverse of the function. That is, we want the weighted output
from the hidden layer to be:
If we can do this, then the output from the network as a whole
will be a good approximation to *
*Note that I have set the bias on the output neuron to .
.
Your challenge, then, is to design a neural network to
approximate the goal function shown just above. To learn as
much as possible, I want you to solve the problem twice. The
first time, please click on the graph, directly adjusting the
heights of the different bump functions. You should find it
fairly easy to get a good match to the goal function. How well
you're doing is measured by the average deviation between the
goal function and the function the network is actually
computing. Your challenge is to drive the average deviation as
low as possible. You complete the challenge when you drive the
average deviation to or below.
Once you've done that, click on "Reset" to randomly re-
initialize the bumps. The second time you solve the problem,
resist the urge to click on the graph. Instead, modify the
values on the left-hand side, and again attempt to drive the
average deviation to or below.
σ( + b)
j
w
j
a
j
b
f (x)
σ
1
σ
1
σ
f (x)
0
0.40
h
0.40
You've now figured out all the elements necessary for the
network to approximately compute the function ! It's only a
coarse approximation, but we could easily do much better,
merely by increasing the number of pairs of hidden neurons,
allowing more bumps.
In particular, it's easy to convert all the data we have found
back into the standard parameterization used for neural
networks. Let me just recap quickly how that works.
The first layer of weights all have some large, constant value,
say .
The biases on the hidden neurons are just . So, for
instance, for the second hidden neuron becomes
.
The final layer of weights are determined by the values. So,
for instance, the value you've chosen above for the first ,
-1.4, means that the output weights from the top two hidden
neurons are -1.4 and 1.4, respectively. And so on, for the entire
layer of output weights.
Finally, the bias on the output neuron is .
f (x)
w = 1000
b = ws
s = 0.2
b = 1000 × 0.2 = 200
h
h h =
0
That's everything: we now have a complete description of a
neural network which does a pretty good job computing our
original goal function. And we understand how to improve the
quality of the approximation by improving the number of
hidden neurons.
What's more, there was nothing special about our original goal
function, . We
could have used this procedure for any continuous function
from to . In essence, we're using our single-layer
neural networks to build a lookup table for the function. And
we'll be able to build on this idea to provide a general proof of
universality.
Many input variables
Let's extend our results to the case of many input variables.
This sounds complicated, but all the ideas we need can be
understood in the case of just two inputs. So let's address the
two-input case.
We'll start by considering what happens when we have two
inputs to a neuron:
Here, we have inputs and , with corresponding weights
and , and a bias on the neuron. Let's set the weight to ,
and then play around with the first weight, , and the bias, ,
to see how they affect the output from the neuron:
f (x) = 0.2 + 0.4 + 0.3 sin(15x) + 0.05 cos(50x)
x
2
[0, 1] [0, 1]
x
y
w
1
w
2
b
w
2
0
w
1
b
x=1
y=1
Output
As you can see, with the input makes no difference to
the output from the neuron. It's as though is the only input.
Given this, what do you think happens when we increase the
weight to , with remaining ? If you don't
immediately see the answer, ponder the question for a bit, and
see if you can figure out what happens. Then try it out and see
if you're right. I've shown what happens in the following movie:
Just as in our earlier discussion, as the input weight gets larger
the output approaches a step function. The difference is that
now the step function is in three dimensions. Also as before, we
can move the location of the step point around by modifying
the bias. The actual location of the step point is .
Let's redo the above using the position of the step as the
parameter:
Here, we assume the weight on the input has some large value
- I've used - and the weight . The number on
the neuron is the step point, and the little above the number
reminds us that the step is in the direction. Of course, it's also
possible to get a step function in the direction, by making the
weight on the input very large (say, ), and the
weight on the equal to , i.e., :
= 0
w
2
y
x
w
1
= 100
w
1
w
2
0
b/
s
x
w
1
x=1
y=1
Output
x
= 1000
w
1
= 0
w
2
x
x
y
y
= 1000
w
2
x
0
= 0
w
1
The number on the neuron is again the step point, and in this
case the little above the number reminds us that the step is in
the direction. I could have explicitly marked the weights on
the and inputs, but decided not to, since it would make the
diagram rather cluttered. But do keep in mind that the little
marker implicitly tells us that the weight is large, and the
weight is .
We can use the step functions we've just constructed to
compute a three-dimensional bump function. To do this, we
use two neurons, each computing a step function in the
direction. Then we combine those step functions with weight
and , respectively, where is the desired height of the bump.
It's all illustrated in the following diagram:
Try changing the value of the height, . Observe how it relates
to the weights in the network. And see how it changes the
height of the bump function on the right.
Also, try changing the step point associated to the top
hidden neuron. Witness how it changes the shape of the bump.
What happens when you move it past the step point
associated to the bottom hidden neuron?
We've figured out how to make a bump function in the
direction. Of course, we can easily make a bump function in the
direction, by using two step functions in the direction.
Recall that we do this by making the weight large on the
x=1
y=1
Output
y
y
x
y
y
y
x
0
x
h
h h
x=1
y=1
Weighted output from hidden layer
h
0.30
0.70
x
y y
y
input, and the weight on the input. Here's the result:
This looks nearly identical to the earlier network! The only
thing explicitly shown as changing is that there's now little
markers on our hidden neurons. That reminds us that they're
producing step functions, not step functions, and so the
weight is very large on the input, and zero on the input, not
vice versa. As before, I decided not to show this explicitly, in
order to avoid clutter.
Let's consider what happens when we add up two bump
functions, one in the direction, the other in the direction,
both of height :
To simplify the diagram I've dropped the connections with zero
weight. For now, I've left in the little and markers on the
hidden neurons, to remind you in what directions the bump
functions are being computed. We'll drop even those markers
later, since they're implied by the input variable.
Try varying the parameter . As you can see, this causes the
output weights to change, and also the heights of both the and
bump functions.
What we've built looks a little like a tower function:
0
x
x=1
y=1
Weighted output from hidden layer
y
y
x
y
x
x
y
h
x=1
y=1
Weighted output from hidden layer
x
y
h
x
y
Tower function
If we could build such tower functions, then we could use them
to approximate arbitrary functions, just by adding up many
towers of different heights, and in different locations:
Of course, we haven't yet figured out how to build a tower
function. What we have constructed looks like a central tower,
of height , with a surrounding plateau, of height .
But we can make a tower function. Remember that earlier we
saw neurons can be used to implement a type of if-then-else
statement:
if input >= threshold:
output 1
else:
output 0
That was for a neuron with just a single input. What we want is
to apply a similar idea to the combined output from the hidden
neurons:
if combined output from hidden neurons >= threshold:
output 1
else:
output 0
If we choose the threshold appropriately - say, a value of ,
which is sandwiched between the height of the plateau and the
height of the central tower - we could squash the plateau down
to zero, and leave just the tower standing.
Can you see how to do this? Try experimenting with the
following network to figure it out. Note that we're now plotting
the output from the entire network, not just the weighted
x=1
y=1
x=1
y=1
Many towers
2h h
3h/2
output from the hidden layer. This means we add a bias term to
the weighted output from the hidden layer, and apply the
sigma function. Can you find values for and which produce
a tower? This is a bit tricky, so if you think about this for a
while and remain stuck, here's two hints: (1) To get the output
neuron to show the right kind of if-then-else behaviour, we
need the input weights (all or ) to be large; and (2) the
value of determines the scale of the if-then-else threshold.
With our initial parameters, the output looks like a flattened
version of the earlier diagram, with its tower and plateau. To
get the desired behaviour, we increase the parameter until it
becomes large. That gives the if-then-else thresholding
behaviour. Second, to get the threshold right, we'll choose
. Try it, and see how it works!
Here's what it looks like, when we use :
Even for this relatively modest value of , we get a pretty good
tower function. And, of course, we can make it as good as we
want by increasing still further, and keeping the bias as
.
Let's try gluing two such networks together, in order to
compute two different tower functions. To make the respective
h b
h h
b
x=1
y=1
Output
h
b 3h/2
h = 10
h
h
b = 3h/2
roles of the two sub-networks clear I've put them in separate
boxes, below: each box computes a tower function, using the
technique described above. The graph on the right shows the
weighted output from the second hidden layer, that is, it's a
weighted combination of tower functions.
In particular, you can see that by modifying the weights in the
final layer you can change the height of the output towers.
The same idea can be used to compute as many towers as we
like. We can also make them as thin as we like, and whatever
height we like. As a result, we can ensure that the weighted
output from the second hidden layer approximates any desired
function of two variables:
In particular, by making the weighted output from the second
hidden layer a good approximation to , we ensure the
x=1
y=1
Weighted output
x=1
y=1
Many towers
f
σ
1
output from our network will be a good approximation to any
desired function, .
What about functions of more than two variables?
Let's try three variables . The following network can be
used to compute a tower function in four dimensions:
Here, the denote inputs to the network. The and
so on are step points for neurons - that is, all the weights in the
first layer are large, and the biases are set to give the step
points . The weights in the second layer alternate
, where is some very large number. And the output bias
is .
This network computes a function which is provided three
conditions are met: is between and ; is between
and ; and is between and . The network is
everywhere else. That is, it's a kind of tower which is in a little
region of input space, and everywhere else.
By gluing together many such networks we can get as many
towers as we want, and so approximate an arbitrary function of
three variables. Exactly the same idea works in dimensions.
The only change needed is to make the output bias ,
in order to get the right kind of sandwiching behavior to level
the plateau.
Okay, so we now know how to use neural networks to
approximate a real-valued function of many variables. What
about vector-valued functions ? Of course,
f
, ,
x
1
x
2
x
3
, ,
x
1
x
2
x
3
,
s
1
t
1
, , ,
s
1
t
1
s
2
+h, h
h
5h/2
1
x
1
s
1
t
1
x
2
s
2
t
2
x
3
s
3
t
3
0
1
0
m
(m + 1/2)h
f ( , , )
x
1
x
m
R
n
such a function can be regarded as just separate real-valued
functions, , and so on. So we create
a network approximating , another network for , and so
on. And then we simply glue all the networks together. So that's
also easy to cope with.
Problem
We've seen how to use networks with two hidden layers to
approximate an arbitrary function. Can you find a proof
showing that it's possible with just a single hidden layer?
As a hint, try working in the case of just two input
variables, and showing that: (a) it's possible to get step
functions not just in the or directions, but in an
arbitrary direction; (b) by adding up many of the
constructions from part (a) it's possible to approximate a
tower function which is circular in shape, rather than
rectangular; (c) using these circular towers, it's possible to
approximate an arbitrary function. To do part (c) it may
help to use ideas from a bit later in this chapter.
Extension beyond sigmoid neurons
We've proved that networks made up of sigmoid neurons can
compute any function. Recall that in a sigmoid neuron the
inputs result in the output , where
are the weights, is the bias, and is the sigmoid function:
What if we consider a different type of neuron, one using some
other activation function, :
n
( , , ), ( , , )
f
1
x
1
x
m
f
2
x
1
x
m
f
1
f
2
x
y
, ,
x
1
x
2
σ( + b)
j
w
j
x
j
w
j
b
σ
s(z)
That is, we'll assume that if our neurons has inputs ,
weights and bias , then the output is .
We can use this activation function to get a step function, just
as we did with the sigmoid. Try ramping up the weight in the
following, say to :
Just as with the sigmoid, this causes the activation function to
contract, and ultimately it becomes a very good approximation
to a step function. Try changing the bias, and you'll see that we
can set the position of the step to be wherever we choose. And
so we can use all the same tricks as before to compute any
desired function.
What properties does need to satisfy in order for this to
work? We do need to assume that is well-defined as
and . These two limits are the two values taken
on by our step function. We also need to assume that these
limits are different from one another. If they weren't, there'd be
no step, simply a flat graph! But provided the activation
function satisfies these properties, neurons based on such
an activation function are universal for computation.
Problems
Earlier in the book we met another type of neuron known
as a rectified linear unit. Explain why such neurons don't
satisfy the conditions just given for universality. Find a
proof of universality showing that rectified linear units are
universal for computation.
, ,
x
1
x
2
, ,
w
1
w
2
b
s( + b)
j
w
j
x
j
w = 100
s(z)
s(z)
z −∞ z
s(z)
Suppose we consider linear neurons, i.e., neurons with the
activation function . Explain why linear neurons
don't satisfy the conditions just given for universality.
Show that such neurons can't be used to do universal
computation.
Fixing up the step functions
Up to now, we've been assuming that our neurons can produce
step functions exactly. That's a pretty good approximation, but
it is only an approximation. In fact, there will be a narrow
window of failure, illustrated in the following graph, in which
the function behaves very differently from a step function:
In these windows of failure the explanation I've given for
universality will fail.
Now, it's not a terrible failure. By making the weights input to
the neurons big enough we can make these windows of failure
as small as we like. Certainly, we can make the window much
narrower than I've shown above - narrower, indeed, than our
eye could see. So perhaps we might not worry too much about
this problem.
Nonetheless, it'd be nice to have some way of addressing the
problem.
In fact, the problem turns out to be easy to fix. Let's look at the
fix for neural networks computing functions with just one input
and one output. The same ideas work also to address the
problem when there are more inputs and outputs.
In particular, suppose we want our network to compute some
function, . As before, we do this by trying to design our
network so that the weighted output from our hidden layer of
neurons is :
s(z) = z
f
f (x)
σ
1
If we were to do this using the technique described earlier, we'd
use the hidden neurons to produce a sequence of bump
functions:
Again, I've exaggerated the size of the windows of failure, in
order to make them easier to see. It should be pretty clear that
if we add all these bump functions up we'll end up with a
reasonable approximation to , except within the
windows of failure.
Suppose that instead of using the approximation just
described, we use a set of hidden neurons to compute an
approximation to half our original goal function, i.e., to
. Of course, this looks just like a scaled down version
of the last graph:
f (x)
σ
1
f (x)/2
σ
1
And suppose we use another set of hidden neurons to compute
an approximation to , but with the bases of the
bumps shifted by half the width of a bump:
Now we have two different approximations to . If we
add up the two approximations we'll get an overall
approximation to . That overall approximation will
still have failures in small windows. But the problem will be
much less than before. The reason is that points in a failure
window for one approximation won't be in a failure window for
the other. And so the approximation will be a factor roughly
better in those windows.
We could do even better by adding up a large number, , of
overlapping approximations to the function .
Provided the windows of failure are narrow enough, a point
will only ever be in one window of failure. And provided we're
using a large enough number of overlapping
approximations, the result will be an excellent overall
approximation.
Conclusion
f (x)/2
σ
1
f (x)/2
σ
1
f (x)
σ
1
2
M
f (x)/M
σ
1
M
The explanation for universality we've discussed is certainly
not a practical prescription for how to compute using neural
networks! In this, it's much like proofs of universality for NAND
gates and the like. For this reason, I've focused mostly on
trying to make the construction clear and easy to follow, and
not on optimizing the details of the construction. However, you
may find it a fun and instructive exercise to see if you can
improve the construction.
Although the result isn't directly useful in constructing
networks, it's important because it takes off the table the
question of whether any particular function is computable
using a neural network. The answer to that question is always
"yes". So the right question to ask is not whether any particular
function is computable, but rather what's a good way to
compute the function.
The universality construction we've developed uses just two
hidden layers to compute an arbitrary function. Furthermore,
as we've discussed, it's possible to get the same result with just
a single hidden layer. Given this, you might wonder why we
would ever be interested in deep networks, i.e., networks with
many hidden layers. Can't we simply replace those networks
with shallow, single hidden layer networks?
Chapter acknowledgments: Thanks to Jen Dodd and Chris Olah for many
discussions about universality in neural networks. My thanks, in particular, to
Chris for suggesting the use of a lookup table to prove universality. The
interactive visual form of the chapter is inspired by the work of people such as
Mike Bostock, Amit Patel, Bret Victor, and Steven Wittens.
While in principle that's possible, there are good practical
reasons to use deep networks. As argued in Chapter 1, deep
networks have a hierarchical structure which makes them
particularly well adapted to learn the hierarchies of knowledge
that seem to be useful in solving real-world problems. Put more
concretely, when attacking problems such as image
recognition, it helps to use a system that understands not just
individual pixels, but also increasingly more complex concepts:
from edges to simple geometric shapes, all the way up through
complex, multi-object scenes. In later chapters, we'll see
evidence suggesting that deep networks do a better job than
shallow networks at learning such hierarchies of knowledge. To
sum up: universality tells us that neural networks can compute
any function; and empirical evidence suggests that deep
networks are the networks best adapted to learn the functions
useful in solving many real-world problems.
. .

Discussion

This sounds like "given enough time, any number can be factorized into primes". The obvious question of practicality is: How many neurons do I need to add, to increase the fidelity of the approximation by N bits? Importantly, is the growth exponential?