and try training with the same hyper-parameters:
>>> net = network2.Network([784, 30, 30, 10])
>>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0,
... evaluation_data=validation_data, monitor_evaluation_accuracy=True)
This gives an improved classification accuracy, 96.90 percent.
That's encouraging: a little more depth is helping. Let's add
another 30-neuron hidden layer:
>>> net = network2.Network([784, 30, 30, 30, 10])
>>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0,
... evaluation_data=validation_data, monitor_evaluation_accuracy=True)
That doesn't help at all. In fact, the result drops back down to
96.57 percent, close to our original shallow network. And
suppose we insert one further hidden layer:
>>> net = network2.Network([784, 30, 30, 30, 30, 10])
>>> net.SGD(training_data, 30, 10, 0.1, lmbda=5.0,
... evaluation_data=validation_data, monitor_evaluation_accuracy=True)
The classification accuracy drops again, to 96.53 percent.
That's probably not a statistically significant drop, but it's not
encouraging, either.
This behaviour seems strange. Intuitively, extra hidden layers
ought to make the network able to learn more complex
classification functions, and thus do a better job classifying.
Certainly, things shouldn't get worse, since the extra layers can,
in the worst case, simply do nothing*
*See this later problem to understand how to build a hidden layer that does
nothing.
. But that's not what's going on.
So what is going on? Let's assume that the extra hidden layers
really could help in principle, and the problem is that our
learning algorithm isn't finding the right weights and biases.
We'd like to figure out what's going wrong in our learning
algorithm, and how to do better.
To get some insight into what's going wrong, let's visualize how
the network learns. Below, I've plotted part of a
network, i.e., a network with two hidden layers, each
containing hidden neurons. Each neuron in the diagram has
a little bar on it, representing how quickly that neuron is
changing as the network learns. A big bar means the neuron's
weights and bias are changing rapidly, while a small bar means