gradient terms for the quadratic cost have an extra term in them.
Suppose we average this over values for , . We see that
(very roughly) the quadratic cost learns an average of times slower, for the
same learning rate. This suggests that a reasonable starting point is to divide
the learning rate for the quadratic cost by . Of course, this argument is far
from rigorous, and shouldn't be taken too seriously. Still, it can sometimes be a
useful starting point.
and we train for epochs. The interface to network2.py is
slightly different than network.py, but it should still be clear
what is going on. You can, by the way, get documentation about
network2.py's interface by using commands such as
help(network2.Network.SGD) in a Python shell.
>>> import mnist_loader
>>> training_data, validation_data, test_data = \
... mnist_loader.load_data_wrapper()
>>> import network2
>>> net = network2.Network([784, 30, 10], cost=network2.CrossEntropyCost)
>>> net.large_weight_initializer()
>>> net.SGD(training_data, 30, 10, 0.5, evaluation_data=test_data,
... monitor_evaluation_accuracy=True)
Note, by the way, that the net.large_weight_initializer()
command is used to initialize the weights and biases in the
same way as described in Chapter 1. We need to run this
command because later in this chapter we'll change the default
weight initialization in our networks. The result from running
the above sequence of commands is a network with
percent accuracy. This is pretty close to the result we obtained
in Chapter 1, percent, using the quadratic cost.
Let's look also at the case where we use hidden neurons,
the cross-entropy, and otherwise keep the parameters the
same. In this case we obtain an accuracy of percent.
That's a substantial improvement over the results from Chapter
1, where we obtained a classification accuracy of percent,
using the quadratic cost. That may look like a small change, but
consider that the error rate has dropped from percent to
percent. That is, we've eliminated about one in fourteen of
the original errors. That's quite a handy improvement.
It's encouraging that the cross-entropy cost gives us similar or
better results than the quadratic cost. However, these results
don't conclusively prove that the cross-entropy is a better
choice. The reason is that I've put only a little effort into
choosing hyper-parameters such as learning rate, mini-batch
size, and so on. For the improvement to be really convincing
we'd need to do a thorough job optimizing such hyper-