Gradient Descent is a method of finding the minimum of a function. This is extremely useful in machine learning and neural networks as we want to find the correct combination of weights to minimise the error at the output of the neural network. We can treat the error function for a neural network with one hundred connections as a function with one hundred parameters: \( f(x_1, x_2,..., x_{100}) \) The goal is to find the combination of \(x_i\) that make the function as close to zero as possible.
This is a difficult process to imagine in \(100\) dimensions, so lets imagine we have a neural network that has only two connections. We can then write the function we are trying to solve as \(f(x, y) = 0\) where \(x\) and \(y\) are the weights for the two connections. Each connection is a data point in a function we are trying to minimise.
In three dimensions, we can view the error function as a plane and we are trying to find the \(x\) and \(y\) values of the planes minimum.
There are three common types of gradient descent: Full Gradient Descent; Batch Gradient Descent; and Stochastic Gradient Descent.
In Full Gradient Descent, we find the error in every point and we take small steps proportional to the negative of the gradient:
\[
\mathbf{x_{i+1}}=\mathbf{x_i}-\gamma\nabla f(\mathbf{x_i})
\]
We tune how fast the function learns using the learning rate \(\gamma\).
Using this technique, we can find the local minumum of the function.
In the context of our three dimensional function, we find the error with our current choice of \(x\) and \(y\) and then we update both of them in accordance to full gradient descent.
In Batch Gradient Descent, we split the dataset into batches of data points.
We then carry out full gradient descent on each batch, shuffling the data points each time.
\[
\mathbf{x'_{i+1}}=\mathbf{x'_i}-\gamma\nabla f(\mathbf{x'_i})
\quad \text{where} \quad
\mathbf{x'} \subseteq \mathbf{x}
\]
This is often repeated multiple times.
Batch Gradient Descent can be useful if the data set is extremely large as we don't have to process all of the data.
In the context of our three dimensional function, this is like if we found and reduced the error in the \(x\) values and \(y\) values separately.
Batch gradient descent doesn't make much sense in this context since we only have two possible data points.
In Stochastic Gradient Descent, we randomly pick one random data point and we only adjust that data point leaving the rest unmodified.
We then repeat this many times until the error is small.
\[
\mathbf{x_{i+1}}=\mathbf{x_i}-\gamma\nabla f(x')
\quad \text{where} \quad
x' \in \mathbf{x}
\]
Stochastic gradient descent usually to converge faster (in terms of time) than full gradient descent since instead of updating hundreds of parameters, only one parameter needs to be updated.
The overall error in stochastic gradient descent tends to be greater than full gradient descent.
This can be due to the the function reaching the optimal value and then oscillating around it.
In the context of our three dimensional function, we choose randomly wether to update either \(x\) or \(y\) and then update accordingly.
When carrying out a gradient descent method, the choice of the Learning Rate \(\gamma\) needs to be chosen carefully.
Choosing a value that is too high can cause the minimum of the function to be overshot and the function won't be able to converge.
Choosing a value that is too low will take a long time to converge.
Variable learing rates can be use that will modify the learning rate under different conditions, for example if the error plateaus, this may be because the function is scollating around the mimimum but since the learning rate is too high, it can get closer.
In that case, the learning rate can be lowered.
Due to the choice of learning rate and/or starting location, the gradient can increase so much that it has causes an overflow error in the maths.
This is known as Gradient Explosion.
Ways to counter this include capping the maximum size the new x and y can be, or by decreasing the learning rate.
You can see how different parameters affect gradient descent by modifying the parameters.
You can apply a varying learning rate and clipping if you start to experience gradient explosion.
You can view the final vectors.
These each have their uses but not every one will be useful in all situations.
The average vector is useful if the gradient descent method oscillates around a minima.
The best vector is useful if the final vector is not the most optimal solution.
This might happen if the learning rate is too high.
You can view the process of finding the minima on the surface plot here.
Along with this you can view a graph of the error in the error graph tab.