Optimiser

As explained in Train a neural network, the formula for updating the weights of a neural network is: $w*{new} = w*{old} - \alpha \frac{d\text{error}}{dw}$ $\alpha$ is called the learning rate and is how quickly we move through the weight space to minimise the loss. $\alpha$ changes during the training process and is picked by an algorithm called the optimiser.

There are multiple optimiser algorithm, including

SGD

$\alpha$ decreases with a steady rate and tends to zero as the number of steps increases, for example $\alpha = \frac{\alpha_{\text{start}}}{1 + \text{steps}}$

Adam

$\alpha$ depends on the strength of the last update using the following equation: equation adam $w_{t+1}$ is the new weight, $\eta$ is the base learning rate and the $\beta$ are fixed hyper-parameters choosen at the start of the training process based on trial and error and results from other papers.

The values: $\beta_1 = 0.9, \beta_2 = 0.999, \epsilon = 10^{−8}$ are known to work well.

SGD​

Adam​

SGD

Adam