Gradient descent by far is the most commonly used way to optimise neural network [26]. It is an iterative optimization algorithm used to find the values of parameters or coefficients of a function that minimizes a cost function. Although various algorithms have been developed to optimize gradient descent, they are usually used as black-box optimizers because it is hard to figure out the practical explanations of their strengths and weaknesses.
Different in how much data used to compute the gradient of the objective function, the gradient descent variants are classified into two categories: batch gradient descent (BGD) and stochastic gradient descent (SGD). BGD is guaranteed to converge to a global minimum for convex error surfaces and a local minimum for non-convex surfaces. However, BGD can be very time-consuming because it needs to calculate the gradients for the whole dataset to perform just one update and thus it is intractable for datasets that do not fit in memory. In addition, BGD cannot be used to update the model online. In contrast, SGD performs one update at a time, and thus it will not have any redundant computations for large datasets as BGD does. As a result, SGD is usually much fast than BGD. Meanwhile, it can be used to learn the model online. The drawback of SGD is that the frequent updates with a high variance would lead to a heavy fluctuation to the objective function. While if the learning rate is slowly decreased over time, SGD shows the same convergence behaviour as BGD, it almost certainly converges to a local or the global minimum for non-convex optimization.
Although SGD can often lead to good convergence, few challenges need to be addressed. For instance, it is difficult to determine a proper learning rate and an annealing schedule, or it is hard to update features to a different extent avoiding suboptimal minima. Ruder [26] outlines some algorithms that are widely used by the deep learning community which can deal with these challenges includes Momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSprop, Adam, AdamMax and Nadam. Ruder also stated that Adagrad, Adadelta, RMSprop and Adam can all significantly improve the robustness of SGD and do not need much manual tuning of the learning rate. These four optimizers are therefore selected and discussed in more detail in this paper.
3.1 Adagrad
Adagrad is a gradient-based optimizer that adapts the learning rate to the parameters, performing larger updates for infrequent and smaller updates for frequent updates. Thus, it is very suitable for sparse data. It uses a different learning rate for every parameter \({\theta }_{i}\) at every time step t, so the gradient of the objective function \({g}_{t,i}\) regarding the parameter \({\theta }_{i}\) at time step t is written as:
$${g}_{t,i}={\nabla }_{{\theta }_{t}}J\left({\theta }_{t,i}\right),$$
(16)
The SGD updates for every parameter \({\theta }_{i}\) at each time step t following equation:
$$\theta _{{t + 1,{\text{~}}i}} = \theta _{{t,i}} - \eta \cdot{g}_{t,i}.$$
(17)
Adagrad modifies the general learning rate η at each time step t for every parameter \({\theta }_{i}\) based on the past gradients:
$${\theta }_{t+1,i}={\theta }_{t,i}-\frac{\eta }{\sqrt{{G}_{t,ii}+\in }}\cdot{g}_{t,i},$$
(18)
where \({G}_{t}\in {R}^{d\times d}\) is a diagonal matrix where each diagonal element i is the sum of the squares of the gradients regarding the parameter \({\theta }_{i}\) at time step t, \(\in\) is a smoothing term used to avoid division by zero.
One of the main advantages of Adagrad is that it is not required to manually tune the learning rate. The default value is set as 0.01. The main drawback of this optimizer is that its accumulation of the squared gradients in the denominator would result in the learning rate to shrink and become infinitesimally small, which means that at a certain point, the algorithm can no longer acquire additional knowledge.
3.2 Adadelta
To reduce the monotonically decreasing learning rate, an extension optimizer of Adagrad has been promoted, named Adadelta. It uses a fixed-size window of accumulated past gradients instead of accumulating all past squared gradients. The sum of the gradient is recursively defined as a decaying average of all past squared gradients. Thus, the running average of the squared gradients of the objective function at time step t depends on the previous average and the current gradient:
$${\text{E}}[g^{2} ]_{t} = \gamma {\text{E}}[g^{2} ]_{{t - 1}} + \left( {1 - \gamma } \right)g_{t}^{2} ,$$
(19)
where \(\gamma\) is the fraction of the update vector of the past time step to the current update vector, which is normally set to 0.9 [26].
The SGD update for parameter \({\Delta \theta }_{t}\) at each time step t therefore becomes:
$$\Delta \theta _{t} = - \eta \cdot{g}_{t,i} ,$$
(20)
$$\theta _{{t + 1}} = \theta _{t} + \Delta \theta _{t} .$$
(21)
And according to the update rule, through simply replacing the diagonal matrix \({G}_{t}\) with the decaying average of past squared gradients \({E}{[{g}^{2}]}_{t}\), the parameter update vector of Adadelta can be derived as:
$$\Delta \theta _{t} = - \frac{\eta }{{\sqrt {{\text{E}}\left[ {g^{2} } \right]_{t} + \in } }} \cdot{g}_{t,i} .$$
(22)
As \(\sqrt{{E}{\left[{g}^{2}\right]}_{t}+\in }\) is the root mean squared (RMS) error criterion of the gradient, it can then be written as:
$${\Delta \theta }_{t}=-\frac{\eta }{RMS{\left[g\right]}_{t}}\cdot{g}_{t,i}.$$
(23)
Since the update should have the same hypothetical units as the parameter, the exponentially decaying average of the squared parameter should be used:
$$E{\left[{\Delta \theta }^{2}\right]}_{t}=\gamma E{\left[{\Delta \theta }^{2}\right]}_{t-1}+\left(1-\gamma \right){\Delta \theta }_{t}^{2},$$
(24)
$$\Delta \theta _{t} = - \frac{{{\text{RMS}}\left[ {\Delta \theta } \right]_{{t - 1}} }}{{RMS\left[ g \right]_{t} }} \cdot{g}_{t} ,$$
(25)
$$\theta _{{t + 1}} = \theta _{t} + \Delta \theta _{t} .$$
(26)
Based on the update rule of Adadelta, there is no need for setting a default learning rate.
3.3 RMSprop
RMSprop is an adaptive learning rate method designed for neural networks which have been growing in popularity in recent years. Similar to Adadelta, the central idea of RMSprop is to keep the moving average of the squared gradients for each weight and then divide the gradient by square root of the mean square. However, a good default value of decay parameter \(\gamma\) and learning rate are set to 0.9 and 0.001:
$${\text{E}}\left[ {g^{2} } \right]_{t} = 0.9{\text{E}}\left[ {g^{2} } \right]_{{t - 1}} + 0.1g_{t}^{2} ,$$
(27)
$$\theta _{{t + 1}} = \theta _{t} - \frac{{0.001}}{{\sqrt {{\text{E}}\left[ {g^{2} } \right]_{t} + \in } }} \cdot{g}_{t,i} .$$
(28)
3.4 Adam
Another method that computes adaptive learning rates for each parameter was named Adaptive Moment Estimation (Adam) [27]. Adam not only stores an exponentially decaying average of past squared gradients \({v}_{t}\), but also keeps an exponentially decaying average of past gradients \({m}_{t}\), as shown in Eqs. (29) and (30):
$$m_{t} = \beta _{1} m_{{t - 1}} + \left( {1 - \beta _{1} } \right)g_{t} ,$$
(29)
$$v_{t} = \beta _{2} v_{{t - 1}} + \left( {1 - \beta _{2} } \right)g_{t}^{2} ,$$
(30)
where \({m}_{t}\) refers to the estimate of the first moment (the mean) of the gradients and \({v}_{t}\) refers to the second moment (the uncentered variance) of the gradients. As the initial value of \({m}_{t}\) and \({v}_{t}\) are vectors of zeros, it is observed that when the decay rates are small during the initial time, they are biased towards zero. The biases are counteracted by computing bias-corrected first and second moment estimates:
$$\hat{m}_{t} = \frac{{m_{t} }}{{1 - \beta _{1}^{t} }},$$
(31)
$$\hat{v}_{t} = \frac{{v_{t} }}{{1 - \beta _{2}^{t} }}.$$
(32)
Therefore, the update rule of Adam can be derived as:
$$\theta _{{t + 1}} = \theta _{t} - \frac{\eta }{{\sqrt {\hat{v}_{t} + \in } }} \cdot\hat{m}_{t} .$$
(33)
The proposed values for \({\beta }_{1}\), \({\beta }_{2}\) and \(\in\) are 0.9, 0.999 and 10−8, respectively.