Gradient descent is one of the most common method of training a neural network. It is an optimization algorithm used to optimize the parameters( eg weights and bias) in neural network. The way this works is you define a loss(cost) function  that tells how well your parameters(weights and bias) fits your training data. Greater the loss, poor is the fitting. Then we gradually reduce loss using the gradient descent algorithm. The reduction of loss includes the updating of weights and bias in the direction in which the loss descrease. The direction is determined by obtaining the gradients(derivative or slope) of loss with respect to the weights and biases.Once the Loss is sufficiently reduced we get the optimium values weights and bias that fits the training data well.

The formula for updating weights and bias is,

$$​\displaystyle \begin{array}{l}\text{W = W – }\eta \left( {\frac{{\partial loss}}{{\partial W}}} \right)\\\\b\text{ = }b\text{ – }\eta \frac{{\partial loss}}{{\partial b}},\text{ where }\eta \text{ is the learning rate or increemental step}\text{.}\end{array}$$

In Most of the cases normal gradient descnet algorithm works well. But some times due to improper selection of learning rate (or increemental step) may lead to overshoot (when learning rate is too high) or slow convergence (when learning rate is too low). So to speed up the convergence and prevent overshooting we do slight modification in our normal gradient descent. We use gradient descent algorithm with momentum.

It’s algorithm is almost same as standard gradient descent, but the only difference is updating value. We update the weights and bias as,

$\displaystyle \begin{array}{l}\text{Iteration:}\\\text{ }\!\!\{\!\!\text{ }\\\text{Compute: }\frac{{\partial loss}}{{\partial w}}and\frac{{\partial loss}}{{\partial b}}\text{ on current batch(or mini batch)}\text{.}\\{{\text{V}}_{{dw}}}\text{ = }\beta {{\text{V}}_{{dw}}}+\text{ (1-}\beta \text{)}\frac{{\partial loss}}{{\partial w}}\\{{\text{V}}_{{db}}}\text{ = }\beta {{\text{V}}_{{db}}}+\text{ (1-}\beta \text{)}\frac{{\partial loss}}{{\partial b}}\text{ }\\\text{Update weight }\!\!’\!\!\text{ w }\!\!’\!\!\text{ and bias }\!\!’\!\!\text{ b }\!\!’\!\!\text{ :}\\w=w-\text{ }\eta {{\text{V}}_{{dw}}}\\\text{b = b }-\eta {{\text{V}}_{{db}}}\\\text{ }\!\!\}\!\!\text{ }\end{array}$

here, β and η are hyperparameters. η is learning rate. And the most commonly used value for β is 0.9. However, you can find its value ny doing hyperparameter tuning.

Further, Vdw and Vdb are initilized as matrices having dimension same as ‘W or dW‘ and ‘b or db‘. Also in some of the cases we find (1-β) neglected.

Insert math as
$${}$$