A mathematical approach towards Gradient Descent Algorithm

Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function (commonly called as loss/cost functions in machine learning and deep learning). Minimization of loss function is done to obtain the optimium values for learning parameters like weights ‘w’ and bias ‘b’. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. If, instead, one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent. Gradient descent is also known as steepest descent.

gradient descent

To understand the algorithm we will take an example of Logistic regression 

Let, 'X ' be  the fearture matrix (i.e. X_train) and, 'Y ' be their corresponding labels.

\[ \displaystyle \text{ X=}{{\left( {\begin{array}{*{20}{c}} {{{x}_{{11}}}} & \ldots & {{{x}_{{1n}}}} \\ \vdots & \ddots & \vdots \\ {{{x}_{{m1}}}} & \cdots & {{{x}_{{mn}}}} \end{array}} \right)}_{{m*n}}},\text{ target = Y =}{{\left( {\begin{array}{*{20}{c}} {{{y}_{1}}} \\ \vdots \\ {{{y}_{m}}} \end{array}} \right)}_{{m*1}}} \]

First of all, we initilize weight matrix and bias with some random values.
If, 'W ' be the weight matrix, then let it be represented as,

\[ \displaystyle \text{W=}{{\left( {\begin{array}{*{20}{c}} {{{w}_{1}}} \\ \vdots \\ {{{w}_{n}}} \end{array}} \right)}_{{n*1}}} \]

where, ​w1,..., wn are random values that we have to define before training our model. 

and, b = b1 be the bias.

Now, we calculate the linear transformation as,

\[ \displaystyle \begin{array}{l}Z=X*W+b\text{ or, }\\\text{Z}=W*{{X}^{T}}+b\end{array} \]

The linear transformation is then passed to a sigmoid Function.
(In neural network in place of the sigmoid function, any other activation function can be used like ReLU, Tanh.)

\[ \displaystyle y\_pred(or\text{ y }\!\!\_\!\!\text{ output})=sig(Z)=\frac{1}{{1+{{e}^{{-Z}}}}}\text{ } \]

Now, its time to calculate loss.

I have used Binary cross-entropy loss function. It can be replaced by MSE or MAE or any other loss functions depending upon the problem. To read more about loss function click here.

Consider a binary cross-entropy loss function 'L', which is given as,

\[ \displaystyle L=-\frac{1}{m}\sum\limits_{{i=1}}^{m}{{\text{ }\!\!\{\!\!\text{ target*log(y }\!\!\_\!\!\text{ pred)+(1-target)*log(1-y }\!\!\_\!\!\text{ pred) }\!\!\}\!\!\text{ }}} \]

Now, we start to train our model.

Training of the model means to find the optimium value of  ‘W ‘ and bias ‘b ‘, for which the loss function ‘L‘ gives minimum value.

To obtain the minimum, we use gradient descent algorithm.

Gradient Descent Algorithm

We will see how gradient descent algorithm works in two steps. And the steps involved are as follows,

Step 1: Calculate the gradient of ‘L’ w.r.t weight ‘w‘ and bias ‘b’ as,

\[ \displaystyle \frac{{\partial L}}{{\partial {{w}_{j}}}}=\frac{1}{m}\sum\limits_{{i=1}}^{m}{{(y\_outpu{{t}^{i}}-{{Y}^{i}})}}x_{j}^{i},\text{ where, j =1,2,3,}…\text{,n}\text{. } \]

While implementing in code, we use

\[ \displaystyle \left( {\frac{{\partial L}}{{\partial W}}} \right)\text{ = }\frac{1}{m}\left\{ {{{X}^{T}}*\left( {y\_output-Y} \right)} \right\} \]

Where,

\[ \displaystyle \left( {\frac{{\partial L}}{{\partial W}}} \right)={{\left( {\begin{array}{*{20}{c}} {\frac{{\partial L}}{{\partial {{w}_{1}}}}} \\ \vdots \\ {\frac{{\partial L}}{{\partial {{w}_{n}}}}} \end{array}} \right)}_{{n*1}}}and,(y\_output-Y)={{\left( {\begin{array}{*{20}{c}} {y\_outpu{{t}^{1}}-{{Y}^{1}}} \\ \vdots \\ {y\_outpu{{t}^{m}}-{{Y}^{m}}} \end{array}} \right)}_{{m*1}}} \]

And the gradient of bias is calculates as,

\[ \displaystyle \frac{{\partial L}}{{\partial b}}=\frac{1}{m}\sum\limits_{{i=1}}^{m}{{(y\_outpu{{t}^{i}}-{{Y}^{i}})}} \]

If you are eager to know where these expression came from, then you may visit this link.

Step 2: Update the previous weights and biases as,

\[ \displaystyle \begin{array}{l}\text{W = W – }\eta \left( {\frac{{\partial L}}{{\partial W}}} \right)\\\\b\text{ = }b\text{ – }\eta \frac{{\partial L}}{{\partial b}},\text{ where }\eta \text{ is the learning rate}\text{.}\end{array} \]

learning rate is a factor that determines how big or small the step should be on each iterations. If the learning rate is too small, it would take long time to converge and thus will be computationally too expensive. In the same way if the value of learning rate is too large, it will overshoot the minimum and fail to converge. 

Iterate Steps 1 and 2 untill the loss is significantly reduced.

Further, if you want to study about different variants of gradient descent algorithm, you can visit here. Also, if you want to understand the gradient descent algorithm with oython code visit this.

That’s it. Isn’t the Gradient descent algorithm easy to understand. If you still have doubt you can comment below.

 

Leave a Reply

Insert math as
Block
Inline
Additional settings
Formula color
Text color
#333333
Type math using LaTeX
Preview
\({}\)
Nothing to preview
Insert
%d bloggers like this: