Gradient descent is a first-order iterative optimization algorithm for finding the minimum of a function (commonly called as loss/cost functions in machine learning and deep learning). To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or approximate gradient) of the function at the current point. If, instead, one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent. Gradient descent is also known as steepest descent.
Before going through the variants of gradient descent let’s first understand how gradient descent algorithm works. For this let’s take an example of logistic regression model. For convenience, let’s suppose that our model has only two parameters: one weight w and bias b. For this the algorithm can be written as:
- Initialize our w and b with random guesses.(e.g. say w=1, b=1).
- Select a value for learning rate α. learning rate is a factor that determines how big or small the step should be on each iterations. If the learning rate is too small, it would take long time to converge and thus will be computationally too expensive. In the same way if the value of learning rate is too large, it will overshoot the minimum and fail to converge.
- The data should be normalized to a suitable scale if is highly varying. It will decrease the chances of error. There are various techniques of normalization. To read about normalization you can visit this page.
- Suppose our cost function/ loss function ( for brief about loss/cost functions visit here.) which is to be minimized be J(w,b). On each iteration, we take partial derivative of cost function J(w,b) with respect to the parameters (w,b) which are called gradients of loss function w.r.t weights and bias:
- We then update our previous weight w and bias b as shown below:
- We continue this process until the cost function converges.
Now let’s discuss in brief about the variants of gradient descent algorithms.
• Batch Gradient Descent
It is also simply called gradient descent. In this algorithm, we consider all of our examples (whole data set) on each iteration when we update our parameters.
i. It has unbiased estimate of gradient i.e. lower chances of error.
ii. We can use a fixed learning rate during training for our entire example.
i. It takes long time to converge if we have large data set.
ii. It is computationally expensive.
• Mini-batch Gradient Descent
In this algorithm, instead of going through entire examples (whole data set), we perform gradient descent algorithm taking several mini-batches. So even we have large number of training examples, it is processed in batches of certain example (batch size). We divide our data set into several mini batches say n batches with certain batch size.
The batch size can be tuned by us. Generally it is chosen as power of 2, examples 32, 64, 128 etc. It is done because some hardware such as GPUs achieves better runtime with batch size as power of 2.
i. It is faster than the batch gradient descent.
ii. Random selection of examples will help to avoid examples that are very similar and do not contribute much to the learning.
i. Mini-batch requires the configuration of an additional “mini-batch size” hyperparameter for the learning algorithm.
ii. Error information must be accumulated across mini-batches of training examples like batch gradient descent.
• Stochastic Gradient Descent
Stochastic gradient descent, often abbreviated SGD, is a variation of the gradient descent algorithm that calculates the error and updates the model for each example in the training dataset. In other words, can say it is a mini batch gradient descent with batch size 1 and has batches equal to the number of training examples.
i. It is simple to understand and implement, especially for beginners.
ii. This increase the model updates frequency that can result in faster learning on some problems.
i. Updating the model frequently can be computationally too much expensive in comparison to the above variants.
ii. The frequent updates may result in a noisy gradient signal, that may cause the model parameters and turn the model error to jump around.