Activation functions are any functions that defines the output of a neuron. The activation function associated with each neurons in a neural network determines whether it should be activated or not, based on the output of that function. Some of the activation functions also help to normalize the output of the each neuron to a range [0,1], or [-1,1].
In this article we will learn about the most common Activation Functions within Deep Learning and see when you should use them. We will also understand why the non-linear activation functions are used most commonly.
The Activation Functions can be divided into 3 types-
- Binary Step Activation Function
- Linear Activation Function
- Non-linear Activation Functions
- Binary Step Activation Function:
A binary step activation function is a type of threshold-based activation function. If the input value to the activation function is above or below a certain threshold, the neuron gets activated and then sends exactly the same signal to the next layer.
- Linear or Identity Activation Function:
As its name suggests it is an activation function having linear nature. This means the graph of this type of activation function is Linear as shown below. Therefore, the output of this functions will not be confined between any range i.e. its Range will be (–infinity to +infinity) .
Equation : f(x) = x
The real world problems are generally non-linear, so the use of linear activation function is often discouraged. However, Linear activation functions can be used, but in limited cases like when data is linear in nature, or when a neural network in one layer deep.
- Non-linear Activation function:
Input to the neural network is usually linear transformation (i.e. input*weight + bias), but most of the real world data are non-linear. So, to make that input non-linear, non-linear activation functions are used. Non-linear Activation are the functions that add non linearity into the network.
Some of the commonly used non-linear activation functions are as follows,
- Sigmoid Activation Function:
The Sigmoid activation function is an activation function that creates a flexible S-shaped (Sigmoid curve) with a minimum value approaching from zero and a maximum value approaching 1. The advantage of activation function over linear activation function is that, unlike linear function it is bound in the range [0,1]. Due to this reason, it is widely used with model where we have to predict the probability as output.
Tanh is also similar to logistic sigmoid function. The tanh function is mathematically a shifted version of the sigmoid function. The sigmoid function only maps values between 0 and 1 but the tanh function maps them between -1 and 1. So, the tanh activation function almost works better than the sigmoid function.
Because of the values between -1 and +1 the mean of the activations that come out of the hidden layer are close to having a zero mean, which makes learning for the next layer a little bit easier. The tanh function is mainly used classification between two classes.
- Rectified linear unit (ReLU)
Rectified Linear Unit i.e. ReLU is another very popular activation function within machine learning. These days ReLU is most widely used as an activation function in deep learning problems. It is almost used in all convolutional neural network these days.
It looks like:
From the equation of ReLU function it is clear that, the value of ‘a’ (i.e. the output of ReLU) is equal the supplied input for any input greater and equal to zero and the value of ‘a’ is zero for all supplied negative inputs. But the issue is that all the negative values become zero immediately which decreases the ability of the model to fit or train from the input data properly. That means any negative input given to the ReLU activation function turns the value into zero immediately in the graph, which in turns affects the resulting graph by not mapping the negative values appropriately. This problem of RelU function to return zero for all the negative inputs is called dying ReLU condition.
Therefore the range of ReLU is [0 to infinity).
If you are not sure which function to use for your hidden layer then the ReLU function might be a good choice but be aware of the fact that there are no perfect guidelines about which function to use because your data and your problems will always be very unique.
- Leaky rectified linear unit(Leaky ReLU):
Leaky ReLU is an attempt to solve the dying ReLU problem. It is a slightly modified version of the ReLU function. Instead of the slope being zero when z is negative, the function has a certain slope. The graph of leaky ReLU is as shown below.
For negative inputs, the slope of Leaky ReLU is ‘0.01z‘ (0.01 is used more commonly). When 0.01 is replaced by any other quantity then it is called Randomized ReLU instead of Leaky ReLU.
Therefore the range of Leaky ReLU is (-infinity to infinity).
This works a bit better most of the time but isn’t used that much in practice.