Activation functions are any functions that defines the output of a neuron. The activation function associated with each neurons in a neural network determines whether it should be activated or not, based on the output of that function. There are three types of activation functions- Binary, Linear and Non-Linear activation function. In this article we will focus mainly on non-linear activation functions.
Input to the neural network is usually linear transformation (i.e. input*weight + bias), but most of the real world data are non-linear. So, to make that input non-linear, non-linear activation functions are used. Non-linear Activation are the functions that add non linearity into the network.
Following are the some of the most commonly used non-linear activation functions:
- Rectified linear unit (ReLU)
Rectified Linear Unit i.e. ReLU is another very popular activation function within machine learning. These days ReLU is most widely used as an activation function in deep learning problems. It is almost used in all convolutional neural network these days.
It looks like:
From the equation of ReLU function it is clear that, the value of ‘a’ (i.e. the output of ReLU) is equal the supplied input for any input greater and equal to zero and the value of ‘a’ is zero for all supplied negative inputs. But the issue is that all the negative values become zero immediately which decreases the ability of the model to fit or train from the input data properly. That means any negative input given to the ReLU activation function turns the value into zero immediately in the graph, which in turns affects the resulting graph by not mapping the negative values appropriately. This problem of RelU function to return zero for all the negative inputs is called dying ReLU condition.
Therefore the range of ReLU is [0 to infinity).
If you are not sure which function to use for your hidden layer then the ReLU function might be a good choice but be aware of the fact that there are no perfect guidelines about which function to use because your data and your problems will always be very unique.
- Leaky rectified linear unit(Leaky ReLU):
Leaky ReLU is an attempt to solve the dying ReLU problem. It is a slightly modified version of the ReLU function. Instead of the slope being zero when z is negative, the function has a certain slope. The graph of leaky ReLU is as shown below.
For negative inputs, the slope of Leaky ReLU is ‘0.01z‘ (0.01 is used more commonly). When 0.01 is replaced by any other quantity then it is called Randomized ReLU instead of Leaky ReLU.
Therefore the range of Leaky ReLU is (-infinity to infinity).
This works a bit better most of the time but isn’t used that much in practice.
- Sigmoid Activation Function:
The Sigmoid activation function is an activation function that creates a flexible S-shaped (Sigmoid curve) with a minimum value approaching from zero and a maximum value approaching 1. The advantage of activation function over linear activation function is that, unlike linear function it is bound in the range [0,1]. Due to this reason, it is widely used with model where we have to predict the probability as output.
Tanh is also similar to logistic sigmoid function. The tanh function is mathematically a shifted version of the sigmoid function. The sigmoid function only maps values between 0 and 1 but the tanh function maps them between -1 and 1. So, the tanh activation function almost works better than the sigmoid function.
Because of the values between -1 and +1 the mean of the activations that come out of the hidden layer are close to having a zero mean, which makes learning for the next layer a little bit easier. The tanh function is mainly used classification between two classes.