Complete guidelines of Activation function and its types

Sweta
10 min readAug 19, 2020

--

I have explained every activation function in detail with its advantages , disadvantages and also important points with graph.

Activation function are an extremely important feature of the artificial neural network. They basically decide whether a neuron should be activated or not, whether the information that the neuron is recieving is relevant for the given information or should it be ignored.

I would also recommend you to go to this link “What is Activation Function?” , i have explained it in a bit detail.

The activation function is the non linear transformation that we do over the input signal. This transformed output is then seen to the next layer of neuron as input.

In other words, activation function always try to linearise the datasets.

Now, function could be:

  • Linear function
  • Binary function
  • Non-linear function

Linear function:

The function is a line or linear.

function:

f(x) = mx+c ….. (above graph at LHS)

(derivative) f’(x) = m … (graph at RHS)

Note: derivative of f(x) is constant.

Points:

  • the output of the function is not confined between any range.
  • Linear function can’t handle the complexity inside the data.
  • It is bad in finding out relationship between complex dataset.

If we are just trying to convert data from this scale to some other scale then, we can use linear function as activation function inside hidden layer but it is not used in complex data.

Binary function:

It gives output in 0 and 1, also called as Threshold function.

function:

f(x) = 0 or 1

(derivative) f’(x) = 0

Points:

  • differential of this function is always zero, this is a huge issue in Backward propagation because we need to find differentiation in Backward.
  • It can handle only two classes 0 and 1 , this is also a big issue as it will fail when I will have multiple classes.

Non-linear function:

We look for non-linear function as it make easy for the model to generalize or adapt with variety of data and to differentiate between the output.

Points:

  • we look for non-linear function so that we can find complexity of the data.
  • We can find out the differential in a backward propagation as we can change neuron’s weight.

Types of Activation Function

  1. Sigmoid function
  2. Tanh function
  3. ReLU function
  4. Leaky ReLU function
  5. ELU ( Exponential linear unit)
  6. P ReLU ( parametric ReLU)
  7. Swish(self gated or sigmoid linear unit) function
  8. Softmax or normalized exponential function
  9. Softplus function
  10. Max out function

Here is all the activation function at one place:

1. Sigmoid function

function:

Output range is between 0 and 1 i.e. [0,1]

Derivative of function:

Output range is between 0 and 0.25 i.e. [0,0.25]

Points:

  • The sigmoid function curve looks like S shape.
  • The function reduces extreme values or outliers in data without removing them.
  • It converts independent variables of near infinite range into simple probabilities between 0 and 1 and most of its output will be very close to 0 and 1.

Advantages :

  • It non-linearize the data.
  • It is a differential function i.e. we can get some derivative.
  • It normalise the data.

Disadvantages :

  • Vanishing gradient descent ( during backward propagation, it keep on decreasing the “derivative of Error w.r.t weight” (i.e. dE/dw) in each and every layer and keep decreasing the final effect.

Note:keep on adding neuron is not a good approach, we must not increase beyond certain limit.

  • Sigmoid is not a zero centric function ( means that it always tried to make a gradient update go too far in different direction. It will try to make our optimization much more harder, if we aren’t able to make a consistency in gradient update.)

It will give output as near to zero or 1, keeps on fluctuating, hence, is not a zero centric function.

  • It is computationally very expensive.
  • We get final value within the range between 0 and 1, no matter, how much long input is, in this case, computing a derivative will give very less amount of changes.

2. Tanh function

Modification of sigmoid.

function:

Where, output lies between -1 and 1 i.e. [-1,1].

Derivative:

Where, output lies between 0 and 1, i.e., [0,1].

Points:

  • It works better than the sigmoid function.
  • Tanh function is actually mathematically a shifted version of sigmoid function.
  • Tanh function maps between -1 and 1.
  • Because of value between -1 and 1, the mean of the activation that comes out of the hidden layer are close to having a zero mean, which makes learning for the next layer a little bit easier.

Advantages :

  • We can find the differentiation.
  • It is zero centric function.
  • Using it within the units of a neural network almost always works a lot better than using the sigmoid function.
  • Optimization is easy as compare to sigmoid.
  • Derivative we can find between 0 and 1.

Disadvantages :

  • It is computationally heavy function, conversion will be slow.
  • Vanishing gradients (we can find gradient of all curved (sharp,smooth) but after certain line, we will not be able to get any gradient because after certain curved, the line will be straight (flat line) and will be almost parallel with x-axis).

Note:

  • Gradient can move in +ve as well as -ve direction, where as in sigmoid, gradient are restricted to move only +ve direction. So, tanh is slightly better than sigmoid function.
  • Sigmoid always gives +ve values even if input is -ve where as, tanh can be +ve or -ve.
  • Sigmoid is not a zero centric, so, gradient is +ve, this is an issue because you will not be able to change it according to direction, whereas tanh is zero centric, so gradient is +ve or -ve.

What is Vanishing Gradient Problem?

The problem of vanishing gradients arises due to the nature of the backpropagation optimization

  • Gradients tend to get smaller and smaller as we keep on moving backwards
  • Implies that neurons in earlier layers learn very slowly compared to neurons in the last layers.

Vanishing Gradient Problem results in a decrease in the prediction accuracy of the model and take a long time to train a model.

3. Rectified linear Unit (ReLU) :

It look like this:

function:

Points :

  • Input can be anything, but output is +ve always.
  • The derivative is 1 as long as x is +ve and the derivative is 0 when x is negative.

Note: If you are not sure which function to use for your hidden layer, then the ReLU function is a good choice but be aware of the fact that there are no perfect guidelines about which function to use because your data and problems will always be very unique. Choosing the right one is more of an art than a science. Consequently you should try things out if your not very sure.

Advantages :

  • we can find differential.
  • overcome vanishing gradient problem.
  • It will not activate a the neuron at the same point of time.
  • Calculation is fast as compare to sigmoid or tanh as there is no calculation of exponential here.
  • finds out the linear relation in Backward or forward direction.

Disadvantages :

  • It is not a zero centric function.
  • For -ve input, it is completely inactive.

4. Leaky ReLU function

This function is slightly changed version of the ReLU function.

function :

Advantages :

  • It keep giving you gradient.
  • It is better than ReLU function as it is doing something for -ve value.

Disadvantages :

  • no consistent prediction in terms of a -ve datasets.
  • Generally we take alpha value = 0.01.

5. Parametric ReLU function

It is improved version of Leaky ReLU.

function:

Here,

Alpha = not a constant ( it learn and change).

Derivative: same as leaky ReLU

We are giving a flexibility to model itself to learn the value of activation function, this things was not available in any of the situation we have convered so far.

Note:

  • Leaky ReLU, alpha = 0.01
  • Parametric ReLU, alpha = not a constant, it learn and change.
  • ReLU, alpha = 0

6. ELU( Exponential Linear Unit)

function:

Derivative:

Look at this picture where derivative is shown (red line)

Exponential Linear Unit or its widely known name ELU is a function that tend to converge cost to zero faster and produce more accurate results. Different to other activation functions, ELU has a extra alpha constant which should be positive number.

ELU is very similiar to RELU except negative inputs. They are both in identity function form for non-negative inputs. On the other hand, ELU becomes smooth slowly until its output equal to -α whereas RELU sharply smoothes.

Advantages:

  • ELU becomes smooth slowly until its output equal to -α whereas RELU sharply smoothes.
  • ELU is a strong alternative to ReLU.Unlike to ReLU, ELU can produce negative outputs.

Disadvantages:

  • For x > 0, it can blow up the activation with the output range of [0, inf].

7. Swish function

function:

y = x.sigmoid(x)

OR

y = swish(x) = xσ(βx)

where σ(x) = 1/(1+exp(-x)), is the sigmoid function. β can either be a constant defined prior to the training or a parameter that can trained during the training time.

The value of β can also greatly influence the shape of the curve and hence the output, accuracy and training time. The graph below compares the graphs of swish function for various values of β.

The below graph shows the Comparison of Swish Activation Functions with various values of β

Derivative:

f’(x) = β f(βx) + σ(βx)(1 – β f(βx))

The graph below compares the graphs of derivatives of swish function for various values of β.

Advantages:

  • No vanishing gradient
  • It is continuous and differentiable at all points.
  • Unlike ReLU, it does not suffer from the problem of dying neurons.
  • It is simple and easy to use.

Disadvantages :

  • It is slower to compute as compared to ReLU and its variants such as Leaky ReLU and Parameterized ReLU because of the use of sigmoid function involved in computing the outputs.
  • Swish activation function is unstable and cannot be predicted a priori.

8.Softmax function or normalized exponential function:

Softmax is a very interesting activation function because it not only maps our output to a [0,1] range but also maps each output in such a way that the total sum is 1. The output of Softmax is therefore a probability distribution.

For example, the rectified linear unit (ReLU) computes y=max(0,x)y=max(0,x). The softplus function can be considered a softened version of ReLU:

Softmax is used for multi-classification in logistic regression model whereas Sigmoid is used for binary classification in logistic regression model, the sum of probabilities is One for Softmax.

function:

Derivative:

9. Softplus activation function:

function:

f(x) = ln(1+ex)

Derivative:

dy/dx = 1 / (1 + e-x)

Derivative of softplus is sigmoid.

Points:

  • Outputs produced by sigmoid and tanh functions have upper and lower limits whereas softplus function produces outputs in scale of (0, +∞). That’s the essental difference.

10. Maxout function:

The Maxout activation is a generalization of the ReLU and the leaky ReLU functions. It is a learnable activation function.

function:

Points:

  • It is a piecewise linear function that returns the maximum of the inputs, designed to be used in conjunction with the dropout regularization technique.
  • Both ReLU and leaky ReLU are special cases of Maxout. The Maxout neuron, therefore, enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU).
  • However, it doubles the total number of parameters for each neuron and hence, a higher total number of parameters need to be trained.

Reference:

  1. https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html
  2. https://deeplearninguniversity.com/swish-as-an-activation-function-in-neural-network/
  3. http://neuralnetworksanddeeplearning.com/chap2.html

--

--

Sweta
Sweta

Written by Sweta

Data Science | Deep learning | Machine learning | Python