Adagrad and RMSprop intuition | How adagrad and RMSprop optimizer works in deep learning?

Sweta
5 min readAug 28, 2020

Let us see a diagram

You might be aware of this diagram.

Now, what is happening here is weight of the neural network is getting optimized using gradient descents.

Initially, if lets say weight is 0.8 i.e.,

w1 = 0.8

In next epoch, weight becomes w2= 0.7

Then, w3= 0.6

Further weights are calculated, depending on the below formula,

We are familiar with this formula, right!

Note: Lets name this formula as eq 1.. keep in mind of this formula as in further explanation if I will just say eq 1, it will basically means that I am talking about this formula.

This formula talks about gradient descent optimizing the weight of a network. When we come from 0.8 to 0.7 , this is called the step size i.e., how much we are increasing or decreasing the weight . Here, we are decreasing with 0.1 .

Hence, we called 0.1 as step size .

For eg, as in diagram above, where I have put naming as A, B, C …

we can say that ,

A is the step size for epoch 1,

B is the step size for epoch 2,

C is the step size for epoch 3,

D is the step size for epoch 4 ,

Similarly all other and so on.

But, the step size should be low when we approach the global minima and it should be high when we are far from the global minima.

So, basically what we need is, if we start from point 1 (see the diagram ) , there has to be a way using which we drop fast from point 1 ( since we are far from the global minima, so step size should be large in this case) and when we are near to global minima, the step size should be low .

Now, how the step size should be computed so that our objective of varying the step size could be achieved.

For this, Adagrad is introduced.

Adagrad

Adagrad stands for adaptive gradient.

What is adaptive?

When a learning rate is optimized using a formula .

Note: I have named this eq as eq 2 .

Where, alpha at time 't' is

We can also say it as , sum of square of all the previous gradients.

This means that, every time this learning rate is tuned and it is tuned by including all the previous gradients and then feeding it to eq 2 , where we get the new learning rate and then finally feeding this result to eq 1 to get the weight .

What is epsilon here in eq 2 .

Epsilon is a small +ve number.

What is the use of epsilon in eq 2?

Sometime, it may happen that alpha at time 't' becomes very low, if we put the low value of it in eq 2, i.e.,

our denominator value will be very low or near to zero. We do not want our denominator to be zero and to be on safer side, we use epsilon.

Reason: If alpha at time 't' will be zero, then the new learning rate will be infinity and then eq 1 i.e., the new weight gradient descent will move in a different direction . We don’t want this to happen, so we use epsilon.

This is how in Adagrad, our learning rate change in every epoch.

Advantages:

Here, we are taking all the gradient into considerations when we compute the new learning rate. So, when all the gradient comes together, it tells you

  • At what speed you should move or
  • In which direction you should move

And that is where, step size is optimized.

Disadvantages:

In this term, we are taking all the previous gradients, all the previous derivative of loss functions w.r.t. weight.

Now, there is a possibility that alpha at time 't' might be very high and if we put this high value in eq

Then, here the denominator will become high, which means the entire term will become very low .

Which means that the step size will be very low.

Now, while finding the new weight using

The current weight at time 't' and previous weight at time 't-1' becomes almost same. But we don’t want this , we want weights to keep optimizing .

The solution to this is RMSprop.

RMSprop

In adaptive, we use,

But in RMSprop, we use ,

Where,

Let us understand this above equation,

I have separated the eq and named it as A and B.

Term A: I am giving more weightage to the recent weight.

Term B : This term was there in Adagrad also, I am giving less weightage to this term.

Overall I want to say that,

We are giving less weightage to previous all Derivatives and more weightage to recent moving average.

This is how RMSprop works internally, this is not an absolute mathematical formula that goes inside but this is how we take care of weight optimization to keep optimizing weight in better way and not letting learning rate to be constant.

Recap :

In gradient descent, the problem was that the step size was same everytime, but we change learning rate or step size by using adagrad, but this is problem in this , what if the denominator is very large, now, to control this , we give smaller weightage to previous all weight and larger weightage to recent average weight or moving average weight. And we are able to control this in RMSprop.

If you like this blog , give it a clap, this will motivate me to write more. Also, I will be very happy to hear suggestions from you.

Thank you ☺

Keep learning, keep reading. ☺

Reference:

  1. https://youtu.be/S2IzmZ7dzBU

--

--

Sweta

Data Science | Deep learning | Machine learning | Python