Gradient descent explained in simple way

Sweta

6 min readAug 20, 2020

Gradient descent is nothing but an algorithm to minimise a function by optimising parameters.

Content

Gradient descent with example
Gradient descent with 1 parameter
Gradient descent with more than 1 parameter
How does algorithm know where to stop?

Gradient descent with example

Let us understand gradient descent with an example.

Now, me and my friend always used to compare our marks. One day we had maths test in which maximum marks wasl 50.

Let me tell you conversation we had. Kindly, observe it carefully.

Me : So, we got a result of our maths test, how much you scored?

Friend : Guess it!

Me: 40 (randomly choose)

Friend: hey, no. I am not that intelligent.

Me: oh! Ok, is it 30.

Friend : No, not that less also.

Me: ok , then 35.

Friend: Ahh! Not exactly, but you are very close to the answer.

Now, this is exactly how gradient descent works internally.

In Gradient descent, we start with a random guess and then we slowly move to the best answer to minimise our error.

In this example, firstly, I started with 40, since 40 was wrong,we reached directly to 30, this was quiet a large jump. This 30 was also wrong. Since, we knew from our friend that the answer is between 30 to 40. So, we minimised our jump and guessed it as 35. Now, this is was small jump compare to previous one.

If 35 is also not correct, we will move 1 step up or down.

This is exactly how parameters are optimized in gradient descent.

Common formula for parameter in gradient descent

New value = old value — step size
OR
New value = old value —( learning rate*slope)

In Gradient descent, step size is mathematically computed as,

step size= learning rate * slope

New value = updated version of old value adjusted against step size.

Now, If we compare the formula with our example then it will look like

New guess = old guess — (shifting)

shifting determines how much you want to shift ( like in our eg we shifted first +10 then -5)

Example of gradient descent with one parameter

This is the square function (contain only one parameter):

f(x) = x2

This is its graph:

This function can have its minimum value at point (0,0).

Now, since (0,0) point is minimum, we have to reach that point by doing something.

Let’s understand that something.

Case 1 :

Let us take randomly a point in graph and named it as 1, as I have shown in eg.

Now, from that point, where should we move, is it upward or downward. We don’t know.

Then, how do we find out which direction to go to reach to the minimum point as (0,0). To solve this, the concept we use is called derivative.

We compute the derivate of this curve w.r.t x in the point named as 1.

From this point named as 1, if I increase the value of x, then my function f(x) decreases. So, here , our objective is to decrease f(x).

Hence, from this point, the correct movement is going downwards.

Case 2 :

Another case could be, if I randomly start from the given point and named it as 2.

From this point, if I compute the slope w.r.t x , then what I came to know is that , from this point(i.e. 2), if I increase the value of x, the value of my function will also increases but our objective is to decrease f(x) to reach the minimum point.

So, in this case, we will have to decrease the value of x in order to decrease f(x).

So, this is the reason, the slope at point 2 is +ve and slope at point 1 is -ve.

function: f(x) = x2

derivative : f’(x) = 2x

formula: new value=old value -(LR*slope)

We use this formula in given case 1 and case 2.

Case 1: where slope = -ve

If slope is -ve , the new value will be +ve.

Therefore, in case 1.,

New value > old value

Case 2: where slope = +ve

If slope is +ve, the new value will be -ve.

Therefore, in case 2.,

New value < old value

Therefore, slope helps which direction to move ,whether to decrease x or increase x in order to reach to this global minima.

So, this was for 2D plane.

But how to deal when we have more than 1 parameters, how to optimise those functions., what is the process of optimization?.

Don’t worry, I am sure at the end of this article, you will have answers to all these questions.

Gradient descent with more than 1 parameters

Lets taken an example of a Regression model.

In plane, regression model looks like,

y = mx + c

We know,

Cost function for a linear regression model:

What is the use of gradient descent?
The use of gradient descent is to minimise the cost function so that model fits the best way possible.

Since , in linear regression model, we have 2 parameters ‘m’ and ‘c’.

So, we will have 3D graph.

Let’s,

x axis = cost function

y axis = m

z axis = c

Now, this function need to be minimise for the optimal value of m and c.

Let us do some calculation, don’t worry, it will be easy and we will have fun doing that.

Let us assume

Our training data is:

We are actually trying to train linear regressor using gradient descent.

Let us understand, how does the internal mathematics work.

Again, don’t panic, this will be simple mathematics.

Step 1: like in our marks eg, we first guessed any random marks, same thing we will do here, we will assume value.

Initial assumption, c=0 and m=1.

Let us first find the new value of 'c’.

Taking derivative w.r.t 'c' , from chain rule of differentiation ,we will get,

If you are unfamiliar with chain rule in weight. I would highly recommend you to also visit this link Backward Propagation in simple way

Updating the existing 'c' :

Learning rate represents how slowly we are changing our step. Here, we take , learning rate = 0.001.

We started with c= 0 and our new value of c = 0.004.

Similarly, the cost function will differentiated with 'm' and new value of 'm' will be identified.

And now, we take updated value of 'c' and 'm' for further calculation and it gets repeated till we get the minimum cost function.

How does algorithm knows where to stop?

There are multiple ways, but there will be time when after thousand iterations, the cost function values is not much changed, the gradient descent will understand that there is not much improvement on cost function and that time, it will stop.

Give it a clap if you like the way I explained this topic, your clap will motivate me to write more.

Thank you for reading.

Keep learning, keep reading ☺

Reference:

https://www.jeremyjordan.me/gradient-descent/
https://youtu.be/gzrQvzYEvYc