Batch , Mini Batch and Stochastic gradient descent

Sweta
8 min readAug 26, 2020

Optimizer : It is nothing but an algorithm or methods used to change the attributes of the neural networks such as weights and learning rate in order to reduce the losses.

Content:

  1. What is Optimizer?
  2. Batch Gradient descent
  3. Mini Batch Gradient descent
  4. What is batch size?
  5. Stochastic Gradient descent
  6. Comparison

If you don’t have good understanding on gradient descent, I would highly recommend you to visit this link first Gradient Descent explained in simple way, and then continue here. ☺

What is Optimizer?

Optimizer is nothing but an algorithm or methods used to change the attributes of the neural networks such as weights and learning rate in order to reduce the losses.

In other words, how we should change the weights and learning rates of our neural network to reduce the loss is defined by the optimizer we use.

Overall, we can say

  • (Predicted output — actual output) is nothing but loss and this loss we pass inside optimizer.
  • Role of optimizer : It is responsible for reducing the losses and to provide the most accurate results possible by calculating the attributes of the neural network.

There are different types of optimizer :

  • Batch Gradient descent
  • Mini batch Gradient descent
  • Stochastic Gradient descent

These types are nothing but different approach to send a data inside a network.

Batch Gradient Descent

BGD is a variation of the gradient descent algorithm that calculates the error for each eg in the training datasets, but only updates the model after all training examples have been evaluated.

Let us understand like this,
suppose I have 'n' number of records. BGD will try to send all data and calculates the summation of loss then do dE/dw (E=summation of loss) i.e., it is going to calculate summation of all the loss and it is going to perform backward propagation based on the summation of the loss.

Here, batch size = n

One cycle through entire training datasets is called a training epoch. Therefore, it is often said that BGD performs model updates at the end of each training epoch.

Advantages:

  • It is more computationally efficient.
  • It is a learnable parameter : whenever we are trying to calculate a new weight, we are trying to consider all the data which is available to us based on the summation of the loss. So, we are trying to find out or derive the new value of the weight / bias , which is a learnable parameter.

Disadvantages:

  • Memory consumption is too high: we are trying to send all the data inside the network one by one, so, we need some kind of memory to store a loss which we have received in each and every iterations. Once we are done with passing datasets through the network, we calculate the loss. So this this case , memory consumption will be too high, and this happens in each and every step.
  • If memory consumption is too high, we can say that thr computation will be high and calculation will be very slow and so the optimization will be slower as compared to any other optimizer.

Advice: try to decrease your batch size.

Points:

  • BGD tries to converge itself, so it will be able to get a global minima. i.e.,

If , converges global minima(dE/dw=0) and

If, non convergeslocal minima

Convergence: Reaching a point in which gradient descent makes very small changes in your objective function is called convergence, which doesn’t mean it reached the optimal result (but it is really quite quite near, if not on it)

Mini Batch Gradient descent (MGD)

MGD is a variation of the gradient descent algorithm that splits the training datasets into small batches that are used to calculate model error and update model coefficients.

Let us understand like this,

suppose I have 1000 records and my batch size = 50, I will choose randomly 50 records, then calculate summation of loss and then send the loss to optimizer to find dE/dw.

Note: batches are formed in terms of random selection of datasets.

Advantages:

  • The model update frequency is higher than BGD: In MGD, we are not waiting for entire data, we are just passing 50 records or 200 or 100 or 256, then we are passing for optimization.
  • The batching allows both efficiency of not having all training data in memory and algorithms implementations. We are controlling memory consumption as well to store losses for each and every datasets.
  • The batches updates provide a computationally more efficient process than SGD.

Disadvantages:

  • No guarantee of convergence of a error in a better way.
  • Since the 50 sample records we take , are not representing the properties (or variance) of entire datasets. Do, this is the reason that we will not be able to get an convergence i.e., we won’t get absolute global or local minima at any point of a time.
  • While using MGD, since we are taking records in batches, so, it might happen that in some batches, we get some error and in dome other batches, we get some other error. So, we will have to control the learning rate by ourself , whenever we use MGD. If learning rate is very low, so the convergence rate will also fall. If learning rate is too high, we won’t get an absolute global or local minima. So we need to control the learning rate.

Note:If the batch size = total no. of data, then in this case, BGD = MGD.

What is batch size?

The batch size defines the number of samples that will be propagated through the network.

For instance, let's say you have 1050 training samples and you want to set up a batch_size equal to 100. The algorithm takes the first 100 samples (from 1st to 100th) from the training dataset and trains the network. Next, it takes the second 100 samples (from 101st to 200th) and trains the network again. We can keep doing this procedure until we have propagated all samples through of the network. Problem might happen with the last set of samples. In our example, we've used 1050 which is not divisible by 100 without remainder. The simplest solution is just to get the final 50 samples and train the network.

Advantages of using a batch size < number of all samples:

It requires less memory. Since you train the network using fewer samples, the overall training procedure requires less memory. That's especially important if you are not able to fit the whole dataset in your machine's memory.

Typically networks train faster with mini-batches. That's because we update the weights after each propagation. In our example we've propagated 11 batches (10 of them had 100 samples and 1 had 50 samples) and after each of them we've updated our network's parameters. If we used all samples during propagation we would make only 1 update for the network's parameter.

Disadvantages of using a batch size < number of all samples:

The smaller the batch the less accurate the estimate of the gradient will be. In the figure below, you can see that the direction of the mini-batch gradient (green color) fluctuates much more in comparison to the direction of the full batch gradient (blue color).

Stochastic is just a mini-batch with batch_size equal to 1. In that case, the gradient changes its direction even more often than a mini-batch gradient.

Stochastic Gradient Descent

SGD is a variation of the gradient descent that calculates the error and updates the model for each record in the training datasets.

Let us understand like this,

Here, Randomly we are going to select only one records and then we are going to send that record to the neural network. Then, we calculate the loss and that loss, we send inside optimizer to find dE/dw and update the weights .

Note: It keep on sending record one by one until it is not able to converge itself to the minima points.

Advantages:

  • For every record, we are updating the weights, so we are learning
  • Weight updates is faster.
  • Loss function is not suppose to wait for the entire datasets to calculate itself.
  • Even optimizer is not suppose to wait for entire datasets to calculate itself.
  • Memory consumptions will also be low.
  • SGD is faster than MGD and BGD.

Points: Since, in SGD records are send one by one so, if talking about minima, we will be able to get multiple minima points as there will be minima for each records and it will look like this as shown below:

It keeps on fluctuating. And we will fall inside a local minima at any point of time.

Disadvantages:

  • It is having huge oscillation. So, SGD will always vary from one point to another for each and every datasets. Hence, its tough to get an absolute minima. And we will end up getting a multiple minima points.
  • We need to control the learning rate: if learning rate is too high, it may be possible that some other dataset may not show you the same properties, again, learning rate effect in SGD will be little but lesser as compare to the BGD and MGD.

Comparison: If we compare all three optimizer, then every optimizer has its own advantages and disadvantages ,we can’t come to conclusions which optimizer is best, it totally depends on datasets.

But this diagram may clear your overall confusion

Related to this topic , we also need to discuss “Momentum based Optimizer” , which I will post very soon.

I will be very happy ☺ , if you give this blog a clap only if you liked it. This will motivate me to write more.

Thank you

Keep reading, keep learning ☺

--

--

Sweta

Data Science | Deep learning | Machine learning | Python