Stochastic gradient descent method for machine learning



In the gradient descent method, batch refers to the total number of samples used to calculate the gradient in a single iteration. So far, we have assumed that batch refers to the entire data set. In terms of Google’s scale, data sets usually include billions or even hundreds of billions of samples. In addition, Google data sets usually include massive features. Therefore, a batch may be quite large

If it is a very large batch, a single iteration may take a long time to calculate

Large data sets containing randomly sampled samples may contain redundant data. In fact, the larger the batch size, the higher the possibility of redundancy. Some redundancy may help to eliminate messy gradients, but the prediction value of very large batches is often not higher than that of large batches

What if we could get the correct average gradient with less computation?

By randomly selecting samples from our dataset, we can estimate a larger average from a much smaller dataset (although the process is very messy). The stochastic gradient descent method (SGD) takes this idea to the extreme. It uses only one sample per iteration (the batch size is 1). SGD can also work if enough iterations are carried out, But the process can be very messy. The term “random” means that a sample constituting each batch is randomly selected

Small batch random gradient descent method

Small batch stochastic gradient descent method (small batch SGD) is a compromise between full batch iteration and SGD. Small batch usually contains 10 – 1000 randomly selected samples. Small batch SGD can reduce the number of messy samples in SGD, but it is still more efficient than full batch

In order to simplify the description, we only focus on the gradient descent method for a single feature. Please rest assured that the gradient descent method is also applicable to features with multiple features Ji

This work adoptsCC agreement, reprint must indicate the author and the link to this article