Questions tagged [mini-batch-gradient-descent]

Is a variation of the gradient descent algorithm that splits the training dataset into small batches that are used to calculate model error and update model coefficients. Implementations may choose to sum the gradient over the mini-batch which further reduces the variance of the gradient. The point of using mini-batch is that you are able to update your weights more than once each epoch, so your model gets better. Mini-batch is considered more efficient.

56 questions
22
votes
2 answers

Sliding window leads to overfitting in LSTM?

Will I overfit my LSTM if I train it via the sliding-window approach? Why do people not seem to use it for LSTMs? For a simplified example, assume that we have to predict the sequence of characters: A B C D E F G H I J K L M N O P Q R S T U V W X Y…
Kari
  • 2,686
  • 1
  • 17
  • 47
13
votes
2 answers

Why averaging the gradient works in Gradient Descent?

In Full-batch Gradient descent or Minibatch-GD we are getting gradient from several training examples. We then average them out to get a "high-quality" gradient, from several estimations and finally use it to correct the network, at once. But why…
Kari
  • 2,686
  • 1
  • 17
  • 47
8
votes
2 answers

Why is taking the gradient of the average error in SGD not correct, but rather the average of the gradients of single errors?

I am a little confused about taking averages in cost functions and SGD. So far I always thought in SGD you would compute the average error for a batch and then backpropagate it. But then I was told in a comment on this question that that was wrong.…
7
votes
1 answer

sklearn: SGDClassifier yields lower accuracy than LogisticRegression

I'm participating in the kaggle Iceberg Classifier Challenge, where the idea is to classify whether an object present in a radar image is an iceberg or a ship. I am currently trying to implement stochastic gradient descent to get a better idea for…
6
votes
1 answer

Changing the batch size during training

The choice of batch size is in some sense the measure of stochasticity : On one hand, smaller batch sizes make the gradient descent more stochastic, the SGD can deviate significantly from the exact GD on the whole data, but allows for more…
6
votes
1 answer

Does small batch size improve the model?

I'm training an LSTM with Keras. I've noticed that the smaller the batch size, the more the loss decreases during periods: so this makes me think that the network can process fewer items better at a time. Is it a normal behavior in general?
pairon
  • 395
  • 1
  • 3
  • 15
6
votes
2 answers

Latent loss in variational autoencoder drowns generative loss

I'm trying to run a variational auto-encoder on the CIFAR-10 dataset, for which I've put together a simple network in TensorFlow with 4 layers in the encoder and decoder each, an encoded vector size of 256. For calculating the latent loss, I'm…
6
votes
1 answer

How backpropagation through gradient descent represents the error after each forward pass

In Neural NEtwork Multilayer Perceptron, I understand that the main difference between Stochastic Gradient Descent (SGD) vs Gradient Descent (GD) lies in the way of how many samples are chosen while training. That is, SGD iteratively chooses one…
5
votes
1 answer

Train loss vs validation loss

I have a few basic questions about tracking losses during training. If I am using mini-batch training, should I validate after each batch update or after I have seen the entire dataset? What should be the condition to stop the training to prevent…
4
votes
2 answers

In sequence models, is it possible to have training batches with different timesteps each to reduce the required padding per input sequence?

I want to train an LSTM model with variable length inputs. Specifically I want to use as little padding as possible while still using minibatches. As far as I understand each batch requires a fixed number of timesteps for all inputs, necessitating…
3
votes
1 answer

Will stochastic gradient descent converge for multivariate linear regression

I am trying to figure out if stochastic gradient descent for a multivariate linear regression will converge (assuming there is no mini-batching, i.e., the batch size is 1). My guess is yes, based on the fact that stochastic gradient descent will…
3
votes
1 answer

Plotting Gradient Descent in 3d - Contour Plots

I have generated 3 parameters along with the cost function. I have the $\theta$ lists and the cost list of 100 values from the 100 iterations. I would like to plot the last 2 parameters against cost in 3d to visualize the level sets on the contour…
3
votes
1 answer

training model on random samples from a large dataset

I have a huge data set(More than 1 million data points).My dataset is text. i am doing NER on it to identify few entities. if i randomly choose 100 data points from the total data set and train my model(LSTM), will this yield good results? i will be…
rawwar
  • 831
  • 2
  • 12
  • 23
3
votes
1 answer

how does minibatch for LSTM look like?

Minibatch is a collection of examples that are fed into the network, (example after example), and back-prop is done after every single example. We then take average of these gradients and update our weights. This completes processing 1 minibatch. I…
Kari
  • 2,686
  • 1
  • 17
  • 47
2
votes
3 answers

How much of a problem is each member of a batch having the same label?

I have a batch size of 128 and a total data size of around 10 million, and I am classifying between 4 different label values. How much of a problem is it if each batch only contains data with one label? So for example - batch 0 all have the 3rd…
1
2 3 4