Questions tagged [optimization]

In statistics this refers to selecting an estimator of a parameter by maximizing or minimizing some function of the data. One very common example is choosing an estimator which maximizes the joint density (or mass function) of the observed data referred to as Maximum Likelihood Estimation (MLE).

494 questions
114
votes
10 answers

Choosing a learning rate

I'm currently working on implementing Stochastic Gradient Descent, SGD, for neural nets using back-propagation, and while I understand its purpose I have some questions about how to choose values for the learning rate. Is the learning rate related…
63
votes
6 answers

Should a model be re-trained if new observations are available?

So, I have not been able to find any literature on this subject but it seems like something worth giving a thought: What are the best practices in model training and optimization if new observations are available? Is there any way to determine the…
yad
  • 1,773
  • 3
  • 16
  • 27
50
votes
2 answers

Why not always use the ADAM optimization technique?

It seems the Adaptive Moment Estimation (Adam) optimizer nearly always works better (faster and more reliably reaching a global minimum) when minimising the cost function in training neural nets. Why not always use Adam? Why even bother using…
PyRsquared
  • 1,584
  • 1
  • 10
  • 17
46
votes
5 answers

Does gradient descent always converge to an optimum?

I am wondering whether there is any scenario in which gradient descent does not converge to a minimum. I am aware that gradient descent is not always guaranteed to converge to a global optimum. I am also aware that it might diverge from an optimum…
40
votes
4 answers

Guidelines for selecting an optimizer for training neural networks

I have been using neural networks for a while now. However, one thing that I constantly struggle with is the selection of an optimizer for training the network (using backprop). What I usually do is just start with one (e.g. standard SGD) and then…
mplappert
  • 501
  • 1
  • 4
  • 4
32
votes
2 answers

Are there any rules for choosing the size of a mini-batch?

When training neural networks, one hyperparameter is the size of a minibatch. Common choices are 32, 64, and 128 elements per mini batch. Are there any rules/guidelines on how big a mini-batch should be? Or any publications which investigate the…
27
votes
2 answers

local minima vs saddle points in deep learning

I heard Andrew Ng (in a video I unfortunately can't find anymore) talk about how the understanding of local minima in deep learning problems has changed in the sense that they are now regarded as less problematic because in high-dimensional spaces…
oW_
  • 6,254
  • 4
  • 28
  • 45
16
votes
1 answer

How many features to sample using Random Forests

The Wikipedia page which quotes "The Elements of Statistical Learning" says: Typically, for a classification problem with $p$ features, $\lfloor \sqrt{p}\rfloor$ features are used in each split. I understand that this is a fairly good educated…
15
votes
2 answers

Why aren't Genetic Algorithms used for optimizing neural networks?

From my understanding, Genetic Algorithms are powerful tools for multi-objective optimization. Furthermore, training Neural Networks (especially deep ones) is hard and has many issues (non-convex cost functions - local minima, vanishing and…
cat91
  • 413
  • 2
  • 7
13
votes
4 answers

Is Gradient Descent central to every optimizer?

I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.
11
votes
1 answer

Fisher Scoring v/s Coordinate Descent for MLE in R

R base function glm() uses Fishers Scoring for MLE, while the glmnet appears to use the coordinate descent method to solve the same equation. Coordinate descent is more time-efficient than Fisher Scoring, as Fisher Scoring calculates the second…
gol
  • 111
  • 2
11
votes
2 answers

Difference between RMSProp with momentum and Adam Optimizers

According to this scintillating blogpost Adam is very similar to RMSProp with momentum. From tensorflow documentation we see that tf.train.RMSPropOptimizer has following parameters __init__( learning_rate, decay=0.9, momentum=0.0, …
hans
  • 253
  • 1
  • 3
  • 9
11
votes
2 answers

Why is learning rate causing my neural network's weights to skyrocket?

I am using tensorflow to write simple neural networks for a bit of research and I have had many problems with 'nan' weights while training. I tried many different solutions like changing the optimizer, changing the loss, the data size, etc. but with…
10
votes
1 answer

Backpropagation: In second-order methods, would ReLU derivative be 0? and what its effect on training?

ReLU is an activation function defined as $h = \max(0, a)$ where $a = Wx + b$. Normally, we train neural networks with first-order methods such as SGD, Adam, RMSprop, Adadelta, or Adagrad. Backpropagation in first-order methods requires first-order…
9
votes
1 answer

clipping the reward for adam optimizer in keras

I would like to clip the reward in keras. I saw it is possible to clip the norm and clip the value is sgd as follows: sgd = optimizers.SGD(lr=0.01, clipnorm=1.) sgd = optimizers.SGD(lr=0.01, clipvalue=0.5) What are clipping the norm and clipping…
1
2 3
32 33