Highest Voted 'optimization' Questions - Data Science Stack Exchange

114

votes

10 answers

Choosing a learning rate

I'm currently working on implementing Stochastic Gradient Descent, SGD, for neural nets using back-propagation, and while I understand its purpose I have some questions about how to choose values for the learning rate. Is the learning rate related…

asked Jun 16 '14 at 18:08

ragingSloth

1,824
3
14
15

63

votes

6 answers

Should a model be re-trained if new observations are available?

So, I have not been able to find any literature on this subject but it seems like something worth giving a thought: What are the best practices in model training and optimization if new observations are available? Is there any way to determine the…

machine-learning predictive-modeling optimization training

asked Jul 13 '16 at 11:03

yad

1,773
3
16
27

50

votes

2 answers

Why not always use the ADAM optimization technique?

It seems the Adaptive Moment Estimation (Adam) optimizer nearly always works better (faster and more reliably reaching a global minimum) when minimising the cost function in training neural nets. Why not always use Adam? Why even bother using…

neural-network optimization

asked Apr 15 '18 at 16:55

PyRsquared

1,584
1
10
17

46

votes

5 answers

Does gradient descent always converge to an optimum?

I am wondering whether there is any scenario in which gradient descent does not converge to a minimum. I am aware that gradient descent is not always guaranteed to converge to a global optimum. I am also aware that it might diverge from an optimum…

machine-learning neural-network deep-learning optimization gradient-descent

asked Nov 09 '17 at 16:41

wit221

563
1
4
5

40

votes

4 answers

Guidelines for selecting an optimizer for training neural networks

I have been using neural networks for a while now. However, one thing that I constantly struggle with is the selection of an optimizer for training the network (using backprop). What I usually do is just start with one (e.g. standard SGD) and then…

neural-network optimization backpropagation

asked Mar 04 '16 at 09:32

mplappert

501
1
4
4

32

votes

2 answers

Are there any rules for choosing the size of a mini-batch?

When training neural networks, one hyperparameter is the size of a minibatch. Common choices are 32, 64, and 128 elements per mini batch. Are there any rules/guidelines on how big a mini-batch should be? Or any publications which investigate the…

deep-learning neural-network convolutional-neural-network optimization

asked Apr 17 '17 at 16:18

Martin Thoma

18,630
31
92
167

27

votes

2 answers

local minima vs saddle points in deep learning

I heard Andrew Ng (in a video I unfortunately can't find anymore) talk about how the understanding of local minima in deep learning problems has changed in the sense that they are now regarded as less problematic because in high-dimensional spaces…

machine-learning deep-learning optimization convergence

asked Sep 05 '17 at 19:14

oW_

6,254
4
28
45

16

votes

1 answer

How many features to sample using Random Forests

The Wikipedia page which quotes "The Elements of Statistical Learning" says: Typically, for a classification problem with $p$ features, $\lfloor \sqrt{p}\rfloor$ features are used in each split. I understand that this is a fairly good educated…

statistics random-forest optimization model-evaluations sampling

asked Oct 10 '17 at 10:50

Valentin Calomme

5,396
3
20
49

15

votes

2 answers

Why aren't Genetic Algorithms used for optimizing neural networks?

From my understanding, Genetic Algorithms are powerful tools for multi-objective optimization. Furthermore, training Neural Networks (especially deep ones) is hard and has many issues (non-convex cost functions - local minima, vanishing and…

neural-network optimization genetic-algorithms

asked Sep 16 '18 at 08:34

cat91

413
2
7

13

votes

4 answers

Is Gradient Descent central to every optimizer?

I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.

machine-learning neural-network deep-learning optimization gradient-descent

asked Mar 12 '19 at 10:04

rawwar

831
2
12
23

11

votes

1 answer

Fisher Scoring v/s Coordinate Descent for MLE in R

R base function glm() uses Fishers Scoring for MLE, while the glmnet appears to use the coordinate descent method to solve the same equation. Coordinate descent is more time-efficient than Fisher Scoring, as Fisher Scoring calculates the second…

machine-learning r algorithms optimization

asked Jul 03 '14 at 17:11

gol

111
2

11

votes

2 answers

Difference between RMSProp with momentum and Adam Optimizers

According to this scintillating blogpost Adam is very similar to RMSProp with momentum. From tensorflow documentation we see that tf.train.RMSPropOptimizer has following parameters __init__( learning_rate, decay=0.9, momentum=0.0, …

tensorflow optimization

asked Jan 18 '18 at 15:21

hans

253
1
3
9

11

votes

2 answers

Why is learning rate causing my neural network's weights to skyrocket?

I am using tensorflow to write simple neural networks for a bit of research and I have had many problems with 'nan' weights while training. I tried many different solutions like changing the optimizer, changing the loss, the data size, etc. but with…

machine-learning python tensorflow optimization gradient-descent

asked Dec 27 '16 at 22:50

abeoliver

113
1
6

10

votes

1 answer

Backpropagation: In second-order methods, would ReLU derivative be 0? and what its effect on training?

ReLU is an activation function defined as $h = \max(0, a)$ where $a = Wx + b$. Normally, we train neural networks with first-order methods such as SGD, Adam, RMSprop, Adadelta, or Adagrad. Backpropagation in first-order methods requires first-order…

neural-network optimization backpropagation activation-function

asked Jul 12 '16 at 17:16

Rizky Luthfianto

2,176
2
19
22

9

votes

1 answer

clipping the reward for adam optimizer in keras

I would like to clip the reward in keras. I saw it is possible to clip the norm and clip the value is sgd as follows: sgd = optimizers.SGD(lr=0.01, clipnorm=1.) sgd = optimizers.SGD(lr=0.01, clipvalue=0.5) What are clipping the norm and clipping…

neural-network deep-learning keras optimization gradient-descent

asked Oct 03 '18 at 20:07

user10296606

1,784
5
17
31

Questions tagged [optimization]