Questions tagged [gradient-descent]

Gradient Descent is an algorithm for finding the minimum of a function. It iteratively calculates partial derivatives (gradients) of the function and descends in steps proportional to those partial derivatives. One major application of Gradient Descent is fitting a parameterized model to a set of data: the function to be minimized is an error function for the model.

Gradient descent is a first-order iterative optimization algorithm. It is an optimization algorithm used to find the values of parameters (coefficients) of a function (f) that minimizes a cost function (cost).

To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.

Gradient descent is best used when the parameters cannot be calculated analytically (e.g. using linear algebra) and must be searched for by an optimization algorithm.

Gradient descent is also known as steepest descent, or the method of steepest descent.

https://en.wikipedia.org/wiki/Gradient_descent

454 questions
75
votes
6 answers

What is the difference between Gradient Descent and Stochastic Gradient Descent?

What is the difference between Gradient Descent and Stochastic Gradient Descent? I am not very familiar with these, can you describe the difference with a short example?
46
votes
5 answers

Does gradient descent always converge to an optimum?

I am wondering whether there is any scenario in which gradient descent does not converge to a minimum. I am aware that gradient descent is not always guaranteed to converge to a global optimum. I am also aware that it might diverge from an optimum…
28
votes
4 answers

Scikit-learn: Getting SGDClassifier to predict as well as a Logistic Regression

A way to train a Logistic Regression is by using stochastic gradient descent, which scikit-learn offers an interface to. What I would like to do is take a scikit-learn's SGDClassifier and have it score the same as a Logistic Regression here.…
hlin117
  • 675
  • 1
  • 8
  • 11
23
votes
1 answer

How does Gradient Descent and Backpropagation work together?

Please forgive me as I am new to this. I have attached a diagram trying to model my understanding of neural network and Back-propagation? From videos on Coursera and resources online I formed the following understanding of how neural network…
23
votes
2 answers

Why ReLU is better than the other activation functions

Here the answer refers to vanishing and exploding gradients that has been in sigmoid-like activation functions but, I guess, Relu has a disadvantage and it is its expected value. there is no limitation for the output of the Relu and so its expected…
13
votes
4 answers

Is Gradient Descent central to every optimizer?

I want to know whether Gradient descent is the main algorithm used in optimizers like Adam, Adagrad, RMSProp and several other optimizers.
13
votes
2 answers

Why averaging the gradient works in Gradient Descent?

In Full-batch Gradient descent or Minibatch-GD we are getting gradient from several training examples. We then average them out to get a "high-quality" gradient, from several estimations and finally use it to correct the network, at once. But why…
Kari
  • 2,686
  • 1
  • 17
  • 47
12
votes
1 answer

What feature engineering is necessary with tree based algorithms?

I understand data hygiene, which is probably the most basic feature engineering. That is making sure all your data is properly loaded, making sure N/As are treated as a special value rather than a number between -1 and 1, and tagging your…
11
votes
2 answers

Why is learning rate causing my neural network's weights to skyrocket?

I am using tensorflow to write simple neural networks for a bit of research and I have had many problems with 'nan' weights while training. I tried many different solutions like changing the optimizer, changing the loss, the data size, etc. but with…
10
votes
1 answer

How flexible is the link between objective function and output layer activation function?

It seems standard in many neural network packages to pair up the objective function to be minimised with the activation function in the output layer. For instance, for a linear output layer used for regression it is standard (and often only choice)…
Neil Slater
  • 28,338
  • 4
  • 77
  • 100
10
votes
1 answer

What is the difference between SGD classifier and the Logisitc regression?

To my understanding, the SGD classifier, and Logistic regression seems similar. An SGD classifier with loss = 'log' implements Logistic regression and loss = 'hinge' implements Linear SVM. I also understand that logistic regression uses gradient…
10
votes
4 answers

Why does it speed up gradient descent if the function is smooth?

I now read a book titled "Hands-on Machine Learning with Scikit-Learn and TensorFlow" and on the chapter 11, it has the following description on the explanation of ELU (Exponential ReLU). Third, the function is smooth everywhere, including around z…
Blaszard
  • 901
  • 1
  • 13
  • 29
10
votes
2 answers

Stochastic gradient descent based on vector operations?

let's assume that I want to train a stochastic gradient descent regression algorithm using a dataset that has N samples. Since the size of the dataset is fixed, I will reuse the data T times. At each iteration or "epoch", I use each training sample…
Pablo Suau
  • 1,767
  • 1
  • 13
  • 20
9
votes
3 answers

What is momentum in neural network?

While using "Two class neural network" in Azure ML, I encountered "Momentum" property. As per documentation, which is not clear, it says For The momentum, type a value to apply during learning as a weight on nodes from previous…
9
votes
1 answer

Understanding dropout and gradient descent

I am looking at how to implement dropout on deep neural networks and found something counter intuitive. In the forward phase dropout mask activations with a random tensor of 1s and 0s to force net to learn the average of the weights. This help the…
emanuele
  • 415
  • 1
  • 4
  • 8
1
2 3
30 31