Questions tagged [gradient]

32 questions
3
votes
1 answer

How batch normalization layer resolve the vanishing gradient problem?

According to this article: https://towardsdatascience.com/the-vanishing-gradient-problem-69bf08b15484 The vanishing gradient problem occurs when using the sigmoid activation function because sigmoid maps large input space into small space, so the…
3
votes
3 answers

Why a sign of gradient (plus or minus) is not enough for finding a steepest ascend?

Consider a simple 1-D function $y = x^2$ to find a maximum with the gradient ascent method. If we start in point 3 on x-axis: $$ \frac{\partial f}{\partial x} \biggr\rvert_{x=3} = 2x \biggr\rvert_{x=3} = 6 $$ This means that a direction in which…
3
votes
1 answer

Gradient Checking: MeanSquareError. Why huge epsilon improves discrepancy?

I am using custom C++ code, and coded a simple "Mean Squared Error" layer. Temporarily using it for the 'classification task', not a simple regression. ...maybe this causes the issues? I don't have anything else before this layer - not even a simple…
Kari
  • 2,686
  • 1
  • 17
  • 47
3
votes
1 answer

Differentiable approximation for counting negative values in array

I have an array of time of arrivals and I want to convert it to count data using pytorch in a differentiable way. Example arrival times: arrival_times = [2.1, 2.9, 5.1] and let's say the total range is 6 seconds. What I want to have is: counts = [0,…
2
votes
1 answer

Gradient passthough in PyTorch

I need to quantize the inputs, but the method (bucketize) I need to do so is indifferentiable. I can of course detach the tensor, but then I lose the flow of gradients to earlier weights. I guess the question is quite simple, how do you continue…
2
votes
1 answer

How to choose appropriate epsilon value while approximating gradients to check training?

While approximating gradients, using actual epsilon to shift the weights results in wildly big gradient approximations, as the "width" of the used approximation triangle is disporportionately small. In Andrew NG-s course, he is using 0.01, but I…
Dávid Tóth
  • 145
  • 5
2
votes
1 answer

Tensorflow.Keras: How to get gradient for an output class w.r.t a given input?

I have implemented and trained a sequential model using tf.keras. Say I am given an input array of size 8X8 and an output [0,1,0,...(rest all 0)]. How to calculate the gradient of the input w.r.t to the given output? model = ... output =…
2
votes
1 answer

Vanishing Gradient vs Exploding Gradient as Activation function?

ReLU is used as an activation function that serves two purposes: Breaking linearity in DNN. Helping in handling Vanishing Gradient problem. For Exploding Gradient problem, we use Gradient Clipping approach where we set the max threshold limit of…
vipin bansal
  • 1,252
  • 9
  • 17
2
votes
1 answer

What does it mean for a method to be invariant to diagonal rescaling of the gradients?

In the paper which describes Adam: a method for stochastic optimization, the author states: The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the…
1
vote
0 answers

Which Neural Network or Gradient Boosting framework is the simplest for Custom Loss Functions?

I need to implement a custom loss function. The function is relatively simple: $$-\sum \limits_{i=1}^m [O_{1,i} \cdot y_i-1] \ \cdot \ \operatorname{ReLu}(O_{1,i} \cdot \hat{y_i} - 1)$$ With $O$ being some external attribute specific to each case. I…
Borut Flis
  • 189
  • 1
  • 7
1
vote
1 answer

Why does my manual derivative of Layer Normalization imply no gradient flow?

I recently tried computing the derivative of the layer norm function (https://arxiv.org/abs/1607.06450), an essential component of transformers, but the result suggests that no gradient flows through the operation, which can't be true. Here's my…
Alex
  • 13
  • 4
1
vote
1 answer

vanishing gradient and gradient zero

There is a well known problem vanishing gradient in BackPropagation training of Feedforward Neural Network (FNN)(here we don't consider the vanishing gradient of Recurrent Neural Network). I don't understand why vanishing gradient does not mean the…
user6703592
  • 127
  • 5
1
vote
1 answer

Can mini-batch gradient descent outperform batch gradient descent?

As I was reading and going through the second course of Andrew Ng's deep learning course, I came across a sentence that said, With a well-turned mini-batch size, usually it outperforms either gradient descent or stochastic gradient descent…
1
vote
1 answer

CNN gradients with different magnitude

I have a CNN architecture with two cross entropy losses $\mathcal{L}_1$ and $\mathcal{L}_2$ summed in the total loss $\mathcal{L} = \mathcal{L}_1 + \mathcal{L}_2$. The task I want to solve is Unsupervised Domain Adaptation. I have attested the…
aretor
  • 117
  • 6
1
vote
0 answers

Matlab Optimization. Meaning of warning: "The slope should be 2. It appears to be 1."

I'm using the manopt package to solve some optimization problems in matlab. The problem is of the form. problem.cost = @(x) f(x) problem.egrad = @(x) g(x) After the problem definition, I check the gradient consistency using the following…
Springberg
  • 111
  • 1
1
2 3