2

I know that softmax is:

$$ softmax(x) = \frac{e^{x_i}}{\sum_j^n e^{x_j}}$$

This is an $\mathbb{R}^n \implies \mathbb{R}^n$ function, and the elements of the output add up to 1. I understand that the purpose of normalizing is to have elements of $x$ represent probabilities of each class (in classification).

However, I don't understand why we need to take the $exp(x_i)$ of each element, instead of just normalizing:

$$ softmax(x) = \frac{x_i}{\sum_j^n x_j}$$

Which should achieve similar output, especially since both functions seem differentiable and could represent probability outcomes.

Could you tell me:

  • What is the advantage of the exponential function inside?
  • When should one be used over the other?
desertnaut
  • 1,908
  • 2
  • 13
  • 23
Victor2748
  • 123
  • 4
  • 1
    I'm not sure if it is that what you are exactly looking for but for the sake of curiosity there is some other intuitions [here](https://youtu.be/Z1pcTxvCOgw?t=1202) (about 2 or 3 minutes) which from my belief can provide you some information on what to search for. – Luciano Dourado Aug 23 '23 at 13:26

1 Answers1

2

Usually the softmax is applied to logits (you can consider them as unnormalized log-probabilities), which are the output of the neural net. The logits are unbounded, i.e. they lie in $(-\infty, \infty)$, so taking the $\exp$ of them would result in positive values which are further normalized to sum to one.

In principle, you can avoid to exponentiate when the neural net's output is strictly positive but I don't know if that would be dentrimental at the gradient level, i.e. when backpropagating the loss.