What is the benefit of the exponential function inside softmax?

Question

I know that softmax is:

$$ softmax(x) = \frac{e^{x_i}}{\sum_j^n e^{x_j}}$$

This is an $\mathbb{R}^n \implies \mathbb{R}^n$ function, and the elements of the output add up to 1. I understand that the purpose of normalizing is to have elements of $x$ represent probabilities of each class (in classification).

However, I don't understand why we need to take the $exp(x_i)$ of each element, instead of just normalizing:

$$ softmax(x) = \frac{x_i}{\sum_j^n x_j}$$

Which should achieve similar output, especially since both functions seem differentiable and could represent probability outcomes.

Could you tell me:

What is the advantage of the exponential function inside?
When should one be used over the other?

I'm not sure if it is that what you are exactly looking for but for the sake of curiosity there is some other intuitions [here](https://youtu.be/Z1pcTxvCOgw?t=1202) (about 2 or 3 minutes) which from my belief can provide you some information on what to search for. — Luciano Dourado, Aug 23 '23 at 13:26

score 2 · Accepted Answer · answered Aug 23 '23 at 13:16

Usually the softmax is applied to logits (you can consider them as unnormalized log-probabilities), which are the output of the neural net. The logits are unbounded, i.e. they lie in $(-\infty, \infty)$, so taking the $\exp$ of them would result in positive values which are further normalized to sum to one.

In principle, you can avoid to exponentiate when the neural net's output is strictly positive but I don't know if that would be dentrimental at the gradient level, i.e. when backpropagating the loss.

What is the benefit of the exponential function inside softmax?

1 Answers1