I know that softmax is:
$$ softmax(x) = \frac{e^{x_i}}{\sum_j^n e^{x_j}}$$
This is an $\mathbb{R}^n \implies \mathbb{R}^n$ function, and the elements of the output add up to 1. I understand that the purpose of normalizing is to have elements of $x$ represent probabilities of each class (in classification).
However, I don't understand why we need to take the $exp(x_i)$ of each element, instead of just normalizing:
$$ softmax(x) = \frac{x_i}{\sum_j^n x_j}$$
Which should achieve similar output, especially since both functions seem differentiable and could represent probability outcomes.
Could you tell me:
- What is the advantage of the exponential function inside?
- When should one be used over the other?