5

I understand the softmax equation is

$\boldsymbol{P}(y=j \mid x)=\frac{e^{x_{j}}}{\sum_{k=1}^{K} e^{x_{k}}}$

My question is: why use $e^x$ instead of say, $3^x$. I understand $e^x$ is it's own derivative, but how is that advantageous in this situation?

I'm generally trying to understand why euler's number appears everywhere, especially in statistics and probability, but specifically in this case.

Codedorf
  • 53
  • 4

1 Answers1

7

Choosing a different base would squash the graph of the function uniformly in the horizontal direction, since $$ a^x = e^{x\cdot \ln(a)}. $$

The exponential function with base $e$ is widely considered the simplest exponential function. It has nice properties that no other base has, mainly:

  • The function $e^x$ is its derivative.
  • It has a particularly simple power series expansion: $$ e^x = 1 + x + \frac12 x^2 + \frac16 x^3 + \cdots + \frac1{n!}x^n + \cdots $$ All of the coefficients are rational numbers. If the base had been something intuitively "nicer" than $e$, such as an integer, the coefficients would need to be irrational.

For this reason, most mathematicians will pick $e^x$ when they need an exponential function and have no particular reason to choose one base over another. (Except for computer scientists and information theorists, who sometimes prefer $2^x$).

  • Reference: "hmakholm left over Monica (https://math.stackexchange.com/users/14366/hmakholm-left-over-monica), Why does sigmoid function use e instead of another constant?, URL (version: 2019-04-21): https://math.stackexchange.com/q/3195726 " – Amirhossein Rezaei Aug 26 '22 at 21:01
  • 1
    It also has nice properties for complex numbers. – Green Falcon Aug 27 '22 at 09:53
  • Did this answer the question? Do those properties you've listed make a difference over using 3^x (e.g. does "rational coefficients" make it run quicker on a GPU, or anything like that?). If not, could we consider that 2.6^x vs. e^x vs. 2.9^x is just another hyperparameter of our models, that we could be tuning for? – Darren Cook Aug 29 '22 at 20:18
  • I don't think there's a need to see it as a hyperparameter. This is for a simple reason: the prior layers are multiplied by trainable weights and change during back-propagation. This is relevant to your question because the output value of the prior neurons are the inputs to the next layer. In essence, using the property $a^x = e^{x\cdot \ln(a)}$ you can think of the $ln(a)$ as the constant multiplied by the prior layer weight, which is adjusted again in Back Propagation; there's no need for hyperparameter tuning. – Amirhossein Rezaei Aug 29 '22 at 20:59