How does one derive the modified tanh activation proposed by LeCun?

Question

In "Efficient Backprop" (http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf), LeCun and others propose a modified tanh activation function of the form:

$$ f(x) = 1.7159 * tanh(\frac{2}{3}*x) $$

They argue that :

It is easier to approximate with polynomials
It is said that it fit the target that it's second derivative is maximal in 1

I tried to start with a function of the form : $f(x) = a * tanh(b*x)$ and derive the value of $a$ and $b$ to match the aforementionned properties.

Any idea of how those constants are derived ? Under what assumptions ? Does it match its expected properties by construction ?

score 1 · Accepted Answer · answered Jan 29 '22 at 22:02

In "Generalization and Network Design Strategies", LeCun argues that he chose parameters that satisfy: $$ f (\pm1) = \pm1$$

The rationale behind this is that the overall gain of the squashing transformation is around 1 in normal operating conditions, and the interpretation of the state of the network is simplified. Moreover, the absolute value of the second derivative of $f$ is a maximum at $+1$ and $-1$, which improves the convergence at the end of the learning session.

Visualization of the derivatives using this code:

From his description of the maxima of the second derivative I conclude that the third differentiation of $f(x)=a*tanh(bx)$ should be zero for $ x = \pm1$. The third differentiation is: $$\frac{\partial^{3} f}{\partial x^3}=\frac{-2ab³}{cosh²(bx)}\left(\frac{1}{cosh²(bx)} - 2tanh²(bx)\right) $$

So we set it to zero: $$1-2sinh²(bx)=0 \quad x\in\pm1$$ $$bx=arcsinh(\frac{1}{\sqrt{2}})$$ Plugging the values into numpy I get:

$$b=0.6584789484624083$$ Plugging the result into $f(1)$ I get: $$a=1.7320509044937022$$

This means there is a slight difference in values between my variables and his. Comparing the the tanh using our different variable values I get $\delta = 0.0012567267661376946$ for $x=1$ using my numpy code.

Either I made a mistake, he did not have such an accurate numerical solver/lookup table, or he chose a "nicer looking" number.

score 1 · Answer 2 · answered Apr 06 '23 at 09:48

I once did a drive to derive a symbolic solution (without trigonometry functions) for myself (mostly relying on Wolfram Alpha to do the heavy lifting) using the same constraints as @a-doering ($f(\pm1)=\pm1$ and $f'''(\pm1)=0$). I arrived at the fairly – all things considered – nice looking coefficients:

$\begin{align} f(x) &= a \tanh(bx) \\ a &= \sqrt{3} &&\approx 1.732050808 &&\approx 1.7159 \\ b &= \frac{-\ln(2 - \sqrt{3})}{2} &&\approx 0.658478948 &&\approx \frac{2}{3} \end{align}$

Unfortunately, I do not remember the steps I took to get there.

Here's an interactive graph for anyone interested in playing around with it: https://www.desmos.com/calculator/tf4udjl8cn Original in red, mine in green. I doubt there would be any real difference in training outcome between them.

How does one derive the modified tanh activation proposed by LeCun?

2 Answers2