Skip to content

Activation FunctionsΒΆ

Last Updated: 3 months ago2024-12-26 ; Contributors: AhmedThahir, web-flow
Name Activation
f(x)f(x)
Inverse Activation
fβˆ’1(y)f^{-1} (y)
Output Type Range Free from
Vanishing Gradients
Zero-Centered Comment
Identity xx yy Continuous \([-1, 1]\) βœ…
Binary
Step
{0,x<01,xβ‰₯0\begin{cases} 0, &x < 0 \\ 1, & x \ge 0 \end{cases} Binary \({0, 1}\) ❌
Tariff/
Tanh
tanh⁑(x)\tanh(x) tanhβ‘βˆ’1(y)\tanh^{-1}(y) Discrete \([-1, 1]\) ❌ βœ…
Fast Softsign Tahn x1+∣x∣\dfrac{x}{1 + \vert x \vert}
ArcTan tanβ‘βˆ’1(x)\tan^{-1} (x) tan⁑(y)\tan(y) Continuous \((-\pi/2, \pi/2)\)
Exponential exe^x ln⁑(y)\ln(y) Continuous \([0, \infty]\)
ReLU (Rectified
Linear Unit)
\(\begin{cases} 0, &x < 0 \\ x, & x \ge 0 \end{cases}\) Continuous \([0, \infty]\) βœ… ❌ βœ… Computationally-efficient
❌ Discontinuous at \(x=0\)
❌ Dead neurons due to poor initialization, high learning rate; initialize with slight +ve bias
SoftPlus
(smooth alt to ReLU)
\(\dfrac{1}{k} \ln \Bigg \vert 1 + \exp \{ {k (x-x_0)} \} \Bigg \vert\) \(\ln(e^y-1)\)

\(k ?\)
Continuous \([0, \infty]\) ❌
Parametric/
Leaky ReLU
\(\begin{cases} \alpha x, &x < 0 \\ x, & x \ge 0 \end{cases}\) Continuous \([-\infty, \infty]\) βœ… βœ… All positives of ReLU
Exponential
Linear Unit
\(\begin{cases} \alpha (e^x-1), &x < 0 \\ x,& x \ge 0 \end{cases}\) Continuous \([-\infty, \infty]\) βœ… ❌ \(\exp\) is computationally-expensive; though not significant in large networks
Maxout \(\max(w_1 x + b_1, w_2 x + b_2)\) βœ… βœ… Generalization of ReLU and Leaky ReLU
❌ double the no of parameters
Generalized Logistic \(a + (b-a) \dfrac{1}{1+e^{-k(x-x_0)}}\)

\(a=\) minimum
\(b=\) maximum
\(k=\) steepness
\(x_0 =\) \(x\) center
\(\ln \left \vert \dfrac{x-a}{b-x} \right \vert\)

what about \(k\)
Continuous \([a, b]\) Depends on \(a\) and \(b\) ❌ ❌ \(\exp\) is computationally-expensive; though not significant in large networks
βœ… Easy to interpret
- "probabilistic"
- saturating "firing rate" of neuron
Sigmoid/
Standard Logistic/
Soft Step
\(\dfrac{1}{1+e^{-x}}\) \(\ln \left \vert \dfrac{x}{1-x} \right \vert\) Binary-Continuous \([0, 1]\) ❌ ❌ \(\exp\) is computationally-expensive; though not significant in large networks
βœ… Easy to interpret
- "probabilistic"
- saturating "firing rate" of neuron
Fast Softsign Sigmoid \(0.5 \Bigg( 1+\dfrac{x}{1 + \vert x \vert} \Bigg)\)
Softmax \(\dfrac{e^{x_i}}{\sum_{j=1}^k e^{x_j}}\)
where \(k=\) no of classes
such that \(\dfrac{\sum p_i}{k} = 1\)
Discrete-Continuous \([0, 1]\) ❌
Softmax with Temperature \(\dfrac{e^{x_i/{\small T}}}{\sum_{j=1}^k e^{x_j/{\small T}}}\) Discrete-Continuous ❌ Exposes more β€œdark knowledge”

activation_functions.svg

Softmax with temperatureΒΆ

image-20240516164505175

Why use activation function for hidden layers?ΒΆ

Else, it would just be regular linear regression/logistic regression, so no point of hidden layers

Not using activation function β€…β€ŠβŸΉβ€…β€Š\implies using identity activation function

The only place identity activation function is acceptable is for the final output activation function in regression.

Linear RegressionΒΆ

y^=wh1y^h1+wh2y^h2=wh1y^(wx1h1x1+wx2h1x2)+wh2y^(wx1h2x1+wx2h2x2)=β‹―=w1x1+w2x2 \begin{aligned} \hat y &= w_{h_1 \hat y} h_1 + w_{h_2 \hat y} h_2 \\ &= w_{h_1 \hat y} (w_{x_1 h_1} x_1 + w_{x_2 h_1} x_2) + w_{h_2 \hat y} (w_{x_1 h_2} x_1 + w_{x_2 h_2} x_2) \\ &= \cdots \\ &= w_1 x_1 + w_2 x_2 \end{aligned}

Logistic RegressionΒΆ

y^=Οƒ(wh1y^h1+wh2y^h2)=Οƒ(wh1y^(wx1h1x1+wx2h1x2)+wh2y^(wx1h2x1+wx2h2x2))=β‹―=Οƒ(w1x1+w2x2) \begin{aligned} \hat y &= \sigma(w_{h_1 \hat y} h_1 + w_{h_2 \hat y} h_2) \\ &= \sigma(w_{h_1 \hat y} (w_{x_1 h_1} x_1 + w_{x_2 h_1} x_2) + w_{h_2 \hat y} (w_{x_1 h_2} x_1 + w_{x_2 h_2} x_2)) \\ &= \cdots \\ &= \sigma(w_1 x_1 + w_2 x_2) \end{aligned}

Why is non-zero-centering bad?ΒΆ

Since Non-zero-centered activation function such as sigmoid always outputs +ve values, it constrains gradients of all parameters to be - all +ve or - all -ve

This leads to sub-optimal steps (zig-zag) in the update procedure, leading to slower convergence

Comments