Skip to content

Activation Functions

Name Activation
\(f(x)\)
Inverse Activation
\(f^{-1} (y)\)
Output Type Range Free from
Vanishing Gradients
Zero-Centered Comment
Identity \(x\) \(y\) Continuous \([-1, 1]\) βœ…
Binary
Step
\(\begin{cases} 0, &x < 0 \\ 1, & x \ge 0 \end{cases}\) Binary \({0, 1}\) ❌
Tariff/
Tanh
\(\tanh(x)\) \(\tanh^{-1}(y)\) Discrete \([-1, 1]\) ❌ βœ…
Fast Softsign Tahn \(\dfrac{x}{1 + \vert x \vert}\)
ArcTan \(\tan^{-1} (x)\) \(\tan(y)\) Continuous \((-\pi/2, \pi/2)\)
Exponential \(e^x\) \(\ln(y)\) Continuous \([0, \infty]\)
ReLU (Rectified
Linear Unit)
\(\begin{cases} 0, &x < 0 \\ x, & x \ge 0 \end{cases}\) Continuous \([0, \infty]\) βœ… ❌ βœ… Computationally-efficient
❌ Discontinuous at \(x=0\)
❌ Dead neurons due to poor initialization, high learning rate; initialize with slight +ve bias
SoftPlus
(smooth alt to ReLU)
\(\dfrac{1}{k} \ln \Bigg \vert 1 + \exp \{ {k (x-x_0)} \} \Bigg \vert\) \(\ln(e^y-1)\)

\(k ?\)
Continuous \([0, \infty]\) ❌
Parametric/
Leaky ReLU
\(\begin{cases} \alpha x, &x < 0 \\ x, & x \ge 0 \end{cases}\) Continuous \([-\infty, \infty]\) βœ… βœ… All positives of ReLU
Exponential
Linear Unit
\(\begin{cases} \alpha (e^x-1), &x < 0 \\ x,& x \ge 0 \end{cases}\) Continuous \([-\infty, \infty]\) βœ… ❌ \(\exp\) is computationally-expensive; though not significant in large networks
Maxout \(\max(w_1 x + b_1, w_2 x + b_2)\) βœ… βœ… Generalization of ReLU and Leaky ReLU
❌ double the no of parameters
Generalized Logistic \(a + (b-a) \dfrac{1}{1+e^{-k(x-x_0)}}\)

\(a=\) minimum
\(b=\) maximum
\(k=\) steepness
\(x_0 =\) \(x\) center
\(\ln \left \vert \dfrac{x-a}{b-x} \right \vert\)

what about \(k\)
Continuous \([a, b]\) Depends on \(a\) and \(b\) ❌ ❌ \(\exp\) is computationally-expensive; though not significant in large networks
βœ… Easy to interpret
- "probabilistic"
- saturating "firing rate" of neuron
Sigmoid/
Standard Logistic/
Soft Step
\(\dfrac{1}{1+e^{-x}}\) \(\ln \left \vert \dfrac{x}{1-x} \right \vert\) Binary-Continuous \([0, 1]\) ❌ ❌ \(\exp\) is computationally-expensive; though not significant in large networks
βœ… Easy to interpret
- "probabilistic"
- saturating "firing rate" of neuron
Fast Softsign Sigmoid \(0.5 \Bigg( 1+\dfrac{x}{1 + \vert x \vert} \Bigg)\)
Softmax \(\dfrac{e^{x_i}}{\sum_{j=1}^k e^{x_j}}\)
where \(k=\) no of classes
such that \(\dfrac{\sum p_i}{k} = 1\)
Discrete-Continuous \([0, 1]\) ❌
Softmax with Temperature \(\dfrac{e^{x_i/{\small T}}}{\sum_{j=1}^k e^{x_j/{\small T}}}\) Discrete-Continuous ❌ Exposes more β€œdark knowledge”

activation_functions.svg

Softmax with temperature

image-20240516164505175

Why use activation function for hidden layers?

Else, it would just be regular linear regression/logistic regression, so no point of hidden layers

Not using activation function \(\implies\) using identity activation function

The only place identity activation function is acceptable is for the final output activation function in regression.

Linear Regression

flowchart LR
a((x1)) & b((x2)) -->
d((h1)) & e((h2)) -->
y(("&ycirc;"))
\[ \begin{aligned} \hat y &= w_{h_1 \hat y} h_1 + w_{h_2 \hat y} h_2 \\ &= w_{h_1 \hat y} (w_{x_1 h_1} x_1 + w_{x_2 h_1} x_2) + w_{h_2 \hat y} (w_{x_1 h_2} x_1 + w_{x_2 h_2} x_2) \\ &= \cdots \\ &= w_1 x_1 + w_2 x_2 \end{aligned} \]

Logistic Regression

flowchart LR
a((x1)) & b((x2)) -->
d((h1)) & e((h2)) -->
s(("&sigma;")) -->
y(("&ycirc;"))
\[ \begin{aligned} \hat y &= \sigma(w_{h_1 \hat y} h_1 + w_{h_2 \hat y} h_2) \\ &= \sigma(w_{h_1 \hat y} (w_{x_1 h_1} x_1 + w_{x_2 h_1} x_2) + w_{h_2 \hat y} (w_{x_1 h_2} x_1 + w_{x_2 h_2} x_2)) \\ &= \cdots \\ &= \sigma(w_1 x_1 + w_2 x_2) \end{aligned} \]

Why is non-zero-centering bad?

Since Non-zero-centered activation function such as sigmoid always outputs +ve values, it constrains gradients of all parameters to be - all +ve or - all -ve

This leads to sub-optimal steps (zig-zag) in the update procedure, leading to slower convergence

Last Updated: 2024-12-26 ; Contributors: AhmedThahir, web-flow

Comments