Information Theory¶

Entropy¶

Entropy, as it relates to machine learning, is a measure of the randomness in the information being processed. The higher the entropy, the harder it is to draw any conclusions from that information.

Information entropy is developed to describe the avg amount of info needed to specify the state of a RV

Quantifies how much new information about a RV $x$ is obtained when we observe a specific value $x_i$

Depends on ‘degree of surprise’: highly improbable value conveys more information than likely one
If we know an event is certain to happen, we would receive no information when we actually observe it happening

Let $h(x)$ denote the information content of an event $x$, such that

$p(x) \propto \dfrac{1}{p(x)}$
For 2 independent events $A$ and $B$

\[ \begin{aligned} p(A \land B) &= p(A) \cdot p(B) \\ \implies h(AB) &= h(A) + h(B) \end{aligned} \]

\[ h(x) = \log \left \vert \dfrac{1}{p(x)} \right \vert \]

Shannon/Information Entropy¶

Entropy of a probability distribution $p(x)$ is the average amount of information transmitted by discrete random variable $x$ $$ \begin{aligned} H(p) &= E_p[h(x)] \ &= \sum_x p(x) \log \left \vert \dfrac{1}{p(x)} \right \vert \end{aligned} $$

Distributions with sharp peaks around a few values will have a relatively low entropy
Those spread evenly across many values will have higher entropy

If we use $\ln$ instead of $\log$ for $H(p)$, then $H(p)$ is a lower bound on the avg no of bits needed to encode a RV with pdf $p$. Achieving this bound requires using an optimal coding scheme designed for $p$, which assigns

shorter codes to higher probability events
longer codes to less probable events

Cross Entropy¶

The average amount of information needed to specify $x$ as a result of using pdf $q(x)$ instead of true $p(x)$

If we use $\ln$ instead of $\log$, it is the average no of bits needed to encode a RV using a coding scheme designed for $q(x)$ instead of true $p(x)$ $$ \begin{aligned} H(p, q) &= E_p \left[ \log \left \vert \dfrac{1}{q(x)} \right \vert \right] \ &= \sum_x p(x) \log \left \vert \dfrac{1}{q(x)} \right \vert \end{aligned} $$

KL Divergence¶

Dissimilarity measure of 2 probability distributions

Represents the avg additional info required to specify $x$ due to using $q(x)$ instead of true $p(x)$ $$ \begin{aligned} D_\text{KL}(p \vert \vert q) &= \text{Cross Entropy} - \text{Entropy} \ &= H(p, q) - H(p) \ & = \sum_x \log \left \vert \dfrac{p(x)}{q(x)} \right \vert \cdot p(x) & \text{(Discrete)} \ & = \int_x \log \left \vert \dfrac{p(x)}{q(x)} \right \vert \cdot p(x) & \text{(Continuous)} \end{aligned} $$ Propreties

$D_\text{KL}(p \vert \vert q) = 0 \iff p=q$
KL divergence is asymmetric $\implies D_\text{KL}(p \vert \vert q) \ne D_\text{KL}(q \vert \vert p)$. Hence it is not a proper distance measure
Non-Negative: $D_\text{KL}(p \vert \vert q) \ge 0$

Given a fixed distribution $p$, optimizing for a $q$ for the following 3 goals are equivalent

minimize $D_\text{KL}(p \vert \vert q)$
minimize $H(p, q)$
maximize $L(q \vert x)$, where $x \sim p(x)$

Last Updated: 2024-05-14 ; Contributors: AhmedThahir, web-flow

Information Theory¶

Entropy¶

Shannon/Information Entropy¶

Cross Entropy¶

KL Divergence¶

Comments