Information Theory

Self information

It converts probability to bits by:\(I(X) = -\log_2{p}\) for a event \(X\) happen with probability \(p\). Note

  • \(I(X)>0\) since \(p\leqslant 1\).

  • An event with smaller probability has more information(bits) if happen.

  • We can interpret the self-information (\(-\log(p)\)) as the amount of surprise we have at seeing a particular outcome.

Entropy

For any random variable \(X\) that follows a probability distribution \(P\) with a probability density function (p.d.f.) or a probability mass function (p.m.f.) \(p(x)\), we measure the expected amount of information through Entropy (or Shannon entropy)

\[H(X) = - E_{x \sim P_X} [\log p(x)].\]

Joint Entropy

Similar to entropy of a single random variable, we define the joint entropy \(H(X, Y)\) of a pair random variables \((X, Y)\) as

\[H(X, Y) = −E_{(x, y) \sim P_{X, Y}} [\log p_{X, Y}(x, y)]. \]