Information theory

Entropy

The entropy of a probability distribution can be intepreted as a measure of uncertainty, or lack of predictability, associated with a random variable drawn from a given distribution. We can also use entropy to define information content of a data source. A distribution wit high entropy will have high information content.

Entropy for discrete random variables

The entropy of a discrete random variable $X$ with distribution $p$ over $K$ states is defined by

$$\mathbb{H}(X) \triangleq \sum_{k = 1}^{K} p(X = k) \text{log}_2(X = k) = -\mathbb{E}_X[\text{log\ }p(X)]$$

Cross entropy

The cross entropy between distribution $p$ and $q$ is defined by

$$\mathbb{H}(p, q) \triangleq -\sum_{k=1}^{K} p_k \text{log\ } q_k$$

Relative entropy (KL divergence)

Given two distribution $p$ and $q$, it is often useful to define a distance metric to measure how "close" or "similar" they are. More precisely, we say that $D$ is a divergence if $D(p, q) \geq 0$ with equality iff $p = q$, it does not have to be a metric and satisfy triangle inequality $D(p, r) \leq D(p, q) + D(q, r)$.

For discrete distributions, the KL divergence is defined as follows:

$$\mathbb{KL}(p || q) \triangleq \sum_{k = 1}^{K} p_k \text{log\ } \frac{p_k}{q_k}$$

This naturally extends to continuous distributions as well:

$$\mathbb{KL} p(p || q) \triangleq \int p(x)\ \text{log\ } \frac{p(x)}{q(x)}\ dx$$

Mutual information

The KL divergence gave us a way to measure how similar two distributions were. How should we measure how dependent two random variables are? One thing we could do is turn the question of measureing the dependence of two random variables into a question about the similarity of their distribution. This gives rise to the notion of mututal information (MI) between two random variables, which is defined below.

$$\mathbb{I} (X; Y) \triangleq \mathbb{KL} (p(x, y) || p(x)p(y)) = \sum_{y \in Y} \sum_{x \in X} p(x, y)\ \text{log\ } \frac{p(x, y)}{p(x) p(y)}$$

MI measures the information gain if we update from a model that treats the two variables as independent $p(x)p(y)$ to one that models their true joint density $p(x, y)$.