Bayesian decision theory

Introduction

Bayesian inference provides the optimal way to update our beliefs about hidden quantities HH given observed data X=xX = x by computing the posterior p(Hx)p(H|x). However, at the end of the day, we need to turn our beliefs into actions that we can perform in the world. How can we decide which action is best? This is where Bayesian decision theory comes in.

In decision theory, we assume the decision maker, or agent, has a set of possible actions A\mathcal{A}, to choose from. Each of these actions have costs and benefits, which will depend on the underlying state of nature HHH \in \mathcal{H}. We can encode this information into a loss function l(h,a)l(h, a), that specifies the loss we incur if we take action aAa \in \mathcal{A} when the state of nature is hHh \in \mathcal{H}.

For example, suppose the state is defined by the age of the patient (young vs old), and whether they have COVID-19 or not. Note that the age can observed directly, but the disease state must be inferred from noisy observations, thus the state is partially observed.

Let us assume that the cost of administering the drug is the same, no matter what the state of the patient is. However the benefits will differ. If the patient is young, we expect them to live a long time, so the cost of not giving the drug if they have COVID-19 is high; but if the patient is old, they have fewer years to live, os the cost of not giving the drug is arguably less. In medical circles, a common unit of cost is quality-adjusted life years or QALY. Suppose that the expected QALY for a young person is 60, and for an old person is 10. Let us assume the drug costs the equivalent of 8 QALY, due to induced pain and suffering from side effects.

Once we have specified the loss function, we can compute the posterior expected loss or risk for each possible action:

R(ax)Ep(hx)[l(h,a)]=hHl(h,a)p(hx)R(a|x) \triangleq \mathbb{E}_{p(h|x)} [l(h, a)] = \sum_{h \in \mathcal{H}} l(h, a) p(h|x)

The optimal policy (also called the Bayes estimator) specifies what action to take for each possible observation so as to minimize risk:

π(x)=argminaA Ep(hx)[l(h,a)]\pi^* (\mathbf{x}) = \underset{a \in \mathcal{A}}{\text{argmin}}\ \mathbb{E}_{p(h|\mathbf{x})} [l(h, a)]

Classification problems

In this section, we use Bayesian decision theory to decide optimal class label to predict given an observed input xXx \in \mathcal{X}.

Zero-one loss

Suppose the states of nature correspond to class labels, os H=Y={1,,C}\mathcal{H} = \mathcal{Y} = \{ 1, \ldots, C \}. Furthermore, suppose the actions also correspond to class labels, os A=Y\mathcal{A} = \mathcal{Y}. In this setting, a very commonly used loss function is the zero-one loss.

Hence the action that minimizes the expected loss is to choose the most probable label:

π(x)=argmaxyY p(yx)\pi(\mathbf{x}) = \underset{y \in \mathcal{Y}}{\text{argmax}}\ p(y|\mathbf{x})

This corresponds to the mode of the posterior distribution, also known as the maximum a posteriori or MAP estimate.

Reject option

In some cases, we may be able to say "I don't know" instead of returning an answer that we don't really trust, this is particulary important in risk averse domains.

l(h,a)={0if h=a and a{1,,C}λrif a=0λeotherwise l(h, a) = \begin{cases} 0 & if\ h = a\ and\ a \in \{ 1, \ldots, C \} \\ \lambda_r & if\ a = 0 \\ \lambda_e & otherwise \end{cases}

where λr\lambda_r is the cost of reject action, and λe\lambda_e is the cost of classification error.

ROC curves

We showed that we can pick the optimal lable in binary classification by thresholding the probability using a value derived from the relative cost of a false positive and false negative. Instead of picking a single threshold, we can instead consider using a set of different thresholds, and comparing the resulting performance.

Class confusion matrices

For any fixed threshold τ\tau, we can compute the empirical number of false positives (FP), false negatives (FN), true positives (TP), and true negatives (TN). We can store these results in a 2×22 \times 2 class confusion matrix CC, where CijC_{ij} is the number of times an item with true class label ii was (mis)classified as having lable jj. We can now plot TPR vs FPR as an implicit function of τ\tau. This is called a receiver operation characteristic or ROC curve.

The quality of the ROC curve is often summarized as a single number using the area under the curve or AUC score, higher AUC scores are better. Another summary statistic that is used is the equal error rate or EER as called the cross-over rate, defined as the value which satisfies FPR = FNR. Since FNR = 1-TPR, we can compute EER by drawing a line from top left to the bottom right and seeing where it intersects the ROC curve, lower EER scores are better.

Class imbalance

In some problems, there is severe class imbalance. For example, the set of negatives is usually much larger than the set of positives. The usefulness of the ROC curve may be reduced in such cases, since a large change in the absolute number of false positives will not change the false positive rate very much, since FPR is divided by FP + TN. Thus all action happens in the extreme left part of the curve. In such cases, we may choose other ways of summarizing the class confusion matrix, such as precision-recall curves.

Precision-recall curves

In some problems, the notion of a "negative" is not well-defined. In these kinds of situations, we may choose to use a precision-recall curve to summarize the performance of the system.

The key idea is to replace FPR with precision. Recall is same as TPR. If y^n{0,1}\widehat{y}_n \in \{ 0, 1 \} is the predicted label, and yn{0,1}y_n \in \{ 0, 1 \} is the true label, we can estimate precision and recall using

P(τ)=nyny^nny^n\mathcal{P}(\tau) = \frac{\sum_n y_n \widehat{y}_n}{\sum_n \widehat{y}_n} R(τ)=nyny^nnyn\mathcal{R}(\tau) = \frac{\sum_n y_n \widehat{y}_n}{\sum_n y_n}

We can now plot precision vs recall as we vary the threshold τ\tau. Hugging the top right is the best we can do.

F-scores

For a fixed threshold, corresponding to a single point on the PR curve, we can compute a single precision and recall value, which will be denoted by P\mathcal{P} and R\mathcal{R}. These are often combined into a single statistic called the FβF_{\beta}, which weights recall as β>0\beta > 0 more important than precision.

1Fβ=11+β21P+β21+β21R\frac{1}{F_\beta} = \frac{1}{1 + \beta^2} \frac{1}{\mathcal{P}} + \frac{\beta^2}{1 + \beta^2} \frac{1}{\mathcal{R}}

or equivalently

Fβ(1+β2)P.Rβ2P+RF_\beta \triangleq (1 + \beta^2) \frac{\mathcal{P . R}}{\beta^2\mathcal{P + R}}

If we set β=1\beta = 1, we get the harmonic mean of precision and recall:

F1=2P.RP+RF_1 = 2 \frac{\mathcal{P . R}}{\mathcal{P + R}}

Using F1F_1 score weights precision and recall equally. However, if recall is more important, we may use β=2\beta = 2, and if precision is more important, we may use β=0.5\beta = 0.5.

Regression problems

So far, we have considered the case where there are a finite number of actions AA and states of nature HH. In this section, we consider the case where the set of actions and states are both equal to the real line.

L2 loss

Also called squared error or quadratic loss, which is defined as follows:

l2(h,a)=(ha)2l_2 (h, a) = (h - a)^2

L1 loss

The l2l_2 loss penalizes deviations from the truth quadratically, and thus is sensitive to outliers. A more robust alternative is the absolute or l1l_1 loss

l1(h,a)=hal_1(h, a) = |h - a|

Huber loss

Another robust loss function is the Huber loss, defined as follows:

lδ(h,a)={r2/2ifrδδrδ2/2ifr>δl_{\delta}(h, a) = \begin{cases} r^2/2 & if |r| \leq \delta \\ \delta|r| - \delta^2/2 & if |r| \gt \delta \end{cases}

where r=har = h - a. This is equivalent to l2l_2 for errors that are smaller than δ\delta, and is equivalent to l1l_1 for larger errors.

Probabilistic prediction problems

In this section, we assume the set of possible actions is to pick a probability distribution over some value of interest. That is, we want to perform probabilistic prediction or probabilistic forecasting rather than predicting a specific value. More precisely, we assume the true "state of nature" is a distribution, h=p(Yx)h = p(Y | x), the action is another distribution, a=q(Yx)a = q(Y | x), and we want to pick qq to minimize E[l(p,q)]\mathbb{E}[l(p, q)] for a given xx.

KL, cross-entropy and log-loss

A common form of loss functions for comparing two distributions is the Kullback Leibler divergence or KL divergence, which is defined as follows:

KL(pq)yYp(y)logp(y)q(y)\mathbb{KL}(p || q) \triangleq \sum_{y \in \mathcal{Y}} p(y) \text{log} \frac{p(y)}{q(y)}

We can expand the KL as folows:

KL(pq)=yYp(y) log p(y)yYp(y) log q(y)H(p)yp(y) log p(y)H(p,q)yp(y) log q(y) \begin{aligned} \mathbb{KL}(p || q) & = \sum_{y \in \mathcal{Y}} p(y)\ log\ p(y) - \sum_{y \in \mathcal{Y}} p(y)\ \text{log}\ q(y) \\ \mathbb{H}(p) & \triangleq - \sum_y p(y)\ \text{log}\ p(y) \\ \mathbb{H}(p, q) & \triangleq - \sum_y p(y)\ \text{log}\ q(y) \end{aligned}

H(p)\mathbb{H}(p) is known as the entropy. This is a measure of uncertainty or variance of pp, it is maximal if pp is uniform, and is 0 if pp is a degenerate or deterministic delta function. The H(p,q)\mathbb{H}(p, q) is known as the cross-entropy.

Now consider a special case in which the true state of nature is a degenerate distribution, which puts all its mass on a single outcome, say cc, i.e., h=p(Yx)=I(Y=c)h = p(Y|x) = \mathbb{I}(Y = c).

H(δ(Y=c),q)=yYδ(y=c) log q(y)=log q(c)\mathbb{H}(\delta(Y = c), q) = - \sum_{y \in \mathcal{Y}} \delta(y = c)\ \text{log}\ q(y) = -\text{log}\ q(c)

This is known as the log loss of the predictive distribution qq when given given target label cc.

Brier score

l(p,q)=1Cc=1C(q(y=cx)p(y=cx))2l(p, q) = \triangleq \frac{1}{C} \sum_{c=1}^{C} (q(y = c | x) - p(y =c | x))^2

This is just the squared error of the predictive distribution compared to the true distribution, when viewed as vectors. Since it is based on squared error, the Brier score is less sensitive to extremely rare or extremely common classes.

A/B testing

Suppose you are trying to decide which version of a product is likely to sell more, or which version of a drug is likely to work better. Let us call the versions you are choosing between A and B, sometimes version A is called the treatment, and version B is called the control. To simplify the notation, we will refer to picking version A as choosing action 1, and picking version B as choosing action 0. We will call the resulting outcome from each action that you want to maximize the reward.

A very common approach to such problems is to use an A/B test, in which you try both actions out for a while, by randomly assigning a different action to different subsets of the publication, and then you measure the results and pick the winner. This is sometimes called a "test and roll" problem, since you test which method is best, and then roll it out for the rest of the population.

A key problem in A/B testing is to come up with a decision rule, or policy, for deciding which action is best, after obtaining potentially noisy results during the test phase. Another problem is to choose how many people to assign to the treat, n1n_1, and how many to the control, n0n_0.

A Bayesian approach

We assume the ii'th reward for action jj is given by YijN(μj,σj2)Y_{ij} \sim \mathcal{N} (\mu_j, \sigma_j^2) for i=1:nji = 1:n_j nad j=0:1j = 0:1, where j=1j=1 corresponds to the treatment (action !) and j=0j = 0 corresponds to the control (action B). For simplicity we assume σ2\sigma^2 are known. For the unknown μj\mu_j parameters (expected reward), we will use Gausian priors, μjN(mj,τj2)\mu_j \sim \mathcal{N}(m_j, \tau_j^2). If we assign n1n_1 people to the treatment and n0n_0 to the control, then the expected reward in the testing phase is given by

E[Rtest]=n0m0+n1m1\mathbb{E}[R_{test}] = n_0m_0 + n_1m_1

The expected reward in the roll phase depends on the decision rule π(y1,y0)\pi(\mathbf{y}_1, \mathbf{y}_0), which specifies which action to deploy, where yj=(y1j,,ynj,j)\mathbf{y}_j = (y_{1j}, \dots, y_{n_j, j}) is the data from action jj. We derive the opimal policy below, and then discuss some of its properties.

Optimal policy

The optimal policy is to choose the action with the greater expected posterior mean reward. Applying Bayes' rule for Gaussians, we find that the corresponding posterior is given by

p(μjyj,nj)=N(μjm^j,τ^j2)p(\mu_j | \mathbf{y}_j, n_j) = \mathcal{N}(\mu_j | \widehat{m}_j, \widehat{\tau}_j^2)

The optimal policy is given is

π(y1,y0)={1if E[μ1y1]E[μ0y0]0if E[μ1y1]<E[μ0y0]\pi^*(\mathbf{y}_1, \mathbf{y}_0) = \begin{cases} 1 & \text{if\ } \mathbb{E}[\mu_1 | \mathbf{y}_1] \geq \mathbb{E}[\mu_0 | \mathbf{y}_0] \\ 0 & \text{if\ } \mathbb{E}[\mu_1 | \mathbf{y}_1] \lt \mathbb{E}[\mu_0 | \mathbf{y}_0] \end{cases}

Expected reward

If the total population size is NN, and we cannot reuse people from the testing phase, the expected reward from the roll stage is

E[Rroll]=(Nn1n0)(m1+eΦ(ev)+vϕ(ev))\mathbb{E}[R_{roll}] = (N - n_1 - n_0) \left( m_1 + e\Phi(\frac{e}{v}) + v \phi (\frac{e}{v}) \right)

where ϕ\phi is the Gaussian pdf, Φ\Phi is the Gaussian cdf, e=m0m1e = m_0 - m_1 and

v=τ14τ12+σ12/n1+τ04τ02+σ02/n0v = \sqrt{ \frac{\tau_1^4}{\tau_1^2 + \sigma_1^2 / n_1} + \frac{\tau_0^4}{ \tau_0^2 + \sigma_0^2 / n_0 }}

Bandit problems

In A/B testing the decision maker tries two different actions a fixed number of times and then pciks the best action. We can obviously generalize this beyond two actions. More importantly, we can generalize this beyond a one-stage decision problem. In particular, suppose we allow the decision maker to try an action ata_t, observe the reward rtr_t, and then decide what to do at time step t+1t + 1, rather than waiting until n1+n0n_1 + n_0 experiments are finished. This immediate feedback allows for adaptive policies that can result in much higher expected reward (lower regret). We have converted a one-stage decision problem into a sequential decision problem. There are many kinds of sequential decision problems, but here we consider the simplest kind, known as a bandit problem.

In a bandit problem, there is an agent (decision maker) which gets to see some state of nature, which it uses to choose an action from some policy, and then it finally receives a reward sampled from the environment. In the finite horizon formulation, the goal is to maximize the expected cumulative reward.

Contextual bandits

In the basic bandit problem, the state of nature sts_t is fixed, meaning that the world does not change. However, the agent's internal model of the world does change, as it learns about which actions have highest reward. If we allow the state of the environment sts_t to vary randomly over time, the model is known as contextual bandit, which is more flexible model.

Markov decision processes

We can consider a generalization of the contextual bandit model, in which the next state st+1s_{t+1} depends on sts_t and the agent's action ata_t; this is called a Markov decision process or MDP. This defines a controlled Markov chain.

p(s0:Ta0:T1)=p(s0)t=1Tp(stst1,at1)p(s_{0:T} | a_{0:T-1}) = p(s_0) \prod_{t=1}^{T} p(s_t | s_{t-1}, a_{t-1})

where p(stst1,at1)p(s_t | s_{t-1}, a_{t-1}) is known as the state transition model.

Exploration-exploitation tradeoff

The fundamental difficulty in solving bandit problems is known as the exploration--exploitation tradeoff. This refers to the fact that the agent needs to try new state/action combinations (this is known as exploration) in order to learn the reward function R(s,a)R(s, a) before it can exploit its knowledge by picking the predicted the predicted best action for each state.

Bayesian hypothesis testing

Suppose we have two hypotheses or models, commonly called the null hypothesis, M0M_0, and the alternative hypothesis, M1M_1, and we want to know which one is more likely to be true. This is called hypothesis testing.

If we use 0-1 loss, the optimal decision is to pick the alternate hypothesis iff p(M1D)/p(M0D)>1p(M_1 | \mathcal{D}) / p(M_0 | \mathcal{D}) \gt 1. If we use uniform prior, the decision rule becomes: select M1M_1 iff p(DM1)/p(DM0)>1p(\mathcal{D} | M_1) / p(\mathcal{D} | M_0) \gt 1. This quantity, which is the ratio of marginal likelihoods of the two models, is known as Bayes factor:

B1,0p(DM1)p(DM0)B_{1,0} \triangleq \frac{ p(\mathcal{D} | M_1) }{ p(\mathcal{D} | M_0) }

This is the likelihood ratio, except we integrate out the parameters, which allows us to compare models of different complexity, due to the Bayesian Occam's razor effect.