Bayesian inference provides the optimal way to update our beliefs about hidden quantities given observed data by computing the posterior . However, at the end of the day, we need to turn our beliefs into actions that we can perform in the world. How can we decide which action is best? This is where Bayesian decision theory comes in.
In decision theory, we assume the decision maker, or agent, has a set of possible actions , to choose from. Each of these actions have costs and benefits, which will depend on the underlying state of nature . We can encode this information into a loss function , that specifies the loss we incur if we take action when the state of nature is .
For example, suppose the state is defined by the age of the patient (young vs old), and whether they have COVID-19 or not. Note that the age can observed directly, but the disease state must be inferred from noisy observations, thus the state is partially observed.
Let us assume that the cost of administering the drug is the same, no matter what the state of the patient is. However the benefits will differ. If the patient is young, we expect them to live a long time, so the cost of not giving the drug if they have COVID-19 is high; but if the patient is old, they have fewer years to live, os the cost of not giving the drug is arguably less. In medical circles, a common unit of cost is quality-adjusted life years or QALY. Suppose that the expected QALY for a young person is 60, and for an old person is 10. Let us assume the drug costs the equivalent of 8 QALY, due to induced pain and suffering from side effects.
Once we have specified the loss function, we can compute the posterior expected loss or risk for each possible action:
The optimal policy (also called the Bayes estimator) specifies what action to take for each possible observation so as to minimize risk:
In this section, we use Bayesian decision theory to decide optimal class label to predict given an observed input .
Suppose the states of nature correspond to class labels, os . Furthermore, suppose the actions also correspond to class labels, os . In this setting, a very commonly used loss function is the zero-one loss.
Hence the action that minimizes the expected loss is to choose the most probable label:
This corresponds to the mode of the posterior distribution, also known as the maximum a posteriori or MAP estimate.
In some cases, we may be able to say "I don't know" instead of returning an answer that we don't really trust, this is particulary important in risk averse domains.
where is the cost of reject action, and is the cost of classification error.
We showed that we can pick the optimal lable in binary classification by thresholding the probability using a value derived from the relative cost of a false positive and false negative. Instead of picking a single threshold, we can instead consider using a set of different thresholds, and comparing the resulting performance.
For any fixed threshold , we can compute the empirical number of false positives (FP), false negatives (FN), true positives (TP), and true negatives (TN). We can store these results in a class confusion matrix , where is the number of times an item with true class label was (mis)classified as having lable . We can now plot TPR vs FPR as an implicit function of . This is called a receiver operation characteristic or ROC curve.
The quality of the ROC curve is often summarized as a single number using the area under the curve or AUC score, higher AUC scores are better. Another summary statistic that is used is the equal error rate or EER as called the cross-over rate, defined as the value which satisfies FPR = FNR. Since FNR = 1-TPR, we can compute EER by drawing a line from top left to the bottom right and seeing where it intersects the ROC curve, lower EER scores are better.
In some problems, there is severe class imbalance. For example, the set of negatives is usually much larger than the set of positives. The usefulness of the ROC curve may be reduced in such cases, since a large change in the absolute number of false positives will not change the false positive rate very much, since FPR is divided by FP + TN. Thus all action happens in the extreme left part of the curve. In such cases, we may choose other ways of summarizing the class confusion matrix, such as precision-recall curves.
In some problems, the notion of a "negative" is not well-defined. In these kinds of situations, we may choose to use a precision-recall curve to summarize the performance of the system.
The key idea is to replace FPR with precision. Recall is same as TPR. If is the predicted label, and is the true label, we can estimate precision and recall using
We can now plot precision vs recall as we vary the threshold . Hugging the top right is the best we can do.
For a fixed threshold, corresponding to a single point on the PR curve, we can compute a single precision and recall value, which will be denoted by and . These are often combined into a single statistic called the , which weights recall as more important than precision.
or equivalently
If we set , we get the harmonic mean of precision and recall:
Using score weights precision and recall equally. However, if recall is more important, we may use , and if precision is more important, we may use .
So far, we have considered the case where there are a finite number of actions and states of nature . In this section, we consider the case where the set of actions and states are both equal to the real line.
Also called squared error or quadratic loss, which is defined as follows:
The loss penalizes deviations from the truth quadratically, and thus is sensitive to outliers. A more robust alternative is the absolute or loss
Another robust loss function is the Huber loss, defined as follows:
where . This is equivalent to for errors that are smaller than , and is equivalent to for larger errors.
In this section, we assume the set of possible actions is to pick a probability distribution over some value of interest. That is, we want to perform probabilistic prediction or probabilistic forecasting rather than predicting a specific value. More precisely, we assume the true "state of nature" is a distribution, , the action is another distribution, , and we want to pick to minimize for a given .
A common form of loss functions for comparing two distributions is the Kullback Leibler divergence or KL divergence, which is defined as follows:
We can expand the KL as folows:
is known as the entropy. This is a measure of uncertainty or variance of , it is maximal if is uniform, and is 0 if is a degenerate or deterministic delta function. The is known as the cross-entropy.
Now consider a special case in which the true state of nature is a degenerate distribution, which puts all its mass on a single outcome, say , i.e., .
This is known as the log loss of the predictive distribution when given given target label .
This is just the squared error of the predictive distribution compared to the true distribution, when viewed as vectors. Since it is based on squared error, the Brier score is less sensitive to extremely rare or extremely common classes.
Suppose you are trying to decide which version of a product is likely to sell more, or which version of a drug is likely to work better. Let us call the versions you are choosing between A and B, sometimes version A is called the treatment, and version B is called the control. To simplify the notation, we will refer to picking version A as choosing action 1, and picking version B as choosing action 0. We will call the resulting outcome from each action that you want to maximize the reward.
A very common approach to such problems is to use an A/B test, in which you try both actions out for a while, by randomly assigning a different action to different subsets of the publication, and then you measure the results and pick the winner. This is sometimes called a "test and roll" problem, since you test which method is best, and then roll it out for the rest of the population.
A key problem in A/B testing is to come up with a decision rule, or policy, for deciding which action is best, after obtaining potentially noisy results during the test phase. Another problem is to choose how many people to assign to the treat, , and how many to the control, .
We assume the 'th reward for action is given by for nad , where corresponds to the treatment (action !) and corresponds to the control (action B). For simplicity we assume are known. For the unknown parameters (expected reward), we will use Gausian priors, . If we assign people to the treatment and to the control, then the expected reward in the testing phase is given by
The expected reward in the roll phase depends on the decision rule , which specifies which action to deploy, where is the data from action . We derive the opimal policy below, and then discuss some of its properties.
The optimal policy is to choose the action with the greater expected posterior mean reward. Applying Bayes' rule for Gaussians, we find that the corresponding posterior is given by
The optimal policy is given is
If the total population size is , and we cannot reuse people from the testing phase, the expected reward from the roll stage is
where is the Gaussian pdf, is the Gaussian cdf, and
In A/B testing the decision maker tries two different actions a fixed number of times and then pciks the best action. We can obviously generalize this beyond two actions. More importantly, we can generalize this beyond a one-stage decision problem. In particular, suppose we allow the decision maker to try an action , observe the reward , and then decide what to do at time step , rather than waiting until experiments are finished. This immediate feedback allows for adaptive policies that can result in much higher expected reward (lower regret). We have converted a one-stage decision problem into a sequential decision problem. There are many kinds of sequential decision problems, but here we consider the simplest kind, known as a bandit problem.
In a bandit problem, there is an agent (decision maker) which gets to see some state of nature, which it uses to choose an action from some policy, and then it finally receives a reward sampled from the environment. In the finite horizon formulation, the goal is to maximize the expected cumulative reward.
In the basic bandit problem, the state of nature is fixed, meaning that the world does not change. However, the agent's internal model of the world does change, as it learns about which actions have highest reward. If we allow the state of the environment to vary randomly over time, the model is known as contextual bandit, which is more flexible model.
We can consider a generalization of the contextual bandit model, in which the next state depends on and the agent's action ; this is called a Markov decision process or MDP. This defines a controlled Markov chain.
where is known as the state transition model.
The fundamental difficulty in solving bandit problems is known as the exploration--exploitation tradeoff. This refers to the fact that the agent needs to try new state/action combinations (this is known as exploration) in order to learn the reward function before it can exploit its knowledge by picking the predicted the predicted best action for each state.
Suppose we have two hypotheses or models, commonly called the null hypothesis, , and the alternative hypothesis, , and we want to know which one is more likely to be true. This is called hypothesis testing.
If we use 0-1 loss, the optimal decision is to pick the alternate hypothesis iff . If we use uniform prior, the decision rule becomes: select iff . This quantity, which is the ratio of marginal likelihoods of the two models, is known as Bayes factor:
This is the likelihood ratio, except we integrate out the parameters, which allows us to compare models of different complexity, due to the Bayesian Occam's razor effect.