Week 6 - Bayesian learning

Bayes rule

To start this lecture lets remind ourselves of the definition of conditional probability.

Conditional probability

Conditional probability

For two events and the conditional probability of happening given has happened is

Link to original

As a simple corollary we get Bayes rule.

Bayes rule

Statement

Bayes Rule

For two events and we have following equality on their conditional probabilities

Proof

This follows from the definition of conditional probability

Link to original

Question

Suppose there is a illness effects of the population . We have a test that has a true positive chance of (i.e. ) and a true negative result of (i.e. ). Given you have a positive test result are you more likely to have the illness or not?

Here we apply Bayes rule. whereas giving that we are more likely to not have the illness than have it with a positive result.

Applying this to learning

Suppose is a hypothesis belonging to our hypothesis space and we have data. Then to see the probability our hypothesis is given the data we can use Bayes rule to reduce it to things we can calculate

  • Here is the accuracy of our prediction.
  • Then is a reflection of prior knowledge about which hypothesis are likely or not.
  • Lastly reflects our prior knowledge about the data we are sampling from.

When we are training our model on training data , we are trying to find

Though for each we have the same as this does not depend on . So we might as well remove this from our calculation - here we get the maximum a posteriori probability.

Maximum a posteriori probability estimate (MAP)

Maximum a posteriori probability estimate (MAP)

Suppose we have a hypothesis space and we want to pick the best hypothesis given some data . Further more suppose we have prior belief about the likelihood of each hypothesis represented by a probability distribution over . The maximum a posteriori probability estimate is

Link to original

Sometimes we will have no prior preference on the hypothesis space in this case we might as well assume it is uniform and remove it from our calculations - here we get the maximum likelihood estimation.

Maximum likelihood estimation (MLE)

Maximum likelihood estimation (MLE)

Suppose we have a hypothesis space and we want to pick the best hypothesis given some data . The maximum likelihood estimation is

Link to original

Though to actually calculate these would be very hard - as the hypothesis space might be very large.

Noise free data

Suppose:

Now we are use Bayes rule to calculate for each .

  • As we have no prior preference on we have .
  • As we know the data is noise free we have however this describes the version space for with , , so
  • As each hypothesis is an independent event we have This gives

Gaussian noise

In the previous set up we assumed there was no noise, this time we will introduce some.

Suppose:

Now lets try to compute the maximum likelihood estimation for our hypothesis space .

So this shows that finding the maximum likelihood estimation for normally distributed noise is the same as minimising mean squared error.

Note

If we switch our assumption about how the noise is distributed, then we find a different loss function will be appropriate.

This shows that the loss function we use really relates to the noise we have in our observations.

Probability length

Length of a probability

Length of a probability

For an event if then we say the length of is Note the lower the probability of the event the longer it is.

Link to original

Now assume we have a prior distribution on our hypothesis space such that is higher when it is a simpler explanation. For some training data lets look at the maximum a posteriori probability estimate

here we have a pay off. Longer length explanations in may lead better explanations of by the hypothesis. Though this will have to be worth in increase in length of that hypothesis. This is Occam’s razor in an equation.

Bayesian classification

Bayeses optimal classifier

Bayesian classification

Suppose we have a classification problem for some function . We have a hypothesis space and training data . We want to work out what the best label is for . We can use maximum a posteriori probability to calculate Bayeses optimal classifier

Link to original