Week 6 - Bayesian learning
Bayes rule
To start this lecture lets remind ourselves of the definition of conditional probability.
Conditional probability
Link to originalConditional probability
For two events
and the conditional probability of happening given has happened is
As a simple corollary we get Bayes rule.
Bayes rule
Statement
Bayes Rule
For two events
and we have following equality on their conditional probabilities Proof
This follows from the definition of conditional probability
Link to original
Question
Suppose there is a illness effects
of the population . We have a test that has a true positive chance of (i.e. ) and a true negative result of (i.e. ). Given you have a positive test result are you more likely to have the illness or not?
Here we apply Bayes rule.
Applying this to learning
Suppose
- Here
is the accuracy of our prediction. - Then
is a reflection of prior knowledge about which hypothesis are likely or not. - Lastly
reflects our prior knowledge about the data we are sampling from.
When we are training our model on training data
Though for each
Maximum a posteriori probability estimate (MAP)
Link to originalMaximum a posteriori probability estimate (MAP)
Suppose we have a hypothesis space
and we want to pick the best hypothesis given some data . Further more suppose we have prior belief about the likelihood of each hypothesis represented by a probability distribution over . The maximum a posteriori probability estimate is
Sometimes we will have no prior preference on the hypothesis space in this case we might as well assume it is uniform and remove it from our calculations - here we get the maximum likelihood estimation.
Maximum likelihood estimation (MLE)
Link to originalMaximum likelihood estimation (MLE)
Suppose we have a hypothesis space
and we want to pick the best hypothesis given some data . The maximum likelihood estimation is
Though to actually calculate these would be very hard - as the hypothesis space might be very large.
Noise free data
Suppose:
- There is some target
. - We have some irreducible error free (noise free) training data
, so for we have . - We have a finite hypothesis space
which contains the target . - We have no prior preference on the hypothesis space
. - Each hypothesis is an independent event.
Now we are use Bayes rule to calculate
- As we have no prior preference on
we have . - As we know the data is noise free we have
however this describes the version space for with , , so - As each hypothesis is an independent event we have
This gives
Gaussian noise
In the previous set up we assumed there was no noise, this time we will introduce some.
Suppose:
- There is some target
. - We have some i.i.d. normally distributed noise values
for each of our training data . - The training data
is such that we have .
Now lets try to compute the maximum likelihood estimation for our hypothesis space
So this shows that finding the maximum likelihood estimation for normally distributed noise is the same as minimising mean squared error.
Note
If we switch our assumption about how the noise is distributed, then we find a different loss function will be appropriate.
This shows that the loss function we use really relates to the noise we have in our observations.
Probability length
Length of a probability
Link to originalLength of a probability
For an event
if then we say the length of is Note the lower the probability of the event the longer it is.
Now assume we have a prior distribution on our hypothesis space
here we have a pay off. Longer length explanations in
Bayesian classification
Bayeses optimal classifier
Link to originalBayesian classification
Suppose we have a classification problem for some function
. We have a hypothesis space and training data . We want to work out what the best label is for . We can use maximum a posteriori probability to calculate Bayeses optimal classifier