Week 6 - Bayesian Inference
Joint distributions
We can use the definition of conditional probability to help calculate joint probabilities.
Therefore independence simplifies the amount of knowledge we need to store to compute the probability of a large state. Therefore is useful to generalise the definition of independence.
Transclude of Conditional-independence
Why is it a generalisation?
Note if
and are independent event then We could say that is conditionally independent on given any event.
Bayesian networks (belief networks)
For a set of random variables
Bayesian network
Link to originalBayesian network
Let
be a directed acyclic graph and let be a set of random variables. We say forms a Bayesian network if the probability density function is given by
Independent variables
Let
and be independent event. Then forms a Bayesian network as Note that it also forms a Bayesian network with or but it would not be minimal.
Conditionally independent
Suppose we have random variables
, , and where is conditionally independent of given . Then forms as Bayesian network.
The property of being conditionally independent is deducible from the Bayesian network
Chain rule (probability)
Link to originalChain rule (probability)
For random variables
for we have This follows from the definition of conditional probability.
Lets topological order
and compare this to
This is in fact a defining property of Bayesian network.
Local Markov property
Link to originalLocal Markov property
Let
be a directed acyclic graph and a set of random variables. We say satisfies the local Markov property if for all and such that where there is no path from to then is conditionally independent of given .
Bayesian network if and only if it satisfies the local Markov Property
Statement
Lemma
Let
be a directed acyclic graph and a set of random variables. is a Bayesian network if and only if it satisfies the Local Markov property. Proof
Link to original
Inference rules
Once you have a Bayesian network you can use it to calculate all kinds of conditional probabilities. There are 3 main rules that help you do this.
Marginalisation (probability)
Link to originalMarginalisation (probability)
Suppose we have two random variables
and over domains and respectively. If we know their join distribution then we can calculate either or ‘s (marginal) distribution, i.e. The name comes from if you were to draw a table we would get both and in the margins if we summed the rows and columns.
Chain rule (probability)
Link to originalChain rule (probability)
For random variables
for we have This follows from the definition of conditional probability.
Statement
Link to originalBayes Rule
For two events
and we have following equality on their conditional probabilities
Example
Suppose we have two boxes, the first box has 3 green balls and 1 orange, the second has 3 blue and 2 green. Then we do the following, we pick a box with a half chance of picking either
, we draw and do not replace 1 ball from that box , lastly we draw another ball from the same box . This can be represented as a Bayesian network with the following graph . Note this means depends on and depends on and . What is blue green ?
We can write out the conditional probability tables when they rely on all their dependents quite easily. First note
1 | 3/4 | 1/4 | 0 |
2 | 2/5 | 0 | 3/5 |
What follows is the second.
P | D1 | |||
---|---|---|---|---|
1 | green | 2/3 | 1/3 | 0 |
1 | orange | 1 | 0 | 0 |
2 | green | 1/4 | 0 | 3/4 |
2 | blue | 1/2 | 0 | 1/2 |
These tables perfectly set us up to calculate an exact state using the chain rule. So we are going to try and rewrite this probability in terms of exact states first. So lets use the definition of conditional probability
Lastly lets use the chain rule to turn them into the probabilities we have in the tables above. For example
doing this for each quantity we get
Naive Bayes Classifier
Lets picture a Bayesian network with one root node and all other nodes as children of it. For this it is useful to think of an example.
Spam mail
We are given 3 words Viagra, Price and Udacity and depending on the existence of these words in an email we have to decide if it is spam or not. We can sample for already labelled data to understand the probabilities of these words appearing or not appearing.
First lets draw the Naive Bayes Classifier diagram.
Transclude of Naice-Bayes
This is a model for if the email is spam or not. Lets assume the probability of spam is
0.3 | 0.2 | 0.0001 | |
0.001 | 0.1 | 0.1 |
To work these out we have only had to sample from our spam and non-spam mail how many times they contain the word
Now we would like to know
We want to do this but not have to crack out the pen and paper every time.
Naive Bayes classifier
Naive Bayes classifier
Suppose we have a classification problem and we are in the modelling framework. We are given features
where each feature has finite values (also split up the random variable into for our input). Let our classification also have finite domain . We can use our testing data to estimate
for each and for each and . Assumption
Making the assumption that
and each of the form a Bayesian network where is the root with one arrow to each of the with the probabilities above. Using Bayes rule then marginalisation and chain rule we can calculate
with the normalisation term
. Then to compute the Maximum a posteriori probability we don’t need to compute and we let This exactly forms our Naive Bayes classifier for this problem.
Run time
It is eager learner so requires you to sample the data initially. Though we only ever need to see each data point once - so is fairly fast. The estimating of the parameters is fairly simple. This model also has the minimum number of parameters you need to factor in most
Correctness
Whilst this does may a large assumption about the different attributes. Empirically this is considered very successful. It is thought that Google use this a lot in gmail for actual spam!
Limitations
Link to original
- It makes a strong assumption that the attributes are not related to one another, which seems ridiculous.
- It is very venerable to co-dependent variables as its influence will grow exponentially the more variables are related. (i.e. if 3 variables were the same we would cube that term.)
- Though in practice it turns out not to matter that much!
- If any of the
is zero, it zeros out the entire product.
- We should account for this by always making these values non-zero. This is called “smoothing” of the probabilities.
- We need a very large amount of data to guarantee our probabilities are accurate.
This is cool as,
- Normally Inference is expensive to do, however in this case it makes it really cheap.
- The model has few parameters so has low spacial complexity.
- If you have sufficient data you can make very good estimates on these probabilities.
- This connects Inference and classification problems.
- Empirically widely successful.
Though the main consideration when using this is you will need to smooth the data to make sure we don’t zero out options due to limitations of our data.
Why sample
- To get indicative distributions of the date. Potentially this is costly and we can now use these probabilities to run simulations.
- To simulate a complex process by sampling smaller bits of it.
- Approximate inference, the process of looking at samples that only have a certain property to understand joint distributions.
- Actual inference is hard!
- To “visualise” or get a feel for the data so we have intuition to reason about the state.