Week 6 - Bayesian Inference

Joint distributions

We can use the definition of conditional probability to help calculate joint probabilities. If probabilities are independent this simplifies to We say simplifies as to keep track of if has domain of size and of size we only need to keep track of values whereas in the first case we need to keep track of values.

Therefore independence simplifies the amount of knowledge we need to store to compute the probability of a large state. Therefore is useful to generalise the definition of independence.

Transclude of Conditional-independence

Why is it a generalisation?

Note if and are independent event then We could say that is conditionally independent on given any event.

Bayesian networks (belief networks)

For a set of random variables for we would like to graph these relationships using a directed graph. We want an edge to signify that is dependent on . Though there is a lot of choice in this as for any two dependent variables we could choose or - however we want to make a directional choice.

Bayesian network

Bayesian network

Let be a directed acyclic graph and let be a set of random variables. We say forms a Bayesian network if the probability density function is given by

Link to original

Independent variables

Let and be independent event. Then forms a Bayesian network as Note that it also forms a Bayesian network with or but it would not be minimal.

Conditionally independent

Suppose we have random variables , , and where is conditionally independent of given . Then forms as Bayesian network.

The property of being conditionally independent is deducible from the Bayesian network by using the chain rule.

Chain rule (probability)

Chain rule (probability)

For random variables for we have This follows from the definition of conditional probability.

Link to original

Lets topological order so where we have . Then apply the chain rule to

and compare this to So for any and such that then is conditionally independent of given .

This is in fact a defining property of Bayesian network.

Local Markov property

Local Markov property

Let be a directed acyclic graph and a set of random variables. We say satisfies the local Markov property if for all and such that where there is no path from to then is conditionally independent of given .

Link to original

Bayesian network if and only if it satisfies the local Markov Property
Statement

Lemma

Let be a directed acyclic graph and a set of random variables. is a Bayesian network if and only if it satisfies the Local Markov property.

Proof
Link to original

Inference rules

Once you have a Bayesian network you can use it to calculate all kinds of conditional probabilities. There are 3 main rules that help you do this.

Marginalisation (probability)

Marginalisation (probability)

Suppose we have two random variables and over domains and respectively. If we know their join distribution then we can calculate either or ‘s (marginal) distribution, i.e. The name comes from if you were to draw a table we would get both and in the margins if we summed the rows and columns.

Link to original

Chain rule (probability)

Chain rule (probability)

For random variables for we have This follows from the definition of conditional probability.

Link to original

Statement

Bayes Rule

For two events and we have following equality on their conditional probabilities

Link to original

Example

Suppose we have two boxes, the first box has 3 green balls and 1 orange, the second has 3 blue and 2 green. Then we do the following, we pick a box with a half chance of picking either , we draw and do not replace 1 ball from that box , lastly we draw another ball from the same box . This can be represented as a Bayesian network with the following graph . Note this means depends on and depends on and . What is blue green?

We can write out the conditional probability tables when they rely on all their dependents quite easily. First note . For the first joint distribution, we have what is below.


1	3/4	1/4	0
2	2/5	0	3/5

What follows is the second.

P	D1
1	green	2/3	1/3	0
1	orange	1	0	0
2	green	1/4	0	3/4
2	blue	1/2	0	1/2

These tables perfectly set us up to calculate an exact state using the chain rule. So we are going to try and rewrite this probability in terms of exact states first. So lets use the definition of conditional probability Next lets use marginalisation to add in the first variable.

Lastly lets use the chain rule to turn them into the probabilities we have in the tables above. For example

doing this for each quantity we get

Naive Bayes Classifier

Lets picture a Bayesian network with one root node and all other nodes as children of it. For this it is useful to think of an example.

Spam mail

We are given 3 words Viagra, Price and Udacity and depending on the existence of these words in an email we have to decide if it is spam or not. We can sample for already labelled data to understand the probabilities of these words appearing or not appearing.

First lets draw the Naive Bayes Classifier diagram.

Naice Bayes
⚠ Switch to EXCALIDRAW VIEW in the MORE OPTIONS menu of this document. ⚠

Text Elements

Spam?

“Viagra”

“Prince”

“Udacity”
Link to original

This is a model for if the email is spam or not. Lets assume the probability of spam is . Then we have the conditional probabilities below.


	0.3	0.2	0.0001
	0.001	0.1	0.1

To work these out we have only had to sample from our spam and non-spam mail how many times they contain the word .

Now we would like to know To turn this into something we can use from above we apply Bayes rule to this (Shortening the words to their first letters.) Next we apply the chain rule but knowing that and are conditionally independent.

We want to do this but not have to crack out the pen and paper every time.

Naive Bayes classifier
Naive Bayes classifier

Suppose we have a classification problem and we are in the modelling framework. We are given features where each feature has finite values (also split up the random variable into for our input). Let our classification also have finite domain .

We can use our testing data to estimate for each and for each and .

Assumption

Making the assumption that and each of the form a Bayesian network where is the root with one arrow to each of the with the probabilities above.

Using Bayes rule then marginalisation and chain rule we can calculate

with the normalisation term . Then to compute the Maximum a posteriori probability we don’t need to compute and we let

This exactly forms our Naive Bayes classifier for this problem.

Run time

It is eager learner so requires you to sample the data initially. Though we only ever need to see each data point once - so is fairly fast. The estimating of the parameters is fairly simple. This model also has the minimum number of parameters you need to factor in most

Correctness

Whilst this does may a large assumption about the different attributes. Empirically this is considered very successful. It is thought that Google use this a lot in gmail for actual spam!

Limitations

It makes a strong assumption that the attributes are not related to one another, which seems ridiculous.

It is very venerable to co-dependent variables as its influence will grow exponentially the more variables are related. (i.e. if 3 variables were the same we would cube that term.)

Though in practice it turns out not to matter that much!

If any of the is zero, it zeros out the entire product.

We should account for this by always making these values non-zero. This is called “smoothing” of the probabilities.

We need a very large amount of data to guarantee our probabilities are accurate.

Link to original

This is cool as,

Normally Inference is expensive to do, however in this case it makes it really cheap.
The model has few parameters so has low spacial complexity.
If you have sufficient data you can make very good estimates on these probabilities.
This connects Inference and classification problems.
Empirically widely successful.

Though the main consideration when using this is you will need to smooth the data to make sure we don’t zero out options due to limitations of our data.

Why sample

To get indicative distributions of the date. Potentially this is costly and we can now use these probabilities to run simulations.
To simulate a complex process by sampling smaller bits of it.
Approximate inference, the process of looking at samples that only have a certain property to understand joint distributions.
- Actual inference is hard!
To “visualise” or get a feel for the data so we have intuition to reason about the state.

Alex's Notes

Explorer

Week 6 - Bayesian Inference

Week 6 - Bayesian Inference

Joint distributions

Bayesian networks (belief networks)

Bayesian network

Chain rule (probability)

Local Markov property

Bayesian network if and only if it satisfies the local Markov Property

Statement

Proof

Inference rules

Marginalisation (probability)

Chain rule (probability)

Statement

Naive Bayes Classifier

Naice Bayes

Text Elements

Naive Bayes classifier

Naive Bayes classifier

Run time

Correctness

Limitations

Why sample

Graph View

Table of Contents

Backlinks