Week 7 - Information Theory
Information
We want to measure the amount of information we need to transmit to send a message. However, the underlying facts about the message we will receive effect the length of that message.
For example, if we have a fair coin and I need to tell you about 10 flips of this coin - I will need to transmit a message of 10 bits. There are
In comparison, if the coin we are flipping is unfair with the probability of heads being 1. Then there is only 1 potential outcome. Therefore I don’t need to transmit any information to you for you to know the outcome of me flipping it 10 times - you already know.
Suppose now we are relaying messages about words using 4 letters. If each letter has equal probability of appearing in a word then to communicate a word of length 4 we need 8 bits. Where we use 2 bits to indicate each letter.
Instead lets suppose the first letter has probability
Letter | A | B | C | D |
---|---|---|---|---|
Probability | 0.5 | 0.25 | 0.125 | 0.125 |
Representation | 0 | 10 | 110 | 111 |
Expected length | 0.5 | 0.5 | 0.375 | 0.375 |
Using the underlying probability distribution we can design representations of these letters to reduce the expected length of a single letter to 1.75 therefore making the 4 letter word we can use only 7 bits on expectation.
This has the interesting side effect of the further away from being uniformly random the background distribution is the less information it takes to transmit a message about it will be.
Mathematically we use Information entropy to inform us about how close a random variable is to being uniform. With higher value indicating it is closer to uniformly random.
Information entropy
Link to originalInformation Entropy
Suppose we have a discrete random variable
that can take values in . We define the information entropy of to be The higher the entropy the more uniform the random variable.
Information between two variables
First lets remind ourselves of
Joint distribution
Link to originalJoint distribution
Given two random variables
and over domains and . The joint distribution of is over and is given by
We can use this to define what the Joint Entropy between two variables is.
Joint Entropy
Link to originalJoint Entropy
Suppose we have two random variables
and over different domains and . The joint entropy is defined by This is Information entropy of the joint distribution.
We want a measure of how much one variables tells us about another. I.e.
Conditional entropy
Link to originalConditional entropy
Suppose we have two random variables
and over different domains and . The conditional entropy is defined by
Notice how independence plays out here.
If two variables are independent joint entropy is additive
Statement
Lemma
Suppose we have two independent random variables
and over different domains and . Then the Joint Entropy is additive Proof
This follows from the definitions
Link to original
If two variables are independent conditional entropy excludes the dependent
Statement
Lemma
Suppose we have two independent random variables
and over different domains and . Then the Conditional entropy is excludes the dependent Proof
This follows from the definitions
Link to original
In the first case by looking at their joint distributions we still need the same amount of information as if we considered them individually. Secondly by telling you about
Conditional entropy could be small in two cases - either
Mutual information
Link to originalMutual information
Suppose we have two random variables
and over different domains and . Then the mutual information is defined to be Where is the information entropy of and is conditional entropy.
Let
Mutual information is symmetric
Statement
Lemma
Suppose we have two random variables
and over different domains and . We have that Mutual information is symmetric Proof
This follows from Bayes rule, rules of logarithms and marginalisation.
Link to original
KL-divergence
We would like to be able to tell when two distributions are similar. For this we use the following definition. Notice if
Kullback–Leibler divergence
Link to originalKullback–Leibler divergence
Given two probability distributions over
called and . The Kullback–Leibler divergence is the expected value of the log difference between and with the probabilities for each value being given by .