Week 7 - Information Theory

Information

We want to measure the amount of information we need to transmit to send a message. However, the underlying facts about the message we will receive effect the length of that message.

For example, if we have a fair coin and I need to tell you about 10 flips of this coin - I will need to transmit a message of 10 bits. There are potential outcomes which are evenly weighted so we can do no better than a message of size 10.

In comparison, if the coin we are flipping is unfair with the probability of heads being 1. Then there is only 1 potential outcome. Therefore I don’t need to transmit any information to you for you to know the outcome of me flipping it 10 times - you already know.

Suppose now we are relaying messages about words using 4 letters. If each letter has equal probability of appearing in a word then to communicate a word of length 4 we need 8 bits. Where we use 2 bits to indicate each letter.

Instead lets suppose the first letter has probability of being in each place, the second letter has probability and the last 2 letters have probability . For a word of length 3 can we on expectation do better than 8 bits?

LetterABCD
Probability0.50.250.1250.125
Representation010110111
Expected length0.50.50.3750.375

Using the underlying probability distribution we can design representations of these letters to reduce the expected length of a single letter to 1.75 therefore making the 4 letter word we can use only 7 bits on expectation.

This has the interesting side effect of the further away from being uniformly random the background distribution is the less information it takes to transmit a message about it will be.

Mathematically we use Information entropy to inform us about how close a random variable is to being uniform. With higher value indicating it is closer to uniformly random.

Information entropy

Information Entropy

Suppose we have a discrete random variable that can take values in . We define the information entropy of to be The higher the entropy the more uniform the random variable.

Link to original

Information between two variables

First lets remind ourselves of

Joint distribution

Joint distribution

Given two random variables and over domains and . The joint distribution of is over and is given by

Link to original

We can use this to define what the Joint Entropy between two variables is.

Joint Entropy

Joint Entropy

Suppose we have two random variables and over different domains and . The joint entropy is defined by This is Information entropy of the joint distribution.

Link to original

We want a measure of how much one variables tells us about another. I.e. is considerably larger than . This means that is in some way consumed by . First lets look at the Information entropy of the conditional probability.

Conditional entropy

Conditional entropy

Suppose we have two random variables and over different domains and . The conditional entropy is defined by

Link to original

Notice how independence plays out here.

If two variables are independent joint entropy is additive

Statement

Lemma

Suppose we have two independent random variables and over different domains and . Then the Joint Entropy is additive

Proof

This follows from the definitions

Link to original

If two variables are independent conditional entropy excludes the dependent

Statement

Lemma

Suppose we have two independent random variables and over different domains and . Then the Conditional entropy is excludes the dependent

Proof

This follows from the definitions

Link to original

In the first case by looking at their joint distributions we still need the same amount of information as if we considered them individually. Secondly by telling you about it doesn’t take any less information to tell you about .

Conditional entropy could be small in two cases - either tells us a lot about (i.e. ) or that was very small to begin with. So we need a better idea to tell us how related and are.

Mutual information

Mutual information

Suppose we have two random variables and over different domains and . Then the mutual information is defined to be Where is the information entropy of and is conditional entropy.

Link to original

Let and represent two i.i.d. fair coin tosses then we have i.e. these events tell us nothing about one another. Whereas completely tells us about .

Mutual information is symmetric

Statement

Lemma

Suppose we have two random variables and over different domains and . We have that Mutual information is symmetric

Proof

This follows from Bayes rule, rules of logarithms and marginalisation.

Link to original

KL-divergence

We would like to be able to tell when two distributions are similar. For this we use the following definition. Notice if then .

Kullback–Leibler divergence

Kullback–Leibler divergence

Given two probability distributions over called and . The Kullback–Leibler divergence is the expected value of the log difference between and with the probabilities for each value being given by .

Link to original