Week 5 - Learning Theory

Computational learning theory is fundamentally about 3 things:

  • defining learning problems,
  • showing algorithms work, and
  • showing problems are easy or hard.

When considering problems it is useful to think about:

  1. The probability of successful training.
    1. The probability the process works
  2. Number of examples you need to train on.
    1. The size of the training set .
  3. Complexity of the hypothesis class.
    1. How complex is the representation - modelling paradigm
  4. Accuracy to which the target concept is approximated.
    1. Within some .
  5. Manner in which the examples are presented.
    1. Batch or online.
  6. Manner in which the training data is collected.
    1. Learner requests examples.
    2. Teacher gives examples.
    3. Fix distribution.
    4. Examples are picked poorly.

20 Questions

20 Questions is a game where there is a set of possible answers and a correct answer . There is a player (questioner) who wants to find and a questionee who knows . The player then asks questions boolean questions to questionee who answers True or false about item .

20 Questions

In the traditional version of the game you have until 20 questions to guess the answer, thus the name!

There are different set ups for this game - but the point is to demonstrate the power of how Training data is collected.

For simplicity we will call the questionee the teacher and the questioner the learner.

Simple: Teacher provides questions

If the goal of the game is too arrive at the answer as quickly as possible but the teacher can provide the questions. This should be solvable in 1 question.

The teacher can provide the question

  • “Is the answer ?”

If the learner has to come up with the questions. The best strategy in terms of expectation is to ask questions that divide the remaining potential answers into two even sized chunks. That way no matter the answers you have halved the solution space. This takes time to find .

This method assumes that any question is valid.

Restricted questions for the teacher

Lets further formalise this problem. Suppose further you can only ask questions in set about the answers in .

For an example let be -bit literals that relate to boolean assignments for with . Then assume are boolean functions on variables that can be expressed as a logical formula involving or their negations joint with . Then the answer to question is .

Example

Suppose and let . Then we might have the following series of questions

101101
000101
111100
101000
101110

In this set up we require 2 questions to establish the variables that didn’t matter by having a true case where they were either 0 or 1. Then the number of questions as there are variables in the formula to verify they are required.

You can see this in the above example. The first two questions show that and are not included in the formula, the 3rd one shows that if is positive then it fails, 4th one shows if is negative it fails and the last one shows that if is positive it fails.

So if the teacher presents the questions here it only takes at most questions to work out the answer.

Restricted question for the learner

In the current set up, if the learner were to guess it would take them guesses. As the formula potentially uses all literal and there is only one correct answer.

Learning with mistake bounds

Clearly the restricted questions for the learner is a hard problem. Though instead we can change the set up.

Suppose instead of the learner giving an example to the teacher to answer, instead the teacher is going to give the learner an example to predict. The teacher then verifies if they were correct or not. If the learner is wrong they get a -1 to their score.

Instead of them trying to get the formula in the least time possible, we are trying to get the largest score from the provided examples.

Algorithm

We are going to slowly build a table that reflects our guess of the formula. For each variable we have 3 possible states, not included, negative, and positive. In the example above this table would look like:

VariablePositiveNegative
Y
Y
Y

Where something that is neither positive or negative is not included.

The algorithm starts as follows:

  1. Initialise the table with all positive and negatives selected.
  2. For each example
    1. Using the table make a guess of the answer
      1. If a variable is ticked positive, it needs to be positive.
      2. If a variable is ticked negative, it needs to be negative.
    2. If we are wrong: 2. For each 1. If was 0, remove ‘s positive tick if it exists. 2. If was 1, remove ‘s negative tick if it exists.

This algorithm will only ever make mistakes so we have a bound on the number of mistakes we can make.

Definitions

In machine learning there are similar definitions to other algorithmic disciplines.

Run time complexity

Run time complexity

There are 3 main ways people use to analyse the run time complexity of algorithms.

Big-O notation

Big-O notation is the main way people determine the run time of algorithms. (Rightly or wrongly.) It is the worst case analysis and has a formal mathematical definition This is just saying that eventually is at most a scalar multiple of .

This is only an upper bound

If then we also have that , as eventually.

Big-Omega notation

Big-Omega notation is said to be the best case running time. However formally it is simply a lower bound on the run time - the same as Big-O notation is simply an upper bound. This has a similar formal definition This is just saying that eventually is at least a scalar multiple of .

Big-Theta notation

Big-Theta notation is the tight bound on the run time. This is only really defined when some function is the following and then we have this is also an equivalence relation, so as well.

Comments

In some algorithmic disciplines, these notions are not sufficient and people talk about average run time of an algorithm. Whilst Big-O notation determines a functions complexity in practical terms if that run time is only really recognised in very specific circumstances - it could have a much better average run time on most cases people care about and come across in real life.

Link to original

However there are some domain specific definitions as well.

Sample complexity

Sample complexity

Similar to run time complexity it is the number of samples required for a model to converge to the optimum hypothesis in terms of the problem size.

Link to original

Touching upon the example above.

Mistake bound

Mistake bound

During an infinite run of a learning task. What is maximum number of mistakes the algorithm could make in terms of the input size.

Link to original

Lets fix terminology for the next section. Suppose we are in the modelling framework with some training data .

Pac learning

There are two concepts of error when evaluating a hypothesis.

Training error

Training error

Suppose we are in the modelling framework with some training data and hypothesis space . For some candidate hypothesis the training error

Link to original

True error

True error

Suppose we are in the modelling framework with feature space with some probability distribution on it, a hypothesis space and the true concept . For some candidate hypothesis the true error This is the integral of the indicator function on whether it is correct or not weighted with respect to .

Link to original

Using this we can define what we call “probably approximately correct” (PAC) learner.

Probably approximately correct learnable (PAC)

Probably approximately correct learnable (PAC)

A concept class is probably approximately correct learnable by learner using hypothesis space with error and probability if and only if learner when ran on concept with probability outputs hypothesis such that the true error in time and samples polynomial in and .

Link to original

Though showing something is PAC-learner is hard.

-bit example

Suppose our underlying set and . Is there an algorithm that is , PAC learner?

-exhausted version space

epsilon-exhausted version space

epsilon-exhausted version space

Suppose we are in the modelling framework with some training data , a hypothesis space and a probability distribution on . For some the version space is -exhausted if and only if for all the true error

Link to original

Xor example

Suppose and our real function is xor on two variables and . Let and set probability distribution , , , and . Then with we have our version space to be . This version space is -exhausted as doesn’t correctly map and doesn’t correctly map . However as and a hypothesis in the version space has error at worst .

This is useful concept for finding the number of samples we need to take to us to give us a PAC learner.

Haussler Theorem

Statement

Haussler Theorem

Suppose we are in the modelling framework with finite hypothesis space and probability distribution on . Set . Let our training data be i.i.d. samples from with associated correct answers. For any hypothesis which is consistent with we have a bound on the true error

Proof

Let be the target function.

Let such that (i.e. we have true error ).

Then let be i.i.d. samples from using . The probability is consistent on is As the samples are i.i.d.. Moreover as for this gives us If all had then we have

the required result.

Link to original

Note in the definition of PAC learner is exactly the quantity we need to bound with (as we need the probability of finding a good learner to be ). Therefore if we have the ability to draw i.i.d. examples we need to obey

Which rearranging for gives us

(Note this is polynomial in , and as required by a PAC learner.) So given that good learner does exist in our hypothesis space we have use drawing i.i.d. representatives to give us a PAC learner by checking if the hypothesis are consistent with the observations.

10-bit example

Suppose our underlying set and . How many i.i.d. samples do we need to draw to find a , with being uniform PAC learner? Using the above formula we need

Note here that we didn’t use what was in this computation.

A weakness of Haussler Theorem is that it doesn’t handle infinite dimensional hypothesis spaces.