20 Questions is a game where there is a set of possible answers and a correct answer . There is a player (questioner) who wants to find and a questionee who knows . The player then asks questions boolean questions to questionee who answers True or false about item .
20 Questions
In the traditional version of the game you have until 20 questions to guess the answer, thus the name!
There are different set ups for this game - but the point is to demonstrate the power of how Training data is collected.
For simplicity we will call the questionee the teacher and the questioner the learner.
Simple: Teacher provides questions
If the goal of the game is too arrive at the answer as quickly as possible but the teacher can provide the questions. This should be solvable in 1 question.
The teacher can provide the question
“Is the answer ?”
Normal: Binary search
If the learner has to come up with the questions. The best strategy in terms of expectation is to ask questions that divide the remaining potential answers into two even sized chunks. That way no matter the answers you have halved the solution space. This takes time to find .
This method assumes that any question is valid.
Restricted questions for the teacher
Lets further formalise this problem. Suppose further you can only ask questions in set about the answers in .
For an example let be -bit literals that relate to boolean assignments for with . Then assume are boolean functions on variables that can be expressed as a logical formula involving or their negations joint with . Then the answer to question is .
Example
Suppose and let . Then we might have the following series of questions
1
0
1
1
0
1
0
0
0
1
0
1
1
1
1
1
0
0
1
0
1
0
0
0
1
0
1
1
1
0
In this set up we require 2 questions to establish the variables that didn’t matter by having a true case where they were either 0 or 1. Then the number of questions as there are variables in the formula to verify they are required.
You can see this in the above example. The first two questions show that and are not included in the formula, the 3rd one shows that if is positive then it fails, 4th one shows if is negative it fails and the last one shows that if is positive it fails.
So if the teacher presents the questions here it only takes at most questions to work out the answer.
Restricted question for the learner
In the current set up, if the learner were to guess it would take them guesses. As the formula potentially uses all literal and there is only one correct answer.
Learning with mistake bounds
Clearly the restricted questions for the learner is a hard problem. Though instead we can change the set up.
Suppose instead of the learner giving an example to the teacher to answer, instead the teacher is going to give the learner an example to predict. The teacher then verifies if they were correct or not. If the learner is wrong they get a -1 to their score.
Instead of them trying to get the formula in the least time possible, we are trying to get the largest score from the provided examples.
Algorithm
We are going to slowly build a table that reflects our guess of the formula. For each variable we have 3 possible states, not included, negative, and positive. In the example above this table would look like:
Variable
Positive
Negative
Y
Y
Y
Where something that is neither positive or negative is not included.
The algorithm starts as follows:
Initialise the table with all positive and negatives selected.
For each example
Using the table make a guess of the answer
If a variable is ticked positive, it needs to be positive.
If a variable is ticked negative, it needs to be negative.
If we are wrong:
2. For each
1. If was 0, remove ‘s positive tick if it exists.
2. If was 1, remove ‘s negative tick if it exists.
This algorithm will only ever make mistakes so we have a bound on the number of mistakes we can make.
Definitions
In machine learning there are similar definitions to other algorithmic disciplines.
Run time complexity
Run time complexity
There are 3 main ways people use to analyse the run time complexity of algorithms.
Big-O notation
Big-O notation is the main way people determine the run time of algorithms. (Rightly or wrongly.) It is the worst case analysis and has a formal mathematical definition
This is just saying that eventually is at most a scalar multiple of .
This is only an upper bound
If then we also have that , as eventually.
Big-Omega notation
Big-Omega notation is said to be the best case running time. However formally it is simply a lower bound on the run time - the same as Big-O notation is simply an upper bound. This has a similar formal definition
This is just saying that eventually is at least a scalar multiple of .
Big-Theta notation
Big-Theta notation is the tight bound on the run time. This is only really defined when some function is the following and then we have this is also an equivalence relation, so as well.
Comments
In some algorithmic disciplines, these notions are not sufficient and people talk about average run time of an algorithm. Whilst Big-O notation determines a functions complexity in practical terms if that run time is only really recognised in very specific circumstances - it could have a much better average run time on most cases people care about and come across in real life.
Using this we can define what we call “probably approximately correct” (PAC) learner.
Probably approximately correct learnable (PAC)
Probably approximately correct learnable (PAC)
A concept class is probably approximately correct learnable by learner using hypothesis space with error and probability if and only if learner when ran on concept with probability outputs hypothesis such that the true error
in time and samples polynomial in and .
Suppose and our real function is xor on two variables and . Let and set probability distribution , , , and .
Then with we have our version space to be .
This version space is -exhausted as doesn’t correctly map and doesn’t correctly map . However as and a hypothesis in the version space has error at worst .
This is useful concept for finding the number of samples we need to take to us to give us a PAC learner.
Then let be i.i.d. samples from using . The probability is consistent on is
As the samples are i.i.d.. Moreover as for this gives us
If all had then we have
Note in the definition of PAC learner is exactly the quantity we need to bound with (as we need the probability of finding a good learner to be ). Therefore if we have the ability to draw i.i.d. examples we need to obey
Which rearranging for gives us
(Note this is polynomial in , and as required by a PAC learner.) So given that good learner does exist in our hypothesis space we have use drawing i.i.d. representatives to give us a PAC learner by checking if the hypothesis are consistent with the observations.
10-bit example
Suppose our underlying set and . How many i.i.d. samples do we need to draw to find a , with being uniform PAC learner?
Using the above formula we need
Note here that we didn’t use what was in this computation.
A weakness of Haussler Theorem is that it doesn’t handle infinite dimensional hypothesis spaces.