For example, making a decision on whether you should give someone a loan is a classification problem as there are only two outcomes - whereas mapping from someone’s picture to their age would be a regression problem.
Classification
First lets start by defining terms used in this course.
Instance - an input,
Concept - the function from inputs to outputs,
for example a car is simply a mapping from things in the world to True or False based on if that is a car or not.
Target concept - the ideal answer or optimal concept.
Hypothesis class - This is the set of all concepts we want to consider
This might be smaller than the set of all concepts.
Sample / Training set - A set of labelled data i.e. a pair (input, output)
Candidate - the current concept under investigation.
Testing set - A set of labelled data that will be used to evaluate your candidate
It is important it is distinct from the training set.
My mathematical interpretation
Suppose we have two sets and and we are looking to find some function optimal function .
Hypothesis class - a restricted class that we will consider.
This restriction may be applied due to the set of algorithms you are considering.
Sample / Training set - A set of observations.
Ideally however this might not be possible.
Candidate - a function which is our current guess at .
Testing set - A set similar to training set but with intent to test the current candidate on.
Decision trees
Suppose your domain actually breaks down into many different aspects . Like if you were making a decision about whether to go into the restaurant your aspects could be the type of restaurant, whether it is raining, and how full the restaurant is.
Each parent node is affiliated with an aspect with each child branch being affiliated with a collection of potential values. We make sure all potential values of that aspect are covered by the branches.
Each child node is associated to a value in the codomain.
An example for the problem of restaurant choosing is below.
Even/odd parity (true if an even/odd number of inputs are True) with variables uses nodes.
ID3
Iterative Dichotomiser 3 (ID3)
This algorithm works on classification problems. It is a greedy algorithm to design a maximal decision tree. Therefore are looking for the tree to represent some function , where we may only have access to some training data .
This decision tree will be maximal with regards to the information entropy of each of the leaf nodes within that decision tree. For any we can define by considering drawing a random element in out with a probability of them being mapped to some element of .
To make this decision tree assume we have a set of attributes where that split the set up into subsets of variable length for . Let be the subsets of and define where
Lastly we will construct this tree using a depth-first search algorithm, where at each leaf node we will have a subset of our original training set.
Pseudocode
ID3(training_data, attributes): Input: training_data of the form {(a,f(a)) | a in A'} attributes mapping A into some finite set of states Output: a decision tree for the function f1. Let A' be the set of points in A of the training data.2. If Entropy(S) = 0 return a single leaf node with the one element of B all the training data has.3. For each attribute t calculate Gain(A',t). 1. If Gain(A',t) = 0 for all t return a signle leaf node with a maximally likely value in the training data. 2. Otherwise let t' be a maximal element with respect to Gain(A', -)4. Build rooted tree T with t' being the root node5. For each j in 1, ..., n_t' 1. Let A'_j = t'^{-1}(j) intersect A' 2. Let T_j = ID3({(a,f(a)) | a in A'_j}, attributes without t) 3. Attach the root of T_j to node t' with outcome j.6. return decision tree T.
Preference bias occurs in the learning process of an algorithm. This will be from how the objective function chooses to go from one step to the next. This will generate preferences for different models.
For example, if you let the attributes then the decision tree can represent any function in trivially. Though this might be computationally completely infeasible in most cases!
Suppose we have some continuous attribute in . Whilst we could have a branch for every value that attribute could take, it is most probably best to categorify the value by their picking Boolean decisions like is it or for or splitting the value into ranges.
Pruning
Overfitting
Overfitting
When training a model on training data, if you model too closely imitates that data without being able to generalise to the test data - you have overfitting.
The model can have overfitting if the tree gets too large. Therefore it is good practice to contract (or prune) branches of the tree if they have marginal effect on the outcome of the decision tree. This splits into two subcategories pre-pruning and post-pruning.
Pre-pruning decision trees
Pre-pruning decision trees
Pre-pruning decision trees
When building a decision tree to avoid overfitting you may want to stop the branching process even if it could increase the accuracy on the training data. One method to do this is to put in requirements on the final form of the decision tree before you start training. The requirements might be:
Post-pruning is the process of letting the tree organically grown. Then evaluating if the branches are worth keeping or they overfit the data.
Cost complexity pruning for decision trees (CPP)
Cost complexity pruning for decision trees (CPP)
When making a decision tree with no pruning it can tend to overfit the training data. To reduce this we can “prune” the decision tree. We do this by looking at how useful each of the branches are - then removing the ones that don’t add enough value. This is controlled by a variable which increases the cost of having more leaf nodes.
to be the value we want to minimise on our sub-trees.
Optimally we would iterate over all potential subtrees of and find the sub-tree that optimises on . However, computationally this can be expensive. Therefore it is usual to find a heuristic to choose a branch to prune instead.
One method to do this it to calculate the effective cost complexity of a non-terminal node. To give you the intuition behind this let be our decision tree and extend to be defined at where is the evaluation function evaluated on the training data that makes it to (like when training the decision tree). Then the effect of a leaf node on is
To define the effective cost complexity we will find the such that the contribution of an internal node to would be the same as for the subtree rooted at . This is
(It can be derived by equating the contribution of against the contribution of the terminal nodes contains.)
The process then prunes branches with the lowest until that value is greater than .
Pseudocode
CPP(decision_tree, alpha, R): Input: decision_tree with a set of vertices V alpha positive constant to determine pruning R the evaluation function such as Entropy or Gini, this will have to defined on vertices of V (for the training data that gets classifed to v) Output: a pruned decision tree1. Set best_tree = decision_tree2. For each non-leaf set alpha_eff(v) = calculate_effective_alpha(decision_tree, v, R)3. Set min_eff_alpha = min(alpha_eff(v) for v in V non-leaf)4. While min_eff_alpha < alpha 4.1. Find v such that alpha_eff(v) = min_eff_alpha 4.2. Prune best_tree at v 4.3. Let P be the set of vertices with v downstream of it. 4.4. For each p in P set alpha_eff(p) = calculate_effective_alpha(best_tree, p, R) 4.5. Set min_eff_alpha = min(alpha_eff(v) for v in best_tree non-leaf) 4.6. Break if best_tree only has 1 vertex.5. return best_tree.calculate_effective_alpha(tree, v, R): Input tree is a decision tree v is a vertex in V that is not a leaf node R the evaluation function such as Entropy or Gini, this will have to defined on vertices of V (for the training data that gets classifed to v) Output alpha_eff the effective alpha for that vertex1. Let L be the set of leaf vertices with v unstream of it in tree.2. Set total_leaf_weight = sum(R(x) for x in L)3. Return (R(v) - total_leaf_weight)/(|L| - 1)
Run time
Optimally we would iterate over all potential subtrees of and find the sub-tree that optimises on . However, computationally this can be expensive. Instead we can use effective cost complexity which cuts down run time but can still take .
Correctness
This can reduce overfitting, however the parameter needs to be fine tuned. It is best to use cross validation for this purpose.
It is possible to tackle this with a decision tree if you only care about approximate answer. However, you will have to change the definition of information entropy to work in the continuous space. This will have to use something like an average of the leaf nodes.
Gini Index
Above we used Information entropy to choose which feature to split along. However there are other options like the Gini index.
Gini index
Gini index
Suppose we have a discrete random variable that can take values in . We define the Gini index of to be
The higher the entropy the closer to a uniform the random variable we are. .