Week 1 - Decision Trees

Supervised learning

Supervised learning falls into two main problems sets

Classification problems - Mapping from a domain to some discrete set, and Regression problems - mapping from a domain to some continuous space like the Real numbers ().

For example, making a decision on whether you should give someone a loan is a classification problem as there are only two outcomes - whereas mapping from someone’s picture to their age would be a regression problem.

Classification

First lets start by defining terms used in this course.

  • Instance - an input,
  • Concept - the function from inputs to outputs,
    • for example a car is simply a mapping from things in the world to True or False based on if that is a car or not.
  • Target concept - the ideal answer or optimal concept.
  • Hypothesis class - This is the set of all concepts we want to consider
    • This might be smaller than the set of all concepts.
  • Sample / Training set - A set of labelled data i.e. a pair (input, output)
  • Candidate - the current concept under investigation.
  • Testing set - A set of labelled data that will be used to evaluate your candidate
    • It is important it is distinct from the training set.

My mathematical interpretation

Suppose we have two sets and and we are looking to find some function optimal function .

  • Instance - some element .
  • Concept - a function .
  • Target concept - the function .
  • Hypothesis class - a restricted class that we will consider.
    • This restriction may be applied due to the set of algorithms you are considering.
  • Sample / Training set - A set of observations.
    • Ideally however this might not be possible.
  • Candidate - a function which is our current guess at .
  • Testing set - A set similar to training set but with intent to test the current candidate on.

Decision trees

Suppose your domain actually breaks down into many different aspects . Like if you were making a decision about whether to go into the restaurant your aspects could be the type of restaurant, whether it is raining, and how full the restaurant is.

Then a Decision tree is a rooted tree where

  • Each parent node is affiliated with an aspect with each child branch being affiliated with a collection of potential values. We make sure all potential values of that aspect are covered by the branches.
  • Each child node is associated to a value in the codomain .

An example for the problem of restaurant choosing is below.

decision tree example

Boolean decision trees

Suppose our domain consists of Boolean’s, either true or false as well as our codomain . Then our function represented by the decision trees is a boolean function.

boolean decision trees

These can expand to many variables, and for different functions can grow with different rates.

  • Logical or and Logical and with variables uses nodes.
  • Even/odd parity (true if an even/odd number of inputs are True) with variables uses nodes.

ID3

Iterative Dichotomiser 3 (ID3)

This algorithm works on classification problems. It is a greedy algorithm to design a maximal decision tree. Therefore are looking for the tree to represent some function , where we may only have access to some training data .

This decision tree will be maximal with regards to the information entropy of each of the leaf nodes within that decision tree. For any we can define by considering drawing a random element in out with a probability of them being mapped to some element of .

To make this decision tree assume we have a set of attributes where that split the set up into subsets of variable length for . Let be the subsets of and define where

Lastly we will construct this tree using a depth-first search algorithm, where at each leaf node we will have a subset of our original training set.

Pseudocode

ID3(training_data, attributes):
	Input:
		training_data of the form {(a,f(a)) | a in A'}
		attributes mapping A into some finite set of states
	Output:
		a decision tree for the function f
1. Let A' be the set of points in A of the training data.
2. If Entropy(S) = 0 return a single leaf node with the one element of B all the training data has.
3. For each attribute t calculate Gain(A',t).
	1. If Gain(A',t) = 0 for all t return a signle leaf node with a maximally likely value in the training data.
	2. Otherwise let t' be a maximal element with respect to Gain(A', -)
4. Build rooted tree T with t' being the root node
5. For each j in 1, ..., n_t'
	1. Let A'_j = t'^{-1}(j) intersect A'
	2. Let T_j = ID3({(a,f(a)) | a in A'_j}, attributes without t)
	3. Attach the root of T_j to node t' with outcome j.
6. return decision tree T.

Run time

Correctness

Link to original

Bias

Inductive bias

Inductive bias

This is any preference in a machine learning algorithm to pick one model over another. Examples of this are:

Link to original

Preference bias

Preference bias

Preference bias occurs in the learning process of an algorithm. This will be from how the objective function chooses to go from one step to the next. This will generate preferences for different models.

Link to original

Transclude of Restriction-bias

In ID3 the restriction bias comes from the fact that the have a decision tree based on a set of attributes.

For example, if you let the attributes then the decision tree can represent any function in trivially. Though this might be computationally completely infeasible in most cases!

Whereas the preference bias comes from the ID3 algorithm itself:

  • good splits near the root of the tree,
  • ones that lower the entropy more, and
  • shorter trees.

Other considerations

Continuous variables

Suppose we have some continuous attribute in . Whilst we could have a branch for every value that attribute could take, it is most probably best to categorify the value by their picking Boolean decisions like is it or for or splitting the value into ranges.

Pruning

Overfitting

Overfitting

When training a model on training data, if you model too closely imitates that data without being able to generalise to the test data - you have overfitting.

Link to original

The model can have overfitting if the tree gets too large. Therefore it is good practice to contract (or prune) branches of the tree if they have marginal effect on the outcome of the decision tree. This splits into two subcategories pre-pruning and post-pruning.

Pre-pruning decision trees

Pre-pruning decision trees

Pre-pruning decision trees

When building a decision tree to avoid overfitting you may want to stop the branching process even if it could increase the accuracy on the training data. One method to do this is to put in requirements on the final form of the decision tree before you start training. The requirements might be:

  • Max depth of the decision tree in any one branch.
  • Minimum size of a leaf node for training data.
  • Minimum size of a parent node for training data.
Link to original

Post-pruning decision trees

Post-pruning is the process of letting the tree organically grown. Then evaluating if the branches are worth keeping or they overfit the data.

Cost complexity pruning for decision trees (CPP)

Cost complexity pruning for decision trees (CPP)

When making a decision tree with no pruning it can tend to overfit the training data. To reduce this we can “prune” the decision tree. We do this by looking at how useful each of the branches are - then removing the ones that don’t add enough value. This is controlled by a variable which increases the cost of having more leaf nodes.

Suppose we are in modelling framework where is our training data and let be our decision tree. Assume is our function for evaluating how good a fit our tree is (likely this will use information entropy or the Gini index). Lastly let be the number of leaf nodes in our decision tree . Set

to be the value we want to minimise on our sub-trees.

Optimally we would iterate over all potential subtrees of and find the sub-tree that optimises on . However, computationally this can be expensive. Therefore it is usual to find a heuristic to choose a branch to prune instead.

One method to do this it to calculate the effective cost complexity of a non-terminal node. To give you the intuition behind this let be our decision tree and extend to be defined at where is the evaluation function evaluated on the training data that makes it to (like when training the decision tree). Then the effect of a leaf node on is

To define the effective cost complexity we will find the such that the contribution of an internal node to would be the same as for the subtree rooted at . This is

(It can be derived by equating the contribution of against the contribution of the terminal nodes contains.)

The process then prunes branches with the lowest until that value is greater than .

Pseudocode

CPP(decision_tree, alpha, R):
	Input:
		decision_tree with a set of vertices V
		alpha positive constant to determine pruning
		R the evaluation function such as Entropy or Gini, this will have to
			defined on vertices of V (for the training data that gets 
			classifed to v)
	Output:
		a pruned decision tree
1. Set best_tree = decision_tree
2. For each non-leaf set alpha_eff(v) = calculate_effective_alpha(decision_tree, v, R)
3. Set min_eff_alpha = min(alpha_eff(v) for v in V non-leaf)
4. While min_eff_alpha < alpha
	4.1. Find v such that alpha_eff(v) = min_eff_alpha
	4.2. Prune best_tree at v
	4.3. Let P be the set of vertices with v downstream of it.
	4.4. For each p in P set alpha_eff(p) = calculate_effective_alpha(best_tree, p, R)
	4.5. Set min_eff_alpha = min(alpha_eff(v) for v in best_tree non-leaf)
	4.6. Break if best_tree only has 1 vertex.
5. return best_tree.
 
calculate_effective_alpha(tree, v, R):
	Input
		tree is a decision tree
		v is a vertex in V that is not a leaf node
		R the evaluation function such as Entropy or Gini, this will have to
			defined on vertices of V (for the training data that gets 
			classifed to v)
	Output
		alpha_eff the effective alpha for that vertex
1. Let L be the set of leaf vertices with v unstream of it in tree.
2. Set total_leaf_weight = sum(R(x) for x in L)
3. Return (R(v) - total_leaf_weight)/(|L| - 1)

Run time

Optimally we would iterate over all potential subtrees of and find the sub-tree that optimises on . However, computationally this can be expensive. Instead we can use effective cost complexity which cuts down run time but can still take .

Correctness

This can reduce overfitting, however the parameter needs to be fine tuned. It is best to use cross validation for this purpose.

Link to original

Due to the

Regression problems

It is possible to tackle this with a decision tree if you only care about approximate answer. However, you will have to change the definition of information entropy to work in the continuous space. This will have to use something like an average of the leaf nodes.

Gini Index

Above we used Information entropy to choose which feature to split along. However there are other options like the Gini index.

Gini index

Gini index

Suppose we have a discrete random variable that can take values in . We define the Gini index of to be The higher the entropy the closer to a uniform the random variable we are. .

Link to original