# Cost complexity pruning for decision trees (CPP) When making a [[Decision tree|decision tree]] with no pruning it can tend to [[Overfitting|overfit]] the [[Training data|training data]]. To reduce this we can "prune" the decision tree. We do this by looking at how useful each of the branches are - then removing the ones that don't add enough value. This is controlled by a variable $\alpha$ which increases the cost of having more leaf nodes. Suppose we are in [[Modelling framework|modelling framework]] where $T$ is our [[Training data|training data]] and let $\hat{f}$ be our decision tree. Assume $R(\hat{f}, T)$ is our function for evaluating how good a fit our tree is (likely this will use [[Information entropy|information entropy]] or the [[Gini index]]). Lastly let $L(\hat{f})$ be the number of leaf nodes in our [[Decision tree|decision tree]] $\hat{f}$. Set $ R_{\alpha}(\hat{f}, T) = R(\hat{f}, T) + \alpha L(\hat{f}) $ to be the value we want to minimise on our sub-trees. Optimally we would iterate over all potential subtrees of $\hat{f}$ and find the sub-tree that optimises $R_{\alpha}$ on $T$. However, computationally this can be expensive. Therefore it is usual to find a heuristic to choose a branch to prune instead. One method to do this it to calculate the *effective cost complexity* of a non-terminal node. To give you the intuition behind this let $G = (V,E)$ be our decision tree and extend $R$ to be defined at $v \in V$ where $R(v)$ is the evaluation function evaluated on the training data that makes it to $v \in V$ (like when training the [[Decision tree|decision tree]]). Then the effect of a leaf node $v \in L(G)$ on $R_{\alpha}(\hat{f}, T)$ is $ R(v) + \alpha. $ To define the effective cost complexity we will find the $\alpha$ such that the contribution of an internal node to $R_{\alpha}$ would be the same as for $G_v$ the subtree rooted at $v$. This is $ \alpha_{eff}(v) = \frac{R(v) - \sum_{x \in L(G_v)} R(x)}{\vert L (G_v) \vert - 1}. $ (It can be derived by equating the contribution of $v$ against the contribution of the terminal nodes $G_v$ contains.) The process then prunes branches with the lowest $\alpha_{eff}$ until that value is greater than $\alpha$. ## Pseudocode ```pseudocode CPP(decision_tree, alpha, R): Input: decision_tree with a set of vertices V alpha positive constant to determine pruning R the evaluation function such as Entropy or Gini, this will have to defined on vertices of V (for the training data that gets classifed to v) Output: a pruned decision tree 1. Set best_tree = decision_tree 2. For each non-leaf set alpha_eff(v) = calculate_effective_alpha(decision_tree, v, R) 3. Set min_eff_alpha = min(alpha_eff(v) for v in V non-leaf) 4. While min_eff_alpha < alpha 4.1. Find v such that alpha_eff(v) = min_eff_alpha 4.2. Prune best_tree at v 4.3. Let P be the set of vertices with v downstream of it. 4.4. For each p in P set alpha_eff(p) = calculate_effective_alpha(best_tree, p, R) 4.5. Set min_eff_alpha = min(alpha_eff(v) for v in best_tree non-leaf) 4.6. Break if best_tree only has 1 vertex. 5. return best_tree. calculate_effective_alpha(tree, v, R): Input tree is a decision tree v is a vertex in V that is not a leaf node R the evaluation function such as Entropy or Gini, this will have to defined on vertices of V (for the training data that gets classifed to v) Output alpha_eff the effective alpha for that vertex 1. Let L be the set of leaf vertices with v unstream of it in tree. 2. Set total_leaf_weight = sum(R(x) for x in L) 3. Return (R(v) - total_leaf_weight)/(|L| - 1) ``` ## Run time Optimally we would iterate over all potential subtrees of $\hat{f}$ and find the sub-tree that optimises $R_{\alpha}$ on $T$. However, computationally this can be expensive. Instead we can use effective cost complexity which cuts down run time but can still take $O(\vert V \vert^3)$. ## Correctness This can reduce overfitting, however the parameter $\alpha$ needs to be fine tuned. It is best to use [[Cross validation|cross validation]] for this purpose.