Week 3 - Ensemble Bagging and Boosting

Ensemble learning

Ensemble learning

In machine learning ensemble learning is the approach of combining different models on the same problem and combining their results to provide a new model.

Link to original

The idea of Ensemble learning is to look at a subset of data and try to model that well. If we do that to enough small patches we hope we can combine this to give us a good overall understanding of the data.

This is a powerful idea when the function we are trying to map doesn’t globally generalise but instead locally generalises like when people try to filter out spam from your inbox.

Bagging

The simplest example of Ensemble learning is Bagging.

Bagging

Bagging

This is one of the simplest Ensemble learning methods but out performs classic modelling methods in certain problems.

Suppose we are in the modelling framework. The bagging is the following process

  1. We choose a set of modelling algorithms for that could fit the problem - these could all be the same.
  2. We take random subsets of the training data , for , with replacement - so two samples could contain the same data point.
  3. We then train using algorithm with training data for .
  4. Then we have some method of averaging these models over our problem space to produce our final model.

Example

Suppose we want to use polynomial regression on a simple function with training data .

We could instead of running it once randomly select some then train using polynomial regression on .

Then we set our final .

Correctness

Bagging tends to lead to less overfitting so can help algorithms that are particularly prone to this like polynomial regression.

Link to original

Bagging treated all data points equally and didn’t focus on whether we performed well or poorly on a given data point to pick the next subset. If we fixed this we could potentially tighten up our model.

Error rate and weak learners

Error rate (modelling)

Error rate (modelling)

Suppose we are in the modelling framework and have probability distribution on our domain. We define the error rate to be Where is the indicator function. If is discrete this is simply adding the probability of a value occurring if our model got it right.

Link to original

Models that are considered good should always do better than chance.

Weak learner

Weak learner

Suppose we are in the modelling framework a model is called a weak learner if for all probability distributions there exists a suffciently small such that That is its error rate is just bellow chance.

Link to original

Boosting

Boosting

Boosting

Boosting is an Ensemble learning method that uses multiple instances of another model to average them out to make a classifier that is better than any one individual one.

Suppose we are in the modelling framework where and we have a training set where . Boosting iteratively train classifiers based on how the past classifiers have performed. To do this before training we define a distribution on .

We set
To define the other terms lets introduce some notation.

Let be the error rate of with respect to

Set our learning rate

Other choices of can be made.

Now define iteratively by setting

Note that if and share the same sign then the power of is negative and we decrease the weight of whereas if they don’t we increase the weight of (generally there are some cases where this isn’t true!).

Once we have trained for some large number times we aggregate this all by defining our output with being the Sign function.

Pseudocode

Name(classifier, N, T):
	Input:
		classifier - some way to train a classifier given some training data and a distrubution.
		N - the number of itterations we want to do.
		T - pairs of points (a_t, b_t) to train on.
	Output:
		a moddel f on the space A to {-1, 1}.
1. Set D: T -> [0,1] to be D(t) = 1/|T|
2. For i = 1, ... , number_of_itterations do
	2.1. let h_i = classifier(T, D)
	2.2. Set alpha_i = calculate_learning_rate(h_i, T, D)
	2.2. Set D = Update_distrubution(h_i, T, D, alpha_i)
3. Return sgn(sum(alpha_t h_t(x) for t in T))
 
calculate_learning_rate(h, T, D):
	Input:
		h - A classifier mapping points in A to a real number 
		T - pairs of points (a_t, b_t) to train on.
		D - A distrubution over T
	Output:
		a learning rate for this itteration
1. Set e = 0
2. For each t in T
	2.1. If b_t != sgn(h(a_t)) then
		2.1.1 Set e = e + D(t)
3. Return ln((1-e)/e)/2
 
Update_distrubution(h, T, D, alpha)
	Input:
		h - A classifier mapping points in A to a real number 
		T - pairs of points (a_t, b_t) to train on.
		D - A distrubution over T
		alhpa - a learning rate
	Output:
		a new distrubution over T
1. For each t in T
	2.1. Set C_t = D(t) exp(- alpha b_t h(a_t))
2. Set Z = sum(C_t for t in T)
3. Intialise new distrubution over T D'
4. For each t in T
	4.1. Set D'(t) = C_t/Z
5. Return D'

Correctness

Link to original