In machine learningensemble learning is the approach of combining different models on the same problem and combining their results to provide a new model.
The idea of Ensemble learning is to look at a subset of data and try to model that well. If we do that to enough small patches we hope we can combine this to give us a good overall understanding of the data.
This is a powerful idea when the function we are trying to map doesn’t globally generalise but instead locally generalises like when people try to filter out spam from your inbox.
Bagging treated all data points equally and didn’t focus on whether we performed well or poorly on a given data point to pick the next subset. If we fixed this we could potentially tighten up our model.
Models that are considered good should always do better than chance.
Weak learner
Weak learner
Suppose we are in the modelling framework a model is called a weak learner if for all probability distributions there exists a suffciently small such that
That is its error rate is just bellow chance.
Boosting is an Ensemble learning method that uses multiple instances of another model to average them out to make a classifier that is better than any one individual one.
Suppose we are in the modelling framework where and we have a training set where . Boosting iteratively train classifiers based on how the past classifiers have performed. To do this before training we define a distribution on .
We set
To define the other terms lets introduce some notation.
Note that if and share the same sign then the power of is negative and we decrease the weight of whereas if they don’t we increase the weight of (generally there are some cases where this isn’t true!).
Once we have trained for some large number times we aggregate this all by defining our output
with being the Sign function.
Pseudocode
Name(classifier, N, T): Input: classifier - some way to train a classifier given some training data and a distrubution. N - the number of itterations we want to do. T - pairs of points (a_t, b_t) to train on. Output: a moddel f on the space A to {-1, 1}.1. Set D: T -> [0,1] to be D(t) = 1/|T|2. For i = 1, ... , number_of_itterations do 2.1. let h_i = classifier(T, D) 2.2. Set alpha_i = calculate_learning_rate(h_i, T, D) 2.2. Set D = Update_distrubution(h_i, T, D, alpha_i)3. Return sgn(sum(alpha_t h_t(x) for t in T))calculate_learning_rate(h, T, D): Input: h - A classifier mapping points in A to a real number T - pairs of points (a_t, b_t) to train on. D - A distrubution over T Output: a learning rate for this itteration1. Set e = 02. For each t in T 2.1. If b_t != sgn(h(a_t)) then 2.1.1 Set e = e + D(t)3. Return ln((1-e)/e)/2Update_distrubution(h, T, D, alpha) Input: h - A classifier mapping points in A to a real number T - pairs of points (a_t, b_t) to train on. D - A distrubution over T alhpa - a learning rate Output: a new distrubution over T1. For each t in T 2.1. Set C_t = D(t) exp(- alpha b_t h(a_t))2. Set Z = sum(C_t for t in T)3. Intialise new distrubution over T D'4. For each t in T 4.1. Set D'(t) = C_t/Z5. Return D'