Week 4 - Support Vector Machines

Linearly separable

We are going to revisit the idea of linearly separable points in the plane.

Linearly separable

Linearly separable

Two sets of points are linearly separable if there is some hyperplane defined for :
Such that and .

Link to original

Suppose we have training data for such that and are linearly separable. This means that there exists some and such that

for every .

Dot product

Dot product

For two vectors the dot product is

Link to original

Geometrically and represent a hyperplane defined by being the tangent vector and being a point on the hyperplane.

Transclude of SVM-example

Direction of

When defining a hyperplane you always have a choice of tangent vector both in scaler (we will come back to that later) and direction. The direction of dictates which side of the hyperplane is “positive”,

  • if with then the point lies on the “positive” side as , whereas
  • if with then the point lies on the “negative” side as .

Best separator

When you have some points that are linearly separable there might be many choices of and . Though which one is the best? And what does the best mean?

Ideally we would want to choose a line that:

  • Linearly separates the points, whilst
  • maximising the minimum distance between the points and the line.

The intuition as to why we would want this is by making our separating line the furthest away from the points it can be whilst still separating them means we are trying to avoid overfitting the best we can. The line is as far away from our training data as it possibly can be.

Margin

Suppose we have a linear separator defined by and . Once again set and for our training data . It is useful to have a concept for the minimum distance from a point in to the hyperplane .

Margin for a linear separator

Margin for a linear separator

Suppose we have a set of linearly separable points such that and separate them. We define the margin of the hyperplane with respect to to be In other words this is the smallest distance from a point in to the hyperplane .

Link to original

Calculating the best and

Let and be the points such that

i.e. these are the positive and negative points that are closest to the line. If we have chosen optimally we would have that

The distance between the hyperplanes defined by and is given by

This gives the margin to be So to pick the best and we will need to maximise . However notice we can always scale so that

Therefore if we make the linearly separable constraint that

to maximise is the same as

Therefore we now have an optimisation problem.

For linearly separable find and where such that

Support vectors

Training points where are called support vectors. As they constrain the hyperplane .

Back to reality

Our training data is nearly never linearly separable so how can we adapt what we have to handle this?

Instead of enforcing

we allow some error and we want to minimise it.

For training data find and where such that

Switched max for min

We transformed the maximisation problem of to a minimisation of . I don’t really know why they add the half, I think they square it to get rid of the square root in the modulus.

We have added a trade off parameter do we care more about a tight fit or a lower error term.

Here comes the magic

This is apparently a quadratic programming problem and can be transformed to the “Dual Optimization Problem” which is as follows such that Which we turn this into a classifier by setting:

b = y^s - \sum_{t \in T} \alpha_t y^t (x^t \cdot x^s), \mbox{ for any } s \in T \mbox{ such that } \alpha^s \not = 0.

You can't use 'macro parameter character #' in math modeThe form should remind you of [[Boosting]] where we take an average of lots of different views on the data. However in reality lots of these $\alpha_t = 0$ if they are not the support vectors closest to the line. So it might be more correct to think of this as close to [[k-nearest neighbour|KNN]]. ## Handling very not separable data In reality data could be far from linearly separable. For example for [[Exclusive or|xor]] we have the following embedding in $\mathbb{R}^2$ of the 4 training points $$\{((1,1), -1), ((1,-1), 1), ((-1,-1), -1), ((-1,1), 1)\}.$$ ![[xor_embedding]] This is far from linearly separable. However, we can define a new embedding mapping $$K: \mathbb{R}^2 \rightarrow \mathbb{R}^6, \ \mbox{ by } \ (x_1, x_2) \mapsto (x_1^2, x_2^2, \sqrt{2} x_1x_2, \sqrt{2} x_1, \sqrt{2} x_2, 1)$$ which when applied to the [[Function domain|domain]] of our training data rearranges our points to be ![[xor_mapped_embedding]] which are easily [[Linearly separable|linearly separable]]. Though how do we choose these embeddings? ## Kernel methods ![[Kernel trick]] There are some popular [[Kernel trick|kernels]] to try. ![[Polynomial kernel (SVMs)]] ![[Gaussian kernel (SVM)]] ![[Sigmoid kernel (SVM)]] ## Support Vector Machines We are now going to take the idea of the [[Kernel trick|kernel trick]] where we embed our feature space $A$ using $\Phi$ into a space in which our dataset is closer to [[Linearly separable|linearly separable]]. This turns our optimisation problem from before into $$ \max_{\alpha} \sum_{t \in T} \alpha_t - \frac{1}{2} \sum_{t,s \in T} \alpha_t \alpha_s y^t y^s (\Phi(x^t) \cdot \Phi(x^s))$$ such that $$ \alpha_t \geq 0 \mbox{ for all } t \in T, \mbox{ and } \sum_{t \in T} \alpha_ty^t = 0.$$ Which we turn this into a classifier by setting:

\hat{f}(x) = \mbox{sgn}\left ( \sum_{t \in T} \alpha_t y^t (\Phi(x^t) \cdot \Phi(x)) + b^t \right )$$ where

However, we only have use on two vectors who immediately get the dot product applied to them - so we can replace this with the kernel instead.

Support vector machines (SVM)

Support Vector Machines (SVM)

Support Vector Machines operate using the modelling framework to try to linearly separate data points. Suppose we have some training data . This utilises the kernel trick to change the topology of the feature space of our data whilst still keeping the computation relatively simple. Let represent such a kernel. Then we solve the following optimisation problem such that Which we turn this into a classifier by setting:

b^s = y^s - \sum_{t \in T} \alpha_t y^t K(x^t, x^s), \mbox{ for any } s \in T \mbox{ such that } \alpha^s \not = 0.

You can't use 'macro parameter character #' in math modeNote that $K$ needs to obey [[Mercer’s condition]] for the underlying mapping of the feature space to exist. ## Run time The complexity of the kernel function can add large overhead to the run time for training this model. ## Correctness The accuracy of this model highly depends on the choice of the kernel function. This definition of similarity between two vectors.Link to original

Why doesn’t Boosting have as many problems with Overfitting

In practice when using Boosting we tend to find the training error and testing error follow each other, rather than separate as in the case with overfitting. This is connected to SVMs by the concept of margins. When we talked about SVMs we stated that a larger margin was better as it reduced the overfitting. The margin is analogous to the confidence level we have in our model.

In relation to Boosting we can think of the algorithm as a complicated way of projecting before taking the sign of the value to determine our prediction. As we train for a longer period of time we separate data points more and more - increasing the margin and thus our confidence in the outcomes. This counter intuitively reduces overfitting in a lot of cases.

This does not hold for all Boosting for example if we use this with Neural networks it will still be liable to overfit.