When defining a hyperplane you always have a choice of tangent vector both in scaler (we will come back to that later) and direction. The direction of dictates which side of the hyperplane is “positive”,
if with then the point lies on the “positive” side as , whereas
if with then the point lies on the “negative” side as .
Best separator
When you have some points that are linearly separable there might be many choices of and . Though which one is the best? And what does the best mean?
maximising the minimum distance between the points and the line.
The intuition as to why we would want this is by making our separating line the furthest away from the points it can be whilst still separating them means we are trying to avoid overfitting the best we can. The line is as far away from our training data as it possibly can be.
Margin
Suppose we have a linear separator defined by and . Once again set and for our training data . It is useful to have a concept for the minimum distance from a point in to the hyperplane .
Margin for a linear separator
Margin for a linear separator
Suppose we have a set of linearly separable points such that and separate them. We define the margin of the hyperplane with respect to to be
In other words this is the smallest distance from a point in to the hyperplane .
Training points where
are called support vectors. As they constrain the hyperplane.
Back to reality
Our training data is nearly never linearly separable so how can we adapt what we have to handle this?
Instead of enforcing
we allow some error and we want to minimise it.
For training data find and where
such that
Switched max for min
We transformed the maximisation problem of to a minimisation of . I don’t really know why they add the half, I think they square it to get rid of the square root in the modulus.
We have added a trade off parameter do we care more about a tight fit or a lower error term.
Here comes the magic
This is apparently a quadratic programming problem and can be transformed to the “Dual Optimization Problem” which is as follows
such that
Which we turn this into a classifier by setting:
b = y^s - \sum_{t \in T} \alpha_t y^t (x^t \cdot x^s), \mbox{ for any } s \in T \mbox{ such that } \alpha^s \not = 0.
You can't use 'macro parameter character #' in math modeThe form should remind you of [[Boosting]] where we take an average of lots of different views on the data. However in reality lots of these $\alpha_t = 0$ if they are not the support vectors closest to the line. So it might be more correct to think of this as close to [[k-nearest neighbour|KNN]]. ## Handling very not separable data In reality data could be far from linearly separable. For example for [[Exclusive or|xor]] we have the following embedding in $\mathbb{R}^2$ of the 4 training points $$\{((1,1), -1), ((1,-1), 1), ((-1,-1), -1), ((-1,1), 1)\}.$$ ![[xor_embedding]] This is far from linearly separable. However, we can define a new embedding mapping $$K: \mathbb{R}^2 \rightarrow \mathbb{R}^6, \ \mbox{ by } \ (x_1, x_2) \mapsto (x_1^2, x_2^2, \sqrt{2} x_1x_2, \sqrt{2} x_1, \sqrt{2} x_2, 1)$$ which when applied to the [[Function domain|domain]] of our training data rearranges our points to be ![[xor_mapped_embedding]] which are easily [[Linearly separable|linearly separable]]. Though how do we choose these embeddings? ## Kernel methods ![[Kernel trick]] There are some popular [[Kernel trick|kernels]] to try. ![[Polynomial kernel (SVMs)]] ![[Gaussian kernel (SVM)]] ![[Sigmoid kernel (SVM)]] ## Support Vector Machines We are now going to take the idea of the [[Kernel trick|kernel trick]] where we embed our feature space $A$ using $\Phi$ into a space in which our dataset is closer to [[Linearly separable|linearly separable]]. This turns our optimisation problem from before into $$ \max_{\alpha} \sum_{t \in T} \alpha_t - \frac{1}{2} \sum_{t,s \in T} \alpha_t \alpha_s y^t y^s (\Phi(x^t) \cdot \Phi(x^s))$$ such that $$ \alpha_t \geq 0 \mbox{ for all } t \in T, \mbox{ and } \sum_{t \in T} \alpha_ty^t = 0.$$ Which we turn this into a classifier by setting:
However, we only have use on two vectors who immediately get the dot product applied to them - so we can replace this with the kernel instead.
Support vector machines (SVM)
Support Vector Machines (SVM)
Support Vector Machines operate using the modelling framework to try to linearly separate data points. Suppose we have some training data. This utilises the kernel trick to change the topology of the feature space of our data whilst still keeping the computation relatively simple. Let represent such a kernel. Then we solve the following optimisation problem
such that
Which we turn this into a classifier by setting:
b^s = y^s - \sum_{t \in T} \alpha_t y^t K(x^t, x^s), \mbox{ for any } s \in T \mbox{ such that } \alpha^s \not = 0.
You can't use 'macro parameter character #' in math modeNote that $K$ needs to obey [[Mercer’s condition]] for the underlying mapping of the feature space to exist. ## Run time The complexity of the kernel function can add large overhead to the run time for training this model. ## Correctness The accuracy of this model highly depends on the choice of the kernel function. This definition of similarity between two vectors.Link to original
In practice when using Boosting we tend to find the training error and testing error follow each other, rather than separate as in the case with overfitting. This is connected to SVMs by the concept of margins. When we talked about SVMs we stated that a larger margin was better as it reduced the overfitting. The margin is analogous to the confidence level we have in our model.
In relation to Boosting we can think of the algorithm as a complicated way of projecting before taking the sign of the value to determine our prediction. As we train for a longer period of time we separate data points more and more - increasing the margin and thus our confidence in the outcomes. This counter intuitively reduces overfitting in a lot of cases.