Support vector machines

Itcha “This method is not really used any more.”

There are 3 levels:

  • Maximal margin classifier: Requires linearly separable
  • Support vector classifier: Requires linear decision boundary
  • Support vector Machine: Requires knowledge of the decision boundary

Hyperplane

Hyperplane

A hyperplane in an -dimensional vector space over a field a is an dimensional sub plane. This is given by linearly independent vectors and a point . Then the space is made of

Note different vectors could define the same hyperplane.

Link to original

Another way to define a hyperplane is to use an equation. This is to take an orthogonal vector to the subspace spanned by perform the Dot product with that vector and set it equal to the dot product of . Intuitively this is the “distance” away from the hyperplane that point is.

This means a hyperplane cuts into two halves.

This provides a nice starting point for classifiers.

Starting to classify

Linearly separable

Linearly separable

Two sets of points are linearly separable if there is some hyperplane defined for :
Such that and .

Link to original

Suppose we have points of training data with classification labels . This will be represented by for , where and . Lets assume the data is linearly separable so there exists such that

There are infinite choices of some of these represent the same solution. You can get rid of these by forcing

Intuitively this forces the orthogonal vector to be unit. Though there are still two choices of vector to point out of the “top” or “bottom” though this is forced by the classification within our data.

This now means that now define a unique classification. However in most cases there are still infinitely many choices of .

Best line

The magnitude of

(note I replaced with to make life easier) relates to how far the point is away from the hyperplane . We call this the margin for that point and ideally we would want to maximise the minimum margin. This will make the least assumptions about our data and reduce the chances of overfitting.

A hyperplane that achieves this is called a maximal margin classifier. I.e. it is looking for the max such that

This is an interesting requirement, as most of our data does not really matter for this. There will be key points of our data that will act as the limits on our data. These will be called support vectors.

(Note these support vectors and classifier will be unique!)

Constructing the maximal margin classifier

Lets now collect all our requirements for the maximal margin classifier. We want to choose such that

This kind of problem is called Quadratic programming problem which is not touched by the book.

Aside (neural networks)

This is essentially the same as a neural network at this point!

Perceptron (neural network)

Perceptron (neural network)

A perceptron is a function defined by two steps - first a weighted linear sum, the second an activation function. To define this we need weights for and an activation function . Where

Link to original

Binary step

Binary step

The binary step is an activation function that determines what the value is in reference to a parameter .

Link to original

Non-seperating case

Your data might not be Linearly separable or moreover the separator that does cut your data in half might be massively over fitting.