Just as you know, I would try to explain Support Vector Machines (SVM) in a vary simple and clear way. I know many find it a bit tough, but I tell you, you can understand it with little effort. The lesson would be in two parts.

So let’s dive in!

We would cover the following topics:

- What are Support Vector Machines?
- Maximum Margin Classifiers
- Concept of Hyperplane
- Equation of a Hyperplane
- Above and Below the Hyperplane
- Performing Classification using a Hyperplane
- Separating Hyperplane

#### 1. What are Support Vector Machines

As you already know, classification is one important aspect of Machine Learning. Also recall that Supervised Learning is divided into Classification, Regression and Density Estimation.

Support Vector Machines (SVM) fall under Classification. In fact SVM is a method used for classification and can be considered one of the most best classifiers.

#### 2. Maximum-Margin Classifiers

SVM come under this type of classifiers. They tent to perform classification by finding the largest margin between the two classes. The challenge with this type of classifiers is that it requires that data be linearly separable. That means that, some kind of line could be fitted to separate the two classes. This is illustrated in Figure 1.0

One good application of SVM is in binary classification, where we need to classify data into two distinct classes.

#### 3. The Concept of Hyperplane

I learning Support Vector Machines, you need to clearly understand the term Hyperplane. So let me just give a formal definition: A hyperplane is a plane which is one dimension less than the ambient plane. That means that if we have data in two dimensions, then the hyperplane would be a single line as in Figure 1.0. In the Figure 1.0. the blue lines are hyperplanes.

In the same way, if we have data in 3-dimensional space, then the hyperplane would be a 2-dimensional plane. And so on.

#### 4. Equation of a Hyperplane

You already know that a line, or a plane have an equation that defines it. In the same way, a hyperplane is defined by the equation:

β_{0} + β_{1}X_{1} + β_{2}X_{2} = 0

In this equation, **X** = (**X**_{1}, **X**_{2}) holds for all the points along this line

As you can recall, this equation is the equation of a line and defines a 2-dimensional hyperplane. We can extend the equation to a hyperplane of arbitrary dimension, say p.

β_{0} + β_{1}X_{1} + β_{2}X_{2} + . . . + β_{p}X_{p} = 0

This is the equations for a p-dimensional hyperplane. In the same way, the point X = (X_{1}, X_{2}, . . . , X_{p}) holds true for all the points on this hyperplane.

#### 5. Above and Below the Hyperplane

So far we know the equation for the line or plane which defines the hyperplane. Now consider a set of datapoints. If we are able to draw a hyperplane across these points, we can be sure of three things;

- some points may lie exactly on the line (hyperplane)
- some points lie above the hyperplane
- some points lie below the hyperplane.

This is illustrated in Figure 1.3.

Now have the equation for the hyperplane. We are also interested in the points that lie on both sides of the hyperplane (above and below). These point can be described with these equations:

β_{0} + β_{1}X_{1} + β_{2}X_{2} + . . . + β_{p}X_{p} < 0 for points that lie below the hyperplane

β_{0} + β_{1}X_{1} + β_{2}X_{2} + . . . + β_{p}X_{p} > 0 for points that lie above the hyperplane

This make so much sense since the function can equate to 0, greater than 0 or less than 0.

This also tells us something about the signs of the classes.

- Points below the hyperplane are associated with -ve
- Points on the hyperplane are associated with 0
- Points above the hyperplane are associated with +ve

Out comes our classifier! If the hyperplane divides the data points into two classes, then one class is -ve, the other class is -ve.

#### 6. Performing Classification Using Hyperplane

Let’s assume we have a dataset of n rows (observations) and p features. This would server as our training dataset. This dataset can be represented as a matrix of n x p dimension like this:

which means that we can represent each of the observations using a column vector containing all the features ( this is also called a feature vector)

Lets also define two classes y_{1}, y_{2}, . . . ,y_{n} which could either be 1 or -1. That is

y_{1}, y_{2}, . . . ,y_{n} ∈ {1, -1}

Let’s also another dataset, a test dataset, similar to the training dataset with with fewer observations and without labels or classes, only features. The goal just like in classification is to build a classifier that would correctly classify the test data using only the features.

In Support Vector Machine, the approach we would use is called the Separating Hyperplane and that is what we would discuss next

#### 7. Separating Hyperplane

The idea is to find a hyperplane that can separate the data into two classes such that one part of the data belongs to class 1 and the other part of the dataset belongs to class -1.

This hyperplane is the separating hyperplane and would satisfy the following properties

β_{0} + β_{1}X_{1} + β_{2}X_{2} + . . . + β_{p}X_{p} < 0 if yi =1

β_{0} + β_{1}X_{1} + β_{2}X_{2} + . . . + β_{p}X_{p} > 0 if yi = -1

Additionally the separting hyperplane would also satisfy the equation:

y_{i}(β_{0} + β_{1}X_{1} + β_{2}X_{2} + . . . + β_{p}X_{p} > 0 if y_{i}) > 0

for all i = 1, . . . , n

I we succeed in creating this classifier using only out training dataset, then we can used it to classify the test dataset. The classification would be based on which side of the hyperplane a datapoint is located.

What we simply need to to is to calculate f(x*) for each observation in the test data where f(x*) is given by

f(x*) = β_{0} + β_{1}X_{1} + β_{2}X_{2} + . . . + β_{p}X_{p}

If f(x*) is positive, then we assign the observation to class 1 but if f(x*) is negative, we assign it to the class -1.

Additionally, we could also use the value of f(x*) to determine how far the data point is from the hyperplane. The larger the value, the farther it is from the hyperplane.

Now, we clearly understand the concept of separating hyperplane as well as how to use a separating hyperplane for classification. The next concept would explain is the maximum-margin-hyperplane.

We continue in the next Tutorial.