In this simple tutorial, I would explain the concept of Principal Components Analysis (PCA) in Machine Learning. I would try to be as simple and clear as possible.

The we would use Python in Tutorial 2 to actually do some of the hands-on, performing principal components analysis.

**What is Principal Components Analysis?**

Principal Components Analysis is an unsupervised learning class of statistical techniques used to explain data in high dimension using smaller number of variables called the principal components.

In PCA, we compute the principal component and used the to explain the data.

**How PCA Work?**

Assuming we have a set X made up of n measurements each represented by a set of * p* features, X

_{1}, X

_{2}, … , X

_{p}. If we want to plot this data in a 2-dimensional plane, we can plot n measurements using two features at a time. If the number of features are more than 3 or four then plotting this in two dimension will be a challenge as the number of plots would be p(p-1)/2 which would be hard to plot.

We would like to visualize this data in two dimension without losing information contained in the data. This is what PCA allows us to do.

**How to Computer Principal Components?**

Given a dataset X of dimension n x p, how do we compute the first principal components?

To do this we look for linear combination of the feature values of the form:

that has the largest sample variance subject to the constraint that:

This means that the first principal component loading vector solves the optimization problem such that we need to maximize the objective function subject to some constraint.

The objective function is given by:

And this is subject to the constraint:

The objective function (function to maximize) can be rewritten as:

Since this also holds:

Therefore the average of z_{11},…, z_{n1} will also be zero. Therefor the objective function that is being maximized is simply the sample variance of the n values of z_{i1}.

z_{11}, z_{2},…,z_{n1} are referred to as the scores of the first principal component.

**How then do we maximize the given objective function? **

We do this by performing eigen decomposition of the covariance matrix. Details of how to perform eigen decomposition is explained here.

**Explaining the Principal Components**

The loading vector Ф_{1} with elements Ф_{11}, Ф_{21},…,Ф_{p1} defines a direction in the feature space along which there is maximum variance in the data.

Thus, if we are to project the n data points x_{1}, x_{2},…, x_{n} onto this direction, then projected values are the actual principal component scores z_{11}, z_{21}, …, z_{n1}.

After the first principal components, Z_{1} of the features has been determined, then the second principal component is the linear combination of X_{1}, ,X_{2},… X_{p} that has the highest variance out of all the linear combinations that are uncorrelated with Z_{1}. The second principal component scores z_{12}, z_{22},…,z_{n2} take the form

where Ф_{2} is the second principal component loading vector, with elements Ф_{11}, Ф_{12}, … ,Ф_{p2} . It turns out that constraining Z_{2} to be uncorrelated with Z_{1} is the same as constraining the direction of Ф2 to be orthogonal to the direction of Ф_{1}

We would now take an example to see how PCA works.