In this simple tutorial, we are going to learn how to perform Principal Components Analysis in Python. This tutorial would be completed using Jupyter Notebook. I assume you have Jupyter notebook installed. You can also learn about the concept of PCA from the following two tutorials:
- Introduction to PCA and Dimensionality Reduction
- How to Perform Principal Components Analysis – PCA (Theory)
These are the following eight steps to performing PCA in Python:
- Step 1: Import the Neccessary Modules
- Step 2: Obtain Your Dataset
- Step 3: Preview Your Data
- Step 4: Standardize the Data
- Step 5: Perform PCA
- Step 6: Combine Target and Principal Components
- Step 7: Do a Scree Plot of the Principal Components
- Step 8: Visualize your New Data in 2D
Step 1: Import the Necessary Modules
The modules we would need are pandas, numpy, sklearn and matplotlib. To import them however, write the following import statement inside the first cell of Jupyter Notebook
Listing 1.0: Import necessary modules
Step 2: Obtain the Dataset
The dataset would be obtained from UCI Machine Learning Repository. To do that, you can right-click on the link below and save a copy of the dataset to your local drive.
Add the following lines to the next cell to load the dataset into a variable
Listing 1.1: Obtain and load your dataset
In Listing 1.1, the first line specifies the url of the dataset, the second line loads the dataset into a dataframe df (a dataframe is simply used to hold data).
pd.read_csv() is a function in pandas. The first argument is the path to the data, the second argument is a list of the column names. What this means is the that the first column of the data would be named ‘sepal lenght’, the second column is ‘sepal_width’ and so on.
When the code in Listing 2.1 executes, then your dataset is available in the variable df.
Step 3: Preview Your Data
So you can view your data by typing df into the next cell and running it as shown in Figure 1.0. You can also type print(df). In the table, there are four features, and one target(or class)
Step 4: Perform Scaling on the Data
This means that we need to center and scale the data. In this way the average value of each record would be 0 and the variance for each record would be 1.
To scale our data, we would use StandardScalar which is available in sklearn.
Note that we are only going to scale the features and not the target. So to do this, we
- first import StandardScalar
- separate the features from the target
- scale the features
This three operations are accomplished using the four lines of codes below
Listing 1.2: Separate features from target and standardize features
Step 5: Perform PCA
To then perform PCA we would use PCA module from sklearn which we have already imported in Step 1. In Listing 1.3, below, the first and the line performs the PCA, the third line loads the principal components into a dataframe. You can view your data by typing principalComponents or principalDataframe in a cell and running it.
Listing 1.3: PCA for two Principal Components
Step 6: Combine the Target and the Principal Components
Remember that the original data has five columns: four features and one target column. Now after performing PCA, we have just two columns for the features. The target dataset y was not touched. Therefore, we attached back the target column to the new set of principal components. To do that, use the code below.
Listing 1.4: Combine Principal Components with target
You can also view your new dataset by just typing newDataframe and running the cell. Your output would therefore be as shown in Figure 1.1
Step 7: Perform a Scree Plot of the Principal Components
A scree plot is like a bar chart showing the size of each of the principal components. It helps us to visualize the percentage of variation captured by each of the principal components. To perform a scree plot you need to:
- first of all, create a list of columns
- then, list of PCs
- finally, do the scree plot using plt
Now, copy and past the code in listing 1.5 below into Jupyter Notebook and then run it. Consequently, your output would be as shown in Figure 1.2
Listing 1.5: PCA Scree Plot
You can hence see the scree plot below.
Step 8: Plot the Principal Components on 2D
Now we have performed PCA, we need to visualize the new dataset to see how PCA makes it easier to explain the original data. We would use scatter plot
Listing 1.6: 2D Plot of PC1 and PC2
If you execute the code above then you will have the plot given in Figure 1.2
So what have we achieved?
We would repeat this plot this time with colors for each of the targets (Iris-setosa, Iris-versicolor and Iris-virginica). In this way we would see how PCA helps explain the data. However, to keep things simple, I would not explain this very code. Write and run the code below.
Listing 1.7: Plot of PC1 vs PC2 with color codes
Likewise, if you execute the code in Listing 1.6 above, you will have the output given in Figure below:
Explaining the Variance Using Principal Component
Finally we need to see how the two principal components explain our data. To do that we would use the command below:
Then you will get the output:
This values show that the first principal component PC1 explains 72.77% of the variation in the original data while the second principal component explains 23.03% of the variation in the original data.
In conclusion, this means that the original 4 dimensional data can be safely reduced to 2 dimensions using PCA because the dataset can be explained by only two components!
Finally, I hope that this lesson has clearly helped you to see how you can perform Principal Components Analysis using Python.