Principal Components Analysis(PCA) in Python – Step by Step

In this simple tutorial, we are going to learn how to perform Principal Components Analysis in Python.  This tutorial would be completed using Jupyter Notebook. I assume you have Jupyter notebook installed. You can also learn about the concept of PCA from the following two tutorials:

These are the following  eight steps to performing PCA in Python:


Step 1: Import the Necessary Modules

The modules we would need are pandas, numpy, sklearn and matplotlib. To import them however, write the following import statement inside the first cell of Jupyter Notebook

import pandas as pd
import numpy as np
from sklearn.decomposition import PCA
from sklearn import preprocessing
import matplotlib.pyplot as plt

Listing 1.0: Import necessary modules


Step 2: Obtain the Dataset

The dataset would be obtained from UCI Machine Learning Repository. To do that, you can right-click on the link below and save a copy of the dataset to your local drive.

Add the following lines to the next cell to load the dataset into a variable

url = ""

df = pd.read_csv(url, names=['sepal length', 'sepal width' , 'petal lenght', 'petal width', 'target'])

Listing 1.1: Obtain and load your dataset

In Listing 1.1, the first line specifies the url of the dataset, the second line loads the dataset into a dataframe df (a dataframe is simply used to hold data).

pd.read_csv() is a function in pandas. The first argument is the path to the data, the second argument is a list of the column names. What this means is the that the first column of the data would be named ‘sepal lenght’, the second column is ‘sepal_width’ and so on.

When the code in Listing 2.1 executes, then your dataset is available in the variable df.


Step 3: Preview Your Data

So you can view your data by typing df into the next cell and running it as shown in Figure 1.0. You can also type print(df). In the table, there are four features, and one target(or class)

Figure 1.0: Data in DataFrame


Step 4: Perform Scaling on the Data

This means that we need to center and scale the data. In this way the average value of each record would be 0 and the variance for each record would be 1.

To scale our data, we would use StandardScalar which is available in sklearn.

Note that we are only going to scale the features and not the target. So to do this, we

  • first import StandardScalar
  • separate the features from the target
  • scale the features

This three operations are accomplished using the four lines of codes below

from sklearn.preprocessing import StandardScaler

features = ['sepal length', 'sepal width', 'petal length', 'petal width']

x = df.loc[:, features].values

y = df.loc[:, ['target']].values

x = StandardScaler().fit_transform(x)

Listing 1.2: Separate features from target and standardize features


Step 5:  Perform PCA

To then perform PCA we would use PCA module from sklearn which we have already imported in Step 1. In Listing 1.3, below, the first and the  line performs the PCA, the third line loads the principal components into a dataframe. You can view your data by typing principalComponents or principalDataframe in a cell and running it.

pca = PCA(n_components=2)

principalComponents = pca.fit_transform(x)

principalDataframe = pd.DataFrame(data = principalComponents, columns = ['PC1', 'PC2'])

Listing 1.3: PCA for two Principal Components


Step 6: Combine the Target and the Principal Components

Remember that the original data has five columns: four features and one target column. Now after performing PCA, we have just two columns for the features. The target dataset y was not touched. Therefore, we attached back the target column to the new set of principal components. To do that, use the code below.

targetDataframe = df[['target']]

newDataframe = pd.concat([principalDataframe, targetDataframe],axis = 1)

Listing 1.4: Combine Principal Components with target

You can also view your new dataset by just typing newDataframe and running the cell.  Your output would therefore be as shown in Figure 1.1

Figure 1.1: New Dataset after performing PCA


Step 7: Perform a Scree Plot of the Principal Components

A scree plot is like a bar chart showing the size of each of the principal components. It helps us to visualize the percentage of variation captured by each of the principal components. To perform a scree plot you need to:

  • first of all, create a list of columns
  •  then, list of PCs
  • finally, do the scree plot using plt

Now, copy and past the code in listing 1.5 below into Jupyter Notebook and then run it. Consequently, your output would be as shown in Figure 1.2

percent_variance = np.round(pca.explained_variance_ratio_* 100, decimals =2)
columns = ['PC1', 'PC2', 'PC3', 'PC4'] range(1,5), height=percent_variance, tick_label=columns)
plt.ylabel('Percentate of Variance Explained')
plt.xlabel('Principal Component')
plt.title('PCA Scree Plot')

Listing 1.5: PCA Scree Plot

You can hence see the scree plot below.

Figure 1.3: Scree Plot


Step 8: Plot the Principal Components on 2D

Now we have performed PCA, we need to visualize the new dataset to see how PCA makes it easier to explain the original data. We would use scatter plot

plt.scatter(principalDataframe.PC1, principalDataframe.PC2)
plt.title('PC1 against PC2')

Listing 1.6:  2D Plot of PC1 and PC2

If you execute the code above then you will have the plot given in Figure 1.2

Figure 1.3: First PCA plot of PC1 and PC2

So what have we achieved?

We would repeat this plot this time with colors for each of the targets (Iris-setosa, Iris-versicolor and Iris-virginica). In this way we would see how PCA helps explain the data. However, to keep things simple, I would not explain this very code.  Write and run the code below.

fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1) 

ax.set_title('Plot of PC1 vs PC2', fontsize = 20)

targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']

colors = ['r', 'g', 'b']

for target, color in zip(targets,colors):
    indicesToKeep = newDataframe['target'] == target
    ax.scatter(newDataframe.loc[indicesToKeep, 'PC1']
               , newDataframe.loc[indicesToKeep, 'PC2']
               , c = color
               , s = 50)

Listing 1.7: Plot of PC1 vs PC2 with color codes

Likewise, if you execute the code in Listing 1.6 above, you will have the output given in Figure below:

Figure 1.4: Final Plot of PC1 and PC2

Explaining the Variance Using Principal Component

Finally  we need to see how the two principal components explain our data. To do that we would use the command below:



Then you will get the output:

array([0.72770452, 0.23030523])

This values show that the first principal component PC1 explains 72.77% of the variation in the original data while the second principal component explains 23.03% of the variation in the original data.

In conclusion, this means that the original 4 dimensional data can be safely reduced to 2 dimensions using PCA because the dataset can be explained by only two components!

Finally, I hope that this lesson has clearly helped you to see how you can perform Principal Components Analysis using Python.

Share this with friends