September 26, 2021
A Class on Data Visualization in Python

Class 2 – A Class on Data Visualization with Python – A Data Science Primer

In this class we would cover Data Visualization with Python. This class follows from Part 1: Your First Data Science Class. I recommend you check out that first if you are a beginner.

The following is covered in this class:

  1. Our Dataset
  2. Univariate Plots: Understanding Attributes Independently
  3. Histogram
  4. Density Plots
  5. Box Plots (Whisker Plot)
  6. Multivariate Plots: Relationship Variables
  7. Correlation Matrix Plot
  8. Scatter Matrix Plot
  9. Using Heatmap – Seaborn
  10. Using Matshow
  11. Pairplot

Class 2 Video of Data Visualization

1. Our Datatset – The Wine Dataset

The dataset of wine was obtained from the UCI Machine Learning Repository. We would use the wine.csv file which is available for free from here. According to the documentation of this dataset, the data consists of 13 physiochemical parameters measured in 178 different wine samples from three distinct cultivars(variety produced by selective breeding) grown in Italy.

Use the code below to import your dataset:

from pandas import ExcelWriter
from pandas import ExcelFile
path = r"/Users/kindsonmunonye/Datasets/wine.xlsx"
wine_data = pd.read_excel(path, header=0)

The parameters are given below:

  1. Alcohol
  2. Metallic Acid
  3. Ash
  4. Alkalinity of Ash
  5. Magnesium
  6. Total Phenols
  7. Flavanoids
  8. Flavanoid Phnols
  9. Proanthocyanins
  10. Color Intensity
  11. Hue
  12. OD280/OD315 of Diluted wines
  13. Prolines


2. Univariate Plots: Visualising Individual Features

This type of plots help use understand individual variables of our dataset independent of other variables. Some of the univariate plots we would use in this class includes, histograms, univariate scatter plots, line plots.

We begin with histogram


3. Histograms

A histogram is a plot that groups the data into bins or vertical bars. Each attributes is represented with bin whose height represents the values of the attribute. An example of use of histogram is to get the count of observations in given category of the totals of certain columns.

Use the code below to get a histogram plot of the wine dataset:

fig = plt.figure(figsize = (15,20))
ax = fig.gca()
wine_data.hist(ax = ax)
# wine_data.hist(ax = ax, column='Wine') for a single data column


4. Density Plot

A density plot is similar to a histogram bu it uses a smooth curve to represent the data attributes. It uses the kernel density estimate to show the probability density function (PDF) of the variables.

The code below provides the density plot of the wine dataset

fig = plt.figure(figsize = (15,20))
ax = fig.gca()
wine_data.plot(ax = ax, kind='density', subplots=True, layout=(4,4), sharex=False)

The kinds of plot can be changes into any of the following.

  • ‘line’ : line plot (default)
  • ‘bar’ : vertical bar plot
  • ‘barh’ : horizontal bar plot
  • ‘hist’ : histogram
  • ‘box’ : boxplot
  • ‘kde’ : Kernel Density Estimation plot
  • ‘density’ : same as ‘kde’
  • ‘area’ : area plot
  • ‘pie’ : pie plot
  • ‘scatter’ : scatter plot (DataFrame only)
  • ‘hexbin’ : hexbin plot (DataFrame only)

I recommend you try them out yourself to see what you get


5. Box Plots

This is also called box and whisker plot. It provides a visualization of the distribution of each attribute in the dataset. It draws a line in the middle value of the attribute and a box around the 25% and 75% (1st and 3rd quartiles). Then it also draws a whisker to indicate the spread of the data.

Use the code below to get a box plot of the wine dataset.

fig = plt.figure(figsize = (15,20))
ax = fig.gca()
wine_data.plot(ax = ax, kind='box', subplots=True, layout=(4,4), sharex=False)


6. Multivariate Plots

This kind of plots are used for multi-variable visualization. Multivariate plots provides an insight into the relationship and interaction between the variables in a dataset.

Some multivariate plots includes corelation matrix plot, scatter matrix plot and pairwise plot (pairplot)


7. Correlation Matrix

Correlation is provides an insight into the relationship between two variables. So how does changes in one variable affect the other variables(s)? A correlation matrix plot uses the correlation coefficient (Pearson’s Correlation coefficient). This value indicates how strong or weak a relationship is between two variables.

Correlation matrix plot can be created using matshow from matplotlib or the seaborn module.

Using seaborn:

# Plot using Seaborn
import seaborn as sb
fig = plt.figure(figsize = (15,15))
ax = fig.gca()
sb.heatmap(correlations, annot=True, ax=ax)


Using matshow:

# Using matshow
import numpy as np
fig = plt.figure(figsize = (15,15))
ax = fig.gca() # The gca() method figure module of matplotlib library is used to get the current axes.
cax = ax.matshow(correlations, vmin=-1, vmax=1) # matshow() function is used to represent an array as a matrix 

ticks = np.arange(0,14,1) 

The outputs from the codes above is also referred to as a ‘heatmap’.


8. Scatter Matrix Plot

The scatter plot or scatter matrix shows how much one variable is affected by another variable or the relationship between the variables. This represented using dots in two dimensions. Scatter plots are similar to x-y graphs since they use the horizontal(x) and the vertical(y) axis.

The code below produces a scatter matrix.

# Using Scatter matrix from pandas
import  pandas.plotting as pp
fig = plt.figure(figsize = (15,15))
ax = fig.gca()
pp.scatter_matrix(wine_data, ax=ax)
# pp.scatter_matrix(wine_data[['Wine','Alcohol', 'Ash', 'Malic.acid']], ax=ax) # Taking a subset


Scatter matrix plot using seaborn

# Using Seaborn
import seaborn as sb
sb.pairplot(wine_data[['Ash', 'Wine', 'Hue', 'Acl']]) # Taking Subset


Complete Video Tutorial on Plotting

0 0 votes
Article Rating
Notify of
Inline Feedbacks
View all comments