Class 1 – Practical Data Science Class For Data Science Beginners

This would be your first class as a beginner Data Scientist. It would be a practical class with some explanations of the concepts along the line.

Here’s what we’ll cover in this class:

Obtain Free Dataset from sklearn
Slicing Your Data
Create a Dictionary Using the Dataset
Convert to Pandas Dataframe
Replace Numerical Values with Target Names
Write data to csv
Check dimension of the dataset
View data types
View summary of dataset
Check class distribution of data using group_by
Check Correlation between features
Check skewness of dataset

Note: This class is clearer when you also watch the video lesson.

Class 1 Video

1. Obtain Free Dataset from sklearn

There are a number of ways to get free datasets. You can also generate your own dataset. One way to get free datasets is to get them from packages in R. This is explained here.

But in this tutorial, we would the iris dataset from sklearn. The dataset comes as an array. The code is given below:

import sklearn.datasets as ds
iris = ds.load_iris()

The above code gets the dataset and loads it into the variable iris.

2. Slicing Your Data

Slicing simply means taking a subset of the dataset. The code below separates the dataset into columns:

col0 = iris.data[:,0] # column 0
col1 = iris.data[:,1] # column 1
col2 = iris.data[:,2] # column 2
col3 = iris.data[:,3] # column 3

3. Create a Dictionary Using the Dataset

Now we create a python dictionary called iris_dict. We need a dictionary for us to covert the array dataset to a pandas DataFrame

iris_dict = {'Sepal Length':col[:,0], 
             'Sepal Width':col[:,1], 
             'Petal Length':col[:,2], 
             'Petal Width':col[:,3], 
             'Target':iris.target
            }

4. Convert to Pandas Dataframe

We do this using the code below.

import pandas as pd
iris_df = pd.DataFrame(data=iris_dict)

The new dataframe is called iris_df.

5. Replace Numerical Values With Text Target Names

We would now have to replace the numerical values (0, 1, 2) with the actual names of the classes available in iris.target_names. The code below does that

# Replace Numerical classes with Target names
target = iris.target_names

iris_df.loc[iris_df['Target']==0, 'Target'] = target[0]
iris_df.loc[iris_df['Target']==1, 'Target'] = target[1]
iris_df.loc[iris_df['Target']==2, 'Target'] = target[2]

6. Write Pandas DataFrame to csv

Now you can export this data as a csv in your local computer

iris_df.to_csv('irisCSV.csv')

It is saved in the same directory as the current notebook you are working with.

6. Check dimension of the dataset

Dimension of the dataset is simply the number of rows and columns in the dataset. You get it using the shape method as shown below:

iris_df.shape

8. View data types

This means that you want to know the datatypes of the columns in your dataset. You get it using the dtypes method

iris_df.dtypes

The output will be:

Sepal Length    float64
Sepal Width     float64
Petal Length    float64
Petal Width     float64
Target            int64
dtype: object

9. View summary of dataset

We can get the summary statistics of our dataset. These statistics includes mean, count, etc

from pandas import set_option # allows us to set precision
set_option('precision', 2)
iris_df.describe()

The output of the above code is :

Output of dataset summary using describe()

10. Check class distribution of data using group_by

The class distribution helps you to see the balance of the class values. See the video for more explanation.

iris_df.groupby('Target').size()

The output would be

Target
0    50
1    50
2    50
dtype: int64

11. Check Correlation between features

Correlation is the relationship between the variables in your dataset. The values of correlation ranges from -1 (negative correlation) to 0 (no correlation) to 1 (positive correlation).

correlations = iris_df.corr(method='pearson')
correlations # you can also use print(correlations)

The output of the above code is given below:

12. Check skewness of dataset

Skewness of the data is the distribution of the data that is expected to be a normal distribution (Gaussian) but it appear distorted or shifted to either the left or the right.

iris_df.skew()