This would be your first class as a beginner Data Scientist. It would be a practical class with some explanations of the concepts along the line.
Here’s what we’ll cover in this class:
- Obtain Free Dataset from sklearn
- Slicing Your Data
- Create a Dictionary Using the Dataset
- Convert to Pandas Dataframe
- Replace Numerical Values with Target Names
- Write data to csv
- Check dimension of the dataset
- View data types
- View summary of dataset
- Check class distribution of data using group_by
- Check Correlation between features
- Check skewness of dataset
Note: This class is clearer when you also watch the video lesson.
1. Obtain Free Dataset from sklearn
There are a number of ways to get free datasets. You can also generate your own dataset. One way to get free datasets is to get them from packages in R. This is explained here.
But in this tutorial, we would the iris dataset from sklearn. The dataset comes as an array. The code is given below:
import sklearn.datasets as ds iris = ds.load_iris()
The above code gets the dataset and loads it into the variable iris.
2. Slicing Your Data
Slicing simply means taking a subset of the dataset. The code below separates the dataset into columns:
col0 = iris.data[:,0] # column 0 col1 = iris.data[:,1] # column 1 col2 = iris.data[:,2] # column 2 col3 = iris.data[:,3] # column 3
3. Create a Dictionary Using the Dataset
Now we create a python dictionary called iris_dict. We need a dictionary for us to covert the array dataset to a pandas DataFrame
iris_dict = {'Sepal Length':col[:,0], 'Sepal Width':col[:,1], 'Petal Length':col[:,2], 'Petal Width':col[:,3], 'Target':iris.target }
4. Convert to Pandas Dataframe
We do this using the code below.
import pandas as pd iris_df = pd.DataFrame(data=iris_dict)
The new dataframe is called iris_df.
5. Replace Numerical Values With Text Target Names
We would now have to replace the numerical values (0, 1, 2) with the actual names of the classes available in iris.target_names. The code below does that
# Replace Numerical classes with Target names target = iris.target_names iris_df.loc[iris_df['Target']==0, 'Target'] = target[0] iris_df.loc[iris_df['Target']==1, 'Target'] = target[1] iris_df.loc[iris_df['Target']==2, 'Target'] = target[2]
6. Write Pandas DataFrame to csv
Now you can export this data as a csv in your local computer
iris_df.to_csv('irisCSV.csv')
It is saved in the same directory as the current notebook you are working with.
6. Check dimension of the dataset
Dimension of the dataset is simply the number of rows and columns in the dataset. You get it using the shape method as shown below:
iris_df.shape
8. View data types
This means that you want to know the datatypes of the columns in your dataset. You get it using the dtypes method
iris_df.dtypes
The output will be:
Sepal Length float64 Sepal Width float64 Petal Length float64 Petal Width float64 Target int64 dtype: object
9. View summary of dataset
We can get the summary statistics of our dataset. These statistics includes mean, count, etc
from pandas import set_option # allows us to set precision set_option('precision', 2) iris_df.describe()
The output of the above code is :

10. Check class distribution of data using group_by
The class distribution helps you to see the balance of the class values. See the video for more explanation.
iris_df.groupby('Target').size()
The output would be
Target 0 50 1 50 2 50 dtype: int64
11. Check Correlation between features
Correlation is the relationship between the variables in your dataset. The values of correlation ranges from -1 (negative correlation) to 0 (no correlation) to 1 (positive correlation).
correlations = iris_df.corr(method='pearson') correlations # you can also use print(correlations)
The output of the above code is given below:

12. Check skewness of dataset
Skewness of the data is the distribution of the data that is expected to be a normal distribution (Gaussian) but it appear distorted or shifted to either the left or the right.
iris_df.skew()
The output of the above code is given below:
Sepal Length 0.31 Sepal Width 0.32 Petal Length -0.27 Petal Width -0.10 Target 0.00 dtype: float64