September 26, 2021 Class 3 – Introduction to Data Preprocessing and Data Cleaning – Part 1

This is Class three of our practical Science Course for Data Science Beginners. In this class would be perform data preprocessing and data cleaning (or data cleansing). We would also discuss some of the theoretical concepts.

We would be using the Titanic Dataset. Get the Titanic Dataset here for free.

The following are covered:

Class 3 Video on Preprocessing

1. What is Data Preprocessing?

After obtaining your dataset and doing basic visualization, the next step is to perform preprocessing on your dataset. Data preprocessing refers to the operations you perform on your data to ensure it works well with Machine Learning algorithms. Data preprocessing also ensure better performance on analytics process. It includes data cleaning, outlier detection, data wrangling, normalization, data editing, unreliable data removal, data conversion etc.

In this class we would perform most of them on the Titanic Dataset.

2. Data Scaling or Rescaling

Data scaling is a technique that ensures that the attributes of the dataset are on the same scale. Most times, we need to rescale to a scale of 0 to 1 as required by Machine Learning algorithms like k-Nearest Neighbor and Gradient Descent.

Python provides a library called the MinMaxScalar for performing scaling. This library is available in sklearn module

Take the four steps below to scale the data in the fare column of the Titanic dataset

Step 1 – Create the MinMaxScaler object

# Create a MinMaxScaler object
data_scaler = pp.MinMaxScaler(feature_range=(0,1))

Step 2 – Extract the fare column

# Extract the fare column
fare_array = titanic_df[['fare']]

Step 3 – Perform the scaling

# Perform the scaling of the extracted column
fare_array_scaled = data_scaler.fit_transform(fare_array)

Step 4 – Replace the original column

# Now replace the original column with the scaled column
titanic_df['fare'] = fare_array_scaled

3. Dropping and Interpolating Missing Data

Dropping and interpolating are data cleansing technique used to handle missing values in a dataset. We can decided to drop a column if it does not contribute anything to the data analysis process. For example the name and the ticket columns.

Drop Columns with Missing Values

Another reason we may drop a column is when there are multiple missing values. An example is the body, boat and cabin columns of the Titanic dataset.

To drop these columns, use the code below:

# Drop Columns
cols_to_drop = ['body', 'boat', 'name', 'ticket', 'cabin']
titanic_df = titanic_df.drop(cols_to_drop, axis=1)

The axis = 1 indicates we are dropping columns

Interpolating Missing Values

If you have a column with very few missing values, you can just choose to interpolate them using existing values. Interpolation is simply a way to create new data based on existing data.  For example if you have a range 2, 4, ?, 8, 10. Then here, by interpolation, the missing value will be 6 by interpolation. That is (4+8)/2.

Let’s interpolate the age column of the Titanic dataset using the code below

# To replace missing values with interpolated values, for example Age
df['Age'] = df['Age'].interpolate()

Drop rows with missing Values

To drop all rows with missing values, we can use the code below. Here, we don’t specify the axis.

# Drop all rows with missin data
df = df.dropna()

4. Data Normalisation

Normalization is used when certain features have broad range of values. For example some feature have values of 0 or close to zero while some other feature have very high values of say, in 100s or 1000s.  In this case normalization would scale each recored to have a range length of say, 1.

There are two types of normalization: L1 Normalization and L2 Normalization

L1 Normalization – Also known as Manhattan normalization. Here, for each row of the dataset, the sum of the absolution values will always equal 1

L2 Normalization – Also known as Euclidean normalization. Here, for each row of data, the root of the sum of the square of the values will always equal 1.

To perform normalization we simply create a normalizer object and proceed similar to how we performed scaling. Code snippet is given below. See video for full explanation

# Perform Normaliztion on the parch column
normalizer = pp.Normalizer(norm='l1') # use l2 for L2 Normalization
parch_array = titanic_df[['parch']]
parch_array_normalized = normalizer.transform(parch_array)
titanic_df['parch'] = parch_array_normalized

Exercise: Perform L2 normalization on the Ash column of the wine dataset. (try it, then see video for procedure and explanation)

The remaining 5 points are covered in the next Class Part 2

• 5. Numerical and Categorical Values Conversion
• 6. Data Binarization
• 7. Data Standardization
• 8. Data Labelling and Encoding
• 9. Data Splitting – Feature and Class; Train & Test

Go to Part 2 