This is Class three of our practical Science Course for Data Science Beginners. In this class would be perform data preprocessing and data cleaning (or data cleansing). We would also discuss some of the theoretical concepts.
We would be using the Titanic Dataset. Get the Titanic Dataset here for free.
The following are covered:
- What is Data Preprocessing?
- Data Scaling
- Dropping and Interpolating Missing Data
- Data Normalisation
- Numerical and Categorical Values Conversion
- Data Binarization
- Data Standardization
- Data Labelling and Encoding
- Data Splitting – Feature and Class; Train & Test
1. What is Data Preprocessing?
After obtaining your dataset and doing basic visualization, the next step is to perform preprocessing on your dataset. Data preprocessing refers to the operations you perform on your data to ensure it works well with Machine Learning algorithms. Data preprocessing also ensure better performance on analytics process. It includes data cleaning, outlier detection, data wrangling, normalization, data editing, unreliable data removal, data conversion etc.
In this class we would perform most of them on the Titanic Dataset.
2. Data Scaling or Rescaling
Data scaling is a technique that ensures that the attributes of the dataset are on the same scale. Most times, we need to rescale to a scale of 0 to 1 as required by Machine Learning algorithms like k-Nearest Neighbor and Gradient Descent.
Python provides a library called the MinMaxScalar for performing scaling. This library is available in sklearn module
Take the four steps below to scale the data in the fare column of the Titanic dataset
Step 1 – Create the MinMaxScaler object
# Create a MinMaxScaler object data_scaler = pp.MinMaxScaler(feature_range=(0,1))
Step 2 – Extract the fare column
# Extract the fare column fare_array = titanic_df[['fare']]
Step 3 – Perform the scaling
# Perform the scaling of the extracted column fare_array_scaled = data_scaler.fit_transform(fare_array)
Step 4 – Replace the original column
# Now replace the original column with the scaled column titanic_df['fare'] = fare_array_scaled
3. Dropping and Interpolating Missing Data
Dropping and interpolating are data cleansing technique used to handle missing values in a dataset. We can decided to drop a column if it does not contribute anything to the data analysis process. For example the name and the ticket columns.
Drop Columns with Missing Values
Another reason we may drop a column is when there are multiple missing values. An example is the body, boat and cabin columns of the Titanic dataset.
To drop these columns, use the code below:
# Drop Columns cols_to_drop = ['body', 'boat', 'name', 'ticket', 'cabin'] titanic_df = titanic_df.drop(cols_to_drop, axis=1)
The axis = 1 indicates we are dropping columns
Interpolating Missing Values
If you have a column with very few missing values, you can just choose to interpolate them using existing values. Interpolation is simply a way to create new data based on existing data. For example if you have a range 2, 4, ?, 8, 10. Then here, by interpolation, the missing value will be 6 by interpolation. That is (4+8)/2.
Let’s interpolate the age column of the Titanic dataset using the code below
# To replace missing values with interpolated values, for example Age df['Age'] = df['Age'].interpolate()
Drop rows with missing Values
To drop all rows with missing values, we can use the code below. Here, we don’t specify the axis.
# Drop all rows with missin data df = df.dropna()
4. Data Normalisation
Normalization is used when certain features have broad range of values. For example some feature have values of 0 or close to zero while some other feature have very high values of say, in 100s or 1000s. In this case normalization would scale each recored to have a range length of say, 1.
There are two types of normalization: L1 Normalization and L2 Normalization
L1 Normalization – Also known as Manhattan normalization. Here, for each row of the dataset, the sum of the absolution values will always equal 1
L2 Normalization – Also known as Euclidean normalization. Here, for each row of data, the root of the sum of the square of the values will always equal 1.
To perform normalization we simply create a normalizer object and proceed similar to how we performed scaling. Code snippet is given below. See video for full explanation
# Perform Normaliztion on the parch column normalizer = pp.Normalizer(norm='l1') # use l2 for L2 Normalization parch_array = titanic_df[['parch']] parch_array_normalized = normalizer.transform(parch_array) titanic_df['parch'] = parch_array_normalized
Exercise: Perform L2 normalization on the Ash column of the wine dataset. (try it, then see video for procedure and explanation)
- 5. Numerical and Categorical Values Conversion
- 6. Data Binarization
- 7. Data Standardization
- 8. Data Labelling and Encoding
- 9. Data Splitting – Feature and Class; Train & Test