{"id":248,"date":"2021-09-19T11:56:11","date_gmt":"2021-09-19T11:56:11","guid":{"rendered":"https:\/\/www.kindsonthegenius.com\/data-science\/?p=248"},"modified":"2021-09-25T13:51:11","modified_gmt":"2021-09-25T13:51:11","slug":"class-3-introduction-to-data-preprocessing-and-data-cleaning-part-1","status":"publish","type":"post","link":"https:\/\/www.kindsonthegenius.com\/data-science\/class-3-introduction-to-data-preprocessing-and-data-cleaning-part-1\/","title":{"rendered":"Class 3 &#8211; Introduction to Data Preprocessing and Data Cleaning &#8211; Part 1"},"content":{"rendered":"<p>This is Class three of our practical Science Course for Data Science Beginners. In this class would be perform data preprocessing and data cleaning (or data cleansing). We would also discuss some of the theoretical concepts.<\/p>\n<p>We would be using the Titanic Dataset. <a href=\"https:\/\/drive.google.com\/file\/d\/10Lsr1MvgORJl_XOv8LfrKXW9Hy-FcbOr\/view?usp=sharing\" target=\"_blank\" rel=\"noopener\">Get the Titanic Dataset here for free<\/a>.<\/p>\n<p>The following are covered:<\/p>\n<ol>\n<li><a href=\"#t1\">What is Data Preprocessing?<\/a><\/li>\n<li><a href=\"#t2\">Data Scaling<\/a><\/li>\n<li><a href=\"#t3\">Dropping and Interpolating Missing Data<\/a><\/li>\n<li><a href=\"#t4\">Data Normalisation<\/a><\/li>\n<li><a href=\"#t5\">Numerical and Categorical Values Conversion<\/a><\/li>\n<li><a href=\"#t5\">Data Binarization<\/a><\/li>\n<li><a href=\"#t5\">Data Standardization<\/a><\/li>\n<li><a href=\"#t5\">Data Labelling and Encoding<\/a><\/li>\n<li><a href=\"#t5\">Data Splitting &#8211; Feature and Class; Train &amp; Test<\/a><\/li>\n<\/ol>\n<p><a href=\"https:\/\/youtu.be\/ylhwP6wFEag\" target=\"_blank\" rel=\"noopener\">Class 3 Video on Preprocessing<\/a><\/p>\n<h4><strong id=\"t1\">1. What is Data Preprocessing?<\/strong><\/h4>\n<p>After obtaining your dataset and doing basic visualization, the next step is to perform preprocessing on your dataset. Data preprocessing refers to the operations you perform on your data to ensure it works well with Machine Learning algorithms. Data preprocessing also ensure better performance on analytics process. It includes data cleaning, outlier detection, data wrangling, normalization, data editing, unreliable data removal, data conversion etc.<\/p>\n<p>In this class we would perform most of them on the Titanic Dataset.<\/p>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t2\">2. Data Scaling or Rescaling<\/strong><\/h4>\n<p>Data scaling is a technique that ensures that the attributes of the dataset are on the same scale. Most times, we need to rescale to a scale of 0 to 1 as required by Machine Learning algorithms like k-Nearest Neighbor and Gradient Descent.<\/p>\n<p>Python provides a library called the MinMaxScalar for performing scaling. This library is available in sklearn module<\/p>\n<p>Take the four steps below to scale the data in the fare column of the Titanic dataset<\/p>\n<p><strong>Step 1<\/strong> &#8211; Create the MinMaxScaler object<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Create a MinMaxScaler object<\/span>\r\ndata_scaler <span style=\"color: #333333;\">=<\/span> pp<span style=\"color: #333333;\">.<\/span>MinMaxScaler(feature_range<span style=\"color: #333333;\">=<\/span>(<span style=\"color: #0000dd; font-weight: bold;\">0<\/span>,<span style=\"color: #0000dd; font-weight: bold;\">1<\/span>))\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Step 2<\/strong> &#8211; Extract the fare column<br \/>\n<!-- HTML generated using hilite.me --><\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Extract the fare column<\/span>\r\nfare_array <span style=\"color: #333333;\">=<\/span> titanic_df[[<span style=\"background-color: #fff0f0;\">'fare'<\/span>]]\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Step 3<\/strong> &#8211; Perform the scaling<br \/>\n<!-- HTML generated using hilite.me --><\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Perform the scaling of the extracted column<\/span>\r\nfare_array_scaled <span style=\"color: #333333;\">=<\/span> data_scaler<span style=\"color: #333333;\">.<\/span>fit_transform(fare_array)\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Step 4<\/strong> &#8211; Replace the original column<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Now replace the original column with the scaled column<\/span>\r\ntitanic_df[<span style=\"background-color: #fff0f0;\">'fare'<\/span>] <span style=\"color: #333333;\">=<\/span> fare_array_scaled\r\n<\/pre>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t3\">3. Dropping and Interpolating Missing Data<\/strong><\/h4>\n<p>Dropping and interpolating are data cleansing technique used to handle missing values in a dataset. We can decided to drop a column if it does not contribute anything to the data analysis process. For example the name and the ticket columns.<\/p>\n<p><strong>Drop Columns with Missing Values<\/strong><\/p>\n<p>Another reason we may drop a column is when there are multiple missing values. An example is the body, boat and cabin columns of the Titanic dataset.<\/p>\n<p>To drop these columns, use the code below:<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Drop Columns<\/span>\r\ncols_to_drop <span style=\"color: #333333;\">=<\/span> [<span style=\"background-color: #fff0f0;\">'body'<\/span>, <span style=\"background-color: #fff0f0;\">'boat'<\/span>, <span style=\"background-color: #fff0f0;\">'name'<\/span>, <span style=\"background-color: #fff0f0;\">'ticket'<\/span>, <span style=\"background-color: #fff0f0;\">'cabin'<\/span>]\r\ntitanic_df <span style=\"color: #333333;\">=<\/span> titanic_df<span style=\"color: #333333;\">.<\/span>drop(cols_to_drop, axis<span style=\"color: #333333;\">=<\/span><span style=\"color: #0000dd; font-weight: bold;\">1<\/span>)\r\n<\/pre>\n<p>The axis = 1 indicates we are dropping columns<\/p>\n<p><strong>Interpolating Missing Values<\/strong><\/p>\n<p>If you have a column with very few missing values, you can just choose to interpolate them using existing values. Interpolation is simply a way to create new data based on existing data.\u00a0 For example if you have a range 2, 4, ?, 8, 10. Then here, by interpolation, the missing value will be 6 by interpolation. That is (4+8)\/2.<\/p>\n<p>Let&#8217;s interpolate the age column of the Titanic dataset using the code below<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># To replace missing values with interpolated values, for example Age<\/span>\r\ndf[<span style=\"background-color: #fff0f0;\">'Age'<\/span>] <span style=\"color: #333333;\">=<\/span> df[<span style=\"background-color: #fff0f0;\">'Age'<\/span>]<span style=\"color: #333333;\">.<\/span>interpolate()\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Drop rows with missing Values<\/strong><\/p>\n<p>To drop all rows with missing values, we can use the code below. Here, we don&#8217;t specify the axis.<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Drop all rows with missin data<\/span>\r\ndf <span style=\"color: #333333;\">=<\/span> df<span style=\"color: #333333;\">.<\/span>dropna()\r\n<\/pre>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t4\">4. Data Normalisation<\/strong><\/h4>\n<p>Normalization is used when certain features have broad range of values. For example some feature have values of 0 or close to zero while some other feature have very high values of say, in 100s or 1000s.\u00a0 In this case normalization would scale each recored to have a range length of say, 1.<\/p>\n<p>There are two types of normalization: L1 Normalization and L2 Normalization<\/p>\n<p><strong>L1 Normalization<\/strong> &#8211; Also known as Manhattan normalization. Here, for each row of the dataset, the sum of the absolution values will always equal 1<\/p>\n<p><strong>L2 Normalization<\/strong> &#8211; Also known as Euclidean normalization. Here, for each row of data, the root of the sum of the square of the values will always equal 1.<\/p>\n<p>To perform normalization we simply create a normalizer object and proceed similar to how we performed scaling. Code snippet is given below. See video for full explanation<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Perform Normaliztion on the parch column<\/span>\r\nnormalizer <span style=\"color: #333333;\">=<\/span> pp<span style=\"color: #333333;\">.<\/span>Normalizer(norm<span style=\"color: #333333;\">=<\/span><span style=\"background-color: #fff0f0;\">'l1'<\/span>) <span style=\"color: #888888;\"># use l2 for L2 Normalization<\/span>\r\nparch_array <span style=\"color: #333333;\">=<\/span> titanic_df[[<span style=\"background-color: #fff0f0;\">'parch'<\/span>]]\r\nparch_array_normalized <span style=\"color: #333333;\">=<\/span> normalizer<span style=\"color: #333333;\">.<\/span>transform(parch_array)\r\ntitanic_df[<span style=\"background-color: #fff0f0;\">'parch'<\/span>] <span style=\"color: #333333;\">=<\/span> parch_array_normalized\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Exercise<\/strong>: Perform L2 normalization on the Ash column of the wine dataset. (try it, then see video for procedure and explanation)<\/p>\n<p><strong id=\"t5\"><a href=\"#\">The remaining 5 points are covered in the next Class Part 2<\/a><\/strong><\/p>\n<ul>\n<li>5. Numerical and Categorical Values Conversion<\/li>\n<li>6. Data Binarization<\/li>\n<li>7. Data Standardization<\/li>\n<li>8. Data Labelling and Encoding<\/li>\n<li>9. Data Splitting &#8211; Feature and Class; Train &amp; Test<\/li>\n<\/ul>\n<p><a href=\"https:\/\/www.kindsonthegenius.com\/data-science\/class-4-introduction-to-data-preprocessing-and-data-cleaning-part-2\/\" target=\"_blank\" rel=\"noopener\">Go to Part 2<\/a><\/p>\n<p>&nbsp;<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>This is Class three of our practical Science Course for Data Science Beginners. In this class would be perform data preprocessing and data cleaning (or &hellip; <!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":1,"featured_media":251,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[44,43],"tags":[54,55,53,56,57],"class_list":["post-248","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science","category-python","tag-data-cleaning","tag-data-cleansing","tag-normalization","tag-preprocessing","tag-scaling"],"jetpack_featured_media_url":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Data-Science-Class-on-Preprocessing.jpg","_links":{"self":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts\/248","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/comments?post=248"}],"version-history":[{"count":5,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts\/248\/revisions"}],"predecessor-version":[{"id":281,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts\/248\/revisions\/281"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/media\/251"}],"wp:attachment":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/media?parent=248"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/categories?post=248"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/tags?post=248"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}