{"id":253,"date":"2021-09-19T18:06:37","date_gmt":"2021-09-19T18:06:37","guid":{"rendered":"https:\/\/www.kindsonthegenius.com\/data-science\/?p=253"},"modified":"2021-09-25T13:49:56","modified_gmt":"2021-09-25T13:49:56","slug":"class-4-introduction-to-data-preprocessing-and-data-cleaning-part-2","status":"publish","type":"post","link":"https:\/\/www.kindsonthegenius.com\/data-science\/class-4-introduction-to-data-preprocessing-and-data-cleaning-part-2\/","title":{"rendered":"Class 4 &#8211; Introduction to Data Preprocessing and Data Cleaning &#8211; Part 2"},"content":{"rendered":"<p>This would be Class 4 of\u00a0 of our Data Science Series and in this class we would complete the remaining topics on Data Preprocessing and Data Cleaning. <a href=\"https:\/\/www.kindsonthegenius.com\/data-science\/class-3-introduction-to-data-preprocessing-and-data-cleaning-part-1\/\" target=\"_blank\" rel=\"noopener\">You can get Part 1 here<\/a>.<\/p>\n<p>&nbsp;<\/p>\n<p>The following are covered here<\/p>\n<ol>\n<li><a href=\"#t1\">Numerical and Categorical Values Conversion<\/a><\/li>\n<li><a href=\"#t2\">Data Binarization<\/a><\/li>\n<li><a href=\"#t3\">Data Standardization<\/a><\/li>\n<li><a href=\"#t4\">Data Labelling and Encoding<\/a><\/li>\n<li><a href=\"#t5\">Data Splitting \u2013 Feature and Class; Train &amp; Test<\/a><\/li>\n<\/ol>\n<p><a href=\"https:\/\/youtu.be\/pQXHb0YrvaY\" target=\"_blank\" rel=\"noopener\">Preprocessing Video Part 2<\/a><\/p>\n<h4><strong id=\"t1\">1. Numerical and Categorical Values Conversion<\/strong><\/h4>\n<p>When some values are given as strings or characters, we may have to convert them to numeric values. For example,\u00a0 if you load up the dataset and take a look at the Sex column, you see that it has the values &#8216;Male&#8217; and &#8216;Female&#8217;. We would need to change this to values 0 and 1.<\/p>\n<p>In this case, we should have Male = 1 and Female 0.<\/p>\n<p>The format to do this conversion is<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">df<span style=\"color: #333333;\">.<\/span>loc[df[<span style=\"background-color: #fff0f0;\">'colname'<\/span>] <span style=\"color: #333333;\">==<\/span> <span style=\"color: #333333;\">&lt;<\/span>value<span style=\"color: #333333;\">&gt;<\/span>, <span style=\"background-color: #fff0f0;\">'colname'<\/span>] <span style=\"color: #333333;\">=<\/span> <span style=\"color: #333333;\">&lt;<\/span>new_value<span style=\"color: #333333;\">&gt;<\/span>\r\n<\/pre>\n<p>The code above will replace the &lt;value&gt; in the column &#8216;colname&#8217; with &lt;new_value&gt;<\/p>\n<p>Therefore the code below would convert the &#8216;Male&#8217; and &#8216;Female&#8217; values in out Titanic data<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">titanic_df<span style=\"color: #333333;\">.<\/span>loc[titanic_df[<span style=\"background-color: #fff0f0;\">'sex'<\/span>]<span style=\"color: #333333;\">==<\/span><span style=\"background-color: #fff0f0;\">'male'<\/span>, <span style=\"background-color: #fff0f0;\">'sex'<\/span>] <span style=\"color: #333333;\">=<\/span> <span style=\"color: #0000dd; font-weight: bold;\">1<\/span>\r\ntitanic_df<span style=\"color: #333333;\">.<\/span>loc[titanic_df[<span style=\"background-color: #fff0f0;\">'sex'<\/span>]<span style=\"color: #333333;\">==<\/span><span style=\"background-color: #fff0f0;\">'female'<\/span>, <span style=\"background-color: #fff0f0;\">'sex'<\/span>] <span style=\"color: #333333;\">=<\/span> <span style=\"color: #0000dd; font-weight: bold;\">0<\/span>\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Add or Remove Column<\/strong><\/p>\n<p>Us the code below to add a column and remove a column<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">df[<span style=\"background-color: #fff0f0;\">'new_col'<\/span>] <span style=\"color: #333333;\">=<\/span> <span style=\"color: #008800; font-weight: bold;\">None<\/span> <span style=\"color: #888888;\"># Add a column<\/span>\r\ndf <span style=\"color: #333333;\">=<\/span> df<span style=\"color: #333333;\">.<\/span>drop(columns <span style=\"color: #333333;\">=<\/span> <span style=\"background-color: #fff0f0;\">'new_col'<\/span>, axis <span style=\"color: #333333;\">=<\/span> <span style=\"color: #0000dd; font-weight: bold;\">1<\/span>) <span style=\"color: #888888;\"># Drop a column<\/span>\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Exercise<\/strong> &#8211; Replace the values in the Embarked column with values 1, 2 3<\/p>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t2\">2. Data Binarization or Thresholding<\/strong><\/h4>\n<p>Just as the name indicates, binarization is a preprocessing technique used to change values in a dataset to binary (0 and 1). This may be achieved by setting a threshold. The values below the threshold are set to 0 while the values below the threshold are set to 1.<\/p>\n<p>Binarization may be needed when logistic regression would have to be performed<\/p>\n<p>For example, in our Titanic dataset, we can binarize the fare field following the steps below:<\/p>\n<p>I have added it as a screen shot for clarity:<\/p>\n<figure id=\"attachment_256\" aria-describedby=\"caption-attachment-256\" style=\"width: 500px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-19-at-18.57.47.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-256\" src=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-19-at-18.57.47-300x222.png\" alt=\"Binarization on the fare column of the Titanic Dataset\" width=\"500\" height=\"369\" srcset=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-19-at-18.57.47-300x222.png 300w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-19-at-18.57.47.png 643w\" sizes=\"auto, (max-width: 500px) 100vw, 500px\" \/><\/a><figcaption id=\"caption-attachment-256\" class=\"wp-caption-text\">Binarization on the fare column of the Titanic Dataset<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t3\">3. Data Standardization<\/strong><\/h4>\n<p>This is also called &#8216;mean removal and variance scaling&#8217; is a technique used to transform the dataset so that the feature look more like a standard normally distributed data. That is data with a mean of 0 and standard deviation of 1.<\/p>\n<p>Standardization is used in Machine learning estimators such as linear regression and logistic regression where better results are achieved with a normally distributed data. In Python we use the StandardScaler from sklearn.<\/p>\n<p>This work the same as MinMaxScaler from Part 1.<\/p>\n<p><strong>Exercise<\/strong> &#8211; Import the Wine dataset. Perform Standardization on the Proline column of the wine dataset.<\/p>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t4\">4. Data Labelling and Encoding<\/strong><\/h4>\n<p>This is an automated way of performing <a href=\"#t1\">Numerical and Categorical values conversion<\/a> which we covered in<a href=\"#t1\"> section 1<\/a> above.\u00a0 This means that instead of using string labels for the data (like &#8216;Male&#8217; and &#8216;Female&#8217;), we encode the values into numeric values or number labels. So if theres are n labels in the dataset, they would be encoded in 0 to n-1. The drawback of this method is that you don&#8217;t have control over the labels that are assigned.<\/p>\n<p>Sklearn provides a module LabelEncoder() which we can use to perform data labelling.<\/p>\n<p>The code below extracts the home.dest column, encodes it and adds it as a new column in the dataset. See the video for more explanation<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">encoder <span style=\"color: #333333;\">=<\/span> pp<span style=\"color: #333333;\">.<\/span>LabelEncoder()\r\nhome_array <span style=\"color: #333333;\">=<\/span> titanic_df[[<span style=\"background-color: #fff0f0;\">'home.dest'<\/span>]]<span style=\"color: #333333;\">.<\/span>values\r\nencoder<span style=\"color: #333333;\">.<\/span>fit(home_array<span style=\"color: #333333;\">.<\/span>ravel()) <span style=\"color: #888888;\"># ravel() converts 1d vector to array<\/span>\r\nhome_array_encoded <span style=\"color: #333333;\">=<\/span> encoder<span style=\"color: #333333;\">.<\/span>transform(home_array)\r\ntitanic_df[<span style=\"background-color: #fff0f0;\">'home.dest.encoded'<\/span>] <span style=\"color: #333333;\">=<\/span> home_array_encoded\r\n<\/pre>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t5\">5. Data Splitting \u2013 Features and Class; Train &amp; Test<\/strong><\/h4>\n<p>This is use mostly when we apply supervised learning algorithms to our dataset. Supervised learning is just a &#8216;fancy word&#8217; for classification and regression!<\/p>\n<p><strong>Splitting into Features and Class<\/strong><\/p>\n<p>Dataset used for classification is normally has two parts: the features (X) and the class (Y). The idea is that the class can be deduced based on the feature. In the case of the Titanic Dataset,\u00a0 we would like to determine who survived. Therefore, the survived column is the class while every other columns are the features. The code below would split the data in X (features) and Y(class).<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">Y <span style=\"color: #333333;\">=<\/span> titanic_df[[<span style=\"background-color: #fff0f0;\">'survived'<\/span>]]\r\nX <span style=\"color: #333333;\">=<\/span> titanic_df<span style=\"color: #333333;\">.<\/span>drop(columns <span style=\"color: #333333;\">=<\/span> <span style=\"background-color: #fff0f0;\">'survived'<\/span>, axis<span style=\"color: #333333;\">=<\/span><span style=\"color: #0000dd; font-weight: bold;\">1<\/span>)\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Splitting into Train and Test<\/strong><\/p>\n<p>In supervised learning, you will have to split your data into two parts: training dataset and test dataset. The training dataset is used to train the model while the test dataset is used to test the performance of the model on new data.<\/p>\n<p>In the code below, I generate a dataset of 1000 records, then\u00a0 perform train-test split to split the data into 70% training set and 30% test dataset.<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># split a dataset into train and test sets<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">from<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.datasets<\/span> <span style=\"color: #008800; font-weight: bold;\">import<\/span> make_blobs\r\n<span style=\"color: #008800; font-weight: bold;\">from<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.model_selection<\/span> <span style=\"color: #008800; font-weight: bold;\">import<\/span> train_test_split\r\n<span style=\"color: #888888;\"># Generate dataset<\/span>\r\nX, y <span style=\"color: #333333;\">=<\/span> make_blobs(n_samples<span style=\"color: #333333;\">=<\/span><span style=\"color: #0000dd; font-weight: bold;\">1000<\/span>)\r\n<span style=\"color: #888888;\"># split into train and test sets<\/span>\r\nX_train, X_test, y_train, y_test <span style=\"color: #333333;\">=<\/span> train_test_split(X, y, test_size<span style=\"color: #333333;\">=<\/span><span style=\"color: #6600ee; font-weight: bold;\">0.30<\/span>)\r\n<span style=\"color: #007020;\">print<\/span>(X_train<span style=\"color: #333333;\">.<\/span>shape, X_test<span style=\"color: #333333;\">.<\/span>shape, y_train<span style=\"color: #333333;\">.<\/span>shape, y_test<span style=\"color: #333333;\">.<\/span>shape)\r\n<\/pre>\n<p>&nbsp;<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>This would be Class 4 of\u00a0 of our Data Science Series and in this class we would complete the remaining topics on Data Preprocessing and &hellip; <!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":1,"featured_media":258,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[44,43],"tags":[58,60,59,56],"class_list":["post-253","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science","category-python","tag-binarization","tag-data-labelling","tag-label-encoding","tag-preprocessing"],"jetpack_featured_media_url":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/A-second-class-in-preprocessing.jpg","_links":{"self":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts\/253","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/comments?post=253"}],"version-history":[{"count":5,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts\/253\/revisions"}],"predecessor-version":[{"id":280,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts\/253\/revisions\/280"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/media\/258"}],"wp:attachment":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/media?parent=253"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/categories?post=253"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/tags?post=253"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}