{"id":267,"date":"2021-09-24T21:52:55","date_gmt":"2021-09-24T21:52:55","guid":{"rendered":"https:\/\/www.kindsonthegenius.com\/data-science\/?p=267"},"modified":"2021-09-30T10:17:38","modified_gmt":"2021-09-30T10:17:38","slug":"class-6-introduction-to-classification","status":"publish","type":"post","link":"https:\/\/www.kindsonthegenius.com\/data-science\/class-6-introduction-to-classification\/","title":{"rendered":"Class 6 &#8211; Introduction to Classification"},"content":{"rendered":"<p>This would be class 6 of our complete data science for beginner series. In this class, we would apply all we&#8217;ve learnt to actually perform an analysis: Building a Classifier<\/p>\n<p>We would cover the following sub-topics:<\/p>\n<ol>\n<li><a href=\"#t1\">What is Classification?<\/a><\/li>\n<li><a href=\"#t2\">Building an NB Classifier<\/a><\/li>\n<li><a href=\"#t3\">Evaluation Metrics &#8211; TP, FP, TN, FN<\/a><\/li>\n<li><a href=\"#t4\">Evaluation Metrics &#8211; Accuracy, Precision, Sensitivity and Specificity<\/a><\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t1\">1. What is Classification<\/strong><\/h4>\n<p>As the name suggests, classification is the process of predicting a category from measured values of the attributes. You can remember the iris dataset from Class 1. In the iris dataset, we have 4 attributes (Sepal Length, Sepal Width, Petal Length, and Petal Width). Given this set of attribute, the record can be classified as 1 of 3 classes of iris (Setosa, Virginica and Versicolor).<\/p>\n<p>Classification under under set of machine learning approaches called Supervised Learning. And generally, one we have a dataset, the task would be to determine the function that maps the inputs(attributes) to the outputs (classes). This is what we call the model (this mapping between input and output).<\/p>\n<p>More topics on classification can be found here:<\/p>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t2\">2. Building a Classifier<\/strong><\/h4>\n<p>As usual, we would follow 5 steps to build and test our classifier:<\/p>\n<p><strong>Step 1<\/strong> &#8211; Import the necessary modules<br \/>\n<!-- HTML generated using hilite.me --><\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\">#1. Import the necessary modules<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">import<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.datasets<\/span> <span style=\"color: #008800; font-weight: bold;\">as<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">ds<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">import<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">pandas<\/span> <span style=\"color: #008800; font-weight: bold;\">as<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">pd<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">import<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">numpy<\/span> <span style=\"color: #008800; font-weight: bold;\">as<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">np<\/span>\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Step 2<\/strong> &#8211; Obtain your dataset<br \/>\n<!-- HTML generated using hilite.me --><\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\">#2. Obtain and prepare your datase<\/span>\r\nbc_array <span style=\"color: #333333;\">=<\/span> ds<span style=\"color: #333333;\">.<\/span>load_breast_cancer()\r\n\r\nfeatures <span style=\"color: #333333;\">=<\/span> bc_array[<span style=\"background-color: #fff0f0;\">'data'<\/span>]\r\nclasses <span style=\"color: #333333;\">=<\/span> bc_array[<span style=\"background-color: #fff0f0;\">'target'<\/span>]\r\nfeature_names <span style=\"color: #333333;\">=<\/span> bc_array[<span style=\"background-color: #fff0f0;\">'feature_names'<\/span>]\r\ncolumn_names <span style=\"color: #333333;\">=<\/span> np<span style=\"color: #333333;\">.<\/span>append(feature_columns_names,<span style=\"background-color: #fff0f0;\">'Class'<\/span>)\r\nbc_df <span style=\"color: #333333;\">=<\/span> pd<span style=\"color: #333333;\">.<\/span>DataFrame(data <span style=\"color: #333333;\">=<\/span> np<span style=\"color: #333333;\">.<\/span>c_[features, classes], columns <span style=\"color: #333333;\">=<\/span> column_names)\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Step 3<\/strong> &#8211; Split the Dataset into Train and Test Data<\/p>\n<p>You already know about<a href=\"https:\/\/www.kindsonthegenius.com\/data-science\/class-4-introduction-to-data-preprocessing-and-data-cleaning-part-2\/\" target=\"_blank\" rel=\"noopener\"> data splitting from Class 5<\/a>. You can review it.<\/p>\n<p>So now, we need to split out dataset into train and test data sets.<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\">#3. Split your dataset into test and train<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">from<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.model_selection<\/span> <span style=\"color: #008800; font-weight: bold;\">import<\/span> train_test_split\r\nXtrain, Xtest, Ytrain, Ytest <span style=\"color: #333333;\">=<\/span> \r\ntrain_test_split(bc_df[feature_names], bc_df[<span style=\"background-color: #fff0f0;\">'Class'<\/span>], test_size <span style=\"color: #333333;\">=<\/span> <span style=\"color: #6600ee; font-weight: bold;\">0.3<\/span>, random_state<span style=\"color: #333333;\">=<\/span><span style=\"color: #0000dd; font-weight: bold;\">50<\/span>)\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Step\u00a0 4<\/strong> &#8211; Build the Model<\/p>\n<p>We would build the model using Naive Bayes algorithm<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\">#4. Build the Model<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">from<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.naive_bayes<\/span> <span style=\"color: #008800; font-weight: bold;\">import<\/span> GaussianNB\r\ngnb <span style=\"color: #333333;\">=<\/span> GaussianNB()\r\nmodel <span style=\"color: #333333;\">=<\/span> gnb<span style=\"color: #333333;\">.<\/span>fit(Xtrain, Ytrain)\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Step\u00a0 5<\/strong> &#8211; Check Model Accuracy<\/p>\n<p>To see the model performance, we would use the model to make predictions based on the test dataset.<br \/>\n<!-- HTML generated using hilite.me --><\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\">#5. Check Model Accuracy<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">from<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.metrics<\/span> <span style=\"color: #008800; font-weight: bold;\">import<\/span> accuracy_score\r\ny_pred <span style=\"color: #333333;\">=<\/span> gnb<span style=\"color: #333333;\">.<\/span>predict(Xtest)\r\n<span style=\"color: #007020;\">print<\/span>(accuracy_score(Ytest, y_pred))\r\n<\/pre>\n<p>The above would display the accuracy score. For me it gave<\/p>\n<div class=\"cell code_cell rendered selected\" tabindex=\"2\">\n<div class=\"output_wrapper\">\n<div class=\"output\">\n<div class=\"output_area\">\n<div class=\"output_subarea output_text output_stream output_stdout\" dir=\"auto\">\n<pre>0.935672514619883<\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t3\">3. Evaluation Metrics &#8211; TP, FP, TN, FN<\/strong><\/h4>\n<p>Although we have gotten the accuracy of our classifier, we still need to calculate other metrics that explain the classifier performance. The following are the metrics of interest.<\/p>\n<div class=\"page\" title=\"Page 69\">\n<div class=\"section\">\n<div class=\"layoutArea\">\n<div class=\"column\">\n<ul>\n<li><strong>True Positives (TP):<\/strong> This is a situation where the actual class is 1 and the classifier correctly predicted it as 1<\/li>\n<li><strong>False Positives (FP):<\/strong> This is a situation where actual class of data point is 0 and the classifier wrongly predicted class of data point is 1. This is a <strong>Type I Error.<\/strong><\/li>\n<li><strong>True Negatives (TN):<\/strong> This is a situation where both the actual class is 0 and the classifier correctly predicted it as 0<\/li>\n<li><strong>False Negatives (FN):<\/strong> This is a situation where actual class of data point is 1 and the classifier wrongly predicted it 0. This is called a <strong>Type II Error.<\/strong><\/li>\n<\/ul>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<p>These values can be represented in a confusion matrix.<\/p>\n<p>The code below displays the confusion matrix for our classifier.<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Display the confusion matrix<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">from<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.metrics<\/span> <span style=\"color: #008800; font-weight: bold;\">import<\/span> confusion_matrix\r\nconfusion_matrix(Ytest, y_pred)\r\n<\/pre>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t4\">4. Evaluation Metrics &#8211; Accuracy, Precision, Sensitivity and Specificity<\/strong><\/h4>\n<p>Lets now look at these further metrics<\/p>\n<p><strong>Accuracy<\/strong> &#8211; This is the number of correct classifications divided by total classifications. It is given by the formula:<\/p>\n<p><a href=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.44.36.png\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-270 size-medium\" src=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.44.36-300x61.png\" alt=\"Accuracy\" width=\"300\" height=\"61\" srcset=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.44.36-300x61.png 300w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.44.36-1024x208.png 1024w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.44.36-768x156.png 768w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.44.36.png 1072w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p><strong>Precision<\/strong> &#8211; This is the total number of True Positives divided by the sum of True Positives and False Positives. It is given by:<\/p>\n<p><a href=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.45.42.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-271 aligncenter\" src=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.45.42-300x61.png\" alt=\"Precision\" width=\"300\" height=\"61\" srcset=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.45.42-300x61.png 300w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.45.42-1024x208.png 1024w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.45.42-768x156.png 768w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.45.42.png 1072w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p><strong>Sensitivity (Recall)<\/strong> &#8211;\u00a0 This is the total number of True Positives divided by the sum of True Positives and False Negatives. It is given by:<\/p>\n<p><a href=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.46.40.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-272 aligncenter\" src=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.46.40-300x61.png\" alt=\"Sensitivity\" width=\"300\" height=\"61\" srcset=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.46.40-300x61.png 300w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.46.40-1024x208.png 1024w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.46.40-768x156.png 768w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.46.40.png 1072w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p><strong>Specificity<\/strong> &#8211; Number of True Negatives divided by the sum of True Negatives and False Positives. It is given by:<\/p>\n<p><a href=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.47.36.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-273 aligncenter\" src=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.47.36-300x61.png\" alt=\"Specificity\" width=\"300\" height=\"61\" srcset=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.47.36-300x61.png 300w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.47.36-1024x208.png 1024w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.47.36-768x156.png 768w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-24-at-23.47.36.png 1072w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>This would be class 6 of our complete data science for beginner series. In this class, we would apply all we&#8217;ve learnt to actually perform &hellip; <!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":1,"featured_media":274,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[66,67,45],"class_list":["post-267","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized","tag-classification","tag-confusion-matrix","tag-sklearn"],"jetpack_featured_media_url":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Class-6-Classification.jpg","_links":{"self":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts\/267","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/comments?post=267"}],"version-history":[{"count":2,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts\/267\/revisions"}],"predecessor-version":[{"id":275,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts\/267\/revisions\/275"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/media\/274"}],"wp:attachment":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/media?parent=267"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/categories?post=267"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/tags?post=267"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}