{"id":276,"date":"2021-09-30T20:10:33","date_gmt":"2021-09-30T20:10:33","guid":{"rendered":"https:\/\/www.kindsonthegenius.com\/data-science\/?p=276"},"modified":"2021-09-30T20:10:33","modified_gmt":"2021-09-30T20:10:33","slug":"data-science-class-7-logistic-regression","status":"publish","type":"post","link":"https:\/\/www.kindsonthegenius.com\/data-science\/data-science-class-7-logistic-regression\/","title":{"rendered":"Data Science Class 7 &#8211; Logistic Regression"},"content":{"rendered":"<p>In this class, we would build a logistic regression model.<\/p>\n<p>We would cover the following topics:<\/p>\n<ol>\n<li><a href=\"#t1\">The Logistic Regression Model<\/a><\/li>\n<li><a href=\"#t2\">Import the Necessary Modules<\/a><\/li>\n<li><a href=\"#t3\">Generate the Dataset<\/a><\/li>\n<li><a href=\"#t4\">Visualize the Data Using ScatterPlot<\/a><\/li>\n<li><a href=\"#t5\">Perform Logistic Regression<\/a><\/li>\n<li><a href=\"#t6\">View the Metrics<\/a><\/li>\n<li><a href=\"#t7\">Create the Sigmoid Plot<\/a><\/li>\n<\/ol>\n<p><strong>Note<\/strong>: Logistic Regression is a Classification algorithm and hence discussed under Classification<\/p>\n<h5><strong id=\"t1\">1. The Logistic Regression Model<\/strong><\/h5>\n<p>You already learnt from Class 6 (Introduction to Classification) that given values of x, we need to determine the class, y.\u00a0 In other words, we aim to find the function that relates x and y. In the case of linear regression, this function is of the form:<\/p>\n<p>y = f(x) = b0 + b1x<\/p>\n<p>However, in the case of logistic regression, the output must be 0 or 1. So to achieve this, we need two things:<\/p>\n<ul>\n<li>a function that would alway return a value between 0 and 1<\/li>\n<li>a threshold to round off this output to either a 0 or a 1<\/li>\n<\/ul>\n<p>To solve the first issue, we would model the outputs as probabilities. For example, we have a dataset of bank customers with credit card debt and we want to predict which customers will default. The input variable would be the balance on the customers account. The function will be something like this<\/p>\n<p>Pr(default = Yes | balance)<\/p>\n<p>This is shortened as p(balance) and will always give a value between 0 and 1 (since probability values are in range of 0 and 1)<\/p>\n<p>The solve the second issue we have to choose a threshold. For example, we may predict default = Yes for customers whose p(balance) &gt; 0.5.<\/p>\n<p>Before we go into the practical, I would just want you to know about the logistic function which is give below.<\/p>\n<p>&nbsp;<\/p>\n<p>Ok, let&#8217;s now go into the fun part!<\/p>\n<p>&nbsp;<\/p>\n<h5><strong id=\"t2\">2. Import the Necessary Modules<\/strong><\/h5>\n<p>All the necessary modules we need are given below. I&#8217;m sure by now, you know what each of them are used for:<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #008800; font-weight: bold;\">from<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.datasets<\/span> <span style=\"color: #008800; font-weight: bold;\">import<\/span> make_classification\r\n<span style=\"color: #008800; font-weight: bold;\">from<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">matplotlib<\/span> <span style=\"color: #008800; font-weight: bold;\">import<\/span> pyplot <span style=\"color: #008800; font-weight: bold;\">as<\/span> plt\r\n<span style=\"color: #008800; font-weight: bold;\">from<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.linear_model<\/span> <span style=\"color: #008800; font-weight: bold;\">import<\/span> LogisticRegression\r\n<span style=\"color: #008800; font-weight: bold;\">from<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.model_selection<\/span> <span style=\"color: #008800; font-weight: bold;\">import<\/span> train_test_split\r\n<span style=\"color: #008800; font-weight: bold;\">from<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.metrics<\/span> <span style=\"color: #008800; font-weight: bold;\">import<\/span> confusion_matrix\r\n<span style=\"color: #008800; font-weight: bold;\">import<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">pandas<\/span> <span style=\"color: #008800; font-weight: bold;\">as<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">pd<\/span>\r\n<\/pre>\n<p>&nbsp;<\/p>\n<h5><strong id=\"t3\">3. Generate the Dataset<\/strong><\/h5>\n<p>We generate a dataset here using the make_classification function from sklearn.datasets.<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Generate and dataset for Logistic Regression<\/span>\r\nx, y <span style=\"color: #333333;\">=<\/span> make_classification(\r\n    n_samples<span style=\"color: #333333;\">=<\/span><span style=\"color: #0000dd; font-weight: bold;\">100<\/span>,\r\n    n_features<span style=\"color: #333333;\">=<\/span><span style=\"color: #0000dd; font-weight: bold;\">1<\/span>,\r\n    n_classes<span style=\"color: #333333;\">=<\/span><span style=\"color: #0000dd; font-weight: bold;\">2<\/span>,\r\n    n_clusters_per_class<span style=\"color: #333333;\">=<\/span><span style=\"color: #0000dd; font-weight: bold;\">1<\/span>,\r\n    flip_y<span style=\"color: #333333;\">=<\/span><span style=\"color: #6600ee; font-weight: bold;\">0.03<\/span>,\r\n    n_informative<span style=\"color: #333333;\">=<\/span><span style=\"color: #0000dd; font-weight: bold;\">1<\/span>,\r\n    n_redundant<span style=\"color: #333333;\">=<\/span><span style=\"color: #0000dd; font-weight: bold;\">0<\/span>,\r\n    n_repeated<span style=\"color: #333333;\">=<\/span><span style=\"color: #0000dd; font-weight: bold;\">0<\/span>\r\n)\r\n<\/pre>\n<p>You can actually change the number of samples to something more.<\/p>\n<p>&nbsp;<\/p>\n<h5><strong id=\"t4\">4. Visualize the Data Using ScatterPlot<\/strong><\/h5>\n<p>Let&#8217;s now just do a scatter plot of this data. So we see why linear regression would not work quite well<br \/>\n<!-- HTML generated using hilite.me --><\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Create a scatter plot<\/span>\r\nplt<span style=\"color: #333333;\">.<\/span>scatter(x, y, c<span style=\"color: #333333;\">=<\/span>y, cmap<span style=\"color: #333333;\">=<\/span><span style=\"background-color: #fff0f0;\">'rainbow'<\/span>)\r\nplt<span style=\"color: #333333;\">.<\/span>title(<span style=\"background-color: #fff0f0;\">'Scatter Plot of Logistic Regression'<\/span>)\r\nplt<span style=\"color: #333333;\">.<\/span>grid( linestyle<span style=\"color: #333333;\">=<\/span><span style=\"background-color: #fff0f0;\">'--'<\/span>)\r\nplt<span style=\"color: #333333;\">.<\/span>show()\r\n<\/pre>\n<p>The output is given below:<\/p>\n<figure id=\"attachment_284\" aria-describedby=\"caption-attachment-284\" style=\"width: 300px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-30-at-21.41.30.png\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-284 size-medium\" src=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-30-at-21.41.30-300x212.png\" alt=\"Logistic Regression Data Scatter Plot\" width=\"300\" height=\"212\" srcset=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-30-at-21.41.30-300x212.png 300w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-30-at-21.41.30-768x543.png 768w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-30-at-21.41.30-120x85.png 120w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-30-at-21.41.30.png 956w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><figcaption id=\"caption-attachment-284\" class=\"wp-caption-text\">Logistic Regression Data Scatter Plot<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<h5><strong>5. Perform Logistic Regression<\/strong><\/h5>\n<p>To perform the logistic regression, we would take two steps:<\/p>\n<ul>\n<li>split the dataset into train and test datasets<\/li>\n<li>create a logistic regression object<\/li>\n<li>fit the logistic regression object through the train data set<\/li>\n<\/ul>\n<p>These three steps are given below in Python code<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Split the dataset into training and test dataset<\/span>\r\nx_train, x_test, y_train, y_test <span style=\"color: #333333;\">=<\/span> train_test_split(x, y, random_state<span style=\"color: #333333;\">=<\/span><span style=\"color: #0000dd; font-weight: bold;\">1<\/span>)\r\n\r\n<span style=\"color: #888888;\"># Create a Logistic Regression Object, perform Logistic Regression<\/span>\r\nlog_reg <span style=\"color: #333333;\">=<\/span> LogisticRegression()\r\nlog_reg<span style=\"color: #333333;\">.<\/span>fit(x_train, y_train)\r\n<\/pre>\n<p>&nbsp;<\/p>\n<h5><strong id=\"t6\">6. View the Metrics<\/strong><\/h5>\n<p>We are interested in view the following metrics<\/p>\n<ul>\n<li>the logistic regression coefficients<\/li>\n<li>the predicted values<\/li>\n<li>the confusion matrix<\/li>\n<\/ul>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Show to Coeficient and Intercept<\/span>\r\n<span style=\"color: #007020;\">print<\/span>(log_reg<span style=\"color: #333333;\">.<\/span>coef_)\r\n<span style=\"color: #007020;\">print<\/span>(log_reg<span style=\"color: #333333;\">.<\/span>intercept_)\r\n\r\n<span style=\"color: #888888;\"># Perform prediction using the test dataset<\/span>\r\ny_pred <span style=\"color: #333333;\">=<\/span> log_reg<span style=\"color: #333333;\">.<\/span>predict(x_test)\r\n\r\n<span style=\"color: #888888;\"># Show the Confusion Matrix<\/span>\r\nconfusion_matrix(y_test, y_pred)\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>The confusion matrix is explained as given below:<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># True positive: (top-left) (We predicted a positive result and it was positive)<\/span>\r\n<span style=\"color: #888888;\"># True negative: (lower-right) (We predicted a negative result and it was negative)<\/span>\r\n<span style=\"color: #888888;\"># False positive:(top-right) (We predicted a positive result and it was negative)<\/span>\r\n<span style=\"color: #888888;\"># False negative: (lower-left) (We predicted a negative result and it was positive)<\/span>\r\n<\/pre>\n<p>You can also use the following code to check that a data value belongs to either class 0 or 1 (No or Yes).<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Check the actual probability that a data point belongs to a class<\/span>\r\nlr<span style=\"color: #333333;\">.<\/span>predict_proba(x_test)\r\n<\/pre>\n<p>&nbsp;<\/p>\n<h5><strong id=\"t7\">7. Create the Sigmoid Plot<\/strong><\/h5>\n<p>The code below creates a sigmoid plot of our dataset<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Create and sort a dataframe containing our data<\/span>\r\ndf <span style=\"color: #333333;\">=<\/span> pd<span style=\"color: #333333;\">.<\/span>DataFrame({<span style=\"background-color: #fff0f0;\">'x'<\/span>: x_test[:,<span style=\"color: #0000dd; font-weight: bold;\">0<\/span>], <span style=\"background-color: #fff0f0;\">'y'<\/span>: y_test})\r\ndf <span style=\"color: #333333;\">=<\/span> df<span style=\"color: #333333;\">.<\/span>sort_values(by<span style=\"color: #333333;\">=<\/span><span style=\"background-color: #fff0f0;\">'x'<\/span>)\r\n\r\n\r\n<span style=\"color: #888888;\"># The expit function, also known as the logistic function, is defined as expit(x) = 1\/(1+exp(-x)).<\/span>\r\n<span style=\"color: #888888;\"># It is the inverse of the logit function.<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">from<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">scipy.special<\/span> <span style=\"color: #008800; font-weight: bold;\">import<\/span> expit\r\nsigmoid_function <span style=\"color: #333333;\">=<\/span> expit(df[<span style=\"background-color: #fff0f0;\">'x'<\/span>] <span style=\"color: #333333;\">*<\/span> log_reg<span style=\"color: #333333;\">.<\/span>coef_[<span style=\"color: #0000dd; font-weight: bold;\">0<\/span>][<span style=\"color: #0000dd; font-weight: bold;\">0<\/span>] <span style=\"color: #333333;\">+<\/span> log_reg<span style=\"color: #333333;\">.<\/span>intercept_[<span style=\"color: #0000dd; font-weight: bold;\">0<\/span>])<span style=\"color: #333333;\">.<\/span>ravel()\r\nplt<span style=\"color: #333333;\">.<\/span>plot(df[<span style=\"background-color: #fff0f0;\">'x'<\/span>], sigmoid_function)\r\nplt<span style=\"color: #333333;\">.<\/span>scatter(df[<span style=\"background-color: #fff0f0;\">'x'<\/span>], df[<span style=\"background-color: #fff0f0;\">'y'<\/span>], c<span style=\"color: #333333;\">=<\/span>df[<span style=\"background-color: #fff0f0;\">'y'<\/span>], cmap<span style=\"color: #333333;\">=<\/span><span style=\"background-color: #fff0f0;\">'prism'<\/span>)\r\nplt<span style=\"color: #333333;\">.<\/span>show()\r\n<\/pre>\n<p>The output of this code is given below:<\/p>\n<figure id=\"attachment_285\" aria-describedby=\"caption-attachment-285\" style=\"width: 300px\" class=\"wp-caption aligncenter\"><a href=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-30-at-22.02.41.png\"><img loading=\"lazy\" decoding=\"async\" class=\"size-medium wp-image-285\" src=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-30-at-22.02.41-300x200.png\" alt=\"Sigmoid plot of Logistic Regression\" width=\"300\" height=\"200\" srcset=\"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-30-at-22.02.41-300x200.png 300w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-30-at-22.02.41-768x513.png 768w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-30-at-22.02.41-700x465.png 700w, https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Screenshot-2021-09-30-at-22.02.41.png 968w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><figcaption id=\"caption-attachment-285\" class=\"wp-caption-text\">Sigmoid plot of Logistic Regression<\/figcaption><\/figure>\n<p>This wraps up our class on Logistic Regression. I strongly recommend you watch the video for a clearer explanation.<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>In this class, we would build a logistic regression model. We would cover the following topics: The Logistic Regression Model Import the Necessary Modules Generate &hellip; <!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":1,"featured_media":286,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[44,43],"tags":[66,68],"class_list":["post-276","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science","category-python","tag-classification","tag-logistic-regression"],"jetpack_featured_media_url":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Class-7-Introduction-to-Logistic-Regression.jpg","_links":{"self":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts\/276","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/comments?post=276"}],"version-history":[{"count":2,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts\/276\/revisions"}],"predecessor-version":[{"id":287,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts\/276\/revisions\/287"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/media\/286"}],"wp:attachment":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/media?parent=276"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/categories?post=276"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/tags?post=276"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}