{"id":261,"date":"2021-09-21T15:51:37","date_gmt":"2021-09-21T15:51:37","guid":{"rendered":"https:\/\/www.kindsonthegenius.com\/data-science\/?p=261"},"modified":"2021-09-21T15:51:58","modified_gmt":"2021-09-21T15:51:58","slug":"class-5-introduction-to-feature-selection","status":"publish","type":"post","link":"https:\/\/www.kindsonthegenius.com\/data-science\/class-5-introduction-to-feature-selection\/","title":{"rendered":"Class 5 &#8211; Introduction to Practical Feature Selection with Python"},"content":{"rendered":"<p>In this class, we would cover Feature Selection. This class follows from Class 3 and Class 4 which discussed Data Preprocessing<\/p>\n<p>The following are covered here:<\/p>\n<ol>\n<li><a href=\"#t1\">Introduction to Feature Selection<\/a><\/li>\n<li><a href=\"#t2\">Univariate Feature Selection<\/a><\/li>\n<li><a href=\"#t3\">Recursive Feature Elimination<\/a><\/li>\n<li><a href=\"#t4\">Dimensionality Reduction<\/a><\/li>\n<li><a href=\"#t5\">Feature Importance<\/a><\/li>\n<\/ol>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t1\">1. Introduction to Feature Selection<\/strong><\/h4>\n<p>Feature Selection also known as Variable selection is the method of selecting a subset of variables (or features) to be used for building a model. In order words, we want to reduce the number of features by selecting only features that are expected to produce the best performance.<\/p>\n<p>Here are some reason why we do feature selection<\/p>\n<ul>\n<li>simplifies the model making it easier to interpret<\/li>\n<li>makes the data more compatible with the training algorithms<\/li>\n<li>results in shorter training times<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t2\">2. Univariate Feature Selection<\/strong><\/h4>\n<p>Univariate feature selection is a technique that helps to select the variables that are strongly related with the output variable(predictor or dependent variable). In this demo, we would use the SelectKBest module from scikit-learn library.<\/p>\n<p>Follow the step below<\/p>\n<p><strong>Step 1<\/strong> &#8211; Import your dataset as well as the relevant modules<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># import the neccessary modules as well as your dataset<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">import<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">pandas<\/span> <span style=\"color: #008800; font-weight: bold;\">as<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">pd<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">import<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">numpy<\/span> <span style=\"color: #008800; font-weight: bold;\">as<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">np<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">import<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.feature_selection<\/span> <span style=\"color: #008800; font-weight: bold;\">as<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">fs<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">from<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.feature_selection<\/span> <span style=\"color: #008800; font-weight: bold;\">import<\/span> chi2\r\npath <span style=\"color: #333333;\">=<\/span> <span style=\"background-color: #fff0f0;\">'\/Users\/kindsonmunonye\/Datasets\/wine.csv'<\/span>\r\nwine_df <span style=\"color: #333333;\">=<\/span> pd<span style=\"color: #333333;\">.<\/span>read_csv(path)\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Step 2<\/strong> &#8211; Extract the features and predictor<\/p>\n<p><!-- HTML generated using hilite.me --><\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Extract the features and the predictor as arrays<\/span>\r\nwineY <span style=\"color: #333333;\">=<\/span> wine_df<span style=\"color: #333333;\">.<\/span>iloc[<span style=\"color: #0000dd; font-weight: bold;\">0<\/span>:,<span style=\"color: #0000dd; font-weight: bold;\">0<\/span>:<span style=\"color: #0000dd; font-weight: bold;\">1<\/span>]<span style=\"color: #333333;\">.<\/span>values\r\nwineX <span style=\"color: #333333;\">=<\/span> wine_df<span style=\"color: #333333;\">.<\/span>iloc[<span style=\"color: #0000dd; font-weight: bold;\">0<\/span>:,<span style=\"color: #0000dd; font-weight: bold;\">1<\/span>:]<span style=\"color: #333333;\">.<\/span>values\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Step 3<\/strong> &#8211; Select Best 5 Features<br \/>\n<!-- HTML generated using hilite.me --><\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Select Best 5 features<\/span>\r\nselector <span style=\"color: #333333;\">=<\/span> fs<span style=\"color: #333333;\">.<\/span>SelectKBest(score_func<span style=\"color: #333333;\">=<\/span>chi2, k<span style=\"color: #333333;\">=<\/span><span style=\"color: #0000dd; font-weight: bold;\">5<\/span>)\r\nresult <span style=\"color: #333333;\">=<\/span> selector<span style=\"color: #333333;\">.<\/span>fit(wineX, wineY)\r\nbest_features <span style=\"color: #333333;\">=<\/span> result<span style=\"color: #333333;\">.<\/span>transform(wineX)\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Step 4<\/strong> &#8211; View the Results<br \/>\n<!-- HTML generated using hilite.me --><\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Display the output<\/span>\r\nnp<span style=\"color: #333333;\">.<\/span>set_printoptions(precision<span style=\"color: #333333;\">=<\/span><span style=\"color: #0000dd; font-weight: bold;\">2<\/span>)\r\n<span style=\"color: #007020;\">print<\/span>(result<span style=\"color: #333333;\">.<\/span>scores_)\r\nbest_features<span style=\"color: #333333;\">.<\/span>shape\r\n<\/pre>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t3\">3. Recursive Feature Elimination(RFE)<\/strong><\/h4>\n<p>This is another feature selection technique that works by removing attributes recursively and then building the model with the remaining attributes or features. The RFE module of the sklearn library can be used to achieve RFE.<\/p>\n<p><strong>Step 1<\/strong> &#8211; Import your dataset and extract the features and predictor. Just modify the code for univariate model selection above.<\/p>\n<p><strong>Step 2<\/strong> &#8211; Import the\u00a0 linear_model library<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Import the neccessary modules<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">import<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.linear_model<\/span> <span style=\"color: #008800; font-weight: bold;\">as<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">lm<\/span>\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Step 3<\/strong> &#8211; Create and fit a regression object. In this example, we want to select 3 features. But feel free to increase to a different number.<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">reg_model <span style=\"color: #333333;\">=<\/span> lm<span style=\"color: #333333;\">.<\/span>LogisticRegression(max_iter<span style=\"color: #333333;\">=<\/span><span style=\"color: #0000dd; font-weight: bold;\">10000<\/span>)\r\nrfe <span style=\"color: #333333;\">=<\/span> RFE(reg_model, n_features_to_select<span style=\"color: #333333;\">=<\/span><span style=\"color: #0000dd; font-weight: bold;\">3<\/span>)\r\nfit <span style=\"color: #333333;\">=<\/span> rfe<span style=\"color: #333333;\">.<\/span>fit(wineX, wineY<span style=\"color: #333333;\">.<\/span>ravel())\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Step 4<\/strong> &#8211; Display the results<\/p>\n<p>The three selected features are assigned rank 1 in the rankings array<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># View the Results<\/span>\r\n<span style=\"color: #888888;\"># Selected features are assigned rank 1<\/span>\r\nranks <span style=\"color: #333333;\">=<\/span> rfe<span style=\"color: #333333;\">.<\/span>ranking_\r\nfeatures <span style=\"color: #333333;\">=<\/span> wine_df<span style=\"color: #333333;\">.<\/span>iloc[<span style=\"color: #0000dd; font-weight: bold;\">0<\/span>:,<span style=\"color: #0000dd; font-weight: bold;\">1<\/span>:]<span style=\"color: #333333;\">.<\/span>columns\r\n\r\n<span style=\"color: #888888;\"># Display feature with rank 1<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">for<\/span> a, b <span style=\"color: #000000; font-weight: bold;\">in<\/span> <span style=\"color: #007020;\">zip<\/span>(ranks, features):\r\n    <span style=\"color: #008800; font-weight: bold;\">if<\/span> a <span style=\"color: #333333;\">==<\/span> <span style=\"color: #0000dd; font-weight: bold;\">1<\/span>:\r\n        <span style=\"color: #007020;\">print<\/span>(f<span style=\"background-color: #fff0f0;\">'{a}: {b}'<\/span>)\r\n<\/pre>\n<p>See the video to learn about the the zip function used for iterating two lists at the same time.<\/p>\n<h4><strong id=\"t4\">4. Dimensionality Reduction<\/strong><\/h4>\n<p>Since this topic is quite involving, I am making a different class for it. But find some of my lessons below. However, the next class, we would review PCA using a simple demo<\/p>\n<ul>\n<li><a href=\"https:\/\/www.kindsonthegenius.com\/pca-tutorial-1-introduction-to-pca-and-dimensionality-reduction\/\" target=\"_blank\" rel=\"noopener\">Introduction to Dimensionality Reduction<\/a><\/li>\n<li><a href=\"https:\/\/www.kindsonthegenius.com\/pca-tutorial-1-how-to-perform-principal-components-analysis-pca\/\" target=\"_blank\" rel=\"noopener\">How to Perform Principal Components Analysis(PCA)<\/a><\/li>\n<li><a href=\"https:\/\/www.kindsonthegenius.com\/principal-components-analysispca-in-python-step-by-step\/\" target=\"_blank\" rel=\"noopener\">How to Perform PCA in Python &#8211; Step by Step<\/a><\/li>\n<li><a href=\"https:\/\/www.kindsonthegenius.com\/singular-value-decompositionsvd-a-dimensionality-reduction-technique\/\" target=\"_blank\" rel=\"noopener\">Introduction to Singular Value Decomposition (SVD)<\/a><\/li>\n<li><a href=\"https:\/\/youtu.be\/ttBs_wfw_6U\" target=\"_blank\" rel=\"noopener\">How to Perform Factor Analysis (FA) &#8211; Step by Step &#8211; Video<\/a><\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h4><strong id=\"t5\">5. Feature Importance<\/strong><\/h4>\n<p>Feature Importance is a technique for assigning scores to input features based on the how used they are for predicting the target variable. Simply put, feature importance helps use select the most important features.<\/p>\n<p>Some types of feature importance includes:<\/p>\n<ul>\n<li>calculated coefficients as part of linear model<\/li>\n<li>correlation scores<\/li>\n<li>permutation importance scores<\/li>\n<li>decision tree scores<\/li>\n<\/ul>\n<p>In this tutorial, we would use the ExtraTreesClassifier module from the sklearn library. As before we would follow the steps:<\/p>\n<p><strong>Step 1<\/strong> &#8211; Import the wine dataset and split into wineX and wineY. You already know how to do this!<\/p>\n<p><strong>Step 2<\/strong> &#8211; Import the sklearn.ensemble module as the ExtraTreesClassifier is available there.<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Import the module<\/span>\r\n<span style=\"color: #008800; font-weight: bold;\">import<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">sklearn.ensemble<\/span> <span style=\"color: #008800; font-weight: bold;\">as<\/span> <span style=\"color: #0e84b5; font-weight: bold;\">se<\/span>\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Step 3<\/strong> &#8211; Create the model, fit the model<\/p>\n<pre style=\"margin: 0; line-height: 125%;\"><span style=\"color: #888888;\"># Create and fit and display<\/span>\r\nmodel <span style=\"color: #333333;\">=<\/span> se<span style=\"color: #333333;\">.<\/span>ExtraTreesClassifier()\r\nmodel<span style=\"color: #333333;\">.<\/span>fit(wineX, wineY<span style=\"color: #333333;\">.<\/span>ravel())\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><strong>Step 4<\/strong> &#8211; Display the Feature importances<\/p>\n<pre style=\"margin: 0; line-height: 125%;\">importances <span style=\"color: #333333;\">=<\/span> model<span style=\"color: #333333;\">.<\/span>feature_importances_\r\nfeatures <span style=\"color: #333333;\">=<\/span> wine_df<span style=\"color: #333333;\">.<\/span>iloc[<span style=\"color: #0000dd; font-weight: bold;\">0<\/span>:<span style=\"color: #0000dd; font-weight: bold;\">1<\/span>,<span style=\"color: #0000dd; font-weight: bold;\">1<\/span>:]<span style=\"color: #333333;\">.<\/span>columns\r\n<span style=\"color: #008800; font-weight: bold;\">for<\/span> a, b <span style=\"color: #000000; font-weight: bold;\">in<\/span> <span style=\"color: #007020;\">zip<\/span>(features, importances):\r\n    <span style=\"color: #007020;\">print<\/span>(f<span style=\"background-color: #fff0f0;\">'{a} : {b}'<\/span>)\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>You can now see that the importance value for each feature is displayed.\u00a0 In the next class we would cover Principal Component Analysis (PCA)<\/p>\n<p>I also strongly recommend you<a href=\"https:\/\/youtu.be\/mdC08NQa5uc\" target=\"_blank\" rel=\"noopener\"> watch the video<\/a> for a clearer explanation.<\/p>\n<p>&nbsp;<\/p>\n<!-- AddThis Advanced Settings generic via filter on the_content --><!-- AddThis Share Buttons generic via filter on the_content -->","protected":false},"excerpt":{"rendered":"<p>In this class, we would cover Feature Selection. This class follows from Class 3 and Class 4 which discussed Data Preprocessing The following are covered &hellip; <!-- AddThis Advanced Settings generic via filter on get_the_excerpt --><!-- AddThis Share Buttons generic via filter on get_the_excerpt --><\/p>\n","protected":false},"author":1,"featured_media":263,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[44,43],"tags":[63,62,61,64,65],"class_list":["post-261","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-data-science","category-python","tag-dimensionality-reduction","tag-feature-importance","tag-feature-selection","tag-pca","tag-recursive-feature-elimination"],"jetpack_featured_media_url":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-content\/uploads\/sites\/12\/2021\/09\/Data-Science-Class-5-A-Class-on-Feature-Selection.jpg","_links":{"self":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts\/261","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/comments?post=261"}],"version-history":[{"count":4,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts\/261\/revisions"}],"predecessor-version":[{"id":282,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/posts\/261\/revisions\/282"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/media\/263"}],"wp:attachment":[{"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/media?parent=261"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/categories?post=261"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kindsonthegenius.com\/data-science\/wp-json\/wp\/v2\/tags?post=261"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}