In this class, we would cover Feature Selection. This class follows from Class 3 and Class 4 which discussed Data Preprocessing
The following are covered here:
- Introduction to Feature Selection
- Univariate Feature Selection
- Recursive Feature Elimination
- Dimensionality Reduction
- Feature Importance
1. Introduction to Feature Selection
Feature Selection also known as Variable selection is the method of selecting a subset of variables (or features) to be used for building a model. In order words, we want to reduce the number of features by selecting only features that are expected to produce the best performance.
Here are some reason why we do feature selection
- simplifies the model making it easier to interpret
- makes the data more compatible with the training algorithms
- results in shorter training times
2. Univariate Feature Selection
Univariate feature selection is a technique that helps to select the variables that are strongly related with the output variable(predictor or dependent variable). In this demo, we would use the SelectKBest module from scikit-learn library.
Follow the step below
Step 1 – Import your dataset as well as the relevant modules
# import the neccessary modules as well as your dataset import pandas as pd import numpy as np import sklearn.feature_selection as fs from sklearn.feature_selection import chi2 path = '/Users/kindsonmunonye/Datasets/wine.csv' wine_df = pd.read_csv(path)
Step 2 – Extract the features and predictor
# Extract the features and the predictor as arrays wineY = wine_df.iloc[0:,0:1].values wineX = wine_df.iloc[0:,1:].values
Step 3 – Select Best 5 Features
# Select Best 5 features selector = fs.SelectKBest(score_func=chi2, k=5) result = selector.fit(wineX, wineY) best_features = result.transform(wineX)
Step 4 – View the Results
# Display the output np.set_printoptions(precision=2) print(result.scores_) best_features.shape
3. Recursive Feature Elimination(RFE)
This is another feature selection technique that works by removing attributes recursively and then building the model with the remaining attributes or features. The RFE module of the sklearn library can be used to achieve RFE.
Step 1 – Import your dataset and extract the features and predictor. Just modify the code for univariate model selection above.
Step 2 – Import the linear_model library
# Import the neccessary modules import sklearn.linear_model as lm
Step 3 – Create and fit a regression object. In this example, we want to select 3 features. But feel free to increase to a different number.
reg_model = lm.LogisticRegression(max_iter=10000) rfe = RFE(reg_model, n_features_to_select=3) fit = rfe.fit(wineX, wineY.ravel())
Step 4 – Display the results
The three selected features are assigned rank 1 in the rankings array
# View the Results # Selected features are assigned rank 1 ranks = rfe.ranking_ features = wine_df.iloc[0:,1:].columns # Display feature with rank 1 for a, b in zip(ranks, features): if a == 1: print(f'{a}: {b}')
See the video to learn about the the zip function used for iterating two lists at the same time.
4. Dimensionality Reduction
Since this topic is quite involving, I am making a different class for it. But find some of my lessons below. However, the next class, we would review PCA using a simple demo
- Introduction to Dimensionality Reduction
- How to Perform Principal Components Analysis(PCA)
- How to Perform PCA in Python – Step by Step
- Introduction to Singular Value Decomposition (SVD)
- How to Perform Factor Analysis (FA) – Step by Step – Video
5. Feature Importance
Feature Importance is a technique for assigning scores to input features based on the how used they are for predicting the target variable. Simply put, feature importance helps use select the most important features.
Some types of feature importance includes:
- calculated coefficients as part of linear model
- correlation scores
- permutation importance scores
- decision tree scores
In this tutorial, we would use the ExtraTreesClassifier module from the sklearn library. As before we would follow the steps:
Step 1 – Import the wine dataset and split into wineX and wineY. You already know how to do this!
Step 2 – Import the sklearn.ensemble module as the ExtraTreesClassifier is available there.
# Import the module import sklearn.ensemble as se
Step 3 – Create the model, fit the model
# Create and fit and display model = se.ExtraTreesClassifier() model.fit(wineX, wineY.ravel())
Step 4 – Display the Feature importances
importances = model.feature_importances_ features = wine_df.iloc[0:1,1:].columns for a, b in zip(features, importances): print(f'{a} : {b}')
You can now see that the importance value for each feature is displayed. In the next class we would cover Principal Component Analysis (PCA)
I also strongly recommend you watch the video for a clearer explanation.
