Class 5 – Introduction to Practical Feature Selection with Python

In this class, we would cover Feature Selection. This class follows from Class 3 and Class 4 which discussed Data Preprocessing

The following are covered here:

Introduction to Feature Selection
Univariate Feature Selection
Recursive Feature Elimination
Dimensionality Reduction
Feature Importance

1. Introduction to Feature Selection

Feature Selection also known as Variable selection is the method of selecting a subset of variables (or features) to be used for building a model. In order words, we want to reduce the number of features by selecting only features that are expected to produce the best performance.

Here are some reason why we do feature selection

simplifies the model making it easier to interpret
makes the data more compatible with the training algorithms
results in shorter training times

2. Univariate Feature Selection

Univariate feature selection is a technique that helps to select the variables that are strongly related with the output variable(predictor or dependent variable). In this demo, we would use the SelectKBest module from scikit-learn library.

Follow the step below

Step 1 – Import your dataset as well as the relevant modules

# import the neccessary modules as well as your dataset
import pandas as pd
import numpy as np
import sklearn.feature_selection as fs
from sklearn.feature_selection import chi2
path = '/Users/kindsonmunonye/Datasets/wine.csv'
wine_df = pd.read_csv(path)

Step 2 – Extract the features and predictor

# Extract the features and the predictor as arrays
wineY = wine_df.iloc[0:,0:1].values
wineX = wine_df.iloc[0:,1:].values

Step 3 – Select Best 5 Features

# Select Best 5 features
selector = fs.SelectKBest(score_func=chi2, k=5)
result = selector.fit(wineX, wineY)
best_features = result.transform(wineX)

Step 4 – View the Results

# Display the output
np.set_printoptions(precision=2)
print(result.scores_)
best_features.shape

3. Recursive Feature Elimination(RFE)

This is another feature selection technique that works by removing attributes recursively and then building the model with the remaining attributes or features. The RFE module of the sklearn library can be used to achieve RFE.

Step 1 – Import your dataset and extract the features and predictor. Just modify the code for univariate model selection above.

Step 2 – Import the linear_model library

# Import the neccessary modules
import sklearn.linear_model as lm

Step 3 – Create and fit a regression object. In this example, we want to select 3 features. But feel free to increase to a different number.

reg_model = lm.LogisticRegression(max_iter=10000)
rfe = RFE(reg_model, n_features_to_select=3)
fit = rfe.fit(wineX, wineY.ravel())

Step 4 – Display the results

The three selected features are assigned rank 1 in the rankings array

# View the Results
# Selected features are assigned rank 1
ranks = rfe.ranking_
features = wine_df.iloc[0:,1:].columns

# Display feature with rank 1
for a, b in zip(ranks, features):
    if a == 1:
        print(f'{a}: {b}')

See the video to learn about the the zip function used for iterating two lists at the same time.

4. Dimensionality Reduction

Since this topic is quite involving, I am making a different class for it. But find some of my lessons below. However, the next class, we would review PCA using a simple demo

5. Feature Importance

Feature Importance is a technique for assigning scores to input features based on the how used they are for predicting the target variable. Simply put, feature importance helps use select the most important features.

Some types of feature importance includes:

calculated coefficients as part of linear model
correlation scores
permutation importance scores
decision tree scores

In this tutorial, we would use the ExtraTreesClassifier module from the sklearn library. As before we would follow the steps:

Step 1 – Import the wine dataset and split into wineX and wineY. You already know how to do this!

Step 2 – Import the sklearn.ensemble module as the ExtraTreesClassifier is available there.

# Import the module
import sklearn.ensemble as se

Step 3 – Create the model, fit the model

# Create and fit and display
model = se.ExtraTreesClassifier()
model.fit(wineX, wineY.ravel())

Step 4 – Display the Feature importances

importances = model.feature_importances_
features = wine_df.iloc[0:1,1:].columns
for a, b in zip(features, importances):
    print(f'{a} : {b}')

You can now see that the importance value for each feature is displayed. In the next class we would cover Principal Component Analysis (PCA)

I also strongly recommend you watch the video for a clearer explanation.

1. Introduction to Feature Selection

2. Univariate Feature Selection

3. Recursive Feature Elimination(RFE)

4. Dimensionality Reduction

5. Feature Importance

kindsonthegenius

You might also like

Data Science Class 7 – Logistic Regression

Class 2 – A Class on Data Visualization with Python – A Data Science Primer

Class 4 – Introduction to Data Preprocessing and Data Cleaning – Part 2