September 29, 2021

Class 6 – Introduction to Classification

This would be class 6 of our complete data science for beginner series. In this class, we would apply all we’ve learnt to actually perform an analysis: Building a Classifier

We would cover the following sub-topics:

  1. What is Classification?
  2. Building an NB Classifier
  3. Evaluation Metrics – TP, FP, TN, FN
  4. Evaluation Metrics – Accuracy, Precision, Sensitivity and Specificity

 

1. What is Classification

As the name suggests, classification is the process of predicting a category from measured values of the attributes. You can remember the iris dataset from Class 1. In the iris dataset, we have 4 attributes (Sepal Length, Sepal Width, Petal Length, and Petal Width). Given this set of attribute, the record can be classified as 1 of 3 classes of iris (Setosa, Virginica and Versicolor).

Classification under under set of machine learning approaches called Supervised Learning. And generally, one we have a dataset, the task would be to determine the function that maps the inputs(attributes) to the outputs (classes). This is what we call the model (this mapping between input and output).

More topics on classification can be found here:

 

2. Building a Classifier

As usual, we would follow 5 steps to build and test our classifier:

Step 1 – Import the necessary modules

#1. Import the necessary modules
import sklearn.datasets as ds
import pandas as pd
import numpy as np

 

Step 2 – Obtain your dataset

#2. Obtain and prepare your datase
bc_array = ds.load_breast_cancer()

features = bc_array['data']
classes = bc_array['target']
feature_names = bc_array['feature_names']
column_names = np.append(feature_columns_names,'Class')
bc_df = pd.DataFrame(data = np.c_[features, classes], columns = column_names)

 

Step 3 – Split the Dataset into Train and Test Data

You already know about data splitting from Class 5. You can review it.

So now, we need to split out dataset into train and test data sets.

#3. Split your dataset into test and train
from sklearn.model_selection import train_test_split
Xtrain, Xtest, Ytrain, Ytest = 
train_test_split(bc_df[feature_names], bc_df['Class'], test_size = 0.3, random_state=50)

 

Step  4 – Build the Model

We would build the model using Naive Bayes algorithm

#4. Build the Model
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
model = gnb.fit(Xtrain, Ytrain)

 

Step  5 – Check Model Accuracy

To see the model performance, we would use the model to make predictions based on the test dataset.

#5. Check Model Accuracy
from sklearn.metrics import accuracy_score
y_pred = gnb.predict(Xtest)
print(accuracy_score(Ytest, y_pred))

The above would display the accuracy score. For me it gave

0.935672514619883

 

3. Evaluation Metrics – TP, FP, TN, FN

Although we have gotten the accuracy of our classifier, we still need to calculate other metrics that explain the classifier performance. The following are the metrics of interest.

  • True Positives (TP): This is a situation where the actual class is 1 and the classifier correctly predicted it as 1
  • False Positives (FP): This is a situation where actual class of data point is 0 and the classifier wrongly predicted class of data point is 1. This is a Type I Error.
  • True Negatives (TN): This is a situation where both the actual class is 0 and the classifier correctly predicted it as 0
  • False Negatives (FN): This is a situation where actual class of data point is 1 and the classifier wrongly predicted it 0. This is called a Type II Error.

These values can be represented in a confusion matrix.

The code below displays the confusion matrix for our classifier.

# Display the confusion matrix
from sklearn.metrics import confusion_matrix
confusion_matrix(Ytest, y_pred)

 

4. Evaluation Metrics – Accuracy, Precision, Sensitivity and Specificity

Lets now look at these further metrics

Accuracy – This is the number of correct classifications divided by total classifications. It is given by the formula:

Accuracy

Precision – This is the total number of True Positives divided by the sum of True Positives and False Positives. It is given by:

Precision

Sensitivity (Recall) –  This is the total number of True Positives divided by the sum of True Positives and False Negatives. It is given by:

Sensitivity

Specificity – Number of True Negatives divided by the sum of True Negatives and False Positives. It is given by:

Specificity

 

0 0 votes
Article Rating
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments