This would be class 6 of our complete data science for beginner series. In this class, we would apply all we’ve learnt to actually perform an analysis: Building a Classifier
We would cover the following sub-topics:
- What is Classification?
- Building an NB Classifier
- Evaluation Metrics – TP, FP, TN, FN
- Evaluation Metrics – Accuracy, Precision, Sensitivity and Specificity
1. What is Classification
As the name suggests, classification is the process of predicting a category from measured values of the attributes. You can remember the iris dataset from Class 1. In the iris dataset, we have 4 attributes (Sepal Length, Sepal Width, Petal Length, and Petal Width). Given this set of attribute, the record can be classified as 1 of 3 classes of iris (Setosa, Virginica and Versicolor).
Classification under under set of machine learning approaches called Supervised Learning. And generally, one we have a dataset, the task would be to determine the function that maps the inputs(attributes) to the outputs (classes). This is what we call the model (this mapping between input and output).
More topics on classification can be found here:
2. Building a Classifier
As usual, we would follow 5 steps to build and test our classifier:
Step 1 – Import the necessary modules
#1. Import the necessary modules import sklearn.datasets as ds import pandas as pd import numpy as np
Step 2 – Obtain your dataset
#2. Obtain and prepare your datase bc_array = ds.load_breast_cancer() features = bc_array['data'] classes = bc_array['target'] feature_names = bc_array['feature_names'] column_names = np.append(feature_columns_names,'Class') bc_df = pd.DataFrame(data = np.c_[features, classes], columns = column_names)
Step 3 – Split the Dataset into Train and Test Data
You already know about data splitting from Class 5. You can review it.
So now, we need to split out dataset into train and test data sets.
#3. Split your dataset into test and train from sklearn.model_selection import train_test_split Xtrain, Xtest, Ytrain, Ytest = train_test_split(bc_df[feature_names], bc_df['Class'], test_size = 0.3, random_state=50)
Step 4 – Build the Model
We would build the model using Naive Bayes algorithm
#4. Build the Model from sklearn.naive_bayes import GaussianNB gnb = GaussianNB() model = gnb.fit(Xtrain, Ytrain)
Step 5 – Check Model Accuracy
To see the model performance, we would use the model to make predictions based on the test dataset.
#5. Check Model Accuracy from sklearn.metrics import accuracy_score y_pred = gnb.predict(Xtest) print(accuracy_score(Ytest, y_pred))
The above would display the accuracy score. For me it gave
3. Evaluation Metrics – TP, FP, TN, FN
Although we have gotten the accuracy of our classifier, we still need to calculate other metrics that explain the classifier performance. The following are the metrics of interest.
- True Positives (TP): This is a situation where the actual class is 1 and the classifier correctly predicted it as 1
- False Positives (FP): This is a situation where actual class of data point is 0 and the classifier wrongly predicted class of data point is 1. This is a Type I Error.
- True Negatives (TN): This is a situation where both the actual class is 0 and the classifier correctly predicted it as 0
- False Negatives (FN): This is a situation where actual class of data point is 1 and the classifier wrongly predicted it 0. This is called a Type II Error.
These values can be represented in a confusion matrix.
The code below displays the confusion matrix for our classifier.
# Display the confusion matrix from sklearn.metrics import confusion_matrix confusion_matrix(Ytest, y_pred)
4. Evaluation Metrics – Accuracy, Precision, Sensitivity and Specificity
Lets now look at these further metrics
Accuracy – This is the number of correct classifications divided by total classifications. It is given by the formula:
Precision – This is the total number of True Positives divided by the sum of True Positives and False Positives. It is given by:
Sensitivity (Recall) – This is the total number of True Positives divided by the sum of True Positives and False Negatives. It is given by:
Specificity – Number of True Negatives divided by the sum of True Negatives and False Positives. It is given by: