We would cover the following:
- What is Outlier Detection?
- Application Areas of Outlier Detection
- Types of Outliers
- Causes of Outliers
- Outlier Detection Techniques
1. What is Outlier Detection?
An outlier also called anomaly is a data point that have low probability under the model for which the predictions may be of low accuracy. The techniques applied to detect such data points is termed outlier detection or anomaly detection.
Anomaly detection and removal from dataset would always result to increase in accuracy. If we examine the two plots shown in Figure 1, it would be very easy to see datapoints that appears not to correspond with the other set of observations. The question would be, ‘how do we handle such data points?’ That is what we are going to examine under Application and Techniques of Outlier Detection.
|Figure 1: Data Containing Outliers|
Outliers may also be classified as contextual outliers, such as typo errors during data entry or point outliers which are single data point separated from others.
4. Causes of Outliers
- Some of the common causes of outliers include:
- Sampling errors that may result from the source of the data
- Deliberate added deliberately to achieve certain objectives
- Human error during data entry
- Measurement errors incurred from the data collection and measurement tools
5. Outlier Detection Techniques
Density-based techniques such as k-nearest neighbor
z-Score or standard score is a parametric value that indicates how many standard deviations a data point is from the mean.
Linear Regression Models such as Principal Component Analysis.
Other techniques exist which are not covered in this article.
Outlier detection is a very important concept every researcher needs to appreciate. The reason is that the accuracy of the results of a research depends on the consistency of the samples used.