The Most Generic Algorithm for Removing Outliers

A generic algorithm for cleaning outliers from a one-dimensional set of numbers

Gal Abramovitz
4 min readNov 27, 2019
Photo by Will Myers on Unsplash

Working on a large dataset is always tricky. It’s often impossible to manually examine the data and determine whether it even makes sense.
Personally, I’m a big fan of data visualisation, that allows intuitive and swift view of otherwise incomprehensible data. The challenges and dilemmas of the visualisation process is well-documented and dealt with; Hence I encourage you to explore and experiment this topic, so you can add it to your data science toolbox and use it as a base for any data examination in the future.

Getting back to our point, no matter if you use scatter plots, histograms or heat maps, outliers will probably make your axes too wide and will make your chart out of proportions. To make matters worse, they might sabotage your understanding of the dataset (by distorting the mean, median, etc.).
But what even are these little demons?

From Wikipedia:
“An outlier is a data point that differs significantly from other observations”

For example, consider the following ascending set:
(1, 2, 2, 5, 115)
The outlier in this first set is clearly 115, as it’s much bigger than the other numbers in the set. While it doesn’t change the median (2), it does change the mean from 2.5 to 25, which is conveniently 10 times bigger.

Dealing With Outliers

It’s worth mentioning that outliers aren’t always mistakes. They can be an excellent indicator for measurement errors, or they can just point an extreme result, which is much better/worse than the others.

However, the inspiration for this article was a work I did on a very diverse dataset (the columns/dimensions significantly differed from each other in scale) that involved creating normal distributions charts for each dimension’s histogram. In this case, the outliers often distorted the variance values and made the line charts too wide.

Hence, I was interested in detecting outliers and removing them from the dataset.
The good news is that in most cases outliers are easy to spot visually.
The bad news is that this isn’t scalable.

Common Methods For Detecting Outliers

As mentioned above, outliers are easy to point out but pretty elusive to define.
Most of the algorithms that handle with outliers require a threshold that defines the minimal “distance” of outliers from the rest of the set (Z-score).
This makes sense, but doesn’t allow a generic solution to a wide range of data in different scales. For example, in order to find the outliers in the two following sets, one will need to define two different thresholds:

  1. (1, 2, 3, 4, 60)
  2. (100, 150, 160, 25000)

Another approach is to determine the threshold according to the distribution of the data (IQR score). Using this method is very efficient; although a bit rough, as it doesn’t reflect the distance of data-points outside the range it handles (between the 75th and the 25th percentiles) and could identify the points that are closer to the median as outliers, while there are other points much farther away.
For example, in the set (1, 2, 3, 4, 5, 6, 100) the interquartile range is (2, 3, 4, 5, 6), hence both 1 and 100 are outside it, even though 1 isn’t an outlier and 100 indeed is.

The Generic Algorithm For Detecting Outliers

Trying to avoid using the thresholds mentioned above, I came up with a greedy algorithm that - on each iteration - identifies the farthest point(s) from its neighbour. In other words, on every step it tries to eliminate as much blank space as possible.

But… I cheated. This algorithm actually does require one threshold: the percentage of original data-points that must remain in the output set.
The lower this value is, the more outliers will be removed, while increasing the risk of removing more non-outliers points as well.
I just figured that this threshold is easier to use, as it’s mostly a definition which is externally provided, rather than requiring intimate acquaintance with the data’s specific properties.

The algorithm is as follows:

As explained, it generically removes the outliers, even when they’re grouped (“far” from the main data group) and portray a batch of relevant data by themselves.

I’d love to hear comments, critics and use cases that can benefit from the use of this algorithm.

--

--