Data Science - Dynamic analysis on outliers with R

Introduction

Outliers are the extreme values that a variable has, depending on the model or requirement, it could be necessary to treat them, either transforming or deleting.

Variable “Income” distribution

01_income_distribution

This is going to be our main variable in this example, which represents customer's income in $. We can observe how there are a few cases with very high values, while on the other hand, there are lots of cases with low/mid values.

If we choose to delete them…

A common question is: “How many cases do we have to leave out?”, we can choose to leave out highest 1%, so we will obtain:

02_income_p99.JPG

Now the distribution looks very similar to last one, except now it reaches $300.000 instead of $500.000.

If we do this process iteratively -deleting highest 1%, and then to that result, we delete again highest 1%, and so on, repeating this process 10 times- we're analyzing different cut-off values in order to leave out extreme values. We obtain a curious result, silhouette remains always similar to:

03_density_simple.JPG

 

Animating the example

The following animation shows in action this iterative deleting process: As we leave out the highest 1%, silhouette keeps a similar aspect to:

04_fractal_outliers.gif

In other words, there are always lots of people with low/mid income, and just a few number of cases with high income -because of distribution nature-. Axis values change within each iteration.

If we change the histogram plot, by a density one, the result is more similar to zoom on the data left side:

05_fractal_outliers_density.gif

When we delete the lowest or highest values of any variable, what we are doing is a “zoom” to the area where most cases are.

 

Final thoughts

In this particular case, we could choose to leave out highest 0.5 or 1% of data. However it is not always recommended to delete all outliers, sometimes they represent valuable information such as fraud or a machine failure, or any other event which deserves further inspection.

 

Contact

Made by Pablo C. from Data Science Heroes.

R code and data available on github

Outliers issues and data treatment are topics of the e-learning course: Data Science with R (request free demo at info@datascienceheroes.com)

 

Post new comment

The content of this field is kept private and will not be shown publicly.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.