Reply to comment

Data Science - Dynamic analysis on outliers with R

Introduction

Outliers are the extreme values that a variable has, depending on the model or requirement, it could be necessary to treat them, either transforming or deleting.

Variable “Income” distribution

01_income_distribution

This is going to be our main variable in this example, which represents customer's income in $. We can observe how there are a few cases with very high values, while on the other hand, there are lots of cases with low/mid values.

If we choose to delete them…

A common question is: “How many cases do we have to leave out?”, we can choose to leave out highest 1%, so we will obtain:

02_income_p99.JPG

Now the distribution looks very similar to last one, except now it reaches $300.000 instead of $500.000.

If we do this process iteratively -deleting highest 1%, and then to that result, we delete again highest 1%, and so on, repeating this process 10 times- we're analyzing different cut-off values in order to leave out extreme values. We obtain a curious result, silhouette remains always similar to:

03_density_simple.JPG

 

Animating the example

The following animation shows in action this iterative deleting process: As we leave out the highest 1%, silhouette keeps a similar aspect to:

04_fractal_outliers.gif

In other words, there are always lots of people with low/mid income, and just a few number of cases with high income -because of distribution nature-. Axis values change within each iteration.

If we change the histogram plot, by a density one, the result is more similar to zoom on the data left side:

05_fractal_outliers_density.gif

When we delete the lowest or highest values of any variable, what we are doing is a “zoom” to the area where most cases are.

 

Final thoughts

In this particular case, we could choose to leave out highest 0.5 or 1% of data. However it is not always recommended to delete all outliers, sometimes they represent valuable information such as fraud or a machine failure, or any other event which deserves further inspection.

 

Contact

Made by Pablo C. from Data Science Heroes.

R code and data available on github

Outliers issues and data treatment are topics of the e-learning course: Data Science with R (request free demo at info@datascienceheroes.com)

 

Reply

The content of this field is kept private and will not be shown publicly.

More information about formatting options

By submitting this form, you accept the Mollom privacy policy.