Introduction
Outliers. They can mess up a data analyst’s data really quickly. As defined by the National Institute of Standards and Technology they are “An outlier is an observation that lies an abnormal distance from other values in a random sample from a population”. Basically, they are a point in a dataset that is not anywhere near the other points in the dataset. So, how do you deal with them? How are they caused? Which statistics do they affect the most? Well in an article by Kowsik R, titled “Understanding Outliers: What Are They and Why Do They Matter?” he explains all of this and so much more about outliers (the article can be found here).
Summary of Article
The article starts out by giving a definition of what an outlier is, and then gets into why they matter so much, and why we are even discussing them. In short, it’s because they can have a massive impact on the results of the analysis of the data, and in some cases can even falsify that analysis to the point where it will hurt the shareholders if they use it. Then the article gets into why they occur, it lists a variety of reasons, some of them being, measurement error, sampling bias, and data transformation. The article also explains what each of these is in detail to ensure that the reader has a good understanding of these reasons. Now the article addresses the fundamental question of “What do we do about this”? Well, the answer is three primary things:
- “Trimming/removing the outlier”
- “Quantile based flooring and capping”
- “Mean/Median imputation”
While doing this the article also gives some code to further explain some concepts as well as statistical graphs to accomplish the same task.
My Take
Overall, I really liked this article. I really enjoyed how it showed me all of the aspects of outliers and not just one of them. For example, the article covered what they are, why they are caused, and what to do with them. This is unlike some articles where these 3 topics would be split into 3 different articles which I am not a fan of. I also enjoyed the visual selection in this article as I was provided with a good selection of visuals to aid with my comprehension of the subject. After reading this article I felt that outliers are like a thorn in a rose. They cause pain because they can severely mess up your analysis, if not dealt with in the proper way. And there is no other way to deal with them except to just remove them.
Conclusion
All in all, I liked this article. It closed all the gaps I had about outliers, and it did it in a way that was easy to follow and understand. After reading the article I felt like outliers are like thorns from a rose. I recommend you read the article (it can be found here).