Introduction
If you try hard enough you can make Celine Dion and Napoleon Bonaparte look like the same person. I know this sounds crazy but the article literally does this and if you were just given the data(the yes or no in the article) you would think they are the same person. This is the problem that this article addresses – some critical data science mistakes(link to article found here).
Summary of Article
First, the article starts out with that uncanny comparison. Then the article goes on to list the common mistakes in data science that this visualization shows. For example, one of them is the impact of cherry picking. The article says if I want to “I can make data sing however I want”. Meaning if you want to tell a specific story with data or advance a specific agenda, you can with data if you try hard enough through the use of cherry-picking. When teams do it unconsciously that not only costs time but ruins the accuracy of the overall presentation. For example, people might spend a long time “tweaking every single parameter only to never critically evaluate their data sources and collection methods”. One good thing to be aware of so that way you can try to ensure you don’t do this is that “often, ML projects fail not because of weak models, but because of a fundamental flaw in the underlying thought process/assumptions that no one caught”. The second lesson that we can learn from the visualization is the importance of domain knowledge. Too many people might start using their data and drawing insights from it before knowing enough domain knowledge. Domain knowledge is of paramount importance because when eh data tells you something you need to be able to make sure that those insights are true and just aren’t because of the specific data that you got, statistical tests are useful for this domain knowledge can very well aid in this as well.
My Take
Overall, I really liked this article. It was very interesting and concise which I appreciated. I also appreciated the way that this information was presented to the reader, it made for a fun read. Instead of just presenting the problems plainly, the author used a fun meme to not only present the problems but show them in action. One question I still have is how much of an impact could these problems have on an organization. Do cherry-picking and lack of domain knowledge cause quite a few problems or is it just once in a blue moon that it causes problems? Theoretically, it should cause problems with every analysis but sometimes the data being analyzed could be very robust and the story being told by the data fits with the story that wants to be told, making cherry-picking not necessary for an organization, and the data could be naturally very easily understandable, making it ok for someone with a lack of domain knowledge to analyze and not causing too many problems.
Conclusion
All in all, this article was a good one. It was concise and informative as well as effective and fun to read. While I still do have some questions about this it definitely did clear things up and I highly recommend you read it(link to article found here).