Introduction
There is quite a bit of data buzzwords in the industry these days. Everything from big data to predictive analytics. But one phrase that I don’t hear as often is data provenance. I feel this is more important than a lot of the buzzwords out there because it can show you where your data is from. And if you have bad data then everything that you do with it including any insights you pull from it will be inherently flawed. So, we should investigate this phrase further. Data provenance. That’s exactly what Cassie Kozyrkov on towards data science on the platform medium(the article can be found here).
Summary of Article
The article first starts by explaining what data provenance is. It is the answer to the equation “Who collected the data and why”. It shows you where the data is from and for what purpose it was collected. This is extremely important because it can allude you to some interlaying bias in the data. The rest of the article is just about other terms that all are within the scope of data provenance. The list below describes some of the terms( the format is [term]: [meaning of term];[advantages with that kind of data]). The terms are as follows:
- Primary: data you collected yourself; you have control over the entire data.
- Secondary: data that you obtained from another source; cheaper to obtain
- Captured: data created specifically for an analytical purpose
- Exhaust: data that is passively collected; can be used to drive different insights than what has already been found out.
- Structured: data that is formatted well and ready to be analyzed; information could have been lost due to reformatting
- Unstructured: data that is not in the format that you desire; extra work is required to analyze; all original information is still there so you can format it to your desire
- Raw: data that hasn’t been altered after collection; contains all of the information that was collected and you can use it to drive the insights that you want.
- Processed: data that has already been processed to where it cannot be reverted to the original form(usually for analytical purposes); easier to analyze
- Polite data: data that has been cleaned and no information is lost; clean data with all of the information of raw data
This is overall what the article talked about. It just clarified buzzwords that are being used in the industry today.
My Take
Overall, I really liked this article. I am really starting to like this author in general. She explains concepts well and organizes her articles and videos in a very easy-to-understand format. But back to the article at hand. This article in specific clarified some of the more niche terms that are used and put a specific definition to them. A lot of them overlap so that was also good to know. The article also gave examples for the terms which were really helpful to contextualize the information and put theory to application. One more thing that the article did that aided in my understanding was breaking the article up and its organization which again helped me read through the article easier. A couple of other things that the article does really well is incorporate pictures throughout the article to help the reader, even if they are just for fun, and to take a small pause from the information and provide a summary at the end of the article that really helped with putting all of the information together.
Conclusion
All in all, this article was a good one. It is extremely good at clarifying those buzzwords in the industry and I am really glad it showed me what data provenance is and its importance when dealing with data. I recommend you read this article when you get the chance(the article can be found here)