Introduction
Imagine this: you have a huge database that you have been working on for the past year and you have just sent it to production. Then someone calls you back and says “the database broke down”. How would you feel? Terrible. You worked on this for so long and it still messes up. So how do you make sure that this doesn’t happen to you? You engage in data testing (sometimes called database testing). Then you ask yourself “How do I do data testing?”. How do you do it easily and efficiently in this day and age and what are the different methods to do it depending on the amount of accuracy vs difficulty that I need and how to do it? That’s where an article written by Pedram Navid called “The State of Data Testing” comes in(the article can be found here). This article details the 4 main methods that people use today for data testing.
Data Testing Methods
Navid ran a study via Twitter to see how people in the industry data test and here was the outcome:
Method of Data Testing | What the Method is | Benefits | Tradeoffs |
Table Scan | Where you look at your database and run your eyes over it to see if anything looks off | 1. Easy 2. Not time-consuming 3. Efficient | 1. Usually misses things due to the nature of humans |
Quality Assurance | Where you give the data to a stakeholder and they look it over to see if anything looks off | 1. Checks with the actual stakeholders about the data and give them a chance to review it | 1. False positives tend to be really high because stakeholders tend to think that anything that doesn’t make their business look good is wrong |
Counting Things | Running a set of statistical values on the data to see if that matches your intuition | 1. Gives you a quantitative sense of the data | 1. Can be misleading because of the nature of statistical values (mean might hide null values and median could hide outliers) 2. It is also tough to have a good intuition of these values to compare accurately so that you can see if your data is correct or no |
Data Diff. | Is a comparison of data that tells us how the data changes and transforms between every iteration | 1. Is very accurate and the results have very few mistakes | 1. Requires a lot of complex SQL to set up and preform correctly 2. The outputs are often just as hard to interpret as the raw data itself |
The Four Facets
In the article Navid also addresses the four facets of data quality which are:
- Data Accuracy: How close the data matches the real world
- Data Completeness: The amount of data available in the database
- Data Consistency: The degree to which the data is agreeing with itself (column names should have the same naming conventions)
- Data Integrity: Refers to the connections between different data sets and data tables in the database
The Data testing methods all aim to satisfy these four facets. They all do this in their own unique way and are preferred at different times by different people. Some are more time beneficial while others are more accurate. It all just depends on your needs at the time.
My Take
So which one is best? Well, the answer to that question is a bit cliché… Each of the methods of data testing is best used in its own situation and to be effective you should know all of these that are available to you, how to use them, and their respective strengths and weaknesses. This is much like how a handyman has to know every tool in his toolbox and their strengths and weaknesses to get the job done as effectively as possible. Now you have an arsenal of tools that you can use it’s just a matter of picking which one is the best for you at that specific time. Don’t have a lot of time and just need a basic understanding? A Table scan might best suit your needs. Need to get a lot of statistics on the data? Counting things is the way to go! Need to ensure that you don’t compromise on accuracy no matter what? Take a look at the Data diff. This just goes to show how different scenarios can require different methods and tools.
Conclusion
All in all, this was a really interesting article, it reinforced my understanding of how in data science there are a lot of tools for the job it’s just a matter of picking the right one. It also gave me a glimpse into the industry of data science and what people are doing as far as data testing goes right now. Overall it was a really good article and I recommend you check it out (the article can be found here).