Data you don’t use is like a car you don’t drive, collecting rust, parked out on the front lawn. Maybe the last time you took it out for a spin, it turned over and got you from point A to point B, it sounded a bit rough, and there was a weird smell. It probably hasn’t gotten any better. Will it even start this time?
Unused data suffers from a similar degradation over time – a dataset requires constant care and attention to stay in usable condition, and really the only way to know if it’s in good working order is to use it occasionally. Incidentally, using the data is also a good way to discover what deficiencies exist, and the desire to use the data also provides the needed motivation to clean it up.
My favorite story of a promising dataset proving to be a total letdown came when I was working at a company that provided electronic medical records. We were working on a feature to check with insurance companies whether a patient was eligible for a particular procedure or medication – everyone hates finding out after the fact that something wasn’t covered by insurance, including the doctors who bear the brunt of the patients ire. In order to request this information from the insurance company, we needed to provide an identifier from the doctor making the request. This identifier is called a National Provider Identifier and every doctor has one. We thought we were set, because our database had a field in it called NPI_Number – every new doctor who signed up since the founding of the product was prompted to give their NPI_Number when they joined the platform. Unfortunately this number was never used, and because there was no need to do so, it was never verified. Worse, the column type was the default 255 character string.
We started digging into the data to see if it was any good. First we looked at how many doctors had typed anything at all. Less than half – not good. A good NPI number is always a 9 digit number, so we looked at how many values were numbers (a third of the records that had a value) and how many of the numbers were 9 digits (half of those). I was relatively junior at the time, and was shocked at how rarely that field was filled in correctly. Out of curiosity, I looked for the longest values in the column for a clue. Right at the top was the string “I will find the card and fill this in Monday AM”. That record was created 3 years previously.
We never used the data, and we gave no incentive for customers to fill it in correctly, so they didn’t. Even though the data ostensibly existed, it was worse than worthless. The good thing is that the data was easy to get – once we started requiring a valid NPI number to use this new and desired feature, it was worth the effort to get right, and very quickly the doctors on the platform updated the records.
I see this all the time in the data that I’ve used as a manager. My bug tracking system has a column for how we found that bug? (was it a customer, a test or an engineer?) It was accurate once upon a time, because someone was tracking the data and following up when it appeared to be incorrect. That person got bored with the project, or moved on, and now a third of the records are still set to the default and the values that aren’t must be regarded with utmost suspicion.
When I say the data must be used, I mean that a real person must really care. A classic failure is for that data to drive an alert that is consistently ignored. Everywhere I’ve worked, I’ve found myself with an email rule to filter out needlessly noisy alerts that I lack the power or motivation to delete. I know I’m not alone. If nobody except for email rules looks at the alert, will anyone shed a tear if the data that drives it is corrupted or stale?
For the data that truly matters to the business – typically financial – we have accountants and auditors to ensure it is correct. If you want to trust your data in an engineering organization, you should know who cares when it’s wrong, and if that person is you, you need to have a plan for how you keep track of it, how you will recognize it has gone squirrely, and how to fix it when it inevitably does. In essence, you must become the accountant, or the auditor for the data you care about. If you need a data set that isn’t in good shape, start trying to use it and follow up with the sources of data as you find issues. Over time the long tail of problems will shrink to a level where the data is acceptable.


Leave a Reply to Leadership as a Service – Under Development Cancel reply