Garbage in, garbage out

Every software developer worth his salt knows this saying. Basically it means that if the information coming into your system is garbage, you can't expect to get anything but garbage out.

In terms of my recent work (two projects in particular), this has never been more evident. In both cases, I'm being given data as the input for a new system. In both cases, the new system is far more restrictive than the old system in terms of what it allows in certain fields. And in both cases, the data I'm being given is difficult to work with.

This problem rears its ugly head often, whether it be upgrading from an old application to a new one, or trying to get meaningful statistics out of information that wasn't collected properly in the first place. In my recent work, I've been faced with both of these scenarios.

So why is this such a prevalent problem? It's very easy to place blame on the developers of the original code and argue that they didn't think about the consequences of free-text fields or that they didn't validate the data properly. To an extent, you'd probably be right. I'm sure that the developers of the old system probably didn't think it was worth validating a date because it was only going to be used for information; not for any statistics or filtering. Lazy? Probably, but every developer has done this. Maybe not with dates necessarily, but let me give you a scenario:

You have a client who wants an appointments system. He wants to track the phone calls received at the office, and he cares about when it was, who it was, why they called, and the follow through to an appointment. Fairly straightforward, no? So you ask this client about the "why they called" bit and they tell you that it could be any reason. You push, saying that that information won't be useful unless it's standardised and are told that it doesn't matter - people should be able to write anything they want. The problem should be immediately apparent by now.

But let's go further - let's say that you insist so much that he lets you put an additional field in the form of an enumerable list of reasons someone could call. You're happier because at least you have something concrete for this field that you can return to later, and the users of the system can still type whatever they like in the comments field. You even train the staff to make sure they select one of these reasons each time. Great! everyone wins... until the client wants to do some analysis on the reasons people have been calling. When you look at the data, you notice that nobody has been using this field. When asked, a staff member might say that it's quicker just to put the reason in the comments section, or that there was more than one reason, or even that they didn't know what the options really meant.

I'd argue that at this point, the data that's in this extra field is less useful than the free-text data. Sure, it's quantifiable and is great for statistics, but those statistics are wrong. Dead wrong. It's even worse if management has been relying on them.

My point is that it's all very well to make sure each piece of information you're collecting is discrete or enumerable, but unless the users are using it properly, it won't work and can have damaging effects. Garbage in - garbage out.

So where does this leave the future developers who have to import this data into their system later? Well, the answer's fairly obvious and there's no paddle. They (and I include myself in this) will be on their soapbox complaining about how the last guy didn't do it properly, but really, it may not have been his fault.

That's great, you say, but it doesn't help me. No, it doesn't, but it's something to be aware of. Before (yes, before) you go headfirst into building a brand new system that will leave the other system for dead, look very carefully at the data that's contained in that old system. Think about how it will fit into the new one - chances are it won't without a lot of shoving, and you need to be prepared to shove. If it's a new system, make sure that the stakeholders understand its limitations. Stress that even though you're building it with enumerable options and discrete values (because that's still a good idea), these will be worthless if they're not used correctly. Make sure they understand this - if they don't, you'll be the first person they'll yell at if their statistics are wrong.

-Damo

Damian Brady

I'm an Australian developer, speaker, and author specialising in DevOps, MLOps, developer process, and software architecture. I love Azure DevOps, GitHub Actions, and reducing process waste.

--