Guiding Principles for Data Capturing

On several occasions recently I've had to work with data in legacy applications.  It's never fun, and it got me thinking about the origins of the problem.  It's fairly obvious that difficulties encountered when working with legacy data are a result of decisions made when the system was originally written.

New systems are not immune, however.  The same difficulties are bound to crop up for someone trying to work with your data at a later date unless you think about it now.  There's obviously no definite way of knowing how your data will be used in the future, and the tendency is often to focus only on current requirements.  With a little care though, you can make it easier for a future person to work with.

Thus, I present my guiding principles for data capturing.

By way of disclaimer, I should mention that these principles are based on business applications where users, be they staff or clients, enter information manually.  That's not to say that they can't apply in other scenarios, but keep in mind that I was looking at it from a business application angle.

So here they are:

  1. Am I capturing everything I need?
  2. Am I capturing information in a way that allows it to actually be used later?
  3. Am I restricting myself - can I capture more (or different) information later if it's needed?

1. Am I capturing everything I need?

If you're not sure, capture it anyway.  For example, if you're not sure whether the sex of a client is required, put it in anyway.  Obviously there are legal restrictions on collecting some information, but remember that these restrictions can change.  Allow for that.

More to the point, capture information with as much detail as you can.  Don't hide behind a usability excuse - if you think hard enough, you can get around that.  Just because you store information in a particular way doesn't mean the user has to enter it that way or even enter it at all - some information can be assumed.  Of course there's always some trade-off with usability, but it's worth thinking about how you can get around the usability problem rather than leaving a potentially important piece of information out.

For example, timestamps on file comments probably don't need to be accurate right down to the tick, but what's the harm if they are?  Let's face it, storage space is cheap.  In terms of hiding behind usability, your user shouldn't have to enter a time to that detail (or probably at all), but it doesn't mean an automatic process you write at a later date isn't able to.

2. Am I capturing information in a way that allows it to actually be used later?

This seems obvious but is often overlooked.  Don't store dates (or any other numeric information) as strings, don't store information in binary fields, and use a database not a file to store information.  If your system takes a piece of information from somewhere, be it a person or another system, it should store that information in a way that can be used easily later.  There's no point retaining information if it can't be used.

Let's say the business wants you to store a whole lot of information about a client's car.  The designer might give the user fields to fill in the licence plate, the colour, make, model, VIN, age, etc.  The easiest way to store most of these pieces of information is in string fields.  Sure, this will allow users to look at the data, probably interpret it, and use it to glean information, but what if you need to analyse it in bulk later?  You'll have a lot of work to do to group all the "White", "white", "wite", and even "wh9ite" cars.  It's hard to work out that cars that are "12mths", "12 months", "1Y", and "one yr" old are all the same age.  Don't do it to yourself.

Typically, the designer might ask of the business, "Are you ever going to search on this data?" That's a mistake.  Don't even ask.  They will say no.  However, some years down the track, they might actually want to know whether there's any correlation between having a new car, and being a good client.  It may seem far-fetched now, but by allowing for it, you don't take it off the tableThe business should never hear, "I can't give you that information" when they know for a fact their staff have been filling it out every time.

It all comes down to working out what the information actually is and what it means, then storing it the same way.  If it's a length of time, work out how to store it as a timespan.  If it's a colour, try working out how to actually store it as a colour, not someone's misspelt description of a colour. Do the research and find out how a VIN or a licence plate is structured so you can validate it.  If it's really not possible to validate, at least process it as much as you can.  Remove spaces or capitalize all letters if it doesn't matter to make it as easy as possible to work with at a later date.

3.  Am I restricting myself - can I capture more (or different) information later if it's needed?

Don't save space in fields when it could come back to bite you. Don't use a 4-bit integer to save space because you know you'll never have more than 16 different payment methods or vehicle makes.  Even though you're only operating in Australia, use string encoding that allows Chinese characters.  What are you saving by skimping, really?  Again, storage is cheap.  Don't restrict yourself from the start.

If you have to use your judgement to determine the size of a field, then overestimate significantly.  Like, really a lot.  Can't imagine someone having a surname longer than 30 characters?  Me neither, but I'll allow 100 or more.  I'm not having some client with a four-way-hyphenated-surname break my application.

Summary

Don't take the easy way out.  I know if you're stressed for time and under the pump, adding a new Bank Name field as an non-validated string is going to be a lot easier than trying to match it to a list of known banks (with the option to add a new one).  Fight the urge to make it easy now and potentially very hard later.

Remember that your design decision is usually a one-way street.  It's not often that you can do it the easy way now and fix it up later.  Even if you can, will you?  Really?  You're too busy to deal with it now, what makes you think you'll have plenty of spare time to do it later?  It's worth the extra time to do it properly.  You'll thank yourself later.

Damian Brady

I'm an Australian developer, speaker, and author specialising in DevOps, MLOps, developer process, and software architecture. I love Azure DevOps, GitHub Actions, and reducing process waste.

--