Collecting and Managing Data, Part 2

Spreadsheets are a wonderful way to collect data. There are many advantages and not a few disadvantages. For starting out the disadvantages are usually minimal.

Advantages:

Almost everybody has a spreadsheet, and anyone can download a free one from Open Office
Rows and columns are an intuitive way to view and think about data and closely mimic rows and columns in a database table
You can see quite a bit of data at a time
It’s easy to copy/paste if there is a lot of similar data

Disadvantages:

Rows and columns only allow two-dimensional data, and most data is more complex
Only one user can update it at a time, no concurrent access
Even if you can get concurrent access, the spreadsheet model doesn’t lend itself to working concurrently
Most spreadsheets have a row limit which effectively limits the size of your data

For many or most pieces of data, though, a spreadsheet is the best way to organize your data. As your business grows, those disadvantages will take over and you will grow beyond the spreadsheet. Even if you don’t, this information is useful. But if you do, that data that you’ve collected will have to be moved somewhere else, and the act of moving it is going to be difficult unless you’ve created your dataset properly from the beginning. Properly formatted data can be imported into a true database by a competent programmer. Improperly formatted data will have to be retyped by hand, introducing errors and taking far more time.

I’m going to start with two rules that you need to keep in mind regardless of where you store your data, but it’s even more important with a spreadsheet.

Rule #1: Always use a serial number for your records

Rule #2: Always format data consistently

If you get nothing else out of this, those two rules will help you immensely.

So let’s talk about these rules. Rule #1: You always need to use a serial number for your main records. This means that if you have a list of customers, each customer should have a number. It should be printed on any documentation created for that customer (e.g. an invoice) so that the customer may be referenced easily. Of course, you also need to keep track of the address, phone number, etc., and perhaps those pieces of data could be used to find an customer. But the customer number is the unique key, or “primary key” as we like to call it. If you have an customer number you can find that unique customer among all others.

Most serial numbers don’t start at “1”. Why? Mostly for psychological reasons – businesses look small if you’re customer #10. But there’s another good reason and that is consistency. Let’s say your business will likely never have more than 87,000 customers. You can then start your customer number at 12100. As long as you don’t get more than 87,900 customers your customer numbers will always be exactly 5 digits.

That gives you an advantage in many ways. Formatting your documents is a little easier as you know the size of the customer number. If someone calls and offers a customer number you have a simple validation since it must be 5 digits. And the starting number makes it difficult for someone to determine the size of your customer base. Note that you can start with anything that’s 5 digits as long as you won’t exhaust your set of possible 5 digits numbers.

Of course, if your business is bigger simply scale that up. Start with 121200 if you need a 6-digit customer number. Or start with 115 if you anticipate having only a few hundred customers. Whatever your business size you can easily scale this up or down to fit your needs.

This scheme obviously works with other data besides customer lists. Order numbers, invoice numbers – anything that will get a serial number – can be assigned in this way.

Rule #2: Always format your data consistently. This is also simple when you understand the basis of it. Let’s consider our customer data again. Let’s say we have a simple database of customer number, name, and phone number. We need to determine up front how we’re going to format each piece of data, and then do it consistently on each record.

Imagine this scenario:

Customer #	Name	Phone
121	Jim Beam	(812)555-1212
122	Jack Daniel	615-555-1212
123	Dickel, George	+1 323 555-1212

This is normal data, unfortunately. There are two problems here:

1. There are three different formats for phone numbers. Not a huge deal because it’s rare that we need to get the component pieces of a phone number (i.e. the area code) and people/phones can generally deal with about anything nowadays. Still, consistency is good because we can better anticipate how it will print on a page. If your customer base is in US/Canada, I highly recommend simply using the middle format.

2. Look at those names. Specifically, “George Dickel” is stored as “Dickel, George”. Names are notoriously difficult to parse because of all of the possible suffixes that might be added to them after a comma. I’m going to talk more about names specifically in a later part.

The point here is to pick some format and be consistent. We can argue about what format is best – and will at some point – but you’re better off with a consistent inferior format than a dozen inconsistent better formats.

I have a piece of mail that very clearly demonstrates the problems with manipulating data that’s not well-formatted. Years and years ago I worked at a university and my title way “Database Programmer”. I had filled out a card or something somewhere that had my name as “Darrin Chaney”. And so I got mail to “Darrin Chaney” and the second line would be my title “Database Programmer”.

At some point this data was sold and massaged and manipulated and screwed up. I received a letter addressed to “Darrin C Dbase Prog”. Inside the envelope there was a personalized letter that started out “Dear Mr. Prog”.

As a small business owner – or large business owner for that matter – you do not want to be the schmuck who sends a piece of mail to dear old Mr. Prog. That’s amateur hour stuff. The piece of mail came from a Fortune 500 company with whom I had done business under my actual name – all the worse. Keeping your data format consistent will go a long way toward making sure that your customers are addressed as you would want to be addressed. You don’t want your mail to be posted to my wall of shame.

Beyond the Two Rules

When storing data in a spreadsheet, you have to consider that at some point you’ll most likely have a computer program manipulating that data in various ways. Spreadsheets are generally not something that programs directly read (Microsoft’s popular Excel format has been mostly reverse-engineered but Microsoft has no interest in truly making the format public). Instead, we usually export the spreadsheet to a text-based format known as “CSV”, or “comma-separated values”. The above table ends up looking like this:

Customer #,Name,Phone Number
121,Jim Beam,(812)555-1212
122,Jack Daniel,615-555-1212
123,”Dickel, George”,+1 323 555-1212

You shouldn’t worry about the mechanics of CSV, i.e. the double quotes in the last line. What you *should* worry about is the fact that any “extra” formatting in your spreadsheet is lost when you export to CSV. That means that if you have bold, underlined, or italicized print that formatting is lost. If you have formatted numbers, dates, times, etc. in a certain way that formatting will likely be lost. And if you’ve changed colors of cells, rows, or columns, or the text colors that will be lost. The only formatting that CSV will maintain is your spaces and any multi-line text that you might have in the sheet.

So, this brings us to a minor rule which we’ll call Rule #3:

Rule #3: All data must be in text in the spreadsheet.

In other words, consider that you want to know whether your customers are “local” or “non-local”. You could make the background of non-local customers yellow, for instance, so that they’re easily viewed. But when you export to CSV that information is lost. The only way to handle it properly is to add another column that keeps track of that data.

In our next installment we’ll consider how to format specific data types.

Michael Chaney Consulting Corporation

Creators of sophisticated web applications

Beyond the Two Rules

Leave a Reply Cancel reply