Monthly Archives: October 2013

Collecting and Managing Data, Part 2

Spreadsheets are a wonderful way to collect data.  There are many advantages and not a few disadvantages.  For starting out the disadvantages are usually minimal.


  1. Almost everybody has a spreadsheet, and anyone can download a free one from Open Office
  2. Rows and columns are an intuitive way to view and think about data and closely mimic rows and columns in a database table
  3. You can see quite a bit of data at a time
  4. It’s easy to copy/paste if there is a lot of similar data


  1. Rows and columns only allow two-dimensional data, and most data is more complex
  2. Only one user can update it at a time, no concurrent access
  3. Even if you can get concurrent access, the spreadsheet model doesn’t lend itself to working concurrently
  4. Most spreadsheets have a row limit which effectively limits the size of your data

For many or most pieces of data, though, a spreadsheet is the best way to organize your data.  As your business grows, those disadvantages will take over and you will grow beyond the spreadsheet.  Even if you don’t, this information is useful.  But if you do, that data that you’ve collected will have to be moved somewhere else, and the act of moving it is going to be difficult unless you’ve created your dataset properly from the beginning.  Properly formatted data can be imported into a true database by a competent programmer.  Improperly formatted data will have to be retyped by hand, introducing errors and taking far more time.

I’m going to start with two rules that you need to keep in mind regardless of where you store your data, but it’s even more important with a spreadsheet.

Rule #1: Always use a serial number for your records

Rule #2: Always format data consistently

If you get nothing else out of this, those two rules will help you immensely.

So let’s talk about these rules.  Rule #1: You always need to use a serial number for your main records.  This means that if you have a list of customers, each customer should have a number.  It should be printed on any documentation created for that customer (e.g. an invoice) so that the customer may be referenced easily.  Of course, you also need to keep track of the address, phone number, etc., and perhaps those pieces of data could be used to find an customer.  But the customer number is the unique key, or “primary key” as we like to call it.  If you have an customer number you can find that unique customer among all others.

Most serial numbers don’t start at “1”.  Why?  Mostly for psychological reasons – businesses look small if you’re customer #10.  But there’s another good reason and that is consistency.  Let’s say your business will likely never have more than 87,000 customers.  You can then start your customer number at 12100.  As long as you don’t get more than 87,900 customers your customer numbers will always be exactly 5 digits.

That gives you an advantage in many ways.  Formatting your documents is a little easier as you know the size of the customer number.  If someone calls and offers a customer number you have a simple validation since it must be 5 digits.  And the starting number makes it difficult for someone to determine the size of your customer base.  Note that you can start with anything that’s 5 digits as long as you won’t exhaust your set of possible 5 digits numbers.

Of course, if your business is bigger simply scale that up.  Start with 121200 if you need a 6-digit customer number.  Or start with 115 if you anticipate having only a few hundred customers.  Whatever your business size you can easily scale this up or down to fit your needs.

This scheme obviously works with other data besides customer lists.  Order numbers, invoice numbers – anything that will get a serial number – can be assigned in this way.

Rule #2: Always format your data consistently.  This is also simple when you understand the basis of it.  Let’s consider our customer data again.  Let’s say we have a simple database of customer number, name, and phone number.  We need to determine up front how we’re going to format each piece of data, and then do it consistently on each record.

Imagine this scenario:

Customer # Name Phone
121 Jim Beam (812)555-1212
122 Jack Daniel 615-555-1212
123 Dickel, George +1 323 555-1212

This is normal data, unfortunately.  There are two problems here:

1. There are three different formats for phone numbers.  Not a huge deal because it’s rare that we need to get the component pieces of a phone number (i.e. the area code) and people/phones can generally deal with about anything nowadays.  Still, consistency is good because we can better anticipate how it will print on a page.  If your customer base is in US/Canada, I highly recommend simply using the middle format.

2. Look at those names.  Specifically, “George Dickel” is stored as “Dickel, George”.  Names are notoriously difficult to parse because of all of the possible suffixes that might be added to them after a comma.  I’m going to talk more about names specifically in a later part.

The point here is to pick some format and be consistent.  We can argue about what format is best – and will at some point – but you’re better off with a consistent inferior format than a dozen inconsistent better formats.

I have a piece of mail that very clearly demonstrates the problems with manipulating data that’s not well-formatted.  Years and years ago I worked at a university and my title way “Database Programmer”.  I had filled out a card or something somewhere that had my name as “Darrin Chaney”.  And so I got mail to “Darrin Chaney” and the second line would be my title “Database Programmer”.

At some point this data was sold and massaged and manipulated and screwed up.  I received a letter addressed to “Darrin C Dbase Prog”.  Inside the envelope there was a personalized letter that started out “Dear Mr. Prog”.

As a small business owner – or large business owner for that matter – you do not want to be the schmuck who sends a piece of mail to dear old Mr. Prog.  That’s amateur hour stuff.  The piece of mail came from a Fortune 500 company with whom I had done business under my actual name – all the worse.  Keeping your data format consistent will go a long way toward making sure that your customers are addressed as you would want to be addressed.  You don’t want your mail to be posted to my wall of shame.

 Beyond the Two Rules

When storing data in a spreadsheet, you have to consider that at some point you’ll most likely have a computer program manipulating that data in various ways.  Spreadsheets are generally not something that programs directly read (Microsoft’s popular Excel format has been mostly reverse-engineered but Microsoft has no interest in truly making the format public).  Instead, we usually export the spreadsheet to a text-based format known as “CSV”, or “comma-separated values”.  The above table ends up looking like this:

Customer #,Name,Phone Number
121,Jim Beam,(812)555-1212
122,Jack Daniel,615-555-1212
123,”Dickel, George”,+1 323 555-1212

You shouldn’t worry about the mechanics of CSV, i.e. the double quotes in the last line.  What you *should* worry about is the fact that any “extra” formatting in your spreadsheet is lost when you export to CSV.  That means that if you have bold, underlined, or italicized print that formatting is lost.  If you have formatted numbers, dates, times, etc. in a certain way that formatting will likely be lost.  And if you’ve changed colors of cells, rows, or columns, or the text colors that will be lost.  The only formatting that CSV will maintain is your spaces and any multi-line text that you might have in the sheet.

So, this brings us to a minor rule which we’ll call Rule #3:

Rule #3: All data must be in text in the spreadsheet.

In other words, consider that you want to know whether your customers are “local” or “non-local”.  You could make the background of non-local customers yellow, for instance, so that they’re easily viewed.  But when you export to CSV that information is lost.  The only way to handle it properly is to add another column that keeps track of that data.

In our next installment we’ll consider how to format specific data types.

Collecting and Managing Data, Part 1

The reason I’m writing this is quite simple: I want to save you a lot of time, money, headaches, and frustration.  If you are in business – any kind of business – you are collecting data.  At a minimum you are selling a product to someone.  That means that you have at least two datasets: products that you sell and people to whom you sell.  Additionally, you have monetary transactions that you have to track for tax purposes if nothing else.  Now we have three datasets.  Every single business on the planet has these three datasets.

Unless your product is created from dirt (e.g. you run a coal mine) you also have a supplier or list of suppliers from whom you buy.  At one end, you have a store where you buy from a warehouse and sell directly to consumers.  At the other end, you buy, say, cloth from cloth makers, create clothing, and sell the clothing.  In other words, you add value to the product.

Many small business owners keep these datasets in their heads.  Literally, they know their suppliers, they don’t need to know their customers (people walk in the door) and a cash register takes care of the minimal data required to track monetary exchanges.  Problems arise as the business grows too big for this model or when the key people who have the information get sick, die, attempt to delegate responsibilities, etc.

Small businesses often grow, and as they grow data collection and management becomes more and more important.  Often, the methods used to collect small amounts of data don’t scale well.  Spreadsheets are a good way to manage some data sets, but they’re limited to being accessed by one user at a time.  Some online spreadsheets, such as Google Docs, allow simultaneous access by multiple users, but the spreadsheet format doesn’t lend itself to such usage.

In the next part, I’m going to discuss general data collection techniques for small businesses and focus on spreadsheets as a good way to get started.

Collecting and Managing Data, Overview

I’m going to start a new series on collecting and managing data.  Ultimately this is going to be geared specifically toward those who have copyright data (music publishers, catalogs, libraries, etc.) but I’m going to start out with general information that will be applicable to anybody who is starting out managing data or has a data set that needs to be brought under control.

I’m not going to give a complete outline here as I haven’t started writing it and I’m not even sure of the path that I’ll ultimately take as I pull this out of my head and write it down.  I will cover general data collection and maintenance, using spreadsheets, using a desktop database like Filemaker or Access, and then specifically handling copyright data and other data associated with a collection of music or songs.

When I’m finished (if I ever finish – this may end up being a lifetime of work) there should be enough information here to write a short book.  I will number each part sequentially so you’ll want to read them in order.