Category Archives: Tech

Technical blogging

Unicursal Mazes

I’ve been working on my javascript maze generator again, you can grab it here off github:

https://github.com/mdchaney/jsmaze

I added a couple of new algorithms to it for generating mazes: Prim’s algorithm and the “bacterial” algorithm.  Neither one is good for generating mazes, honestly, and my drunk-walk algorithm remains the best way.  Still looking at a possibility of adding another algorithm similar to the recursive backtracker that allows one to determine from where the “backtracking” will continue.

I became interested a few weeks ago in the concept of the unicursal maze, also known as a labyrinth.  This isn’t really a maze; it’s a space-filling curve that visits all areas of the space exactly one time with no branches.

Screen shot 2014-03-05 at 8.55.54 AM

Interestingly, I cannot find anything online about generating them except one.  That gentleman’s approach to generating them is to first generate a standard maze using any algorithm, then close the exit, and finally solve the maze leaving the “solution” in place as a set of walls.  That creates a unicursal maze, but a very specific kind where the entrance and exit are side-by-side.  Plus, it works because the solution will cut the rectangular cells into more rectangles as long as it goes to the middle of each cell.  Using another base such as hexagons or snub-square tiling will cause the new maze to have differently shaped cells.  It’s not a general algorithm.

So what is a general algorithm?  I don’t have a good answer, yet.  I was able to make a simple one from a recursive backtracker.  It’s actually remarkably simple.  First, I keep track of the number of cells that remain unfilled.  Then, I do a standard recursive backtracking algorithm with one modification.  If it runs into a dead end, it returns “false”.  If it gets to the end cell and there are still unfilled cells, it returns false.  After moving to a square if it returns “false” then it closes the wall back up and tries the next.  If all moves for a cell return false then we simply return false.

But there is one way to make it return “true”.  That is if it gets to the last cell and there are no remaining unfilled cells.  In that case it returns true.  And when that one returns true it simply unwinds the stack all the way back to the beginning and is done.

Ultimately it will try a lot of paths.  The number of tries that it requires grows exponentially with the size of the maze.  So a 7×5 works in a couple of seconds, and a 10×7 just doesn’t.  At least not in a 4 or 5 minutes that I’ve waited.

So that particular algorithm, for lack of a better term, sucks.  But it works and I think it’s a start.  One thing that I know is that we can avoid some dead ends and such by looking ahead a bit and trying to not move in such a way as to create a dead-end tunnel.  Thinking of a standard rectangular maze that would mean that if we took a right turn we would have to move at least two squares before turning right again.  Coming within one space of an outside wall would be a problem unless it’s the last run to the end cell.  Generalizing the algorithm to work on any graph is the key, and it’s a tough one.  I’ve added some comments to the maze generator, will be doing more as I get more ideas.

Maze.maze_styles.unicursal = function() {

   var end_cell = this.end_cell();
   var pieces_left = this.cells.length-1;

   function recursive_maze(cell,entry_wall,depth) {
      cell.depth=depth;
      cell.entry_wall=entry_wall;

      // Check if this is the end (yes, I'm aware of the
      // optimization).
      // Alternately, we could just say "if pieces_left is 
      // 0 then this cell is the end", maybe with an edge
      // check.
      if (cell == end_cell) {
         if (pieces_left == 0) {
            return true;
         } else {
            return false;
         }  
      }  

      // Now, go through the surrounding cells and recurse
      for (var k=0 ; k<cell.perm.length ; k++) {
         var wall_num = cell.perm[k];
         var neighbor = cell.walls[wall_num].neighbor(cell);
         if (neighbor && !neighbor.visited()) {
            pieces_left--;
            cell.walls[wall_num].open();
            var winner = arguments.callee(neighbor,cell.walls[wall_num],depth+1);
            if (winner) return true;
            pieces_left++
            cell.walls[wall_num].close();
         }
      }

      return false;
   }
   var success = recursive_maze(this.start_cell(),null,0);
   if (!success) {
      throw "Cannot create unicursal maze with these parameters."
   }
}

Pro Tools update causes headaches

The latest Pro Tools update has caused two separate problems:

  1. I’ve had some customers with AIF files created with Pro Tools that have empty ANNO chunks.  Unfortunately, the Finder on the Mac ignores the ID3 chunk if there’s an ANNO chunk, so the metadata doesn’t show up unless the ANNO chunk is removed.
  2. Not sure why but some AIF files that previously worked fine with Pro Tools are now causing it to complain.  I had a customer’s customer complain that the AIF surely had a virus in it.  Needless to say it didn’t, but I did have to go through the entire catalog and strip out all extra chunks from the AIF files, leaving only COMM, SSND, and ID3.  I’ll post the ruby code for that later if there’s interest.  I don’t know which chunk was causing problems.

Collecting and Managing Data, Part 2

Spreadsheets are a wonderful way to collect data.  There are many advantages and not a few disadvantages.  For starting out the disadvantages are usually minimal.

Advantages:

  1. Almost everybody has a spreadsheet, and anyone can download a free one from Open Office
  2. Rows and columns are an intuitive way to view and think about data and closely mimic rows and columns in a database table
  3. You can see quite a bit of data at a time
  4. It’s easy to copy/paste if there is a lot of similar data

Disadvantages:

  1. Rows and columns only allow two-dimensional data, and most data is more complex
  2. Only one user can update it at a time, no concurrent access
  3. Even if you can get concurrent access, the spreadsheet model doesn’t lend itself to working concurrently
  4. Most spreadsheets have a row limit which effectively limits the size of your data

For many or most pieces of data, though, a spreadsheet is the best way to organize your data.  As your business grows, those disadvantages will take over and you will grow beyond the spreadsheet.  Even if you don’t, this information is useful.  But if you do, that data that you’ve collected will have to be moved somewhere else, and the act of moving it is going to be difficult unless you’ve created your dataset properly from the beginning.  Properly formatted data can be imported into a true database by a competent programmer.  Improperly formatted data will have to be retyped by hand, introducing errors and taking far more time.

I’m going to start with two rules that you need to keep in mind regardless of where you store your data, but it’s even more important with a spreadsheet.

Rule #1: Always use a serial number for your records

Rule #2: Always format data consistently

If you get nothing else out of this, those two rules will help you immensely.

So let’s talk about these rules.  Rule #1: You always need to use a serial number for your main records.  This means that if you have a list of customers, each customer should have a number.  It should be printed on any documentation created for that customer (e.g. an invoice) so that the customer may be referenced easily.  Of course, you also need to keep track of the address, phone number, etc., and perhaps those pieces of data could be used to find an customer.  But the customer number is the unique key, or “primary key” as we like to call it.  If you have an customer number you can find that unique customer among all others.

Most serial numbers don’t start at “1″.  Why?  Mostly for psychological reasons – businesses look small if you’re customer #10.  But there’s another good reason and that is consistency.  Let’s say your business will likely never have more than 87,000 customers.  You can then start your customer number at 12100.  As long as you don’t get more than 87,900 customers your customer numbers will always be exactly 5 digits.

That gives you an advantage in many ways.  Formatting your documents is a little easier as you know the size of the customer number.  If someone calls and offers a customer number you have a simple validation since it must be 5 digits.  And the starting number makes it difficult for someone to determine the size of your customer base.  Note that you can start with anything that’s 5 digits as long as you won’t exhaust your set of possible 5 digits numbers.

Of course, if your business is bigger simply scale that up.  Start with 121200 if you need a 6-digit customer number.  Or start with 115 if you anticipate having only a few hundred customers.  Whatever your business size you can easily scale this up or down to fit your needs.

This scheme obviously works with other data besides customer lists.  Order numbers, invoice numbers – anything that will get a serial number – can be assigned in this way.

Rule #2: Always format your data consistently.  This is also simple when you understand the basis of it.  Let’s consider our customer data again.  Let’s say we have a simple database of customer number, name, and phone number.  We need to determine up front how we’re going to format each piece of data, and then do it consistently on each record.

Imagine this scenario:

Customer # Name Phone
121 Jim Beam (812)555-1212
122 Jack Daniel 615-555-1212
123 Dickel, George +1 323 555-1212

This is normal data, unfortunately.  There are two problems here:

1. There are three different formats for phone numbers.  Not a huge deal because it’s rare that we need to get the component pieces of a phone number (i.e. the area code) and people/phones can generally deal with about anything nowadays.  Still, consistency is good because we can better anticipate how it will print on a page.  If your customer base is in US/Canada, I highly recommend simply using the middle format.

2. Look at those names.  Specifically, “George Dickel” is stored as “Dickel, George”.  Names are notoriously difficult to parse because of all of the possible suffixes that might be added to them after a comma.  I’m going to talk more about names specifically in a later part.

The point here is to pick some format and be consistent.  We can argue about what format is best – and will at some point – but you’re better off with a consistent inferior format than a dozen inconsistent better formats.

I have a piece of mail that very clearly demonstrates the problems with manipulating data that’s not well-formatted.  Years and years ago I worked at a university and my title way “Database Programmer”.  I had filled out a card or something somewhere that had my name as “Darrin Chaney”.  And so I got mail to “Darrin Chaney” and the second line would be my title “Database Programmer”.

At some point this data was sold and massaged and manipulated and screwed up.  I received a letter addressed to “Darrin C Dbase Prog”.  Inside the envelope there was a personalized letter that started out “Dear Mr. Prog”.

As a small business owner – or large business owner for that matter – you do not want to be the schmuck who sends a piece of mail to dear old Mr. Prog.  That’s amateur hour stuff.  The piece of mail came from a Fortune 500 company with whom I had done business under my actual name – all the worse.  Keeping your data format consistent will go a long way toward making sure that your customers are addressed as you would want to be addressed.  You don’t want your mail to be posted to my wall of shame.

 Beyond the Two Rules

When storing data in a spreadsheet, you have to consider that at some point you’ll most likely have a computer program manipulating that data in various ways.  Spreadsheets are generally not something that programs directly read (Microsoft’s popular Excel format has been mostly reverse-engineered but Microsoft has no interest in truly making the format public).  Instead, we usually export the spreadsheet to a text-based format known as “CSV”, or “comma-separated values”.  The above table ends up looking like this:

Customer #,Name,Phone Number
121,Jim Beam,(812)555-1212
122,Jack Daniel,615-555-1212
123,”Dickel, George”,+1 323 555-1212

You shouldn’t worry about the mechanics of CSV, i.e. the double quotes in the last line.  What you *should* worry about is the fact that any “extra” formatting in your spreadsheet is lost when you export to CSV.  That means that if you have bold, underlined, or italicized print that formatting is lost.  If you have formatted numbers, dates, times, etc. in a certain way that formatting will likely be lost.  And if you’ve changed colors of cells, rows, or columns, or the text colors that will be lost.  The only formatting that CSV will maintain is your spaces and any multi-line text that you might have in the sheet.

So, this brings us to a minor rule which we’ll call Rule #3:

Rule #3: All data must be in text in the spreadsheet.

In other words, consider that you want to know whether your customers are “local” or “non-local”.  You could make the background of non-local customers yellow, for instance, so that they’re easily viewed.  But when you export to CSV that information is lost.  The only way to handle it properly is to add another column that keeps track of that data.

In our next installment we’ll consider how to format specific data types.

Collecting and Managing Data, Part 1

The reason I’m writing this is quite simple: I want to save you a lot of time, money, headaches, and frustration.  If you are in business – any kind of business – you are collecting data.  At a minimum you are selling a product to someone.  That means that you have at least two datasets: products that you sell and people to whom you sell.  Additionally, you have monetary transactions that you have to track for tax purposes if nothing else.  Now we have three datasets.  Every single business on the planet has these three datasets.

Unless your product is created from dirt (e.g. you run a coal mine) you also have a supplier or list of suppliers from whom you buy.  At one end, you have a store where you buy from a warehouse and sell directly to consumers.  At the other end, you buy, say, cloth from cloth makers, create clothing, and sell the clothing.  In other words, you add value to the product.

Many small business owners keep these datasets in their heads.  Literally, they know their suppliers, they don’t need to know their customers (people walk in the door) and a cash register takes care of the minimal data required to track monetary exchanges.  Problems arise as the business grows too big for this model or when the key people who have the information get sick, die, attempt to delegate responsibilities, etc.

Small businesses often grow, and as they grow data collection and management becomes more and more important.  Often, the methods used to collect small amounts of data don’t scale well.  Spreadsheets are a good way to manage some data sets, but they’re limited to being accessed by one user at a time.  Some online spreadsheets, such as Google Docs, allow simultaneous access by multiple users, but the spreadsheet format doesn’t lend itself to such usage.

In the next part, I’m going to discuss general data collection techniques for small businesses and focus on spreadsheets as a good way to get started.

Collecting and Managing Data, Overview

I’m going to start a new series on collecting and managing data.  Ultimately this is going to be geared specifically toward those who have copyright data (music publishers, catalogs, libraries, etc.) but I’m going to start out with general information that will be applicable to anybody who is starting out managing data or has a data set that needs to be brought under control.

I’m not going to give a complete outline here as I haven’t started writing it and I’m not even sure of the path that I’ll ultimately take as I pull this out of my head and write it down.  I will cover general data collection and maintenance, using spreadsheets, using a desktop database like Filemaker or Access, and then specifically handling copyright data and other data associated with a collection of music or songs.

When I’m finished (if I ever finish – this may end up being a lifetime of work) there should be enough information here to write a short book.  I will number each part sequentially so you’ll want to read them in order.

Code Available

I have made a lot of code available at GitHub:

http://github.com/mdchaney

Right now you can find 1D barcode generation code written in Perl, 3 maze generators written (and highly commented) in JavaScript, a JavaScript sprintf function, and a plugin for CarrierWave to store image information. More coming as I have time.