Category Archives: Tech

Technical blogging

Thoughts on Net Neutrality

As I’ve said in here, this is not the “long version”.  Nor is it the “short version”.  I could write a book on the subject, but it would bore you to tears.  Anyway, read on if you’re interested in this subject.

I’m about as much of an expert on this subject as anybody. Let me say this: net neutrality is a good idea. I don’t know anybody else with as much knowledge and experience on the subject as I have who feels differently.

The implementation (giving the FCC more power) is subpar, but probably better than the status quo. This is a far more complicated issue than either side is making it and it cannot be reduced to a simple soundbite. I get left-wing emails telling me about “rich corporations having an internet fast lane with the rest of us being in the slow lane” – they literally don’t even know what they’re talking about. At the same time the right doesn’t either – I see “government take over of the internet” all the time. Newsflash: the government already has more control than you can imagine. This doesn’t change too much.

Let me explain this as simply as I can. I use Comcast for internet service and pay $50/month. I also have Netflix for $9/month. Comcast went to Netflix and said “nice little internet service you have there, hows about paying us to ‘protect’ it for you, make sure your customers who use our service don’t have any slow downs or anything?” They then lowered the bandwidth available to Netflix – something they can trivially do in software – causing Netflix customers to have hiccups when watching movies. Netflix quickly caved and payed Comcast extra money and their service suddenly got better.

Here’s the problem: Netflix doesn’t print money. Every dollar they get is from a customer like me. That means that when they pay Comcast it’s really me who’s paying Comcast. But – I’m *already* paying the bastards. So now I have to pay them twice. Or three or four or five times – after all Netflix is just one company.

Why does Comcast do this? Because they’re not a disinterested third party here. They have their own video streaming service that they would prefer for me to use. They’re in an interesting position in that their competition relies on them to deliver their product.

Comcast isn’t the only one who did this, by the way. Verizon did, too. That we know of.

The basic concept of net neutrality is to say to Comcast “you cannot treat Netflix traffic any differently than you treat any other traffic – including traffic coming from your own competing services”. This applies to phone, video, etc.

Where net neutrality falls short is in the Verizon case. Verizon famously had a dispute with Level 3 over this, the details of which you can find online. The short version is that Netflix traffic was coming through Level 3 to get to Verizon’s network and customers. Verizon’s peering point – the point at which the two networks are connected – was congested badly to the point of dropping traffic due to Netflix large amount of traffic. The cost to upgrade the equipment was on the order of a few thousand dollars – pocket change in this world. The network interface had unused capacity that could have been trivially exploited had Verizon cared. But they instead left it congested and then blamed Level 3. This particular problem is as big as the other issue but not easily solved by government or free markets. There’s no good solution.

Anyway, that’s not even the long answer to all this. The takeaway is that we want to be in a world where Comcast, Verizon, AT&T, Time Warner, and other ISPs don’t get into a shakedown racket with Netflix and other internet companies in order to make sure their traffic isn’t limited. Unfortunately the market wasn’t able to properly handle this situation (Comcast is the only high-speed internet available for me, for instance) so the government stepped in. That’s not good, but it likely beats the alternative.

As for Forbes being “pro-business” – that’s good. Business is where we all get our jobs ultimately. Net neutrality if extremely pro-business.

The Forbes article is also confused. They say ” the company that links your computer/tablet/smartphone to the internet should not be able to discriminate among users and providers in the level of connectivity service provided. That is, we should all be able to send and receive the same number of bits of data per second.” Wrong. ISPs like Comcast offer different service levels at different price points. This is unrelated to net neutrality. Literally – totally different ideas.

To go on: “He may think it is not, but it completely blocks certain business models and stops any possible innovation that might emerge if given the option of seeking differential access to bandwidth.” Wrong. The only business model that it blocks is the classic protection racket. It’s telling that although the internet has been around for over 20 years in its present form (accessible to everybody and having the “world wide web” as a foundation for most people’s usage) Forbes can’t name even a single business or business model that would benefit from being able to buy higher bandwidth from an ISP on the back end (again, this is unrelated to your bandwidth at your connection point).

“If an ISP blocks Netflix because of the bandwidth it requires, consumers who want Netflix will take their business elsewhere.” Yeah, if only. I can either get Comcast, Comcast, or Comcast for my internet service. Oh, AT&T offers some laughably slow internet connection here, too, but the top end of the bandwidth that they offer isn’t enough for the work that my wife and I do. We cannot take our business elsewhere without moving, and it’s unlikely that anybody is going to run the fiber here to get a faster connection for me. If it was cost effective (on the order of a couple thousand dollars) I’d do it myself.

“The fact that most people cannot afford some of those models does not mean they should be removed from sale. Similarly, the fact that some businesses or consumers may choose to pay for better access to the Internet is not a bad thing. Some people pay more to fly first class, but they do not interfere with my travel in coach.” Again, the author literally doesn’t understand the issue. It’s not about the service Comcast offers to me – the customer. It’s about them shaking down Netflix *who isn’t their customer*.

Let me draw an analogy that actually works. Remember when AT&T was broken up? Suddenly there were other long-distance providers, such as MCI and Worldcomm. Now, I used Ameritech in the early 90s with MCI for long distance. I was a customer of both Ameritech – who provided my local phone service – and MCI – who provided long distance. That means that my calls went through Ameritech to get to MCI. But Ameritech also sold long distance service. They competed with MCI.

Now, imagine that you worked at Ameritech and you wanted to get more long distance customers. After all, it’s a very lucrative market. What to do? How about this – go find the physical wires that connect to MCI and alter them so that the phone calls going over to MCI sound like crap. Then call MCI and say “Hey, wow, your phone calls sound like crap. You know, for a little payment I bet we could find out where the problem is and make those calls sound good again.”

Do you think Ameritech should have been able to do that to MCI? If not, you’re actually supporting net neutrality.

Russian doll caching for collections in Rails 4

A lot has been written about Russian Doll Caching in Rails 4 but surprisingly little about caching of collections.   One of the basic tenets is that you should use “touch: true” on your “belongs_to” relationships in order to update the “updated_at” time on the parent record and thus invalidate its cached copies.

An issue then arises when you have a relationship that isn’t a strict parent/child relationship.  As an example, let’s imagine this particular relationship structure:


 class Item < ActiveRecord::Base
   has_many :project_items, inverse_of: :item
 end
class Project < ActiveRecord::Base
  has_many :project_items, -> { order("project_items.position") }, inverse_of: :project
  has_many :items, through: :project_items
end
class ProjectItem < ActiveRecord::Base
  belongs_to :project, inverse_of: :project_items, touch: true
  belongs_to :item, inverse_of: :project_items
end

Items and Projects are then in a many-to-many relationship with a join table providing support for ordering of items within a project. Now, let’s imagine a simple view:

 <% cache @project do -%>
   <h1><%= @project.title %></h1>
   <% @project.project_items.each do |project_item| -%>
     <% cache project_item %>
       <p><%= project_item.item.description %></p>
     <% end -%>
   <% end -%>
 <% end -%>

The problem here is that if an item is updated it won’t change the updated_at date on the project_items that are associated with it, nor do we want it to.   But we need to invalidate both the main Project cache and the individual Item cache.  Simply changing the cache key on the second item to “project_item.item” doesn’t fix this as the Project still won’t be updated.

This is then a two-fold problem:

  1. What is the proper cache key for the outer @project cache?
  2. What is the proper cache key for the inner item cache?

It’s tempting on the @project to do something like this:

<% cache [@project,@project.items] %>

And that mostly works.  The problem is that the cache key will simply grow as items are added.  What might not be a problem with two or three items gets out of whack with 50 or 100.  I’ve seen cache keys that are a block of 10+ lines at 80 characters wide.  That’s inefficient.

We can come up with something simpler:

<% cache [@project,@project.items.maximum(:updated_at)] %>

That works.  If someone removes a ProjectItem, the project should be touched, and if a track is updated then the max updated_at should be changed.  I like this even better:

<% cache [@project,@project.items.count,@project.items.maximum(:updated_at)] %>

Now, the question is how best to do this.  That’s a little ugly.

I can actually change this in the relationship:

class Project < ActiveRecord::Base
  has_many :project_items, -> { order("project_items.position") }, inverse_of: :project
  has_many :items, through: :project_items do
    def cache_key
      [count(:updated_at),maximum(:updated_at)].map(&:to_i).join('-')
    end
  end
end

Now, our cache line is a little simpler:

<% cache [@project,@project.items] do %>

With that, it’ll get @project.items.cache_key and the cache will be invalidated if any item is updated.  The bonus is that the cache key is made up of only a few items and is much more manageable.  It’s also much more readable to humans, both in code and in the cache itself.

The inner cache is then simply:

<% cache [project_item,project_item.item] %>

That way any update to either the project_item or the item will invalidate the cache.  I found a gem that should add the cache_key for associations but it seems to not work with Rails 4.  It would be useful for someone to update it as this functionality is even better when the code doesn’t have to be specified each time.

The argument for being non-DRY is that some cache key schemes might be lighter-weight and work for some places.  The example here is another table that contains viewing logs for projects.  The table is basically write-only, never updated.  So I can just look at the record count or maximum id on the joined table to determine the cache key.  Etc.

Unicursal Mazes

I’ve been working on my javascript maze generator again, you can grab it here off github:

https://github.com/mdchaney/jsmaze

I added a couple of new algorithms to it for generating mazes: Prim’s algorithm and the “bacterial” algorithm.  Neither one is good for generating mazes, honestly, and my drunk-walk algorithm remains the best way.  Still looking at a possibility of adding another algorithm similar to the recursive backtracker that allows one to determine from where the “backtracking” will continue.

I became interested a few weeks ago in the concept of the unicursal maze, also known as a labyrinth.  This isn’t really a maze; it’s a space-filling curve that visits all areas of the space exactly one time with no branches.

Screen shot 2014-03-05 at 8.55.54 AM

Interestingly, I cannot find anything online about generating them except one.  That gentleman’s approach to generating them is to first generate a standard maze using any algorithm, then close the exit, and finally solve the maze leaving the “solution” in place as a set of walls.  That creates a unicursal maze, but a very specific kind where the entrance and exit are side-by-side.  Plus, it works because the solution will cut the rectangular cells into more rectangles as long as it goes to the middle of each cell.  Using another base such as hexagons or snub-square tiling will cause the new maze to have differently shaped cells.  It’s not a general algorithm.

So what is a general algorithm?  I don’t have a good answer, yet.  I was able to make a simple one from a recursive backtracker.  It’s actually remarkably simple.  First, I keep track of the number of cells that remain unfilled.  Then, I do a standard recursive backtracking algorithm with one modification.  If it runs into a dead end, it returns “false”.  If it gets to the end cell and there are still unfilled cells, it returns false.  After moving to a square if it returns “false” then it closes the wall back up and tries the next.  If all moves for a cell return false then we simply return false.

But there is one way to make it return “true”.  That is if it gets to the last cell and there are no remaining unfilled cells.  In that case it returns true.  And when that one returns true it simply unwinds the stack all the way back to the beginning and is done.

Ultimately it will try a lot of paths.  The number of tries that it requires grows exponentially with the size of the maze.  So a 7×5 works in a couple of seconds, and a 10×7 just doesn’t.  At least not in a 4 or 5 minutes that I’ve waited.

So that particular algorithm, for lack of a better term, sucks.  But it works and I think it’s a start.  One thing that I know is that we can avoid some dead ends and such by looking ahead a bit and trying to not move in such a way as to create a dead-end tunnel.  Thinking of a standard rectangular maze that would mean that if we took a right turn we would have to move at least two squares before turning right again.  Coming within one space of an outside wall would be a problem unless it’s the last run to the end cell.  Generalizing the algorithm to work on any graph is the key, and it’s a tough one.  I’ve added some comments to the maze generator, will be doing more as I get more ideas.

Maze.maze_styles.unicursal = function() {

   var end_cell = this.end_cell();
   var pieces_left = this.cells.length-1;

   function recursive_maze(cell,entry_wall,depth) {
      cell.depth=depth;
      cell.entry_wall=entry_wall;

      // Check if this is the end (yes, I'm aware of the
      // optimization).
      // Alternately, we could just say "if pieces_left is 
      // 0 then this cell is the end", maybe with an edge
      // check.
      if (cell == end_cell) {
         if (pieces_left == 0) {
            return true;
         } else {
            return false;
         }  
      }  

      // Now, go through the surrounding cells and recurse
      for (var k=0 ; k<cell.perm.length ; k++) {
         var wall_num = cell.perm[k];
         var neighbor = cell.walls[wall_num].neighbor(cell);
         if (neighbor && !neighbor.visited()) {
            pieces_left--;
            cell.walls[wall_num].open();
            var winner = arguments.callee(neighbor,cell.walls[wall_num],depth+1);
            if (winner) return true;
            pieces_left++
            cell.walls[wall_num].close();
         }
      }

      return false;
   }
   var success = recursive_maze(this.start_cell(),null,0);
   if (!success) {
      throw "Cannot create unicursal maze with these parameters."
   }
}

Pro Tools update causes headaches

The latest Pro Tools update has caused two separate problems:

  1. I’ve had some customers with AIF files created with Pro Tools that have empty ANNO chunks.  Unfortunately, the Finder on the Mac ignores the ID3 chunk if there’s an ANNO chunk, so the metadata doesn’t show up unless the ANNO chunk is removed.
  2. Not sure why but some AIF files that previously worked fine with Pro Tools are now causing it to complain.  I had a customer’s customer complain that the AIF surely had a virus in it.  Needless to say it didn’t, but I did have to go through the entire catalog and strip out all extra chunks from the AIF files, leaving only COMM, SSND, and ID3.  I’ll post the ruby code for that later if there’s interest.  I don’t know which chunk was causing problems.

Collecting and Managing Data, Part 2

Spreadsheets are a wonderful way to collect data.  There are many advantages and not a few disadvantages.  For starting out the disadvantages are usually minimal.

Advantages:

  1. Almost everybody has a spreadsheet, and anyone can download a free one from Open Office
  2. Rows and columns are an intuitive way to view and think about data and closely mimic rows and columns in a database table
  3. You can see quite a bit of data at a time
  4. It’s easy to copy/paste if there is a lot of similar data

Disadvantages:

  1. Rows and columns only allow two-dimensional data, and most data is more complex
  2. Only one user can update it at a time, no concurrent access
  3. Even if you can get concurrent access, the spreadsheet model doesn’t lend itself to working concurrently
  4. Most spreadsheets have a row limit which effectively limits the size of your data

For many or most pieces of data, though, a spreadsheet is the best way to organize your data.  As your business grows, those disadvantages will take over and you will grow beyond the spreadsheet.  Even if you don’t, this information is useful.  But if you do, that data that you’ve collected will have to be moved somewhere else, and the act of moving it is going to be difficult unless you’ve created your dataset properly from the beginning.  Properly formatted data can be imported into a true database by a competent programmer.  Improperly formatted data will have to be retyped by hand, introducing errors and taking far more time.

I’m going to start with two rules that you need to keep in mind regardless of where you store your data, but it’s even more important with a spreadsheet.

Rule #1: Always use a serial number for your records

Rule #2: Always format data consistently

If you get nothing else out of this, those two rules will help you immensely.

So let’s talk about these rules.  Rule #1: You always need to use a serial number for your main records.  This means that if you have a list of customers, each customer should have a number.  It should be printed on any documentation created for that customer (e.g. an invoice) so that the customer may be referenced easily.  Of course, you also need to keep track of the address, phone number, etc., and perhaps those pieces of data could be used to find an customer.  But the customer number is the unique key, or “primary key” as we like to call it.  If you have an customer number you can find that unique customer among all others.

Most serial numbers don’t start at “1”.  Why?  Mostly for psychological reasons – businesses look small if you’re customer #10.  But there’s another good reason and that is consistency.  Let’s say your business will likely never have more than 87,000 customers.  You can then start your customer number at 12100.  As long as you don’t get more than 87,900 customers your customer numbers will always be exactly 5 digits.

That gives you an advantage in many ways.  Formatting your documents is a little easier as you know the size of the customer number.  If someone calls and offers a customer number you have a simple validation since it must be 5 digits.  And the starting number makes it difficult for someone to determine the size of your customer base.  Note that you can start with anything that’s 5 digits as long as you won’t exhaust your set of possible 5 digits numbers.

Of course, if your business is bigger simply scale that up.  Start with 121200 if you need a 6-digit customer number.  Or start with 115 if you anticipate having only a few hundred customers.  Whatever your business size you can easily scale this up or down to fit your needs.

This scheme obviously works with other data besides customer lists.  Order numbers, invoice numbers – anything that will get a serial number – can be assigned in this way.

Rule #2: Always format your data consistently.  This is also simple when you understand the basis of it.  Let’s consider our customer data again.  Let’s say we have a simple database of customer number, name, and phone number.  We need to determine up front how we’re going to format each piece of data, and then do it consistently on each record.

Imagine this scenario:

Customer # Name Phone
121 Jim Beam (812)555-1212
122 Jack Daniel 615-555-1212
123 Dickel, George +1 323 555-1212

This is normal data, unfortunately.  There are two problems here:

1. There are three different formats for phone numbers.  Not a huge deal because it’s rare that we need to get the component pieces of a phone number (i.e. the area code) and people/phones can generally deal with about anything nowadays.  Still, consistency is good because we can better anticipate how it will print on a page.  If your customer base is in US/Canada, I highly recommend simply using the middle format.

2. Look at those names.  Specifically, “George Dickel” is stored as “Dickel, George”.  Names are notoriously difficult to parse because of all of the possible suffixes that might be added to them after a comma.  I’m going to talk more about names specifically in a later part.

The point here is to pick some format and be consistent.  We can argue about what format is best – and will at some point – but you’re better off with a consistent inferior format than a dozen inconsistent better formats.

I have a piece of mail that very clearly demonstrates the problems with manipulating data that’s not well-formatted.  Years and years ago I worked at a university and my title way “Database Programmer”.  I had filled out a card or something somewhere that had my name as “Darrin Chaney”.  And so I got mail to “Darrin Chaney” and the second line would be my title “Database Programmer”.

At some point this data was sold and massaged and manipulated and screwed up.  I received a letter addressed to “Darrin C Dbase Prog”.  Inside the envelope there was a personalized letter that started out “Dear Mr. Prog”.

As a small business owner – or large business owner for that matter – you do not want to be the schmuck who sends a piece of mail to dear old Mr. Prog.  That’s amateur hour stuff.  The piece of mail came from a Fortune 500 company with whom I had done business under my actual name – all the worse.  Keeping your data format consistent will go a long way toward making sure that your customers are addressed as you would want to be addressed.  You don’t want your mail to be posted to my wall of shame.

 Beyond the Two Rules

When storing data in a spreadsheet, you have to consider that at some point you’ll most likely have a computer program manipulating that data in various ways.  Spreadsheets are generally not something that programs directly read (Microsoft’s popular Excel format has been mostly reverse-engineered but Microsoft has no interest in truly making the format public).  Instead, we usually export the spreadsheet to a text-based format known as “CSV”, or “comma-separated values”.  The above table ends up looking like this:

Customer #,Name,Phone Number
121,Jim Beam,(812)555-1212
122,Jack Daniel,615-555-1212
123,”Dickel, George”,+1 323 555-1212

You shouldn’t worry about the mechanics of CSV, i.e. the double quotes in the last line.  What you *should* worry about is the fact that any “extra” formatting in your spreadsheet is lost when you export to CSV.  That means that if you have bold, underlined, or italicized print that formatting is lost.  If you have formatted numbers, dates, times, etc. in a certain way that formatting will likely be lost.  And if you’ve changed colors of cells, rows, or columns, or the text colors that will be lost.  The only formatting that CSV will maintain is your spaces and any multi-line text that you might have in the sheet.

So, this brings us to a minor rule which we’ll call Rule #3:

Rule #3: All data must be in text in the spreadsheet.

In other words, consider that you want to know whether your customers are “local” or “non-local”.  You could make the background of non-local customers yellow, for instance, so that they’re easily viewed.  But when you export to CSV that information is lost.  The only way to handle it properly is to add another column that keeps track of that data.

In our next installment we’ll consider how to format specific data types.

Collecting and Managing Data, Part 1

The reason I’m writing this is quite simple: I want to save you a lot of time, money, headaches, and frustration.  If you are in business – any kind of business – you are collecting data.  At a minimum you are selling a product to someone.  That means that you have at least two datasets: products that you sell and people to whom you sell.  Additionally, you have monetary transactions that you have to track for tax purposes if nothing else.  Now we have three datasets.  Every single business on the planet has these three datasets.

Unless your product is created from dirt (e.g. you run a coal mine) you also have a supplier or list of suppliers from whom you buy.  At one end, you have a store where you buy from a warehouse and sell directly to consumers.  At the other end, you buy, say, cloth from cloth makers, create clothing, and sell the clothing.  In other words, you add value to the product.

Many small business owners keep these datasets in their heads.  Literally, they know their suppliers, they don’t need to know their customers (people walk in the door) and a cash register takes care of the minimal data required to track monetary exchanges.  Problems arise as the business grows too big for this model or when the key people who have the information get sick, die, attempt to delegate responsibilities, etc.

Small businesses often grow, and as they grow data collection and management becomes more and more important.  Often, the methods used to collect small amounts of data don’t scale well.  Spreadsheets are a good way to manage some data sets, but they’re limited to being accessed by one user at a time.  Some online spreadsheets, such as Google Docs, allow simultaneous access by multiple users, but the spreadsheet format doesn’t lend itself to such usage.

In the next part, I’m going to discuss general data collection techniques for small businesses and focus on spreadsheets as a good way to get started.

Collecting and Managing Data, Overview

I’m going to start a new series on collecting and managing data.  Ultimately this is going to be geared specifically toward those who have copyright data (music publishers, catalogs, libraries, etc.) but I’m going to start out with general information that will be applicable to anybody who is starting out managing data or has a data set that needs to be brought under control.

I’m not going to give a complete outline here as I haven’t started writing it and I’m not even sure of the path that I’ll ultimately take as I pull this out of my head and write it down.  I will cover general data collection and maintenance, using spreadsheets, using a desktop database like Filemaker or Access, and then specifically handling copyright data and other data associated with a collection of music or songs.

When I’m finished (if I ever finish – this may end up being a lifetime of work) there should be enough information here to write a short book.  I will number each part sequentially so you’ll want to read them in order.

Code Available

I have made a lot of code available at GitHub:

http://github.com/mdchaney

Right now you can find 1D barcode generation code written in Perl, 3 maze generators written (and highly commented) in JavaScript, a JavaScript sprintf function, and a plugin for CarrierWave to store image information. More coming as I have time.