Sunday, January 12, 2014

Rube Goldberg Network Design

The holy grail of network engineer is building a completely redundant network with no single point of failure, where outages are never seen by the end users and the network team is a happy upbeat group of individuals who never get blamed for anything.  The problem adding redundancy is the added complexity needed. Sadly in the lust for more 9s of uptime you can build what I call the Rube Goldberg Network.

For those of you not familiar with what a Rube Goldberg machine is, here's the definition from Wikipedia:

"A Rube Goldberg machinecontraption,inventiondevice, or apparatus is a deliberately over-engineered or overdone machine that performs a very simple task in a very complex fashion..."

And a little Mythbusters example:

After watching this video, if you feel like Adam Savage when trying to talk about your network resiliency and redundancy, you may have a Rube Goldberg Network.

One of the great challenges with a building a Rube Goldberge machine is that everything must go perfectly in order for everything it to work. If there is a small failure in the machine, the whole function of the machine may be impaired.  Not a scenario you want in a network.  The goal is to build robustness, not fragility into the network.

The Rube Goldberg network requires everything to work perfectly or you run into complex and intermittent issues when failover occurs.  And the complexity of troubleshooting these issues is much greater due to the number of moving pieces.

Some advice that helps you not build a Rube Goldberg network:

  • When looking at solutions to provide redundancy, look for the simplest solution that provides what you need.  SSO Sounds really awesome, but it's also more complex to troubleshoot when things don't work.  Are you hitting a bug, or is something else wrong?
  • Design within your teams skill level, don't rely on technology your team doesn't have a good understanding of.  If you need that particular technology, you may need to get the team some training on how to make it work.
  • Ask your Partner/Vendor/Twitter buddy what they think.  "You 'could' do that, but maybe this would be a better idea."  My Coworkers joke that if I don't like something I say "Hmmm, That's Interesting." Don't give me that "I don't trust my Partner/Vendor."  If you don't trust them, why are they your partner/vendor?
  • Use technology as it is intended.  CCIE stupid router tricks are just that. Tricks.  Don't use tricks.
  • Test. Test. Test. Mock it up in lab, test changes, repeatedly.  Schedule Quarterly/Annual DR testing to make sure it all still works after it's had months of change requests.
  • Periodically review the decisions you've made, the outages you've had, and assess whether it's working for you.  Are the stakeholders happy with the uptime/stability/resiliency of the network?