Infrastructures.Org: Best Practices in Automated Systems Administration and Infrastructure Architecture: Disaster Recovery

Disaster Recovery

The fewer unique bytes you have on any host's hard drive, the better -- always think about how you would be able to quickly (and with the least skilled person in your group) recreate that hard drive if it were to fail.

The test we use when designing infrastructures is "Can I grab a random machine that's never been backed up and throw it out the tenth-floor window without losing sysadmin work?" If the answer to this was "yes", then we knew we were doing things right.

Likewise, if the entire infrastructure, our enterprise cluster, were to fail, due to power outage or terrorist bomb (this was New York, 1997), then we should expect replacement of the whole infrastructure to be no more time-consuming than replacement of a conventionally-managed UNIX host.

We originally started with two independent infrastructures -- developers, who we used as beta testers for infrastructure code; and traders, who were in a separate production floor infrastructure, in another building, on a different power grid and PBX switch. This gave us the unexpected side benefit of having two nearly duplicate infrastructures -- we were able to very successfully use the development infrastructure as the disaster-recovery site for the trading floor.

In tests we were able to recover the entire production floor -- including servers -- in under two hours. We did this by co-opting our development infrastructure. This gave us full recovery of applications, business data, and even the contents of traders' home directories and their desktop color settings. This was done with no hardware shared between the two infrastructures, and with no "standby" hardware collecting dust, other than the disk space needed to periodically replicate the production data and applications into a protected space on the development servers. We don't have space here to detail how the failover was done, but you can deduce much of it by thinking of the two infrastructures as two single machines -- how would you allow one to take on the duties and identity of the other in a crisis? With an entire infrastructure managed as one enterprise cluster you can have this kind of flexibility. Change the name and reboot...

If you recall, in our model the DNS domain name was the name of the enterprise cluster. You may also recall that we normally used meaningful CNAMES for server hosts -- gold.mydom.com, sup.mydom.com, and so on. Both of these facts were integral to the failover scenario mentioned in the previous paragraph, and should give you more clues as to how we did it.