Thursday, September 06, 2007

MTBF: Fear and Loathing in the Datacenter

In Service Level Automation in the Datacenter: Fear and the Right Thing, James Urquhart discusses the FUD surrounding the topic of shutting down servers.

I am conflicted.

I agree that most IT shops have an unhealthy aversion to shutting down servers. Naive uptime metrics don't help, neither does the tendency for mistakes to get baked into our cultural DNA.

Having said this,

MTBF - ouch - I made the mistake of believing these numbers once upon a time. I know of one sysadmin that gave a whole new meaning to the term "reboot" - literally, give it a good swift kick, then scramble like mad to write it to tape.

I doubt many server managers will be swayed by quantitative evidence that devices can survive power cycling.

Here's my modest proposal,

Don't hesitate to reboot any machine under warranty if the repair time won't put an SLA in jeopardy. Rejoice if it results in a warranty repair. Rejoice because it create the necessary feedback to vendors. Rejoice because it increases leverage in future vendor negotiations. Rejoice because failing during a scheduled reboot is better failing at 3am on Sunday during that big production cutover.


LanceW said...

my two cents worth would be that it's not the "mean" that worries IT guys.

It's Murphy's law that scares us (Well me at least).

That 1 in a thousand failure is bound to happen to the key server, the one with the backup that didn't work and we didn't notice. :(

Aloof Schipperke said...

You bring up a good point. It's a much easier sell if IT management is willing to bet their jobs that everything is working as designed, including the assurance mechanisms. Unfortunately, the day to day realities can quickly turn this into a negative proof.