Cloud Vanguard

Conquering the Clouds

OMG #CLOUDFAIL! All your data are belong to us!!

I have an admission. I’ve never sold traditional servers so forgive my industry ignorance when it comes to asking the question; what is the big deal with the #cloudfail meme anyway? How is every remote server failure a doomsday for cloud computing? What does a server at Microsoft have to do with a server at Rackspace or AWS or GoGrid or Salesforce or Facebook? (I’m playing it loose with the definition for the sake of discussion since Microsoft Danger is SaaS anyway.) Server failure and data loss in general have always happened and will continue to happen.

The story goes like this, a contractor that Microsoft hired didn’t perform a backup before updating the production Sidekick SAN. Microsoft didn’t follow extremely well established operational best practices in their datacenter and now their customer’s customers are now out of contact (pun intended) and the reputation these of three companies has been sullied. This isn’t cause to cloud-hate. This should be cause celebre of proper application and infrastructure architecture! The cloud just forces you to acknowledge these established best practices. A friend at Google tells me, if any part of the code they write fails to compensate for infrastructure failure or issues, then it gets kicked right back to them. Period.

Cabling Fail

Datacenter Operations Fail

I did a stint in corporate IT and Microsoft’s failure is similar to the time I arrogantly neglected to shut off the computer I was repairing before taking a screwdriver to it. I blew out the entire accounting department’s unsaved afternoon productivity (to my defense, afternoons were never that productive). Was that also a #cloudfail? After all, a large percentage of data in a corporate network is stored on an individual desktops or laptops. Isn’t that like remote server storage? Furthermore, a study from Carnegie Mellon reports that hard disk replacement is typically 2-4% per year with up to 13% per year in some systems, largely dependent upon operating conditions and surprisingly independent from age. The operating conditions that we worked in there were terrible! Dusty and hot, and an accountant’s idea of a good computer fix was to kick the machine under their desk. Shouldn’t CIOs be more concerned about that data loss?

Then what does this epic fail mean for the Sidekick? It means Danger designed a crappy application which had no checks and balances for server synchronization to local devices. It means T-mobile and Danger forced their Sidekick users to subscribe to this model without giving them the ability to back up their data locally. And it means Microsoft ran it on a poorly architected infrastructure and never set up any backups of critical customer data. Danger was running on dedicated hardware, the same type that any IT department would rack out. Consequently they had to do more work to solve the gotchas inherent in their own application design (such as giving users backup options). There is no standardization of data redundancy (like in S3 for instance), and apparently there was no predictable operational process in place either.

If you control your own hardware, it doesn’t matter whether it’s in a colo, a closet, or an IaaS provider; you and only you are responsible for the uptime of your application. Natural and freak disasters will happen (trucks running into your datacenter, earthquakes, fires, extended power outages, exploding power rooms). Cloud computing is an operations model, not a technology. Data loss and server failure is a reality, never unavoidable. So think about the implications of all of this and don’t put all your eggs in one basket (or one disk, or one server, or one datacenter, or one availability zone, or one geographic region, or one anything). Have a contingency plan in place, invest in best practices in cloud/datacenter portability (management tools/operations procedures/standards) and avoid lock in to any one infrastructure location. After all, if your application is your livelihood shouldn’t you take more responsibility for it?

Plan for failure, follow your runbook and know where the exit is!

October 14, 2009 Posted by | cloudfail | | 2 Comments

   

Follow

Get every new post delivered to your Inbox.