Perhaps you heard about Amazon’s recent technical problems with their EC2 Cloud service; several weeks ago Amazon experienced a series of issues that resulted in extended downtime of one of their Availability Zones. Long story short – many websites and services that are reliant on this zone of Amazon’s cloud service went partially or entirely down – including well-known sites such as HootSuite, Foursquare, Reddit, and Quora. Perhaps you read about it – just about every technology blog around was all over it, as was CNN and much of the mainstream media. The outage continued for the better part of two days – and the ensuing questions about the reliability of cloud services continues.
But Amazon isn’t the only technology company working through major technical issues. Sony is still in the midst of resolving a major hacker intrusion that resulted in the compromise of personal information potentially including logins, passwords, name, address, email, and certain credit card information (though as many have noted – not the CCV security codes that would allow card numbers to be used freely). The issue began with the Playstation Network – used by Playstation 3 owners to play multiplayer games online and use a variety of online services such as Netflix. However, the damage was discovered to have reached even further to Sony Online Entertainment as well – developer and publisher of massively-multiplayer online (MMO) games such as DC Universe Online, and the EverQuest series. These services continue to be down as Sony rebuilds their networks.
But enough about WHAT has happened. The point is these companies have experienced major errors that resulted in significant downtime in their services – and have raised major questions concerning their stability and security.
This is something that all of us in the IT industry should be able to sympathize with and learn from. If you’ve never encountered a major technical issue resulting in significant damage and/or downtime to your system, count yourself lucky and be forewarned. Let’s take a look at how these companies have responded to their issues.
Amazon’s primary responsibility was to the websites and services that rely on their down network. The Amazon Web Service (AWS) team utilizes a “Health Dashboard” to log the status of their services – and during the outage they issued many status updates concerning their finding, progress, and expectations. Just as importantly, they continued to stress their apologies, concerns, and dedication to resolving the issue as they went. After the outage, they issued a very detailed summary explaining the causes, complications, and solutions to their issues. Additionally, they are offering customers who were in the affected zone a 10-day credit.
We won’t likely see a similar detailed report of the PSN/SOE downtime, as it was an external criminal action as opposed to an internal error. This was an exploitation of a weakness in their system – and while they have said they are rebuilding everything, don’t expect many details about what that previous vulnerability was. Additionally, AWS was explaining their very technical issue to their customers – themselves technology companies (insofar as that they operate web-based services). Sony’s customer base of individual Playstation 3 users wouldn’t benefit in the same way from an equally in-depth explanation.
However, Sony can’t afford to say nothing. They have been issuing status updates through their PSN and SOE sites – and more recently have released a press release concerning expected downtimes and potential culprits of the attack. In a similar move to Amazon, they are issuing subscription credit and potential “goodwill” gestures to thank users for their patience.
So what can we learn? You have to expect the worst sometimes. Know what your contingency plans are for major issues in whatever service your business provides – this could be anything from website hosting to development data being lost to major personnel issues (read: management scandal). How would your organization react? How should you react? Would you know what steps to take to control damage, resolve issues, and communicate with customers? Part of this process will inevitably come down to making the right decisions and responding when these issues hit. Cool heads will always prevail.