An Analysis of a Global Cloud Computing System Failure
The paper digs into the true root causes of Skype’s global outage last Christmas focusing on which mechanisms, tools, operational and engineering practices could have prevented such a failure from escalating to a global outage or from happening altogether. It derives 11 practical lessons how to build reliable clouds and other types of large-scale systems (including those built on cloud platforms). Since the first version of this paper was written another spectacular regional outage of Amazon EC2 cloud took place. After reading Amazon’s postmortem, it appears that following the above guidelines would have prevented or contained that failure as well.
Alex brings over two decades of technology leadership experience. His interests include: Cloud, Internet-scale computing, dependable systems and SOA. Alex built his first global cloud last century, delivered Amazon’s Auto Scaling and launched several Cloud 2.0 initiatives. He is now working on a system designed for 1.2 Billion users.
Short URL: http://vertical-cloud.com/?p=3464