Following the well-publicized Amazon EC2 failure, VMware's vCloud architect Massimo Re Ferre wrote an interesting...
post arguing that cloud infrastructure providers should offer reliable TCP-like services rather than UDP-like services. In his opinion, cloud infrastructure should have high-availability and disaster recovery mechanisms similar to the current VMware offerings. While I agree that some application and platform cloud services must be highly available, trying to meet very high availability requirements (four or five 9s) with cloud infrastructure services will just make them more complex and more expensive.
To avoid the pitfalls of trying to define what reliable means for cloud infrastructure services, we'll focus on high-availability requirements because when people talk about reliable infrastructure, they usually mean a highly-available infrastructure.
Looking at the bigger picture, some cloud services clearly have to be highly available. Let's look at various categories of cloud services, going from application services down the stack to cloud infrastructure services:
• Application services, such as Software as a Service or SaaS, must be at least as reliable as their more traditional counterparts. SaaS users can't add an additional layer that would allow them to increase their reliability. For example, if you use Gmail and it happens to be down, there's nothing you can do to get access to your email. Database as a Service is similar because if you work around the network layer, databases look just like specialized applications.
• Platform services that use Platform as a Service or PaaS, such as Google App Engine, also have to offer inherent resilience and high availability. If you deploy your Web-based application as a bunch of modules on a cloud platform, you can't control where those modules will be executed, how the platform will handle the increased load—apart from tweaking a few parameters from the control panel—or how it will handle global load distribution.
• Infrastructure services, such as Infrastructure as a Service and Storage as a Service, are quite different. Making them highly available would solve some problems for people who don't know how to design and deploy properly scaled-out application architecture. My suggestion there is that those people should really use PaaS or SaaS services because at least some of those services address high-availability requirements, but even that wouldn't greatly increase application availability. After all, software crashes or configuration errors are far more common than infrastructure failures like the one Amazon experienced. The servers running as virtual machines in IaaS cloud services also need to be patched or upgraded. Investing in high availability on the virtual machine level—ensuring the virtual machine is always up and running— with mechanisms like VMware's Fault Tolerance would lead only to marginal improvement in application uptime.
Clearly, this is my personal opinion, as we lack any statistically-significant, long-term historical data. But I am positive that a cloud infrastructure provider experiencing frequent or prolonged outages won't stay in the business long enough to matter.
Furthermore, when using scaled-out application architecture where every tier—Web servers, application servers and database servers—runs on multiple parallel server instances, it's easier to achieve geographically distributed processing and adaptive load balancing. If you want to deploy a robust application in an IaaS environment, the use of comprehensive load balancing mechanisms is a must. Incidentally, those same mechanisms make your application highly available even when the underlying servers (or infrastructure) fail, and even more so if you balance the load between different data centers.
Cloud infrastructure design: A cost and complexity reality check
Every engineer designing a cloud solution would love to build a highly-available cloud infrastructure for IaaS, where every single component would be redundant and the failover mechanisms would ensure five 9s (99.999) or better availability, but the cost and complexity realities usually stop us from doing that. A highly-available infrastructure would definitely not be cost-competitive compared to" just-good-enough" cloud platforms, and we all know that IaaS buyers look primarily at the cost.
From the service provider perspective, a high-availability IaaS service is a lose-lose proposition. If customers are experienced enough to understand the realities of unreliable infrastructure and application designs they can use to deploy high availability applications, like the one used by Evernote, they won't pay extra for reliability they don't need. But if they're new to the cloud world and assume that cloud services never fail—or if their applications aren't mission-critical—they won't be willing to pay the premium price because they won't understand the need for extra infrastructure complexity and the resulting higher cost.
The optimum cloud infrastructure route is probably the one Amazon took: Build a cost-optimized but still highly reliable infrastructure, offer the services necessary to build reliable scaled-out applications, such as elastic load balancing, for example, and try to educate customers about how to use those services to increase the reliability of their applications.
About the author: Ivan Pepelnjak, CCIE No. 1354, is a 25-year veteran of the networking industry. He has more than 10 years of experience in designing, installing, troubleshooting and operating large service provider and enterprise WAN and LAN networks and is currently chief technology advisor at NIL Data Communications, focusing on advanced IP-based networks and Web technologies. His books include MPLS and VPN Architectures and EIGRP Network Design. Check out his IOS Hints blog, and ask him your networking questions at SearchTelecom.com's Ask the Expert.