Modern cloud infrastructures are built leveraging thousands of highly distributed servers, used to provide services directly to customers over the Internet. The service provider has two extremely important objectives, which, unfortunately, are to some degree contrasting:
- Ensure continuous availability of the cloud service.
- Contain the cost of the infrastructure and administration (CAPEX and OPEX).
There are several factors that have an impact on the availability of services, mostly related to infrastructure failures. Failures are not only related to unrecoverable hardware outages, but also to recoverable OS or middleware failures.
Not so long ago, the most common approach to high availability was to assume one could deploy infrastructures with the highest mean time to failure (MTTF) possible, which required expensive systems and assumed the possibility to write error-safe software applications. It was also assumed that some degree of downtime was acceptable, with vendors boasting of the number of 9s that they could support (for example, 99.999% availability). In today’s always-on Internet, any downtime of major services becomes headline news. The traditional approach is no longer applicable, and a new approach has to be considered.
Given the requirement to reduce infrastructure costs, service providers are using commodity hardware. Given also the requirement to reduce operational costs, hardware failures are commonly dealt with by directly replacing the failed component rather than manual debugging and recovery by skilled (and expensive) administrators. Thus, to maintain the objective of continuous availability of the service, the cloud system must be built to expect failure of the underlying infrastructure, and not only for temporary periods but it must assume that components will disappear forever. This cannot be limited to only hardware components, because no matter how well a software element is tested, unexpected edge conditions will appear at some time. So, to guarantee continuous availability, a cloud solution must also expect its own components to fail too.
Given that we are forced to expect failure, the high MTTF approach is no longer valid, and instead we have to increase availability by flipping the approach to minimizing mean time to recovery (MTTR). The quicker the system can recover from failure, the higher the availability of the service will be. Given however that even a tiny percentage of downtime is no longer acceptable, we also need a means to maintain service availability during the recovery process. One way of doing this is through providing redundancy of all critical services within the cloud solution.
IBM SmartCloud Provisioning is designed according to the recovery-oriented computing (ROC) principles, because it is based on a highly distributed, redundant, and robust infrastructure, with near zero downtime, and automated recovery across heterogeneous platforms; and it does not require expensive systems, but can run on a relatively low-cost commodity infrastructure.
The key factors that allow IBM SmartCloud Provisioning to be a low-touch and robust cloud infrastructure are as follows:
- The infrastructure is as stateless as possible, which avoids issues related to single points of failure.
- Management agents are deployed on the physical nodes of the infrastructure (compute nodes and storage nodes) and are connected in a peer-to-peer network to form a self-monitoring and self-managing infrastructure.
- Core services are redundant, being deployed in clusters to tolerate individual faults.
- Master images are replicated in multiple copies across the storage nodes in the storage cluster; this tolerates hardware failures of the storage nodes in the cluster and also network failures when accessing one copy of the image.
- Hypervisor (compute) nodes are deployed through a stateless boot so that it becomes easier to redeploy a failing hypervisor by simply rebooting it and getting a fresh new copy of the hypervisor image. This way also allows easy deployment of new nodes if needed, to augment the capacity of the infrastructure.
Let’s consider several typical failure scenarios that can happen in a real environment, and let’s see how IBM SmartCloud Provisioning is designed to tolerate them and react appropriately.
First example is related to the management agents that are used by IBM SmartCloud Provisioning to perform the standard provisioning operations.
Management agents are deployed on both the compute nodes and the storage nodes and are organized in dynamic hierarchies, where a leader (manager) is dynamically elected. The leader is just the entry point for distributing the requests across the infrastructure and a coordinator of any operation, but this role does not imply any special information being associated with the agent itself (stateless infrastructure): any agent can be a leader.
All the agents have a watch-dog mechanism that is used to prevent, detect, and correct failures; they also monitor each other in the neighborhood and can start simple actions to fix other agents issues.
So, if an agent fails, the watch-dog mechanism tries to restart it. If the watch-dog is not able to restart the agent, neighbors try simple actions to restart the failing agent. If the agent cannot be restarted, the system keeps on working without that node, thanks to the redundant infrastructure.
If the failing agent was a leader, and it cannot be restarted, the managed agents can re-elect their leader dynamically, without losing any information.
Another example is related to failures either in a storage node or in a compute node.
If a storage node fails, thanks to the redundant deployment and to the multiple copies of the same image available in the storage cluster, the deployment of VMs can continue without issues, and the leader agent will try to restart the failing node.
If a compute node fails, the leader detects the failures and stops sending requests to that node. Moreover it tries to restart the node, forcing a fresh copy of the compute node to be redeployed through PXE boot.
If you’re interested in trying the IBM SmartCloud Provisioning product, you can download a trial version from the following link: