There are many names for it. There are many descriptions of it. There are many factors that contribute to it. But no matter what you call it, how you paint it, or what you think causes it; the issue remains: it is difficult to predict when your job will complete in a public cloud environment.
This problem is the topic of many academic and corporate research projects, but from what I have seen no one has solved it. For this discussion, I am strictly referring to uncapped CPU (or CPUs) of virtual machines. In most cases, public cloud offerings provide uncapped virtual machines (VMs). Uncapped VMs take advantage of unused CPU cycles of a physical node by allowing a busy virtual machine to use additional available cycles. This way allows customers to get more than they are paying for.
This is a great thing… in most cases…
The negative aspect of this approach is that the performance predictability of a job is near impossible to determine in this type of environment. A job run today might complete in two hours, but the exact same job might take four hours tomorrow. The typical cause for the difference is other VMs doing real work on the same physical node as the job in question. These could be the same customer’s jobs on the same VM, the same customer’s jobs on another VM collocated on the same physical node, or another customer’s jobs on another VM collocated on the same physical node. The latter two of these are often referred to as “your own worst enemy” and the “noisy neighbor” problems of cloud computing.
Does this mean I’m being cheated by my cloud provider? The answer is no. Cloud providers usually promise a minimal rating for CPU performance; however they often deliver a lot more. In essence, customers typically are getting a lot more than they are paying for. From a provider’s stand point, however, the hidden danger of operating in this manner is that customer becomes accustom to a level of performance that in some cases far exceeds the minimum being promised. Dissatisfaction arises when customers notice that their jobs are running much slower than they did the month before.
So, what is a customer to do…
If you have a job that needs to complete by a certain time (for example, time-sensitive quarterly report) there are several things you can do to get the most out of your cloud with the goal of having your job complete in a predictable time frame. By no means is the list that follows an exhaustive one. As I mentioned earlier, there is a lot of research in this area and I am sure much brighter minds are out trying to solve the problem as I type this and as you read it. Your mileage will vary, but depending on your job, the tools you are using, and the underlying architectural and business design of your cloud provider, I hope you can find one or two of these ideas helpful.
- Regularly benchmark your job. By running your job at different hours, you might be able to determine typical slow time on your provider’s cloud.
- Run your jobs “after hours.” This doesn’t mean you have to run your jobs between midnight and 7 a.m., but rather chose a data center in a time zone that is 8-12 hours ahead or behind your own. What is a busy time in a US-based data center might not be in your provider’s Asia-based data center.
- Take advantage of new data centers. New data centers might initially be very underutilized, allowing you to take advantage of unused cycles.
- Understand how your job scales both vertically and horizontally. By understanding how your job scales best, you can determine an optimum number of VMs to use and what size or configuration of VM to use. Test your workload on all the configurations (virtual CPUs, memory, disk) that your provider makes available.
- Understand your tools. If you are using a tool like Hadoop (http://hadoop.apache.org/) or Condor (http://research.cs.wisc.edu/condor/description.html) make sure you know how these tools work and try different mixes of configurations and number of VMs. If you have control of your tools and can build intelligence into them, you might be able to drive towards a more predictable model. If your tools can determine that a VM is performing poorly, have them provision additional or replacement VMs that can perform better. You might be able to take advantage of your provider’s ability to anti-collocate VMs.
- Test VMs at provisioning time. If you can define a small benchmark that is representative of your workload, execute it when a VM is provisioned. If it performs well, use that VM. If not, de-provision it and provision a new one. Note, there may be significant time and cost drawbacks to this approach. In addition, because a cloud tends to be very dynamic and your VM neighbors on the same node could change, this approach might not provide a huge benefit.
And if all else fails, determine what is the worst-case scenario for your job. This may be hard to determine depending on what your provider publishes on the physical architecture of its cloud environment. But if you can determine the worst-case performance and your job is predictable from run to run, you should be able to work backward from your job deadline to determine when you must start it in order to complete on time.
As mentioned earlier, this list is definitely not exhaustive. What do you do? What would you suggest?