The perfect marriage: Hadoop and Cloud

Hadoop meet Cloud, Cloud meet Hadoop. It was love at first sight. Hadoop found in Cloud the wind beneath its wings, a loyal companion; available any time she needed him, flexible, and elastic. Cloud found in Hadoop its partner in life to share and discover new things; together anything was possible. Structured, semi-structured or unstructured data could not stop Hadoop, and Cloud was always there for her.

If you are new to Hadoop and cloud computing, you probably have no idea what the first paragraph was about. Let me step back, and tell you quickly the story of cloud computing, and then of Hadoop. Then, you will understand why they have a perfect marriage.

Cloud computing is a new model for delivering computing resources. It gives the illusion of having an infinite amount of resources available on demand. Users do not have to commit to a given software or hardware. They can simply rent the resources they need, and even better, only pay for what they use for the amount of time they use it. Cloud computing enables users, through automation, to self-provision the resources they need. This automation, the availability of pre-defined resources, and economics of scale is what brings costs down allowing any individual to start working on a server paying as little as 10 cents per hour. Cloud computing usage has grown exponentially. According to this survey, most companies anticipate using the cloud for most or all of its needs by 2015.

Even with all this power and availability (from a hardware point of view), there are major hurdles (from a software perspective) to manipulate vast amounts of data, big data, that is being collected daily. Volume, velocity, and variety (V3) are terms often used today to describe the characteristics of big data. The world as IBM describes it with its Smarter Planet initiative is now instrumented, interconnected, and intelligent (I3). This means that more and more data is being collected (volume), and is being collected at incredible speeds (velocity) from different sources such as sensors, Twitter, Facebook, and so on. It is said that about 80% of the data collected is unstructured (variety). This means it would be hard to manipulate this data with existing relational database software, which mainly manages structured data. This is why Hadoop was born.

Hadoop’s inception started with papers published by Google that described how Google was manipulating the data they were collecting with their services. This information was used by an engineer from Yahoo to develop Hadoop, an open source Java framework. Hadoop consists mainly of two components: A new file system (HDFS), and a new way to code programs (MapReduce). Using commodity hardware, HDFS replicates blocks of data across many nodes in the cluster, and this provides data reliability. With MapReduce, the code is sent to the nodes in the cluster closest to where the data to be manipulated resides, and then it processes the blocks of data in parallel. The result is fast and reliable output.  Hadoop works well for large files accessed sequentially. It is very fast, but it works like a batch job, so the results are not immediate.

In summary, Hadoop and Cloud working together deliver value. Without Hadoop, users cannot see all the benefits they can get from the cloud. Without the cloud, users cannot see all the benefits they can get from Hadoop.

Hadoop and Cloud are a perfect marriage; we hope to see kids coming soon!

Comments: 3
Raul Chong

About Raul Chong

Raul F. Chong is a senior DB2, Big Data and Cloud Program Manager at the IBM Information Management Cloud Computing Center of Competence, based at the IBM Canada Laboratory in Toronto. He works as a technical evangelist delivering presentations at educational institutions and conferences around the world showing the latest features of DB2, BigInsights, Data Studio, and related products, and how they work on the Cloud.
This entry was posted in Workloads and tagged , , , , , , , , . Bookmark the permalink.

3 Responses to The perfect marriage: Hadoop and Cloud

  1. Raul Chong Raul Chong says:

    I got a comment on twitter to this blog post, and I just couldn't reply with 140 characters, so let me reply here. The comment was "actually #hadoop is not designed for virtualisation."

    My comment to this comment is:
    The folks from Google started working on Hadoop-related technology 10+ years ago, and at that time, noone was talking about the Cloud (maybe a bit about virtualization), so yes, you are right that Hadoop was probably not designed for virtualization, but that doesn't mean it cannot work well in a virtualized environment like the Cloud. It would be interesting to compare (in terms of performance) how Hadoop would run, let's say with 1000 physical nodes versus 1000 nodes in the Cloud (virtualized). Unfortunately, I cannot get the 1000 physical nodes to test. I have BigInsights (IBM's Hadoop distribution) running on a 3-node cluster on the IBM SmartCloud Enterprise…maybe I could test that against a 3-node physical cluster; but I'd need some time to set it up (and find the 3 physical nodes). The 3-node test may not be representative though. Do you have any performance results comparing how Hadoop would run in a virtualized vs. a non-virtualized environment?. Regardless of the results, I still believe the Cloud is the enabler to running Hadoop MapReduce jobs (fully distributed mode) that could simply not be possible (for regular guys like me) in a non-Cloud environment. Thanks for the comment!

  2. @raulchong says:

    I got this reply again in Twitter:
    I'll add a page on the #hadoop wiki. virtualised IO performance is what hurts; CPU jobs are ok, network unpredictable.

    I replied:
    Thx! Launching instances on the same placement group in AWS (cluster compute instance) may help?

  3. Annie Rogers says:

    I believe that to manage the 3 Vs – Velocity, variety and volume, the most important aspect that a technology should have is seamless integration with the existing (legacy) systems. And also, the most important pain points out of these three is the variety. Volume and velocity can be automated but when it comes to variety of data sources, setting up a different system for every platform has to be done manually. This is a major task for any web data extraction company like us.
    On a side note, loved the article and the fact that you have personified Hadoop and Cloud!

Comments are closed.