Hadoop meet Cloud, Cloud meet Hadoop. It was love at first sight. Hadoop found in Cloud the wind beneath its wings, a loyal companion; available any time she needed him, flexible, and elastic. Cloud found in Hadoop its partner in life to share and discover new things; together anything was possible. Structured, semi-structured or unstructured data could not stop Hadoop, and Cloud was always there for her.
If you are new to Hadoop and cloud computing, you probably have no idea what the first paragraph was about. Let me step back, and tell you quickly the story of cloud computing, and then of Hadoop. Then, you will understand why they have a perfect marriage.
Cloud computing is a new model for delivering computing resources. It gives the illusion of having an infinite amount of resources available on demand. Users do not have to commit to a given software or hardware. They can simply rent the resources they need, and even better, only pay for what they use for the amount of time they use it. Cloud computing enables users, through automation, to self-provision the resources they need. This automation, the availability of pre-defined resources, and economics of scale is what brings costs down allowing any individual to start working on a server paying as little as 10 cents per hour. Cloud computing usage has grown exponentially. According to this survey, most companies anticipate using the cloud for most or all of its needs by 2015.
Even with all this power and availability (from a hardware point of view), there are major hurdles (from a software perspective) to manipulate vast amounts of data, big data, that is being collected daily. Volume, velocity, and variety (V3) are terms often used today to describe the characteristics of big data. The world as IBM describes it with its Smarter Planet initiative is now instrumented, interconnected, and intelligent (I3). This means that more and more data is being collected (volume), and is being collected at incredible speeds (velocity) from different sources such as sensors, Twitter, Facebook, and so on. It is said that about 80% of the data collected is unstructured (variety). This means it would be hard to manipulate this data with existing relational database software, which mainly manages structured data. This is why Hadoop was born.
Hadoop’s inception started with papers published by Google that described how Google was manipulating the data they were collecting with their services. This information was used by an engineer from Yahoo to develop Hadoop, an open source Java framework. Hadoop consists mainly of two components: A new file system (HDFS), and a new way to code programs (MapReduce). Using commodity hardware, HDFS replicates blocks of data across many nodes in the cluster, and this provides data reliability. With MapReduce, the code is sent to the nodes in the cluster closest to where the data to be manipulated resides, and then it processes the blocks of data in parallel. The result is fast and reliable output. Hadoop works well for large files accessed sequentially. It is very fast, but it works like a batch job, so the results are not immediate.
In summary, Hadoop and Cloud working together deliver value. Without Hadoop, users cannot see all the benefits they can get from the cloud. Without the cloud, users cannot see all the benefits they can get from Hadoop.
Hadoop and Cloud are a perfect marriage; we hope to see kids coming soon!