Business Guide
simplicable technology guide   »  cloud computing   »  what is hadoop

Hadoop: What It Is and How It Works

        posted by , November 29, 2012

Hadoop is an open source framework from Apache that can be used to process big data sets using distributed systems (such as cloud infrastructure).

What can Hadoop do?

Hadoop can break big processing tasks into many small tasks and distribute those tasks to commodity computers (e.g. on a cloud). In other words, it allows large problems to be solved using large numbers of computers.

Hadoop also includes a distributed, fault tolerate filesystem that can handle big data.

One of the advantages of Hadoop is that it can process structured (e.g. xml) and unstructured data (e.g. images) together.

Hadoop is used to solve a variety of business and scientific problems. For example, the New York Times used Hadoop to create 11 million PDFs from 4 terabytes of images in 24 hours.

Google uses Hadoop to build search indexes and calculate metrics from big sets of unstructured data.

What is MapReduce?

Hadoop implements MapReduce — a model for parallel processing developed by Google. MapReduce solves problems by breaking them into two steps:

1. Map
A master node takes a problem and divides it into sub-problems. It distributes the sub-problems to worker nodes. Worker nodes may also break problems into sub-problems and distribute them.

2. Reduce
The master node collects the solutions to sub-problems from worker nodes and combines them to form the answer to the problem.

Worker nodes may also process the reduce step.

How does Hadoop work?

Hadoop implements MapReduce using two logical layers.

hadoop architecture details

Hadoop Distributed File System
Hadoop Distributed File System (HDFS) is a scalable distributed filesystem that can store very large files and/or large numbers of files. Files are replicated to make the file system fault tolerant.

Hadoop can also be deployed with an alternative distributed file system such as the Amazon S3 filesystem.

MapReduce Engine
Hadoop's MapReduce Engine uses a tool known as a JobTracker to break problems into sub-problems which it gives to worker nodes known as TaskTrackers.

Hadoop attempts to keep each sub-problem close to the data it requires (ideally on the same physical machine).

How do I use Hadoop?

A typical enterprise deployment of Hadoop passes big data from social, enterprise, legacy and industry data sources to a Hadoop instance for processing.

hadoop architecture

When Hadoop processing completes the result is imported into a database (RDBMS) or an enterprise application such as a business intelligence (BI) tool, analytics engine or ERP.

3 Shares Google Twitter Facebook

Related Articles

Enterprise Architecture
How to architect an organization.

Want to automate, monitor, measure and continually optimize your business? You might need BPM.

The 90 second version of TOGAF — a popular enterprise architecture framework.

What is the value of your EA project in 9 words or less?

ITIL 2011 (v3) identifies 25 core processes. Here they are.

Recently on Simplicable

Where the Cloud Was Born

posted by Anna Mar
It sometimes feels as if we've entered a world of cloud and now there's nothing but cloud.

9 Reasons To Go Private Cloud

posted by Anna Mar
In the future, large sophisticated organizations will own clouds, small businesses won't.

7 Reasons To Go Public Cloud

posted by Anna Mar
As fun as investing in private cloud may be, it's an expensive proposition. Public cloud has several advantages.

Why Hybrid Cloud is Inevitable

posted by Anna Mar
The decision to integrate clouds is usually a no-brainer.


about     contact     sitemap     privacy     terms of service     copyright