Hadoop: What It Is and How It Worksposted by John Spacey, November 29, 2012
Hadoop is an open source framework from Apache that can be used to process big data sets using distributed systems (such as cloud infrastructure).
What can Hadoop do?
Hadoop can break big processing tasks into many small tasks and distribute those tasks to commodity computers (e.g. on a cloud). In other words, it allows large problems to be solved using large numbers of computers.
Hadoop also includes a distributed, fault tolerate filesystem that can handle big data.
One of the advantages of Hadoop is that it can process structured (e.g. xml) and unstructured data (e.g. images) together.
Hadoop is used to solve a variety of business and scientific problems. For example, the New York Times used Hadoop to create 11 million PDFs from 4 terabytes of images in 24 hours.
Google uses Hadoop to build search indexes and calculate metrics from big sets of unstructured data.
What is MapReduce?
Hadoop implements MapReduce — a model for parallel processing developed by Google. MapReduce solves problems by breaking them into two steps:
A master node takes a problem and divides it into sub-problems. It distributes the sub-problems to worker nodes. Worker nodes may also break problems into sub-problems and distribute them.
The master node collects the solutions to sub-problems from worker nodes and combines them to form the answer to the problem.
Worker nodes may also process the reduce step.
How does Hadoop work?
Hadoop implements MapReduce using two logical layers.
Hadoop Distributed File System
Hadoop Distributed File System (HDFS) is a scalable distributed filesystem that can store very large files and/or large numbers of files. Files are replicated to make the file system fault tolerant.
Hadoop can also be deployed with an alternative distributed file system such as the Amazon S3 filesystem.
Hadoop's MapReduce Engine uses a tool known as a JobTracker to break problems into sub-problems which it gives to worker nodes known as TaskTrackers.
Hadoop attempts to keep each sub-problem close to the data it requires (ideally on the same physical machine).
How do I use Hadoop?
A typical enterprise deployment of Hadoop passes big data from social, enterprise, legacy and industry data sources to a Hadoop instance for processing.
When Hadoop processing completes the result is imported into a database (RDBMS) or an enterprise application such as a business intelligence (BI) tool, analytics engine or ERP.
Want to automate, monitor, measure and continually optimize your business? You might need BPM.|
The 90 second version of TOGAF — a popular enterprise architecture framework.|
What is the value of your EA project in 9 words or less?|
ITIL 2011 (v3) identifies 25 core processes. Here they are. |