Hadoop Introduction Rob Hughes 1. Why the need for Hadoop? LOTS!!! of data causes some problems: In 1990 a typical hard drive could store 1,370 MB of

Hadoop IntroductionRob Hughes

1

Why the need for Hadoop?• LOTS!!! of data causes some problems:• In 1990 a typical hard drive could store 1,370 MB of data with

a transfer speed of 4.4MB/s. So you could read all the data from a full drive in around five minutes.

• Over 20 years later, one terabyte drives are the norm, but the transfer speed is around 100 MB/s, so it takes more than two and a half hours to read all the data off the disk.

• Hard drive access speeds have not kept up with storage capacity. Stupid seek time!

2

How to mitigate access times• Partition data into more-or-less equal size chunks spread

across separate drives. Process data by working in parallel.• With a hundred drives each containing a hundredth of the

data, working in parallel, you could read an entire 1 terabyte drive in under 2 minutes.

• But this creates other problems…

3

More Problems• Increasing the amount of hardware increases the likelihood of

hardware failure.• If computer fails you lose part of the computation.• If drive fails, you lose part of your data.

• Intermediate results of processing now stored on multiple drives. Data may need to be combined to produce final result.

4

Again, why the need for Hadoop?

• Distributed processing (parallelism).• Distributed data (replication).• Fault-tolerance.• Mechanism to combine data at key points during processing.

5

Hadoop Features• Data Compression.• Separation of concerns:• Hadoop manages the complexity of data storage and replication,

coordination of hundreds to thousands of machines, and provides a fault tolerant platform for data access and job execution.

• Developers develop instead of becoming distributed system experts. Hadoop defines an API for packaging and submitting jobs, an API hook to update job progress, and a file system to capture job results.

• Data locality.• Scalable + Commodity hardware.

6

HDFSHadoop Distributed File System• Core component of Hadoop.• Exhibits all the characteristics of a distributed file system.• Files managed across a network of servers.• Data file size can grow beyond limits of physical server.• Scalable storage of data.• Tolerate failure of nodes without losing access to data.

• HDFS does this by high replication count.

7

HDFS• Designed for storing very large files and large amounts of data. • Capable of storing petabytes of data.

• Designed to run on commodity hardware.• Doesn’t require expensive or highly available hardware.• Runs on large clusters of inexpensive commonly available

hardware.• Chance of node failure is high for large clusters but HDFS is

designed to survive in the face of failure.• Optimized for write-once, read many times.• Optimized for high throughput to read entire dataset.

8

HDFS – Not Ideal For:• Low-latency data access (tens of milliseconds).• Large number of files. • Due to memory constraints, the number of files is currently

limited. Scales to millions but not billions of files.• Multiple writers. Writes in middle of files.• Files may only be written by a single writer. File modifications are

always done at the end of a file.

9

HDFS -- Blocks• Disks organized into blocks which is basic unit of storage

• Minimum amount of data than can be read or written.

• Filesystems perform I/O to individual disks in terms of multiple disk blocks.

• HDFS has notion of a block as well but block size is much larger than typical filesystems—64 MB by default.• Goal is to keep the number of relatively slow disk seeks low.• Hadoop operations designed to operate on data the size of an

HDFS block. Allows operations to access data with a single disk seek.

• Favors throughput over low-latency.• Files stored in HDFS as one or more HDFS blocks.• File replication in HDFS occurs at block level.

10

HDFS Architecture• Collection of HDFS nodes/servers is known as an HDFS cluster• An HDFS cluster comprised of two types of nodes (Namenode

and Datanode) operating in a master-worker relationship• Namenode (the master) – Server running a special piece of

software called the NameNode.• Datanodes (the worker) – Servers running a special piece of

software called the DataNode.

11

HDFS Architecture—Nodes• Namenode

• Maintains the filesystem tree and the metadata for all the files and directories in the tree.

• Information is stored persistently on the local disk in two files: the namespace image and the edit log.

• Recommended to configure HDFS to write copies of namespace image and edit log files to remote NFS mounted filesystem.

• Determines the mapping of blocks to Datanodes.• Knows the Datanodes on which all the blocks for a given file are

located. This information is stored in memory and not persisted to disk.• Datanodes

• Store and retrieves blocks when requested. • Perform block creation, deletion, and replication upon instruction from

the Namenode.• Periodically (and at system startup) reports list of blocks back to

Namenode.

12

HDFS Architecture (misc.)• Namenode is a single point of failure.• Cannot use the filesystem without the Namenode.• In the Apache Hadoop distribution, manual procedures are

needed to bring another node online as the new Namenode. Involves recovering namespace image and edit log from failed Namenode server or from external copy of those files.

• Hadoop 2.x offers federated Namenodes and HA features• Secondary Namenode—Optional node type.• Not a standby for Namenode as the name may imply.• Main role is to periodically merge namespace image with the edit

log to prevent the edit log from becoming too large.

13

HDFS Access• Hadoop and third party clients used to access HDFS filesystem• Command Line Interface (CLI) program named “hadoop”.• Various Hadoop java libraries provide programmatic filesystem

access.• C library called libhdfs also bundled with Hadoop.

• Clients access filesystem on behalf of user or program by hiding interaction with Namenodes and Datanodes.

14

rack

server

Local Storage

HD

FS

DataNode

server

Local Storage

HD

FS

DataNode

server

Local Storage

HD

FS

DataNode

Sample HDFS Cluster

rack

server

Local Storage

HD

FS

DataNode

server

Local StorageNameNode

server

Local Storage

HD

FS

DataNode

rack

server

Local Storage

HD

FS

DataNode

server

Local Storage

HD

FS

DataNode

server

Local Storage

HD

FS

DataNode

Hadoop Network Topology• Hadoop takes a simple approach in which the network is

represented as a tree.• “Distance” between nodes is important.• For high-volume data processing the limiting factor is how

rapidly data can be transferred between nodes.• Levels in the tree are not predefined, but it is common to have

levels that correspond to the data center, the rack, and the node.

• Network distance and data locality optimizations are key features that distinguish HDFS from other distributed file systems. 16

Sample Hierarchical/Tree Network Topology

NodesRacksLogical Root

HDFS Cluster

Rack 1

Datanode1

DatanodeN

Namenode

RackN

DN1

DN2

DNN 17

Sample Topology Adding Data Center Layer

HDFS Cluster/

Data Center 1

Rack 1

Datanode1 DatanodeN

RackN

DN1 Namenode

DC2

R1

DN1 DN2

R2

DN1 DN2

RN

DN1 DN2 DNN

18

Distance between two nodes is the sum of their distances to their closest common ancestor.

D=0 D=2 D=4 D=6

HDFS Cluster—Flat Topology

HDFS Cluster/

Datanode1 Datanode2 Namenode DatanodeN

19

D=0 D=2 D=2 D=2

client node

client jvm

HDFS Write: Replication Factor=3

rack

server

Local StorageDataNode

server

Local Storage

HD

FS

NameNode

rack

server

Local Storage

HD

FS

DataNode

server

Local Storage

HD

FS

DataNode

HDFS Client

DistributedFileSystem

FSDataOutputStream

1. create2. create

3. write

4. write packet

5. ack packet

b1

b1

b1

b1

4

5

HDFS block

pipelined writepipelined ack

4

4

5

5

4. write packet

4. write packet

4. write packet

block locations

6. close

7. complete

Hadoop Modes• Standalone• All Hadoop/MapReduce/HDFS daemons run as threads within a

single Java Virtual Machine (JVM).• Debugging distributed programs across multiple JVMs and

servers notoriously difficult. Standalone mode simplifies debugging experience.

• Pseudo-distributed• All Hadoop/MapReduce/HDFS daemons run as threads within

separate JVMs but on the same node. Closer to full-up cluster mode but all processing occurs on local node.

• Cluster• Hadoop/MapReduce/HDFS daemons run as threads within JVMs

spread across nodes of Hadoop/HDFS cluster. 21

Exercise 2

22

MapReduce• Skirted the issue so far. What good is all that data without analysis.• MapReduce is one of the core components of Hadoop and provides a data model

for processing data.• Analyzing data with Hadoop is broken up into two primary phases: the Map phase

and the Reduce phase. (hence the name).• Each phase has key-value pairs as input and output. The types of those key-value

pairs is selectable by the programmer. • The MapReduce API provides a number of available input and output types. • Types are extensible.

• The programmer must specify two functions: the map function and the reduce function.

• The output key-value types for the Map phase must be the same as the input key-value types for the Reduce phase.

• The output from the map function is processed by the MapReduce framework before• being sent to the reduce function. This processing sorts and groups the key-value pairs• by key—Known as “The Shuffle”.• MapReduce provides an API for creating a ‘job’ and submitting it to Hadoop for

execution.

23

Map Phase• Input: Key/value pairs• Value represents data set to be processed

• Map Function: User-defined and is applied to every value in data set.

• Output: New list of key/value pairs.• Output key/value types may be different than input types.

24

Reduce Phase

Map Function Output:(K3, V1)(K1, V1)(K1, V2)(K2, V1)

=>

25

• Input: Intermediate key/value pairs output from Map Phase.– Data is sorted and grouped by key before being passed to reduce

function.

Input to Reduce Function:(K1, [V1, V2])(K2, [V1])(K3, [V1])

• Reduce Function: User-defined function applied to each grouping (by key) of values.– Typically a function that takes a large number of key/value pairs

and produces a smaller number of key/value pairs.– Hence the name “reduce”.

• Output: Finalized set of key/value pairs.• All values with the same key will eventually be processed by the

same reduce task.

MapReduce (cont)• In addition to the Map and Reduce phases there are a few

data processing steps.• Input – Turn raw data into key-value pairs for input into Map

phase.• “Shuffle” – Step to turn output key-value pairs from Map

phase into input key-value pairs for Reduce phase. The output from the map function is processed by the MapReduce framework before being sent to the reduce function. This processing sorts and groups the key-value pairs by key.

• Output – Write results of Reduce phase to file system.

26

Scaling MapReduce• Leverage data stored in HDFS.• Use Hadoop to move computations to nodes hosting part of

the data.• MapReduce Job—unit of work to be completed for client.• Consists of:

• Input data.• MapReduce program.• Configuration information.

• Hadoop divides job into two types of tasks:• Map tasks.• Reduce tasks.

27

Scaling MapReduceDivision Of Labor• Hadoop divides input data into fixed-sized pieces called splits.• Hadoop creates one map task for each split. The map task

executes the user-defined map function for each record in the split.

28

outputreduceshufflemap/combineinput

Map and Reduce Tasks

29

split 1 map

part 1

part n

sort

map task

split 2 map

part 1

part n

sort

map task

split 3 map

part 1

part n

sort

map task

part 1 reduce part 1

merge/sort

reduce task

output HDFS HDFSreplication

part 1

part n reduce part n

merge/sort

reduce task

output HDFS HDFSreplication

part n

Hadoop Architecture• Hadoop cluster uses two node types to facilitate job

execution:• Jobtracker – Server running a special piece of software known as

JobTracker.• Tasktracker – Server(s) running a special piece of software known

as TaskTracker.

30

Hadoop Architecture--Nodes• Jobtracker--The jobtracker coordinates all the jobs run on the

system by scheduling tasks to run on tasktrackers.• Keeps track of progress for each job.• Reschedules failed tasks on another tasktracker.

• Tasktracker--Runs tasks and sends progress reports to the jobtracker.

• Note: This architecture is for Hadoop 1.x. Architecture has changed for Hadoop 2.x to overcome 1.x scaling issues.

31

rack

server

Local Storage

HD

FS

DataNode

TaskTracker

server

Local Storage

HD

FS

DataNode

TaskTracker

server

Local Storage

HD

FS

DataNode

TaskTracker

Sample HDFS/Hadoop Cluster

rack

server

Local Storage

JobTracker

server

Local StorageNameNode

server

Local Storage

HD

FS

DataNode

TaskTracker

rack

server

Local Storage

HD

FS

DataNode

TaskTracker

server

Local Storage

HD

FS

DataNode

TaskTracker

server

Local Storage

HD

FS

DataNode

TaskTracker

Hadoop Data Locality Optimization

rack

server

Local Storage

DataNodeH

DFS

rack

a

b

HDFS block

Data-localRack-local

TaskTracker

maptask

server

Local Storage

DataNode

HD

FS

TaskTracker

maptask

c Off-rack

a

b

server

Local Storage

DataNode

HD

FS

TaskTracker

c

server

Local Storage

DataNode

HD

FS

TaskTracker

• Hadoop does its best to run map tasks on the same node containing the input split.

• Optimal split size is one HDFS block

MapReduce Example

Input dataset: SINGLE.TXT*aardwolvesdraftableflatbreadtutortrouttourt

34

• Find all input rows containing a word that is an anagram of another word in the input.• Input to Map Phase is a single file SINGLE.TXT containing all words

• Output of Map Phase is a set of key/value pairs. One key/value pair for each line of the input dataset. Key is sorted letters for word. The value is the word itself.

* Data from Project Gutenberg @ http://www.gutenberg.org/dirs/etext02/mword10.zip

aadelorsvw aardwolvesaabdeflrt draftableaabdeflrt flatbreadorttu tutororttu troutorttu tourt

MapReduce Example

Output: part-00000<snip>aabdeflrt draftable,flatbread<snip>orttu tutor,trout,tourt<snip>

35

• Input to the Reduce phase is the sorted and grouped keys and values.• The idea being that anagrams will show up as key/value pairs with

multiple values in the group (keyed by common sorted letters)

• Output of the Reduce phase will be records with multiple values for the same key:

<snip>aadelorsvw [aardwolves]aabdeflrt [draftable,flatbread]<snip>orttu [tutor,trout,tourt]<snip>

Map Functionpackage com.hadoop.examples.anagrams;import java.io.IOException;import java.util.Arrays;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reporter;public class AnagramMapper extends MapReduceBase implements

Mapper<LongWritable, Text, Text, Text> {private Text sortedText = new Text();private Text orginalText = new Text();public void map(LongWritable key, Text value,

OutputCollector<Text, Text> outputCollector, Reporter reporter)throws IOException {

String word = value.toString();char[] wordChars = word.toCharArray();Arrays.sort(wordChars);String sortedWord = new String(wordChars);sortedText.set(sortedWord);orginalText.set(word);outputCollector.collect(sortedText, orginalText);

}

}

36

Reduce Functionpackage com.hadoop.examples.anagrams;import java.io.IOException;import java.util.Iterator;import java.util.StringTokenizer;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;public class AnagramReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> {

private Text outputKey = new Text();private Text outputValue = new Text();public void reduce(Text anagramKey, Iterator<Text> anagramValues,

OutputCollector<Text, Text> results, Reporter reporter) throws IOException {String output = "";while(anagramValues.hasNext()){

Text anagram = anagramValues.next();output = output + anagram.toString() + "~";

}StringTokenizer outputTokenizer = new StringTokenizer(output,"~");if(outputTokenizer.countTokens()>=2){

output = output.replace("~", ",");outputKey.set(anagramKey.toString());outputValue.set(output);results.collect(outputKey, outputValue);

}}}

37

Package The Jobpackage com.hadoop.examples.anagrams;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.TextInputFormat;import org.apache.hadoop.mapred.TextOutputFormat;public class AnagramJob {

public static void main(String[] args) throws Exception{JobConf conf = new JobConf(com.hadoop.examples.anagrams.AnagramJob.class);conf.setJobName("anagramcount");conf.setKeepTaskFilesPattern(".*");conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(Text.class);conf.setMapperClass(AnagramMapper.class);conf.setReducerClass(AnagramReducer.class);conf.setInputFormat(TextInputFormat.class);conf.setOutputFormat(TextOutputFormat.class);FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));JobClient.runJob(conf);

}}

38

MapReduce Example 2

Dataset 1: NAMES-F.TXT*AarenAarikaAbagaelAbagailAbbe…ZsazsaZulemaZuzana

Dataset 2: NAMES-M.TXT *AaronAbAbbaAbbeAbbey…ZerkZollieZolly

39

• Compare two datasets for a set of common names

• Input to Map Phase is both data sets (Female and Male names)• Output of Map Phase is a set of key/value pairs. One key/value pair

for each line of the input datasets. Both key and value set to the name from the input record.

• Input to the Reduce phase is the sorted and grouped keys and values.• The idea being that common names will show up as key/value pairs

with multiple values in the group.* Data from Project Gutenberg @ http://www.gutenberg.org/dirs/etext02/mword10.zip

Map Functionpackage com.hadoop.examples.commonnames;

import java.io.IOException;import java.util.Arrays;

import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reporter;public class CommonNamesMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {

private Text originalText = new Text();

public void map(LongWritable key, Text value,OutputCollector<Text, Text> outputCollector, Reporter reporter)throws IOException {

String word = value.toString();

originalText.set(word);outputCollector.collect(originalText, originalText);

}}

40

Reduce Functionpackage com.hadoop.examples.commonnames;import <snip>public class CommonNamesReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> {

private Text outputKey = new Text();private Text outputValue = new Text();

public void reduce(Text namesKey, Iterator<Text> namesValues,OutputCollector<Text, Text> results, Reporter reporter) throws IOException {

String output = "";while(namesValues.hasNext()){

Text names = namesValues.next();

output = output + names.toString() + "~";}

StringTokenizer outputTokenizer = new StringTokenizer(output,"~");if(outputTokenizer.countTokens()>=2){

output = output.replace("~", ",");outputKey.set(namesKey.toString());outputValue.set(output);results.collect(outputKey, outputValue);

}}

}

41

Package The Jobpackage com.hadoop.examples.commonnames;

import <snip>

public class CommonNamesJob {public static void main(String[] args) throws Exception{JobConf conf = new JobConf(com.hadoop.examples.commonnames.CommonNamesJob.class);conf.setJobName("commonnames");conf.setOutputKeyClass(Text.class);conf.setOutputValueClass(Text.class);

conf.setMapperClass(CommonNamesMapper.class);conf.setReducerClass(CommonNamesReducer.class);

conf.setInputFormat(TextInputFormat.class);conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf);}

}

42

Hadoop Modes• Standalone• All Hadoop/MapReduce/HDFS daemons run as threads within a

single Java Virtual Machine (JVM).• Debugging distributed programs across multiple JVMs and

servers notoriously difficult. Standalone mode simplifies debugging experience.

• Pseudo-distributed• All Hadoop/MapReduce/HDFS daemons run as threads within

separate JVMs but on the name node. Closer to full-up cluster mode but all processing occurs on local node.

• Cluster• Hadoop/MapReduce/HDFS daemons run as threads within JVMs

spread across nodes of Hadoop/HDFS cluster. 43

Exercise 3

44

Hadoop Introduction• Vendor: Copyright © 2011 The Apache Software Foundation.

Website: http://hadoop.apache.org/

Description:The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing.The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

• The project includes these subprojects:• Hadoop Common: The common utilities that support the other Hadoop subprojects.• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-

throughput access to application data.• Hadoop MapReduce: A software framework for distributed processing of large data sets on

compute clusters.

45

http://www.apache.org/licenses/

http://hadoop.apache.org/

http://hadoop.apache.org/

http://hadoop.apache.org/common/



http://hadoop.apache.org/hdfs/

http://hadoop.apache.org/hdfs/

http://hadoop.apache.org/mapreduce/



Additional Resources• “Hadoop: The Definitive Guide, Third Edition, by Tom White.

Copyright 2011 Tom White, 978-1-449-31152-0.”• Cloudera:• Linux packages and Virtual Machines:

• https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads#CDHDownloads-CDH3PackagesandDownloads

• Apache Hadoop Motivation Webinars (free registration required)• http://www.cloudera.com/resource/cloudera-essentials-for-apache-

hadoop-the-motivation-for-hadoop/

46

https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads







http://www.cloudera.com/resource/cloudera-essentials-for-apache-hadoop-the-motivation-for-hadoop/

http://www.cloudera.com/resource/cloudera-essentials-for-apache-hadoop-the-motivation-for-hadoop/

Backup Slides

47

Alternatives• MPI• PVM

48

HDFS Access• Talk about HTTP/Proxy access??• Talk about pluggable filesystems???

49

HDFS

• File structures• SequenceFile• MapFIle

50

MapReduce Features

• Counters• Sorting• Joins• Side Data

51

Documents

Hadoop Introduction Rob Hughes 1. Why the need for Hadoop? LOTS!!! of data causes some problems: In 1990 a typical hard drive could store 1,370 MB of