Hadoop 101

Hadoop 101

Mohit SonieBay Inc.

BarCamp Chennai - 5 Mohit Soni

About Me

• I work as a Software Engineer at eBay

• Worked on large-scale data processing with eBay Research Labs


First Things First


• Inspired from functional operations

– Map

– Reduce

• Functional operations do not modify data, they generate new data

• Original data remains unmodified

MapReduce


MapReducedef MapReduce(data, mapper, reducer):

return reduce(reducer, map(mapper, data))

MapReduce(list, sqr, add) -> 30

Functional Operations


Mapdef sqr(n):

return n * n

list = [1,2,3,4]

map(sqr, list) -> [1,4,9,16]

Reducedef add(i, j):

return i + j

list = [1,2,3,4]

reduce(add, list) -> 10

Python code


• Framework for large-scale data processing

• Based on Google’s MapReduce and GFS

• An Apache Software Foundation project

• Open Source!

• Written in Java

• Oh, btw

What is Hadoop ?


• Need to process lots of data (PetaByte scale)

• Need to parallelize processing across multitude of CPUs

• Achieves above while KeepIng Software Simple

• Gives scalability with low-cost commodity hardware

Why Hadoop ?


Source: Hadoop Wiki

Hadoop fans


Hadoop is a good choice for:

• Indexing data

• Log Analysis

• Image manipulation

• Sorting large-scale data

• Data Mining

When to use and not-use Hadoop ?


Hadoop is not a good choice:

• For real-time processing

• For processing intensive tasks with little data

• If you have Jaguar or RoadRunner in your stock

• Hadoop Distributed File System

• Based on Google’s GFS (Google File System)

• Write once read many access model

• Fault tolerant

• Efficient for batch-processing

HDFS – Overview


• HDFS splits input data into blocks

• Block size in HDFS: 64/128MB (configurable)

• Block size *nix: 4KB

HDFS – Blocks


Block 1

Block 2

Block 3Input Data

• Blocks are replicated across nodes to handle hardware failure

• Node failure is handled gracefully, without loss of data

HDFS – Replication


Block 1

Block 2

Block 1

Block 3

Block 2

Block 3

HDFS – Architecture


NameNode

Client

Clu

ster

DataNodes

• NameNode (Master)

– Manages filesystem metadata

– Manages replication of blocks

– Manages read/write access to files

• Metadata

– List of files

– List of blocks that constitutes a file

– List of DataNodes on which blocks reside, etc

• Single Point of Failure (candidate for spending $$)

HDFS – NameNode


• DataNode (Slave)– Contains actual data

– Manages data blocks

– Informs NameNode about block IDs stored

– Client read/write data blocks from DataNode

– Performs block replication as instructed by NameNode

• Block Replication– Supports various pluggable replication strategies

– Clients read blocks from nearest DataNode

• Data Pipelining– Client write block to first DataNode

– First DataNode forwards data to next DataNode in pipeline

– When block is replicated across all replicas, next block is chosen

HDFS – DataNode


Hadoop - Architecture


JobTracker

TaskTracker TaskTracker

NameNode

DataNode

DataNode

DataNode

DataNode

DataNode

DataNode

User

• JobTracker (Master)

– 1 Job Tracker per cluster

– Accepts job requests from users

– Schedule Map and Reduce tasks for TaskTrackers

– Monitors tasks and TaskTrackers status

– Re-execute task on failure

• TaskTracker (Slave)

– Multiple TaskTrackers in a cluster

– Run Map and Reduce tasks

Hadoop - Terminology


Input Data

Input Map Shuffle + Sort Reduce Output

Map

Map

Map

OutputData

Reduce

Reduce


MapReduce – Flow

Word CountHadoop’s HelloWorld


• Input– Text files

• Output– Single file containing (Word <TAB> Count)

• Map Phase– Generates (Word, Count) pairs– [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}]

• Reduce Phase– For each word, calculates aggregate

– [{a,7}, {b,5}, {c,6}]


Word Count Example

public class WordCountMapper extends MapReduceBase implements

Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text,

IntWritable> out, Reporter reporter) throws Exception {

String l = value.toString();

StringTokenizer t = new StringTokenizer(l);

while(t.hasMoreTokens()) {

word.set(t.nextToken());

out.collect(word, one);

}

}

}


Word Count – Mapper

public class WordCountReducer extends MapReduceBase implements

Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWriter> values,

OutputCollector<Text, IntWritable> out, Reporter reporter) throws

Exception {

int sum = 0;

while(values.hasNext()) {

sum += values.next().get();

}

out.collect(key, new IntWritable(sum));

}

}


Word Count – Reducer

public class WordCountConfig {

public static void main(String[] args) throws Exception {

if (args.length() != 2) {

System.exit(1);

}

JobConf conf = new JobConf(WordCountConfig.class);

conf.setJobName(“Word Counter”);

FileInputFormat.addInputPath(conf, new Path(args[0]);

FileInputFormat.addOutputPath(conf, new Path(args[1]));

conf.setMapperClass(WordCountMapper.class);

conf.setCombinerClass(WordCountReducer.class);

conf.setReducerClass(WordCountReducer.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

JobClient.runJob(conf);

}

}


Word Count – Config

• http://hadoop.apache.org/

• Jeffrey Dean and Sanjay Ghemwat, MapReduce: Simplified Data Processing on Large Clusters

• Tom White, Hadoop: The Definitive Guide, O’Reilly

• Setting up a Single-Node Cluster: http://bit.ly/glNzs4

• Setting up a Multi-Node Cluster: http://bit.ly/f5KqCP

Diving Deeper


http://hadoop.apache.org/

http://bit.ly/glNzs4

http://bit.ly/glNzs4

http://bit.ly/f5KqCP

http://bit.ly/f5KqCP

• Follow me on twitter @mohitsoni

• http://mohitsoni.com/

Catching-Up


http://twitter.com/mohitsoni

http://twitter.com/mohitsoni

http://www.mohitsoni.com/

Documents

Hadoop 101