26
Hadoop 101 Mohit Soni eBay Inc. BarCamp Chennai - 5 Mohit Soni

Hadoop 101

Embed Size (px)

DESCRIPTION

Slide deck from Barcamp Chennai

Citation preview

Page 1: Hadoop 101

Hadoop 101

Mohit SonieBay Inc.

BarCamp Chennai - 5 Mohit Soni

Page 2: Hadoop 101

About Me

• I work as a Software Engineer at eBay

• Worked on large-scale data processing with eBay Research Labs

BarCamp Chennai - 5 Mohit Soni

Page 3: Hadoop 101

First Things First

BarCamp Chennai - 5 Mohit Soni

Page 4: Hadoop 101

• Inspired from functional operations

– Map

– Reduce

• Functional operations do not modify data, they generate new data

• Original data remains unmodified

MapReduce

BarCamp Chennai - 5 Mohit Soni

Page 5: Hadoop 101

MapReducedef MapReduce(data, mapper, reducer):

return reduce(reducer, map(mapper, data))

MapReduce(list, sqr, add) -> 30

Functional Operations

BarCamp Chennai - 5 Mohit Soni

Mapdef sqr(n):

return n * n

list = [1,2,3,4]

map(sqr, list) -> [1,4,9,16]

Reducedef add(i, j):

return i + j

list = [1,2,3,4]

reduce(add, list) -> 10

Python code

Page 6: Hadoop 101

BarCamp Chennai - 5 Mohit Soni

Page 7: Hadoop 101

• Framework for large-scale data processing

• Based on Google’s MapReduce and GFS

• An Apache Software Foundation project

• Open Source!

• Written in Java

• Oh, btw

What is Hadoop ?

BarCamp Chennai - 5 Mohit Soni

Page 8: Hadoop 101

• Need to process lots of data (PetaByte scale)

• Need to parallelize processing across multitude of CPUs

• Achieves above while KeepIng Software Simple

• Gives scalability with low-cost commodity hardware

Why Hadoop ?

BarCamp Chennai - 5 Mohit Soni

Page 9: Hadoop 101

Source: Hadoop Wiki

Hadoop fans

BarCamp Chennai - 5 Mohit Soni

Page 10: Hadoop 101

Hadoop is a good choice for:

• Indexing data

• Log Analysis

• Image manipulation

• Sorting large-scale data

• Data Mining

When to use and not-use Hadoop ?

BarCamp Chennai - 5 Mohit Soni

Hadoop is not a good choice:

• For real-time processing

• For processing intensive tasks with little data

• If you have Jaguar or RoadRunner in your stock

Page 11: Hadoop 101

• Hadoop Distributed File System

• Based on Google’s GFS (Google File System)

• Write once read many access model

• Fault tolerant

• Efficient for batch-processing

HDFS – Overview

BarCamp Chennai - 5 Mohit Soni

Page 12: Hadoop 101

• HDFS splits input data into blocks

• Block size in HDFS: 64/128MB (configurable)

• Block size *nix: 4KB

HDFS – Blocks

BarCamp Chennai - 5 Mohit Soni

Block 1

Block 2

Block 3Input Data

Page 13: Hadoop 101

• Blocks are replicated across nodes to handle hardware failure

• Node failure is handled gracefully, without loss of data

HDFS – Replication

BarCamp Chennai - 5 Mohit Soni

Block 1

Block 2

Block 1

Block 3

Block 2

Block 3

Page 14: Hadoop 101

HDFS – Architecture

BarCamp Chennai - 5 Mohit Soni

NameNode

Client

Clu

ster

DataNodes

Page 15: Hadoop 101

• NameNode (Master)

– Manages filesystem metadata

– Manages replication of blocks

– Manages read/write access to files

• Metadata

– List of files

– List of blocks that constitutes a file

– List of DataNodes on which blocks reside, etc

• Single Point of Failure (candidate for spending $$)

HDFS – NameNode

BarCamp Chennai - 5 Mohit Soni

Page 16: Hadoop 101

• DataNode (Slave)– Contains actual data

– Manages data blocks

– Informs NameNode about block IDs stored

– Client read/write data blocks from DataNode

– Performs block replication as instructed by NameNode

• Block Replication– Supports various pluggable replication strategies

– Clients read blocks from nearest DataNode

• Data Pipelining– Client write block to first DataNode

– First DataNode forwards data to next DataNode in pipeline

– When block is replicated across all replicas, next block is chosen

HDFS – DataNode

BarCamp Chennai - 5 Mohit Soni

Page 17: Hadoop 101

Hadoop - Architecture

BarCamp Chennai - 5 Mohit Soni

JobTracker

TaskTracker TaskTracker

NameNode

DataNode

DataNode

DataNode

DataNode

DataNode

DataNode

User

Page 18: Hadoop 101

• JobTracker (Master)

– 1 Job Tracker per cluster

– Accepts job requests from users

– Schedule Map and Reduce tasks for TaskTrackers

– Monitors tasks and TaskTrackers status

– Re-execute task on failure

• TaskTracker (Slave)

– Multiple TaskTrackers in a cluster

– Run Map and Reduce tasks

Hadoop - Terminology

BarCamp Chennai - 5 Mohit Soni

Page 19: Hadoop 101

Input Data

Input Map Shuffle + Sort Reduce Output

Map

Map

Map

OutputData

Reduce

Reduce

BarCamp Chennai - 5 Mohit Soni

MapReduce – Flow

Page 20: Hadoop 101

Word CountHadoop’s HelloWorld

BarCamp Chennai - 5 Mohit Soni

Page 21: Hadoop 101

• Input– Text files

• Output– Single file containing (Word <TAB> Count)

• Map Phase– Generates (Word, Count) pairs– [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}]

• Reduce Phase– For each word, calculates aggregate

– [{a,7}, {b,5}, {c,6}]

BarCamp Chennai - 5 Mohit Soni

Word Count Example

Page 22: Hadoop 101

public class WordCountMapper extends MapReduceBase implements

Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text,

IntWritable> out, Reporter reporter) throws Exception {

String l = value.toString();

StringTokenizer t = new StringTokenizer(l);

while(t.hasMoreTokens()) {

word.set(t.nextToken());

out.collect(word, one);

}

}

}

BarCamp Chennai - 5 Mohit Soni

Word Count – Mapper

Page 23: Hadoop 101

public class WordCountReducer extends MapReduceBase implements

Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWriter> values,

OutputCollector<Text, IntWritable> out, Reporter reporter) throws

Exception {

int sum = 0;

while(values.hasNext()) {

sum += values.next().get();

}

out.collect(key, new IntWritable(sum));

}

}

BarCamp Chennai - 5 Mohit Soni

Word Count – Reducer

Page 24: Hadoop 101

public class WordCountConfig {

public static void main(String[] args) throws Exception {

if (args.length() != 2) {

System.exit(1);

}

JobConf conf = new JobConf(WordCountConfig.class);

conf.setJobName(“Word Counter”);

FileInputFormat.addInputPath(conf, new Path(args[0]);

FileInputFormat.addOutputPath(conf, new Path(args[1]));

conf.setMapperClass(WordCountMapper.class);

conf.setCombinerClass(WordCountReducer.class);

conf.setReducerClass(WordCountReducer.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

JobClient.runJob(conf);

}

}

BarCamp Chennai - 5 Mohit Soni

Word Count – Config

Page 25: Hadoop 101

• http://hadoop.apache.org/

• Jeffrey Dean and Sanjay Ghemwat, MapReduce: Simplified Data Processing on Large Clusters

• Tom White, Hadoop: The Definitive Guide, O’Reilly

• Setting up a Single-Node Cluster: http://bit.ly/glNzs4

• Setting up a Multi-Node Cluster: http://bit.ly/f5KqCP

Diving Deeper

BarCamp Chennai - 5 Mohit Soni

Page 26: Hadoop 101

• Follow me on twitter @mohitsoni

• http://mohitsoni.com/

Catching-Up

BarCamp Chennai - 5 Mohit Soni