Upload
harrythewiz
View
11.322
Download
0
Embed Size (px)
DESCRIPTION
Slide deck from Barcamp Chennai
Citation preview
Hadoop 101
Mohit SonieBay Inc.
BarCamp Chennai - 5 Mohit Soni
About Me
• I work as a Software Engineer at eBay
• Worked on large-scale data processing with eBay Research Labs
BarCamp Chennai - 5 Mohit Soni
First Things First
BarCamp Chennai - 5 Mohit Soni
• Inspired from functional operations
– Map
– Reduce
• Functional operations do not modify data, they generate new data
• Original data remains unmodified
MapReduce
BarCamp Chennai - 5 Mohit Soni
MapReducedef MapReduce(data, mapper, reducer):
return reduce(reducer, map(mapper, data))
MapReduce(list, sqr, add) -> 30
Functional Operations
BarCamp Chennai - 5 Mohit Soni
Mapdef sqr(n):
return n * n
list = [1,2,3,4]
map(sqr, list) -> [1,4,9,16]
Reducedef add(i, j):
return i + j
list = [1,2,3,4]
reduce(add, list) -> 10
Python code
BarCamp Chennai - 5 Mohit Soni
• Framework for large-scale data processing
• Based on Google’s MapReduce and GFS
• An Apache Software Foundation project
• Open Source!
• Written in Java
• Oh, btw
What is Hadoop ?
BarCamp Chennai - 5 Mohit Soni
• Need to process lots of data (PetaByte scale)
• Need to parallelize processing across multitude of CPUs
• Achieves above while KeepIng Software Simple
• Gives scalability with low-cost commodity hardware
Why Hadoop ?
BarCamp Chennai - 5 Mohit Soni
Source: Hadoop Wiki
Hadoop fans
BarCamp Chennai - 5 Mohit Soni
Hadoop is a good choice for:
• Indexing data
• Log Analysis
• Image manipulation
• Sorting large-scale data
• Data Mining
When to use and not-use Hadoop ?
BarCamp Chennai - 5 Mohit Soni
Hadoop is not a good choice:
• For real-time processing
• For processing intensive tasks with little data
• If you have Jaguar or RoadRunner in your stock
• Hadoop Distributed File System
• Based on Google’s GFS (Google File System)
• Write once read many access model
• Fault tolerant
• Efficient for batch-processing
HDFS – Overview
BarCamp Chennai - 5 Mohit Soni
• HDFS splits input data into blocks
• Block size in HDFS: 64/128MB (configurable)
• Block size *nix: 4KB
HDFS – Blocks
BarCamp Chennai - 5 Mohit Soni
Block 1
Block 2
Block 3Input Data
• Blocks are replicated across nodes to handle hardware failure
• Node failure is handled gracefully, without loss of data
HDFS – Replication
BarCamp Chennai - 5 Mohit Soni
Block 1
Block 2
Block 1
Block 3
Block 2
Block 3
HDFS – Architecture
BarCamp Chennai - 5 Mohit Soni
NameNode
Client
Clu
ster
DataNodes
• NameNode (Master)
– Manages filesystem metadata
– Manages replication of blocks
– Manages read/write access to files
• Metadata
– List of files
– List of blocks that constitutes a file
– List of DataNodes on which blocks reside, etc
• Single Point of Failure (candidate for spending $$)
HDFS – NameNode
BarCamp Chennai - 5 Mohit Soni
• DataNode (Slave)– Contains actual data
– Manages data blocks
– Informs NameNode about block IDs stored
– Client read/write data blocks from DataNode
– Performs block replication as instructed by NameNode
• Block Replication– Supports various pluggable replication strategies
– Clients read blocks from nearest DataNode
• Data Pipelining– Client write block to first DataNode
– First DataNode forwards data to next DataNode in pipeline
– When block is replicated across all replicas, next block is chosen
HDFS – DataNode
BarCamp Chennai - 5 Mohit Soni
Hadoop - Architecture
BarCamp Chennai - 5 Mohit Soni
JobTracker
TaskTracker TaskTracker
NameNode
DataNode
DataNode
DataNode
DataNode
DataNode
DataNode
User
• JobTracker (Master)
– 1 Job Tracker per cluster
– Accepts job requests from users
– Schedule Map and Reduce tasks for TaskTrackers
– Monitors tasks and TaskTrackers status
– Re-execute task on failure
• TaskTracker (Slave)
– Multiple TaskTrackers in a cluster
– Run Map and Reduce tasks
Hadoop - Terminology
BarCamp Chennai - 5 Mohit Soni
Input Data
Input Map Shuffle + Sort Reduce Output
Map
Map
Map
OutputData
Reduce
Reduce
BarCamp Chennai - 5 Mohit Soni
MapReduce – Flow
Word CountHadoop’s HelloWorld
BarCamp Chennai - 5 Mohit Soni
• Input– Text files
• Output– Single file containing (Word <TAB> Count)
• Map Phase– Generates (Word, Count) pairs– [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}]
• Reduce Phase– For each word, calculates aggregate
– [{a,7}, {b,5}, {c,6}]
BarCamp Chennai - 5 Mohit Soni
Word Count Example
public class WordCountMapper extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> out, Reporter reporter) throws Exception {
String l = value.toString();
StringTokenizer t = new StringTokenizer(l);
while(t.hasMoreTokens()) {
word.set(t.nextToken());
out.collect(word, one);
}
}
}
BarCamp Chennai - 5 Mohit Soni
Word Count – Mapper
public class WordCountReducer extends MapReduceBase implements
Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWriter> values,
OutputCollector<Text, IntWritable> out, Reporter reporter) throws
Exception {
int sum = 0;
while(values.hasNext()) {
sum += values.next().get();
}
out.collect(key, new IntWritable(sum));
}
}
BarCamp Chennai - 5 Mohit Soni
Word Count – Reducer
public class WordCountConfig {
public static void main(String[] args) throws Exception {
if (args.length() != 2) {
System.exit(1);
}
JobConf conf = new JobConf(WordCountConfig.class);
conf.setJobName(“Word Counter”);
FileInputFormat.addInputPath(conf, new Path(args[0]);
FileInputFormat.addOutputPath(conf, new Path(args[1]));
conf.setMapperClass(WordCountMapper.class);
conf.setCombinerClass(WordCountReducer.class);
conf.setReducerClass(WordCountReducer.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
JobClient.runJob(conf);
}
}
BarCamp Chennai - 5 Mohit Soni
Word Count – Config
• http://hadoop.apache.org/
• Jeffrey Dean and Sanjay Ghemwat, MapReduce: Simplified Data Processing on Large Clusters
• Tom White, Hadoop: The Definitive Guide, O’Reilly
• Setting up a Single-Node Cluster: http://bit.ly/glNzs4
• Setting up a Multi-Node Cluster: http://bit.ly/f5KqCP
Diving Deeper
BarCamp Chennai - 5 Mohit Soni
• Follow me on twitter @mohitsoni
• http://mohitsoni.com/
Catching-Up
BarCamp Chennai - 5 Mohit Soni