Upload
erik-eldridge
View
115
Download
0
Tags:
Embed Size (px)
DESCRIPTION
This is a very high-level introduction to Hadoop delivered to the Information Retrieval class at University of Michigan during the Hack U week '09.
Citation preview
Hadoop:A (very) high-level overview
University of Michigan Hack U ’08
Erik EldridgeYahoo! Developer Network
Photo credit: Swami Stream (http://ow.ly/17tC)
2
Overview
• What is it?
• Example 1: word count
• Example 2: search suggestions
• Why would I use it?
• How do I use it?
• Some Code
3
Before I continue…
• Slides are available here: slideshare.net/erikeldridge
4
Hadoop is
• Software for breaking a big job into smaller tasks, performing each task, and collecting the results
5
Example 1: Counting Words
1. Split into 3 sentences
2. Count words in each sentence– 1 “Mary”, 1 “had”, 1 “a”, …– 1 “It’s”, 1 “fleece”, 1 “was”, …– 1 “Everywhere”, 1 “that”, 1 “Mary”, …
3. Collect results: 2 “Mary”, 1 “had”, 1 “a”, 1 “little”, 2 “lamb”, …
“Mary had a little lamb. It’s fleece was white as snow. Everywhere that Mary went the lamb was sure to go.”
6
Example 2: Search Suggestions
7
Creating search suggestions
• Gazillions of search queries in server log files• How many times was each word used?• Using Hadoop, we would:
– Split up files– Count words in each– Sum word counts
8
So, Hadoop is
• A distributed batch processing infrastructure
• Built to process "web-scale" data: terabytes, petabytes
• Two components:– HDFS– MapReduce infrastructure
9
HDFS
• A distributed, fault-tolerant file system
• It’s easier to move calculations than data
• Hadoop will split the data for you
10
MapReduce Infrastructure
• Two steps:– Map– Reduce
• Java, C, C++ APIs
• Pig, Streaming
11
Java Word Count: Mapper
//credit: http://ow.ly/1bER public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }
12
Java Word Count: Reducer
//credit: http://ow.ly/1bER public static class Reduce extends MapReduceBase
implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
13
Java Word Count: Running it//credit: http://ow.ly/1bERpublic class WordCount { …… public static void main(String[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); // the keys are words (strings) conf.setOutputKeyClass(Text.class); // the values are counts (ints) conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); conf.setInputPath(new Path(args[0]); conf.setOutputPath(new Path(args[1]); JobClient.runJob(conf); …..
14
Streaming Word Count
//credit: http://ow.ly/1bER• bin/hadoop jar hadoop-streaming.jar
-input in-dir -output out-dir-mapper streamingMapper.sh -reducer streamingReducer.sh
• streamingMapper.sh: /bin/sed -e 's| |\n|g' | /bin/grep .
• streamingReducer: /usr/bin/uniq -c | /bin/awk '{print $2 "\t" $1}'
15
Pig Word Count
//credit: http://ow.ly/1bERinput = LOAD “in-dir”
USING TextLoader(); words = FOREACH input GENERATE
FLATTEN(TOKENIZE(*)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE
group, COUNT(words); STORE counts INTO “out-dir”;
16
Beyond Word Count
• Yahoo! Search – Generating their Web Map
• Zattoo– Computing viewership stats
• New York Times – Converting their archives to pdf
• Last.fm – Improving their streams by learning from track skipping
patterns
• Facebook– Indexing mail accounts
17
Why use Hadoop?
• Do you have a very large data set?
• Hadoop works with cheap hardware
• Simplified programming model
18
How do I use it?
1. Download Hadoop
2. Define cluster in Hadoop settings
3. Import data using Hadoop
4. Define job using API, Pig, or streaming
5. Run job
6. Output is saved to file(s)
7. Sign up for Hadoop mailing list
19
Resources
• Hadoop project site
• Yahoo! Hadoop tutorial
• Hadoop Word Count (pdf)
• Owen O’Malley’s intro to Hadoop
• Ruby Word Count example
• Tutorial on Hadoop + EC2 + S3
• Tutorial on single-node Hadoop
20
Thank you!
• Twitter: erikeldridge
• Presentation is available here:slideshare.net/erikeldridge