Hadoop high-level intro - U. of Mich. Hack U '09

Preview:

DESCRIPTION

This is a very high-level introduction to Hadoop delivered to the Information Retrieval class at University of Michigan during the Hack U week '09.

Citation preview

Hadoop:A (very) high-level overview

University of Michigan Hack U ’08

Erik EldridgeYahoo! Developer Network

Photo credit: Swami Stream (http://ow.ly/17tC)

2

Overview

• What is it?

• Example 1: word count

• Example 2: search suggestions

• Why would I use it?

• How do I use it?

• Some Code

3

Before I continue…

• Slides are available here: slideshare.net/erikeldridge

4

Hadoop is

• Software for breaking a big job into smaller tasks, performing each task, and collecting the results

5

Example 1: Counting Words

1. Split into 3 sentences

2. Count words in each sentence– 1 “Mary”, 1 “had”, 1 “a”, …– 1 “It’s”, 1 “fleece”, 1 “was”, …– 1 “Everywhere”, 1 “that”, 1 “Mary”, …

3. Collect results: 2 “Mary”, 1 “had”, 1 “a”, 1 “little”, 2 “lamb”, …

“Mary had a little lamb. It’s fleece was white as snow. Everywhere that Mary went the lamb was sure to go.”

6

Example 2: Search Suggestions

7

Creating search suggestions

• Gazillions of search queries in server log files• How many times was each word used?• Using Hadoop, we would:

– Split up files– Count words in each– Sum word counts

8

So, Hadoop is

• A distributed batch processing infrastructure

• Built to process "web-scale" data: terabytes, petabytes

• Two components:– HDFS– MapReduce infrastructure

9

HDFS

• A distributed, fault-tolerant file system

• It’s easier to move calculations than data

• Hadoop will split the data for you

10

MapReduce Infrastructure

• Two steps:– Map– Reduce

• Java, C, C++ APIs

• Pig, Streaming

11

Java Word Count: Mapper

//credit: http://ow.ly/1bER public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }

12

Java Word Count: Reducer

//credit: http://ow.ly/1bER public static class Reduce extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

}

output.collect(key, new IntWritable(sum));

}

}

13

Java Word Count: Running it//credit: http://ow.ly/1bERpublic class WordCount { …… public static void main(String[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); // the keys are words (strings) conf.setOutputKeyClass(Text.class); // the values are counts (ints) conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); conf.setInputPath(new Path(args[0]); conf.setOutputPath(new Path(args[1]); JobClient.runJob(conf); …..

14

Streaming Word Count

//credit: http://ow.ly/1bER• bin/hadoop jar hadoop-streaming.jar

-input in-dir -output out-dir-mapper streamingMapper.sh -reducer streamingReducer.sh

• streamingMapper.sh: /bin/sed -e 's| |\n|g' | /bin/grep .

• streamingReducer: /usr/bin/uniq -c | /bin/awk '{print $2 "\t" $1}'

15

Pig Word Count

//credit: http://ow.ly/1bERinput = LOAD “in-dir”

USING TextLoader(); words = FOREACH input GENERATE

FLATTEN(TOKENIZE(*)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE

group, COUNT(words); STORE counts INTO “out-dir”;

16

Beyond Word Count

• Yahoo! Search – Generating their Web Map

• Zattoo– Computing viewership stats

• New York Times – Converting their archives to pdf

• Last.fm – Improving their streams by learning from track skipping

patterns

• Facebook– Indexing mail accounts

17

Why use Hadoop?

• Do you have a very large data set?

• Hadoop works with cheap hardware

• Simplified programming model

18

How do I use it?

1. Download Hadoop

2. Define cluster in Hadoop settings

3. Import data using Hadoop

4. Define job using API, Pig, or streaming

5. Run job

6. Output is saved to file(s)

7. Sign up for Hadoop mailing list

19

Resources

• Hadoop project site

• Yahoo! Hadoop tutorial

• Hadoop Word Count (pdf)

• Owen O’Malley’s intro to Hadoop

• Ruby Word Count example

• Tutorial on Hadoop + EC2 + S3

• Tutorial on single-node Hadoop

20

Thank you!

• eldridge@yahoo-inc.com

• Twitter: erikeldridge

• Presentation is available here:slideshare.net/erikeldridge

Recommended