Hadoop high-level intro - U. of Mich. Hack U '09

Hadoop:A (very) high-level overview

University of Michigan Hack U ’08

Erik EldridgeYahoo! Developer Network

Photo credit: Swami Stream (http://ow.ly/17tC)

Overview

• What is it?

• Example 1: word count

• Example 2: search suggestions

• Why would I use it?

• How do I use it?

• Some Code

Before I continue…

• Slides are available here: slideshare.net/erikeldridge

Hadoop is

• Software for breaking a big job into smaller tasks, performing each task, and collecting the results

Example 1: Counting Words

1. Split into 3 sentences

2. Count words in each sentence– 1 “Mary”, 1 “had”, 1 “a”, …– 1 “It’s”, 1 “fleece”, 1 “was”, …– 1 “Everywhere”, 1 “that”, 1 “Mary”, …

3. Collect results: 2 “Mary”, 1 “had”, 1 “a”, 1 “little”, 2 “lamb”, …

“Mary had a little lamb. It’s fleece was white as snow. Everywhere that Mary went the lamb was sure to go.”

Example 2: Search Suggestions

Creating search suggestions

• Gazillions of search queries in server log files• How many times was each word used?• Using Hadoop, we would:

– Split up files– Count words in each– Sum word counts

So, Hadoop is

• A distributed batch processing infrastructure

• Built to process "web-scale" data: terabytes, petabytes

• Two components:– HDFS– MapReduce infrastructure

• A distributed, fault-tolerant file system

• It’s easier to move calculations than data

• Hadoop will split the data for you

MapReduce Infrastructure

• Two steps:– Map– Reduce

• Java, C, C++ APIs

• Pig, Streaming

Java Word Count: Mapper

//credit: http://ow.ly/1bER public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }

Java Word Count: Reducer

//credit: http://ow.ly/1bER public static class Reduce extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values,

OutputCollector<Text, IntWritable> output,

Reporter reporter) throws IOException {

int sum = 0;

while (values.hasNext()) {

sum += values.next().get();

output.collect(key, new IntWritable(sum));

Java Word Count: Running it//credit: http://ow.ly/1bERpublic class WordCount { …… public static void main(String[] args) throws IOException { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); // the keys are words (strings) conf.setOutputKeyClass(Text.class); // the values are counts (ints) conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(Reduce.class); conf.setInputPath(new Path(args[0]); conf.setOutputPath(new Path(args[1]); JobClient.runJob(conf); …..

Streaming Word Count

//credit: http://ow.ly/1bER• bin/hadoop jar hadoop-streaming.jar

-input in-dir -output out-dir-mapper streamingMapper.sh -reducer streamingReducer.sh

• streamingMapper.sh: /bin/sed -e 's| |\n|g' | /bin/grep .

• streamingReducer: /usr/bin/uniq -c | /bin/awk '{print $2 "\t" $1}'

Pig Word Count

//credit: http://ow.ly/1bERinput = LOAD “in-dir”

USING TextLoader(); words = FOREACH input GENERATE

FLATTEN(TOKENIZE(*)); grouped = GROUP words BY $0; counts = FOREACH grouped GENERATE

group, COUNT(words); STORE counts INTO “out-dir”;

Beyond Word Count

• Yahoo! Search – Generating their Web Map

• Zattoo– Computing viewership stats

• New York Times – Converting their archives to pdf

• Last.fm – Improving their streams by learning from track skipping

patterns

• Facebook– Indexing mail accounts

Why use Hadoop?

• Do you have a very large data set?

• Hadoop works with cheap hardware

• Simplified programming model

How do I use it?

1. Download Hadoop

2. Define cluster in Hadoop settings

3. Import data using Hadoop

4. Define job using API, Pig, or streaming

5. Run job

6. Output is saved to file(s)

7. Sign up for Hadoop mailing list

Resources

• Hadoop project site

• Yahoo! Hadoop tutorial

• Hadoop Word Count (pdf)

• Owen O’Malley’s intro to Hadoop

• Ruby Word Count example

• Tutorial on Hadoop + EC2 + S3

• Tutorial on single-node Hadoop

Thank you!

• eldridge@yahoo-inc.com

• Twitter: erikeldridge

• Presentation is available here:slideshare.net/erikeldridge

Hadoop high-level intro - U. of Mich. Hack U '09

Technology

Yahoo! App. Platform : University of Michigan Hack U '09

Hack Humanity - Hack The Workplace - Heroification & Gamification

mich-inter ppt

2018 Big Ten Championships · 2019-01-18 · Mohamed Samy, IND Felix Auboeck, MICH Luiz Gustavo Borges, MICH James Peek, MICH Paul Powers, MICH Charles Swanson, MICH Evan White, MICH

Sheet Music (pdf) - University of · PDF fileWe cheer and cheer a - gain We cheer for Mich-i - gan gain, cheer with might mi ht For U. and Hoo Mich M. > gan! Rah! cheer, We U. we of

Fictive Hack of Old School Hack

Hack the hack vivatech

Hack U Barcelona 2011

zamora, mich

U-HACK @ GOLF Here are some highlights from U-HACK for 2004. Enjoy. HACK This show will run automatically. Turn your speakers on for full effect

Nw mich biz_article

U-HACK @ GOLF Here are some highlights from U-HACK for 2009. Enjoy. HACK This show will run automatically. Turn your speakers on for full effect

PräSentation1 Kauf Mich

Media Hack Day 2014 — News Sightseeing Hack

Hack U - YUI - 2012 IIT Kharagpur

Abacus Hack Day Singapore - Winning hack

„Kostet und seht, wie gut der Herr ist · Jesus, berühre mich, hole mich ab, öffne die Tür für mich. Nimm mich an deine Hand, entführe mich. In deine Gegenwart. Je-sus, ich

U-HACK @ GOLF Here are some highlights from U-HACK for 2007. Enjoy. HACK This show will run automatically. Turn your speakers on for full effect

Horizon Mich 2011

Fire Eagle presentation for IIT Delhi Hack U event