22
Some tips for effective map reducing CHRISTOPHER SEVERS eBay eBay Netanya December 2 nd , 2013

Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

Some tips for effective map reducing

CHRISTOPHER SEVERS

eBay

eBay NetanyaDecember 2nd, 2013

Page 2: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

THE AGENDA

Page 3: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

3

THE AGENDA

1. Quick survey of the current landscape for Hadoop tools

2. A light comparison of the best functional tools.

3. General advice

4. Some code samples

PRESENTATION TITLE GOES HERE

Page 4: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

THE ALTERNATIVES

I promise this part will be quick

Page 5: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

5

VANILLA MAPREDUCE

PRESENTATION TITLE GOES HERE

package org.myorg;

import java.io.IOException;

import java.util.*;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.conf.*;

import org.apache.hadoop.io.*;

import org.apache.hadoop.mapreduce.*;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);

private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());

context.write(word, one);

}

}

}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

context.write(key, new IntWritable(sum));

}

}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);

job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);

}

Page 6: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

6

PIG

•Apache Pig is a really great tool for quick, ad-hoc data analysis

•While we can do amazing things with it, I’m not sure we should

•Anything complicated requires User Defined Functions (UDFs)

•UDFs require a separate code base•Now you have to maintain two separate languages for no good reason

PRESENTATION TITLE GOES HERE

Page 7: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

7

APACHE HIVE

•On previous slide: s/Pig/Hive/g

PRESENTATION TITLE GOES HERE

Page 8: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

GENERAL ADVICE

Do this, not that

Page 9: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

PRESENTATION TITLE GOES HERE 9

DO

•Use a higher level abstraction like distributed lists•Use objects instead of tuples•Use a good serialization format•Always check for data quality•Use flatMap for uncertain computations•Develop reusable reductions (monoids!)•Prefer map side operations when possible•Always check for data skew

Page 10: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

10

DON’T

•Never use nulls•Don’t use too many levels of nesting•Don’t use shared state•Don’t use iteration (too much)•Try not to start with a complicated approach

PRESENTATION TITLE GOES HERE

Page 11: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

SCALDING AND SCOOBIThis is what we use at eBay

Page 12: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

12

SOME SCALA CODE

val myLines = getStuff

val myWords = myLines.flatMap(w =>

w.split("\\s+"))

val myWordsGrouped = myLines.groupBy(identity)

val countedWords = myWordsGrouped.

mapValues(x=>x.size)

write(countedWords)

PRESENTATION TITLE GOES HERE

Page 13: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

13

SOME SCALDING CODE

val myLines = TextLine(path)

val myWords= myLines.flatMap(w =>

w.split(" "))

.groupBy(identity)

.size

myWords.write(TypedTSV(output))

PRESENTATION TITLE GOES HERE

Page 14: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

14

WHAT HAPPENED ON THE PREVIOUS SLIDE?•flatMap()

–Similar to map, but a one-to-many rather than one-to-one mapping

–Use when the desired result has some probability of occurring

–Can handle errors with the Option (Maybe) monad. A None type will be discarded

PRESENTATION TITLE GOES HERE

Page 15: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

15

MORE EXPLANATION

•groupBy()–Takes a function that generates a key from the given value–Logically the result can be thought of as an associative

array: key -> List of values–In Scalding this doesn’t necessarily force a Hadoop reduce

phase, it depends on what comes after

PRESENTATION TITLE GOES HERE

Page 16: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

16

THE BEST PART

•size–This part is pure magic–size is actually sugar for .map( t => 1L).sum–sum has an implicit argument, mon: Monoid[T]

PRESENTATION TITLE GOES HERE

Page 17: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

17

MONOIDS: WHY YOU SHOULD CARE ABOUT MATH•From Wikipedia:

–a monoid is an algebraic structure with a single associative binary operation and an identity element.

•Almost everything you want to do is a monoid–Standard addition of numeric types is the most common–List/map/set/string concatenation–Top k elements–Bloom filter, count-min sketch, hyperloglog–stochastic gradient descent–histograms

PRESENTATION TITLE GOES HERE

Page 18: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

18

MORE MONOID STUFF

•If you are aggregating, you are probably using a monoid•Scalding has Algebird and monoid support baked in•Scoobi can use Algebird (or any other monoid library) with almost no work–combine { case (l,r) => monoid.plus(l,r) }

•Algebird handles tuples with ease•Very easy to define monoids for your own types

PRESENTATION TITLE GOES HERE

Page 19: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

19

ADVANTAGES

•Type checking–Find errors at compile time, not at job submission time (or

even worse, 5 hours after job submission time)

•Single language–Scala is a full programming language

•Productivity–Since the code you write looks like collections code you can

use the Scala REPL to prototype

•Clarity–Write code as a series of operations and let the job planner

smash it all together

PRESENTATION TITLE GOES HERE

Page 20: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

CONCLUSION

We’re almost done!

Page 21: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

THINGS TO TAKE AWAY

•Mapreduce is a functional problem, we should use functional tools

•You can increase productivity, safety, and maintainability all at once with no down side

•Thinking of data flows in a functional way opens up many new possibilities

•The community is awesome

Page 22: Presentation on functional data mining at the IGT Cloud meet up at eBay Netanya

22

THANKS!

•Questions/comments?

PRESENTATION TITLE GOES HERE