64
MapReduce Intro The MapReduce Programming Model Introduction and Examples Dr. Jose Mar´ ıa Alvarez-Rodr´ ıguez “Quality Management in Service-based Systems and Cloud Applications” FP7 RELATE-ITN South East European Research Center Thessaloniki, 10th of April, 2013 1 / 61

Map/Reduce intro

Embed Size (px)

DESCRIPTION

Some slides about the Map/Reduce programming model (academic purposes) adapting some examples of the book Map/Reduce design patterns. Special thanks to the next authors: -http://shop.oreilly.com/product/0636920025122.do -http://mapreducepatterns.com/index.php?title=Main_Page -http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/

Citation preview

Page 1: Map/Reduce intro

MapReduce Intro

The MapReduce Programming Model

Introduction and Examples

Dr. Jose Marıa Alvarez-Rodrıguez

“Quality Management in Service-based Systems and CloudApplications”

FP7 RELATE-ITN

South East European Research Center

Thessaloniki, 10th of April, 2013

1 / 61

Page 2: Map/Reduce intro

MapReduce Intro

1 MapReduce in a nutshell

2 Thinking in MapReduce

3 Applying MapReduce

4 Success Stories with MapReduce

5 Summary and Conclusions

2 / 61

Page 3: Map/Reduce intro

MapReduce Intro

MapReduce in a nutshell

Features

A programming model...

1 Large-scale distributed data processing

2 Simple but restricted

3 Paralell programming

4 Extensible

3 / 61

Page 4: Map/Reduce intro

MapReduce Intro

MapReduce in a nutshell

Antecedents

Functional programming

1 Inspired

2 ...but not equivalent

Example in Python

“Given a list of numbers between 1 and 50 print only evennumbers”� �

print filter(lambda x: x % 2 == 0, range(1, 50))� �A list of numbers (data)

A condition (even numbers)

A function filter that is applied to the list (map)

4 / 61

Page 5: Map/Reduce intro

MapReduce Intro

MapReduce in a nutshell

Antecedents

Functional programming

1 Inspired

2 ...but not equivalent

Example in Python

“Given a list of numbers between 1 and 50 print only evennumbers”� �

print filter(lambda x: x % 2 == 0, range(1, 50))� �A list of numbers (data)

A condition (even numbers)

A function filter that is applied to the list (map)

5 / 61

Page 6: Map/Reduce intro

MapReduce Intro

MapReduce in a nutshell

...Other examples...

Example in Python

“Return the sum of the squares of a list of numbers between 1 and50” � �

import operator

reduce(operator.add , map(( lambda x: x **2), range (1 ,50)) , 0)� �“reduce” is equivalent to “foldl” in other func. languages asHaskell

other math considerations should be taken into account (kindof operator)...

6 / 61

Page 7: Map/Reduce intro

MapReduce Intro

MapReduce in a nutshell

Some interesting points...

The Map Reduce framework...

1 Inspired in functional programming concepts (but notequivalent)

2 Problems that can be paralellized

3 Sometimes recursive solutions

4 ...

7 / 61

Page 8: Map/Reduce intro

MapReduce Intro

MapReduce in a nutshell

Basic Model

“MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google.

8 / 61

Page 9: Map/Reduce intro

MapReduce Intro

MapReduce in a nutshell

Map Function

Figure: Mapping creates a new output list by applying a function toindividual elements of an input list.

“Module 4: MapReduce”, Hadoop Tutorial, Yahoo!.

9 / 61

Page 10: Map/Reduce intro

MapReduce Intro

MapReduce in a nutshell

Reduce Function

Figure: Reducing a list iterates over the input values to produce anaggregate value as output.

“Module 4: MapReduce”, Hadoop Tutorial, Yahoo!.

10 / 61

Page 11: Map/Reduce intro

MapReduce Intro

MapReduce in a nutshell

MapReduce Flow

Figure: High-level MapReduce pipeline.

“Module 4: MapReduce”, Hadoop Tutorial, Yahoo!.

11 / 61

Page 12: Map/Reduce intro

MapReduce Intro

MapReduce in a nutshell

MapReduce Flow

Figure: Detailed Hadoop MapReduce data flow.

“Module 4: MapReduce”, Hadoop Tutorial, Yahoo!.

12 / 61

Page 13: Map/Reduce intro

MapReduce Intro

MapReduce in a nutshell

Tip

What is MapReduce?

It is a framework inspired in functional programming to tackleproblems in which steps can be paralellized applying a divide andconquer approach.

13 / 61

Page 14: Map/Reduce intro

MapReduce Intro

Thinking in MapReduce

When should I use MapReduce?

Query

Index and Search: inverted index

Filtering

Classification

Recommendations: clustering or collaborative filtering

Analytics

Summarization and statistics

Sorting and merging

Frequency distribution

SQL-based queries: group-by, having, etc.

Generation of graphics: histograms, scatter plots.

Others

Message passing such as Breadth First-Search or PageRank algorithms.

14 / 61

Page 15: Map/Reduce intro

MapReduce Intro

Thinking in MapReduce

When should I use MapReduce?

Query

Index and Search: inverted index

Filtering

Classification

Recommendations: clustering or collaborative filtering

Analytics

Summarization and statistics

Sorting and merging

Frequency distribution

SQL-based queries: group-by, having, etc.

Generation of graphics: histograms, scatter plots.

Others

Message passing such as Breadth First-Search or PageRank algorithms.

15 / 61

Page 16: Map/Reduce intro

MapReduce Intro

Thinking in MapReduce

When should I use MapReduce?

Query

Index and Search: inverted index

Filtering

Classification

Recommendations: clustering or collaborative filtering

Analytics

Summarization and statistics

Sorting and merging

Frequency distribution

SQL-based queries: group-by, having, etc.

Generation of graphics: histograms, scatter plots.

Others

Message passing such as Breadth First-Search or PageRank algorithms.

16 / 61

Page 17: Map/Reduce intro

MapReduce Intro

Thinking in MapReduce

How Google uses MapReduce (80% of data processing)

Large-scale web search indexing

Clustering problems for Google News

Produce reports for popular queries, e.g. Google Trend

Processing of satellite imagery data

Language model processing for statistical machine translation

Large-scale machine learning problems

. . .

17 / 61

Page 18: Map/Reduce intro

MapReduce Intro

Thinking in MapReduce

Comparison of MapReduce and other approaches

“MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google.

18 / 61

Page 19: Map/Reduce intro

MapReduce Intro

Thinking in MapReduce

Evaluation of MapReduce and other approaches

“MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google.

19 / 61

Page 20: Map/Reduce intro

MapReduce Intro

Thinking in MapReduce

Apache Hadoop

MapReduce definition

The Apache Hadoop softwarelibrary is a framework thatallows for the distributedprocessing of large data setsacross clusters of computersusing simple programmingmodels.

Figure: Apache Hadoop Logo.

20 / 61

Page 21: Map/Reduce intro

MapReduce Intro

Thinking in MapReduce

Tip

What can I do in MapReduce?

Three main functions:

1 Querying

2 Summarizing

3 Analyzing

. . . large datasets in off-line mode for boosting other on-lineprocesses.

21 / 61

Page 22: Map/Reduce intro

MapReduce Intro

Applying MapReduce

MapReduce in Action

MapReduce Patterns

1 Summarization

2 Filtering

3 Data Organization (sort, merging, etc.)

4 Relational-based (join, selection, projection, etc.)

5 Iterative Message Passing (graph processing)6 Others (depending on the implementation):

Simulation of distributed systemsCross-correlationMetapatternsInput-output. . .

22 / 61

Page 23: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Overview (stages)-Counting Letters

23 / 61

Page 24: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Summarization

Types

1 Numerical summarizations

2 Inverted index

3 Counting and counters

24 / 61

Page 25: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Numerical Summarization-I

Description

A general pattern for calculating aggregate statistical values overyour data.

Intent

Group records together by a key field and calculate a numericalaggregate per group to get a top-level view of the larger data set.

25 / 61

Page 26: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Numerical Summarization-II

Applicability

To deal with numerical data or counting.

To group data by specific fields

Examples

1 Word count

2 Record count

3 Min/Max/Count

4 Average/Median/Standard deviation

5 . . .

26 / 61

Page 27: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Numerical Summarization-Pseudocode

class Mapper

method Map(recordid id, record r)

for all term t in record r do

Emit(term t, count 1)

class Reducer

method Reduce(term t, counts [c1, c2,...])

sum = 0

for all count c in [c1, c2,...] do

sum = sum + c

Emit(term t, count sum)

27 / 61

Page 28: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Overview-Word Counter

28 / 61

Page 29: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Numerical Summarization-Word Counter

� �public void map(LongWritable key , Text value , Context context)

throws Exception {

String line = value.toString ();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens ()) {

word.set(tokenizer.nextToken ());

context.write(word , one);

}

}

public void reduce(Text key , Iterable <IntWritable > values ,

Context context)

throws IOException , InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

context.write(key , new IntWritable(sum));

}� �29 / 61

Page 30: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Example-II

Min/Max

Given a list of tweets (username, date, text) determine first andlast time an user commented and the number of times.

Implementation

See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro

30 / 61

Page 31: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Overview - Min/Max

∗ Min and max creation date are the same in the map phase.

31 / 61

Page 32: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Example II-Min/Max, function Map

� �public void map(Object key , Text value , Context context)

throws IOException , InterruptedException , ParseException {

Map <String , String > parsed = MRDPUtils.parse(value.

toString ());

String strDate = parsed.get(MRDPUtils.CREATION_DATE);

String userId = parsed.get(MRDPUtils.USER_ID);

if (strDate == null || userId == null) {

return;

}

Date creationDate = MRDPUtils.frmt.parse(strDate);

outTuple.setMin(creationDate);

outTuple.setMax(creationDate);

outTuple.setCount (1);

outUserId.set(userId);

context.write(outUserId , outTuple);

}� �

32 / 61

Page 33: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Example II-Min/Max, function Reduce

� �public void reduce(Text key , Iterable <MinMaxCountTuple > values ,

Context context) throws IOException , InterruptedException {

result.setMin(null);

result.setMax(null);

int sum = 0;

for (MinMaxCountTuple val : values) {

if (result.getMin () == null

|| val.getMin ().compareTo(result.getMin ()) < 0)

{

result.setMin(val.getMin ());

}

if (result.getMax () == null

|| val.getMax ().compareTo(result.getMax ()) > 0)

{

result.setMax(val.getMax ());

}

sum += val.getCount ();}

result.setCount(sum);

context.write(key , result);

}� �33 / 61

Page 34: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Example-III

Average

Given a list of tweets (username, date, text) determine the averagecomment length per hour of day.

Implementation

See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro

34 / 61

Page 35: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Overview - Average

35 / 61

Page 36: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Example III-Average, function Map

� �public void map(Object key , Text value , Context context)

throws IOException , InterruptedException ,ParseException {

Map <String , String > parsed =

MRDPUtils.parse(value.toString ());

String strDate = parsed.get(MRDPUtils.CREATION_DATE);

String text = parsed.get(MRDPUtils.TEXT);

if (strDate == null || text == null) {

return;

}

Date creationDate = MRDPUtils.frmt.parse(strDate);

outHour.set(creationDate.getHours ());

outCountAverage.setCount (1);

outCountAverage.setAverage(text.length ());

context.write(outHour , outCountAverage);

}� �

36 / 61

Page 37: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Example III-Average, function Reduce

� �public void reduce(IntWritable key , Iterable <CountAverageTuple >

values ,

Context context) throws IOException , InterruptedException {

float sum = 0;

float count = 0;

for (CountAverageTuple val : values) {

sum += val.getCount () * val.getAverage ();

count += val.getCount ();

}

result.setCount(count);

result.setAverage(sum / count);

context.write(key , result);

}� �

37 / 61

Page 38: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Numerical Summarization-Other approaches

Relation to SQL

� �SELECT MIN(numcol1), MAX(numcol1),

COUNT (*) FROM table GROUP BY groupcol2;� �Implementation in PIG

� �b = GROUP a BY groupcol2;

c = FOREACH b GENERATE group , MIN(a.numcol1),

MAX(a.numcol1), COUNT_STAR(a);� �38 / 61

Page 39: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Numerical Summarization-Other approaches

Relation to SQL

� �SELECT MIN(numcol1), MAX(numcol1),

COUNT (*) FROM table GROUP BY groupcol2;� �Implementation in PIG

� �b = GROUP a BY groupcol2;

c = FOREACH b GENERATE group , MIN(a.numcol1),

MAX(a.numcol1), COUNT_STAR(a);� �39 / 61

Page 40: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Filtering

Types

1 Filtering

2 Top N records

3 Bloom filtering

4 Distinct

40 / 61

Page 41: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Filtering-I

Description

It evaluates each record separately and decides, based on somecondition, whether it should stay or go.

Intent

Filter out records that are not of interest and keep ones that are.

41 / 61

Page 42: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Filtering-II

Applicability

To collate data

Examples

1 Closer view of dataset

2 Data cleansing

3 Tracking a thread of events

4 Simple random sampling

5 Distributed Grep

6 Removing low scoring dataset

7 Log Analysis

8 Data Querying

9 Data Validation

10 . . .

42 / 61

Page 43: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Filtering-Pseudocode

class Mapper

method Map(recordid id, record r)

field f = extract(r)

if predicate (f)

Emit(recordid id, value(r))

class Reducer

method Reduce(recordid id, values [r1, r2,...])

//Whatever

Emit(recordid id, aggregate (values))

43 / 61

Page 44: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Example-IV

Distributed Grep

Given a list of tweets (username, date, text) determine the tweetsthat contain a word.

Implementation

See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro

44 / 61

Page 45: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Overview - Distributed Grep

45 / 61

Page 46: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Example IV-Distributed Grep, function Map

� �public void map(Object key , Text value , Context context)

throws IOException , InterruptedException {

Map <String , String > parsed =

MRDPUtils.parse(value.toString ());

String txt = parsed.get(MRDPUtils.TEXT);

String mapRegex = ".*\\b"+context.getConfiguration ()

.get("mapregex")+"(.)*\\b.*";

if (txt.matches(mapRegex)) {

context.write(NullWritable.get(), value);

}

}� �...and the Reduce function?

In this case it is not necessary and output values are directly writing to the output.

46 / 61

Page 47: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Example-V

Top 5

Given a list of tweets (username, date, text) determine the 5 usersthat wrote longer tweets

Implementation

See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro

47 / 61

Page 48: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Overview - Top 5

48 / 61

Page 49: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Example V-Top 5, function Map

� �private TreeMap <Integer , Text > repToRecordMap = new TreeMap <

Integer , Text >();

public void map(Object key , Text value , Context context)

throws IOException , InterruptedException {

Map <String , String > parsed =

MRDPUtils.parse(value.toString ());

if (parsed == null) {return ;}

String userId = parsed.get(MRDPUtils.USER_ID);

String reputation = String.valueOf(parsed.get(MRDPUtils.

TEXT).length ());

//Max reputation if you write tweets longer

if (userId == null || reputation == null) {return ;}

repToRecordMap.put(Integer.parseInt(reputation), new

Text(value));

if (repToRecordMap.size() > MAX_TOP) {

repToRecordMap.remove(repToRecordMap.firstKey ()

);

}

}� �49 / 61

Page 50: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Example V-Top 5, function Reduce

� �public void reduce(NullWritable key , Iterable <Text > values ,

Context context) throws IOException , InterruptedException {

for (Text value : values) {

Map <String , String > parsed = MRDPUtils.parse(value.

toString ());

repToRecordMap.put(parsed.get(MRDPUtils.TEXT).length

(),new Text(value));

if (repToRecordMap.size() > MAX_TOP) {

repToRecordMap.remove(repToRecordMap.firstKey ()

);

}

}

for (Text t : repToRecordMap.descendingMap ().values ()

) {

context.write(NullWritable.get(), t);

}

}� �

50 / 61

Page 51: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Filtering-Other approaches

Relation to SQL

� �SELECT * FROM table WHERE colvalue < VALUE;� �

Implementation in PIG

� �b = FILTER a BY colvalue < VALUE;� �

51 / 61

Page 52: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Filtering-Other approaches

Relation to SQL

� �SELECT * FROM table WHERE colvalue < VALUE;� �

Implementation in PIG

� �b = FILTER a BY colvalue < VALUE;� �

52 / 61

Page 53: Map/Reduce intro

MapReduce Intro

Applying MapReduce

Tip

How can I use and run a MapReduce framework?

You should identify what kind of problem you are addressing andapply a design pattern to be implemented in a framework suchas Apache Hadoop.

53 / 61

Page 54: Map/Reduce intro

MapReduce Intro

Success Stories with MapReduce

Tip

Who is using MapReduce?

All companies that are dealing with Big Data problems foranalytics such as:

Cloudera

Datasalt

Elasticsearch

. . .

54 / 61

Page 55: Map/Reduce intro

MapReduce Intro

Success Stories with MapReduce

Apache Hadoop-Related Projects

55 / 61

Page 56: Map/Reduce intro

MapReduce Intro

Success Stories with MapReduce

More tips

FAQ

MapReduce is a framework based on a simple programmingmodel

...to deal with large datasets in a distributed fashion

...scalability, replication, fault-tolerant, etc.

Apache Hadoop is not a database

New frameworks on top of Hadoop for specific tasks:querying, analysis, etc.

Other similar frameworks: Storm, Signal/Collect, etc.

. . .

56 / 61

Page 57: Map/Reduce intro

MapReduce Intro

Summary and Conclusions

Summary

57 / 61

Page 58: Map/Reduce intro

MapReduce Intro

Summary and Conclusions

Conclusions

What is MapReduce?

It is a framework inspired in functional programming to tackle problems in which steps can be paralellizedapplying a divide and conquer approach.

What can I do in MapReduce?

Three main functions:

1 Querying

2 Summarizing

3 Analyzing

. . . large datasets in off-line mode for boosting other on-line processes.

How can I use and run a MapReduce framework?

You should identify what kind of problem you are addressing and apply a design pattern to be implemented in aframework such as Apache Hadoop.

58 / 61

Page 59: Map/Reduce intro

MapReduce Intro

Summary and Conclusions

Conclusions

What is MapReduce?

It is a framework inspired in functional programming to tackle problems in which steps can be paralellizedapplying a divide and conquer approach.

What can I do in MapReduce?

Three main functions:

1 Querying

2 Summarizing

3 Analyzing

. . . large datasets in off-line mode for boosting other on-line processes.

How can I use and run a MapReduce framework?

You should identify what kind of problem you are addressing and apply a design pattern to be implemented in aframework such as Apache Hadoop.

59 / 61

Page 60: Map/Reduce intro

MapReduce Intro

Summary and Conclusions

Conclusions

What is MapReduce?

It is a framework inspired in functional programming to tackle problems in which steps can be paralellizedapplying a divide and conquer approach.

What can I do in MapReduce?

Three main functions:

1 Querying

2 Summarizing

3 Analyzing

. . . large datasets in off-line mode for boosting other on-line processes.

How can I use and run a MapReduce framework?

You should identify what kind of problem you are addressing and apply a design pattern to be implemented in aframework such as Apache Hadoop.

60 / 61

Page 61: Map/Reduce intro

MapReduce Intro

Summary and Conclusions

What’s next?

. . .

Concatenate MapReduce jobs

Optimization using combiners and setting the parameters (sizeof partition, etc.)

Pipelining with other languages such as Python

Hadoop in Action: more examples, etc.

New trending problems (image/video processing)

Real-time processing

. . .

61 / 61

Page 62: Map/Reduce intro

MapReduce Intro

References

J. Dean and S. Ghemawat.MapReduce: simplified data processing on large clusters.Commun. ACM, 51(1):107–113, Jan. 2008.

J. L. Jonathan R. Owens, Brian Femiano.Hadoop Real-World Solutions Cookbook.Packt Publishing Ltd, 2013.

C. Lam.Hadoop in Action.Manning Publications Co., Greenwich, CT, USA, 1st edition,2010.

J. Lin and C. Dyer.Data-intensive text processing with MapReduce.In Proceedings of Human Language Technologies: The 2009Annual Conference of the North American Chapter of theAssociation for Computational Linguistics, Companion

62 / 61

Page 63: Map/Reduce intro

MapReduce Intro

References

Volume: Tutorial Abstracts, NAACL-Tutorials ’09, pages 1–2,Stroudsburg, PA, USA, 2009. Association for ComputationalLinguistics.

D. Miner and A. Shook.Mapreduce Design Patterns.Oreilly and Associates Inc, 2012.

T. G. Srinath Perera.Hadoop MapReduce Cookbook.Packt Publishing Ltd, 2013.

T. White.Hadoop: The Definitive Guide.O’Reilly Media, Inc., 1st edition, 2009.

I. H. Witten and E. Frank.Data Mining: Practical Machine LearningTools and Techniques.

63 / 61

Page 64: Map/Reduce intro

MapReduce Intro

References

Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,2005.

64 / 61