Map/Reduce intro

MapReduce Intro

The MapReduce Programming Model

Introduction and Examples

Dr. Jose Marıa Alvarez-Rodrıguez

“Quality Management in Service-based Systems and CloudApplications”

FP7 RELATE-ITN

South East European Research Center

Thessaloniki, 10th of April, 2013

1 / 61

MapReduce Intro

1 MapReduce in a nutshell

2 Thinking in MapReduce

3 Applying MapReduce

4 Success Stories with MapReduce

5 Summary and Conclusions

2 / 61

MapReduce Intro

MapReduce in a nutshell

Features

A programming model...

1 Large-scale distributed data processing

2 Simple but restricted

3 Paralell programming

4 Extensible

3 / 61

MapReduce Intro


Antecedents

Functional programming

1 Inspired

2 ...but not equivalent

Example in Python

“Given a list of numbers between 1 and 50 print only evennumbers”� �

print filter(lambda x: x % 2 == 0, range(1, 50))� �A list of numbers (data)

A condition (even numbers)

A function filter that is applied to the list (map)

4 / 61

MapReduce Intro


Antecedents

Functional programming

1 Inspired

2 ...but not equivalent

Example in Python

“Given a list of numbers between 1 and 50 print only evennumbers”� �

print filter(lambda x: x % 2 == 0, range(1, 50))� �A list of numbers (data)

A condition (even numbers)

A function filter that is applied to the list (map)

5 / 61

MapReduce Intro


...Other examples...

Example in Python

“Return the sum of the squares of a list of numbers between 1 and50” � �

import operator

reduce(operator.add , map(( lambda x: x **2), range (1 ,50)) , 0)� �“reduce” is equivalent to “foldl” in other func. languages asHaskell

other math considerations should be taken into account (kindof operator)...

6 / 61

MapReduce Intro


Some interesting points...

The Map Reduce framework...

1 Inspired in functional programming concepts (but notequivalent)

2 Problems that can be paralellized

3 Sometimes recursive solutions

4 ...

7 / 61

MapReduce Intro


Basic Model

“MapReduce: The Programming Model and Practice”, SIGMETRICS, Turorials 2009, Google.

8 / 61

MapReduce Intro


Map Function

Figure: Mapping creates a new output list by applying a function toindividual elements of an input list.

“Module 4: MapReduce”, Hadoop Tutorial, Yahoo!.

9 / 61

MapReduce Intro


Reduce Function

Figure: Reducing a list iterates over the input values to produce anaggregate value as output.


10 / 61

MapReduce Intro


MapReduce Flow

Figure: High-level MapReduce pipeline.


11 / 61

MapReduce Intro


MapReduce Flow

Figure: Detailed Hadoop MapReduce data flow.


12 / 61

MapReduce Intro


Tip

What is MapReduce?

It is a framework inspired in functional programming to tackleproblems in which steps can be paralellized applying a divide andconquer approach.

13 / 61

MapReduce Intro

Thinking in MapReduce

When should I use MapReduce?

Query

Index and Search: inverted index

Filtering

Classification

Recommendations: clustering or collaborative filtering

Analytics

Summarization and statistics

Sorting and merging

Frequency distribution

SQL-based queries: group-by, having, etc.

Generation of graphics: histograms, scatter plots.

Others

Message passing such as Breadth First-Search or PageRank algorithms.

14 / 61

MapReduce Intro



Query


Filtering

Classification


Analytics


Sorting and merging




Others


15 / 61

MapReduce Intro



Query


Filtering

Classification


Analytics


Sorting and merging




Others


16 / 61

MapReduce Intro


How Google uses MapReduce (80% of data processing)

Large-scale web search indexing

Clustering problems for Google News

Produce reports for popular queries, e.g. Google Trend

Processing of satellite imagery data

Language model processing for statistical machine translation

Large-scale machine learning problems

. . .

17 / 61

MapReduce Intro


Comparison of MapReduce and other approaches


18 / 61

MapReduce Intro


Evaluation of MapReduce and other approaches


19 / 61

MapReduce Intro


Apache Hadoop

MapReduce definition

The Apache Hadoop softwarelibrary is a framework thatallows for the distributedprocessing of large data setsacross clusters of computersusing simple programmingmodels.

Figure: Apache Hadoop Logo.

20 / 61

MapReduce Intro


Tip

What can I do in MapReduce?

Three main functions:

1 Querying

2 Summarizing

3 Analyzing

. . . large datasets in off-line mode for boosting other on-lineprocesses.

21 / 61

MapReduce Intro

Applying MapReduce

MapReduce in Action

MapReduce Patterns

1 Summarization

2 Filtering

3 Data Organization (sort, merging, etc.)

4 Relational-based (join, selection, projection, etc.)

5 Iterative Message Passing (graph processing)6 Others (depending on the implementation):

Simulation of distributed systemsCross-correlationMetapatternsInput-output. . .

22 / 61

MapReduce Intro

Applying MapReduce

Overview (stages)-Counting Letters

23 / 61

MapReduce Intro

Applying MapReduce

Summarization

Types

1 Numerical summarizations

2 Inverted index

3 Counting and counters

24 / 61

MapReduce Intro

Applying MapReduce

Numerical Summarization-I

Description

A general pattern for calculating aggregate statistical values overyour data.

Intent

Group records together by a key field and calculate a numericalaggregate per group to get a top-level view of the larger data set.

25 / 61

MapReduce Intro

Applying MapReduce

Numerical Summarization-II

Applicability

To deal with numerical data or counting.

To group data by specific fields

Examples

1 Word count

2 Record count

3 Min/Max/Count

4 Average/Median/Standard deviation

5 . . .

26 / 61

MapReduce Intro

Applying MapReduce

Numerical Summarization-Pseudocode

class Mapper

method Map(recordid id, record r)

for all term t in record r do

Emit(term t, count 1)

class Reducer

method Reduce(term t, counts [c1, c2,...])

sum = 0

for all count c in [c1, c2,...] do

sum = sum + c

Emit(term t, count sum)

27 / 61

MapReduce Intro

Applying MapReduce

Overview-Word Counter

28 / 61

MapReduce Intro

Applying MapReduce

Numerical Summarization-Word Counter

� �public void map(LongWritable key , Text value , Context context)

throws Exception {

String line = value.toString ();

StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens ()) {

word.set(tokenizer.nextToken ());

context.write(word , one);

}

}

public void reduce(Text key , Iterable <IntWritable > values ,

Context context)

throws IOException , InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

context.write(key , new IntWritable(sum));

}� �29 / 61

MapReduce Intro

Applying MapReduce

Example-II

Min/Max

Given a list of tweets (username, date, text) determine first andlast time an user commented and the number of times.

Implementation

See https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro

30 / 61

https://github.com/chemaar/seqos/tree/master/prototypes/mapreduce-intro

MapReduce Intro

Applying MapReduce

Overview - Min/Max

∗ Min and max creation date are the same in the map phase.

31 / 61

MapReduce Intro

Applying MapReduce

Example II-Min/Max, function Map

� �public void map(Object key , Text value , Context context)

throws IOException , InterruptedException , ParseException {

Map <String , String > parsed = MRDPUtils.parse(value.

toString ());

String strDate = parsed.get(MRDPUtils.CREATION_DATE);

String userId = parsed.get(MRDPUtils.USER_ID);

if (strDate == null || userId == null) {

return;

}

Date creationDate = MRDPUtils.frmt.parse(strDate);

outTuple.setMin(creationDate);

outTuple.setMax(creationDate);

outTuple.setCount (1);

outUserId.set(userId);

context.write(outUserId , outTuple);

}� �

32 / 61

MapReduce Intro

Applying MapReduce

Example II-Min/Max, function Reduce

� �public void reduce(Text key , Iterable <MinMaxCountTuple > values ,

Context context) throws IOException , InterruptedException {

result.setMin(null);

result.setMax(null);

int sum = 0;

for (MinMaxCountTuple val : values) {

if (result.getMin () == null

|| val.getMin ().compareTo(result.getMin ()) < 0)

{

result.setMin(val.getMin ());

}

if (result.getMax () == null

|| val.getMax ().compareTo(result.getMax ()) > 0)

{

result.setMax(val.getMax ());

}

sum += val.getCount ();}

result.setCount(sum);

context.write(key , result);

}� �33 / 61

MapReduce Intro

Applying MapReduce

Example-III

Average

Given a list of tweets (username, date, text) determine the averagecomment length per hour of day.

Implementation


34 / 61


MapReduce Intro

Applying MapReduce

Overview - Average

35 / 61

MapReduce Intro

Applying MapReduce

Example III-Average, function Map


throws IOException , InterruptedException ,ParseException {

Map <String , String > parsed =

MRDPUtils.parse(value.toString ());

String strDate = parsed.get(MRDPUtils.CREATION_DATE);

String text = parsed.get(MRDPUtils.TEXT);

if (strDate == null || text == null) {

return;

}

Date creationDate = MRDPUtils.frmt.parse(strDate);

outHour.set(creationDate.getHours ());

outCountAverage.setCount (1);

outCountAverage.setAverage(text.length ());

context.write(outHour , outCountAverage);

}� �

36 / 61

MapReduce Intro

Applying MapReduce

Example III-Average, function Reduce

� �public void reduce(IntWritable key , Iterable <CountAverageTuple >

values ,


float sum = 0;

float count = 0;

for (CountAverageTuple val : values) {

sum += val.getCount () * val.getAverage ();

count += val.getCount ();

}

result.setCount(count);

result.setAverage(sum / count);

context.write(key , result);

}� �

37 / 61

MapReduce Intro

Applying MapReduce

Numerical Summarization-Other approaches

Relation to SQL

� �SELECT MIN(numcol1), MAX(numcol1),

COUNT (*) FROM table GROUP BY groupcol2;� �Implementation in PIG

� �b = GROUP a BY groupcol2;

c = FOREACH b GENERATE group , MIN(a.numcol1),

MAX(a.numcol1), COUNT_STAR(a);� �38 / 61

MapReduce Intro

Applying MapReduce

Numerical Summarization-Other approaches

Relation to SQL

� �SELECT MIN(numcol1), MAX(numcol1),

COUNT (*) FROM table GROUP BY groupcol2;� �Implementation in PIG

� �b = GROUP a BY groupcol2;

c = FOREACH b GENERATE group , MIN(a.numcol1),

MAX(a.numcol1), COUNT_STAR(a);� �39 / 61

MapReduce Intro

Applying MapReduce

Filtering

Types

1 Filtering

2 Top N records

3 Bloom filtering

4 Distinct

40 / 61

MapReduce Intro

Applying MapReduce

Filtering-I

Description

It evaluates each record separately and decides, based on somecondition, whether it should stay or go.

Intent

Filter out records that are not of interest and keep ones that are.

41 / 61

MapReduce Intro

Applying MapReduce

Filtering-II

Applicability

To collate data

Examples

1 Closer view of dataset

2 Data cleansing

3 Tracking a thread of events

4 Simple random sampling

5 Distributed Grep

6 Removing low scoring dataset

7 Log Analysis

8 Data Querying

9 Data Validation

10 . . .

42 / 61

MapReduce Intro

Applying MapReduce

Filtering-Pseudocode

class Mapper

method Map(recordid id, record r)

field f = extract(r)

if predicate (f)

Emit(recordid id, value(r))

class Reducer

method Reduce(recordid id, values [r1, r2,...])

//Whatever

Emit(recordid id, aggregate (values))

43 / 61

MapReduce Intro

Applying MapReduce

Example-IV

Distributed Grep

Given a list of tweets (username, date, text) determine the tweetsthat contain a word.

Implementation


44 / 61


MapReduce Intro

Applying MapReduce

Overview - Distributed Grep

45 / 61

MapReduce Intro

Applying MapReduce

Example IV-Distributed Grep, function Map





String txt = parsed.get(MRDPUtils.TEXT);

String mapRegex = ".*\\b"+context.getConfiguration ()

.get("mapregex")+"(.)*\\b.*";

if (txt.matches(mapRegex)) {

context.write(NullWritable.get(), value);

}

}� �...and the Reduce function?

In this case it is not necessary and output values are directly writing to the output.

46 / 61

MapReduce Intro

Applying MapReduce

Example-V

Top 5

Given a list of tweets (username, date, text) determine the 5 usersthat wrote longer tweets

Implementation


47 / 61


MapReduce Intro

Applying MapReduce

Overview - Top 5

48 / 61

MapReduce Intro

Applying MapReduce

Example V-Top 5, function Map

� �private TreeMap <Integer , Text > repToRecordMap = new TreeMap <

Integer , Text >();

public void map(Object key , Text value , Context context)




if (parsed == null) {return ;}

String userId = parsed.get(MRDPUtils.USER_ID);

String reputation = String.valueOf(parsed.get(MRDPUtils.

TEXT).length ());

//Max reputation if you write tweets longer

if (userId == null || reputation == null) {return ;}

repToRecordMap.put(Integer.parseInt(reputation), new

Text(value));

if (repToRecordMap.size() > MAX_TOP) {

repToRecordMap.remove(repToRecordMap.firstKey ()

);

}

}� �49 / 61

MapReduce Intro

Applying MapReduce

Example V-Top 5, function Reduce

� �public void reduce(NullWritable key , Iterable <Text > values ,


for (Text value : values) {

Map <String , String > parsed = MRDPUtils.parse(value.

toString ());

repToRecordMap.put(parsed.get(MRDPUtils.TEXT).length

(),new Text(value));

if (repToRecordMap.size() > MAX_TOP) {

repToRecordMap.remove(repToRecordMap.firstKey ()

);

}

}

for (Text t : repToRecordMap.descendingMap ().values ()

) {

context.write(NullWritable.get(), t);

}

}� �

50 / 61

MapReduce Intro

Applying MapReduce

Filtering-Other approaches

Relation to SQL

� �SELECT * FROM table WHERE colvalue < VALUE;� �

Implementation in PIG

� �b = FILTER a BY colvalue < VALUE;� �

51 / 61

MapReduce Intro

Applying MapReduce

Filtering-Other approaches

Relation to SQL

� �SELECT * FROM table WHERE colvalue < VALUE;� �

Implementation in PIG

� �b = FILTER a BY colvalue < VALUE;� �

52 / 61

MapReduce Intro

Applying MapReduce

Tip

How can I use and run a MapReduce framework?

You should identify what kind of problem you are addressing andapply a design pattern to be implemented in a framework suchas Apache Hadoop.

53 / 61

MapReduce Intro

Success Stories with MapReduce

Tip

Who is using MapReduce?

All companies that are dealing with Big Data problems foranalytics such as:

Cloudera

Datasalt

Elasticsearch

. . .

54 / 61

MapReduce Intro


Apache Hadoop-Related Projects

55 / 61

MapReduce Intro


More tips

FAQ

MapReduce is a framework based on a simple programmingmodel

...to deal with large datasets in a distributed fashion

...scalability, replication, fault-tolerant, etc.

Apache Hadoop is not a database

New frameworks on top of Hadoop for specific tasks:querying, analysis, etc.

Other similar frameworks: Storm, Signal/Collect, etc.

. . .

56 / 61

MapReduce Intro

Summary and Conclusions

Summary

57 / 61

MapReduce Intro


Conclusions

What is MapReduce?

It is a framework inspired in functional programming to tackle problems in which steps can be paralellizedapplying a divide and conquer approach.



1 Querying

2 Summarizing

3 Analyzing

. . . large datasets in off-line mode for boosting other on-line processes.


You should identify what kind of problem you are addressing and apply a design pattern to be implemented in aframework such as Apache Hadoop.

58 / 61

MapReduce Intro


Conclusions

What is MapReduce?




1 Querying

2 Summarizing

3 Analyzing




59 / 61

MapReduce Intro


Conclusions

What is MapReduce?




1 Querying

2 Summarizing

3 Analyzing




60 / 61

MapReduce Intro


What’s next?

. . .

Concatenate MapReduce jobs

Optimization using combiners and setting the parameters (sizeof partition, etc.)

Pipelining with other languages such as Python

Hadoop in Action: more examples, etc.

New trending problems (image/video processing)

Real-time processing

. . .

61 / 61

MapReduce Intro

References

J. Dean and S. Ghemawat.MapReduce: simplified data processing on large clusters.Commun. ACM, 51(1):107–113, Jan. 2008.

J. L. Jonathan R. Owens, Brian Femiano.Hadoop Real-World Solutions Cookbook.Packt Publishing Ltd, 2013.

C. Lam.Hadoop in Action.Manning Publications Co., Greenwich, CT, USA, 1st edition,2010.

J. Lin and C. Dyer.Data-intensive text processing with MapReduce.In Proceedings of Human Language Technologies: The 2009Annual Conference of the North American Chapter of theAssociation for Computational Linguistics, Companion

62 / 61

MapReduce Intro

References

Volume: Tutorial Abstracts, NAACL-Tutorials ’09, pages 1–2,Stroudsburg, PA, USA, 2009. Association for ComputationalLinguistics.

D. Miner and A. Shook.Mapreduce Design Patterns.Oreilly and Associates Inc, 2012.

T. G. Srinath Perera.Hadoop MapReduce Cookbook.Packt Publishing Ltd, 2013.

T. White.Hadoop: The Definitive Guide.O’Reilly Media, Inc., 1st edition, 2009.

I. H. Witten and E. Frank.Data Mining: Practical Machine LearningTools and Techniques.

63 / 61

MapReduce Intro

References

Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,2005.

64 / 61

Technology

Map/Reduce intro