18
Tuple MapReduce: Beyond classic MapReduce Pedro Ferrera, Ivan de Prado, Eric Palacios DataSalt Barcelona, SPAIN pere,ivan,[email protected] Jose Luis Fernandez-Marquez Giovanna Di Marzo Serugendo University of Geneva, CUI Geneva, SWITZERLAND [email protected]

Tuple map reduce: beyond classic mapreduce

Embed Size (px)

DESCRIPTION

Tuple MapReduce, a new foundational model extending MapReduce with the notion of tuples. Tuple MapReduce allows to bridge the gap between the low-level constructs provided by MapReduce and higher-level needs required by programmers, such as compound records, sorting or joins. This paper presents as well Pangool, an open- source framework implementing Tuple MapReduce. Pangool eases the design and implementation of applications based on MapReduce and increases their flexibility, still maintaining Hadoop’s performance.

Citation preview

Page 1: Tuple map reduce: beyond classic mapreduce

Tuple MapReduce: Beyond classic MapReduce

Pedro Ferrera, Ivan de Prado, Eric PalaciosDataSalt

Barcelona, SPAINpere,ivan,[email protected]

Jose Luis Fernandez­MarquezGiovanna Di Marzo Serugendo

University of Geneva, CUIGeneva, SWITZERLAND

[email protected]

Page 2: Tuple map reduce: beyond classic mapreduce

2 / 18

Outline

● Introduction● Related Work● Classic MapReduce

– The problems of MapReduce

● Tuple MapReduce– The basic Tuple MapReduce

– Joins

– Generalization of MapReduce

● Pangool● Conclusions and Future work

Page 3: Tuple map reduce: beyond classic mapreduce

3 / 18

Introduction

● A huge amount of information → needs for new processing technologies.

● MapReduce → major contribution ...– … but involves a sharp learning curve.

● Most of design patterns found in real world problems are not well covered.

● We propose Tuple MapReduce as a better foundation model.● TupleMapReduce on Hadoop → Pangool

– No key architectural changes needed.

Page 4: Tuple map reduce: beyond classic mapreduce

4 / 18

Related work

● MapReduce: Google paper on 2004● Hadoop● Higher level tools

– Sawzall, FlumeJava, Pig, Hive, Jaql, Cascading

● Higher level abstractions very popular– Supports the idea of MapReduce as a too low-level paradigm

● Merge MapReduce– Targets the problem of relational operations (joins)

– Implies changes in the architecture and a new step merge

Page 5: Tuple map reduce: beyond classic mapreduce

5 / 18

Classic MapReduce

● Jobs– input file, ouput file

– Developer provides two functions: map and reduce

● Distributed execution of work– Firstly the map function in the mapper phase

– Then the reduce function in the reducing phase

Page 6: Tuple map reduce: beyond classic mapreduce

6 / 18

The problems of MapReduce

● Compound records– Real world problems include multi-field records. They don’t fit well on

the key/value schema

● Sorting– No inherent sorting within the reduce records.

– “secondary sorting trick” on implementations (Hadoop)

● Join– A quite common operation

– Not directly possible in MapReduce without using “tricks”:

● secondary sorting● compound records

Page 7: Tuple map reduce: beyond classic mapreduce

7 / 18

Tuple MapReduce

● Idea: replace key/value by tuples● group-by and sort-by clauses

Page 8: Tuple map reduce: beyond classic mapreduce

8 / 18

Tuple MapReduce (II)

● group-by and sort-by constraint– group-by as a prefix of sort-by

– Needed if you want to be able to implement Tuple MapReduce over a MapReduce architecture

● Contrary to MapReduce, Tuple MapReduce:– provides compound records → tuple

– provides intra-reduce sorting

Page 9: Tuple map reduce: beyond classic mapreduce

9 / 18

Example: cumulative visits

● Cumulative # of visits up to each single date

Input → URL, date, visits

Expected output → URL, date, cumulative visits

<<<

Page 10: Tuple map reduce: beyond classic mapreduce

10 / 18

Join-Tuple MapReduce

● Joins among heterogeneous datasets– Tuples associated with a source-id.

● Tuples reach the reducer sorted by source-id

– enabling memoryless reduce joins– and grouped by some common fields

Page 11: Tuple map reduce: beyond classic mapreduce

11 / 18

Example: join between clients and payments

clients

paymentsInner join

client_idname payment_id amount

Page 12: Tuple map reduce: beyond classic mapreduce

12 / 18

Generalization of MapReduce

● MapReduce is a TupleMapReduce with...– tuples of two values and

– group-by and sort-by set to first value

● The opposite is also possible → implementing Tuple MapReduce into existing MapReduce implementations. – Architectural changes are not needed.

– Pangool is a proof of that.

Page 13: Tuple map reduce: beyond classic mapreduce

13 / 18

Pangool

● Tuple MapReduce implementation on top of Hadoop. – On top of existing MapReduce implementation.

● It is just a library. No architecture change was needed.

● Used on real world applications– Banking

– Searching

– Social networks

pangool.net

Page 14: Tuple map reduce: beyond classic mapreduce

14 / 18

Pangool benchmark – secondary sort

Page 15: Tuple map reduce: beyond classic mapreduce

15 / 18

Pangool benchmark – join

Page 16: Tuple map reduce: beyond classic mapreduce

16 / 18

Pangool performance

● Just between 5% and 8% worst than Hadoop– Pretty good considering that Pangool is built on top of Hadoop API

● The difference would probably disappear with a native implementation

● Much better than higher level API's– Probably because Pangool is a low level API

Page 17: Tuple map reduce: beyond classic mapreduce

17 / 18

Conclusions and Future work

● MapReduce key/value has been shown too strict. ● Tuple MapReduce keep MapReduce features

– Enhancing it with

● compound records, ● joins and ● intra-reduce sorting.

● Pangool is a proof of its viability, – including in existing implementations like Hadoop without changing the

architecture

● Future work would involve abstractions for flow creations– Simplifying job chaining and data flow.

Page 18: Tuple map reduce: beyond classic mapreduce

18 / 18

Thanks!

● Any questions, or doubts?

[email protected]

– @ivanprado

Pedro Ferrera, Ivan de Prado, Eric PalaciosDataSalt

Barcelona, SPAINpere,ivan,[email protected]

Jose Luis Fernandez­MarquezGiovanna Di Marzo Serugendo

University of Geneva, CUIGeneva, SWITZERLAND

[email protected]