22
Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

Embed Size (px)

Citation preview

Page 1: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

Incremental Recomputationsin MapReduce

Thomas JörgUniversity of Kaiserslautern

Page 2: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 2

Motivation

Base data Result data

Bigtable / HBase

MapReduce Program

Page 3: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 3

Motivation

View Definition

Base data Materialized view

Page 4: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 4

Motivation

Base data Result data

Bigtable / HBase

incrementalMapReduce

Program

MapReduce Program

Page 5: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 5

Agenda

• Related Work

• Case study

• Incremental view maintenance

• Summary Delta Algorithm

• Conclusion and future work

Page 6: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 6

Related Work

• Caching intermediate results

• DryadInc

• Incoop

• Incremental programming models

• Google Percolator

• Continuous bulk processing (CBP)

L. Popa, et al.: DryadInc: Reusing work in large-scale computations. HotCloud 2009P. Bhatotia, et al.: Incoop: MapReduce for Incremental Computations. SoCC 2011D. Peng and F. Dabek: Large-scale Incremental Processing Using Distributed Transactions and Notifications. OSDI 2010D. Logothetis et al.: Stateful Bulk Processing for Incremental Analytics. SoCC 2010

Page 7: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 7

Challenges

• Programming model

• SQL / relational algebra vs. MapReduce

• Efficient access paths

• No secondary indexes in Hbase

• Support for transactions

• Only single-row transactions in Hbase

Page 8: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 8

Case Study

• Word histograms

• Reverse web-link graphs

• Term-vectors per host

• Count of URL access frequency

• Inverted Indexes

J. Dean and S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004

Page 9: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 9

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

<html>...</html>

Computing Reverse Web-Link Graphs

9Thomas Jörg, Technische Universität Kaiserslautern

Page 10: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 10

Sample Web-Link Graph

a.htm<html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a></html>

<html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a></html>

b.htm

Page 11: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 11

Computing Reverse Web-Link Graphs

<html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a></html>

Map

<html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a></html>

b.htm, a.htm

a.htm, b.htm

a.htm

b.htm

b.htm, a.htm

b.htm, b.htm

a.htm, {b.htm}

b.htm, {a.htm, b.htm}

Shuffle Reduce

Page 12: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 12

I. S. Mumick et al.: Maintenance of Data Cubes and Summary Tables in a Warehouse. SIGMOD Conference 1997W. Labio et al.: Performance Issues in Incremental Warehouse Maintenance. VLDB 2000

Summary Delta Algorithm

CREATE VIEW Parts ASSELECT partID, SUM(qty*price) AS revenue, COUNT(*) AS tplcntFROM OrdersGROUP BY partID

SELECT partID, SUM(revenue) AS revenue, SUM(tplcnt) AS tplcntFROM ( (SELECT partID, SUM(qty*price) AS revenue, COUNT(*) as tplcnt FROM Orders_Insertions GROUP BY partID) UNION ALL (SELECT partID, -SUM(qty*price) AS revenue, -COUNT(*) as tplcnt FROM Orders_Deletions GROUP BY partID))GROUP BY partID

Page 13: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 13

Computing Reverse Web-Link Graphs

<html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a></html>

Map

<html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a></html>

b.htm, a.htm

a.htm, b.htm

a.htm

b.htm

b.htm, a.htm

b.htm, b.htm

a.htm, {b.htm}

b.htm, {a.htm, b.htm}

Shuffle Reduce

Page 14: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 14

Achieving Self-Maintainability

<html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a></html>

Map

<html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a></html>

b.htm, [a.htm, 1]

a.htm, [b.htm, 1]

a.htm

b.htm

b.htm, [a.htm, 1]

b.htm, [b.htm, 1]

a.htm, {[b.htm, 1]}

b.htm, {[a.htm, 2], [b.htm, 1]}

Shuffle Reduce

Page 15: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 15

Sample Web-Link Graph

a.htm<html> <a href="b.htm"> ...</a><a href="b.htm"> ...</a></html>

<html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a></html>

b.htm<html> <a href="b.htm"> ...</a> <a href="a.htm"> ...</a></html>

Page 16: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 16

Summary Delta Algorithm in MapReduce

Mapa.htm (deleted)

Shuffle Reduce

a.htm (inserted)

<html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a></html>

<html> <a href="b.htm"> ...</a> <a href="a.htm"> ...</a></html>

b.htm, [a.htm, -1]

b.htm, [a.htm, +1]

b.htm, [a.htm, -1]

a.htm, [a.htm, +1]

a.htm, {[a.htm, +1]}

b.htm, {[a.htm, -1]}

Page 17: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 17

Delta Installation Approaches

MapReduce

Base deltas Materialized view

MapReduce

Base deltas Materialized view

Materialized view

Increment Installation

Overwrite Installation

Page 18: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 18

Case Study – Lessons Learned

• Numerical aggregation

• Word histogram

• URL access frequency

• Set aggregation

• Reverse web-link graph

• Inverted index

• Multiset aggregation

• Term-vector per host

Page 19: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 19

General Solution

• Self-maintainable aggregates

• Computed in three steps

• Translation

• Grouping

• Aggregation

• commutative and associative binary function

• inverse elements

• Abelian group

Page 20: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 20

Case Study – Lessons Learned

• Numerical aggregation

• Word histogram

• URL access frequency

• Set aggregation

• Reverse web-link graph

• Inverted index

• Multiset aggregation

• Term-vector per host

Translation function:Translate web pages into (word, 1)

Aggregation function:Abelian group (Natural numbers, +)

Translation function:Translate web pages into (link target, link source)

Aggregation function:Abelian group (Power-multiset of URLs, multiset union)

Page 21: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 21

Evaluation

0 25 50 75 1001

10

100

Word histogram

0 25 50 75 1004

40

400

Reverse web-link graph

0 25 50 75 1001

10

100

URL access frequency

0 25 50 75 1001

10

100

Term-vector per host

y-axis: Elapsed time [min]x-axis: Updates in basedocuments [%]

Page 22: Incremental Recomputations in MapReduce Thomas Jörg University of Kaiserslautern

DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 22

Conclusion & Future Work

• View Maintenance in MapReduce

• Case study

• Summary delta algorithm

• Self-maintainable aggregations

• Future Work

• Broader class of MapReduce programs

• High-level MapReduce languages, e.g. Jaql or PigLatin