Upload
katherine-mclain
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Incremental Recomputationsin MapReduce
Thomas JörgUniversity of Kaiserslautern
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 2
Motivation
Base data Result data
Bigtable / HBase
MapReduce Program
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 3
Motivation
View Definition
Base data Materialized view
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 4
Motivation
Base data Result data
Bigtable / HBase
incrementalMapReduce
Program
MapReduce Program
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 5
Agenda
• Related Work
• Case study
• Incremental view maintenance
• Summary Delta Algorithm
• Conclusion and future work
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 6
Related Work
• Caching intermediate results
• DryadInc
• Incoop
• Incremental programming models
• Google Percolator
• Continuous bulk processing (CBP)
L. Popa, et al.: DryadInc: Reusing work in large-scale computations. HotCloud 2009P. Bhatotia, et al.: Incoop: MapReduce for Incremental Computations. SoCC 2011D. Peng and F. Dabek: Large-scale Incremental Processing Using Distributed Transactions and Notifications. OSDI 2010D. Logothetis et al.: Stateful Bulk Processing for Incremental Analytics. SoCC 2010
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 7
Challenges
• Programming model
• SQL / relational algebra vs. MapReduce
• Efficient access paths
• No secondary indexes in Hbase
• Support for transactions
• Only single-row transactions in Hbase
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 8
Case Study
• Word histograms
• Reverse web-link graphs
• Term-vectors per host
• Count of URL access frequency
• Inverted Indexes
J. Dean and S. Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 9
<html>...</html>
<html>...</html>
<html>...</html>
<html>...</html>
<html>...</html>
<html>...</html>
<html>...</html>
<html>...</html>
<html>...</html>
<html>...</html>
<html>...</html>
<html>...</html>
<html>...</html>
<html>...</html>
<html>...</html>
Computing Reverse Web-Link Graphs
9Thomas Jörg, Technische Universität Kaiserslautern
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 10
Sample Web-Link Graph
a.htm<html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a></html>
<html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a></html>
b.htm
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 11
Computing Reverse Web-Link Graphs
<html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a></html>
Map
<html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a></html>
b.htm, a.htm
a.htm, b.htm
a.htm
b.htm
b.htm, a.htm
b.htm, b.htm
a.htm, {b.htm}
b.htm, {a.htm, b.htm}
Shuffle Reduce
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 12
I. S. Mumick et al.: Maintenance of Data Cubes and Summary Tables in a Warehouse. SIGMOD Conference 1997W. Labio et al.: Performance Issues in Incremental Warehouse Maintenance. VLDB 2000
Summary Delta Algorithm
CREATE VIEW Parts ASSELECT partID, SUM(qty*price) AS revenue, COUNT(*) AS tplcntFROM OrdersGROUP BY partID
SELECT partID, SUM(revenue) AS revenue, SUM(tplcnt) AS tplcntFROM ( (SELECT partID, SUM(qty*price) AS revenue, COUNT(*) as tplcnt FROM Orders_Insertions GROUP BY partID) UNION ALL (SELECT partID, -SUM(qty*price) AS revenue, -COUNT(*) as tplcnt FROM Orders_Deletions GROUP BY partID))GROUP BY partID
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 13
Computing Reverse Web-Link Graphs
<html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a></html>
Map
<html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a></html>
b.htm, a.htm
a.htm, b.htm
a.htm
b.htm
b.htm, a.htm
b.htm, b.htm
a.htm, {b.htm}
b.htm, {a.htm, b.htm}
Shuffle Reduce
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 14
Achieving Self-Maintainability
<html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a></html>
Map
<html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a></html>
b.htm, [a.htm, 1]
a.htm, [b.htm, 1]
a.htm
b.htm
b.htm, [a.htm, 1]
b.htm, [b.htm, 1]
a.htm, {[b.htm, 1]}
b.htm, {[a.htm, 2], [b.htm, 1]}
Shuffle Reduce
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 15
Sample Web-Link Graph
a.htm<html> <a href="b.htm"> ...</a><a href="b.htm"> ...</a></html>
<html> <a href="a.htm"> ...</a> <a href="b.htm"> ...</a></html>
b.htm<html> <a href="b.htm"> ...</a> <a href="a.htm"> ...</a></html>
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 16
Summary Delta Algorithm in MapReduce
Mapa.htm (deleted)
Shuffle Reduce
a.htm (inserted)
<html> <a href="b.htm"> ...</a> <a href="b.htm"> ...</a></html>
<html> <a href="b.htm"> ...</a> <a href="a.htm"> ...</a></html>
b.htm, [a.htm, -1]
b.htm, [a.htm, +1]
b.htm, [a.htm, -1]
a.htm, [a.htm, +1]
a.htm, {[a.htm, +1]}
b.htm, {[a.htm, -1]}
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 17
Delta Installation Approaches
MapReduce
Base deltas Materialized view
MapReduce
Base deltas Materialized view
Materialized view
Increment Installation
Overwrite Installation
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 18
Case Study – Lessons Learned
• Numerical aggregation
• Word histogram
• URL access frequency
• Set aggregation
• Reverse web-link graph
• Inverted index
• Multiset aggregation
• Term-vector per host
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 19
General Solution
• Self-maintainable aggregates
• Computed in three steps
• Translation
• Grouping
• Aggregation
• commutative and associative binary function
• inverse elements
• Abelian group
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 20
Case Study – Lessons Learned
• Numerical aggregation
• Word histogram
• URL access frequency
• Set aggregation
• Reverse web-link graph
• Inverted index
• Multiset aggregation
• Term-vector per host
Translation function:Translate web pages into (word, 1)
Aggregation function:Abelian group (Natural numbers, +)
Translation function:Translate web pages into (link target, link source)
Aggregation function:Abelian group (Power-multiset of URLs, multiset union)
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 21
Evaluation
0 25 50 75 1001
10
100
Word histogram
0 25 50 75 1004
40
400
Reverse web-link graph
0 25 50 75 1001
10
100
URL access frequency
0 25 50 75 1001
10
100
Term-vector per host
y-axis: Elapsed time [min]x-axis: Updates in basedocuments [%]
DBAG Treffen 2011 – Thomas Jörg – TU Kaiserslautern 22
Conclusion & Future Work
• View Maintenance in MapReduce
• Case study
• Summary delta algorithm
• Self-maintainable aggregations
• Future Work
• Broader class of MapReduce programs
• High-level MapReduce languages, e.g. Jaql or PigLatin