33
Adaptive Dataflow: A Database/Networking Cosmic Convergence Joe Hellerstein UC Berkeley

Adaptive Dataflow: A Database/Networking Cosmic Convergence

Embed Size (px)

DESCRIPTION

Adaptive Dataflow: A Database/Networking Cosmic Convergence. Joe Hellerstein UC Berkeley. Road Map. How I got started on this CONTROL project Eddies Tie-ins to Networking Research Telegraph & ongoing adaptive dataflow research New arenas: Sensor networks P2P networks. - PowerPoint PPT Presentation

Citation preview

Page 1: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Adaptive Dataflow: A Database/Networking

Cosmic Convergence

Joe HellersteinUC Berkeley

Page 2: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Road Map

How I got started on this CONTROL project Eddies

Tie-ins to Networking ResearchTelegraph & ongoing adaptive dataflow researchNew arenas: Sensor networks P2P networks

Page 3: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Background: CONTROL project

Online/Interactive query processing Online aggregation Scalable spreadsheets & refining

visualizations Online data cleaning (Potter’s Wheel)

Pipelining operators (ripple joins, online reordering) over streaming samples

Page 4: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Example: Online Aggregation

Page 5: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Online Data Visualization

CLOUDS

Page 6: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Potter’s Wheel

Page 7: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Goals for Online ProcessingPerformance metric: Statistical (e.g. conf. intervals) User-driven (e.g. weighted by widgets)

New “greedy” performance regime Maximize 1st derivative of the “mirth index” Mirth defined on-the-fly Therefore need FEEDBACK and CONTROL

Time

100%

OnlineTraditional

Page 8: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

CONTROL Volatility

Goals and data may change over time User feedback, sample variance

Goals and data may be different in different “regions” Group-by, scrollbar position [An aside: dependencies in selectivity

estimation]

Q: Query optimization in this world? Or in any pipelining, volatile environment?? Where else do we see volatility?

Page 9: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Continuous Adaptivity: Eddies

A little more state per tuple Ready/done bits (extensible a la Volcano/Starburst)

Query processing = dataflow routing!! We'll come back to this!

Eddy

Page 10: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Eddies: Two Key Observations

Break the set-oriented boundary Usual DB model: algebra expressions: (R S) T Usual DB implementation: pipelining operators!

Subexpressions never materialized Typical implementation is more flexible than algebra

We can reorder in-flight operators Other gains possible by breaking the set-oriented

boundary…

Don’t rewrite graph. Impose a router Graph edge = absence of routing constraint Observe operator consumption/production rates

Consumption: cost Production: cost*selectivity

Page 11: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Road Map

How I got started on this CONTROL project Eddies

Tie-ins to Networking ResearchTelegraph & ongoing adaptive dataflow researchNew arenas: Sensor networks P2P networks

Page 12: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Coincidence: Eddie Comes to Berkeley

CLICK: a NW router is a query plan! “The Click Modular Router”, Robert Morris, Eddie

Kohler, John Jannotti, and M. Frans Kaashoek, SOSP ‘99

Page 13: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Figure 3:Example Router Graph

Also Scout

Paths the key to comm-centric OS “Making Paths Explicit in the Scout Operating

System”, David Mosberger and Larry L. Peterson. OSDI ‘96.

Page 14: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

More Interaction: CS262 Experiment w/ Eric Brewer

Merge OS & DBMS grad class, over a yearEric/Joe, point/counterpointSome tie-ins were obvious: memory mgmt, storage, scheduling,

concurrency

Surprising: QP and networks go well side by side E.g. eddies and TCP Congestion Control

Both use back-pressure and simple Control Theory to “learn” in an unpredictable dataflow environment

Eddies close to the n-armed bandit problem

Page 15: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Networking Overview for DB People Like Me

Core function of protocols: data xfer Data Manipulation (buffer, checksum, encryption,

xfer to/fr app space, presentation) Transfer Control (flow/congestion ctl, detecting

xmission probs, acks, muxing, timestamps, framing)-- Clark & Tennenhouse, “Architectural Considerations for a New Generation of Protocols”, SIGCOMM ‘90

Basic Internet assumption: “a network of unknown topology and with an

unknown, unknowable and constantly changing population of competing conversations” (Van Jacobson)

Page 16: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Exchange!

Data Modeling!

Query Opt!

Thesis: nets are good at xfer control, not so good at data manipulationSome C&T wacky ideas for better data manipulation Xfer semantic units, not packets (ALF) Auto-rewrite layers to flatten them (ILP) Minimize cross-layer ordering constraints Control delivery in parallel via packet

content

C & T’s Wacky Ideas

Page 17: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Wacky New Ideas in QP

What if… We had unbounded data producers and consumers

(“streams” … “continuous queries”) We couldn’t know our producers’ behavior or

contents?? (“federation” … “mediators”) We couldn’t predict user behavior? (“control”) We couldn’t predict behavior of components in the

dataflow? (“networked services”) We had partial failure as a given? (oops, have we

ignored this?)

Yes … networking people have been here! Remember Van Jacobson’s quote?

Page 18: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

The Cosmic Convergence

NETWORKING RESEARCH

Content-Based Routing

Router Toolkits

Content AddressableNetworks

DirectedDiffusion

Adaptivity, Federated Control, GeoScalability

DATABASE RESEARCH

Adaptive QueryProcessing

ContinuousQueries

Approximate/Interactive QP

SensorDatabases

Data Models, Query Opt, DataScalability

Page 19: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

The Cosmic Convergence

Adaptivity, Federated Control, GeoScalability

NETWORKING RESEARCH

Content-Based Routing

Router Toolkits

Content AddressableNetworks

DirectedDiffusion

DATABASE RESEARCH

Adaptive QueryProcessing

ContinuousQueries

Approximate/Interactive QP

SensorDatabases

Data Models, Query Opt, DataScalability

Telegraph

Page 20: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Road Map

How I got started on this CONTROL project Eddies

Tie-ins to Networking ResearchTelegraph & ongoing adaptive dataflow researchNew arenas: Sensor networks P2P networks

Page 21: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

What’s in the Sweet Spot?

Scenarios with: Structured Content Volatility Rich Queries

Clearly: Long-running data analysis a la CONTROL Continuous queries Queries over Internet sources and services

Two emerging scenarios: Sensor networks P2P query processing

Page 22: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Telegraph: Engineering the Sweet Spot

An adaptive dataflow system Dataflow programming model

A la Volcano, CLICK: push and pull. “Fjords”, ICDE02 Extensible set of pipelining operators, including relational

ops, grouped filters (e.g. XFilter) SQL parser for convenience (looking at XQuery)

Adaptivity operators Eddies

+ Extensible rules for routing constraints, Competition SteMs (state modules) FLuX (Fault-tolerant Load-balancing eXchange)

Bounded and continuous: Data sources Queries

Page 23: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

State Modules (SteMs)

Goal: Further adaptivity through competition Multiple mirrored sources

Handle rate changes, failures, parallelism

Multiple alternate operators Join = Routing + State SteM operator manages

tradeoffs State Module, unifies caches,

rendezvous buffers, join state Competitive sources/operators

share building/probing SteMs Join algorithm hybridization!

Vijayshankar Raman

staticdataflow

eddy

eddy+

stems

Page 24: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

FLuX: Routing Across ClusterFault Tolerance, Load Balancing Continuous/long-running flows need high availability Big flows need parallelism

Adaptive Load-Balancing req’d FLuX operator: Exchange plus…

Adaptive flow partitioning (River) Transient state replication & migration

RAID for SteMs Needs to be extensible to different ops:

Content-sensitivity History-sensitivity

Dataflow semantics Optimize based on edge semantics Networking tie-in again:

• At-least-once delivery?• Exactly-once delivery?• In/Out of order?

Migration policy: the ski rental analogy

Mehul Shah

Page 25: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Continuously AdaptiveContinuous Queries (CACQ)

Continuous Queries clearly need all this stuff! Address adaptivity 1st.4 Ideas in CACQ: Use eddies to allow reordering of ops.

But one eddy will serve for all queries Explicit tuple lineage

Mark each tuple with per-op ready/done bits Mark each tuple with per-query completed bits

Queries are data: join with Grouped Filter Much like XFilter, but for relational queries

Joins via SteMs, shared across all queries Note: mixed-lineage tuples in a SteM. I.e. shared state is not shared

algebraic expressions! Delete a tuple from flow only if it matches no query

Next: F.T. CACQ via FLuXen

Sam Madden, Mehul Shah, Vijayshankar Raman

Page 26: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Road Map

How I got started on this CONTROL project Eddies

Tie-ins to Networking ResearchTelegraph & ongoing adaptive dataflow researchNew arenas: Sensor networks P2P networks

Page 27: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Sensor Nets

“Smart Dust” + TinyOSThousands of “motes”Expensive communication Power constraints

Query workload: Aggregation & approximation Queries and Continuous Queries

Challenges: Push the processing into the network Deal with volatility & failure CONTROL issues: data variance, user desires

Joint work with Ramesh Govindan, Sam Madden, Wei Hong and David Culler (Intel Berkeley Lab)

Simple example:Aggregation query

Page 28: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

P2P QP

Starting point: P2P as grassroots phenomenon Outrageous filesharing volume (1.8Gfiles in October 2001) No business case to date

Challenge: scale DDBMS QP ideas to P2P Motivate why Pick the right parts of DBMS research to focus on

Storage: no! QP: yes. Make it work:

Scalability well beyond our usual target Admin constraints Unknown data distributions, load Heterogeneous comm/processing Partial failure

Joint work with Scott Shenker, Ion Stoica, Matt Harren, Ryan Huebsch, Nick Lanham, Boon Thau Loo

Page 29: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

A Grassroots Example: TeleNap

Page 30: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Themes Throughout

Adaptivity Requires clever system design

The Exchange model: encapsulate in ops? Interesting adaptive policy problems

E.g. eddy routing, flux migration Control Theory, Machine Learning

Encompasses another CS goal? “No-knobs”, “Autonomic”, etc.

New performance regimes Decent performance in the common case

Mean/Variance more important than MAX Interactive Metrics

Time to completion often unimportant/irrelevant

Page 31: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

More Themes

Set-valued thinking as albatross? E.g. eddies vs. Kabra/DeWitt or Tukwila E.g. SteMs vs. Materialized Views E.g. CACQ vs. NiagaraCQ Some clean theory here would be nice

Current routing correctness proofs are inelegant

Extensibility Model/language of choice is not clear

SEQ? Relational? XQuery? Extensible operators, edge semantics [A whine about VLDB’s absurd “Specificity

Factor”]

Page 32: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Conclusions?

Too early for technical conclusionsOf this I’m sure: The CS262 experiment is a success

Our students are getting a bigger picture than before I’m learning, finding new connections May morph to OS/Nets, Nets/DB Eventually rethink the systems software curriculum at the

undergraduate level too Nets folks are coming our way

Doing relevant work, eager to collaborate DB community needs to branch out

Outbound: Better proselytizing in CS Inbound: Need new ideas

Page 33: Adaptive Dataflow:  A Database/Networking Cosmic Convergence

Conclusions, cont.

Sabbatical is a good invention Hasn’t even started, I’m already

grateful!