Upload
nicholas-carroll
View
32
Download
4
Tags:
Embed Size (px)
DESCRIPTION
Adaptive Dataflow: A Database/Networking Cosmic Convergence. Joe Hellerstein UC Berkeley. Road Map. How I got started on this CONTROL project Eddies Tie-ins to Networking Research Telegraph & ongoing adaptive dataflow research New arenas: Sensor networks P2P networks. - PowerPoint PPT Presentation
Citation preview
Adaptive Dataflow: A Database/Networking
Cosmic Convergence
Joe HellersteinUC Berkeley
Road Map
How I got started on this CONTROL project Eddies
Tie-ins to Networking ResearchTelegraph & ongoing adaptive dataflow researchNew arenas: Sensor networks P2P networks
Background: CONTROL project
Online/Interactive query processing Online aggregation Scalable spreadsheets & refining
visualizations Online data cleaning (Potter’s Wheel)
Pipelining operators (ripple joins, online reordering) over streaming samples
Example: Online Aggregation
Online Data Visualization
CLOUDS
Potter’s Wheel
Goals for Online ProcessingPerformance metric: Statistical (e.g. conf. intervals) User-driven (e.g. weighted by widgets)
New “greedy” performance regime Maximize 1st derivative of the “mirth index” Mirth defined on-the-fly Therefore need FEEDBACK and CONTROL
Time
100%
OnlineTraditional
CONTROL Volatility
Goals and data may change over time User feedback, sample variance
Goals and data may be different in different “regions” Group-by, scrollbar position [An aside: dependencies in selectivity
estimation]
Q: Query optimization in this world? Or in any pipelining, volatile environment?? Where else do we see volatility?
Continuous Adaptivity: Eddies
A little more state per tuple Ready/done bits (extensible a la Volcano/Starburst)
Query processing = dataflow routing!! We'll come back to this!
Eddy
Eddies: Two Key Observations
Break the set-oriented boundary Usual DB model: algebra expressions: (R S) T Usual DB implementation: pipelining operators!
Subexpressions never materialized Typical implementation is more flexible than algebra
We can reorder in-flight operators Other gains possible by breaking the set-oriented
boundary…
Don’t rewrite graph. Impose a router Graph edge = absence of routing constraint Observe operator consumption/production rates
Consumption: cost Production: cost*selectivity
Road Map
How I got started on this CONTROL project Eddies
Tie-ins to Networking ResearchTelegraph & ongoing adaptive dataflow researchNew arenas: Sensor networks P2P networks
Coincidence: Eddie Comes to Berkeley
CLICK: a NW router is a query plan! “The Click Modular Router”, Robert Morris, Eddie
Kohler, John Jannotti, and M. Frans Kaashoek, SOSP ‘99
Figure 3:Example Router Graph
Also Scout
Paths the key to comm-centric OS “Making Paths Explicit in the Scout Operating
System”, David Mosberger and Larry L. Peterson. OSDI ‘96.
More Interaction: CS262 Experiment w/ Eric Brewer
Merge OS & DBMS grad class, over a yearEric/Joe, point/counterpointSome tie-ins were obvious: memory mgmt, storage, scheduling,
concurrency
Surprising: QP and networks go well side by side E.g. eddies and TCP Congestion Control
Both use back-pressure and simple Control Theory to “learn” in an unpredictable dataflow environment
Eddies close to the n-armed bandit problem
Networking Overview for DB People Like Me
Core function of protocols: data xfer Data Manipulation (buffer, checksum, encryption,
xfer to/fr app space, presentation) Transfer Control (flow/congestion ctl, detecting
xmission probs, acks, muxing, timestamps, framing)-- Clark & Tennenhouse, “Architectural Considerations for a New Generation of Protocols”, SIGCOMM ‘90
Basic Internet assumption: “a network of unknown topology and with an
unknown, unknowable and constantly changing population of competing conversations” (Van Jacobson)
Exchange!
Data Modeling!
Query Opt!
Thesis: nets are good at xfer control, not so good at data manipulationSome C&T wacky ideas for better data manipulation Xfer semantic units, not packets (ALF) Auto-rewrite layers to flatten them (ILP) Minimize cross-layer ordering constraints Control delivery in parallel via packet
content
C & T’s Wacky Ideas
Wacky New Ideas in QP
What if… We had unbounded data producers and consumers
(“streams” … “continuous queries”) We couldn’t know our producers’ behavior or
contents?? (“federation” … “mediators”) We couldn’t predict user behavior? (“control”) We couldn’t predict behavior of components in the
dataflow? (“networked services”) We had partial failure as a given? (oops, have we
ignored this?)
Yes … networking people have been here! Remember Van Jacobson’s quote?
The Cosmic Convergence
NETWORKING RESEARCH
Content-Based Routing
Router Toolkits
Content AddressableNetworks
DirectedDiffusion
Adaptivity, Federated Control, GeoScalability
DATABASE RESEARCH
Adaptive QueryProcessing
ContinuousQueries
Approximate/Interactive QP
SensorDatabases
Data Models, Query Opt, DataScalability
The Cosmic Convergence
Adaptivity, Federated Control, GeoScalability
NETWORKING RESEARCH
Content-Based Routing
Router Toolkits
Content AddressableNetworks
DirectedDiffusion
DATABASE RESEARCH
Adaptive QueryProcessing
ContinuousQueries
Approximate/Interactive QP
SensorDatabases
Data Models, Query Opt, DataScalability
Telegraph
Road Map
How I got started on this CONTROL project Eddies
Tie-ins to Networking ResearchTelegraph & ongoing adaptive dataflow researchNew arenas: Sensor networks P2P networks
What’s in the Sweet Spot?
Scenarios with: Structured Content Volatility Rich Queries
Clearly: Long-running data analysis a la CONTROL Continuous queries Queries over Internet sources and services
Two emerging scenarios: Sensor networks P2P query processing
Telegraph: Engineering the Sweet Spot
An adaptive dataflow system Dataflow programming model
A la Volcano, CLICK: push and pull. “Fjords”, ICDE02 Extensible set of pipelining operators, including relational
ops, grouped filters (e.g. XFilter) SQL parser for convenience (looking at XQuery)
Adaptivity operators Eddies
+ Extensible rules for routing constraints, Competition SteMs (state modules) FLuX (Fault-tolerant Load-balancing eXchange)
Bounded and continuous: Data sources Queries
State Modules (SteMs)
Goal: Further adaptivity through competition Multiple mirrored sources
Handle rate changes, failures, parallelism
Multiple alternate operators Join = Routing + State SteM operator manages
tradeoffs State Module, unifies caches,
rendezvous buffers, join state Competitive sources/operators
share building/probing SteMs Join algorithm hybridization!
Vijayshankar Raman
staticdataflow
eddy
eddy+
stems
FLuX: Routing Across ClusterFault Tolerance, Load Balancing Continuous/long-running flows need high availability Big flows need parallelism
Adaptive Load-Balancing req’d FLuX operator: Exchange plus…
Adaptive flow partitioning (River) Transient state replication & migration
RAID for SteMs Needs to be extensible to different ops:
Content-sensitivity History-sensitivity
Dataflow semantics Optimize based on edge semantics Networking tie-in again:
• At-least-once delivery?• Exactly-once delivery?• In/Out of order?
Migration policy: the ski rental analogy
Mehul Shah
Continuously AdaptiveContinuous Queries (CACQ)
Continuous Queries clearly need all this stuff! Address adaptivity 1st.4 Ideas in CACQ: Use eddies to allow reordering of ops.
But one eddy will serve for all queries Explicit tuple lineage
Mark each tuple with per-op ready/done bits Mark each tuple with per-query completed bits
Queries are data: join with Grouped Filter Much like XFilter, but for relational queries
Joins via SteMs, shared across all queries Note: mixed-lineage tuples in a SteM. I.e. shared state is not shared
algebraic expressions! Delete a tuple from flow only if it matches no query
Next: F.T. CACQ via FLuXen
Sam Madden, Mehul Shah, Vijayshankar Raman
Road Map
How I got started on this CONTROL project Eddies
Tie-ins to Networking ResearchTelegraph & ongoing adaptive dataflow researchNew arenas: Sensor networks P2P networks
Sensor Nets
“Smart Dust” + TinyOSThousands of “motes”Expensive communication Power constraints
Query workload: Aggregation & approximation Queries and Continuous Queries
Challenges: Push the processing into the network Deal with volatility & failure CONTROL issues: data variance, user desires
Joint work with Ramesh Govindan, Sam Madden, Wei Hong and David Culler (Intel Berkeley Lab)
Simple example:Aggregation query
P2P QP
Starting point: P2P as grassroots phenomenon Outrageous filesharing volume (1.8Gfiles in October 2001) No business case to date
Challenge: scale DDBMS QP ideas to P2P Motivate why Pick the right parts of DBMS research to focus on
Storage: no! QP: yes. Make it work:
Scalability well beyond our usual target Admin constraints Unknown data distributions, load Heterogeneous comm/processing Partial failure
Joint work with Scott Shenker, Ion Stoica, Matt Harren, Ryan Huebsch, Nick Lanham, Boon Thau Loo
A Grassroots Example: TeleNap
Themes Throughout
Adaptivity Requires clever system design
The Exchange model: encapsulate in ops? Interesting adaptive policy problems
E.g. eddy routing, flux migration Control Theory, Machine Learning
Encompasses another CS goal? “No-knobs”, “Autonomic”, etc.
New performance regimes Decent performance in the common case
Mean/Variance more important than MAX Interactive Metrics
Time to completion often unimportant/irrelevant
More Themes
Set-valued thinking as albatross? E.g. eddies vs. Kabra/DeWitt or Tukwila E.g. SteMs vs. Materialized Views E.g. CACQ vs. NiagaraCQ Some clean theory here would be nice
Current routing correctness proofs are inelegant
Extensibility Model/language of choice is not clear
SEQ? Relational? XQuery? Extensible operators, edge semantics [A whine about VLDB’s absurd “Specificity
Factor”]
Conclusions?
Too early for technical conclusionsOf this I’m sure: The CS262 experiment is a success
Our students are getting a bigger picture than before I’m learning, finding new connections May morph to OS/Nets, Nets/DB Eventually rethink the systems software curriculum at the
undergraduate level too Nets folks are coming our way
Doing relevant work, eager to collaborate DB community needs to branch out
Outbound: Better proselytizing in CS Inbound: Need new ideas
Conclusions, cont.
Sabbatical is a good invention Hasn’t even started, I’m already
grateful!