37
Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

Embed Size (px)

Citation preview

Page 1: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

Stream and Sensor Data Management

Zachary G. IvesUniversity of Pennsylvania

CIS 650 – Implementing Data Management Systems

November 17, 2008

Page 2: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

2

Converting between Streams & Relations

Stream-to-relation operators: Sliding window: tuple-based (last N rows) or time-

based (within time range) Partitioned sliding window: does grouping by keys,

then does sliding window over that Is this necessary or minimal?

Relation-to-stream operators: Istream: stream-ifies any insertions over a relation Dstream: stream-ifies the deletes Rstream: stream contains the set of tuples in the

relation

Page 3: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

3

Some Examples

Select * From S1 [Rows 1000], S2 [Range 2 minutes]Where S1.A = S2.A And S1.A > 10

Select Rstream(S.A, R.B) From S [Now], R Where S.A = R.A

Page 4: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

4

Building a Stream System

Basic data item is the element: <op, time, tuple> where op 2 {+, -}

Query plans need a few new (?) items: Queues

Used for hooking together operators, esp. over windows

(Assumption is that pipelining is generally not possible, and we may need to drop some tuples from the queue)

Synopses The intermediate state an operator needs to carry

around Note that this is usually bounded by windows

Page 5: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

5

Example Query Plan

What’s different here?

Page 6: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

6

Some Tricks for Performance

Sharing synopses across multiple operators In a few cases, more than one operator may join

with the same synopsis Can exploit punctuations or “k-constraints”

Analogous to interesting orders Referential integrity k-constraint: bound of k

between arrival of “many” element and its corresponding “one” element

Ordered-arrival k-constraint: need window of at most k to sort

Clustered-arrival k-constraint: bound on distance between items with same grouping attributes

Page 7: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

7

Query Processing – “Chain Scheduling”

Similar in many ways to eddies May decide to apply operators as follows:

Assume we know how many tuples can be processed in a time unit

Cluster groups of operators into “chains” that maximize reduction in queue size per unit time

Greedily forward tuples into the most selective chain

Within a chain, process in FIFO order

They also do a form of join reordering

Page 8: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

8

Scratching the Surface: Approximation

They point out two areas where we might need to approximate output: CPU is limited, and we need to drop some stream

elements according to some probabilistic metric Collect statistics via a profiler Use Hoeffding inequality to derive a sampling rate in

order to maintain a confidence interval May need to do similar things if memory usage is

a constraint

Are there other options? When might they be useful?

Page 9: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

9

STREAM in General

“Logical semantics first”

Starts with a basic data model: streams as timestamped sets

Develops a language and semantics Heavily based on SQL

Proposes a relatively straightforward implementation Interesting ideas like k-constraints Interesting approaches like chain scheduling No real consideration of distributed processing

Page 10: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

10

Aurora

“Implementation first; mix and match operations from past literature”

Basic philosophy: most of the ideas in streams existed in previous research Sliding windows, load shedding, approximation, … So let’s borrow those ideas and focus on how to

build a real system with them! Emphasis is on building a scalable, robust

system Distributed implementation: Medusa

Page 11: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

11

Queries in Aurora

Oddly: no declarative query language in the initial version! (Added for commercial product)

Queries are workflows of physical query operators (SQuAl) Many operators resemble relational algebra ops

Page 12: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

12

Example Query

Page 13: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

13

Some Interesting Aspects

A relatively simple adaptive query optimizer Can push filtering and mapping into many operators Can reorder some operators (e.g., joins, unions)

Need built-in error handling If a data source fails to respond in a certain amount of

time, create a special alarm tuple This propagates through the query plan

Incorporate built-in load-shedding, RT sched. to support QoS

Have a notion of combining a query over historical data with data from a stream Switches from a pull-based mode (reading from disk) to

a push-based mode (reading from network)

Page 14: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

14

The Medusa Processor

Distributed coordinator between many Aurora nodes Scalability through federation and distribution Fail-over Load balancing

Page 15: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

15

Main Components

Lookup Distributed catalog – schemas, where to find

streams, where to find queries

Brain Query setup, load monitoring via I/O queues

and stats Load distribution and balancing scheme is

used Very reminiscent of Mariposa!

Page 16: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

16

Load Balancing

Migration – an operator can be moved from one node to another Initial implementation didn’t support moving of state

The state is simply dropped, and operator processing resumes

Implications on semantics? Plans to support state migration

“Agoric system model to create incentives” Clients pay nodes for processing queries Nodes pay each other to handle load – pairwise contracts

negotiated offline Bounded-price mechanism – price for migration of load,

spec for what a node will take on Does this address the weaknesses of the Mariposa

model?

Page 17: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

17

Some Applications They Tried

Financial services (stock ticker) Main issue is not volume, but problems with feeds Two-level alarm system, where higher-level alarm helps

diagnose problems Shared computation among queries User-defined aggregation and mapping This is the main application for the commercial version

(StreamBase) Linear road (sensor monitoring)

Traffic sensors in a toll road – change toll depending on how many cars are on the road

Combination of historical and continuous queries Environmental monitoring

Sliding-window calculations

Page 18: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

18

Lessons Learned

Historical data is important – not just stream data (Summaries?)

Sometimes need synchronization for consistency “ACID for streams”?

Streams can be out of order, bursty “Stream cleaning”?

Adaptors (and also XML) are important … But we already knew that!

Performance is critical They spent a great deal of time using microbenchmarks

and optimizing

Page 19: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

19

Sensors and Sensor Networks

Trends: Cameras and other sensors are very cheap Microprocessors and microcontrollers can be very

small Wireless networks are easy to build

Why not instrument the physical world with tiny wireless sensors and networks? Vision: “Smart dust” Berkeley motes, RF tags, cameras, camera phones,

temperature sensors, etc. Today we already see pieces of this:

Penn buildings and SCADA system 250+ surveillance cameras through campus

Page 20: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

20

What Can We Do with Sensor Networks?

Many “passive” monitoring applications: Environmental monitoring:

temperature in different parts of a building air quality etc.

Law enforcement: Video feeds and anomalous behavior

Research studies: Study ocean temperature, currents Monitor status of eggs in endangered birds’ nests ZebraNet

Fun: Record sporting events or performances from every angle (video &

audio)

Ultimately, build reactive systems as well: robotics, Mars landers, …

Page 21: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

21

Some Challenges

Highly distributed! May have thousands of nodes Know about a few nodes within proximity; may not know

location Nodes’ transmissions may interfere with one another

Power and resource constraints Most of these devices are wireless, tiny, battery-

powered Can only transmit data every so often Limited CPU, memory – can’t run sophisticated code

High rate of failure Collisions, battery failures, sensor calibration, …

Page 22: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

22

The Target Platform

Most sensor network research argues for the Berkeley mote as a target platform: Mote: 4MHz, 8-bit CPU 128KB RAM 512KB Flash memory 40kbps radio, 100 ft range Sensors:

Light, temperature, microphone Accelerometer Magnetometer http://robotics.eecs.berkeley.edu/~pister/SmartDust/

Page 23: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

23

Sensor Net Data Acquisition

• First: build routing tree

• Second: begin sensing and aggregation

Page 24: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

24

Sensor Net Data Acquisition (Sum)

5 5 5 5

5

55

8

5

5

5

5

55

5

5 5

7

• First: build routing tree

• Second: begin sensing and aggregation (e.g., sum)

Page 25: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

25

Sensor Net Data Acquisition (Sum)

5 5 5 5

5

55

8

5

5

5

5

55

5

5 5

10 15205

25

6055

20105

138 18

5

3023357

5

85

• First: build routing tree

• Second: begin sensing and aggregation (e.g., sum)

Page 26: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

26

Sensor Network Research

Routing: need to aggregate and consolidate data in a power-efficient way Ad hoc routing – generate routing tree to base

station Generally need to merge computation with

routing Robustness: need to combine info from

many sensors to account for individual errors What aggregation functions make sense?

Languages: how do we express what we want to do with sensor networks? Many proposals here

Page 27: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

27

A First Try: Tiny OS and nesC

TinyOS: a custom OS for sensor nets, written in nesC Assumes low-power CPU

Very limited concurrency support: events (signaled asynchronously) and tasks (cooperatively scheduled)

Applications built from “components” Basically, small objects without any local state

Various features in libraries that may or may not be included

interface Timer { command result_t start(char type,

uint32_t interval); command result_t stop(); event result_t fired();}

Page 28: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

28

Drawbacks of this Approach

Need to write very low-level code for sensor net behavior

Only simple routing policies are built into TinyOS – some of the routing algorithms may have to be implemented by hand

Has required many follow-up papers to fill in some of the missing pieces, e.g., Hood (object tracking and state sharing), …

Page 29: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

29

An Alternative

“Much” of the computation being done in sensor nets looks like what we were discussing with STREAM

Today’s sensor networks look a lot like databases, pre-Codd Custom “access paths” to get to data One-off custom-code

So why not look at mapping sensor network computation to SQL? Not very many joins here, but significant aggregation Now the challenge is in picking a distribution and routing

strategy that provides appropriate guarantees and minimizes power usage

Page 30: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

30

TinyDB and TinySQL

Treat the entire sensor network as a universal relation Each type of sensor data is a column in a

global table

Tuples are created according to a sample interval (separated by epochs) (Implications of this model?)

SELECT nodeid, light, tempFROM sensorsSAMPLE INTERVAL 1s FOR 10s

Page 31: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

31

Storage Points and Windows

Like Aurora, STREAM, can materialize portions of the data: CREATE STORAGE POINT recentlight SIZE 8

AS (SELECT nodeid, light FROM sensors SAMPLE INTERVAL 10s)

and we can use windowed aggregates: SELECT WINAVG(volume, 30s, 5s)

FROM sensorsSAMPLE INTERVAL 1s

Page 32: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

32

Events

ON EVENT bird-detect(loc): SELECT AVG(light), AVG(temp), event.loc FROM sensors AS s WHERE dist(s.loc, event.loc) < 10m SAMPLE INTERVAL 2s FOR 30s

How do we know about events?

Contrast to UDFs? triggers?

Page 33: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

33

Power and TinyDB

Cost-based optimizer tries to find a query plan to yield lowest overall power consumption Different sensors have different power usage

Try to order sampling according to selectivity (sounds familiar?)

Assumption of uniform distribution of values over range Batching of queries (multi-query optimization)

Convert a series of events into a stream join – does this resemble anything we’ve seen recently?

Also need to consider where the query is processed…

Page 34: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

34

Dissemination of Queries Based on semantic routing

tree idea SRT build request is flooded

first Node n gets to choose its

parent p, based on radio range from root

Parent knows its children Maintains an interval on

values for each child Forwards requests to

children as appropriate Maintenance:

If interval changes, child notifies its parent

If a node disappears, parent learns of this when it fails to get a response to a query

Page 35: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

35

Query Processing

Mostly consists of sleeping! Wake briefly, sample, and compute operators,

then route onwards

Nodes are time synchronized Awake time is proportional to the

neighborhood size (why?)

Computation is based on partial state records Basically, each operation is a partial aggregate

value, plus the reading from the sensor

Page 36: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

36

Load Shedding & Approximation

What if the router queue is overflowing? Need to prioritize tuples, drop the ones we don’t want FIFO vs. averaging the head of the queue vs. delta-

proportional weighting

Later work considers the question of using approximation for more power efficiency If sensors in one region change less frequently, can sample

less frequently (or fewer times) in that region If sensors change less frequently, can sample readings that

take less power but are correlated (e.g., battery voltage vs. temperature)

Thursday, 4:30PM, DB Group Meeting, I’ll discuss some of this work

Page 37: Stream and Sensor Data Management Zachary G. Ives University of Pennsylvania CIS 650 – Implementing Data Management Systems November 17, 2008

37

The Future of Sensor Nets?

TinySQL is a nice way of formulating the problem of query processing with motes View the sensor net as a universal relation Can define views to abstract some concepts, e.g., an

object being monitored

But: What about when we have multiple instances of an

object to be tracked? Correlations between objects? What if we have more complex data? More CPU

power? What if we want to reason about accuracy?