1
Implementation and Research Issues in Query Processing for Wireless
Sensor Networks
Wei Hong Intel Research, Berkeley
Sam MaddenMIT
Adapted by L.B.
2
Declarative Queries
• Programming Apps is Hard– Limited power budget– Lossy, low bandwidth communication– Require long-lived, zero admin deployments– Distributed Algorithms– Limited tools, debugging interfaces
• Queries abstract away much of the complexity– Burden on the database developers– Users get:
• Safe, optimizable programs• Freedom to think about apps instead of details
3
TinyDB: Prototype declarativequery processor
• Platform: Berkeley Motes + TinyOS• Continuous variant of SQL : TinySQL
• Power and data-acquisition based in-network optimization framework
• Extensible interface for aggregates, new types of sensors
4
TinyDB RevisitedSELECT MAX(mag) FROM sensors WHERE mag > threshSAMPLE PERIOD 64ms
• High level abstraction:– Data centric programming– Interact with sensor
network as a whole– Extensible framework
• Under the hood:– Intelligent query
processing: query optimization, power efficient execution
– Fault Mitigation: automatically introduce redundancy, avoid problem areas
App
Sensor Network
TinyDB
Query, Trigger
Data
5
Feature Overview
• Declarative SQL-like query interface• Metadata catalog management• Multiple concurrent queries• Network monitoring (via queries)• In-network, distributed query processing• Extensible framework for attributes,
commands and aggregates• In-network, persistent storage
6
TinyDB GUI
TinyDB Client APIDBMS
Sensor network
Architecture
TinyDB query processor
0
4
0
1
5
2
6
3
7
JDBC
Mote side
PC side
8
7
Data Model
• Entire sensor network as one single, infinitely-long logical table: sensors
• Columns consist of all the attributes defined in the network
• Typical attributes:– Sensor readings– Meta-data: node id, location, etc.– Internal states: routing tree parent, timestamp, queue
length, etc.• Nodes return NULL for unknown attributes• On server, all attributes are defined in catalog.xml• Discussion: other alternative data models?
8
Query Language (TinySQL)
SELECT <aggregates>, <attributes>
[FROM {sensors | <buffer>}][WHERE <predicates>][GROUP BY <exprs>][SAMPLE PERIOD <const> |
ONCE][INTO <buffer>][TRIGGER ACTION <command>]
9
Comparison with SQL
• Single table in FROM clause• Only conjunctive comparison predicates
in WHERE and HAVING• No subqueries• No column alias in SELECT clause• Arithmetic expressions limited to
column op constant• Only fundamental difference: SAMPLE
PERIOD clause
10
TinySQL Examples
SELECT nodeid, nestNo, lightFROM sensorsWHERE light > 400EPOCH DURATION 1s
1EpocEpoc
hhNodeiNodei
ddnestNnestN
ooLightLight
0 1 17 455
0 2 25 389
1 1 17 422
1 2 25 405
Sensors
“Find the sensors in bright nests.”
11
TinySQL Examples (cont.)
Epoch region CNT(…) AVG(…)
0 North 3 360
0 South 3 520
1 North 3 370
1 South 3 520
“Count the number occupied nests in each loud region of the island.”
SELECT region, CNT(occupied) AVG(sound)
FROM sensors
GROUP BY region
HAVING AVG(sound) > 200
EPOCH DURATION 10s
3
Regions w/ AVG(sound) > 200
SELECT AVG(sound)
FROM sensors
EPOCH DURATION 10s
2
12
Event-based Queries
• ON event SELECT …• Run query only when interesting events
happens• Event examples
– Button pushed– Message arrival– Bird enters nest
• Analogous to triggers but events are user-defined
13
Query over Stored Data
• Named buffers in Flash memory• Store query results in buffers• Query over named buffers• Analogous to materialized views• Example:
– CREATE BUFFER name SIZE x (field1 type1, field2 type2, …)
– SELECT a1, a2 FROM sensors SAMPLE PERIOD d INTO name
– SELECT field1, field2, … FROM name SAMPLE PERIOD d
14
Inside TinyDB
TinyOS
Schema
Query Processor
Multihop Network
Filterlight >
400get (‘temp’)
Aggavg(tem
p)
QueriesSELECT AVG(temp) WHERE light > 400
ResultsT:1, AVG: 225T:2, AVG: 250
Tables Samples got(‘temp’)
Name: tempTime to sample: 50 uSCost to sample: 90 uJCalibration Table: 3Units: Deg. FError: ± 5 Deg FGet f : getTempFunc()…
getTempFunc(…)getTempFunc(…)
TinyDBTinyDB
~10,000 Lines Embedded C Code
~5,000 Lines (PC-Side) Java
~3200 Bytes RAM (w/ 768 byte heap)
~58 kB compiled code
(3x larger than 2nd largest TinyOS Program)
15
Tree-based Routing
• Tree-based routing– Used in:
• Query delivery • Data collection• In-network aggregation
– Relationship to indexing?
A
B C
D
FE
Q:SELECT …
Q Q
Q
Q
Q
Q
Q
Q QQ
R:{…}
R:{…}
R:{…}
R:{…} R:{…}
16
Sensor Network Research
• Very active research area– Can’t summarize it all
• Focus: database-relevant research topics– Some outside of Berkeley– Other topics that are itching to be scratched– But, some bias towards work that we find
compelling
17
Topics
• In-network aggregation• Acquisitional Query Processing• Heterogeneity• Intermittent Connectivity• In-network Storage• Statistics-based summarization and
sampling• In-network Joins• Adaptivity and Sensor Networks• Multiple Queries
18
Topics
• In-network aggregation• Acquisitional Query Processing• Heterogeneity• Intermittent Connectivity• In-network Storage• Statistics-based summarization and
sampling• In-network Joins• Adaptivity and Sensor Networks• Multiple Queries
19
Tiny Aggregation (TAG)
• In-network processing of aggregates– Common data analysis operation
• Aka gather operation or reduction in || programming
– Communication reducing• Operator dependent benefit
– Across nodes during same epoch
• Exploit query semantics to improve efficiency!
Madden, Franklin, Hellerstein, Hong. Tiny AGgregation (TAG), OSDI 2002.
20
Basic Aggregation
• In each epoch:– Each node samples local sensors once– Generates partial state record (PSR)
• local readings • readings from children
– Outputs PSR during assigned comm. interval
• At end of epoch, PSR for whole network output at root
• New result on each successive epoch
• Extras:– Predicate-based partitioning via GROUP BY
1
2 3
4
5
21
Illustration: Aggregation
1 2 3 4 5
4 1
3
2
1
4
1
2 3
4
5
1
Sensor #
Inte
rval #
Interval 4SELECT COUNT(*) FROM sensors
Epoch
22
Illustration: Aggregation
1 2 3 4 5
4 1
3 2
2
1
4
1
2 3
4
5
2
Sensor #
Interval 3SELECT COUNT(*) FROM sensors
Inte
rval #
23
Illustration: Aggregation
1 2 3 4 5
4 1
3 2
2 1 3
1
4
1
2 3
4
5
31
Sensor #
Interval 2SELECT COUNT(*) FROM sensors
Inte
rval #
24
Illustration: Aggregation
1 2 3 4 5
4 1
3 2
2 1 3
1 5
4
1
2 3
4
5
5
Sensor #
SELECT COUNT(*) FROM sensors Interval 1
Inte
rval #
25
Illustration: Aggregation
1 2 3 4 5
4 1
3 2
2 1 3
1 5
4 1
1
2 3
4
5
1
Sensor #
SELECT COUNT(*) FROM sensors Interval 4
Inte
rval #
26
Aggregation Framework
• As in extensible databases, TinyDB supports any aggregation function conforming to:
Aggn={finit, fmerge, fevaluate}
Finit {a0} <a0>
Fmerge {<a1>,<a2>} <a12>
Fevaluate {<a1>} aggregate value
Example: AverageAVGinit {v} <v,1>
AVGmerge {<S1, C1>, <S2, C2>} < S1 + S2 , C1 + C2>
AVGevaluate{<S, C>} S/C
Partial State Record (PSR)
Restriction: Merge associative, commutative
27
Property Examples Affects
Partial State MEDIAN : unbounded, MAX : 1 record
Effectiveness of TAG
Monotonicity COUNT : monotonicAVG : non-monotonic
Hypothesis Testing, Snooping
Exemplary vs. Summary
MAX : exemplaryCOUNT: summary
Applicability of Sampling, Effect of Loss
Duplicate Sensitivity
MIN : dup. insensitive,AVG : dup. sensitive
Routing Redundancy
Taxonomy of Aggregates
• TAG insight: classify aggregates according to various functional properties– Yields a general set of optimizations that can automatically be
applied
Drives an API!
28
Use Multiple Parents
• Use graph structure – Increase delivery probability with no communication
overhead
• For duplicate insensitive aggregates, or• Aggs expressible as sum of parts
– Send (part of) aggregate to all parents• In just one message, via multicast
– Assuming independence, decreases variance
SELECT COUNT(*)
A
B C
R
A
B C
c
R
P(link xmit successful) = p
P(success from A->R) = p2
E(cnt) = c * p2
Var(cnt) = c2 * p2 * (1 – p2) V
# of parents = n
E(cnt) = n * (c/n * p2)
Var(cnt) = n * (c/n)2 * p2 * (1 – p2) = V/n
A
B C
c/n c/n
R
n = 2
29
Multiple Parents Results
• Better than previous analysis expected!
• Losses aren’t independent!
• Insight: spreads data over many links
Benefit of Result Splitting (COUNT query)
0
200
400
600
800
1000
1200
1400
(2500 nodes, lossy radio model, 6 parents per node)
Avg
. C
OU
NT Splitting
No Splitting
Critical Link!
No Splitting With Splitting
30
Acquisitional Query Processing (ACQP)
• TinyDB acquires AND processes data
– Could generate an infinite number of samples
• An acqusitional query processor controls
– when,
– where,
– and with what frequency data is collected!
• Versus traditional systems where data is provided a priori
Madden, Franklin, Hellerstein, and Hong. The Design of An Acqusitional Query Processor. SIGMOD, 2003.
31
ACQP: What’s Different?• How should the query be processed?
– Sampling as a first class operation
• How does the user control acquisition?– Rates or lifetimes– Event-based triggers
• Which nodes have relevant data?– Index-like data structures
• Which samples should be transmitted?– Prioritization, summary, and rate control
32
• E(sampling mag) >> E(sampling light)
1500 uJ vs. 90 uJ
Operator Ordering: Interleave Sampling + Selection
SELECT light, magFROM sensorsWHERE pred1(mag)AND pred2(light)EPOCH DURATION 1s
(pred1)
(pred2)
mag
light
(pred1)
(pred2)
mag
light
(pred1)
(pred2)
mag light
Traditional DBMS
ACQP
At 1 sample / sec, total power savings could be as much as 3.5mW Comparable to processor!
Correct orderingCorrect ordering(unless pred1 is (unless pred1 is very very selective selective
and pred2 is not):and pred2 is not):
Cheap
Costly
33
Exemplary Aggregate Pushdown
SELECT WINMAX(light,8s,8s)FROM sensorsWHERE mag > xEPOCH DURATION 1s
• Novel, general pushdown technique
• Mag sampling is the most expensive operation!
WINMAX
(mag>x)
mag light
Traditional DBMS
light
mag
(mag>x)
WINMAX
(light > MAX)
ACQP
34
Topics
• In-network aggregation• Acquisitional Query Processing• Heterogeneity• Intermittent Connectivity• In-network Storage• Statistics-based summarization and sampling• In-network Joins• Adaptivity and Sensor Networks• Multiple Queries
35
Heterogeneous Sensor Networks
• Leverage small numbers of high-end nodes to benefit large numbers of inexpensive nodes
• Still must be transparent and ad-hoc• Key to scalability of sensor networks• Interesting heterogeneities
– Energy: battery vs. outlet power– Link bandwidth: Chipcon vs. 802.11x– Computing and storage: ATMega128 vs.
Xscale– Pre-computed results– Sensing nodes vs. QP nodes
36
Computing Heterogeneity with TinyDB
• Separate query processing from sensing– Provide query processing on a small number of nodes– Attract packets to query processors based on “service
value”• Compare the total energy consumption of the
network
• No aggregation• All aggregation• Opportunistic aggregation• HSN proactive
aggregation
Mark Yarvis and York Liu, Intel’s Heterogeneous Sensor
Network Project, ftp://download.intel.com/research/people/HSN_IR_Day_Poster_03.pdf.
37
5x7 TinyDB/HSN Mica2 Testbed
38
Data Packet SavingData Packet Saving
-50.00%
-45.00%
-40.00%
-35.00%
-30.00%
-25.00%
-20.00%
-15.00%
-10.00%
-5.00%
0.00%
1 2 3 4 5 6 All (35)
Number of Aggregator
% C
han
ge
in D
ata
Pac
ket
Co
un
t
Data Packet Saving - Aggregator Placement
-50.00%
-45.00%
-40.00%
-35.00%
-30.00%
-25.00%
-20.00%
-15.00%
-10.00%
-5.00%
0.00%
25 27 29 31 All (35)
Aggregator Location
% C
han
ge
in D
ata
Pac
ket
Co
un
nt
• How many aggregators are desired?
• Does placement matter?
11% aggregators achieve 72% of max
data reduction
Optimal placement 2/3 distance from sink.
39
Occasionally Connected Sensornets
TinyDB QPTinyDB QP
TinyDB Server
GTWY
Mobile GTWYMobile GTWY
TinyDB QP
Mobile GTWY
GTWY
internet
GTWY
40
Occasionally Connected Sensornets Challenges
• Networking support– Tradeoff between reliability, power
consumption and delay– Data custody transfer: duplicates?– Load shedding– Routing of mobile gateways
• Query processing– Operation placement: in-network vs. on mobile
gateways– Proactive pre-computation and data movement
• Tight interaction between networking and QP
Fall, Hong and Madden, Custody Transfer for Reliable Delivery in Delay Tolerant Networks, http://www.intel-research.net/Publications/Berkeley/081220030852_157.pdf.
41
Distributed In-network Storage
• Collectively, sensornets have large amounts of in-network storage
• Good for in-network consumption or caching
• Challenges– Distributed indexing for fast query
dissemination– Resilience to node or link failures– Graceful adaptation to data skews– Minimizing index insertion/maintenance cost
42
Example: DIM• Functionality
– Efficient range query for multidimensional data.
• Approaches– Divide sensor field into
bins.– Locality preserving
mapping from m-d space to geographic locations.
– Use geographic routing such as GPSR.
• Assumptions– Nodes know their
locations and network boundary
– No node mobility
E2= <0.6, 0.7>E1 = <0.7, 0.8>
Q1=<.5-.7, .5-1>
Xin Li, Young Jin Kim, Ramesh Govindan and Wei Hong, Distributed Index for Multi-dimentional Data (DIM) in Sensor Networks, SenSys 2003.
43
Statistical Techniques
• Approximations, summaries, and sampling based on statistics and statistical models
• Applications:– Limited bandwidth and large number of
nodes -> data reduction– Lossiness -> predictive modeling– Uncertainty -> tracking correlations and
changes over time– Physical models -> improved query
answering
44
Correlated Attributes
• Data in sensor networks is correlated; e.g.,– Temperature and voltage– Temperature and light– Temperature and humidity– Temperature and time of day– etc.
45
IDSQ
• Idea: task sensors in order of best improvement to estimate of some value:– Choose leader(s)
• Suppress subordinates• Task subordinates, one at a time
– Until some measure of goodness (error bound) is met
» E.g. “Mahalanobis Distance” -- Accounts for correlations in axes, tends to favor minimizing principal axis
See “Scalable Information-Driven Sensor Querying and Routing for ad hoc Heterogeneous Sensor Networks.” Chu, Haussecker and Zhao. Xerox TR P2001-10113. May, 2001.
46
Model location estimate as a point with 2-dimensional Gaussian uncertainty.
Graphical Representation
Principal Axis
S1
Residual 1
Preferred because it reduces error along principal axis
Residual 2 S2
Area of residuals is equal
47
MQSN: Model-based Probabilistic Querying over
Sensor Networks
Query ProcessorModel
1
23
5
6
4
7
8
9
Joint work with Amol Desphande, Carlos Guestrin,
and Joe Hellerstein
48
MQSN: Model-based Probabilistic Querying over
Sensor Networks
Query ProcessorModel
1
23
5
6
4
7
8
9
Probabilistic Queryselect NodeID, Temp ± 0.1Cwhere NodeID in [1..9] with conf(0.95)
Consult
Model
Observation Plan[Temp, 3], [Temp, 9]
49
MQSN: Model-based Probabilistic Querying over
Sensor Networks
Query ProcessorModel
1
23
5
6
4
7
8
9
Observation Plan[Temp, 3], [Temp, 9]
Probabilistic Queryselect NodeID, Temp ± 0.1Cwhere NodeID in [1..9] with conf(0.95)
Consult
Model
50
MQSN: Model-based Probabilistic Querying over
Sensor Networks
Query ProcessorModel
1
23
5
6
4
7
8
9
Data[Temp, 3] = …, [Temp, 9] = …
Query Results
10
15
20
25
30
1 2 3 4
Node ID
Tem
per
atu
re
Update
Model
51
Challenges
• What kind of models to use ?• Optimization problem:
– Given a model and a query, find the best set of attributes to observe
– Cost not easy to measure• Non-uniform network communication costs• Changing network topologies
– Large plan space• Might be cheaper to observe attributes not in query
– e.g. Voltage instead of Temperature• Conditional Plans:
– Change the observation plan based on observed values
52
MQSN: Current Prototype
• Multi-variate Gaussian Models– Kalman Filters to capture correlations across time
• Handles:– Range predicate queries
• sensor value within [x,y], w/ confidence– Value queries
• sensor value = x, w/in epsilon, w/ confidence– Simple aggregate queries
• AVG(sensor value) n, w/in epsilon, w/confidence
• Uses a greedy algorithm to choose the observation plan
53
In-Net Regression
• Linear regression : simple way to predict future values, identify outliers
• Regression can be across local or remote values, multiple dimensions, or with high degree polynomials– E.g., node A readings vs. node B’s– Or, location (X,Y), versus temperature
E.g., over many nodes
X vs Y w/ Curve Fit
y = 0.9703x - 0.0067
R2 = 0.947
0
2
4
6
8
10
12
1 3 5 7 9Guestrin, Thibaux, Bodik, Paskin, Madden. “Distributed Regression: an Efficient
Framework for Modeling Sensor Network Data .” Under submission.
54
In-Net Regression (Continued)
• Problem: may require data from all sensors to build model
• Solution: partition sensors into overlapping “kernels” that influence each other– Run regression in each kernel
• Requiring just local communication
– Blend data between kernels– Requires some clever matrix manipulation
• End result: regressed model at every node– Useful in failure detection, missing value
estimation
55
Exploiting Correlations in Query Processing
• Simple idea: – Given predicate P(A) over expensive attribute A– Replace it with P’ over cheap attribute A’ such that P’
evaluates to P – Problem: unless A and A’ are perfectly correlated, P’ ≠ P
for all time• So we could incorrectly accept or reject some readings
• Alternative: use correlations to improve selectivity estimates in query optimization– Construct conditional plans that vary predicate order
based on prior observations
56
Exploiting Correlations (Cont.)
• Insight: by observing a (cheap and correlated) variable not involved in the query, it may be possible to improve query performance – Improves estimates of selectivities
• Use conditional plans• Example
Light > 100 Lux
Temp < 20° C
Cost = 100Selectivity = .5
Cost = 100Selectivity = .5
Expected Cost = 150
Light > 100 Lux
Temp < 20° C
Cost = 100Selectivity = .5
Cost = 100Selectivity = .5
Expected Cost = 150
Light > 100 Lux
Temp < 20° C
Cost = 100Selectivity = .1
Cost = 100Selectivity = .9
Expected Cost = 110
Light > 100 Lux
Temp < 20° C
Cost = 100Selectivity = .1
Cost = 100Selectivity = .9
Expected Cost = 110
Time in [6pm, 6am]
T
F
57
In-Network Join Strategies
• Types of joins: – non-sensor -> sensor– sensor -> sensor
• Optimization questions:– Should the join be pushed down?– If so, where should it be placed?– What if a join table exceeds the
memory available on one node?
58
Choosing Where to Place Operators
• Idea : choose a “join node” to run the operator
• Over time, explore other candidate placements– Nodes advertise data rates to their neighbors– Neighbors compute expected cost of running the
join based on these rates– Neighbors advertise costs– Current join node selects a new, lower cost node
Bonfils + Bonnet, Adaptive and Decentralized Operator Placement for In-Network QueryProcessing IPSN 2003.
59
Topics
• In-network aggregation• Acquisitional Query Processing• Heterogeneity• Intermittent Connectivity• In-network Storage• Statistics-based summarization and
sampling• In-network Joins• Adaptivity and Sensor Networks• Multiple Queries
60
Adaptivity In Sensor Networks
• Queries are long running• Selectivities change
– E.g. night vs day
• Network load and available energy vary• All suggest that some adaptivity is needed
– Of data rates or granularity of aggregation when optimizing for lifetimes
– Of operator orderings or placements when selectivities change (c.f., conditional plans for correlations)
• As far as we know, this is an open problem!
61
Multiple Queries and Work Sharing
• As sensornets evolve, users will run many queries simultaneously– E.g., traffic monitoring
• Likely that queries will be similar– But have different end points, parameters,
etc
• Would like to share processing, routing as much as possible
• But how? Again, an open problem.
62
Concluding Remarks
• Sensor networks are an exciting emerging technology, with a wide variety of applications
• Many research challenges in all areas of computer science– Database community included– Some agreement that a declarative interface is right
• TinyDB and other early work are an important first step
• But there’s lots more to be done!