Download ppt - Fishing for Patterns in (Shallow) Geometric Streams

Fishing for Patterns in (Shallow) Geometric StreamsFishing for Patterns in

(Shallow) Geometric Streams

Subhash Suri

UC Santa Barbaraand

ETH Zurich

IIT Kanpur Workshop on Data Streams Dec 18-20

Geometric Streams

• A stream of points (dim 1, 2, 3, …)

• Abstract view of multi-attribute data:

– IP packets, database transactions, geographic sensor data, processor instruction stream etc.

DDoS

Worm

IP Traffic (NetViewer)

Code profilingSensor eScan

Shape of a Point Stream

• Form informs about the function

• Identifying visually interesting patterns (“shape”) of point stream– Areas of high density (hot spots).– Large empty areas (cold spots).– Population estimates of geometric ranges – A geometric summary of the distribution of the stream.

• Deliberately vague and ill-posed; some specifics later.

Outline

• No attempt to survey

• Adaptive Spatial Partitioning

– Generic summary structure (Algorithmica ‘06)

– Q-Digest: sensornet data aggregation (SenSys’04)

– Range Adaptive Profiling for Programs (CGO ‘06)

• Specialized geometric patterns and queries:

– Range queries (SoCG ‘04)

– Hierarchical Heavy-hitters (PODS ‘05)

• Shape of the stream: ClusterHulls (Alenex‘06)

• Conclusions

Adaptive Spatial Partitioning

• A subdivision of space into square cells.

• Each cell maintains O(1) size info, essentially count of points in it.

• Tension between coverage and precision:

Large cells cover a lot, but with poor precision

Small cells have good precision, but poor coverage

• Dynamically adapt the subdivision to the distribution of points in the stream.

• Adaptive zoom: more precision (cells) where the action is, and fewer elsewhere.

• [HSST], ISAAC ‘04, Algorithmica ‘06

ASP Structure

• Data structure size is function of accuracy parameter ε

• Initially, a single box (LxL), and its counter.

• When the count of a box b > εn

– Freeze b’s counter– Split b into 4 sub-boxes

– Introduce a new counter for each sub-box

• This hierarchically defined structure of boxes (a streaming quad tree) is our ASP.

L

Adaptivity: Refine and Unrefine

• The structure must adapt to the changing distribution of points:

– New regions become heavy

– Previously heavy regions may become light/cold.

• Refine operation puts new counters where the action is increasing:

• Stream Processing: for each item x

– Locate the smallest box v containing x, increment its count

– Refine: If count of v > εN

– Split v into 4 children sub-boxes, each with a new counter, initialized to 0.

– Old counter of v frozen.

Refine operation

Unrefine Operation

• To conserve memory, boxes with low counts must be deleted.

• A previously heavy box may become light because n, the size of the stream, has increased, and so its count is below εn.

• Unrefine: if count of box v and its children < εn/2

– Delete the children boxes and– Add their counts to count of v– (v’s old counter revived)

• Refinement occurs only at node of new insert; refinement can occur anywhere (non-locality).

• A heap for fast unrefine ops.

Unrefine operation

The Data Structure

• ASP represented as a 4-ary tree

ASP-treeL

Analysis of ASP• (Space Bound):

– For each node v, the count of v, its siblings, and parent > εn/2

– Total number of boxes at most O(1/ε)

• (Per-point Processing Time):

– Naïve will be O(lg L): tree height

– With heap, centroid tree (amortized) time O(lg 1/ε)

• (Count Bound):

– Each point counted in exactly one box

– Points contained in a box b are counted at b or one of its ancestors

– Depth of the tree by the binary partitioning rule is O(log L)

– Error in a leaf’s count is O(εn*log L).

– Using memory = O(1/ε * log L), the count error bounded by εn.

ASP-tree

Spatial Summary

• A partition into O(1/ε * log L) boxes, with auto-adaptive zoom.

• No undivided box has more than εn points: only leaf nodes can.

• Gives a qualitative summary of the stream’s spatial distribution: a visual sense of hot and cold regions.

Two applications and two theorems

• Data aggregation in sensor networks

– Distributed version of ASP structure

• Code profiling in processor streams

– Hardware implementation of ASP

• Theoretical bounds for range searching

– Worst-case guarantees for rectangle range searching

• Lower bounds on hierarchical heavy hitters

– Space complexity

Geometric Summaries in Sensornets• Self-organizing networks of tiny, cheap sensors,

– Integrated sensing, computing, radio communication,

– Continuous, real-time monitoring of remote, hard to reach areas.

– Limited power (battery), bandwidth, memory.

• Communication typically the biggest drain on energy

• Perform as much local processing as possible, and transmit smart summaries.

• Similar to synopses: distributed data, rather than one-pass.

• Active area: in-network aggregation, compressed sensing.

Distributed ASP

• Q-Digest: an approximate histogram

– Shrivastava, Buragohain, Agrawal, S. [SenSys ‘04]

• ASP for 1-dim data signal (measurements of sensors): vibration data, acoustics, toxin levels, etc.

• Going beyond min, max, or average, and approximating quantiles.

• Sensors form an aggregation tree, rooted at base station.

• Data flows from leaves to the base station, always reduced to size K summary. (user parameter).

• The key point is that ASP is efficiently mergeable:

– Given q-digests of children, a node can compute the merged q-digest.

• Space/quality bounds of ASP carry over.

Base Station

A simulation

8000 sensors,each generating a 2-byte integer (death valley elevation data)

Error: (true - est) rank< 5% with 160 byte Q-Digest

< 2% with 400 byte Q-Digest

Code Profiling

• Stream of program instructions

• Profiling: Understand code behavior

– Access patterns, cache behavior, load value distributions

– Example: which program segments are hot, and how hot?

• Challenges

– Large item space: programs with 1M basic blocks

– Profiling should take little space and add little overhead

– ASP adaptation to profile high frequency code segments

push %ebpmov %esp,%ebpsub $0x38,%espand $0xfffffff0,%espmov $0x0,%eaxsub %eax,%espsub $0x8,%esppush $0x28push $0x8048468call 80482b0 add push %ebpmov %esp,%ebpsub $0x38,%espand $0xfffffff0,%espmov $0x0,%eaxsub %eax,%espsub $0x8,%esppush $0x28push $0x8048468call 80482b0 add $0x10,%esppush %ebpmov %esp,%ebpsub $0x38,%espand $0xfffffff0,%espmov $0x0,%eaxsub %eax,%espsub $0x8,%esppush $0x28push $0x8048468call 80482b0 add $0x10,%espmov %esp,%ebpsub $0x38,%espand $0xfffffff0,%espmov $0x0,%eaxsub %eax,%espsub $0x8,%esppush $0x28push $0x8048468

Code Basic Blocks

FrequentRare

Range Adaptive Profiling [CGO ‘06]

• Small fixed memory (counters)

• Dynamically zoom onto high frequency code segments.

• 1d adaptation of ASP with various “optimizations” to reduce memory and processing time.

– Lot of constants squeezing

– Batching of unrefinements

– Branching factor choices

• Design specs for specialized hardware for profiling (www.cs.ucsb.edu/~arch/rap)

Cold range

Hot range

Range Adaptive Profiling

• Use RAP to estimate frequency of arbitrary ranges.

– Count errors due to not splitting early enough

– Regions undergoing hot/cold spells

• Typical performance: 8K memory sufficient for 97% accuracy.

0

0.5

1

1.5

2

2.5

3

gcc gzip mcfparser vortex

vpr

average

SPEC Benchmarks

Percent Error

Yes, but….

That’s well and nice in practice,

but how does it work in theory!

Range Searching in Streams

• A stream of k-dimensional points.

– Summary to approximate counts of geometric ranges.

• VC dimension, -nets and -approximation.

• “Nice” geometric ranges have small (bounded) VC dimension: e.g. rectangles, balls, half planes etc.

-approximation Theorem: For every range space (X, R) of fixed VC dim, there exists subset A of X of size O(lg s.t.

• Iceberg error (n) unavoidable

-Approximations: challenges

• Large summary size: (-2)

– Would prefer O(1/

-nets are small but can't estimate ranges

• Deterministic construction a space hog.

• The best streaming algorithm for -approximation requires working space

O( (1/)d+1 lgO(d+1)n )[BCEG ‘04]

Some Theorems [STZ, SoCG ‘04]

• Deterministic Multipass:

With d passes over data, can build a deterministic data structure for rectangular queries of size

O(1/ lg2d-2 (1/.

• Randomized Single pass:

A data structure for rectangular range queries in 2d with error at most n, with prob > 1 - o(1), of size

O(lgn

The data structure size is only slightly sub-quadratic for d > 2:

Another Theorem

• An implicit desire in ASP is to spot “pockets” of high population.

• Think of such a spatially correlated set as a “spatial heavy hitter”: many different formal definitions possible.

• An important concept is hierarchical heavy hitter (HHH).

– Popularized by Estan-Varghese, Graham-Muthukrishnan

– Non-redundant heavy hitters

• Ranges often form a natural hierarchy (IP addresses, time, space, etc)

• Stream of points and a (hierarchical) set of boxes.

• Report boxes whose “discounted” frequency is above threshold.

B

A C

Discounted Frequency

C

Space Complexity of HHH [HSST, PODS ‘05]

• Elegant applications to IP network monitoring, and clever algorithms by EV and CKMS

• Unlike flat heavy hitters, however, 2-sided approximation guarantees seem difficult to achieve:

– Every HHH (with discounted freq > n) should be caught

– Every box reported must have discounted frequency > cn

• HHH Space Theorem: Any -HHH algorithm in d dim with fixed accuracy factor c requires Ω(1/d+1) memory.

B

A

CB

A

C

Information loss in aggregation


Caution:

entering highly speculative zone!!!

Shape of a Point Stream [HSS, Alenex ‘06]

• What is a natural summary to describe the geometric shape of a streaming point set?

• A simple first approximation is the convex hull, which preserves basic extremal properties:

– Diameter, width, separation, containment, dist etc.

– Efficient streaming Hulls [AHV, CM, Chan, FKZ, HS].

– Max error O(Diam/r2) for summary size r


• Convex hull is a crude summary when the point stream has a richer structure, especially in the interior.

• Consider the simple example of L-shaped set.

• A powerful technique for shape extraction is -hulls

– area left after subtracting all 1/ radius empty disks

• Unfortunately, -hulls can have linear size and we don’t know how to build a streaming approximation.

Cluster Hulls (ALENEX ‘06)

• Generalizes the streaming convex hull algorithm to represent the shape as a collection of hulls.

• Mimics -hull by using minimum area coverage as metric.

• It is not clustering:

– Objective is to approximate well the boundary shape of components

– 2 dimensions only

– Problems with noise

• But could be coupled with clustering.

Algorithm: ClusterHulls

• k convex hulls, H = {h1, h2,… hk}

• A cost function w(h) = area(h) + μ(perimeter(h))2

• Minimize w(H) = Σw(hi)

• For each point p in sequence

• If p inside an hi, assign p to hi without modifying hi

else create a new hull containing only p; add it to H

• If |H| > k Choose a pair hi, hj to merge into a single hull, s.t. the increase to w(H) is minimized.

• Revise the assignment of adaptive sampling directions to hulls in H to minimize the overall error.

Choosing the cost function• Area only: merges pairs of

points from different clusters and intersecting hulls.

• Perimeter only: favors merging of large hulls to reduce cost.

• The combined area+perimeter works well at both extremes.

Some Pictures

Input: West Nile Virus

Data

m = 256 m = 512ClusterHulls

Why not Plain Clustering

m = 45

ClusterHulls k-median; k=5 CURE; k=5

k-median; k=45 CURE; k=45

Extreme Examples

Early choices can be fatal. Recover by discarding sparse CHs.

Process points in rounds whose length doubles each time.

Discard hulls h whose count(h) or density(h) = count(h)/area(h) is small.

On these extreme examples, most clustering algs fail

Input ClusterHulls Period-doubling Cleanup

Conclusions, Open Problems

• Is ClusterHull a good idea?

– Too early to tell. The problem seems interesting.

• Open theoretical questions:

– Complexity of covering a set of points with convex polygons: at most k vertices, minimize the area.

– Covering by rectangles (arbitrarily oriented).

– Streaming versions?

• Other notions of stream shape.

• Space-efficient streaming range searching.

Danke Shun!

The Lower Bound in 1-D

• r intervals of length 2 each (call them literals)

• Union of the r intervals is B.

• Each interval split into two unit length sub-intervals.

• If stream points fall in the left (resp. right) subinterval, we say the literal has orientation 0 (resp. 1).

Literal

0 1

2r

B

The Construction

• Stream arrives in 2 phases.

• In 1st phase: Put 3N/r points in each interval, either in left or right half.

• In 2nd phase: Adversary chooses either left or right half for each sub-interval and puts N points. Call these intervals sticks.

• Heavy hitters:

– Each stick is a -HHH– Discounted frequency of B (the union interval) depends on literals whose orientations in 1st and 2nd phase differ

• Algorithms must keep track of (r) orientations after 1st phase

B

The Lower Bound

• Suppose an algorithm A uses < 0.01r bits of space.

– After phase 1, orientations of the r literals encoded in 0.01r bits.

– There are 2r distinct orientation

– Two orientations that differ in at least r/3 literals map to the same (0.01r)-bit code ==> indistinguishable to A.

• If orientations in 1st and 2nd phase are same, frequency of B = 0, not a HHH.

• If r/3 literals differ, frequency of B = r/3 * 3N/r = N, so B is a -HHH

• A misclassifies B in one sequence.

B

Completing the Lower Bound

• Make r independent copies of the construction

• Use only one of them to complete the construction in the 2nd phase

• Need (r2) bits to track all orientations

• For r = 1/4, this gives (-2) lower bound

2r

B

r

Multi-dimensional lower bound

• The 1-D lower bound is information-theoretic; applies to all algorithms.

• For higher dimensions, need a more restrictive model of algorithms.

• Box Counter Model.

– Algorithm with memory m has m counters

– These counters maintain frequency of boxes

– All deterministic heavy hitter algorithms fit this model

• In the box counter model, finding -HHH in d-dim with any fixed approximation requires (d+1) memory

2D (Multi-Dim) Construction

• A box B and a set of descendants.

• B has side length 2r.

• 1st phase

– 2x2 (literal) boxes in upper left quadrant (orientation 0 or 1)

• 2nd phase

– Diagonal: boxes in upper left quadrant; all orientation 0

– Sticks: 1xr (or rx1) boxes

– Uniform: lower right quadrant

Stick 2r

Literal0 1

Uniform

Diagonal

Multi-dimensional lower bound

• Intuition:

– Each stick combines with a diagonal box to form a skinny -HHH box

– Diagonal boxes pair-up to form -HHH

• Skinny boxes form a checker-board pattern in upper left quadrant

– Each literal is either fully covered or half covered

• As in 1-D, adversary picks sticks

• Discounted frequency of B has

– Half covered literals and

– Points in the Uniform quadrant Stick 2r Uniform

Diagonal

FullyCovered

Half Covered

The Lower Bound

• The algorithm must remember the (r2) literal orientations.

• Otherwise, it cannot distinguish between two sequences, where discounted frequency of B is m or 3m/2, resp. (for m = 20/29 N).

• Like before, by making r copies of the construction, we get the lower bound of (r3).

• The basic construction generalized to d dimensions.

• Adjusting the hierarchy to get lower bound for any arbitrary approximation