Fishing for Patterns in (Shallow) Geometric Streams Subhash Suri UC Santa Barbara and ETH Zurich IIT...

Preview:

Citation preview

Fishing for Patterns in (Shallow) Geometric StreamsFishing for Patterns in

(Shallow) Geometric Streams

Subhash Suri

UC Santa Barbaraand

ETH Zurich

IIT Kanpur Workshop on Data Streams Dec 18-20

Geometric Streams

• A stream of points (dim 1, 2, 3, …)

• Abstract view of multi-attribute data:

– IP packets, database transactions, geographic sensor data, processor instruction stream etc.

DDoS

Worm

IP Traffic (NetViewer)

Code profilingSensor eScan

Shape of a Point Stream

• Form informs about the function

• Identifying visually interesting patterns (“shape”) of point stream– Areas of high density (hot spots).– Large empty areas (cold spots).– Population estimates of geometric ranges – A geometric summary of the distribution of the stream.

• Deliberately vague and ill-posed; some specifics later.

Outline

• No attempt to survey

• Adaptive Spatial Partitioning

– Generic summary structure (Algorithmica ‘06)

– Q-Digest: sensornet data aggregation (SenSys’04)

– Range Adaptive Profiling for Programs (CGO ‘06)

• Specialized geometric patterns and queries:

– Range queries (SoCG ‘04)

– Hierarchical Heavy-hitters (PODS ‘05)

• Shape of the stream: ClusterHulls (Alenex‘06)

• Conclusions

Adaptive Spatial Partitioning

• A subdivision of space into square cells.

• Each cell maintains O(1) size info, essentially count of points in it.

• Tension between coverage and precision:

Large cells cover a lot, but with poor precision

Small cells have good precision, but poor coverage

• Dynamically adapt the subdivision to the distribution of points in the stream.

• Adaptive zoom: more precision (cells) where the action is, and fewer elsewhere.

• [HSST], ISAAC ‘04, Algorithmica ‘06

ASP Structure

• Data structure size is function of accuracy parameter ε

• Initially, a single box (LxL), and its counter.

• When the count of a box b > εn

– Freeze b’s counter– Split b into 4 sub-boxes

– Introduce a new counter for each sub-box

• This hierarchically defined structure of boxes (a streaming quad tree) is our ASP.

L

Adaptivity: Refine and Unrefine

• The structure must adapt to the changing distribution of points:

– New regions become heavy

– Previously heavy regions may become light/cold.

• Refine operation puts new counters where the action is increasing:

• Stream Processing: for each item x

– Locate the smallest box v containing x, increment its count

– Refine: If count of v > εN

– Split v into 4 children sub-boxes, each with a new counter, initialized to 0.

– Old counter of v frozen.

Refine operation

Unrefine Operation

• To conserve memory, boxes with low counts must be deleted.

• A previously heavy box may become light because n, the size of the stream, has increased, and so its count is below εn.

• Unrefine: if count of box v and its children < εn/2

– Delete the children boxes and– Add their counts to count of v– (v’s old counter revived)

• Refinement occurs only at node of new insert; refinement can occur anywhere (non-locality).

• A heap for fast unrefine ops.

Unrefine operation

The Data Structure

• ASP represented as a 4-ary tree

ASP-treeL

Analysis of ASP• (Space Bound):

– For each node v, the count of v, its siblings, and parent > εn/2

– Total number of boxes at most O(1/ε)

• (Per-point Processing Time):

– Naïve will be O(lg L): tree height

– With heap, centroid tree (amortized) time O(lg 1/ε)

• (Count Bound):

– Each point counted in exactly one box

– Points contained in a box b are counted at b or one of its ancestors

– Depth of the tree by the binary partitioning rule is O(log L)

– Error in a leaf’s count is O(εn*log L).

– Using memory = O(1/ε * log L), the count error bounded by εn.

ASP-tree

Spatial Summary

• A partition into O(1/ε * log L) boxes, with auto-adaptive zoom.

• No undivided box has more than εn points: only leaf nodes can.

• Gives a qualitative summary of the stream’s spatial distribution: a visual sense of hot and cold regions.

Two applications and two theorems

• Data aggregation in sensor networks

– Distributed version of ASP structure

• Code profiling in processor streams

– Hardware implementation of ASP

• Theoretical bounds for range searching

– Worst-case guarantees for rectangle range searching

• Lower bounds on hierarchical heavy hitters

– Space complexity

Geometric Summaries in Sensornets• Self-organizing networks of tiny, cheap sensors,

– Integrated sensing, computing, radio communication,

– Continuous, real-time monitoring of remote, hard to reach areas.

– Limited power (battery), bandwidth, memory.

• Communication typically the biggest drain on energy

• Perform as much local processing as possible, and transmit smart summaries.

• Similar to synopses: distributed data, rather than one-pass.

• Active area: in-network aggregation, compressed sensing.

Distributed ASP

• Q-Digest: an approximate histogram

– Shrivastava, Buragohain, Agrawal, S. [SenSys ‘04]

• ASP for 1-dim data signal (measurements of sensors): vibration data, acoustics, toxin levels, etc.

• Going beyond min, max, or average, and approximating quantiles.

• Sensors form an aggregation tree, rooted at base station.

• Data flows from leaves to the base station, always reduced to size K summary. (user parameter).

• The key point is that ASP is efficiently mergeable:

– Given q-digests of children, a node can compute the merged q-digest.

• Space/quality bounds of ASP carry over.

Base Station

A simulation

8000 sensors,each generating a 2-byte integer (death valley elevation data)

Error: (true - est) rank< 5% with 160 byte Q-Digest

< 2% with 400 byte Q-Digest

Code Profiling

• Stream of program instructions

• Profiling: Understand code behavior

– Access patterns, cache behavior, load value distributions

– Example: which program segments are hot, and how hot?

• Challenges

– Large item space: programs with 1M basic blocks

– Profiling should take little space and add little overhead

– ASP adaptation to profile high frequency code segments

push %ebpmov %esp,%ebpsub $0x38,%espand $0xfffffff0,%espmov $0x0,%eaxsub %eax,%espsub $0x8,%esppush $0x28push $0x8048468call 80482b0 add push %ebpmov %esp,%ebpsub $0x38,%espand $0xfffffff0,%espmov $0x0,%eaxsub %eax,%espsub $0x8,%esppush $0x28push $0x8048468call 80482b0 add $0x10,%esppush %ebpmov %esp,%ebpsub $0x38,%espand $0xfffffff0,%espmov $0x0,%eaxsub %eax,%espsub $0x8,%esppush $0x28push $0x8048468call 80482b0 add $0x10,%espmov %esp,%ebpsub $0x38,%espand $0xfffffff0,%espmov $0x0,%eaxsub %eax,%espsub $0x8,%esppush $0x28push $0x8048468

Code Basic Blocks

FrequentRare

Range Adaptive Profiling [CGO ‘06]

• Small fixed memory (counters)

• Dynamically zoom onto high frequency code segments.

• 1d adaptation of ASP with various “optimizations” to reduce memory and processing time.

– Lot of constants squeezing

– Batching of unrefinements

– Branching factor choices

• Design specs for specialized hardware for profiling (www.cs.ucsb.edu/~arch/rap)

Cold range

Hot range

Range Adaptive Profiling

• Use RAP to estimate frequency of arbitrary ranges.

– Count errors due to not splitting early enough

– Regions undergoing hot/cold spells

• Typical performance: 8K memory sufficient for 97% accuracy.

0

0.5

1

1.5

2

2.5

3

gcc gzip mcfparser vortex

vpr

average

SPEC Benchmarks

Percent Error

Yes, but….

That’s well and nice in practice,

but how does it work in theory!

Range Searching in Streams

• A stream of k-dimensional points.

– Summary to approximate counts of geometric ranges.

• VC dimension, -nets and -approximation.

• “Nice” geometric ranges have small (bounded) VC dimension: e.g. rectangles, balls, half planes etc.

-approximation Theorem: For every range space (X, R) of fixed VC dim, there exists subset A of X of size O(lg s.t.

• Iceberg error (n) unavoidable

-Approximations: challenges

• Large summary size: (-2)

– Would prefer O(1/

-nets are small but can't estimate ranges

• Deterministic construction a space hog.

• The best streaming algorithm for -approximation requires working space

O( (1/)d+1 lgO(d+1)n )[BCEG ‘04]

Some Theorems [STZ, SoCG ‘04]

• Deterministic Multipass:

With d passes over data, can build a deterministic data structure for rectangular queries of size

O(1/ lg2d-2 (1/.

• Randomized Single pass:

A data structure for rectangular range queries in 2d with error at most n, with prob > 1 - o(1), of size

O(lgn

The data structure size is only slightly sub-quadratic for d > 2:

Another Theorem

• An implicit desire in ASP is to spot “pockets” of high population.

• Think of such a spatially correlated set as a “spatial heavy hitter”: many different formal definitions possible.

• An important concept is hierarchical heavy hitter (HHH).

– Popularized by Estan-Varghese, Graham-Muthukrishnan

– Non-redundant heavy hitters

• Ranges often form a natural hierarchy (IP addresses, time, space, etc)

• Stream of points and a (hierarchical) set of boxes.

• Report boxes whose “discounted” frequency is above threshold.

B

A C

Discounted Frequency

C

Space Complexity of HHH [HSST, PODS ‘05]

• Elegant applications to IP network monitoring, and clever algorithms by EV and CKMS

• Unlike flat heavy hitters, however, 2-sided approximation guarantees seem difficult to achieve:

– Every HHH (with discounted freq > n) should be caught

– Every box reported must have discounted frequency > cn

• HHH Space Theorem: Any -HHH algorithm in d dim with fixed accuracy factor c requires Ω(1/d+1) memory.

B

A

CB

A

C

Information loss in aggregation

Shape of a Point Stream

Caution:

entering highly speculative zone!!!

Shape of a Point Stream [HSS, Alenex ‘06]

• What is a natural summary to describe the geometric shape of a streaming point set?

• A simple first approximation is the convex hull, which preserves basic extremal properties:

– Diameter, width, separation, containment, dist etc.

– Efficient streaming Hulls [AHV, CM, Chan, FKZ, HS].

– Max error O(Diam/r2) for summary size r

Shape of a Point Stream

• Convex hull is a crude summary when the point stream has a richer structure, especially in the interior.

• Consider the simple example of L-shaped set.

• A powerful technique for shape extraction is -hulls

– area left after subtracting all 1/ radius empty disks

• Unfortunately, -hulls can have linear size and we don’t know how to build a streaming approximation.

Cluster Hulls (ALENEX ‘06)

• Generalizes the streaming convex hull algorithm to represent the shape as a collection of hulls.

• Mimics -hull by using minimum area coverage as metric.

• It is not clustering:

– Objective is to approximate well the boundary shape of components

– 2 dimensions only

– Problems with noise

• But could be coupled with clustering.

Algorithm: ClusterHulls

• k convex hulls, H = {h1, h2,… hk}

• A cost function w(h) = area(h) + μ(perimeter(h))2

• Minimize w(H) = Σw(hi)

• For each point p in sequence

• If p inside an hi, assign p to hi without modifying hi

else create a new hull containing only p; add it to H

• If |H| > k Choose a pair hi, hj to merge into a single hull, s.t. the increase to w(H) is minimized.

• Revise the assignment of adaptive sampling directions to hulls in H to minimize the overall error.

Choosing the cost function• Area only: merges pairs of

points from different clusters and intersecting hulls.

• Perimeter only: favors merging of large hulls to reduce cost.

• The combined area+perimeter works well at both extremes.

Some Pictures

Input: West Nile Virus

Data

m = 256 m = 512ClusterHulls

Why not Plain Clustering

m = 45

ClusterHulls k-median; k=5 CURE; k=5

k-median; k=45 CURE; k=45

Extreme Examples

Early choices can be fatal. Recover by discarding sparse CHs.

Process points in rounds whose length doubles each time.

Discard hulls h whose count(h) or density(h) = count(h)/area(h) is small.

On these extreme examples, most clustering algs fail

Input ClusterHulls Period-doubling Cleanup

Conclusions, Open Problems

• Is ClusterHull a good idea?

– Too early to tell. The problem seems interesting.

• Open theoretical questions:

– Complexity of covering a set of points with convex polygons: at most k vertices, minimize the area.

– Covering by rectangles (arbitrarily oriented).

– Streaming versions?

• Other notions of stream shape.

• Space-efficient streaming range searching.

Danke Shun!

The Lower Bound in 1-D

• r intervals of length 2 each (call them literals)

• Union of the r intervals is B.

• Each interval split into two unit length sub-intervals.

• If stream points fall in the left (resp. right) subinterval, we say the literal has orientation 0 (resp. 1).

Literal

0 1

2r

B

The Construction

• Stream arrives in 2 phases.

• In 1st phase: Put 3N/r points in each interval, either in left or right half.

• In 2nd phase: Adversary chooses either left or right half for each sub-interval and puts N points. Call these intervals sticks.

• Heavy hitters:

– Each stick is a -HHH– Discounted frequency of B (the union interval) depends on literals whose orientations in 1st and 2nd phase differ

• Algorithms must keep track of (r) orientations after 1st phase

B

The Lower Bound

• Suppose an algorithm A uses < 0.01r bits of space.

– After phase 1, orientations of the r literals encoded in 0.01r bits.

– There are 2r distinct orientation

– Two orientations that differ in at least r/3 literals map to the same (0.01r)-bit code ==> indistinguishable to A.

• If orientations in 1st and 2nd phase are same, frequency of B = 0, not a HHH.

• If r/3 literals differ, frequency of B = r/3 * 3N/r = N, so B is a -HHH

• A misclassifies B in one sequence.

B

Completing the Lower Bound

• Make r independent copies of the construction

• Use only one of them to complete the construction in the 2nd phase

• Need (r2) bits to track all orientations

• For r = 1/4, this gives (-2) lower bound

2r

B

r

Multi-dimensional lower bound

• The 1-D lower bound is information-theoretic; applies to all algorithms.

• For higher dimensions, need a more restrictive model of algorithms.

• Box Counter Model.

– Algorithm with memory m has m counters

– These counters maintain frequency of boxes

– All deterministic heavy hitter algorithms fit this model

• In the box counter model, finding -HHH in d-dim with any fixed approximation requires (d+1) memory

2D (Multi-Dim) Construction

• A box B and a set of descendants.

• B has side length 2r.

• 1st phase

– 2x2 (literal) boxes in upper left quadrant (orientation 0 or 1)

• 2nd phase

– Diagonal: boxes in upper left quadrant; all orientation 0

– Sticks: 1xr (or rx1) boxes

– Uniform: lower right quadrant

Stick 2r

Literal0 1

Uniform

Diagonal

Multi-dimensional lower bound

• Intuition:

– Each stick combines with a diagonal box to form a skinny -HHH box

– Diagonal boxes pair-up to form -HHH

• Skinny boxes form a checker-board pattern in upper left quadrant

– Each literal is either fully covered or half covered

• As in 1-D, adversary picks sticks

• Discounted frequency of B has

– Half covered literals and

– Points in the Uniform quadrant Stick 2r Uniform

Diagonal

FullyCovered

Half Covered

The Lower Bound

• The algorithm must remember the (r2) literal orientations.

• Otherwise, it cannot distinguish between two sequences, where discounted frequency of B is m or 3m/2, resp. (for m = 20/29 N).

• Like before, by making r copies of the construction, we get the lower bound of (r3).

• The basic construction generalized to d dimensions.

• Adjusting the hierarchy to get lower bound for any arbitrary approximation

Recommended