Click here to load reader

Sketch-based Querying of Distributed Sliding-window Data Streams

  • Upload
    pascal

  • View
    67

  • Download
    0

Embed Size (px)

DESCRIPTION

Sketch-based Querying of Distributed Sliding-window Data Streams. Odysseas Papapetrou , Minos Garofalakis , Antonios Deligiannakis SoftNet laboratory, Technical University of Crete, Greece. Streams and sliding windows. Querying of distributed sliding-window data streams - PowerPoint PPT Presentation

Citation preview

Odysseas Papapetrou, Minos Garofalakis, Antonios DeligiannakisSoftNet laboratory, Technical University of Crete, GreeceSketch-based Querying of Distributed Sliding-window Data Streams

#1Streams and sliding windowsQuerying of distributed sliding-window data streams

Distributed: Many nodes/peers, many streams, aggregate statisticsCannot afford to centralize all dataSliding windows: Only interested on recent dataArrival-based model: Account for the last X itemsTime-based model: Account for the items arriving in the last X minutesData streams: High-dimensionalMaintain occurrences of ip addressesMaintain term frequencies in textual streams (e.g., emails)Small space/time#2Motivation example: Monitoring network packet trafficMonitor the distribution of packet traffic over IP addresses

Challenge 1:Local statistics: Compactly/efficiently maintain the ip address frequenciesSliding window use only recent packets, e.g., of last hourQueries with multiple sliding window lengths!

Challenge 2:How to aggregate local statistics to get the global statistics

Local statistics

ipfreq.10.0.3.41220.3.5.6120111.1.2.32121.2.1.111145.4.5.318n1n2n3n4n5n6n7n8nj

Global statisticsipfreq.10.0.3.412111.2.1.59220.3.5.6281145.4.5.392#3Solution desiderataNeed a method/data structure to maintain the (local) stream statistics:Ability to handle sliding windows of abritrary lengthFastUp to 10 million network packets per secondSmall memory footprintRouters: MB of memoryNetwork-efficientLocal statistics exchanged over the networkComposableAggregating of local statistics to derive global statistics

Our directionTrade off statistics accuracy for efficiency (memory, network)Sketches: Lossy summarizations of data streams#4Count-min sketches [Cormode, Muthukrishnan05]Generic sketch for maintaining frequencies, frequency moments, etc...An array of w x d countersEach row i associated with a hash function hi with range [1, w]00000000000000000000000000000000000000000000000000000000000000000000000000000000d hash functionsw countersAdd x+1+1+1+1+1+1+1+1

h1(x) = 7h2(x) = 1h3(x) = 4h4(x) = 6x, 10z, y, x, 20y, 3k STREAM

Example: x, y, z, can correspond to ip addresses

#5Estimating the frequency (point queries)

overestimate due to hashing collisionsError relative to the stream sizeAlso enables inner join and self join queries!2317223213114445521511784374963825356562393126463444233362449558412772354627846118273644522537352237435934173214222320105121511323535101655225059442252Count-min sketchesd hash functionsw countersExample: Query x:

#6Sliding windowsButSketches do not support sliding windows

Several sliding window structures proposedExponential histograms, deterministic waves, randomized waves, ...Only simple statistics, e.g., count the number of one-bits over sliding windows

This work:Combine count-min sketches with sliding window structuresTime100101101110101010111...0101101010101010StreamWindow to monitor#7Exponential histograms [Datar et al.02]Exponential histograms (and deterministic waves)Key ideabreak the sliding window range in non-overlapping buckets of exponentially increasing sizesuse these buckets for maintaining and estimating the aggregatesE.g., time 1 - 27: 8 one-bits arrivedtime 27 35: 4 one-bits, Query execution: sum only the buckets in the query range, and half of the weight of the last bucket

b1b2b3b4b584211Time: 1 27 35 42 47 51 Bucket informationEnding timeNumber of one-bitsRequired memory:

#8ECM-sketchesTwo distinct functionalitiesSketches: Summarize distributions, no sliding window functionalitySliding window data structures: only simple statistics

Our contributionsECM-sketches Combines count-min sketches with sliding windowsCompact data stream summaries over sliding windowsProbabilistic guarantees for frequency, self join/inner product queries#9Counters are sliding windowsExponential histogramsDeterministic wavesRandomized waves...

Updated and queried as with standard count-min sketchesECM-sketchesw countersd hash functionsb1b2b3b4b584211Time: 1 27 35 42 47 51 #10Combine count-min sketches with sliding windowsExample: STREAM: (t1,z), (t3, 6x), (t5, y), ...

Error coming from both hash collisions and the sliding window counters estimationDesired the algorithm chooses the optimal configuration (d, w, sliding window)Total size depends on the sliding window structure (detailed analysis in the paper)

Challenge 1: Maintaining of data stream statistics over sliding windowsECM-sketchesw countersd hash functionsQuery (t2,z)t1,+1Add (t1,z)h1(z) = 5h2(z) = 2h3(z) = 8h4(z) = 6t1,+1t1,+1t1,+1t1,+1t1,+1t1,+1

#11Aggregating ECM-sketchesOrder-preserving aggregationStream 1: (1, A), (2, B), (10, C), (11, A), (17, D), (18, B), Stream 2: (3, B), (6, A), (13, A), (14, A), (22, D), (27, B), Aggregate: (1, A), (2, B), (3, B), (6, A), (10, C), (11, A), (13, A), (14, A),

Composition of ECM-sketches: compose the corresponding countersRequires composition of sliding windows!Randomized sliding window structuresTrivial lossless aggregation, very expensive (computation, memory, network)Deterministic sliding window structuresMore compact and efficient, do not trivially support aggregationn1n2n3n4n5n6n7n8nj

++

+h#12Aggregation for deterministic sliding window structuresKey idea: Use the sliding window buckets as logs to re-play the streamsE.g.

Generate an aggregate exponential histogram as follows:For each bucket of size b, generate two events:b/2 one-bits arrive at the starting time of the bucketb/2 one-bits arrive at the ending time of the bucketSort events based on timeConstruct a new exponential histogram with these eventsIf each of the EH has error , then the aggregated EH has error 2 (worst-case analytic prediction -- tight)Proof in the paper

Result holds for any number of exponential histograms composedb1b2b3b4b584211Time: 1 27 35 42 47 51b1b2b3b4b584211 1 12 22 28 31 33#13Given A, B, ....Aggregated sketch represents the order-preserving aggregation of all streams

Challenge 2: Aggregation of local statistics to get global statistics

Aggregating ECM-sketches

+

+h+=ABCABCDE#14Experimental evaluationECM-sketches based onExponential histograms, deterministic waves, randomized waves in [0.05 , 0.25]Centralized setting: Evaluate individual ECM-sketchesDistributed setting: Nodes organized in a binary tree, aggregated ECM-sketchesDataset:World-cup 98: approx. 1.1 billion http requests (key:url)

Queries: Point queries (URL frequency), and self-join queriesObserved error relative to the stream size, as in conventional Count-min sketches.

Sliding window of 1 million seconds (~11.5 days)More results in the paper#15Estimation accuracy of ECM-sketchesECM-sketches with exponential histogramsMore efficient and more compact than deterministic wavesAt least two orders of magnitude smaller compared to randomized waves

#16Accuracy of aggregated ECM-sketches ECM-sketches with randomized waves: Error-free aggregation, high space complexityECM-sketches based on deterministic sliding windows: error smaller than the worst-case analytic prediction

#17ConclusionsECM-sketchesThe first data structure to enable sliding window statistics over high-dimensional streamsEnables composition with controllable error bounds

Future workECM-sketches to continuously monitor functions over distributed dataGeometric method [Sharfman06]#18Thank you for your attention

http://www.softnet.tuc.grhttp://www.lift-eu.org#19