27
Data Stream Algorithmics S. Muthu Muthukrishnan Rutgers Univ

S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Data Stream Algorithmics

S. Muthu Muthukrishnan

Rutgers Univ

Page 2: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Finding out more about me…

Type in

in www.a9.com

Adorisms

I work in algorithms/databases/networking.

Page 3: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Lezione 1 Overview

• Sublinear Methods:– Sublinear time– Sublinear space– Streaming.

• Data Stream Algorithms– Model– Applications

• Rest of the course

Page 4: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Sublinear Time• Problem:

– Given distinct integers A[1,..,n], determine a number in their top half in sorted rank.

• Algorithm:– Pick k numbers uniformly randomly. Determine their

MAX and return as the solution.• Analysis:

– Probability that the solution is incorrect is the probthat all k sampled numbers are in the lower half which is at most (1/2)^k.To get prob of error δ, uses log (1/δ) samples.MIN or MAX is hard using this method.

Page 5: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Psuedo-Sublinear Space: Lights out puzzle

• One switch initially in OFF position in a room. People 1, …, n each in different rooms.

• Paul picks people to go into the switch room:– Randomly– Each is unaware of everyone else.

• Goal: Design a protocol so someone can declare when everyone has been into the switch room at least once. Solution:

– 1 common bit, – log n bits with the leader,

Page 6: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Sublinear Space: Metric Embeddings• Space of objects A with distance C(.,.) and space

B with distance D(.,.).• Consider mappings • For any p, q in

• Often, f is linear, f is oblivious.• Say A and B are L_p norms in dimension d and

d’ resp. d’<<d is dimensionality reduction. And d << d’ to switch from easy to hard norms.

Ordinal embeddings

BA PPf →:AP

),())(),((),()/1( qpCuqfpfDqpCl ≤≤

Contraction &Expansion

Page 7: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Embedding Vector Objects• Johnson+Lindenstrauss. There exists a (1+є)

distortion embedding of n points in into where

2

ln4'ε

nd ≈

dl2'

2dl

Page 8: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Sublinear Space: Chasing tails• Problem: Given a linked list in the standard

nextptr representation and a pointer p to the head of the list, determine if the list has a loopwith at most O(1) extra memory.

• Solution:– Even Rounds: Paul moves one step.– Odd Rounds: Carole moves two steps.

• Analysis:– If they meet when Paul is in round p+x, then,

x ≤ L,where L is the loop length and p the prefix length. So, total time is O(N), the length of the list.

Page 9: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Streaming: Find Missing Numbers

• Paul permutes numbers 1…n, and shows all but one to Carole, in the permuted order, one after the other.

• Carole must find the missing number.

Carole can not remember all the numbers she has been shown.

Page 10: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Carole finds the missing number…• Carole cumulates the sum of all the

numbers she is being shown. At the end, she can subtract this sum from

– Uses O(log n) bits to store the partial sum. – Performs one + each time Paul shows a new number.

Takes O(log n) time per number.– At the end, computes the missing number with on

subtraction. Takes O(log n) time for the final computation.

2)1( +nn

Page 11: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Finding 2 missing numbers.

• What if Paul shows all but two numbers?

• Carole keeps the sum AND product of the numbers Paul shows her.

• Alternatively, Carole keeps the sum AND sum of squares of the numbers Paul shows her.

O(n log n) bits and time.

As before: O(log n) storage, O(log n) process time and O(log n) compute time.

Page 12: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Streaming: Informal• Streaming involves

– Small number of passes over data. (Typically 1?)– Sublinear space (sublinear in the universe or number

of stream items?)– Sublinear time for computing (?)

• Similar to dynamic, online, approximation or randomized algorithms, but with more constraints.

Page 13: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Paul &Carole Games

• Playing twenty questions (Spencer and Winkler)– Paul (for Paul Erdos) asks the questions.– Carole (anagram for Oracle) gives answers.

• Pusher, Chooser Games. I used them for coin-weighing problems in my thesis work.

• Alliteration: – Paul permutes, Carole cumulates. – Earlier, Paul pulled people into the switch room and Carole

called the end of the game, or – Carole chased the tail and Paul panted behind.

www.despair.com

Page 14: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Lezione 1 Overview

• Sublinear:– Sublinear time– Sublinear space– Streaming.

• Data Stream Algorithms– Model– Applications

• Rest of the course and Details

Page 15: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Motivation for Sublinear and Streaming

Explosion of Data In Recent Years• 3 Billion Telephone Calls in US each day

• 30 Billion emails daily, 1 Billion SMS, IMs.

CC transUS Phone

Satellite

Email

IP Router

• Scientific data: NASA's observation satellites generate billions of readings each per day.

• IP Network Traffic: up to 1 Billion packets per hour per router. Each ISP has many (hundreds) of routers!

• Compare to human scale data: "only" 1 billion worldwide credit card transactions per month.

New data scales demand new approaches from databases, algorithms, networks, systems and engineering.

Page 16: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

IP Network MeasurementsSNMPPacket logsFlow logsBGP logsFault alarms….

Router

SNMP log:(Router ID, Interface ID, Timestamp, Bytes sent since last obs)

Flow log:(Source IP, Dest IP, Start Time, Duration, No. Packets, No. Bytes)

Packet log:(Source IP, Dest IP, Src/Dest Port Numbers, Time, No. Bytes)

Page 17: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Challenge of IP N/W Monitoring• 1 link with 2 Gb/s. Say avg packet size is 50

bytes. • Number of pkts/sec = 5 Million.• Time per pkt = 0.2 µsec.• If we capture pkt headers per packet: src/dest IP,

time, no of bytes, etc. at least 10 bytes. Space per second is 50 Mb. Space per day is 4.5 Tb per link. ISPs have hundreds of links.

Very important that we don’t do pseudo-applications.How to analyze IP packet content streams?

Page 18: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Earliest Data Analysis.• Natural and political observations made upon

the Bills of Mortality, by John Graunt, 1662.– Use figures of births and deaths in London collected by parishes.– Few beggars starve to death, polygamy is irrational, Head is too

big for the body,…– How many M/F? How many married/single? What years

fruitful/mortal and in what intervals?…– Knowledge of these is necessary to ease government, balance

Parties and factions both in church and state. But is it necessary for others besides the king and his ministers?

• First life insurance tables, by Edmund Halley, 1693.

Page 19: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Some queries on IP network traffic• How many distinct IP addresses use

a given link currently or anytime during the day?

• What are the top k voluminous flows currently in progress in a link?

• How many flows consisted of only one packet?

• Are traffic patterns in two routers correlated? What are (un)usual trends?

Paul’s missing number.

Rarity

Signal analysis: Wavelets, Fourier, etc.

Online statistics

Page 20: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

The Streaming Model

• Underlying Signal: one dimensional array A[1…n] with values A[i], initially all zero.

• Signal representation is implicit via updates. jthupdate is (i,C[j]) implying– Update:

• Compute functions on A subject to: – small space – fast processing of updates – fast computation of functions

].[][][ jCiAiA +←

AT&T User:

number of transactions may be very very large.

AT&T User:

number of transactions may be very very large.

.0,0][ ≥≤jC

multi-

log n, log^2 n, etc.

Page 21: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

IP Network Signals• Number of bytes (packets) sent by a source IP

address during the day.

• Number of flows between a source and a destination IP address during the day.

• Number of active flows per source IP.

2^(32) sized one-dimensional array; increment only

2^(64) sized two-dimensional array; aggregate packets.

2^(32) sized one-dimensional array; increment and decrement.

Page 22: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Special cases of the model• Transaction j updates A[j]. Time series model.• C[j] ≥ 0 always. Cash Register Model. Same

item may appear many times, typically C[j]=1, so we see a multiset of items in one pass.

• Most general model is the Turnstile Model.How do the streaming puzzles and IP traffic monitoringexamples map to the models here?

Cash Register model can do more than sampling, eg, MINor MAX, but weaker than Turnstile model where MIN or MAXis hard to approximate in range, but not in domain.

Page 23: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Turnstile Model Challenge

1000000 items inserted

999996 items deleted

4 items left

Recovering itemsto ±0.1 ||A|| accuracy =>retrieve each item precisely.Summary

Maintained

Page 24: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Lezione 1 Overview

• Sublinear:– Sublinear time– Sublinear space– Streaming.

• Data Stream Algorithms– Model– Applications

• Rest of the course and Details

Page 25: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Course Details• Meeting times:

– 05/31, 06/07, 06/21, 06/28 at 10.30—12 Noon. – 06/02, 06/09, 06/23, 06/30 at 11.30—1PM.

• Homeworks due on Mondays. Latex and PS please.

• Contact:– Office room: tel x:– Email: [email protected]

• Course webpage:– www.dis.uniroma1.it/~muthu

Page 26: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Rest of the course• Sketches: AMS, CCC, CM. Relationship to

embeddings, k-wise ind. rv’s.• Heavy Hitters: Manku+Motwani, CM, Deltoids.

(Non) adaptive Code construction.• Quantiles: Rangesum rv, GK, CM, Other

statistical quantities eg histograms, wavelets.• Distinct Elements: stable distributions. FM. • Inverse distribution: Summarizing. Rarity. Min-

wise hashing. Maximum likelihood methods MLE/EM.

Page 27: S. Muthu Muthukrishnan Rutgers Univccm/cs34/papers/streaminglecture.pdf · satellites generate billions of readings each per day. • IP Network Traffic: up to 1 Billion packets per

Rest of the course• Advanced topics

– Semi-streaming and graph algorithms.– Streaming geometry problems.– Streaming text processing problems.– Clustering and facility location– Property testing, privacy-preserving data mining, etc.