Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Data Stream Algorithmics
S. Muthu Muthukrishnan
Rutgers Univ
Finding out more about me…
Type in
in www.a9.com
Adorisms
I work in algorithms/databases/networking.
Lezione 1 Overview
• Sublinear Methods:– Sublinear time– Sublinear space– Streaming.
• Data Stream Algorithms– Model– Applications
• Rest of the course
Sublinear Time• Problem:
– Given distinct integers A[1,..,n], determine a number in their top half in sorted rank.
• Algorithm:– Pick k numbers uniformly randomly. Determine their
MAX and return as the solution.• Analysis:
– Probability that the solution is incorrect is the probthat all k sampled numbers are in the lower half which is at most (1/2)^k.To get prob of error δ, uses log (1/δ) samples.MIN or MAX is hard using this method.
Psuedo-Sublinear Space: Lights out puzzle
• One switch initially in OFF position in a room. People 1, …, n each in different rooms.
• Paul picks people to go into the switch room:– Randomly– Each is unaware of everyone else.
• Goal: Design a protocol so someone can declare when everyone has been into the switch room at least once. Solution:
– 1 common bit, – log n bits with the leader,
Sublinear Space: Metric Embeddings• Space of objects A with distance C(.,.) and space
B with distance D(.,.).• Consider mappings • For any p, q in
• Often, f is linear, f is oblivious.• Say A and B are L_p norms in dimension d and
d’ resp. d’<<d is dimensionality reduction. And d << d’ to switch from easy to hard norms.
Ordinal embeddings
BA PPf →:AP
),())(),((),()/1( qpCuqfpfDqpCl ≤≤
Contraction &Expansion
Embedding Vector Objects• Johnson+Lindenstrauss. There exists a (1+є)
distortion embedding of n points in into where
2
ln4'ε
nd ≈
dl2'
2dl
Sublinear Space: Chasing tails• Problem: Given a linked list in the standard
nextptr representation and a pointer p to the head of the list, determine if the list has a loopwith at most O(1) extra memory.
• Solution:– Even Rounds: Paul moves one step.– Odd Rounds: Carole moves two steps.
• Analysis:– If they meet when Paul is in round p+x, then,
x ≤ L,where L is the loop length and p the prefix length. So, total time is O(N), the length of the list.
Streaming: Find Missing Numbers
• Paul permutes numbers 1…n, and shows all but one to Carole, in the permuted order, one after the other.
• Carole must find the missing number.
Carole can not remember all the numbers she has been shown.
Carole finds the missing number…• Carole cumulates the sum of all the
numbers she is being shown. At the end, she can subtract this sum from
– Uses O(log n) bits to store the partial sum. – Performs one + each time Paul shows a new number.
Takes O(log n) time per number.– At the end, computes the missing number with on
subtraction. Takes O(log n) time for the final computation.
2)1( +nn
Finding 2 missing numbers.
• What if Paul shows all but two numbers?
• Carole keeps the sum AND product of the numbers Paul shows her.
• Alternatively, Carole keeps the sum AND sum of squares of the numbers Paul shows her.
O(n log n) bits and time.
As before: O(log n) storage, O(log n) process time and O(log n) compute time.
Streaming: Informal• Streaming involves
– Small number of passes over data. (Typically 1?)– Sublinear space (sublinear in the universe or number
of stream items?)– Sublinear time for computing (?)
• Similar to dynamic, online, approximation or randomized algorithms, but with more constraints.
Paul &Carole Games
• Playing twenty questions (Spencer and Winkler)– Paul (for Paul Erdos) asks the questions.– Carole (anagram for Oracle) gives answers.
• Pusher, Chooser Games. I used them for coin-weighing problems in my thesis work.
• Alliteration: – Paul permutes, Carole cumulates. – Earlier, Paul pulled people into the switch room and Carole
called the end of the game, or – Carole chased the tail and Paul panted behind.
www.despair.com
Lezione 1 Overview
• Sublinear:– Sublinear time– Sublinear space– Streaming.
• Data Stream Algorithms– Model– Applications
• Rest of the course and Details
Motivation for Sublinear and Streaming
Explosion of Data In Recent Years• 3 Billion Telephone Calls in US each day
• 30 Billion emails daily, 1 Billion SMS, IMs.
CC transUS Phone
Satellite
IP Router
• Scientific data: NASA's observation satellites generate billions of readings each per day.
• IP Network Traffic: up to 1 Billion packets per hour per router. Each ISP has many (hundreds) of routers!
• Compare to human scale data: "only" 1 billion worldwide credit card transactions per month.
New data scales demand new approaches from databases, algorithms, networks, systems and engineering.
IP Network MeasurementsSNMPPacket logsFlow logsBGP logsFault alarms….
Router
SNMP log:(Router ID, Interface ID, Timestamp, Bytes sent since last obs)
Flow log:(Source IP, Dest IP, Start Time, Duration, No. Packets, No. Bytes)
Packet log:(Source IP, Dest IP, Src/Dest Port Numbers, Time, No. Bytes)
Challenge of IP N/W Monitoring• 1 link with 2 Gb/s. Say avg packet size is 50
bytes. • Number of pkts/sec = 5 Million.• Time per pkt = 0.2 µsec.• If we capture pkt headers per packet: src/dest IP,
time, no of bytes, etc. at least 10 bytes. Space per second is 50 Mb. Space per day is 4.5 Tb per link. ISPs have hundreds of links.
Very important that we don’t do pseudo-applications.How to analyze IP packet content streams?
Earliest Data Analysis.• Natural and political observations made upon
the Bills of Mortality, by John Graunt, 1662.– Use figures of births and deaths in London collected by parishes.– Few beggars starve to death, polygamy is irrational, Head is too
big for the body,…– How many M/F? How many married/single? What years
fruitful/mortal and in what intervals?…– Knowledge of these is necessary to ease government, balance
Parties and factions both in church and state. But is it necessary for others besides the king and his ministers?
• First life insurance tables, by Edmund Halley, 1693.
Some queries on IP network traffic• How many distinct IP addresses use
a given link currently or anytime during the day?
• What are the top k voluminous flows currently in progress in a link?
• How many flows consisted of only one packet?
• Are traffic patterns in two routers correlated? What are (un)usual trends?
Paul’s missing number.
Rarity
Signal analysis: Wavelets, Fourier, etc.
Online statistics
The Streaming Model
• Underlying Signal: one dimensional array A[1…n] with values A[i], initially all zero.
• Signal representation is implicit via updates. jthupdate is (i,C[j]) implying– Update:
• Compute functions on A subject to: – small space – fast processing of updates – fast computation of functions
].[][][ jCiAiA +←
AT&T User:
number of transactions may be very very large.
AT&T User:
number of transactions may be very very large.
.0,0][ ≥≤jC
multi-
log n, log^2 n, etc.
IP Network Signals• Number of bytes (packets) sent by a source IP
address during the day.
• Number of flows between a source and a destination IP address during the day.
• Number of active flows per source IP.
2^(32) sized one-dimensional array; increment only
2^(64) sized two-dimensional array; aggregate packets.
2^(32) sized one-dimensional array; increment and decrement.
Special cases of the model• Transaction j updates A[j]. Time series model.• C[j] ≥ 0 always. Cash Register Model. Same
item may appear many times, typically C[j]=1, so we see a multiset of items in one pass.
• Most general model is the Turnstile Model.How do the streaming puzzles and IP traffic monitoringexamples map to the models here?
Cash Register model can do more than sampling, eg, MINor MAX, but weaker than Turnstile model where MIN or MAXis hard to approximate in range, but not in domain.
Turnstile Model Challenge
1000000 items inserted
999996 items deleted
4 items left
Recovering itemsto ±0.1 ||A|| accuracy =>retrieve each item precisely.Summary
Maintained
Lezione 1 Overview
• Sublinear:– Sublinear time– Sublinear space– Streaming.
• Data Stream Algorithms– Model– Applications
• Rest of the course and Details
Course Details• Meeting times:
– 05/31, 06/07, 06/21, 06/28 at 10.30—12 Noon. – 06/02, 06/09, 06/23, 06/30 at 11.30—1PM.
• Homeworks due on Mondays. Latex and PS please.
• Contact:– Office room: tel x:– Email: [email protected]
• Course webpage:– www.dis.uniroma1.it/~muthu
Rest of the course• Sketches: AMS, CCC, CM. Relationship to
embeddings, k-wise ind. rv’s.• Heavy Hitters: Manku+Motwani, CM, Deltoids.
(Non) adaptive Code construction.• Quantiles: Rangesum rv, GK, CM, Other
statistical quantities eg histograms, wavelets.• Distinct Elements: stable distributions. FM. • Inverse distribution: Summarizing. Rarity. Min-
wise hashing. Maximum likelihood methods MLE/EM.
Rest of the course• Advanced topics
– Semi-streaming and graph algorithms.– Streaming geometry problems.– Streaming text processing problems.– Clustering and facility location– Property testing, privacy-preserving data mining, etc.