View
218
Download
5
Category
Tags:
Preview:
Citation preview
Graph Problems in the Streaming Model
Sampath KannanUniversity of Pennsylvania
Work done with Joan Feigenbaum, Andrew McGregor, Siddharth Suri and Jian Zhang
Graph Streaming
G=(V,E), V known; |V| = n E revealed in arbitrary order (e1, e2, …)
Space allowed O(n polylog n): Semi streaming
Motivation?
Fundamental problems … help ‘calibrate’ model
Massive graphs such as the webgraph can appear as stream
Recommendation systems… and more generally data mining
Why so much space?
Even simple problems need it:
Given u,v, and a streamed graph G, is there path of length 2 between u & v?
Requires (n) space.More generally … for balanced graph
properties …
Balanced Properties
v
A property is balanced, if there existsstream of edges such that: before seeing lastedge:
There exists v: last edge is (v,x)...for Ω(n) x’s, property holds for Ω(n) x’s property doesn’t hold.
Lower Bound for Balanced Props
Consider all isomorphic versions of the graphthat demonstrates the balance property.
Before seeing last edge, streaming algorithmhas to remember the subset x of vertices suchthat the addition of edge (v,x) causes propertyto hold.
As we range over isomorphisms... this is anarbitrary subset of the given cardinality... andthere are exponentially many possibilities.
“Exceptions”
Counting Local Structures
• Counting triangles (Bar-Yossef et al, Buriol et al)
• Counting |E(G2)| (Ganguly et al)
• Duplicate elimination and aggregation (Cormode,Muthukrishnan)
One algorithm design techniqueSparsification (Eppstein, Galil,Italiano,Nissenzweig ‘97)
For graph property P: G’ strong certificate for G if ∀ H: (G ⋃ H) ∈ P ⇔ (G’ ⋃ H) ∈ P.
Existence of quickly computable, sparse, strong certificates leads to good semi-streaming algorithms
Sparsification-based algorithms
Bipartiteness, 1-, 2-, 3-vertex connectedcomponents, 2-, 3-edge connected components: O((n)) per edge
MST, 4-vertex connected comps., 3-edge connected comps. O(log n)
Higher connectivities: O~(n). (Zelke)
Bipartite Matching
Approximable with local greed
Matching (maximal)
Augmenting path
Constant-pass 2/3-approx for bip. matching
• Maximal matching is .5 approx:
If M’ maximum and M maximal thenM matches at least one endpoint of each
edgein M’… has |M’|/2 edges.
• If M has only |M| vertex-disjoint 3-aug-paths =>|M| (1 + ) ≥ 2 OPT/3
M’ maximum: M’∆ M – bunch of augmenting paths. Count!
• Can find maximal matching
• To go beyond: Need to get most aug. paths of length 3.
Randomly project all free vertices into Layer 0 or Layer 3
• Matched edges go from layer 1 to layer 2.
• Expect half the augmenting paths of length 3 to respect layering
• Use maximal matchings between successive layers to get constant fraction of these.
• Gives constant-pass 2/3 - approximation
To get approximation scheme: Need to findmost augmenting paths of length
• Again project vertices into k+1 layers to find augmenting paths of length k
• Use carefully chosen maximal matchings algorithms between successive layers
• Repeat constant number of times
Gives streaming linear time approx scheme for unweighted matching in general graphs (McGregor)
Weighted Matching
A 1/6 Approximation in 1 Pass
At all times we store some matching M.
On seeing edge e =(u,v) we compare the w(e) with the weight W of edges e1 and e2 in M incident on u and v.
If w(e) > 2W then
M M e \ e1,e2
• To show 1/6 approx: Account for the weight of edges lost in terms of weight of edges that survive
• Can improve approx to 1/2 - (McGregor) in constant number of passes:
• Choose an edge if it is (1 + ) times the weight of edges that it kills.
Approximating Distances
The “Sketch” Approach
A two-stage approach First stage: While going through the stream,
construct a small sketch of the input graph. Second stage: Compute the distance using
the sketch, without further access to the stream.
Perform BFS-like computations in the second stage.
Graph Spanners as Sketches
Multiplicative t-spanner: Edge subgraph H of a graph G, s.t., for any pair of vertices u and v, distH(u,v) t·distG(u,v).
There is a t-Spanner with O(n1+1/t) edges.
Reduce streaming graph distance to streaming spanner construction.
BFS-like subroutines are used in most existing spanner constructions.
Streaming Spanner Construction For each incoming edge, decide whether it should be
in the spanner. If the edge causes a cycle of length t, do not put the
edge in the spanner. This gives a t-spanner, because there is a path P of
length < t connecting the two endpoints of any discarded edge.
This spanner is sparse. Thm [Bollobás78] : A graph whose girth is larger than k can
only have O(n1+2/(k-1)) edges. Need to know: For an incoming edge, does a short
path exist?
Baswana & Sen show almost linear time non-streaming algorithm for spanners… growingBFS-trees from appropriate nodes.
Difficult to do in streaming fashion…
Instead we grow a BFS-like tree not just from itsroot!
Clusters: Rooted BFS treesPreclusters: Free floating pieces of BFS trees …
will attach to clusters
Summary of the One-Pass Algorithm
Use a vertex-labeling scheme to construct clusters. Structure of the algorithm:
– In the pre-processing phase, generate a multi-level set of labels for the vertices.
– Go through the stream; for each edge: • According to the current assignment of labels to vertices,
decide whether to put this edge in the spanner.• Depending on the type of edge, possibly assign more
labels to one of its endpoints.
Next, an example with t = log n
Labels
– logn/2 levels– w.h.p., there are top-level labels.– Semantics of labels:
• The set of vertices assigned the same top-level label forms a cluster.
• The set of vertices assigned the same lower-level label forms a “pre-cluster.”
(0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (0,9)(0,10) (0,11) (0,12)
(1,2) (1,4) (1,7) (1,11)
(2,2) (2,7)
Level 0
Level 1
Level 2
(0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (0,9) (0,10) (0,11) (0,12)
(1,2) (1,4) (1,7) (1,11)
Initial Label Assignment
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12
(0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8) (0,9) (0,10)(0,11) (0,12)
(1,2) (1,4) (1,7) (1,11)
(2,2) (2,7)
Level 0
Level 1
Level 2
On arrival of an edge
Already know what to do with:– Intra-cluster/pre-cluster edges– Inter-cluster edges
Edges connecting pre-clusters: the sticky edges– They are added to the spanner.– They may lead to new label assignment and
cluster growth.
“Good” Neighbor (1)
(3,2)
(2,2)
(1,2)
(0,2)
(1,6)
(0,6)
(2,2)
(3,2)
v u
Has marked labels
Good Neighbor (2)
v uC(1,2)
C(2,2)
C(3,2)
C(1,6)
“Bad” Neighbor
(3,2)(1,6)
v u
No marked labels
Properties of the Clusters
Small diameter
Number of clusters bounded by .
Do not need to cover the whole graph with clusters, but the uncovered subgraph is sparse.
The uncovered subgraph consists of sticky edges, and there are not too many of them.
Sticky Edges are Rare
u1
u2
u3
u4
v u1, u2, u3, u4 …
A neighbor is good with probability at least ½. After seeing at most logn/2 good neighbors, v will be assigned a top-
level label and be included in a cluster. No more sticky edges for v. The number of sticky edges can be bounded by the length of the
shortest prefix in the above sequence that contains logn/2 good neighbors.
4. Lower Bounds
One-pass diameter lower bound
Theorem: For any , any one-pass algorithm thatreturns a k (slightly better than 1/) approx to diameterin weighted graph requires n1+) space.
Proof (Sketch):
Some properties of random graph G in Gn,p with p = 1/n1-
•w.h.p. Contains set E’ of edges: |E’| = n1+64 :
• no edge in E’ is in a cycle of length k or less.• When all edges in E’ are removed, graph still has diameter < 2/
Fix one such G = (V, E E’)
Sketch (cont’d): Reduce from INDEX (hard for comm. cmplxty)
INDEX: Alice has m-bit string x and Bob has index i. One-way comm. complexity for Bob to learn xi is m.
Reduction: m edges in E’ enumerated 1 .. m.
Alice constructs prefix of stream corresponding to multiple copies of H = (V,E E’’) where E’’ E’ are the indices where xi=1. All Alice’s edges have weight 1Bob constructs rest of stream: If his index corresponds to edge (a,b) in E’
• He connects vertex b in one copy with vertex a in next copy at 0 weight• Also creates source s and sink t and connects s to a in 1st copy and b in last copy to t at high weight.
Properties: If xi = 1 where i is Bob’s index then small diameter; else large diameter.
Small space streaming violates comm. lower bound.
Open Problems
Are there interesting subclasses of graphs for which distances and diameters are “easier” in streaming model?
Is there a more generous but reasonable model?
Network Intrusion Detection Systems
Current techniques fairly primitive:– Misuse: Pattern match packets with misuse
signatures in database – Anomaly: Look for statistical anomalies in
individual packet headers and payload Needed:
– Look across multiple packets for intrusions– Deal with interleaved traffic
An Example: Browsing habits
You read sports and cartoons. You’re equally likely to read both. You do not remember what you read last.
You’d expect a “random” sequence
SCSSCSSCSSCCSCCCSSSSCSC…
Two readers
I like health, entertainment, and politics I always read entertainment first, health
next and politics last The sequence would be
EHPEHPEHPEHPEHPEHPEHP…
Two readers, one log file
If there is one log file… Assume there is no correlation between us
SECHSSPECSHPESCSSHCPCESCHCCPSESHPESSHPE…
Is there enough information to tell that there are two people browsing? What are they browsing? How are they browsing?
Clues in stream? Yes, under model assumptions.
H, E, P have special relationship. They cannot belong to different
(uncorrelated) people.
Not clear about S and C ... These could be two people or one person.
SECHSSPECSHPESCSSHCPCESCHCCPSESHPESSHPE…
Markov Chains as Stochastic Sources
12
3
4
5
6
7
.2
.4
.4
.7
.3
.1
.9
.5
.5.8
.2.9
.1
Output sequence:1 4 7 7 1 2 5 7 ...
1
Markov chains on S,E,C,H,F
SC
1/2
1/2
1/21/2
Modeled by …
H
1
E
F
1
1
Need more realistic generalizations of such analysis todeal with:
• Worm detection
• Anomaly detection at high traffic links in a network
• TCP compliance
• BGP policy behavior
Partial Solution: Clusters (1)
A cluster is a subset of vertices and a small diameter spanning tree built on these vertices.
Intra-cluster edge
Partial Solution: Clusters (2)
Inter-cluster edges
Bollobás’s result no longer applies. Need to control the number of clusters (i.e., make it ).
Open Shortest Path First (OSPF)Packet routing protocol:• Each link broadcasts its weight (initially could be 1/bw...)• To route from A to B, each router sends along shortest path to B, dividing traffic evenly if many shortest paths.
Adjustments:• Human operator observing congestion on link could raise wt• Local decisions could lead to oscillation & suboptimality
• Link latency: Convex function of its utilization• Goal: Minimize max link latency, total link latency, expected path latency, etc.• Exact optimizations typically NP-hard
Streaming problem
• Can we automate the weight adjustments?
Simple scenario:• Assume weights have been optimized for current traffic matrix• Assume we now have a new (unknown) traffic matrix observed at routers• Assume some simple goal ... minimize time to converge to new solution ... or something ...
Streaming algorithm should itself be allowed to generatetraffic for communication between monitors and for diagnostics, but this overhead should be low.
Early Worm Detection
EarlyBird System [Singh et al] identifies following characteristics:
1. Substantial volume of identical traffic2. Rising infection levels (# sources & destinations increasing)3. Random probing (infected source tries many IP addresses)
1. Top-k type streaming algorithm can identify high volume of identical traffic at one location. Can we do better in distributed fashion?2. How do we communicate to detect rising inf. levels?3. Sophisticated worms may not use random probing. What other discriminating tests are possible?4. Sophisticated worms are polymorphic… not “identical” traffic.
Recommended