Upload
calix
View
48
Download
0
Embed Size (px)
DESCRIPTION
Algorithms for Large Data Sets. Giuseppe F. (Pino) Italiano Università di Roma “Tor Vergata” [email protected]. Examples of Large Data Sets: Astronomy. Astronomical sky surveys 120 Gigabytes/week 6.5 Terabytes/year. The Hubble Telescope. - PowerPoint PPT Presentation
Citation preview
Algorithms for Large Data Sets
Giuseppe F. (Pino) Italiano
Università di Roma “Tor Vergata”
2
Examples of Large Data Sets:Astronomy
• Astronomical sky surveys
• 120 Gigabytes/week
• 6.5 Terabytes/year
The Hubble Telescope
4
Examples of Large Data Sets:Phone call billing records
• 250M calls/day
• 60G calls/year
• 40 bytes/call
• 2.5 Terabytes/year
5
Examples of Large Data Sets:Credit card transactions
• 47.5 billion transactions in 2005 worldwide
• 115 Terabytes of data transmitted to VisaNet data processing center in 2004
6
Examples of Large Data Sets:Internet traffic
Traffic in a typical router:
• 42 kB/second
• 3.5 Gigabytes/day
• 1.3 Terabytes/year
7
Examples of Large Data Sets:The World-Wide Web
• 25 billion pages indexed
• 10kB/Page
• 250 Terabytes of indexed text data
• “Deep web” is supposedly 100 times as large
8
Reasons for Large Data Sets:Better technology
• Storage & disks– Cheaper– More volume– Physically smaller– More efficient
Large data sets are affordable
9
Reasons for Large Data Sets:Better networking
• High speed Internet
• Cellular phones
• Wireless LAN
More data consumers
More data producers
10
Reasons for Large Data Sets:Better IT
• More processes are automatic– E-commerce– Online and telephone banking– Online and telephone customer service – E-learning– Chats, news, blogs– Online journals– Digital libraries
• More enterprises are digital– Companies– Banks– Governmental institutions– Universities
More data is available in digital form
World’s yearly production of data:
Billions Gigabyes
12
More and More Digital Data• Amount of data to be processed increasing at faster rate
than computing power
• Digital data created in few years larger than amout of data created in all previous history (57 billion GB)
13
The Digital Universe is growing fast• Digital Universe = amount of digital information
created and replicated in a year.• YouTube hosts 100 million video streams a day• More than a billion songs a day shared over the
Internet• London’s 200 traffic surveillance cameras send 64
trillion bits a day to the command center• Chevron accumulate data at the rate of TB / day• TV broadcasting is going all-digital in most
countries• …
14
We Ain’t Seen Nothing Yet…
• In 2009, despite global recession, Digital Universe grew by 62% to nearly 800,000 PetaBytes (1 PB = 1 million GB). I.e., stack of DVDs reaching from earth to moon and back.
• In 2010, Digital Universe expected to grow almost as fast to 1.2 million PB, or 1.2 Zettabytes (ZB).
• With this trend, in 2020 Digital Universe will be 35 ZB, i.e., 44 TIMES AS BIG as it was in 2009. Stack of DVDs would now reach halfway to Mars!
The RAM Model of Computation
15
The simple uniform memory
model (i.e., unit time per memory access) is no longer adequate for large data sets
Internal memory (RAM) has typical size of few GB only
Let’s see this with two very simple experiments
Experiment 1: Sequential vs. Random Access
16
2GB RAM Write (sequentially) a file with 2 billions
32-bit integers (7.45 GB) Read (randomly) same file Which is faster? Why?
17
Platform
• MacOS X 10.5.5 (2.16 GHz Intel Core Duo)
• 2GB SDRAM, 2MB L2 cache
• HD Hitachi HTS542525K9SA00 232.89 GB serial ATA (speed 1.5 Gigabit)
• File system Journaled HFS+
• Compiler gcc 4.0.1
18
Accesso sequenziale (scrittura)#include <stdio.h>#include <stdlib.h>
typedef unsigned long ItemType; /* type of file items */
int main(int argc, char** argv){
FILE* f; long long N, i;
if (argc < 3) exit (printf("Usage: ./MakeRandFile fileName numItems\n")); /* check command line parameters */
N = atoll(argv[2]); /* convert number of items from string to integer format */
printf("file offset: %d bit\n", sizeof(off_t)*8); printf("creating random file of %lld 32 bit integers...\n", N);
f = fopen(argv[1], "w+"); /* open file for writing */ if (f == NULL) exit(printf("can't open file\n"));
/* make N sequential file accesses */ for (i=0; i<N; ++i) { ItemType val = rand(); fwrite(&val, sizeof(ItemType), 1, f); }
fclose(f);}
Sequential Write
19
Accesso sequenziale (scrittura)
…
/* make N sequential file accesses */ for (i=0; i<N; ++i) { ItemType val = rand(); fwrite(&val, sizeof(ItemType), 1, f); }
…
Sequential Write
20
Accesso casuale (lettura)#include <stdio.h>#include <stdlib.h>#include <time.h>
typedef unsigned long ItemType; /* type of file items */
int main(int argc, char** argv){
FILE* f; long long N, i, R;
if (argc < 3) exit (printf("Usage: ./RandFileScan fileName numReads\n")); /* check command line parameters */
R = atoll(argv[2]); /* convert number of accesses from string to integer format */
f = fopen(argv[1], ”r"); /* open file for reading */ if (f == NULL) exit(printf("can't open file\n”, argv[1]));
fseeko(f, 0LL, SEEK_END); /* compute number N of elements in the file */ N = ftello(f)/sizeof(ItemType);
printf("file offset: %d bit\n", sizeof(off_t)*8); printf("make %lld random accesses to file of %lld 32 bit integers...\n", R, N);
srand(clock()); /* init pseudo-random generator seed */
for (i=0; i<R; ++i) { /* make R random file accesses */ ItemType val; long long j = (long long)(N*((double)rand()/RAND_MAX)); fseeko(f, j*sizeof(ItemType), SEEK_SET); fread(&val, sizeof(ItemType), 1, f); } fclose(f);}
Random Read
21
Accesso casuale (lettura) …
for (i=0; i<R; ++i) { /* make R random file accesses */
ItemType val; long long j = (long long)
(N*((double)rand() /
RAND_MAX)); fseeko(f, j*sizeof(ItemType),
SEEK_SET); fread(&val, sizeof(ItemType), 1, f); }
…
Random Read
22
Outcome of the ExperimentRandom Read:
Time to read randomly 10,000 integers in file with 2 billions 32-bit integers (7.45 GB) is ≈118.419 sec. (i.e., 1 min. and 58.419 sec.). That’s ≈11.8 msec. per integer.
Throughput: ≈ 337.8 byte/sec ≈ 0.0003 MB/sec.
CPU Usage : ≈ 1.6%
Sequential Write:
Time to write sequentially file with 2 billions 32-bit integers (7.45 GB) is ≈250.685 sec. (i.e., 4 min. and 10.685 sec.). That’s ≈120 nanosec. per integer.
Throughput: ≈ 31.8 MB/sec.
CPU Usage: ≈ 77%
Sequential access is roughly 100,000 times faster than random!
What is More Realistic
23
Doug Comer: “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk”
Magnetic Disk Drives as Secondary Memory
24
Actually, disk access is about million times slower… More like going Around the World in 80 Days!
Time for rotation ≈ Time for seek
Amortize search time by transfering large blocks so that:
Time for rotation ≈ Time for seek ≈ Time to transfer data
Solution 1: Exploit locality – take advantage of data locality
Solution 2: Use disks in parallel
25
Experiment 2:
Copy 2048 x 2048 array of 32-bit integers
copyij: copy by rows
copyij: copy by columns
Another Experiment
26
Access by rows:
void copyij (int src[2048][2048],
int dst[2048][2048])
{
int i,j;
for (i = 0; i < 2048; i++)
for (j = 0; j < 2048; j++)
dst[i][j] = src[i][j];
}
Array Copy
27
Access by columns:
void copyji (int src[2048][2048],
int dst[2048][2048])
{
int i,j;
for (j = 0; j < 2048; j++)
for (i = 0; i < 2048; i++)
dst[i][j] = src[i][j];
}
Array Copy
28
copyij and copyji differ only in access patterns:
copyij accesses by rows,
copyji accesses by columns.
On a Intel Core i7 with 2.7 GHz:
copyij takes 5.2 msec,
copyji takes 162 msec ( ≈ 30X slower!)
Array Copy
Arrays stored in row-major order (depends on language / compiler)
Thus copyij makes a better use of locality
29
Is this due to External Mem?
A Refined Memory Model
30
Outline
31
Algorithms for Large Data Sets
1. External Memory (Disk): I/O-Efficient Algorithms
2. Cache: Cache-Oblivious Algorithms
3. Large and Inexpensive Memories: Resilient Algorithms
Outline
32
Important issues we are NOT touching
1. Algs for data streaming
2. Algs for multicore architectures: Threads, parallelism, etc…
3. Programming models (MapReduce)
4. How to write fast code
5. …
I/O-Efficient Algorithms
34
Model
N: Elements in structure (input size)
B: Elements per block
M: Elements in main memory
Problem starts out on disk
Solution is to be written to disk
Cost of an algorithm is the number of input and output (I/O) operations.
D
P
M
Block I/O
I/O- Efficient Algorithms
Will start with “Simple” Problems
• Scanning
• Sorting
• List ranking
35
36
Scanning N elements stored in blocks costs Θ(N/B) I/Os:
Scanning
Will refer to this bound as scan(N)
36
Sorting
Sorting one of the most-studied problems in computer science.
In external-memory, sorting plays particularly important role, because often lower bound, and even upper bound, for other problems.
Original paper of Aggarwal and Vitter [AV88] proved that number of memory transfers to sort in comparison model is Θ( N/B logM/B N/B).
Will denote this bound by sort(N).
Clearly, sort(N) = (scan(N))
37
External Memory Sorting
Simplest external-memory algorithm that achieves this bound [AV88] is an (M/B)-way mergesort.
During merge, each memory block maintains first B elements of each list, and when a block empties, next block from that list is loaded.
So a merge effectively corresponds to scanning through entire data, for an overall cost of Θ(N/B) I/Os.
38
External Memory Sorting
Mainly of theoretical interest…
Luckily more practical I/O-efficient alg. for sorting
Total number of I/Os for this sorting algorithm is given by recurrence
T (N) = (M/B) T (N / (M/B) ) + Θ(N/B),
with a base case of T(O(B)) = O(1).
39
External Memory Sorting
T (N) = (M/B) T (N / (M/B) ) + Θ(N/B),
T(O(B)) = O(1).
At level i of recursion tree:
(M/B)i nodes, problem sizes Ni = N / (M/B)i
Number of levels in recursion tree is O(logM/B N/B)
Divide-and-merge cost at any level i is Θ(N/B): Recursion tree has Θ(N/B) leaves, for a leaf cost of Θ(N/B). Root node has divide-and-merge cost Θ(N/B) as well, as do all levels in between: (M/B)i Θ(Ni/B) = (M/B)i Θ(N/B / (M/B)i)
So total cost is Θ( N/B logM/B N/B).40
List Ranking
Given a linked list L, compute for each item in the list its distance from the head
1 2 3 4 5 6
41
Weighted List Ranking
Can be generalized to weighted ranks:
Given a linked list L, compute for each item in the list its weghted distance from the head
3 1 5 2 3 1
3 4 9 11 14 15
42
Why Is List Ranking Non-Trivial?
1 2 3 45 6 7 89 10 11 1213 14 15 16
1 5 9 13 2 6 10 14 3 47 811 1215 16
1 5 9 13 2 6 10 14 3 47 811 1215 16
1 5 9 13 2 6 10 14 3 47 811 1215 16
1 25 69 1013 14 3 47 811 1215 16
1 25 69 1013 14 3 47 811 1215 16
The internal memory algorithm spends (N) I/Os in the worst case (LRU assumed).
43
I/O-Efficient List Ranking Alg
Proposed by Chiang et al. [1995]
If list L fits into internal memory (|L| ≤ M):
1. L read in internal memory in O(scan(|L|)) I/Os
2. Use trivial list ranking in internal memory
3. Write to disk element ranks in O(scan(|L|)) I/Os
Difficult part is when |L| > M (not a surprise…)
44
List Ranking for |L| > M • Assume an independent set of size at least N/3 can
be found in O(sort(N)) I/Os (we’ll see later how).
3 1 5 2 3 1
3 1 5 2 3 1
Scan(|L|)
45
List Ranking for |L| > M
3 1 7 4
3 1 5 2 3 1
3 1 5 2 3 1
46
Step Analysis
• Assume each vertex has a unique numerical ID– Sort elements in L \ I by their numbers– Sort elements in I by numbers of their successors – Scan the two lists to update the label of succ(v),
for every element v I
47
Step Analysis
• Each vertex has a unique numerical ID– Sort elements in I by their numbers– Sort elements in L \ I by numbers of their
successors – Scan the two lists to update the label of succ(v),
for every element v L \ I48
List Ranking for |L| > M
3 1 7 4
3 4 11 15
Recursive step:
3 1 5 2 3 1
49
List Ranking for |L| > M
3 4 11 15
3 4 9 11 14 15
3 1 5 2 3 1
50
O(Sort(|L|) + Scan(|L|)) (as before)
Recap of the Algorithm
3 1 5 2 3 1
3 1 7 4
3 4 11 15
3 1 5 2 3 1
3 4 9 11 14 15
51
52
I/O-Efficient List Ranking
• Except for recursion, everything is scan and sort• I/O-complexity is
• Solution of recurrence is
Theorem: A list of size N can be ranked in O(sort(N)) I/Os.
€
I N( ) ≤ I 2N3( ) + O sort N( )( )
€
I N( ) = O sort N( )( )€
I N( ) = O scan N( )( )
if N > M
if N ≤ M
53
I/O-Efficient List Ranking
Observation: We do not use the property that there is only one head and one tail in the list
In other words, the algorithm can be applied simultaneously to a collection of linked lists
This is exploited for solving more complicated graph problems (will see an example later)
54
Now Switch to Graph AlgorithmsFor theoreticians:• Graph problems neat, often difficult, hence interesting
For practitioners:• Massive graphs arise in GIS, Web modelling, ...
• Problems in computational geometry can be expressed as graph problems
• Many abstract problems best viewed as graph problems
• Extreme: Pointer-based data structures = graphs with extra information at their nodes
For us:• Still we don’t understand how to solve some graph problems
I/O-efficiently (e.g., DFS)
55
Outline
Fundamental Graph Problems• (List ranking)• Algorithms for trees
– Euler tour – Tree labelling
• Evaluating DAGs• Connectivity
– Connected components – Minimum spanning trees
56
The Euler Tour Technique
Given a tree T, represent it by a list L that captures the tree structure
Why? Certain computations on T can be performed by a (weighted) list ranking of L.
57
Seven Bridges of Königsberg
(Five Bridges of Kaliningrad
Калинингра`д)
Euler Tour
Given a tree T, and a distinguished vertex r of T, an Euler tour of T is a traversal of T that starts and ends at r and traverses every edge exactly twice, once in each direction.
r
58
The Euler Tour Technique
Theorem: Given the adjacency lists of the vertices in T, an Euler tour can be constructed in O(scan(N)) I/Os.
• Let {v,w1},…,{v,wr} be the (undirected) edges incident to v
• Then succ((wi,v)) = (v,wi+1))
v
w4
w3
w2
w1
59
Problem 1
If T is represented as an unordered collection of edges (i.e., no adjacency lists), what’s the I/O complexity of computing an Euler tour of T?
60
Rooting a Tree
• Choosing a vertex r as the root of a tree T defines parent-child relationships between adjacent nodes
• Rooting tree T =computing for every edge{v,w} who is the parentand who is the child
• v = p(w) if and only ifrank((v,w)) < rank((w,v)) in Euler tour
61
Theorem: A tree can be rooted in O(sort(N)) I/Os.
Computing a Preorder Numbering
Theorem: A preorder numbering of a rooted tree T can be computed in O(sort(N)) I/Os.
0
1
2
3 4
5
6
7 8
9
10
1
11
0
110
0
0
1 0
0
0
01
1
18
2
34
4
564
3
8
7 8
5
7
99
8
preorder#(v) = rank((p(v),v))62
Computing Subtree Sizes
10
1
11
0
110
0
0
1 0
0
0
01
1
18
2
34
4
564
3
8
7 8
5
7
99
8
|T(v)| = rank((v,p(v))) – rank((p(v),v)) + 1
63
Theorem: The nodes of T can be labelled with their subtree sizes in O(sort(N)) I/Os.
10
8
3
1 1
1
3
1 1
1
Problem 2
Given a tree T, rooted at r, the depth of a node v is defined as the number of edges on the path from r to v in T.
Design an I/O-efficient algorithm to compute the depth of each node of T.
65
Evaluating a Directed Acyclic Graph
• More general: Given a labelling , compute a labelling so that (v) is computed from (v) and (u1),…,(ur), where u1,…,ur are v’s in-neighbors
0
1
0
0
1
0
1
00 00
1
0
11
0
0
1
0
1
0
1 1
0 0
1
0
66
0
1
0
1
2
3 4
5
6
7
8
10
9 11
12
AssumptionsAssumption 1: Nodes are given in topological order: for every edge (v,w) v precedes w in the ordering(Note: there is no I/O-efficient alg to topologically sort arbitrary DAG)
Use priority queue Q to send data along the edges.
Assumption 2: If there is no bound on in-degrees, computation in a node with in-degree K can be done in O(sort(K)) I/Os
67
Q:
0
1
0
1
2
3 4
5
6
7
8
10
9 11
12
Time-Forward Processing
Chiang et al. [1995], Arge [1995]:
000
111
000 00
11
00
11
00
11
00
11
00
Use priority queue Q to send data along the edges.
(6,1,0)(4,2,1) (5,2,1) (6,1,0)(4,2,1) (4,3,0) (5,2,1) (5,3,0) (6,1,0)(4,2,1) (4,3,0) (5,2,1) (5,3,0) (6,1,0)(5,2,1) (5,3,0) (6,1,0)(5,2,1) (5,3,0) (6,1,0) (7,4,0) (8,4,0)(5,2,1) (5,3,0) (6,1,0) (7,4,0) (8,4,0)(6,1,0) (7,4,0) (8,4,0)(6,1,0) (6,5,1) (7,4,0) (7,5,1) (8,4,0) (8,5,1)(6,1,0) (6,5,1) (7,4,0) (7,5,1) (8,4,0) (8,5,1)(7,4,0) (7,5,1) (8,4,0) (8,5,1)(7,4,0) (7,5,1) (8,4,0) (8,5,1) (10,6,0)(7,4,0) (7,5,1) (8,4,0) (8,5,1) (10,6,0)(8,4,0) (8,5,1) (10,6,0)(8,4,0) (8,5,1) (9,7,1) (10,6,0) (10,7,1)(8,4,0) (8,5,1) (9,7,1) (10,6,0) (10,7,1)(9,7,1) (10,6,0) (10,7,1)(9,7,1) (9,8,0) (10,6,0) (10,7,1)(9,7,1) (9,8,0) (10,6,0) (10,7,1)(10,6,0) (10,7,1)(10,6,0) (10,7,1) (11,9,1) (12,9,1)(10,6,0) (10,7,1) (11,9,1) (12,9,1)(11,9,1) (12,9,1)(11,9,1) (11,10,0) (12,9,1) (12,10,0)(11,9,1) (11,10,0) (12,9,1) (12,10,0)(12,9,1) (12,10,0)
68
Time-Forward Processing
Correctness:
• Every in-neighbor of v evaluated before v
• All vertices preceeding v in topological order evaluated before v
69
Time-Forward ProcessingAnalysis:
• Vertex set + adjacency lists scannedO(scan(|V| + |E|)) I/Os
• Priority queue:– Every edge inserted into and deleted from Q
exactly onceO(|E|) priority queue operations
O(sort(|E|)) I/Os
70
Time-Forward ProcessingAnalysis:
• Vertex set + adjacency lists scannedO(scan(|V| + |E|)) I/Os
• Priority queue:– Every edge inserted into and deleted from Q
exactly onceO(|E|) priority queue operations
O(sort(|E|)) I/OsTheorem: A directed acyclic graph G = (V,E) can be evaluated in O(sort(|V| + |E|)) I/Os.
71
Independent Set (MIS)
Given a graph G=(V,E), an independent set is a set I V such that no two ⊆vertices in I are adjacent
1
2
34
5
6
7
8 9
10
11
1
2
34
5
6
72
Maximal Independent Set (MIS)
An independent set I is maximal if every vertex in V \ I has at least one neighbor in I
1
2
34
5
6
7
8 9
10
11
1
2
34
5
6
73
Maximal Independent Set (MIS)
An independent set I is maximal if every vertex in V \ I has at least one neighbor in I
1
2
34
5
6
7
8 9
10
11
1
2
34
5
6
74
Maximal Independent Set (MIS)
Algorithm GREEDYMIS:
1. I 0
2. for every vertex v G do
3. if no neighbor of v is in I then
4. Add v to I
5. end if
6. end for
Observation: It suffices to consider only all neighbors of v which have been visited in a previous iteration.
75
Implementation Details
• Assume each vertex has a unique numerical ID
• Direct edges from lower to higher numbers
• Sort vertices by their number
• Consider vertices in sorted order
76
Maximal Independent Set (MIS)
Algorithm GREEDYMIS:
1. I 0
2. for every vertex v G in sorted order do
3. if no in-neighbor of v is in I then
4. Add v to I
5. end if
6. end for
77
Maximal Independent Set (MIS)
1
2
34
5
6
7
8 9
10
11
1
2
34
5
6
78
1
2
34
5
6
7
8 9
10
11
11
22
3344
55
66
7
8
7
8 99
1010
1111
Maximal Independent Set (MIS)
79
Implementation Details
How to check I/O-efficiently if any neighbor of v was already included in I? Time-forward processing:• (Assume each vertex has a unique numerical ID)
• (Direct edges from lower to higher numbers)
• (Sort vertices by their number)
• Sort edges by number of their sources
• After deciding whether v included or not in I, v sends a flag to each of its out-neighbors to inform them whether or not v is in I (same as evaluating DAGs)
• Each vertex decides whether should be added to I based solely on flags received from in-neighbors
80
Maximal Independent Set (MIS)
Correctness follows from following observations:• Set I computed by the algorithm is indipendent since
a) Vertex v added to I only if none of its in-neighbors is in I
b) At this point none of its out-neighbors can be in I yet, and the insertion of v into I prevents all of these out-neighbors from being added to I later
• Set I is maximal since otherwise there would be a vertex v I such that none of the in-neighbors of v ∉are in I: but then v would have been added to I
81
Maximal Independent Set (MIS)
Theorem: A maximal independent set of a graph G = (V,E) can be computed in O(sort(|V|+|E|)) I/Os.
82
Large Independent Set of a List
Fill missing details of list ranking alg.
Corollary: An independent set of size at least N/3 for a list L of size N can be found in O(sort(N)) I/Os.
In a list, every vertex in an MIS I prevents at most two other vertices from being in I:
Every MIS of a list has size at least N/3.
83
Graph Connectivity
• Connected Components
• Minimum Spanning Trees
84
Connectivity
A graph G=(V,E) is connected if for any two vertices u,v in V there is a path between u and v in G.
The connected components of G are its maximal connected subgraphs.
First, semi-external algorithm (vertices fit in main memory, edges don’t)
Next, fully external algorithm: use graph contraction to reduce number of vertices. Call semi-external as soon as vertices fit in main memory
85
ConnectivityA Semi-External Algorithm
86
ConnectivityA Semi-External Algorithm
Analysis:
• Scan vertex set to load vertices into main memory
• Scan edge set to carry out algorithm
• O(scan(|V| + |E|)) I/Os
Theorem: If |V| M, the connected components of a graph can be computed in O(scan(|V| + |E|)) I/Os.
87
ConnectivityThe General CaseIdea [Chiang et al 1995]:
• If |V| M– Use semi-external algorithm
• If |V| > M– Identify simple connected subgraphs of G– Contract these subgraphs to obtain graph
G’ = (V’,E’) with |V’| c|V|, c < 1– Recursively compute connected components of G’– Obtain labelling of connected components of G
from labelling of components of G’
88
A
B C D
E
ConnectivityThe General Case
a
b
c
de
f
gh
i
j
k
lm
n
A
BC
D
E
1
12
2
2
1
1
11
1
12
2
22 2
2
2
2
89
ConnectivityThe General Case
Main steps:
• Find smallest neighbors
• Compute connected components of graph H induced by selected edges
• Contract each component into a single vertex
• Call the procedure recursively
• Copy label of every vertex v G’ to all vertices in G represented by v
90
Finding smallest neighbors
To find smallest neighbor w(v) of every vertex v:
Scan edges and replace each undirected edge {u,v} with directed edges (u,v) and (v,u)
Sort directed edges lexicographically
This produces adjacency lists
Scan adjacency list of v and return as w(v) first vertex in list
This takes overall O(sort(|E|)) I/Os
To produce edge set of (undirected) graph H, sort and scan edges {v, w(v)} to remove duplicates
This takes another O(sort(|V|)) I/Os
91
Computing Conn Comps of H
Cannot use same algorithm recursively (didn’t reduce vertex set)
Exploit following property:
Lemma Graph H is a forestAssume not. Then H must contain cycle x0, x1, …, xk = x0. Since no duplicate edges, k ≥ 3. Since each vertex v has at most one incident edge {v,w(v)} in H, w.l.o.g. xi+1 = w(xi) for 0 ≤ i < k. Then the existence of {xi-1,xi} implies that xi-1 > xi+1. Similarly, xk-1 > x1.
If k even: x0 > x2 > … > xk = x0 yields a contradiction.
If k odd: x0 > x2 > … > xk-1 > x1 > x3 > … > xk = x0 yields a contradiction.
92
Exploit Property that H is a Forest
Apply Euler tour to H in order to transform each tree into a list
Now compute connected components using ideas from list ranking:
Find large independent set I of H and remove vertices in I from H
Recursively find connected components of smaller graphs
Reintegrate vertices in I (assign component label of neighbor)
This takes sort(|H|) = sort(|V|) I/Os
93
Recursive Calls
Every connected component of H has size at least 2 |V’| |V|/2 O(log (|V|/M)) recursive calls
Theorem: The connected components of a graph G = (V,E) can be computed in O(sort(|V|) + sort(|E|) log(|V|/M)) I/Os.
94
Improved Connectivity via BFS
• BFS in O(|V| + sort(|E|)) I/Os [Munagala & Ranade 99] BFS can be used to identify connected components• When |V| = |E|/B, algorithm takes O(sort(|E|)) I/Os• Same alg. but stop recursion before, when # of vertices
reduced to |E|/B (after log (|V|B/|E|) recursive calls)• At this point, apply BFS rather than semi-external
connectivity
Theorem: The connected components of a graph
G = (V,E) can be computed in
O(sort(|V|) + sort(|E|) log (|V|B / |E|) I/Os.95
Minimum Spanning Tree (MST)
A spanning tree of an undirected graph G=(V,E) is a tree T=(V,E’) such that E’ E.⊆
Given an undirected graphs with costs assigned to edges, the cost of a spanning tree is the sum of the costs of its edges.
A spanning tree T of G is minimum if there is no other spanning tree T’ of G such that cost(T’) < cost(T).
97
Spanning Trees
Observation: Connectivity algorithm can be augmented to produce a spanning tree (forest) of G.
SemiExternalST: add edge {v,w} whenever v and w are in different connected components
ExternalST: build spanning tree (forest) of G from H and spanning tree (forest) T’ produced by recursive invocations of the alg. on compressed graph G’
98
ExternalST
Spanning tree produced is not necessarily minimum
a
b
c
de
f
gh
i
j
k
lm
n
A
BC
D
E
99
Minimum Spanning Tree (MST)
Simple modifications
SemiExternalMST: Rather than inspecting edges of G in arbitrary order, inspect edges by increasing weights. This increase I/O-complexity from O(scan(|V|+|E|)) to O(scan(|V|) + sort(|E|))
Essentially, a semi-external version of Kruskal
100
Minimum Spanning Tree (MST)
ExternalMST differs from ExternalST in a number of places
a) Choose edge of minimum cost rather than to smallest neighbor (maintains invariant H is forest)
b) Weight of edge e in compressed graph G’ = min weight of all edges represented by e
c) When “e added” to T, add in fact this minimum edge
4
1 5
3
va
b c
d
101
Minimum Spanning Tree (MST)a
b
c
de
f
gh
i
j
k
lm
n
A
BC
D
E
Theorem: A MST of a graph G = (V,E) can be computed in O(sort(|V|) + sort(|E|) log (|V|/M)) I/Os.
102
A Fast MST Algorithm
Idea:– If can compute MST in O(|V| + sort(|E|)) I/Os– Apply same trick as before (BFS) and stop
recursion after log (|V|B / |E| ) iterations
Arge et al [2000] got desired bound
I/O-efficient implem. of Prim’s algorithm:
103
A Fast MST Algorithm
• Maintain light blue and intra-tree edges in priority queue Q
• When edge {v,w} of minimum cost retrieved, test whether v,w are both in T– Yes discard (intra-tree) edge
– No Add edge to MST and add all to Q edges incident to w, except {v,w}(assuming that w T)
Problem: How to testwhether v,w T.
104
A Fast MST Algorithm
• If v,w T, but {v,w} T, then both v and w have inserted edge {v,w} into Q
There are two copies of {v,w} in Q They are consecutive (assumption: different costs) Perform two DELETEMIN operations
– If {v,w} = {y,z}, discard both– Otherwise, add {v,w} to T and re-insert {y,z}
v
w
105
A Fast MST Algorithm
Analysis:• O(|V| + scan(|E|)) I/Os for retrieving adjacency lists• O(sort(|E|)) I/Os for priority queue operations
Theorem: A MST of a graph G = (V,E) can be found in O(|V| + sort(|E|)) I/Os.
Corollary: A MST of a graph G = (V,E) can be found in O(sort(|V|) + sort(|E|) log (|V|B / |E|) I/Os.
106
Graph Contraction and Sparse Graphs
• A graph G = (V,E) is sparse if for any graph H obtainable from G through a series of edge contractions, |E(H)| = O(|V(H)|) (include planar, bounded treewidth)
• For a sparse graph, the number of vertices and edges in G reduces by a constant factor in each iteration of the connectivity and MST algorithms.
Theorem: The connected components or a MST of a sparse graph with N vertices can be computed in O(sort(N)) I/Os.
107
Three Techniques for Graph Algs
• Time-forward processing:– Express graph problems as evaluation problems of DAGs
• Graph contraction:– Reduce the size of G while maintaining the properties of
interest– Solve problem recursively on compressed graph– Construct solution for G from solution for compressed
graph• Bootstrapping:
– Switch to generally less efficient algorithm as soon as (part of the) input is small enough
108