Algorithms for Large Data Sets

Algorithms for Large Data Sets

Giuseppe F. (Pino) Italiano

Università di Roma “Tor Vergata”

[email protected]

2

Examples of Large Data Sets:Astronomy

• Astronomical sky surveys

• 120 Gigabytes/week

• 6.5 Terabytes/year

The Hubble Telescope

4

Examples of Large Data Sets:Phone call billing records

• 250M calls/day

• 60G calls/year

• 40 bytes/call


5

Examples of Large Data Sets:Credit card transactions

• 47.5 billion transactions in 2005 worldwide

• 115 Terabytes of data transmitted to VisaNet data processing center in 2004

6

Examples of Large Data Sets:Internet traffic

Traffic in a typical router:

• 42 kB/second

• 3.5 Gigabytes/day


7

Examples of Large Data Sets:The World-Wide Web

• 25 billion pages indexed

• 10kB/Page

• 250 Terabytes of indexed text data

• “Deep web” is supposedly 100 times as large

8

Reasons for Large Data Sets:Better technology

• Storage & disks– Cheaper– More volume– Physically smaller– More efficient

Large data sets are affordable

9

Reasons for Large Data Sets:Better networking

• High speed Internet

• Cellular phones

• Wireless LAN

More data consumers

More data producers

10

Reasons for Large Data Sets:Better IT

• More processes are automatic– E-commerce– Online and telephone banking– Online and telephone customer service – E-learning– Chats, news, blogs– Online journals– Digital libraries

• More enterprises are digital– Companies– Banks– Governmental institutions– Universities

More data is available in digital form

World’s yearly production of data:

Billions Gigabyes

12

More and More Digital Data• Amount of data to be processed increasing at faster rate

than computing power

• Digital data created in few years larger than amout of data created in all previous history (57 billion GB)

13

The Digital Universe is growing fast• Digital Universe = amount of digital information

created and replicated in a year.• YouTube hosts 100 million video streams a day• More than a billion songs a day shared over the

Internet• London’s 200 traffic surveillance cameras send 64

trillion bits a day to the command center• Chevron accumulate data at the rate of TB / day• TV broadcasting is going all-digital in most

countries• …

14

We Ain’t Seen Nothing Yet…

• In 2009, despite global recession, Digital Universe grew by 62% to nearly 800,000 PetaBytes (1 PB = 1 million GB). I.e., stack of DVDs reaching from earth to moon and back.

• In 2010, Digital Universe expected to grow almost as fast to 1.2 million PB, or 1.2 Zettabytes (ZB).

• With this trend, in 2020 Digital Universe will be 35 ZB, i.e., 44 TIMES AS BIG as it was in 2009. Stack of DVDs would now reach halfway to Mars!

The RAM Model of Computation

15

The simple uniform memory

model (i.e., unit time per memory access) is no longer adequate for large data sets

Internal memory (RAM) has typical size of few GB only

Let’s see this with two very simple experiments

Experiment 1: Sequential vs. Random Access

16

2GB RAM Write (sequentially) a file with 2 billions

32-bit integers (7.45 GB) Read (randomly) same file Which is faster? Why?

17

Platform

• MacOS X 10.5.5 (2.16 GHz Intel Core Duo)

• 2GB SDRAM, 2MB L2 cache

• HD Hitachi HTS542525K9SA00 232.89 GB serial ATA (speed 1.5 Gigabit)

• File system Journaled HFS+

• Compiler gcc 4.0.1

18

Accesso sequenziale (scrittura)#include <stdio.h>#include <stdlib.h>

typedef unsigned long ItemType; /* type of file items */

int main(int argc, char** argv){

FILE* f; long long N, i;

if (argc < 3) exit (printf("Usage: ./MakeRandFile fileName numItems\n")); /* check command line parameters */

N = atoll(argv[2]); /* convert number of items from string to integer format */

printf("file offset: %d bit\n", sizeof(off_t)*8); printf("creating random file of %lld 32 bit integers...\n", N);

f = fopen(argv[1], "w+"); /* open file for writing */ if (f == NULL) exit(printf("can't open file\n"));

/* make N sequential file accesses */ for (i=0; i<N; ++i) { ItemType val = rand(); fwrite(&val, sizeof(ItemType), 1, f); }

fclose(f);}

Sequential Write

19

Accesso sequenziale (scrittura)

…

/* make N sequential file accesses */ for (i=0; i<N; ++i) { ItemType val = rand(); fwrite(&val, sizeof(ItemType), 1, f); }

…

Sequential Write

20

Accesso casuale (lettura)#include <stdio.h>#include <stdlib.h>#include <time.h>

typedef unsigned long ItemType; /* type of file items */

int main(int argc, char** argv){

FILE* f; long long N, i, R;

if (argc < 3) exit (printf("Usage: ./RandFileScan fileName numReads\n")); /* check command line parameters */

R = atoll(argv[2]); /* convert number of accesses from string to integer format */

f = fopen(argv[1], ”r"); /* open file for reading */ if (f == NULL) exit(printf("can't open file\n”, argv[1]));

fseeko(f, 0LL, SEEK_END); /* compute number N of elements in the file */ N = ftello(f)/sizeof(ItemType);

printf("file offset: %d bit\n", sizeof(off_t)*8); printf("make %lld random accesses to file of %lld 32 bit integers...\n", R, N);

srand(clock()); /* init pseudo-random generator seed */

for (i=0; i<R; ++i) { /* make R random file accesses */ ItemType val; long long j = (long long)(N*((double)rand()/RAND_MAX)); fseeko(f, j*sizeof(ItemType), SEEK_SET); fread(&val, sizeof(ItemType), 1, f); } fclose(f);}

Random Read

21

Accesso casuale (lettura) …

for (i=0; i<R; ++i) { /* make R random file accesses */

ItemType val; long long j = (long long)

(N*((double)rand() /

RAND_MAX)); fseeko(f, j*sizeof(ItemType),

SEEK_SET); fread(&val, sizeof(ItemType), 1, f); }

…

Random Read

22

Outcome of the ExperimentRandom Read:

Time to read randomly 10,000 integers in file with 2 billions 32-bit integers (7.45 GB) is ≈118.419 sec. (i.e., 1 min. and 58.419 sec.). That’s ≈11.8 msec. per integer.

Throughput: ≈ 337.8 byte/sec ≈ 0.0003 MB/sec.

CPU Usage : ≈ 1.6%

Sequential Write:

Time to write sequentially file with 2 billions 32-bit integers (7.45 GB) is ≈250.685 sec. (i.e., 4 min. and 10.685 sec.). That’s ≈120 nanosec. per integer.

Throughput: ≈ 31.8 MB/sec.

CPU Usage: ≈ 77%

Sequential access is roughly 100,000 times faster than random!

What is More Realistic

23

Doug Comer: “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk”

Magnetic Disk Drives as Secondary Memory

24

Actually, disk access is about million times slower… More like going Around the World in 80 Days!

Time for rotation ≈ Time for seek

Amortize search time by transfering large blocks so that:

Time for rotation ≈ Time for seek ≈ Time to transfer data

Solution 1: Exploit locality – take advantage of data locality

Solution 2: Use disks in parallel

25

Experiment 2:

Copy 2048 x 2048 array of 32-bit integers

copyij: copy by rows

copyij: copy by columns

Another Experiment

26

Access by rows:

void copyij (int src[2048][2048],

int dst[2048][2048])

{

int i,j;

for (i = 0; i < 2048; i++)

for (j = 0; j < 2048; j++)

dst[i][j] = src[i][j];

}

Array Copy

27

Access by columns:

void copyji (int src[2048][2048],

int dst[2048][2048])

{

int i,j;

for (j = 0; j < 2048; j++)

for (i = 0; i < 2048; i++)

dst[i][j] = src[i][j];

}

Array Copy

28

copyij and copyji differ only in access patterns:

copyij accesses by rows,

copyji accesses by columns.

On a Intel Core i7 with 2.7 GHz:

copyij takes 5.2 msec,

copyji takes 162 msec ( ≈ 30X slower!)

Array Copy

Arrays stored in row-major order (depends on language / compiler)

Thus copyij makes a better use of locality

29

Is this due to External Mem?

A Refined Memory Model

30

Outline

31

Algorithms for Large Data Sets

1. External Memory (Disk): I/O-Efficient Algorithms

2. Cache: Cache-Oblivious Algorithms

3. Large and Inexpensive Memories: Resilient Algorithms

Outline

32

Important issues we are NOT touching

1. Algs for data streaming

2. Algs for multicore architectures: Threads, parallelism, etc…

3. Programming models (MapReduce)

4. How to write fast code

5. …

I/O-Efficient Algorithms

34

Model

N: Elements in structure (input size)

B: Elements per block

M: Elements in main memory

Problem starts out on disk

Solution is to be written to disk

Cost of an algorithm is the number of input and output (I/O) operations.

D

P

M

Block I/O

I/O- Efficient Algorithms

Will start with “Simple” Problems

• Scanning

• Sorting

• List ranking

35

36

Scanning N elements stored in blocks costs Θ(N/B) I/Os:

Scanning

Will refer to this bound as scan(N)

36

Sorting

Sorting one of the most-studied problems in computer science.

In external-memory, sorting plays particularly important role, because often lower bound, and even upper bound, for other problems.

Original paper of Aggarwal and Vitter [AV88] proved that number of memory transfers to sort in comparison model is Θ( N/B logM/B N/B).

Will denote this bound by sort(N).

Clearly, sort(N) = (scan(N))

37

External Memory Sorting

Simplest external-memory algorithm that achieves this bound [AV88] is an (M/B)-way mergesort.

During merge, each memory block maintains first B elements of each list, and when a block empties, next block from that list is loaded.

So a merge effectively corresponds to scanning through entire data, for an overall cost of Θ(N/B) I/Os.

38


Mainly of theoretical interest…

Luckily more practical I/O-efficient alg. for sorting

Total number of I/Os for this sorting algorithm is given by recurrence

T (N) = (M/B) T (N / (M/B) ) + Θ(N/B),

with a base case of T(O(B)) = O(1).

39


T (N) = (M/B) T (N / (M/B) ) + Θ(N/B),

T(O(B)) = O(1).

At level i of recursion tree:

(M/B)i nodes, problem sizes Ni = N / (M/B)i

Number of levels in recursion tree is O(logM/B N/B)

Divide-and-merge cost at any level i is Θ(N/B): Recursion tree has Θ(N/B) leaves, for a leaf cost of Θ(N/B). Root node has divide-and-merge cost Θ(N/B) as well, as do all levels in between: (M/B)i Θ(Ni/B) = (M/B)i Θ(N/B / (M/B)i)

So total cost is Θ( N/B logM/B N/B).40

List Ranking

Given a linked list L, compute for each item in the list its distance from the head

1 2 3 4 5 6

41

Weighted List Ranking

Can be generalized to weighted ranks:

Given a linked list L, compute for each item in the list its weghted distance from the head

3 1 5 2 3 1

3 4 9 11 14 15

42

Why Is List Ranking Non-Trivial?

1 2 3 45 6 7 89 10 11 1213 14 15 16

1 5 9 13 2 6 10 14 3 47 811 1215 16

1 5 9 13 2 6 10 14 3 47 811 1215 16

1 5 9 13 2 6 10 14 3 47 811 1215 16

1 25 69 1013 14 3 47 811 1215 16

1 25 69 1013 14 3 47 811 1215 16

The internal memory algorithm spends (N) I/Os in the worst case (LRU assumed).

43

I/O-Efficient List Ranking Alg

Proposed by Chiang et al. [1995]

If list L fits into internal memory (|L| ≤ M):

1. L read in internal memory in O(scan(|L|)) I/Os

2. Use trivial list ranking in internal memory

3. Write to disk element ranks in O(scan(|L|)) I/Os

Difficult part is when |L| > M (not a surprise…)

44

List Ranking for |L| > M • Assume an independent set of size at least N/3 can

be found in O(sort(N)) I/Os (we’ll see later how).

3 1 5 2 3 1

3 1 5 2 3 1

Scan(|L|)

45

List Ranking for |L| > M

3 1 7 4

3 1 5 2 3 1

3 1 5 2 3 1

46

Step Analysis

• Assume each vertex has a unique numerical ID– Sort elements in L \ I by their numbers– Sort elements in I by numbers of their successors – Scan the two lists to update the label of succ(v),

for every element v I

47

Step Analysis

• Each vertex has a unique numerical ID– Sort elements in I by their numbers– Sort elements in L \ I by numbers of their

successors – Scan the two lists to update the label of succ(v),

for every element v L \ I48


3 1 7 4

3 4 11 15

Recursive step:

3 1 5 2 3 1

49


3 4 11 15

3 4 9 11 14 15

3 1 5 2 3 1

50

O(Sort(|L|) + Scan(|L|)) (as before)

Recap of the Algorithm

3 1 5 2 3 1

3 1 7 4

3 4 11 15

3 1 5 2 3 1

3 4 9 11 14 15

51

52

I/O-Efficient List Ranking

• Except for recursion, everything is scan and sort• I/O-complexity is

• Solution of recurrence is

Theorem: A list of size N can be ranked in O(sort(N)) I/Os.

€

I N( ) ≤ I 2N3( ) + O sort N( )( )

€

I N( ) = O sort N( )( )€

I N( ) = O scan N( )( )

if N > M

if N ≤ M

53

I/O-Efficient List Ranking

Observation: We do not use the property that there is only one head and one tail in the list

In other words, the algorithm can be applied simultaneously to a collection of linked lists

This is exploited for solving more complicated graph problems (will see an example later)

54

Now Switch to Graph AlgorithmsFor theoreticians:• Graph problems neat, often difficult, hence interesting

For practitioners:• Massive graphs arise in GIS, Web modelling, ...

• Problems in computational geometry can be expressed as graph problems

• Many abstract problems best viewed as graph problems

• Extreme: Pointer-based data structures = graphs with extra information at their nodes

For us:• Still we don’t understand how to solve some graph problems

I/O-efficiently (e.g., DFS)

55

Outline

Fundamental Graph Problems• (List ranking)• Algorithms for trees

– Euler tour – Tree labelling

• Evaluating DAGs• Connectivity

– Connected components – Minimum spanning trees

56

The Euler Tour Technique

Given a tree T, represent it by a list L that captures the tree structure

Why? Certain computations on T can be performed by a (weighted) list ranking of L.

57

Seven Bridges of Königsberg

(Five Bridges of Kaliningrad

Калинингра`д)

Euler Tour

Given a tree T, and a distinguished vertex r of T, an Euler tour of T is a traversal of T that starts and ends at r and traverses every edge exactly twice, once in each direction.

r

58

The Euler Tour Technique

Theorem: Given the adjacency lists of the vertices in T, an Euler tour can be constructed in O(scan(N)) I/Os.

• Let {v,w1},…,{v,wr} be the (undirected) edges incident to v

• Then succ((wi,v)) = (v,wi+1))

v

w4

w3

w2

w1

59

Problem 1

If T is represented as an unordered collection of edges (i.e., no adjacency lists), what’s the I/O complexity of computing an Euler tour of T?

60

Rooting a Tree

• Choosing a vertex r as the root of a tree T defines parent-child relationships between adjacent nodes

• Rooting tree T =computing for every edge{v,w} who is the parentand who is the child

• v = p(w) if and only ifrank((v,w)) < rank((w,v)) in Euler tour

61

Theorem: A tree can be rooted in O(sort(N)) I/Os.

Computing a Preorder Numbering

Theorem: A preorder numbering of a rooted tree T can be computed in O(sort(N)) I/Os.

0

1

2

3 4

5

6

7 8

9

10

1

11

0

110

0

0

1 0

0

0

01

1

18

2

34

4

564

3

8

7 8

5

7

99

8

preorder#(v) = rank((p(v),v))62

Computing Subtree Sizes

10

1

11

0

110

0

0

1 0

0

0

01

1

18

2

34

4

564

3

8

7 8

5

7

99

8

|T(v)| = rank((v,p(v))) – rank((p(v),v)) + 1

63

Theorem: The nodes of T can be labelled with their subtree sizes in O(sort(N)) I/Os.

10

8

3

1 1

1

3

1 1

1

Problem 2

Given a tree T, rooted at r, the depth of a node v is defined as the number of edges on the path from r to v in T.

Design an I/O-efficient algorithm to compute the depth of each node of T.

65

Evaluating a Directed Acyclic Graph

• More general: Given a labelling , compute a labelling so that (v) is computed from (v) and (u1),…,(ur), where u1,…,ur are v’s in-neighbors

0

1

0

0

1

0

1

00 00

1

0

11

0

0

1

0

1

0

1 1

0 0

1

0

66

0

1

0

1

2

3 4

5

6

7

8

10

9 11

12

AssumptionsAssumption 1: Nodes are given in topological order: for every edge (v,w) v precedes w in the ordering(Note: there is no I/O-efficient alg to topologically sort arbitrary DAG)

Use priority queue Q to send data along the edges.

Assumption 2: If there is no bound on in-degrees, computation in a node with in-degree K can be done in O(sort(K)) I/Os

67

Q:

0

1

0

1

2

3 4

5

6

7

8

10

9 11

12

Time-Forward Processing

Chiang et al. [1995], Arge [1995]:

000

111

000 00

11

00

11

00

11

00

11

00

Use priority queue Q to send data along the edges.

(6,1,0)(4,2,1) (5,2,1) (6,1,0)(4,2,1) (4,3,0) (5,2,1) (5,3,0) (6,1,0)(4,2,1) (4,3,0) (5,2,1) (5,3,0) (6,1,0)(5,2,1) (5,3,0) (6,1,0)(5,2,1) (5,3,0) (6,1,0) (7,4,0) (8,4,0)(5,2,1) (5,3,0) (6,1,0) (7,4,0) (8,4,0)(6,1,0) (7,4,0) (8,4,0)(6,1,0) (6,5,1) (7,4,0) (7,5,1) (8,4,0) (8,5,1)(6,1,0) (6,5,1) (7,4,0) (7,5,1) (8,4,0) (8,5,1)(7,4,0) (7,5,1) (8,4,0) (8,5,1)(7,4,0) (7,5,1) (8,4,0) (8,5,1) (10,6,0)(7,4,0) (7,5,1) (8,4,0) (8,5,1) (10,6,0)(8,4,0) (8,5,1) (10,6,0)(8,4,0) (8,5,1) (9,7,1) (10,6,0) (10,7,1)(8,4,0) (8,5,1) (9,7,1) (10,6,0) (10,7,1)(9,7,1) (10,6,0) (10,7,1)(9,7,1) (9,8,0) (10,6,0) (10,7,1)(9,7,1) (9,8,0) (10,6,0) (10,7,1)(10,6,0) (10,7,1)(10,6,0) (10,7,1) (11,9,1) (12,9,1)(10,6,0) (10,7,1) (11,9,1) (12,9,1)(11,9,1) (12,9,1)(11,9,1) (11,10,0) (12,9,1) (12,10,0)(11,9,1) (11,10,0) (12,9,1) (12,10,0)(12,9,1) (12,10,0)

68

Time-Forward Processing

Correctness:

• Every in-neighbor of v evaluated before v

• All vertices preceeding v in topological order evaluated before v

69

Time-Forward ProcessingAnalysis:

• Vertex set + adjacency lists scannedO(scan(|V| + |E|)) I/Os

• Priority queue:– Every edge inserted into and deleted from Q

exactly onceO(|E|) priority queue operations

O(sort(|E|)) I/Os

70

Time-Forward ProcessingAnalysis:

• Vertex set + adjacency lists scannedO(scan(|V| + |E|)) I/Os

• Priority queue:– Every edge inserted into and deleted from Q

exactly onceO(|E|) priority queue operations

O(sort(|E|)) I/OsTheorem: A directed acyclic graph G = (V,E) can be evaluated in O(sort(|V| + |E|)) I/Os.

71

Independent Set (MIS)

Given a graph G=(V,E), an independent set is a set I V such that no two ⊆vertices in I are adjacent

1

2

34

5

6

7

8 9

10

11

1

2

34

5

6

72

Maximal Independent Set (MIS)

An independent set I is maximal if every vertex in V \ I has at least one neighbor in I

1

2

34

5

6

7

8 9

10

11

1

2

34

5

6

73


An independent set I is maximal if every vertex in V \ I has at least one neighbor in I

1

2

34

5

6

7

8 9

10

11

1

2

34

5

6

74


Algorithm GREEDYMIS:

1. I 0

2. for every vertex v G do

3. if no neighbor of v is in I then

4. Add v to I

5. end if

6. end for

Observation: It suffices to consider only all neighbors of v which have been visited in a previous iteration.

75

Implementation Details

• Assume each vertex has a unique numerical ID

• Direct edges from lower to higher numbers

• Sort vertices by their number

• Consider vertices in sorted order

76


Algorithm GREEDYMIS:

1. I 0

2. for every vertex v G in sorted order do

3. if no in-neighbor of v is in I then

4. Add v to I

5. end if

6. end for

77


1

2

34

5

6

7

8 9

10

11

1

2

34

5

6

78

1

2

34

5

6

7

8 9

10

11

11

22

3344

55

66

7

8

7

8 99

1010

1111


79

Implementation Details

How to check I/O-efficiently if any neighbor of v was already included in I? Time-forward processing:• (Assume each vertex has a unique numerical ID)

• (Direct edges from lower to higher numbers)

• (Sort vertices by their number)

• Sort edges by number of their sources

• After deciding whether v included or not in I, v sends a flag to each of its out-neighbors to inform them whether or not v is in I (same as evaluating DAGs)

• Each vertex decides whether should be added to I based solely on flags received from in-neighbors

80


Correctness follows from following observations:• Set I computed by the algorithm is indipendent since

a) Vertex v added to I only if none of its in-neighbors is in I

b) At this point none of its out-neighbors can be in I yet, and the insertion of v into I prevents all of these out-neighbors from being added to I later

• Set I is maximal since otherwise there would be a vertex v I such that none of the in-neighbors of v ∉are in I: but then v would have been added to I

81


Theorem: A maximal independent set of a graph G = (V,E) can be computed in O(sort(|V|+|E|)) I/Os.

82

Large Independent Set of a List

Fill missing details of list ranking alg.

Corollary: An independent set of size at least N/3 for a list L of size N can be found in O(sort(N)) I/Os.

In a list, every vertex in an MIS I prevents at most two other vertices from being in I:

Every MIS of a list has size at least N/3.

83

Graph Connectivity

• Connected Components

• Minimum Spanning Trees

84

Connectivity

A graph G=(V,E) is connected if for any two vertices u,v in V there is a path between u and v in G.

The connected components of G are its maximal connected subgraphs.

First, semi-external algorithm (vertices fit in main memory, edges don’t)

Next, fully external algorithm: use graph contraction to reduce number of vertices. Call semi-external as soon as vertices fit in main memory

85

ConnectivityA Semi-External Algorithm

86

ConnectivityA Semi-External Algorithm

Analysis:

• Scan vertex set to load vertices into main memory

• Scan edge set to carry out algorithm

• O(scan(|V| + |E|)) I/Os

Theorem: If |V| M, the connected components of a graph can be computed in O(scan(|V| + |E|)) I/Os.

87

ConnectivityThe General CaseIdea [Chiang et al 1995]:

• If |V| M– Use semi-external algorithm

• If |V| > M– Identify simple connected subgraphs of G– Contract these subgraphs to obtain graph

G’ = (V’,E’) with |V’| c|V|, c < 1– Recursively compute connected components of G’– Obtain labelling of connected components of G

from labelling of components of G’

88

A

B C D

E

ConnectivityThe General Case

a

b

c

de

f

gh

i

j

k

lm

n

A

BC

D

E

1

12

2

2

1

1

11

1

12

2

22 2

2

2

2

89

ConnectivityThe General Case

Main steps:

• Find smallest neighbors

• Compute connected components of graph H induced by selected edges

• Contract each component into a single vertex

• Call the procedure recursively

• Copy label of every vertex v G’ to all vertices in G represented by v

90

Finding smallest neighbors

To find smallest neighbor w(v) of every vertex v:

Scan edges and replace each undirected edge {u,v} with directed edges (u,v) and (v,u)

Sort directed edges lexicographically

This produces adjacency lists

Scan adjacency list of v and return as w(v) first vertex in list

This takes overall O(sort(|E|)) I/Os

To produce edge set of (undirected) graph H, sort and scan edges {v, w(v)} to remove duplicates

This takes another O(sort(|V|)) I/Os

91

Computing Conn Comps of H

Cannot use same algorithm recursively (didn’t reduce vertex set)

Exploit following property:

Lemma Graph H is a forestAssume not. Then H must contain cycle x0, x1, …, xk = x0. Since no duplicate edges, k ≥ 3. Since each vertex v has at most one incident edge {v,w(v)} in H, w.l.o.g. xi+1 = w(xi) for 0 ≤ i < k. Then the existence of {xi-1,xi} implies that xi-1 > xi+1. Similarly, xk-1 > x1.

If k even: x0 > x2 > … > xk = x0 yields a contradiction.

If k odd: x0 > x2 > … > xk-1 > x1 > x3 > … > xk = x0 yields a contradiction.

92

Exploit Property that H is a Forest

Apply Euler tour to H in order to transform each tree into a list

Now compute connected components using ideas from list ranking:

Find large independent set I of H and remove vertices in I from H

Recursively find connected components of smaller graphs

Reintegrate vertices in I (assign component label of neighbor)

This takes sort(|H|) = sort(|V|) I/Os

93

Recursive Calls

Every connected component of H has size at least 2 |V’| |V|/2 O(log (|V|/M)) recursive calls

Theorem: The connected components of a graph G = (V,E) can be computed in O(sort(|V|) + sort(|E|) log(|V|/M)) I/Os.

94

Improved Connectivity via BFS

• BFS in O(|V| + sort(|E|)) I/Os [Munagala & Ranade 99] BFS can be used to identify connected components• When |V| = |E|/B, algorithm takes O(sort(|E|)) I/Os• Same alg. but stop recursion before, when # of vertices

reduced to |E|/B (after log (|V|B/|E|) recursive calls)• At this point, apply BFS rather than semi-external

connectivity

Theorem: The connected components of a graph

G = (V,E) can be computed in

O(sort(|V|) + sort(|E|) log (|V|B / |E|) I/Os.95

Minimum Spanning Tree (MST)

A spanning tree of an undirected graph G=(V,E) is a tree T=(V,E’) such that E’ E.⊆

Given an undirected graphs with costs assigned to edges, the cost of a spanning tree is the sum of the costs of its edges.

A spanning tree T of G is minimum if there is no other spanning tree T’ of G such that cost(T’) < cost(T).

97

Spanning Trees

Observation: Connectivity algorithm can be augmented to produce a spanning tree (forest) of G.

SemiExternalST: add edge {v,w} whenever v and w are in different connected components

ExternalST: build spanning tree (forest) of G from H and spanning tree (forest) T’ produced by recursive invocations of the alg. on compressed graph G’

98

ExternalST

Spanning tree produced is not necessarily minimum

a

b

c

de

f

gh

i

j

k

lm

n

A

BC

D

E

99


Simple modifications

SemiExternalMST: Rather than inspecting edges of G in arbitrary order, inspect edges by increasing weights. This increase I/O-complexity from O(scan(|V|+|E|)) to O(scan(|V|) + sort(|E|))

Essentially, a semi-external version of Kruskal

100


ExternalMST differs from ExternalST in a number of places

a) Choose edge of minimum cost rather than to smallest neighbor (maintains invariant H is forest)

b) Weight of edge e in compressed graph G’ = min weight of all edges represented by e

c) When “e added” to T, add in fact this minimum edge

4

1 5

3

va

b c

d

101

Minimum Spanning Tree (MST)a

b

c

de

f

gh

i

j

k

lm

n

A

BC

D

E

Theorem: A MST of a graph G = (V,E) can be computed in O(sort(|V|) + sort(|E|) log (|V|/M)) I/Os.

102

A Fast MST Algorithm

Idea:– If can compute MST in O(|V| + sort(|E|)) I/Os– Apply same trick as before (BFS) and stop

recursion after log (|V|B / |E| ) iterations

Arge et al [2000] got desired bound

I/O-efficient implem. of Prim’s algorithm:

103


• Maintain light blue and intra-tree edges in priority queue Q

• When edge {v,w} of minimum cost retrieved, test whether v,w are both in T– Yes discard (intra-tree) edge

– No Add edge to MST and add all to Q edges incident to w, except {v,w}(assuming that w T)

Problem: How to testwhether v,w T.

104


• If v,w T, but {v,w} T, then both v and w have inserted edge {v,w} into Q

There are two copies of {v,w} in Q They are consecutive (assumption: different costs) Perform two DELETEMIN operations

– If {v,w} = {y,z}, discard both– Otherwise, add {v,w} to T and re-insert {y,z}

v

w

105


Analysis:• O(|V| + scan(|E|)) I/Os for retrieving adjacency lists• O(sort(|E|)) I/Os for priority queue operations

Theorem: A MST of a graph G = (V,E) can be found in O(|V| + sort(|E|)) I/Os.

Corollary: A MST of a graph G = (V,E) can be found in O(sort(|V|) + sort(|E|) log (|V|B / |E|) I/Os.

106

Graph Contraction and Sparse Graphs

• A graph G = (V,E) is sparse if for any graph H obtainable from G through a series of edge contractions, |E(H)| = O(|V(H)|) (include planar, bounded treewidth)

• For a sparse graph, the number of vertices and edges in G reduces by a constant factor in each iteration of the connectivity and MST algorithms.

Theorem: The connected components or a MST of a sparse graph with N vertices can be computed in O(sort(N)) I/Os.

107

Three Techniques for Graph Algs

• Time-forward processing:– Express graph problems as evaluation problems of DAGs

• Graph contraction:– Reduce the size of G while maintaining the properties of

interest– Solve problem recursively on compressed graph– Construct solution for G from solution for compressed

graph• Bootstrapping:

– Switch to generally less efficient algorithm as soon as (part of the) input is small enough

108

Documents

Algorithms for Large Data Sets