104
Algorithms for Large Data Sets Giuseppe F. (Pino) Italiano Università di Roma “Tor Vergata” [email protected]

Algorithms for Large Data Sets

  • Upload
    calix

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

Algorithms for Large Data Sets. Giuseppe F. (Pino) Italiano Università di Roma “Tor Vergata” [email protected]. Examples of Large Data Sets: Astronomy. Astronomical sky surveys 120 Gigabytes/week 6.5 Terabytes/year. The Hubble Telescope. - PowerPoint PPT Presentation

Citation preview

Page 1: Algorithms for  Large Data Sets

Algorithms for Large Data Sets

Giuseppe F. (Pino) Italiano

Università di Roma “Tor Vergata”

[email protected]

Page 2: Algorithms for  Large Data Sets

2

Examples of Large Data Sets:Astronomy

• Astronomical sky surveys

• 120 Gigabytes/week

• 6.5 Terabytes/year

The Hubble Telescope

Page 3: Algorithms for  Large Data Sets

4

Examples of Large Data Sets:Phone call billing records

• 250M calls/day

• 60G calls/year

• 40 bytes/call

• 2.5 Terabytes/year

Page 4: Algorithms for  Large Data Sets

5

Examples of Large Data Sets:Credit card transactions

• 47.5 billion transactions in 2005 worldwide

• 115 Terabytes of data transmitted to VisaNet data processing center in 2004

Page 5: Algorithms for  Large Data Sets

6

Examples of Large Data Sets:Internet traffic

Traffic in a typical router:

• 42 kB/second

• 3.5 Gigabytes/day

• 1.3 Terabytes/year

Page 6: Algorithms for  Large Data Sets

7

Examples of Large Data Sets:The World-Wide Web

• 25 billion pages indexed

• 10kB/Page

• 250 Terabytes of indexed text data

• “Deep web” is supposedly 100 times as large

Page 7: Algorithms for  Large Data Sets

8

Reasons for Large Data Sets:Better technology

• Storage & disks– Cheaper– More volume– Physically smaller– More efficient

Large data sets are affordable

Page 8: Algorithms for  Large Data Sets

9

Reasons for Large Data Sets:Better networking

• High speed Internet

• Cellular phones

• Wireless LAN

More data consumers

More data producers

Page 9: Algorithms for  Large Data Sets

10

Reasons for Large Data Sets:Better IT

• More processes are automatic– E-commerce– Online and telephone banking– Online and telephone customer service – E-learning– Chats, news, blogs– Online journals– Digital libraries

• More enterprises are digital– Companies– Banks– Governmental institutions– Universities

More data is available in digital form

World’s yearly production of data:

Billions Gigabyes

Page 10: Algorithms for  Large Data Sets

12

More and More Digital Data• Amount of data to be processed increasing at faster rate

than computing power

• Digital data created in few years larger than amout of data created in all previous history (57 billion GB)

Page 11: Algorithms for  Large Data Sets

13

The Digital Universe is growing fast• Digital Universe = amount of digital information

created and replicated in a year.• YouTube hosts 100 million video streams a day• More than a billion songs a day shared over the

Internet• London’s 200 traffic surveillance cameras send 64

trillion bits a day to the command center• Chevron accumulate data at the rate of TB / day• TV broadcasting is going all-digital in most

countries• …

Page 12: Algorithms for  Large Data Sets

14

We Ain’t Seen Nothing Yet…

• In 2009, despite global recession, Digital Universe grew by 62% to nearly 800,000 PetaBytes (1 PB = 1 million GB). I.e., stack of DVDs reaching from earth to moon and back.

• In 2010, Digital Universe expected to grow almost as fast to 1.2 million PB, or 1.2 Zettabytes (ZB).

• With this trend, in 2020 Digital Universe will be 35 ZB, i.e., 44 TIMES AS BIG as it was in 2009. Stack of DVDs would now reach halfway to Mars!

Page 13: Algorithms for  Large Data Sets

The RAM Model of Computation

15

The simple uniform memory

model (i.e., unit time per memory access) is no longer adequate for large data sets

Internal memory (RAM) has typical size of few GB only

Let’s see this with two very simple experiments

Page 14: Algorithms for  Large Data Sets

Experiment 1: Sequential vs. Random Access

16

2GB RAM Write (sequentially) a file with 2 billions

32-bit integers (7.45 GB) Read (randomly) same file Which is faster? Why?

Page 15: Algorithms for  Large Data Sets

17

Platform

• MacOS X 10.5.5 (2.16 GHz Intel Core Duo)

• 2GB SDRAM, 2MB L2 cache

• HD Hitachi HTS542525K9SA00 232.89 GB serial ATA (speed 1.5 Gigabit)

• File system Journaled HFS+

• Compiler gcc 4.0.1

Page 16: Algorithms for  Large Data Sets

18

Accesso sequenziale (scrittura)#include <stdio.h>#include <stdlib.h>

typedef unsigned long ItemType; /* type of file items */

int main(int argc, char** argv){

FILE* f; long long N, i;

if (argc < 3) exit (printf("Usage: ./MakeRandFile fileName numItems\n")); /* check command line parameters */

N = atoll(argv[2]); /* convert number of items from string to integer format */

printf("file offset: %d bit\n", sizeof(off_t)*8); printf("creating random file of %lld 32 bit integers...\n", N);

f = fopen(argv[1], "w+"); /* open file for writing */ if (f == NULL) exit(printf("can't open file\n"));

/* make N sequential file accesses */ for (i=0; i<N; ++i) { ItemType val = rand(); fwrite(&val, sizeof(ItemType), 1, f); }

fclose(f);}

Sequential Write

Page 17: Algorithms for  Large Data Sets

19

Accesso sequenziale (scrittura)

/* make N sequential file accesses */ for (i=0; i<N; ++i) { ItemType val = rand(); fwrite(&val, sizeof(ItemType), 1, f); }

Sequential Write

Page 18: Algorithms for  Large Data Sets

20

Accesso casuale (lettura)#include <stdio.h>#include <stdlib.h>#include <time.h>

typedef unsigned long ItemType; /* type of file items */

int main(int argc, char** argv){

FILE* f; long long N, i, R;

if (argc < 3) exit (printf("Usage: ./RandFileScan fileName numReads\n")); /* check command line parameters */

R = atoll(argv[2]); /* convert number of accesses from string to integer format */

f = fopen(argv[1], ”r"); /* open file for reading */ if (f == NULL) exit(printf("can't open file\n”, argv[1]));

fseeko(f, 0LL, SEEK_END); /* compute number N of elements in the file */ N = ftello(f)/sizeof(ItemType);

printf("file offset: %d bit\n", sizeof(off_t)*8); printf("make %lld random accesses to file of %lld 32 bit integers...\n", R, N);

srand(clock()); /* init pseudo-random generator seed */

for (i=0; i<R; ++i) { /* make R random file accesses */ ItemType val; long long j = (long long)(N*((double)rand()/RAND_MAX)); fseeko(f, j*sizeof(ItemType), SEEK_SET); fread(&val, sizeof(ItemType), 1, f); } fclose(f);}

Random Read

Page 19: Algorithms for  Large Data Sets

21

Accesso casuale (lettura) …

for (i=0; i<R; ++i) { /* make R random file accesses */

ItemType val; long long j = (long long)

(N*((double)rand() /

RAND_MAX)); fseeko(f, j*sizeof(ItemType),

SEEK_SET); fread(&val, sizeof(ItemType), 1, f); }

Random Read

Page 20: Algorithms for  Large Data Sets

22

Outcome of the ExperimentRandom Read:

Time to read randomly 10,000 integers in file with 2 billions 32-bit integers (7.45 GB) is ≈118.419 sec. (i.e., 1 min. and 58.419 sec.). That’s ≈11.8 msec. per integer.

Throughput: ≈ 337.8 byte/sec ≈ 0.0003 MB/sec.

CPU Usage : ≈ 1.6%

Sequential Write:

Time to write sequentially file with 2 billions 32-bit integers (7.45 GB) is ≈250.685 sec. (i.e., 4 min. and 10.685 sec.). That’s ≈120 nanosec. per integer.

Throughput: ≈ 31.8 MB/sec.

CPU Usage: ≈ 77%

Sequential access is roughly 100,000 times faster than random!

Page 21: Algorithms for  Large Data Sets

What is More Realistic

23

Doug Comer: “The difference in speed between modern CPU and disk technologies is analogous to the difference in speed in sharpening a pencil on one’s desk or by taking an airplane to the other side of the world and using a sharpener on someone else’s desk”

Page 22: Algorithms for  Large Data Sets

Magnetic Disk Drives as Secondary Memory

24

Actually, disk access is about million times slower… More like going Around the World in 80 Days!

Time for rotation ≈ Time for seek

Amortize search time by transfering large blocks so that:

Time for rotation ≈ Time for seek ≈ Time to transfer data

Solution 1: Exploit locality – take advantage of data locality

Solution 2: Use disks in parallel

Page 23: Algorithms for  Large Data Sets

25

Experiment 2:

Copy 2048 x 2048 array of 32-bit integers

copyij: copy by rows

copyij: copy by columns

Another Experiment

Page 24: Algorithms for  Large Data Sets

26

Access by rows:

void copyij (int src[2048][2048],

int dst[2048][2048])

{

int i,j;

for (i = 0; i < 2048; i++)

for (j = 0; j < 2048; j++)

dst[i][j] = src[i][j];

}

Array Copy

Page 25: Algorithms for  Large Data Sets

27

Access by columns:

void copyji (int src[2048][2048],

int dst[2048][2048])

{

int i,j;

for (j = 0; j < 2048; j++)

for (i = 0; i < 2048; i++)

dst[i][j] = src[i][j];

}

Array Copy

Page 26: Algorithms for  Large Data Sets

28

copyij and copyji differ only in access patterns:

copyij accesses by rows,

copyji accesses by columns.

On a Intel Core i7 with 2.7 GHz:

copyij takes 5.2 msec,

copyji takes 162 msec ( ≈ 30X slower!)

Array Copy

Arrays stored in row-major order (depends on language / compiler)

Thus copyij makes a better use of locality

Page 27: Algorithms for  Large Data Sets

29

Is this due to External Mem?

Page 28: Algorithms for  Large Data Sets

A Refined Memory Model

30

Page 29: Algorithms for  Large Data Sets

Outline

31

Algorithms for Large Data Sets

1. External Memory (Disk): I/O-Efficient Algorithms

2. Cache: Cache-Oblivious Algorithms

3. Large and Inexpensive Memories: Resilient Algorithms

Page 30: Algorithms for  Large Data Sets

Outline

32

Important issues we are NOT touching

1. Algs for data streaming

2. Algs for multicore architectures: Threads, parallelism, etc…

3. Programming models (MapReduce)

4. How to write fast code

5. …

Page 31: Algorithms for  Large Data Sets

I/O-Efficient Algorithms

Page 32: Algorithms for  Large Data Sets

34

Model

N: Elements in structure (input size)

B: Elements per block

M: Elements in main memory

Problem starts out on disk

Solution is to be written to disk

Cost of an algorithm is the number of input and output (I/O) operations.

D

P

M

Block I/O

Page 33: Algorithms for  Large Data Sets

I/O- Efficient Algorithms

Will start with “Simple” Problems

• Scanning

• Sorting

• List ranking

35

Page 34: Algorithms for  Large Data Sets

36

Scanning N elements stored in blocks costs Θ(N/B) I/Os:

Scanning

Will refer to this bound as scan(N)

36

Page 35: Algorithms for  Large Data Sets

Sorting

Sorting one of the most-studied problems in computer science.

In external-memory, sorting plays particularly important role, because often lower bound, and even upper bound, for other problems.

Original paper of Aggarwal and Vitter [AV88] proved that number of memory transfers to sort in comparison model is Θ( N/B logM/B N/B).

Will denote this bound by sort(N).

Clearly, sort(N) = (scan(N))

37

Page 36: Algorithms for  Large Data Sets

External Memory Sorting

Simplest external-memory algorithm that achieves this bound [AV88] is an (M/B)-way mergesort.

During merge, each memory block maintains first B elements of each list, and when a block empties, next block from that list is loaded.

So a merge effectively corresponds to scanning through entire data, for an overall cost of Θ(N/B) I/Os.

38

Page 37: Algorithms for  Large Data Sets

External Memory Sorting

Mainly of theoretical interest…

Luckily more practical I/O-efficient alg. for sorting

Total number of I/Os for this sorting algorithm is given by recurrence

T (N) = (M/B) T (N / (M/B) ) + Θ(N/B),

with a base case of T(O(B)) = O(1).

39

Page 38: Algorithms for  Large Data Sets

External Memory Sorting

T (N) = (M/B) T (N / (M/B) ) + Θ(N/B),

T(O(B)) = O(1).

At level i of recursion tree:

(M/B)i nodes, problem sizes Ni = N / (M/B)i

Number of levels in recursion tree is O(logM/B N/B)

Divide-and-merge cost at any level i is Θ(N/B): Recursion tree has Θ(N/B) leaves, for a leaf cost of Θ(N/B). Root node has divide-and-merge cost Θ(N/B) as well, as do all levels in between: (M/B)i Θ(Ni/B) = (M/B)i Θ(N/B / (M/B)i)

So total cost is Θ( N/B logM/B N/B).40

Page 39: Algorithms for  Large Data Sets

List Ranking

Given a linked list L, compute for each item in the list its distance from the head

1 2 3 4 5 6

41

Page 40: Algorithms for  Large Data Sets

Weighted List Ranking

Can be generalized to weighted ranks:

Given a linked list L, compute for each item in the list its weghted distance from the head

3 1 5 2 3 1

3 4 9 11 14 15

42

Page 41: Algorithms for  Large Data Sets

Why Is List Ranking Non-Trivial?

1 2 3 45 6 7 89 10 11 1213 14 15 16

1 5 9 13 2 6 10 14 3 47 811 1215 16

1 5 9 13 2 6 10 14 3 47 811 1215 16

1 5 9 13 2 6 10 14 3 47 811 1215 16

1 25 69 1013 14 3 47 811 1215 16

1 25 69 1013 14 3 47 811 1215 16

The internal memory algorithm spends (N) I/Os in the worst case (LRU assumed).

43

Page 42: Algorithms for  Large Data Sets

I/O-Efficient List Ranking Alg

Proposed by Chiang et al. [1995]

If list L fits into internal memory (|L| ≤ M):

1. L read in internal memory in O(scan(|L|)) I/Os

2. Use trivial list ranking in internal memory

3. Write to disk element ranks in O(scan(|L|)) I/Os

Difficult part is when |L| > M (not a surprise…)

44

Page 43: Algorithms for  Large Data Sets

List Ranking for |L| > M • Assume an independent set of size at least N/3 can

be found in O(sort(N)) I/Os (we’ll see later how).

3 1 5 2 3 1

3 1 5 2 3 1

Scan(|L|)

45

Page 44: Algorithms for  Large Data Sets

List Ranking for |L| > M

3 1 7 4

3 1 5 2 3 1

3 1 5 2 3 1

46

Page 45: Algorithms for  Large Data Sets

Step Analysis

• Assume each vertex has a unique numerical ID– Sort elements in L \ I by their numbers– Sort elements in I by numbers of their successors – Scan the two lists to update the label of succ(v),

for every element v I

47

Page 46: Algorithms for  Large Data Sets

Step Analysis

• Each vertex has a unique numerical ID– Sort elements in I by their numbers– Sort elements in L \ I by numbers of their

successors – Scan the two lists to update the label of succ(v),

for every element v L \ I48

Page 47: Algorithms for  Large Data Sets

List Ranking for |L| > M

3 1 7 4

3 4 11 15

Recursive step:

3 1 5 2 3 1

49

Page 48: Algorithms for  Large Data Sets

List Ranking for |L| > M

3 4 11 15

3 4 9 11 14 15

3 1 5 2 3 1

50

O(Sort(|L|) + Scan(|L|)) (as before)

Page 49: Algorithms for  Large Data Sets

Recap of the Algorithm

3 1 5 2 3 1

3 1 7 4

3 4 11 15

3 1 5 2 3 1

3 4 9 11 14 15

51

Page 50: Algorithms for  Large Data Sets

52

Page 51: Algorithms for  Large Data Sets

I/O-Efficient List Ranking

• Except for recursion, everything is scan and sort• I/O-complexity is

• Solution of recurrence is

Theorem: A list of size N can be ranked in O(sort(N)) I/Os.

I N( ) ≤ I 2N3( ) + O sort N( )( )

I N( ) = O sort N( )( )€

I N( ) = O scan N( )( )

if N > M

if N ≤ M

53

Page 52: Algorithms for  Large Data Sets

I/O-Efficient List Ranking

Observation: We do not use the property that there is only one head and one tail in the list

In other words, the algorithm can be applied simultaneously to a collection of linked lists

This is exploited for solving more complicated graph problems (will see an example later)

54

Page 53: Algorithms for  Large Data Sets

Now Switch to Graph AlgorithmsFor theoreticians:• Graph problems neat, often difficult, hence interesting

For practitioners:• Massive graphs arise in GIS, Web modelling, ...

• Problems in computational geometry can be expressed as graph problems

• Many abstract problems best viewed as graph problems

• Extreme: Pointer-based data structures = graphs with extra information at their nodes

For us:• Still we don’t understand how to solve some graph problems

I/O-efficiently (e.g., DFS)

55

Page 54: Algorithms for  Large Data Sets

Outline

Fundamental Graph Problems• (List ranking)• Algorithms for trees

– Euler tour – Tree labelling

• Evaluating DAGs• Connectivity

– Connected components – Minimum spanning trees

56

Page 55: Algorithms for  Large Data Sets

The Euler Tour Technique

Given a tree T, represent it by a list L that captures the tree structure

Why? Certain computations on T can be performed by a (weighted) list ranking of L.

57

Seven Bridges of Königsberg

(Five Bridges of Kaliningrad

Калинингра`д)

Page 56: Algorithms for  Large Data Sets

Euler Tour

Given a tree T, and a distinguished vertex r of T, an Euler tour of T is a traversal of T that starts and ends at r and traverses every edge exactly twice, once in each direction.

r

58

Page 57: Algorithms for  Large Data Sets

The Euler Tour Technique

Theorem: Given the adjacency lists of the vertices in T, an Euler tour can be constructed in O(scan(N)) I/Os.

• Let {v,w1},…,{v,wr} be the (undirected) edges incident to v

• Then succ((wi,v)) = (v,wi+1))

v

w4

w3

w2

w1

59

Page 58: Algorithms for  Large Data Sets

Problem 1

If T is represented as an unordered collection of edges (i.e., no adjacency lists), what’s the I/O complexity of computing an Euler tour of T?

60

Page 59: Algorithms for  Large Data Sets

Rooting a Tree

• Choosing a vertex r as the root of a tree T defines parent-child relationships between adjacent nodes

• Rooting tree T =computing for every edge{v,w} who is the parentand who is the child

• v = p(w) if and only ifrank((v,w)) < rank((w,v)) in Euler tour

61

Theorem: A tree can be rooted in O(sort(N)) I/Os.

Page 60: Algorithms for  Large Data Sets

Computing a Preorder Numbering

Theorem: A preorder numbering of a rooted tree T can be computed in O(sort(N)) I/Os.

0

1

2

3 4

5

6

7 8

9

10

1

11

0

110

0

0

1 0

0

0

01

1

18

2

34

4

564

3

8

7 8

5

7

99

8

preorder#(v) = rank((p(v),v))62

Page 61: Algorithms for  Large Data Sets

Computing Subtree Sizes

10

1

11

0

110

0

0

1 0

0

0

01

1

18

2

34

4

564

3

8

7 8

5

7

99

8

|T(v)| = rank((v,p(v))) – rank((p(v),v)) + 1

63

Theorem: The nodes of T can be labelled with their subtree sizes in O(sort(N)) I/Os.

10

8

3

1 1

1

3

1 1

1

Page 62: Algorithms for  Large Data Sets

Problem 2

Given a tree T, rooted at r, the depth of a node v is defined as the number of edges on the path from r to v in T.

Design an I/O-efficient algorithm to compute the depth of each node of T.

65

Page 63: Algorithms for  Large Data Sets

Evaluating a Directed Acyclic Graph

• More general: Given a labelling , compute a labelling so that (v) is computed from (v) and (u1),…,(ur), where u1,…,ur are v’s in-neighbors

0

1

0

0

1

0

1

00 00

1

0

11

0

0

1

0

1

0

1 1

0 0

1

0

66

Page 64: Algorithms for  Large Data Sets

0

1

0

1

2

3 4

5

6

7

8

10

9 11

12

AssumptionsAssumption 1: Nodes are given in topological order: for every edge (v,w) v precedes w in the ordering(Note: there is no I/O-efficient alg to topologically sort arbitrary DAG)

Use priority queue Q to send data along the edges.

Assumption 2: If there is no bound on in-degrees, computation in a node with in-degree K can be done in O(sort(K)) I/Os

67

Page 65: Algorithms for  Large Data Sets

Q:

0

1

0

1

2

3 4

5

6

7

8

10

9 11

12

Time-Forward Processing

Chiang et al. [1995], Arge [1995]:

000

111

000 00

11

00

11

00

11

00

11

00

Use priority queue Q to send data along the edges.

(6,1,0)(4,2,1) (5,2,1) (6,1,0)(4,2,1) (4,3,0) (5,2,1) (5,3,0) (6,1,0)(4,2,1) (4,3,0) (5,2,1) (5,3,0) (6,1,0)(5,2,1) (5,3,0) (6,1,0)(5,2,1) (5,3,0) (6,1,0) (7,4,0) (8,4,0)(5,2,1) (5,3,0) (6,1,0) (7,4,0) (8,4,0)(6,1,0) (7,4,0) (8,4,0)(6,1,0) (6,5,1) (7,4,0) (7,5,1) (8,4,0) (8,5,1)(6,1,0) (6,5,1) (7,4,0) (7,5,1) (8,4,0) (8,5,1)(7,4,0) (7,5,1) (8,4,0) (8,5,1)(7,4,0) (7,5,1) (8,4,0) (8,5,1) (10,6,0)(7,4,0) (7,5,1) (8,4,0) (8,5,1) (10,6,0)(8,4,0) (8,5,1) (10,6,0)(8,4,0) (8,5,1) (9,7,1) (10,6,0) (10,7,1)(8,4,0) (8,5,1) (9,7,1) (10,6,0) (10,7,1)(9,7,1) (10,6,0) (10,7,1)(9,7,1) (9,8,0) (10,6,0) (10,7,1)(9,7,1) (9,8,0) (10,6,0) (10,7,1)(10,6,0) (10,7,1)(10,6,0) (10,7,1) (11,9,1) (12,9,1)(10,6,0) (10,7,1) (11,9,1) (12,9,1)(11,9,1) (12,9,1)(11,9,1) (11,10,0) (12,9,1) (12,10,0)(11,9,1) (11,10,0) (12,9,1) (12,10,0)(12,9,1) (12,10,0)

68

Page 66: Algorithms for  Large Data Sets

Time-Forward Processing

Correctness:

• Every in-neighbor of v evaluated before v

• All vertices preceeding v in topological order evaluated before v

69

Page 67: Algorithms for  Large Data Sets

Time-Forward ProcessingAnalysis:

• Vertex set + adjacency lists scannedO(scan(|V| + |E|)) I/Os

• Priority queue:– Every edge inserted into and deleted from Q

exactly onceO(|E|) priority queue operations

O(sort(|E|)) I/Os

70

Page 68: Algorithms for  Large Data Sets

Time-Forward ProcessingAnalysis:

• Vertex set + adjacency lists scannedO(scan(|V| + |E|)) I/Os

• Priority queue:– Every edge inserted into and deleted from Q

exactly onceO(|E|) priority queue operations

O(sort(|E|)) I/OsTheorem: A directed acyclic graph G = (V,E) can be evaluated in O(sort(|V| + |E|)) I/Os.

71

Page 69: Algorithms for  Large Data Sets

Independent Set (MIS)

Given a graph G=(V,E), an independent set is a set I V such that no two ⊆vertices in I are adjacent

1

2

34

5

6

7

8 9

10

11

1

2

34

5

6

72

Page 70: Algorithms for  Large Data Sets

Maximal Independent Set (MIS)

An independent set I is maximal if every vertex in V \ I has at least one neighbor in I

1

2

34

5

6

7

8 9

10

11

1

2

34

5

6

73

Page 71: Algorithms for  Large Data Sets

Maximal Independent Set (MIS)

An independent set I is maximal if every vertex in V \ I has at least one neighbor in I

1

2

34

5

6

7

8 9

10

11

1

2

34

5

6

74

Page 72: Algorithms for  Large Data Sets

Maximal Independent Set (MIS)

Algorithm GREEDYMIS:

1. I 0

2. for every vertex v G do

3. if no neighbor of v is in I then

4. Add v to I

5. end if

6. end for

Observation: It suffices to consider only all neighbors of v which have been visited in a previous iteration.

75

Page 73: Algorithms for  Large Data Sets

Implementation Details

• Assume each vertex has a unique numerical ID

• Direct edges from lower to higher numbers

• Sort vertices by their number

• Consider vertices in sorted order

76

Page 74: Algorithms for  Large Data Sets

Maximal Independent Set (MIS)

Algorithm GREEDYMIS:

1. I 0

2. for every vertex v G in sorted order do

3. if no in-neighbor of v is in I then

4. Add v to I

5. end if

6. end for

77

Page 75: Algorithms for  Large Data Sets

Maximal Independent Set (MIS)

1

2

34

5

6

7

8 9

10

11

1

2

34

5

6

78

Page 76: Algorithms for  Large Data Sets

1

2

34

5

6

7

8 9

10

11

11

22

3344

55

66

7

8

7

8 99

1010

1111

Maximal Independent Set (MIS)

79

Page 77: Algorithms for  Large Data Sets

Implementation Details

How to check I/O-efficiently if any neighbor of v was already included in I? Time-forward processing:• (Assume each vertex has a unique numerical ID)

• (Direct edges from lower to higher numbers)

• (Sort vertices by their number)

• Sort edges by number of their sources

• After deciding whether v included or not in I, v sends a flag to each of its out-neighbors to inform them whether or not v is in I (same as evaluating DAGs)

• Each vertex decides whether should be added to I based solely on flags received from in-neighbors

80

Page 78: Algorithms for  Large Data Sets

Maximal Independent Set (MIS)

Correctness follows from following observations:• Set I computed by the algorithm is indipendent since

a) Vertex v added to I only if none of its in-neighbors is in I

b) At this point none of its out-neighbors can be in I yet, and the insertion of v into I prevents all of these out-neighbors from being added to I later

• Set I is maximal since otherwise there would be a vertex v I such that none of the in-neighbors of v ∉are in I: but then v would have been added to I

81

Page 79: Algorithms for  Large Data Sets

Maximal Independent Set (MIS)

Theorem: A maximal independent set of a graph G = (V,E) can be computed in O(sort(|V|+|E|)) I/Os.

82

Page 80: Algorithms for  Large Data Sets

Large Independent Set of a List

Fill missing details of list ranking alg.

Corollary: An independent set of size at least N/3 for a list L of size N can be found in O(sort(N)) I/Os.

In a list, every vertex in an MIS I prevents at most two other vertices from being in I:

Every MIS of a list has size at least N/3.

83

Page 81: Algorithms for  Large Data Sets

Graph Connectivity

• Connected Components

• Minimum Spanning Trees

84

Page 82: Algorithms for  Large Data Sets

Connectivity

A graph G=(V,E) is connected if for any two vertices u,v in V there is a path between u and v in G.

The connected components of G are its maximal connected subgraphs.

First, semi-external algorithm (vertices fit in main memory, edges don’t)

Next, fully external algorithm: use graph contraction to reduce number of vertices. Call semi-external as soon as vertices fit in main memory

85

Page 83: Algorithms for  Large Data Sets

ConnectivityA Semi-External Algorithm

86

Page 84: Algorithms for  Large Data Sets

ConnectivityA Semi-External Algorithm

Analysis:

• Scan vertex set to load vertices into main memory

• Scan edge set to carry out algorithm

• O(scan(|V| + |E|)) I/Os

Theorem: If |V| M, the connected components of a graph can be computed in O(scan(|V| + |E|)) I/Os.

87

Page 85: Algorithms for  Large Data Sets

ConnectivityThe General CaseIdea [Chiang et al 1995]:

• If |V| M– Use semi-external algorithm

• If |V| > M– Identify simple connected subgraphs of G– Contract these subgraphs to obtain graph

G’ = (V’,E’) with |V’| c|V|, c < 1– Recursively compute connected components of G’– Obtain labelling of connected components of G

from labelling of components of G’

88

Page 86: Algorithms for  Large Data Sets

A

B C D

E

ConnectivityThe General Case

a

b

c

de

f

gh

i

j

k

lm

n

A

BC

D

E

1

12

2

2

1

1

11

1

12

2

22 2

2

2

2

89

Page 87: Algorithms for  Large Data Sets

ConnectivityThe General Case

Main steps:

• Find smallest neighbors

• Compute connected components of graph H induced by selected edges

• Contract each component into a single vertex

• Call the procedure recursively

• Copy label of every vertex v G’ to all vertices in G represented by v

90

Page 88: Algorithms for  Large Data Sets

Finding smallest neighbors

To find smallest neighbor w(v) of every vertex v:

Scan edges and replace each undirected edge {u,v} with directed edges (u,v) and (v,u)

Sort directed edges lexicographically

This produces adjacency lists

Scan adjacency list of v and return as w(v) first vertex in list

This takes overall O(sort(|E|)) I/Os

To produce edge set of (undirected) graph H, sort and scan edges {v, w(v)} to remove duplicates

This takes another O(sort(|V|)) I/Os

91

Page 89: Algorithms for  Large Data Sets

Computing Conn Comps of H

Cannot use same algorithm recursively (didn’t reduce vertex set)

Exploit following property:

Lemma Graph H is a forestAssume not. Then H must contain cycle x0, x1, …, xk = x0. Since no duplicate edges, k ≥ 3. Since each vertex v has at most one incident edge {v,w(v)} in H, w.l.o.g. xi+1 = w(xi) for 0 ≤ i < k. Then the existence of {xi-1,xi} implies that xi-1 > xi+1. Similarly, xk-1 > x1.

If k even: x0 > x2 > … > xk = x0 yields a contradiction.

If k odd: x0 > x2 > … > xk-1 > x1 > x3 > … > xk = x0 yields a contradiction.

92

Page 90: Algorithms for  Large Data Sets

Exploit Property that H is a Forest

Apply Euler tour to H in order to transform each tree into a list

Now compute connected components using ideas from list ranking:

Find large independent set I of H and remove vertices in I from H

Recursively find connected components of smaller graphs

Reintegrate vertices in I (assign component label of neighbor)

This takes sort(|H|) = sort(|V|) I/Os

93

Page 91: Algorithms for  Large Data Sets

Recursive Calls

Every connected component of H has size at least 2 |V’| |V|/2 O(log (|V|/M)) recursive calls

Theorem: The connected components of a graph G = (V,E) can be computed in O(sort(|V|) + sort(|E|) log(|V|/M)) I/Os.

94

Page 92: Algorithms for  Large Data Sets

Improved Connectivity via BFS

• BFS in O(|V| + sort(|E|)) I/Os [Munagala & Ranade 99] BFS can be used to identify connected components• When |V| = |E|/B, algorithm takes O(sort(|E|)) I/Os• Same alg. but stop recursion before, when # of vertices

reduced to |E|/B (after log (|V|B/|E|) recursive calls)• At this point, apply BFS rather than semi-external

connectivity

Theorem: The connected components of a graph

G = (V,E) can be computed in

O(sort(|V|) + sort(|E|) log (|V|B / |E|) I/Os.95

Page 93: Algorithms for  Large Data Sets

Minimum Spanning Tree (MST)

A spanning tree of an undirected graph G=(V,E) is a tree T=(V,E’) such that E’ E.⊆

Given an undirected graphs with costs assigned to edges, the cost of a spanning tree is the sum of the costs of its edges.

A spanning tree T of G is minimum if there is no other spanning tree T’ of G such that cost(T’) < cost(T).

97

Page 94: Algorithms for  Large Data Sets

Spanning Trees

Observation: Connectivity algorithm can be augmented to produce a spanning tree (forest) of G.

SemiExternalST: add edge {v,w} whenever v and w are in different connected components

ExternalST: build spanning tree (forest) of G from H and spanning tree (forest) T’ produced by recursive invocations of the alg. on compressed graph G’

98

Page 95: Algorithms for  Large Data Sets

ExternalST

Spanning tree produced is not necessarily minimum

a

b

c

de

f

gh

i

j

k

lm

n

A

BC

D

E

99

Page 96: Algorithms for  Large Data Sets

Minimum Spanning Tree (MST)

Simple modifications

SemiExternalMST: Rather than inspecting edges of G in arbitrary order, inspect edges by increasing weights. This increase I/O-complexity from O(scan(|V|+|E|)) to O(scan(|V|) + sort(|E|))

Essentially, a semi-external version of Kruskal

100

Page 97: Algorithms for  Large Data Sets

Minimum Spanning Tree (MST)

ExternalMST differs from ExternalST in a number of places

a) Choose edge of minimum cost rather than to smallest neighbor (maintains invariant H is forest)

b) Weight of edge e in compressed graph G’ = min weight of all edges represented by e

c) When “e added” to T, add in fact this minimum edge

4

1 5

3

va

b c

d

101

Page 98: Algorithms for  Large Data Sets

Minimum Spanning Tree (MST)a

b

c

de

f

gh

i

j

k

lm

n

A

BC

D

E

Theorem: A MST of a graph G = (V,E) can be computed in O(sort(|V|) + sort(|E|) log (|V|/M)) I/Os.

102

Page 99: Algorithms for  Large Data Sets

A Fast MST Algorithm

Idea:– If can compute MST in O(|V| + sort(|E|)) I/Os– Apply same trick as before (BFS) and stop

recursion after log (|V|B / |E| ) iterations

Arge et al [2000] got desired bound

I/O-efficient implem. of Prim’s algorithm:

103

Page 100: Algorithms for  Large Data Sets

A Fast MST Algorithm

• Maintain light blue and intra-tree edges in priority queue Q

• When edge {v,w} of minimum cost retrieved, test whether v,w are both in T– Yes discard (intra-tree) edge

– No Add edge to MST and add all to Q edges incident to w, except {v,w}(assuming that w T)

Problem: How to testwhether v,w T.

104

Page 101: Algorithms for  Large Data Sets

A Fast MST Algorithm

• If v,w T, but {v,w} T, then both v and w have inserted edge {v,w} into Q

There are two copies of {v,w} in Q They are consecutive (assumption: different costs) Perform two DELETEMIN operations

– If {v,w} = {y,z}, discard both– Otherwise, add {v,w} to T and re-insert {y,z}

v

w

105

Page 102: Algorithms for  Large Data Sets

A Fast MST Algorithm

Analysis:• O(|V| + scan(|E|)) I/Os for retrieving adjacency lists• O(sort(|E|)) I/Os for priority queue operations

Theorem: A MST of a graph G = (V,E) can be found in O(|V| + sort(|E|)) I/Os.

Corollary: A MST of a graph G = (V,E) can be found in O(sort(|V|) + sort(|E|) log (|V|B / |E|) I/Os.

106

Page 103: Algorithms for  Large Data Sets

Graph Contraction and Sparse Graphs

• A graph G = (V,E) is sparse if for any graph H obtainable from G through a series of edge contractions, |E(H)| = O(|V(H)|) (include planar, bounded treewidth)

• For a sparse graph, the number of vertices and edges in G reduces by a constant factor in each iteration of the connectivity and MST algorithms.

Theorem: The connected components or a MST of a sparse graph with N vertices can be computed in O(sort(N)) I/Os.

107

Page 104: Algorithms for  Large Data Sets

Three Techniques for Graph Algs

• Time-forward processing:– Express graph problems as evaluation problems of DAGs

• Graph contraction:– Reduce the size of G while maintaining the properties of

interest– Solve problem recursively on compressed graph– Construct solution for G from solution for compressed

graph• Bootstrapping:

– Switch to generally less efficient algorithm as soon as (part of the) input is small enough

108