25
1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

  • View
    216

  • Download
    1

Embed Size (px)

Citation preview

Page 1: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

1

On Compressing Web Graphs

Michael Mitzenmacher, Harvard

Micah Adler, Univ. of Massachusetts

Page 2: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

2

The Web as a Graph

Page A

Page BPage CPage D

A

B C D

Page 3: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

3

Motivation

• The Web graph itself is interesting and useful.– PageRank / Kleinberg’s algorithm.– Finding cyber-communities.– Archival history of Web growth and development.– Connectivity server.

• Storing Web linkage information is expensive.– Web growth rate vs. storage growth rate?

• Can we compress it?

Page 4: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

4

Varieties of Compression

1. Compress an isomorphism of the Web graph. Good for storage/transmission of graph features.

2. Compress the Web graph with nodes in a given order (e.g. sorted by URL).

3. Compress for use of compressed graph in a product (e.g. connectivity server).

Page 5: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

5

Baseline: Huffman coding

• Significant work has shown in/outdegrees of vertices of Web graph have power-law distribution.

• Basic scheme: for each vertex, list all outedges.

• Assign Huffman codeword based on indegree.

jj ~)indegreePr(

Page 6: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

6

Huffman Example

Indegrees

1

3

2

3

1

1

1

Codewords

100

01

001

11

0000

0001

101

Page 7: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

7

Web Graph Structure

• Intuition: Huffman uses degree distribution, but not Web graph structure.

• More structure to take advantage of: Web communities.

• Many pages share links.

A

C D E

B

F

Page 8: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

8

Reference Algorithm

• Each vertex is allowed to choose a reference vertex.• Compress by representing edges copied from

reference vertex as a bit vector.

• No cycles allowed among references.

X Y

a b c d e f

X uses YX outedges = a + ref Y [11100]

Page 9: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

9

Simple Reference Algorithm

• Maximize the number of edges compressed.

• Build a related affinity graph, recording number of shared pointers.

• Find a maximum spanning tree (or forest) to find best references.

X Y

a b c d e f

X Y3

Page 10: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

10

Improved Reference Algorithm

• Let cost(A,B) be the cost of compressing A using B as a reference.

• Form an improved affinity graph: directed graph with costs.

• Also add a root node R, with cost(A,R) being the cost of A with no reference.

• Compute the rooted directed maximum spanning tree on directed affinity graph.

1)B()A(log)B(outdeg)BA,(cost NNn

Page 11: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

11

Example

A B

a b c d e f

A B

n = 1024 vertices

25

34

40 50

Part of the directed affinity graph.

R

Page 12: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

12

Complexity• Finding directed maximum spanning is fast: for x vertices and y edges, running time is O(x log x + y) or O(y log x).

• Compressing is fast given references.• Slow part is building affinity graph.

– Equivalent to sparse matrix multiplication.– If M is adjacency matrix, number of shared neighbors

found by computing MMT.– Sparseness helps, but still potentially very slow.

Page 13: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

13

Building the Affinity Graph

• Approach 1: For each pair of vertices a,b, check edge list to find common neighbors.– Slow, but good with memory.

• Approach 2: For each vertex a, increase count for each pair b,c of vertices with edges to a.– Quicker, but a potential memory hog.– Parallelizable.– Complexity:

2))(indeg( Va aO

Page 14: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

14

Variations• Huffman code non-referenced edges.

– Using non-Huffman weights to find references is no longer optimal.

– But do not know Huffman weights until references found.

• Huffman/run length/otherwise encode bit vectors.

• Bound the depth of tree.

• Find multiple references.

Page 15: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

15

Bounded Tree Depth

• For computing on compressed form of graph, do not want a long path of references.

• Potential solution: bound tree depth from root.• Problem: finding optimal tree of bounded depth is

NP-hard. – Depth 2 = Facility location problem.

• In practice: use heuristic/approximation algorithms; split full optimal tree to keep depth bound.

Page 16: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

16

Multiple References

• If one reference is good, finding two could be better.

• We show finding optimal pair of references, even just to maximize number of compressed edges, is NP-hard.

• In practice: run single Reference algorithm multiple times.

Page 17: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

17

Prototype

• Finds references by constructing directed affinity graph, computing directed maximum spanning tree.

• Does not output compressed form; only size of compressed form.

• Also computes Huffman and Reference + Huffman size.– Size of Huffman table not counted.

• Future work: dealing with bottleneck of computing affinity graph.

Page 18: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

18

Web Graph Models

• Copy models– New pages generated dynamically– Some links are “random”-- uniform over all

vertices– Some links are copies: choose a page you like at

random, and copy some of its links.– Richer models include deletions, changing links,

inedges at creation. – Results in power-law distribution.

Page 19: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

19

Copy Model

Random Link

X Copiesof X links

X

Page 20: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

20

Data for Testing

• Graphs chosen using random copy graphs.

• TREC8 WT2g data set.

Graph

Pages Copied

Copy Prob

RandomLinks

G1 G2 G3 G4 TREC

Nodes 131,072 131,072 131,072 131,072 247,428

0.5

1

0.7 0.5 0.5 NA

1 1 [1,2] [0,4] NA

1 [1,2] [0,4] NA

Page 21: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

21

Testing Details• Single pass: at most one reference.• 10 trials for each random graph type.

– Little variance found.

• Random graphs seeded with 1024 vertices of degree 3.

• Small graphs: edge between vertices in affinity graph if at least 2 shared edges in original. Large graphs (G3,G4,TREC): 3 shared edges.

Page 22: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

22

Results

Graph

No comp.Bits, mill.

Huffman

Reference

Ref+Huff

G2 G3 G4 TREC

Avg. Deg. 2.09 3.25 5.10 10.22 4.72

87.75

88.68

81.58

83.93

67.49

63.63

85.15

69.96

65.35

79.47

61.65

54.13

83.31

49.15

46.36

4.66 7.25 11.36 22.78 21.00

G1

Page 23: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

23

Analysis of Results

• Huffman fails to capture significant structure.

• More copying leads to more compression.

• Good compression possible even with only one reference.

• Performs well on “real” Web data.– TREC database may not be representative.– Significant locality.

Page 24: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

24

Contributions

• We introduce the Reference algorithm, an algorithm designed to compress Web graphs based on structural properties.

• Initial results: Reference algorithm appears very promising, better than Huffman.

• Bounded depth variations may be suitable for on-line computing (connectivity server).

• Hardness results for natural extensions of Reference algorithm.

Page 25: 1 On Compressing Web Graphs Michael Mitzenmacher, Harvard Micah Adler, Univ. of Massachusetts

25

Future Work• Beating the bottleneck: determining the affinity

graph.– Can we approximate the affinity graph and still

compress well?

• More extensive testing.– Variations: multiple passes, bounded depth.– Graphs: larger artificial and real Web graphs.

• Determining value of locality and combining locality with a reference-based scheme.