Upload
robert-metzger
View
1.594
Download
4
Tags:
Embed Size (px)
Citation preview
13.1.2014 DIMA - TU Berlin
Compute “Closeness” in Graphs using Apache Giraph
… using probabilistic data structures.Today: Validation
IMPRO-3, TU Berlin, Winter 13/14Robert Metzger, Robert Waury
13.1.2014 DIMA - TU Berlin
Quick Recap on our Task
● Measure reachable nodes within s steps from a node n in a Graph.→ N(a,s).N(“Robert”,1)=80 N(“Robert”,2)=10413…
● Largest N() is graph diameter.
Robert’s Xing Network
13.1.2014 DIMA - TU Berlin
What happened so far ...
● Giraph Implementation:○ a) Bitfield○ b) Flajolet Martin Sketch
■ 32 bit with Thomas Wang’s integer hash■ 64 bit MurmurHash 2.0
○ c) HyperLogLogSketch with MurmurHash 2.0● Drafted Stratosphere “Spargel” implementation● Benchmarked a) and b) for AIM-3
13.1.2014 DIMA - TU Berlin
Validating the correctness of the implementation ...
● Approach: Assume the “bitfield” implementation as the reference and measure the correlation with the results from the other implementations.
● On two (small) datasets:○ General Relativity and Quantum Cosmology collaboration
network (Coauthor relationships). Largest CC 4.158 Nodes.○ Enron email network. Largest CC 33.696 Nodes.
13.1.2014 DIMA - TU Berlin
Statistical Methods to determine correlation
● Kendall's τ (tau)○ -1 < τ < 1○ expects an order (ranking)
e.g. Comparable interface ;-)
● Spearman's ρ (rho)
○ same properties as Kendall but checks whether relation is monotonic (not just linear)
● Pearson’s r○ checks for linear correlation○ uses the actual values (not just ranks)
13.1.2014 DIMA - TU Berlin
Coauthorship Results (I)
Kendall’s τ Spearman’s ρ Pearson’s r
FM32 0.906881050538273 0.98765689317449 0.991695076216846
FM64 0.905736944670186 0.987400738579957 0.991700042774567
HLL 0.931782793461063 0.993272573234886 0.9956213651786
→ High (linear) correlation with all metrics ✔→ HyperLogLog has highest correlation and has best memory properties
13.1.2014 DIMA - TU Berlin
Coauthorship Results (II)
→ HLL the best approximation→ outliers can be identified with higher confidence than central nodes→ nodes with highest closeness tend to have similar values
Top10 Top100 Top1000 Last1 Last100
FM32 6/10 76/100 891/1000 1/1 94/100
FM64 5/10 69/100 881/1000 1/1 94/100
HLL 8/10 80/100 932/1000 1/1 95/100
13.1.2014 DIMA - TU Berlin
Enron Results (I)
→ High (linear) correlation with all metrics ✔→ HyperLogLog has highest correlation and has best memory properties
Kendall’s τ Spearman’s ρ Pearson’s r
FM32 0.9138299158409239 0.9880939188638478 0.9935462917118506
FM64 0.8894530452951206 0.9803803899254973 0.9902062846287614
HLL 0.9335364446051608 0.9927569721570411 0.9966840593148085
13.1.2014 DIMA - TU Berlin
Enron Results (II)
Top10 Top100 Top1000 Last1 Last100
FM32 5/10 80/100 877/1000 1/1 96/100
FM64 7/10 66/100 839/1000 1/1 97/100
HLL 8/10 86/100 889/1000 1/1 97/100
→ HLL again best approximation→ outliers can be identified with higher confidence than central nodes
13.1.2014 DIMA - TU Berlin
Validation Summary
● HyperLogLog exhibits the highest correlation in all experiments. It also has the lowest memory footprint.
● We assume that these results hold for larger data sets.
13.1.2014 DIMA - TU Berlin
Next step
● Benchmark implementations with larger datasets (that require Giraph out-of-core execution)
● Datasets:
Description Name Vertices Edges Text File Size in GB
The data of Stanford's WebBase 2001 crawl as a graph
webbase-2001 118,142,155 1,019,903,190 9.46
Follower relationships twitter-2010 41,652,230 1,468,365,182 12.49
13.1.2014 DIMA - TU Berlin
References
U. Kang, Charalampos E. Tsourakakis, Ana Paula Appel, Christos Faloutsos, and Jure Leskovec. 2011. HADI: Mining Radii of Large Graphs. ACM Trans. Knowl. Discov. Data 5, 2, Article 8 (February 2011), 24 pages
Centralities in Large Networks: Algorithms and Observations. U Kang, Spiros Papadimitriou, Jimeng Sun, and Hanghang Tong. SIAM International Conference on Data Mining (SDM) 2011, Mesa, Arizona, USA
Stefan Heule, Marc Nunkesser, and Alexander Hall. 2013. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. InProceedings of the 16th International Conference on Extending Database Technology(EDBT '13). ACM, New York, NY, USA, 683-692
Paolo Boldi, Marco Rosa, and Sebastiano Vigna. 2011. HyperANF: approximating the neighbourhood function of very large graphs on a budget. In Proceedings of the 20th international conference on World wide web (WWW '11). ACM, New York, NY, USA, 625-634.
Formulas taken from Wikipedia.