46
Giornata di Studio: “Prospettive di sviluppo della matematica applicata in Italia 2011” Roma, 8 Aprile 2011 Sampling and Measuring Large Graphs A Complex Network Analysis Study [email protected] Emilio Ferrara Advisor: Prof. Giacomo Fiumara Dottorato di Ricerca in Matematica Dipartimento di Matematica Università degli Studi di Messina Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 1 / 46

Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Giornata di Studio:“Prospettive di sviluppo della matematica applicata in Italia 2011”

Roma, 8 Aprile 2011

Sampling and Measuring Large GraphsA Complex Network Analysis Study

[email protected]

Emilio Ferrara

Advisor: Prof. Giacomo Fiumara

Dottorato di Ricerca in MatematicaDipartimento di MatematicaUniversità degli Studi di Messina

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 1 / 46

Page 2: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Outline

1 IntroductionSocial Network Analysis and MiningSampling Architecture

2 Online Social Network AnalysisModelingSampling TechniquesResults

3 Related Work

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 2 / 46

Page 3: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Outline

1 IntroductionSocial Network Analysis and MiningSampling Architecture

2 Online Social Network AnalysisModelingSampling TechniquesResults

3 Related Work

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46

Page 4: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Introduction and Objectives

Social Network Analysis and Mining (SNAM) includes different techniquesfrom sociology, social sciences, mathematics, statistics andcomputer science.

ObjectivesAnalysis of the structure of a social networkAnalysis of large sub-networks and connected componentsDiscovering nodes of particular interestIdentifying communities within the network

AdvantagesLarge scale studies, impossible before, are feasibleData can be automatically acquiredA huge amount of information is accessible onlineData could be acquired at different granularity level

LimitsProblems related to large scale data mining issuesComputational and algorithmic challengesBias of data should be investigated

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 4 / 46

Page 5: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Web Data Extraction

WDE Systems Software platform for the extraction, in an automatic andintelligent fashion, of data from Web pages, under the form ofstatic and/or dynamic contents, in order to store them in adatabase (or other structured data sources) and make themavailable for other applications.

Wrapper An algorithmic procedure which aims to the extraction ofunstructured information from a data source (such as a Webpage) and transform it in a structured format.

Automatic Wrapper Adaptation A novel smart approach to make wrappersadaptive to structural changes has been proposed.

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 5 / 46

Page 6: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Clustered Tree MatchingHTML Web pages are represented as trees, whose nodes contains elements

displayed in the page.

XPath A standard language defined to identify elements within a Webpage. Wrappers implements the XPath logic.

Key aspects (Ferrara, 2011)

Inspired by Simple Tree Matching (STM) a

Assigns weights to evaluate importanceof matches

Different behavior considering leaves ormiddle-level nodes

Introduces a degree of accuracy

Identify clusters of similar sub-trees

aTree to tree editing problem, Selkow, 1977

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 6 / 46

Page 7: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Tree Matching Algorithm: Example (I)

Figure: A and B are two similar trees. CTM assigns weights to matchingnodes. Node f in A has weight 1

2 because in B it appears in a sub-tree withtwo children. Node h in B has weight 1

3 for the same reason.

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 7 / 46

Page 8: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Tree Matching Algorithm: Example (II)

Figure: Couples of matrices W and M, step-by-step. CTM solves the similarityproblem in 6 steps. Final result of similarity: 3

8 . Grey cells identify similarclusters between the two trees.

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 8 / 46

Page 9: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Tree Matching Algorithm: Advantages and Limits

AdvantagesCTM produces an intrinsic measure of similarity (STMcalculates only the number of different elements)It is possible to introduce a degree of accuracyestablishing a minimum similarity threshold, required tomatch two clustersThe more the structure of compared trees is complex andsimilar, the better the similarity calculated by CTM is!

Limits CTM/STM do not produce perfect results when nodes areadded/removed in middle-levels.

Further considerations Pure text can be matched using classic stringmatching techniques, e.g. Jaro-Winkler, bigrams, etc.

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 9 / 46

Page 10: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Outline

1 IntroductionSocial Network Analysis and MiningSampling Architecture

2 Online Social Network AnalysisModelingSampling TechniquesResults

3 Related Work

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 10 / 46

Page 11: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Auto-adaptive Web Wrappers: Example

Figure: An example of automatic adaptation of modifications. In the upperpart of the screenshot the original page structure is shown. In the lower part,the new version. Modifications have been brought both to page structure andits contents. Elements matched by the original wrapper are even identified inthe modified page, by applying the automatic adaptation policy.Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 11 / 46

Page 12: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Agent of Web data extraction

Intelligent Agent It’s a platform (software + architecture) which couldautonomously take smart decisions to achieve a goal.

Each Web wrapper is implementedas an Agent

Several Agents populate the sameenvironment

If a Wrapper fails, it adapts itself tochanges

Results are collected in a transparentway w.r.t. users

Figure: Web Data Mining platformarchitecture

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 12 / 46

Page 13: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Outline

1 IntroductionSocial Network Analysis and MiningSampling Architecture

2 Online Social Network AnalysisModelingSampling TechniquesResults

3 Related Work

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 13 / 46

Page 14: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Social Networks: Taxonomy

Social Networks (SN) A social network is a social structure made up ofindividuals (or organizations) connected each other, by(possibly) different social ties, such as friendship, kinship,shared interest, knowledge etc.

Social Network Analysis Analysis of social networks, (i.e., studying, modelingand measuring), could be conducted by using the formalism ofgraph theory.Theory and models adopted for the study of SNs are part of theso called Social Network Analysis.

Several types of network exist: collaborations, communication,friendship, etc. Our study focuses on Online Social Networks:

Social communities: Facebook, MySpace, etc.Sharing contents: YouTube, Flickr, etc.

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 14 / 46

Page 15: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Social Networks: Examples of OSN

Figure: An example of Online Social NetworkEmilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 15 / 46

Page 16: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Analysis of Online Social Networks: Motivations

Q: Is it possible to model social networks?A: Analysis of characteristics and properties of OSNs graphs

Open problems• Improving algorithms:

For visiting large graphs (e.g., BFS, Uniform, etc.)To efficiently store and represent data (matrixdecomposition, etc.)Efficient and meaningful visualization of large graphsOptimization of metrics calculation (e.g.,All-Pairs-Shortest-Path, Betweenness Centrality, etc.)

• Investigation of the scalability• Considering similitudes between OSNs and real social

networks

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 16 / 46

Page 17: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Background and Related Work

Milgram The Small World problem (1970)Zachary The first model of a real SN (1980)

Kleinberg Algorithmic perspective of SNs (2000)Barabasi, Newman, et al. 2000+ focus on OSNs

Large scale data mining from OSNsVisualization of large graphsDynamics and evolution of OSNsSNA Metrics calculationClustering, community structure, etc.

Remarks SNA is a “young” branch, born from the context of socialsciences and moved towards mathematics and computersciences in the last years.

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 17 / 46

Page 18: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Outline

1 IntroductionSocial Network Analysis and MiningSampling Architecture

2 Online Social Network AnalysisModelingSampling TechniquesResults

3 Related Work

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 18 / 46

Page 19: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Mining the Facebook graph: Breadth-first searchBFS (breadth-first search): starting from a seed, a graph is visited exploringall the neighbors in order of discovering.

ProsOptimal solution forunweighted and/orundirected graphs (suchas Facebook and otherOSNs)Intuitive implementation

Cons Resulting samples arebiased towards high degreenodes in incomplete visits.

Challenge Obtaining a sub-graph ofthe Facebook networkwhich preserves propertiesof the complete graph.

Figure: BFS (3rd sub-level)

1 seed

2-4 friends

5-8 friends of friends

9-12 friends of friends offriends

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 19 / 46

Page 20: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Mining the Facebook graph: Uniform sampling

Uniform (rejection sampling): a list of random nodes to be visited isgenerated.

ProsIndependent w.r.t. thestructural distribution offriendship tiesProduces unbiased resultsSimple and efficientimplementation

Cons Resulting graph hasdisconnected components.

Challenge Acquiring a uniform sub-graphwith a huge connectedcomponent.

Figure: Uniform sampling

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 20 / 46

Page 21: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Mining the Facebook graph: How the Agent worksInitialization:

Authentication on FB

Selection of an example friendlist

Generation of the wrapper for the automatic extraction

Execution:

Generation of a FIFO queue of profiles to be visited

For each profile in queue:I Visit the friendlist page:

F Extract friends (nodes) and relationships (edges)F (eventually) Put new friends in the FIFO queue

I Cycle the process

Figure: Diagram of the process of data extraction from Facebook

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 21 / 46

Page 22: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Mining the Facebook graph: Data cleaningData cleaning O(n) (optimal time)

1 Remove duplicates usinghash tables

2 Delete parallel edges

3 Anonymize

Structuring dataFinal data are stored as GraphML. It isa standard XML format for representinggraph. It contains a description nodeswithin the graph and edges connectingthem.

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 22 / 46

Page 23: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Mining the Facebook graph: Agent execution

Figure: Agent i) visits the page containing the friendlist, ii) generates aWrapper to extract Name and ID of each friend, iii) insert into the graph thesedata, and iv) proceeds with the next profile in the list.

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 23 / 46

Page 24: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Outline

1 IntroductionSocial Network Analysis and MiningSampling Architecture

2 Online Social Network AnalysisModelingSampling TechniquesResults

3 Related Work

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 24 / 46

Page 25: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Network Analysis Metrics: Characteristics of the Facebook graph

Ego-centric network: the term ego denotes a user connected to others(alter)

Unweighted, undirected network:

I Degree 1.0

I Degree 1.5

I Degree 2.0

Remarks The graph shows a natural clustering effect over principal areasof the life of a user: friends, colleagues, family, etc.

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 25 / 46

Page 26: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Network Analysis Metrics: MeasuresPerer and Shneiderman 1 provided a summary of useful metrics:Overall metrics no. of nodes, edges, density, diameter, etc.Centrality measures degree, betweenness and closeness centralityNodes in pairs plotting degree vs. betweennessCohesive sub-groups discovering communities, community structure

N. Visited users N. Discovered users N. edgesBFS 63.4K 8.21M 12.58MUni 48.1K 7.69M 7.84M

Avg. deg. Eigenvectors Diameter Clustering Coverage Density396.8 68.93 8.75 0.0789 98.98% 0.626%326.0 23.63 16.32 0.0471 94.96% 0.678 %

Table: Dataset: BFS and Uniform (acquired during August 2010)

1Balancing systematic and flexible exploration of social networks, 2006Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 26 / 46

Page 27: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Social Network Analysis Aspects: Visualization of data

Remarks Our data contains the same information as if we would acquireall the friendship relations among all the inhabitants of amiddle-size town (e.g., 100k people).

Visualization of Social Networks Providing a meaningful graphicalrepresentation of a large network in order to have greaterinsights on the structure, is a big challenge, both algorithmicand computational.

ProblemsThe more the complexity of the network increases, thegreater its illegibility is.Operations such as interaction on nodes and edges,filtering and manual positioning are required.

A: Our group 2 developed LogAnalysis, a powerful visual tool toanalyze social network structures.

2A visual tool for forensic analysis of mobile phone traffic, Catanese & Fiumara,2010Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 27 / 46

Page 28: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Facebook Network Graph: Visual resultsNodeXL: Unfiltered graph (Dataset: 25K nodes sub-graph)

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 28 / 46

Page 29: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Facebook Network Graph: Visual resultsNodeXL: Filtered graph (Dataset: 25K nodes sub-graph)

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 29 / 46

Page 30: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Facebook Network Graph: Visual resultsNodeXL: Filtered results (Dataset: 25K nodes sub-graph)

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 30 / 46

Page 31: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Facebook Network Graph: Visual resultsLogAnalysis: Force Directed graph (Dataset: 25K nodes sub-graph)

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 31 / 46

Page 32: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Facebook Network Graph: Visual resultsLogAnalysis: Clustering (Dataset: 25K nodes sub-graph)

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 32 / 46

Page 33: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Facebook Network Graph: Visual resultsLogAnalysis: Radial view (2.0 degree)

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 33 / 46

Page 34: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Betweenness Centrality resultsTop 25 Nodes ordered w.r.t. BC (Dataset: 25K nodes sub-graph)

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 34 / 46

Page 35: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Facebook Network Graph: Distributions in FacebookDegree distribution of node degree in the network

Social Networks usually follow power-law distributions,such as P(k) ∼ k−γ , with k node degree and γ ≤ 3.This means the existence of a relatively small number ofusers highly connected each other.This distribution could be represented by aComplementary Cumulative Distribution Function (CCDF).

Figure: Tail of the power-law Figure: CCDF

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 35 / 46

Page 36: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Facebook Network Graph: Graph clusteringClustering coefficient Is the measure representing how much nodes of a

graph tend to “group” each other.

Results The mean value detected in FB lies in the interval [0.05, 0.2],the same w.r.t. other well-known real Social Networks.

Diameter of the network The mean diameter is smaller than 10, such as inthe Milgram Small World theory.

Figure: Diameter Figure: Clustering coefficient

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 36 / 46

Page 37: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Facebook Network Graph: Betweenness centrality

Betweenness centrality bi of a node i is defined as bi =∑j 6=k

dijnjk (i)njk

, i.e., the

number of times the nodes lies in the shortest path connectingtwo other nodes.

Remarks It is well-known that the BC follows a power-law p(g) ∼ g−η inscale-free networks.

Results We proved that it holds also for the Facebook network.

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 37 / 46

Page 38: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Facebook Community StructureCommunity Structure A sub-structure of the overall graph, in which the

density of relationships within the community is much greaterthan the density of connections among communities.

Model A common formulation of this problem is to find a partitioningV = (V1 ∪ V2 ∪ · · · ∪ Vn) of disjoint subsets of vertices of thegraph G = (V ,E) representing the network, in a meaningfulmanner.

Algorithms The most popular quantitative technique is the Q −modularity(or network modularity), proposed by Newman 3.

Q-modularity

Q =m∑

s=1

[l2sE− d2

s

2E

](1)

ls: number of edges between vertices belonging to the s-thcommunity; ds: sum of the degrees of these vertices.

High values of Q [0,1] implies a evident community strucure.3Finding and evaluating community structure in networks, Newman, 2004

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 38 / 46

Page 39: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Facebook Network Graph: Algorithms and Results

LPA (Label Propagation Algorithm) 4

FNCA (Fast Network Community Algorithm) 5

Algorithm N. of Communities Q Time (s)BFS (8.21 M vertices, 12.58 M edges)

FNCA 50,156 0.6867 5.97e+004LPA 48,750 0.6963 2.27e+004

Uniform (7.69 M vertices, 7.84 M edges)FNCA 40,700 0.9650 3.77e+004LPA 48,022 0.9749 2.32e+004

Table: Results on Facebook Network Samples

4Near linear algorithm to detect community structures, Raghavan et al., 20075Fast Complex Network Clustering Algorithm Using Agents, Jin et al., 2009

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 39 / 46

Page 40: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Facebook Community Structure: Uniform Sample

Uniform Sample The power law distribution is evident.FNCA Algorithm P(k)FNCA ∼ k−γ , with k node degree and γ = 0.53.LPA Algorithm P(k)LPA ∼ k−γ , γ = 0.49.

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 40 / 46

Page 41: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Facebook Community Structure: BFS Sample

The differences in the behavior between the BFS and “Uniform” samplesdistributions reflect accordingly with the adopted sampling techniques.

Gjoka et al. 6 and Kurant et al. 7, put into evidence the possible biasintroduced by using the BFS algorithm, towards high degree nodes.

6Walking in facebook: A case study of unbiased sampling, Gjoka et al., 20107On the bias of BFS (Breadth First Search), Kurant et al., 2010

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 41 / 46

Page 42: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Facebook Community Structure: Overlapping DistributionsObservation These two distributions, regardless the sampling and

community detecting adopted algorithms, appears to bestrongly overlapping.

Q: We would qualitatively investigate the similarity amongproducted results, w.r.t. LPA and FNCA techniques.

A: We could compare obtained sets using similarity metrics, e.g.,Jaccard and/or Cosine Similarity.

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 42 / 46

Page 43: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Facebook Community Structure: Similarity MeasuresBinary Jaccard Coefficient: J(v,w) =

M11

M01 + M10 + M11where M11 represents the total number of shared elements betweenvectors v and w, M01 represents the total number of elements belongingto w and not belonging to v, and, finally M10 the vice-versa.Intuitively, the result lies in [0,1].

Cosine Similarity: cos(Θ) =A · B||A|| ||B||

=

∑ni=1 Ai × Bi√∑n

i=1 (Ai )2 ×√∑n

i=1 (Bi )2

where Ai and Bi represent the binary frequency vectors computed on thelist members over i .

Degree of Similarity FNCA vs. LPAMetric Dataset In Common Mean Median Std. D.

J BFS 2.45% 73.28% 74.24% 18.76%Uniform 35.57% 91.53% 98.63% 15.98%

Table: Similarity degree of community structures

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 43 / 46

Page 44: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Facebook Community Structure: Similarity Results

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 44 / 46

Page 45: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Related Work I

Ferrara E., Baumgartner R.Automatic Wrapper Adaptation by Tree Edit Distance Matching.In: Combinations of Intelligent Methods and Applications – Springer 2011.

De Meo P., Ferrara E., Fiumara G.Finding similar users in Facebook.In: Social Networking and Community Behavior Modeling: Qualitative andQuantitative Measures – IGI Global 2011.

Catanese S., De Meo P., Ferrara E., Fiumara G., Provetti A.Crawling Facebook for social network analysis purposes.In: International Conference on Web Intelligence, Mining and Semantics – 2011

Ferrara E., Baumgartner R.Design of automatically adaptable web wrappers.In: 3rd International Conference on Agents and Artificial Intelligence – 2011

Catanese S., De Meo P., Ferrara E., Fiumara G.Analyzing the Facebook friendship graph.In: 1st International Workshop on Mining the Future Internet – 2010

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 45 / 46

Page 46: Sampling and Measuring Large Graphs A ... - Emilio Ferrara · Results 3 Related Work Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 3 / 46. Introduction

Related Work IICatanese S., De Meo P., Ferrara E., Fiumara G., Provetti A.Extraction and Analysis of Facebook Friendship Relations.In: Social Network Book 2011 (under review).

Catanese S., Ferrara E., Fiumara G.Social network analysis of Facebook.In: Journal of Computational Science Special Issue on Social ComputationalSystems (under review).

Ferrara E., Fiumara G., Baumgartner R.Web data extraction, applications and techniques: A survey.In: ACM Computing Surveys (under review).

Ferrara E.Community Structure Discovery in FacebookIn: Int. J. Social Network Mining (under review).

Ferrara E., Jin D.A Large-Scale Community Structure Analysis in FacebookIn: 11th International Conference on Knowledge Management and KnowledgeTechnologies (under review).

Emilio Ferrara (Università di Messina) Sampling and Measuring Large Graphs 46 / 46