Mining Social and Information Networkscseweb.ucsd.edu/classes/fa16/cse259-a/slides/fragkiskos.pdf · Mining Social and Information Networks Fragkiskos D. Malliaros Ecole Polytechnique,

Introduction and Research Overview Robustness Estimation Identification of Influential Spreaders Graph-Based Text Analytics Concluding Remarks

Mining Social and Information Networks

Fragkiskos D. Malliaros

Ecole Polytechnique, France→ UC San Diego

E-mail: [email protected]

UCSD Artificial Intelligence Seminar

October 3, 2016

1/69 Fragkiskos D. Malliaros UCSD AI Seminar Mining Social and Information Networks

[email protected]


Networks

Networks allow to model relationshipsbetween entities

General-purpose languagefor describing real-world systems


Online Social Networks Source: https://www.facebook.com/zuck

Collaboration networks (Co-authorship)

Weblogs network (Political blogs)

Term co-occurrence network (David Copperfield novel by

Charles Dickens)


Motivation: Analytics in the Network Science Era

Overview of my work

Models, tools and observations for problems in the area of mining real-world networks

We build upon computationally efficient graph mining methods to:

1. Design models for analyzing the structure and dynamics of real-worldnetworks

� Unravel properties that can further be used in practical applications

2. Develop algorithmic tools for large-scale analytics on data with:

� Inherent graph structure (e.g., social networks)� Without inherent graph structure (e.g., text)




Overview of my work










Overview of my work










Overview of my work









Overview of my Work (1/2)

� Engagement dynamics and vulnerability assessment in socialnetworks

� [w/ Vazirgiannis, CIKM ’13; EPL ’15]

� Graph clustering and community detection� [w/ Vazirgiannis, Phys. Rep. ’13]� [w/ Giatsidis, Thilikos, Vazirgiannis, AAAI ’14]� [w/ Mitri, Vazirgiannis, Manuscript ’16]

� Anonymity models for real networks� [w/ Casas-Roma, Vazirgiannis, Manuscript ’16]



Overview of my Work (2/2)

� Spectral robustness estimation in real networks [This talk]� [w/ Megalooikonomou, Faloutsos, SDM ’12; KAIS ’15]

� Identification of influential spreaders (nodes) in complexnetworks [This talk]

� [w/ Rossi, Vazirgiannis, WWW ’15; Scientific Reports ’16]

� Graph-based text analytics (text categorization and keywordextraction) [This talk]

� [w/ Skianis, Vazirgiannis, ASONAM ’15; Manuscript ’16]� [w/ Tixier, Vazirgiannis, EMNLP ’16]

Main tools: spectral graph theory, core decomposition



Outline

1 Introduction and Research Overview

2 Robustness Estimation in Social Graphs

3 Locating Influential Spreaders in Social Networks

4 Graph-Based Text Analytics

5 Concluding Remarks


Robustness Estimation in Social Graphs


Graph Robustness Estimation

Main question:

What can we say about the robustness of large real-world social graphs?

Modular network Non modular network




Main question:






Main question:






Main question:






Main question:





Expansion Properties

� Quick estimation of the robustness of agraph by examining the expansionproperties

� Graph with good expansion properties:simultaneously sparse and highlyconnected

� Given a graph G = (V ,E), theexpansion of a subset of nodes S, with

|S| ≤ |V |2

is|N(S)||S|







|S| ≤ |V |2

is|N(S)||S|

S ⊂ V







|S| ≤ |V |2

is|N(S)||S|

S ⊂ V







|S| ≤ |V |2

is|N(S)||S|

S ⊂ V







|S| ≤ |V |2

is|N(S)||S|

S ⊂ V







|S| ≤ |V |2

is|N(S)||S|

Expansion factor

h(G) = min{S:|S|≤ |N|2

} |N(S)||S|



Why the Expansion Properties are Important?

� They can act as a natural measure of the graph’s robustness� Bottlenecks edges inside the network

� Good expansion properties imply high robustness� Any subset S ⊂ V will have a relatively large neighborhood� Poor modular structure

� Poor expansibility reflects absence of robustness� Modular structure→ Better community structure



Measuring Expansion Properties

� Computing the expansion factor is a computational difficultproblem

� Approximation techniques?

� Use the spectrum of the adjacency matrix A

� The expansion properties of a graph can be approximated by thespectral gap ∆λ = λ1 − λ2

Alon - Milman inequality

∆λ

2≤ h(G) ≤

√2λ1∆λ



First Approach in Estimating the Robustness

� Large spectral gap ∆λ implies good expansion properties→high robustness

� If λ2 is close enough to λ1, the spectral gap will be small→ lowrobustness

Spectral gap based approach

1. Compute the spectral gap2. If this is large, the graph will show high robustness properties

Problem

How large the spectral gap should be for a graph to have goodexpansibility?



First Approach in Estimating the Robustness

� Large spectral gap ∆λ implies good expansion properties→high robustness

� If λ2 is close enough to λ1, the spectral gap will be small→ lowrobustness

Spectral gap based approach

1. Compute the spectral gap2. If this is large, the graph will show high robustness properties

Problem

How large the spectral gap should be for a graph to have goodexpansibility?



Spectral Gap + Subgraph Centrality

� Combine the spectral gap with the subgraph centrality

� Subgraph centrality:

� Node centrality measure

� It is based on the number of closed walks that a node participates:

SC(i) =∞∑`=0

A`ii`!, ∀i ∈ V

� Using techniques from spectral graph theory

SC(i) =

|V |∑j=1

u2ij sinh(λj), ∀i ∈ V



Characterizing the Robustness : the Idea

SC(i) =∑|V |

j=1 u2ij sinh(λj), ∀i ∈ V

� Good expansion properties→ High robustness→ λ1 � λ2→u2

i1 sinh(λ1)�∑|V |

j=2 u2ij sinh(λj)

� For a graph with high robustness

SC(i) ≈ u2i1 sinh(λ1)

� This implies a power-law relationship

ui1 ∝ sinh−1/2(λ1) SC(i)1/2

� Deviations from this behavior imply existence (or lack thereof) ofhigh robustness



Observations and Shortcoming

1. Scalability issues� It requires the computation of all the pairs (λi , ui ), ∀i ∈ V of the

adjacency matrix A

� Computational bottleneck for large graphs with millions of nodesand edges

2. It cannot be applied directly to bipartite graphs



Proposed Metric: Generalized Robustness Index rkBasic Idea

Q: Can we efficiently approximate the SC(i), ∀i ∈ V?

1. The eigenvalues of A follow a power-lawdistribution [Faloutsos et al., ’99]

2. Almost symmetric eigenvalues around zero

3. sinh(·): odd function

� sinh(−x) = − sinh(x)0 1000 2000 3000 4000 5000 6000 7000

−100

−50

0

50

100

150

RankE

ige

nva

lue

Figure : Skewed spectrum (WIKI-VOTE)



Normalized Subgraph Centrality NSCk

SC(i) =∑|V|

j=1 u2ij sinh(λj ), ∀i ∈ V

� Use only the first top k eigenvalues and their correspondingeigenvectors

� Due to the sinh(·) function, the contribution of each eigenvaluewill be roughly canceled out by its (almost) symmetric eigenvalue

� The contribution of most of the eigenvalues is negligiblecompared with that of the first few eigenvalues

Normalized subgraph centrality

NSCk (i) =∑k


� Generally, k � |V | for real-world graphs



Normalized Subgraph Centrality NSCk

SC(i) =∑|V|

j=1 u2ij sinh(λj ), ∀i ∈ V

� Use only the first top k eigenvalues and their correspondingeigenvectors

� Due to the sinh(·) function, the contribution of each eigenvaluewill be roughly canceled out by its (almost) symmetric eigenvalue

� The contribution of most of the eigenvalues is negligiblecompared with that of the first few eigenvalues

Normalized subgraph centrality

NSCk (i) =∑k


� Generally, k � |V | for real-world graphs



Generalized Robustness Index rk

ui1 ∝ sinh−1/2(λ1) SC(i)1/2

Generalized robustness index rk

rk =

(1|V |

∑|V |i=1

{log(ui1)−

(log(sinh−1/2(λ1)) +

12

log(NSCk (i)))}2

)1/2

� Small rk value→ High robustness� For a robust enough graph, mostly (λ1, u1) will contribute to NSCk

� Parameter k� Trade-off between accuracy and required time

� It can be computed very easily in any programming environment



An Illustrative Example: Random vs. Real Graphs

103

104

105

106

10−2

10−1

100

Normalized Subgraph Centrality

Prin

cip

al E

ige

nve

cto

r

r = 3.5027e−05

(a) ER random graph (b) Discrepancy plot

10−2

100

102

104

10−8

10−6

10−4

10−2

100

Normalized Subgraph Centrality

Princip

al E

igenvecto

r

r = 2.4921

A..−L.B.

G.G.

(c) Network science graph (d) A.-L. Barabasi’s egonet (e) Discrepancy plot



DatasetsBasic characteristics of real-world networks

Graph # Nodes # Edges

EPINIONS 75, 877 405, 739EMAIL-EUALL 224, 832 340, 795SLASHDOT 77, 360 546, 487WIKI-VOTE 7, 066 100, 736FACEBOOK 63, 392 816, 886YOUTUBE 1, 134, 890 2, 987, 624CA-ASTRO-PH 17, 903 197, 031CA-GR-QC 4, 158 13, 428CA-HEP-TH 8, 638 24, 827DBLP 404, 892 1, 422, 263CIT-HEP-TH 26, 084 334, 091



Efficiency of rk Index

Scalability

0 5 10 15

x 105

0

10

20

30

40

50

60

Number of Edges

Tim

e (

se

co

nd

s)

Observed Time

Linear Fit

� Several snapshots of theDBLP graph

� Set k = 30 eigenpairs

� The rk index scales linearlywith respect to the number ofedges



Robustness of Large Static Graphs (1/2)Discrepancy Plots (Principal eigenvector vs. NSC in log-log scales)

EPINIONS

EMAIL-EUALL

SLASHDOT

WIKI-VOTE

FACEBOOK

YOUTUBE



Robustness of Large Static Graphs (2/2)Discrepancy Plots (Principal eigenvector vs. NSC in log-log scales)

CA-ASTRO-PH

DBLP-1980

CA-HEP-TH

DBLP-2006

CA-GR-QC

CIT-HEP-TH



Observations: Robustness of Large Social Graphs

� Almost all graphs show good expansion properties→ Highrobustness

� rk values tend to zero

� Not a clear modular structure

� Lack of well-defined clusters that can be easily separated fromthe whole graph

Observation: High Robustness Pattern

Large real-world social graphs exhibit good expansion propertiesand thus high robustness



Are the Observations Expected?

Community structure property

Social networks

� Social Networks ≡ Community Structure� Bad expansion properties→ Low robustness� Previously observed on small scale social graphs



Possible Explanations

DBLP-1980: |V | = 5K , |E| = 9K DBLP-2006: |V | = 405K , |E| = 1, 5M

� Structural differences between different scale networks



Time Evolving Graphs: Fragility Evolution (1/2)

� The graphs evolve over time

� Q: How the robustness of a graph changes over time?

� Study the fragility evolution of a graph

1. For every time point t in the dataset2. Form the graph up to the specific time point t3. For every time snapshot Gt we examine the rk index

� Datasets:� DBLP: 1960-2006 (cumulative graph snapshots per year)� CIT-HEP-TH: February 1993 - April 2003 (cumulative graph

snapshots per month)





















Time Evolving Graphs: Fragility Evolution (2/2)DBLP: rk over time

0 10 20 30 40 500

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Time [Years]

Ro

bu

stn

ess I

nd

ex r

t=16 (1975)

CIT-HEP-TH: rk over time

0 20 40 60 80 100 120 1400

0.5

1

1.5

2

2.5

3

Time [Months]

Ro

bu

stn

ess I

nd

ex r

t = 17 (June ’94)



Time Evolving Graphs: Fragility Evolution (2/2)DBLP: rk over time

0 10 20 30 40 500

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Time [Years]

Ro

bu

stn

ess I

nd

ex r

t=16 (1975)

DBLP: Diameter over time

0 10 20 30 40 502

4

6

8

10

12

14

Time [Years]

Dia

me

ter

t=16 (1975)

CIT-HEP-TH: rk over time

0 20 40 60 80 100 120 1400

0.5

1

1.5

2

2.5

3

Time [Months]

Ro

bu

stn

ess I

nd

ex r

t = 17 (June ’94)

CIT-HEP-TH: Diameter over time

0 20 40 60 80 100 120 1400

2

4

6

8

10

12

14

Time [Months]

Dia

me

ter

t = 17 (June ’94)



Time Evolving Graphs: Fragility Evolution Pattern

Observation: Fragility evolution pattern

The spike of the robustness is aligned with the gelling point

General Observation:� At the first time points rk ↗ gradually→ Low robustness→ Good

community structure

� After a specific time point, rk starts↘ gradually→ The graphs tend tobecome more robust

� The time point that rk changes, corresponds to the gelling point[McGlohon et al., KDD ’08]

Explanation for the structural differences (regarding robust-ness and community structure) between different scale graphs



Time Evolving Graphs: Fragility Evolution Pattern

Observation: Fragility evolution pattern

The spike of the robustness is aligned with the gelling point

General Observation:� At the first time points rk ↗ gradually→ Low robustness→ Good

community structure

� After a specific time point, rk starts↘ gradually→ The graphs tend tobecome more robust

� The time point that rk changes, corresponds to the gelling point[McGlohon et al., KDD ’08]

Explanation for the structural differences (regarding robust-ness and community structure) between different scale graphs



Anomaly Detection (1/2)

Question

Can we spot anomalies over time using the rk index along with thefragility evolution pattern?

1. Examine the rk index over time, trying to identify and track abruptchanges and deviations

2. Sudden deviations from the fragility evolution pattern can possiblycorrespond to anomalies

3. The specific time snapshots (graphs) can be tagged as outliers



Anomaly Detection (2/2)Fragility Evolution of DBLP (lin-log)

0 10 20 30 40 5010

−15

10−10

10−5

100

105

Time [Years]

Robustn

ess Index r

2002 and 2003

� Strange behavior of the rk indexfor 2002 and 2003

Q: Are these two time graphs reallyoutliers?

A: Yes!

� Explanation1:� After 2001 a large number of new publications were introduced→

Robustness↗→ rk ↘� After 2002-2003 new research fields are covered from DBLP→

New fields formed new communities→ rk ↗

1Personal communication with Michael Ley and Florian Reitz of the DBLP server37/69 Fragkiskos D. Malliaros UCSD AI Seminar Mining Social and Information Networks


Conclusions

� Fast robustness estimation for large scale social networks

� Efficient and effective robustness index� Take advantage of the special spectral properties

� Patterns of large scale static and time evolving social graphs

� Anomaly detection combining the rk index with the observed patterns

� Robustness properties of network generation models

� Ongoing work: Given a budget of k edges, how to improve therobustness of the graph?


Locating Influential Spreaders

in Social Networks


Identification of Influential SpreadersMotivation

� Spreading processes in complex networks� Spread of news and ideas� Diffusion of influence� Disease propagation� Viral marketing (word-of-mouth effect)� ...

� Identification of influential spreaders (goal of this work)� Able to diffuse information to a large part of the network

� Understand and control spreading dynamics� E.g, vaccinate individuals with good spreading properties in epidemic

control



Identification of Influential SpreadersMotivation

� Spreading processes in complex networks� Spread of news and ideas� Diffusion of influence� Disease propagation� Viral marketing (word-of-mouth effect)� ...

� Identification of influential spreaders (goal of this work)� Able to diffuse information to a large part of the network

� Understand and control spreading dynamics� E.g, vaccinate individuals with good spreading properties in epidemic

control



Identification of Influential Spreaders: the Process

Typically, a two-step approach:

1. Consider a topological or centrality criterion of the nodes

� Rank the nodes accordingly

� The top-ranked nodes are candidates for the most influential ones

2. Simulate the spreading process over the network to examine theperformance of the chosen nodes

[Pei and Makse, ’13]



Identification of Single Influential SpreadersRelated work

� Straightforward approach: consider degree centrality� High degree nodes are expected to be good spreaders� Hub nodes can trigger big cascades

� However, degree is a local criterion� Bad case: star subgraph

� Core-periphery structure of real-world networks

Equal degree (8)







Equal degree (8)







Equal degree (8)



Degeneracy and k -core Decomposition

Example

3-core

2-core

1-core

Core number ci = 1

Core number ci = 2

Core number ci = 3

Graph Degeneracy δ∗(G) = 3

G0 = G

G1 = 1-core of G

G2 = 2-core of G

G3 = 3-core of G

G0 ⊇ G1 ⊇ G2 ⊇ G3

Detection of dense and cohesive subgraphs



The k -core Decomposition Finds Good Spreaders

Degree 96 Core number 63


Node A Node B

� The core number is better spreadingpredictor compared to the degree

� [Kitsak et al., Nature Phys. ’10]



The k -core Decomposition Finds Good Spreaders



Node A Node B

� The core number is better spreadingpredictor compared to the degree

� [Kitsak et al., Nature Phys. ’10]



Motivation and Contributions

� The k -core decomposition often returns a relatively large number ofcandidate influential spreaders

� Only a small fraction corresponds to highly influential nodes

� Can we further refine the set of the most influential spreaders?

Main contributions

� Propose the K -truss decomposition for locating influential nodes

� Triangle-based extension of the k -core decomposition

� Experimental evaluation

� Better spreading behavior� Faster and wider epidemic spreading



Motivation and Contributions

� The k -core decomposition often returns a relatively large number ofcandidate influential spreaders

� Only a small fraction corresponds to highly influential nodes

� Can we further refine the set of the most influential spreaders?

Main contributions

� Propose the K -truss decomposition for locating influential nodes

� Triangle-based extension of the k -core decomposition

� Experimental evaluation

� Better spreading behavior� Faster and wider epidemic spreading



K -truss Decomposition (1/2)Definitions

K -truss subgraph TK [Cohen, ’08], [Wang and Cheng, ’12]

K -truss TK = (VTK , ETK ),K ≥ 2: the largest subgraph of G where every edge iscontained in at least K − 2 triangles within the subgraph

Maximal K -truss subgraph

� The K -truss subgraph defined for the maximum value Kmax of K

� The nodes of this subgraph define set T



K -truss Decomposition (2/2)Maximal k -core and K -truss subgraphs

Set

Set

6

CT

� Set C: nodes of maximal k -core

� 3-core subgraph

� Set T : nodes of maximal K -truss

� 4-truss subgraph

� The maximal k -core and K -truss subgraphs overlap� K -truss is a subgraph of k -core (core of the k -core)� Heuristic to improve execution time

� We argue that set T contains highly influential nodes



K -truss Decomposition (2/2)Maximal k -core and K -truss subgraphs

Set

Set

6

CT

� Set C: nodes of maximal k -core

� 3-core subgraph

� Set T : nodes of maximal K -truss

� 4-truss subgraph

� The maximal k -core and K -truss subgraphs overlap� K -truss is a subgraph of k -core (core of the k -core)� Heuristic to improve execution time

� We argue that set T contains highly influential nodes



Experimental Set-up (1/2)Baseline methods

� truss method: nodes belonging to set T (Proposed method)

� core method: nodes belonging to set C − T

� top degree method: nodes with highest degree T� Choose |C| − |T | nodes for fair comparison



Experimental Set-up (2/2)How to simulate the spreading process?

Susceptible-Infected-Recovered (SIR) model

1. Set candidate node as infected (I state)

2. An infected node can infect its susceptible neighbors with probability β

� Set β close to the epidemic threshold τ = 1λ1

[Chakrabarti et al., ’08]

3. An infected node can recover (stop being active) with probability γ

� Set γ = 0.8

4. Count the total number of infected individuals (avg. over multiple runs)

S I Rβ γ

1− β 1− γ

State diagram of the SIR model

[Barrat et al., ’08]



Datasets and PropertiesCharacteristics of the K -truss subgraphs

Network Name Nodes Edges kmax Kmax |C| |T | τ

EMAIL-ENRON 33, 696 180, 811 43 22 275 45 0.00840EPINIONS 75, 877 405, 739 67 33 486 61 0.00540WIKI-VOTE 7, 066 100, 736 53 23 336 50 0.00720EMAIL-EUALL 224, 832 340, 795 37 20 292 62 0.00970SLASHDOT 82, 168 582, 533 55 36 134 96 0.00074WIKI-TALK 2, 388, 953 4, 656, 682 131 53 700 237 0.00870

� τ = 1/λ1: epidemic threshold of the graph (λ1: largest eigenvalue of A)

� Set T has significantly smaller size compared to set C



Datasets and PropertiesCharacteristics of the K -truss subgraphs

Network Name Nodes Edges kmax Kmax |C| |T | τ

EMAIL-ENRON 33, 696 180, 811 43 22 275 45 0.00840EPINIONS 75, 877 405, 739 67 33 486 61 0.00540WIKI-VOTE 7, 066 100, 736 53 23 336 50 0.00720EMAIL-EUALL 224, 832 340, 795 37 20 292 62 0.00970SLASHDOT 82, 168 582, 533 55 36 134 96 0.00074WIKI-TALK 2, 388, 953 4, 656, 682 131 53 700 237 0.00870

� τ = 1/λ1: epidemic threshold of the graph (λ1: largest eigenvalue of A)

� Set T has significantly smaller size compared to set C



Evaluation the Spreading Process

Apply SIR simulation starting from a single node v each time

� Number of infected nodes at each time step of the process

� Total number of infected nodes Mv

� The time step where the epidemic fades out



Average Number of Infected Nodes

Time StepMethod 2 4 6 8 10 Final step σ Max step

EMAIL- truss 8.44 46.66 204.08 418.77 355.84 2, 596.52 136.7 33ENRON core 4.78 31.97 152.55 367.28 364.13 2, 465.60 199.6 37

top degree 6.89 34.13 155.48 360.89 357.08 2, 471.67 354.8 36

EPINIONS truss 4.17 19.70 75.04 204.14 329.08 2, 567.69 227.8 37core 3.45 14.72 55.27 158.56 280.03 2, 325.37 327.2 43

top degree 4.22 16.03 58.84 166.23 289.49 2, 414.99 331.7 47

WIKI- truss 2.92 6.92 15.27 28.73 42.46 560.66 114.9 52VOTE core 1.92 4.78 10.65 20.66 32.40 466.01 104.5 57

top degree 2.43 5.46 12.05 23.05 35.55 502.88 104.5 62

EMAIL- truss 11.62 62.25 240.97 584.87 725.42 5, 018.52 487.94 36EUALL core 9.85 40.82 158.72 433.81 644.76 4, 579.84 498.71 38

top degree 17.96 39.93 144.69 503.18 548.25 4, 137.56 1, 174.84 39

� Performance of truss method

� Higher infection rate during the first steps of the process� The total number of infected nodes is larger� The epidemic dies out earlier






top degree 6.89 34.13 155.48 360.89 357.08 2, 471.67 354.8 36


top degree 4.22 16.03 58.84 166.23 289.49 2, 414.99 331.7 47


top degree 2.43 5.46 12.05 23.05 35.55 502.88 104.5 62


top degree 17.96 39.93 144.69 503.18 548.25 4, 137.56 1, 174.84 39








top degree 6.89 34.13 155.48 360.89 357.08 2, 471.67 354.8 36


top degree 4.22 16.03 58.84 166.23 289.49 2, 414.99 331.7 47


top degree 2.43 5.46 12.05 23.05 35.55 502.88 104.5 62


top degree 17.96 39.93 144.69 503.18 548.25 4, 137.56 1, 174.84 39








top degree 6.89 34.13 155.48 360.89 357.08 2, 471.67 354.8 36


top degree 4.22 16.03 58.84 166.23 289.49 2, 414.99 331.7 47


top degree 2.43 5.46 12.05 23.05 35.55 502.88 104.5 62


top degree 17.96 39.93 144.69 503.18 548.25 4, 137.56 1, 174.84 39





Comparison to the Optimal Spreading (1/2)Methodology

OPT1 OPT2 OPT3 · · · OPT|V |

Window W

1. Rank nodes according to the total infection size Mv

� OPT1 ≥ OPT2 ≥ . . . ≥ OPT|V |, where OPT1 = arg maxv∈V Mv

2. Consider window W over the ranked nodes

PTW =|TW |/|T ||W |/|V |

� TW is the set of nodes v ∈ T located in W



Comparison to the Optimal Spreading (1/2)Methodology

OPT1 OPT2 OPT3 · · · OPT|V |

Window W

1. Rank nodes according to the total infection size Mv

� OPT1 ≥ OPT2 ≥ . . . ≥ OPT|V |, where OPT1 = arg maxv∈V Mv

2. Consider window W over the ranked nodes

PTW =|TW |/|T ||W |/|V |

� TW is the set of nodes v ∈ T located in W



Comparison to the Optimal Spreading (2/2)Results

EMAIL-ENRON

0 0.5 1 1.50

20

40

60

80

100

Window W (%)

PW

(%

)

C

T

WIKI-VOTE

0 5 100

20

40

60

80

100

Window W (%)

PW

(%

)

C

T

� PTW reaches the maximum value (i.e., 100%) relatively early and forsmall window sizes, compared to PCW

� The nodes detected by the K -truss decomposition are betterdistributed among the most efficient spreaders



Comparison to the Optimal Spreading (2/2)Results

EMAIL-ENRON

0 0.5 1 1.50

20

40

60

80

100

Window W (%)

PW

(%

)

C

T

WIKI-VOTE

0 5 100

20

40

60

80

100

Window W (%)

PW

(%

)

C

T

� PTW reaches the maximum value (i.e., 100%) relatively early and forsmall window sizes, compared to PCW

� The nodes detected by the K -truss decomposition are betterdistributed among the most efficient spreaders



Discussion

� Core decomposition methods can help towards identifying singleinfluential spreaders

� K -truss decomposition

� Faster and wider epidemic spreading

� Well distributed nodes among those that are achieving the optimalspreading

� Ongoing work: Identification of multiple influential spreaders



From Influential Spreaders to Influential TermsIn a network of words

Graph mining and core decomposition

Detection of influential users in social networks

↓

Influential (important) entities in an interconnected space

↓

Influential terms (i.e., words) for text analytics

Text as information network

with application to keyword selection and text categorization






↓


↓









↓


↓









↓


↓





Graph-Based Text Analytics


Text is Everywhere

� Rapid growth of social media and networking platforms� Vast amount of textual data� Analyze and extract useful information

� Text analytics (mining) tasks� Summarization (e.g., keyword extraction)� Categorization (e.g., sentiment analysis)� Information Retrieval� ...



Text is Everywhere

� Rapid growth of social media and networking platforms� Vast amount of textual data� Analyze and extract useful information

� Text analytics (mining) tasks� Summarization (e.g., keyword extraction)� Categorization (e.g., sentiment analysis)� Information Retrieval� ...



Term Weighting in the Bag-of-Words Model

� Vector Space Model� Collection of m documents: D = {d1, d2, . . . , dm}� Set of n distinct terms (dictionary): T = {t1, t2, . . . , tn}

� Feature extraction and term weighting in the Bag-of-words model

� Find: feature vector di = {wi,1,wi,2, . . . ,wi,n}, ∀di ∈ D� Each document is represented as a bag (multiset) of its words

� Limitations� Term independence assumption of frequency-based weighting

schemes (e.g., TF, TF-IDF)� Order of terms is disregarded



















ExampleText to graph representation; The Graph-of-words model

Graph representation of a document (w = 3; undirected graph)

Data Science is the extraction of knowledge from large volumes of datathat are structured or unstructured which is a continuation of the field ofdata mining and predictive analytics, also known as knowledge discoveryand data mining.

data

scienc

extract

knowledg

larg

volum

structur

unstructur

continu

field

mine

predict

analyt

discoveri

known



ExampleText to graph representation; The Graph-of-words model

Graph representation of a document (w = 3; undirected graph)

Data Science is the extraction of knowledge from large volumes of datathat are structured or unstructured which is a continuation of the field ofdata mining and predictive analytics, also known as knowledge discoveryand data mining.

data

scienc

extract

knowledg

larg

volum

structur

unstructur

continu

field

mine

predict

analyt

discoveri

known



Degeneracy-based Keyword Extraction

●

●

●

●

●

●

●

●

●

●

●

●

●

●

mathemat

aspect

computer−aid

share

trade

problem

statist

analysi

price

probabilist

characterist

seri

method

model

Edge weights

12345

Core numbers

891011

●

●

●

inf

shells

diffe

renc

e in

she

ll si

zes

−6

−

4

−2

0

2

(10 − 11) (9 − 10) (8 − 9)

●

●

●

●

0.38

0.42

0.46

0.50

dens

k−cores

dens

ity

11 10 9 8

●

● ● ●

● ●●

● ●●

●

●

● ●

2 4 6 8 10 12 14

7090

110

130TR

nodes

TR s

core

s

elbowtop 33%

Mathematical aspects of computer-aided share trading. We consider problems of statistical analysis of share prices and propose probabilistic characteristics to describe the price series. We discuss three methods of mathematical modelling of price series with given probabilistic characteristics.

Core numbers TR scoresP R F1 P R F1

MAIN 0.86 0.55 0.67 ELB 1 0.18 0.31INF 0.83 0.91 0.87

PER 1 0.45 0.63DENS 0.88 0.64 0.74

mathemat 11 price .1359

price 11 share .0948

probabilist 11 .0906

characterist 11 .0870

seri 11 .0860

method 11 mathemat .0812

model 11 analysi .0633

share 10 statist .0595

trade 9 method .0569

problem 9 problem .0560

statist 9 trade .0525

analysi 9 model .0493

aspect 8 computer-aid .0453

computer-aid 8 aspect .0417

MAIN

INF

DENS

ELB

probabilist

characterist

seri

CR scoresP R F1

ELB 0.90 0.82 0.86

PER 1 0.45 0.63

mathemat 128

price 120

analysi 119

share 118

probabilist 112

characterist 112

statist 108

trade 97

problem 97

seri 94

method 85

computer-aid 76

model 66

aspect 65

PER

ELB

PER

(a)

(c)

(b)

●

●●

● ●●

●●

● ●●

●●

●

2 4 6 8 10 12 14

0.04

0.08

0.12

CRelbowtop 33%

nodes

CR

sco

res

(a) Graph-of-words (W = 8) for document #1512 of Hulth2003 decomposed with k -core. (b) Keywords extracted by the different methods

(human assigned keywords in bold). (c) Selection criterion of each method.


Concluding Remarks


Mining Large Scale Networks

Network analytics with

spectral techniques and core decomposition

1. Spectral robustness estimation of social graphs

2. Core decomposition based identification of influential spreaders

3. Degeneracy-based keyword extraction



Ongoing and Future Work

� Community detection and dense subgraph detectionalgorithms

� Influence maximization in social networks

� Algorithms for rich network structures� Time, location, strength, trust, text� Probabilistic, signed and multilayer graphs

� Machine Learning tasks on graphs� Link prediction, node classification and graph classification� Node and graph embeddings based on deep learning

� Text analytics using graph-based techniques



Publications I

F. D. Malliaros, V. Megalooikonomou, and C.Faloutsos

Fast Robustness Estimation in Large Social Graphs: Communities and Anomaly DetectiionIn SIAM International Conference on Data Mining (SDM), 2012.

F. D. Malliaros and M. Vazirgiannis

Clustering and Community Detection in Directed Networks: A Survey.Physics Reports, 533(4), Elsevier, 2013.


To Stay or Not to Stay: Modeling Engagement Dynamics in Social Graphs.In ACM International Conference on Information and Knowledge Management (CIKM), 2013.

C. Giatsidis, F. D. Malliaros, D. M. Thilikos, and M. Vazirgiannis

CoreCluster: A Degeneracy Based Graph Clustering Framework.In AAAI Conference on Artificial Intelligence, (AAAI), 2014.

M.-E. G. Rossi, F. D. Malliaros, and M. Vazirgiannis

Spread It Good, Spread It Fast: Identification of Influential Nodes in Social Networks.In International Conference on World Wide Web Companion (WWW), 2015.


Vulnerability Assessment in Social Networks under Cascade-based Node Departures.EPL (Europhysics Letters), 11(6), 2015.

F. D. Malliaros and Michalis Vazirgiannis

Disengagement Social Contagion: Assessing Network Vulnerability under Node Departures.In International Conference on Computational Social Science (ICCSS), 2015.



Publications II

F. D. Malliaros, V. Megalooikonomou, and C.Faloutsos

Estimating Robustness in Large Social Graphs.Knowledge and Information Systems (KAIS), Springer, 2015.

F. D. Malliaros and K. Skianis

Graph-Based Term Weighting for Text Categorization.In IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Social Media and RiskWorkshop (SoMeRis), 2015.

F. D. Malliaros, M. Rossi, and M. Vazirgiannis

Locating Influential Nodes in Complex Networks.Scientific Reports, 6(19307), Nature Publishing Group, 2016.

A. J.-P. Tixier, F. D. Malliaros, and M. Vazirgiannis.

A Graph Degeneracy-based Approach to Keyword Extraction.In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.

F. D. Malliaros, K. Skianis, and M. Vazirgiannis

A Graph-Based Framework for Text Categorization: Term Weighting and Selection as Ranking of Interconnected Features.Submitted Manuscript, 2016.

J. Casas-Roma, F. D. Malliaros, and M. Vazirgiannis

k -Degree Anonymity on Directed Networks.Manuscript, 2016.

M. Mitri, F. D. Malliaros, and M. Vazirgiannis

Sensitivity of Community Structure to Network Uncertainty.Manuscript, 2016.



References I

V. Batagelj and M. Zaversnik.

AnO(m) Algorithm for Cores Decomposition of Networks.CoRR, 2003.

A. Barrat, M. Barthlemy, and A. Vespignani.Dynamical Processes on Complex Networks.Cambridge University Press, 2008.

D. Chakrabarti, Y. Wang, C. Wang, J. Leskovec, and Christos Faloutsos.

Epidemic Thresholds in Real Networks.ACM Trans. Inf. Syst. Secur., 10(4), 2008.

J. Cohen

Trusses: Cohesive Subgraphs for Social Network Analysis.National Security Agency Technical Report, 2008.

D. Easley and J. KleinbergNetworks, Crowds, and Markets: Reasoning About a Highly Connected World.Cambridge University Press, 2010.

M. Faloutsos, P. Faloutsos, and C. Faloutsos

On Power-law Relationships of the Internet Topology.In SIGCOMM, 1999.

D. Kempe, J. Kleinberg, and E. Tardos

Maximizing the Spread of Influence Through a Social Network.In KDD, 2003.

M. Kitsak, L. Gallos, S. Havlin, F. Liljeros, L. Muchnik, H. Stanley, and H. Makse

Identification of Influential Spreaders in Complex Networks.Nature Physics, 6(11), 2010.



References II

S. Pei and H. A. Makse

Spreading Dynamics in Complex Networks.Journal of Statistical Mechanics: Theory and Experiment, 12, 2013.

S. B. Seidman

Network Structure and Minimum Degree.Social Networks 5, 1983.

J. Wang and J. Cheng

Truss Decomposition in Massive Networks.Proc. VLDB Endow., 5(9), 2012.



Thank You!

NETWORK AND DATA SCIENCE

Graph Mining

Social Networks Analysis

Spectral Graph Theory

Core Decomposition

Applied Machine Learning

Natural Language Processing

Engagementdynamics

Social vulnerabilityassessment

Graph-basedtext analytics

Privacy in realnetworks

Spectral robust-ness estimation

Communitydetection

Influentialspreaders


Documents

Mining Social and Information Networkscseweb.ucsd.edu/classes/fa16/cse259-a/slides/fragkiskos.pdf · Mining Social and Information Networks Fragkiskos D. Malliaros Ecole Polytechnique,