Understanding and Managing Cascades on Large Graphs B. Aditya Prakash Computer Science Virginia...

Preview:

Citation preview

Understanding and Managing Cascades on

Large GraphsB. Aditya Prakash

Computer ScienceVirginia Tech.

CS Seminar 11/30/2012

Prakash 2012

Networks are everywhere!

Human Disease Network [Barabasi 2007]

Gene Regulatory Network [Decourty 2008]

Facebook Network [2010]

The Internet [2005]

Prakash 2012

Dynamical Processes over networks are also everywhere!

Why do we care?• Social collaboration• Information Diffusion• Viral Marketing• Epidemiology and Public Health• Cyber Security• Human mobility • Games and Virtual Worlds • Ecology• Localized effects: riots…

Prakash 2012

Why do we care? (1: Epidemiology)

• Dynamical Processes over networks[AJPH 2007]

CDC data: Visualization of the first 35 tuberculosis (TB) patients and their 1039 contacts

Diseases over contact networks

Prakash 2012

Why do we care? (1: Epidemiology)

• Dynamical Processes over networks

• Each circle is a hospital• ~3000 hospitals• More than 30,000 patients transferred

[US-MEDICARE NETWORK 2005]

Problem: Given k units of disinfectant, whom to immunize?

Prakash 2012

Why do we care? (1: Epidemiology)

CURRENT PRACTICE OUR METHOD

~6x fewer!

[US-MEDICARE NETWORK 2005]

Hospital-acquired inf. took 99K+ lives, cost $5B+ (all per year)

Prakash 2012

Why do we care? (2: Online Diffusion)

> 800m users, ~$1B revenue [WSJ 2010]

~100m active users

> 50m users

Prakash 2012

Why do we care? (2: Online Diffusion)

• Dynamical Processes over networks

Celebrity

Buy Versace™!

Followers

Social Media Marketing

Prakash 2012

Why do we care? (4: To change the world?)

• Dynamical Processes over networks

Social networks and Collaborative Action

Prakash 2012

High Impact – Multiple Settings

Q. How to squash rumors faster?

Q. How do opinions spread?

Q. How to market better?

epidemic out-breaks

products/viruses

transmit s/w patches

Prakash 2012

Research Theme

DATALarge real-world

networks & processes

ANALYSISUnderstanding

POLICY/ ACTIONManaging

Prakash 2012

Research Theme – Public Health

DATAModeling # patient

transfers

ANALYSISWill an epidemic

happen?

POLICY/ ACTION

How to control out-breaks?

Prakash 2012

Research Theme – Social Media

DATAModeling Tweets

spreading

POLICY/ ACTION

How to market better?

ANALYSIS# cascades in

future?

Prakash 2012

In this talk

Q1: How to immunize and control out-breaks better?Q2: How to find culprits of epidemics?

POLICY/ ACTIONManaging

Prakash 2012

In this lecture

DATALarge real-world

networks & processes

Q3: How do cascades look like?Q4: How does activity evolve over time?

Prakash 2012

Outline

• Motivation• Part 1: Policy and Action (Algorithms)• Part 2: Learning Models (Empirical Studies)• Conclusion

Prakash 2012

Part 1: Algorithms

• Q1: Whom to immunize?• Q2: How to detect culprits?

Prakash 2012

• Hanghang Tong, B. Aditya Prakash, Tina Eliassi-Rad, Michalis Faloutsos, Christos Faloutsos “Gelling, and Melting, Large Graphs by Edge Manipulation”

in ACM CIKM 2012 (Best Paper Award)

[Thanks to Hanghang Tong for some slides!]

An Example: Flu/Virus Propagation

HealthySick

Contact

1: Sneeze to neighbors

2: Some neighbors Sick

3: Try to recover

Q: How to guild propagation by opt. link structure? - Q1: Understand tipping point existing work - Q2: Minimize the propagation - Q3: Maximize the propagation

20

This paper

Prakash 2012

Vulnerability measure λ [ICDM 2011,

PKDD2010]

Increasing λ Increasing vulnerability

λ is the epidemic threshold

“Safe” “Vulnerable” “Deadly”

Minimizing Propagation: Edge Deletion•Given: a graph A, virus prop model and budget k; •Find: delete k ‘best’ edges from A to minimize λ

Bad

Good

Q: How to find k best edges to delete efficiently?

Left eigen-score of source

Right eigen-score of target

Minimizing Propagation: Evaluations

Time Ticks

Log (Infected Ratio)

(better)

Our Method

Aa Data set: Oregon Autonomous System Graph (14K node, 61K edges)

Discussions: Node Deletion vs. Edge Deletion•Observations:

• Node or Edge Deletion λ Decrease• Nodes on A = Edges on its line graph L(A)

•Questions?• Edge Deletion on A = Node Deletion on L(A)? • Which strategy is better (when both feasible)?

Original Graph A Line Graph L(A)

Discussions: Node Deletion vs. Edge Deletion•Q: Is Edge Deletion on A = Node Deletion on L(A)?•A: Yes!

•But, Node Deletion itself is not easy:

26

Theorem: Hardness of Node Deletion.Find Optimal k-node Immunization is NP-Hard

Theorem: Line Graph Spectrum. Eigenvalue of A Eigenvalue of L(A)

Discussions: Node Deletion vs. Edge Deletion•Q: Which strategy is better (when both feasible)?•A: Edge Deletion > Node Deletion

27

(better)

Green: Node Deletion (e.g., shutdown a twitter account)Red: Edge Deletion (e.g., un-friend two users)

Maximizing Propagation: Edge Addition•Given: a graph A, virus prop model and budget k; •Find: add k ‘best’ new edges into A.

• By 1st order perturbation, we have λs - λ ≈Gv(S)= c ∑eєS u(ie)v(je)

• So, we are done need O(n2-m) complexity

Left eigen-score of source

Right eigen-score of target

Low GvHigh Gv 28

λs - λ ≈Gv(S)= c ∑eєS u(ie)v(je)• Q: How to Find k new edges w/ highest Gv(S) ?• A: Modified Fagin’s algorithm

k

k

#3:Searchspace k+d

k+d

Searchspace

:existing edgeTime Complexity: O(m+nt+kt2), t = max(k,d)

#1: Sorting Sources by u

#2: Sorting Targets by v

Maximizing Propagation: Edge Addition

Maximizing Propagation: Evaluation

Time Ticks

Log (Infected Ratio)

(better)

30

Our Method

Prakash 2012

Fractional Immunization of NetworksB. Aditya Prakash, Lada Adamic, Theodore Iwashyna (M.D.), Hanghang Tong, Christos Faloutsos

Under Submission

Prakash 2012

?

?

Given: a graph A, virus prop. model and budget k; Find: k ‘best’ nodes for immunization (removal).

k = 2

Previously: Full Static Immunization

Prakash 2012

Fractional Asymmetric Immunization

• Fractional Effect [ f(x) = ]• Asymmetric Effect

# antidotes = 3

x5.0

Prakash 2012

Now: Fractional Asymmetric Immunization

• Fractional Effect [ f(x) = ]• Asymmetric Effect

# antidotes = 3

x5.0

Prakash 2012

Fractional Asymmetric Immunization

• Fractional Effect [ f(x) = ]• Asymmetric Effect

# antidotes = 3

x5.0

Prakash 2012

Fractional Asymmetric Immunization

Hospital Another Hospital

Drug-resistant Bacteria (like XDR-TB)

Prakash 2012

Fractional Asymmetric Immunization

Hospital Another Hospital

Drug-resistant Bacteria (like XDR-TB)

= f

Prakash 2012

Fractional Asymmetric Immunization

Hospital Another Hospital

Problem: Given k units of disinfectant, how to distribute them to maximize

hospitals saved?

Prakash 2012

Our Algorithm “SMART-ALLOC”

CURRENT PRACTICE SMART-ALLOC

[US-MEDICARE NETWORK 2005]• Each circle is a hospital, ~3000 hospitals• More than 30,000 patients transferred

~6x fewer!

Prakash 2012

Running Time

Simulations SMART-ALLOC

> 1 week

14 secs

> 30,000x speed-up!

Wall-Clock Time

Lower is better

Prakash 2012

Experiments

K = 200 K = 2000

PENN-NETWORK SECOND-LIFE

~5 x ~2.5 x

Lower is better

Prakash 2012

Part 1: Algorithms

• Q2: Whom to immunize?• Q3: How to detect culprits?

Prakash and Faloutsos 2012 43

• B. Aditya Prakash, Jilles Vreeken, Christos Faloutsos ‘Detecting Culprits in Epidemics: Who and How many?’ in ICDM 2012, Brussels

Prakash and Faloutsos 2012

Culprits: Problem definition

44

2-d grid‘+’ -> infectedWho started it?

Prakash and Faloutsos 2012

Culprits: Problem definition

45

2-d grid‘+’ -> infectedWho started it?

Prior work: [Lappas et al. 2010, Shah et al. 2011]

Prakash and Faloutsos 2012 46

Culprits: Exoneration

Prakash and Faloutsos 2012 47

Culprits: Exoneration

Prakash and Faloutsos 2012 48

Who are the culprits

• Two-part solution– use MDL for number of seeds– for a given number:

• exoneration = centrality + penalty

• Running time =– linear! (in edges and nodes)

Modeling using MDL

• Minimum Description Length Principle == Induction by compression

• Related to Bayesian approaches• MDL = Model + Data • Model

– Scoring the seed-set

Number of possible |S|-sized setsEn-coding integer |S|

Modeling using MDL

• Data: Propagation RipplesOriginal Graph

Infected Snapshot

Ripple R2Ripple R1

Modeling using MDL

• Ripple cost

• Total MDL cost

How the ‘frontier’ advancesHow long is the ripple

Ripple R

How to optimize the score?

• Two-step process– Given k, quickly identify high-quality set– Given these nodes, optimize the ripple R

Optimizing the score

• High-quality k-seed-set– Exoneration

• Best single seed: – Smallest eigenvector of Laplacian sub-matrix

• Exonerate neighbors • Repeat

Optimizing the score

• Optimizing R– Just get the MLE ripple!

• Finally use MDL score to tell us the best set

• NetSleuth: Linear running time in nodes and edges

Experiments

• Evaluation functions:– MDL based

– Overlap based

(JD == Jaccard distance)

Closer to 1 the better

Experiments

Experiments

Prakash 2012

Part 2: Empirical Studies

• Q4: How do cascades look like?• Q5: How does activity evolve over time?

Prakash 2012

Cascading Behavior in Large Blog

Graphs

How does information propagate over the blogosphere?

Blogs Posts

LinksInformation

cascade

J. Leskovec, M.McGlohon, C. Faloutsos, N. Glance, M. Hurst. Cascading Behavior in Large Blog Graphs. SDM 2007.

Prakash 2012

Cascades on the Blogosphere

Cascade is graph induced by a time ordered propagation of information (edges)

Cascades

B1 B2

B4B3

a

b c

de

B1 B2

B4B3

11

2

1 3

1

d

e

b c

e

a

Blogosphereblogs + posts

Blog networklinks among blogs

Post networklinks among posts

Prakash 2012

Blog data 45,000 blogs participating in cascades All their posts for 3 months (Aug-Sept ‘05) 2.4 million posts ~5 million links (245,404 inside the dataset)

Time [1 day]

Num

ber o

f pos

tsNumber of posts

Prakash 2012

Popularity over time

Post popularity drops-off – exponentially?

lag: days after post

# in links

1 2 3

@t

@t + lag

Prakash 2012

Popularity over time

Post popularity drops-off – exponentially?POWER LAW!Exponent?

# in links(log)

days after post(log)

Prakash 2012

Popularity over time

Post popularity drops-off – exponentially?POWER LAW!Exponent? -1.6 • close to -1.5: Barabasi’s stack model• and like the zero-crossings of a random walk

# in links(log)

-1.6

days after post(log)

-1.5 slope

Prakash 2012

J. G. Oliveira & A.-L. Barabási Human Dynamics: The Correspondence Patterns of Darwin and Einstein. Nature 437, 1251 (2005) . [PDF]

Prakash 2012

Part 2: Empirical Studies

• Q4: How do cascades look like?• Q5: How does activity evolve over time?

Prakash 2012

• Meme (# of mentions in blogs)– short phrases Sourced from U.S. politics in 2008

“you can put lipstick on a pig”

“yes we can”

Rise and fall patterns in social media

Prakash 2012

Rise and fall patterns in social media

• Can we find a unifying model, which includes these patterns?

• four classes on YouTube [Crane et al. ’08]• six classes on Meme [Yang et al. ’11]

Prakash 2012

Rise and fall patterns in social media

• Answer: YES!

• We can represent all patterns by single model

In Matsubara+ SIGKDD 2012

Prakash 2012

Main idea - SpikeM- 1. Un-informed bloggers (uninformed about rumor)- 2. External shock at time nb (e.g, breaking news)- 3. Infection (word-of-mouth)

Infectiveness of a blog-post at age n:

- Strength of infection (quality of news)

- Decay function (how infective a blog posting is)

Time n=0 Time n=nb Time n=nb+1

β

Power Law

-1.5 slope

Prakash 2012

J. G. Oliveira & A.-L. Barabási Human Dynamics: The Correspondence Patterns of Darwin and Einstein. Nature 437, 1251 (2005) . [PDF]

Prakash 2012

SpikeM - with periodicity

• Full equation of SpikeM

Periodicity

12pmPeak activity 3am

Low activity

Time n

Bloggers change their activity over time

(e.g., daily, weekly, yearly)

activity

Details

Prakash 2012

Tail-part forecasts

• SpikeM can capture tail part

Prakash 2012

“What-if” forecasting

e.g., given (1) first spike, (2) release date of two sequel movies (3) access volume before the release date

? ?

(1) First spike (2) Release date (3) Two weeks before release

Prakash 2012

“What-if” forecasting

–SpikeM can forecast not only tail-part, but also rise-part!

• SpikeM can forecast upcoming spikes

(1) First spike (2) Release date (3) Two weeks before release

Prakash 2012

Outline

• Motivation• Part 1: Understanding Epidemics (Theory)• Part 2: Policy and Action (Algorithms)• Part 3: Learning Models (Empirical Studies)• Conclusion

Prakash 2012

Conclusions

• Fast Immunization– Min/Max. drop in eigenvalue, NetGel, NetMelt

• Finding Culprits Automatically– MDL+Exoneration, Linear Time Algo

• Bursts: SpikeM model– Exponential growth, Power-law decay

Prakash 2012

ML & Stats.

Comp. Systems

Theory & Algo.

Biology

Econ.

Social Science

Engg.

Propagation on Networks

Prakash 2012

References1. Winner-takes-all: Competing Viruses or Ideas on fair-play networks (B. Aditya Prakash, Alex Beutel, Roni

Rosenfeld, Christos Faloutsos) – In WWW 2012, Lyon2. Threshold Conditions for Arbitrary Cascade Models on Arbitrary Networks (B. Aditya Prakash, Deepayan

Chakrabarti, Michalis Faloutsos, Nicholas Valler, Christos Faloutsos) - In IEEE ICDM 2011, Vancouver (Invited to KAIS Journal Best Papers of ICDM.)

3. Times Series Clustering: Complex is Simpler! (Lei Li, B. Aditya Prakash) - In ICML 2011, Bellevue4. Epidemic Spreading on Mobile Ad Hoc Networks: Determining the Tipping Point (Nicholas Valler, B. Aditya

Prakash, Hanghang Tong, Michalis Faloutsos and Christos Faloutsos) – In IEEE NETWORKING 2011, Valencia, Spain

5. Formalizing the BGP stability problem: patterns and a chaotic model (B. Aditya Prakash, Michalis Faloutsos and Christos Faloutsos) – In IEEE INFOCOM NetSciCom Workshop, 2011.

6. On the Vulnerability of Large Graphs (Hanghang Tong, B. Aditya Prakash, Tina Eliassi-Rad and Christos Faloutsos) – In IEEE ICDM 2010, Sydney, Australia

7. Virus Propagation on Time-Varying Networks: Theory and Immunization Algorithms (B. Aditya Prakash, Hanghang Tong, Nicholas Valler, Michalis Faloutsos and Christos Faloutsos) – In ECML-PKDD 2010, Barcelona, Spain

8. MetricForensics: A Multi-Level Approach for Mining Volatile Graphs (Keith Henderson, Tina Eliassi-Rad, Christos Faloutsos, Leman Akoglu, Lei Li, Koji Maruhashi, B. Aditya Prakash and Hanghang Tong) - In SIGKDD 2010, Washington D.C.

9. Parsimonious Linear Fingerprinting for Time Series (Lei Li, B. Aditya Prakash and Christos Faloutsos) - In VLDB 2010, Singapore

10. EigenSpokes: Surprising Patterns and Scalable Community Chipping in Large Graphs (B. Aditya Prakash, Ashwin Sridharan, Mukund Seshadri, Sridhar Machiraju and Christos Faloutsos) – In PAKDD 2010, Hyderabad, India

11. BGP-lens: Patterns and Anomalies in Internet-Routing Updates (B. Aditya Prakash, Nicholas Valler, David Andersen, Michalis Faloutsos and Christos Faloutsos) – In ACM SIGKDD 2009, Paris, France.

12. Surprising Patterns and Scalable Community Detection in Large Graphs (B. Aditya Prakash, Ashwin Sridharan, Mukund Seshadri, Sridhar Machiraju and Christos Faloutsos) – In IEEE ICDM Large Data Workshop 2009, Miami

13. FRAPP: A Framework for high-Accuracy Privacy-Preserving Mining (Shipra Agarwal, Jayant R. Haritsa and B. Aditya Prakash) – In Intl. Journal on Data Mining and Knowledge Discovery (DKMD), Springer, vol. 18, no. 1, February 2009, Ed: Johannes Gehrke.

14. Complex Group-By Queries For XML (C. Gokhale, N. Gupta, P. Kumar, L. V. S. Lakshmanan, R. Ng and B. Aditya Prakash) – In IEEE ICDE 2007, Istanbul, Turkey.

Prakash 2012

Understanding and Managing Cascades on Large Networks

B. Aditya Prakash http://www.cs.vt.edu/~badityap

Sounds Interesting? I am looking for Ph.D. students---drop me an email with your CV!

Recommended