58
ddBall: Spotting Anomalies in Weighted Graphs Leman Akoglu, Mary McGlohon, Christos Faloutsos Carnegie Mellon University School of Computer Science Pittsburgh, Pennsylvania, USA

ddBall: Spotting A n o m a l i e s in Weighted Graphs

  • Upload
    roddy

  • View
    27

  • Download
    0

Embed Size (px)

DESCRIPTION

ddBall: Spotting A n o m a l i e s in Weighted Graphs. Leman Akoglu , Mary McGlohon , Christos Faloutsos Carnegie Mellon University School of Computer Science Pittsburgh, Pennsylvania, USA. Motivation. Anomaly detection in networks (graph data) has important applications: - PowerPoint PPT Presentation

Citation preview

Page 1: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

ddBall: Spotting Anomalies in Weighted Graphs

Leman Akoglu, Mary McGlohon, Christos FaloutsosCarnegie Mellon University

School of Computer Science

Pittsburgh, Pennsylvania, USA

Page 2: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Motivation Anomaly detection in networks (graph data) has

important applications: Computer networks

spammers, port scanners

Phone-call networks telemarketers, misbehaving

costumers, faulty equipment

Social networks ‘popularity contests’

Account networks scammers, transfer fraud

Terrorist networks tight groups of people

PAKDD 2010 Akoglu, McGlohon, Faloutsos2

Page 3: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

ProblemQ1. Given a weighted and unlabeled graph, how can we spot strange, abnormal, extreme nodes?

Q2. Can we explain why the spotted nodes are anomalous?

PAKDD 2010 Akoglu, McGlohon, Faloutsos 3

Page 4: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

PAKDD 2010 Akoglu, McGlohon, Faloutsos 4

Preliminaries I – What is an anomaly?

No clear and unique definition!

“An observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.” [Hawkins, 80]

Page 5: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Preliminaries II – Weights

PAKDD 2010 Akoglu, McGlohon, Faloutsos 55

1

$10K

Bipartite Unipartite

$5K$15K

3

Page 6: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Preliminaries III – Power Laws

PAKDD 2010 Akoglu, McGlohon, Faloutsos 66

Pr[X≥x] ~ cx-α

ln(Pr[X≥x]) ~ -α(c lnx)

c ≥ 0, α ≥ 0

lin-lin plot log-log plot

slope = -α

Page 7: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

PAKDD 2010 Akoglu, McGlohon, Faloutsos 77

DBLP Keyword-to-Conference Network# Edges

Total weight

#Source nodes

#Destination nodes

‘Power Law’ Example

Densification Power Law [Leskovec ‘05]

Weight Power Law [McGlohon ‘08]

Page 8: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

PAKDD 2010 Akoglu, McGlohon, Faloutsos 8

In-degree (# donors)2004 US FEC Committees to Candidates network

e.g. John Kerry,

$10M received,

from 1K donors

Snapshot Power Law [McGlohon et al.‘08]

In-weights($)

‘Power Law’ Example

Page 9: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Preliminaries IV – how to fit

PAKDD 2010 Akoglu, McGlohon, Faloutsos 9

Least Squares

fit to medians!

Page 10: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Problem revisitedQ1. Given a weighted and unlabeled graph, how can we spot strange, abnormal, extreme nodes?

Q2. Can we explain why the spotted nodes are anomalous?

PAKDD 2010 Akoglu, McGlohon, Faloutsos 10

Page 11: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Problem sketch

PAKDD 2010 Akoglu, McGlohon, Faloutsos 11

Page 12: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Main ideaFor each node,

P.1) extract ‘ego-net’ (=1-step-away neighbors)

P.2) extract features (#edges, total weight, etc.)

P.3) extract patterns (norms)

P.4) anomaly detection: compare with the rest of the population

LLNL'10 C. Faloutsos (CMU) 12

Page 13: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Outline1. Motivation

2. Preliminaries and Problem Definition

3. Proposed Method

a. Study of ego-nets

b. Laws and Observations

c. Anomaly detection

1. Datasets

2. Experiments

3. Discussion & Conclusion

PAKDD 2010 Akoglu, McGlohon, Faloutsos 13

Page 14: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

P.1 What is an egonet?

PAKDD 2010 Akoglu, McGlohon, Faloutsos 14

ego

ego-net

Page 15: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

What is odd?

PAKDD 2010 Akoglu, McGlohon, Faloutsos 15

Page 16: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

PwC 2009 Leman Akoglu 16

What is “anomalous”?

Near-star

Near-clique

telemarketer, port scanner,

people adding friends

indiscriminatively, etc.

tightly connected people,

terrorist groups?, discussion

group, etc.

Page 17: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

PwC 2009 Leman Akoglu 17

What is “anomalous”?

Heavy vicinity

Dominant heavy link17

too much money wrt number

of accounts, high donation

wrt number of donors, etc.

single-minded,

tight company

Page 18: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

P.2 What features…

PwC 2009 Leman Akoglu 18

… should we extract so that to project nodes into a low-dimensional space?

features that could yield “laws”

features easy to compute

and interpret

18

Page 19: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Selected Features

PAKDD 2010 Akoglu, McGlohon, Faloutsos 19

Ni: number of neighbors (degree) of ego i

Ei: number of edges in egonet i

Wi: total weight of egonet i

λw,i: principal eigenvalue of the weighted adjacency matrix of egonet i

Page 20: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

20

λw,i = √N = √E = √W

λw,i > √N

~ √E, √Wλw,i = N ≈ √W

λw,i = W λw,i ≈ W

λw,i √W

N: #neighbors, W: total weightPAKDD 2010 Akoglu, McGlohon, Faloutsos 20

details

Page 21: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Other Features

PAKDD 2010 Akoglu, McGlohon, Faloutsos 21

Si: number of singleton neighbors of ego i with degree 1

max(Wi): maximum edge weight in egonet i

max(Wi, d=1): maximum edge weight to/from a degree 1 neighbor of ego i

max(di): maximum degree of the neighbors of ego i

2-step neighborhood features

Page 22: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Outline1. Motivation

2. Preliminaries

3. Proposed Method

a. Study of egonets

b. Laws and Observations

c. Anomaly detection

4. Datasets

5. Experiments

6. Discussion & Conclusion

PAKDD 2010 Akoglu, McGlohon, Faloutsos 2222

Page 23: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

P.3 What patterns?

PAKDD 2010 Akoglu, McGlohon, Faloutsos 23

Observation 1: Egonet Density Power Law (EDPL)

23

Q1: How does the number of neighbors N

of the egonet relate to the

number of edges E?

Page 24: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Observation 1: Egonet Density Power Law (EDPL)

PwC 2009 Leman Akoglu 2424

Ei N∝ iα

1 ≤ α ≤ 2

Page 25: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

P.3 What patterns?

PAKDD 2010 Akoglu, McGlohon, Faloutsos 25

Observation 2: Egonet Weight Power Law (EWPL)

25

Q2: How does the total weight W of the egonet

relate to the number of edges E?

Page 26: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Observation 2: Egonet Weight Power Law (EWPL)

2626

Wi E∝ iβ

β ≥ 1

Page 27: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

P.3 What patterns?

PAKDD 2010 Akoglu, McGlohon, Faloutsos 27

Observation 3: Egonet λw Power Law (ELWPL)

27

Q3: How does the largest eigenvalue λw of the weighted adjacency matrix of the egonet

relate to the total weight W?

Page 28: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Observation 3: Egonet λw Power Law (ELWPL)

2828

λw,i W∝ iγ

0.5 ≤ γ ≤ 1

Page 29: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Outline1. Motivation

2. Preliminaries

3. Proposed Method

a. Study of egonets

b. Laws and Observations

c. Anomaly detection

4. Datasets

5. Experiments

6. Discussion & Conclusion

PAKDD 2010 Akoglu, McGlohon, Faloutsos 2929

Page 30: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

P.4 Anomaly detection

PAKDD 2010 Akoglu, McGlohon, Faloutsos 30

violates our “laws”

too far away from the rest of the pointsAnomaly ≈

30

Page 31: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

31

scoredist = distance to fitting linescoreoutl = outlierness score

score = func ( scoredist , scoreoutl )

can tell what kind

of anomaly a node

belongs to can sort nodes wrt

their outlierness scoresPAKDD 2010 Akoglu, McGlohon, Faloutsos 31

Page 32: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Outline1. Motivation

2. Preliminaries

3. Proposed Method

a. Study of egonets

b. Laws and Observations

c. Anomaly detection

4. Datasets

5. Experiments

6. Discussion & Conclusion

PAKDD 2010 Akoglu, McGlohon, Faloutsos 3232

Page 33: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Datasets

PAKDD 2010 Akoglu, McGlohon, Faloutsos 3333

Bipartite networks: |N| |E|

1. Don2Com 1.6M 2M

2. Com2Cand 6K 125K

3. Auth2Conf 421K 1M

Unipartite networks: |N| |E|

5. BlogNet 27K 126K

6. PostNet 223K 217K

7. Enron 36K 183K

8. Oregon 11K 38K

Page 34: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Outline1. Motivation

2. Preliminaries

3. Proposed Methoda. Study of egonets

b. Laws and Observations

c. Anomaly detection

4. Datasets

5. Experiments

6. Discussion & Conclusion

PAKDD 2010 Akoglu, McGlohon, Faloutsos 3434

Page 35: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Experimental Results

PAKDD 2010 Akoglu, McGlohon, Faloutsos 3535

Anomaly /Dataset

Near-clique,Near-star

Heavy vicinity Dominant pair,Uniform weights

Don2Com N/A ? ?

Com2Cand N/A ? ?

Auth2Conf N/A ? ?

PostNet ? ? ?

BlogNet ? ? ?

Enron ? N/A N/A

Oregon ? N/A N/A

Page 36: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Near-Clique/Star

PwC 2009 Leman Akoglu 3636

Page 37: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

37

Near-Clique/Star

PAKDD 2010 Akoglu, McGlohon, Faloutsos 37

Page 38: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Experimental Results

PAKDD 2010 Akoglu, McGlohon, Faloutsos 3838

Anomaly /Dataset

Near-clique,Near-star

Heavy vicinity Dominant pair,Uniform weights

Don2Com N/A ? ?

Com2Cand N/A ? ?

Auth2Conf N/A ? ?

PostNet self-linking post,post w/ numerous links

to diverse posts

? ?

BlogNet “link blogs” devoted to a wide array of content

? ?

Enron Kenneth Lay (>1K contacts)

N/A N/A

Oregon 3 large ASPs,Verizon, Sprint, AT&T

N/A N/A

Page 39: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Heavy Vicinity

PAKDD 2010 Akoglu, McGlohon, Faloutsos 3939

Page 40: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Heavy Vicinity

PAKDD 2010 Akoglu, McGlohon, Faloutsos 4040

Page 41: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Experimental Results

PAKDD 2010 Akoglu, McGlohon, Faloutsos 4141

Anomaly /Dataset

Near-clique,Near-star

Heavy vicinity Dominant pair,Uniform weights

Don2Com N/A Bush-Cheney ’04 Inc, Kerry Committee

?

Com2Cand N/A Liberty Congressional PAC, Aaron Russo

?

Auth2Conf N/A Averill M. Law - Winter Simulation Conference

?

PostNet self-linking post,post w/ numerous links

to diverse posts

post listed as blog homepage, post w/single repeated link

?

BlogNet “link blogs” devoted to a wide array of content

Automotive News Today – GM blog

?

Enron Kenneth Lay (>1K contacts)

N/A N/A

Oregon 3 large ASPs,Verizon, Sprint, AT&T

N/A N/A

Page 42: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Dominant Heavy Link

PAKDD 2010 Akoglu, McGlohon, Faloutsos 4242

$87M - DNC$25M - RNC

Page 43: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Dominant Heavy Link

PwC 2009 Leman Akoglu 4343

Page 44: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Experimental Results

PAKDD 2010 Akoglu, McGlohon, Faloutsos 4444

Anomaly /Dataset

Near-clique,Near-star

Heavy vicinity Dominant pair,Uniform weights

Don2Com N/A Bush-Cheney ’04 Inc, Kerry Committee

Negative edge weights due returns

Com2Cand N/A Liberty Congressional PAC, Aaron Russo

DNC against George Bush

Auth2Conf N/A Averill M. Law - Winter Simulation Conference

Toshio Fukuda-ICRAPLaTD- Hans Bekic

PostNet self-linking post,post w/ numerous links

to diverse posts

post listed as blog homepage, post w/single repeated link

“ThinkProgress” and “A Freethinker’s Paradise” on

on a leak scandal

BlogNet “link blogs” devoted to a wide array of content

Automotive News Today – GM blog

“Drudge” (298 links to 4)“Nocapital” (300 links to 2)

Enron Kenneth Lay (>1K contacts)

N/A N/A

Oregon 3 large ASPs,Verizon, Sprint, AT&T

N/A N/A

Page 45: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Outline1. Motivation

2. Preliminaries

3. Proposed Methoda. Study of egonets

b. Laws and Observations

c. Anomaly detection

4. Datasets

5. Experiments

6. Discussion & Conclusion

PAKDD 2010 Akoglu, McGlohon, Faloutsos 4545

Page 46: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

46

Scalability

PAKDD 2010 Akoglu, McGlohon, Faloutsos 46

Counting number of edges in egonets for ALL

nodes is expensive!

need to scan connections for all pairs of neighbors!

Can be reworded as counting local triangles A fast method [Tsourakakis,08] exists!

IDEA: o #triangles = (# paths of length 3) / 2

o # paths of length 3 for node i = (A3)ii

o Computing A3 is still expensive!o Low-rank approximation!

Page 47: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

PAKDD 2010 Akoglu, McGlohon, Faloutsos 47

0 1/3 1/3 1/3 0 0 0 0 0 0 0 0

1/3 0 1/3 0 0 0 0 0 0 0 0 0

1/3 1/3 0 1/3 0 0 0 0 0 0 0 0

1/3 0 1/3 0 0 0 0 0 0 0 0 0

0 0 0 0 0 1/2 1/2 0 0 0 0 0

0 0 0 0 1/4 0 1/2 0 0 0 0 0

0 0 0 0 1/4 1/2 0 0 0 0 0 0

0 0 0 0 0 0 0 0 1/2 0 1/3 0

0 0 0 0 0 0 0 1/4 0 1/3 0 0

0 0 0 0 0 0 0 0 1/2 0 1/3 1/2

0 0 0 0 0 0 0 1/4 0 1/3 0 1/2

0 0 0 0 0 0 0 0 0 1/3 1/3 0

0 0 0 0

-0.18 -0.36 0.13 -0.90

0 0 0 0

0.36 -0.18 0.90 0.13

-0.40 -0.81 -0.06 0.40

0 0 0 0

0 0 0 0

0.81 -0.40 -0.40 -0.06

0 0 0 0

0 0 0 0

0 0 0 0

0 0 0 0

0 0.60 0 -0.30 0.65 0 0 -0.32 0 0 0 0

0 -0.30 0 -0.60 -0.32 0 0 -0.65 0 0 0 0

0 -0.72 0 -0.11 0.66 0 0 0.10 0 0 0 0

0 -0.11 0

0.72 0.10 0 0 -0.66 0 0 0 0

0.44 0 0 0

0 0.44 0 0

0 0 0.18 0

0 0 0 0.18

US UT

A ~

nxn nxk

kxk kxn

A3 =O(n3) ~ O(nk2)

A3S3

Prune d=1 nodes Prune d=2 as well as d=1 nodes

smaller & sparser A matrix

details

Page 48: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Scalability – time vs. size

PAKDD 2010 Akoglu, McGlohon, Faloutsos 4848

Time vs. number of edges.

Effect of pruning on computation time.

Solid (–): no pruning,

Dashed (−−): pruning nodes w/ d ≤1,

Dotted (…): pruning nodes w/ d ≤ 2

Computation time increases linearly

with increasing number of edges,

while decreasing with pruning.

Page 49: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

49

Scalability – accuracy vs time

PAKDD 2010 Akoglu, McGlohon, Faloutsos 49

Time vs. accuracy.

Effect of pruning on accuracy of finding

top anomalies as in the original ranking

before pruning.

New rankings are scored using

Normalized Cumulative Discounted Gain.

Pruning reduces time for both

Node-Iterator and Eigen-Triangle

while keeping accuracy at as high as

~1 and ~.9, respectively.

Page 50: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Conclusion OddBall, a fast, unsupervised method to detect abnormal

nodes in weighted graphs. Study of egonets; list of numerical features Discovery of new patterns in density (Obs.1: EDPL),

weights (Obs.2: EWPL), and principal eigenvalues (Obs.3: ELWPL).

Speed-up in feature extraction, with accuracy ~.9 Experiments on real graphs of over 1M nodes, that reveal

strange/extreme nodes from many different domains

Software available online!http://www.cs.cmu.edu/~lakoglu/#tools

PAKDD 2010 Akoglu, McGlohon, Faloutsos 5050

Page 51: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

Related Work

PAKDD 2010 Akoglu, McGlohon, Faloutsos 51

Noble and Cook [KDD,03] detect anomalous sub-graphs using variants of the MDL principle.

Eberle and Holder [ICDM,07] detect unexpected/missing nodes/edges in labeled graphs.

Liu et. al [SDM,05] detect non-crashing bugs in software using frequent execution flow graphs combined with supervised classification.

Sun et al.[ICDM,05] use proximity and random walks to assess normality of nodes in bipartite graphs.

Chakrabarti [PKDD,04] spot anomalous edges as a by-product of cross-associations.

51

Page 52: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

http://www.cs.cmu.edu/~lakoglu/#tools

52

QUESTIONS?

Page 53: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

OddBall over time? Rank nodes at each time tick t

Node i will have rank vector Ri ri,1 ri,2 … ri,t

Sort nodes w.r.t. |Ri<=threshold| threshold =3 will sort nodes w.r.t. the number of time-

ticks they appear in top 3 outliers

Note: not all nodes appear at all time-ticks

PAKDD 2010 Akoglu, McGlohon, Faloutsos 53

Page 54: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

OddBall over time?

PAKDD 2010 Akoglu, McGlohon, Faloutsos 54

threshold=3

- Node was active at 46 time-ticks.

- At 26 of them, it was in top-3 outliers.

- Score becomes:

(26/46) * 26

- Rank: 1

84321653332|MR|10-MAY-65|400010|15-MAR-03|ACTIVE|RCVALUE

Page 55: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

OddBall over time?

PAKDD 2010 Akoglu, McGlohon, Faloutsos 55

threshold=3

- Node was active at 46 time-ticks.

- At 26 of them, it was in top-3 outliers.

- Score becomes:

(26/46) * 26

E.g. from t=154 (MAY-2)

to t=170 (MAY-18),

it appears in top-3

Page 56: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

OddBall over time?

PAKDD 2010 Akoglu, McGlohon, Faloutsos 56

Page 57: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

OddBall over time?

PAKDD 2010 Akoglu, McGlohon, Faloutsos 57

threshold=3

- Node was active at 177 time-ticks.

- At 35 of them, it was in top-3 outliers.

- Rank: 5

84354420350|Mr|21-OCT-76|400033|23-JAN-04|ACTIVE|RCVALUE

Page 58: ddBall: Spotting  A n o m a l i e s  in  Weighted Graphs

OddBall over time?

PAKDD 2010 Akoglu, McGlohon, Faloutsos 58

threshold=3

- Node was active at 177 time-ticks.

- At 35 of them, it was in top-3 outliers.

- Rank: 5

84354420350|Mr|21-OCT-76|400033|23-JAN-04|ACTIVE|RCVALUE