ddBall: Spotting Anomalies in Weighted Graphs
Leman Akoglu, Mary McGlohon, Christos FaloutsosCarnegie Mellon University
School of Computer Science
Pittsburgh, Pennsylvania, USA
Motivation Anomaly detection in networks (graph data) has
important applications: Computer networks
spammers, port scanners
Phone-call networks telemarketers, misbehaving
costumers, faulty equipment
Social networks ‘popularity contests’
Account networks scammers, transfer fraud
Terrorist networks tight groups of people
PAKDD 2010 Akoglu, McGlohon, Faloutsos2
ProblemQ1. Given a weighted and unlabeled graph, how can we spot strange, abnormal, extreme nodes?
Q2. Can we explain why the spotted nodes are anomalous?
PAKDD 2010 Akoglu, McGlohon, Faloutsos 3
PAKDD 2010 Akoglu, McGlohon, Faloutsos 4
Preliminaries I – What is an anomaly?
No clear and unique definition!
“An observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism.” [Hawkins, 80]
Preliminaries II – Weights
PAKDD 2010 Akoglu, McGlohon, Faloutsos 55
1
$10K
Bipartite Unipartite
$5K$15K
3
Preliminaries III – Power Laws
PAKDD 2010 Akoglu, McGlohon, Faloutsos 66
Pr[X≥x] ~ cx-α
ln(Pr[X≥x]) ~ -α(c lnx)
c ≥ 0, α ≥ 0
lin-lin plot log-log plot
slope = -α
PAKDD 2010 Akoglu, McGlohon, Faloutsos 77
DBLP Keyword-to-Conference Network# Edges
Total weight
#Source nodes
#Destination nodes
‘Power Law’ Example
Densification Power Law [Leskovec ‘05]
Weight Power Law [McGlohon ‘08]
PAKDD 2010 Akoglu, McGlohon, Faloutsos 8
In-degree (# donors)2004 US FEC Committees to Candidates network
e.g. John Kerry,
$10M received,
from 1K donors
Snapshot Power Law [McGlohon et al.‘08]
In-weights($)
‘Power Law’ Example
Preliminaries IV – how to fit
PAKDD 2010 Akoglu, McGlohon, Faloutsos 9
Least Squares
fit to medians!
Problem revisitedQ1. Given a weighted and unlabeled graph, how can we spot strange, abnormal, extreme nodes?
Q2. Can we explain why the spotted nodes are anomalous?
PAKDD 2010 Akoglu, McGlohon, Faloutsos 10
Problem sketch
PAKDD 2010 Akoglu, McGlohon, Faloutsos 11
Main ideaFor each node,
P.1) extract ‘ego-net’ (=1-step-away neighbors)
P.2) extract features (#edges, total weight, etc.)
P.3) extract patterns (norms)
P.4) anomaly detection: compare with the rest of the population
LLNL'10 C. Faloutsos (CMU) 12
Outline1. Motivation
2. Preliminaries and Problem Definition
3. Proposed Method
a. Study of ego-nets
b. Laws and Observations
c. Anomaly detection
1. Datasets
2. Experiments
3. Discussion & Conclusion
PAKDD 2010 Akoglu, McGlohon, Faloutsos 13
P.1 What is an egonet?
PAKDD 2010 Akoglu, McGlohon, Faloutsos 14
ego
ego-net
What is odd?
PAKDD 2010 Akoglu, McGlohon, Faloutsos 15
PwC 2009 Leman Akoglu 16
What is “anomalous”?
Near-star
Near-clique
telemarketer, port scanner,
people adding friends
indiscriminatively, etc.
tightly connected people,
terrorist groups?, discussion
group, etc.
PwC 2009 Leman Akoglu 17
What is “anomalous”?
Heavy vicinity
Dominant heavy link17
too much money wrt number
of accounts, high donation
wrt number of donors, etc.
single-minded,
tight company
P.2 What features…
PwC 2009 Leman Akoglu 18
… should we extract so that to project nodes into a low-dimensional space?
features that could yield “laws”
features easy to compute
and interpret
18
Selected Features
PAKDD 2010 Akoglu, McGlohon, Faloutsos 19
Ni: number of neighbors (degree) of ego i
Ei: number of edges in egonet i
Wi: total weight of egonet i
λw,i: principal eigenvalue of the weighted adjacency matrix of egonet i
20
λw,i = √N = √E = √W
λw,i > √N
~ √E, √Wλw,i = N ≈ √W
λw,i = W λw,i ≈ W
λw,i √W
N: #neighbors, W: total weightPAKDD 2010 Akoglu, McGlohon, Faloutsos 20
details
Other Features
PAKDD 2010 Akoglu, McGlohon, Faloutsos 21
Si: number of singleton neighbors of ego i with degree 1
max(Wi): maximum edge weight in egonet i
max(Wi, d=1): maximum edge weight to/from a degree 1 neighbor of ego i
max(di): maximum degree of the neighbors of ego i
2-step neighborhood features
Outline1. Motivation
2. Preliminaries
3. Proposed Method
a. Study of egonets
b. Laws and Observations
c. Anomaly detection
4. Datasets
5. Experiments
6. Discussion & Conclusion
PAKDD 2010 Akoglu, McGlohon, Faloutsos 2222
P.3 What patterns?
PAKDD 2010 Akoglu, McGlohon, Faloutsos 23
Observation 1: Egonet Density Power Law (EDPL)
23
Q1: How does the number of neighbors N
of the egonet relate to the
number of edges E?
Observation 1: Egonet Density Power Law (EDPL)
PwC 2009 Leman Akoglu 2424
Ei N∝ iα
1 ≤ α ≤ 2
P.3 What patterns?
PAKDD 2010 Akoglu, McGlohon, Faloutsos 25
Observation 2: Egonet Weight Power Law (EWPL)
25
Q2: How does the total weight W of the egonet
relate to the number of edges E?
Observation 2: Egonet Weight Power Law (EWPL)
2626
Wi E∝ iβ
β ≥ 1
P.3 What patterns?
PAKDD 2010 Akoglu, McGlohon, Faloutsos 27
Observation 3: Egonet λw Power Law (ELWPL)
27
Q3: How does the largest eigenvalue λw of the weighted adjacency matrix of the egonet
relate to the total weight W?
Observation 3: Egonet λw Power Law (ELWPL)
2828
λw,i W∝ iγ
0.5 ≤ γ ≤ 1
Outline1. Motivation
2. Preliminaries
3. Proposed Method
a. Study of egonets
b. Laws and Observations
c. Anomaly detection
4. Datasets
5. Experiments
6. Discussion & Conclusion
PAKDD 2010 Akoglu, McGlohon, Faloutsos 2929
P.4 Anomaly detection
PAKDD 2010 Akoglu, McGlohon, Faloutsos 30
violates our “laws”
too far away from the rest of the pointsAnomaly ≈
30
31
scoredist = distance to fitting linescoreoutl = outlierness score
score = func ( scoredist , scoreoutl )
can tell what kind
of anomaly a node
belongs to can sort nodes wrt
their outlierness scoresPAKDD 2010 Akoglu, McGlohon, Faloutsos 31
Outline1. Motivation
2. Preliminaries
3. Proposed Method
a. Study of egonets
b. Laws and Observations
c. Anomaly detection
4. Datasets
5. Experiments
6. Discussion & Conclusion
PAKDD 2010 Akoglu, McGlohon, Faloutsos 3232
Datasets
PAKDD 2010 Akoglu, McGlohon, Faloutsos 3333
Bipartite networks: |N| |E|
1. Don2Com 1.6M 2M
2. Com2Cand 6K 125K
3. Auth2Conf 421K 1M
Unipartite networks: |N| |E|
5. BlogNet 27K 126K
6. PostNet 223K 217K
7. Enron 36K 183K
8. Oregon 11K 38K
Outline1. Motivation
2. Preliminaries
3. Proposed Methoda. Study of egonets
b. Laws and Observations
c. Anomaly detection
4. Datasets
5. Experiments
6. Discussion & Conclusion
PAKDD 2010 Akoglu, McGlohon, Faloutsos 3434
Experimental Results
PAKDD 2010 Akoglu, McGlohon, Faloutsos 3535
Anomaly /Dataset
Near-clique,Near-star
Heavy vicinity Dominant pair,Uniform weights
Don2Com N/A ? ?
Com2Cand N/A ? ?
Auth2Conf N/A ? ?
PostNet ? ? ?
BlogNet ? ? ?
Enron ? N/A N/A
Oregon ? N/A N/A
Near-Clique/Star
PwC 2009 Leman Akoglu 3636
37
Near-Clique/Star
PAKDD 2010 Akoglu, McGlohon, Faloutsos 37
Experimental Results
PAKDD 2010 Akoglu, McGlohon, Faloutsos 3838
Anomaly /Dataset
Near-clique,Near-star
Heavy vicinity Dominant pair,Uniform weights
Don2Com N/A ? ?
Com2Cand N/A ? ?
Auth2Conf N/A ? ?
PostNet self-linking post,post w/ numerous links
to diverse posts
? ?
BlogNet “link blogs” devoted to a wide array of content
? ?
Enron Kenneth Lay (>1K contacts)
N/A N/A
Oregon 3 large ASPs,Verizon, Sprint, AT&T
N/A N/A
Heavy Vicinity
PAKDD 2010 Akoglu, McGlohon, Faloutsos 3939
Heavy Vicinity
PAKDD 2010 Akoglu, McGlohon, Faloutsos 4040
Experimental Results
PAKDD 2010 Akoglu, McGlohon, Faloutsos 4141
Anomaly /Dataset
Near-clique,Near-star
Heavy vicinity Dominant pair,Uniform weights
Don2Com N/A Bush-Cheney ’04 Inc, Kerry Committee
?
Com2Cand N/A Liberty Congressional PAC, Aaron Russo
?
Auth2Conf N/A Averill M. Law - Winter Simulation Conference
?
PostNet self-linking post,post w/ numerous links
to diverse posts
post listed as blog homepage, post w/single repeated link
?
BlogNet “link blogs” devoted to a wide array of content
Automotive News Today – GM blog
?
Enron Kenneth Lay (>1K contacts)
N/A N/A
Oregon 3 large ASPs,Verizon, Sprint, AT&T
N/A N/A
Dominant Heavy Link
PAKDD 2010 Akoglu, McGlohon, Faloutsos 4242
$87M - DNC$25M - RNC
Dominant Heavy Link
PwC 2009 Leman Akoglu 4343
Experimental Results
PAKDD 2010 Akoglu, McGlohon, Faloutsos 4444
Anomaly /Dataset
Near-clique,Near-star
Heavy vicinity Dominant pair,Uniform weights
Don2Com N/A Bush-Cheney ’04 Inc, Kerry Committee
Negative edge weights due returns
Com2Cand N/A Liberty Congressional PAC, Aaron Russo
DNC against George Bush
Auth2Conf N/A Averill M. Law - Winter Simulation Conference
Toshio Fukuda-ICRAPLaTD- Hans Bekic
PostNet self-linking post,post w/ numerous links
to diverse posts
post listed as blog homepage, post w/single repeated link
“ThinkProgress” and “A Freethinker’s Paradise” on
on a leak scandal
BlogNet “link blogs” devoted to a wide array of content
Automotive News Today – GM blog
“Drudge” (298 links to 4)“Nocapital” (300 links to 2)
Enron Kenneth Lay (>1K contacts)
N/A N/A
Oregon 3 large ASPs,Verizon, Sprint, AT&T
N/A N/A
Outline1. Motivation
2. Preliminaries
3. Proposed Methoda. Study of egonets
b. Laws and Observations
c. Anomaly detection
4. Datasets
5. Experiments
6. Discussion & Conclusion
PAKDD 2010 Akoglu, McGlohon, Faloutsos 4545
46
Scalability
PAKDD 2010 Akoglu, McGlohon, Faloutsos 46
Counting number of edges in egonets for ALL
nodes is expensive!
need to scan connections for all pairs of neighbors!
Can be reworded as counting local triangles A fast method [Tsourakakis,08] exists!
IDEA: o #triangles = (# paths of length 3) / 2
o # paths of length 3 for node i = (A3)ii
o Computing A3 is still expensive!o Low-rank approximation!
PAKDD 2010 Akoglu, McGlohon, Faloutsos 47
0 1/3 1/3 1/3 0 0 0 0 0 0 0 0
1/3 0 1/3 0 0 0 0 0 0 0 0 0
1/3 1/3 0 1/3 0 0 0 0 0 0 0 0
1/3 0 1/3 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1/2 1/2 0 0 0 0 0
0 0 0 0 1/4 0 1/2 0 0 0 0 0
0 0 0 0 1/4 1/2 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1/2 0 1/3 0
0 0 0 0 0 0 0 1/4 0 1/3 0 0
0 0 0 0 0 0 0 0 1/2 0 1/3 1/2
0 0 0 0 0 0 0 1/4 0 1/3 0 1/2
0 0 0 0 0 0 0 0 0 1/3 1/3 0
0 0 0 0
-0.18 -0.36 0.13 -0.90
0 0 0 0
0.36 -0.18 0.90 0.13
-0.40 -0.81 -0.06 0.40
0 0 0 0
0 0 0 0
0.81 -0.40 -0.40 -0.06
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 0.60 0 -0.30 0.65 0 0 -0.32 0 0 0 0
0 -0.30 0 -0.60 -0.32 0 0 -0.65 0 0 0 0
0 -0.72 0 -0.11 0.66 0 0 0.10 0 0 0 0
0 -0.11 0
0.72 0.10 0 0 -0.66 0 0 0 0
0.44 0 0 0
0 0.44 0 0
0 0 0.18 0
0 0 0 0.18
US UT
A ~
nxn nxk
kxk kxn
A3 =O(n3) ~ O(nk2)
A3S3
Prune d=1 nodes Prune d=2 as well as d=1 nodes
smaller & sparser A matrix
details
Scalability – time vs. size
PAKDD 2010 Akoglu, McGlohon, Faloutsos 4848
Time vs. number of edges.
Effect of pruning on computation time.
Solid (–): no pruning,
Dashed (−−): pruning nodes w/ d ≤1,
Dotted (…): pruning nodes w/ d ≤ 2
Computation time increases linearly
with increasing number of edges,
while decreasing with pruning.
49
Scalability – accuracy vs time
PAKDD 2010 Akoglu, McGlohon, Faloutsos 49
Time vs. accuracy.
Effect of pruning on accuracy of finding
top anomalies as in the original ranking
before pruning.
New rankings are scored using
Normalized Cumulative Discounted Gain.
Pruning reduces time for both
Node-Iterator and Eigen-Triangle
while keeping accuracy at as high as
~1 and ~.9, respectively.
Conclusion OddBall, a fast, unsupervised method to detect abnormal
nodes in weighted graphs. Study of egonets; list of numerical features Discovery of new patterns in density (Obs.1: EDPL),
weights (Obs.2: EWPL), and principal eigenvalues (Obs.3: ELWPL).
Speed-up in feature extraction, with accuracy ~.9 Experiments on real graphs of over 1M nodes, that reveal
strange/extreme nodes from many different domains
Software available online!http://www.cs.cmu.edu/~lakoglu/#tools
PAKDD 2010 Akoglu, McGlohon, Faloutsos 5050
Related Work
PAKDD 2010 Akoglu, McGlohon, Faloutsos 51
Noble and Cook [KDD,03] detect anomalous sub-graphs using variants of the MDL principle.
Eberle and Holder [ICDM,07] detect unexpected/missing nodes/edges in labeled graphs.
Liu et. al [SDM,05] detect non-crashing bugs in software using frequent execution flow graphs combined with supervised classification.
Sun et al.[ICDM,05] use proximity and random walks to assess normality of nodes in bipartite graphs.
Chakrabarti [PKDD,04] spot anomalous edges as a by-product of cross-associations.
51
http://www.cs.cmu.edu/~lakoglu/#tools
52
QUESTIONS?
OddBall over time? Rank nodes at each time tick t
Node i will have rank vector Ri ri,1 ri,2 … ri,t
Sort nodes w.r.t. |Ri<=threshold| threshold =3 will sort nodes w.r.t. the number of time-
ticks they appear in top 3 outliers
Note: not all nodes appear at all time-ticks
PAKDD 2010 Akoglu, McGlohon, Faloutsos 53
OddBall over time?
PAKDD 2010 Akoglu, McGlohon, Faloutsos 54
threshold=3
- Node was active at 46 time-ticks.
- At 26 of them, it was in top-3 outliers.
- Score becomes:
(26/46) * 26
- Rank: 1
84321653332|MR|10-MAY-65|400010|15-MAR-03|ACTIVE|RCVALUE
OddBall over time?
PAKDD 2010 Akoglu, McGlohon, Faloutsos 55
threshold=3
- Node was active at 46 time-ticks.
- At 26 of them, it was in top-3 outliers.
- Score becomes:
(26/46) * 26
E.g. from t=154 (MAY-2)
to t=170 (MAY-18),
it appears in top-3
OddBall over time?
PAKDD 2010 Akoglu, McGlohon, Faloutsos 56
OddBall over time?
PAKDD 2010 Akoglu, McGlohon, Faloutsos 57
threshold=3
- Node was active at 177 time-ticks.
- At 35 of them, it was in top-3 outliers.
- Rank: 5
84354420350|Mr|21-OCT-76|400033|23-JAN-04|ACTIVE|RCVALUE
OddBall over time?
PAKDD 2010 Akoglu, McGlohon, Faloutsos 58
threshold=3
- Node was active at 177 time-ticks.
- At 35 of them, it was in top-3 outliers.
- Rank: 5
84354420350|Mr|21-OCT-76|400033|23-JAN-04|ACTIVE|RCVALUE