Modeling massive dynamic graphs Chris Volinsky AT&T Research Statistics Research Department Along with: Deepak Agarwal, Bob Bell, Corinna Cortes, Shawndra

Modeling massive Modeling massive dynamic graphsdynamic graphs

Chris Volinsky

AT&T Research

Statistics Research Department

Along with:

Deepak Agarwal, Bob Bell, Corinna Cortes, Shawndra Hill, Daryl Pregibon,

OutlineOutline• Defining a Dynamic Graph, and our objectives• A motivating example: Repetitive fraud in telecommunications• Our approach: representation and approximation of dynamic graphs• Revisiting Repetitive Fraud • Parameter setting and applications to other domains• Fraud revisited – applying our methods• Other applications, conclusions, further work

Summary

• Analyzing large dynamic graphs of transactional data is hard!• By storing and analyzing a massive graph as an indexed set of small

local graphs, we have developed a generic framework for dynamic graph representation, which facilitates fast and accurate analysis.

Defining a Dynamic Graph, and our objectives

Defining Dynamic Graphs Defining Dynamic Graphs

• Dynamic Graphs represent transactional data – – Credit card data– Web connectivity data– Web logs– Telecommunications network traffic– Online auction data

Transactional data can be represented as a directed graph…

Kathleen

Chris Daryl

JenFred

Corinna

John ZachDebbyAnne

Defining Dynamic GraphsDefining Dynamic Graphs

• Dynamic Graphs– Nodes represent transactors– Edges are directed transactions– All edges have a time stamp– All edges have a weight (?)– May contain

• Other attributes on nodes

• Other attributes on edges

Analysis of dynamic graphsAnalysis of dynamic graphsWhy is it hard?

• Scale– Often tens or hundreds of millions of nodes and edges – Doesn’t fit in main memory

• Dynamic – Large numbers of nodes coming and going

continuously– Accounting for temporal component of changing

graphs is a challenge

• Heterogeneous nodes– Superactive nodes dominate visualization and analysis– Zipf’s law holds: most users have few connections

Overview: Our ObjectivesOverview: Our Objectives• Define a representation for a dynamic graph

– What is the graph at time t: Gt

– How does one account for addition and attrition of nodes

• Graphs can be used for– Global properties - learning about the graph properties

• Clustering coefficient

• Power law coefficient

• Graph depth

– Local represenation – learning about individual nodes in the graph

• Entities – the local subgraph of a node of interest

• Fraud signatures, anomaly (intrusion) detection, matching, etc.

• Our Goal: Methods for entity based analysis in large dynamic graphs

A motivating example: Repetitive fraud in telecommunications

Motivating Example: Repetitive Motivating Example: Repetitive FraudFraud• Lots of people cant pay their bill, but they want phone service

anyway:

Name Ted Hanley

Address 14 Pearl Dr

St Peters, MN

Balance $208.00

Disconnected 2/19/04 (nonpayment)

Name Debra Handley

Address 14 Pearl Dr

St Peters, MN

Balance $142.00

Connected 2/22/04

Name Elizabeth Harmon

Address APT 1045

4301 ST JOHN RD

SCOTTSDALE, AZ

Balance $149.00

Disconnected 2/19/04 (nonpayment)

Name Elizabeth Harmon

Address 180 N 40TH PL

APT 40

PHOENIX, AZ

Balance $72.00

Connected 1/31/04

Motivating Example: Repetitive FraudMotivating Example: Repetitive FraudHow can we identify that it is the same person behind

both accounts?

Old Account:

67855232344 New Account:

4215554597

Old Date: 2003-02-25 New Date: 2003-02-13

Old Name: DAVID ATKINS New Name: DAVID WATKINS

Old Address:

10 NIGHT WAY APT 114

New Address:

10 HATSWORTH DR

Old City: FAYVILLE New City: BONDALE

Old State: AL New State: AL

Old Zip: 302141798 New Zip: 300021530

Old II Code: 5512127609901 New II Code: 5312074639501

Old Balance: 284.62 New Balance: 5.83

Motivating Example: ChallengesMotivating Example: Challenges• This is a problem of record linkage and graph matching,

but because of obfuscation, we can only count on the graph matching!

• But the problem is huge:

• Q: How can we do efficient matching to put a dent in this problem?

• A: Efficient representation of entities

Connect pool

TRestrict pool

10 K/day10 K/day300K/300K/monthmonth

5 K/day5 K/day150 150

K/monthK/month45 billion comparisons

Motivating Example: Our Motivating Example: Our datadata– Our graph is large

• 350M Telephone numbers (TNs) currently active on our Long Distance network, 300M calls/day

– Our graph is dynamic 4 Million TNs appear per

week

4 Million TNs disappear per

week

Motivating Example: Our dataMotivating Example: Our data

Our graph is sparse:

For one year of long distance data:

Median = 34Median = 34

95% = 17195% = 171

• Our Approach to Dynamic Graphs– Definition– Representation– Approximation

Our Approach: Defining dynamic Our Approach: Defining dynamic graphs graphs • Q: for transactional data, what does the graph at time t (Gt)mean?

- let gt be the collection of nodes and edges during the time period t

• We could use: tt gG Too narrow!

• We could use the union of all time periods:

i

t

itt ggggG

1

21

Too broad!

• We could use a moving average of the most recent time periods:

i

t

ntitntntt ggggG

1

Too many!

Our Approach: Defining dynamic Our Approach: Defining dynamic graphsgraphs

We adopt an Exponentially Weighted Moving Average (EWMA):

t1tt gθ)(1θGG

Alternatively, this is:

igωgωgωgωG i

t

1itt2211t

θ)(1θω where iti

• Advantages:- recent data has most influence- only one most recent graph need be stored

i.e. today’s graph is defined recursively as a convex combination of yesterday’s graph and today’s data

Through time, edge weights decay with decay rate

Our Approach: Defining dynamic Our Approach: Defining dynamic graphsgraphs

closer to 1• calls decay slower• more historical data included• smoother

closer to 0• faster decay• recent calls count more• more power to detect changes• less smooth

Selecting Selecting

= 1/(1-n) means weight reduces to 1/e times its original weight in n days

Our Approach: RepresentationOur Approach: Representation• Since we are interested in entities, we represent the graph as a union of

entity graphs

• These entity graphs are the atomic units of analysis, a signature of the behavior of the node

• As it turns out, storing hundreds of millions of small graphs is much more efficient than storing one massive graph. We use Hancock, a language developed at AT&T for signatures, which allows for easy indexed storage of

• In short, we are approximating a hugh graph with many little graphs, which are easily accessible (via indexed storage) for analysis.

2222222222 100.32222222222 100.31111111111 90.11111111111 90.13213232423 27.03213232423 27.09098765453 11.39098765453 11.388764573268876457326 5.4 5.42122121212 3.02122121212 3.09908989898 0.99908989898 0.98887878787 0.18887878787 0.1

Our Approach: RepresentationOur Approach: Representation

Update the graph by updating all of the atomic units daily – so any time we access the data we have the most recent representation.

1111111111 20.01111111111 20.02122121212 10.02122121212 10.09991119999 5.09991119999 5.0

2222222222 100.32222222222 100.31111111111 90.11111111111 90.13213232423 27.03213232423 27.09098765453 11.39098765453 11.388764573268876457326 5.4 5.42122121212 3.02122121212 3.09908989898 0.99908989898 0.98887878787 0.18887878787 0.1

++ ==1111111111 92.11111111111 92.12222222222 90.32222222222 90.33213232423 24.33213232423 24.39098765453 10.19098765453 10.188764573268876457326 4.9 4.92122121212 3.72122121212 3.79991119999 0.59991119999 0.5 3990898989 0.83990898989 0.88887878787 0.098887878787 0.09

Yesterday’s graphYesterday’s graph Today’s dataToday’s data Today’s Today’s graphgraph

Our Approach: ApproximationOur Approach: Approximation• We also implore two types of approximation of the graph,

by pruning.

• Local pruning of edges – designate a maximal in and out degree (k) for each node, and assign an overflow bin

• Global pruning of edges – overall threshold () below which edges are removed from the graph

1111111111 92.11111111111 92.12222222222 90.32222222222 90.33213232423 24.33213232423 24.39098765453 10.19098765453 10.188764573268876457326 4.9 4.92122121212 3.72122121212 3.7OtherOther 1.4 1.4

1111111111 92.11111111111 92.12222222222 90.32222222222 90.33213232423 24.33213232423 24.39098765453 10.19098765453 10.188764573268876457326 4.9 4.92122121212 3.72122121212 3.79991119999 0.5 9991119999 0.5 3990898989 0.83990898989 0.88887878787 0.098887878787 0.09

==

Removes stale edges

Reduces effect of supernodes

Our Approach: ApproximationOur Approach: Approximation• Defending k

– Most entities have the vast majority of their weight in a fraction of their nodes

Our Approach: Parameter SelectionOur Approach: Parameter Selection• We have now defined a representation of a dynamic graph

by three parameters: controls the decay of edges and edge weights global pruning parameter k – local pruning parameter

• For a given application, we choose the parameters by optimizing predictive performance, selecting the parameters which optimize a loss function

– Two loss functions we have used:• Weighted Dice• Hellinger Distance

Our Approach: Parameter SelectionOur Approach: Parameter Selection• Let A and B be two entities.

• Weighted Dice:

• Hellinger Distance:

• For each value– Set to be a low tolerance value

– For a range of k, optimize – Look at the plot to select parameters

jA

BABAj

jp

jpjpIBAWD

)(1

))()((),(

)(

)()(),(BAj

BA jpjpBAHD

Our Approach: Parameter SelectionOur Approach: Parameter Selection

Our Approach: SummaryOur Approach: Summary• This method allows us, for all 350 million phone numbers

we see on the network to have an up-to-date representation of the entity associated with each number. These entities are stored in an indexed data base for easy storage and retrieval

• Parameters are set to maximize the self-similarity of a node with itself through time, to allow us to match entities.

Fraud Revisited: Applying our methods

Fraud Revisited: Applying our Fraud Revisited: Applying our methodsmethods

• Fraudsters don’t get caught, so they go elsewhere

• Can we find them, based on their calling patterns?

• Intuition: fraudster may change network identity, but calling pattern will be similar

• Find overlap in calling pattern between recent connected lines and lines shut down for fraud.


• Repetitive Fraud Process

Connect pool

TRestrict pool

10 K/day10 K/day300K/300K/monthmonth

5 K/day5 K/day150 150

K/monthK/month

• Anchor on a few restrict dates in the pool

• Compare to all connects in the connect pool

• For all of those that have at least one overlap, collect more information

Fraud Revisited: Applying our Fraud Revisited: Applying our methodsmethods• For each potential matched pair

– Calculate a matching score, based on the characteristics of the overlap:

– Incorporate other information, including • Name, address

• Payment history

• Tenure

Restrict

TN-1

TN-2

TN-3

TN-4

TN-5

Connect

Connect

Fraud Revisited: Applying our Fraud Revisited: Applying our methodsmethods• Results:

– We identify 50-100 of these cases a day– 95% match rate– 85% block rate

– Credited with saving AT&T $5M in reduced costs and uncollectables in 2002-2003

– By far the most reliable matching criteria is the entity based matching

Other applications, conclusions, further Other applications, conclusions, further workwork• We have a representation of a dynamic graph which can be

applied to any problem where entity modeling over time is of interest

• Other fraud: GBA

• Language models

• Email

• Web pages

• Terrorism

• Viral Marketing– Targeting– Understanding

Other slidesOther slides

Matching AlgorithmMatching Algorithm

• What cases will we present to the reps? • A combination of:

– COI Overlap measures• At least two, and strength determined by uniqueness of overlap

TNs

– Name/address overlap• Edit distance no more than 50% of the longest name or address

– $$ owed• Most interested in the ones that will generate the most $$

• 500-1000 cases a day become 100-150 that we present to the reps

Motivating Example: Repetitive FraudMotivating Example: Repetitive Fraud

• When we catch a fraudster, we rarely catch the person, we simply shut down the line

• They will likely move on to another attempt at defrauding us, from a different network location

• Idea: record linkage - network identity has changed, but network behavior is the same

• We can use network behavior to indicate that the new line has the same “owner” as an old line

COI Signatures to COICOI Signatures to COI• To construct a COI from a COI signature:

– Often the signature contains things we don’t want:• Businesses• High weight nodes

– Often the signature doesn’t contain things we do want:• Local calls• Other carrier calls

• To combat this, create a COI by:– Recursively expanding the COI signature– Adding edges– Pruning edges

here’s an example…

COI COI signaturesignature

me

other

other

Extended Extended COICOI

me

other

other

Enhanced Enhanced COICOI

me

other

other

Pruned COIPruned COI

me

other

other

A likely case of the same A likely case of the same fraudster showing up as a new fraudster showing up as a new numbernumber

Pink nodes exist in both COI


Where:

wao = weight of edge from a to o

wob = weight of edge from o to b

wo = sum weight of edges to o

dao, dob are the graph distances from a and b to o

obaoo o

obao

ddw

ww 1b)overlap(a,

overlap}in {

• Calculate the “informative overlap” score:

ZA O

Bwobwao

wo

Documents

Modeling massive dynamic graphs Chris Volinsky AT&T Research Statistics Research Department Along with: Deepak Agarwal, Bob Bell, Corinna Cortes, Shawndra