Opportunities and Challenges in Uncertainty Quantification in Complex Interacting Systems

Opportunities and Challenges in Uncertainty Quantification in Complex Interacting Systems

Pip PattisonUniversity of Melbourne

University of Southern California April 13-14, 2009

Exponential random graph models for social networks and their role in uncertainty quantification

Joint work with

University of MelbourneGarry Robins Peng WangGalina Daraganova Johan KoskinenDean Lusher

Oxford University/Groningen UniversityTom Snijders

Building on work with: Mark Handcock University of WashingtonDave Hunter Penn State UniversityMartina Morris University of WashingtonSteve Goodreau University of Washington

and earlier work by: Frank, Strauss, Wasserman …

Main argument

• Many social processes (eg diffusion of information, diseases) depend on social interactions and social relationships, and different types of networks enable and constrain social processes in distinctive ways

• We need therefore to understand/model both the structure and dynamics of different types of networks and social interaction and the nature of network-mediated social processes

• Appropriate quantification of uncertainty requires good models and hence advances in:

• Measurement and design• Understanding of relevant social processes• Careful model development and empirical testing

• Exponential random graph modelling approach provides a coherent framework for model development and uncertainty quantification

Outline

1. Exponential random graph models for social networks

2. Models for large networks from partial network data

3. Where next?

1. Exponential random graph models for social networks

Models for social networks: the problem

To develop a statistical framework for modelling networks that appropriately represents what we know/propose about network tie formation processes:

Exogenous effects:Shared characteristics, interests and affiliations, and spatial

propinquity, all matter

Endogenous network effects:e.g., clustering, comparison and attachment processes

To do so in a way that affords a consistent approach to the analysis of various forms of longitudinal and cross-sectional social relational data

Network variables

We assume a fixed set of actors We consider a system of tie variables:

Y = [Yij] Yij = 1 if i has a tie to j

0 otherwise

realisation of Y is denoted by y = [yij]

Note that:• We consider the system of all tie variables at once• The variables are associated with relational ties between

actors• We do not assume that the variables are independent

i

j

Local interactivity and dependence

Local interactions: define two network tie variables to be neighbours if they are conditionally dependent, given the values of all other tie variables

A neighbourhood is a set of mutually neighbouring variables and corresponds to a potential network configuration:

1e.g. {Y12, Y13, Y23} corresponds to

2 3Dependence structure: hypothesis about which ties are

neighbours

A useful dependence structure: the social circuit model (Snijders, Pattison, Robins &

Handcock, 2006)

Assume that two tie variables are neighbours if:

• they share an actor

• their presence would create a 4-cycle »

red ties are already present

A 4-cycle is a closed structure that can sustain mutual social monitoring and influence, as well as levels of trustworthiness within which obligations and expectations might proliferate (e.g., Coleman, 1988; Bearman et al, 2005)

Hammersley-Clifford theorem (Besag, 1974) applied to networks (Frank & Strauss,

1986)

P(Y = y) = (1/()) exp[QQzQ(y)]

normalizing quantity parameter network statistic

the summation is over all neighbourhoods Q

zQ(y) = YijQyij signifies whether () = yexp[QQzQ(y)]all ties in Q are observed in y

This means…

We model the probability of a network in terms of propensities for configurations of certain types to occur:

Edges

Stars(k-stars)

Multiple paths(k-2-paths)

Multiple triangles(k-triangles)

parameters

Equality and other constraints on model parameters

1. Assume (in the first instance) that isomorphic configurations have equal parameters, so there is one parameter for each class of network configurations

2. It may sometimes also be convenient to assume a relationship between related parameters, eg:

k = -k-1/, for k >2 and 1 a (fixed) constant

Star configurations …

Parameters 2 3 4 …

We then obtain a single parameter (2) for the family of star configurations with statistic:

S[](y) = k(-1)k Sk(y)/k-2 alternating star statistic

3. Likewise for k-2-paths and k-triangles …

Exponential random graph models

for networksWe can then model the probability of a network in terms of

propensities for certain families of network configurations to occur alt-star

alt-2-path

alt-triangle

edge

The zone of order k of a set A of nodes

Zk(A), the zone of order k of A in the network y, to be the set of nodes within k steps of some node in A:

For the two node set comprising marked vertices:

Zones of order 0 to 2

Three generations of models (so far) for nondirected graphs

Tie variables Yij and Ykl are conditionally independent unless:

(i,j) = (k,l) same zone of order 0 Bernoull iZ0({i,j}) = Z0({k,l}) for {i,j} and {k,l} (Erdös-Rényi)

Z0({i,j}) Z0({k,l}) overlapping zones Markov

of order 0

Z1({i,j}) Z0({k,l}) zero-order zones for realisation-

and {i,j) and {k,l} jointly dependentZ1({k,l}) Z0({i,j}) embedded in first order (social

zones for {k,l} and {i,j} circuit)

Bernoulli:

Z0{i,j} = Z0{k,l}

Markov:

Z0{i,j} Z0{k,l}

Social circuit:

Z1{i,j} Z0{k,l} and Z1{k,l} Z0{i,j}

Degree/closure interaction:

Z1{i,j} Z0{k,l} or Z1{k,l} Z0{i,j}

Three-path:

Z1{i,j} Z0{k,l}

Z1{k,l} Z0{i,j}

m nodes

m nodes

h nodes

m nodes

h nodes

m nodes

m nodes

m nodes in all

(m,h)-coat-hanger

Hierarchy of models

A valuable tool for understanding a model (and hence for understanding how the various configuration propensities combine to give networks a particular structural signature)

Increasing triangulation

60 nodes: Fix the density at 0.05 (i.e. 88 or 89 edges)

Varying the propensity for alternating-triangles

The movie shows one representative graph from each simulated distribution

Simulation

From centralization to segmentation

60 nodes: Fix the density at 0.05 (i.e. 88 or 89 edges)

Varying the propensity for alternating-stars

The movie shows one representative graph from each simulated distribution

The inference problem

Given an observed network, can we infer the propensities for various configurations in the model that may have generated it, and quantify their uncertainty?

We use Monte Carlo Markov Chain Maximum Likelihood Estimation (Snijders, 2002) implemented in PNet (Wang, Robins & Pattison): http://www.sna.unimelb.edu.au/pnet/pnet.html

This uses the Polyak-Ruppert variant of the Robbins-Monro procedure to solve for in the moment equation E{z(Y)} = z(y)

See also statnet: http://csde.washington.edu/statnet (Handcock et al, 2003)

http://www.sna.unimelb.edu.au/pnet/pnet.html

http://www.sna.unimelb.edu.au/pnet/pnet.html

http://csde.washington.edu/statnet

Example 1: Co-authorship in the journal Social Networks (Sunbelt XXVIII)

The official version (Sunbelt XXVIII)

Parameter estimates

Parameter estimate standard error

Edge: -8.736936 0.378

Isolates: -1.954926 0.253

Alt-Star(=2.00) 0.083514 0.148

Alt-triangle(=2.00): 3.186715 0.100

Alt-2-path(=2.00): -0.043458 0.027

edge isolates alt-star alt-triangle alt-2-path

observed statistics 568 120.0 1018.6 631.2 1333.2

mean statistics for model: 567.3 112.0 1017.4 631.9 1327.1

A graph sampled from fitted probability distribution

Heuristic goodness of fit: degree statistics

The t statistic locates the observed value of each

statistic in the distribution of statistics associated

with the ergm simulated using model parameters:

if t 2, the observed statistic is within the

envelope expected by the model

statistic observed simulated mean (sd) t

# 2-stars: 1921 1789.0 (128.66) 1.03

# 3-stars: 3401 2808.3 (555.8) 1.07

# Std Dev degree dist: 2.187 2.108 (0.102) 0.77

# Skew degree dist: 2.071 1.775 (0.281) 1.06

Heuristic goodness of fit:Path-based measures

statistic observed simulated mean (sd) t

# 2-paths: 1921 1789.0 (128.66) 1.03

# 3-paths: 8728 7671.9 (1295.0) 0.82

Geodesic distribution

Quartile Median for sampled graphs Observed

First 553 553

Second 553 553

Third 553 553

Heuristic goodness of fit:Closure measures

statistic observed simulated mean (sd) t# 1-triangles: 389 303.5 (18.8) 4.56

# 4-cycles: 1033 520.3 (86.2) 5.95

# (1,1)-coathangers 5189 3372.1 (565.0) 3.22

# (1,2)-coathangers 15275 8191.6 (2731.9) 2.59

# cliques of size 4 294 90.6 (13.5) 15.09

Global Clustering: 0.607 0.510 (0.02) 4.65

Mean Local Clustering: 0.396 0.309 (0.02) 3.92

Variance Local Clustering: 0.217 0.168 (0.01) 6.62

Example 2: Kapferer “sociational” ties (time 2)

Model 1

Effect Parameter Std Err

Edge 0.2238 2.07641

K-Star(=2.00) -0.8892 0.55314

AKT-T(=2.00) 1.2592 0.26601

A2P-T(=2.00) -0.1545 0.02705

Model 1: goodness of fit

Effect observed mean stddev t-ratio# 2-stars 2904 2668.6 767.3 0.31# 3-stars 13752 10890.5 4149.0 0.69# 1-triangles 451 348.1 107.8 0.95# 2-triangles 4617 2423.8 1069.5 2.05# bow-ties 28853 14366.1 7675.1 1.89# 3-paths 56721 51831.7 12677.6 0.39# 4-cycles 3880 2812.5 1262.0 0.85# (1,1)-coathangers 18463 12554.4 5150.0 1.15# cliques of size 4 448 164.1 75.6 3.75# cliques of size 5 234 23.1 17.1 12.3Std Dev degree dist 5.44 3.92 0.43 3.50Skew degree dist 0.39 -0.26 0.386 1.70Global Clustering 0.47 0.39 0.017 4.22Mean Local Clustering 0.45 0.42 0.026 3.11Variance Local Clustering 0.03 0.02 0.014 1.01

Model 2

Effect Parameter Std Err

edge -0.3017 2.83532

1-triangle 0.7466 0.20502

2-triangle -0.1548 0.05167

3-path -0.0167 0.00572

4-cycle 0.0713 0.03052

(1.1)-coathanger 0.0333 0.01131

clique of size 4 0.4025 0.17083

Alt-Star(=2.00) -0.1969 0.81080

Model 2: goodness-of-fit

Effect observed mean stddev t-ratio

# 2-stars 2904 2845.2 294.6 0.20# 3-stars 13752 13171.8 2057.3 0.28# bow-ties 28853 27164.9 7767.5 0.22# cliques of size 5 234 208.4 116.5 0.22# alt-triangles (2.00) 406.4 400.6 29.2 0.20# alt-indpt.2-path(2.00) 1138.2 1115.9 61.7 0.36Std Dev degree dist 5.44 5.264 0.420 0.41Skew degree dist 0.395 0.306 0.386 0.23Global Clustering 0.466 0.458 0.026 0.30Mean Local Clustering 0.498 0.465 0.032 1.01Variance Local Clustering 0.031 0.023 0.01 0.87

What have we learnt about network topology?

“Social circuit” models appear to reflect social processes underlying network formation better than simple Markovian neighbourhoods, having a “capacity for actors to transform as well as reproduce long-standing structures, frameworks and networks of interaction” (Emirbayer & Goodwin, 1994)

Hypotheses about relationships among the values of related parameters can provide a practical and effective means of incorporating important higher-order configurations

We may often need to add terms for cliques of size greater than 3

It may sometimes be necessary to go beyond the social circuit model

[And network effects do depend on actor and relational attributes, and are often mutually dependent across multiple and multi-mode networks]

2. Building network models from snowball samples

Moreno’s network dream

“If we ever get to the point of charting a whole city or a whole nation, we would have … a picture of a vast solar system of intangible structures, powerfully influencing conduct, as gravitation does in space. Such an invisible structure underlies society, and has its influence in determining the conduct of society as a whole.”

J. L. Moreno, New York Times, April 13, 1933

(via James Moody)

The problem: Estimating models for large networks from sampled

dataMany networks of interest, including community-level

networks and biological networks are very large and observing a complete network can be costly and difficult

We consider the problem of estimating the model

P(Y = y) = (1/()) exp{ppzp(y)}

using data from snowball sampling designs, assuming, for the moment, a model with social circuit dependence assumptions

Handcock & Gile (2007) and Koskinen et al (2008) consider the same problem as a missing data problem

Daraganova et al (2008): A partial network among Brimbank respondents from a snowball sampling design

(yellow=wave 1, green=2, red=wave 3)

Snowball sampling designs

Multi-wave snowball sampling:

We observe ties of:

Wave 0: Nodes in Z0

Wave 1: Nodes in Z1 but not Z0

Wave 2: Nodes in Z2 but not Z1

yk: network on Zk\Zk-1

ykl: ties from Zk to Zl\Zl-1

Conditional estimation strategy

We make the social circuit assumption and follow Besag (1974), make a positivity assumption, Pr(Y0=0rest) > 0.

We show that:

Pr(Y0=y0rest)

log ------------------- = p p [zp(y0+1) - zp(y0+10)]

Pr(Y0=0rest)

where y0 is equal to y but with all entries in y0 set to 0

Defining 1/c = Pr(Y0=0 rest) yields:

Pr(Y0=y0rest) = 1/c exp (p p [zp(y0+1) - zp(y0+10)] )

and hence the capacity to use observed data on y0+1 to obtain conditional MLEs of = [p]

3-wave snowball sample

For a 3-wave sample and positivity assumption: Pr(Y0=0,Y1=0rest) > 0

we obtain:

Pr(Y0=y0,Y1=y1rest) = 1/c exp (pp[zp(y0+1+2)-zp(y0+1+20,1)])

where y0,1 is equal to y but with all entries in y0 and y1 set to 0

And hence we can use observed data on y0+1+2 to obtain conditional MLEs of

MCMCMLEs from single networks sampled from the random graph distribution with known parameters (-4,.2,-.2,1), n = 150

edgealt-star

alt-2-path alt-triangle

true value

MCMCMLEs from y0, conditional on y01, y1

(size of Z0 =10)

edge alt-star


Conditional MCMCMLEs from y1, conditional on y0, y01, y12 , y2 and assigning isolated nodes and dyads in Z0 to Z1 (Z0 = 10)

edge alt-star


What if we ignore the sampling design and use “available cases”?MCMCMLEs from network on Z0+Z1+Z2

edge alt-star


Simulation study

Using data on y0+1

For a fixed model: Edge -4.0

Alr-star 0.2

Alt-triangle 1.0

Alt-2-path -0.2

Size of node set/random seed sets: 150 (15, 30, 50, 100, 150)

500 (30, 50, 70, 100, 200)

1000 (30,50, 70, 100, 200)

Estimating network on Z0 given observed network on Z1 and ties between Z0 and Z1

n = 150

Estimating network on Z0 given observed network on Z1 and ties between Z0 and Z1 (n = 500)

Estimating network on Z0 given observed network on Z1 and ties between Z0 and Z1 (n = 1000)

Simulation study: summary findings

• RMSE and bias decline as seed set size increases for fixed n

• For a sufficiently large seed set size, bias is small for each n

• For given n and seed set size, bias is greatest for edge and alt-star effects

• For given seed set size, bias is greater as n increases, although the effect is less pronounced for alt-triangle and alt-2-path effects

Daraganova et al (2008): A partial network among Brimbank respondents from a snowball sampling design

(yellow=wave 1, green=2, red=wave 3)

Example 1: a close tie network in Brimbank in suburban Melbourne, a community of 35000 people (Daraganova, 2008).

Location of respondents: wave 1 = yellow, wave 2 = green, wave 3 = blue, other = grey

Conditional MLEs for 6 models based on respondent ties (Daraganova, 2008)

model Edgealt-

starsalt-triangles

alt-2-Paths

Spatial covariate

1 -4.86*(0.20)

2 3.80*(1.37)

-1.05*(0.17)

3 -3.95*(0.74)

-0.107(0.9)

4 4.71*(-0.11)

-0.11(0.08)

-1.04*(0.17)

5-6.52*(0.918)

2.55*(0.297)

-0.2*(0.093)

6 -1.18(2.49)

2.46*(0.34)

-0.19*(0.09)

-0.65*(0.298)

7 -13.12*(2.84)

1.92*(0.82)

-0.24*(0.33)

-0.24*(0.09)

8 -7.21*(2.79)

1.58(0.85)

2.41*(0.34)

-0.24*(0.09)

-0.58(0.31)

Heuristic evaluation of fit

t-ratio==(observation - sample mean) / standard deviation

1 2 3 4 5 6 7 8

2-star 1.59 1.36 2.15 1.93 0.51 0.24 0.06 0.01

3-star 3.28 2.45 5.25 4.43 0.66 0.30 0.09 0.03

triangle 34.73 19.37 29.03 16.59 1.16 0.86 0.60 0.53

3-path 2.64 2.04 3.88 3.32 0.76 0.35 0.03 -0.01

4-cycle 20.92 15.02 22.58 12.5 1.74 1.17 0.63 0.73

Coathanger 16.63 12.28 18.76 12.12 1.50 0.98 0.51 0.60

4-clique 15.42 12.36 18.18 9.77 2.66 2.01 1.62 1.97

alt-star 0.29 0.25 0.27 0.27 0.13 -0.03 0.06 -0.11

alt-triangle 32.22 23.34 29.98 21.70 0.02 -0.05 -0.02 -0.11

Alt- 2-path -0.36 -0.43 -0.08 -0.05 0.02 -0.11 -0.12 -0.12

Spatial covariate -0.50 0.04 -0.51 -0.01 -0.13 -0.09 -0.22 -0.12

Std Dev degree distribution 3.49 2.69 5.31 4.65 0.89 0.51 0.16 0.12

Skew degree distribution 3.17 2.57 5.23 4.35 0.33 0.11 0.11 0.01

Global Clustering 5.39 5.09 5.77 5.23 1.22 1.25 1.21 1.26

Mean Local Clustering 13.46 13.72 14.11 13.76 -0.01 0.05 0.51 0.49

Variance Local Clustering -2.64 -2.51 -2.39 -2.31 1.82 1.54 1.01 0.86

Example 2: Protein-protein interaction network (yellow = initial wave, green = wave 1, red = wave 2, blue = wave 3)

4763 nodes, 17274 ties in modelled zones

Illustrative conditional MLEs from random seed sets of different sizes

Parameter estimates (s.e)

|Z0| Edge Alt-star Alt-triangle Alt-2-path

100 -8.00 (2.30) 0.34 (0.67) 1.07 (0.198) 0.0027 (0.017)

200 -5.04 (1.17) -0.78 (0.38) 0.87 (0.156) 0.043 (0.013)

300 -12.7 (1.56) 1.57 (0.45) 0.86 (0.082) -0.0017 (0.013)

400 -10.63 (0.81) 1.05 (0.23) 0.60 (.058) 0.011 (.0008)

500 -9.79 (0.56) 0.85 (0.16) 0.71 (.042) 0.012 (.0010)

600 -11.8 (0.65) 1.30 (0.18) 0.81 (.040) 0.016 (.0019)

…

1000 -11.0 (0.31) 1.11 (0.09) 0.93 (.025) 0.007 (.0004)

Heuristic goodness of fit for conditional MLEs from 1000 node

seed set

Graph statistic t

Std Dev degree dist: 1.24

Skew degree dist: -1.94

Global Clustering: 3.03

Mean

Local Clustering: 1.40

Variance

Local Clustering 1.97

Count statistics t

0.9003

-0.7267

2.5194

-1.5980

3.0425

1.5223

Preliminary conclusions

Conditional MLEs can be useful for very large networks, both where missing data approaches are computationally infeasible, and to speed convergence

The approach can be used to test homogeneity assumptions

Extension to other network sampling designs

A well-specified model is essential, and hence so are principled approaches to assessment of fit

3. Where next?

Some next steps

• Enhanced model specifications– multi-relational– multi-mode (including multiple levels)– Incorporating heterogeneity

• Better design-modelling links, and careful investment in data collection

• Models for both short- and long-term dynamics: relationship and interaction structures are not the same!

• Computational enhancements

Thankyou

Documents

Opportunities and Challenges in Uncertainty Quantification in Complex Interacting Systems