Upload
aimee-wells
View
53
Download
7
Tags:
Embed Size (px)
DESCRIPTION
Opportunities and Challenges in Uncertainty Quantification in Complex Interacting Systems. Exponential random graph models for social networks and their role in uncertainty quantification. Pip Pattison University of Melbourne. University of Southern California April 13-14, 2009. - PowerPoint PPT Presentation
Citation preview
Opportunities and Challenges in Uncertainty Quantification in Complex Interacting Systems
Pip PattisonUniversity of Melbourne
University of Southern California April 13-14, 2009
Exponential random graph models for social networks and their role in uncertainty quantification
Joint work with
University of MelbourneGarry Robins Peng WangGalina Daraganova Johan KoskinenDean Lusher
Oxford University/Groningen UniversityTom Snijders
Building on work with: Mark Handcock University of WashingtonDave Hunter Penn State UniversityMartina Morris University of WashingtonSteve Goodreau University of Washington
and earlier work by: Frank, Strauss, Wasserman …
Main argument
• Many social processes (eg diffusion of information, diseases) depend on social interactions and social relationships, and different types of networks enable and constrain social processes in distinctive ways
• We need therefore to understand/model both the structure and dynamics of different types of networks and social interaction and the nature of network-mediated social processes
• Appropriate quantification of uncertainty requires good models and hence advances in:
• Measurement and design• Understanding of relevant social processes• Careful model development and empirical testing
• Exponential random graph modelling approach provides a coherent framework for model development and uncertainty quantification
Outline
1. Exponential random graph models for social networks
2. Models for large networks from partial network data
3. Where next?
1. Exponential random graph models for social networks
Models for social networks: the problem
To develop a statistical framework for modelling networks that appropriately represents what we know/propose about network tie formation processes:
Exogenous effects:Shared characteristics, interests and affiliations, and spatial
propinquity, all matter
Endogenous network effects:e.g., clustering, comparison and attachment processes
To do so in a way that affords a consistent approach to the analysis of various forms of longitudinal and cross-sectional social relational data
Network variables
We assume a fixed set of actors We consider a system of tie variables:
Y = [Yij] Yij = 1 if i has a tie to j
0 otherwise
realisation of Y is denoted by y = [yij]
Note that:• We consider the system of all tie variables at once• The variables are associated with relational ties between
actors• We do not assume that the variables are independent
i
j
Local interactivity and dependence
Local interactions: define two network tie variables to be neighbours if they are conditionally dependent, given the values of all other tie variables
A neighbourhood is a set of mutually neighbouring variables and corresponds to a potential network configuration:
1e.g. {Y12, Y13, Y23} corresponds to
2 3Dependence structure: hypothesis about which ties are
neighbours
A useful dependence structure: the social circuit model (Snijders, Pattison, Robins &
Handcock, 2006)
Assume that two tie variables are neighbours if:
• they share an actor
• their presence would create a 4-cycle »
red ties are already present
A 4-cycle is a closed structure that can sustain mutual social monitoring and influence, as well as levels of trustworthiness within which obligations and expectations might proliferate (e.g., Coleman, 1988; Bearman et al, 2005)
Hammersley-Clifford theorem (Besag, 1974) applied to networks (Frank & Strauss,
1986)
P(Y = y) = (1/()) exp[QQzQ(y)]
normalizing quantity parameter network statistic
the summation is over all neighbourhoods Q
zQ(y) = YijQyij signifies whether () = yexp[QQzQ(y)]all ties in Q are observed in y
This means…
We model the probability of a network in terms of propensities for configurations of certain types to occur:
Edges
Stars(k-stars)
Multiple paths(k-2-paths)
Multiple triangles(k-triangles)
parameters
Equality and other constraints on model parameters
1. Assume (in the first instance) that isomorphic configurations have equal parameters, so there is one parameter for each class of network configurations
2. It may sometimes also be convenient to assume a relationship between related parameters, eg:
k = -k-1/, for k >2 and 1 a (fixed) constant
Star configurations …
Parameters 2 3 4 …
We then obtain a single parameter (2) for the family of star configurations with statistic:
S[](y) = k(-1)k Sk(y)/k-2 alternating star statistic
3. Likewise for k-2-paths and k-triangles …
Exponential random graph models
for networksWe can then model the probability of a network in terms of
propensities for certain families of network configurations to occur alt-star
alt-2-path
alt-triangle
edge
The zone of order k of a set A of nodes
Zk(A), the zone of order k of A in the network y, to be the set of nodes within k steps of some node in A:
For the two node set comprising marked vertices:
Zones of order 0 to 2
Three generations of models (so far) for nondirected graphs
Tie variables Yij and Ykl are conditionally independent unless:
(i,j) = (k,l) same zone of order 0 Bernoull iZ0({i,j}) = Z0({k,l}) for {i,j} and {k,l} (Erdös-Rényi)
Z0({i,j}) Z0({k,l}) overlapping zones Markov
of order 0
Z1({i,j}) Z0({k,l}) zero-order zones for realisation-
and {i,j) and {k,l} jointly dependentZ1({k,l}) Z0({i,j}) embedded in first order (social
zones for {k,l} and {i,j} circuit)
Bernoulli:
Z0{i,j} = Z0{k,l}
Markov:
Z0{i,j} Z0{k,l}
Social circuit:
Z1{i,j} Z0{k,l} and Z1{k,l} Z0{i,j}
Degree/closure interaction:
Z1{i,j} Z0{k,l} or Z1{k,l} Z0{i,j}
Three-path:
Z1{i,j} Z0{k,l}
Z1{k,l} Z0{i,j}
m nodes
m nodes
h nodes
m nodes
h nodes
m nodes
m nodes
m nodes in all
(m,h)-coat-hanger
Hierarchy of models
A valuable tool for understanding a model (and hence for understanding how the various configuration propensities combine to give networks a particular structural signature)
Increasing triangulation
60 nodes: Fix the density at 0.05 (i.e. 88 or 89 edges)
Varying the propensity for alternating-triangles
The movie shows one representative graph from each simulated distribution
Simulation
From centralization to segmentation
60 nodes: Fix the density at 0.05 (i.e. 88 or 89 edges)
Varying the propensity for alternating-stars
The movie shows one representative graph from each simulated distribution
The inference problem
Given an observed network, can we infer the propensities for various configurations in the model that may have generated it, and quantify their uncertainty?
We use Monte Carlo Markov Chain Maximum Likelihood Estimation (Snijders, 2002) implemented in PNet (Wang, Robins & Pattison): http://www.sna.unimelb.edu.au/pnet/pnet.html
This uses the Polyak-Ruppert variant of the Robbins-Monro procedure to solve for in the moment equation E{z(Y)} = z(y)
See also statnet: http://csde.washington.edu/statnet (Handcock et al, 2003)
Example 1: Co-authorship in the journal Social Networks (Sunbelt XXVIII)
The official version (Sunbelt XXVIII)
Parameter estimates
Parameter estimate standard error
Edge: -8.736936 0.378
Isolates: -1.954926 0.253
Alt-Star(=2.00) 0.083514 0.148
Alt-triangle(=2.00): 3.186715 0.100
Alt-2-path(=2.00): -0.043458 0.027
edge isolates alt-star alt-triangle alt-2-path
observed statistics 568 120.0 1018.6 631.2 1333.2
mean statistics for model: 567.3 112.0 1017.4 631.9 1327.1
A graph sampled from fitted probability distribution
Heuristic goodness of fit: degree statistics
The t statistic locates the observed value of each
statistic in the distribution of statistics associated
with the ergm simulated using model parameters:
if t 2, the observed statistic is within the
envelope expected by the model
statistic observed simulated mean (sd) t
# 2-stars: 1921 1789.0 (128.66) 1.03
# 3-stars: 3401 2808.3 (555.8) 1.07
# Std Dev degree dist: 2.187 2.108 (0.102) 0.77
# Skew degree dist: 2.071 1.775 (0.281) 1.06
Heuristic goodness of fit:Path-based measures
statistic observed simulated mean (sd) t
# 2-paths: 1921 1789.0 (128.66) 1.03
# 3-paths: 8728 7671.9 (1295.0) 0.82
Geodesic distribution
Quartile Median for sampled graphs Observed
First 553 553
Second 553 553
Third 553 553
Heuristic goodness of fit:Closure measures
statistic observed simulated mean (sd) t# 1-triangles: 389 303.5 (18.8) 4.56
# 4-cycles: 1033 520.3 (86.2) 5.95
# (1,1)-coathangers 5189 3372.1 (565.0) 3.22
# (1,2)-coathangers 15275 8191.6 (2731.9) 2.59
# cliques of size 4 294 90.6 (13.5) 15.09
Global Clustering: 0.607 0.510 (0.02) 4.65
Mean Local Clustering: 0.396 0.309 (0.02) 3.92
Variance Local Clustering: 0.217 0.168 (0.01) 6.62
Example 2: Kapferer “sociational” ties (time 2)
Model 1
Effect Parameter Std Err
Edge 0.2238 2.07641
K-Star(=2.00) -0.8892 0.55314
AKT-T(=2.00) 1.2592 0.26601
A2P-T(=2.00) -0.1545 0.02705
Model 1: goodness of fit
Effect observed mean stddev t-ratio# 2-stars 2904 2668.6 767.3 0.31# 3-stars 13752 10890.5 4149.0 0.69# 1-triangles 451 348.1 107.8 0.95# 2-triangles 4617 2423.8 1069.5 2.05# bow-ties 28853 14366.1 7675.1 1.89# 3-paths 56721 51831.7 12677.6 0.39# 4-cycles 3880 2812.5 1262.0 0.85# (1,1)-coathangers 18463 12554.4 5150.0 1.15# cliques of size 4 448 164.1 75.6 3.75# cliques of size 5 234 23.1 17.1 12.3Std Dev degree dist 5.44 3.92 0.43 3.50Skew degree dist 0.39 -0.26 0.386 1.70Global Clustering 0.47 0.39 0.017 4.22Mean Local Clustering 0.45 0.42 0.026 3.11Variance Local Clustering 0.03 0.02 0.014 1.01
Model 2
Effect Parameter Std Err
edge -0.3017 2.83532
1-triangle 0.7466 0.20502
2-triangle -0.1548 0.05167
3-path -0.0167 0.00572
4-cycle 0.0713 0.03052
(1.1)-coathanger 0.0333 0.01131
clique of size 4 0.4025 0.17083
Alt-Star(=2.00) -0.1969 0.81080
Model 2: goodness-of-fit
Effect observed mean stddev t-ratio
# 2-stars 2904 2845.2 294.6 0.20# 3-stars 13752 13171.8 2057.3 0.28# bow-ties 28853 27164.9 7767.5 0.22# cliques of size 5 234 208.4 116.5 0.22# alt-triangles (2.00) 406.4 400.6 29.2 0.20# alt-indpt.2-path(2.00) 1138.2 1115.9 61.7 0.36Std Dev degree dist 5.44 5.264 0.420 0.41Skew degree dist 0.395 0.306 0.386 0.23Global Clustering 0.466 0.458 0.026 0.30Mean Local Clustering 0.498 0.465 0.032 1.01Variance Local Clustering 0.031 0.023 0.01 0.87
What have we learnt about network topology?
“Social circuit” models appear to reflect social processes underlying network formation better than simple Markovian neighbourhoods, having a “capacity for actors to transform as well as reproduce long-standing structures, frameworks and networks of interaction” (Emirbayer & Goodwin, 1994)
Hypotheses about relationships among the values of related parameters can provide a practical and effective means of incorporating important higher-order configurations
We may often need to add terms for cliques of size greater than 3
It may sometimes be necessary to go beyond the social circuit model
[And network effects do depend on actor and relational attributes, and are often mutually dependent across multiple and multi-mode networks]
2. Building network models from snowball samples
Moreno’s network dream
“If we ever get to the point of charting a whole city or a whole nation, we would have … a picture of a vast solar system of intangible structures, powerfully influencing conduct, as gravitation does in space. Such an invisible structure underlies society, and has its influence in determining the conduct of society as a whole.”
J. L. Moreno, New York Times, April 13, 1933
(via James Moody)
The problem: Estimating models for large networks from sampled
dataMany networks of interest, including community-level
networks and biological networks are very large and observing a complete network can be costly and difficult
We consider the problem of estimating the model
P(Y = y) = (1/()) exp{ppzp(y)}
using data from snowball sampling designs, assuming, for the moment, a model with social circuit dependence assumptions
Handcock & Gile (2007) and Koskinen et al (2008) consider the same problem as a missing data problem
Daraganova et al (2008): A partial network among Brimbank respondents from a snowball sampling design
(yellow=wave 1, green=2, red=wave 3)
Snowball sampling designs
Multi-wave snowball sampling:
We observe ties of:
Wave 0: Nodes in Z0
Wave 1: Nodes in Z1 but not Z0
Wave 2: Nodes in Z2 but not Z1
yk: network on Zk\Zk-1
ykl: ties from Zk to Zl\Zl-1
Conditional estimation strategy
We make the social circuit assumption and follow Besag (1974), make a positivity assumption, Pr(Y0=0rest) > 0.
We show that:
Pr(Y0=y0rest)
log ------------------- = p p [zp(y0+1) - zp(y0+10)]
Pr(Y0=0rest)
where y0 is equal to y but with all entries in y0 set to 0
Defining 1/c = Pr(Y0=0 rest) yields:
Pr(Y0=y0rest) = 1/c exp (p p [zp(y0+1) - zp(y0+10)] )
and hence the capacity to use observed data on y0+1 to obtain conditional MLEs of = [p]
3-wave snowball sample
For a 3-wave sample and positivity assumption: Pr(Y0=0,Y1=0rest) > 0
we obtain:
Pr(Y0=y0,Y1=y1rest) = 1/c exp (pp[zp(y0+1+2)-zp(y0+1+20,1)])
where y0,1 is equal to y but with all entries in y0 and y1 set to 0
And hence we can use observed data on y0+1+2 to obtain conditional MLEs of
MCMCMLEs from single networks sampled from the random graph distribution with known parameters (-4,.2,-.2,1), n = 150
edgealt-star
alt-2-path alt-triangle
true value
MCMCMLEs from y0, conditional on y01, y1
(size of Z0 =10)
edge alt-star
alt-2-path alt-triangle
Conditional MCMCMLEs from y1, conditional on y0, y01, y12 , y2 and assigning isolated nodes and dyads in Z0 to Z1 (Z0 = 10)
edge alt-star
alt-2-path alt-triangle
What if we ignore the sampling design and use “available cases”?MCMCMLEs from network on Z0+Z1+Z2
edge alt-star
alt-2-path alt-triangle
Simulation study
Using data on y0+1
For a fixed model: Edge -4.0
Alr-star 0.2
Alt-triangle 1.0
Alt-2-path -0.2
Size of node set/random seed sets: 150 (15, 30, 50, 100, 150)
500 (30, 50, 70, 100, 200)
1000 (30,50, 70, 100, 200)
Estimating network on Z0 given observed network on Z1 and ties between Z0 and Z1
n = 150
Estimating network on Z0 given observed network on Z1 and ties between Z0 and Z1 (n = 500)
Estimating network on Z0 given observed network on Z1 and ties between Z0 and Z1 (n = 1000)
Simulation study: summary findings
• RMSE and bias decline as seed set size increases for fixed n
• For a sufficiently large seed set size, bias is small for each n
• For given n and seed set size, bias is greatest for edge and alt-star effects
• For given seed set size, bias is greater as n increases, although the effect is less pronounced for alt-triangle and alt-2-path effects
Daraganova et al (2008): A partial network among Brimbank respondents from a snowball sampling design
(yellow=wave 1, green=2, red=wave 3)
Example 1: a close tie network in Brimbank in suburban Melbourne, a community of 35000 people (Daraganova, 2008).
Location of respondents: wave 1 = yellow, wave 2 = green, wave 3 = blue, other = grey
Conditional MLEs for 6 models based on respondent ties (Daraganova, 2008)
model Edgealt-
starsalt-triangles
alt-2-Paths
Spatial covariate
1 -4.86*(0.20)
2 3.80*(1.37)
-1.05*(0.17)
3 -3.95*(0.74)
-0.107(0.9)
4 4.71*(-0.11)
-0.11(0.08)
-1.04*(0.17)
5-6.52*(0.918)
2.55*(0.297)
-0.2*(0.093)
6 -1.18(2.49)
2.46*(0.34)
-0.19*(0.09)
-0.65*(0.298)
7 -13.12*(2.84)
1.92*(0.82)
-0.24*(0.33)
-0.24*(0.09)
8 -7.21*(2.79)
1.58(0.85)
2.41*(0.34)
-0.24*(0.09)
-0.58(0.31)
Heuristic evaluation of fit
t-ratio==(observation - sample mean) / standard deviation
1 2 3 4 5 6 7 8
2-star 1.59 1.36 2.15 1.93 0.51 0.24 0.06 0.01
3-star 3.28 2.45 5.25 4.43 0.66 0.30 0.09 0.03
triangle 34.73 19.37 29.03 16.59 1.16 0.86 0.60 0.53
3-path 2.64 2.04 3.88 3.32 0.76 0.35 0.03 -0.01
4-cycle 20.92 15.02 22.58 12.5 1.74 1.17 0.63 0.73
Coathanger 16.63 12.28 18.76 12.12 1.50 0.98 0.51 0.60
4-clique 15.42 12.36 18.18 9.77 2.66 2.01 1.62 1.97
alt-star 0.29 0.25 0.27 0.27 0.13 -0.03 0.06 -0.11
alt-triangle 32.22 23.34 29.98 21.70 0.02 -0.05 -0.02 -0.11
Alt- 2-path -0.36 -0.43 -0.08 -0.05 0.02 -0.11 -0.12 -0.12
Spatial covariate -0.50 0.04 -0.51 -0.01 -0.13 -0.09 -0.22 -0.12
Std Dev degree distribution 3.49 2.69 5.31 4.65 0.89 0.51 0.16 0.12
Skew degree distribution 3.17 2.57 5.23 4.35 0.33 0.11 0.11 0.01
Global Clustering 5.39 5.09 5.77 5.23 1.22 1.25 1.21 1.26
Mean Local Clustering 13.46 13.72 14.11 13.76 -0.01 0.05 0.51 0.49
Variance Local Clustering -2.64 -2.51 -2.39 -2.31 1.82 1.54 1.01 0.86
Example 2: Protein-protein interaction network (yellow = initial wave, green = wave 1, red = wave 2, blue = wave 3)
4763 nodes, 17274 ties in modelled zones
Illustrative conditional MLEs from random seed sets of different sizes
Parameter estimates (s.e)
|Z0| Edge Alt-star Alt-triangle Alt-2-path
100 -8.00 (2.30) 0.34 (0.67) 1.07 (0.198) 0.0027 (0.017)
200 -5.04 (1.17) -0.78 (0.38) 0.87 (0.156) 0.043 (0.013)
300 -12.7 (1.56) 1.57 (0.45) 0.86 (0.082) -0.0017 (0.013)
400 -10.63 (0.81) 1.05 (0.23) 0.60 (.058) 0.011 (.0008)
500 -9.79 (0.56) 0.85 (0.16) 0.71 (.042) 0.012 (.0010)
600 -11.8 (0.65) 1.30 (0.18) 0.81 (.040) 0.016 (.0019)
…
1000 -11.0 (0.31) 1.11 (0.09) 0.93 (.025) 0.007 (.0004)
Heuristic goodness of fit for conditional MLEs from 1000 node
seed set
Graph statistic t
Std Dev degree dist: 1.24
Skew degree dist: -1.94
Global Clustering: 3.03
Mean
Local Clustering: 1.40
Variance
Local Clustering 1.97
Count statistics t
0.9003
-0.7267
2.5194
-1.5980
3.0425
1.5223
Preliminary conclusions
Conditional MLEs can be useful for very large networks, both where missing data approaches are computationally infeasible, and to speed convergence
The approach can be used to test homogeneity assumptions
Extension to other network sampling designs
A well-specified model is essential, and hence so are principled approaches to assessment of fit
3. Where next?
Some next steps
• Enhanced model specifications– multi-relational– multi-mode (including multiple levels)– Incorporating heterogeneity
• Better design-modelling links, and careful investment in data collection
• Models for both short- and long-term dynamics: relationship and interaction structures are not the same!
• Computational enhancements
Thankyou