View
1
Download
0
Category
Preview:
Citation preview
Statistical Analysis of Network Datain the Context of ‘Big Data’:Large Networks and Many Networks
Eric D. Kolaczyk
Dept of Mathematics and Statistics, Boston University
kolaczyk@bu.edu
Northeastern U, May 2015
Introduction
Outline
1 IntroductionComplex Networks @ ∼ 15 YearsStatistical Analysis of Network Data: Foundations?
2 Propagation of Uncertainty to Network Summary Statistics
3 ‘Stat 101’ for Collections of Network Data Objects
4 Closing Thoughts: Much Still to Do . . .
Northeastern U, May 2015
Introduction Complex Networks @ ∼ 15 Years
Complex Networks
Pdp
dCLK Cyc
Tim
VriPer
Network-based analysis traditionally a relatively small ‘field’ of study
Epidemic-like spread of interest in networks since mid/late-1990s
Arguably due to various factors, such as
Increasingly systems-level perspective in science,away from reductionism;Flood of high-throughput data;Globalization, the Internet, etc.
Northeastern U, May 2015
Introduction Complex Networks @ ∼ 15 Years
Much Has Happened in 15 Years . . .
Google Scholar reports ∼ 361, 000 articles with ‘network’ in the titlepublished since 1999.
Contributions have been from across the sciences:
Computer Science, Mathematics, StatisticsSignal processing, Statistical Physics, Information SciencesBioinformatics, Economics, Neuroscience, SociologyDigital Humanities
1
USE R !
ISBN 978-1-4939-0982-7
Use R ! Use R !
Eric KolaczykGábor Csárdi
Statistical Analysis of Network Data with R
Statistical Analysis of Netw
ork Data w
ith RKolaczyk · Csárdi
Eric Kolaczyk · Gábor Csárdi
Statistical Analysis of Network Data with R
Th is book is the fi rst of its kind in network research. It can be used as a stand-alone resource in which multiple R packages are used to illustrate how to use the base code for many tasks. igraph is the central package and has created a standard for developing and manipulating network graphs in R. Measurement and analysis are integral components of network research. As a result, there is a critical need for all sorts of statistics for network analysis, ranging from applications to methodology and theory. Networks have permeated everyday life through everyday realities like the Internet, social networks, and viral marketing, and as such, network analysis is an important growth area in the quantitative sciences. Th eir roots are in social network analysis going back to the 1930s and graph theory going back centuries. Th is text also builds on Eric Kolaczyk’s book Statistical Analysis of Network Data (Springer, 2009).
Eric Kolaczyk is a professor of statistics, and Director of the Program in Statistics, in the Department of Mathematics and Statistics at Boston University, where he also is an affi liated faculty member in the Bioinformatics Program, the Division of Systems Engineering, and the Program in Computational Neuroscience. His publications on network-based topics, beyond the development of statistical methodology and theory, include work on applications ranging from the detection of anomalous traffi c patterns in computer networks to the prediction of biological function in networks of interacting proteins to the characterization of infl uence of groups of actors in social networks. He is an elected fellow of the American Statistical Association (ASA) and an elected senior member of the Institute of Electrical and Electronics Engineers (IEEE).
Gábor Csárdi is a research associate at the Department of Statistics at Harvard University, Cambridge, Mass. He holds a PhD in Computer Science from Eötvös University, Hungary. His research includes applications of network analysis in biology and social sciences, bioinformatics and computational biology, and graph algorithms. He created the igraph soft ware package in 2005 and has been one of the lead developers since then.
Statistics
9 781493 909827
Northeastern U, May 2015
Introduction Statistical Analysis of Network Data: Foundations?
Where is Statistics Amid All of this Work?
Everywhere!
Statistical aspects of network analysis include problems involving
sampling and design
description and visualization
modeling and inference
prediction
for data both of and on networks.
Much of this work occurs in domain-specific areas;a nontrivial amount of it is general.
However . . . comparatively less of it is foundational in nature.
Northeastern U, May 2015
Introduction Statistical Analysis of Network Data: Foundations?
Implications of Network Data on Statistical Foundations
Broadly speaking, the primary statistical challenge(s) in most networkproblems comes from nontrivial interplay between
relational/dependent nature of the data;
network structure;
lack of (traditional) geometry; and
‘big data’.
Question: Despite a fairly well-developed ability to deal successfully withmany classes of problems in many contexts and domain areas, how well dowe truly understand the implications of network data (e.g., as opposed toIID, temporal, or spatial data) on statistics at a foundational level?
My Answer: Somewhat . . . but there’s still a long way to go!
Northeastern U, May 2015
Introduction Statistical Analysis of Network Data: Foundations?
Focus of this Talk: Behavior of ‘Network Averages’
Fundamental to much of statistics areaverages and their asymptotic behavior.
Goal: To consider (arguably quite basic!) notions of ‘network averages’and their asymptotic behavior in the settings of
1 large networks, and
2 many networks
and sketch some our recent results, with an eye towards illustrating
the novelty lent by the aspect of complex networks; and
some of the resulting challenges and open problems.
Northeastern U, May 2015
Propagation of Uncertainty to Network Summary Statistics
Outline
1 IntroductionComplex Networks @ ∼ 15 YearsStatistical Analysis of Network Data: Foundations?
2 Propagation of Uncertainty to Network Summary Statistics
3 ‘Stat 101’ for Collections of Network Data Objects
4 Closing Thoughts: Much Still to Do . . .
Northeastern U, May 2015
Propagation of Uncertainty to Network Summary Statistics
Setting the Stage
Common modus operandi in network analysis:
System of elements and their interactions is of interest.
Collect elements and relations among elements.
Represent the collected data via a network.
Summarize properties of the network.
Sounds good . . . right?
Northeastern U, May 2015
Propagation of Uncertainty to Network Summary Statistics
But What About the Noise?!
(Stating the Obvious . . . ) If there is uncertainty in the determination ofedge status (i.e., presence/absence) between vertex pairs, then thatuncertainty propagates through any calculations using the resultingnetwork graph as input!
Surprisingly, there is very little work acknowledging, much less addressing,this issue.
Work needed on problems of
characterizating propagation of error, from networks to sumaries, and
adjusting for error (e.g., improved estimators).
=⇒ This represents a major foundation missing in network science. ⇐=
Northeastern U, May 2015
Propagation of Uncertainty to Network Summary Statistics
Propagation of Noise: Illustrationilv
Y_b
3773
_at
hydG
_b40
04_a
tga
tR_1
_b20
87_a
tcy
nR_b
0338
_at
leuO
_b00
76_a
tm
etR
_b38
28_a
tar
sR_b
3501
_at
yiaJ
_b35
74_a
ttd
cR_b
3119
_at
soxR
_b40
63_a
trh
aS_b
3905
_at
rhaR
_b39
06_a
tal
pA_b
2624
_at
rtcR
_b34
22_a
ttd
cA_b
3118
_at
rbsR
_b37
53_a
txy
lR_b
3569
_at
emrR
_b26
84_a
tcp
xR_b
3912
_at
yhhG
_b34
81_a
tm
elR
_b41
18_a
thy
cA_b
2725
_at
rpiR
_b40
89_a
tgl
nG_b
3868
_at
hyfR
_b24
91_a
tfh
lA_b
2731
_at
iciA
_b29
16_a
tyg
iX_b
3025
_at
yhiX
_b35
16_a
tyh
iW_b
3515
_at
adiY
_b41
16_a
tyh
iE_b
3512
_at
srlR
_b27
07_a
tgu
tM_b
2706
_at
gntR
_b34
38_a
tm
alT
_b34
18_a
tcy
tR_b
3934
_at
crp_
b335
7_at
yjdG
_b41
24_a
tyj
bK_b
4046
_at
yijC
_b39
63_a
tly
sR_b
2839
_at
glnL
_b38
69_a
tux
uR_b
4324
_at
ygaE
_b26
64_a
tyj
fQ_b
4191
_at
cadC
_b41
33_a
tlld
R_b
3604
_at
sohA
_b31
29_a
tid
nR_b
4264
_at
treR
_b42
41_a
tfr
uR_b
0080
_at
lacI
_b03
45_a
ten
vY_b
0566
_at
fucR
_b28
05_a
tom
pR_b
3405
_at
fecI
_b42
93_a
tfis
_b32
61_a
thu
pB_b
0440
_at
hupA
_b40
00_a
tcs
pE_b
0623
_at
nadR
_b43
90_a
trp
oN_b
3202
_at
gcvR
_b24
79_a
tds
dC_b
2364
_at
atoC
_b22
20_a
tap
pY_b
0564
_at
rcsB
_b22
17_a
tna
gC_b
0676
_at
him
A_b
1712
_at
lrp_
b088
9_at
fnr_
b133
4_at
putA
_b10
14_a
tui
dA_b
1617
_at
b139
9_at
baeR
_b20
79_a
tga
tR_2
_b20
90_f
_at
fadR
_b11
87_a
tna
rL_b
1221
_at
pspF
_b13
03_a
tph
oP_b
1130
_at
fliA
_b19
22_a
tdn
aA_b
3702
_at
birA
_b39
73_a
tle
xA_b
4043
_at
yifA
_b37
62_a
tas
nC_b
3743
_at
caiF
_b00
34_a
tna
c_b1
988_
athn
s_b1
237_
atcs
pA_b
3556
_at
uhpA
_b36
69_a
tar
cA_b
4401
_at
nhaR
_b00
20_a
tso
xS_b
4062
_at
rpoH
_b34
61_a
tar
gR_b
3237
_at
fur_
b068
3_at
evgA
_b23
69_a
tro
b_b4
396_
atrp
oD_b
3067
_at
hcaR
_b25
37_a
trp
oE_b
2573
_at
ebgR
_b30
75_a
tar
aC_b
0064
_at
exuR
_b30
94_a
tga
lR_b
2837
_at
mtlR
_b36
01_a
tgl
pR_b
3423
_at
sdiA
_b19
16_a
tflh
D_b
1892
_at
flhC
_b18
91_a
tcs
gD_b
1040
_at
lrhA
_b22
89_a
txa
pR_b
2405
_at
gcvA
_b28
08_a
tyg
aA_b
2709
_at
rcsA
_b19
51_a
tic
lR_b
4018
_at
glcC
_b29
80_a
tbe
tI_b0
313_
atac
rR_b
0464
_at
yhdM
_b32
92_a
thi
pA_b
1507
_at
hipB
_b15
08_a
trp
oS_b
2741
_at
mod
E_b
0761
_at
tyrR
_b13
23_a
tm
alI_
b162
0_at
slyA
_b16
42_a
tcy
sB_b
1275
_at
ada_
b221
3_at
trpR
_b43
93_a
tde
oR_b
0840
_at
torR
_b09
95_a
tb2
531_
atm
arR
_b15
30_a
tm
arA
_b15
31_a
tga
lS_b
2151
_at
narP
_b21
93_a
tm
lc_b
1594
_at
purR
_b16
58_a
tpd
hR_b
0113
_at
him
D_b
0912
_at
farR
_b07
30_a
tyl
cA_b
0571
_at
oxyR
_b39
61_a
tm
etJ_
b393
8_at
cbl_
b198
7_at
phoB
_b03
99_a
tb1
499_
atm
hpR
_b03
46_a
tkd
pE_b
0694
_at
yebF__U_N0075zipA__U_N0075b2618_U_N0075bcp___U_N0075cpxR__U_N0075crcB__U_N0075crp___U_N0075cspF__U_N0075dam___U_N0075dnaA__U_N0075dnaN__U_N0075dnaT__U_N0075era___U_N0075fis___U_N0075fklB__U_N0075folA__U_N0075galF__U_N0075gcvR__U_N0075gyrA__U_N0075gyrI__U_N0075hlpA__U_N0075holD__U_N0075hscA__U_N0075ldrA__U_N0075mcrB__U_N0075mcrC__U_N0075menB__U_N0075menC__U_N0075minD__U_N0075minE__U_N0075murI__U_N0075yoeB__U_N0075nrdA__U_N0075nupC__U_N0075pyrC__U_N0075rimI__U_N0075rstB__U_N0075ruvC__U_N0075sbcB__U_N0075uspA__U_N0075
=⇒
YBR083W
YHR084W
YDR480W
YGR088W
YHR030C
YLR342W
YCL027W
YDR461WYFL026W
YBL016W
YJL157C
YMR043W
YGR032W
YER111C
YJL128C
YLR113W
YKL062W
YML004C
YNL145W
YKR095W
YCR089W
YBR040W
YMR198WYDR309C
YDR085C
YMR065W
YNL192W
YNR044W
YGL021WYDR146C YGL116W
YBR023C
YDR055W
YGR189C
YGR282C
YIL076W
YJL158C
YJL159W
YKL096WYKL163W
YKL164C
YMR238W
YCR012W
YNL160W
YBR072W
YDL204WYBR126C
YJL141C
YKL150W
YER054C
YOR173W
YGR248W
YJL161WYHR139C
YIR019C
YJR153W YHR084W
YMR043W
YLR113W
YMR037C
YKL062W
YPL089CYBR083W
=⇒ Density = 0.14± ????
Examples particularly prevalent in biology (e.g., gene regulatory networks,protein-protein interaction networks, and neural functional connectivitynetworks), but some noise likely present in most network applications.
Northeastern U, May 2015
Propagation of Uncertainty to Network Summary Statistics
A General Formulation of the Problem
Suppose we have
G , a true underlying (undirected) network-graph
η (G ), a network summary / characteristic of interest.
We will restrict our attention to the problem of counting subgraphs H, i.e.,where η is of the form
ηH(G ) =1
|Iso(H)|∑
H′⊆Knv ,H′∼=H
1H ′ ⊆ G .
Northeastern U, May 2015
Propagation of Uncertainty to Network Summary Statistics
A General Formulation (cont)
Rather than having G , however, suppose we only have anapproximation/estimate of G , say G = (V , E ).
We model observed edges to reflect potential errors in determining the(non)edge status between vertex pair i , j:
Yij ∼
Bernoulli (αij) , if i , j ∈ E c ,
Bernoulli (1− βij) , if i , j ∈ E ,
Goal: Understand the propagation of error from the Yij to the standard
plug-in estimate ηH(G ), by characterizing the distribution of thediscrepency
D = ηH(G )− ηH(G ) .
Northeastern U, May 2015
Propagation of Uncertainty to Network Summary Statistics
Some Assumptions
Motivated by a combination of (i) canonical characteristics in complexnetworks in practice, and (ii) a need to further simplify, we assume thefollowing:
(A1) Large Graphs: nv →∞.
(A2) Sparse Graphs: |E | = Θ(nv log nv ).
(A3) Edge Unbiasedness:∑i ,j∈E c αij =
∑i ,j∈E βij (≡ λ).
(A4) Low Error Rate: λ = Θ (nγv logκ nv ), for γ ∈ [0, 1) and κ ≥ 0.
(A5) Homogeneity: αij ≡ α and βij ≡ β, for α, β ∈ (0, 1).
Northeastern U, May 2015
Propagation of Uncertainty to Network Summary Statistics
Key Observation
When subgraph counting is the goal, the statistic
D =1
|Iso(H)|∑
H′⊆Knv ,H′∼=H
[1H ′ ⊆ G − 1H ′ ⊆ G
],
is a difference of sums of indicator random variables.
Key Point: This can be rewritten as # Type I Errors - # Type II Errors.
Intutitively, the combination of
Sparse Networks + Low-rate Errors
suggests that each sum behaves like a Poisson random variable, and hencetheir difference, like a Skellam random variable.
Northeastern U, May 2015
Propagation of Uncertainty to Network Summary Statistics
The Skellam Distribution
A random variable W defined on the integers is said to have a Skellamdistribution, with parameters λ1, λ2 > 0, i.e., W ∼ Skellam (λ1, λ2), if
P (W = k) = e−(λ1+λ2)
(√λ1λ2
)k
Ik
(2√λ1λ2
)for k ∈ Z, (1)
where Ik(2√λ1λ2
)is the modified Bessel function of the first kind with
index k and argument 2√λ1λ2.
Note:
1 E[W ] = λ1 − λ2 and Var(W ) = λ1 + λ2;
2 Distribution symmetric iff λ1 = λ2.
Northeastern U, May 2015
Propagation of Uncertainty to Network Summary Statistics
An Initial Treatment: Edge-Counting + Independence
Theorem
Let DE = |E | − |E |. Under assumptions (A1)-(A5) and independenceamong errors in declaring (non)edge status (i.e., among the Yij),
dKS (DE , Skellam(λ, λ) ) ≤ O((n1−γv log1−κ nv )−1
). (2)
At the same time,
dKS (DE/σ , N(0, 1) )) ≤ O(
(nγ/2v logκ/2 nv )−1
)(3)
and, for sufficiently large nv ,
dKS (DE/σ , N(0, 1) )) ≥ Ω((nv log nv )−1
), (4)
where N(0, 1) refers to a standard normal random variable.
Northeastern U, May 2015
Propagation of Uncertainty to Network Summary Statistics
Stronger Results Are Likely . . .
Constant Log Sqrt Linear
−3
−2
−1
0
100 1000 10000 100 1000 10000 100 1000 10000 100 1000 10000NumberOfVertices
log1
0(K
SD
ista
nce)
Limit
Normal
Skellam
(Log)Kolmogorov-Smirnov distance between Skellam and standard normal approximationsto the distributon of discrepency DE in edge counts under independent errors,with rates constant, log, square-root, and linear in nv .
Northeastern U, May 2015
Propagation of Uncertainty to Network Summary Statistics
A More General Treatment
For the special case of counting edges under independent errors, the binaryrandom variables defining D = ηH(G )− ηH(G ) are independent.
In fact, dependence can be expected in practice, due to
construction/estimation of G (e.g., gene regulatory networkinference); and/or
overlap of candidate subgraphs H ′, for subgraphsof order larger than 2.
Nevertheless, a Skellam approximation still likely to be appropriate in manycontexts, due to fundamental role of sparse graphs and low-rates errors.
Northeastern U, May 2015
Propagation of Uncertainty to Network Summary Statistics
A More General Treatment
Motivated by these observations, we provide1 a general analysis ofdifferences of sums of binary random variables,
U =n∑
k=1
Lk −m∑
k=1
Mk ,
extending the above treatment.
Our approach uses Stein’s method, through which we
exhibit a Stein operator for the Skellam distribution;
produce several bounds on dKS(U,W ), for W ∼ Skellam(λ1, λ2);
control the Stein constant, in the case λ1 = λ2, and
illustrate in a handful of simple network error models.
1Balachandran, Kolaczyk, Viles (2014). On the Propagation of Low-Rate Measurement Error
to Subgraph Counts in Large, Sparse Networks. (arXiv:1409.5640)
Northeastern U, May 2015
Propagation of Uncertainty to Network Summary Statistics
Illustration: Edge-Counting + Dependence
Suppose that Type I and Type II errors are determined according to asimple two-step process:
1 A fixed integer number ν of vertex pairs are selected for potentialerrors;
2 These potential errors are realized with probability
α = λ/[ν(1− δ)], for vertex pairs in E c , andβ = λ/[νδ], for vertex pairs in E ,
where δ is the network density (i.e, fraction of edges).
Theorem
The resulting noise process is negatively associated and, as a result,
dKS (DE , Skellam(λ, λ) ) ≤ O (λ/ν) .
Northeastern U, May 2015
Propagation of Uncertainty to Network Summary Statistics
Challenges / Open Problems in Network Error Propagation
Tighter bounds are needed on dKS(U,W )
Control is needed on the Stein constant when λ1 6= λ2.
‘Signal plus noise’ formulation posed here leads to a need for betterunderstanding the large-graph limit of subgraph counts in general(particularly sparse!) network models.
Development/study of domain-specific measurement error models.
Note: Also of interest are related statistical questions, e.g., improvedestimation of functions η(G ), confidence intervals, etc. We have madesome initial contributions2 using methods of matrix approximation.
2Balachandran, P., Airoldi, E., and Kolaczyk, E.D. (2013). Inference of network summary statistics
through network denosing. (arXiv:1310.0423)
Northeastern U, May 2015
‘Stat 101’ for Collections of Network Data Objects
Outline
1 IntroductionComplex Networks @ ∼ 15 YearsStatistical Analysis of Network Data: Foundations?
2 Propagation of Uncertainty to Network Summary Statistics
3 ‘Stat 101’ for Collections of Network Data Objects
4 Closing Thoughts: Much Still to Do . . .
Northeastern U, May 2015
‘Stat 101’ for Collections of Network Data Objects
From Large Networks to Many Networks
Traditionally in ‘network science’ what is ‘large’is the number of vertices nv .
But a second paradigm arguably is emerging, where what is large is thenumber of networks n.
Naturally emerging trend in the ‘big data’ approach to science
Northeastern U, May 2015
‘Stat 101’ for Collections of Network Data Objects
Motivation: Functional Connectivity Networks
Kramer, M.A., Eden, U.T., Kolaczyk, E.D., Zepeda, R., Eskandar, E.N., Cash, S.S. (2010). Coalescence andfragmentation of cortical networks during focal seizures. Journal of Neuroscience, 30(30), 10076-10085.
Northeastern U, May 2015
‘Stat 101’ for Collections of Network Data Objects
‘Statistics 101’ for Network Data Objects?
While there are lots of methods and models for single (large) complexnetworks, there is comparatively little for analysis of collections ofnetworks.
Tools are needed for answering basic questions like
“What is the ‘average’ of a collection of networks?”
“Do these networks differ, on average, from a given nominalnetwork?”
“Do two collections of networks differ on average?”
“What factors (e.g., age, gender, etc.) appear to contribute todifferences in networks?”
In order to answer these and similar questions, we require network-basedanalogues of classical tools for statistical estimation and hypothesis testing.
Northeastern U, May 2015
‘Stat 101’ for Collections of Network Data Objects
Laying an Initial Foundation . . .
Extension of classical tools to network-based datasets is non-trivial:networks are not Euclidean objects – rather, they are combinatorialobjects, defined simply through their sets of vertices and edges.
Nevertheless, in recent work3 we have
shown certain classes of networks can be associated with nice subsetsof Euclidean space,
which permits definition of a natural notion of distance and averaging,
allowing results in statistical shape analysis to be used to establish anasymptotic theory,
resulting in, for example, a principled approach to one- andtwo-sample hypothesis testing.
3Ginestet, C.E., Balachandran, P., Rosenberg, S., and Kolaczyk, E.D. (2014). Hypothesis testing for network data
in functional neuroimaging. (arXiv:1407.5525)
Northeastern U, May 2015
‘Stat 101’ for Collections of Network Data Objects
Some Notation and Such
Let G = (V ,E ,W ) be a weighted undirected graph, that is
simple (i.e., no self-loops or multi-edges)
connected (i.e., only one component)
and define the (combinatorial) graph Laplacian
L = D(W )−W ,
where D is a diagonal matrix of weighted degrees,i.e., Djj = dj(W ) =
∑i 6=j wij .
Our interest will be in IID collections of graphs G1, . . . ,Gn,with which we will inter-changeably associatean IID collection of graph Laplacians L1, . . . , Ln.
Northeastern U, May 2015
‘Stat 101’ for Collections of Network Data Objects
Geometry of a Certain Space of Networks
Theorem
Let the set L′d consist of d × d matrices A, satisfying:
(1) Rank(A) = d − 1,
(2) Symmetry, AT = A,
(3) Positive semi-definiteness, A ≥ 0,
(4) The entries in each row sum to 0,
(5) The off-diagonal entries are non-positive, aij ≤ 0 .
Then L′d is a manifold with corners, of dimension d(d − 1)/2.
Furthermore, L′d is a convex subset of an affine space in Rd2
of dimension d(d − 1)/2.
Northeastern U, May 2015
‘Stat 101’ for Collections of Network Data Objects
A Central Limit Theorem
For L1, . . . , Ln IID wrt some distribution Q,and ρF the Frobenius norm, define the (Frechet) means
EQ [L] := arg minL∈L′d
∫L′d
ρ2F (L, L)Q(dL) and Ln := arg minL∈L′d
1
n
n∑i=1
ρ2F (L, Li ).
Theorem
If the expectation, Λ := EQ [L], does not lie on the boundary of L′d , andPQ [U] > 0, where U is an open subset of L′d with Λ ∈ U, then (undersome further regularity conditions) we obtain the following convergence indistribution:
n1/2(φ(Ln)− φ(Λ)) −→ N(0,Σ),
where Σ := Cov [φ(L)] and φ(·) denotes the half-vectorization of its matrixargument.
Northeastern U, May 2015
‘Stat 101’ for Collections of Network Data Objects
Large-Sample Testing Theory
Corollary
Under the null hypothesis H0 : E[L] = Λ0, we have,
T1 := n(φ(L)− φ(Λ0)
)TΣ−1
(φ(L)− φ(Λ0)
)−→ χ2
m,
with m :=(d2
)degrees of freedom, and where Σ is the sample covariance.
Analogous results may be stated for two- and multiple-sample testing.
Northeastern U, May 2015
‘Stat 101’ for Collections of Network Data Objects
Practical Considerations: Covariance Estimation
In order to use these results in practice, we require knowledge of Σ or,more realistically, for Σ to be stable.
For n O(d2), it may be that Σ is stable, but for n O(d2), we face a“large n, small p” problem.
Natural to seek to exploit recent literature on estimation of large,structured covariance matrices from limited data.
In neuroimaging, it has been argued that the networks of interest besparse. Our empirical experience suggests that Σ is similarly sparse.
We use a thresholding-based method of Cai & Lui ’11 in our applications.
Northeastern U, May 2015
‘Stat 101’ for Collections of Network Data Objects
Illustration: 1000 FCP data
The 1000 Functional Connectomes Project (FCP) is a major MRIdata-sharing initiative, launched in 2010.
Data collected on both genders, at five international locations, across arange of ages.
(A) Sex (B) Age
0
200
400
0
100
200
300
Female Male x ≤ 22 22 < x ≤ 32 32 < x
Northeastern U, May 2015
‘Stat 101’ for Collections of Network Data Objects
Testing for Differences in 1000 FCP Networks
Current state of the art (aka ‘mass univariate’) uses edge-wise testing(presence/absence) and multiple correction.
(A) Mass-univariate analysis (Sex) (B) Mass-univariate analysis (Age)
Uncorrected Corrected Uncorrected Corrected
Comparison: Mass-Univariate vs Multivariate
Both methods detect differences in mean networks across gender andage, when using the full 1000 connectomes; but . . .
Only our multivariate method detects those differences at smallsample sizes (i.e., relevant to single labs).
Northeastern U, May 2015
‘Stat 101’ for Collections of Network Data Objects
Challenges / Open Problems with Network Data Objects
This work lays only an initial stone in the foundation.There are a vast number of directions to go from here!
Examples include
Critical understanding of the impact of network structure (e.g.,sparseness, degree distribution, etc.) on geometric, probabilistic, andstatistical aspects of the problem.
A more refined understanding of covariance structure of Laplacians,as well as a careful study of their estimation.
Finite-sample properties, corrections, etc.
Adaptation of other tools in the ‘Statistics 101’ toolbox?
Northeastern U, May 2015
Closing Thoughts: Much Still to Do . . .
Outline
1 IntroductionComplex Networks @ ∼ 15 YearsStatistical Analysis of Network Data: Foundations?
2 Propagation of Uncertainty to Network Summary Statistics
3 ‘Stat 101’ for Collections of Network Data Objects
4 Closing Thoughts: Much Still to Do . . .
Northeastern U, May 2015
Closing Thoughts: Much Still to Do . . .
Some Closing Thoughts
Use of network-based perspective in modeling and analysis is nowpervasive across the sciences (and has even penetrated the humanities).
While much of the work done in this area is necessarily problem-specific,to varying extents, there is sufficient evidence after 15 years to suggest itis both necessary and interesting to (re)visit the statistical foundations4.
The hook (hopefully!) is that the area is a source of problems that arebroadly relevant, with an intruiging blend of traditional and new elements,and generally needing to incorporate many of the directions of researchalready being explored in the broader community.
4Note there is parallel movement in the making within the signal and informationprocessing literatures (e.g., witness the rise of ‘graph signal processing’).Northeastern U, May 2015
Closing Thoughts: Much Still to Do . . .
Collaborators and Support
Supported in part by AFOSR, NIH, NSF, and ONR.
Northeastern U, May 2015
Recommended