Statistical Analysis of Network Data in the Context of `Big Data': - … · 2015-06-17 ·...

Statistical Analysis of Network Datain the Context of ‘Big Data’:Large Networks and Many Networks

Eric D. Kolaczyk

Dept of Mathematics and Statistics, Boston University

kolaczyk@bu.edu

Northeastern U, May 2015

Introduction

Outline

1 IntroductionComplex Networks @ ∼ 15 YearsStatistical Analysis of Network Data: Foundations?

2 Propagation of Uncertainty to Network Summary Statistics

3 ‘Stat 101’ for Collections of Network Data Objects

4 Closing Thoughts: Much Still to Do . . .

Introduction Complex Networks @ ∼ 15 Years

Complex Networks

dCLK Cyc

VriPer

Network-based analysis traditionally a relatively small ‘field’ of study

Epidemic-like spread of interest in networks since mid/late-1990s

Arguably due to various factors, such as

Increasingly systems-level perspective in science,away from reductionism;Flood of high-throughput data;Globalization, the Internet, etc.

Introduction Complex Networks @ ∼ 15 Years

Much Has Happened in 15 Years . . .

Google Scholar reports ∼ 361, 000 articles with ‘network’ in the titlepublished since 1999.

Contributions have been from across the sciences:

Computer Science, Mathematics, StatisticsSignal processing, Statistical Physics, Information SciencesBioinformatics, Economics, Neuroscience, SociologyDigital Humanities

USE R !

ISBN 978-1-4939-0982-7

Use R ! Use R !

Eric KolaczykGábor Csárdi

Statistical Analysis of Network Data with R

Statistical Analysis of Netw

ork Data w

ith RKolaczyk · Csárdi

Eric Kolaczyk · Gábor Csárdi

Statistical Analysis of Network Data with R

Th is book is the fi rst of its kind in network research. It can be used as a stand-alone resource in which multiple R packages are used to illustrate how to use the base code for many tasks. igraph is the central package and has created a standard for developing and manipulating network graphs in R. Measurement and analysis are integral components of network research. As a result, there is a critical need for all sorts of statistics for network analysis, ranging from applications to methodology and theory. Networks have permeated everyday life through everyday realities like the Internet, social networks, and viral marketing, and as such, network analysis is an important growth area in the quantitative sciences. Th eir roots are in social network analysis going back to the 1930s and graph theory going back centuries. Th is text also builds on Eric Kolaczyk’s book Statistical Analysis of Network Data (Springer, 2009).

Eric Kolaczyk is a professor of statistics, and Director of the Program in Statistics, in the Department of Mathematics and Statistics at Boston University, where he also is an affi liated faculty member in the Bioinformatics Program, the Division of Systems Engineering, and the Program in Computational Neuroscience. His publications on network-based topics, beyond the development of statistical methodology and theory, include work on applications ranging from the detection of anomalous traffi c patterns in computer networks to the prediction of biological function in networks of interacting proteins to the characterization of infl uence of groups of actors in social networks. He is an elected fellow of the American Statistical Association (ASA) and an elected senior member of the Institute of Electrical and Electronics Engineers (IEEE).

Gábor Csárdi is a research associate at the Department of Statistics at Harvard University, Cambridge, Mass. He holds a PhD in Computer Science from Eötvös University, Hungary. His research includes applications of network analysis in biology and social sciences, bioinformatics and computational biology, and graph algorithms. He created the igraph soft ware package in 2005 and has been one of the lead developers since then.

Statistics

9 781493 909827

Introduction Statistical Analysis of Network Data: Foundations?

Where is Statistics Amid All of this Work?

Everywhere!

Statistical aspects of network analysis include problems involving

sampling and design

description and visualization

modeling and inference

prediction

for data both of and on networks.

Much of this work occurs in domain-specific areas;a nontrivial amount of it is general.

However . . . comparatively less of it is foundational in nature.

Implications of Network Data on Statistical Foundations

Broadly speaking, the primary statistical challenge(s) in most networkproblems comes from nontrivial interplay between

relational/dependent nature of the data;

network structure;

lack of (traditional) geometry; and

‘big data’.

Question: Despite a fairly well-developed ability to deal successfully withmany classes of problems in many contexts and domain areas, how well dowe truly understand the implications of network data (e.g., as opposed toIID, temporal, or spatial data) on statistics at a foundational level?

My Answer: Somewhat . . . but there’s still a long way to go!

Focus of this Talk: Behavior of ‘Network Averages’

Fundamental to much of statistics areaverages and their asymptotic behavior.

Goal: To consider (arguably quite basic!) notions of ‘network averages’and their asymptotic behavior in the settings of

1 large networks, and

2 many networks

and sketch some our recent results, with an eye towards illustrating

the novelty lent by the aspect of complex networks; and

some of the resulting challenges and open problems.

Propagation of Uncertainty to Network Summary Statistics

Outline

Setting the Stage

Common modus operandi in network analysis:

System of elements and their interactions is of interest.

Collect elements and relations among elements.

Represent the collected data via a network.

Summarize properties of the network.

Sounds good . . . right?

But What About the Noise?!

(Stating the Obvious . . . ) If there is uncertainty in the determination ofedge status (i.e., presence/absence) between vertex pairs, then thatuncertainty propagates through any calculations using the resultingnetwork graph as input!

Surprisingly, there is very little work acknowledging, much less addressing,this issue.

Work needed on problems of

characterizating propagation of error, from networks to sumaries, and

adjusting for error (e.g., improved estimators).

=⇒ This represents a major foundation missing in network science. ⇐=

Propagation of Noise: Illustrationilv

yebF__U_N0075zipA__U_N0075b2618_U_N0075bcp___U_N0075cpxR__U_N0075crcB__U_N0075crp___U_N0075cspF__U_N0075dam___U_N0075dnaA__U_N0075dnaN__U_N0075dnaT__U_N0075era___U_N0075fis___U_N0075fklB__U_N0075folA__U_N0075galF__U_N0075gcvR__U_N0075gyrA__U_N0075gyrI__U_N0075hlpA__U_N0075holD__U_N0075hscA__U_N0075ldrA__U_N0075mcrB__U_N0075mcrC__U_N0075menB__U_N0075menC__U_N0075minD__U_N0075minE__U_N0075murI__U_N0075yoeB__U_N0075nrdA__U_N0075nupC__U_N0075pyrC__U_N0075rimI__U_N0075rstB__U_N0075ruvC__U_N0075sbcB__U_N0075uspA__U_N0075

YBR083W

YHR084W

YDR480W

YGR088W

YHR030C

YLR342W

YCL027W

YDR461WYFL026W

YBL016W

YJL157C

YMR043W

YGR032W

YER111C

YJL128C

YLR113W

YKL062W

YML004C

YNL145W

YKR095W

YCR089W

YBR040W

YMR198WYDR309C

YDR085C

YMR065W

YNL192W

YNR044W

YGL021WYDR146C YGL116W

YBR023C

YDR055W

YGR189C

YGR282C

YIL076W

YJL158C

YJL159W

YKL096WYKL163W

YKL164C

YMR238W

YCR012W

YNL160W

YBR072W

YDL204WYBR126C

YJL141C

YKL150W

YER054C

YOR173W

YGR248W

YJL161WYHR139C

YIR019C

YJR153W YHR084W

YMR043W

YLR113W

YMR037C

YKL062W

YPL089CYBR083W

=⇒ Density = 0.14± ????

Examples particularly prevalent in biology (e.g., gene regulatory networks,protein-protein interaction networks, and neural functional connectivitynetworks), but some noise likely present in most network applications.

A General Formulation of the Problem

Suppose we have

G , a true underlying (undirected) network-graph

η (G ), a network summary / characteristic of interest.

We will restrict our attention to the problem of counting subgraphs H, i.e.,where η is of the form

ηH(G ) =1

|Iso(H)|∑

H′⊆Knv ,H′∼=H

1H ′ ⊆ G .

A General Formulation (cont)

Rather than having G , however, suppose we only have anapproximation/estimate of G , say G = (V , E ).

We model observed edges to reflect potential errors in determining the(non)edge status between vertex pair i , j:

Yij ∼

Bernoulli (αij) , if i , j ∈ E c ,

Bernoulli (1− βij) , if i , j ∈ E ,

Goal: Understand the propagation of error from the Yij to the standard

plug-in estimate ηH(G ), by characterizing the distribution of thediscrepency

D = ηH(G )− ηH(G ) .

Some Assumptions

Motivated by a combination of (i) canonical characteristics in complexnetworks in practice, and (ii) a need to further simplify, we assume thefollowing:

(A1) Large Graphs: nv →∞.

(A2) Sparse Graphs: |E | = Θ(nv log nv ).

(A3) Edge Unbiasedness:∑i ,j∈E c αij =

∑i ,j∈E βij (≡ λ).

(A4) Low Error Rate: λ = Θ (nγv logκ nv ), for γ ∈ [0, 1) and κ ≥ 0.

(A5) Homogeneity: αij ≡ α and βij ≡ β, for α, β ∈ (0, 1).

Key Observation

When subgraph counting is the goal, the statistic

|Iso(H)|∑

H′⊆Knv ,H′∼=H

[1H ′ ⊆ G − 1H ′ ⊆ G

is a difference of sums of indicator random variables.

Key Point: This can be rewritten as # Type I Errors - # Type II Errors.

Intutitively, the combination of

Sparse Networks + Low-rate Errors

suggests that each sum behaves like a Poisson random variable, and hencetheir difference, like a Skellam random variable.

The Skellam Distribution

A random variable W defined on the integers is said to have a Skellamdistribution, with parameters λ1, λ2 > 0, i.e., W ∼ Skellam (λ1, λ2), if

P (W = k) = e−(λ1+λ2)

(√λ1λ2

(2√λ1λ2

)for k ∈ Z, (1)

where Ik(2√λ1λ2

)is the modified Bessel function of the first kind with

index k and argument 2√λ1λ2.

1 E[W ] = λ1 − λ2 and Var(W ) = λ1 + λ2;

2 Distribution symmetric iff λ1 = λ2.

An Initial Treatment: Edge-Counting + Independence

Theorem

Let DE = |E | − |E |. Under assumptions (A1)-(A5) and independenceamong errors in declaring (non)edge status (i.e., among the Yij),

dKS (DE , Skellam(λ, λ) ) ≤ O((n1−γv log1−κ nv )−1

). (2)

At the same time,

dKS (DE/σ , N(0, 1) )) ≤ O(

(nγ/2v logκ/2 nv )−1

and, for sufficiently large nv ,

dKS (DE/σ , N(0, 1) )) ≥ Ω((nv log nv )−1

), (4)

where N(0, 1) refers to a standard normal random variable.

Stronger Results Are Likely . . .

Constant Log Sqrt Linear

100 1000 10000 100 1000 10000 100 1000 10000 100 1000 10000NumberOfVertices

Normal

Skellam

(Log)Kolmogorov-Smirnov distance between Skellam and standard normal approximationsto the distributon of discrepency DE in edge counts under independent errors,with rates constant, log, square-root, and linear in nv .

A More General Treatment

For the special case of counting edges under independent errors, the binaryrandom variables defining D = ηH(G )− ηH(G ) are independent.

In fact, dependence can be expected in practice, due to

construction/estimation of G (e.g., gene regulatory networkinference); and/or

overlap of candidate subgraphs H ′, for subgraphsof order larger than 2.

Nevertheless, a Skellam approximation still likely to be appropriate in manycontexts, due to fundamental role of sparse graphs and low-rates errors.

A More General Treatment

Motivated by these observations, we provide1 a general analysis ofdifferences of sums of binary random variables,

U =n∑

Lk −m∑

extending the above treatment.

Our approach uses Stein’s method, through which we

exhibit a Stein operator for the Skellam distribution;

produce several bounds on dKS(U,W ), for W ∼ Skellam(λ1, λ2);

control the Stein constant, in the case λ1 = λ2, and

illustrate in a handful of simple network error models.

1Balachandran, Kolaczyk, Viles (2014). On the Propagation of Low-Rate Measurement Error

to Subgraph Counts in Large, Sparse Networks. (arXiv:1409.5640)

Illustration: Edge-Counting + Dependence

Suppose that Type I and Type II errors are determined according to asimple two-step process:

1 A fixed integer number ν of vertex pairs are selected for potentialerrors;

2 These potential errors are realized with probability

α = λ/[ν(1− δ)], for vertex pairs in E c , andβ = λ/[νδ], for vertex pairs in E ,

where δ is the network density (i.e, fraction of edges).

Theorem

The resulting noise process is negatively associated and, as a result,

dKS (DE , Skellam(λ, λ) ) ≤ O (λ/ν) .

Challenges / Open Problems in Network Error Propagation

Tighter bounds are needed on dKS(U,W )

Control is needed on the Stein constant when λ1 6= λ2.

‘Signal plus noise’ formulation posed here leads to a need for betterunderstanding the large-graph limit of subgraph counts in general(particularly sparse!) network models.

Development/study of domain-specific measurement error models.

Note: Also of interest are related statistical questions, e.g., improvedestimation of functions η(G ), confidence intervals, etc. We have madesome initial contributions2 using methods of matrix approximation.

2Balachandran, P., Airoldi, E., and Kolaczyk, E.D. (2013). Inference of network summary statistics

through network denosing. (arXiv:1310.0423)

‘Stat 101’ for Collections of Network Data Objects

Outline

From Large Networks to Many Networks

Traditionally in ‘network science’ what is ‘large’is the number of vertices nv .

But a second paradigm arguably is emerging, where what is large is thenumber of networks n.

Naturally emerging trend in the ‘big data’ approach to science

Motivation: Functional Connectivity Networks

Kramer, M.A., Eden, U.T., Kolaczyk, E.D., Zepeda, R., Eskandar, E.N., Cash, S.S. (2010). Coalescence andfragmentation of cortical networks during focal seizures. Journal of Neuroscience, 30(30), 10076-10085.

‘Statistics 101’ for Network Data Objects?

While there are lots of methods and models for single (large) complexnetworks, there is comparatively little for analysis of collections ofnetworks.

Tools are needed for answering basic questions like

“What is the ‘average’ of a collection of networks?”

“Do these networks differ, on average, from a given nominalnetwork?”

“Do two collections of networks differ on average?”

“What factors (e.g., age, gender, etc.) appear to contribute todifferences in networks?”

In order to answer these and similar questions, we require network-basedanalogues of classical tools for statistical estimation and hypothesis testing.

Laying an Initial Foundation . . .

Extension of classical tools to network-based datasets is non-trivial:networks are not Euclidean objects – rather, they are combinatorialobjects, defined simply through their sets of vertices and edges.

Nevertheless, in recent work3 we have

shown certain classes of networks can be associated with nice subsetsof Euclidean space,

which permits definition of a natural notion of distance and averaging,

allowing results in statistical shape analysis to be used to establish anasymptotic theory,

resulting in, for example, a principled approach to one- andtwo-sample hypothesis testing.

3Ginestet, C.E., Balachandran, P., Rosenberg, S., and Kolaczyk, E.D. (2014). Hypothesis testing for network data

in functional neuroimaging. (arXiv:1407.5525)

Some Notation and Such

Let G = (V ,E ,W ) be a weighted undirected graph, that is

simple (i.e., no self-loops or multi-edges)

connected (i.e., only one component)

and define the (combinatorial) graph Laplacian

L = D(W )−W ,

where D is a diagonal matrix of weighted degrees,i.e., Djj = dj(W ) =

∑i 6=j wij .

Our interest will be in IID collections of graphs G1, . . . ,Gn,with which we will inter-changeably associatean IID collection of graph Laplacians L1, . . . , Ln.

Geometry of a Certain Space of Networks

Theorem

Let the set L′d consist of d × d matrices A, satisfying:

(1) Rank(A) = d − 1,

(2) Symmetry, AT = A,

(3) Positive semi-definiteness, A ≥ 0,

(4) The entries in each row sum to 0,

(5) The off-diagonal entries are non-positive, aij ≤ 0 .

Then L′d is a manifold with corners, of dimension d(d − 1)/2.

Furthermore, L′d is a convex subset of an affine space in Rd2

of dimension d(d − 1)/2.

A Central Limit Theorem

For L1, . . . , Ln IID wrt some distribution Q,and ρF the Frobenius norm, define the (Frechet) means

EQ [L] := arg minL∈L′d

∫L′d

ρ2F (L, L)Q(dL) and Ln := arg minL∈L′d

n∑i=1

ρ2F (L, Li ).

Theorem

If the expectation, Λ := EQ [L], does not lie on the boundary of L′d , andPQ [U] > 0, where U is an open subset of L′d with Λ ∈ U, then (undersome further regularity conditions) we obtain the following convergence indistribution:

n1/2(φ(Ln)− φ(Λ)) −→ N(0,Σ),

where Σ := Cov [φ(L)] and φ(·) denotes the half-vectorization of its matrixargument.

Large-Sample Testing Theory

Corollary

Under the null hypothesis H0 : E[L] = Λ0, we have,

T1 := n(φ(L)− φ(Λ0)

)TΣ−1

(φ(L)− φ(Λ0)

)−→ χ2

with m :=(d2

)degrees of freedom, and where Σ is the sample covariance.

Analogous results may be stated for two- and multiple-sample testing.

Practical Considerations: Covariance Estimation

In order to use these results in practice, we require knowledge of Σ or,more realistically, for Σ to be stable.

For n O(d2), it may be that Σ is stable, but for n O(d2), we face a“large n, small p” problem.

Natural to seek to exploit recent literature on estimation of large,structured covariance matrices from limited data.

In neuroimaging, it has been argued that the networks of interest besparse. Our empirical experience suggests that Σ is similarly sparse.

We use a thresholding-based method of Cai & Lui ’11 in our applications.

Illustration: 1000 FCP data

The 1000 Functional Connectomes Project (FCP) is a major MRIdata-sharing initiative, launched in 2010.

Data collected on both genders, at five international locations, across arange of ages.

(A) Sex (B) Age

Female Male x ≤ 22 22 < x ≤ 32 32 < x

Testing for Differences in 1000 FCP Networks

Current state of the art (aka ‘mass univariate’) uses edge-wise testing(presence/absence) and multiple correction.

(A) Mass-univariate analysis (Sex) (B) Mass-univariate analysis (Age)

Uncorrected Corrected Uncorrected Corrected

Comparison: Mass-Univariate vs Multivariate

Both methods detect differences in mean networks across gender andage, when using the full 1000 connectomes; but . . .

Only our multivariate method detects those differences at smallsample sizes (i.e., relevant to single labs).

Challenges / Open Problems with Network Data Objects

This work lays only an initial stone in the foundation.There are a vast number of directions to go from here!

Examples include

Critical understanding of the impact of network structure (e.g.,sparseness, degree distribution, etc.) on geometric, probabilistic, andstatistical aspects of the problem.

A more refined understanding of covariance structure of Laplacians,as well as a careful study of their estimation.

Finite-sample properties, corrections, etc.

Adaptation of other tools in the ‘Statistics 101’ toolbox?

Closing Thoughts: Much Still to Do . . .

Outline

Some Closing Thoughts

Use of network-based perspective in modeling and analysis is nowpervasive across the sciences (and has even penetrated the humanities).

While much of the work done in this area is necessarily problem-specific,to varying extents, there is sufficient evidence after 15 years to suggest itis both necessary and interesting to (re)visit the statistical foundations4.

The hook (hopefully!) is that the area is a source of problems that arebroadly relevant, with an intruiging blend of traditional and new elements,and generally needing to incorporate many of the directions of researchalready being explored in the broader community.

4Note there is parallel movement in the making within the signal and informationprocessing literatures (e.g., witness the rise of ‘graph signal processing’).Northeastern U, May 2015

Collaborators and Support

Supported in part by AFOSR, NIH, NSF, and ONR.

Statistical Analysis of Network Data in the Context of `Big Data': - … · 2015-06-17 ·...

Documents

Statistical Data Mining

Statistical Parsing with Context-Free Filtering Grammar

Business Architecture model within an official statistical context

The Statistical Analysis of Crime Data at Street Level ... · the City of Genoa, and we use different statistical models to study crime events in relationship with the context in

Data and context

GPP Statistical Data

Context-Based Statistical Process Control: A Monitoring ...bengal/TECH01041RRR1.pdf · Context-Based Statistical Process Control: A Monitoring Procedure for State-Dependent ... and

MNGT6232 Data Analysis & Statistical Modelling for Business · Data Analysis & Statistical Modelling for Business ... Monitoring Business Processes: Part 1 ... Data Analysis & Statistical

Statistical Development in Nigeria in the context of the Global Statistical System – An Insight

Statistical Topological Data Analysis - A Kernel Perspectivepapers.nips.cc/paper/5887-statistical-topological-data-analysis-a... · Statistical Topological Data Analysis – A Kernel

Context-Based Statistical Process Control: A Monitoring

Consuming statistical data

Statistical Linked Data

Big Data, Statistical Inference and Official Statistics • BIG DATA, STATISTICAL INFERENCE AND OFFICIAL STATISTICS † 1351.0.55.054 1 BIG DATA, STATISTICAL INFERENCE AND OFFICIAL

Statistical analysis data. Coffee

Statistical Data Presentation

A Statistical Methodology for Analyzing Data from a ...€¦ · Missing data and imputation ... standard methods in this context is due to the complexi- National Health and Nutrition

MNGT6232 Data Analysis & Statistical Modelling for ... · Data Analysis & Statistical Modelling for Business ... Data Analysis & Statistical Modelling for Business2 ... Monitoring

Structural business reporting in a statistical context

Bayesian statistical data assimilation for ecosystem ... · This study considers advanced statistical approaches for sequential data assimilation. These are explored in the context