Streaming Graph Analysis A Statistical Framework...

Preview:

Citation preview

A Statistical Framework forStreaming Graph AnalysisJames Fairbanks, David Ediger, Rob McColl, David A. Bader, Eric Gilbert

Problem

James Fairbanks, ASONAM 2013

In order to understand social media, we must understand the evolution of relationships in streaming data.

● How can we detect change?● What is a significant change?● Are two sets of vertices significantly different?● How can we visualize 10,000+ vertices?● Which vertices look anomalous?

We look to statistical analysis for guidance on these questions.

1

Challenge

James Fairbanks, ASONAM 2013

● Graph data is big, sparse, irregular, messy, high dimensional

● Statistics works best on dense, regular, clean, low dimensional data

2

Solution● Use graph theoretic computations to embed the graph

in a low-dimensional space● Embedding is not topology preserving● Do Machine Learning and Statistics in Euclidean Space

Wikipedia articles Scatter plot of vertices in feature space

3James Fairbanks, ASONAM 2013

Related Work

James Fairbanks, ASONAM 2013

● Tracking Earthquakes [Sakaki, et al., 2010]

● Rumors about Earthquakes [Mendoza, et al., 2010]

● London Riots and Hashtags [Glasgow and Fink, 2013]

● Streaming Clustering Coefficient [Ediger, Riedy, et al., 2011]

● Atlanta Floods, H1N1 [Ediger, et al., 2010]

● Dynamic Visual Analysis [Federico, et al., 2012]

4

Definitions

James Fairbanks, ASONAM 2013

Vertex [Features, Metrics, Statistics]

A vertex statistic associates a number to each vertex at each time step.

Graph Kernels

computational subroutines that compute vertex features or maintain a data structure on top of the graph

5

Examples

James Fairbanks, ASONAM 2013

Vertex Features● Degree● Size of connected

component● Geodesic distance● Local clustering

coefficient● PageRank● Betweenness Centrality

Graph Kernels● Counting neighbors● Shiloach-Vishkin connected

components● Breadth First Search (BFS)● Counting Neighborhood

intersection● Power Iteration ● Brandes 2001

6

A high performance, dynamic graph data structure withsemantic and temporal properties

● Supports concurrent streaming data sources and analysis● Scalable on shared-memory Intel x86 platforms and Cray XMT● Open source and free (BSD License)● http://www.stingergraph.com

STINGER

James Fairbanks, ASONAM 2013 7

Data Set

James Fairbanks, ASONAM 2013

● Hurricane Sandy public Tweets [28 Oct, 12 Nov 2012]

● 1,238,109 mentions

● 662,575 unique users

● Batches of 10,000 updates

● Update interval: 1 batch represents ~3 hours of Tweets

photo credit: NASA Earth Observatory

8

Clustering Coefficient

James Fairbanks, ASONAM 2013

● Where tri(v) is the number of 3-cycles containing v● Measures how tightly knit the graph is at the local

level[Watts, Strogatz, 98]

● Compute in time

9

Clustering Coefficient

James Fairbanks, ASONAM 2013

NJ Landfall

● Counting vertices that have increasing or decreasing clustering coefficient

● Model as stochastic process for forecasting

10

Temporal Correlation

James Fairbanks, ASONAM 2013

● Defined for any quantity measuring strength of association

● For Pearson’s correlation

● formula

● quantifies strength of association between successive measurements

11

Correlation Decay

James Fairbanks, ASONAM 2013

● New edges change vertex statistics

● Correlation measures of forgetfulness of vertex statistic

● Bigger graph implies less impact

12

The centered discrete derivative of a vertex feature:

Derivatives

James Fairbanks, ASONAM 2013 13

Anomaly Detection

James Fairbanks, ASONAM 2013

In graphs:● What is an anomalous vertex in a graph?● In Social Media, who uses the service in a novel way?● A vertex with edges that look different.

From statistics:● Outlier: a point in a region of space with very low

probability density. ● Points close to the outliers in space, are rare.● If we can estimate the true density from a finite

sample, then we can find outliers.

14

Outlier Detection Features

James Fairbanks, ASONAM 2013

● Mean(CC)● Var(CC)● Mean(Deriv(CC))● Var(Deriv(CC))

● Gaussian Radial Basis Function● Radius 0.3● By choice 5% of the data is labeled

outlier

15

Outlier Detection

James Fairbanks, ASONAM 2013

Used a one Class SVM because of multimodal features

16

Validation

James Fairbanks, ASONAM 2013

● Inlier and Outlier distributions differ

● Outliers more uniformly distributed

● Mixing in each scatter plot means all dimensions are necessary

17

Conclusions

James Fairbanks, ASONAM 2013

● Separating computation into graph algorithms, then machine learning and statistics phase allows leveraging best techniques from both fields.

● Applying multivariate outlier detection methods to streaming graphs reveals two distinct distributions of vertices.

● These feature based methods enable dynamic visualization of much larger graphs than traditional two dimensional embeddings.

18

Acknowledgment of Support

James Fairbanks, ASONAM 2013

Future Work

James Fairbanks, ASONAM 2013

Explore predictive ability in feature space

Signal Processing

James Fairbanks, ASONAM 2013

Estimate Periodicity. Filtering out small deviations and trends.

Recommended