View
2
Download
0
Category
Preview:
Citation preview
A Statistical Framework forStreaming Graph AnalysisJames Fairbanks, David Ediger, Rob McColl, David A. Bader, Eric Gilbert
Problem
James Fairbanks, ASONAM 2013
In order to understand social media, we must understand the evolution of relationships in streaming data.
● How can we detect change?● What is a significant change?● Are two sets of vertices significantly different?● How can we visualize 10,000+ vertices?● Which vertices look anomalous?
We look to statistical analysis for guidance on these questions.
1
Challenge
James Fairbanks, ASONAM 2013
● Graph data is big, sparse, irregular, messy, high dimensional
● Statistics works best on dense, regular, clean, low dimensional data
2
Solution● Use graph theoretic computations to embed the graph
in a low-dimensional space● Embedding is not topology preserving● Do Machine Learning and Statistics in Euclidean Space
Wikipedia articles Scatter plot of vertices in feature space
3James Fairbanks, ASONAM 2013
Related Work
James Fairbanks, ASONAM 2013
● Tracking Earthquakes [Sakaki, et al., 2010]
● Rumors about Earthquakes [Mendoza, et al., 2010]
● London Riots and Hashtags [Glasgow and Fink, 2013]
● Streaming Clustering Coefficient [Ediger, Riedy, et al., 2011]
● Atlanta Floods, H1N1 [Ediger, et al., 2010]
● Dynamic Visual Analysis [Federico, et al., 2012]
4
Definitions
James Fairbanks, ASONAM 2013
Vertex [Features, Metrics, Statistics]
A vertex statistic associates a number to each vertex at each time step.
Graph Kernels
computational subroutines that compute vertex features or maintain a data structure on top of the graph
5
Examples
James Fairbanks, ASONAM 2013
Vertex Features● Degree● Size of connected
component● Geodesic distance● Local clustering
coefficient● PageRank● Betweenness Centrality
Graph Kernels● Counting neighbors● Shiloach-Vishkin connected
components● Breadth First Search (BFS)● Counting Neighborhood
intersection● Power Iteration ● Brandes 2001
6
A high performance, dynamic graph data structure withsemantic and temporal properties
● Supports concurrent streaming data sources and analysis● Scalable on shared-memory Intel x86 platforms and Cray XMT● Open source and free (BSD License)● http://www.stingergraph.com
STINGER
James Fairbanks, ASONAM 2013 7
Data Set
James Fairbanks, ASONAM 2013
● Hurricane Sandy public Tweets [28 Oct, 12 Nov 2012]
● 1,238,109 mentions
● 662,575 unique users
● Batches of 10,000 updates
● Update interval: 1 batch represents ~3 hours of Tweets
photo credit: NASA Earth Observatory
8
Clustering Coefficient
James Fairbanks, ASONAM 2013
● Where tri(v) is the number of 3-cycles containing v● Measures how tightly knit the graph is at the local
level[Watts, Strogatz, 98]
● Compute in time
9
Clustering Coefficient
James Fairbanks, ASONAM 2013
NJ Landfall
● Counting vertices that have increasing or decreasing clustering coefficient
● Model as stochastic process for forecasting
10
Temporal Correlation
James Fairbanks, ASONAM 2013
● Defined for any quantity measuring strength of association
● For Pearson’s correlation
● formula
● quantifies strength of association between successive measurements
11
Correlation Decay
James Fairbanks, ASONAM 2013
● New edges change vertex statistics
● Correlation measures of forgetfulness of vertex statistic
● Bigger graph implies less impact
12
The centered discrete derivative of a vertex feature:
Derivatives
James Fairbanks, ASONAM 2013 13
Anomaly Detection
James Fairbanks, ASONAM 2013
In graphs:● What is an anomalous vertex in a graph?● In Social Media, who uses the service in a novel way?● A vertex with edges that look different.
From statistics:● Outlier: a point in a region of space with very low
probability density. ● Points close to the outliers in space, are rare.● If we can estimate the true density from a finite
sample, then we can find outliers.
14
Outlier Detection Features
James Fairbanks, ASONAM 2013
● Mean(CC)● Var(CC)● Mean(Deriv(CC))● Var(Deriv(CC))
● Gaussian Radial Basis Function● Radius 0.3● By choice 5% of the data is labeled
outlier
15
Outlier Detection
James Fairbanks, ASONAM 2013
Used a one Class SVM because of multimodal features
16
Validation
James Fairbanks, ASONAM 2013
● Inlier and Outlier distributions differ
● Outliers more uniformly distributed
● Mixing in each scatter plot means all dimensions are necessary
17
Conclusions
James Fairbanks, ASONAM 2013
● Separating computation into graph algorithms, then machine learning and statistics phase allows leveraging best techniques from both fields.
● Applying multivariate outlier detection methods to streaming graphs reveals two distinct distributions of vertices.
● These feature based methods enable dynamic visualization of much larger graphs than traditional two dimensional embeddings.
18
Acknowledgment of Support
James Fairbanks, ASONAM 2013
Future Work
James Fairbanks, ASONAM 2013
Explore predictive ability in feature space
Signal Processing
James Fairbanks, ASONAM 2013
Estimate Periodicity. Filtering out small deviations and trends.
Recommended