Leting Wu Xiaowei Ying, Xintao Wu Dept. Software and Information Systems Univ. of N.C. – Charlotte Reconstruction from Randomized Graph via Low Rank Approximation

Leting Wu Xiaowei Ying, Xintao Wu Dept. Software and Information Systems Univ. of N.C. Charlotte Reconstruction from Randomized Graph via Low Rank Approximation

Outline Background & Motivation Low Rank Approximation on Graph Data Reconstruction from Randomized Graph Evaluation Privacy Issue 2

Background & Motivation 3

Background In the process of publishing/outsourcing network data for mining/analysis, pure anonymization is not enough for protecting the privacy due to topology based attacks( Active/passive attacks, subgraph attacks). Graph Randomization/Perturbation: Random Add/Del edges (no. of edges unchanged) Random Switch edges (nodes degree unchanged) Feature preserving randomization Spectrum preserving randomization Feature preserving via Markov-chain based graph generation Clustering --- grouping subgraphs into supernodes 4

Motivation We focus on whether we can reconstruct a graph from s.t. 5 Our Focus

Low Rank Approximation on Graph Data 6

Adjacency Matrix & Its Eigen-Decomposition Matrix Representation of Network Adjacency Matrix A (symmetric) Eigen-decomposition: Questions: What are their relations with graph topology? 7

8 Leading Eigenpairs vs. Graph Topology What are the role of positive and negative eigen-pairs in graph topology? Without loss of generality, we partition the node set into two groups and the adjacency matrix can be partitioned as where and represent the edges within the two groups and represents the edges between the groups 8

9 Leading Eigenpairs vs. Graph Topology 9 r = 1 r = 2 Original

10 Leading Eigenpairs vs. Graph Topology 10 Original r = 1 r = 2

11 Leading Eigenpairs vs. Graph Topology 11 Originalr = 1 r = 4 r = 2

Low Rank Approximation on Graph Data Low Rank Approximation: This provide a best r rank approximation to A To keep the structure of adjacency matrix, discrete as following: 12

Reconstruction from Randomized Graph 13

14 Reconstructed Features (Political Blogs, Rand Add/Del 40% of Edges) 14

15 Determine Number of Eigen-pairs Question: How to choose an optimal rank r for reconstruction? Solution: Choose as the indicator since it is closely related to the other features and there exists an explicit moment estimator where m is the number of edges, k is the number of edges add/delete, 15

Algorithm 16

Evaluation 17

18 Effect of Noise (Political Blogs) The method works well to a certain level of noise Even with high level of noise, the reconstructed features are still closer to the original than the randomized ones 18

19 Reconstructed Features on 3 real network data 19 Reconstruction Quality When, the reconstructed features are closer to the original ones than the randomized ones All positive for the three data sets

Privacy Issue 20

21 Privacy Issue Question 1: Can this reconstruction be used by attackers? Define the normalized Frobenius distance between A and as 21 Political Books Enron Political Blogs Normalized F Norm

22 Privacy Issue Question 2: Which type of graphs would have privacy breached? For low rank graphs which have, the distance between the reconstructed graph and the original graph can be very small 22

23 Synthetic Low Rank Graphs Here is a set of synthetic low rank graphs generated from Political Blogs and you can see that the reconstruction works on both the distance and features 23

24 Conclusion We show the relationship between graph topological structure and eigen-pairs of the adjacency matrix We propose a low rank approximation based reconstruction algorithm with a novel solution to determine the optimal rank For most social networks, our algorithm do not incur further disclosure risks of individual privacy except for networks with low ranks or a small number of dominant eigenvalues 24

25 Questions? Acknowledgments This work was supported in part by U.S. National Science Foundation IIS-0546027 and CNS-0831204. Thank You! 25

Background Publish/outsource data for mining/analysis 26 Public/Third party/Research Inst. Data Owner The original graph data release publish Under Attacks!!! Privacy: protect sensitive data (identity, relationship, sensitive attributes) Utility: preserve features/patterns/distributions of data

Background Spectral Filter for Numerical Data derive estimation of U from perturbed data Calculate covariance matrix Apply spectral decomposition to Derive the eigenvalues information from the covariance matrix of noise V and choose a proper number of dimensions, r Let and, obtain the estimated data set using 27

New Challenges A is a 0-1 adjacency matrix whereas U is a numerical matrix and is positive covariance matrix has only non-negative eigenvalues whereas A has both positive and negative eigenvalues. Can not define the covariance matrix for graph data The strategy of determining the number of eigen components to use in numerical data does not work for graph data since the first eigenvalue of the noise matrix could be very large. 28 A is a 0-1 adjacency matrix whereas U is a numerical matrix and is positive covariance matrix has only non-negative eigenvalues whereas A has both positive and negative eigenvalues. Can not define the covariance matrix for graph data The strategy of determining the number of eigen components to use in numerical data does not work for graph data since the first eigenvalue of the noise matrix could be very large.

29 Data Sets Political Blogs Based on incoming and outgoing links and posts during the time of 2004 presidential election 16714 links among 1222 US political blogs Political Books Based on the political books sold by Amazon.com where nodes represent the books and edges represent the co-purchasing of books 105 nodes and 441 edges Enron Based on email corpus of a real organization covering 3 years period where an edge represents there are at least 5 emails sent between two people 151 nodes and 869 edges 29

30 Future Work Study whether similar LRA reconstruction can be derived on other edge based perturbation strategies such as Rand Switch and K-Anonimity. Reconstruction of distribution from networked data. Distribution of networked data? Randomization mechanism Privacy vs. utility (in general social networks and with background knowledge attacks) Spectral analysis of graph topology (signed/weighted/directed graph) 30

Documents

Leting Wu Xiaowei Ying, Xintao Wu Dept. Software and Information Systems Univ. of N.C. – Charlotte Reconstruction from Randomized Graph via Low Rank Approximation