A Tutorial of Privacy-Preservation of Graphs and Social Networks

Slide 1

Xintao Wu, Xiaowei Ying

University of North Carolina at CharlotteA Tutorial of Privacy-Preservation of Graphs and Social Networks1National Freedom of Information

22Data Protection Laws

33National LawsUSA HIPAA for health carePassed August 21, 96lowest bar and the States are welcome to enact more stringent rulesCalifornia State Bill 1386Grann-Leach-Bliley Act of 1999 for financial institutionsCOPPA for childerns online privacyetc.CanadaPIPEDA 2000Personal Information Protection and Electronic Documents ActEffective from Jan 2004

European Union (Directive 94/46/EC)Passed by European Parliament Oct 95 and Effective from Oct 98. Provides guidelines for member state legislationForbids sharing data with states that do not protect privacy

44Privacy BreachAOL's publication of the search histories of more than 650,000 of its users has yielded more than just one of the year's bigger privacy scandals. (Aug 6, 2006)

That database does not include names or user identities. Instead, it lists only a unique ID number for each user. AOL user 710794

an overweight golfer, owner of a 1986 Porsche 944 and 1998 Cadillac SLS, and a fan of the University of Tennessee Volunteers Men's Basketball team. interested in the Cherokee County School District in Canton, Ga., and has looked up the Suwanee Sports Academy in Suwanee, Ga., which caters to local youth, and the Youth Basketball of America's Georgia affiliate. regularly searches for "lolitas," a term commonly used to describe photographs and videos of minors who are nude or engaged in sexual acts.

5Source: AOL's disturbing glimpse into users' lives By Declan McCullough , CNET News.com, August 7, 2006, 8:05 PM PDT 5Privacy Preserving Data MiningData miningThe goal of data mining is summary results (e.g., classification, cluster, association rules etc.) from the data (distribution)

Individual PrivacyIndividual values in database must not be disclosed, or at least no close estimation can be got by attackersContractual limitations: privacy policies, corporate agreements

Privacy Preserving Data MiningHow to transform data such thatwe can build a good data mining model (data utility)while preserving privacy at the record level (privacy)?

66PPDM on Tabular Data7ssnnamezipraceageSexBalincomeIntP28223Asian20M10k85k2k28223Asian30F15k70k18k28262Black20M50k120k35k28261White26M45k23k134k.......28223Asian20M80k110k15k

69% unique on zip and birth date87% with zip, birth date and genderGeneralization (k-anonymity, L-diversity, t-closeness etc.) and RandomizationRefer to a survey book [Aggarwal, 08]

PPDM Tutorials on Tabular DataPrivacy in data system, Rakesh Agrawal, PODS03Privacy preserving data mining, Chris Clifton, PKDD02, KDD03Models and methods for privacy preserving data publishing and analysis, Johannes Gehrke, ICDM05, ICDE06, KDD06Cryptographic techniques in privacy preserving data mining, Helger Lipmaa, PKDD06Randomization based privacy preserving data mining, Xintao Wu, PKDD06Privacy in data publishing, Johannes Gehrke & Ashwin Machanavajjhala, S&P09Anonymized data: genertion, models, usage, Graham Cormode & Divesh Srivastava, SIGMOD09

88Social Network9

9

Network of US political books(105 nodes, 441 edges)Books about US politics sold by Amazon.com. Edges represent frequent co-purchasing of books by the same buyers. Nodes have been given colors of blue, white, or red to indicate whether they are "liberal", "neutral", or "conservative". Social Network10Social network has drawn much attention recently. Many practical scenarios produce network data. Here is an example. Nodes represents US political books and theres an edge between two nodes if they are frequently bought by the same users. The color indicates their political point of view: red is conservative, blue is liberal and white is neutral.10Social NetworkNetwork of the political blogs on the 2004 U.S. election (polblogs, 1,222 nodes and 16,714 edges)

11

11Social Network12

Collaboration network of scientists [Newman, PRE06]12More Social Network DataNewmans collection http://www-personal.umich.edu/~mejn/netdata/

Enron data http://www.cs.cmu.edu/~enron/

Stanford large network dataset collection http://snap.stanford.edu/data/index.html

13Graph MiningA very hot research areaGraph properties such as degree distributionMotif analysisCommunity partition and outlier detectionInformation spreading Resiliency/robustness, e.g., against virus propagationSpectral analysisResearch development Managing and mining graph data by Aggarwal and Wang, Springer 2010.Large graph-mining: power tools and a practitioners guide by Faloutsos et al. KDD09

14Network Science and Privacy15

Source: Jeannette Wing, Computing research: a view from DC, SNOWBIRD, 2008 15OutlineAttacks on Naively Anonymized GraphPrivacy Preserving Social Network PublishingK-anonymityGeneralizationRandomizationOther WorksOutput PerturbationBackground on differential privacyAccurate analysis of private network data1616Social Network Data Publishing17

Data owner

Data miner

releasenamesexagediseasesalaryAdaF18cancer25kBobM25heart110kCathyF20cancer70kDellM65flu65kEdM60cancer300kFredM24flu20kGeorgeM22cancer45kHarryM40flu95kIreneF45heart70kidSexagediseasesalary5FYcancer25k3MYheart110k6FYcancer70k1MOflu65k7MOcancer300k2MYflu20k9MYcancer45k4MMflu95k8FMheart70k17Threat of Re-identification18

idSexagediseasesalary5FYcancer25k3MYheart110k6FYcancer70k1MOflu65k7MOcancer300k2MYflu20k9MYcancer45k4MMflu95k8FMheart70k

AttackerattackAdas sensitive information is disclosed. Privacy breachesIdentify disclosureLink disclosureAttribute disclosure18Deriving Personal Identifying Information [Gross WPES05]User profiles (e.g., photo, birth date, residence, interests, friend links) can be used to estimate personal identifying information such as SSN. ### - ## - ####

Users should pay attention to (default) privacy preference settings of online social networks. 19Determined by zip codehttps://secure.ssa.gov/apps10/poms.nsf/lnx/0100201030Group noSequential noActive and Passive Attacks [Backstorm WWW07]

20Active attack outlineJoin the network by creating some new user accounts;Establish a highly distinguishable subgraph H among the attacking nodes;Send links to targeted individuals from the attacking nodes;In the released graph, identify the subgraph H among the attacking nodes;The targeted individuals and their links are then identified.

After some privacy preserving techniques are applied to the social network, the data owner then release the randomized graph to the data miner. But does the randomized graph safe enough? And If the data miner analyzes the data based on the randomized graph instead of the true graph, is there any information loss? So my research mainly focus on how to quantify the privacy disclosure risks of randomization, how to quantify the information loss of the randomization, and how to randomize the graph to protect the privacy and better preserve the data utility at the same time.20Active and Passive Attacks [Backstorm WWW07]

21Active attacks & subgraph HThe active attack is based on the subgraph H among the attackers:No other subgraphs isomorphic to H;Subgraph H has no non-trivial automorphism Efficient to identify H regardless G;

After some privacy preserving techniques are applied to the social network, the data owner then release the randomized graph to the data miner. But does the randomized graph safe enough? And If the data miner analyzes the data based on the randomized graph instead of the true graph, is there any information loss? So my research mainly focus on how to quantify the privacy disclosure risks of randomization, how to quantify the information loss of the randomization, and how to randomize the graph to protect the privacy and better preserve the data utility at the same time.21Active and Passive Attacks [Backstorm WWW07]22Passive attacks outlineObservation: most nodes in the network already form a uniquely identifiable subgraph.One adversary recruits k-1 of his neighbors to form the subgraph H of size k.Work similarly to active attacks.

Drawback: Uniqueness of H is not guaranteed.After some privacy preserving techniques are applied to the social network, the data owner then release the randomized graph to the data miner. But does the randomized graph safe enough? And If the data miner analyzes the data based on the randomized graph instead of the true graph, is there any information loss? So my research mainly focus on how to quantify the privacy disclosure risks of randomization, how to quantify the information loss of the randomization, and how to randomize the graph to protect the privacy and better preserve the data utility at the same time.2223Structural queries:A structural query Q represent complete or partial structural information of a targeted individual that may be available to adversaries.Structural queries and identity privacy:Attacks by Structural Queries [Hay VLDB08]

23Attacks by Structural Queries [Hay VLDB08]24Degree sequence refinement queries

After some privacy preserving techniques are applied to the social network, the data owner then release the randomized graph to the data miner. But does the randomized graph safe enough? And If the data miner analyzes the data based on the randomized graph instead of the true graph, is there any information loss? So my research mainly focus on how to quantify the privacy disclosure risks of randomization, how to quantify the information loss of the randomization, and how to randomize the graph to protect the privacy and better preserve the data utility at the same time.24Attacks by Structural Queries [Hay VLDB08]25Subgraph queriesThe adversary is capable of gathering a fixed number ofedges around the targeted individual.

Hub fingerprint queriesA hub is a central node in a network. A hub fingerprint of node v is the node's connections to a set of designated hubs within a certain distance.

25Attacks by Combining Multiple Graphs [Narayanan ISSP09]26Attack outline:The attacker has two type of auxiliary information:Aggregate: an auxiliary graph whose members overlap with the anonymized target graphIndividual: the detailed information on a very small number of individuals (called seeds) in both the auxiliary graph and the target graph.Identify seeds in the target graph.Identify more nodes by comparing the neighborhoods of the de-anonymized nodes in the auxiliary graph and the target graph (propagation).After some privacy preserving techniques are applied to the social network, the data owner then release the randomized graph to the data miner. But does the randomized graph safe enough? And If the data miner analyzes the data based on the randomized graph instead of the true graph, is there any information loss? So my research mainly focus on how to quantify the privacy disclosure risks of randomization, how to quantify the information loss of the randomization, and how to randomize the graph to protect the privacy and better preserve the data utility at the same time.26Deriving Link Structure of Entire Network [Korolova ICDE08]A different threat in whichAn adversary subverts user accounts to get local neighborhoods and pieces them together to build the entire network.No underlying network is released.A registered user often can see all the links and nodes incident to him within distance d from him.d=0 if a user can see who he links to.d=1 if a user can also see who links to all his friends.Analysis showed that the number of local neighborhoods needed to cover a fraction of the entire network drops exponentially with increase of the lookahead parameter d.

27OutlineAttacks on Naively Anonymized GraphPrivacy Preserving Social Network PublishingK-anonymityGeneralizationRandomizationOther WorksOutput PerturbationBackground on differential privacyAccurate analysis of private network data2828Privacy Preserving Social Network PublishingNave anoymization is not sufficient to prevent privacy breaches, mainly due to link structure based attacks. Graph topology has to be modified viaAdding/deleting edges/nodesGrouping nodes/edges into super-nodes and super-edges How to quantify utility loss and privacy preservation in the perturbed (and anonymized) graph?29Graph UtilityUtility heavily depends on mining tasks.It is challenging to quantify the information loss in the perturbed graph data.Unlike tabular data, we cannot use the sum of the information loss of each individual record.We cannot use histograms to approximate the distribution of graph topology.It is more challenging when considering both structure change and node attribute change. 303031Graph UtilityTopological features:Structural characteristics of the graph.Various measures form different perspectives.Commonly used.Spectral features:Defined as eigenvalues of the graph's adjacency matrix or other derived matrices.Closely related to many topological features.Can provide global graph measures.Aggregate queries:Calculate the aggregate on some paths or subgraphs satisfying the query condition.E.g.: the average distance from a medical doctor vertex to a teacher vertex in a network.After some privacy preserving techniques are applied to the social network, the data owner then release the randomized graph to the data miner. But does the randomized graph safe enough? And If the data miner analyzes the data based on the randomized graph instead of the true graph, is there any information loss? So my research mainly focus on how to quantify the privacy disclosure risks of randomization, how to quantify the information loss of the randomization, and how to randomize the graph to protect the privacy and better preserve the data utility at the same time.31Topological FeaturesTopological features of networks

Harmonic mean of shortest distance

Transitivity(cluster coefficient)

Subgraph centrality

Modularity (community structure)

And many others (refer to: F. Costa et al., Characterization of Complex Networks: A Survey of measurements, 2006)32

Researchers have proposed various measures to describe the social network from different perspective. I list 4 of them frequently used in our evaluations. The average distance of the graph. For many real-world graphs, the average shortest distance is usually small. This phenomenon is called the small world model. The second measure is the transitivity which measures the centrality of each nodes. In real social networks, there are usually many triangles around each node. The third measure sub-graph centrality takes all the loops into consideration. The triangle is the loop of 3 steps. The modularity measures the quality of a given community partition.32Graph and MatrixAdjacency matrix

For a undirected graph, A is a symmetric;No self-links, diagonal entries of A are all 0For a un-weighted graph, A is a 0-1 matrix

33

Our randomization procedures are based on two basic operations: add and delete edges and switch edge pairs. For random add and delete procedure, we randomly add k false links and delete k true links. For random switch, we randomly pick up a pair of edges, do a switch, and repeat this process for k times. This procedure preserve the degree of each node.33Spectral FeaturesSpectral features of networksAdjacency spectrum

Laplacian spectrum

34

Another type of graph measures is the spectral measures, which are defined as the eigenvalues of the graph matrices. For adjacency matrix, it is symmetric, and has n eigenvalues. They are usually sorted in descending order. Besides the adjacency matrix, we also have the Laplacian matrix and normal matrix. The n eigenvalues of the Laplacian matrix are non-negative, and the smallest eigenvalue is always 0. The associated eigenvector has the entries all equal to 1. The eigenvalues of the normal matrix are always between -1 and 1. The largest eigenvalue of the normal matrix is always equal to 1.34Topological vs. Spectral FeaturesAdjacency and Laplacian spectrum:The maximum degree, chromatic number, clique number etc. are related to ;Epidemic threshold of the virus propagates in the network is related to ;Laplacian spectrum indicates the community structure:k disconnected communities: k loosely connected communities:35

The eigenvalues of graph matrices reflect some important information about the graph. We take two eigenvalues as the example. The largest eigenvalue of the adjacency matrix is related to the maximum degree, the maximum clique, and the epidemic threshold. The second smallest eigenvalue of the Laplacian matrix is a good indicator of community structure. If the graph has clear community structure, it is close to 0.35Topological vs. Spectral FeaturesLaplacian spectrum & communitiesDisconnected communities:

Loosely connected communities:

36

0.00 0.00 0.00 1.27 2.59 3.00 3.00 3.00 4.00 4.00 4.00 4.73 5.00 5.41 6.00 6.00 6.00 6.00 6.00 0.00 0.11 0.34 1.31 2.60 3.00 3.10 3.36 4.00 4.13 4.59 4.79 5.31 5.58 6.00 6.00 6.00 6.66 7.12

The eigenvalues of graph matrices reflect some important information about the graph. We take two eigenvalues as the example. The largest eigenvalue of the adjacency matrix is related to the maximum degree, the maximum clique, and the epidemic threshold. The second smallest eigenvalue of the Laplacian matrix is a good indicator of community structure. If the graph has clear community structure, it is close to 0.36Eigenspace [Ying SDM09]37

Topological vs. Spectral FeaturesSuppose the graph has k communities. We have the largest k eigenvalues of the adjacency matrix. We define the i-th row of the matrix as the spectral coordinates of node i. For polbook network, it has two communities, and if we plot the spectral coordinates in the 2-dimensional space, we can observe a very interesting pattern: we can observe two lines in the spectral space, and nodes from the same community line on the same line. The two lines are approximately orthogonal to each other.37Topological vs. Spectral FeaturesTopological & spectral features are related

No. of triangles:

Sub-graph centrality

Graph diameter:

38

There are close relations between the spectral measures and the topological measures. For example, the total number of triangles can be calculated by the adjacency eigenvalues, and so is the sub-graph centrality. The graph diameter is upper bounded by a function of the Laplacian eigenvalues. If the graph have k disconnected components, the Laplacian spectrum has k 0 eigenvalues.38OutlineAttacks on Naively Anonymized GraphPrivacy Preserving Social Network PublishingK-anonymityGeneralizationRandomizationOther WorksOutput PerturbationBackground on differential privacyAccurate analysis of private network data3939K-anonymity Privacy Preservation K-anonymity (Sweeney) Each individual is identical with at least K-1 other individualsA general definition for network data [Hay, VLDB08]

40

K-anonymityEach node is identical with at least K-1 other nodes under topology-based attacks.The adversary is assumed to have some knowledge of the target user:node degree (K-degree) (immediate) neighborhood (K-neighborhood) arbitrary subgraph (K-automorphism etc.) K-anonymity approach guarantees that no node in the released graph can be linked to a target individual with success prob. greater than 1/K.

41K-degree Anonymity [Liu SIGMOD08]Attacking model:The attackers know the degree of the targeted individual42

A graph with n nodes and m edges can have various structures. An edge can be directed or undirected. For example, the friendship is usually mutual, while the links between web pages usually directions. In some scenarios, there is weight associated to each link, higher weight means more communication. The edges can even have signs, friends have positive edges, and enemies have negative links. In my research, I mainly focus on the undirected, un-weighted, and unsigned graph. This type of graph can be represented by the adjacency matrix. A_ij is equal to 1 if there is an edge between node I and j. The degree of node I is the number of edges connected to node i. Laplacian and normal matrix also commonly used data structure for social network.42K-degree Anonymity [Liu SIGMOD08]K-degree anonymous:

Optimize utility: minimize no. of added edges43

A graph with n nodes and m edges can have various structures. An edge can be directed or undirected. For example, the friendship is usually mutual, while the links between web pages usually directions. In some scenarios, there is weight associated to each link, higher weight means more communication. The edges can even have signs, friends have positive edges, and enemies have negative links. In my research, I mainly focus on the undirected, un-weighted, and unsigned graph. This type of graph can be represented by the adjacency matrix. A_ij is equal to 1 if there is an edge between node I and j. The degree of node I is the number of edges connected to node i. Laplacian and normal matrix also commonly used data structure for social network.43K-degree Anonymity [Liu SIGMOD08]Algorithm outline:44

44K-neighborhood Anonymity [Zhou ICDE08]Attacking modelThe attackers know the immediate neighborhood of a targeted individual.45

A graph with n nodes and m edges can have various structures. An edge can be directed or undirected. For example, the friendship is usually mutual, while the links between web pages usually directions. In some scenarios, there is weight associated to each link, higher weight means more communication. The edges can even have signs, friends have positive edges, and enemies have negative links. In my research, I mainly focus on the undirected, un-weighted, and unsigned graph. This type of graph can be represented by the adjacency matrix. A_ij is equal to 1 if there is an edge between node I and j. The degree of node I is the number of edges connected to node i. Laplacian and normal matrix also commonly used data structure for social network.45K-neighborhood Anonymity [Zhou ICDE08]Problem:46

46K-neighborhood Anonymity [Zhou ICDE08]Algorithm outlineExtract the neighborhoods of all vertices in the network.Compare and test all neighborhoods by neighborhood component codingOrganize vertices into groups and anonymize the neighborhoods of vertices in the same group until the graph satisfies K-neighborhood anonymity.47

1-neighborhood of AdaNaively Anonymized GraphK-neighborhood AnonymityA graph with n nodes and m edges can have various structures. An edge can be directed or undirected. For example, the friendship is usually mutual, while the links between web pages usually directions. In some scenarios, there is weight associated to each link, higher weight means more communication. The edges can even have signs, friends have positive edges, and enemies have negative links. In my research, I mainly focus on the undirected, un-weighted, and unsigned graph. This type of graph can be represented by the adjacency matrix. A_ij is equal to 1 if there is an edge between node I and j. The degree of node I is the number of edges connected to node i. Laplacian and normal matrix also commonly used data structure for social network.47K-neighborhood Anonymity [Zhou ICDE08]Graph utility:The nodes have hierarchical label information.Two ways to anonymize the neighborhoods: generalizing labels and adding edges.Answer aggregate network queries as accurate as possible.48A graph with n nodes and m edges can have various structures. An edge can be directed or undirected. For example, the friendship is usually mutual, while the links between web pages usually directions. In some scenarios, there is weight associated to each link, higher weight means more communication. The edges can even have signs, friends have positive edges, and enemies have negative links. In my research, I mainly focus on the undirected, un-weighted, and unsigned graph. This type of graph can be represented by the adjacency matrix. A_ij is equal to 1 if there is an edge between node I and j. The degree of node I is the number of edges connected to node i. Laplacian and normal matrix also commonly used data structure for social network.48K-automorphism Anonymity [Zou VLDB09]Attacking model:The attackers can know any subgraph that contains the targeted individual.Graph automorphism49

A graph with n nodes and m edges can have various structures. An edge can be directed or undirected. For example, the friendship is usually mutual, while the links between web pages usually directions. In some scenarios, there is weight associated to each link, higher weight means more communication. The edges can even have signs, friends have positive edges, and enemies have negative links. In my research, I mainly focus on the undirected, un-weighted, and unsigned graph. This type of graph can be represented by the adjacency matrix. A_ij is equal to 1 if there is an edge between node I and j. The degree of node I is the number of edges connected to node i. Laplacian and normal matrix also commonly used data structure for social network.49K-automorphism Anonymity [Zou VLDB09]K-automorphic graph50

A graph with n nodes and m edges can have various structures. An edge can be directed or undirected. For example, the friendship is usually mutual, while the links between web pages usually directions. In some scenarios, there is weight associated to each link, higher weight means more communication. The edges can even have signs, friends have positive edges, and enemies have negative links. In my research, I mainly focus on the undirected, un-weighted, and unsigned graph. This type of graph can be represented by the adjacency matrix. A_ij is equal to 1 if there is an edge between node I and j. The degree of node I is the number of edges connected to node i. Laplacian and normal matrix also commonly used data structure for social network.50K-automorphism Anonymity [Zou VLDB09]Algorithm outlinePartition graph G into several groups of subgraphs, each group contains at least K subgraphs, and no subgraphs share a node.Block Alignment: make subgraphs within each group isomorphic to each other.Edge Copy: copy the edges across the subgraphs properly.51

51K-symmetry Model [Wu EDBT10]Attacking model:The attackers can know any subgraph that contains the targeted individual.K-symmetry approach:A concept similar to K-automorphism (equivalent?)Make graph K-symmetry by adding fake nodes52

A graph with n nodes and m edges can have various structures. An edge can be directed or undirected. For example, the friendship is usually mutual, while the links between web pages usually directions. In some scenarios, there is weight associated to each link, higher weight means more communication. The edges can even have signs, friends have positive edges, and enemies have negative links. In my research, I mainly focus on the undirected, un-weighted, and unsigned graph. This type of graph can be represented by the adjacency matrix. A_ij is equal to 1 if there is an edge between node I and j. The degree of node I is the number of edges connected to node i. Laplacian and normal matrix also commonly used data structure for social network.52K-isomorphism Model [Cheng SIGMOD10]Attacking model:The attackers can know any subgraph that contains the targeted individual.Insufficient protection on link privacy by K-automorphism approach53

Example: the adversary can not identify Alice or Bob, but there must be a link between them.A graph with n nodes and m edges can have various structures. An edge can be directed or undirected. For example, the friendship is usually mutual, while the links between web pages usually directions. In some scenarios, there is weight associated to each link, higher weight means more communication. The edges can even have signs, friends have positive edges, and enemies have negative links. In my research, I mainly focus on the undirected, un-weighted, and unsigned graph. This type of graph can be represented by the adjacency matrix. A_ij is equal to 1 if there is an edge between node I and j. The degree of node I is the number of edges connected to node i. Laplacian and normal matrix also commonly used data structure for social network.53K-isomorphism Model [Cheng SIGMOD10]K-security graph54

54K-anonymity55Privacy protectionUtility preservation

K-neighborhoodK-degreeK-automorphismK-symmetryK-securityK-obfuscation [Bonchi ICDE11]56

56K-obfuscation [Bonchi ICDE11]57

Both cases respect 2-candidate anonymity.K-candidate aims at guaranteeing a lower bound on the amount of uncertainty.K-obfuscation measures the uncertainty.The obfuscation level quantified by means of the entropy is always no less than the one based on a-posteriori belief probabilities.

57OutlineAttacks on Naively Anonymized GraphPrivacy Preserving Social Network PublishingK-anonymityGeneralizationRandomizationOther WorksOutput PerturbationBackground on differential privacyAccurate analysis of private network data5858Generalization Approach [Hay VLDB08]Generalize nodes into super nodes and edges into super edges

5959Generalization Approach [Hay VLDB08]The size of possible graph world:

Maximize the graph likelihood functionSimulated annealing algorithmStart with a single partition containing all nodes;Update state by splitting/merge partitions or move a node to a new partition.60

60Anonymizing Rick Graph [Bhagat VLDB09]Hyper-graphG(V,I,E) represents multiple types of interactions between entitiesAttacking modelThe attackers know part of the links and nodes in the graph61

61Anonymizing Rick Graph [Bhagat VLDB09]Algorithm outlineSort the nodes according to the attributes;Group nodes into super nodes satisfying class safety property, and the size of each supper node is greater than K.Class safety: each node cannot have interactions with two or more nodes from the same groupReplace the node identifiers by the label list.62

62OutlineAttacks on Naively Anonymized GraphPrivacy Preserving Social Network PublishingK-anonymityGeneralizationRandomizationOther WorksOutput PerturbationBackground on differential privacyAccurate analysis of private network data6363Basic Graph Randomization OperationsRand Add/Del: randomly add k false edges and delete k true edges (no. of edges unchanged)

Rand Switch: randomly switch a pair of edges, and repeat it for k times (nodes degree unchanged)64

Our randomization procedures are based on two basic operations: add and delete edges and switch edge pairs. For random add and delete procedure, we randomly add k false links and delete k true links. For random switch, we randomly pick up a pair of edges, do a switch, and repeat this process for k times. This procedure preserve the degree of each node.64RandomizationRandomized response model [Warner 1965]65: Cheated in the exam : Didnt cheat in the exam

Cheated in examDidnt cheat

Randomization device

Do you belong to A? (p) Do you belong to ?(1-p)

Yes answerNo answerAs:Unbiased estimate: Procedure:

Purpose: Get the proportion( ) of population members that cheated in the exam.

PurposeRandomization [Agarawal SIGMOD00]6650 | 40K | ... 30 | 70K | ... ......RandomizerRandomizerReconstructDistribution of AgeReconstructDistributionof SalaryClassificationAlgorithmModel65 | 20K | ... 25 | 60K | ... ...30 becomes 65 (30+35)Alices ageAdd random number to AgeReconstructionGivenx1+y1, x2+y2, ..., xn+yn where xi are original values.the probability distribution of noise YEstimate the probability distribution of X. 67

[Agarawal SIGMOD00]Randomization on GraphLink privacy the prob. of existence a link (i,j) given the perturbed graph Feature preservation randomizationSpectrum preserving randomizationMarkov chain based feature preserving randomizationReconstruction from randomized graph 68

Prior probability:

Posterior probabilities

69

Link Privacy: Posterior Beliefs [Ying PAKDD09]The above calculation of the posterior probability is simple. It only uses the total number of edges and the randomization magnitude parameter k. For any observed link, the posterior probability is the same, and it can not distinguish one observed link from another. So we developed the second method to enhance the attackers posterior confidence. The idea of the second method is that we can use the some similarity measures of two nodes, and if the similarity of node I and j is high, there is a high probability for the link to be a true link.69Posterior probability and similarity measures70Similarity and proportion of existing edges before randomizationSimilarity and proportion of existing edges after randomizationLink Privacy: Posterior Beliefs [Ying PAKDD09]

Basically, these four figures show one common phenomenon in many real-world social network: nodes with high similarity are more likely to be connected with each others. 70Link Privacy: Posterior Beliefs [Ying PAKDD09]Posterior probability and similarity measures

How to calculate posterior probability for general cases?71

prior prob.posterior prob. Iposterior prob. II

One important probability is that the summation of the 3 types of probabilities are all equal to the number of edges m. This is not a surprising result, because we actually re-distribute the m true edges in a probabilistic manner, and the more information the attacker use, the higher accuracy he can achieve, but the edges are not coming out of nowhere.71Exploit graph space to breach link privacy

Link Privacy: Graph Space [Ying SDM09]72

Lets see an example. Lets say G2 is the original graph. The data owner apply a switch based randomization procedure, and the feature constraint is that the randomized graph should have the same number of triangles as the original one. And we get G4 as the released graph. Lets focus on the link between node 1 and 5. Is this link in the randomized graph a true link in the original one? It is possible that link (1,5) is a fake link added by the randomization. With the randomized graph, the attacker knows that the original degree sequence must be (3,2,2,2,3). There are totally 7 graphs with this degree sequence, and one of them is the original graph. If graph G1 is the original graph, then the link (1,5) is actually a fake graph. However, G1 is impossible to be the true graph. This is because it has no triangles, and the original graph and the released one must have the same number of triangles. Now, we only have 6 graphs left, and all of them have the link (1,5), and no matter which if the original graph, the attacker knows that the link between 1 & 5 must be the true link.72Sample the graph space when the space is largeStart with the randomized graph, construct a Markov chain, and uniformly sample the graph space.Generate N uniform graph samples

Empirical evaluations show that node pairs with highest probabilities have serious link disclosure risk (as high as 90%).Link Privacy: Graph Space [Ying SDM09]73

For large and complex networks, it is infeasible for the attacker to examine all the possible graphs. We proposed an attacking model in which the attacker can sample the graph space. Starting from the released graph, the attacker can construct a Markov chain which converges to the uniform stationary distribution. Using this Markov chain, the attacker can uniformly sample the graph space, and get N sample graph, and if many of the sample have the link between node I and j, the link (I,j) is more likely to be a true link.73Graph Features Under Pure RandomizationTopological and spectral features change significantly along the randomization.74

(Networks of US political books, 105 nodes and 441 edges)Can we better preserve the network structure?Lets first see to what extent the pure randomization can affect the graph features. These figures record the change of the graph features along the randomization. We can see that most of the features change significantly along the randomization. Then, the question is that whether we can better preserve the network structure.74Spectrum Preserving Randomization [Ying SDM08]Graph spectrum is related to many real graph features.Preserve graph features by preserving some eigenvalues.75

In our research, we first developed the spectrum preserving randomization. The basic idea is that we can utilize the correlation between the topological measures and the graph eigenvalues, and if we can preserve some important eigenvalues of the graph, we may be able to preserve some other topological features as well. Our approach is based on this result.75Spectrum Preserving Randomization [Ying SDM08]Spectral Switch (apply to adjacency matrix):

Up-switch to increase the eigenvalue:

Down-switch to decrease the eigenvalue:76

Let me give a brief introduction to the randomization procedure. We have the largest eigenvalue of the adjacency matrix and its eigenvector. Every entry of the eigenvector corresponds to a node. When we pick up one pair of edge, we can check the 4 values. Our result shows that if the 4 values satisfy this condition, the eigenvalue lambda_1 must increase. We call this a up-switch. Similarly, if the 4 values satisfy the second condition, the eigenvalue must decrease, and we call this a down-switch.76Spectral Switch (apply to Laplacian matrix):

Down-switch to decrease the eigenvalue:

Up-switch increase the eigenvalue:

77Spectrum Preserving Randomization [Ying SDM08]77Algorithm outline78Spectrum Preserving Randomization [Ying SDM08]

So we can combine the up- and down-switch together so that the eigenvalue is within a small range. The user can specify a small range around the true value. If the eigenvalue exceeds the upper-bound, we do a down-switch, and if the eigenvalue is below the lower bound, we do an up-switch.78Preserve any graph feature S(G) within a small rangeFeature range constraint specified by the user

Markov chain with feature range constraint

Markov Chain Based Feature Preserving Randomization [Ying SDM09]79

(uniformity on accessible graphs)This is the basic flow of the Markov chain.79Feature constraint can be used to breach link privacy80

OriginalReleasedMarkov Chain Based Feature Preserving Randomization [Ying SDM09]

Lets see an example. Lets say G2 is the original graph. The data owner apply a switch based randomization procedure, and the feature constraint is that the randomized graph should have the same number of triangles as the original one. And we get G4 as the released graph. Lets focus on the link between node 1 and 5. Is this link in the randomized graph a true link in the original one? It is possible that link (1,5) is a fake link added by the randomization. With the randomized graph, the attacker knows that the original degree sequence must be (3,2,2,2,3). There are totally 7 graphs with this degree sequence, and one of them is the original graph. If graph G1 is the original graph, then the link (1,5) is actually a fake graph. However, G1 is impossible to be the true graph. This is because it has no triangles, and the original graph and the released one must have the same number of triangles. Now, we only have 6 graphs left, and all of them have the link (1,5), and no matter which if the original graph, the attacker knows that the link between 1 & 5 must be the true link.80Reconstruction from Randomized Graph [Wu SDM10]MotivationFrom the randomized graph, can we reconstruct a graph whose features are closer to the true features?

81

81Low rank approximation approachBest rank r approximation by eigen-decomposition:

Discretize the low rank matrix

82

Reconstruction from Randomized Graph [Wu SDM10]

82Effect of including significant negative eigenvalues8383

Originalr = 1r = 4r = 2Reconstruction from Randomized Graph [Wu SDM10]

8384

Reconstruction from Randomized Graph [Wu SDM10]Algorithm Outline

84Feature value of the original graph, randomized graph, and the reconstructed graph8585

Reconstruction from Randomized Graph [Wu SDM10]85Reconstructed graphs do not jeopardize link privacy for real-world networks.-- Privacy measured by the proportion of different edges:8686


8687Graphs with low rank may have privacy breached by reconstruction:

Reconstruction on synthetic low rank graphs87


8788Graph and feature dataOriginal GraphBinary feature matrixTwo individuals with higher similarity in features are more likely to be connected in the graph.Reconstruction problem

The maximum likelihood estimation is adopted to reconstruct the original graph and features.88Reconstructing Randomized Social Networks & Features [Vuokko SDM10]

88Random sparsification [Bonchi ICDE11]only remove edges from the graph without adding new edges.

outperform random add/del in terms of utility preservation partially due to the small word phenomenon: adding random long-haul edges brings nodes close, while removing an edge does not bring nodes so much apart since there exit alternative paths.

utility vs. privacy trade-off for various randomization strategies.

8989OutlineAttacks on Naively Anonymized GraphPrivacy Preserving Social Network PublishingK-anonymityGeneralizationRandomizationOther WorksOutput PerturbationBackground on differential privacyAccurate analysis of private network data9090Edge-weighted GraphEdge weights could be sensitive, e.g., trustworthiness of user A according to user B, or transaction amount between two accounts.

91

Anonymizing Edge-weighted Graph [Das TR09]Some properties of edge weights in terms of some functions are preserved. Relative distances between nodes for shortest paths or kNN queries.A framework for edge weight anonymization of graph data that preserves linear properties.A linear property can be expressed by a specific set of linear inequalities of edge weights.Finding new weights for each edge is a linear programming problem.

92Gaussian Randomization [Liu SDM09]Perturb edge weights while preserving global and local utilities. Graph structure is unchanged.Gaussian randomization multiplicationThe original weight of each edge is multiplied by a random Gaussian noise with mean 1 and some variance.In the original graph, if the shortest distance d(A,B) is much smaller than d(C,D), the order is high likely to be preserved.Greedy perturbationPreserve a set of shortest distances. 93Anonymizing Multi-graphs [Li SDM11]How to generate an anonymized collection of graphs where each graph corresponds to an individuals behavior. XML representation of attributes about an individualClick-graph in a user-sessionRoute for a given individual in a time-period Condensation based approach create constrained clusters of size at least K construct a super-template to represent properties of the group generate anonymized graphs from super-template

9494Computing Privacy Scores [Liu ICDM09]The privacy score measures the users potential privacy risk due to her online information sharing behaviors. It increases with sensitivity of the information being shared visibility of the revealed information in the network95

Computing Privacy Scores [Liu ICDM09]Item Response Theory based model Used to measure the abilities of examinees, the difficulty of the questions, and the prob. of an examinee to answer a question correctly. Each examinee is mapped to a user, and each question is mapped to a profile item. The difficulty parameter is to quantify the sensitivity of a profile item. The true visibility is estimated from observed profiles. 96OutlineAttacks on Naively Anonymized GraphPrivacy Preserving Social Network PublishingK-anonymityGeneralizationRandomizationOther WorksOutput PerturbationBackground on differential privacyAccurate analysis of private network data9797Output Perturbation98

Data owner

Data miner

namesexagediseasesalaryAdaF18cancer25kBobM25heart110kCathyF20cancer70kDellM65flu65kEdM60cancer300kFredM24flu20kGeorgeM22cancer45kHarryM40flu95kIreneF45heart70kQuery f Query result + noise 98Differential Guarantee [Dwork, TCC06]99namediseaseAdacancerBobheartCathycancerDellfluEdcancerFredfluf count(#cancer) f(x) + noise namediseaseAdacancerBobheartCathycancerDellfluEdcancerFredfluKKf count(#cancer) f(x) + noise 3 + noise 2 + noise Two databases (x, x) differ in only one row.

99Differential GuaranteeRequire that the prob. distribution is essentially the same independent of whether any individual opts in to, or opts out of the database.

Anything that can be learned about a respondent from a statistical database should be learnable without access to the database.

Independent of adversary knowledge.

Different from prior work on comparing an adversary's prior and posterior views of an individual.

100100OutlineAttacks on Naively Anonymized GraphPrivacy Preserving Social Network PublishingK-anonymityGeneralizationRandomizationOther WorksOutput PerturbationBackground on differential privacyAccurate analysis of private network data101101-differential privacy

is a privacy parameter: smaller = stronger privacy Differential Privacy

[Dwork, TCC06 &Dwork CACM11]Two neighboring datasets are defined in terms of Hamming distance |(x-x)(x-x)|=1Symmetric distance

102102Calibrating Noise103

Laplace distributionSensitivity of functionglobal sensitivity local sensitivityMultiple queries

103Laplace Distribution

104104Gaussian Distribution

105105Sensitivity106

namesexagediseasesalaryAdaF18cancer25kBobM25heart110kCathyF20cancer70kDellM65flu65kEdM60cancer300kFredM24flu20kGeorgeM22cancer45kHarryM40flu95kIreneF45heart70kFunction fsensitivityCount(#cancer)1Sum(salary)u (domain upper bound)Avg(salary)u/nComplex functions or data mining tasks can be decomposed to a sequence of simple functions.

L-1 distance for vector output106Differential Guarantee107namediseaseAdacancerBobheartCathycancerDellfluEdcancerFredfluf count(#cancer) f(x) + Lap(1/) K =1 =2

Difference of Prob.107Neighboring Datasets108

Two neighboring datasets are defined in terms of Hamming distance |(x-x)(x-x)|=1Symmetric distance

How about two data sets differing by k rows? 108Multiple Queries109

109Histogram QuerySELECT count(*) FROM tableGROUP BY disease

namesexagediseasesalaryAdaF18cancer25kBobM25heart110kEdM60cancer300kGeorgeM22cancer45kHarryM40flu95kIreneF45heart70k[3, 2, 1]cancer, heart, flu[3+Lap(1/ ), 2+Lap(1/ ),1+Lap(1/ )] 110Recent Development[Xiao ICDE10] Dependencies among the queries can be exploited to improve the accuracy of responses.

[Li PODS10] Matrix mechanism for answering a workload of predicate counting queries

[Kifer SIGMOD11] Misconceptions of differential privacy.

111111OutlineAttacks on Naively Anonymized GraphPrivacy Preserving Social Network PublishingK-anonymityGeneralizationRandomizationOther WorksOutput PerturbationBackground on differential privacyAccurate analysis of private network data112112Private Query Answering on Networks[Hay ICDM09]113Two neighboring graphs can be defined to differ by a single edge, K edges, or a single node.

edge -differential privacyquery output is indistinguishable whether any single edge is present or absent. K-edge -differential privacyquery output is indistinguishable whether any set of k edges is present or absent. node -differential privacyquery output is indistinguishable whether any single node (and all its edges) is present or absent.Lap(f/) Lap(f K/) 113Degree SequenceThe list of degrees of each node in a graph

The degree sequence of a network may be sensitive as it can be used to determine the graph structure by incorporating other graph statistics

114114Two Equivalent Queries115

Degree sequence D(G)=[1,1,3,3,3,3,2]D(G)=[1,1,3,3,2,2,2]

D=2, Lap(2/) for each component# of nodes with degree i=0,n-1 F(G)=[0,2,1,4,0,0,0]F(G)=[0,2,3,2,0,0,0]F=4, Lap(4/) for each component115Boosting Accuracy [Hay ICDM09]116 Rewrite query D to get S with constraint Cs KSubmit S

Perturbed answer A(S) Perform inference on A(S) with constraints Cs to derive a better estimation 116Formulating Query D117

Degree sequence D(G)=[1,1,3,3,3,3,2]D(G)=[1,1,3,3,2,2,2]

D=2, Lap(2/) for each componentReturn the rank i-th degree S(G)=[1,1,2,3,3,3,3]Perturbed answer could be [3,2,]A new (and more accurate) sequence could be derived by computing theclosest non-decreasing sequence. +Lap(2/)117Accurate Motif AnalysisMeasures the frequency of occurrence of small subgraphs in a network, e.g., # of triangles118

# of triangles is n-2# of triangles is 0High sensitivity!118Weakening PrivacyStatistics such as transitivity, clustering coefficient, centrality, and path-lengths have high sensitivity values.Possible techniques[Nissim STOC07] Smooth sensitivity, adopting local sensitivity, i.e., the max. change between Q(I) and Q(I) for any I in neighbor(I). [Rastogi PODS09] Adversary privacy, limiting assumptions of the priori knowledge of the adversaryMore exploration is needed on robust statistics and differential privacy.119119Model based Data Publishing120

Data owner

Data miner

namesexagediseasesalaryAdaF18cancer25kBobM25heart110kCathyF20cancer70kDellM65flu65kEdM60cancer300kFredM24flu20kGeorgeM22cancer45kHarryM40flu95kIreneF45heart70kK

...Build models (e.g., contingency table, power-law graph)Release differentially private model parametersGenerate synthetic data using models with perturbed parameters120OutlineAttacks on Naively Anonymized GraphPrivacy Preserving Social Network PublishingK-anonymityGeneralizationRandomizationOther WorksOutput PerturbationBackground on differential privacyAccurate analysis of private network data121121References[Agarawal, SIGMOD00] R. Aggarwal and R. Srikant. Privacy preserving data mining. SIGMOD, 2000.

[Aggarwal, 08] C. C. Aggarwal and P. S. Yu. Privacy-preserving data mining: models and algorithms. Springer, 2008.

[Backstrom, WWW07] L. Backstrom, C. Dwork, and J. Kleinberg. Wherefore art thou R3579X? Anonymized social networks, hidden patterns and structural steganography. WWW, 2007.

[Bhagat, VLDB09] S. Bhagat, G. Cormode, B. Krishnamurthy, and D. Srivastava. Class-based graph anaonymization for social network data. VLDB, 2009.

[Bonchi, ICDE11] F. Bonchi, A. Gionis, and T. Tassa. Identity Obfuscation in Graphs Through the Information Theoretic Lens. ICDE, 2011.

[Campan, PinKDD08] A. Campan and T. M. Truta. A clustering approach for data and structural anonymity in social network data. PinKDD, 2008.

[Cheng, SIGMOD10] J. Cheng, A. Fu, and J. Liu. K-isomorphism: privacy preserving network publication against structural attacks. SIGMOD, 2010.

[Cormode, VLDB08] G. Cormode, D. Srivastava, T. Yu, and Q. Zhang. Anonymizing bipartite graph data using safe groupings. VLDB, 2008.122122References[Das, TR09] S. Das, Omer Egecioglu, and A. E. Abbadi. Anonymizing edge weighted social network graphs. 2009.

[Dwork, CACM11] C. Dwork. A firm foundation for private data analysis. CACM, 2011.

[Dwork, TCC06] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating noise to sensitivity in private data analysis. TCC, 2006.

[Gross, WPES05] R. Gross and A. Acquisti. Information revelation and privacy in online social networks (the Facebook case). WPES, 2005.

[Hanhijarvi, SDM09] S. Hanhijarvi, G. C. Garriga, and K. Puolamaki. Randomization techniques for graphs. SDM, 2009.

[Hay, VLDB08] M. Hay, G. Miklau, D. Jensen, D. Towsely, and P. Weis. Resisting structural re-identification in anonymized social networks. VLDB, 2008.

[Hay, 07] M. Hay, G. Miklau, D. Jensen, P. Weis, and S. Srivastava. Anonymizing social networks. 2007.

123123References [Hay, 09] M. Hay, G. Miklau and D. Jensen. Enabling accurate analysis of private data analysis. 2009.

[Kifer, SIGMOD11] D. Kifer and A. Machanavajjhala. No Free Lunch in Data Privacy. SIGMOD, 2011.

[Korolova, ICDE08] A. Koroloca, R. Motwani, S. Nabar, and Y. Xu. Link privacy in social networks. ICDE, 2008.

[Li, PODS10] C. Li, M. Hay, V. Rastogi, G. Miklau, A. McGregor. Optimizing linear counting queries under differential privacy. PODS, 2010.

[Liu, SIGMOD08] K. Liu and E. Terzi. Towards identity anonymization on graphs. SIGMOD 2008.

[Liu, ICDM09] K. Liu and E. Terzi. A framework for computing the privacy scores of users in online social networks. ICDM 2009.

[Liu, SDM09] L. Liu, J. Wang, J. Liu and J. Zhang. Privacy preserving in social networks against sensitive edge disclosure. SDM 2008.

[McSherry, FOCS07] F. McSherry and K. Talwar. Mechanism design via differential privacy. FOCS, 2007.

124124References[Narayanan, 09] A. Narayanan and V. Shmatikov. De-anonymizing social networks. 2009.

[Newman, PRE06] M. Newman. Physical Review E, 2006.

[Vuokko, SDM10] N. Vyokko and E. Terzi. Reconstructing randomized social networks. SDM, 2010.

[Wu, SDM10] L. Wu, X. Ying, and X. Wu. Reconstruction of randomized graph via low rank approximation. SDM, 2010.

[Wu, EDBT11] W. Wu, Y. Xiao, W. Wang, Z. He, and Z. Wang. k-symmetry model for identity anonymization in social networks. EDBT, 2011.

[Wu, 09] X. Wu, X. Ying, K. Liu, and L. Chen. A Survey of Algorithms for Privacy-Preservation of Graphs and Social Networks. 2009.

[Ying, SDM08] X. Ying and X. Wu. Randomizing social networks: a spectrum preserving approach. SDM, 2008.

[Ying, SDM09] X. Ying and X. Wu. Graph generation with prescribed feature constraints. SDM, 2009.125125References[Ying, SDM09-2] X. Ying and X. Wu. On randomness measures for social networks. SDM, 2009.

[Ying, PAKDD09] X. Ying and X. Wu. On link privacy in randomizing social networks. PAKDD, 2009.

[Xiao, ICDE10] X. Xiao, G. Wang, and J. Gehrke. Differential privacy via wavelet transformation. ICDE, 2010.[Zheleva, PinKDD07] E. Zhelava and L. Getoor. Preserving the privacy of sensitive relationships in graph data. PinKDD, 2007.

[Zhou, ICDE08] B. Zhou and J. Pei. Preserving privacy in social networks against neighborhood attacks. ICDE, 2008.

[Zou, VLDB09] L. Zou, L. Chen, and M. T. Ozsu. K-automorphism: A general framework for privacy preserving network publication. VLDB, 2009.

126126Questions?

AcknowledgmentsThis work was supported in part by U.S. National Science Foundation IIS-0546027 , CNS-0831204 and CCF-1047621.

Update version: http://dpl.sis.uncc.edu/ppsn-tut.PDF

Thank You!127127

vu

vu12345123451234512345

Method II:If similarity is large, link (i. j) is more likely to be a true linkMethod I:ijRandomized grapht (xt)t (xt)v (xv)u (xu)u (xu)w (xw)v (xv)w (xw)t (yt)

t (yt)

v (yv)

u (yu)

u (yu)

w (yw)

v (yv)

w (yw)