De-anonymizing Social Networks

De-anonymizing Social Networks

Presenter: Lijie ZhangAdvisor: Weining Zhang

Outlines

Motivation Attack Model De-anonymization Algorithm Experiments Conclusions

Motivation

Social network (SN) owner publishes graph data for sharing Academic and government data-mining: phone call networks Advertising: Third-party applications: 550,000 Facebook applications

Private information on SNs: Node attributes: node degree in a sexual network Edge presence: a single call, romantic relationship

Motivation

SN owner publishes anonymized graph:Nodes have no identifying attributes

Propose a model to identify nodes from the anonymized graph:Re-identification: learn the entity to which the

node belongs to. Entity: an account, a real person, a group, an

organization

Outlines


Model – Social Network

Social Network S:A directed graph G=(V,E)A set of node attributes X: name, telephone

numberA set of edge attributes Y: type of relationshipTreat attributes values from a discrete domain

Model – Data Release

A sanitized subset of nodes and edges in S Computation:

Vsan: subset of V Xsan: subset of X including sensitive attributes Ysan: subset of Y including sensitive attributes Published attributes by themselves are insufficient for re-

identification Compute induced subgraph on Vsan Remove some edges and add faked edges

}),|{Y(e)},XVsan,v|{X(v)Esan,(Vsan,Ssan YsanYEsaneXsan

Model – Attacker

Purpose: extract sensitive information about specific individuals from anonymized SN graphs

Attacker’s knowledge Aggregate auxiliary information Individual auxiliary information

Aggregate auxiliary information

Large-scale information from other data sources and social networks whose membership overlaps with the target network Ssan Gaux={Vaux, Eaux} AuxX and AuxY: probability distributions of each node

attribute in Vaux and edge attribute in Eaux, respectively (prior knowledge).

Individual auxiliary information

Identifiable details about a small number of individuals from the target network Ssan and possibly relationships between them

Model – Breaching Privacy

Extract sensitive information about specific individuals from Ssan

Re-identify nodes from target SN Ssan Re-identification: find a mapping μbetween a node

in Vaux and a node in Vsan : ground truth mapping Succeeds if

G)()( vv G


Re-identification algorithm: Input: Ssan and Saux Output is the probability that vaux maps to vsan

Mapping adversary:

]1,0[}){(:~ VauxVsan),(~ sanaux vv

],[,,

],[,,

][,

][,

),(~),(~),(~),(~

],,,[

),(~),(~

],,[

vuYVsanvu auxaux

yvuYVsanvu auxauxauxaux

vXVsanv aux

xvXVsanv auxaux

vvuu

vvuuyvuYAdv

vv

vvxvXAdv


Privacy breach: privacy of vsan is breached w.r.t adversary Adv and privacy parameter , if

],,,[],,,[

],,[],,[

yvuYAuxyvuYAdvor

xvXAuxxvXAdv

auxauxauxaux

auxaux

Model – Measuring Success of an Attack

Let . The success rate of a de-anonymization algorithm outputting a probabilistic mapping , w.r.t a centrality measure , is the probability that μsampled from maps a node v to if v is selected according to

})(:{ vVvV Gauxmapped

~

~ )(vG

mapped

mapped

Vv

Vv G

v

vvvPR

)(

)()]()([

Outlines


De-anonymization Algorithm

Seed identification: apply individual auxiliary information

Propagation: apply aggregate auxiliary information

Algorithm - Seed Identification Input:

The target graph A clique of k nodes which are present both in the

auxiliary and the target graphs. The degree values of k nodes pairs of common-neighbor counts Error parameter ε

Output : k-clique with matching ( ) node degrees and common-neighbor counts.

2k

1S

Algorithm - Propagation

Inputs: G1, G2, Output: μ Iteratively find new mappings using the

topological structure of the network and the feedback from previously constructed mappings.

S

Algorithm - Propagationfunction propagationStep(lgraph, rgraph, mapping) for lnode in lgraph.nodes:

scores[lnode] = matchScores(lgraph, rgraph, mapping, lnode)if eccentricity(scores[lnode]) < theta: continuernode = (pick node from rgraph.nodes where

scores[lnode][node] = max(scores[lnode]))

scores[rnode] = matchScores(rgraph, lgraph, invert(mapping), rnode)if eccentricity(scores[rnode]) < theta: continuereverse_match = (pick node from lgraph.nodes where

scores[rnode][node] = max(scores[rnode]))

if reverse_match != lnode: continue

mapping[lnode] = rnode


Eccentricity: measure how much a node in a graph “stands out” from the rest nodes.

Rejects the match if eccentricity of the set of mapping scores is below a threshold,

)()(max)max( 2

XXX


Complexity: O((|E1|+|E2|)d1d2) d1 : a bound on the degree of the nodes in V1

Outlines


Experiments – Data Sets

Twitter, Flickr, LiveJournal:

Experiments – Seed Identification

Evaluate the feasibility of seed identification by measuring how much auxiliary information is needed to identify a unique node in the target graph.

LiveJournal graph: auxiliary and target Construct 4-cliques, and treat a 4-clique in the target

graph as a match as long as each degree and common-neighbor count matches within a factor of 1

Experiments – Seed Identification

Experiments – Propagation

Evaluate the robustness against perturbation and seed selection

Pairs of subgraphs (V1,V2), over 100,000 nodes each of a real-world SN One for auxiliary SN, the other as the target SN Perturbation strategy: two subgraphs has nodes

overlapped 25% and edges overlapped 50%

Evaluate the robustness against perturbation and seed selection

Experiments – Propagation

Mapping between two real-world social networks: Flickr and Twitter

Finding ground truth : Exact matches in either the username, or name field 27,000 mappings Human inspect ground truth error that is under 5%.

G

Mapping between two real-world social networks

Seeds: 150 pairs of nodes selected from Results:

30.8% of the mappings were re-identified correctly, 12.1% were identified incorrectly, and 57% were not identified.

41% of the incorrectly identified mappings (5% overall) were mapped to nodes which are at a distance 1 from the true mapping.

55% of the incorrectly identified mappings (6.7% overall) were mapped to nodes where the same geographic location was reported.

The above two categories overlap; of all the incorrect mappings, only 27% (or 3.3% overall) fall into neither category and are completely erroneous.

G

Conclusions

Anonymity is not sufficient for privacy when dealing with social networks.

Demonstrate feasibility of successful re-identification based solely on the network topology and assuming that the target graph is completely anonymized.

Reference

[1] Arvind Narayanan and Vitaly Shmatikov, “De-anonymizing Social Networks”, IEEE Security & Privacy '09.

Documents

De-anonymizing Social Networks