31
Privacy Risk in Anonymized Heterogeneous Informa9on Networks (How to Break Anonymity of the KDD Cup 2012 Dataset) Aston Zhang 1 , Xing Xie 2 , Kevin C.C. Chang 1 Carl A. Gunter 1 , Jiawei Han 1 , Xiaofeng Wang 3 University of Illinois at UrbanaChampaign 1 MicrosoS Research 2 Indiana University at Bloomington 3

Privacy Risk in Anonymized Heterogeneous Information Networks

Embed Size (px)

DESCRIPTION

Presented at EDBT'14 Full paper: http://web.engr.illinois.edu/~lzhang74/papers/aZhang_privacyRiskHIN_edbt14.pdf

Citation preview

Page 1: Privacy Risk in Anonymized Heterogeneous Information Networks

Privacy  Risk  in  Anonymized  Heterogeneous  Informa9on  Networks  (How  to  Break  Anonymity  of  the  KDD  Cup  2012  Dataset)  

Aston  Zhang1,  Xing  Xie2,  Kevin  C.-­‐C.  Chang1    Carl  A.  Gunter1,  Jiawei  Han1,  Xiaofeng  Wang3  

 University  of  Illinois  at  Urbana-­‐Champaign1  

MicrosoS  Research2  

Indiana  University  at  Bloomington3  

Page 2: Privacy Risk in Anonymized Heterogeneous Information Networks

K-­‐Anonymity  

•  Any  tuple  can  be  re-­‐iden9fied  with  a  probability  no  higher  than  1/K  – AAABBB,  3-­‐anonymity  – AABBBB,  2-­‐anonymity  – ABBBBB,  1-­‐anonymity  

2

Page 3: Privacy Risk in Anonymized Heterogeneous Information Networks

K-­‐Anonymity?  

•  If  the  adversary  (aWacker)  is  only  interested  in  breaking  privacy  of  tuples  with  value  B…  – AAABBB,  3-­‐anonymity  – AABBBB,  4-­‐anonymity?  – ABBBBB,  5-­‐anonymity?  

3

Page 4: Privacy Risk in Anonymized Heterogeneous Information Networks

Limita9ons  of  K-­‐Anonymity  

 T1000  (1000  tuples  of  the  same  value:  AAA…AA):  1000-­‐anonymity  T2  (500  pairs  of  same  values:  AABBCCDD…):  2-­‐anonymity  

Consider  injec9ng  a  globally  unique  tuple  t*  T1000*  (1000  same  tuples:  AAAAAA…AAAA  t*)  T2*  (500  pairs  of  same  values:  AABBCCDD…t*)    •   Are  their  security  level  really  the  same  in  terms  of  k-­‐anonymity?  

4

Both: 1-anonymity

Page 5: Privacy Risk in Anonymized Heterogeneous Information Networks

K-­‐Anonymity:  Worst  Says  All  5

Page 6: Privacy Risk in Anonymized Heterogeneous Information Networks

Different  Individuals  May  Have  Different  Privacy  Needs  

6

Page 7: Privacy Risk in Anonymized Heterogeneous Information Networks

Our  Proposed  Privacy  Risk  

•  Individual  Privacy  Risk  –  Allow  us  to  differen9ate!  •  Aggregate  Individual  Privacy  Risk  to  obtain  Dataset  Privacy  Risk  

 – Mathema9cal  Factor  

– Social  Factor  

7

Page 8: Privacy Risk in Anonymized Heterogeneous Information Networks

Our  Proposed  Privacy  Risk  

•  Privacy  risk  of  tuple  ti  in  dataset  T:  

•  Privacy  risk  of  dataset  T:    

8

Individual Privacy Risk

Dataset Privacy Risk

•  l(ti): loss function (social component) •  k(ti): the number of tuples in T with the same value of ti (math component)

•  N: the size of dataset T (the number of tuples in T)

Page 9: Privacy Risk in Anonymized Heterogeneous Information Networks

Expected  Privacy  Risk  9

•  Lemma  1:  Given  dataset  T  with  cardinality  C(T),  for  each  tuple  ti  in  T,  assuming  the  loss  func9on  is  independent  of  1/k(ti)  with  mean  value  μ,  the  expected  privacy  risk  

 

Page 10: Privacy Risk in Anonymized Heterogeneous Information Networks

Privacy  Risk  of  Dataset  10

•  Theorem  1:  The  privacy  risk  R(T)  of  dataset  T  is    

•  N:  the  number  of  tuples  in  T  

•  C(T):  cardinality  of  T  ——  number  of  dis;nct  (combined)  a<ribute  values  describing  tuples    

•  T=ABCDE:  C(T)=5,  R(T)=1  

•  T’=AAAAA:  C(T’)=1,  R(T’)=1/5  

•  T’’=AAA:  C(T’’)=1,  R(T’’)=1/3    

Cardinality of T Size of T

Page 11: Privacy Risk in Anonymized Heterogeneous Information Networks

Privacy  Risk  =  C(T)/N,  BeWer  Interpreta9on  

T1000  (1000  same  tuples:  AAAAAA…AAAA)    1000-­‐anonymity  (Privacy  Risk:  1/1000)  

T2  (500  pairs  of  same  values:  AABBCCDD…)    2-­‐anonymity  (Privacy  Risk:  500/1000)  

Inject  globally  unique  tuple  t*    T1000*  (1000  same  tuples:  AAAAAA…AAAAt*)  

 1-­‐anonymity  (Privacy  Risk:  2/1001)  T2*  (500  pairs  of  same  values:  AABBCCDD…t*)  

 1-­‐anonymity  (Privacy  Risk:  501/1001)  

R(T1000*)  <  R(T2*)  

11

Page 12: Privacy Risk in Anonymized Heterogeneous Information Networks

How  Bad  Is  Our  Privacy  in  HIN?  

•  Heterogeneous  Informa9on  Networks  (HIN)  12

Page 13: Privacy Risk in Anonymized Heterogeneous Information Networks

We  Live  In  a  More  Connected  World  

•  Heterogeneous  Informa9on  Networks  (HIN)  – Mul9ple  types  of  en99es  (nodes)  or  links  (edges)  – Present  in    

•  social  media  (TwiWer)  •  medical  informa9on  systems  (EMR)  •  academic  informa9on  systems  (DBLP)  •  and  more…  

13

Page 14: Privacy Risk in Anonymized Heterogeneous Information Networks

Heterogeneous  Informa9on  Networks  in  Social  Media  (KDD  Cup  2012  Dataset)  

14

Page 15: Privacy Risk in Anonymized Heterogeneous Information Networks

Network  Schema  15

Page 16: Privacy Risk in Anonymized Heterogeneous Information Networks

Meta-­‐Paths  16

Page 17: Privacy Risk in Anonymized Heterogeneous Information Networks

AWribute-­‐Metapath-­‐Combined  Values    of  Target  En99es  

17

The neighbors of the target entity A1X are generated along target meta paths

Page 18: Privacy Risk in Anonymized Heterogeneous Information Networks

AWribute-­‐Metapath-­‐Combined  Values    of  Target  En99es  

18

•  Theorem  2:  For  power-­‐law  distribu9on  of  the  user  out-­‐degree,  the  lower  and  upper  bounds  for  the  expected  heterogeneous  informa9on  network  cardinality  grows  faster  than  double  exponen9ally  with  respect  to  the  max.  distance  of  u9lized  neighbors.    

Privacy Risk = C(T)/N

Page 19: Privacy Risk in Anonymized Heterogeneous Information Networks

Privacy  Risk  Increases  With  More            Link  Types  

19

Page 20: Privacy Risk in Anonymized Heterogeneous Information Networks

Privacy  Risk  Increases  With  More            Link  Types  

20

Page 21: Privacy Risk in Anonymized Heterogeneous Information Networks

DeHIN  Algorithm  to  Prey  on                    Privacy  Risk  in  HIN  

21

Page 22: Privacy Risk in Anonymized Heterogeneous Information Networks

DeHIN  Algorithm  to  Prey  on                    Privacy  Risk  in  HIN  

22

Page 23: Privacy Risk in Anonymized Heterogeneous Information Networks

23

Profile  

a  

f  

r  

Profile  

as   rs  

Gender  

Year_of_birth  

No_of_Tags  

No_of_tweets  c  

Profile   Profile  

cs  

Profile  

Break Anonymity of the KDD Cup 2012 Dataset

Advantages: Exploit the identified privacy risk without requiring creating new accounts or relying on easily-detectable graph structures in a large-scale network [BDK’07, NV’09]

Page 24: Privacy Risk in Anonymized Heterogeneous Information Networks

Experiments  –  The  Exis9ng  Defense  24

Modified based on [Wu and Ying’11]

Page 25: Privacy Risk in Anonymized Heterogeneous Information Networks

25

Experiments  –  Key  Measures  

Page 26: Privacy Risk in Anonymized Heterogeneous Information Networks

DeHIN  Performs  BeWer  on  Denser  Networks  by  U9lizing  Longer-­‐Distance  Neighbors  

26

Page 27: Privacy Risk in Anonymized Heterogeneous Information Networks

DeHIN  Performs  BeWer  with  More        Link  Types  

27

Page 28: Privacy Risk in Anonymized Heterogeneous Information Networks

DeHIN’s  Performance  Slightly  Degrades  When  Anonymity  Gets  Stronger  

28

Page 29: Privacy Risk in Anonymized Heterogeneous Information Networks

DeHIN  Performs  BeWer  with  More        Link  Types  

29

Page 30: Privacy Risk in Anonymized Heterogeneous Information Networks

Privacy  Risk  Increases  With  More            Link  Types  

30

Page 31: Privacy Risk in Anonymized Heterogeneous Information Networks

The  Research  Roadmap  31