Xintao Wu Aug 25,2014 Research Overview 1. Outline Introduction Privacy Preserving Social Network...

Preview:

Citation preview

Xintao Wu Aug 25,2014

Research Overview

1

OutlineIntroductionPrivacy Preserving Social Network Analysis

Input perturbation Output perturbation

Fraud Detection in Social Networks Spectral analysis of graph topology Detecting Random Link Attacks Detecting weak anomalies

Sample ProjectsConclusions and Future work

2

Trustworthy ComputingTrustworthy = reliability, security,

privacy, usabilitySample research challenges

Understand and capture emergent behaviors/interactions among regular users, fraudsters, and victims

Design secure, survivable, persistent systems when under attack

Enable privacy protection in collecting/analyzing/sharing personal data

3

Privacy Breach CasesNydia Velázquez (1994)

Medical record on her suicide attempt was disclosed

AOL Search Log (2006) Anonymized release of 650K users’

search histories lasted for less than 24 hours

NetFlix Contest (2009) $1M contest was cancelled due to privacy

lawsuit23andMe (2013)

Genetic testing was ordered to discontinue by FDA due to genetic privacy

4

AcxiomPrivacy

In 2003, the EPIC alleged Acxiom provided consumer information to US Army "to determine how information from public and private records might be analyzed to help defend military bases from attack."

In 2013 Acxiom was among nine companies that the FTC investigated to see how they collect and use consumer data.

Security In 2003, more than 1.6 billion customer

records were stolen during the transmission of information to and from Acxiom's clients.5

6

Most restricted Restricted Some restrictions Minimal restrictions

Effectively no restrictions No legislation or no information

Privacy Regulation -- Forrester

Privacy Protection Laws USA

HIPAA for health careGrann-Leach-Bliley Act of 1999 for financial institutionsCOPPA for children online privacyState regulations, e.g., California State Bill 1386

CanadaPIPEDA 2000 - Personal Information Protection and Electronic

Documents Act European Union

Directive 94/46/EC - Provides guidelines for member state legislation and forbids sharing data with states that do not protect privacy

Contractual obligations Individuals should have notice about how their data is used

and have opt-out choices

7

Privacy Preserving Data Mining

8

ssn name zip race … age Sex income … disease

28223 Asian … 20 M 85k … Cancer

28223 Asian … 30 F 70k … Flu

28262 Black … 20 M 120k … Heart

28261 White … 26 M 23k … Cancer

. . … . . . … .

28223 Asian … 20 M 110k … Flu

69% unique on zip and birth date87% with zip, birth date and gender

Generalization (k-anonymity, l-diversity, t-closeness) Randomization

Social Network Data

9

Data owner

Data miner

release

name

sex age

disease

salary

Ada F 18 cancer

25k

Bob M 25 heart 110k

Cathy F 20 cancer

70k

Dell M 65 flu 65k

Ed M 60 cancer

300k

Fred M 24 flu 20k

George

M 22 cancer

45k

Harry M 40 flu 95k

Irene F 45 heart 70k

id Sex age

disease

salary

5 F Y cancer

25k

3 M Y heart 110k

6 F Y cancer

70k

1 M O flu 65k

7 M O cancer

300k

2 M Y flu 20k

9 M Y cancer

45k

4 M M flu 95k

8 F M heart 70k

Threat of Re-identification

10

id Sex age

disease

salary

5 F Y cancer

25k

3 M Y heart 110k

6 F Y cancer

70k

1 M O flu 65k

7 M O cancer

300k

2 M Y flu 20k

9 M Y cancer

45k

4 M M flu 95k

8 F M heart 70k

Attacker

attack

Privacy breachesIdentity disclosureLink disclosureAttribute disclosure

Privacy Preservation in Social Network Analysis• Input Perturbation

• K-anonymity

• Generalization

• Randomization

• Output Perturbation

• Background on differential privacy

• Differential privacy preserving social network mining

11

Our Work Feature preservation randomization

Spectrum preserving randomization (SDM08)

Markov chain based feature preserving randomization (SDM09)

Reconstruction from randomized graph (SDM10)

Link privacy (from the attacker perspective) Exploiting node similarity feature

(PAKDD09 Best Student Paper Runner-up Award)

Exploiting graph space via Markov chain (SDM09)

12

PSNet (NSF-0831204)

13

Output Perturbation

14

Data owner

Data miner

name

sex age

disease

salary

Ada F 18 cancer

25k

Bob M 25 heart 110k

Cathy F 20 cancer

70k

Dell M 65 flu 65k

Ed M 60 cancer

300k

Fred M 24 flu 20k

George

M 22 cancer

45k

Harry M 40 flu 95k

Irene F 45 heart 70k

Query f

Query result + noise

Cannot be used to derive whether any individual is included in the database

Differential Guarantee [Dwork, TCC06]

15

name

disease

Ada cancer

Bob heart

Cathy

cancer

Dell flu

Ed cancer

Fred flu

f count(#cancer) f(x) + noise

name

disease

Ada cancer

Bob heart

Cathy cancer

Dell flu

Ed cancer

Fred flu

K

K

f count(#cancer) f(x’) + noise

3 + noise

2 + noise

achieving Opt-Out

Our WorkDP-preserving cluster coefficient (ASONAM12)

Divide and conquer Smooth sensitivity

DP-preserving spectral graph analysis (PAKDD13) LNPP: based on the Laplace Noise Perturbation SBMF: based on the Exponential Mechanism and

MBF density Linear-refinement of DP-preserving query

answering (PAKDD13 Best Application Paper)DP-preserving graph generation based on

degree correlation (TDP13)

16

SMASH (NIH R01GM103309)

17

OutlineIntroductionPrivacy Preserving Social Network Analysis

Input perturbation Output perturbation

Fraud Detection Spectral analysis of graph topology Detecting Random Link Attacks Detecting weak anomalies

Sample ProjectsConclusions and Future work

18

Cyber Fraud Cyber crime

cost US economy $400 Billion annually OSN Fraud and Attack

Sybil attack, spam, viral marketing, fraudulent auction, brand jacking, denial of service, etc.

Fake followers on Twitter (used in viral marketing) worth $360 million annually on the black market.

19

Fraud CharacterizationIndividual vs. collusiveRobot vs. money-motivated regular

userRandom vs. selective targetStatic vs. dynamic

Traditional topology-based detection methodsincur high computational cost difficult to detect collaborative attacks

or subtle anomalies

Topology-based Detection

20

An abstraction of collaborative attacks including spam, viral marketing, etc.

The attacker creates some fake nodes and uses them to attack a large set of randomly selected regular nodes;

Fake nodes also mimic the real graph structure among themselves to evade detection.

Random Link Attack [Shirvastava ICDE08]

21

Spectral Graph Analysis based Fraud Detection

Examine the spectral space of graph topology.

A network with n nodes and m edges that is undirected, un-weighted, and without considering link/node attribute information

Adjacency Matrix A (symmetric)

Adjacency Eigenspace

22

Eigenspace

23

Principal Minor

Projecting Node in Spectral Space [SDM09]

24

Spectral coordinate: ),,( 21 kuuuu xxx

kn

k

k

nn

k

x

x

x

x

x

x

x

x

xxxx

2

1

2

22

21

1

12

11

21 k-orthogonal line pattern

0. vu

1

vu

vu

when nodes u, v from

the same community

when nodes u, v from different communities

2

Example

25

Spectral coordinate: ),,( 21 kuuuu xxx

Polbook Network

A snapshot of websites in domain .UK (2007) (114K nodes and 1.8M links), add a mix of 8 RLAs with varied sizes and connection patterns.

SPCTRA: based on spectral spaceGREEDY: based on outer-triangles [Shrivastava, ICDE08]

Evaluation on Web spam challenge data [ICDE11]

26

Much faster 36s vs. 26h

OutlineIntroductionPrivacy Preserving Social Network Analysis

Input perturbationOutput perturbation

Fraud DetectionSpectral analysis of graph topologyDetecting random link attacks Detecting weak anomalies

Sample ProjectsConclusions and Future work

27

28

Privacy Preserving Data Mining (NSF CAREER)

28 28

Genetic Privacy (NSF SCH pending)

29BIBM13 Best Paper Award

oSafari (NSF SaTC)

30

Manipulation in E-Commerce (NSF III pending)

31

Structured Topic Analysis

Spectral Bipartite Graph Analysis

D-S based Evidence Fusion

• Bot-committed• Money-motivated

ReviewsRatingsRanks

Privacy Preserving Database Application testing (NSF 0310974)

ER

Data

DDL

CatalogProduction db

R NR S

Conflict resolution

Disclosure AssessmentRule Analyzer

R’ NR’ S’

Schema & Domain Filter

Schema’ Domain’

Data Generator Mock DB

User

33

Data Generation for Testing DB Applications (NSF 0915059)

How to generate data to cover paths?

34

OutlineIntroductionPrivacy Preserving Social Network Analysis

Input perturbation Output perturbation

Fraud Detection Spectral analysis of graph topology Detecting Random Link Attacks Detecting weak anomalies

Sample ProjectsConclusions and Future work

35

Big Data Computing Drowning in data

Volume, Velocity, Variety, and Veracity 2.5 Exabyte every day Web data, healthcare, e-commerce, social

networkAdvancing technology

Cheap storage/processing power Growth in huge data centers Data is in the “cloud”- Amazon AWS,

Hadoop, Azure Computing is in the “cloud”

36

Social Media Customer Analytics

37

Network topology (friendship,followship,intera

ction)

name

sex age

disease

salary

Ada F 18 cancer

25k

Bob M 25 heart 110k

id Sex age address

Income

5 F Y NC 25k

3 M Y SC 110k

Structured profile

Retweet sequence

Product and review

Entity resolutionPatterns

Temporal/spatialScalability

VisualizationSentiment

Privacy

Unstructured text (e.g., blog, tweet) Transaction

database

Velocity, Variety

10GB tweets per dayBelk and Lowe’sChancellor’s special fund

38

39

Samsung AVC Denial Log Analysis

40

Volume and Velocity:1 million log files per day and each has thousands entriesS3, Hive and EMR

Drivers of Data Computing

41

6A’sAnytimeAnywhereAccess toAnything byAnyoneAuthorized

4V’sVolumeVelocityVarietyVeracity

ReliabilitySecurityPrivacyUsability

Thank You! Questions?

42

Collaborators: Aidong Lu, Xinghua Shi, Jun Li (Oregon), Dejing Dou (Oregon), Tao Xie (UIUC)

Doctoral graduates: Songtao Guo, Ling Guo, Kai Pan, Leting Wu, Xiaowei Ying

Doctoral Students: Yue Wang, Yuemeng Li, Zhilin Luo (visiting)

Recommended