Upload
talon-keeley
View
213
Download
0
Tags:
Embed Size (px)
Citation preview
Xintao Wu Aug 25,2014
Research Overview
1
OutlineIntroductionPrivacy Preserving Social Network Analysis
Input perturbation Output perturbation
Fraud Detection in Social Networks Spectral analysis of graph topology Detecting Random Link Attacks Detecting weak anomalies
Sample ProjectsConclusions and Future work
2
Trustworthy ComputingTrustworthy = reliability, security,
privacy, usabilitySample research challenges
Understand and capture emergent behaviors/interactions among regular users, fraudsters, and victims
Design secure, survivable, persistent systems when under attack
Enable privacy protection in collecting/analyzing/sharing personal data
3
Privacy Breach CasesNydia Velázquez (1994)
Medical record on her suicide attempt was disclosed
AOL Search Log (2006) Anonymized release of 650K users’
search histories lasted for less than 24 hours
NetFlix Contest (2009) $1M contest was cancelled due to privacy
lawsuit23andMe (2013)
Genetic testing was ordered to discontinue by FDA due to genetic privacy
4
AcxiomPrivacy
In 2003, the EPIC alleged Acxiom provided consumer information to US Army "to determine how information from public and private records might be analyzed to help defend military bases from attack."
In 2013 Acxiom was among nine companies that the FTC investigated to see how they collect and use consumer data.
Security In 2003, more than 1.6 billion customer
records were stolen during the transmission of information to and from Acxiom's clients.5
6
Most restricted Restricted Some restrictions Minimal restrictions
Effectively no restrictions No legislation or no information
Privacy Regulation -- Forrester
Privacy Protection Laws USA
HIPAA for health careGrann-Leach-Bliley Act of 1999 for financial institutionsCOPPA for children online privacyState regulations, e.g., California State Bill 1386
CanadaPIPEDA 2000 - Personal Information Protection and Electronic
Documents Act European Union
Directive 94/46/EC - Provides guidelines for member state legislation and forbids sharing data with states that do not protect privacy
Contractual obligations Individuals should have notice about how their data is used
and have opt-out choices
7
Privacy Preserving Data Mining
8
ssn name zip race … age Sex income … disease
28223 Asian … 20 M 85k … Cancer
28223 Asian … 30 F 70k … Flu
28262 Black … 20 M 120k … Heart
28261 White … 26 M 23k … Cancer
. . … . . . … .
28223 Asian … 20 M 110k … Flu
69% unique on zip and birth date87% with zip, birth date and gender
Generalization (k-anonymity, l-diversity, t-closeness) Randomization
Social Network Data
9
Data owner
Data miner
release
name
sex age
disease
salary
Ada F 18 cancer
25k
Bob M 25 heart 110k
Cathy F 20 cancer
70k
Dell M 65 flu 65k
Ed M 60 cancer
300k
Fred M 24 flu 20k
George
M 22 cancer
45k
Harry M 40 flu 95k
Irene F 45 heart 70k
id Sex age
disease
salary
5 F Y cancer
25k
3 M Y heart 110k
6 F Y cancer
70k
1 M O flu 65k
7 M O cancer
300k
2 M Y flu 20k
9 M Y cancer
45k
4 M M flu 95k
8 F M heart 70k
Threat of Re-identification
10
id Sex age
disease
salary
5 F Y cancer
25k
3 M Y heart 110k
6 F Y cancer
70k
1 M O flu 65k
7 M O cancer
300k
2 M Y flu 20k
9 M Y cancer
45k
4 M M flu 95k
8 F M heart 70k
Attacker
attack
Privacy breachesIdentity disclosureLink disclosureAttribute disclosure
Privacy Preservation in Social Network Analysis• Input Perturbation
• K-anonymity
• Generalization
• Randomization
• Output Perturbation
• Background on differential privacy
• Differential privacy preserving social network mining
11
Our Work Feature preservation randomization
Spectrum preserving randomization (SDM08)
Markov chain based feature preserving randomization (SDM09)
Reconstruction from randomized graph (SDM10)
Link privacy (from the attacker perspective) Exploiting node similarity feature
(PAKDD09 Best Student Paper Runner-up Award)
Exploiting graph space via Markov chain (SDM09)
12
PSNet (NSF-0831204)
13
Output Perturbation
14
Data owner
Data miner
name
sex age
disease
salary
Ada F 18 cancer
25k
Bob M 25 heart 110k
Cathy F 20 cancer
70k
Dell M 65 flu 65k
Ed M 60 cancer
300k
Fred M 24 flu 20k
George
M 22 cancer
45k
Harry M 40 flu 95k
Irene F 45 heart 70k
Query f
Query result + noise
Cannot be used to derive whether any individual is included in the database
Differential Guarantee [Dwork, TCC06]
15
name
disease
Ada cancer
Bob heart
Cathy
cancer
Dell flu
Ed cancer
Fred flu
f count(#cancer) f(x) + noise
name
disease
Ada cancer
Bob heart
Cathy cancer
Dell flu
Ed cancer
Fred flu
K
K
f count(#cancer) f(x’) + noise
3 + noise
2 + noise
achieving Opt-Out
Our WorkDP-preserving cluster coefficient (ASONAM12)
Divide and conquer Smooth sensitivity
DP-preserving spectral graph analysis (PAKDD13) LNPP: based on the Laplace Noise Perturbation SBMF: based on the Exponential Mechanism and
MBF density Linear-refinement of DP-preserving query
answering (PAKDD13 Best Application Paper)DP-preserving graph generation based on
degree correlation (TDP13)
16
SMASH (NIH R01GM103309)
17
OutlineIntroductionPrivacy Preserving Social Network Analysis
Input perturbation Output perturbation
Fraud Detection Spectral analysis of graph topology Detecting Random Link Attacks Detecting weak anomalies
Sample ProjectsConclusions and Future work
18
Cyber Fraud Cyber crime
cost US economy $400 Billion annually OSN Fraud and Attack
Sybil attack, spam, viral marketing, fraudulent auction, brand jacking, denial of service, etc.
Fake followers on Twitter (used in viral marketing) worth $360 million annually on the black market.
19
Fraud CharacterizationIndividual vs. collusiveRobot vs. money-motivated regular
userRandom vs. selective targetStatic vs. dynamic
Traditional topology-based detection methodsincur high computational cost difficult to detect collaborative attacks
or subtle anomalies
Topology-based Detection
20
An abstraction of collaborative attacks including spam, viral marketing, etc.
The attacker creates some fake nodes and uses them to attack a large set of randomly selected regular nodes;
Fake nodes also mimic the real graph structure among themselves to evade detection.
Random Link Attack [Shirvastava ICDE08]
21
Spectral Graph Analysis based Fraud Detection
Examine the spectral space of graph topology.
A network with n nodes and m edges that is undirected, un-weighted, and without considering link/node attribute information
Adjacency Matrix A (symmetric)
Adjacency Eigenspace
22
Eigenspace
23
Principal Minor
Projecting Node in Spectral Space [SDM09]
24
Spectral coordinate: ),,( 21 kuuuu xxx
kn
k
k
nn
k
x
x
x
x
x
x
x
x
xxxx
2
1
2
22
21
1
12
11
21 k-orthogonal line pattern
0. vu
1
vu
vu
when nodes u, v from
the same community
when nodes u, v from different communities
2
Example
25
Spectral coordinate: ),,( 21 kuuuu xxx
Polbook Network
A snapshot of websites in domain .UK (2007) (114K nodes and 1.8M links), add a mix of 8 RLAs with varied sizes and connection patterns.
SPCTRA: based on spectral spaceGREEDY: based on outer-triangles [Shrivastava, ICDE08]
Evaluation on Web spam challenge data [ICDE11]
26
Much faster 36s vs. 26h
OutlineIntroductionPrivacy Preserving Social Network Analysis
Input perturbationOutput perturbation
Fraud DetectionSpectral analysis of graph topologyDetecting random link attacks Detecting weak anomalies
Sample ProjectsConclusions and Future work
27
28
Privacy Preserving Data Mining (NSF CAREER)
28 28
Genetic Privacy (NSF SCH pending)
29BIBM13 Best Paper Award
oSafari (NSF SaTC)
30
Manipulation in E-Commerce (NSF III pending)
31
Structured Topic Analysis
Spectral Bipartite Graph Analysis
D-S based Evidence Fusion
• Bot-committed• Money-motivated
ReviewsRatingsRanks
Privacy Preserving Database Application testing (NSF 0310974)
ER
Data
DDL
CatalogProduction db
R NR S
Conflict resolution
Disclosure AssessmentRule Analyzer
R’ NR’ S’
Schema & Domain Filter
Schema’ Domain’
Data Generator Mock DB
User
33
Data Generation for Testing DB Applications (NSF 0915059)
How to generate data to cover paths?
34
OutlineIntroductionPrivacy Preserving Social Network Analysis
Input perturbation Output perturbation
Fraud Detection Spectral analysis of graph topology Detecting Random Link Attacks Detecting weak anomalies
Sample ProjectsConclusions and Future work
35
Big Data Computing Drowning in data
Volume, Velocity, Variety, and Veracity 2.5 Exabyte every day Web data, healthcare, e-commerce, social
networkAdvancing technology
Cheap storage/processing power Growth in huge data centers Data is in the “cloud”- Amazon AWS,
Hadoop, Azure Computing is in the “cloud”
36
Social Media Customer Analytics
37
Network topology (friendship,followship,intera
ction)
name
sex age
disease
salary
Ada F 18 cancer
25k
Bob M 25 heart 110k
…
id Sex age address
Income
5 F Y NC 25k
3 M Y SC 110k
Structured profile
Retweet sequence
Product and review
Entity resolutionPatterns
Temporal/spatialScalability
VisualizationSentiment
Privacy
Unstructured text (e.g., blog, tweet) Transaction
database
Velocity, Variety
10GB tweets per dayBelk and Lowe’sChancellor’s special fund
38
39
Samsung AVC Denial Log Analysis
40
Volume and Velocity:1 million log files per day and each has thousands entriesS3, Hive and EMR
Drivers of Data Computing
41
6A’sAnytimeAnywhereAccess toAnything byAnyoneAuthorized
4V’sVolumeVelocityVarietyVeracity
ReliabilitySecurityPrivacyUsability
Thank You! Questions?
42
Collaborators: Aidong Lu, Xinghua Shi, Jun Li (Oregon), Dejing Dou (Oregon), Tao Xie (UIUC)
Doctoral graduates: Songtao Guo, Ling Guo, Kai Pan, Leting Wu, Xiaowei Ying
Doctoral Students: Yue Wang, Yuemeng Li, Zhilin Luo (visiting)