Upload
bryson-hinchcliffe
View
217
Download
1
Tags:
Embed Size (px)
Citation preview
Privacy-preserving Data Mining for the Internet of Things: State of the Art
Yee Wei Law (罗裔纬 )
wsnlabs.com
Speaker’s brief bio• Ph.D. from University of Twente for research on security of wireless sensor
networks (WSNs) in EU project EYES• Research Fellowship on WSNs from The University of Melbourne
– ARC projects “Trustworthy sensor networks: theory and implementation”, “BigNet”
– EU FP7 projects “SENSEI”, “SmartSantander”, “IoT-i”, “SocIoTal”– IBES seed projects on participatory sensing, smart grids– Taught Master’s course “Sensor Systems”
• Professional membership:– Associate of (ISC)2 (junior CISSP)– Smart Grid Australia Research Working Group
Current research interests:• Privacy-preserving data mining• Secure/resilient control• Applications of above to the IoT
and smart grid
Current research orientation:• Mixed basic/applied research in
data science or network science• Research involving
probabilistic/statistical, combinatorial, matrix analysis
Agenda
• The IoT and its research priorities– Participatory sensing (PS)– Collaborative learning (CL)
• Introduction to privacy-preserving data mining• Schemes suitable for PS and CL• Research opportunities challenges• If time permits, SOCIOTAL
A dynamic global network infrastructure with self-configuring capabilities based on standard and interoperable communication protocols where physical and virtual “things” have identities, physical attributes, and virtual personalities and use intelligent interfaces, and are seamlessly integrated into the information network.
H. Sundmaeker et al., “Vision and Challenges for Realising the Internet of Things,” Cluster of European Research Projects on the Internet of Things, Mar. 2010.
Evidence of the Internet of Things
Nissan EPORO robot cars Smart grid
Research prioritiesITU-T: “Through the exploitation of identification, data capture, processing and communication capabilities, the IoT makes full use of things to offer services to all kinds of applications, whilst maintaining the required privacy.”
Among research priorities: • Mathematical models and
algorithms for inventory management, production scheduling, and data mining
• Privacy aware data processing
Smart transport Smart grid Smart water Smart whatever
ARPAnet
Machine-to-machine communications
Some graphics from Sabina Jeschke
• We have enough tech to hook things up, now we should make find better ways of capturing and analyzing data.
• Introducing participatory sensing and collaborative learning...
Shifting priorities
Participatory sensingA process whereby individuals and communities use evermore-capable mobile phones and cloud services to collect and analyze systematic data for use in discovery.
Source: Estrin et al.
Citizen-provided data can improve governance with benefits including: • Increased public
safety• Increased social
inclusion and awareness
• Increased resource efficiency for sustainable communities
• Increased public accountability
Data sharing scenarios
Lindell and Pinkas [2000]: “privacy-preserving data mining” refers to privacy-preserving distributed data mining
Data sharing scenarios (cont’d)
• Collaborative learning: Multiple data owners collaboratively analyze the union of their data with the involvement of a third-party data miner.
• Agrawal and Srikant [2000] coined the term “privacy-preserving data mining” to refer to privacy-preserving collaborative learning.
• Encrypting data to data miner is inadequate, data should be masked, at a balanced point between accuracy and privacy.
Privacy-preserving collaborative learning
• Requirement imposed by participatory sensing: – online data submission, offline
data processing
• Design space: – Data type:
• continuous or categorical• voice, images, videos, etc.
– Data structure: • relational or time series• for relational data: horizontal or
vertical partitioned
– Data mining operation
Adversarial models
Semantic Syntactic
Privacy criterion
SMC Randomization
Proposed criterion
Differential privacy
Linear Nonlinear
Additive Multiplicative
Adversarial models
Semi-honest (honest but curious)
• Passive attacker tries to learn the private states of other parties, without deviating from protocol
• By definition, semi-honest parties do not collude
Malicious• Active attacker tries to learn
the private states of other parties, and deviates arbitrarily from protocol
• Common approach: Design in the semi-honest model, enhance it for the malicious model
• General method: zero-knowledge proofs often not practical • Semi-honest model often realistic enough
Syntactic privacy criteria
• To prevent syntactic attacks, e.g., table linkage:– Attacker has access to an
anonymous table and a nonanonymous table, with the anonymous table being a subset of the nonanonymous table
– Attacker can infer the presence of its target’s record in the anonymous table from the target’s record in the nonanonymous table
• Relevant for relational data, not time series data
• Example:– k-anonymity
Semantic privacy criteria
• To minimize the difference between adversarial prior knowledge and adversarial posterior knowledge about individuals represented in the database
• General enough for most data types, relational or time series
• Example:– Cryptographic privacy– Differential privacy
Cryptographic privacy
Differential privacy
Secure Multiparty Computation
Randomization
Secure multiparty computation
Oblivious transfer• Introduced by Rabin [1981]• Killian [1988] showed oblivious transfer is
sufficient for secure two-party computation
• Naor et al. [2001] reduce the amortized overhead of oblivious transfer to one exponentiation per a log number of oblivious transfers
• Homomorphic encryption can be used in the semi-honest model
f(x1,x2)x1
x2
Output
Garbled circuits for arbitrary functions [Beaver et al. 1990]
Metaphor: Yao’s millionaire problem [1982]
Building blocks:Oblivious transfer
SenderReceiver chooses a valueSender doesn’t know which
n values
1-out-of-n oblivious transfer
Differential privacy
• In cryptography, semantic security: whatever is computable about the cleartext given the ciphertext is also efficiently computable without the ciphertext
• Useless for PPDM: A DB satisfying above has no utility• Dwork [2006] proposed “differential privacy” for
statistical disclosure control: add noise to query results
Differential privacy (cont’d)• Theoretical basis for answering “sum queries”
• Sum queries can be used for histogram, mean, covariance, correlation, SVD, PCA, k-means, decision tree, etc.
Row index Row
Differential privacy Sensitivity Laplace noise Noisy sum queries
Taxonomy of attacks against randomization-based approaches
Known input/sample attack:• The attacker has some input
samples and all output samples but does not know which input sample corresponds to which output sample
• Typically begins with establishing correspondences between the input samples and the output samples
Known input-output attack:• The attacker has some input
samples and all output samples, and knows which input sample corresponds to which output sample
Proposed privacy criterion:The distance between f(X) and estimated f(X) kept above a specified threshold under known attacks
Randomization
• Additive perturbation: adds noise data to data
• iid noise susceptible to:
• Spectral filtering attack by Kargupta et al. [2003]
• PCA attack by Huang et al. [2005]:– Estimate covariance matrix of
original data
– Find eigenvalues and eigenvectors of covariance matrix through PCA
–
• Bayesian estimation may not have analytic form
Randomization
Linear Nonlinear
Additive perturbation
Multiplicative perturbation
Randomized distortion or perturbation of data
Time series data Relational data
eigenvectors of covar
Collaborative learning using additive perturbation
• Compared to multiplicative perturbation, easier to recover the source data distribution fX(x) from the perturbed data distribution and noise distribution
• Against attacks: noise to be correlated with data and participant-specific
• PoolView [Ganti et al. 2008] builds a model of the data, then generate noise from the model:
• With a common noise model, a participant (i) can reconstruct another participant’s (j) data from the perturbed data:
Estimated with kernel density estimationSolved through deconvolution
Attack
Collaborative learning using additive perturbation
• Zhang et al. [2012]
Data-dependence
Participant-dependence
• Catches:– The data miner has to
know the participants’ parameters—system not resilient to collusion
– Data correlation between participants expose them to attacks (recall the PCA-based attack?)
PDF reconstructed by data miner based on PDF of y and noise
Multiplicative perturbation
Rotation perturbation [Chen et al. 2005]• Noise matrix is an orthogonal
matrix with orthonormal rows and columns
• Giannella et al.’s [2013] attack can estimate the original data using all perturbed data and a small amount of original data
Attack stage 1• Find maximally unique map β
that satisfies
Then we know which xi is mapped to which yiAttack stage 2• Find that maximizes
Enhanced version: geometric perturbation
Multiplies data with noise
Input x
Output y
Perturbation
Multiplicative projection: random projection
• Projection by Gaussian random matrix– Statistically orthogonal– essentially a Johnson-
Lindenstrauss transform
• Other Johnson-Lindenstrauss transforms:
• Attack against orthogonal transform adaptable for this?
Perturbed vectors
d dimension k dimension
inter-point distances change by factor (1±ε)as long as k≥O(ε-2logn)
Collaborative learning using multiplicative perturbation
Goal is to use a different perturbation matrix for a different participant Liu et al. [2012]:
mean, covariancesynthesized data matrix
Z
Learn in approx an inverse of Ru and RvData miner then
get an estimation of Xu and Xv !What about the privacy criterion?
Nonlinear perturbation
• Relies on linear perturbation to achieve projection
• Near-many-to-one mapping provides privacy property
• Many-to-one mapping extended to the “normal” part of the curve?
Random matrices
Nonlinear function
Nonlinear + linear perturbation:
Normalized values
Extreme values (potential outliers) are “squashed”
=tan
h(x)
Bayesian estimation attacks against multiplication perturbation
• Solve underdetermined system Y=RX for X
• Maximum a posteriori estimation (why?)
• If R is known
• Gaussian original data obviously simplifies the attacker’s problem
• If R is not known
• Difficult optimization problem, although Gaussian data simplifies the problem
• Choice of p(R) matters
Independent component analysis against multiplicative perturbation
• Prerequisites for attacker– independence– at most one Gaussian
component– sparseness (Laplace)– m≥(n+1)/2
• Steps:– estimate R– estimate X– resolve permutation and
scaling ambiguity
Perturbation matrix treated as mixing matrix
Blind source separation
m<nm=n m>n
Overcomplete/underdetermined ICA
Sparse representation
Nonnegative matrix
factorization
Research opportunities and challenges
• Commercial interest?• Large design space:
effectiveness depends as much on the nature of data as the data mining algorithms
• Challenging multidisciplinary problems necessitate broad range of tools:– Scenario-dependent privacy criteria– Defenses and attacks evolve side-by-side– Role of dimensionality reduction? – Steganography for “traitor tracing”?– Many more from syntactic privacy, SMC, etc.
Multiplicative perturbation
Nonlinear perturbation
Participants’ data
Bayesian estimation attacks
ICA attacks
Tools: Statistical analysis, Bayesian analysis, matrix analysis, time series
analysis, optimization, signal processing
Data mining algorithms
Perturbed data
• What is Big Data?• Unsupervised learning of Big
Data, e.g., Deep Learning
• Vision: Business-centric Internet of Things Citizen-centric Internet of Things
• Main non-technical aim: Create trust and confidence in Internet of Things systems, while providing user-friendly ways to contribute to and use the system thus encouraging creation of services of high socio-economic value.
• Main technical aims: – Reliable and secure
communications– Trustworthy data collection– Privacy-preserving data
mining
Motivating use cases:
Alice’s sensor network monitoring her house
Alice’s friend Bob granted access to Alice’s network while Alice’s on vacation
Sensor network monitoring community microgrid feeding data to stakeholders
Duration: Sep 2013- Aug 2016Funding scheme: STREPTotal Cost: €3.69 mEC Contribution: €2.81m
Contract Number: CNECT-ICT- 609112
Conclusion• Looking back: 1970s
gives us statistical disclosure control; 2000s gives us PPDM
• Technological development expands design space, invites multidisciplinary input
• Socio-economical development plays critical role
Adversarial models
Semantic Syntactic
Privacy criterion
SMC Randomization
Proposed criterion
Differential privacy
Linear Nonlinear
Additive Multiplicative
Source: Cisco IBSG, April 2011
Syntactic privacy criteria/definitions
To prevent syntactic attacks:• Table linkage:
– Attacker has access to an anonymous table and a nonanonymous table, with the anonymous table being a subset of the nonanonymous table
– Attacker can infer the presence of its target’s record in the anonymous table from the target’s record in the nonanonymous table
• Record linkage:– Attacker has access to an anonymous table and a nonanonymous table, and the
knowledge that its target is represented in both tables– Attacker can uniquely identify the target’s record in the anonymous table from the
target’s record in the nonanonymous table• Attribute linkage:
– Attacker has access to an anonymous table, and the knowledge that its target is represented in the table, the attacker can infer the value(s) of its target’s sensitive attribute(s) from the group (e.g., 30-40 year-old females) the target belongs to
Examples:• k-anonymity