Privacy-preserving Data Mining for the Internet of Things: State of the Art Yee Wei Law ( ) wsnlabs.com

Privacy-preserving Data Mining for the Internet of Things: State of the Art

Yee Wei Law (罗裔纬 )

wsnlabs.com

Speaker’s brief bio• Ph.D. from University of Twente for research on security of wireless sensor

networks (WSNs) in EU project EYES• Research Fellowship on WSNs from The University of Melbourne

– ARC projects “Trustworthy sensor networks: theory and implementation”, “BigNet”

– EU FP7 projects “SENSEI”, “SmartSantander”, “IoT-i”, “SocIoTal”– IBES seed projects on participatory sensing, smart grids– Taught Master’s course “Sensor Systems”

• Professional membership:– Associate of (ISC)2 (junior CISSP)– Smart Grid Australia Research Working Group

Current research interests:• Privacy-preserving data mining• Secure/resilient control• Applications of above to the IoT

and smart grid

Current research orientation:• Mixed basic/applied research in

data science or network science• Research involving

probabilistic/statistical, combinatorial, matrix analysis

Agenda

• The IoT and its research priorities– Participatory sensing (PS)– Collaborative learning (CL)

• Introduction to privacy-preserving data mining• Schemes suitable for PS and CL• Research opportunities challenges• If time permits, SOCIOTAL

A dynamic global network infrastructure with self-configuring capabilities based on standard and interoperable communication protocols where physical and virtual “things” have identities, physical attributes, and virtual personalities and use intelligent interfaces, and are seamlessly integrated into the information network.

H. Sundmaeker et al., “Vision and Challenges for Realising the Internet of Things,” Cluster of European Research Projects on the Internet of Things, Mar. 2010.

Evidence of the Internet of Things

Nissan EPORO robot cars Smart grid

Research prioritiesITU-T: “Through the exploitation of identification, data capture, processing and communication capabilities, the IoT makes full use of things to offer services to all kinds of applications, whilst maintaining the required privacy.”

Among research priorities: • Mathematical models and

algorithms for inventory management, production scheduling, and data mining

• Privacy aware data processing

Smart transport Smart grid Smart water Smart whatever

ARPAnet

Machine-to-machine communications

Some graphics from Sabina Jeschke

• We have enough tech to hook things up, now we should make find better ways of capturing and analyzing data.

• Introducing participatory sensing and collaborative learning...

Shifting priorities

Participatory sensingA process whereby individuals and communities use evermore-capable mobile phones and cloud services to collect and analyze systematic data for use in discovery.

Source: Estrin et al.

Citizen-provided data can improve governance with benefits including: • Increased public

safety• Increased social

inclusion and awareness

• Increased resource efficiency for sustainable communities

• Increased public accountability

Data sharing scenarios

Lindell and Pinkas [2000]: “privacy-preserving data mining” refers to privacy-preserving distributed data mining

Data sharing scenarios (cont’d)

• Collaborative learning: Multiple data owners collaboratively analyze the union of their data with the involvement of a third-party data miner.

• Agrawal and Srikant [2000] coined the term “privacy-preserving data mining” to refer to privacy-preserving collaborative learning.

• Encrypting data to data miner is inadequate, data should be masked, at a balanced point between accuracy and privacy.

Privacy-preserving collaborative learning

• Requirement imposed by participatory sensing: – online data submission, offline

data processing

• Design space: – Data type:

• continuous or categorical• voice, images, videos, etc.

– Data structure: • relational or time series• for relational data: horizontal or

vertical partitioned

– Data mining operation

Adversarial models

Semantic Syntactic

Privacy criterion

SMC Randomization

Proposed criterion

Differential privacy

Linear Nonlinear

Additive Multiplicative

Adversarial models

Semi-honest (honest but curious)

• Passive attacker tries to learn the private states of other parties, without deviating from protocol

• By definition, semi-honest parties do not collude

Malicious• Active attacker tries to learn

the private states of other parties, and deviates arbitrarily from protocol

• Common approach: Design in the semi-honest model, enhance it for the malicious model

• General method: zero-knowledge proofs often not practical • Semi-honest model often realistic enough

Syntactic privacy criteria

• To prevent syntactic attacks, e.g., table linkage:– Attacker has access to an

anonymous table and a nonanonymous table, with the anonymous table being a subset of the nonanonymous table

– Attacker can infer the presence of its target’s record in the anonymous table from the target’s record in the nonanonymous table

• Relevant for relational data, not time series data

• Example:– k-anonymity

Semantic privacy criteria

• To minimize the difference between adversarial prior knowledge and adversarial posterior knowledge about individuals represented in the database

• General enough for most data types, relational or time series

• Example:– Cryptographic privacy– Differential privacy

Cryptographic privacy


Secure Multiparty Computation

Randomization

Secure multiparty computation

Oblivious transfer• Introduced by Rabin [1981]• Killian [1988] showed oblivious transfer is

sufficient for secure two-party computation

• Naor et al. [2001] reduce the amortized overhead of oblivious transfer to one exponentiation per a log number of oblivious transfers

• Homomorphic encryption can be used in the semi-honest model

f(x1,x2)x1

x2

Output

Garbled circuits for arbitrary functions [Beaver et al. 1990]

Metaphor: Yao’s millionaire problem [1982]

Building blocks:Oblivious transfer

SenderReceiver chooses a valueSender doesn’t know which

n values

1-out-of-n oblivious transfer


• In cryptography, semantic security: whatever is computable about the cleartext given the ciphertext is also efficiently computable without the ciphertext

• Useless for PPDM: A DB satisfying above has no utility• Dwork [2006] proposed “differential privacy” for

statistical disclosure control: add noise to query results

Differential privacy (cont’d)• Theoretical basis for answering “sum queries”

• Sum queries can be used for histogram, mean, covariance, correlation, SVD, PCA, k-means, decision tree, etc.

Row index Row

Differential privacy Sensitivity Laplace noise Noisy sum queries

Taxonomy of attacks against randomization-based approaches

Known input/sample attack:• The attacker has some input

samples and all output samples but does not know which input sample corresponds to which output sample

• Typically begins with establishing correspondences between the input samples and the output samples

Known input-output attack:• The attacker has some input

samples and all output samples, and knows which input sample corresponds to which output sample

Proposed privacy criterion:The distance between f(X) and estimated f(X) kept above a specified threshold under known attacks

Randomization

• Additive perturbation: adds noise data to data

• iid noise susceptible to:

• Spectral filtering attack by Kargupta et al. [2003]

• PCA attack by Huang et al. [2005]:– Estimate covariance matrix of

original data

– Find eigenvalues and eigenvectors of covariance matrix through PCA

–

• Bayesian estimation may not have analytic form

Randomization

Linear Nonlinear

Additive perturbation

Multiplicative perturbation

Randomized distortion or perturbation of data

Time series data Relational data

eigenvectors of covar

Collaborative learning using additive perturbation

• Compared to multiplicative perturbation, easier to recover the source data distribution fX(x) from the perturbed data distribution and noise distribution

• Against attacks: noise to be correlated with data and participant-specific

• PoolView [Ganti et al. 2008] builds a model of the data, then generate noise from the model:

• With a common noise model, a participant (i) can reconstruct another participant’s (j) data from the perturbed data:

Estimated with kernel density estimationSolved through deconvolution

Attack

Collaborative learning using additive perturbation

• Zhang et al. [2012]

Data-dependence

Participant-dependence

• Catches:– The data miner has to

know the participants’ parameters—system not resilient to collusion

– Data correlation between participants expose them to attacks (recall the PCA-based attack?)

PDF reconstructed by data miner based on PDF of y and noise


Rotation perturbation [Chen et al. 2005]• Noise matrix is an orthogonal

matrix with orthonormal rows and columns

• Giannella et al.’s [2013] attack can estimate the original data using all perturbed data and a small amount of original data

Attack stage 1• Find maximally unique map β

that satisfies

Then we know which xi is mapped to which yiAttack stage 2• Find that maximizes

Enhanced version: geometric perturbation

Multiplies data with noise

Input x

Output y

Perturbation

Multiplicative projection: random projection

• Projection by Gaussian random matrix– Statistically orthogonal– essentially a Johnson-

Lindenstrauss transform

• Other Johnson-Lindenstrauss transforms:

• Attack against orthogonal transform adaptable for this?

Perturbed vectors

d dimension k dimension

inter-point distances change by factor (1±ε)as long as k≥O(ε-2logn)

Collaborative learning using multiplicative perturbation

Goal is to use a different perturbation matrix for a different participant Liu et al. [2012]:

mean, covariancesynthesized data matrix

Z

Learn in approx an inverse of Ru and RvData miner then

get an estimation of Xu and Xv !What about the privacy criterion?

Nonlinear perturbation

• Relies on linear perturbation to achieve projection

• Near-many-to-one mapping provides privacy property

• Many-to-one mapping extended to the “normal” part of the curve?

Random matrices

Nonlinear function

Nonlinear + linear perturbation:

Normalized values

Extreme values (potential outliers) are “squashed”

=tan

h(x)

Bayesian estimation attacks against multiplication perturbation

• Solve underdetermined system Y=RX for X

• Maximum a posteriori estimation (why?)

• If R is known

• Gaussian original data obviously simplifies the attacker’s problem

• If R is not known

• Difficult optimization problem, although Gaussian data simplifies the problem

• Choice of p(R) matters

Independent component analysis against multiplicative perturbation

• Prerequisites for attacker– independence– at most one Gaussian

component– sparseness (Laplace)– m≥(n+1)/2

• Steps:– estimate R– estimate X– resolve permutation and

scaling ambiguity

Perturbation matrix treated as mixing matrix

Blind source separation

m<nm=n m>n

Overcomplete/underdetermined ICA

Sparse representation

Nonnegative matrix

factorization

Research opportunities and challenges

• Commercial interest?• Large design space:

effectiveness depends as much on the nature of data as the data mining algorithms

• Challenging multidisciplinary problems necessitate broad range of tools:– Scenario-dependent privacy criteria– Defenses and attacks evolve side-by-side– Role of dimensionality reduction? – Steganography for “traitor tracing”?– Many more from syntactic privacy, SMC, etc.


Nonlinear perturbation

Participants’ data

Bayesian estimation attacks

ICA attacks

Tools: Statistical analysis, Bayesian analysis, matrix analysis, time series

analysis, optimization, signal processing

Data mining algorithms

Perturbed data

• What is Big Data?• Unsupervised learning of Big

Data, e.g., Deep Learning

• Vision: Business-centric Internet of Things Citizen-centric Internet of Things

• Main non-technical aim: Create trust and confidence in Internet of Things systems, while providing user-friendly ways to contribute to and use the system thus encouraging creation of services of high socio-economic value.

• Main technical aims: – Reliable and secure

communications– Trustworthy data collection– Privacy-preserving data

mining

Motivating use cases:

Alice’s sensor network monitoring her house

Alice’s friend Bob granted access to Alice’s network while Alice’s on vacation

Sensor network monitoring community microgrid feeding data to stakeholders

Duration: Sep 2013- Aug 2016Funding scheme: STREPTotal Cost: €3.69 mEC Contribution: €2.81m

Contract Number: CNECT-ICT- 609112

Conclusion• Looking back: 1970s

gives us statistical disclosure control; 2000s gives us PPDM

• Technological development expands design space, invites multidisciplinary input

• Socio-economical development plays critical role

Adversarial models

Semantic Syntactic

Privacy criterion

SMC Randomization

Proposed criterion


Linear Nonlinear

Additive Multiplicative

Source: Cisco IBSG, April 2011

Syntactic privacy criteria/definitions

To prevent syntactic attacks:• Table linkage:

– Attacker has access to an anonymous table and a nonanonymous table, with the anonymous table being a subset of the nonanonymous table

– Attacker can infer the presence of its target’s record in the anonymous table from the target’s record in the nonanonymous table

• Record linkage:– Attacker has access to an anonymous table and a nonanonymous table, and the

knowledge that its target is represented in both tables– Attacker can uniquely identify the target’s record in the anonymous table from the

target’s record in the nonanonymous table• Attribute linkage:

– Attacker has access to an anonymous table, and the knowledge that its target is represented in the table, the attacker can infer the value(s) of its target’s sensitive attribute(s) from the group (e.g., 30-40 year-old females) the target belongs to

Examples:• k-anonymity

Documents

Privacy-preserving Data Mining for the Internet of Things: State of the Art Yee Wei Law ( ) wsnlabs.com