17
Exploring Privacy Preservation in Outsourced K-Nearest Neighbors with Multiple Data Owners Frank Li Richard Shin Vern Paxson Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2015-177 http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-177.html July 29, 2015

Exploring Privacy Preservation in Outsourced K … parties with plaintext data. Some potential ap-plications include sensor information, personal images, and location data. These scenarios

Embed Size (px)

Citation preview

Exploring Privacy Preservation in Outsourced K-NearestNeighbors with Multiple Data Owners

Frank LiRichard ShinVern Paxson

Electrical Engineering and Computer SciencesUniversity of California at Berkeley

Technical Report No. UCB/EECS-2015-177http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-177.html

July 29, 2015

Copyright © 2015, by the author(s).All rights reserved.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and thatcopies bear this notice and the full citation on the first page. To copyotherwise, to republish, to post on servers or to redistribute to lists,requires prior specific permission.

Acknowledgement

This work is partially supported by a National Science Foundation grant(CNS-1237265). The first author is supported by the National ScienceFoundation Graduate Research Fellowship. Any opinion, findings, andconclusions or recommendations expressed in this material are those ofthe authors(s) and do not necessarily reflect the views of the NationalScience Foundation.

Exploring Privacy Preservation in Outsourced K-NearestNeighbors with Multiple Data Owners

Frank Li† Richard Shin† Vern Paxson†∗†University of California, Berkeley ∗International Computer Science Institute

{frankli, ricshin, vern}@cs.berkeley.edu

ABSTRACTThe k-nearest neighbors (k-NN) algorithm is a popular andeffective classification algorithm. Due to its large storageand computational requirements, it is suitable for cloud out-sourcing. However, k-NN is often run on sensitive data suchas medical records, user images, or personal information. Itis important to protect the privacy of data in an outsourcedk-NN system.

Prior works have all assumed the data owners (who submitdata to the outsourced k-NN system) are a single trustedparty. However, we observe that in many practical scenar-ios, there may be multiple mutually distrusting data owners.In this work, we present the first framing and explorationof privacy preservation in an outsourced k-NN system withmultiple data owners. We consider the various threat mod-els introduced by this modification. We discover that undera particularly practical threat model that covers numerousscenarios, there exists a set of adaptive attacks that breachthe data privacy of any exact k-NN system. The vulnerabil-ity is a result of the mathematical properties of k-NN and itsoutput. Thus, we propose a privacy-preserving alternativesystem supporting kernel density estimation using a Gaus-sian kernel, a classification algorithm from the same familyas k-NN. In many applications, this similar algorithm servesas a good substitute for k-NN. We additionally investigatesolutions for other threat models, often through extensionson prior single data owner systems.

1. INTRODUCTIONThe k-nearest neighbors (k-NN) classification algorithm hasbeen widely and effectively used for machine learning ap-plications. Wu et al. categorized it as one of the top 10most influential data mining algorithms [25]. K-NN identi-fies the k points nearest to a query point in a given data set,and classifies the query based on the classifications of theneighboring points. The intuition is that nearby points areof similar classes. K-NN classification benefits from runningover large data sets and its computation can be expensive.

These characteristics make it suitable to outsource the clas-sification computation to the cloud. However, k-NN data isoften sensitive in nature. For example, k-NN can be appliedto medical patient records, census information, and facialimages. It is critical to protect the privacy of such datawhen outsourcing computation.

The outsourced k-NN model focuses on providing classifi-cation, rather than training. Hence, it is assumed that thealgorithm parameters have been adequately chosen throughinitial investigation. Prior work on preserving privacy insuch a model uses computation over encrypted data [24, 32,9, 26]. These works have all assumed a single trusted dataowner who encrypts her data before sending it to a cloudparty. Then queriers can submit encrypted queries to thesystem for k-NN classification. In this setting, we wouldlike to keep data private from the cloud party and queriers,and keep queries private from the cloud party and the dataowner. In some existing works, the queriers share the se-cret (often symmetric) data encryption keys. In others, thequerier interacts with the data owner to derive the encryp-tion for a query without revealing it.

From these existing models, we make a simple observationwith non-trivial consequences: the data owner may not becompletely trusted. In all previous models, trust in the dataowner is natural since only the owner’s data privacy is atrisk. However, this assumption does not hold in some prac-tical scenarios:

• Multiple mutually-distrusting parties wish to aggre-gate their data for k-NN outsourcing, without reveal-ing data to each other. Since k-NN can perform sig-nificantly better on more data, all parties can benefitfrom improved accuracy through sharing. As an ex-ample, hospitals might contribute medical data for ak-NN disease classification study or a service availableto doctors. However, they do not want to release theirpatient data in the clear to each other or a cloud ser-vice provider. In some cases, these data owners mayact adversarially if that allows them to learn the otherowners’ data.

• The k-NN outsourced system allows anyone to be adata owner and/or querier. This is a generalizationof the previous scenario, and a privacy-preserving so-lution allows data owner individuals to contribute orparticipate in a system directly without trusting any

1

other parties with plaintext data. Some potential ap-plications include sensor information, personal images,and location data.

These scenarios involve data aggregated from multiple own-ers, hence we term this the multi-data owner outsourcedmodel. In this paper, we provide the first framing and ex-ploration of privacy preservation under this practical model.Since these multi-data owner systems have not yet been usedin practice (perhaps because no prior works address privacypreservation), we enumerate and investigate variants of aprivacy threat model. However, we focus on one variantthat we argue is particularly practical and covers a numberof interesting cases. Under this model, we discover a set ofadaptive privacy-breaching attacks based purely on the na-ture of how k-NN works, regardless of any system protocoland encryption designs.

To counter these privacy attacks, we propose using kerneldensity estimation with a Gaussian kernel instead of k-NN.This is an alternative algorithm from the same class as k-NN(which we can consider as kernel density estimation with auniform kernel). While this algorithm and k-NN are notequivalent, we demonstrate that in many applications theGaussian kernel should provide similar accuracy. We con-struct a privacy-preserving scheme to support such an algo-rithm using partially homomorphic encryption and garbledcircuits, although at a computational and network cost lin-ear to the size of the data set. Do note that k-NN itself iscomputationally linear. While this system may not yet bepractical for large data sets, it is both proof of the existenceof a theoretically secure alternative as well as an first stepto guide future improvements.

In summary, our contributions are:

• We provide the first framework for the k-NN multi-data owner outsourced model.

• We explore privacy preservation of k-NN under variousthreat models. However, we focus on one threat modelwe argue is both practical and covers several realisticscenarios. We discover a set of privacy-breaching at-tacks under such a model.

• We propose using kernel density estimation with aGaussian kernel in place of k-NN. We describe a privacy-preserving construction of such a system, using par-tially homomorphic encryption and garbled circuits.

2. BACKGROUND2.1 Nearest NeighborsK-nearest neighbors (k-NN) is a simple yet powerful non-parametric classification algorithm that operates on the in-tuition that neighboring data points are similar and likelyshare the same classification. It runs on a training set ofdata points with known classifications: D = {(x1, y1), · · · ,(xn, yn)}, where xi is the i-th data point and the labelyi ∈ {1, · · · , C} indicates which of the C classes xi belongsto. For a query point x, k-NN determines its classification la-bel y as the most common label amongst the k closest pointsto x in D. Closeness is defined under some distance metric

(such as Manhattan or Euclidean distance for real-valuedvectors). Since all of k-NN’s computation is at classificationtime, instead of training time, k-NN is an example of a lazylearning method.

In the case of 1-NN (k = 1), as the number of training datapoints approaches infinity, the classification error of k-NNbecomes bounded above by twice the Bayes error (which isthe theoretical lower bound on classification error for anyideal classification algorithm). In general, k-NN benefitsgreatly from execution over large data sets. Given a sin-gle data owner will have a limited amount of data, aggrega-tion of data from multiple parties is desirable for increasedaccuracy in k-NN applications. Also, since k-NN classifica-tion is computationally expensive, it is practical to outsourcecomputation. These characteristics motivate our interest inmulti-data owner outsourced k-NN classification.

2.2 Kernel density estimation and regressionK-NN is an example of an instance-based learning methodas it involves making a prediction directly from existing in-stances in the training data (as opposed to using a modelof the training data, for example). K-NN uses only the knearest instances to a query and allows each to equally in-fluence the query classification. However, there are otherpossible rules to decide how much influence each instanceshould have on the final prediction, known as kernels.1

Formally, a kernel K(x) is a function which satisfies

(1)

∫ ∞−∞

K(x)dx = 1 (2) K(x) = K(−x).

In other words, it integrates to 1 and is symmetric across 0.

Given n samples {x1, · · · ,xn} of a random variable x, onecan estimate the probability density function p(x) with akernel K in the following way, called kernel density estima-tion:

p(x) =1

n

n∑i=1

K(‖x− xi‖)

where ‖ · ‖ is a norm. Given this estimate, classification canbe determined as the most probable class:

DC = {xi | (xi, yi) ∈ D, yi = C}

p(y = C|x,D) ∝∑

xi∈DC

K(‖x− xi‖)

arg maxC

p(y = C|x,D) = arg maxC

∑xi∈DC

K(‖x− xi‖)

See Appendix A for a more detailed exposition.

Therefore, to classify a particular point x, we can sum thekernel values of the points which belong to each class anddetermine which class has the largest sum. The following

1To disambiguate between the many distinct meanings of“kernel” in machine learning, statistics, and computer sci-ence, the kind of kernels in this paper are also called“smoothing kernels”.

2

uniform kernel equates this derivation with k-NN:

dk,x,D := distance of kth nearest point from x in D

K(‖x− xi‖; k,D) =

{1

2dk,x,Dif ‖x− xi‖ ≤ dk,x,D

0 otherwise

dk,x,D is the width of this kernel, as K(t) > 0 for ||t −x|| ≤ dk,x,D and K(t) = 0 for ||t − x|| > dk,x,D. Thus, k-NN classification is kernel density estimation with a uniformfinite-width kernel.

One can substitute in a different kernel to obtain a classi-fier which behaves similarly. One example is the Gaussiankernel:

K(x) =1

σ√

2πe− x2

2σ2

where σ is the standard deviation parameter. Note thatthis kernel has infinite width, meaning all points have someinfluence on the classification.

2.3 Paillier CryptosystemThe Paillier cryptosystem [16] is a public-key semanticallysecure cryptographic scheme with partially homomorphicproperties. These homomorphic properties allow certainmathematical operations to be conducted over encrypteddata. Let the public key pk be (N, g), where N is a prod-uct of two large primes and g is a generator in Z∗N2 . Thesecret decryption key is sk. Also let Epk be the Paillierencryption function, and E−1

sk be the decryption function.Given a, b ∈ ZN , the Paillier cryptosystem has the followingproperties:

E−1sk (Epk(a) · Epk(b) mod N2) = a+ b mod N (Add)

E−1sk (Epk(a)b mod N2) = a · b mod N (Mult)

In other words, multiplying the ciphertext for two valuesresults in the ciphertext for the sum of those values, andcomputing the c-th power of a value’s ciphertext will resultin the ciphertext for the value multiplied by c.

The Paillier ciphertext is twice the size of a plaintext; if Nis 1024 bits, the ciphertext is 2048 bits. A micro-benchmarkof Paillier in [18] shows practical performance runtimes: en-cryption of a 32-bit integer takes 9.7 ms, decryption takes0.7 ms, and the homomorphic addition operation takes 0.005ms.

2.4 Yao’s Garbled CircuitsYao’s garbled circuit protocol [29, 14] allows a two-partyevaluation of a function f(i1, i2) run over inputs from bothparties, without revealing the input values when assuminga semi-honest adversary model. If Alice and Bob have pri-vate input values iA and iB , respectively, the protocol isrun between them (the input owners) and outputs f(iA, iB)without revealing the inputs to any party.

In the protocol, the first party is called the generator andthe second party is the evaluator. The generator takes aBoolean circuit for computing f , and generates a “garbled”version GF (intuitively, it is cryptographically obfuscatingthe circuit logic). Any input i for function f has a mappingto a garbled input for GF , which we will denote as GI(i).

Figure 1: The general model of existing outsourcedk-NN system. The data owner is trusted and out-sources encrypted data to the cloud party. Queriersrequest k-NN computation from the cloud party,and are sometimes trusted depending on the priorwork. The cloud party executes k-NN classificationand is semi-honest.

The generator gives the garbled circuit GF to the evalua-tor, as well as the generator’s garbled inputs GI(i1). Sincethe generator created the garbled circuit, only the generatorknows the valid garbled inputs. The evalutor then engagesin a 1-out-of-2 oblivious transfer protocol [20, 10] with thegenerator to obliviously obtain the garbled input values forher own private inputs i2. The evaluator can now evaluateGF (GI(i1), GI(i2)) to obtain a garbled output, which mapsback to the output of f(i1, i2).

For a more concrete understanding of the garbled circuititself, consider a single binary gate g (e.g., an AND-gate)with inputs i and j, and output k. For each input and theoutput, the generator generates two random cryptographickeys K0

x and K1x that correspond to the bit value of x as 0

and 1, respectively. The generator then computes the follow-ing four ciphertexts using a symmetric encryption algorithmEnc (which must be IND-CPA, or indistinguishable underchosen-plaintext attacks):

Enc(K

bii ,K

bjj )

(Kg(bi,bj)

k ) for bi, bj ∈ {0, 1}

A random ordering of these four ciphertexts represents the

garbled gate. Knowing (Kbii ,K

bjj ) allows the decryption of

only one of these ciphertexts to yield the value of Kg(bi,bj)

k

and none of the other outputs. Similarly, valid outputs can-not be obtained without the keys associated with the gateinputs. A garbled circuit is the garbling of all the gates inthe Boolean circuit for f , and can be evaluated gate-by-gatewithout leaking information about intermediate computa-tions. There exists efficient implementations of garbled cir-cuits [15, 6, 2, 23], although naturally the garbled circuit’ssize and evaluation runtime increases with the complexity ofthe evaluated function’s circuit.

3. RELATED WORKPrior works have focused on two general approaches toachieving privacy in k-NN: distributed k-NN and outsourcedk-NN computed on encrypted data.

In distributed k-NN, multiple parties each maintain theirown data set. Distributed computation involves interactionsbetween these parties to jointly compute k-NN without re-

3

Prior Work DO Q CloudWong et al [24] Trusted Trusted Semi-honestZhu et al [32] Trusted Semi-honest Semi-honestElmehdwi et al [9] Trusted Semi-honest Semi-honestXiao et al [26] Trusted Trusted Semi-honest

Table 1: A summary of the trust models from priorprivate outsource k-NN works. DO is the dataowner, Q is the querier, and Cloud represents anycloud parties. Semi-honest parties are assumed tofollow protocols but may attempt to discover queryor data values.

vealing other data values [5, 31, 27, 13, 19, 22, 28, 30]. Ageneral framework for how these systems operate is they it-eratively reveal the next closest neighbor until k neighborshave been determined. While allowing multiple parties toinclude their data sets, these works do not provide a solu-tion for outsourced k-NN because they requires data ownersto store and compute on their data. Furthermore, they mustremain online for all queries.

The other line of prior privacy-preserving k-NN work relieson computing k-NN over encrypted data. These systemsare designed for outsourcing, as depicted in Figure 1. Table1 summarizes the trust model each work assumes. An im-portant observation is that all prior works assume a trusteddata owner.

Wong et al. [24] provides secure k-NN outsourcing by devel-oping an asymmetric scalar-product-preserving encryption(ASPE) scheme. ASPE transforms data tuples and querieswith secret matrices that are inverses of each other. Mul-tiplying encrypted tuples and queries cancel the transfor-mations to output scalar products, used for distance com-parisons. Hence, a data owner and queriers can upload en-crypted data tuples and queries to a cloud party, who cancompute k-NN using scalar products. It is worth noting thisencryption scheme is deterministic, so identical data tupleshave the same ciphertext and likewise for queries. The se-cret matrices are symmetric keys for the encryption scheme,and must be shared with both the data owner and queriers.This approach assumes a trusted data owner and queriers,and a semi-honest cloud party who follows the protocols butattempts to learn query or data values.

In the outsourced k-NN system from [32], data encryption isagain conducted by a trusted data owner, using a symmetricscheme with a secret matrix transformation as a key. How-ever, queriers do not share this key. Instead, they interactwith the data owner to derive a query encryption withoutrevealing the query. Note this requires a data owner to al-ways remain online for all queries. Also, data tuple encryp-tion (being a matrix transformation) is deterministic, butquery encryption is not due to randomness introduced dur-ing the query encryption protocol. The encryption schemeis designed similiarly to ASPE and preserves distance, soa cloud party can execute k-NN using distances computedfrom encrypted data tuples and queries. In this system’strust model, the data owner is trusted while the queriersand the cloud party are semi-honest.

The system in [9] is designed to protect data and query pri-vacy with two cloud parties. One cloud party is the datahost, who stores all uploaded (encrypted) data tuples. Theother cloud party is called the key party, since it generatesthe keys for a public-key encryption scheme. Data tuplesand queries are encrypted with the key party’s public key.In this system, Paillier encryption is used for its partiallyhomomorphic properties. For each k-NN execution, the sys-tem computes encrypted distances through interactions be-tween the key party and the data host. The key party ordersthe distances and provides the data host with the indices ofthe k nearest neighbor data tuples to return to a querier.This work assumes a trusted data owner, and semi-honestqueriers and cloud parties.

Instead of finding exact k-NN, [26] allows a cloud party toapproximate it using encrypted data structures uploadedby the data owner that represent Voronoi boundaries. TheVoronoi boundary surrounding a data point is the boundaryequidistant from that data point to neighboring points. Theregion enclosed in a boundary is the region within whichqueries will return the same 1-nearest neighbor. Becausethe Voronoi boundaries change whenever a new data pointis added, this approach prevents continuous data upload-ing without redoing the entire outsourcing process. The en-cryption scheme is symmetric, and both the data owner andqueriers share the secret key. This work’s trust model isidentical to [24]’s, where the cloud party is semi-honest andthe data owner and queriers are fully trusted.

Another contribution from [26] is a reduction-based impos-sibility proof for privacy preservation in outsourced exactk-NN under a single cloud party model, where the cloudparty has access to an encryption key. Note that [24] and[32] assume the cloud party does not have the encryptionkey, hence avoiding this impossibility argument. The proofis a reduction to order-preserving encryption (OPE), andleverages prior work that shows OPE is impossible undercertain cryptographic assumptions [4]. Fully homomorphicencryption does actually allow OPE but it is still impractical[11]. Let B be an algorithm, without access to a decryptionoracle, that finds the nearest neighbor of an encrypted queryE(q) in the encrypted data E(D). The impossibility proofshows that B can be used to construct an OPE functionE(), hence B cannot exist. However, their argument doesnot apply to a system model with multiple cloud parties.Their proof, which relies on B’s lack of access to a decryp-tion oracle, can be circumvented by providing access to thedecryption function at another cloud party, such as in [9].Furthermore, OPE has been realized in an interactive two-party setting [17]. As later discussed in Section 5 and 8,we consider it reasonable that a cloud party has encryptioncapabilities in a multi-data owner model. Thus further ex-ploration of private k-NN is needed for scenarios not withinthe scope of this impossibility result.

One important observation about these prior works is thatthey all exhibit linear complexity in computation (and net-work bandwidth in the case of [9]). Intuitively, this is be-cause the protocols are distance based. Since the distancesfrom a query to all data tuples varies for each query, alldistances must be recomputed per query. While there aretechniques [3, 12] for reducing the computational complex-

4

Data Host CSP

DataOwners

DataOwners

Queriers

Queriers

Figure 2: The model of an outsourced k-NN systemcontaining four parties. The cloud parties are thedata host and the crypto-service provider (CSP).

ity of k-NN, these techniques may not be privacy preserving.For example, these algorithms necessarily compute on only aportion of the data tuples near the query. Observing similardata tuple subsets used can leak the similarity of subsequentqueries. However, future work may yield more efficient pri-vacy preserving k-NN constructions.

4. SYSTEM AND THREAT MODELS FORMULTIPLE DATA OWNERS

All existing private outsourced k-NN systems assume a sin-gle trusted data owner entity. In this paper, we are thefirst to consider multiple mutually distrusting data ownerentities. This is a simple and practical extension of exist-ing models, yet has important implications. We explore thevarious threat models that can arise in such a scenario, butwe focus the most attention on one threat model we findparticularly realistic. The remaining threat models can bemore easily dealt with, for example by extending existingsingle data owner systems, and will be discussed in detail inSection 8. In this section, we first discuss our model of amulti-data owner system. We then provide a framework formodeling threats in the system.

4.1 K-NN System ModelThe privacy-preserving outsourced k-NN model has threetypes of parties: the data owners who outsource their data,the cloud party(s) who host the k-NN computation, and thequeriers requesting k-NN classification. For emphasis, thekey difference between our system model and prior modelsis the existence of multiple data owner entities. An immedi-ate consequence of this modification is seen on the structureof the cloud parties. As discussed in Section 3, results from[26] indicate that privacy-preserving outsourcing of exact k-NN cannot be achieved by a single cloud party without fullyhomomorphic encryption, which is impractical. Instead, atleast two cloud parties should exist, one which stores en-crypted data, and one with access to the associated decryp-tion function that acts as a decryption oracle. Hence oursystem model includes an additional cloud party we call theCryptographic Service Provider (CSP) who generates a keypair for a public-key encryption scheme, and distributes thepublic key to data owners and queriers to use for encryption.

The cloud party storing the encrypted data, termed the datahost, is able to compute on encrypted data via interactionwith the CSP. As depicted in Figure 2, our system modelinvolves multiple data owners and queriers, the data host,and the CSP. Note that encrypted queries and data submis-sions must be sent to the data host, not the CSP, since theCSP can simply decrypt any received values.

4.2 Threat Model AbstractionsWe consider any party as potentially adversarial. Like priorworks, we will consider semi-honest adversaries that aim tolearn data or query values, possibly through collusion, whilefollowing protocols. We do not consider attacks where par-ties attempt to disrupt or taint the system outputs. Addi-tionally, we must assume that the CSP and the data host donot directly or indirectly collude, since the CSP maintainsthe secret key to decrypt all data tuples on the data host.This is a reasonable assumption, for example, if the twocloud parties are hosted by different competing companiesincentivized to protect their customers’ data privacy.

To describe our threat models, we will take a slightly un-orthodox approach. Instead of providing a model for eachparty’s malicious actions, we model the malicious behaviorof a party based on roles it possesses. The logic behindthis approach is that different parties in our system modelcan pose the same threats because they possess the sameroles. Enumerating and investigating all combinations ofroles and parties is redundant. Hence, just using roles pro-vides a cleaner abstraction from which to analyze threats.Below we describe the four possible roles that arise in oursystem model.

• Data Owner Role: A party with this role may submitencrypted data into the system. We focus on misbe-havior to compromise data privacy, and do not dealwith spam or random submissions aimed at taintingsystem results. Note that the data owner role can-not compromise data privacy by itself since it does notallow observation of system outputs.

• Querier Role: This role allows submission of encryptedqueries to receive k-NN classification results.

We note that the querying ability can allow the dis-covery of the Voronoi cell surrounding a data point.The Voronoi cell is the boundary surrounding a pointp that is equidistant between p and its neighbors. Inthe 1-NN case, queries within p’s Voronoi cell will re-turn p’s classification. A query outside of the cell willreturn the classification of a neighboring point. Hence,changes in 1-NN outputs can signal a crossing over ofa boundary. However, we deem this inherent leakageminimal since discovering the Voronoi cell would re-quire numerous queries to reveal each boundary edge.The Voronoi cell also simply bounds the value of thedata point, and does not directly reveal it. Further-more, neighboring cells of the same class will appearmerged in such an attack, since the output signal willnot change between cells. K-NN with a large k param-eter makes analysis even less accurate.

• Data Host Role: The data host role can only be pos-sessed by a cloud party. It allows storage and obser-

5

vation of incoming encrypted data tuples and queries,and the data host computation that is conducted. Aholder of the role also observes any interactions withother parties.

• CSP Role: Only a cloud party may possess this role.A CSP possesses both an encryption and decryptionkey, and can decrypt any data it observes. It interactswith a data host role to serve as a decryption oracle,and can observe any interaction with other parties, aswell as the CSP’s own computation.

In Section 5, we focus on the primary threat model in thispaper, where any single party can possess both the dataowner and querier role. In Section 8, the remaining threatmodels are explored. These are threats from one of the cloudparties possessing either the data owner role or the querierrole, but not both. Also, we consider the case where thecloud parties possess neither roles.

5. ATTACKS IN THE DATA OWNER-QUERIER ROLES THREAT MODEL

In this section, we consider the threat model where an ad-versarial party possesses both the data owner and querierroles (termed the DO-Q threat model). Since we consider allparties as potentially adversarial, any scenario where bothroles may belong to a single party falls under this threatmodel. We argue this is a realistic threat model, and dis-cover a set of conceptually simple attacks on any systemregardless of system design or encryption scheme. The at-tacks work based purely on the mathematical nature of thek-NN output, rather than implementation specifics. Hence,we conclude that multi-data owner outsourced k-NN cannotbe secured under such a threat model.

5.1 DO-Q Threat ModelThe DO-Q threat model covers any situation where a singleparty can possess both the data owner and querier roles.We consider this a practical threat model because it canarise in numerous scenarios. Hence, we will focus on it inthis section as well as Section 6. The following outlines thepossible scenarios where a single party may possess bothroles:

• It is reasonable to expect that data owners, contribut-ing their sensitive data, are allowed access to querying.If not, there is less incentive for data owners to sharedata. For examples, hospitals might want to pool theirdata to create a disease classification system. The doc-tors at any participating hospital should be able toquery the system when diagnosing their patients.

• The data owners may not be explicitly given query-ing permissions, say if the data owners and queriersare separate institutions. However, miscreants in eachparty may collude together, providing a joint alliancewith both roles.

• If the data host can encrypt data tuples and queries,it can act as its own data owner and querier. It caninsert its own encrypted data tuples into the system’sdata set, and compute k-NN using its own encrypted

queries. A data host with access to the encryptionkeys is not unreasonable. In [9], the data host needsto encrypt values to carry out operations on encrypteddata tuples. In addition, the queries are encrypted un-der the same key as data tuples to allow homomorphicoperations.

• If the data host lacked the data owner and/or querierroles (e.g., if an encryption key was kept secret from it),it may still collude with a data owner and/or querier toobtain the roles. This collusion can also occur betweenthe CSP with a data owner and querier.

• If the system is public to data submissions, queries,or both, then any party can supplement their currentroles with the public roles. For example, a fully publicsystem allows anyone to obtain both the data ownerand querier role.

5.2 Distance-Learning Attacks in the DO-QModel

We now present a set of attacks that reveal distances be-tween a query and encrypted data tuples, allowing triangu-lation of the plaintext data values. We begin by present-ing attacks under the simpler 1-NN, and gradually progresstowards k-NN. We assume that k-NN does not output theneighboring tuples directly, which provides arbitrary privacybreaches through querying, but rather just the query’s pre-dicted classification.

We note that prior work [7] has developed algorithms forcloning the Voronoi diagram of 1-NN using only queries ifthe query responses contain the exact location of the nearestneighbor or the distance and label of the nearest neighbor.If only the nearest neighbor label is returned, then the algo-rithm provides an approximate cloning. Our attacks differin that they are structurally very different, we look at k-NN beyond 1-NN, our attacks reveal the exact data valueof a target tuple rather than the Voronoi diagram, and ourattacks leverage data insertion as well as querying. Alsowe focus on a system model where query responses do notcontain the distance (which we considered an already brokenconstruction). The algorithms in [7] require at least distancein the query response to conduct exact cloning.

5.2.1 Attack using Plaintext DistanceWe begin by considering a broken construction of outsourcedk-NN. The k-NN system must calculate the distances fromall data tuples to the query point. If these distances are evervisible in plaintext to a party with just the querier role, thenit is simple to discover the true value of any encrypted datatuple.

Knowing a query q, the adversary can observe the distance lfrom q to the nearest neighbor. This forms a hypersphere ofradius l around q. If the data is d dimensions, d+1 differentquery points with the same nearest neighbor constructs d+1hyperspheres, which will intersect at one unique point, thenearest neighbor’s plaintext value. Hence, we can uncoverdata even though it is encrypted in the data set. Note thatin this case, the adversary needs both a querier role as well asthe cloud party role that observes plaintext distances. This

6

is different from the threat model we are considering, butwe discuss it as it provides insight for the following attacks.

5.2.2 Attack on 1-NNNow consider the 1-NN scenario where the system does notfoolishly reveal the distance in plaintext. An adversary withthe data owner and querier roles can still breach the privacyof the system. The querier role allows the adversary to sub-mit a query q, and observe the classification C of its nearestneighbor p. The attacker, using its data owner role, can theninsert an encrypted “guess” data tuple E(g), with any clas-sification C′ 6= C. If the new nearest neighbor classificationreturns C′, we know p is farther away from q than g. If not,then p is closer. Using this changing 1-NN output signal,the adversary can conduct binary search using additionalguess tuples to discover the distance from q to p. Hence, thedistance is revealed and triangulation can be conducted asin the insecure previous case. This takes O((d + 1) logD)guesses in total to compromise a data tuple, where d is thedata dimensionality and D is the distance from q to theinitial guess insertion.

Note the above attack appears to require tuple deletions orupdates, without which guess tuples that are closer than pwill disallow continued binary search. Deletions or updateswill certainly improve attack performance, and is reasonableto allow in many scenarios (e.g., location data which is con-stantly changing). However, it is not required. First, theattacker could conduct a linear search, rather than binary,starting with a guess at distance D and linearly decreasingthe guess distance until equal to the distance between q andp. Alternatively, a too-near guess still narrows down therange in which p’s value can be located, and the adversarycan restart a search with a new query in that range.

5.2.3 Attack on k-NNConsider now attacks on k-NN, instead of 1-NN. In this sce-nario, the k-NN system returns the classes of all k nearestneighbors. Let p of class C be the nearest neighbor to queryq. In fact, k-NN can be reduced to 1-NN by inserting k − 1data tuples of class C′ 6= C adjacent to q. Then, p is theonly unknown data tuple whose classification is included inthe k-NN output set. The attacker with data owner andquerier roles can test guess data insertions of class C′ un-til C is no longer included in the k-NN output, indicatingthe guess tuple is closer than p. This signal can be used asbefore to conduct a distance search.

5.2.4 Attack on Majority-Rule k-NNThe final scenario for a k-NN system is one that operateson majority rule. Rather than outputting all classificationsfrom the k nearest neighbors, only the most common classifi-cation is returned. Again, data owner and querier privilegesallow a distance-learning attack by reducing to 1-NN. Let pof class C be the nearest neighbor. The attack can insertk− 1 data tuples adjacent to a query q, split evenly over allclasses such that the output of k-NN solely depends on theclassification of p. For example, if we have binary classifi-cation (0 and 1) and 3-NN, the attacker can insert a 0-classtuple and a 1-class tuple adjacent to q. The original nearestneighbor (p) now solely determines the output of 3-NN, andthe adversary may again use a change in k-NN output as asignal for a distance search.

5.3 Fundamental VulnerabilityThe vulnerability exposed by these attacks is fundamentalto exact k-NN. For any given query, k-NN outputs a func-tion evaluated on the subset of tuples within a distance d,where d is the distance of the k-th nearest neighbor. Repre-senting k-NN as kernel density estimation using a uniformkernel, d is the kernel’s width. The vulnerability above fun-damentally relies on the fact that the kernel’s width changesdepending on which data points are near the query. An ad-versary can learn distances through guess insertions and anobservation of k-NN output change. Hence, in any scenariowhere a party possesses both the data owner and querierroles, exact k-NN cannot be secure, regardless of implemen-tation details. An alternate or approximate solution must beused. In Section 6, we propose a privacy-preserving schemeusing a similar algorithm from the same family as k-NN.

6. PRIVACY-PRESERVING KERNEL DEN-SITY ESTIMATION

In Section 5, we demonstrated that data privacy can bebreached in any multi-data owner exact k-NN system un-der the DO-Q threat model. Given the practicality of thisscenario, particularly since the data host typically will havethose roles, we seek to provide a secure alternative to ex-act k-NN. In this section, we propose a privacy-preservingsystem using an algorithm from the same family as k-NN.

In particular, k-NN is a specific case of kernel density esti-mation (as discussed in Section 2.2). Intuitively, the kernelmeasures the amount of influence a neighbor should haveon the query’s final classification. For a given query, kerneldensity estimation computes the sum of the kernel values forall data points of each class. The classification is the classwith the highest sum. K-NN uses a kernel that is uniformfor the k nearest neighbors, and zero otherwise. We proposesubstituting this kernel with another common kernel, theGaussian kernel:

K(‖q− xi‖) =1

σ√

2πe− 1

2σ2‖q−xi‖2

In this equation, q is the query tuple, xi is the i-th datatuple, σ is a parameter (chosen to maximize accuracy usingstandard hyperparameter optimization techniques, such ascross-validation with grid search), and ‖ · ‖ is the L2 norm.As the distance between q and xi increases, the kernel value(and influence) of xi decreases. K-NN allows the k nearestneighbors to all provide equal influence on the query point’sclassification. When using a Gaussian kernel, all points haveinfluence, but points farther away will influence less thanpoints nearby. While this approach is not equivalent to exactk-NN due to the non-uniformity of neighbor influences, itis similar and from the same family of algorithms. Later,we show experimentally that it can serve as an appropriatesubstitute in order to provide privacy preservation.

6.1 Why the Gaussian Kernel?The Gaussian kernel for kernel density estimation is specif-ically chosen to provide defense against the adaptive k-NN distance-learning attacks under the DO-Q threat model(Section 5). Here, we briefly provide the intuition for whythis is true, with the formal proof in Section 6.3.

7

In k-NN, the width of the kernel depends on the distanceof the kth nearest point from the query, so that only the knearest neighbors influence the classification. An adversarywith the data owner role can change this set of neighbors byinserting data tuples within the kernel width, and manipu-late whether a particular point influences the classification.Distances can be learned by using changes in this subset asa signal. When using the Gaussian kernel, which has an in-finite kernel width, any data the adversary inserts will notchange the influence of other points, and the outcome willbe affected by exactly an amount the adversary already cancompute. For example, for a given query q, an adversaryinserting a point y knows the influence for y’s class will in-crease by exactly K(‖q − y‖). Hence, the adversary learnsnothing new from possessing the data owner role and query-ing.

It is important to realize that this security arises from theGaussian kernel’s unvarying and infinite width. The choiceof an alternative kernel is non-trivial and must be carefullyselected. For example, another viable substitute is the logis-tic kernel K(u) = 1

eu+2+e−u , since it too has an unvaryingand infinite width. Our decision to use the Gaussian kernelwas based on the ease of developing a scheme using partiallyhomomorphic cryptographic primitives.

6.2 A Privacy-Preserving DesignIn this section, we will step-by-step describe the construc-tion of a privacy-preserving classification system using ker-nel density estimation with a Gaussian kernel. As we willdemonstrate, our protocols provide classification withoutleaking data or queries. Our system will follow the samesystem model as described in Section 4.1.

6.2.1 SetupRecall that the system model provides outsourcing data stor-age and computation to a data host and a cryptographic ser-vice provider. The cryptographic service provider generatesa public/private key pair for the Paillier cryptosystem anddistributes the public key to data owners and queriers, whouse it to encrypt data they submit to the cloud, as well asthe data host. Data owners submit their data in encryptedform to the data host. Note that while we use the Pailliercryptosystem, other similarly additive homormophic cryp-tosystems may be appropriate as well.

6.2.2 Computing Squared DistancesSince a kernel is a function of distance, our system must beable to compute distance using encrypted data without re-vealing it. Algorithm 1 describes SquaredDist(E(a), E(b)),which allows the data host to compute the encryptedsquared distance between a and b given only their cipher-texts. The protocol does not leak the distance to eithercloud parties, to prevent a distance-learning attack. Onlysquared distances are required as they are used in the Gaus-sian kernel calculation.

Assume our tuples are m dimensions, and ai is the i-thfeature of tuple a. E and E−1 are the Paillier encryptionand decryption functions, respectively, using the previously-chosen public/private key pair. Note all Paillier operationsare done modulo the Paillier parameter N , but for simplicity

Algorithm 1 SquaredDist(E(a), E(b)): Output E(‖a −b‖2)

1. DH: for 1 ≤ i ≤ m do:(a) xi ← E(ai) · E(bi)

−1. Thus, xi = E(ai − bi).(b) Choose random µi ∈ ZN .(c) yi ← xi · E(µi), such that yi = E(ai − bi + µi).(d) Send yi to CSP .

2. CSP : for 1 ≤ i ≤ m do:(a) wi ← E((E−1(yi))

2).(b) Send wi to DH.

3. DH: for 1 ≤ i ≤ m do:(a) zi ← wi · x−2µi

i · E(−µ2i ).

4. DH: Encrypted squared distance E(‖a− b‖2) = Πmi=1zi.

we elide the modulus. Also, we abbreviate the data host asDH, and the cryptographic service provider as CSP.

Step 1 computes the encryption of di = (ai − bi) and ad-ditively masks each with a random secret µi, such thatE−1(yi) = di + µi mod N . This prevents the CSP fromlearning di in step 2. The CSP decrypts yi, squares it,and sends it back to the DH the encryption of the square.Note wi = E((E−1(yi))

2) = E(d2i + 2µidi + µ2i ). The DH

can encrypt µ2i and compute E(2µidi) as x2µii , allowing it

to recover E(d2i ) in step 3. Finally, the DH can computeE(d2) = E(Σmi=1d

2i ) = Πm

i=1E(d2i ), as in step 4. This pro-vides the DH with encrypted squared distances without re-vealing any information to either cloud parties.

6.2.3 Computing Kernel ValuesNow that we can compute squared distances, we must com-pute the Gaussian kernel values in a secure fashion. Algo-rithm 2 does so such that the CSP obtains a masked kernelvalue for each data tuple.

Algorithm 2 KernelV alue(q): Compute Gaussian kernelvalues for data tuples t in data set D given query q

1. DH: for 1 ≤ i ≤ |D| do:(a) si ← SquareDist(ti, q) = E(‖ti − q‖2)(b) Choose random µi ∈ ZN(c) ei ← si · E(µi). Thus, ei = E(‖ti − q‖2 + µi).(d) Send ei to CSP .

2. CSP : for 1 ≤ i ≤ |D| do:

(a) gi ← 1

σ√2πe− 1

2σ2E−1(ei)

In step 1, the DH additively masks each computed encryptedsquared distance by adding a different random mask. Thesevalues are sent to the CSP who in step 2 decrypts and com-putes the kernel values using the masked squared distances.Note that since each distance is masked with a random value,the CSP does not learn any distances, even if the CSP knowssome data values in the data set. The CSP will not be ableto determine which masked values correspond to its knownvalues. Again, no cloud party learns any distances, and theyobserved only encrypted or masked values. The query is onlyused in this algorithm, without ever being decrypted. Hencequery privacy is achieved.

8

6.2.4 Computing ClassificationThe final step is to conduct kernel density estimation forclassification by determining which class is associated withthe largest sum of kernel values. Our scheme representsclasses not as a single value, but rather as a vector. If thereare c classes, we require a set of c orthonormal unit-lengthbases, where each basis is associated with a particular class.When uploading a data tuple, a data owner uploads theencrypted basis associated with the tuple’s class.

We can compute classification as described in Algorithm 3.This algorithm is run after Algorithm 2, such that the CSPknows the masked Gaussian kernel values for all data tuples.Let ci represent the classification vector for the i-th datatuple.

Algorithm 3 Classify(q): Output a classification predic-tion for query q where there are c classes

0. DH: Run KernelV alue(q)1. DH: for 1 ≤ i ≤ |D| do:

(a) Generate a random invertible matrix Bi of size c× c.(b) vi = Bi × E(ci).(d) Send vi to CSP .

2. CSP : for 1 ≤ i ≤ |D| do:(a) wi = vgii , where gi are kernel density values.(b) Send wi to DH.

3. DH: for 1 ≤ i ≤ |D| do:

(a) w′i ← B−1i ·w

e1

2σ2µi

i , where µi is the random distancemasking for the i-th tuple (see Algorithm 2 step 1(c))

4. DH: A← Σ|D|i=1w

′i. Note that this is a c× 1 vector.

5. DH: for 1 ≤ i ≤ c:(a) Choose new µi ∈ ZN randomly.(b) Ac ← Ac + µi, where Ac is the c-th component of A.

6. DH: Let f(in1, ..., inc, µ1, ..., µc)= arg max

k(ink − µk mod N).

(a) Generate the garbled circuitGFf of f and the garbledinputs GI(µi) ∀i ∈ {1, ..., c}.(b) Send GFf , GI(µi) and Ai ∀i ∈ {1, ..., c} to CSP .

7. CSP :(a) A′i ← E−1(Ai) ∀i ∈ {1, ..., c}.(b) Conduct oblivious transfers with DH to obtainGI(A′i) ∀i ∈ {1, ..., c}(c) gout← GFf (GI(A′1), ..., GI(A′c), GI(µ1), ..., GI(µc)).(d) Send gout to DH.

8. DH: Map gout to its non-garbled value out and return tothe querier. out is the number representing the predictedclass.

In step 1, the encrypted classification vectors are maskedthrough matrix multiplication with a random invertible ma-trix Bi, which can be efficiently constructed [21]. Note thiscomputation can be conducted on encrypted data becausethe DH knows the values of Bi, and the matrix multipli-cation involves only multiplying by plaintext constants andadding ciphertexts. The masked classification vectors aresent to the CSP in step 2, who scales them by the kernelvalues (from Algorithm 2). In step 3, the DH is returnedthese values. The DH undoes the transformation by multi-plying by B−1

i . The distance masks in the Gaussian kernel

from Algorithm 2 are removed by multiplying by e1

2σ2µi ,

where µi is the mask for the i-th tuple. Now, w′i is the en-

crypted basis for the i-th tuple’s class, scaled by the kernelvalue. In step 4, the sum of these scaled encrypted baseswill sum the kernel values for each class in the direction ofthe classes’ bases, forming the vector A. Note that A is stilla vector of encrypted values though.

At this point, we need to determine the class with the high-est kernel values summation. Having the CSP decrypt Aand return the kernel value sums for the classes may appearstraightforward, but in fact leads to an attack on future in-serted data tuples if an adversary can view the kernel valuesums and can continuously query q. When the next datatuple t is inserted, the adversary will observe an increase inthe kernel values summation for t’s class by K(‖t − q‖2).Knowing q and K(‖t − q‖2), the adversary can compute‖t − q‖. Assuming the data has m features, the attackercan determine t using a set of (m + 1) queries both beforeand after insertion to compute (m+ 1) distances. Note thatour threat model allows for one of the cloud parties to pos-sess the querying role, so whichever party can observe thekernel value sums can conduct this attack. Thus, the proto-col must determine the classification without revealing thekernel value sums to any party.

We accomplish this using a garbled circuit for the functionf(in1, ..., inc, µ1, ..., µc) = arg max

k(ink − µk mod N). If

there are c classes, this function accepts c masked inputs inand c additive masks µ, unmasks each masked input withits associated mask, and returns the index corresponding tothe largest unmasked value. If the masked inputs are maskedkernel value summations for each class, f returns the classindex with the kernel values summation. In step 5, the DHadditively masks the (encrypted) kernel values summationfor each class. Then in step 6, the DH generates the garbledcircuit GFf for f and garbles its own inputs, which are themasks for each class’s kernel values summation. Note that fis not that complex of a function (for example, compared toa decryption function) and its garbled circuit can be practi-cal implemented using existing systems [15]. The DH sendsGFf , its garbled inputs, and the masked (still encrypted) Ato the CSP. The CSP decrypts the masked A vector in step7(a). By conducting oblivious transfer in step 7(b) with theDH to obtain the garbled input associated with each Ai, theCSP does not reveal the masked kernel value sums. Withall of the inputs now to GFf , the CSP evaluates the garbledcircuit and returns the garbled output to the DH. Becausethe DH created the circuit, it can map the garbled outputback to its non-garbled value, which is the class index re-turned by f with the largest kernel values summation. Thisclassification is exactly the classification output of kerneldensity estimation using a Gaussian kernel.

6.3 Security AnalysisFirst we must show that use of the Gaussian kernel by ourscheme is resistant to the adaptive distance-learning attacksdetailed in Section 5. While Section 6.1 provide an intuitiveargument for this, here we present a formal proof.

Theorem 6.1. Under the Gaussian kernel in kernel den-sity estimation, the adversary learns nothing from thedistance-learning attacks of Section 5. More specifically, theadversary does not learn the distance from a query output

9

by adding to the data set (or removing her own data tuple).

Proof. Let D = {d0, ..., dn} be the current data set, qbe the adversary’s desired query point, and dA be any datatuple the adversary could insert. Also define GKj(D, q) tobe the kernel value sums for the j-th class given the queryq and data set D, and C(di) to be the i-th data tuple’sclassification. Recall that if the system allows for c classes,the query output is arg max

j∈{1,...,c}(GKj(D, q)).

Before tuple insertion, for each class j in {1, ..., c}:

GKj(D, q) = Σ|D|i=1

1

σ√

2πe− 1

2σ2(‖di−q‖2)

1C(di)=j

Here, 1C(di)=j is an indicator variable for whether the i-thdata tuple di is of class j. After inserting tuple dA, for eachclass j in {1, ..., c}:

GKj(D ∪ {dA}, q) = Σ|D|i=1

1

σ√

2πe− 1

2σ2(‖di−q‖2)

1C(di)=j

+1

σ√

2πe− 1

2σ2(‖dA−q‖2)

1C(dA)=j

= GKj(D, q) +1

σ√

2πe− 1

2σ2(‖dA−q‖2)

1C(dA)=j

The adversary, through using its data owner role, can cause

a change of δ = 1

σ√2πe− 1

2σ2(‖dA−q‖2)

1C(dA)=j . However, the

Gaussian kernel values have not changed for any other datatuples. The change in a query output is directly and com-pletely a result of δ, which the adversary knows independentof querying since it knows dA, q, and σ (which can be public).Given the query output is based only on hidden Gaussiankernel value sums, the output after insertion does not leakany more information on an individual tuple’s kernel valuebeyond what the query output leaks inherently (without in-sertion). Because the distance is only used in the Gaussiankernel, and the adversary learns nothing about other tuples’Gaussian kernel values through insertion, it does not learnthe distance from insertion.

If the adversary deleted one of her tuples, the outcome isexactly the same except δ is negative. With a semi-honestadversary, we can assume the adversary only deletes her owntuples.

The next analysis is to show our scheme leaks no furtherdata beyond its output. More formally, our scheme leaks noadditional information compared to an ideal scheme wherethe data is submitted to a trusted third party, who computesthe same classification output. In our initial discussion of theprotocol design in Section 6.2, we justified the security alongthis front for each step of the construction. See Appendix Bfor a more formal proof.

A final analysis that could be conducted is to formally char-acterize what the kernel density estimation output itself re-veals about the underlying data. One way is to use theframework of differential privacy [8], which can tell us howmuch randomness we need to add to the output in order toachieve a specific quantifiable level of privacy as defined bythe framework.

We say that a randomized algorithmA provides ε-differentialprivacy if for all data sets D1 and D2 differing by at mostone element, and all subsets S of the range of f ,

Pr(A(D1) ∈ S) ≤ exp(ε) · Pr(A(D2) ∈ S).

We also define the sensitivity of a function f : D → Rm as

∆f = maxD1,D2

‖f(D1)− f(D2)‖1.

Then [8] shows that, if we add Laplacian noise with standarddeviation λ, as in

Pr(Af (D) = k) ∝ exp(−‖f(D)− k‖/λ),

Af gives (∆f/λ)-differential privacy.

In our case, we can view f as performing a kernel densityestimation query with a particular query point on a set ofdata points, to receive a score for each class. Af would bea randomized version of f , which takes these scores fromf and adds Laplacian noise to each one. The sensitivityis ∆f = 1

σ√2π

, as follows from the proof of Theorem 6.1

when ‖dA − q‖ = 0, and so using Af would grant us σ√

2πλ

-differential privacy. We can see that a larger σ in the Gaus-sian kernel grants us greater differential privacy (i.e., leadsto a higher value of ε) when the magnitude of the noise re-mains fixed. This follows the intuition that a larger σ in thekernel increases the influence of far-away points in the dataset, and so the result is not as effected by any particularpoint.

While we proved that the Gaussian kernel output does notleak data tuple values through the distance-learning attacksof Section 5, we have not yet developed a framework to ana-lyze other leakage channels other than the preliminary anal-ysis above. Such a problem is challenging and we leave itto future work. However, even in the face of potential leak-age, data perturbation provides some security, although inexchange for output accuracy.

6.4 Performance ConsiderationsOur kernel density solution requires O(N) computation andcommunication overhead, where N is the number of datatuples in the data set. On large data sets, we acknowl-edge this cost might be impractical. Future work remainsto find privacy-preserving constructions with optimizations.Our goals with the presented construction are to bothdemonstrate that there exists a theoretically viable privacy-preserving alternative to k-NN, and lay the groundworksfor future improvements. Note that existing private k-NNschemes also have linear performance costs, as discussed inSection 3.

One solution for improving performance is parallelism, sinceour scheme operates on each data tuple independent of othertuples. Ideally, multiple machines would run in parallel atboth cloud parties, providing parallelism not only in com-putation but also network communication. Each machineat the data host would compute a subcomponent of the ker-nel value sums, and one master node can aggregate thesesubcomponents and execute Yao’s garbled circuit protocolto produce the classification. Our protocol was intentionallydesigned so that the function circuit that is garbled is of lim-ited complexity, without need for complex sub-routines such

10

as decryption. Thus, Yao’s protocol can be implemented ef-ficiently [15].

K-NN suffers this linear-growth complexity as well, howeverapproximation algorithms significantly improve performanceby reducing the number of points considered during neigh-bor search. For example, k-d trees [3] divide the searchspace into smaller regions, while locality hashing [12] hashesnearby points to the same bin. Hence, search for a query’sneighbors involves computation over the data points in thesame region or bin, rather than the entire data set. Unfortu-nately, it is not obvious how to execute approximate k-NNin a privacy-preserving fashion. Similarly, there must be fu-ture work on similar optimizations and approximations forkernel density estimation.

7. COMPARING K-NN AND GAUSSIANKERNEL DENSITY CLASSIFICATION

In this section, we examine the differences between classi-fication with k-NN and with Gaussian kernel density esti-mation. We empirically evaluated the performance of usingboth algorithms, to show they can give similar results on realdata. Appendix C expands on this evaluation, discussing anumber of additional considerations when using Gaussiankernel density estimation. We construct hypothetical datasets where we can expect them to behave differently, anddiscuss the underlying causes for the discrepancy. We alsodiscuss some practical issues specific to using a Gaussiankernel and how to solve them.

We summarize the results in Table 2. The medical datacomes from the UCI Machine Learning Repository [1], acollection of real data sets used for empirical evaluationsof machine learning algorithms. “Cancer 1” and “Cancer2” contain measurements from breast cancer patients con-ducted at the University of Wisconsin. “Cancer 1” contains9 features for each patient; “Cancer 2” came several years af-ter “Cancer 1” and contains 30 features which are more fine-grained. “Diabetes” contains measurements from Pima In-dian diabetes patients (8 features per patient). We also usethe MNIST handwritten digit recognition data set, whichconsist of 28 × 28 pixel grayscale images each containing asingle digit between 0 and 9.

Following our recommendations described in Appendix C,we scaled each feature to [0, 1]. For the UCI data sets, weused cross-validation to select k and σ; for MNIST, we setthem manually to 5 and 0.25. Except for MNIST which al-ready came divided into training and test sets, we randomly

Data Set k-NN KDE AgreementCancer 1 97.85% 97.14% 99.28%Cancer 2 96.49% 96.49% 98.24%Diabetes 81.82% 74.68% 87.66%MNIST 97% 96% 99%

Table 2: Empirical results for k-NN versus kerneldensity estimation (KDE) with a Gaussian kernel.Using several data sets, we evaluate each algorithm’sclassification accuracy as well as the degree of agree-ment between the two algorithms’ outputs.

took 20% of each data set for testing. Table 2 contains theprediction accuracy and classification agreement for both al-gorithms on these test sets when using the remaining 80%as training data.

Except for the Pima Indian diabetes data set, Gaussian ker-nel density classification exhibits an accuracy no more than1% lower than k-NN’s, and the two algorithms agree on over99% of classifications. In the diabetes case, both algorithmsperform poorly because the data is not well separated, withregions scattered with both classes. We argue such a dataset is not well suited for k-NN in the first place. Prior in-vestigation must decide whether a particular algorithm issuitable for use.

8. ALTERNATIVE THREAT MODELSSections 5 and 6 focused on the DO-Q threat model since itaccounts for a number of realistic scenarios, including var-ious collusion relationships. Without any party (directlyor through collusion) possessing both the data owner andquerier roles, there are several remaining threat situations.In this section, we explore these threat models. The onepervasive assumption remains: the data host and the cryp-tographic service provider (CSP) do not collude. As men-tioned in Section 4.2, a data owner or querier role itself doesnot allow privacy compromise of the system. Hence, theremaining alternate threat models are outlined below:

• Each party possesses only its original role (e.g., a datahost only possesses a data host role).

• The data host additionally obtains a data owner role.

• The data host additionally obtains a querier role.

• The CSP additionally obtains a data owner role.

• The CSP additionally obtains a querier role.

Note that a given system may be under more than one ofthe above threats. For example, the CSP and data host mayboth obtain an additional role. However, since we assumeno collusion directly or indirectly between the two cloudparties, we can analyze each threat independently.

These threat models, while each covering fewer scenarios,are still realistic. There are settings where the cloud partiesdo not collude with certain other parties. For example, thecloud parties may have reputations to maintain, and mayavoid collusion or misbehavior for fear of legal, financial,and reputational backlash from discovery through securityaudits, internal and external monitoring, whistleblowers, orinvestigations. Note that in all cases, our kernel densityestimation solution using the Gaussian kernel can still beused as an appropriate alternative, since the DO-Q threatmodel allows for a strictly more powerful (in terms of roles)adversary. Our discussion that follows will look at othersolutions though, under each threat model.

8.1 Original Roles Threat ModelIn this threat model, each party only possesses or utilizesthe role it was originally designated. This is the simplest ofthreat models because data owners do not collude with any

11

other parties. Since data owners themselves do not receiveoutput from the k-NN system, they cannot compromise pri-vacy. Hence, we can treat them as trusted, allowing us torevert to single data owner designs [9, 24], except using mul-tiple data owners. These existing systems already provideprivacy against cloud and querier adversaries. This scenariocould realistically arise if data owners and querier are sep-arate institutions. For example, hospitals could contributedata to a research study conducted at a university, and onlythe researchers at that university can query the system.

8.2 Data Host Threat ModelsWe consider now the threat models where the data hostobtains the role of a data owner or a querier, but not both.This can be through collusion, or a design giving the datahost the ability to insert data (e.g., encrypt data tuples) orinitiate queries.

An extension of the system in [9] can protect against a datahost with the data owner role. This system uses computa-tion over Paillier ciphertext, similar in flavor to our kerneldensity estimation approach. Their system consists also of aCSP which they call the key party, and a data host. Theirscheme is privacy-preserving in the single data owner model.The data host is allowed to encrypt, and hence possesses therole of a data owner, but all data that flows through the datahost is encrypted, including that received from the key party.Given the data host does not know query values, the datahost cannot learn anything that it did not already possessprior knowledge about (e.g., data submitted using the dataowner role). Hence, we argue this scheme can be simplyextended to multiple data owners in this threat model. Donote that this scheme is not secure under the DO-Q threatmodel since it still computes exact k-NN.

The data host with the querier role appears more problem-atic due to information leakage. Intuitively, the ability toobserve plaintext query outputs can leak the classificationof data tuples. In particular, the data host can observewhat tuples are used in the output construction, and thequery output associates classifications to those tuples. Itis challenging to defend against this breach because in out-sourced exact k-NN, some cloud party typically must de-termine the subset of encrypted data tuples associated withthe system output. If that party colludes with a querier (orpossesses the querying role), the group will be able to as-sociate those tuples with query-returned classifications. Inthis situation, our Gaussian kernel density algorithm canagain provide privacy, since the query output is dependenton the classification over all encrypted tuples. Hence, thequery output cannot be associated with any particular setof data tuples.

8.3 Key Party Threat ModelsFinally, we consider a key party either colluding with orpossessing the role of a data owner or a querier, but not both.Again we will consider extending the Paillier-based schemein [9]. A key party with either roles should not be ableto compromise this system even with multiple data-owners.K-NN computation is done with relations to distances froma query point. Without knowing the query point, the keyparty who knows values in the data set (through the dataowner role) will not be able to associate observed distance-

based values with known data. Furthermore, the values itobserves should either be masked (as in our scheme), orencrypted (as in [9]). A key party with querying abilitywill only observe masked or encrypted values, hence it willnot be able to learn anything from the computation.

9. CONCLUSIONIn this paper, we have presented the first exploration of pri-vacy preservation in an outsourced k-NN system with mul-tiple data owners. Under certain threat conditions, we canextend existing single data owner solutions to secure themulti-data owner scenario. However, under a particularlypractical threat model, exact k-NN cannot be secured due toa set of adaptive distance-learning attacks. This highlightsthe need for investigation into the inherent leakage from out-puts of machine learning algorithms. As an alternative so-lution, we propose use of a Gaussian kernel in kernel densityestimation, which is the family of algorithms k-NN belongsto. We present a privacy preserving system that supportsthis similar algorithm, as well as evidence of its similarityto k-NN. Admittedly, the scheme may not be practical forlarge data sets, given that the computational and commu-nication complexities scale linearly. Improving performanceis a remaining challenge, and we hope this first step laysthe groundwork for future optimized privacy-preserving so-lutions.

10. ACKNOWLEDGEMENTSThis work is partially supported by a National Science Foun-dation grant (CNS-1237265). The first author is supportedby the National Science Foundation Graduate Research Fel-lowship. Any opinion, findings, and conclusions or recom-mendations expressed in this material are those of the au-thors(s) and do not necessarily reflect the views of the Na-tional Science Foundation.

12

11. REFERENCES[1] K. Bache and M. Lichman. UCI machine learning

repository, 2013.

[2] M. Bellare, V. T. Hoang, S. Keelveedhi, andP. Rogaway. Efficient garbling from a fixed-keyblockcipher. In Proceedings of the IEEE Symposiumon Security and Privacy, SP’13.

[3] J. L. Bentley. Multidimensional binary search treesused for associative searching. Commun. ACM, Sept.1975.

[4] A. Boldyreva, N. Chenette, and A. O’Neill.Order-preserving encryption revisited: Improvedsecurity analysis and alternative solutions. InProceedings of the Annual Conference on Advances inCryptology, CRYPTO’11.

[5] M. Burkhart and X. Dimitropoulos. Fastprivacy-preserving top-k queries using secret sharing.In Proceedings of the International Conference onComputer Communications and Networks, ICCCN’10.

[6] D. Demmler, T. Schneider, and M. Zohner. ABY - aframework for efficient mixed-protocol securetwo-party computation. In Proceedings of the Networkand Distributed System Security Symposium, NDSS’15.

[7] M. Dickerson, D. Eppstein, and M. Goodrich. Cloningvoronoi diagrams via retroactive data structures. InEuropean Symposium on Algorithms, Lecture Notes inComputer Science. 2010.

[8] C. Dwork. Differential privacy. In Proceedings of theInternational Colloquium on Automata, Languagesand Programming, ICALP’06.

[9] Y. Elmehdwi, B. K. Samanthula, and W. Jiang.Secure k-nearest neighbor query over encrypted datain outsourced environments. In Proceedings of theIEEE International Conference on Data Engineering,ICDE’14.

[10] S. Even, O. Goldreich, and A. Lempel. A randomizedprotocol for signing contracts. Commun. ACM, June1985.

[11] C. Gentry. Fully homomorphic encryption using ideallattices. In Proceedings of the ACM Symposium onTheory of Computing, STOC’09.

[12] A. Gionis, P. Indyk, and R. Motwani. Similaritysearch in high dimensions via hashing. In Proceedingsof the International Conference on Very Large DataBases, VLDB’99.

[13] M. Kantarcioglu and C. Clifton. Privately computinga distributed k-nn classifier. In Proceedings of theEuropean Conference on Principles and Practice ofKnowledge Discovery in Databases, PKDD’04.

[14] Y. Lindell and B. Pinkas. A proof of security of Yao’sprotocol for two-party computation. In Journal ofCryptography, April 2009.

[15] D. Malkhi, N. Nisan, B. Pinkas, and Y. Sella. Fairplay:a secure two-party computation system. In Proceedingsof the USENIX Security Symposium, USENIX’04.

[16] P. Paillier. Public-key cryptosystems based oncomposite degree residuosity classes. In Proceedings ofthe International Conference on Theory andApplication of Cryptographic Techniques,EUROCRYPT’99.

[17] R. A. Popa, F. H. Li, and N. Zeldovich. An

ideal-security protocol for order-preserving encoding.In Proceedings of the IEEE Symposium on Securityand Privacy, SP’13.

[18] R. A. Popa, C. M. S. Redfield, N. Zeldovich, andH. Balakrishnan. Cryptdb: Protecting confidentialitywith encrypted query processing. In Proceedings of theACM Symposium on Operating Systems Principles,SOSP’11.

[19] Y. Qi and M. J. Atallah. Efficient privacy-preservingk-nearest neighbor search. In Proceedings of theInternational Conference on Distributed ComputingSystems, ICDCS’08.

[20] M. Rabin. How to exchange secrets by oblivioustransfer. Technical report, Boston, MA, USA, 1981.

[21] D. Randall. Efficient generation of random nonsingularmatrices. Technical report, Berkeley, CA, USA, 1991.

[22] M. Shaneck, Y. Kim, and V. Kumar. Privacypreserving nearest neighbor search. In Proceedings ofthe IEEE International Conference on Data MiningWorkshops, ICDM Workshops ’06.

[23] E. Songhori, S. Hussain, A.-R. Sadeghi, T. Schneider,and F. Koushanfar. TinyGarble: Highly compressedand scalable sequential garbled circuits. In Proceedingsof the IEEE Symposium on Security and Privacy,SP’15.

[24] W. K. Wong, D. W.-l. Cheung, B. Kao, andN. Mamoulis. Secure knn computation on encrypteddatabases. In Proceedings of the InternationalConference on Management of Data, SIGMOD’09.

[25] X. Wu, V. Kumar, J. Ross Quinlan, J. Ghosh,Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu,P. S. Yu, Z.-H. Zhou, M. Steinbach, D. J. Hand, andD. Steinberg. Top 10 algorithms in data mining.Knowl. Inf. Syst., Dec. 2007.

[26] X. Xiao, F. Li, and B. Yao. Secure nearest neighborrevisited. In Proceedings of the IEEE InternationalConference on Data Engineering, ICDE’13.

[27] L. Xiong, S. Chitti, and L. Liu. K nearest neighborclassification across multiple private databases. InProceedings of the ACM International Conference onInformation and Knowledge Management, CIKM’06.

[28] L. Xiong, S. Chitti, and L. Liu. Preserving dataprivacy in outsourcing data aggregation services. ACMTrans. Internet Technol., Aug. 2007.

[29] A. Yao. How to generate and exchange secrets. InProceedings of the IEEE Annual Symposium onFoundations of Computer Science, FOCS’86.

[30] J. Zhan and S. Matwin. A crypto-based approach toprivacy-preserving collaborative data mining. InProceedings of the IEEE International Conference onData Mining Workshops, ICDM Workshops’13.

[31] F. Zhang, G. Zhao, and T. Xing. Privacy-preservingdistributed k-nearest neighbor mining on horizontallypartitioned multi-party data. In Advanced DataMining and Applications, Lecture Notes in ComputerScience. 2009.

[32] Y. Zhu, R. Xu, and T. Takagi. Secure k-nncomputation on encrypted cloud data without sharingkey with query users. In Proceedings of theInternational Workshop on Security in CloudComputing, Cloud Computing’13.

13

APPENDIXA. KERNEL DENSITY ESTIMATION AND

REGRESSIONGiven n samples {x1, · · · ,xn} of a random variable x, onecan estimate the probability density function p(x) with akernel K in the following way, called kernel density estima-tion:

p(x) =1

n

n∑i=1

K(‖x− xi‖)

where ‖ · ‖ is a norm. Given this estimate, classification canbe determined as the most probable class:

DC = {xi | (xi, yi) ∈ D, yi = C}

p(x|y = C,D) =1

|DC |∑

xi∈DC

K(‖x− xi‖)

p(y = C|D) =|DC ||D|

p(y = C|x,D) =p(x|y = C,D)p(y = C|D)

p(x|D)

=p(x|y = C,D)p(y = C|D)∑C′ p(x|y = C′,D)p(y = C′|D)

=

1|DC |

∑xi∈DC

K(‖x− xi‖) |DC ||D|∑C′

1|D′C|∑

xi∈D′CK(‖x− xi‖)

|D′C|

|D|

=

∑xi∈DC

K(‖x− xi‖)∑xi∈DK(‖x− xi‖)

∝∑

xi∈DC

K(‖x− xi‖)

arg maxC

p(y = C|x,D) = arg maxC

∑xi∈DC

K(‖x− xi‖)

Therefore, to classify a particular point x, we can sum thekernel values of the points which belong to each class anddetermine which class has the largest sum.

B. ADDITIONAL SECURITY ANALYSISIn this section, we additionally analyze our scheme to showthat it leaks no further data beyond what the classificationoutput leaks, under our semi-honest adversary model. Moreformally, our scheme leaks no further data than an idealscheme where the data is submitted to a trusted third party,who computes the Gaussian kernel value sums and producesthe same classification output. In the below proofs, we ab-breviate the data host as the DH and the crypto-serviceprovider as the CSP. Also, our initial discussion of the pro-tocol design in Section 6.2 already justified the security pro-vided by each step of the construction. We will avoid re-peating these detail in the proofs and analyze leakage bythe algorithm as a whole.

Theorem B.1. Algorithm 1 leaks nothing about data tu-ples or queries.

Proof. As we see in step 1 of Algorithm 2, the inputs toAlgorithm 1 are one encrypted data tuple and one encryptedquery. In step 1 and 3 of the algorithm, only encrypted data

is visible to the DH. In step 2, the CSP does see plaintext.However, each plaintext has had a random additive maskapplied to the original true value, done by the DH in step1(c). Thus, the CSP cannot determine the true value fromthe plaintext. Under the security of the Paillier cryptosys-tem, the DH and the CSP learn nothing about data tuplesor queries.

Theorem B.2. Algorithm 2 leaks nothing about the datatuples or queries.

Proof. In step 1, the DH only operates on encryptedvalues. Theorem B.1 proves that the SquareDist algorithm(Algorithm 1) leaks nothing about the data tuples or queries,so the DH learns nothing from running SquareDist.

In step 2, the CSP again views plaintext, but a randomadditive mask has been applied (step 1(b)). Thus, the CSPcannot determine the original unmasked value, and does notlearn anything about the data tuples or queries.

Theorem B.3. Algorithm 3 leaks nothing about the datatuples or queries, except what may inherently leak from theclassification output of kernel density estimation using aGaussian kernel.

Proof. Step 0 leaks no information based on TheoremB.2. Again, the DH only operates on encrypted values instep 1. In step 2, the CSP has access to the Gaussian kernelvalue for a masked value (from step 2(a) of Algorithm 2), anda randomly masked classification vector. Since all values aremasked, the CSP cannot determine the true value or classi-fication of any data tuple or query. Note the CSP does notreturn a decrypted classification vector to the DH, resultingin the DH still only operating on encrypted values in steps3 through 6. Steps 6 through 8 are simply Yao’s garbledcircuit protocol for the defined function f . Through thisconstruction, the garbled circuit’s inputs, which are maskedvalues and their associated additive masks, are not revealedto any party that doesn’t possess them. Only the DH pos-sesses the masks, and only the CSP can obtain the maskedkernel value sums through decryption. Since these CSP-owned values are masked, the CSP learns nothing from theplaintext. By the security of Yao’s protocol, nothing elseis leaked by the garbled circuit protocol except the circuitoutput, which is the classification output of our algorithm.

Thus, no information is revealed about the data tuples orqueries from Algorithm 3 except what may inherently leakfrom the classification output of kernel density estimationwith a Gaussian kernel. Our protocol is as secure as an anideal model, where a single trust cloud party receives alldata tuples and computes kernel density estimation in theclear.

C. ADDITIONAL CONSIDERATIONSWITH GAUSSIAN KERNEL DENSITYCLASSIFICATION

Here we discuss additional considerations with using Gaus-sian kernel density estimation in place of k-NN. We con-struct hypothetical data sets where we can expect them to

14

Figure 3: A visualization of unbalanced data. Thefilled points and unfilled points represent separateclasses. Depending on the distance metric and theparameter of the Gaussian kernel, it is possible forclassification with kernel density estimation to pre-dict all possible points in the space as the unfilledclass, whereas 1-NN classification would not.

behave differently, and discuss the underlying causes for thediscrepancy. We also discuss some practical issues specificto using a Gaussian kernel and how to solve them.

C.1 Causes of DivergenceGiven a set of training data and parameters for the Gaussiankernel, assume query q is classified as class A. If sufficientlymany points of class B are added to the training data, theclassification of q will change to B no matter how far awaythese extra points are. In contrast, the classification underk-NN would remain unaffected if the new points are fartheraway than any of the existing k nearest neighbors. Figure 3illustrates a similar situation, where a point of one class issurrounded by many points of a different class.

To see why, recall from Appendix A:

p(y = C|D) =|DC ||D| ,

which means that the prior probability of a class is equalto its proportion in the training data. If the data containssignificantly more of one class than the other, then Gaussiankernel density classification will tend to predict the more fre-quent class, which arguably is preferable if the query pointsalso satisfy this assumption. If it is important to detectsome less commonly occurring classes even at the cost ofincreased false positives, the less common classes should beweighted higher while computing the summed influences foreach class. This can be done using a scaled basis vector forthat class.

C.2 Importance of Distance MetricA significant difference between classification with k-NN andwith Gaussian kernel density estimation arises from the us-age of the distances in computing the decision. With k-NN,only the ordering of the distances to the labeled data pointsmatters, and changes in the actual values of the distanceshave no effect if the order remains constant. However, theGaussian kernel is a functions of distance. While the Gaus-sian kernel is monotonic (it decreases in value as distanceincreases), the sum of the values for each class can dramat-ically change even if the ordering of the distances betweenpoints does not. In particular, as the Gaussian kernel is notlinear, the classification of a query might change if distancesare all scaled by some factor.

The limited precision of numbers used in calculation createsa more practical concern that the kernel value for a largedistance will round down to 0, effectively imposing a finitewidth on the Gaussian kernel. If all distances between pairsof points exceed this width, then kernel density estimationfails to provide any information about a query point; if alarge fraction does, then the accuracy will correspondinglysuffer.

To solve these issues, we recommend re-scaling all featuresto fit in [0, 1]. Since the data owners already need to agreeon the number of features and a common meaning for eachfeature, they can also find a consistent way to scale eachfeature. In fact, if all features contain roughly equal in-formation, then there is no particular reason some featuresshould span a wider range and have greater influence on thedistance metric. Other domain-specific methods for ensur-ing that the distance between two points is never too largewould also suffice.

C.3 Selecting ParametersBoth k-NN and the Gaussian kernel contain parameters weneed to select: k, the number of neighbors to consider, and σ,the standard deviation for the Gaussian distribution, respec-tively. Choosing them inappropriately can lead to poor clas-sification performance. For example, if points of one classare scattered amongst points of another class, then a large kwill prevent us from classifying the scattered class correctly.If σ is too large, then the differences in the summed influ-ence for each class will depend more heavily on the numberof data points in each class rather than the distance to thepoints, and the system will tend to classify queries as themore frequent class. Parameters should be selected usingstandard techniques such as cross-validation and grid search.

15