152 IEEE TRANSACTIONS ON INFORMATION … preserving content based... · rate decreases with the number of omitted bits and ... news claims that advertisers and Facebook can generate

152 IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 10, NO. 1, JANUARY 2015

A Privacy-Preserving Framework for Large-ScaleContent-Based Information Retrieval

Li Weng, Member, IEEE, Laurent Amsaleg, April Morton, and Stéphane Marchand-Maillet

Abstract— We propose a privacy protection framework forlarge-scale content-based information retrieval. It offers twolayers of protection. First, robust hash values are used asqueries to prevent revealing original content or features. Second,the client can choose to omit certain bits in a hash value tofurther increase the ambiguity for the server. Due to the reducedinformation, it is computationally difficult for the server to knowthe client’s interest. The server has to return the hash valuesof all possible candidates to the client. The client performs asearch within the candidate list to find the best match. Sinceonly hash values are exchanged between the client and theserver, the privacy of both parties is protected. We introducethe concept oftunable privacy, where the privacy protection levelcan be adjusted according to a policy. It is realized throughhash-based piecewise inverted indexing. The idea is to divide afeature vector into pieces and index each piece with a subhashvalue. Each subhash value is associated with an inverted indexlist. The framework has been extensively tested using a largeimage database. We have evaluated both retrieval performanceand privacy-preserving performance for a particular contentidentification application. Two different constructions of robusthash algorithms are used. One is based on random projections;the other is based on the discrete wavelet transform. Bothalgorithms exhibit satisfactory performance in comparison withstate-of-the-art retrieval schemes. The results show that theprivacy enhancement slightly improves the retrieval performance.We consider the majority voting attack for estimating the querycategory and identification. Experiment results show that thisattack is a threat when there are near-duplicates, but the successrate decreases with the number of omitted bits and the numberof distinct items.

Index Terms— Multimedia database, image hashing, indexing,content-based retrieval, data privacy.

I. INTRODUCTION

IN THE Internet era, multimedia content is massivelyproduced and distributed. In order to efficiently locate

content in a large-scale database, content-based search

Manuscript received November 20, 2013; revised June 11, 2014 andOctober 27, 2014; accepted October 27, 2014. Date of publicationOctober 30, 2014; date of current version December 17, 2014. This workwas supported in part by the European COST Action on Multilingual andMultifaceted Interactive Information Access (MUMIA) through the SwissState Secretariat for Education and Research under Grant C11.0043 and inpart by the French Project Secular under Grant ANR-12-CORD-0014. Theassociate editor coordinating the review of this manuscript and approving itfor publication was Prof. C.-C. Jay Kuo

L. Weng was with the University of Geneva, Geneva 1227, Switzerland.He is now with Inria Rennes-Bretagne Atlantique, Rennes 35042, France(e-mail: [email protected]).

L. Amsaleg is with the Centre National de la Recherche Scientifique,Institut de Recherche en Informatique et Systèmes Aléatoires Laboratory,Rennes 35042, France (e-mail: [email protected]).

A. Morton and S. Marchand-Maillet are with the Department ofComputer Science, University of Geneva, Geneva 1227, Switzerland (e-mail:[email protected]; [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIFS.2014.2365998

techniques have been developed. They are used by content-based information retrieval (CBIR) systems [1] to complementconventional keyword-based techniques in applications such asnear-duplicate detection, automatic annotation, recommenda-tion, etc. In a typical scenario, a user could provide a retrievalsystem with a set of criteria or examples as a query; the systemreturns relevant information from the database as an answer.

Recently, with the emergence of new applications, an issuewith content-based search has arisen – sometimes the queryor the database contains privacy-sensitive information. In anetworked environment, the roles of the database owner, thedatabase user, and the database service provider can be takenby different parties, who do not necessarily trust each other.A privacy issue arises when an untrusted party wants to accessthe private information of another party. In that case, measuresshould be taken to protect the corresponding information. Themain challenge is that the search has to be performed withoutrevealing the original query or the database. This motivatesthe need for privacy-preserving CBIR (PCBIR) systems.

Privacy raised early attention in biometric systems, wherethe query and the database contain biometric identifiers.Biometric systems rarely keep data in the clear, fearing theftsof such highly valuable data. Similarly, a user is reluctant insending his biometric template in the clear. Conventionally,biometric systems rely on cryptographic primitives to protectthe database of templates [2]. In the multimedia domain, pri-vacy issues recently emerged in content recommendation [3].With recommendation systems, users are typically profiled.Profiles are sent to service providers, which send back per-sonalized content. Users are today forced to trust the serviceproviders for the use of their profiles. Although CBIR systemshave not been widely deployed yet, similar threats exist.Recently, the one-way privacy model for CBIR was investi-gated [4]. The one-way privacy setting assumes that only theuser wants to keep his information secret because the databaseis public. Public databases against which users may wish to runprivate queries have become commonplace nowadays. Someof them already integrate similarity search mechanisms, suchas Google Images or Google Goggles. It is likely that otherswill soon follow that path, turning Flickr, YouTube, Facebookinto content-based searchable collections (in addition toalready being tag searchable). Put in a larger picture, PCBIRis one of many aspects on privacy protection in the big dataera where profiling becomes ubiquitous. For example, recentnews claims that advertisers and Facebook can generate userprofiles of political opinions and behaviours.1 Latest research

1http://hothardware.com/News/Watch-Dogs-Analyzes-Your-Digital-Shadow-Facebook-Data-Miner-Will-Shock-You, last visit on 27 Oct. 2014

1556-6013 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Home

Highlight

Home

Highlight

Home

Highlight

Home

Highlight

Home

Highlight

Home

Highlight

Home

Highlight

Home

Highlight

WENG et al.: PRIVACY-PRESERVING FRAMEWORK FOR LARGE-SCALE CBIR 153

discovers that websites are actually fingerprinting users onthe Internet by their system (e.g. browser) configurations [5].There is already some initiatives in web search privacy [6].The trend shows that privacy protection will become anindispensable part of future content-based search systems.

A. Two Approaches for PCBIR

PCBIR systems face the common problem that the serveris not trusted by the database owner or the user. According towhere privacy is emphasized, a PCBIR technique is typicallyused in the following scenarios:

• The database contains private information, e.g., photosharing websites;

• The query contains private information, e.g., remotediagnosis;

• The CBIR technique contains private information,e.g. proprietary technology.

Conventionally, PCBIR research concentrates on the moststringent scenario where both the query and the databaseare private. Some solutions are based on the concept ofSignal Processing in EncryptEd Domain (SPEED) [7], [8].They typically rely on heavy cryptographic computation, suchas homomorphic encryption and multiparty computation [8].Their advantage is the cryptographic level protection of privatedata. On the other hand, due to highly complicated implemen-tations, speed is their bottleneck. They are expensive to usein practice, especially for low-profile devices and large-scalesystems.

Differing from SPEED approaches, some other solutions arebased on the concept of Search with Reduced Reference (SRR).They commonly use a secure index (the reduced reference)as the query. The reduced information helps to protect theoriginal content and accelerate the search. SRR approachesare in general much faster, and thus are more suitable formultimedia data.

So far, both SPEED and SRR approaches are typicallyintended for encrypted databases. A recent trend in researchis to consider PCBIR for public databases [4]. The newchallenge is the oblivious retrieval problem. It means that theserver should not know exactly which database item has beenretrieved by the user. Very few works have addressed thisissue. For example, Shashank et al. [9] exploit the quadraticresiduosity assumption. Sabbu et al. [10] use homomorphicencryption. Fanti et al. [4] resort to multiple servers. However,these approaches are either inefficient or inflexible, sometimeseven infeasible in practice.

B. Contribution

In this work, we propose a new PCBIR framework. It worksefficiently for both public and private databases. Comparedwith existing solutions, our proposal is more attractive:

• It is designed for large-scale databases;• It can offer multiple levels of privacy protection;• It is easy to configure and generalize.

As far as we know, the granularity of privacy protection isa neglected factor in existing PCBIR solutions. In practice,

the need for privacy implicitly requires that the cost of privacyprotection is adjustable according to the application. Ourframework enables flexible trade-offs between privacy andcomputation complexity in a systematic way. Therefore, itcan be used in a heterogeneous network where devices havedifferent computing power and bandwidths.

The proposed framework is essentially an SRR approach.The key components are robust hashing and piece-wiseinverted indexing. The flexibility is mainly embodied in thatany robust hash algorithm can be used as a module, andany feature can be converted to hash values. In addition, thelevel of privacy protection is controlled by a privacy policy.These elements work together according to a new PCBIRprotocol.

The performance of the framework has been evaluated byextensive experiments. We apply the framework to a concretecontent identification scenario. It is tested in several caseswhere the database size ranges from 50 thousand to fivemillion. Two different robust hash algorithms are used toshow the versatility of the framework. Overall, the retrievalperformance matches state-of-the-art baseline algorithms. Theprivacy enhancement is effective and can be well tuned; itturns out to slightly improve the retrieval performance.

The rest of the work is organized as follows: Section I-Cis a brief literature review. In Sect. II, we give definitionsand application scenarios. Section III describes the proposedframework. In Sect. IV, we estimate the privacy-preservingperformance. In Sect. V, we show detailed experiment resultson both privacy and retrieval performance. Section VI pro-vides security analysis from both the client’s and the server’spoints of view. Section VII concludes the work with adiscussion.

C. Related Work

There are a number of potential applications under theumbrella of privacy-preserving data processing. They can becategorized by the data type and the purpose of processing.Some applications deal with biometric data [2], such asface recognition [11]–[13] and ECG signal classification [14].Some others deal with multimedia data, such as featureextraction [15] and content-based search [4], [10], [16]. Thereare also some works on data mining [17] and learning [18].

PCBIR for multimedia data essentially consists of twoparts – nearest neighbor (similarity) search and obliviousretrieval. Most existing solutions only deal with the first part,because they assume database items are encrypted by (andalso belong to) a user. In that case, oblivious retrieval isnot necessary because the server cannot access the encryptedinformation.

In a recent survey by Rane and Boufounos [19], itis mentioned that there are three classes of privacy-preserving nearest neighbor solutions: computational methods,information-theoretic methods, and randomized embeddingmethods. The first one corresponds to the SPEED approaches.The second one is only suitable for limited applications,because it requires a trusted third party for distancecomputation. The third one corresponds to SRR approaches.

Home

Highlight

Home

Highlight

Home

Highlight

Home

Highlight

Home

Highlight

Home

Highlight

Home

Highlight

Home

Highlight


Among the SRR approaches, the difference usually liesin the generation of the index and the correspondingdatabase structure. An insecure index (feature) can beconverted to a secure one through some perturbation.Lu et al. [20] proposed two index perturbation methods for theJaccard similarity between bags of visual words. One is basedon random permutation and order preserving encryption. Theother is based on min-Hash sketches. In another work [21],they proposed two general feature perturbation methods basedon bit-plane randomization and randomized unary encoding.Recently, Mathon et al. [22] proposed to use quantizationindices as queries but hide quantizers’ reconstruction pointsfrom the server. These perturbation schemes typically causesslight degradation in retrieval performance.

A secure index can also be derived through randomizedembedding, which is typically a transformation of signalsfrom a high-dimensional space to a lower-dimensional one.Our proposal belongs to this category. A widely used methodis the Johnson-Lindenstrauss (J-L) embedding based on ran-dom projections [23]. It was adopted by Lu et al. [20],Voloshynovskiy et al. [24], and Fanti et al. [4] in theirsolutions. Another method is the locality-sensitive hashing(LSH) [25], which is typically a quantized J-L embedding.A newly proposed method is the secure embedding [26], whichis a privacy-enhanced variant of LSH.

When the database is public, oblivious retrieval is required.It makes PCBIR different from other privacy-preserving appli-cations, because others typically assume that all parties canknow the final output. It is intrinsically difficult for multimediadatabases, because if the server does not read the wholedatabase, something about the query is guaranteed to berevealed [9]. In general, the communication complexity islinear in the database size.

An early solution was proposed by Shashank et al. [9].This is an interactive protocol that retrieves one bit at a time.The user has to maintain the hierarchy information of thedatabase. It sends multiple indices to the server and hides thetrue query by exploiting the quadratic residuosity assumption.The disadvantage of this scheme is that the burden of search iscompletely shifted to the user, and it also violates the privacyof the database. Sabbu et al. [10] proposed another solutionbased on homomorphic encryption, where the database privacyis better protected.

More efficient communication is possible when there aremultiple non-colluding servers of the same database [27], orwhen there are multiple queries to a single server. Thesegeneric PIR (private information retrieval) techniques areout of the scope of this work and sometimes infeasible inpractice. Readers can refer to the surveys [28], [29] for moreinformation. To summarize, existing works typically have amixture of the following shortcomings:

• Expensive implementation, e.g. homomorphic encryption;• Non-scalability to large databases, see [11];• Simplified settings, e.g. multiple non-colluding servers;• Degraded retrieval performance, see [22];• Interactive protocols, see [9];• Unbalanced load between server and client.

We address these issues with an SRR approach.

II. DEFINITIONS

We target a scalable PCBIR system in a heterogeneousnetwork. Only two parties are involved in the proposed PCBIRprotocol: the server and the client. There are clients withdifferent computing power and communication channels withvarious bandwidths. We assume that a server has much highercomputing power than a client.

A. Privacy Definition

Privacy has been broadly defined as “the right to be leftalone” [30]. More specifically, it involves some system proper-ties such as undetectability, unlinkability, and communicationcontent confidentiality. In our context, we mainly focus oncommunication content confidentiality.

The privacy of a client has two levels. In the short run, it isabout the particular content of a query and the related answersin the database. In the long run, privacy can be defined asthe profile of a client. A profile can be built by collectingthe queries of a client, and later be exploited by advertisersfor accurate recommendation. When there are many clients,recommendation can be further improved by linking clientswith similar interests. This is not always appreciated.

The privacy on the server side includes the database contentand the possibly proprietary CBIR techniques. In general, weassume that the server only answers the client on a need-to-know basis, i.e., the client is not supposed to get moreinformation than it needs.

B. Threat Model

From the client’s point of view, a potential adversary isthe server. The threat is that the server learns the client’sinterest. Specifically, the server can have two goals: 1) toknow the query content; 2) to know some query properties(e.g. category). Obviously, the first goal is more ambitiousthan the second. The server has two opportunities to learn thequery. The first one is when it receives the query. We denotethe client privacy at this stage by P1. The second chance iswhen it returns the set of matching results. We denote theclient privacy at this stage by P2.

From the server’s point of view, a potential adversary is theclient. The threat is that the client knows too much informationabout the database, including database content and the possiblyproprietary techniques (e.g. indexing hierarchy) for providingthe database service. We denote the server privacy by P3.

We assume both parties behave in a curious-but-honest(semi-honest) way. That means they must strictly follow theprotocol. However, they can learn from the data in possessionfor their own purposes.

C. Application Scenarios

Although our solution is general, we distinguish two majorapplication domains – private database and public database, inorder to facilitate analysis.

Home

Highlight

Home

Highlight

Home

Highlight

Home

Highlight

Home

Highlight

Home

Highlight

Home

Highlight

Home

Highlight

Home

Highlight


TABLE I

APPLICATION SCENARIOS AND PRIVACY NOTIONS. SOME

APPLICATIONS CAN FIT MULTIPLE SCENARIOS

1) Private Database: In this case, the database content iseither encrypted or restricted from public access. It can beproprietary, confidential, or personal, sometimes outsourced.If the query is public, a possible application is search ina proprietary database. For example, in a song recognitionapplication, the client wants to know the name of a songby sending a recorded clip to a database, where the privacyconcern is the database content and the client’s interest profile.If the query is private, a possible application is cloud-basedpersonal databases, where the client is the data owner and theserver is expected to provide outsourced database service. Theprivacy concern is both the database and the query content.Another potential application is remote face recognition forboarder control. On the one hand, authorities want to findcriminals by comparing passenger face photos with a remoteconfidential database. On the other hand, passengers do notalways want their faces to be recorded.

2) Public Database: In this case, privacy protection ismainly intended for the client. If the query is private, theprivacy concern is the actual query content. A possible applica-tion is trademark (patent) search – a client has invented a newtrademark and would like to know if similar ones already existwithout revealing it. Another possible application is remotediagnosis – the client sends his medical images to a syndromedatabase for automatic matching. The privacy concern is thatthe server should not see the query (which reveals the patient’shealth status) but still perform the search. If the query is public,the privacy concern is that the server knows the profile of theclient by collecting his queries. A possible application is dailysearch of multimedia content.

The applications are listed in Table I together with theprivacy notions. One can see that client privacy is alwaysrequired, which is the main focus of our proposal.

III. THE PROPOSED FRAMEWORK

A naive solution to the client privacy problem is thatthe server sends the whole database to the client. This isnot feasible for a limited bandwidth and violates the serverprivacy P3. Even if the client obtains the whole database, itmay not be able to store or process the database.

A compromise is that the client removes some details fromthe query to create some ambiguity for the server. A potentialprivacy-enhanced search procedure works as follows:

1) The client creates a partial query, and sends it to theserver;

2) The server generates an extended query list based on thepartial query;

3) The server performs a search with the extended querylist, and sends back all matching items;

4) The client performs a search within the received set ofmatching results using the original query.

On the one hand, the partial query enables the server to narrowdown enough the search scope for the sake of performance;on the other hand, it should be difficult for the server to inferthe original query. The information in the partial query is ameasure of privacy related to P1.

The server sends all items that match the extended querylist to the client. We denote the set of matching items by A.A must be large enough to create sufficient ambiguity for theserver, but it must be feasible for the client to find the finalmatch (i.e., A should not be too large). The information inA is another measure of privacy – it is proportional to P2 andinversely proportional to P3.

A good system should keep P1, P2, and P3 sufficientlylarge. Note that P1 is a necessary condition for P2. IncreasingP1 also increases P2 and decreases P3, because the size of thematching set (denoted as |A|) increases. In practice, |A| isupper bounded by the available bandwidth, the computingpower of the client, and the size of the database; it is lowerbounded by the minimum privacy requirement.

Specifically, there are the following requirements on thepartial query:

1) It is difficult to infer the original query;2) It is feasible to generate and perform search with the

extended query list;3) The properties of A, e.g. the size and the diversity, can

be controlled by the partial query;4) It is easy to estimate P1.

There are the following requirements on the matching set A:1) A should be compact enough to save bandwidth;2) A contains the best answers, e.g., the (approximate)

nearest neighbors;3) The diversity of elements in A is sufficiently large;4) The server cannot tell which are the best answers by

analyzing A;5) A should not reveal too much information about the

database.The above requirements are achieved by the rest of the

framework, which consists of query generation, databaseindexing, and database search. They are described in the fol-lowing. Figure 1 shows a schematic diagram of the framework.

A. Query Generation

In order to protect privacy, original content cannot beused as queries. Sometimes even features are not safe,because they still reveal information about the originalcontent [31], [32]. Instead of encryption, we generate queriesfrom original content by robust hashing.

Robust hashing is also called perceptual hashing [33], [34]or robust fingerprinting [35], [36] (for multimedia data), orlocality-sensitive hashing (LSH) [37] (for generic data). It is aframework that maps multimedia data to compact hash values.

Home

Highlight

Kalai

Highlight


Fig. 1. A schematic diagram. Client sends a partial hash value and a privacypolicy to Server; Server returns a candidate list, including hash values andmeta data; Client performs final search within the candidate list.

Ideally, a robust hash value is a short string of equallyprobable and independent bits. It can be used to persistentlyidentify or authenticate the underlying content, just like a“fingerprint”. The basic property of robust hashing is thatsimilar content should result in similar hash values. Moreimportantly, hash algorithms have the one-way property that itis computationally difficult to infer the input from the output,because hashing is essentially a many-to-one mapping.

The advantage of robust hashing in this application ismainly two-fold: 1) the compact size can facilitate fast search(in the Hamming space if binary); 2) due to the one-wayproperty, the privacy requirements P1 and P3 can be achievedby using the hash value instead of the original content(or features) for the search and return of answers. Using hashvalues for privacy protection is also called generic privacyamplification [24]. A conventional system can be enhancedby converting feature vectors into hash values [4]. Anotheradvantage of robust hashing is the possibility to overcome thesemantic gap by supervised learning [38]–[40].

Robust hashing typically involves feature extraction, orthog-onal transformation, dimension reduction, and quantization.

B. Database Indexing

The database indexing is based on the concept ofpiece-wise inverted indexing. We assume there is a generalfeature extraction component. The extracted feature vectorsare capable of characterizing the underlying content. They firstundergo an orthogonal transform and dimension reduction.Only significant features are preserved. The elements of afeature vector are divided into n groups. A robust hash value hi

(i = 0, 1, · · · , n − 1) is computed from the i th group.We call it a sub-hash value. The above step creates a newcoordinate system, with each coordinate represented by a sub-hash value. Finally, a multimedia object in the database isindexed by the overall hash value H = h0||h1|| · · · ||hn−1,i.e., the concatenation of sub-hash values.

Each sub-hash value is associated with an inverted index list(also called a hash bucket). The list contains the IDs (identi-fication information) of multimedia objects corresponding to

the sub-hash value. The size of a sub-hash value l depends onthe significance of its corresponding feature elements.

C. Database Search

When privacy protection is not required, the proposedframework can work as efficiently as a normal CBIR scheme.In general, there are several possibilities to perform databasesearch. They mainly differ in the domain for distance compu-tation, which can be the feature space, the quantized featurespace, or the hash space. In order to facilitate the explanation,we assume that an original query is a hash value. It can begenerated by the client, or the server. In the former case, P1 isstill preserved, but there is no guarantee for P2. While in thelatter case, since the client sends the original content to theserver, no privacy is preserved for the client.

1) Approximate Nearest Neighbor Search: When the serverreceives a hash value, it checks the table for each sub-hash value and optionally performs a nearest neighbor searchwithin a Hamming sphere. For each binary sub-hash value,the multimedia object IDs within a small Hamming radius rare retrieved. When r ≥ 1, we call it multi-probing, becausethis is similar to the concept of multi-probe LSH [41]–[43].Additionally, when side information is available, differentpolicies can be applied to prioritize sub-hash values in theneighborhood [44].

The retrieved objects for all sub-hash values are put into alist A. This list of candidates is sorted according to the hashdistance from the query. The hash distance can be definedsimilarly as the L1 distance

D(H1, H2)|L1 =n−1∑

i=0

|dH (h1i , h2i )|, (1)

where dH denotes e.g. the Hamming distance between twosub-hash values. In general, we assume that similar multimediaobjects should have similar hash values. Therefore, the nearestneighbors can be obtained from the sorted list. If not specifiedotherwise, we use D(H1, H2)|L1 in the later experiments.

Distance computation can also be performed with featurevectors or quantization indices. In that case, we just need toreplace dH in the above equations with the distance in thefeature space or the quantization space. It is also possible touse other similarity metrics, such as the L2 distance or thecosine similarity.

2) Approximate Nearest Neighbor Search With Privacy:When privacy protection is “turned on”, the hash value of thequery content must be generated by the client. A partial queryis then formed by omitting some bits in one or more sub-hashvalues according to a privacy policy. In general, the more bitsare missing, the more client privacy (P1, P2) is preserved.

The partial hash value is sent to the server along withthe privacy policy, i.e., positions of the absent bits. If b bitsare omitted from each sub-hash value, the server has tocheck 2b · n hash buckets. All the candidate IDs are sentback to the client, together with the corresponding hash values.The client eventually performs a search by comparing the hashvalues in the list with the original one.


TABLE II

SYSTEM PARAMETERS AND PERFORMANCE ESTIMATION

Ideally, in order to guess the query, a curious server gen-erates an extended query list by enumerating all 2n·b possiblecombinations of the absent bits, and performs the procedurein Sect. III-C1 to find a list of candidates for each item in theextended query list. When n ·b is large, the cost for generatingthe candidate lists becomes prohibitively high; the server isneither able to form the extended query list, nor capable oftelling which candidate lists are of interest to the client. Thusclient privacy P2 is preserved. In addition, since the client onlyreceives hash values instead of features, database privacy P3 isalso preserved to a certain extent. However, P3 decreases withthe number of omitted bits.

IV. PRIVACY PERFORMANCE ESTIMATION

The performance of a system using the proposed frameworkconsists of retrieval performance and privacy performance.The former can be evaluated with conventional metrics ininformation retrieval and depends on the specific application.In the following, we mainly discuss the privacy performance.

The privacy performance is related to P1, P2, and P3. In ourframework, P1 and P3 depend on the one-wayness of therobust hashing mechanism; P2 is related to the matching set(candidate list) A. In particular, |A| indicates the chance ofknowing the query content; on the other hand, the types ofcontent in A indicates the chance of knowing query properties.In the following, we roughly estimate |A| according to somesystem parameters.

Denote the number of distinct (original) items in the data-base by N and the number of bits in a sub-hash value by l.Assume each distinct item has x near-duplicates including theoriginal. For each query, we need to check n hash buckets.The probability that two randomly chosen items fall into thesame hash bucket is p0 = 1/2l . On average each hash bucketcontains N0 = N · x/2l items. Assume that a fraction p1of these items appear in one or more of the correspondingbuckets in the other n − 1 hash tables. For each query, weapproximately get N1 = N0 · (1 − p1) · n + N0 · p1 items ina candidate list, where the parameter p1 = ∑n−1

i=1

(n−1i

)pi

0 ·(1 − p0)

n−1−i is the probability that an item appears in afixed bucket in one or more of the other n − 1 hash tables.

Since p1 is typically small, the second term of N1 is negligiblewhen N0 is small.

When privacy enhancement is enabled, bi bits are omittedfrom each sub-hash value. For each of the n hash tables, theserver has to check 2bi buckets, which results in N0 · 2bi

candidates. Assuming that a fraction p2,i of these items arepresent in the results from the other n − 1 hash tables, thereare approximately N2 = N0 · ∑n

i=1 2bi · (1 − p2,i) + N0 ·∑ni=1 2bi · p2,i/n items in the overall candidate list. The

parameter p2,i can be defined in a similar way as p1; itis the probability that an item from hash table i appearsin a fixed bucket in one or more of the other n − 1 hashtables. For the baseline privacy policies, i.e., bi = b, we havep2,i = p2 = ∑n−1

i=1

(n−1i

)(2b p0)

i · (1 − 2b p0)n−1−i and

N2 = N0 · 2b · (1 − p2) · n + N0 · 2b · p2

≈ N0 · 2b · n. (2)

Since p2 is typically small, the second term of N2 is negligiblewhen N0 · 2b is small.

When multi-probing is used, we assume the Hammingradius is equal to r ≥ 1 for each hash table. For each sub-hash value in the extended query list, n p(r) = ∑r

i=1

(li

) + 1probes are generated. Since the probes for neighboring sub-hash values can overlap, there are less than n p(r) · ∑n

i=1 2bi

probes in total. For the baseline privacy policies, there areapproximately N3 = N2 · n p(r) · p3 items in the overallcandidate list, where the parameter p3 is a fraction due tothe overlap. It is not straight-forward to formulate p3, but itcan be estimated empirically.

Table II summarizes the notations and the estimations ofsome performance benchmarks. In practice, given the systemparameters, the privacy protection level can be negotiated. Forthe server, the maximum acceptable level can be representedby the percentage of database content, e.g. 1% of the databaseis returned to the client. For the client, the acceptable leveldepends on its capability to search within the candidate listand the available bandwidth. We can estimate the percentageof database items returned per query. In a simple case, if b bitsare omitted from each sub-hash, we have

database coverage per query ≈ n

2l−b· 100% . (3)


Intuitively, the level of privacy protection can be adjusted bychoosing appropriate values for bi . The larger l and n are,the finer granularity. An advantage of the framework is thescalability – for the same privacy protection level (N2, N3),the required number of omitted bits bi generally decreaseswith the database size N , i.e., for a larger database the serverneeds to check less hash buckets (

∑ni=1 2bi ).

A. Majority Voting Attack

Is it possible to attack the proposed framework? We con-sider, from the server’s point of view, a baseline attack calledmajority voting. The idea is to predict the query category oreven the query content from the majority of the candidatelist. The most frequent item (category) in the candidate list isjudged as a correct answer.

The performance of this attack actually depends on thenature of the database. One important factor is whether thereare near-duplicates. In the case of no near-duplicates, a non-empty hash bucket only contains distinct items, and so doesthe candidate list. The majority voting is unlikely to succeedin this scenario. In the other case, a non-empty hash bucketmay contain some near-duplicates. The candidate list is likelyto have a non-uniform distribution, which might facilitate themajority voting attack. We will measure the chance of successin our experiments.

V. EXPERIMENT RESULTS

Extensive experiments have been performed to validate theproposed framework. In order to have solid ground truths, wemake a concrete example by applying the framework to a near-duplicate detection scenario, i.e., finding similar copies of thesame content in a database.

In the following, we first describe the databases forexperiments. Then we demonstrate the effectiveness of theindexing and retrieval schemes without considering privacy.Afterwards, we “turn on” privacy protection and considerboth the privacy-preserving performance and the impact onretrieval.

A. Data Sets and Ground Truths

We use a database of 50, 000 images from the publicdomain image collection ImageNet (the validation set ofILSVRC’2012). They consist of 1, 000 categories, each with50 images. Each image is represented by a 128-bit robust hashvalue and indexed by an {l = 16, n = 8} scheme.

We consider three different cases according to the databasesize and the existence of near-duplicates. Each case containsa data set within which searches are run, and a query setcontaining the queries used at search time:

• Case 1: small database, DBS

– Data set: 50,000 images (DS1)– Query set: 105,000 images (QS1)

• Case 2: medium database, DBM

– Data set: 155,000 images (DS2 = DS1 + QS1)– Query set: 1,000 images (QS2)

• Case 3: large database, DBL

TABLE III

A SET OF DISTORTIONS FOR NEAR-DUPLICATE GENERATION

– Data set: 5,300,000 images (DS3)– Query set: 1,000 images (QS3 = QS2) .

The data and query sets are built as follows. The smalldatabase is denoted as DBS . The original 50,000 imagesdirectly form the data set DS1. We then randomly pick 1,000images from DS1 and subsequently create 105 near-duplicatesfor each of them, according to the distortions listed in Table III.This results in 105,000 images, which form the query setQS1. When DBS is used, DS1 is queried with every itemof QS1. The correct answer should be the original version ofthe query.

The medium database is denoted as DBM . The data setcontains the whole DBS , i.e., DS2 = DS1 + QS1. The queryset QS2 contains only the 1,000 original images used to createthe near-duplicates in QS1. When DBM is used, a perfectsearch should return 106 images from DS2, i.e., the 105 near-duplicates plus the original version of the query.

The large database is denoted as DBL . The query set QS3 isthe same with QS2. The data set DS3 is however much larger.It contains the 50,000 original images and their 5,250,000near-duplicates created by the same list of distortion (in total5,300,000 images). In this case, every query is also expectedto return 106 images. The newly added near-duplicates in DS3act as distraction, making searches more difficult.

B. Robust Hash Algorithms

In order to show the flexibility of the framework, we givetwo examples of robust hash algorithms. They are both tunedto produce 128-bit hash values. The same {l = 16, n = 8}indexing scheme is used.

The first algorithm is based on the GIST features [45].We first generate a 512-dimensional GIST feature vector, thenperform principle component analysis (PCA) to decorrelate thefeature components. In order to increase speed and improverobustness, only the first 256 feature elements are kept. Thereduced feature vector is divided into n = 8 parts. Each part


is projected onto l = 16 random Gaussian vectors. Theresults are quantized to 0 or 1 by comparing with theirmean value. This is an LSH scheme corresponding to thecosine similarity [46]. The one-way property has been provedin terms of content indistinguishability [47]. This algorithmis an example to show that features can be converted intohash values. It corresponds to the scenario when privacyenhancement is applied to some existing features. However,due to the non-integrated design, the retrieval performancemight not be optimal. This algorithm will be denoted asLSH in the rest of the work.

The second algorithm is based on the discrete wavelettransform (DWT). An image first undergoes a DWT. Thebase band is then processed by the discrete cosine trans-form (DCT). The sign bits of low-frequency DCT coefficients(excluding the DC) are extracted to form the hash value.They have been shown to have good performance in contentidentification [16], [48] and authentication [49]. This algorithmaddresses some drawbacks of the first one. It is faster; it doesnot require to synchronize the random projections and the PCAmatrix between the server and the client. It corresponds tothe scenario when a privacy-preserving system is started fromscratch. This implementation is probably more attractive inpractice. The algorithm will be denoted as DWT in the rest ofthe work.

We use the classic E2LSH scheme [50] as a baseline forperformance comparison. It is implemented and tested underthe same conditions. In order to generate hash values of thesame size, we use 4 random projections for each sub-hashvalue and quantize the results with a 4-bit uniform quantizer.Note that E2LSH only uses the hash value for generating thecandidate list, and (by default) further ranking is based on theEuclidean distance between original features.

Another baseline algorithm is the recently proposed productquantization (PQ) scheme [51]. We use 16 vector quan-tizers, each producing 8 bits. The distance metric is theL1 distance, and the comparison is symmetric, i.e., betweende-quantized feature vectors. In order to have a fair compari-son, both E2LSH and PQ are applied to the 256-dimensionalGIST features after PCA and dimension reduction. The same{l = 16, n = 8} indexing scheme is used.

C. Indexing Performance

The usage information of hash tables is shown in Table IVfor all three cases. The results are averaged over the eight sub-hash tables. We consider the hash table utilization ratio andthe uniformity of item distribution. The former is representedby the percentage of non-empty hash buckets, and the latter isrepresented by the average number of items per bucket withstandard deviation. In general, large utilization rates and smalldeviation imply efficient indexing. Ideally, 50, 000 distinctitems should cover most of the 216 = 65, 536 hash buckets. Inreality, since hash bits are not i.i.d Bernoulli distributed withequal probability, items are not uniformly distributed amongthe hash buckets.

The statistics show that E2LSH uses the fewest hash buck-ets, PQ and DWT use the most, and LSH sits somewhere

TABLE IV

AVERAGE HASH TABLE USAGE. UNIFORM HASH BUCKET

USAGE INDICATES EFFICIENT INDEXING

in between. This might imply their capability of distinguishingdifferent content. On the other hand, the average bucket sizeand standard deviation show a reverse trend for Case 1 –E2LSH has the largest bucket size, PQ and DWT have thesmallest, and LSH in between. This is consistent with theprevious observation. However, Case 2 shows that PQ hasa larger deviation than LSH. That means PQ is sensitive tothe addition of QS1. Moreover, PQ takes the lead for themaximum bucket size. It is not clear at the moment how PQ’sbehavior is related to the retrieval performance. Regardingthe other schemes, since they have consistent statistics, theirretrieval performance might be consistent too.

D. Retrieval Performance

We look at the retrieval performance from two perspectives,using the precision-recall (PR) curve and the receiver operatingcharacteristic (ROC) curve. The former emphasizes the qualityof retrieved results; the latter emphasizes the overall risk.Privacy is not considered at the moment. In order to achievedifferent performance trade-offs, a threshold comparison pro-cedure is applied to the candidate list. The hash value ofthe query is compared with the ones of the candidate list.The resulted Hamming distances are compared with a thresh-old. Candidates below the threshold are accepted as correctanswers. The PR and ROC curves are derived by varying thethreshold value.

1) Case 1 (DBS): The PR curves are shown in Fig. 2for a few parameters. In the figure, “MP x” means multi-probing within Hamming radius x. One can see that DWTgenerally performs the best, followed by LSH, E2LSH,and PQ. The precision-recall trade-off behaves like a stepfunction for most schemes – the precision is constantlyclose to 1 for low recall values (except for PQ). Thatmeans the retrieved results are mostly relevant. The recallstops at about 0.6 and 0.7 for LSH and DWT respec-tively. When multi-probing is enabled, the recall can besignificantly increased up to 0.9, but the performance gaindecreases with the probing radius. Intuitively, the best recallis given by “MP2”, because this strategy probes the most hashbuckets.


Fig. 2. Retrieval performance (Case 1, DBS). “MP x” means multi-probingwithin Hamming radius x. DWT generally performs the best, followed byLSH, E2LSH, and PQ. The best recall is given by “MP2”.

Fig. 3. ROC curve (Case 1, DBS). “MP x” means multi-probing withinHamming radius x. The same trend with Fig. 2 can be observed. When multi-probing is off, the curves are almost flat (except for PQ).

The ROC curves are shown in Fig. 3 (with the x-axis ona log scale). The same trend can be observed. When multi-probing is off, the curves are almost flat (except for PQ),i.e., the true positive rate (recall) is not much influenced bythe threshold. On the other hand, the false positive rate can betuned to be as low as 10−9.

Both figures show that E2LSH and PQ perform worse thanLSH and DWT. E2LSH gives the second worst performance.One possible reason is that the ranking of candidates usesthe distances between original features, whose identificationcapability is not as good as the hash values.

The PQ scheme gives the worst overall performance. It isprobably due to the lack of robustness. Note that PQ can stillachieve a large recall rate at the cost of low precision. Thatmeans the relevant items are far away from the query butstill in the same bucket. Thus a large threshold is requiredto increase the recall. When the false positive rate is largerthan about 10−5 or 10−4 (corresponding to precision lowerthan 0.2 or 0.1), PQ actually performs better than E2LSH andLSH respectively.

Fig. 4. Retrieval performance (Case 2, DBM ). “MP x” means multi-probingwithin Hamming radius x. The overall trend does not change. PQ performsmuch better in this case.

2) Case 2 (DBM ) and Case 3 (DBL): For Case 2, thePR curves are shown in Fig. 4. The ROC curves are omittedhere due to the similarity to the previous case. Note thatalthough the overall trend does not change, PQ performs muchbetter in this case. It still gives a linear trade-off betweenprecision and recall, but the recall is much higher. A possibleexplanation is that the quantizers used by PQ are trainedwith DS1. Another possible reason is that the ratio betweenpositive and negative samples is smaller in this case.

For Case 3, the ROC curves are shown in Fig. 5. ThePR curves are omitted here due to the similarity to previouscases. The baseline E2LSH is ignored in this case due toimplementation issues. Since E2LSH uses feature comparisoninstead of hash comparison, fitting five million featurevectors into memory is not efficient for a normal workstation.By comparing with Fig. 3, one can see that there is nosignificant difference. Therefore, the performance is not muchaffected by the increased database scale. Another thing tonotice is that LSH performs slightly better than DWT in theregion of low false positive rates when multi-probing is on.

Overall, the above results show that the framework iseffective for content-based retrieval and the performance iscomparable to state-of-the-art schemes. In the following, weconsider privacy protection and the influence on retrieval.

E. Privacy-Preserving Performance

We consider the privacy-preserving performance in termsof P2, because P3 is inversely proportional to P2, and P1 isdecided by the system parameters. Privacy-enhanced retrievalis carried out according to different privacy policies. First, werandomly omit b = 1, · · · , 8 bits from each sub-hash value.We refer to these policies as the baseline policies. Since theserver puts all candidates into a list, two metrics are used:

• The number of candidates in the list;• The entropy of the candidate categories in the list.

Additionally, we also guess the query ID and the querycategory by majority voting and compute the success rate.In the following experiments, Case 2 (DBM ) is used as anexample if not specified otherwise.


Fig. 5. ROC curve (Case 3, DBL ). “MP x” means multi-probing withinHamming radius x. There is no significant difference from Fig. 3. The per-formance is not much affected by the increased database scale. (a) Complete.(b) Close-up.

1) Baseline Privacy Policies: We first consider the casewhen multi-probing is not used. The average numbers ofcandidates per query are shown in Fig. 6. In the figure,“0 bit” per sub-hash corresponds to the non-privacy-preservingscenario. A more detailed plot is shown in Fig. 7 for LSH.In general, we can observe that:

• The number of candidates increases with the number ofomitted bits exponentially.

• The distribution of candidates among different sub-hashtables is quite even.

The entropy of candidate categories is shown in Fig. 8 ina similar way. This figure shows how difficult it is to guessthe category of the query. Since there are 1, 000 categoriesin total, the maximum entropy of candidate categories in alist is approximately 10 bits. We can observe that the entropyincreases linearly with the number of omitted bits.

In order to measure the accuracy of candidate numberestimation in Table II, especially N2, we compute the cor-relation coefficient between the estimation and the true valuefor all the queries. Figure 9 shows a histogram close-up ofthe correlation coefficient between the number of candidates

Fig. 6. Average no. of candidates per query (Case 2, baseline policies). Thecandidate number increases exponentially with the policy.

Fig. 7. Average no. of candidates per query (Case 2, LSH, baseline policies).The distribution of candidates among different sub-hash tables is quite even.

Fig. 8. Entropy of candidate categories per query (Case 2, baseline policies).The entropy increases linearly with the policy.

and its estimation N2 defined in (2). We can see that mostvalues are quite close to one, on average larger than 0.98. Thatimplies we can estimate the computation and communicationcost when defining a privacy policy, using Table II.

When multi-probing is enabled, the load of the server istremendously increased. We choose r = 1, so that eachsub-hash of the query is expanded to n p = 17 items.


Fig. 9. Histogram of the correlation coefficient between the no. of candidatesand estimation (Case 2, baseline policies, close-up). Good correlation impliesthat the system behavior can be predicted in advance.

Fig. 10. Average no. of candidates per query (Case 2, baseline, MP1).Multi-probing makes privacy protection much stronger, but it is not alwaysdesired.

The average numbers of candidates are shown in Fig. 10.We can observe trends that are similar to the non-multi-probingcase. Note that when 8 bits are omitted from each sub-hashvalue, the candidate list covers more than 25% of the wholedatabase, which is probably not desirable in practice. Thecorresponding empirical entropy values of candidate categoriesare shown in Fig. 11. Due to the large amount of candidates,the entropy stays high from the beginning.

2) Other Privacy Policies: Given a fixed number of bits toomit, which sub-hash value(s) should be chosen? We havetested some more privacy policies where not all sub-hashvalues are involved in bit omission. The results are shownin Fig. 12. Basically, the figure tells that for 8-bit omission,8 × 1 schemes (omit 8 bits in one sub-hash value) are moreeffective than 2 × 4 schemes (which are more effective than1 × 8 schemes). They are even more effective than4 × 4 schemes (16-bit omission) which are more effectivethan 2 × 8 schemes. Therefore, we can conclude that in orderto achieve a larger number of candidates, one should:

• Concentrate the omitted bits in fewer sub-hash valuesrather than spread them in more sub-hash values.

Fig. 11. Entropy of candidate categories per query (Case 2, baseline, MP1).Multi-probing makes privacy protection much stronger.

Fig. 12. No. of candidates for different privacy policies (Case 2). It is moreeffective to concentrate the omitted bits in fewer sub-hash values.

This is consistent with the definition of N2 in (2), becausethe number of candidates increases exponentially with b, butlinearly with n. Another observation is that there is no signifi-cant difference between sub-hash values (thanks to dimensionreduction). This is consistent with results in Fig. 7.

Another question is that, within a sub-hash value, whichbits should be chosen first? According to the design of robusthashing, the bits in a sub-hash value are (ideally) equallyimportant. This is verified by experiments. A single bit with afixed position is omitted from each sub-hash value. The resultsare shown in Fig. 13. One can see that the bit position withina sub-hash value makes no big difference. This is consistentwith the assumption.

3) Influence on Retrieval: What is the influence of privacyenhancement on retrieval performance? Since a privacy policyis essentially a particular multi-probing strategy, one canimagine that privacy enhancement actually forces the serverto behave like multi-probing. Therefore, privacy enhancementshould improve retrieval performance. Figure 14 shows a ROCcurve comparison of LSH with and without using privacyenhancement. In the figure, “private x” means x bits are


Fig. 13. No. of candidates for different 1 bit omission policies (Case 2).Within a sub-hash value, the position of a bit does not matter.

Fig. 14. ROC curve comparison (Case 2, LSH). “Original” means no privacyenhancement. “Private x” means x bits are randomly omitted from each sub-hash value. “MP x” means multi-probing within Hamming radius x.

randomly omitted from each sub-hash value. Indeed, theretrieval performance increases with the level of privacyprotection and approaches the performance of multi-probing.Figure 15 shows a PR curve comparison of DWT. The sametrend can be observed. By increasing the search radius, we mayget more irrelevant items in the candidate list. The retrievalperformance is not degraded because we use threshold-baseddecision making. Since the threshold stays the same, mostirrelevant items are filtered out.

4) Majority Voting Attack: The majority voting attack hasbeen applied to estimate the query’s category and ID. Specif-ically, the most frequent category or ID in the candidate listis considered as the one of the query. The rate of success isshown in Fig. 16 for query category estimation with DWT,and in Fig. 17 for query ID estimation with LSH. Since theattack is unlikely to succeed in Case 1, we only considerCase 2 and 3. The baseline privacy policies are used.

Several results are worth noticing. First, majority votingindeed works to some extent when there are near-duplicates.That means the majority of a candidate list is likely to be thenear-duplicates of the query. On the one hand, this is perfectlyreasonable, because an effective CBIR algorithm is supposed

Fig. 15. PR curve comparison (Case 2, DWT). “Original” means no privacyenhancement. “Private x” means x bits are randomly omitted from each sub-hash value. “MP x” means multi-probing within Hamming radius x.

Fig. 16. Majority voting attack on query category (DWT). The effectdecreases with the privacy policy.

Fig. 17. Majority voting attack on query ID (LSH). The effect decreaseswith the privacy policy.

to behave like that; on the other hand, this implies a threat tothe privacy-preserving mechanism.

Second, note that the success rate decreases when thenumber of omitted bits increases. Therefore, in order to preventmajority voting, the number of omitted bits should not be


too small. Also note that multi-probing actually reduces thesuccess rate. This is because probing more hash buckets makesthe candidate list more noisy.

In practice, the attack may not be so straight-forward,because each item may have a different number of relevantitems, and the database composition can be much more com-plex than in our cases. For a particular query, the success rateof majority voting depends on the number of relevant items inthe database. Majority voting is likely to work for those itemsthat have most relevant items, e.g., the most popular ones.

Additionally, the attack is also affected by the databasedensity. Since the success rate in Case 3 is much lower thanin Case 2, we conclude that for a fixed indexing scheme, thesuccess rate decreases with the number of distinct items. Inthe next section, we provide a more formal security analysisfrom both the client’s and the server’s points of view.

VI. SECURITY ANALYSIS

Attacks on a retrieval system can be categorized accordingto the opponent’s resources, computing capability, and theapplication scenarios. The security notions below are definedin a similar way as in [52, MAC algorithms]. First we definethe power of an attack.

• In a brute-force attack, the attacker has no knowledgeabout the system’s input and output.

• In a known-text attack, the attacker knows some content-hash pairs.

• In a chosen-text attack, the attacker can access the systemand request answers for a number of queries of his choice.

• In an adaptive chosen-text attack, the choice of the currentquery may depend on the outcome of previous queries.

An attack is verifiable if the attacker knows beforehand thatthe attack will succeed with a high probability.

In a PCBIR service, a client is virtually free to perform allthe attacks against the server. As long as the service policyallows, he can send as many (kinds of) queries as he wants.The server, on the other hand, is not able to perform (adaptive)chosen-text attacks, because the protocol is not interactive andmust be initiated by the client.

Basically, our protocol works as follows: the server receivesa reduced hash value and a privacy policy from the client,and returns many hash values and IDs (meta data). Sinceno actual content is sent, the confidentiality of the queryand of the database are both preserved to a certain extent.The hash value does reveal some information about the inputcontent. However, this information leak is not sufficient tocreate a verifiable attack. A multimedia object (pictures, audio-visual clips) typically takes hundreds of kilobytes, while ahash value only takes hundreds of bits. The strong lossycompression makes it unlikely to invert a hash value. Somelatest results show that image reconstruction from descriptorswith the help of existing knowledge (known-text attack) is notreally satisfactory [31], [32], [53]. Hash values in fact can beconsidered as extremely compressed versions of descriptors.To the best of our knowledge, so far there is no existing workon content reconstruction from hash values. The one-waynessof random projection based approaches has been proved in

terms of content indistinguishability [47], which is adaptedfrom the indistinguishability definition widely used in crypt-analysis [54]. This property basically says an adversary cannotdistinguish two different inputs with sufficient confidence byobserving their hash values.

A. Client Privacy (P1, P2)

The server can guess the query content from its hashvalue. This is computationally difficult as explained above.Therefore, P1 is always guaranteed. If there is a match inthe database, then the scope of the query is narrowed downto the candidates. The success rate of a brute-force attackis inversely proportional to the number of candidates. Inorder to quantify the privacy protection P2 at this stage,we can use k-diversity [55, l-diversity], which links to thenotion of k-anonymity [56]. A release of data is said to havethe k-anonymity property if the information for each personcontained in the release cannot be distinguished from at leastk −1 individuals whose information also appear in the release.k-diversity requires that for each query, the server shouldreturn at least k distinct answers, which can be controlled bythe privacy policy.

In a known-text attack, the server already knowns some ofthe client’s past queries and/or query hash values. In general,a new query is more likely to be related to past ones. Thisprinciple helps to further narrow down the query scope. Theeffectiveness of this strategy depends on the behaviour of bothparties. If the server always updates the client’s profile byobserving new queries, then the client can lead the serverto build a wrong profile by sending fake queries. If theclient always tends to query similar things, then he is morevulnerable to profiling. To conclude, it is good practice tomake the k-diversity sufficiently large and sometimes sendfake queries. In addition, clients sometimes have the optionto be anonymous by using e.g. Tor,2 which makes profilingeven less effective.

B. Server Privacy (P3)

We assume that the client’s interest is to know what is inthe database. It is computationally difficult for the client toguess the database content from received hash values. Anotherpossible attack is to pre-compute the hash values of manymultimedia objects offline, and compare them with the hashvalues returned by the server during past queries. This isnegligibly more efficient than directly querying the serveronline, because the computational difficulty mainly lies infiguring out the content. In a known-text attack, the clientknows some content in the database. Since related contentis likely to exist together, he can give priority to verifythe existence of related content. In a (adaptive) chosen-textattack, the client can adapt his search in a more efficient way.For example, he can send online queries to the server, andmeanwhile perform the offline search. However, recall that weassume the server has much higher computing power than theclient, which is often true in practice. The attacks mentioned in

2https://www.torproject.org


this section are only feasible for clients which are as powerfulas or even more powerful than the server. They are somewhatbeyond the setting of a curious-but-honest model. On theother hand, these attacks can be thwarted by making the hashgeneration dependent on a key. The server should periodicallyupdate the hash generation key and re-compute hash valuesof the database. The key should be communicated to theclient before a query session. Note that the LSH algorithm isactually key dependent. The Gaussian vectors for projectionare generated by a pseudo-random number generator (PRNG).The seed to the PRNG acts as the key. The DWT algorithm isnot keyed, but there are keyed alternatives based on wavelets,such as [57] and [58].

VII. CONCLUSION AND DISCUSSION

In this work, we propose a privacy-enhancing frameworkfor large-scale content-based information retrieval. It can beused for any CBIR system based on features and similaritysearch. The framework is mainly based on robust hashingand piece-wise inverted indexing. We have first introduced theconcept of a privacy policy, according to which the privacyprotection level can be adjusted. We also adopted a privacynotion called k-diversity. A device can make a policy based onits capability. A stronger policy not only increases the load ofthe server, but also costs more bandwidth and local processing.Therefore, an equilibrium can be eventually reached. This is auseful property in a heterogeneous network. A new systemcan be designed either from privacy requirements or fromcomputation and communication requirements. For example,one can derive b by setting (2) equal to k, or derive k byfixing (3) and using (2).

The framework has been implemented and extensivelyevaluated in different scenarios. We show that the privacylevel, e.g., the number and the diversity of candidates can betuned by the privacy policy. Some guidelines are given onhow to choose the omitted bits. We have demonstrated bothretrieval performance and privacy-preserving performance fora particular content identification application. Two differentconstructions of robust hash algorithms are plugged into theframework to show its versatility. One is based on the signbits of random projections; the other is based on the sign bitsof DWT coefficients. Both algorithms show better retrievalperformance than state-of-the-art reference schemes based onE2LSH and product quantization. The results also show thatthe proposed framework in general resembles a multi-probingscheme, which improves the retrieval performance.

We have considered the majority voting attack by estimatingthe query category and ID. Experiment results show thatquery items with near-duplicates are likely to be vulnerableto majority voting. The chance of success is equivalent to thechance that a query item has more near-duplicates than otherirrelevant items in the candidate list. The results also showthat the success rate decreases with the number of omittedbits and the number of distinct items. In practice, a clientcould send the server a number of fake query hash valuesto further reduce the effect of this attack. Actually, majorityvoting is a general attack to any PCBIR scheme when the

group of relevant answers has a significant probability amongall the candidates. To the best of our knowledge, this work isperhaps the first to discuss this issue. Further study is neededto investigate more sophisticated attacks.

In the proposed framework, we only consider retrieving asmall amount of information (meta data) about a multimediaobject, e.g, the content ID. According to Table II, the amountof overhead is affordable even for extreme cases. For example,in Case 3, when 8 bits are omitted from each sub-hash value,about 3% of the database (more than 160,000 items) shouldbe returned. Assuming each item contains 256 bits (128 bithash + 128 bit ID), that only amounts to 5 MBytes. However,if the client wants to retrieve the entire multimedia object,there is probably no efficient solution in the current setting.A basic observation is that as long as the server only returnsone (correct) answer, it approximately knows what the userquery is (P2 is compromised). Therefore, sending more thanone multimedia objects is inevitable if privacy is required.This is an expensive application considering the huge volumeof multimedia data. In practice, once necessary information isretrieved by a PCBIR scheme, content retrieval can be solvedusing generic PIR (private information retrieval) techniques,see [10], [28], [29].

The framework particularly addresses the “query by exam-ple” scenario. For alternative scenarios such as query byregion, relevance feedback involving many examples, theframework is likely to work if the query can be expressedas a feature vector, because arbitrary feature vectors can beconverted into hash values [4, Fig. 7]. The framework can alsowork with multiple features. In that case, each sub-hash valuecorresponds to a different feature. It is compatible with therecent submodular video hashing framework [40]. If a featurevector is very long, we can divide it into many segments andhash them separately (with a fixed sub-hash length). The directimpact is that the number of hash tables n increases. Thisis usually affordable because the increase of memory cost islinear. Note that feature data typically contain a significantamount of redundancy, which can be effectively compressedby hashing techniques, see [59].

A limit of our proposal is that it requires both the server andthe client to implement the same architecture, i.e, the samefeature extraction, hash algorithms, indexing, etc. Althoughrandom projection can be applied to any feature, it is notalways feasible for existing systems. If privacy-protection isto be implemented solely on the client side, our proposal stillgives some inspiration. In that case, what the client can do isto modify the query before sending it to the server, and ask foran extended list of answers. A question here is to what extentshould the query content be modified. The client can assumethe server uses a hash-based indexing scheme (with a randomrobust hash algorithm) just like our proposal. The query shouldbe modified so that the hash value of the modified queryvirtually corresponds to a privacy policy. Another possibilityfor the client is to send multiple (fake) queries. A questionthere is how to select different queries. The idea behind ourproposal is that fake queries should be chosen so that their hashvalues are sufficiently different. The notion of k-diversity stillworks as a guideline.


Nevertheless, we do not advocate a one-sided approach.A successful protocol needs some collaboration and infrastruc-ture. For example, in information security, a protocol istypically a combination of private-key, public-key, and hashalgorithms. The infrastructure contains a pool of standardalgorithms so that different devices can communicate. Privacyprotection in the multimedia domain is still at its infancy.We may expect to agree on a set of standard descriptors andhash algorithms. Once hash based indexing and search becomewidely used, our proposal takes less effort to implement.

ACKNOWLEDGEMENT

The authors would like to thank Ewa Kijak and Teddy Furonfor their suggestions on improving the paper.

REFERENCES

[1] M. S. Lew, N. Sebe, C. Djeraba, and R. Jain, “Content-based multimediainformation retrieval: State of the art and challenges,” ACM Trans.Multimedia Comput., Commun., Appl., vol. 2, no. 1, pp. 1–19, Feb. 2006.

[2] J. Bringer, H. Chabanne, and A. Patey, “Privacy-preserving biometricidentification using secure multiparty computation: An overview andrecent trends,” IEEE Signal Process. Mag., vol. 30, no. 2, pp. 42–52,Mar. 2013.

[3] A. Aghasaryan, M. Bouzid, D. Kostadinov, M. Kothari, andA. Nandi, “On the use of LSH for privacy preserving personaliza-tion,” in Proc. 12th IEEE Int. Conf. Trust, Secur., Privacy Comput.Commun. (TrustCom), Jul. 2013, pp. 362–371.

[4] G. Fanti, M. Finiasz, and K. Ramchandran, “One-way private mediasearch on public databases: The role of signal processing,” IEEE SignalProcess. Mag., vol. 30, no. 2, pp. 53–61, Mar. 2013.

[5] G. Acar et al., “FPDetective: Dusting the web for fingerprinters,” inProc. ACM SIGSAC Conf. Comput. Commun. Secur. (CCS), 2013,pp. 1129–1140.

[6] E. Balsa, C. Troncoso, and C. Diaz, “OB-PWS: Obfuscation-basedprivate web search,” in Proc. IEEE Symp. Secur. Privacy, May 2012,pp. 491–505.

[7] Z. Erkin et al., “Protection and retrieval of encrypted multimediacontent: When cryptography meets signal processing,” EURASIP J. Inf.Secur., vol. 2007, p. 20, Dec. 2007.

[8] R. L. Lagendijk, Z. Erkin, and M. Barni, “Encrypted signal processingfor privacy protection: Conveying the utility of homomorphic encryptionand multiparty computation,” IEEE Signal Process. Mag., vol. 30, no. 1,pp. 82–105, Jan. 2013.

[9] J. Shashank, P. Kowshik, K. Srinathan, and C. V. Jawahar, “Privatecontent based image retrieval,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit. (CVPR), Jun. 2008, pp. 1–8.

[10] P. R. Sabbu, U. Ganugula, S. Kannan, and B. Bezawada, “An obliviousimage retrieval protocol,” in Proc. IEEE Int. Workshop Adv. Inf. Netw.Appl. (WAINA), Mar. 2011, pp. 349–354.

[11] Z. Erkin, M. Franz, J. Guajardo, S. Katzenbeisser, I. Lagendijk, andT. Toft, “Privacy-preserving face recognition,” in Proc. 9th Int. Symp.Privacy Enhancing Technol. (PETS), 2009, pp. 235–253.

[12] A.-R. Sadeghi, T. Schneider, and I. Wehrenberg, “Efficient privacy-preserving face recognition,” in Proc. 12th Int. Conf. Inf. Secur.Cryptol. (ICISC), 2009, pp. 229–244.

[13] M. Osadchy, B. Pinkas, A. Jarrous, and B. Moskovich,“SCiFI—A system for secure face identification,” in Proc. IEEESymp. Secur. Privacy (SP), May 2010, pp. 239–254.

[14] M. Barni, P. Failla, R. Lazzeretti, A. Sadeghi, and T. Schneider,“Privacy-preserving ECG classification with branching programs andneural networks,” IEEE Trans. Inf. Forensics Security, vol. 6, no. 2,pp. 452–468, Jun. 2011.

[15] C.-Y. Hsu, C.-S. Lu, and S.-C. Pei, “Image feature extraction inencrypted domain with privacy-preserving SIFT,” IEEE Trans. ImageProcess., vol. 21, no. 11, pp. 4593–4607, Nov. 2012.

[16] M. Diephuis, S. Voloshynovskiy, O. Koval, and F. Beekhof,“DCT sign based robust privacy preserving image copy detection forcloud-based systems,” in Proc. 10th Workshop Content-Based Multime-dia Indexing (CBMI), Jun. 2012, pp. 1–6.

[17] R. Agrawal and R. Srikant, “Privacy-preserving data mining,” in Proc.ACM SIGMOD Int. Conf. Manage. Data, 2000, pp. 439–450.

[18] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Privacy awarelearning,” in Advances in Neural Information Processing Systems 25.Red Hook, NY, USA: Curran Associates, 2012, pp. 1430–1438.

[19] S. Rane and P. T. Boufounos, “Privacy-preserving nearest neighbormethods: Comparing signals without revealing them,” IEEE SignalProcess. Mag., vol. 30, no. 2, pp. 18–28, Mar. 2013.

[20] W. Lu, A. Swaminathan, A. L. Varna, and M. Wu, “Enabling search overencrypted multimedia databases,” Proc. SPIE, Media Forensics Secur.,vol. 7254, pp. 725418-1–725418-11, Feb. 2009.

[21] W. Lu, A. L. Varna, A. Swaminathan, and M. Wu, “Secure imageretrieval through feature protection,” in Proc. IEEE Int. Conf. Acoust.,Speech, Signal Process. (ICASSP), Apr. 2009, pp. 1533–1536.

[22] B. Mathon, T. Furon, L. Amsaleg, and J. Bringer, “Secure and efficientapproximate nearest neighbors search,” in Proc. 1st ACM Workshop Inf.Hiding Multimedia Secur. (IH & MMSec), Jun. 2013, pp. 175–180.

[23] W. Johnson and J. Lindenstrauss, “Extensions of Lipschitz mappingsinto a Hilbert space,” in Proc. Conf. Modern Anal. Probab., vol. 26.1984, pp. 189–206.

[24] S. Voloshynovskiy, F. Beekhof, O. Koval, and T. Holotyak, “On privacypreserving search in large scale distributed systems: A signal processingview on searchable encryption,” in Proc. Int. Workshop Signal Process.Encrypted Domain, Lausanne, Switzerland, 2009.

[25] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approx-imate nearest neighbor in high dimensions,” Commun. ACM, vol. 51,no. 1, pp. 117–122, Jan. 2008.

[26] P. Boufounos and S. Rane, “Secure binary embeddings for privacy pre-serving nearest neighbors,” in Proc. IEEE Int. Workshop Inf. ForensicsSecur. (WIFS), Nov./Dec. 2011, pp. 1–6.

[27] B. Chor, E. Kushilevitz, O. Goldreich, and M. Sudan, “Private informa-tion retrieval,” J. ACM, vol. 45, no. 6, pp. 965–981, 1998.

[28] W. Gasarch, “A survey on private information retrieval,” in Bulletin ofthe EATCS, vol. 82. Rio, Greece: EATCS, 2004, pp. 72–107.

[29] R. Ostrovsky and W. E. Skeith, III, “A survey of single-database privateinformation retrieval: Techniques and applications,” in Proc. 10th Int.Conf. Pract. Theory Public-Key Cryptogr., 2007, pp. 393–411.

[30] G. Danezis and S. Gürses, “A critical review of 10 years of privacytechnology,” in Proc. 4th Surveill. Soc. Conf., 2010.

[31] M. Daneshi and J. Guo, “Image reconstruction based on local featuredescriptors,” Dept. Elect. Eng., Stanford Univ., Stanford, CA, USA,Tech. Rep., 2011.

[32] P. Weinzaepfel, H. Jégou, and P. Perez, “Reconstructing an imagefrom its local descriptors,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit. (CVPR), Jun. 2011, pp. 337–344.

[33] F. Khelifi and J. Jiang, “Perceptual image hashing based on virtualwatermark detection,” IEEE Trans. Image Process., vol. 19, no. 4,pp. 981–994, Apr. 2010.

[34] H. Özer, B. Sankur, N. Memon, and E. Anarim, “Perceptual audiohashing functions,” EURASIP J. Appl. Signal Process., vol. 2005,pp. 1780–1793, Jan. 2005.

[35] A. L. Varna and M. Wu, “Modeling and analysis of correlated binary fin-gerprints for content identification,” IEEE Trans. Inf. Forensics Security,vol. 6, no. 3, pp. 1146–1159, Sep. 2011.

[36] P. Cano, E. Batlle, T. Kalker, and J. Haitsma, “A review ofaudio fingerprinting,” J. VLSI Signal Process. Syst., vol. 41, no. 3,pp. 271–284, Nov. 2005.

[37] A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approx-imate nearest neighbor in high dimensions,” in Proc. 47th Annu. IEEESymp. Found. Comput. Sci. (FOCS), Oct. 2006, pp. 459–468.

[38] S. Yan, D. Xu, B. Zhang, H.-J. Zhang, Q. Yang, and S. Lin, “Graphembedding and extensions: A general framework for dimensionalityreduction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 1,pp. 40–51, Jan. 2007.

[39] Y. Mu, J. Shen, and S. Yan, “Weakly-supervised hashing in kernelspace,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),Jun. 2010, pp. 3344–3351.

[40] L. Cao, Z. Li, Y. Mu, and S.-F. Chang, “Submodular video hashing:A unified framework towards video pooling and indexing,” in Proc.20th ACM Int. Conf. Multimedia, 2012, pp. 299–308.

[41] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li, “Multi-probeLSH: Efficient indexing for high-dimensional similarity search,” in Proc.33rd Int. Conf. Very Large Data Bases (VLDB), 2007, pp. 950–961.

[42] A. Joly and O. Buisson, “A posteriori multi-probe locality sensitivehashing,” in Proc. 16th ACM Int. Conf. Multimedia, 2008, pp. 209–218.

[43] W. Zhang, K. Gao, Y.-D. Zhang, and J.-T. Li, “Data-oriented local-ity sensitive hashing,” in Proc. ACM Int. Conf. Multimedia, 2010,pp. 1131–1134.


[44] S. Voloshynovskiy, T. Holotyak, O. Koval, F. Beekhof, andF. Farhadzadeh, “Private content identification based on soft fingerprint-ing,” Proc. SPIE, Media Watermarking, Secur., Forensics III, vol. 7880,pp. 78800M-1–78800M-13, Feb. 2011.

[45] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holisticrepresentation of the spatial envelope,” Int. J. Comput. Vis., vol. 42,no. 3, pp. 145–175, May 2001.

[46] M. S. Charikar, “Similarity estimation techniques from rounding algo-rithms,” in Proc. 34th Annu. ACM Symp. Theory Comput. (STOC), 2002,pp. 380–388.

[47] W. Lu, A. L. Varna, and M. Wu, “Security analysis for privacypreserving search of multimedia,” in Proc. 17th IEEE Int. Conf. ImageProcess. (ICIP), Sep. 2010, pp. 2093–2096.

[48] L. Yu and S. Sun, “Image robust hashing based on DCT sign,” in Proc.Int. Conf. Intell. Inf. Hiding Multimedia Signal Process. (IIH-MSP),Dec. 2006, pp. 131–134.

[49] L. Weng, G. Braeckman, A. Dooms, B. Preneel, and P. Schelkens,“Robust image content authentication with tamper location,” in Proc.IEEE Int. Conf. Multimedia Expo, Jul. 2012, pp. 380–385.

[50] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni, “Locality-sensitivehashing scheme based on p-stable distributions,” in Proc. 20th Annu.Symp. Comput. Geometry (SCG), 2004, pp. 253–262.

[51] H. Jégou, M. Douze, and C. Schmid, “Product quantization for nearestneighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1,pp. 117–128, Jan. 2011.

[52] H. C. A. van Tilborg and S. Jajodia, Eds., Encyclopedia of Cryptographyand Security, 2nd ed. New York, NY, USA: Springer-Verlag, 2011.

[53] H. Kato and T. Harada, “Image reconstruction from bag-of-visual-words,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),Jun. 2014, pp. 955–962.

[54] S. Goldwasser and S. Micali, “Probabilistic encryption,” J. Comput. Syst.Sci., vol. 28, no. 2, pp. 270–299, 1984.

[55] A. Machanavajjhala, D. Kifer, J. Gehrke, and M. Venkitasubramaniam,“L-diversity: Privacy beyond k-anonymity,” ACM Trans. Knowl. Discov-ery Data, vol. 1, no. 1, Mar. 2007, Art. ID 3.

[56] L. Sweeney, “K -anonymity: A model for protecting privacy,” Int. J.Uncertainty, Fuzziness, Knowl.-Based Syst., vol. 10, no. 5, pp. 557–570,Oct. 2002.

[57] R. Venkatesan, S.-M. Koon, M. H. Jakubowski, and P. Moulin, “Robustimage hashing,” in Proc. Int. Conf. Image Process., vol. 3. 2000,pp. 664–666.

[58] F. Ahmed and M. Y. Siyal, “A secure and robust wavelet-based hashingscheme for image authentication,” in Proc. 13th Int. Conf. MultimediaModeling (MMM), vol. 4352. 2007, pp. 51–62.

[59] C. Strecha, A. M. Bronstein, M. M. Bronstein, and P. Fua, “LDAHash:Improved matching with smaller descriptors,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 34, no. 1, pp. 66–78, Jan. 2012.

Li Weng received the Ph.D. degree in electricalengineering from Katholieke Universiteit Leuven,Leuven, Belgium, in 2012. He worked onencryption, authentication, and hash algorithms formultimedia data. He then spent 10 months at theUniversity of Geneva, Geneva, Switzerland, workingon privacy protection schemes for CBIR systems.He is currently a Post-Doctoral Researcher withInria Rennes-Bretagne Atlantique, Rennes, France.His research interests include signal processing,machine learning, multimedia, and security.

Laurent Amsaleg received the Ph.D. degreefrom the University of Paris 6, Paris, France, in1995. He worked on relational and object-orienteddatabases, garbage collection, microkernels, andsingle-level stores. He then spent 18 months withthe Database Group, University of Maryland atCollege Park, College Park, MD, USA, designingflexible database query execution strategies (queryscrambling). Subsequently, he received a full-timeresearch position at the National Centre for ScientificResearch, Paris, and joined the IRISA Laboratory,

Rennes, France. His research focuses on very large scale high-dimensionalindexing, and the security and privacy dimensions of multimedia content.

April Morton received the master’s degree inapplied mathematics from California State Polytech-nic University at Pomona, Pomona, CA, USA, in2013. During her master’s degree, she developedand validated statistical models at the Jet PropulsionLaboratory, Pasadena, CA, USA, and the Oak RidgeNational Laboratory, Oak Ridge, TN, USA. She thenspent one year at the Viper Group, Computer Visionand Multimedia Laboratory, University of Geneva,Geneva, Switzerland, working on research issuesrelated to machine learning, information retrieval,

and pattern recognition. She is currently with the Oak Ridge NationalLaboratory, where her research interests include statistical modeling andmachine learning as they relate to the population dynamics and spatiotemporalmodeling fields.

Stéphane Marchand-Maillet received thePh.D. degree in applied mathematics from ImperialCollege London, London, U.K., in 1997. He thenjoined the Institut Eurecom, Sophia-Antipolis,France, where he worked on automated indexingtechniques based on human face localizationand recognition. Since 1999, he has been anAssistant Professor with the Computer Vision andMultimedia Laboratory, University of Geneva,Geneva, Switzerland, where he is leading the ViperGroup. He and his group are addressing machine

learning, information retrieval, and pattern recognition research issuesin an open large-scale context. He has authored several publications oninformation retrieval, machine learning, and pattern recognition, including abook on low-level image analysis.

Documents

152 IEEE TRANSACTIONS ON INFORMATION … preserving content based... · rate decreases with the number of omitted bits and ... news claims that advertisers and Facebook can generate