58
Privacy Cognizant Information Privacy Cognizant Information Systems Systems Rakesh Rakesh Agrawal Agrawal IBM IBM Almaden Almaden Research Center Research Center Jt. work with Srikant, Kiernan, Xu & Evfimievski Evfimievski

Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Privacy Cognizant Information Privacy Cognizant Information SystemsSystems

RakeshRakesh AgrawalAgrawalIBM IBM AlmadenAlmaden Research CenterResearch Center

Jt. work with Srikant, Kiernan, Xu & EvfimievskiEvfimievski

Page 2: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

ThesisThesis

ƒƒ There is increasing need to build information There is increasing need to build information systems thatsystems thatƒƒ protect the privacy and ownership of informationprotect the privacy and ownership of informationƒƒ do not impede the flow of informationdo not impede the flow of information

ƒƒ CrossCross--fertilization of ideas from the security and fertilization of ideas from the security and database research communities can lead to the database research communities can lead to the development of innovative solutions.development of innovative solutions.

Page 3: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

OutlineOutline

MotivationMotivationPrivacy Preserving Data MiningPrivacy Preserving Data MiningPrivacy Aware Data ManagementPrivacy Aware Data ManagementInformation Sharing Across Private DatabasesInformation Sharing Across Private DatabasesConclusionsConclusions

Page 4: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

DriversDrivers

Policies and LegislationsPolicies and Legislations–– U.S. and international regulationsU.S. and international regulations–– Legal proceedings against businessesLegal proceedings against businesses

Consumer ConcernsConsumer Concerns–– Consumer privacy apprehensions continue to plague the Consumer privacy apprehensions continue to plague the

Web … these fears will hold back roughly $15 billion in eWeb … these fears will hold back roughly $15 billion in e--Commerce revenue.” Forrester Research, 2001Commerce revenue.” Forrester Research, 2001

–– Most consumers are “privacy pragmatists.” Westin Most consumers are “privacy pragmatists.” Westin SurveysSurveys

Moral ImperativeMoral Imperative–– The right to privacy: the most cherished of human The right to privacy: the most cherished of human

freedom freedom ---- Warren & Brandeis, 1890Warren & Brandeis, 1890

Page 5: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

OutlineOutline

MotivationMotivationPrivacy Preserving Data MiningPrivacy Preserving Data MiningPrivacy Aware Data ManagementPrivacy Aware Data ManagementInformation Sharing Across Private DatabasesInformation Sharing Across Private DatabasesConclusionsConclusions

Page 6: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Data Mining and PrivacyData Mining and Privacy

The primary task in data mining:The primary task in data mining:–– development of models about aggregated data.development of models about aggregated data.

Can we develop accurate models, while Can we develop accurate models, while protecting the privacy of individual records? protecting the privacy of individual records?

Page 7: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

SettingSetting

Application scenario: A central server interested in Application scenario: A central server interested in building a data mining model using data obtained building a data mining model using data obtained from a large number of clients, while preserving from a large number of clients, while preserving their privacytheir privacy–– WebWeb--commerce, e.g. recommendation servicecommerce, e.g. recommendation service

Desiderata:Desiderata:–– Must not slowMust not slow--down the speed of client interactiondown the speed of client interaction–– Must scale to very large number of clientsMust scale to very large number of clients

During the application phase During the application phase –– Ship model to the clientsShip model to the clients–– Use oblivious computationsUse oblivious computations

Page 8: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Recommendation Service

Alice

Bob

3595,000J.S. Bachpaintingnasa

3595,000J.S. Bachpaintingnasa

4560,000B. Spearsbaseballcnn

4560,000B. Spearsbaseballcnn 42

85,000B. Marleycampingmicrosoft

4285,000B. Marleycampingmicrosoft

4560,000B. Spearsbaseballcnn

3595,000J.S. Bachpaintingnasa

Chris4285,000B. Marley,camping,microsoft

World TodayWorld Today

Page 9: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Recommendation Service

Alice

Bob

3595,000J.S. Bachpaintingnasa

3595,000J.S. Bachpaintingnasa

4560,000B. Spearsbaseballcnn

4560,000B. Spearsbaseballcnn 42

85,000B. Marleycampingmicrosoft

4285,000B. Marleycampingmicrosoft

4560,000B. Spearsbaseballcnn

3595,000J.S. Bachpaintingnasa

Chris4285,000B. Marley,camping,microsoft

Mining Algorithm

Data Mining Model

World TodayWorld Today

Page 10: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Recommendation Service

Alice

Bob

5065,000Metallicapaintingnasa

5065,000Metallicapaintingnasa

3890,000B. Spearssoccerfox

3890,000B. Spearssoccerfox 32

55,000B. Marleycampinglinuxware

3255,000B. Marleycampinglinuxware

4560,000B. Spearsbaseballcnn

3595,000J.S. Bachpaintingnasa

Chris4285,000B. Marley,camping,microsoft

35 becomes

50 (35+15)

Per-record randomizationwithout considering other records

Randomization parameterscommon across users

Randomization techniques differ for numeric and

categorical dataEach attribute randomized

independently

New Order: New Order: Randomization toRandomization toProtect PrivacyProtect Privacy

Page 11: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Recommendation Service

Alice

Bob

5065,000Metallicapaintingnasa

5065,000Metallicapaintingnasa

3890,000B. Spearssoccerfox

3890,000B. Spearssoccerfox 32

55,000B. Marleycampinglinuxware

3255,000B. Marleycampinglinuxware

4560,000B. Spearsbaseballcnn

3595,000J.S. Bachpaintingnasa

Chris4285,000B. Marley,camping,microsoft

New Order: New Order: Randomization toRandomization toProtect PrivacyProtect Privacy

True valuesNever Leave

the User!

Page 12: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Recommendation Service

Alice

Bob

5065,000Metallicapaintingnasa

5065,000Metallicapaintingnasa

3890,000B. Spearssoccerfox

3890,000B. Spearssoccerfox 32

55,000B. Marleycampinglinuxware

3255,000B. Marleycampinglinuxware

4560,000B. Spearsbaseballcnn

3595,000J.S. Bachpaintingnasa

Chris4285,000B. Marley,camping,microsoft

Data Mining Model

Mining Algorithm

Recovery

Recovery of distributions, not individual records

New Order: New Order: RandomizationRandomizationProtects PrivacyProtects Privacy

Page 13: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Reconstruction Problem Reconstruction Problem (Numeric Data)(Numeric Data)

Original values xOriginal values x11, x, x22, ..., , ..., xxnn

–– from probability distribution X (unknown)from probability distribution X (unknown)To hide these values, we use yTo hide these values, we use y11, y, y22, ..., , ..., yynn

–– from probability distribution Yfrom probability distribution YGivenGiven–– xx11+y+y11, x, x22+y+y22, ..., , ..., xxnn+y+ynn

–– the probability distribution of Ythe probability distribution of Y Estimate the probability distribution of X.Estimate the probability distribution of X.

Page 14: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Reconstruction AlgorithmReconstruction Algorithm

ffXX00 := Uniform distribution := Uniform distribution

j := 0 j := 0 repeatrepeat

ffXXj+1j+1(a) := (a) := BayesBayes’ Rule’ Rule

j := j+1j := j+1 until (stopping criterion met)until (stopping criterion met)

(R. (R. AgrawalAgrawal & R. & R. SrikantSrikant, SIGMOD 2000), SIGMOD 2000)

Converges to maximum likelihood estimate.Converges to maximum likelihood estimate.–– D. Agrawal & C.C. Aggarwal, PODS 2001.D. Agrawal & C.C. Aggarwal, PODS 2001.

∑∫=

∞−−+

−+n

i jXiiY

jXiiY

afayxf

afayxfn 1 )())((

)())((1

Page 15: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Works WellWorks Well

20 60

Age

0

200

400

600

800

1000

1200

Num

ber

of P

eopl

e Original

Randomized

Reconstructed

Page 16: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Decision Tree ExampleDecision Tree Example

Age Salary Repeat Visitor?

23 50K Repeat17 30K Repeat43 40K Repeat68 50K Single32 70K Single20 20K Repeat

Age < 25

Salary < 50K

Repeat

Repeat

Single

Yes

Yes

No

No

Page 17: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

AlgorithmsAlgorithms

GlobalGlobal–– Reconstruct for each attribute once at the beginningReconstruct for each attribute once at the beginning

By ClassBy Class–– For each attribute, first split by class, then reconstruct For each attribute, first split by class, then reconstruct

separately for each class.separately for each class.

LocalLocal–– Reconstruct at each nodeReconstruct at each node

See SIGMOD 2000 paper for details.See SIGMOD 2000 paper for details.

Page 18: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Experimental MethodologyExperimental Methodology

Compare accuracy againstCompare accuracy against–– OriginalOriginal: unperturbed data without randomization.: unperturbed data without randomization.–– RandomizedRandomized: perturbed data but without making any : perturbed data but without making any

corrections for randomization.corrections for randomization.

Test data not randomized.Test data not randomized.Synthetic benchmark from [AGI+92].Synthetic benchmark from [AGI+92].Training set of 100,000 records, split equally Training set of 100,000 records, split equally between the two classes.between the two classes.

Page 19: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Decision Tree ExperimentsDecision Tree Experiments

Fn 1 Fn 2 Fn 3 Fn 4 Fn 550

60

70

80

90

100

Acc

urac

y Original

Randomized

Reconstructed

100% Randomization Level

Page 20: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Accuracy vs. RandomizationAccuracy vs. Randomization

10 20 40 60 80 100 150 200

Randomization Level

40

50

60

70

80

90

100

Acc

urac

y Original

Randomized

Reconstructed

Fn 3

Page 21: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

More on RandomizationMore on Randomization

PrivacyPrivacy--Preserving Association Rule Mining Over Preserving Association Rule Mining Over Categorical DataCategorical Data–– RizviRizvi & & HaritsaHaritsa [VLDB 02][VLDB 02]– Evfimievski, Srikant, Agrawal, & Gehrke [KDD-02]

Privacy Breach Control: Probabilistic limits on what one can infer with access to the randomized data as well as mining results– Evfimievski, Srikant, Agrawal, & Gehrke [KDD-02]– Evfimievski, Gehrke & Srikant [PODS-03]

Page 22: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Related Work:Related Work:Private Distributed ID3Private Distributed ID3

How to build a decisionHow to build a decision--tree classifier on the union of two tree classifier on the union of two private databases (private databases (LindellLindell & & PinkasPinkas [Crypto 2000])[Crypto 2000])Basic Idea:Basic Idea:

Find attribute with highest information gain privatelyFind attribute with highest information gain privatelyIndependently split on this attribute and Independently split on this attribute and recurserecurse

Selecting the Split AttributeSelecting the Split AttributeGiven v1 known to DB1 and v2 known to DB2, compute (v1 + v2) Given v1 known to DB1 and v2 known to DB2, compute (v1 + v2) log (v1 + v2) and output random shares of the answerlog (v1 + v2) and output random shares of the answerGiven random shares, use Given random shares, use Yao'sYao's protocol protocol [FOCS 84][FOCS 84] to compute to compute information gain.information gain.

TradeTrade--offoff++ AccuracyAccuracy–– Performance & scalingPerformance & scaling

Page 23: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Related Work: Purdue ToolkitRelated Work: Purdue Toolkit

Partitioned databases (horizontally + vertically)Partitioned databases (horizontally + vertically)Secure Building BlocksSecure Building BlocksAlgorithms (using building blocks):Algorithms (using building blocks):–– Association rulesAssociation rules–– EM ClusteringEM Clustering

C. Clifton et al. Tools for Privacy Preserving Data C. Clifton et al. Tools for Privacy Preserving Data Mining. SIGKDD Explorations 2003.Mining. SIGKDD Explorations 2003.

Page 24: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Related Work:Related Work:Statistical DatabasesStatistical Databases

Provide statistical information without compromising Provide statistical information without compromising sensitive information about individuals (AW89, sensitive information about individuals (AW89, Sho82)Sho82)TechniquesTechniques–– Query RestrictionQuery Restriction–– Data PerturbationData Perturbation

Negative Results: cannot give high quality statistics Negative Results: cannot give high quality statistics and simultaneously prevent partial disclosure of and simultaneously prevent partial disclosure of individual information [AW89]individual information [AW89]

Page 25: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

SummarySummary

Promising technical direction & resultsPromising technical direction & resultsMuch more needs to be done, e.g.Much more needs to be done, e.g.–– Trade off between the amount of privacy breach and Trade off between the amount of privacy breach and

performanceperformance–– Examination of other approaches (e.g. randomization Examination of other approaches (e.g. randomization

based on swapping)based on swapping)

Page 26: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

OutlineOutline

MotivationMotivationPrivacy Preserving Data MiningPrivacy Preserving Data MiningPrivacy Aware Data ManagementPrivacy Aware Data ManagementInformation Sharing Across Private DatabasesInformation Sharing Across Private DatabasesConclusionsConclusions

Page 27: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Hippocratic DatabasesHippocratic Databases

Hippocratic Oath, 8 (circa 400 BC)Hippocratic Oath, 8 (circa 400 BC)–– What I may see or hear in the course of treatment … I will What I may see or hear in the course of treatment … I will

keep to myself.keep to myself.

What if the database systems were to embrace the What if the database systems were to embrace the Hippocratic Oath?Hippocratic Oath?

Architecture derived from privacy legislations.Architecture derived from privacy legislations.–– US (FIPA, 1974), Europe (OECD , 1980), Canada (1995), US (FIPA, 1974), Europe (OECD , 1980), Canada (1995),

Australia (2000), Japan (2003)Australia (2000), Japan (2003)

Agrawal, Kiernan, Srikant & Xu: VLDB 2002..

Page 28: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Architectural PrinciplesArchitectural Principles

Purpose SpecificationPurpose SpecificationAssociate with data the Associate with data the purposes for collectionpurposes for collectionConsentConsentObtain donor’s consent on the Obtain donor’s consent on the purposespurposesLimited CollectionLimited CollectionCollect minimum necessary Collect minimum necessary datadataLimited UseLimited UseRun only queries that are Run only queries that are consistent with the purposesconsistent with the purposesLimited DisclosureLimited DisclosureDo not release data without Do not release data without donor’s consentdonor’s consent

Limited RetentionLimited RetentionDo not retain data beyond Do not retain data beyond necessarynecessaryAccuracyAccuracyKeep data accurate and upKeep data accurate and up--toto--datedateSafetySafetyProtect against theft and other Protect against theft and other misappropriationsmisappropriationsOpennessOpennessAllow donor access to data Allow donor access to data about the donorabout the donorComplianceComplianceVerifiable compliance with the Verifiable compliance with the above principlesabove principles

Page 29: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Architecture: PolicyArchitecture: PolicyPrivacyPolicy

PrivacyMetadataCreator

StorePrivacyMetadata

For each purpose & piece of information (attribute):

• External recipients• Retention period• Authorized users

Different designs possible.

Converts privacy policy into privacy metadata tables.

LimitedDisclosure

LimitedRetention

Page 30: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Privacy Policies TablePrivacy Policies Table

{mining}{mining}

{registration}{registration}

{registration}{registration}

{shipping}{shipping}

{shipping, charge}{shipping, charge}

AuthorizedAuthorized--usersusers

10 years10 yearsemptyemptybookbookorderorderrecommendrecommendationsations

3 years3 yearsemptyemptyemailemailcustomercustomerregisterregister

3 years3 yearsemptyemptynamenamecustomercustomerregisterregister

1 month1 monthemptyemptyemailemailcustomercustomerpurchasepurchase

1 month1 month{delivery, {delivery, creditcredit--card}card}

namenamecustomercustomerpurchasepurchase

RetentionRetentionExternalExternal--recipientsrecipients

AttributeAttributeTableTablePurposePurpose

Page 31: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Architecture: Data CollectionArchitecture: Data CollectionData

Collection

Store

PrivacyConstraintValidator

AuditInfo

AuditTrail

PrivacyMetadata

Privacy policy compatible with user’s privacy preference?

Audit trail for compliance. Compliance

Consent

Page 32: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Architecture: Data CollectionArchitecture: Data CollectionData

Collection

Store

PrivacyConstraintValidator

DataAccuracyAnalyzer

AuditInfo

AuditTrail

PrivacyMetadata

Data cleansing, e.g., errors in address.

RecordAccessControl

Associate set of purposes with each record.

Purpose Specification

Accuracy

Page 33: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Architecture: QueriesArchitecture: QueriesQueries

Store

AttributeAccessControl

PrivacyMetadata

RecordAccessControl

2. Query tagged “telemarketing” cannot see credit card info.

3. Telemarketing query only sees records that include “telemarketing”in set of purposes.

Safety

LimitedUse

1. Telemarketing cannot issue query tagged “charge”.

Safety

Page 34: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Architecture: QueriesArchitecture: QueriesQueries

Store

AuditInfo

AuditTrail

QueryIntrusionDetector

AttributeAccessControl

PrivacyMetadata

RecordAccessControl

Telemarketing query that asks for all phone numbers.

• Compliance• Training data for query intrusion detector

Safety

Compliance

Page 35: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Architecture: OtherArchitecture: Other

StorePrivacyMetadata

Other

DataRetentionManager

EncryptionSupport

Delete items in accordance with privacy policy.

Additional security for sensitive data.

DataCollectionAnalyzer

Analyze queries to identify unnecessary collection, retention & authorizations.

LimitedRetention

LimitedCollection

Safety

Page 36: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

ArchitectureArchitecturePrivacyPolicy

DataCollection

Queries

PrivacyMetadataCreator

Store

PrivacyConstraintValidator

DataAccuracyAnalyzer

AuditInfo

AuditInfo

AuditTrail

QueryIntrusionDetector

AttributeAccessControl

PrivacyMetadata

Other

DataRetentionManager

RecordAccessControl

EncryptionSupport

DataCollectionAnalyzer

Page 37: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Related Work:Related Work:Statistical & Secure DatabasesStatistical & Secure Databases

Statistical DatabasesStatistical Databases–– Provide statistical information (sum, count, etc.) without Provide statistical information (sum, count, etc.) without

compromising sensitive information about individuals, [AW89]compromising sensitive information about individuals, [AW89]Multilevel Secure DatabasesMultilevel Secure Databases–– Multilevel relations, e.g., records tagged “secret”, “confidentiMultilevel relations, e.g., records tagged “secret”, “confidential”, al”,

or “unclassified”, e.g. [JS91]or “unclassified”, e.g. [JS91]Need to protect privacy in transactional databases that Need to protect privacy in transactional databases that support daily operations.support daily operations.–– Cannot restrict queries to statistical queries.Cannot restrict queries to statistical queries.–– Cannot tag all the records “top secret”.Cannot tag all the records “top secret”.

Page 38: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Some Interesting ProblemsSome Interesting Problems

Privacy enforcement requires cellPrivacy enforcement requires cell--level decisions (which may level decisions (which may be different for different queries)be different for different queries)–– How to minimize the cost of privacy checking?How to minimize the cost of privacy checking?Encryption to avoid data theftEncryption to avoid data theft–– How to index encrypted data for range queries?How to index encrypted data for range queries?Intrusive queries from authorized usersIntrusive queries from authorized users–– Query intrusion detection?Query intrusion detection?Identifying unnecessary data collectionIdentifying unnecessary data collection–– Assets info needed only if salary is below a thresholdAssets info needed only if salary is below a threshold–– Queries only ask “Salary > threshold” for rent applicationQueries only ask “Salary > threshold” for rent applicationForgetting data after the purpose is fulfilledForgetting data after the purpose is fulfilled–– Databases designed not to lose dataDatabases designed not to lose data–– Interaction with compliance Interaction with compliance

Solutions must scale to database-size problems!

Page 39: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

OutlineOutline

MotivationMotivationPrivacy Preserving Data MiningPrivacy Preserving Data MiningPrivacy Aware Data ManagementPrivacy Aware Data ManagementInformation Sharing Across Private DatabasesInformation Sharing Across Private DatabasesConclusionsConclusions

Page 40: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Assumption: Information in each database can be Assumption: Information in each database can be freely shared.freely shared.

Today’s Information Sharing Today’s Information Sharing SystemsSystems

Mediator

Q R

Federated

Q R

Centralized

Page 41: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Minimal Necessary Information Minimal Necessary Information SharingSharing

Compute queries across databases so that no more Compute queries across databases so that no more information than necessary is revealed (without information than necessary is revealed (without using a trusted third party).using a trusted third party).Need is driven by several trends:Need is driven by several trends:–– EndEnd--toto--end integration of information systems end integration of information systems

across companies.across companies.–– Simultaneously compete and cooperate.Simultaneously compete and cooperate.–– Security: needSecurity: need--toto--know information sharingknow information sharing

AgrawalAgrawal, , EvfimievskiEvfimievski & & SrikantSrikant: SIGMOD 2003.: SIGMOD 2003.

Page 42: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Selective Document SharingSelective Document Sharing

R is shopping for R is shopping for technology.technology.S has intellectual S has intellectual property it may want to property it may want to license.license.First find the specific First find the specific technologies where there technologies where there is a match, and then is a match, and then reveal further information reveal further information about those.about those.

R

ShoppingList

S

TechnologyList

Example 2: Govt. agencies sharing information on a

need-to-know basis.

Page 43: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Medical Research Medical Research

Validate hypothesis Validate hypothesis between adverse between adverse reaction to a drug and a reaction to a drug and a specific DNA sequence.specific DNA sequence.Researchers should not Researchers should not learn anything beyond 4 learn anything beyond 4 counts:counts:

MayoClinic

DNA Sequences

DrugReactions

????Sequence AbsentSequence Absent????Sequence PresentSequence Present

No Adv. ReactionNo Adv. ReactionAdverse ReactionAdverse Reaction

Page 44: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

R SR must not know that S has b & yS must not know that R has a & x

vv

uu

R S

xx

vv

uu

aa

yy

vv

uu

bb

R

SCount (R S)

R & S do not learn anything except that the result is 2.

Minimal Necessary SharingMinimal Necessary Sharing

Page 45: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Problem Statement:Problem Statement:Minimal SharingMinimal Sharing

Given:Given:–– Two parties (honestTwo parties (honest--butbut--curious): R (receiver) and S curious): R (receiver) and S

(sender)(sender)–– Query Q spanning the tables R and SQuery Q spanning the tables R and S–– Additional (preAdditional (pre--specified) categories of information Ispecified) categories of information I

Compute the answer to Q and return it to R without revealing Compute the answer to Q and return it to R without revealing any additional information to either party, any additional information to either party, except for the except for the information contained in Iinformation contained in I–– For intersection, intersection size & For intersection, intersection size & equijoinequijoin, ,

I = { |R| , |S| }I = { |R| , |S| }–– For For equijoinequijoin size, I also includes the distribution of duplicates & size, I also includes the distribution of duplicates &

some subset of information in R some subset of information in R SS

Page 46: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

A Possible ApproachA Possible Approach

Secure MultiSecure Multi--Party ComputationParty Computation–– Given two parties with inputs x and y, compute Given two parties with inputs x and y, compute f(x,yf(x,y) such ) such

that the parties learn only that the parties learn only f(x,yf(x,y) and nothing else.) and nothing else.–– Can be solved by building a combinatorial circuit, and Can be solved by building a combinatorial circuit, and

simulating that circuit [Yao86].simulating that circuit [Yao86].

Prohibitive cost for databaseProhibitive cost for database--size problems.size problems.–– Intersection of two relations of a million records each Intersection of two relations of a million records each

would require 144 days would require 144 days

Page 47: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Intersection Protocol: IntuitionIntersection Protocol: Intuition

Want to encrypt the value in R and S and compare Want to encrypt the value in R and S and compare the encrypted values.the encrypted values.However, want an encryption function such that it However, want an encryption function such that it can only be jointly computed by R and S, not can only be jointly computed by R and S, not separately.separately.

Page 48: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Commutative EncryptionCommutative Encryption

Commutative encryption F is a computable function Commutative encryption F is a computable function f : Key F X Dom F f : Key F X Dom F --> Dom F, satisfying:> Dom F, satisfying:–– For all e, e’ For all e, e’ Key F, Key F, ffee oo ffee’ ’ = = ffee’ ’ oo ffee

(The result of encryption with two different keys is the same, (The result of encryption with two different keys is the same, irrespective of the order of encryption)irrespective of the order of encryption)

–– Each Each ffee is a is a bijectionbijection..(Two different values will have different encrypted values)(Two different values will have different encrypted values)

–– The distribution of <x, The distribution of <x, ffee(x(x), y, ), y, ffee(y(y)> is indistinguishable from the )> is indistinguishable from the distribution of <x, distribution of <x, ffee(x(x), y, z>; x, y, z ), y, z>; x, y, z rr Dom F and e Dom F and e rr Key F.Key F.(Given a value x and its encryption (Given a value x and its encryption ffee(x(x), for a new value y, we ), for a new value y, we cannot distinguish between cannot distinguish between ffee(y(y) and a random value z. Thus we ) and a random value z. Thus we cannot encrypt y nor decrypt cannot encrypt y nor decrypt ffee(y(y).)).)

Page 49: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Example Commutative Example Commutative EncryptionEncryption

ffee(x(x) = ) = xxee mod p mod p wherewhere–– p: safe prime number, i.e., both p and q=(pp: safe prime number, i.e., both p and q=(p--1)/2 1)/2

are primesare primes–– encryption key e encryption key e 1, 2, …, q1, 2, …, q--11–– Dom F: all quadratic residues modulo pDom F: all quadratic residues modulo pCommutativityCommutativity: powers commute: powers commute((xxdd mod mod p)p)ee mod p = mod p = xxdede mod p = (mod p = (xxee mod mod p)p)dd mod pmod pIndistinguishabilityIndistinguishability follows from Decisional follows from Decisional DiffieDiffie--HellmanHellman Hypothesis (DDH)Hypothesis (DDH)

Page 50: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Intersection ProtocolIntersection Protocol

R S

R S

Secret key

r s

fs(S )We apply fs on h(S), where h is a hash function, not directly

on S.

Shorthand for { fs(x) | x S }

Page 51: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

R

Intersection ProtocolIntersection Protocol

S

R S

fs(S)fs(S )

fr(fs(S ))

r s

fs(fr(S ))

Commutative property

Page 52: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

R

Intersection ProtocolIntersection Protocol

S

R

S

fr(R )

fr(R )

fs(fr(S ))

<y, fs(y)>for y fr(R)

r s

<x, fs(fr(x))>for x R

<y, fs(y)>for y fr(R)

Since R knows<x, y=fr(x)>

Page 53: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Intersection Size ProtocolIntersection Size Protocol

R S

R S

fr(R ) fs(S )

fs(S ) fr(R )

fr(fs(S ))

r s

fs(fr(R ))

fr(fs(R))

R cannot map z fr(fs(R))

back to x R.

Not <y, fs(y)>for y fr(R)

Page 54: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

EquiEqui Join and Join SizeJoin and Join Size

See Sigmod03 paper See Sigmod03 paper Also gives the cost analysis of protocolsAlso gives the cost analysis of protocols

Page 55: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Related WorkRelated Work

[NP99]: Protocols for list intersection problem[NP99]: Protocols for list intersection problem–– Oblivious evaluation of n polynomials of degree n each.Oblivious evaluation of n polynomials of degree n each.–– Oblivious evaluation of nOblivious evaluation of n22 polynomials.polynomials.

[HFH99]: find people with common preferences, [HFH99]: find people with common preferences, without revealing the preferences.without revealing the preferences.–– Intersection protocols are similar to ours, but do not Intersection protocols are similar to ours, but do not

provide proofs of security.provide proofs of security.

Page 56: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

ChallengesChallenges

Models of minimal disclosure and corresponding Models of minimal disclosure and corresponding protocols forprotocols for–– other database operationsother database operations–– combination of operationscombination of operations

Faster protocolsFaster protocolsTradeoff between efficiency andTradeoff between efficiency and–– the additional information disclosedthe additional information disclosed–– approximationapproximation

Page 57: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

Closing ThoughtsClosing Thoughts

Solutions to complex problems such as privacy Solutions to complex problems such as privacy require a mix of legislations, societal norms, require a mix of legislations, societal norms, market forces & technologymarket forces & technologyBy advancing technology, we can change the mix By advancing technology, we can change the mix and improve the overall quality of the solution and improve the overall quality of the solution Gold mine of challenging research problems Gold mine of challenging research problems (besides being useful)!(besides being useful)!

Page 58: Privacy Cognizant Information Systemszoo.cs.yale.edu/classes/cs538/readings/papers/ccs03-0.pdfStatistical Databases – Provide statistical information (sum, count, etc.) without compromising

ReferencesReferenceshttp://http://www.almaden.ibm.comwww.almaden.ibm.com/software/quest//software/quest/

M. M. BawaBawa, R. , R. BayardoBayardo, R. , R. AgrawalAgrawal. . PrivacyPrivacy--preserving indexing of Documents on the preserving indexing of Documents on the NetworkNetwork. 29th Int'l Conf. on Very Large Databases (VLDB), Berlin, Sept.. 29th Int'l Conf. on Very Large Databases (VLDB), Berlin, Sept. 2003.2003.R. R. AgrawalAgrawal, A. , A. EvfimievskiEvfimievski, R. , R. SrikantSrikant. . Information Sharing Across Private DatabasesInformation Sharing Across Private Databases. . ACM Int’l Conf. On Management of Data (SIGMOD), San Diego, CalifACM Int’l Conf. On Management of Data (SIGMOD), San Diego, California, June 2003.ornia, June 2003.A. A. EvfimievskiEvfimievski, J. , J. GehrkeGehrke, R. , R. SrikantSrikant. . Liming Privacy Breaches inLiming Privacy Breaches in Privacy Preserving Privacy Preserving Data MiningData Mining. PODS, San Diego, California, June 2003.. PODS, San Diego, California, June 2003.R. R. AgrawalAgrawal, J. Kiernan, R. , J. Kiernan, R. SrikantSrikant, Y. , Y. XuXu. . An An XpathXpath Based Preference Language for Based Preference Language for P3PP3P.. 12th Int'l World Wide Web Conf. (WWW), Budapest, Hungary, May 2012th Int'l World Wide Web Conf. (WWW), Budapest, Hungary, May 2003.03.R. R. AgrawalAgrawal, J. Kiernan, R. , J. Kiernan, R. SrikantSrikant, Y. , Y. XuXu. . Implementing P3P Using Database Implementing P3P Using Database TechnologyTechnology.. 19th Int'l Conf.on Data Engineering(ICDE), Bangalore, India, Mar19th Int'l Conf.on Data Engineering(ICDE), Bangalore, India, March 2003.ch 2003.R. R. AgrawalAgrawal, J. Kiernan, R. , J. Kiernan, R. SrikantSrikant, Y. , Y. XuXu. . Server Centric P3P.Server Centric P3P. W3C Workshop on the W3C Workshop on the Future of P3P, Dulles, Virginia, Nov. 2002.Future of P3P, Dulles, Virginia, Nov. 2002.R. R. AgrawalAgrawal, J. Kiernan, R. , J. Kiernan, R. SrikantSrikant, Y. , Y. XuXu. . Hippocratic DatabasesHippocratic Databases.. 28th Int'l Conf. on Very 28th Int'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002.Large Databases (VLDB), Hong Kong, August 2002.R. R. AgrawalAgrawal, J. Kiernan. , J. Kiernan. Watermarking Relational DatabasesWatermarking Relational Databases.. 28th Int'l Conf. on Very 28th Int'l Conf. on Very Large Databases (VLDB), Hong Kong, August 2002. Expanded versionLarge Databases (VLDB), Hong Kong, August 2002. Expanded version in VLDB Journal in VLDB Journal 2003.2003.A. A. EvfimievskiEvfimievski, R. , R. SrikantSrikant, R. , R. AgrawalAgrawal, J. , J. GehrkeGehrke. . Mining Association Rules Over Privacy Mining Association Rules Over Privacy Preserving DataPreserving Data.. 8th Int'l Conf. on Knowledge Discovery in Databases and Data Min8th Int'l Conf. on Knowledge Discovery in Databases and Data Mining ing (KDD), Edmonton, Canada, July 2002(KDD), Edmonton, Canada, July 2002..R. R. AgrawalAgrawal, R. , R. SrikantSrikant. . Privacy Preserving Data MiningPrivacy Preserving Data Mining. ACM Int’l Conf. On . ACM Int’l Conf. On Management of Data (SIGMOD), Dallas, Texas, May 2000.Management of Data (SIGMOD), Dallas, Texas, May 2000.