Upload
reyna
View
30
Download
0
Embed Size (px)
DESCRIPTION
SplitX: High-Performance Private Analytics. Ruichuan Chen (Bell Labs / Alcatel-Lucent) Istemi Ekin Akkus (MPI-SWS) Paul Francis (MPI-SWS). Data analytics is important. Evaluate system performance Understand user behavior Discover statistical patterns. - PowerPoint PPT Presentation
Citation preview
SplitX: High-Performance Private Analytics
Ruichuan Chen (Bell Labs / Alcatel-Lucent)Istemi Ekin Akkus (MPI-SWS)Paul Francis (MPI-SWS)
Data analytics is important
Evaluate system performance
Understand user behavior
Discover statistical patterns
Data exposure has become a major concern
Third-partyTrackers
Smart-phone Apps
User-owned and operated
Data exposure has to be brought under control!
User-owned and operated principle Personal data should be stored in a local
host under the user’s control.
Motivation and problem
How to make aggregate queries over distributed private user data while still preserving user privacy?
Data Data Data
Analyst
Outline
Related work
SplitX system Key insights System design Performance comparison Implementation & deployment
Conclusion
A general approach
Based on differential privacy. Differential privacy adds noise to the
output of a computation (i.e., query).
Hide the presence or absence of a user.
DatabaseQuery Module
(add noise)AnalystData
Data Data
Previous systems Servers aggregate
answers without seeing individual user data.
Differentially private noise is added to the aggregate result.
Data Data Data
Analyst
Servers
Analyst
Akkus et al., CCS’12; Chen et al., NSDI’12; Dwork et al., EUROCRYPT’06; Hardt et al., CCS’12; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11
Primary technical problems Scale poorly
Require public-key operations or something even more expensive.Akkus et al., CCS’12; Chen et al., NSDI’12; Dwork et al., EUROCRYPT’06; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11
Suffer from answer pollution Even a single malicious user can
substantially distort the aggregate result through a single answer.Hardt et al., CCS’12; Rastogi et al., SIGMOD’10; Shi et al., NDSS’11
Outline
Related work
SplitX system Key insights System design Performance comparison Implementation & deployment
Conclusion
SplitX
A high-performance private analytics system 2 to 3 orders of magnitude more efficient in
bandwidth 3 to 5 orders of magnitude more efficient in
computation Resistant to answer pollution
Components & assumptions
Data Data Data
Analyst
Servers(1 aggregator and 2 mixes)
Analysts are potentially malicious(violating user privacy)
Clients are user devices.Clients are potentially malicious(distorting the final results)
Servers are honest but curious1) Follow the specified protocol2) Try to exploit additional info that can be learned in so doing
Analyst
Outline
Related work
SplitX system Key insights System design Performance comparison Implementation & deployment
Conclusion
Key insights: XOR encryption How to achieve high performance?
Client wants to send M to aggregator Client splits M, and sends split messages to
aggregator via mixes Aggregator joins split messages to recreate M
AggregatorClientMix2
Mix1M R M R
R R
Mgenerate R recreate M
Key insights: XOR encryption How to achieve high performance?
M denotes that client sends two split messages of M to aggregator via Mix1 and Mix2.
For clarity
AggregatorClientMix2
Mix1M R M R
R R
AggregatorClientMix2
Mix1
M
generate R recreate M
Key insights: query buckets How to limit answer pollution?
Solution: Ensure that a client cannot arbitrarily
manipulate answers. Divide answer’s value range into buckets. Enforce a binary answer in each bucket.
Key insights: query buckets
Query: “SELECT age FROM splitx”
4 buckets: 0~19, 20~39, 40~59, and ≥60. Answers: a ‘1’ or ‘0’ per bucket.
30 years-old 0, 1, 0, 0 Answers encoded in a bit-vector.
An answer from a malicious client cannot substantially distort the query result!
Outline
Related work
SplitX system Key insights System design Performance comparison Implementation & deployment
Conclusion
System design
1) Query publish/subscribe Analyst publishes its queries Client subscribes to an analyst’s queries
2) Query answering Client answers queries Mixes add differentially private noise Mixes shuffle answers Aggregator generates query results
1) Query publish/subscribe
AggregatorClient
Mix2
Mix1
Query1, Query2, …
Analyst
Analyst ID
Query1, Query2, …
1) Query publish/subscribe
Query example: age distribution among male users?
QID: SQL:
Buckets: DP parameter ( ): Tend:
123
11:59:59PM on Aug 16, 2013
0~19, 20~39, 40~59, and ≥60
1.0
SELECT age FROM splitxWHERE gender=‘male’
2) Query answering
Client answers queries Mixes add differentially private noise Mixes shuffle answers Aggregator generates query results
Step 1: client answers queries
Client executes query over its local data and generates an answer
‘1’ or ‘0’ per bucket
Encoded as a bit-vector
Step 1: client answers queries
Client splits its answer, and sends the split answers with the query ID to the two mixes, respectively.
AggregatorClient
Mix2
Mix1
Analyst
QID, answer
Mix knows which query a client answered.Privacy violation!
Step 2: mixes add DP noise
Each mix individually adds some random bit-vectors as the differentially private noise
How many bit-vectors needed?c: # clients queried : DP parameter
Mix1
0100
1110
……
0111
……
Mix2
1101
1001
……
0101
……
Mix2
1101
1001
……
Mix1
0100
1110
……
random bit-vectors as noise
Step 3: mixes shuffle split answers
Each mix maintains c+n split answers Mixes shuffle the split answers for each
column (i.e., bucket) in a synchronized way.
Mix1
0100
1110
……
0111
……
Mix2
1101
1001
……
0101
……
Mix1
1110
0111
……
0100
……
Mix2
1101
1101
……
0001
……
shuffle
Mixes transmit shuffled answers
Each mix transmits the shuffled split answers to the aggregator.
AggregatorClient
Mix2
Mix1
Analyst
Mix1
……
Mix2
…… c+n shuffled split answers
c+n shuffled split answers
Step 4: aggregator generates query result
Join each bit position in the two split answer arrays.
Sum up the values for each bucket.
Obtain the noisy count for each bucket.
Mix1
1110
0111
……
0100
……
Mix2
1101
1101
……
0001
……
Agg
0011
1010
……
0101
……
=
Privacy issue at the mixes Client splits the answer, and sends the
split answers with the query ID to the two mixes
Mix knows which query a specific client answered!
AggregatorClient
Mix2
Mix1
Analyst
QID, answer
Solution: double-splitting
Client
Mix2Mix2
Mix1Mix1
Mix1
Mix2
AggregatorAggregator
AggregatorAggregator
AggregatorClient
Mix2
Mix1
Analyst
QID, answer
QID, answer
Duplicate answer detection
A client can answer a query many times!
How to detect and remove duplicate answers?
Triple-splitting is needed
Section 5 in the paper.
Outline
Related work
SplitX system Key insights System design Performance comparison Implementation & deployment
Conclusion
Computational overhead
Three to five orders of magnitude more efficient in computation than previous systems
PDDP [NSDI’12]Akkus et al. [CCS’12] – “A” is #buckets that a client reports
Implementation
Client side Google Chrome extension Capture webpages browsed, searches
made, extensions installed
Server side (mix + aggregator) Web services on Jetty RPCs defined in Thrift language
Deployment Query results from a 416-client
deployment
Most visited websites: google, facebook, youtube
Most used apps: gmail, youtube, google drive
91% of clients made ≤50 searches / day 70% of clients visited >50 webpages / day 97% of clients visited ≤100 websites / day
Conclusion
SplitX: a high-performance private analytics system Orders of magnitude more efficient than
previous systems Resistant to answer pollution
Key insights XOR-based encryption Query buckets