Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Stacey [email protected]
Ling [email protected]
Privacy-Preserving Decision Tree Learning:
Evaluation to Collaboration
Risks of Reverse Engineering Trade-Offs In State-Of-The-Art Frameworks Future Work to Address Framework Pitfalls25
School of Computer Science
Collaborative Learning
Decision Tree is one of the most fundamental inductive learning modelsHealthcare cost prediction [1], disease diagnosis [2] [3] , computer network analysis [4], credit risk assessment [5] [6]
Additionally, common ensemble models leveraging decision trees (random forest) makes it an intuitive choice for collaborative learning
1 Sushmita, Shanu, et al. "Population cost prediction on public healthcare datasets." Proceedings of the 5th International Conference on Digital Health 2015. ACM, 2015.2 Azar, Ahmad Taher, and Shereen M. El-Metwally. "Decision tree classifiers for automated medical diagnosis." Neural Computing and Applications 23.7-8 (2013): 2387-2403.3 Singh, Anima, and John V. Guttag. "A comparison of non-symmetric entropy-based classification trees and support vector machine for cardiovascular risk stratification." Engineering in Medicine and Biology Society, EMBC, 2011 Annual International Conference of the IEEE. IEEE, 2011. 4 Antonakakis, Manos, et al. "From Throw-Away Traffic to Bots: Detecting the Rise of DGA-Based Malware." USENIX security symposium. Vol. 12. 2012.5 Kim, Soo Y., and Arun Upneja. "Predicting restaurant financial distress using decision tree and AdaBoosted decision tree models." Economic Modelling 36 (2014): 354-362.6 Koh, Hian Chye, Wei Chin Tan, and Chwee Peng Goh. "A two-step method to construct credit scoring models with data mining techniques." International Journal of Business and Information 1.1 (2015).7 Lindell, Yehuda, and Benny Pinkas. "An efficient protocol for secure two-party computation in the presence of malicious adversaries." Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer Berlin Heidelberg, 2007.8 Beaver, Donald. "Commodity-based cryptography." Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. ACM, 1997.9 Paillier, Pascal. "Public-key cryptosystems based on composite degree residuosity classes." Eurocrypt. Vol. 99. 1999.
10 Cramer, Ronald, Rosario Gennaro, and Berry Schoenmakers. "A secure and optimally efficient multi‐authority election scheme." Transactions on Emerging Telecommunications Technologies 8.5 (1997): 481-490.11 Rabin, Michael O. "How To Exchange Secrets with Oblivious Transfer." IACR Cryptology ePrint Archive 2005 (2005): 187.12 Yao, Andrew Chi-Chih. "How to generate and exchange secrets." Foundations of Computer Science, 1986., 27th Annual Symposium on. IEEE, 1986.13 Shamir, Adi. "How to share a secret." Communications of the ACM 22.11 (1979): 612-613.14 Lindell, Yehuda, and Benny Pinkas. "Privacy preserving data mining." Advances in Cryptology—CRYPTO 2000. Springer Berlin/Heidelberg, 2000.15 de Hoogh, Sebastiaan, et al. "Practical secure decision tree learning in a teletreatment application." International Conference on Financial Cryptography and Data Security. Springer, Berlin, Heidelberg, 2014.16 Wu, David J., et al. "Privately evaluating decision trees and random forests." Proceedings on Privacy Enhancing Technologies 2016.4 (2016): 335-355.17 De Cock, Martine, et al. "Efficient and Private Scoring of Decision Trees, Support Vector Machines and Logistic Regression Models based on Pre-Computation." IEEE Transactions on Dependable and Secure Computing(2017).18 Ohrimenko, Olga, et al. "Oblivious Multi-Party Machine Learning on Trusted Processors." USENIX Security Symposium. 2016.19 Dwork, Cynthia. "Differential privacy." Encyclopedia of Cryptography and Security. Springer US, 2011. 338-340
20 Blum, Avrim, et al. "Practical privacy: the SuLQ framework." Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 2005.21 Friedman, Arik, and Assaf Schuster. "Data mining with differential privacy." Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010.22 Rana, Santu, Sunil Kumar Gupta, and Svetha Venkatesh. "Differentially private random forest with high utility." Data Mining (ICDM), 2015 IEEE International Conference on. IEEE, 2015.23 Jagannathan, Geetha, Krishnan Pillaipakkamnatt, and Rebecca N. Wright. "A practical differentially private random decision tree classifier." Data Mining Workshops, 2009. ICDMW'09. IEEE International Conference on. IEEE, 2009.24 Vaidya, Jaideep, et al. "A random decision tree framework for privacy-preserving data mining." IEEE transactions on dependable and secure computing 11.5 (2014): 399-411.25 Truex, Stacey, et al. "Privacy-Preserving Inductive Learning with Decision Trees." Big Data (BigData Congress), 2017 IEEE International Congress on. IEEE, 2017.
Primary Care Physician’s
Dataset
Hospital’s Dataset
Insurance Provider’s
Dataset
HIPAA?
Patient’s Private Medical Data
Patient’s Sensitive Classification Result
Client-Sever Model for Evaluation
Decision Tree Model
Black
Box
What does this mean?Tree T1 trained on dataset D1Tree T2 trained on dataset D2 where
D2 = any dataset differing from D1 by, at most, one training exampleIf any adversary cannot tell the difference between T1 and T2 T1 is a differentially private decision tree
How? Add noise to D1 before building the tree!
• Train using differentially private queries [12]
• Make each step of the training process differentially private [13]
• Add randomization- Random forests [14]
- Random decision trees [15]
“The computation should be such that the outputs received by the parties are correctly distributed, and furthermore, that the privacy of each party's input is preserved as much as possible, even in the presence of adversarial behavior.” [16]
General Concept: Exchange random-looking message such that messages can still be used to compute decision tree
Messages don’t mean anything, still get trained model
How? Building Blocks:• Commodity-Based Cryptography [17]
• Homomorphic Encryption [18] [19]
• Oblivious Transfer [20]
• Yao’s Garbled Circuits [21]
• Shamir’s Secret Sharing Scheme [22]
Primary Concern:Privacy of the datasets
Leakage Points:(1) Training Process, (2) Tree Structure
Idea: Evaluation as a ServiceService provider has predictive ensemble modelCharges per query made
Privacy Concerns:Server: Models
- as a source of revenue- encodes business knowledge- encodes underlying, potentially sensitive, training data
Client: DataClassification Result
Differential Privacy [11]
• Training based on Garbled Circuits [23]
• Training based on Shamir’s Secret Sharing [24]
• Evaluation using Homomorphic Encryption [25]
• Evaluation using Commodity-Based Cryptography [26]
• Evaluation using SGX [27]
Framework Trade-Offs
Addressing Accuracy Loss
Trade-Offs:• Black-Box Access vs Accuracy LossA-priori information + results of many protocol executions
reverse engineer private dataORIntroduce randomness lose accuracy in resulting model
• Efficiency Loss vs Data AccessMultiple data holders need to exchange mgs privately
cryptographic operations efficiency loss ORMust assume single data holder
AcknowledgementThis research has been partially sup- port by the National Science Foundation under Grants CNS-1115375, NSF 1547102, SaTC 1564097, and an RCN BD Fellowship, provided by the Research Coordination Network (RCN) on Big Data and Smart Cities.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the RCN or National Science Foundation.
unlabeled data
……
final
predictions
Iteratively learn the combination from labeled data
training test
classifier 1
classifier 2
classifier m
Ensemble model
labeled data
SouthernData ScienceConference
Secure Multiparty Computation
[7] [28] [12]
[13]
[14] [25] [26] [24]
Heavy Cryptographic Operations! Computationally Expensive Solutions
Loss in Model Accuracy!
……
Example
Example:
New Data Instance
Predictive Result Becomes:
Use differential privacy techniques to accomplish secure multiparty computation tasks for some data exchanges within the secure multiparty computation protocol.
• Use where the addition of noise may not impact accuracy too strongly• Can improve the computational complexity problem
• Noisy messages instead of cryptographically secure messages
If multiple parties were to engage in a secure multiparty computation protocol to produce a differentially private decision tree model…
• Addresses assumptions of single data holder in DP solutions• Addresses SMC privacy pitfall around black-box leakage.• Increases Accuracy through introduction of more data
• More data less noise (generally)
Black
Box
Addressing Computational Cost