1
Stacey Truex [email protected] Ling Liu [email protected] Privacy-Preserving Decision Tree Learning: Evaluation to Collaboration Risks of Reverse Engineering Trade-Offs In State-Of-The-Art Frameworks Future Work to Address Framework Pitfalls 25 School of Computer Science Collaborative Learning Decision Tree is one of the most fundamental inductive learning models Healthcare cost prediction [1], disease diagnosis [2] [3] , computer network analysis [4], credit risk assessment [5] [6] Additionally, common ensemble models leveraging decision trees (random forest) makes it an intuitive choice for collaborative learning 1 Sushmita, Shanu, et al. "Population cost prediction on public healthcare datasets." Proceedings of the 5th International Conference on Digital Health 2015. ACM, 2015. 2 Azar, Ahmad Taher, and Shereen M. El-Metwally. "Decision tree classifiers for automated medical diagnosis." Neural Computing and Applications 23.7-8 (2013): 2387-2403. 3 Singh, Anima, and John V. Guttag. "A comparison of non-symmetric entropy-based classification trees and support vector machine for cardiovascular risk stratification." Engineering in Medicine and Biology Society, EMBC, 2011 Annual International Conference of the IEEE. IEEE, 2011. 4 Antonakakis, Manos, et al. "From Throw-Away Traffic to Bots: Detecting the Rise of DGA-Based Malware." USENIX security symposium. Vol. 12. 2012. 5 Kim, Soo Y., and Arun Upneja. "Predicting restaurant financial distress using decision tree and AdaBoosted decision tree models." Economic Modelling 36 (2014): 354-362. 6 Koh, Hian Chye, Wei Chin Tan, and Chwee Peng Goh. "A two-step method to construct credit scoring models with data mining techniques." International Journal of Business and Information 1.1 (2015). 7 Lindell, Yehuda, and Benny Pinkas. "An efficient protocol for secure two-party computation in the presence of malicious adversaries." Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer Berlin Heidelberg, 2007. 8 Beaver, Donald. "Commodity-based cryptography." Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. ACM, 1997. 9 Paillier, Pascal. "Public-key cryptosystems based on composite degree residuosity classes." Eurocrypt. Vol. 99. 1999. 10 Cramer, Ronald, Rosario Gennaro, and Berry Schoenmakers. "A secure and optimally efficient multi‐authority election scheme." Transactions on Emerging Telecommunications Technologies 8.5 (1997): 481-490. 11 Rabin, Michael O. "How To Exchange Secrets with Oblivious Transfer." IACR Cryptology ePrint Archive 2005 (2005): 187. 12 Yao, Andrew Chi-Chih. "How to generate and exchange secrets." Foundations of Computer Science, 1986., 27th Annual Symposium on. IEEE, 1986. 13 Shamir, Adi. "How to share a secret." Communications of the ACM 22.11 (1979): 612-613. 14 Lindell, Yehuda, and Benny Pinkas. "Privacy preserving data mining." Advances in Cryptology—CRYPTO 2000. Springer Berlin/Heidelberg, 2000. 15 de Hoogh, Sebastiaan, et al. "Practical secure decision tree learning in a teletreatment application." International Conference on Financial Cryptography and Data Security. Springer, Berlin, Heidelberg, 2014. 16 Wu, David J., et al. "Privately evaluating decision trees and random forests." Proceedings on Privacy Enhancing Technologies 2016.4 (2016): 335-355. 17 De Cock, Martine, et al. "Efficient and Private Scoring of Decision Trees, Support Vector Machines and Logistic Regression Models based on Pre-Computation." IEEE Transactions on Dependable and Secure Computing(2017). 18 Ohrimenko, Olga, et al. "Oblivious Multi-Party Machine Learning on Trusted Processors." USENIX Security Symposium. 2016. 19 Dwork, Cynthia. "Differential privacy." Encyclopedia of Cryptography and Security. Springer US, 2011. 338-340 20 Blum, Avrim, et al. "Practical privacy: the SuLQ framework." Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 2005. 21 Friedman, Arik, and Assaf Schuster. "Data mining with differential privacy." Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010. 22 Rana, Santu, Sunil Kumar Gupta, and Svetha Venkatesh. "Differentially private random forest with high utility." Data Mining (ICDM), 2015 IEEE International Conference on. IEEE, 2015. 23 Jagannathan, Geetha, Krishnan Pillaipakkamnatt, and Rebecca N. Wright. "A practical differentially private random decision tree classifier." Data Mining Workshops, 2009. ICDMW'09. IEEE International Conference on. IEEE, 2009. 24 Vaidya, Jaideep, et al. "A random decision tree framework for privacy-preserving data mining." IEEE transactions on dependable and secure computing 11.5 (2014): 399-411. 25 Truex, Stacey, et al. "Privacy-Preserving Inductive Learning with Decision Trees." Big Data (BigData Congress), 2017 IEEE International Congress on. IEEE, 2017. Primary Care Physician’s Dataset Hospital’s Dataset Insurance Provider’s Dataset HIPAA? Patient’s Private Medical Data Patient’s Sensitive Classification Result Client-Sever Model for Evaluation Decision Tree Model Black Box What does this mean? Tree T1 trained on dataset D1 Tree T2 trained on dataset D2 where D2 = any dataset differing from D1 by, at most, one training example If any adversary cannot tell the difference between T1 and T2 T1 is a differentially private decision tree How? Add noise to D1 before building the tree! Train using differentially private queries [12] Make each step of the training process differentially private [13] Add randomization - Random forests [14] - Random decision trees [15] “The computation should be such that the outputs received by the parties are correctly distributed, and furthermore, that the privacy of each party's input is preserved as much as possible, even in the presence of adversarial behavior.” [16] General Concept: Exchange random-looking message such that messages can still be used to compute decision tree Messages don’t mean anything, still get trained model How? Building Blocks: Commodity-Based Cryptography [17] Homomorphic Encryption [18] [19] Oblivious Transfer [20] Yao’s Garbled Circuits [21] Shamir’s Secret Sharing Scheme [22] Primary Concern: Privacy of the datasets Leakage Points: (1) Training Process, (2) Tree Structure Idea: Evaluation as a Service Service provider has predictive ensemble model Charges per query made Privacy Concerns: Server: Models - as a source of revenue - encodes business knowledge - encodes underlying, potentially sensitive, training data Client: Data Classification Result Differential Privacy [11] Training based on Garbled Circuits [23] Training based on Shamir’s Secret Sharing [24] Evaluation using Homomorphic Encryption [25] Evaluation using Commodity-Based Cryptography [26] Evaluation using SGX [27] Framework Trade-Offs Addressing Accuracy Loss Trade-Offs: Black-Box Access vs Accuracy Loss A-priori information + results of many protocol executions reverse engineer private data OR Introduce randomness lose accuracy in resulting model Efficiency Loss vs Data Access Multiple data holders need to exchange mgs privately cryptographic operations efficiency loss OR Must assume single data holder Acknowledgement This research has been partially sup- port by the National Science Foundation under Grants CNS- 1115375, NSF 1547102, SaTC 1564097, and an RCN BD Fellowship, provided by the Research Coordination Network (RCN) on Big Data and Smart Cities. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the RCN or National Science Foundation. unlabeled data …… final predictions Iteratively learn the combination from labeled data training test classifier 1 classifier 2 classifier m Ensemble model labeled data Southern Data Science Conference Secure Multiparty Computation [7] [28] [12] [13] [14] [25] [26] [24] Heavy Cryptographic Operations! Computationally Expensive Solutions Loss in Model Accuracy! …… Example Example: New Data Instance Predictive Result Becomes: Use differential privacy techniques to accomplish secure multiparty computation tasks for some data exchanges within the secure multiparty computation protocol. Use where the addition of noise may not impact accuracy too strongly Can improve the computational complexity problem Noisy messages instead of cryptographically secure messages If multiple parties were to engage in a secure multiparty computation protocol to produce a differentially private decision tree model… Addresses assumptions of single data holder in DP solutions Addresses SMC privacy pitfall around black-box leakage. Increases Accuracy through introduction of more data More data less noise (generally) Black Box Addressing Computational Cost

Southern Privacy-Preserving Decision Tree Learningsnewman36/STruex_Poster_PPDT.pdf · 2018. 4. 10. · "A practical differentially private random decision tree classifier." Data Mining

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Southern Privacy-Preserving Decision Tree Learningsnewman36/STruex_Poster_PPDT.pdf · 2018. 4. 10. · "A practical differentially private random decision tree classifier." Data Mining

Stacey [email protected]

Ling [email protected]

Privacy-Preserving Decision Tree Learning:

Evaluation to Collaboration

Risks of Reverse Engineering Trade-Offs In State-Of-The-Art Frameworks Future Work to Address Framework Pitfalls25

School of Computer Science

Collaborative Learning

Decision Tree is one of the most fundamental inductive learning modelsHealthcare cost prediction [1], disease diagnosis [2] [3] , computer network analysis [4], credit risk assessment [5] [6]

Additionally, common ensemble models leveraging decision trees (random forest) makes it an intuitive choice for collaborative learning

1 Sushmita, Shanu, et al. "Population cost prediction on public healthcare datasets." Proceedings of the 5th International Conference on Digital Health 2015. ACM, 2015.2 Azar, Ahmad Taher, and Shereen M. El-Metwally. "Decision tree classifiers for automated medical diagnosis." Neural Computing and Applications 23.7-8 (2013): 2387-2403.3 Singh, Anima, and John V. Guttag. "A comparison of non-symmetric entropy-based classification trees and support vector machine for cardiovascular risk stratification." Engineering in Medicine and Biology Society, EMBC, 2011 Annual International Conference of the IEEE. IEEE, 2011. 4 Antonakakis, Manos, et al. "From Throw-Away Traffic to Bots: Detecting the Rise of DGA-Based Malware." USENIX security symposium. Vol. 12. 2012.5 Kim, Soo Y., and Arun Upneja. "Predicting restaurant financial distress using decision tree and AdaBoosted decision tree models." Economic Modelling 36 (2014): 354-362.6 Koh, Hian Chye, Wei Chin Tan, and Chwee Peng Goh. "A two-step method to construct credit scoring models with data mining techniques." International Journal of Business and Information 1.1 (2015).7 Lindell, Yehuda, and Benny Pinkas. "An efficient protocol for secure two-party computation in the presence of malicious adversaries." Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer Berlin Heidelberg, 2007.8 Beaver, Donald. "Commodity-based cryptography." Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. ACM, 1997.9 Paillier, Pascal. "Public-key cryptosystems based on composite degree residuosity classes." Eurocrypt. Vol. 99. 1999.

10 Cramer, Ronald, Rosario Gennaro, and Berry Schoenmakers. "A secure and optimally efficient multi‐authority election scheme." Transactions on Emerging Telecommunications Technologies 8.5 (1997): 481-490.11 Rabin, Michael O. "How To Exchange Secrets with Oblivious Transfer." IACR Cryptology ePrint Archive 2005 (2005): 187.12 Yao, Andrew Chi-Chih. "How to generate and exchange secrets." Foundations of Computer Science, 1986., 27th Annual Symposium on. IEEE, 1986.13 Shamir, Adi. "How to share a secret." Communications of the ACM 22.11 (1979): 612-613.14 Lindell, Yehuda, and Benny Pinkas. "Privacy preserving data mining." Advances in Cryptology—CRYPTO 2000. Springer Berlin/Heidelberg, 2000.15 de Hoogh, Sebastiaan, et al. "Practical secure decision tree learning in a teletreatment application." International Conference on Financial Cryptography and Data Security. Springer, Berlin, Heidelberg, 2014.16 Wu, David J., et al. "Privately evaluating decision trees and random forests." Proceedings on Privacy Enhancing Technologies 2016.4 (2016): 335-355.17 De Cock, Martine, et al. "Efficient and Private Scoring of Decision Trees, Support Vector Machines and Logistic Regression Models based on Pre-Computation." IEEE Transactions on Dependable and Secure Computing(2017).18 Ohrimenko, Olga, et al. "Oblivious Multi-Party Machine Learning on Trusted Processors." USENIX Security Symposium. 2016.19 Dwork, Cynthia. "Differential privacy." Encyclopedia of Cryptography and Security. Springer US, 2011. 338-340

20 Blum, Avrim, et al. "Practical privacy: the SuLQ framework." Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 2005.21 Friedman, Arik, and Assaf Schuster. "Data mining with differential privacy." Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010.22 Rana, Santu, Sunil Kumar Gupta, and Svetha Venkatesh. "Differentially private random forest with high utility." Data Mining (ICDM), 2015 IEEE International Conference on. IEEE, 2015.23 Jagannathan, Geetha, Krishnan Pillaipakkamnatt, and Rebecca N. Wright. "A practical differentially private random decision tree classifier." Data Mining Workshops, 2009. ICDMW'09. IEEE International Conference on. IEEE, 2009.24 Vaidya, Jaideep, et al. "A random decision tree framework for privacy-preserving data mining." IEEE transactions on dependable and secure computing 11.5 (2014): 399-411.25 Truex, Stacey, et al. "Privacy-Preserving Inductive Learning with Decision Trees." Big Data (BigData Congress), 2017 IEEE International Congress on. IEEE, 2017.

Primary Care Physician’s

Dataset

Hospital’s Dataset

Insurance Provider’s

Dataset

HIPAA?

Patient’s Private Medical Data

Patient’s Sensitive Classification Result

Client-Sever Model for Evaluation

Decision Tree Model

Black

Box

What does this mean?Tree T1 trained on dataset D1Tree T2 trained on dataset D2 where

D2 = any dataset differing from D1 by, at most, one training exampleIf any adversary cannot tell the difference between T1 and T2 T1 is a differentially private decision tree

How? Add noise to D1 before building the tree!

• Train using differentially private queries [12]

• Make each step of the training process differentially private [13]

• Add randomization- Random forests [14]

- Random decision trees [15]

“The computation should be such that the outputs received by the parties are correctly distributed, and furthermore, that the privacy of each party's input is preserved as much as possible, even in the presence of adversarial behavior.” [16]

General Concept: Exchange random-looking message such that messages can still be used to compute decision tree

Messages don’t mean anything, still get trained model

How? Building Blocks:• Commodity-Based Cryptography [17]

• Homomorphic Encryption [18] [19]

• Oblivious Transfer [20]

• Yao’s Garbled Circuits [21]

• Shamir’s Secret Sharing Scheme [22]

Primary Concern:Privacy of the datasets

Leakage Points:(1) Training Process, (2) Tree Structure

Idea: Evaluation as a ServiceService provider has predictive ensemble modelCharges per query made

Privacy Concerns:Server: Models

- as a source of revenue- encodes business knowledge- encodes underlying, potentially sensitive, training data

Client: DataClassification Result

Differential Privacy [11]

• Training based on Garbled Circuits [23]

• Training based on Shamir’s Secret Sharing [24]

• Evaluation using Homomorphic Encryption [25]

• Evaluation using Commodity-Based Cryptography [26]

• Evaluation using SGX [27]

Framework Trade-Offs

Addressing Accuracy Loss

Trade-Offs:• Black-Box Access vs Accuracy LossA-priori information + results of many protocol executions

reverse engineer private dataORIntroduce randomness lose accuracy in resulting model

• Efficiency Loss vs Data AccessMultiple data holders need to exchange mgs privately

cryptographic operations efficiency loss ORMust assume single data holder

AcknowledgementThis research has been partially sup- port by the National Science Foundation under Grants CNS-1115375, NSF 1547102, SaTC 1564097, and an RCN BD Fellowship, provided by the Research Coordination Network (RCN) on Big Data and Smart Cities.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the RCN or National Science Foundation.

unlabeled data

……

final

predictions

Iteratively learn the combination from labeled data

training test

classifier 1

classifier 2

classifier m

Ensemble model

labeled data

SouthernData ScienceConference

Secure Multiparty Computation

[7] [28] [12]

[13]

[14] [25] [26] [24]

Heavy Cryptographic Operations! Computationally Expensive Solutions

Loss in Model Accuracy!

……

Example

Example:

New Data Instance

Predictive Result Becomes:

Use differential privacy techniques to accomplish secure multiparty computation tasks for some data exchanges within the secure multiparty computation protocol.

• Use where the addition of noise may not impact accuracy too strongly• Can improve the computational complexity problem

• Noisy messages instead of cryptographically secure messages

If multiple parties were to engage in a secure multiparty computation protocol to produce a differentially private decision tree model…

• Addresses assumptions of single data holder in DP solutions• Addresses SMC privacy pitfall around black-box leakage.• Increases Accuracy through introduction of more data

• More data less noise (generally)

Black

Box

Addressing Computational Cost