Upload
vanxuyen
View
218
Download
0
Embed Size (px)
Citation preview
www.viscovery.net
Privacy Preserving Data Mining: An approach to safely share and use sensible medical data
Gerhard Kranner, ViscoveryBiomax Symposium, June 24th, 2016, Munich
© 2016 Viscovery Software GmbH www.viscovery.net
Privacy protection vs knowledge gain
What is Privacy Preserving Data Mining?Terms and standardsRisks, limits, and issuesData mining without need of data disclosureData abstraction with perceptual mapsConnectome example
© 2016 Viscovery Software GmbH www.viscovery.net
Privacy Preserving Data MiningØ „PPDM is the responsible use of data mining to
extract useful knowledge from data without compromising data privacy.“
Which implies to– Access, explore and model sensible data– Share results, deploy analytical models
But, in doing so, to– Observe legal and ethical standards– In particular, preserve data confidentiality
© 2016 Viscovery Software GmbH www.viscovery.net
Basic terms
Pseudonymization– Replace identifying fields within each data record by
pseudonyms (artificial codes)De-identification– Remove, mask or generalize identifying information to prevent
a person’s identity from being connected with informationAnonymization– Irreversibly remove association between an identifying
dataset and the data subject
© 2016 Viscovery Software GmbH www.viscovery.net
Common de-identification methods Removal of identifiers– Direct identifiers: name, address, social security number– Quasi-identifiers: birthday, ZIP, sex– Any links to identifying information
Data and/or output perturbation– Add non-deterministic noise to attribute values– Mask, modify, aggregate values systematically
Generalization (data binning, bucketing)– Original data values which fall in a given small interval, a bin, are
replaced by a value representative of that interval– Generalize all dates to year: 17th March 1983 à 1983– Reduce zip codes to three digits: D-82152 à 821
© 2016 Viscovery Software GmbH www.viscovery.net
Example: Two-dimensional binning
© 2016 Viscovery Software GmbH www.viscovery.net
The HIPAA “Safe Harbor” Method
(I) Health plan beneficiary numbers(J) Account numbers(K) Certificate/license numbers(L) Vehicle identifiers and serial numbers, including license plate numbers(M) Device identifiers and serial numbers(N) Web Universal Resource Locators (URLs)(O) Internet Protocol (IP) addresses(P) Biometric identifiers, including finger and voiceprints(Q) Full-face photographs and any comparable images(R) Any other unique identifying number, characteristic, or code
HIPAA Privacy Rule, USA, 2003: Provides mechanisms for using and disclosing health data responsibly without the need for patient consentEITHER apply Expert Determination MethodOR remove or generalize 18 specific types of data:
(A) Names (B) All geographic subdivisions, including street address, city, county, precinct, ZIP code, if the geographic unit contains less than 20,000 people…(C) All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89…(D) Telephone numbers(E) Fax numbers(F) Email addresses(G) Social security numbers(H) Medical record numbers
© 2016 Viscovery Software GmbH www.viscovery.net
Usual de-identification process
Source: NISTIR 8053, De-Identification ofPersonal Information, 2015
© 2016 Viscovery Software GmbH www.viscovery.net
Limits and issues
Re-identification risk– Cross-reference anonymous data with other data sources to
re-identify the origin (linkage attack)– May result in harms to individuals or groups
De-identification is of limited use– Not robust against advanced re-identification methods– Impossible in certain cases– E.g., genetic data cannot be safely anonymized due to huge
amount of pattern information in bio-specimens which allows to re-identify the donors
àCannot be sure whether information is re-identifiable!
© 2016 Viscovery Software GmbH www.viscovery.net
Implicit disclosure risk
Attribute disclosure– Adversary derives sensible information about a patient from
released data in conjunction with disclosed information– E.g. all patients in a list have a specific diagnosis
Inferential disclosure– When information can be inferred with high confidence from
statistical properties of released data– E.g. infer the income of a data subject from the (publicly
available) purchase price of a home
© 2016 Viscovery Software GmbH www.viscovery.net
Linkage attacks
Link records in datasets based on similarity between subsets of attributesCombination of attributes allows to discern records in each dataset (fingerprint information)Use machine learning for pattern matching
àCan link identity of data subjects in a (released or public) dataset with confidential information contained in another dataset
© 2016 Viscovery Software GmbH www.viscovery.net
Linkage examples for re-identificationMovie ratings– Dataset 1: 500,000 training records containing customer ratings of
movies (1 to 5 stars) published by Netflix– Dataset 2: Ratings of (personally) registered users at IMDb– With only eight movie ratings and dates, 96% of released Netflix
subscribers can be uniquely identified
Medical tests– Only four consecutive laboratory test results of CHEM-7 (creatinine)
uniquely distinguished 89.9% oft test subjects in a sample of 61,280 patients
Credit card transactions– Four distinct points in space and time were sufficient to specify
uniquely 90% of the individuals in a sample of 1.1 million people
© 2016 Viscovery Software GmbH www.viscovery.net
Conclusion
De-identification should be applied– Removal of direct identifiers is essential– Must conform with legal regulations
However, even complete anonymization– Only reduces matching accuracy– Doesn‘t prevent from re-identification
Ø Tradiditonal de-identification is not sufficient to ensure privacy, yet being detrimental to data mining!
© 2016 Viscovery Software GmbH www.viscovery.net
Consequences
Need comprehensive strategies (Release Models) forthe use of confidential data and results– Observe data privacy– Limit risk of re-identification– Minimize information loss
Need technologies that support these strategies– Level of disclosed information under control of application– Ideal application: Provides complete conceptual information
without disclosing original data
© 2016 Viscovery Software GmbH www.viscovery.net
Release ModelsData Use Agreement (DUA) model– Make de-identified data available under a legally binding
data use agreementConceptual model– Provide access only to aggregate data while prohibiting
access to records containing data on an individualEnclave model– Keep data in kind of segregated enclave that restricts export
of original data, instead accept queries from qualified users, run the queries on the data, and respond with results
© 2016 Viscovery Software GmbH www.viscovery.net
Role and purpose based access control
Source: Indumathi, InTech, 2012, http://dx.doi.org/10.5772/49982
© 2016 Viscovery Software GmbH www.viscovery.net
PPDM by decoupling models from data Represent original data in perceptual map– Generates abstraction that directly shows data distribution– Data statistics contained in microcluster ensemble
Perform data mining on the map– Explore, visualize, and cluster data distribution– Enhance model with predictive capabilities
Segregate map from original data– Disclose map as conceptual repository for further explor’n– Deploy predictive model for use/integration in applications
Enable access to original data via map– Achievable through Micro-Cluster Queries (MCQ)– User authorization for MCQ under control of application
© 2016 Viscovery Software GmbH www.viscovery.net
Vanfleteren et al., AJRCCM, 2013
Example: CIROCO data representation
© 2016 Viscovery Software GmbH www.viscovery.net
CIROCO study: Model publication
© 2016 Viscovery Software GmbH www.viscovery.net
CIROCO study: Diagnostic factors
© 2016 Viscovery Software GmbH www.viscovery.net
CIROCO study: Aggregate statistics
© 2016 Viscovery Software GmbH www.viscovery.net
Self-Organizing Maps (SOM)
SOMs represent data distributions in perceptual maps– Able to create maps from big / complex data– Original data can be „forgotten“– Maintains essential distribution information– Contains local data statistics in microclusters (cluster binning)
Released map is a conceptual repository to– Visually explore data distributions– Make complex distributions tangible– Explore patterns and data dependences– Draw benefit from sensible data without disclosing data
© 2016 Viscovery Software GmbH www.viscovery.net
PPDM with Viscovery®
Workflow-oriented system for predictive modeling– Explorative data mining, visual clustering– Profiling, statistical analyses – Classification, non-linear regression
Based on innovative, patented combination of– Self-Organizing Maps (SOM)– Multivariate statistics
Map can be segregated from original data– Disclosure of map does not compromise privacy– Can be integrated in operational systems (BioXM)– Level of data disclosure under control of application
© 2016 Viscovery Software GmbH www.viscovery.net
Viscovery® data flow (project mode)
Modeling
PredictiveModels
Viscovery® SOMineApplication data
Model data
Results
Preprocessing
De-identified
data
AnalyticalDatamarts
Application
© 2016 Viscovery Software GmbH www.viscovery.net
Viscovery® data flow (operational mode)Viscovery® One(2)One Engine
Parameter name
De-identified
data
Model application
Model name
Data record
Model loading
Model recall
Result
Parameter value
PPDMapplicationwith useraccesscontrol
PredictiveModel
User interaction
© 2016 Viscovery Software GmbH www.viscovery.net
Example: Mining the connectome
Connectome matrices of individual brains– Source: http://umcd.humanconnectomeproject.org/– De-identified, pseudonymized data (highly confidential)– Connectivity Matrix + Diagnosis (Autism) + Personal data– Draw conclusions about personality, mental disorders,…
Derive networks measures– Build network graph from each matrix– Calculate network measures (on global or local level)– E.g. Clustering Coefficient, Characteristic Path Length,
Transitivity, Assortativity, Betweenness
Visualize, explore, cluster network data in Viscovery®
© 2016 Viscovery Software GmbH www.viscovery.net
Diffusion Tensor Imaging data from the Human Connectome Project
Source: www.nimh.nih.gov/news/science-news/2012/brain-wiring-a-no-brainer.shtml
© 2016 Viscovery Software GmbH www.viscovery.net
Diffusion Tensor Imaging (DTI)
Thickness of detected fibersbetween brain areas (color coded)
Connectivity MatrixDiffusion Gradients
Directed flow of water moleculesdetected by MR indicating fiber tracts
Reconstructed Fiber Tracts
Reconstructed fiber tracts indicatea potential anatomical connection
between two brain areas
© 2016 Viscovery Software GmbH www.viscovery.net
Topological graph of functional network
Source: Bullmore, Sporns 2009, Nature Reviews Neuroscience,Vol. 10
© 2016 Viscovery Software GmbH www.viscovery.net
Source: http://umcd.humanconnectomeproject.org
Values are computed by Brain Connectivity Toolbox, Rubinov & Sporns, 2009
Calculation of network measures
© 2016 Viscovery Software GmbH www.viscovery.net
Can network measures hold as biomarkers for brain diseases?
© 2016 Viscovery Software GmbH www.viscovery.net
Stratification of autism patients
leveraging comprehensive clinical knowledge without compromising patient data privacy
www.viscovery.net
Learn more and visit us at ...
Viscovery Software GmbH
Kupelwiesergasse 27A-1130 Wien
Tel. +43-1-532 [email protected]