2016.06.24 - Privacy Preserving Data Mining - … · Privacy Preserving Data Mining: An approach to safely share and use sensible medical data ... (MCQ) – User authorization for

www.viscovery.net

Privacy Preserving Data Mining: An approach to safely share and use sensible medical data

Gerhard Kranner, ViscoveryBiomax Symposium, June 24th, 2016, Munich

© 2016 Viscovery Software GmbH www.viscovery.net

Privacy protection vs knowledge gain

What is Privacy Preserving Data Mining?Terms and standardsRisks, limits, and issuesData mining without need of data disclosureData abstraction with perceptual mapsConnectome example


Privacy Preserving Data MiningØ „PPDM is the responsible use of data mining to

extract useful knowledge from data without compromising data privacy.“

Which implies to– Access, explore and model sensible data– Share results, deploy analytical models

But, in doing so, to– Observe legal and ethical standards– In particular, preserve data confidentiality


Basic terms

Pseudonymization– Replace identifying fields within each data record by

pseudonyms (artificial codes)De-identification– Remove, mask or generalize identifying information to prevent

a person’s identity from being connected with informationAnonymization– Irreversibly remove association between an identifying

dataset and the data subject


Common de-identification methods Removal of identifiers– Direct identifiers: name, address, social security number– Quasi-identifiers: birthday, ZIP, sex– Any links to identifying information

Data and/or output perturbation– Add non-deterministic noise to attribute values– Mask, modify, aggregate values systematically

Generalization (data binning, bucketing)– Original data values which fall in a given small interval, a bin, are

replaced by a value representative of that interval– Generalize all dates to year: 17th March 1983 à 1983– Reduce zip codes to three digits: D-82152 à 821


Example: Two-dimensional binning


The HIPAA “Safe Harbor” Method

(I) Health plan beneficiary numbers(J) Account numbers(K) Certificate/license numbers(L) Vehicle identifiers and serial numbers, including license plate numbers(M) Device identifiers and serial numbers(N) Web Universal Resource Locators (URLs)(O) Internet Protocol (IP) addresses(P) Biometric identifiers, including finger and voiceprints(Q) Full-face photographs and any comparable images(R) Any other unique identifying number, characteristic, or code

HIPAA Privacy Rule, USA, 2003: Provides mechanisms for using and disclosing health data responsibly without the need for patient consentEITHER apply Expert Determination MethodOR remove or generalize 18 specific types of data:

(A) Names (B) All geographic subdivisions, including street address, city, county, precinct, ZIP code, if the geographic unit contains less than 20,000 people…(C) All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89…(D) Telephone numbers(E) Fax numbers(F) Email addresses(G) Social security numbers(H) Medical record numbers


Usual de-identification process

Source: NISTIR 8053, De-Identification ofPersonal Information, 2015


Limits and issues

Re-identification risk– Cross-reference anonymous data with other data sources to

re-identify the origin (linkage attack)– May result in harms to individuals or groups

De-identification is of limited use– Not robust against advanced re-identification methods– Impossible in certain cases– E.g., genetic data cannot be safely anonymized due to huge

amount of pattern information in bio-specimens which allows to re-identify the donors

àCannot be sure whether information is re-identifiable!


Implicit disclosure risk

Attribute disclosure– Adversary derives sensible information about a patient from

released data in conjunction with disclosed information– E.g. all patients in a list have a specific diagnosis

Inferential disclosure– When information can be inferred with high confidence from

statistical properties of released data– E.g. infer the income of a data subject from the (publicly

available) purchase price of a home


Linkage attacks

Link records in datasets based on similarity between subsets of attributesCombination of attributes allows to discern records in each dataset (fingerprint information)Use machine learning for pattern matching

àCan link identity of data subjects in a (released or public) dataset with confidential information contained in another dataset


Linkage examples for re-identificationMovie ratings– Dataset 1: 500,000 training records containing customer ratings of

movies (1 to 5 stars) published by Netflix– Dataset 2: Ratings of (personally) registered users at IMDb– With only eight movie ratings and dates, 96% of released Netflix

subscribers can be uniquely identified

Medical tests– Only four consecutive laboratory test results of CHEM-7 (creatinine)

uniquely distinguished 89.9% oft test subjects in a sample of 61,280 patients

Credit card transactions– Four distinct points in space and time were sufficient to specify

uniquely 90% of the individuals in a sample of 1.1 million people


Conclusion

De-identification should be applied– Removal of direct identifiers is essential– Must conform with legal regulations

However, even complete anonymization– Only reduces matching accuracy– Doesn‘t prevent from re-identification

Ø Tradiditonal de-identification is not sufficient to ensure privacy, yet being detrimental to data mining!


Consequences

Need comprehensive strategies (Release Models) forthe use of confidential data and results– Observe data privacy– Limit risk of re-identification– Minimize information loss

Need technologies that support these strategies– Level of disclosed information under control of application– Ideal application: Provides complete conceptual information

without disclosing original data


Release ModelsData Use Agreement (DUA) model– Make de-identified data available under a legally binding

data use agreementConceptual model– Provide access only to aggregate data while prohibiting

access to records containing data on an individualEnclave model– Keep data in kind of segregated enclave that restricts export

of original data, instead accept queries from qualified users, run the queries on the data, and respond with results


Role and purpose based access control

Source: Indumathi, InTech, 2012, http://dx.doi.org/10.5772/49982


PPDM by decoupling models from data Represent original data in perceptual map– Generates abstraction that directly shows data distribution– Data statistics contained in microcluster ensemble

Perform data mining on the map– Explore, visualize, and cluster data distribution– Enhance model with predictive capabilities

Segregate map from original data– Disclose map as conceptual repository for further explor’n– Deploy predictive model for use/integration in applications

Enable access to original data via map– Achievable through Micro-Cluster Queries (MCQ)– User authorization for MCQ under control of application


Vanfleteren et al., AJRCCM, 2013

Example: CIROCO data representation


CIROCO study: Model publication


CIROCO study: Diagnostic factors


CIROCO study: Aggregate statistics


Self-Organizing Maps (SOM)

SOMs represent data distributions in perceptual maps– Able to create maps from big / complex data– Original data can be „forgotten“– Maintains essential distribution information– Contains local data statistics in microclusters (cluster binning)

Released map is a conceptual repository to– Visually explore data distributions– Make complex distributions tangible– Explore patterns and data dependences– Draw benefit from sensible data without disclosing data


PPDM with Viscovery®

Workflow-oriented system for predictive modeling– Explorative data mining, visual clustering– Profiling, statistical analyses – Classification, non-linear regression

Based on innovative, patented combination of– Self-Organizing Maps (SOM)– Multivariate statistics

Map can be segregated from original data– Disclosure of map does not compromise privacy– Can be integrated in operational systems (BioXM)– Level of data disclosure under control of application


Viscovery® data flow (project mode)

Modeling

PredictiveModels

Viscovery® SOMineApplication data

Model data

Results

Preprocessing

De-identified

data

AnalyticalDatamarts

Application


Viscovery® data flow (operational mode)Viscovery® One(2)One Engine

Parameter name

De-identified

data

Model application

Model name

Data record

Model loading

Model recall

Result

Parameter value

PPDMapplicationwith useraccesscontrol

PredictiveModel

User interaction


Example: Mining the connectome

Connectome matrices of individual brains– Source: http://umcd.humanconnectomeproject.org/– De-identified, pseudonymized data (highly confidential)– Connectivity Matrix + Diagnosis (Autism) + Personal data– Draw conclusions about personality, mental disorders,…

Derive networks measures– Build network graph from each matrix– Calculate network measures (on global or local level)– E.g. Clustering Coefficient, Characteristic Path Length,

Transitivity, Assortativity, Betweenness

Visualize, explore, cluster network data in Viscovery®


Diffusion Tensor Imaging data from the Human Connectome Project

Source: www.nimh.nih.gov/news/science-news/2012/brain-wiring-a-no-brainer.shtml


Diffusion Tensor Imaging (DTI)

Thickness of detected fibersbetween brain areas (color coded)

Connectivity MatrixDiffusion Gradients

Directed flow of water moleculesdetected by MR indicating fiber tracts

Reconstructed Fiber Tracts

Reconstructed fiber tracts indicatea potential anatomical connection

between two brain areas


Topological graph of functional network

Source: Bullmore, Sporns 2009, Nature Reviews Neuroscience,Vol. 10


Source: http://umcd.humanconnectomeproject.org

Values are computed by Brain Connectivity Toolbox, Rubinov & Sporns, 2009

Calculation of network measures


Can network measures hold as biomarkers for brain diseases?


Stratification of autism patients

leveraging comprehensive clinical knowledge without compromising patient data privacy

www.viscovery.net

Learn more and visit us at ...

Viscovery Software GmbH

Kupelwiesergasse 27A-1130 Wien

Tel. +43-1-532 [email protected]

Documents

2016.06.24 - Privacy Preserving Data Mining - … · Privacy Preserving Data Mining: An approach to safely share and use sensible medical data ... (MCQ) – User authorization for