Privacy and ethical issues in Big Data: Current …privacy aspect )major privacy breach I Once a...

IntroductionTransparency, accountability and fairness

Privacy

Privacy and ethical issues in Big Data:Current trends and future challenges

Sebastien GambsCanada Research Chair in Privacy-preserving and

Ethical Analysis of Big DataUniversite du Quebec a Montreal (UQAM)

gambs.sebastien@uqam.ca

15 November 2018

Sebastien Gambs Big Data, privacy and ethics 1

Privacy

Introduction

Privacy

Big Data

I Broadly refers to the massive increase of the amount anddiversity of data collected and available.

I Technical characterization: often define in terms of the five“V”’ (Volume, Variety, Velocity, Variability and Veracity).

I Main promise of Big Data: offer the possibility to realizeinferences with an unprecedented level of accuracy and details.

Privacy

A glimpse at Big personal Data

Privacy

Personalisation

I In our Information Society, our daily life is influenced byrecommendation algorithms and predictions that arepersonalized according to our profile.

I Examples:

I Necessity to understand the biases that can have thesealgorithms and their impact on society.

I Main challenge: Quantify, measure and prevent discrimination.

Privacy

The deep learning revolution

I Revolution: “quantum leap” in the prediction accuracy inmany domains.

I Recent success: automatic generation of textual descriptionfrom a picture, victory of AlphaGo against a professional goplayer in 2016.

I Possible due to algorithmic advances in machine learning, inparticular in deep learning, combined with the increase incomputational power and the amount of data available.

Privacy

Personalized medicine - IBM Watson advisor for cancer

Privacy

Example of predictive recruiting

Promise: the recruitment process will become ultra-objective andwill not be influence by the human bias of a recruiter.

Privacy

Large-scale mobility analytics

Objective: publication of the mobility traces of users issued fromphone usage (Call Details Records).

Fundamental question : how to anonymize the data beforepublishing it to limit the privacy risks for the users whose mobilityis recorded in the data?

Privacy

Other types of data with strong inference potential

1. Genomic/medical data.

I Possible risks : inference on genetic diseases or tendency todevelop particular health problems, leakage of informationabout ethnic origin and genomics of relatives, geneticdiscrimination, . . .

2. Social data.

I Possible risks : reconstruction of the social graph, inferenceson political opinions, religion, sexual orientations, hobbies, . . .

Privacy

I Privacy is one of the fundamental right of individuals:I Universal Declaration of the Human Rights at the assembly of

the United Nations (Article 12), 1948.I Act respecting the protection of personal information in the

private sector, Quebec.I European General Data Protection Regulation (GDPR), voted

in 2016 has become effective in May 2018.I California Consumer Privacy Law voted in 2018, will become

effective in 2020.

I Risk: collect and use of digital traces and personal data forfraudulent purposes.

I Examples: targeted spam, identity theft, profiling, (unfair)discrimination (to be discussed later).

Privacy

Impact of Big Data on privacy

1. Magnification of the privacy risks due to the increase involume and diversity of the personal data collected and thecomputational power to process them.

2. Often data collected about individuals are “re-used” for adifferent purpose without asking their consent.

3. The inferences that are possible with Big Data are much morefine-grained and precise than before.

4. Massive release of data without taking into account theprivacy aspect ⇒ major privacy breach

I Once a data is disclosed, it is there forever (no right to beforgotten).

5. Ethics of inference : what are the inferences that areacceptable for the society and which ones are not?

Privacy

Example of sensitive inference: predictive policing

Privacy

Transparency, accountability andfairness

Privacy

The fuzzy border between personalization anddiscrimination

I Example : price personalization of the Staples websitedepending of the localization (Wall Street Journal 2012).

I Possible discriminations : poor credit, high insurance rate,refusal to employment or access to schools, denial to somefunction.

Privacy

Example of sensitive context with risk of discrimination

Privacy

Possible origin of the bias

1. Problem in the data collection due to some error or the factthat the data is inherently biased.

I Examples : mistake in the profile of the user, dataset reflectshistorical discriminatory decisions against a particularpopulation.

2. Inaccuracy due to the learning algorithm.I Example : the algorithm is very accurate, except for 1% of the

individuals.

Privacy

Right to fairness and transparency (GDPR, Article 70)

In order to ensure fair and transparent processing inrespect of the data subject, taking into account thespecific circumstances and context in which the personaldata are processed, the controller should use appropriatemathematical or statistical procedures for the profiling,implement technical and organisational measuresappropriate to ensure, in particular, that factors whichresult in inaccuracies in personal data are corrected andthe risk of errors is minimised, [...] and that prevents,inter alia, discriminatory effects on natural persons on thebasis of racial or ethnic origin, political opinion, religionor beliefs, trade union membership, genetic or healthstatus or sexual orientation, or that result in measureshaving such an effect.

Privacy

Different notions of fairness

1. Individual fairness: two individuals whose profiles are similar,except for the protected attributes, should receive the samedecision.

I Example: during a recruitment for a job, candidates should beselected according to their competences, independently oftheir gender or ethnic origin.

2. Group fairness: the statistics of decisions targeting aparticular group should be approximately the same as theoverall population.

I Example: if the recruitment algorithm produce a binaryoutput, “accepted” or “refused”, the proportion of males andfemales accepted should be the same.

I Challenge: some studies have demonstrated that their arescenarios in which these metrics are incompatible.

Privacy

Example of incompatibility between two fairness notions

I Consider a school in which the admission process is based onthe average of ratings at school and imagine the situation inwhich women have generally an average higher than men.

I Scenario 1: In this case, women are more often accepted thanmen and we have an individual fairness (the selection is madeindependently of gender) but . . .

I No group fairness : the proportion of women accepted ishigher that the proportion of men accepted.

I Scenario 2: To reach a group fairness, we could raiseartificially the average of men to ensure that a higher numberof them are accepted.

I No individual fairness : if a man and a woman have the sameaverage, the man rating will be artificially increased and hewill have more chance to be admitted than the woman.

Privacy

No consensus on the “right” notion of fairness

Figure: Example of the tutorial given by Arvind Narayanan at FAT*18

Privacy

Opacity of machine learning algorithms

I Machine learning has a central role in most of thepersonalized systems.

I Opacity: difficulty of understanding and explaining theirdecision due to their complex design.

I Example: the classifier outputted by a deep learning algorithmis typically composed of many layers of neural networks.

I Risk of “algorithmic dictatorship” (Rouvroy): loss of controlof individuals on their digital lives due to automated decisionif there is no remediation procedure.

Privacy

Transparency as a first step

I Asymmetry of information: strong difference between whatthe system knows about a person and what the person knowsabout the system.

I Lack of transparency leads to lack of trust.I Strong need to improve the transparency of information

systems.

Privacy

Possible approaches to transparency

1. Regulatory approaches to force companies to let usersexamine and correct the information collected about them.

2. Methods to increase transparency by “opening the black-box”.

I Tools to reach transparency by design.

I Examples: publication of the source code, use of aninterpretable model in machine learning.

Privacy

Community effort to increase transparency

Privacy

Measuring discrimination

I Example: measurement of quantitative input influence.

I Challenge: possibility of indirect discrimination in which thediscriminatory attribute is inferred through other attributes.

I Example: even if the ethnicity is not asked from the user, insome countries it strongly correlates with the ZIP code.

Privacy

Understanding the bias of the algorithm through aninterpretable model (Zeng, Ustun and Rudin 2016)

I Main idea: approximating a classifier through an interpretablemodel that can be inspected by a human.

I Example: use of a SLIM (Supersparse Linear Integer Models).

I Instead of using 42 binary attributes such as “Was the inmatebetween 18 and 24 years old during his first arrest ?” or “Wasthe inmate released in probation ?”, the tool can only select 5of them while obtaining a non-trivial fidelity with the originalmodel.

Privacy

Towards algorithmic accountability

I Caveat: transparency does not necessarily meansinterpretability or accountability.

I Example: the code of an application could be public but toocomplex to be comprehend by a human.

I Strong need for the development of tools that can analyzeand certify the code of the program.

I Objective: verify that the execution of the program match theintended behaviour or the ethical values that are expectedfrom it.

I Strong link with the notion of loyalty (does the system behaveas promises).

Privacy

Example of NYC algorithmic task force

Privacy

Example of FairTest report (based on normalized mutualinformation)

Privacy

Observing from all possible angles to observe adiscrimination

I Challenge: depending on the way by which we split thepopulation in subgroups, we can observe or not adiscrimination.

I Example:

I Simpson paradox: statistical paradox stating that a particularphenomenon observed on some groups can seem invertedwhen the group are combined.

I To obtain relevant results, once need to look fordiscrimination by changing the group observed.

Privacy

Enhancing fairness

I Ultimate objective: being able to increase fairness while notimpact too much accuracy

I Examples of possible approaches:I Sample the input data to remove its original bias,I Change the design of the algorithm so that it becomes

discrimination-aware by design,I Adapt the output produced by the algorithm (e.g., the

classifier) to reduce discrimination.

I Active subject of research but still in its infancy, muchremains to be done.

I Difficulty: there are several ways to reach fairness.

I Example: for each candidate, we randomly decide if he isaccepted or not (perfect fairness).

Privacy

Diversity as a key aspect from improving fairness

Privacy

Privacy enhancing technologies

Privacy Enhancing Technologies (PETs) : ensemble of techniquesfor protecting the privacy of an individual and offer him a bettercontrol on his personal data.Example of PET : homomorphic encryption.

Two fundamental principles behind the PETs :

I Data minimization : only the information necessary forcompleting a particular purpose should be collected/revealed.

I Data sovereignty : enable a user to keep the control on hispersonal data and how they are collected and disseminated.

Privacy

Personally identifiable information

I Personally identifiable information : ensemble of informationthat can be used to uniquely identified an individual.

I Examples : first and last name, social security number, placeand date of birth, physical and email address, phone number,credit card number, biometric data (such as fingerprint andDNA), . . .

I Sensitive because they identify uniquely an individual and canbe used to easily cross-referenced databases.

I Main limits of the definition :I does not take into account some attributes or patterns in the

data that can seem innocuous individually but can identifiedan individual when combined together (quasi-identifiers).

I does not take into account the inference potential of the dataconsidered.

Privacy

Pseudonymization is not an alternative to anonymization

Replacing the name of a person by a pseudonym ; preservation ofthe privacy of this individual

(Extract from an article from the New York Times, 6 August 2006)

Privacy

SUICA’s privacy leak (July 2013)

Privacy

Legal requirements to evaluate anonymization methods

General Data Protection Regulation (Article 16):

“To ascertain whether means are reasonably likely to beused to identify the natural person, account should betaken of all objective factors, such as the costs of and theamount of time required for identification, taking intoconsideration the available technology at the time of theprocessing and technological developments.”

I Consequence : evaluation of risk of de-anonymization shouldtake into account the ressources needed to conduct there-identification and should be done on a regular basis(risk-based approach).

I The French law “for a Digital Republic” (October 2016) alsorecognized the right for the French data protection authority(the CNIL) to certify anonymization processes.

Privacy

Inference attack

I Inference attack : the adversary takes as input a publisheddataset (and possibly some background knowledge) and triesto infer some personal information regarding individualscontained in the dataset.

I Main challenge : to be able to give some privacy guaranteeseven against an adversary having some auxiliary knowledge.

I We may not even be able to model this a priori knowledge.I Remark: maybe my data is private today but it may not be so

in the future due to the public release of some other data.

Privacy

De-anonymization attack

I De-anonymization attack : the adversary takes as input asanitized dataset and some background knowledge and tries toinfer the identities of the individuals contained in the dataset.

I Specific form of inference attack.I Example : Sweeney’s original de-anonymization attack via

linking.

I The re-identification risk measures the success probability ofthis attack.

Privacy

Example : inference attacks on location data

Joint work with Marc-Olivier Killijian (LAAS-CNRS) and MiguelNunez del Prado (Universidad del Pacifico).Main objective : quantify the privacy risks of location data.Types of attacks:

1. Identification of important places, called Point of Interests(POI), characterizing the interests of an individual.

I Example: home, place of work, gymnasium, politicalheadquarters, medical center, . . .

2. Prediction of the movement patterns of an individual, such ashis past, present and future locations.

3. Linking the records of the same individual contained in thesame dataset or in different datasets (either anonymized orunder different pseudonyms).

Privacy

Sanitization

Sanitization : process increasing the uncertainty in the data inorder to preserve privacy.⇒ Inherent trade-off between the desired level of privacy and theutility of the sanitized data.Typical application : public release of data (offline or onlinecontext).

Examples drawn from the “sanitization” entry on Wikipedia

Privacy

Classical sanitization mechanisms

I Perturbation : addition of noise to the true value.I Aggregation : merge several data into a single one.I Generalization : loss of granularity of information.

I Deletion : erasure of the information related to a particularattribute.

I Remark : the absence of information can sometimes lead to aprivacy breach (e.g. : removing the information on the diseaseof a patient record only if he has a sexual disease).

I Introduction of fake data : addition of artificial records in adatabase to “hide” the true data.

Privacy

Example of privacy model : k-anonymity (Sweeney 02)

I Privacy guarantee : in each group of the sanitized dataset,each invidividual will be identical to a least k − 1 others.

I Reach by a combination of generalization and suppression.I Example of use : sanitization of medical data.

I Main challenge : extracting useful knowledge while preservingthe confidentiality of individual sensitive data.

Privacy

Intersection attack

I Question : suppose that Alice’s employer knows that she is 28years old, she lives in ZIP code 13012 and she visits bothhospitals. What does he learn?

Privacy

The key property: composition

I A “good” privacy model should provide some guaranteesabout the total leak of information revealed by two (or more)sanitized datasets.

I More precisely, if the first release reveals b1 of information andthe second release b2 bits of information, the total amount ofinformation leaked should not be more than O(b1 + b2) bits.

I Remark : most of the existing privacy models do not have anycomposition property with the exception of differential privacy(Dwork 06).

Privacy

Differential privacy: principle (Dwork 06)

I Privacy notion developed within the community of privatedata analysis that has gained a widespread adoption.

I Basically ensures that whether or not an item is in the profileof an individual does not influence too much the output.

I Give strong privacy guarantees that hold independently of theauxiliary knowledge of the adversary and compose well.

Privacy

Implementing differential privacy

Possible techniques to implement differential privacy :

I Addition of noise to the output of an algorithm (ex: Laplacianmechanism).

I Perturbation of the input given to the algorithm.

I Randomization of the behaviour of the algorithm.

I Creation of a synthetic database or a data structure“summarizing and aggregating the data”.

I Sampling mechanisms.

Privacy

Towards an adoption of differential privacy by web actors

Privacy

Why one should always question the choice of the privacyparameter

Privacy

Fire and Ice Japanese competition on anonymization andre-identification attacks (2015, 2016, 2017 and 2018)

I Objective : evaluate empirically the efficiency ofanonymization methods and re-identification attacks.

I Similar in spirit to other competitions in machine learning orsecurity.

Privacy

Next steps

I To broaden the impact and the outreach to the privacycommunity, we are organizing an international competition onsanitization mechanisms and inference attacks.

I Schedule :I Last year: workshop at PETS for preparing the competition

(definition of privacy and utility metrics, choice of the dataset,setting of the competition, . . . ).

I Next year: international competition for sanitization andinference starting at the beginning of 2019 on CrowdAI.

Privacy

Conclusion

Privacy

Conclusion (1/2)

I Observation 1 : the capacity to record and store personal dataas increased rapidly these last years.

I Observation 2 : “Big Data” will result in more and more beingavailable ⇒ increase of inference possibilities.

I Observation 3 : the “Open data” movement will lead to therelease of a huge amount of dataset ⇒ worsen the privacyimpact of Big Data (observation 2).

I The advent of Big Data magnifies the privacy risks that werealready existing but also raises new ethical issues.

I Main challenge : balance the social and economical benefits ofBig Data with the protection of privacy and fundamentalrights of individuals.

Privacy

Conclusion (2/2)

I Strong need for research and scientific cooperation in BigData :

I for determining how to address privacy in this context,I for the design of new protection and sanitization mechanismsI as well as for inference attacks for assessing the privacy level

they provide.I for finding solutions for addressing the transparency, fairness

and accountability issues

I Overall objective: being able to reap the benefits of Big Databy not only to protecting the privacy of individuals but alsomaking sure that they remain in control of their digital lives.

Privacy

This is the end

Thanks for your attentionQuestions?

Privacy

Fairness metrics (1/3)

I Consider an algorithm that produces a binary outputo ∈ {0, 1} and which contains a unique protect attributes ∈ {s−, s+}.

I Assume that the group s = s+ is the favored group to obtaino = 1 and that the group s− is disadvantaged.

I P(o|s+) is the empirical probability to obtain o knowing thatthe protected attribute has for value s+ (P(o|s−) can bedefined in the same manner).

Privacy

Examples of fairness metrics:

1. selection-lift: P(o|s+)P(o|s−)

I The higher the ratio is, the more unfair the treatment.

I Example: disparate impact, criterion existing in the US lawuse to measure unfair treatment.

I According to the law, a treatment is unfair if P(o|s+)P(o|s−)

> 0.8.

2. difference-based selection-lift: = P(o|s+)− P(o|s−).

I The higher the difference is, the more unfair is the treatment.

Privacy

3. Mutual information: measure the correlation between theprotect attribute and the output.

I If this value is low, this means that the two variables areindependent ⇒ weak correlation.

I Ideally, we would like this value to be independent of theprotected attribute to have a fair decision.

I Examples of tools to measure fairness: FairTest, FairML, . . .

I Strong effort from the community to develop tools to quantifydiscrimination but . . .

I currently there is no consensus on the definition ofdiscrimination.

Privacy and ethical issues in Big Data: Current …privacy aspect )major privacy breach I Once a...

Documents

Forgotten Students, Forgotten Teachers

BROWARD HEALTH BENEFITS 2013. The Broward Health Notice of Privacy Practice describes how medical information about you may be used and disclosed and

Draft Consumer Privacy Framework for Health Data · 2020. 8. 31. · education, etc. Inaccurate ... Limits the amount of consumer health information collected, disclosed, or used

Forgotten Heros

Societies forgotten

Forgotten Australian's - Alliance for Forgotten Australians

Forgotten God

Burden and Privacy Statements · select recipients of the President’s Volunteer Service Award. Routine Uses – The information you provide will be disclosed to CNCS employees,

serenitydentalcenter.com...era-icy serenitydentalcenter.com mercuryfreedentalcenter.com NOTICE OF PRIVACY PRACTICES HOW DISCLOSED TO r. DDS Cosmetic & General Dentistry 12301 Wilshire

Benchmarking for Strategic Corporate Governance · Process for board appraisal disclosed Criteria for board appraisal disclosed Process for director appraisal disclosed Criteria for

Data Privacy and Dignitary Privacy: Google Spain, the ...€¦ · Google Spain, the Right to Be Forgotten, and the Construction of the Public Sphere1 Lying at the intersection of

Tearing the Veil of Privacy Law: An Experiment on Chilling ...homepage.coll.mpg.de/pdf_dat/2013_15online.pdfSurprisingly, the right to be forgotten does not reduce chilling effects

Forgotten Ruins

THE ARCHIVE OF FORGOTTEN CONCERNS 49 FORGOTTEN …

Form 425 - SEC.gov | HOME PRIVACY-ENHANCED MESSAGE----- Proc-Type: 2001,MIC-CLEAR Originator-Name: webmaster@ Originator-Key-Asymmetric ... See also the risk factors disclosed in Alcoa's

Partners HealthCare Notice of Privacy Practices Your ... · Our Responsibilities. This Notice describes how medical information about you may be used and disclosed and how you can

… notice of privacy practices this notice describes how health information about you may be used and disclosed and how you can get access to this information

FORGOTTEN PLACES

Forgotten Warbirds

Riverfront Dental Careriverfront dentalcare notice of privacy practices this notice describes how health information about you may be used and disclosed and how you can