View
7
Download
0
Category
Preview:
Citation preview
IntroductionTransparency, accountability and fairness
Privacy
Privacy and ethical issues in Big Data:Current trends and future challenges
Sebastien GambsCanada Research Chair in Privacy-preserving and
Ethical Analysis of Big DataUniversite du Quebec a Montreal (UQAM)
gambs.sebastien@uqam.ca
15 November 2018
Sebastien Gambs Big Data, privacy and ethics 1
IntroductionTransparency, accountability and fairness
Privacy
Introduction
Sebastien Gambs Big Data, privacy and ethics 2
IntroductionTransparency, accountability and fairness
Privacy
Big Data
I Broadly refers to the massive increase of the amount anddiversity of data collected and available.
I Technical characterization: often define in terms of the five“V”’ (Volume, Variety, Velocity, Variability and Veracity).
I Main promise of Big Data: offer the possibility to realizeinferences with an unprecedented level of accuracy and details.
Sebastien Gambs Big Data, privacy and ethics 3
IntroductionTransparency, accountability and fairness
Privacy
A glimpse at Big personal Data
Sebastien Gambs Big Data, privacy and ethics 4
IntroductionTransparency, accountability and fairness
Privacy
Personalisation
I In our Information Society, our daily life is influenced byrecommendation algorithms and predictions that arepersonalized according to our profile.
I Examples:
I Necessity to understand the biases that can have thesealgorithms and their impact on society.
I Main challenge: Quantify, measure and prevent discrimination.
Sebastien Gambs Big Data, privacy and ethics 5
IntroductionTransparency, accountability and fairness
Privacy
The deep learning revolution
I Revolution: “quantum leap” in the prediction accuracy inmany domains.
I Recent success: automatic generation of textual descriptionfrom a picture, victory of AlphaGo against a professional goplayer in 2016.
I Possible due to algorithmic advances in machine learning, inparticular in deep learning, combined with the increase incomputational power and the amount of data available.
Sebastien Gambs Big Data, privacy and ethics 6
IntroductionTransparency, accountability and fairness
Privacy
Personalized medicine - IBM Watson advisor for cancer
Sebastien Gambs Big Data, privacy and ethics 7
IntroductionTransparency, accountability and fairness
Privacy
Example of predictive recruiting
Promise: the recruitment process will become ultra-objective andwill not be influence by the human bias of a recruiter.
Sebastien Gambs Big Data, privacy and ethics 8
IntroductionTransparency, accountability and fairness
Privacy
Large-scale mobility analytics
Objective: publication of the mobility traces of users issued fromphone usage (Call Details Records).
Fundamental question : how to anonymize the data beforepublishing it to limit the privacy risks for the users whose mobilityis recorded in the data?
Sebastien Gambs Big Data, privacy and ethics 9
IntroductionTransparency, accountability and fairness
Privacy
Other types of data with strong inference potential
1. Genomic/medical data.
I Possible risks : inference on genetic diseases or tendency todevelop particular health problems, leakage of informationabout ethnic origin and genomics of relatives, geneticdiscrimination, . . .
2. Social data.
I Possible risks : reconstruction of the social graph, inferenceson political opinions, religion, sexual orientations, hobbies, . . .
Sebastien Gambs Big Data, privacy and ethics 10
IntroductionTransparency, accountability and fairness
Privacy
Privacy
I Privacy is one of the fundamental right of individuals:I Universal Declaration of the Human Rights at the assembly of
the United Nations (Article 12), 1948.I Act respecting the protection of personal information in the
private sector, Quebec.I European General Data Protection Regulation (GDPR), voted
in 2016 has become effective in May 2018.I California Consumer Privacy Law voted in 2018, will become
effective in 2020.
I Risk: collect and use of digital traces and personal data forfraudulent purposes.
I Examples: targeted spam, identity theft, profiling, (unfair)discrimination (to be discussed later).
Sebastien Gambs Big Data, privacy and ethics 11
IntroductionTransparency, accountability and fairness
Privacy
Impact of Big Data on privacy
1. Magnification of the privacy risks due to the increase involume and diversity of the personal data collected and thecomputational power to process them.
2. Often data collected about individuals are “re-used” for adifferent purpose without asking their consent.
3. The inferences that are possible with Big Data are much morefine-grained and precise than before.
4. Massive release of data without taking into account theprivacy aspect ⇒ major privacy breach
I Once a data is disclosed, it is there forever (no right to beforgotten).
5. Ethics of inference : what are the inferences that areacceptable for the society and which ones are not?
Sebastien Gambs Big Data, privacy and ethics 12
IntroductionTransparency, accountability and fairness
Privacy
Example of sensitive inference: predictive policing
Sebastien Gambs Big Data, privacy and ethics 13
IntroductionTransparency, accountability and fairness
Privacy
Transparency, accountability andfairness
Sebastien Gambs Big Data, privacy and ethics 14
IntroductionTransparency, accountability and fairness
Privacy
The fuzzy border between personalization anddiscrimination
I Example : price personalization of the Staples websitedepending of the localization (Wall Street Journal 2012).
I Possible discriminations : poor credit, high insurance rate,refusal to employment or access to schools, denial to somefunction.
Sebastien Gambs Big Data, privacy and ethics 15
IntroductionTransparency, accountability and fairness
Privacy
Example of sensitive context with risk of discrimination
Sebastien Gambs Big Data, privacy and ethics 16
IntroductionTransparency, accountability and fairness
Privacy
Possible origin of the bias
1. Problem in the data collection due to some error or the factthat the data is inherently biased.
I Examples : mistake in the profile of the user, dataset reflectshistorical discriminatory decisions against a particularpopulation.
2. Inaccuracy due to the learning algorithm.I Example : the algorithm is very accurate, except for 1% of the
individuals.
Sebastien Gambs Big Data, privacy and ethics 17
IntroductionTransparency, accountability and fairness
Privacy
Right to fairness and transparency (GDPR, Article 70)
In order to ensure fair and transparent processing inrespect of the data subject, taking into account thespecific circumstances and context in which the personaldata are processed, the controller should use appropriatemathematical or statistical procedures for the profiling,implement technical and organisational measuresappropriate to ensure, in particular, that factors whichresult in inaccuracies in personal data are corrected andthe risk of errors is minimised, [...] and that prevents,inter alia, discriminatory effects on natural persons on thebasis of racial or ethnic origin, political opinion, religionor beliefs, trade union membership, genetic or healthstatus or sexual orientation, or that result in measureshaving such an effect.
Sebastien Gambs Big Data, privacy and ethics 18
IntroductionTransparency, accountability and fairness
Privacy
Different notions of fairness
1. Individual fairness: two individuals whose profiles are similar,except for the protected attributes, should receive the samedecision.
I Example: during a recruitment for a job, candidates should beselected according to their competences, independently oftheir gender or ethnic origin.
2. Group fairness: the statistics of decisions targeting aparticular group should be approximately the same as theoverall population.
I Example: if the recruitment algorithm produce a binaryoutput, “accepted” or “refused”, the proportion of males andfemales accepted should be the same.
I Challenge: some studies have demonstrated that their arescenarios in which these metrics are incompatible.
Sebastien Gambs Big Data, privacy and ethics 19
IntroductionTransparency, accountability and fairness
Privacy
Example of incompatibility between two fairness notions
I Consider a school in which the admission process is based onthe average of ratings at school and imagine the situation inwhich women have generally an average higher than men.
I Scenario 1: In this case, women are more often accepted thanmen and we have an individual fairness (the selection is madeindependently of gender) but . . .
I No group fairness : the proportion of women accepted ishigher that the proportion of men accepted.
I Scenario 2: To reach a group fairness, we could raiseartificially the average of men to ensure that a higher numberof them are accepted.
I No individual fairness : if a man and a woman have the sameaverage, the man rating will be artificially increased and hewill have more chance to be admitted than the woman.
Sebastien Gambs Big Data, privacy and ethics 20
IntroductionTransparency, accountability and fairness
Privacy
No consensus on the “right” notion of fairness
Figure: Example of the tutorial given by Arvind Narayanan at FAT*18
Sebastien Gambs Big Data, privacy and ethics 21
IntroductionTransparency, accountability and fairness
Privacy
Opacity of machine learning algorithms
I Machine learning has a central role in most of thepersonalized systems.
I Opacity: difficulty of understanding and explaining theirdecision due to their complex design.
I Example: the classifier outputted by a deep learning algorithmis typically composed of many layers of neural networks.
I Risk of “algorithmic dictatorship” (Rouvroy): loss of controlof individuals on their digital lives due to automated decisionif there is no remediation procedure.
Sebastien Gambs Big Data, privacy and ethics 22
IntroductionTransparency, accountability and fairness
Privacy
Transparency as a first step
I Asymmetry of information: strong difference between whatthe system knows about a person and what the person knowsabout the system.
I Lack of transparency leads to lack of trust.I Strong need to improve the transparency of information
systems.
Sebastien Gambs Big Data, privacy and ethics 23
IntroductionTransparency, accountability and fairness
Privacy
Possible approaches to transparency
1. Regulatory approaches to force companies to let usersexamine and correct the information collected about them.
2. Methods to increase transparency by “opening the black-box”.
I Tools to reach transparency by design.
I Examples: publication of the source code, use of aninterpretable model in machine learning.
Sebastien Gambs Big Data, privacy and ethics 24
IntroductionTransparency, accountability and fairness
Privacy
Community effort to increase transparency
Sebastien Gambs Big Data, privacy and ethics 25
IntroductionTransparency, accountability and fairness
Privacy
Measuring discrimination
I Example: measurement of quantitative input influence.
I Challenge: possibility of indirect discrimination in which thediscriminatory attribute is inferred through other attributes.
I Example: even if the ethnicity is not asked from the user, insome countries it strongly correlates with the ZIP code.
Sebastien Gambs Big Data, privacy and ethics 26
IntroductionTransparency, accountability and fairness
Privacy
Understanding the bias of the algorithm through aninterpretable model (Zeng, Ustun and Rudin 2016)
I Main idea: approximating a classifier through an interpretablemodel that can be inspected by a human.
I Example: use of a SLIM (Supersparse Linear Integer Models).
I Instead of using 42 binary attributes such as “Was the inmatebetween 18 and 24 years old during his first arrest ?” or “Wasthe inmate released in probation ?”, the tool can only select 5of them while obtaining a non-trivial fidelity with the originalmodel.
Sebastien Gambs Big Data, privacy and ethics 27
IntroductionTransparency, accountability and fairness
Privacy
Towards algorithmic accountability
I Caveat: transparency does not necessarily meansinterpretability or accountability.
I Example: the code of an application could be public but toocomplex to be comprehend by a human.
I Strong need for the development of tools that can analyzeand certify the code of the program.
I Objective: verify that the execution of the program match theintended behaviour or the ethical values that are expectedfrom it.
I Strong link with the notion of loyalty (does the system behaveas promises).
Sebastien Gambs Big Data, privacy and ethics 28
IntroductionTransparency, accountability and fairness
Privacy
Example of NYC algorithmic task force
Sebastien Gambs Big Data, privacy and ethics 29
IntroductionTransparency, accountability and fairness
Privacy
Example of FairTest report (based on normalized mutualinformation)
Sebastien Gambs Big Data, privacy and ethics 30
IntroductionTransparency, accountability and fairness
Privacy
Observing from all possible angles to observe adiscrimination
I Challenge: depending on the way by which we split thepopulation in subgroups, we can observe or not adiscrimination.
I Example:
I Simpson paradox: statistical paradox stating that a particularphenomenon observed on some groups can seem invertedwhen the group are combined.
I To obtain relevant results, once need to look fordiscrimination by changing the group observed.
Sebastien Gambs Big Data, privacy and ethics 31
IntroductionTransparency, accountability and fairness
Privacy
Enhancing fairness
I Ultimate objective: being able to increase fairness while notimpact too much accuracy
I Examples of possible approaches:I Sample the input data to remove its original bias,I Change the design of the algorithm so that it becomes
discrimination-aware by design,I Adapt the output produced by the algorithm (e.g., the
classifier) to reduce discrimination.
I Active subject of research but still in its infancy, muchremains to be done.
I Difficulty: there are several ways to reach fairness.
I Example: for each candidate, we randomly decide if he isaccepted or not (perfect fairness).
Sebastien Gambs Big Data, privacy and ethics 32
IntroductionTransparency, accountability and fairness
Privacy
Diversity as a key aspect from improving fairness
Sebastien Gambs Big Data, privacy and ethics 33
IntroductionTransparency, accountability and fairness
Privacy
Privacy
Sebastien Gambs Big Data, privacy and ethics 34
IntroductionTransparency, accountability and fairness
Privacy
Privacy enhancing technologies
Privacy Enhancing Technologies (PETs) : ensemble of techniquesfor protecting the privacy of an individual and offer him a bettercontrol on his personal data.Example of PET : homomorphic encryption.
Two fundamental principles behind the PETs :
I Data minimization : only the information necessary forcompleting a particular purpose should be collected/revealed.
I Data sovereignty : enable a user to keep the control on hispersonal data and how they are collected and disseminated.
Sebastien Gambs Big Data, privacy and ethics 35
IntroductionTransparency, accountability and fairness
Privacy
Personally identifiable information
I Personally identifiable information : ensemble of informationthat can be used to uniquely identified an individual.
I Examples : first and last name, social security number, placeand date of birth, physical and email address, phone number,credit card number, biometric data (such as fingerprint andDNA), . . .
I Sensitive because they identify uniquely an individual and canbe used to easily cross-referenced databases.
I Main limits of the definition :I does not take into account some attributes or patterns in the
data that can seem innocuous individually but can identifiedan individual when combined together (quasi-identifiers).
I does not take into account the inference potential of the dataconsidered.
Sebastien Gambs Big Data, privacy and ethics 36
IntroductionTransparency, accountability and fairness
Privacy
Pseudonymization is not an alternative to anonymization
Replacing the name of a person by a pseudonym ; preservation ofthe privacy of this individual
(Extract from an article from the New York Times, 6 August 2006)
Sebastien Gambs Big Data, privacy and ethics 37
IntroductionTransparency, accountability and fairness
Privacy
SUICA’s privacy leak (July 2013)
Sebastien Gambs Big Data, privacy and ethics 38
IntroductionTransparency, accountability and fairness
Privacy
Legal requirements to evaluate anonymization methods
General Data Protection Regulation (Article 16):
“To ascertain whether means are reasonably likely to beused to identify the natural person, account should betaken of all objective factors, such as the costs of and theamount of time required for identification, taking intoconsideration the available technology at the time of theprocessing and technological developments.”
I Consequence : evaluation of risk of de-anonymization shouldtake into account the ressources needed to conduct there-identification and should be done on a regular basis(risk-based approach).
I The French law “for a Digital Republic” (October 2016) alsorecognized the right for the French data protection authority(the CNIL) to certify anonymization processes.
Sebastien Gambs Big Data, privacy and ethics 39
IntroductionTransparency, accountability and fairness
Privacy
Inference attack
I Inference attack : the adversary takes as input a publisheddataset (and possibly some background knowledge) and triesto infer some personal information regarding individualscontained in the dataset.
I Main challenge : to be able to give some privacy guaranteeseven against an adversary having some auxiliary knowledge.
I We may not even be able to model this a priori knowledge.I Remark: maybe my data is private today but it may not be so
in the future due to the public release of some other data.
Sebastien Gambs Big Data, privacy and ethics 40
IntroductionTransparency, accountability and fairness
Privacy
De-anonymization attack
I De-anonymization attack : the adversary takes as input asanitized dataset and some background knowledge and tries toinfer the identities of the individuals contained in the dataset.
I Specific form of inference attack.I Example : Sweeney’s original de-anonymization attack via
linking.
I The re-identification risk measures the success probability ofthis attack.
Sebastien Gambs Big Data, privacy and ethics 41
IntroductionTransparency, accountability and fairness
Privacy
Example : inference attacks on location data
Joint work with Marc-Olivier Killijian (LAAS-CNRS) and MiguelNunez del Prado (Universidad del Pacifico).Main objective : quantify the privacy risks of location data.Types of attacks:
1. Identification of important places, called Point of Interests(POI), characterizing the interests of an individual.
I Example: home, place of work, gymnasium, politicalheadquarters, medical center, . . .
2. Prediction of the movement patterns of an individual, such ashis past, present and future locations.
3. Linking the records of the same individual contained in thesame dataset or in different datasets (either anonymized orunder different pseudonyms).
Sebastien Gambs Big Data, privacy and ethics 42
IntroductionTransparency, accountability and fairness
Privacy
Sanitization
Sanitization : process increasing the uncertainty in the data inorder to preserve privacy.⇒ Inherent trade-off between the desired level of privacy and theutility of the sanitized data.Typical application : public release of data (offline or onlinecontext).
Examples drawn from the “sanitization” entry on Wikipedia
Sebastien Gambs Big Data, privacy and ethics 43
IntroductionTransparency, accountability and fairness
Privacy
Classical sanitization mechanisms
I Perturbation : addition of noise to the true value.I Aggregation : merge several data into a single one.I Generalization : loss of granularity of information.
I Deletion : erasure of the information related to a particularattribute.
I Remark : the absence of information can sometimes lead to aprivacy breach (e.g. : removing the information on the diseaseof a patient record only if he has a sexual disease).
I Introduction of fake data : addition of artificial records in adatabase to “hide” the true data.
Sebastien Gambs Big Data, privacy and ethics 44
IntroductionTransparency, accountability and fairness
Privacy
Example of privacy model : k-anonymity (Sweeney 02)
I Privacy guarantee : in each group of the sanitized dataset,each invidividual will be identical to a least k − 1 others.
I Reach by a combination of generalization and suppression.I Example of use : sanitization of medical data.
I Main challenge : extracting useful knowledge while preservingthe confidentiality of individual sensitive data.
Sebastien Gambs Big Data, privacy and ethics 45
IntroductionTransparency, accountability and fairness
Privacy
Intersection attack
I Question : suppose that Alice’s employer knows that she is 28years old, she lives in ZIP code 13012 and she visits bothhospitals. What does he learn?
Sebastien Gambs Big Data, privacy and ethics 46
IntroductionTransparency, accountability and fairness
Privacy
The key property: composition
I A “good” privacy model should provide some guaranteesabout the total leak of information revealed by two (or more)sanitized datasets.
I More precisely, if the first release reveals b1 of information andthe second release b2 bits of information, the total amount ofinformation leaked should not be more than O(b1 + b2) bits.
I Remark : most of the existing privacy models do not have anycomposition property with the exception of differential privacy(Dwork 06).
Sebastien Gambs Big Data, privacy and ethics 47
IntroductionTransparency, accountability and fairness
Privacy
Differential privacy: principle (Dwork 06)
I Privacy notion developed within the community of privatedata analysis that has gained a widespread adoption.
I Basically ensures that whether or not an item is in the profileof an individual does not influence too much the output.
I Give strong privacy guarantees that hold independently of theauxiliary knowledge of the adversary and compose well.
Sebastien Gambs Big Data, privacy and ethics 48
IntroductionTransparency, accountability and fairness
Privacy
Implementing differential privacy
Possible techniques to implement differential privacy :
I Addition of noise to the output of an algorithm (ex: Laplacianmechanism).
I Perturbation of the input given to the algorithm.
I Randomization of the behaviour of the algorithm.
I Creation of a synthetic database or a data structure“summarizing and aggregating the data”.
I Sampling mechanisms.
Sebastien Gambs Big Data, privacy and ethics 49
IntroductionTransparency, accountability and fairness
Privacy
Towards an adoption of differential privacy by web actors
Sebastien Gambs Big Data, privacy and ethics 50
IntroductionTransparency, accountability and fairness
Privacy
Why one should always question the choice of the privacyparameter
Sebastien Gambs Big Data, privacy and ethics 51
IntroductionTransparency, accountability and fairness
Privacy
Fire and Ice Japanese competition on anonymization andre-identification attacks (2015, 2016, 2017 and 2018)
I Objective : evaluate empirically the efficiency ofanonymization methods and re-identification attacks.
I Similar in spirit to other competitions in machine learning orsecurity.
Sebastien Gambs Big Data, privacy and ethics 52
IntroductionTransparency, accountability and fairness
Privacy
Next steps
I To broaden the impact and the outreach to the privacycommunity, we are organizing an international competition onsanitization mechanisms and inference attacks.
I Schedule :I Last year: workshop at PETS for preparing the competition
(definition of privacy and utility metrics, choice of the dataset,setting of the competition, . . . ).
I Next year: international competition for sanitization andinference starting at the beginning of 2019 on CrowdAI.
Sebastien Gambs Big Data, privacy and ethics 53
IntroductionTransparency, accountability and fairness
Privacy
Conclusion
Sebastien Gambs Big Data, privacy and ethics 54
IntroductionTransparency, accountability and fairness
Privacy
Conclusion (1/2)
I Observation 1 : the capacity to record and store personal dataas increased rapidly these last years.
I Observation 2 : “Big Data” will result in more and more beingavailable ⇒ increase of inference possibilities.
I Observation 3 : the “Open data” movement will lead to therelease of a huge amount of dataset ⇒ worsen the privacyimpact of Big Data (observation 2).
I The advent of Big Data magnifies the privacy risks that werealready existing but also raises new ethical issues.
I Main challenge : balance the social and economical benefits ofBig Data with the protection of privacy and fundamentalrights of individuals.
Sebastien Gambs Big Data, privacy and ethics 55
IntroductionTransparency, accountability and fairness
Privacy
Conclusion (2/2)
I Strong need for research and scientific cooperation in BigData :
I for determining how to address privacy in this context,I for the design of new protection and sanitization mechanismsI as well as for inference attacks for assessing the privacy level
they provide.I for finding solutions for addressing the transparency, fairness
and accountability issues
I Overall objective: being able to reap the benefits of Big Databy not only to protecting the privacy of individuals but alsomaking sure that they remain in control of their digital lives.
Sebastien Gambs Big Data, privacy and ethics 56
IntroductionTransparency, accountability and fairness
Privacy
This is the end
Thanks for your attentionQuestions?
Sebastien Gambs Big Data, privacy and ethics 57
IntroductionTransparency, accountability and fairness
Privacy
Fairness metrics (1/3)
I Consider an algorithm that produces a binary outputo ∈ {0, 1} and which contains a unique protect attributes ∈ {s−, s+}.
I Assume that the group s = s+ is the favored group to obtaino = 1 and that the group s− is disadvantaged.
I P(o|s+) is the empirical probability to obtain o knowing thatthe protected attribute has for value s+ (P(o|s−) can bedefined in the same manner).
Sebastien Gambs Big Data, privacy and ethics 58
IntroductionTransparency, accountability and fairness
Privacy
Fairness metrics (2/3)
Examples of fairness metrics:
1. selection-lift: P(o|s+)P(o|s−)
.
I The higher the ratio is, the more unfair the treatment.
I Example: disparate impact, criterion existing in the US lawuse to measure unfair treatment.
I According to the law, a treatment is unfair if P(o|s+)P(o|s−)
> 0.8.
2. difference-based selection-lift: = P(o|s+)− P(o|s−).
I The higher the difference is, the more unfair is the treatment.
Sebastien Gambs Big Data, privacy and ethics 59
IntroductionTransparency, accountability and fairness
Privacy
Fairness metrics (3/3)
3. Mutual information: measure the correlation between theprotect attribute and the output.
I If this value is low, this means that the two variables areindependent ⇒ weak correlation.
I Ideally, we would like this value to be independent of theprotected attribute to have a fair decision.
I Examples of tools to measure fairness: FairTest, FairML, . . .
I Strong effort from the community to develop tools to quantifydiscrimination but . . .
I currently there is no consensus on the definition ofdiscrimination.
Sebastien Gambs Big Data, privacy and ethics 60
Recommended