Upload
andreas-dewes
View
1.011
Download
1
Embed Size (px)
Citation preview
Say "Hi!" to Your New Boss
How algorithms might soon control our lifes(and why we should be careful with them)
Motivation
no alternatives, Google?
OutlineTheory1. Algorithms2. Machine Learning3. Big Data & Consequences for Machine Learning4. Use of Algorithms Today and in the FutureExperiments5. Discriminating people with machine learning & algorithms6. Creating persistent user identities by (accidental) de-
anonymizationSummary & Outlook7. Strategies for Handling Data Responsibly
Algorithms , Machine Learning & Big Data
Algorithms
An algorithm is a "recipe" that gives a computer (or a human) step-by-step instructions in order to achieve a certain goal.
StartDoor bell
ringing
Andreas stands
on trapdoor
?
Open trapdo
or
Wait. Our time will
come.
yes
no
Machine Learning
A machine learning algorithm automatically generates models and checks them against the training data we provide, trying to find a model that explains the data well and can predict unknown data.
Data vs. Model
𝒙 𝑦=𝑚 (𝒙 ,𝒑 )+𝜀
see e.g. "Machine Learning" by Tom Mitchell (McGraw Hill, 1997).
y
x1
Data vs. Model
𝒙 𝑦=𝑚 (𝒙 ,𝒑 )+𝜀
see e.g. "Machine Learning" by Tom Mitchell (McGraw Hill, 1997).
y
x1
Sources of Error
𝜀=𝜀𝑠𝑦𝑠+𝜀𝑛𝑜𝑖𝑠𝑒+𝜀h𝑖𝑑𝑑𝑒𝑛
systematic errors arise due toimperfect measurements ofknown variables
noise is present due tothe nature of the process
or our measurement apparatus
many variables areusually unknown to us
Big Data & Machine Learning
2000 2015
more data sourceshigh data volumehigher densityhigher frequencylonger retention
Data Volume: More is (usually) better
Data Volume: More is (usually) better
Exploiting New Sources of Data
𝑦=𝑚 (𝑥 ,𝑝 )+𝜀h𝑖𝑑𝑑𝑒𝑛+…
incorporate variables that were hiddeninto the model, reducing error
Understanding ResultsModels can be easy or very difficult to interpretParameter space is often huge and can't be explored entirely
age > 37 ?
height < 1.78 projects > 19 ?
decision tree classifier (easy to interpret) neural network classifier (hard to interpret)
yes no
Example: Deep Learning for Image Recognition
http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html
Classifying Use of Algorithms
low riskmildly annoying in case of failure / misbehaviour
medium risklarge impact on our life in case of failure / misbehaviour
high riskcritical impact on ourlife in case of failure / misbehaviour
low riskpersonalization of services(e.g. recommendation engines for websites, video-on-demand, content, ...)
individualized ad targeting
customer rating / profiling
consumer demand prediction
medium riskpersonalized health
person classification (e.g. crime, terrorism)
autonomous cars/ planes/ machines ...
automated trading
military intelligence / intervention
political oppression
critical infrastructure services (e.g. electricity)
life-changing decisions (e.g. about health)
high risk
Big Data & Advances in Machine Learning
Data "Mishaps"
Two Experiments
Discriminating PeopleWith Algorithms
Humans can be prejudiced. Are algorithms better?
Discrimination
Discrimination is treatment or consideration of, or making a distinction in favor of or against, a person or thing based on the group, class, or category to which that person or thing is perceived to belong to rather than on individual merit.
Wikipedia
Protected attributes (examples): Ethnicity, Gender, Sexual Orientation, ...
When is a process discriminating?
Disparate Impact: Adverse impact of a process C on a given group X
Outcome X = 0 X = 1C = NO P(C = NO, X = 0) P(C = NO, X = 1)C = YES P(C = YES, X = 0) P(C =YES,X = 1)𝑃 (𝐶=𝑌𝐸𝑆|𝑋=0 )𝑃 (𝐶=𝑌𝐸𝑆|𝑋=1 )
<τ
see e.g. "Certifying and Removing Disparate Impact" M. Feldman et. al. (arxiv.org)
When is a process discriminating?
Estimating with real-world data
Outcome X = 0 X = 1C = NO a bC = YES c d𝑐 / (𝑎+𝑐 )𝑑 / (𝑏+𝑑 )
<τ
Discrimination through Data Analysis
Replacing a manual hiring process with an automated one.
Benefits:Save time screening CVs by handImprove candidate choice
The Setup
human
CV
algorithm
information (CV, cover letter, projects, ...)
C= YES (invite) / C = NO (not invite)
C Training Data
The SetupUse submitted information (CV, work samples) along with publicly available / external information to predict candidate success.
Use data from the manual process (invite/ no invite) to train the classifier
Provide it with as much data as possible to decrease error rate ("Big Data" approach)
Our decision model
𝑆=𝑚 (𝑌 )+𝑑 (𝑋 )+𝜀
score of candidate(merit function) discrimination
malus/bonushidden variables &
luck (if you believe in it)
𝐶={ 𝑌𝐸𝑆 ,𝑆>𝑡𝑁𝑂 , h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒
luckcandidate meritwithout discrimination with discrimination
Training a predictor for C
information about Y(unprotected attributes)
additional informationwe give to the algorithm
𝒁 ∝ 𝑋+𝜀𝛾we can predict the value of X from Z with fidelity
A Simulation
•Generate 10.000 samples of C with disparate impact • Train a classifer (e.g. Support-Vector-Machine) on the test data• Provide it with (noisy) information about X•Measure the algorithm-based on the test data
Discrimination by Algorithm
Discrimination by Algorithm
(how much information about X leaks into the data)
Discrimination by Algorithm
(disparate impact on protected class)
Discrimination by Algorithm
how well our algorithm can explain the data
8 % luck / noise6-8 % discrimination87 % merit
Discrimination by Algorithm
how much our algorithm discriminates against people in group X
Discrimination by Algorithm
Why give that information to the algorithm?
𝒁
We don't! But it leaks through anyway...
𝑋
But can it be done?
Discrimination through information leakage is possible, but how likely is it in practice?
Let's try! We use publicly available data to predict the gender of Github users (protected attribute X).
Basic InformationManually classify users as men/women (by looking at profile pictures, names) -> 5.000 training samples with small error
Use the Github API to retrieve information about users (followers, repositories, stargazers, contributions, ...)
We only use data that is easy to get and likely to be used in real-world setting for classification
We only use a limited dataset (proof of concept, not optimized)
Stargazers, Followers, Projects, ...
No predictive power for X
Github Event Data
https://www.githubarchive.org/
PushEvent2015-03-17 21:21h3 commitsLog : "..."
PullRequestEvent2015-03-17 22:43
CommentEvent2015-03-17 23:14h"Hi, I think we should add more cats to the landing page"
Hourly event patterns & event types
Commit Message Analysis
Use the commit messages (as obtained from the event data) to predict gender by training a Support Vector Machine (SVM) classifier on the word frequency data.
lol
emoji
wtf seriously
rtfm
dude
fuckgit
Predictive Power of Model
15 % 35 % error50 % baseline fidelity
30 % information leakage(with a very simple data set)
Takeaways
Algorithms will readily "learn" discrimination from us if we provide them with contaminated training data.
Information leakage of protected attributes can happen easily.
How we can fix this
Harder than you might think! We need to know X to measure disparate impact and remove it
Incorporate penality for discrimination into target function
Remove information about X from dataset by performing a suitable transformation (reduces fidelity of model)
see e.g. "Certifying and Removing Disparate Impact" M. Feldman et. al. (arxiv.org)
Oh, it's you again! De-anonymizing data
What is de-anonymization?
Use data recorded about individuals / entities to identify those same individuals / entities in another set of data (exactly or with high likelihood).
Deanonymization becomes an increasing risk as datasets about individual entities become larger and more detailed.
"Buckets of Truth"
N boolean attributes per entity - on average M < N of them are set
,
fun with deanonymization: http://en.akinator.com
Examples
𝑃𝑐𝑜𝑙 .= (1−2𝑝 (1−𝑝))𝑁
uniform distribution long-tailed distribution
𝑃𝑐𝑜𝑙 .=?
Geolife Trajectories
http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-9fd4-daa38f2b2e13/
Question:How easy is it to re-identify single users through their data?
Could an algorithm build a representation of a given user?
Individual trajectories (color-coded)
http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-9fd4-daa38f2b2e13/
How good are our buckets?
𝑒−𝑥𝑎𝛾 here's the interesting information
Identifying / comparing fingerprints
𝑠 (𝑢𝑖 ,𝑢 𝑗 )=‖𝑓 (𝑢𝑖 ) ∙ 𝑓 (𝑢 𝑗 ) ‖
‖𝑓 (𝑢𝑖 ) ‖∙‖𝑓 (𝑢 𝑗 ) ‖
* =
Testing De-Anonymization
Use 75 % of the trajectories as prior data set
Predict the user ID belonging to the remaining 25 %
Measure average success probability and identification rank (i.e. at which position is the correct user)
Identification Rate
Finding Similar Users
Possible Improvements
Use Temporal / Sequence InformationUse speed of movement / mode of transportation
Improve choice of buckets for fingerprinting
Interesting Review Article: "Life in the network: the coming age of computational social science." D. Lazer et. al.
SummaryThe more data we have, the more difficult it is to keep algorithms from directly learning and using object identities instead of attributes.
Our data follows us around!
What can we do?
As Data Scientists / Analysts / Programmers
Consume data responsibly: Don't include everything under the sun just because it increases fidelity by a slim margin
Check for disparate impact and remove it from the input data
Test anonymization safety by using machine learning
Train data scientists in safety and risks of data analysis
As Citizens / Hackers / Users
Do not blindly trust decisions made by algorithms
Test them if possible (using different input values)
Reverse-engineer them (using e.g. active learning)
Fight back with data: Collect and analyze algorithm-based decisions using collaborative approaches
As a Society
Create better regulations for algorithms and their use
Force companies / organizations to open up black boxes
Making access to data easier, also for small organizations
Impede the creation of data monopolies
Algorithms are like children:
Smart & eager to learn
So let's make surewe raise them to
be responsibleadults.
Thanks!
Slides slideshare.net/japh44Website andreas-dewes.de/en
Code (coming soon) github.com/adewes/32c3E-Mail [email protected]
Twitter @japh44License Creative Commons Attribution 4.0
International(except Google Deep Learning image)
Result
Intro
Whenever we measure user actions, we (automatically) gain information about them that we can use to classify them.
Classifying and Controlling People
Case Study: Click Rate Optimization
Simple but common use case for big data: Collaborative filtering
• Users have an opinion on a given topic A (between 0-1)• They are more likely to like articles that confirm their
opinion• Our algorithm knows nothing about A, just tries to
optimize click rate• User opinion may change over time according to the
content he/she is exposed to (2 % change per exposure)
Mathematical Model
𝑃 (𝐿𝑖𝑘𝑒 )∝|𝐴𝑎𝑟𝑡𝑖𝑐𝑙𝑒− 𝐴𝑢𝑠𝑒𝑟|+𝜀𝑚𝑜𝑜𝑑
Like Rate vs. Articles Viewed
Like Rate vs. Articles Viewedonly observe, don't optimize
What have we learned?
60 observations / user
Clustering users into groups
Similarity measure: # Articles that both users like or dislikeClustering: K-Means (minimize distance within clusters, maximize distance between clusters)
Like Rate vs. Articles Viewedwith click-rate optimization
Consequence of optimization: "Filter Bubbles"
Switching On User Feedback
𝐴𝑢𝑠𝑒𝑟𝑡+1 =𝐴𝑢𝑠𝑒𝑟
𝑡 +γ ∙ sgn ( 𝐴𝑢𝑠𝑒𝑟𝑡 −𝐴𝑎𝑟𝑡𝑖𝑐𝑙𝑒
❑ )
User opinions with and without feedback
the algorithm has an interest to steer opinions towards the center
no feedback 2 % feedback
Summary