86
Say "Hi!" to Your New Boss How algorithms might soon control our lifes (and why we should be careful with them)

Say "Hi!" to Your New Boss

Embed Size (px)

Citation preview

Page 1: Say "Hi!" to Your New Boss

Say "Hi!" to Your New Boss

How algorithms might soon control our lifes(and why we should be careful with them)

Page 2: Say "Hi!" to Your New Boss

Motivation

no alternatives, Google?

Page 3: Say "Hi!" to Your New Boss

OutlineTheory1. Algorithms2. Machine Learning3. Big Data & Consequences for Machine Learning4. Use of Algorithms Today and in the FutureExperiments5. Discriminating people with machine learning & algorithms6. Creating persistent user identities by (accidental) de-

anonymizationSummary & Outlook7. Strategies for Handling Data Responsibly

Page 4: Say "Hi!" to Your New Boss

Algorithms , Machine Learning & Big Data

Page 5: Say "Hi!" to Your New Boss

Algorithms

An algorithm is a "recipe" that gives a computer (or a human) step-by-step instructions in order to achieve a certain goal.

StartDoor bell

ringing

Andreas stands

on trapdoor

?

Open trapdo

or

Wait. Our time will

come.

yes

no

Page 6: Say "Hi!" to Your New Boss

Machine Learning

A machine learning algorithm automatically generates models and checks them against the training data we provide, trying to find a model that explains the data well and can predict unknown data.

Page 7: Say "Hi!" to Your New Boss

Data vs. Model

𝒙 𝑦=𝑚 (𝒙 ,𝒑 )+𝜀

see e.g. "Machine Learning" by Tom Mitchell (McGraw Hill, 1997).

y

x1

Page 8: Say "Hi!" to Your New Boss

Data vs. Model

𝒙 𝑦=𝑚 (𝒙 ,𝒑 )+𝜀

see e.g. "Machine Learning" by Tom Mitchell (McGraw Hill, 1997).

y

x1

Page 9: Say "Hi!" to Your New Boss

Sources of Error

𝜀=𝜀𝑠𝑦𝑠+𝜀𝑛𝑜𝑖𝑠𝑒+𝜀h𝑖𝑑𝑑𝑒𝑛

systematic errors arise due toimperfect measurements ofknown variables

noise is present due tothe nature of the process

or our measurement apparatus

many variables areusually unknown to us

Page 10: Say "Hi!" to Your New Boss

Big Data & Machine Learning

2000 2015

more data sourceshigh data volumehigher densityhigher frequencylonger retention

Page 11: Say "Hi!" to Your New Boss

Data Volume: More is (usually) better

Page 12: Say "Hi!" to Your New Boss

Data Volume: More is (usually) better

Page 13: Say "Hi!" to Your New Boss

Exploiting New Sources of Data

𝑦=𝑚 (𝑥 ,𝑝 )+𝜀h𝑖𝑑𝑑𝑒𝑛+…

incorporate variables that were hiddeninto the model, reducing error

Page 14: Say "Hi!" to Your New Boss

Understanding ResultsModels can be easy or very difficult to interpretParameter space is often huge and can't be explored entirely

age > 37 ?

height < 1.78 projects > 19 ?

decision tree classifier (easy to interpret) neural network classifier (hard to interpret)

yes no

Page 15: Say "Hi!" to Your New Boss

Example: Deep Learning for Image Recognition

http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html

Page 16: Say "Hi!" to Your New Boss

Classifying Use of Algorithms

low riskmildly annoying in case of failure / misbehaviour

medium risklarge impact on our life in case of failure / misbehaviour

high riskcritical impact on ourlife in case of failure / misbehaviour

Page 17: Say "Hi!" to Your New Boss

low riskpersonalization of services(e.g. recommendation engines for websites, video-on-demand, content, ...)

individualized ad targeting

customer rating / profiling

consumer demand prediction

Page 18: Say "Hi!" to Your New Boss

medium riskpersonalized health

person classification (e.g. crime, terrorism)

autonomous cars/ planes/ machines ...

automated trading

Page 19: Say "Hi!" to Your New Boss

military intelligence / intervention

political oppression

critical infrastructure services (e.g. electricity)

life-changing decisions (e.g. about health)

high risk

Page 20: Say "Hi!" to Your New Boss

Big Data & Advances in Machine Learning

Page 21: Say "Hi!" to Your New Boss

Data "Mishaps"

Two Experiments

Page 22: Say "Hi!" to Your New Boss

Discriminating PeopleWith Algorithms

Humans can be prejudiced. Are algorithms better?

Page 23: Say "Hi!" to Your New Boss

Discrimination

Discrimination is treatment or consideration of, or making a distinction in favor of or against, a person or thing based on the group, class, or category to which that person or thing is perceived to belong to rather than on individual merit.

Wikipedia

Protected attributes (examples): Ethnicity, Gender, Sexual Orientation, ...

Page 24: Say "Hi!" to Your New Boss

When is a process discriminating?

Disparate Impact: Adverse impact of a process C on a given group X

Outcome X = 0 X = 1C = NO P(C = NO, X = 0) P(C = NO, X = 1)C = YES P(C = YES, X = 0) P(C =YES,X = 1)𝑃 (𝐶=𝑌𝐸𝑆|𝑋=0 )𝑃 (𝐶=𝑌𝐸𝑆|𝑋=1 )

see e.g. "Certifying and Removing Disparate Impact" M. Feldman et. al. (arxiv.org)

Page 25: Say "Hi!" to Your New Boss

When is a process discriminating?

Estimating with real-world data

Outcome X = 0 X = 1C = NO a bC = YES c d𝑐 / (𝑎+𝑐 )𝑑 / (𝑏+𝑑 )

Page 26: Say "Hi!" to Your New Boss

Discrimination through Data Analysis

Replacing a manual hiring process with an automated one.

Benefits:Save time screening CVs by handImprove candidate choice

Page 27: Say "Hi!" to Your New Boss

The Setup

human

CV

algorithm

information (CV, cover letter, projects, ...)

C= YES (invite) / C = NO (not invite)

C Training Data

Page 28: Say "Hi!" to Your New Boss

The SetupUse submitted information (CV, work samples) along with publicly available / external information to predict candidate success.

Use data from the manual process (invite/ no invite) to train the classifier

Provide it with as much data as possible to decrease error rate ("Big Data" approach)

Page 29: Say "Hi!" to Your New Boss

Our decision model

𝑆=𝑚 (𝑌 )+𝑑 (𝑋 )+𝜀

score of candidate(merit function) discrimination

malus/bonushidden variables &

luck (if you believe in it)

𝐶={ 𝑌𝐸𝑆 ,𝑆>𝑡𝑁𝑂 , h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒

luckcandidate meritwithout discrimination with discrimination

Page 30: Say "Hi!" to Your New Boss

Training a predictor for C

information about Y(unprotected attributes)

additional informationwe give to the algorithm

𝒁 ∝ 𝑋+𝜀𝛾we can predict the value of X from Z with fidelity

Page 31: Say "Hi!" to Your New Boss

A Simulation

•Generate 10.000 samples of C with disparate impact • Train a classifer (e.g. Support-Vector-Machine) on the test data• Provide it with (noisy) information about X•Measure the algorithm-based on the test data

Page 32: Say "Hi!" to Your New Boss

Discrimination by Algorithm

Page 33: Say "Hi!" to Your New Boss

Discrimination by Algorithm

(how much information about X leaks into the data)

Page 34: Say "Hi!" to Your New Boss

Discrimination by Algorithm

(disparate impact on protected class)

Page 35: Say "Hi!" to Your New Boss

Discrimination by Algorithm

how well our algorithm can explain the data

8 % luck / noise6-8 % discrimination87 % merit

Page 36: Say "Hi!" to Your New Boss

Discrimination by Algorithm

how much our algorithm discriminates against people in group X

Page 37: Say "Hi!" to Your New Boss

Discrimination by Algorithm

Page 38: Say "Hi!" to Your New Boss

Why give that information to the algorithm?

𝒁

We don't! But it leaks through anyway...

𝑋

Page 39: Say "Hi!" to Your New Boss

But can it be done?

Discrimination through information leakage is possible, but how likely is it in practice?

Let's try! We use publicly available data to predict the gender of Github users (protected attribute X).

Page 40: Say "Hi!" to Your New Boss

Basic InformationManually classify users as men/women (by looking at profile pictures, names) -> 5.000 training samples with small error

Use the Github API to retrieve information about users (followers, repositories, stargazers, contributions, ...)

We only use data that is easy to get and likely to be used in real-world setting for classification

We only use a limited dataset (proof of concept, not optimized)

Page 41: Say "Hi!" to Your New Boss

Stargazers, Followers, Projects, ...

No predictive power for X

Page 42: Say "Hi!" to Your New Boss

Github Event Data

https://www.githubarchive.org/

PushEvent2015-03-17 21:21h3 commitsLog : "..."

PullRequestEvent2015-03-17 22:43

CommentEvent2015-03-17 23:14h"Hi, I think we should add more cats to the landing page"

Page 43: Say "Hi!" to Your New Boss

Hourly event patterns & event types

Page 44: Say "Hi!" to Your New Boss

Commit Message Analysis

Use the commit messages (as obtained from the event data) to predict gender by training a Support Vector Machine (SVM) classifier on the word frequency data.

lol

emoji

wtf seriously

rtfm

dude

fuckgit

Page 45: Say "Hi!" to Your New Boss

Predictive Power of Model

15 % 35 % error50 % baseline fidelity

30 % information leakage(with a very simple data set)

Page 46: Say "Hi!" to Your New Boss

Takeaways

Algorithms will readily "learn" discrimination from us if we provide them with contaminated training data.

Information leakage of protected attributes can happen easily.

Page 47: Say "Hi!" to Your New Boss

How we can fix this

Harder than you might think! We need to know X to measure disparate impact and remove it

Incorporate penality for discrimination into target function

Remove information about X from dataset by performing a suitable transformation (reduces fidelity of model)

see e.g. "Certifying and Removing Disparate Impact" M. Feldman et. al. (arxiv.org)

Page 48: Say "Hi!" to Your New Boss

Oh, it's you again! De-anonymizing data

Page 49: Say "Hi!" to Your New Boss

What is de-anonymization?

Use data recorded about individuals / entities to identify those same individuals / entities in another set of data (exactly or with high likelihood).

Deanonymization becomes an increasing risk as datasets about individual entities become larger and more detailed.

Page 50: Say "Hi!" to Your New Boss

"Buckets of Truth"

N boolean attributes per entity - on average M < N of them are set

,

fun with deanonymization: http://en.akinator.com

Page 51: Say "Hi!" to Your New Boss

Examples

𝑃𝑐𝑜𝑙 .= (1−2𝑝 (1−𝑝))𝑁

uniform distribution long-tailed distribution

𝑃𝑐𝑜𝑙 .=?

Page 52: Say "Hi!" to Your New Boss

Geolife Trajectories

http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-9fd4-daa38f2b2e13/

Question:How easy is it to re-identify single users through their data?

Could an algorithm build a representation of a given user?

Page 53: Say "Hi!" to Your New Boss
Page 54: Say "Hi!" to Your New Boss

Individual trajectories (color-coded)

http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-9fd4-daa38f2b2e13/

Page 55: Say "Hi!" to Your New Boss
Page 56: Say "Hi!" to Your New Boss
Page 57: Say "Hi!" to Your New Boss
Page 58: Say "Hi!" to Your New Boss
Page 59: Say "Hi!" to Your New Boss

How good are our buckets?

𝑒−𝑥𝑎𝛾 here's the interesting information

Page 60: Say "Hi!" to Your New Boss

Identifying / comparing fingerprints

𝑠 (𝑢𝑖 ,𝑢 𝑗 )=‖𝑓 (𝑢𝑖 ) ∙ 𝑓 (𝑢 𝑗 )  ‖

‖𝑓 (𝑢𝑖 )  ‖∙‖𝑓 (𝑢 𝑗 )  ‖

* =

Page 61: Say "Hi!" to Your New Boss

Testing De-Anonymization

Use 75 % of the trajectories as prior data set

Predict the user ID belonging to the remaining 25 %

Measure average success probability and identification rank (i.e. at which position is the correct user)

Page 62: Say "Hi!" to Your New Boss

Identification Rate

Page 63: Say "Hi!" to Your New Boss

Finding Similar Users

Page 64: Say "Hi!" to Your New Boss

Possible Improvements

Use Temporal / Sequence InformationUse speed of movement / mode of transportation

Improve choice of buckets for fingerprinting

Interesting Review Article: "Life in the network: the coming age of computational social science." D. Lazer et. al.

Page 65: Say "Hi!" to Your New Boss

SummaryThe more data we have, the more difficult it is to keep algorithms from directly learning and using object identities instead of attributes.

Our data follows us around!

Page 66: Say "Hi!" to Your New Boss

What can we do?

Page 67: Say "Hi!" to Your New Boss

As Data Scientists / Analysts / Programmers

Consume data responsibly: Don't include everything under the sun just because it increases fidelity by a slim margin

Check for disparate impact and remove it from the input data

Test anonymization safety by using machine learning

Train data scientists in safety and risks of data analysis

Page 68: Say "Hi!" to Your New Boss

As Citizens / Hackers / Users

Do not blindly trust decisions made by algorithms

Test them if possible (using different input values)

Reverse-engineer them (using e.g. active learning)

Fight back with data: Collect and analyze algorithm-based decisions using collaborative approaches

Page 69: Say "Hi!" to Your New Boss

As a Society

Create better regulations for algorithms and their use

Force companies / organizations to open up black boxes

Making access to data easier, also for small organizations

Impede the creation of data monopolies

Page 70: Say "Hi!" to Your New Boss

Algorithms are like children:

Smart & eager to learn

So let's make surewe raise them to

be responsibleadults.

Page 71: Say "Hi!" to Your New Boss

Thanks!

Slides slideshare.net/japh44Website andreas-dewes.de/en

Code (coming soon) github.com/adewes/32c3E-Mail [email protected]

Twitter @japh44License Creative Commons Attribution 4.0

International(except Google Deep Learning image)

Page 72: Say "Hi!" to Your New Boss

Result

Page 73: Say "Hi!" to Your New Boss

Intro

Whenever we measure user actions, we (automatically) gain information about them that we can use to classify them.

Page 74: Say "Hi!" to Your New Boss
Page 75: Say "Hi!" to Your New Boss

Classifying and Controlling People

Page 76: Say "Hi!" to Your New Boss

Case Study: Click Rate Optimization

Simple but common use case for big data: Collaborative filtering

• Users have an opinion on a given topic A (between 0-1)• They are more likely to like articles that confirm their

opinion• Our algorithm knows nothing about A, just tries to

optimize click rate• User opinion may change over time according to the

content he/she is exposed to (2 % change per exposure)

Page 77: Say "Hi!" to Your New Boss

Mathematical Model

𝑃 (𝐿𝑖𝑘𝑒 )∝|𝐴𝑎𝑟𝑡𝑖𝑐𝑙𝑒− 𝐴𝑢𝑠𝑒𝑟|+𝜀𝑚𝑜𝑜𝑑

Page 78: Say "Hi!" to Your New Boss

Like Rate vs. Articles Viewed

Page 79: Say "Hi!" to Your New Boss

Like Rate vs. Articles Viewedonly observe, don't optimize

Page 80: Say "Hi!" to Your New Boss

What have we learned?

60 observations / user

Page 81: Say "Hi!" to Your New Boss

Clustering users into groups

Similarity measure: # Articles that both users like or dislikeClustering: K-Means (minimize distance within clusters, maximize distance between clusters)

Page 82: Say "Hi!" to Your New Boss

Like Rate vs. Articles Viewedwith click-rate optimization

Page 83: Say "Hi!" to Your New Boss

Consequence of optimization: "Filter Bubbles"

Page 84: Say "Hi!" to Your New Boss

Switching On User Feedback

𝐴𝑢𝑠𝑒𝑟𝑡+1 =𝐴𝑢𝑠𝑒𝑟

𝑡 +γ ∙ sgn ( 𝐴𝑢𝑠𝑒𝑟𝑡 −𝐴𝑎𝑟𝑡𝑖𝑐𝑙𝑒

❑ )

Page 85: Say "Hi!" to Your New Boss

User opinions with and without feedback

the algorithm has an interest to steer opinions towards the center

no feedback 2 % feedback

Page 86: Say "Hi!" to Your New Boss

Summary