Say "Hi!" to Your New Boss

Say "Hi!" to Your New Boss

How algorithms might soon control our lifes(and why we should be careful with them)

Motivation

no alternatives, Google?

OutlineTheory1. Algorithms2. Machine Learning3. Big Data & Consequences for Machine Learning4. Use of Algorithms Today and in the FutureExperiments5. Discriminating people with machine learning & algorithms6. Creating persistent user identities by (accidental) de-

anonymizationSummary & Outlook7. Strategies for Handling Data Responsibly

Algorithms , Machine Learning & Big Data

Algorithms

An algorithm is a "recipe" that gives a computer (or a human) step-by-step instructions in order to achieve a certain goal.

StartDoor bell

ringing

Andreas stands

on trapdoor

?

Open trapdo

or

Wait. Our time will

come.

yes

no

Machine Learning

A machine learning algorithm automatically generates models and checks them against the training data we provide, trying to find a model that explains the data well and can predict unknown data.

Data vs. Model

𝒙 𝑦=𝑚 (𝒙 ,𝒑 )+𝜀

see e.g. "Machine Learning" by Tom Mitchell (McGraw Hill, 1997).

y

x1

Data vs. Model

𝒙 𝑦=𝑚 (𝒙 ,𝒑 )+𝜀

see e.g. "Machine Learning" by Tom Mitchell (McGraw Hill, 1997).

y

x1

Sources of Error

𝜀=𝜀𝑠𝑦𝑠+𝜀𝑛𝑜𝑖𝑠𝑒+𝜀h𝑖𝑑𝑑𝑒𝑛

systematic errors arise due toimperfect measurements ofknown variables

noise is present due tothe nature of the process

or our measurement apparatus

many variables areusually unknown to us

Big Data & Machine Learning

2000 2015

more data sourceshigh data volumehigher densityhigher frequencylonger retention

Data Volume: More is (usually) better

Data Volume: More is (usually) better

Exploiting New Sources of Data

𝑦=𝑚 (𝑥 ,𝑝 )+𝜀h𝑖𝑑𝑑𝑒𝑛+…

incorporate variables that were hiddeninto the model, reducing error

Understanding ResultsModels can be easy or very difficult to interpretParameter space is often huge and can't be explored entirely

age > 37 ?

height < 1.78 projects > 19 ?

decision tree classifier (easy to interpret) neural network classifier (hard to interpret)

yes no

Example: Deep Learning for Image Recognition

http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html

Classifying Use of Algorithms

low riskmildly annoying in case of failure / misbehaviour

medium risklarge impact on our life in case of failure / misbehaviour

high riskcritical impact on ourlife in case of failure / misbehaviour

low riskpersonalization of services(e.g. recommendation engines for websites, video-on-demand, content, ...)

individualized ad targeting

customer rating / profiling

consumer demand prediction

medium riskpersonalized health

person classification (e.g. crime, terrorism)

autonomous cars/ planes/ machines ...

automated trading

military intelligence / intervention

political oppression

critical infrastructure services (e.g. electricity)

life-changing decisions (e.g. about health)

high risk

Big Data & Advances in Machine Learning

Data "Mishaps"

Two Experiments

Discriminating PeopleWith Algorithms

Humans can be prejudiced. Are algorithms better?

Discrimination

Discrimination is treatment or consideration of, or making a distinction in favor of or against, a person or thing based on the group, class, or category to which that person or thing is perceived to belong to rather than on individual merit.

Wikipedia

Protected attributes (examples): Ethnicity, Gender, Sexual Orientation, ...

When is a process discriminating?

Disparate Impact: Adverse impact of a process C on a given group X

Outcome X = 0 X = 1C = NO P(C = NO, X = 0) P(C = NO, X = 1)C = YES P(C = YES, X = 0) P(C =YES,X = 1)𝑃 (𝐶=𝑌𝐸𝑆|𝑋=0 )𝑃 (𝐶=𝑌𝐸𝑆|𝑋=1 )

<τ

see e.g. "Certifying and Removing Disparate Impact" M. Feldman et. al. (arxiv.org)

When is a process discriminating?

Estimating with real-world data

Outcome X = 0 X = 1C = NO a bC = YES c d𝑐 / (𝑎+𝑐 )𝑑 / (𝑏+𝑑 )

<τ

Discrimination through Data Analysis

Replacing a manual hiring process with an automated one.

Benefits:Save time screening CVs by handImprove candidate choice

The Setup

human

CV

algorithm

information (CV, cover letter, projects, ...)

C= YES (invite) / C = NO (not invite)

C Training Data

The SetupUse submitted information (CV, work samples) along with publicly available / external information to predict candidate success.

Use data from the manual process (invite/ no invite) to train the classifier

Provide it with as much data as possible to decrease error rate ("Big Data" approach)

Our decision model

𝑆=𝑚 (𝑌 )+𝑑 (𝑋 )+𝜀

score of candidate(merit function) discrimination

malus/bonushidden variables &

luck (if you believe in it)

𝐶={ 𝑌𝐸𝑆 ,𝑆>𝑡𝑁𝑂 , h𝑜𝑡 𝑒𝑟𝑤𝑖𝑠𝑒

luckcandidate meritwithout discrimination with discrimination

Training a predictor for C

information about Y(unprotected attributes)

additional informationwe give to the algorithm

𝒁 ∝ 𝑋+𝜀𝛾we can predict the value of X from Z with fidelity

A Simulation

•Generate 10.000 samples of C with disparate impact • Train a classifer (e.g. Support-Vector-Machine) on the test data• Provide it with (noisy) information about X•Measure the algorithm-based on the test data

Discrimination by Algorithm


(how much information about X leaks into the data)


(disparate impact on protected class)


how well our algorithm can explain the data

8 % luck / noise6-8 % discrimination87 % merit


how much our algorithm discriminates against people in group X


Why give that information to the algorithm?

𝒁

We don't! But it leaks through anyway...

𝑋

But can it be done?

Discrimination through information leakage is possible, but how likely is it in practice?

Let's try! We use publicly available data to predict the gender of Github users (protected attribute X).

Basic InformationManually classify users as men/women (by looking at profile pictures, names) -> 5.000 training samples with small error

Use the Github API to retrieve information about users (followers, repositories, stargazers, contributions, ...)

We only use data that is easy to get and likely to be used in real-world setting for classification

We only use a limited dataset (proof of concept, not optimized)

Stargazers, Followers, Projects, ...

No predictive power for X

Github Event Data

https://www.githubarchive.org/

PushEvent2015-03-17 21:21h3 commitsLog : "..."

PullRequestEvent2015-03-17 22:43

CommentEvent2015-03-17 23:14h"Hi, I think we should add more cats to the landing page"

Hourly event patterns & event types

Commit Message Analysis

Use the commit messages (as obtained from the event data) to predict gender by training a Support Vector Machine (SVM) classifier on the word frequency data.

lol

emoji

wtf seriously

rtfm

dude

fuckgit

Predictive Power of Model

15 % 35 % error50 % baseline fidelity

30 % information leakage(with a very simple data set)

Takeaways

Algorithms will readily "learn" discrimination from us if we provide them with contaminated training data.

Information leakage of protected attributes can happen easily.

How we can fix this

Harder than you might think! We need to know X to measure disparate impact and remove it

Incorporate penality for discrimination into target function

Remove information about X from dataset by performing a suitable transformation (reduces fidelity of model)

see e.g. "Certifying and Removing Disparate Impact" M. Feldman et. al. (arxiv.org)

Oh, it's you again! De-anonymizing data

What is de-anonymization?

Use data recorded about individuals / entities to identify those same individuals / entities in another set of data (exactly or with high likelihood).

Deanonymization becomes an increasing risk as datasets about individual entities become larger and more detailed.

"Buckets of Truth"

N boolean attributes per entity - on average M < N of them are set

,

fun with deanonymization: http://en.akinator.com

Examples

𝑃𝑐𝑜𝑙 .= (1−2𝑝 (1−𝑝))𝑁

uniform distribution long-tailed distribution

𝑃𝑐𝑜𝑙 .=?

Geolife Trajectories

http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-9fd4-daa38f2b2e13/

Question:How easy is it to re-identify single users through their data?

Could an algorithm build a representation of a given user?

Individual trajectories (color-coded)

http://research.microsoft.com/en-us/downloads/b16d359d-d164-469e-9fd4-daa38f2b2e13/

How good are our buckets?

𝑒−𝑥𝑎𝛾 here's the interesting information

Identifying / comparing fingerprints

𝑠 (𝑢𝑖 ,𝑢 𝑗 )=‖𝑓 (𝑢𝑖 ) ∙ 𝑓 (𝑢 𝑗 ) ‖

‖𝑓 (𝑢𝑖 ) ‖∙‖𝑓 (𝑢 𝑗 ) ‖

* =

Testing De-Anonymization

Use 75 % of the trajectories as prior data set

Predict the user ID belonging to the remaining 25 %

Measure average success probability and identification rank (i.e. at which position is the correct user)

Identification Rate

Finding Similar Users

Possible Improvements

Use Temporal / Sequence InformationUse speed of movement / mode of transportation

Improve choice of buckets for fingerprinting

Interesting Review Article: "Life in the network: the coming age of computational social science." D. Lazer et. al.

SummaryThe more data we have, the more difficult it is to keep algorithms from directly learning and using object identities instead of attributes.

Our data follows us around!

What can we do?

As Data Scientists / Analysts / Programmers

Consume data responsibly: Don't include everything under the sun just because it increases fidelity by a slim margin

Check for disparate impact and remove it from the input data

Test anonymization safety by using machine learning

Train data scientists in safety and risks of data analysis

As Citizens / Hackers / Users

Do not blindly trust decisions made by algorithms

Test them if possible (using different input values)

Reverse-engineer them (using e.g. active learning)

Fight back with data: Collect and analyze algorithm-based decisions using collaborative approaches

As a Society

Create better regulations for algorithms and their use

Force companies / organizations to open up black boxes

Making access to data easier, also for small organizations

Impede the creation of data monopolies

Algorithms are like children:

Smart & eager to learn

So let's make surewe raise them to

be responsibleadults.

Thanks!

Slides slideshare.net/japh44Website andreas-dewes.de/en

Code (coming soon) github.com/adewes/32c3E-Mail [email protected]

Twitter @japh44License Creative Commons Attribution 4.0

International(except Google Deep Learning image)

Result

Intro

Whenever we measure user actions, we (automatically) gain information about them that we can use to classify them.

Classifying and Controlling People

Case Study: Click Rate Optimization

Simple but common use case for big data: Collaborative filtering

• Users have an opinion on a given topic A (between 0-1)• They are more likely to like articles that confirm their

opinion• Our algorithm knows nothing about A, just tries to

optimize click rate• User opinion may change over time according to the

content he/she is exposed to (2 % change per exposure)

Mathematical Model

𝑃 (𝐿𝑖𝑘𝑒 )∝|𝐴𝑎𝑟𝑡𝑖𝑐𝑙𝑒− 𝐴𝑢𝑠𝑒𝑟|+𝜀𝑚𝑜𝑜𝑑

Like Rate vs. Articles Viewed

Like Rate vs. Articles Viewedonly observe, don't optimize

What have we learned?

60 observations / user

Clustering users into groups

Similarity measure: # Articles that both users like or dislikeClustering: K-Means (minimize distance within clusters, maximize distance between clusters)

Like Rate vs. Articles Viewedwith click-rate optimization

Consequence of optimization: "Filter Bubbles"

Switching On User Feedback

𝐴𝑢𝑠𝑒𝑟𝑡+1 =𝐴𝑢𝑠𝑒𝑟

𝑡 +γ ∙ sgn ( 𝐴𝑢𝑠𝑒𝑟𝑡 −𝐴𝑎𝑟𝑡𝑖𝑐𝑙𝑒

❑ )

User opinions with and without feedback

the algorithm has an interest to steer opinions towards the center

no feedback 2 % feedback

Summary

Data & Analytics

Say "Hi!" to Your New Boss