67
Data ethics and machine learning Discrimination, algorithmic bias, and how to discover them. DINO PEDRESCHI KDDLAB, DIPARTIMENTO DI INFORMATICA, UNIVERSITÀ DI PISA

Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Embed Size (px)

Citation preview

Page 1: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Data ethics and machine learningDiscrimination, algorithmic bias, and how to discover them.DINO PEDRESCHIKDDLAB, DIPARTIMENTO DI INFORMATICA, UNIVERSITÀ DI PISA

Page 2: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi
Page 3: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi
Page 4: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Opportunities of big data

4

Page 5: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

5

Spot business trendsPrevent diseasesFight crime

Improve transportationPersonalised servicesImprove wellbeing

Page 6: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Event Detection

Detecting events in a geographic area classifying the different kinds of users.

City of Rome

Metropolitan area

Covered geographical region: city of RomeDataset size per snapshot: ≈ 1.2 GBytes per dayNumber of records: ≈ 5.6 million lines per day

8 months between 2015 and 2016

Page 7: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

San Pietro

Page 8: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

San Giovanni

Page 9: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Circo Massimo

Page 10: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Stadio Olimpico

Page 11: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

End users

Traveler

Mobility Manager

Data

Trip

advises

Mon

itor

Rul

es &

In

cent

ives

City

Page 12: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Personal mobility assistant

12

Carpooling Network

Page 13: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Estimating wellbeing with mobility data

AI and Big Data 13A

B

C

HW

Page 14: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Predicting GDP with Retail Market data

14

generic utility function(rationality)

personal utility function(diversity)

Product

Price

Quantity Needed

Sophistication

R2 = 17.25% R2 = 32.38%

R2 = 85.72%

Page 15: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Risks of big data

15

Page 16: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Big Data, Big Risks Big data is algorithmic, therefore it cannot be biased! And yet…

• All traditional evils of social discrimination, and many new ones, exhibit themselves in the big data ecosystem

• Because of its tremendous power, massive data analysis must be used responsibly

• Technology alone won’t do: also need policy, user involvement and education efforts

16

Page 17: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

By 2018, 50% of business ethics violations will occur through improper use of big data analytics

[source: Gartner, 2016]

AI and Big Data 17

Page 18: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

AI and Big Data 18

Page 20: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

The danger of black boxes - 1

The COMPAS score (Correctional Offender Management Profiling for Alternative Sanctions)

A 137-questions questionnaire and a predictive model for “risk of crime recidivism.” The model is a proprietary secret of Northpointe, Inc.

The data journalists at propublica.org have shown that

• the prediction accuracy of recidivism is rather low (around 60%)

• the model has a strong ethnic bias◦ blacks who did not reoffend are classified as high risk twice as much as

whites who did not reoffend◦ whites who did reoffend were classified as low risk twice as much as blacks

who did reoffend.

AI and Big Data 20

Page 21: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

The danger of black boxes -2

The three major US credit bureaus, Experian, TransUnion, and Equifax, providing credit scoring for millions of individuals, are often discordant.

In a study of 500,000 records, 29% of consumers received credit scores that differ by at least fifty points between credit bureaus, a difference that may mean tens of thousands dollars over the life of a mortgage [CRS+16].

AI and Big Data 21

Page 22: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

The danger of black boxes - 3

In 2010, some homeowners with a regular payment history of their mortgage reported a sudden drop of forty points in their credit score, soon after their own enquiry.

AI and Big Data 22

Page 23: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

The danger of black boxes - 4

During the 1970s and 1980s, St. George’s Hospital Medical School in London used a computer program for initial screening of job applicants.

The program used information from applicants’ forms, which contained no reference to ethnicity.

The program was found to unfairly discriminate against female applicants and ethnic minorities (inferred from surnames and place of birth), less likely to be selected for interview [LM88].

AI and Big Data 23

Page 24: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

The danger of black boxes - 5

In a recent paper at SIGKDD 2016 [RSG16] the authors show how an accurate but untrustworthy classifier may result from an accidental bias in the training data.

In a task of discriminating wolves from huskies in a dataset of images, the resulting deep learning model is shown to classify a wolf in a picture based solely on …

AI and Big Data 24

Page 25: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

The danger of black boxes - 5

In a recent paper at SIGKDD 2016 [RSG16] the authors show how an accurate but untrustworthy classifier may result from an accidental bias in the training data.

In a task of discriminating wolves from huskies in a dataset of images, the resulting deep learning model is shown to classify a wolf in a picture based solely on … the presence of snow in the background!

[RSG16] “Why Should I Trust You?” Explaining the Predictions of Any Classifier

SIGKDD 2016 Conference Paper

AI and Big Data 25

Page 26: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Deep learning is creating computer systems we don't fully

understand www.theverge.com/2016/7/12/12158238/first-click-deep-learning-algorithmic-black-boxes

AI and Big Data 26

Page 27: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Is AI Permanently Inscrutable?

nautil.us/issue/40/learning/is-artificial-intelligence-permanently-inscrutable

27

Page 28: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

The danger of black boxes - 6

In a recent study at Princeton Univ, the authors show how the semantics derived automatically from large text/web corpora contains human biases ◦ E.g., names associated with whites were found to be

significantly easier to associate with pleasant than unpleasant terms, compared to names associated with black people.

Therefore, any machine learning model trained on text data for, e.g., sentiment or opinion mining has a strong chance of inheriting the prejudices reflected in the human-produced training data.

AI and Big Data 28

Page 29: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Human Bias

AI and Big Data 29

Page 30: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Human Bias can be Learned - 7

AI and Big Data 30

Page 31: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

As we stated in our 2008 SIGKDD paper that started the field of discrimination-aware data mining [PRT08]:

“learning from historical data recording human decision making may mean to discover traditional prejudices that are endemic in reality, and to assign to such practices the status of general rules, maybe unconsciously, as these rules can be deeply hidden within the learned classifier.”

AI and Big Data 31

Page 32: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

PoliciesBIG DATA ETHICS

Page 33: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Satya Nadella's rules for AI

www.theverge.com/2016/6/29/12057516/satya-nadella-ai-robot-laws

AI and Big Data 33

Page 34: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

U.S. – F.T.C.

Salvatore Ruggieri 34

www.ftc.gov/system/files/documents/reports/big-data-tool-inclusion-or-exclusion-understanding-issues/160106big-data-rpt.pdf (Sept. 2014)

Page 35: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

U.S. – White House

Salvatore Ruggieri 35

www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf (May 2014)

Page 36: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

U.S. – White House

Salvatore Ruggieri

36

www.whitehouse.gov/sites/default/files/microsites/ostp/2016_0504_data_discrimination.pdf (May 2016)

Page 37: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

U.S. – White House www.whitehouse.gov/sites/default/files/whitehouse_files/microsites/ostp/NSTC/preparing_for_the_future_of_ai.pdf (October 2016)

AI and Big Data 37

Page 40: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Netherlands www.knaw.nl/en/news/publications/ethical-and-legal-aspects-of-informatics-research (September 2016)

AI and Big Data 40

Page 41: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Big Data Ethics informationaccountability.org/big-data-ethics-initiative/

AI and Big Data 41

Page 42: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Value-Sensitive Design Design for privacy Design for security Design for inclusion Design for sustainability Design for democracy Design for safety Design for transparency Design for accountability Design for human capabilities

AI and Big Data 42

Page 43: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

EU Projects: SoBigData.eu Social Mining & Big Data Ecosystem project (SoBigData, H2020-INFRAIA-2014-2015, duration: 2015-2019, www.sobigdata.eu

AI and Big Data 43

Page 44: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Data Ethics Literacy Rapporto MIUR su Big Data, 28 Luglio 2016

◦ www.istruzione.it/allegati/2016/bigdata.pdf

Master UNIPI in Big Data Analytics & Social Mining◦ masterbigdata.it

AI and Big Data 44

Page 45: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Data ethics technologiesDISCRIMINATION DISCOVERY FROM DATA

Page 46: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

AI and Big Data 46

Page 47: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Discrimination discovery Given:

◦ an historical database of decision records, each describing features of an applicant to a benefit ◦ e.g., a credit request to a bank and the corresponding on credit approval/denial

◦ some designated categories of applicants, such as groups protected by anti-discrimination laws,

find whether, and in which circumstances, there are evidences of discrimination of the designated categories that emerge from the data.

DCUBE: Discrimination Discovery in Databases 47

Page 48: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

German Credit dataset

DCUBE: Discrimination Discovery in Databases 48

Page 49: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

How? Fight with the same weapons

Idea: use data mining to discover discrimination

◦ the decision policies hidden in a database can be represented by decision rules and discovered by frequent pattern mining

◦ Once found all such decision rules, highlight all potential niches of discrimination by filtering the rules using a measure that quantifies the discrimination risk.

DCUBE: Discrimination Discovery in Databases 49

Page 50: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Discrimination discovery from data

FOREIGN_WORKER=yes & PURPOSE=new_car & HOUSING=own CREDIT=bad◦ elift = 5,19 supp = 56 conf = 0,37

elift = 5,19 means that foreign workers have more than 5 times more probability of being refused credit than the average population (even if they own their house).

50

Page 51: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Outcome: Funded Not funded Conditionally funded

Case Study: grant evaluation

51

Page 52: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Dataset attributes

52

Features of the PI

Project costs

Research Area

Project Evaluation

Page 53: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

A potentially discriminatory rule

Antecedent◦ Project proposals in “Physical and Analytical

Chemical Sciences”◦ Young females◦ Total cost of 1,358,000 Euros or above

Possible interpretation◦ “Peer-reviewers of panel PE4 trusted young females

requiring high budgets less than males leading similar projects”

53

Page 54: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Case study: US Harmonized Tariff System

US Harmonized Tariff System (HTS)

https://hts.usitc.gov/

Detailed tariff classification system for merchandise imported to US

Chapter 61, 62, 64, 65: apparels◦ Different taxes for same garments

separately produced for male and female◦ Description is at semi-structured form

64.4¢/kg + 18.8%96¢/doz + 1.4% 8.5%Women and girls

38.6¢/kg + 10%08.9%Men and boys

Coats Fur felt hatsCotton pajamas

Different taxes for same apparels for men and women

64.4¢/kg + 18.8%96¢/doz + 1.4% 8.5%Women and girls

38.6¢/kg + 10%08.9%Men and boys

Coats Fur felt hatsCotton pajamas

Different taxes for same apparels for men and women

54

Women: 14%Men: 9% 1.3 billions USD!!!

Page 55: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

AI and Big Data 55

Totes-Isotoner Corp. v. U.S.

Rack Room Shoes Inc. andForever 21 Inc. vs U.S.

Court of International Trade

U.S. Court of Appeals for the Federal Circuit (2014)

“[…] the courts may have concluded that Congress had no discriminatory intent when ruling the HTS, but there is littledoubt that gender-based tariffs have discriminatory impact”

Page 56: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Sample rule from the HTS dataset

AI and Big Data 56

Page 57: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Soccer Player Ratings

Page 58: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Soccer Player Ratings

How humans evaluate sports performance?

Page 59: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi
Page 60: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Human evaluation line

Tech

nica

lfe

atur

es

Mac

hine

per

form

ance

Page 61: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Human evaluation line

Tech

nica

lfe

atur

es

Tech

nica

l + C

onte

xtua

lfe

atur

es

Mac

hine

per

form

ance

Page 62: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Wrapping up

AI AND BIG DATA 62

Page 63: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Right of explanation • Applying AI within many domains requires

transparency and responsibility: • health care• finance• surveillance• autonomous vehicles• Government

• EU General Data Protection Regulation (April 2016) establishes (?) a right of explanation for all individuals to obtain “meaningful explanations of the logic involved” when automated (algorithmic) individual decision-making, including profiling, takes place.

• In sharp contrast, (big) data-driven AI/ML models are often black boxes.

AI and Big Data 63

Page 64: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Accountability “Why exactly was my loan application rejected?” “What could I have done differently so that my application would not have been rejected?”

AI and Big Data 64

Page 65: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Social Mining & Big Data Ecosystem

www.sobigdata.eu

Page 66: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

66

Knowledge Discovery & Data Mining Labhttp://kdd.isti.cnr.it

Page 67: Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Special thanks

• Salvatore Ruggieri• Franco Turini• Fosca Giannotti• Anna Monreale• Luca Pappalardo

SMARTCATs