Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi

Data ethics and machine learningDiscrimination, algorithmic bias, and how to discover them.DINO PEDRESCHIKDDLAB, DIPARTIMENTO DI INFORMATICA, UNIVERSITÀ DI PISA

Opportunities of big data

4

5

Spot business trendsPrevent diseasesFight crime

Improve transportationPersonalised servicesImprove wellbeing

Event Detection

Detecting events in a geographic area classifying the different kinds of users.

City of Rome

Metropolitan area

Covered geographical region: city of RomeDataset size per snapshot: ≈ 1.2 GBytes per dayNumber of records: ≈ 5.6 million lines per day

8 months between 2015 and 2016

San Pietro

San Giovanni

Circo Massimo

Stadio Olimpico

End users

Traveler

Mobility Manager

Data

Trip

advises

Mon

itor

Rul

es &

In

cent

ives

City

Personal mobility assistant

12

Carpooling Network

Estimating wellbeing with mobility data

AI and Big Data 13A

B

C

HW

Predicting GDP with Retail Market data

14

generic utility function(rationality)

personal utility function(diversity)

Product

Price

Quantity Needed

Sophistication

R2 = 17.25% R2 = 32.38%

R2 = 85.72%

Risks of big data

15

Big Data, Big Risks Big data is algorithmic, therefore it cannot be biased! And yet…

• All traditional evils of social discrimination, and many new ones, exhibit themselves in the big data ecosystem

• Because of its tremendous power, massive data analysis must be used responsibly

• Technology alone won’t do: also need policy, user involvement and education efforts

16

By 2018, 50% of business ethics violations will occur through improper use of big data analytics

[source: Gartner, 2016]

AI and Big Data 17

AI and Big Data 18

19

http://visual.ly/data-brokers

The danger of black boxes - 1

The COMPAS score (Correctional Offender Management Profiling for Alternative Sanctions)

A 137-questions questionnaire and a predictive model for “risk of crime recidivism.” The model is a proprietary secret of Northpointe, Inc.

The data journalists at propublica.org have shown that

• the prediction accuracy of recidivism is rather low (around 60%)

• the model has a strong ethnic bias◦ blacks who did not reoffend are classified as high risk twice as much as

whites who did not reoffend◦ whites who did reoffend were classified as low risk twice as much as blacks

who did reoffend.

AI and Big Data 20

The danger of black boxes -2

The three major US credit bureaus, Experian, TransUnion, and Equifax, providing credit scoring for millions of individuals, are often discordant.

In a study of 500,000 records, 29% of consumers received credit scores that differ by at least fifty points between credit bureaus, a difference that may mean tens of thousands dollars over the life of a mortgage [CRS+16].

AI and Big Data 21


In 2010, some homeowners with a regular payment history of their mortgage reported a sudden drop of forty points in their credit score, soon after their own enquiry.

AI and Big Data 22


During the 1970s and 1980s, St. George’s Hospital Medical School in London used a computer program for initial screening of job applicants.

The program used information from applicants’ forms, which contained no reference to ethnicity.

The program was found to unfairly discriminate against female applicants and ethnic minorities (inferred from surnames and place of birth), less likely to be selected for interview [LM88].

AI and Big Data 23


In a recent paper at SIGKDD 2016 [RSG16] the authors show how an accurate but untrustworthy classifier may result from an accidental bias in the training data.

In a task of discriminating wolves from huskies in a dataset of images, the resulting deep learning model is shown to classify a wolf in a picture based solely on …

AI and Big Data 24


In a recent paper at SIGKDD 2016 [RSG16] the authors show how an accurate but untrustworthy classifier may result from an accidental bias in the training data.

In a task of discriminating wolves from huskies in a dataset of images, the resulting deep learning model is shown to classify a wolf in a picture based solely on … the presence of snow in the background!

[RSG16] “Why Should I Trust You?” Explaining the Predictions of Any Classifier

SIGKDD 2016 Conference Paper

AI and Big Data 25

Deep learning is creating computer systems we don't fully

understand www.theverge.com/2016/7/12/12158238/first-click-deep-learning-algorithmic-black-boxes

AI and Big Data 26

http://www.theverge.com/2016/7/12/12158238/first-click-deep-learning-algorithmic-black-boxes

http://www.theverge.com/2016/7/12/12158238/first-click-deep-learning-algorithmic-black-boxes

Is AI Permanently Inscrutable?

nautil.us/issue/40/learning/is-artificial-intelligence-permanently-inscrutable

27

http://nautil.us/issue/40/learning/is-artificial-intelligence-permanently-inscrutable


In a recent study at Princeton Univ, the authors show how the semantics derived automatically from large text/web corpora contains human biases ◦ E.g., names associated with whites were found to be

significantly easier to associate with pleasant than unpleasant terms, compared to names associated with black people.

Therefore, any machine learning model trained on text data for, e.g., sentiment or opinion mining has a strong chance of inheriting the prejudices reflected in the human-produced training data.

AI and Big Data 28

Human Bias

AI and Big Data 29

Human Bias can be Learned - 7

AI and Big Data 30

As we stated in our 2008 SIGKDD paper that started the field of discrimination-aware data mining [PRT08]:

“learning from historical data recording human decision making may mean to discover traditional prejudices that are endemic in reality, and to assign to such practices the status of general rules, maybe unconsciously, as these rules can be deeply hidden within the learned classifier.”

AI and Big Data 31

PoliciesBIG DATA ETHICS

Satya Nadella's rules for AI

www.theverge.com/2016/6/29/12057516/satya-nadella-ai-robot-laws

AI and Big Data 33

http://www.theverge.com/2016/6/29/12057516/satya-nadella-ai-robot-laws

U.S. – F.T.C.

Salvatore Ruggieri 34

www.ftc.gov/system/files/documents/reports/big-data-tool-inclusion-or-exclusion-understanding-issues/160106big-data-rpt.pdf (Sept. 2014)

http://www.ftc.gov/system/files/documents/reports/big-data-tool-inclusion-or-exclusion-understanding-issues/160106big-data-rpt.pdf

http://www.ftc.gov/system/files/documents/reports/big-data-tool-inclusion-or-exclusion-understanding-issues/160106big-data-rpt.pdf

U.S. – White House


www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf (May 2014)

https://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf

https://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf

U.S. – White House

Salvatore Ruggieri

36

www.whitehouse.gov/sites/default/files/microsites/ostp/2016_0504_data_discrimination.pdf (May 2016)

https://www.whitehouse.gov/sites/default/files/microsites/ostp/2016_0504_data_discrimination.pdf

https://www.whitehouse.gov/sites/default/files/microsites/ostp/2016_0504_data_discrimination.pdf

U.S. – White House www.whitehouse.gov/sites/default/files/whitehouse_files/microsites/ostp/NSTC/preparing_for_the_future_of_ai.pdf (October 2016)

AI and Big Data 37

https://www.whitehouse.gov/sites/default/files/whitehouse_files/microsites/ostp/NSTC/preparing_for_the_future_of_ai.pdf

https://www.whitehouse.gov/sites/default/files/whitehouse_files/microsites/ostp/NSTC/preparing_for_the_future_of_ai.pdf

E.U. - EDPS


secure.edps.europa.eu/EDPSWEB/webdav/site/mySite/shared/Documents/Consultation/Opinions/2015/15-11-19_Big_Data_EN.pdf

https://secure.edps.europa.eu/EDPSWEB/webdav/site/mySite/shared/Documents/Consultation/Opinions/2015/15-11-19_Big_Data_EN.pdf





E.U. - EDPS


secure.edps.europa.eu/EDPSWEB/webdav/site/mySite/shared/Documents/Consultation/Opinions/2015/15-09-11_Data_Ethics_EN.pdf

https://secure.edps.europa.eu/EDPSWEB/webdav/site/mySite/shared/Documents/Consultation/Opinions/2015/15-09-11_Data_Ethics_EN.pdf






Netherlands www.knaw.nl/en/news/publications/ethical-and-legal-aspects-of-informatics-research (September 2016)

AI and Big Data 40

https://www.knaw.nl/en/news/publications/ethical-and-legal-aspects-of-informatics-research

https://www.knaw.nl/en/news/publications/ethical-and-legal-aspects-of-informatics-research

Big Data Ethics informationaccountability.org/big-data-ethics-initiative/

AI and Big Data 41

http://informationaccountability.org/big-data-ethics-initiative/

Value-Sensitive Design Design for privacy Design for security Design for inclusion Design for sustainability Design for democracy Design for safety Design for transparency Design for accountability Design for human capabilities

AI and Big Data 42

EU Projects: SoBigData.eu Social Mining & Big Data Ecosystem project (SoBigData, H2020-INFRAIA-2014-2015, duration: 2015-2019, www.sobigdata.eu

AI and Big Data 43

http://www.sobigdata.eu/

Data Ethics Literacy Rapporto MIUR su Big Data, 28 Luglio 2016

◦ www.istruzione.it/allegati/2016/bigdata.pdf

Master UNIPI in Big Data Analytics & Social Mining◦ masterbigdata.it

AI and Big Data 44

http://www.istruzione.it/allegati/2016/bigdata.pdf

http://masterbigdata.it/

Data ethics technologiesDISCRIMINATION DISCOVERY FROM DATA

AI and Big Data 46

Discrimination discovery Given:

◦ an historical database of decision records, each describing features of an applicant to a benefit ◦ e.g., a credit request to a bank and the corresponding on credit approval/denial

◦ some designated categories of applicants, such as groups protected by anti-discrimination laws,

find whether, and in which circumstances, there are evidences of discrimination of the designated categories that emerge from the data.

DCUBE: Discrimination Discovery in Databases 47

German Credit dataset


How? Fight with the same weapons

Idea: use data mining to discover discrimination

◦ the decision policies hidden in a database can be represented by decision rules and discovered by frequent pattern mining

◦ Once found all such decision rules, highlight all potential niches of discrimination by filtering the rules using a measure that quantifies the discrimination risk.


Discrimination discovery from data

FOREIGN_WORKER=yes & PURPOSE=new_car & HOUSING=own CREDIT=bad◦ elift = 5,19 supp = 56 conf = 0,37

elift = 5,19 means that foreign workers have more than 5 times more probability of being refused credit than the average population (even if they own their house).

50

Outcome: Funded Not funded Conditionally funded

Case Study: grant evaluation

51

Dataset attributes

52

Features of the PI

Project costs

Research Area

Project Evaluation

A potentially discriminatory rule

Antecedent◦ Project proposals in “Physical and Analytical

Chemical Sciences”◦ Young females◦ Total cost of 1,358,000 Euros or above

Possible interpretation◦ “Peer-reviewers of panel PE4 trusted young females

requiring high budgets less than males leading similar projects”

53

Case study: US Harmonized Tariff System

US Harmonized Tariff System (HTS)

https://hts.usitc.gov/

Detailed tariff classification system for merchandise imported to US

Chapter 61, 62, 64, 65: apparels◦ Different taxes for same garments

separately produced for male and female◦ Description is at semi-structured form

64.4¢/kg + 18.8%96¢/doz + 1.4% 8.5%Women and girls

38.6¢/kg + 10%08.9%Men and boys

Coats Fur felt hatsCotton pajamas

Different taxes for same apparels for men and women

64.4¢/kg + 18.8%96¢/doz + 1.4% 8.5%Women and girls

38.6¢/kg + 10%08.9%Men and boys

Coats Fur felt hatsCotton pajamas

Different taxes for same apparels for men and women

54

Women: 14%Men: 9% 1.3 billions USD!!!



AI and Big Data 55

Totes-Isotoner Corp. v. U.S.

Rack Room Shoes Inc. andForever 21 Inc. vs U.S.

Court of International Trade

U.S. Court of Appeals for the Federal Circuit (2014)

“[…] the courts may have concluded that Congress had no discriminatory intent when ruling the HTS, but there is littledoubt that gender-based tariffs have discriminatory impact”

Sample rule from the HTS dataset

AI and Big Data 56

Soccer Player Ratings

Soccer Player Ratings

How humans evaluate sports performance?

Human evaluation line

Tech

nica

lfe

atur

es

Mac

hine

per

form

ance

Human evaluation line

Tech

nica

lfe

atur

es

Tech

nica

l + C

onte

xtua

lfe

atur

es

Mac

hine

per

form

ance

Wrapping up

AI AND BIG DATA 62

Right of explanation • Applying AI within many domains requires

transparency and responsibility: • health care• finance• surveillance• autonomous vehicles• Government

• EU General Data Protection Regulation (April 2016) establishes (?) a right of explanation for all individuals to obtain “meaningful explanations of the logic involved” when automated (algorithmic) individual decision-making, including profiling, takes place.

• In sharp contrast, (big) data-driven AI/ML models are often black boxes.

AI and Big Data 63

Accountability “Why exactly was my loan application rejected?” “What could I have done differently so that my application would not have been rejected?”

AI and Big Data 64

Social Mining & Big Data Ecosystem

www.sobigdata.eu

66

Knowledge Discovery & Data Mining Labhttp://kdd.isti.cnr.it

Special thanks

• Salvatore Ruggieri• Franco Turini• Fosca Giannotti• Anna Monreale• Luca Pappalardo

SMARTCATs

Data & Analytics

Data ethics and machine learning: discrimination, algorithmic bias, and how to discover them. Dino Pedreschi