59
Data, Responsibly: The Next Decade of Data Science Bill Howe, PhD Associate Professor, Information School Associate Director, eScience Institute Adjunct Associate Professor, Computer Science & Engineering University of Washington

Data, Responsibly: The Next Decade of Data Science

Embed Size (px)

Citation preview

Page 1: Data, Responsibly: The Next Decade of Data Science

Data, Responsibly: The Next Decade of Data Science

Bill Howe, PhDAssociate Professor, Information SchoolAssociate Director, eScience Institute

Adjunct Associate Professor, Computer Science & EngineeringUniversity of Washington

Page 2: Data, Responsibly: The Next Decade of Data Science

05/03/2023 2

My goals this evening• Describe emerging topics in data science

research and practice around a technical interpretation of ethics

• Describe some specific thrusts we are pursuing• Encourage you to get involved

Data, Responsibly / SciTech NW

Page 3: Data, Responsibly: The Next Decade of Data Science

05/03/2023 3

How much time do you spend “handling data” as opposed to “doing science”?

Mode answer: “90%”

Bill Howe, UW

Page 4: Data, Responsibly: The Next Decade of Data Science

1) Upload data “as is”Cloud-hosted; no need to install or design a database; no pre-defined schema

2) Analyze data with SQLRight in your browser, writing queries on top of queries on top of queries ...

SELECT hit, COUNT(*) FROM tigrfam_surface GROUP BY hit ORDER BY cnt DESC

3) Share the results Click on the science question, see the SQL that answers it

Page 5: Data, Responsibly: The Next Decade of Data Science

5

SPARQL(GEMS)Serial C++

PGAS/HPCMyriaX RDBMS

SQLDatalogMyriaL

Compiler Compiler Compiler Compiler Compiler

Hadoop (via layers)

Compiler

multiple languages

multiple big data systems

multiple GUIs/apps

Page 6: Data, Responsibly: The Next Decade of Data Science

May 3, 2023 6

Making it easier to do data science• SQLShare: Easier to use a database• Myria: Easier to use a bunch of different

systems at once, at scale

• Worked great in the physical sciences

• But some collaborators weren’t that excited…

Page 7: Data, Responsibly: The Next Decade of Data Science

05/03/2023 7Bill Howe, UW

Data Science Kickoff Session:137 posters from 30+ departments and units

Page 8: Data, Responsibly: The Next Decade of Data Science

Data, data, data

8

Kevin MerritCEO Socrata

Deep DhillonCTO Socrata

Page 9: Data, Responsibly: The Next Decade of Data Science

9

• Pursue transformative interdisciplinary urban research• Facilitate translation from UW to .gov stakeholders• Position Seattle/UW as a leader in applied urban research• 80+ faculty from 20+ departments around campus

Page 10: Data, Responsibly: The Next Decade of Data Science

10

Assessing Community Well-BeingThird-Place Technologies

Optimization of King County Metro ParatransitComputer Science & Engineering

Predictors of Permanent Housing for Homeless FamiliesBill and Melinda Gates Foundation

Open Sidewalk Graph for Accessible Trip PlanningComputer Science & Engineering

Inaugural 2015 program:16 spots140 applicants …from 20+ departments

Page 11: Data, Responsibly: The Next Decade of Data Science

11

Mining Online Data to Detect Unsafe Food ProductsElaine Nsoesie, Institute for Health Metrics and

EvaluationORCA data for improved transit system planning and operation

Washington State Transportation Center (TRAC)Global Open Sidewalks: Creating a shared open data layer

Taskar Center for Accessible TechnologyCrowdSensing Census: A tool for estimating poverty

Bell Labs, Nokia

2016 program:16 spots190 applicants

New in 2016: An explicit emphasis on data ethics

Page 12: Data, Responsibly: The Next Decade of Data Science

05/03/2023 12Bill Howe, UW

Page 13: Data, Responsibly: The Next Decade of Data Science

July 2016

“Data, Responsibly”Dagstuhl Workshop

GerhardWeikum

Serge Abiteboul

Julia Stoyanovich

GeromeMiklau

Page 14: Data, Responsibly: The Next Decade of Data Science

14

Cathy O’Neil

September 2016

Three properties of a WMD:

OpacityScaleDamage

Page 15: Data, Responsibly: The Next Decade of Data Science

First decade of Data Science research and practice:

What can we do with massive, noisy, heterogeneous datasets?

Next decade of Data Science research and practice:

What should we do with massive, noisy, heterogeneous datasets?

The way I think about this…..(1)

Page 16: Data, Responsibly: The Next Decade of Data Science

05/03/2023 16

The way I think about this…. (2)

Decisions are based on two sources of information:

1. Past examplese.g., “prior arrests tend to increase likelihood of future

arrests”

2. Societal constraintse.g., “we must avoid racial discrimination”

Data, Responsibly / SciTech NW

We’ve become very good at automating the use of past examples

We’ve only just started to think about incorporating societal constraints

Page 17: Data, Responsibly: The Next Decade of Data Science

05/03/2023 17

The way I think about this… (3)

How do we apply societal constraints to algorithmic decision-making?

Option 1: Keep a human in the loopEx: EU General Data Protection Regulation requires

that a human be involved in legally binding algorithmic decision-making

Ex: Wisconsin Supreme Court says a human must review algorithmic decisions made by recidivism models

Option 2: Build them into the algorithms themselvesI’ll talk about some approaches for this

Data, Responsibly / SciTech NW

Page 18: Data, Responsibly: The Next Decade of Data Science

05/03/2023 18

The way I think about this…(4)

On transparency vs. accountability:• For human decision-making, sometimes explanations

are required, improving transparency– Supreme court decisions– Employee reprimands/termination

• But when transparency is difficult, accountability takes over– medical emergencies, business decisions

• As we shift decisions to algorithms, we lose both transparency AND accountability

• “The buck stops where?”Data, Responsibly / SciTech NW

Page 19: Data, Responsibly: The Next Decade of Data Science

05/03/2023 19

Some Facets of “Data, Responsibly”

• Privacy• Fairness• Transparency• Reproducibility• Ethics

Data, Responsibly / SciTech NW

I won’t be talking about this

I’ll give a taste of the work here

I won’t be talking about this

Towards automatic scientific claim-checking

Vignette on teaching data ethics

Page 20: Data, Responsibly: The Next Decade of Data Science

05/03/2023 20

FAIRNESS

Data, Responsibly / SciTech NW

Page 21: Data, Responsibly: The Next Decade of Data Science

21

Ex: Staples online pricing

Reasoning: Offer deals to people that live near competitors’ storesEffect: lower prices offered to buyers who live in more affluent

neighborhoods

Page 22: Data, Responsibly: The Next Decade of Data Science

22

[Latanya Sweeney; CACM 2013]

Racially identifying names trigger ads suggestive of an arrest record

slide adapted from Stoyanovich, Miklau

Page 23: Data, Responsibly: The Next Decade of Data Science

23

Propublica, May 2016

Page 24: Data, Responsibly: The Next Decade of Data Science

24

The Special Committee on Criminal Justice Reform's hearing of reducing the pre-trial jail population.

Technical.ly, September 2016

Philadelphia is grappling with the prospect of a racist computer algorithm

Any background signal in the data of institutional racism is

amplified by the algorithm

operationalized by the algorithm

legitimized by the algorithm

“Should I be afraid of risk assessment tools?”

“No, you gotta tell me a lot more about yourself.At what age were you first arrested? What is the date of your most recent crime?”

“And what’s the culture of policing in the neighborhood in which I grew up in?”

Page 25: Data, Responsibly: The Next Decade of Data Science

26

Towards a precise characterization of fairness…

Positive Outcomes Negative Outcomes

offered employment denied employment

accepted to school rejected from school

offered a loan denied a loan

offered a discount not offered a discount

Label outcomes to individuals as positive or negative

Fairness is concerned with how outcomes are assigned to a population

slide adapted from Stoyanovich, Miklau

Page 26: Data, Responsibly: The Next Decade of Data Science

27

Statistical parity

race

black

white

⊕ ⊖⊖

⊕⊕⊖⊖

40% of the whole population

positiveoutcomes

Statistical paritydemographics of the individuals receiving any outcome are

the same as demographics of the underlying population

20% of black

60% of white

slide adapted from Stoyanovich, Miklau

Page 27: Data, Responsibly: The Next Decade of Data Science

28

First attempt: Ignore sensitive information

zip code

10025 10027

race

black

white

20% of black

60% of white

⊕⊖⊖⊖

⊕⊕ ⊖

positiveoutcomes

Removing race from the vendor’s assignment process does not prevent discrimination

Assessing disparate impactDiscrimination is assessed by the effect on the protected sub-population, not by the input or by the process that lead to the

effect.

slide adapted from Stoyanovich, Miklau

Page 28: Data, Responsibly: The Next Decade of Data Science

29

More directly: Impose statistical parity

credit score

good bad

black

white

⊕⊖⊖

⊖⊕⊕ ⊖

⊖⊕

positive outcomes

40% of black

40% of white

race

positive outcome: offered a loan

Tradeoff between (perceived) accuracy and fairness; may be contrary to the goals of the vendor

slide adapted from Stoyanovich, Miklau

Page 29: Data, Responsibly: The Next Decade of Data Science

30

A systems approach:FairTest: fairness test suite for data analysis apps

• Tests for unintentional discrimination according to several representative discrimination measures.

• Automates search for context-specific associations between protected variables and application outputs

• Report findings, ranked by association strength and affected population size

[F. Tramèr et al., arXiv:1510.02377 (2015)]

Page 30: Data, Responsibly: The Next Decade of Data Science

As a corporation, should I care?

Compliance

Jacobson, Scientific American, 2013

CustomerRetention

Employee Retention

Eichler, Hiffington Post, 2012

CNET, May 2016

Page 31: Data, Responsibly: The Next Decade of Data Science

05/03/2023 32

REPRODUCIBILITY

Bill Howe, UW

Page 32: Data, Responsibly: The Next Decade of Data Science

05/03/2023 33

Science is a complete mess• Reproducibility

– Begley & Ellis, Nature 2012: 6 out of 53 cancer studies reproducible – Only about half of psychology 100 studies had effect sizes that

approximated the original result (Science, 2015)– Ioannidis 2005: Why most public research findings are false– Reinhart & Rogoff: global economic policy based on spreadsheet

fuck ups

Bill Howe, UW

Page 33: Data, Responsibly: The Next Decade of Data Science

Science, 2015

Page 34: Data, Responsibly: The Next Decade of Data Science

05/03/2023 35Data, Responsibly @ Dagstuhl

Retractions are increasing…..

Page 35: Data, Responsibly: The Next Decade of Data Science
Page 36: Data, Responsibly: The Next Decade of Data Science

05/03/2023 37

Why is this happening? (1)

Bill Howe, UW

Page 37: Data, Responsibly: The Next Decade of Data Science

05/03/2023 38

Why is this happening? (2)

Bill Howe, UW

Page 38: Data, Responsibly: The Next Decade of Data Science

Why is this happening? (2)Publication Bias!

Page 39: Data, Responsibly: The Next Decade of Data Science

“DEEP CURATION”TOWARDS AUTOMATIC SCIENTIFIC CLAIM CHECKING

Page 40: Data, Responsibly: The Next Decade of Data Science

05/03/2023 41

Vision: Validate scientific claims automatically– Check for manipulation (manipulated images, Benford’s Law)– Extract claims from papers– Check claims against the authors’ data– Check claims against related data sets– Automatic meta-analysis across the literature + public

datasets

• First steps– Automatic curation: Validate and attach metadata to public

datasets– Longitudinal analysis of the visual literature

Data, Responsibly / SciTech NW

Page 41: Data, Responsibly: The Next Decade of Data Science

Microarray experiments

Page 42: Data, Responsibly: The Next Decade of Data Science

05/03/2023 43Bill Howe, UW

Microarray samples submitted to the Gene Expression Omnibus

Curation is fast becoming the bottleneck to data sharing

Maxim Gretchkin

Hoifung Poon

Page 43: Data, Responsibly: The Next Decade of Data Science

Maxim Gretchkin

Hoifung Poon

No growth in number of datasets used per paper!

Page 44: Data, Responsibly: The Next Decade of Data Science

Maxim Gretchkin

Hoifung Poon

Majority of samples are one-time-use only!

Page 45: Data, Responsibly: The Next Decade of Data Science

color = labels supplied as metadata

clusters = 1st two PCA dimensions on the gene expression data itself

Can we use curate algorithmically?Maxim Gretchkin

Hoifung Poon

The expression data and the text labels appear to disagree

Page 46: Data, Responsibly: The Next Decade of Data Science

Maxim Gretchkin

Hoifung Poon

Better Tissue Type Labels

Domain knowledge (Ontology)

Expression data

Free-text Metadata

2 Deep Networkstext

expr

SVM

Page 47: Data, Responsibly: The Next Decade of Data Science

Deep Curation Maxim Gretchkin

Hoifung Poon

Distant supervision and co-learning between text-based classified and expression-based classifier: Both models improve by training on each others’ results.

Free-text classifierExpression classifier

Page 48: Data, Responsibly: The Next Decade of Data Science

Deep Curation: Our stuff wins, with no training data

Maxim Gretchkin

Hoifung Poon

state of the art

our reimplementation of the state of the art

our dueling pianos NN

amount of training data used

Page 49: Data, Responsibly: The Next Decade of Data Science

05/03/2023 51

VIGNETTE ON TEACHING DATA ETHICS

Bill Howe, UW

Page 50: Data, Responsibly: The Next Decade of Data Science

Alcohol Study, Barrow Alaska, 1979Native leaders and city officials, worried about drinking and associated violence in their community invited a group of sociology researchers to assess the problem and work with them to devise solutions.

Page 51: Data, Responsibly: The Next Decade of Data Science

Methods• 10% representative sample

(N=88) of everyone over the age of 15 using a 1972 demographic survey

• Interviewed on attitudes and values about use of alcohol

• Obtained psychological histories including drinking behavior

• Given the Michigan Alcoholism Screening Test (Seltzer, 1971)

• Asked to draw a picture of a person– Used to determine cultural

identity

Page 52: Data, Responsibly: The Next Decade of Data Science

Results announced unilaterally and publicly

At the conclusion of the study researchers formulated a report entitled “The Inupiat, Economics and Alcohol on the Alaskan North Slope” which was released simultaneously at a press release and to the Barrow community. The press release was picked up by the New York Times, who ran a front page story entitled Alcohol Plagues Eskimos

Page 53: Data, Responsibly: The Next Decade of Data Science

The results of the Barrow Alcohol Study in Alaska were revealed in the context of a press conference that was held far from the Native village, and without the presence, much less the knowledge or consent, of any community member who might have been able to present any context concerning the socioeconomic conditions of the village. Study results suggested that nearly all adults in the community were alcoholics. In addition to the shame felt by community members, the town’s Standard and Poor bond rating suffered as a result, which in turn decreased the tribe’s ability to secure funding for much needed projects.

Backlash

Page 54: Data, Responsibly: The Next Decade of Data Science

Methodological Problems“The authors once again met with the Barrow Technical Advisory Group, who stated their concern that only Natives were studied, and that outsiders in town had not been included.”

“The estimates of the frequency of intoxication based on association with the probability of being detained were termed "ludicrous, both logically and statistically.””

Edward F. Foulks, M.D., Misalliances In The Barrow Alcohol Study

Page 55: Data, Responsibly: The Next Decade of Data Science

Ethical Problems• Participants were not in control of their data nor

the context in which they were presented.• Easy to demonstrate specific, significant harms:

– Social: Stigmatization– Financial: Bond rating lowered

• Important: Nothing to do with individual privacy– No PII revealed at any point, to anyone– No violations of best practices in data handling– But even those who did not participate in the study

incurred harm

Page 56: Data, Responsibly: The Next Decade of Data Science

Two Topics• Social Component: Codes of Conduct• Technical Component: Managing

Sensitive Data

Page 57: Data, Responsibly: The Next Decade of Data Science

Ethical principles vs. ethical rules• In the Barrow example, ethical

rules were generally followed• But ethical principles were violated:

The researchers appear to have placed their own interests ahead of those of the research subjects, the client, and society

Page 58: Data, Responsibly: The Next Decade of Data Science

Principles: Codes of Conduct

• American Statistical Association– http://www.amstat.org/committees/eth

ics/• Certified Analytics Professional

– https://www.certifiedanalytics.org/ethics.php

• Data Science Association– http://www.datascienceassn.org/code-

of-conduct.html

Page 59: Data, Responsibly: The Next Decade of Data Science

Recap• There’s a sea change underway in how we will

teach and practice data science• No longer only about what can be done, but

about what should be done• This is not just a policy/behavior/culture issue –

there are technical problems to solve

• If you’re not thinking about this stuff, you will be facing retention issues and compliance issues very soon– Witness privacy, which is a few years ahead