Ethical and Scientific Issues in Developing Deep Learning ... · Ethical and Scientific Issues in Developing Deep Learning Systems in Medicine: More Like Drugs Than You Might Think

Center for

Ethics &

Policy

Ethical and Scientific Issues in Developing Deep Learning Systems in Medicine: More Like Drugs Than You Might Think

Alex John London, Ph.D. Clara L. West Professor of Ethics and Philosophy Director, Center for Ethics and Policy Carnegie Mellon University

Twitter: @AlexJohnLondon

1

Center for

Ethics &

Policy

Disclosure

I. I have no conflicts to disclose regarding

the material in this talk.

II. Mention of a product, system or

approach is not an endorsement.

2

Center for

Ethics &

Policy

Agenda

I. An example of medical AI.

II. Why do we need RCTs in medicine?

III. Division of scientific labor across drug

development life cycle.

IV. Commonalities with Deep Learning

Systems

3

Center for

Ethics &

Policy

Ethics and Uncertainty

I. Medical uncertainty is an ethical issue a. Stakeholders rely on medical information for

momentous decisions1: 1. Patients and providers

2. Organizing health systems

3. Spending scarce resources

b. Uncertainty creates inefficiencies for stakeholders

c. Resolution and management of uncertainty is a key aspect of a learning health system2.

1 London, A. J. (2012). A non-paternalistic model of research ethics and oversight: assessing the benefits of prospective review. The Journal of Law, Medicine & Ethics, 40(4), 930-944.

2 London, A. J. (2018). Learning health systems, clinical equipoise and the ethics of response adaptive randomisation. Journal of medical ethics, 44(6), 409-415.

4

Center for

Ethics &

Policy

Contrast Case

I. Structural engineers don’t build 50 physical instances of one design and 50 physical instances of another and see which one works the best.

Icons made by Freepik from www.flaticon.com

5

http://www.flaticon.com/

Center for

Ethics &

Policy

Contrast Case

I. Comprehensive models of relevant causal

systems:

a. Soil composition

b. Weather and climate

c. Material tensile strength, durability…

d. Structural capacities: load…

e. Stresses and loads from use cases…

II. Simulations based on models are reliable

and capture relevant causal relationships. 6

Center for

Ethics &

Policy

A Major Difference

I. In medicine our knowledge of causal structures is limited and unreliable.

II. Powerful machine learning techniques leverage associations in large data sets and computing power to make predictions/ classifications.

III. Utility of these systems depends on empirical verification of accuracy and reliability.

a. Trust hinges on independence and rigor of verification.

7

Center for

Ethics &

Policy

Deep Learning Systems

I. Deep-learning neural-networks

a. Humans: 1. Specify a use case

2. Label datasets to “train” the system.

3. Design system architecture

i. E.g., Number of “layers” used.

b. System: 1. Learn what features to detect (from data)

2. Learns potentially very complex functions from features to labels

3. Given a new case it outputs a label or a probable label. 8

Center for

Ethics &

Policy

Example: AI Screening

I. April 2018 FDA approved IDx-DR, an AI software system for diagnosing “more-than-mild” diabetic retinopathy in adults with diabetes.

a. “IDx-DR is the first device authorized for marketing that provides a screening decision without the need for a clinician to also interpret the image or results, which makes it usable by health care providers who may not normally be involved in eye care.”1

1https://www.fda.gov/newsevents/newsroom/pressannouncements/ucm604357.htm 9

Center for

Ethics &

Policy

IDx-DR: Hybrid System

I. CNNs trained to detect hemorrhages, exudates

and other lesions.

II. Model connecting pathologies to diagnostic

label reflects

expert

knowledge.

10 https://www.reviewofophthalmology.com/article/machine-learning-for-diabetic-retinopathy

Center for

Ethics &

Policy

Types of Approach

I. Gulshan and colleagues1: a. CNN to detect DR using all information in retinal

image.

b. “Another fundamental limitation arises from the nature of deep networks, in which the neural network was provided with only the image and associated grade, without explicit definitions of features (eg, microaneurysms, exudates). Because the network “learned” the features that were most predictive for the referability implicitly, it is possible that the algorithm is using features previously unknown to or ignored by humans.”

1 JAMA. 2016;316(22):2402-2410. doi:10.1001/jama.2016.17216

11

Center for

Ethics &

Policy

Why Do We Need Trials I. Models of pathophysiology and

intervention mechanism often unreliable or incomplete.

a. 9 or 10 new drugs fail to win approval for any indication.1

1. Far lower in some areas. i. Neuroprotective agents for Alzheimer's and

Parkinson’s

ii. Mega-dose vitamins and the CARET trial.

2. 50% failure rate in phase III.

1THOMAS, D. W, BURNS, J., AUDETTE, J., CARROL, A., DOW-HYGELUND, C. AND HAY, M. (2016). Clinical Development Success Rates 2006–2015. San Diego: Biomedtracker.

12

Center for

Ethics &

Policy

Why Do We Need Trials

II. In the absence of complete and accurate

causal models, causal claims must be

tested empirically.

a. Trials directly test specific claims.

b. Indirectly test theories of pathophysiology

and mechanism.

13

Center for

Ethics &

Policy


III. Models of pathophysiology and

intervention mechanism guide

intervention use.

a. Patients differ from trial populations.

b. Clinicians extrapolate efficacy and toxicity

for differences in race, gender, age, co-

morbidities…

c. Inferences go beyond validated medical

evidence. 14

Center for

Ethics &

Policy

(source: FDA Innovation or Stagnation)

California Institute of Regenerative Medicine. “Progress toward therapies: Path to

the Clinic” (http://www.cirm.ca.gov/path-clinic)

(Mis)Understanding Translation

Center for

Ethics &

Policy

Theoretical

Understanding

1Kimmelman J and London AJ. (2015) The Structure of Clinical Translation: Efficiency,

Information, and Ethics. The Hastings Center Report 45, no. 2 (2015): 27-39. DOI:

10.1002/hast.433

Structure of Translation1 (1) Theories include:

a. Mechanism of physiology

and disease

b. Mechanism of intervention

c. Pharmacodynamics

d. Pharmacokinetics

Center for

Ethics &

Policy

EXPLORATORY

TRIALS

Theoretical

Understanding

Intervention

Ensemble

Assembly

(1)

Kimmelman J and London AJ. (2015) The Structure of Clinical Translation: Efficiency,


10.1002/hast.433

Structure of Translation

Center for

Ethics &

Policy

Intervention Ensemble

I. Dimensions of relevance

a. Indication

b. Dose

c. Schedule

d. Co-intervention

e. Diagnostic modalities

II. Values on dimensions that approximate

optimal performance 18

Center for

Ethics &

Policy

CONFIRMATORY

TRIALS

EXPLORATORY

TRIALS

Theoretical

Understanding

Intervention

Ensemble

Assembly

(1) (2)



10.1002/hast.433


Center for

Ethics &

Policy

CONFIRMATORY

TRIALS

EXPLORATORY

TRIALS

Theoretical

Understanding

Intervention

Ensemble

Assembly

(1) (2)



10.1002/hast.433


Center for

Ethics &

Policy

CONFIRMATORY

TRIALS

PRACTICE

EXPLORATORY

TRIALS

Theoretical

Understanding

Intervention

Ensemble

Assembly

(1) (2)

(3)



10.1002/hast.433


Center for

Ethics &

Policy

CONFIRMATORY

TRIALS

PRACTICE

EXPLORATORY

TRIALS

Theoretical

Understanding

Intervention

Ensemble

Assembly

(1) (2)

(3)



10.1002/hast.433


Center for

Ethics &

Policy

CONFIRMATORY

TRIALS

PRACTICE

EXPLORATORY

TRIALS

Theoretical

Understanding

Intervention

Ensemble

Assembly

(1) (2)

(3)



10.1002/hast.433


Center for

Ethics &

Policy

Deep Learning

Atheoretic:

– Don’t use our domain

knowledge

– Learn a model from

data

24

Center for

Ethics &

Policy

Deep Learning

Atheoretic: – Don’t use our domain

knowledge

– Learn a model from data

Associationist: – Association doesn’t

entail causation.

– Different assumptions create different models.

25

Center for

Ethics &

Policy

Deep Learning

Atheoretic: – Don’t use our domain

knowledge for task.

– Learn a model from data

Associationist: – Association doesn’t entail

causation.

– Different assumptions create different models.

Black Box: – We know how system

generates a model.

– We often can’t directly inspect and understand the model. 26

Center for

Ethics &

Policy

Verifiability of Performance

I. Precision

a. p(+ | test +)

b. 8/10 vs 11/12

II. Sensitivity a. p(test +| +)

b. 8/12 vs 11/12

III. Specificity

a. p(test - | -)

b. 16/18 vs 17/18

27

+ + + + - - - - - -

+ + + + - - - - - -

+ + + + - - - - - -

+ + + + - - - - - -

+ + + + - - - - - -

+ + + + - - - - - -

(A)

(B)

Center for

Ethics &

Policy

EXPLORATORY

TESTING

Decision Task

Training Data Algorithm,

Standardization

s

(1)


Center for

Ethics &

Policy

ML Ensemble Development

I. Use Case:

a. Task to be performed?

1. Better way to do what we already do?

2. Way to do something new?

b. Relationship to clinical endpoints?

1. Direct

2. Indirect

3. Validated

4. Novel 29

Center for

Ethics &

Policy

ML Ensemble Development

II. Training data

a. Does it support the use case?

b. How representative is it?

1. Racial or other biases?

c. Raw data or curated in some way?

III. Algorithm

a. Robust across datasets?

b. Accuracy? 30

Center for

Ethics &

Policy


I. Algorithms learn models from test sets.

II. Test sets may not reflect real world data.

a. Higher quality images

b. Curated or standardized fields, missing data

c. Common, arbitrary associations 1. Artifacts, clinical practices…

2. Devices correlate with patient demographics

III. Performance characteristics may not transfer to real world data.

31

Center for

Ethics &

Policy

EXPLORATORY

TESTING

Decision Task

Training Data

Algorithm

Standardization

s

(1)

Structure of Translation CONFIRMATORY

TRIALS

Center for

Ethics &

Policy


IV. Testing systems prospectively:

a. Independent assessment of system

performance.

b. Feasibility of replicating conditions needed

to achieve performance benchmarks.

1. Standardizations?

i. Image quality?

ii. Data curation?

2. Protocols and reliability.

3. Personnel proficiency. 33

Center for

Ethics &

Policy


V. Proof of accuracy on specific decision

task may not translate into real world

effectiveness on clinically meaningful

endpoints.

34

Center for

Ethics &

Policy

Real World Effectiveness

35


Ensemble Efficacy

Utilization Factors

Center for

Ethics &

Policy


36


Ensemble Efficacy1

Utilization Factors

a. Drug

b. Dosage

c. Schedule

d. Population

e. Co-interventions

f. Diagnostic

requirements

1Kimmelman & London (2015) The Structure of Clinical Translation. Hastings Cent

Rep. Mar-Apr;45(2):27-39. doi: 10.1002/hast.433. Epub 2015 Jan 27

Center for

Ethics &

Policy


37


Ensemble Efficacy

Utilization Factors

a. Algorithm

b. Use Case

c. Data

requirements

1. Standardizations?

2. Curation?

Center for

Ethics &

Policy


38


Ensemble Efficacy

Utilization Factors

– Provider or Patient

Preferences

– Cost

– Tolerance/Adherence

– Clinical capacity

– Awareness

Center for

Ethics &

Policy


39


Ensemble Efficacy

Utilization Factors

– Single vs multiple

(dose, test,

visits…)

– Invasive?

– Less testing

– Time to results

Center for

Ethics &

Policy

Best Epistemic State

40


Ensemble2 Efficacy Ensemble1 Efficacy

Utilization Factors Utilization Factors

>

>

Center for

Ethics &

Policy

Pre-Approval Uncertainty

41




?

?

? ?

Center for

Ethics &

Policy

Post-Approval Uncertainty

42




?

?

? ?

Center for

Ethics &

Policy

Conclusions

I. Intervention development is an iterated process.

II. Ensemble of practices necessary to achieve efficacy.

III. Trials validate the utility of this ensemble

IV. Real world evidence includes utilization factors and comparative effectiveness relative to a clinically meaningful benchmark.

43

Center for

Ethics &

Policy

Thank you

[email protected]

Twitter: @AlexJohnLondon

44

Documents

Ethical and Scientific Issues in Developing Deep Learning ... · Ethical and Scientific Issues in Developing Deep Learning Systems in Medicine: More Like Drugs Than You Might Think