62

Data science as a science

  • Upload
    jtleek

  • View
    2.502

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data science as a science
Page 2: Data science as a science

Evidence based data analysis @jtleek

Page 3: Data science as a science

Data scienceas a Science (DSaaS) @jtleek

Page 4: Data science as a science
Page 5: Data science as a science
Page 6: Data science as a science
Page 7: Data science as a science
Page 8: Data science as a science
Page 9: Data science as a science

“Data science is as much art as it is science.”

Page 10: Data science as a science
Page 11: Data science as a science

Wouldn’t it be amazing if we got 2,000 people to learn

statistics!

“”-Jeff Leek

7/17/12

Page 12: Data science as a science

date: 7/19/12from: [email protected]

Roger let me know you gave him a ballpark figure for the number of students registered for his course "Computing for Data Analysis”. Could you give me an idea of how many have registered for my course "Data Analysis?”

Page 13: Data science as a science

date: 7/19/12from: [email protected]

Hi Jeff,

7,000 students! It's pretty awesome. (You'll be able to check this out yourself next week, once the class sites are up.)

Page 14: Data science as a science

date: 7/19/12from: [email protected]

You are f**ed.-roger

Page 15: Data science as a science

9 classes1 month longAlways open

Page 16: Data science as a science

Data Science SpecializationTotal Enrollments: 3,815,890Total Completions: 409,712

Genomic Data Science SpecializationTotal Enrollments: 173,495Total Completions: 10,826

Executive Data Science SpecializationTotal Enrollments: 62,076Total Completions: 10,957

Page 17: Data science as a science
Page 18: Data science as a science

2A theoretical model

Data

Page 19: Data science as a science

1A theoretical model

Data

Page 20: Data science as a science

Y = some outcomeX = some covariateD = (X,Y)

lm(Y ~ X)

Page 21: Data science as a science

Y = some outcomeX = some covariateD = (X,Y)

lm(Y ~ X)Leek and Peng, Nature 2015

Page 22: Data science as a science

F0

Ul F0(S) Ul F0(Y)

Fithian, Sun and Taylor arXiv 2015

Page 23: Data science as a science

σ-algebra“what we know”

F0

Ul F0(S) Ul F0(Y)

Page 24: Data science as a science

“we’ve done nothing”

F0

Ul F0(S) Ul F0(Y)

Page 25: Data science as a science

“we did model selection”

F0

Ul F0(S) Ul F0(Y)

Page 26: Data science as a science

“we looked at all the data”

F0

Ul F0(S) Ul F0(Y)

Page 27: Data science as a science

E[β |F0]≠

E[β |F0(S)]

Page 28: Data science as a science

Population

Question

Hypothesis

Experimental Design

Experimentor

Data

Analysis Plan

Analyst

Code

Estimate

Claim Patil, Peng and Leek biorXiv 2016

Page 29: Data science as a science

Population

Question

Hypothesis

Experimental Design

Experimentor

Data

Analysis Plan

Analyst

Code

Estimate

Claim Patil, Peng and Leek biorXiv 2016

F0

Ul F0(1P,Q(H))F0(1ED(E))F0(1ED;E(D))F0(1AP;A(C))F0(1C(A*))

UlUlUl Ul

Page 30: Data science as a science

Population

Question

Hypothesis

Experimental Design

Experimentor

Data

Analysis Plan

Analyst

Code

Estimate

Claim Patil, Peng and Leek biorXiv 2016

F0

Ul F0(1P,Q(H))F0(1ED(E))F0(1ED;E(D))F0(1AP;A(C))F0(1C(A*))

UlUlUl Ul

Page 31: Data science as a science

2A theoretical model

Data

Page 32: Data science as a science

Slide courtesy Hadley Wickham

Page 33: Data science as a science

Who?What?When?Why?Where?How? Slide courtesy Hadley Wickham

Page 34: Data science as a science

Who?What?When?Why?Where?How? Where Ingo is working

Page 35: Data science as a science

Who?What?When?Why?Where?How? Slide courtesy Hadley Wickham

Base R

Lassodplyr

googlesheets

ppt

Page 36: Data science as a science

Who?What?When?Why?Where?How? Slide courtesy Hadley Wickham

Bad life choices?

Sparsity!David Robinsontold me

Spreadsheets

Hedgemony

Page 37: Data science as a science

Cleveland and McGill JASA 1984

Page 38: Data science as a science
Page 39: Data science as a science

Leek & Peng 2015 PNAS

Page 40: Data science as a science

Experiment1

Page 41: Data science as a science

Leek and Peng, Science 2015

Page 42: Data science as a science

Population

Question

Hypothesis

Experimental Design

Experimentor

Data

Analysis Plan

Analyst

Code

Estimate

Claim

E[S| F0(1c(W))

Page 43: Data science as a science

We take a random sample of individuals in a population and identify whether they smoke and if they have cancer. We observe that there is a strong relationship between whether a person in the sample smoked or whether they have lung cancer. We claim that smoking is related to lung cancer in the larger population.

Page 44: Data science as a science

79% 17%

Inferential

vs

Causal

n=47,141

Page 45: Data science as a science

We take a random sample of individuals in a population and identify whether they smoke and if they have cancer. We observe that there is a strong relationship between whether a person in the sample smoked or whether they have lung cancer. We claim that smoking is related to lung cancer in the larger population. We explain we think that the reason for this relationship is because cigarette smoke contains known carcinogens such as benzene, which make cells in lungs become cancerous.

Page 46: Data science as a science

65% 32 %Inferential

vs

Causal

n=47,141

Page 47: Data science as a science

Experiment2

Page 48: Data science as a science

Population

Question

Hypothesis

Experimental Design

Experimentor

Data

Analysis Plan

Analyst

Code

Estimate

Claim

E[Est| F0(1c(A))

Page 49: Data science as a science
Page 50: Data science as a science
Page 51: Data science as a science

69% vs 40%n=1,985

Page 52: Data science as a science

Experiment3

Page 53: Data science as a science
Page 54: Data science as a science

E[Claim | F0(1set(base)(A))] - E[Claim | F0(1set(ggplot2)(A))]

Population

Question

Hypothesis

Experimental Design

Experimentor

Data

Analysis Plan

Analyst

Code

Estimate

Claim

Page 55: Data science as a science

1.Make a plot that answers the question: what is the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) in New York?

2.Make a plot (possibly multi-panel) that answers the question: how does the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) vary by medical condition (DRG.Definition) and the state in which care was received (Provider.State)?

Use only the [ggplot2/base R] graphics system (not base R or lattice) to make your figure.

Page 56: Data science as a science

“Does the plot clearly show the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) in New York?”

G: 5/22 (23%) vs. B: 5/12 (42%)

Page 57: Data science as a science

“Does the plot clearly show the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) vary by medical condition (DRG.Definition) and the state in which care was received (Provider.State)?”

G: 7/22 (32%) vs. B: 5/12 (42%)

Page 58: Data science as a science

“Is the plot visually pleasing?”

G: 21/22 (95%) vs. B: 10/12 (83%)

G: 20/22 (91%) vs. B: 8/12 (67%)

Page 59: Data science as a science

“Do the plot text and labels use full words instead of abbreviations?”

G: 21/22 (95%) vs. B: 12/12 (100%)

G: 11/22 (50%) vs. B: 5/12 (42%)

Page 60: Data science as a science

2A theoretical model

Data

Page 61: Data science as a science

Data scienceas a Science (DSaaS) @jtleek

Page 62: Data science as a science