Upload
jtleek
View
2.502
Download
0
Embed Size (px)
Citation preview
Evidence based data analysis @jtleek
Data scienceas a Science (DSaaS) @jtleek
“Data science is as much art as it is science.”
Wouldn’t it be amazing if we got 2,000 people to learn
statistics!
“”-Jeff Leek
7/17/12
date: 7/19/12from: [email protected]
Roger let me know you gave him a ballpark figure for the number of students registered for his course "Computing for Data Analysis”. Could you give me an idea of how many have registered for my course "Data Analysis?”
date: 7/19/12from: [email protected]
Hi Jeff,
7,000 students! It's pretty awesome. (You'll be able to check this out yourself next week, once the class sites are up.)
date: 7/19/12from: [email protected]
You are f**ed.-roger
9 classes1 month longAlways open
Data Science SpecializationTotal Enrollments: 3,815,890Total Completions: 409,712
Genomic Data Science SpecializationTotal Enrollments: 173,495Total Completions: 10,826
Executive Data Science SpecializationTotal Enrollments: 62,076Total Completions: 10,957
2A theoretical model
Data
1A theoretical model
Data
Y = some outcomeX = some covariateD = (X,Y)
lm(Y ~ X)
Y = some outcomeX = some covariateD = (X,Y)
lm(Y ~ X)Leek and Peng, Nature 2015
F0
Ul F0(S) Ul F0(Y)
Fithian, Sun and Taylor arXiv 2015
σ-algebra“what we know”
F0
Ul F0(S) Ul F0(Y)
“we’ve done nothing”
F0
Ul F0(S) Ul F0(Y)
“we did model selection”
F0
Ul F0(S) Ul F0(Y)
“we looked at all the data”
F0
Ul F0(S) Ul F0(Y)
E[β |F0]≠
E[β |F0(S)]
Population
Question
Hypothesis
Experimental Design
Experimentor
Data
Analysis Plan
Analyst
Code
Estimate
Claim Patil, Peng and Leek biorXiv 2016
Population
Question
Hypothesis
Experimental Design
Experimentor
Data
Analysis Plan
Analyst
Code
Estimate
Claim Patil, Peng and Leek biorXiv 2016
F0
Ul F0(1P,Q(H))F0(1ED(E))F0(1ED;E(D))F0(1AP;A(C))F0(1C(A*))
UlUlUl Ul
Population
Question
Hypothesis
Experimental Design
Experimentor
Data
Analysis Plan
Analyst
Code
Estimate
Claim Patil, Peng and Leek biorXiv 2016
F0
Ul F0(1P,Q(H))F0(1ED(E))F0(1ED;E(D))F0(1AP;A(C))F0(1C(A*))
UlUlUl Ul
2A theoretical model
Data
Slide courtesy Hadley Wickham
Who?What?When?Why?Where?How? Slide courtesy Hadley Wickham
Who?What?When?Why?Where?How? Where Ingo is working
Who?What?When?Why?Where?How? Slide courtesy Hadley Wickham
Base R
Lassodplyr
googlesheets
ppt
Who?What?When?Why?Where?How? Slide courtesy Hadley Wickham
Bad life choices?
Sparsity!David Robinsontold me
Spreadsheets
Hedgemony
Cleveland and McGill JASA 1984
Leek & Peng 2015 PNAS
Experiment1
Leek and Peng, Science 2015
Population
Question
Hypothesis
Experimental Design
Experimentor
Data
Analysis Plan
Analyst
Code
Estimate
Claim
E[S| F0(1c(W))
We take a random sample of individuals in a population and identify whether they smoke and if they have cancer. We observe that there is a strong relationship between whether a person in the sample smoked or whether they have lung cancer. We claim that smoking is related to lung cancer in the larger population.
79% 17%
Inferential
vs
Causal
n=47,141
We take a random sample of individuals in a population and identify whether they smoke and if they have cancer. We observe that there is a strong relationship between whether a person in the sample smoked or whether they have lung cancer. We claim that smoking is related to lung cancer in the larger population. We explain we think that the reason for this relationship is because cigarette smoke contains known carcinogens such as benzene, which make cells in lungs become cancerous.
65% 32 %Inferential
vs
Causal
n=47,141
Experiment2
Population
Question
Hypothesis
Experimental Design
Experimentor
Data
Analysis Plan
Analyst
Code
Estimate
Claim
E[Est| F0(1c(A))
69% vs 40%n=1,985
Experiment3
E[Claim | F0(1set(base)(A))] - E[Claim | F0(1set(ggplot2)(A))]
Population
Question
Hypothesis
Experimental Design
Experimentor
Data
Analysis Plan
Analyst
Code
Estimate
Claim
1.Make a plot that answers the question: what is the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) in New York?
2.Make a plot (possibly multi-panel) that answers the question: how does the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) vary by medical condition (DRG.Definition) and the state in which care was received (Provider.State)?
Use only the [ggplot2/base R] graphics system (not base R or lattice) to make your figure.
“Does the plot clearly show the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) in New York?”
G: 5/22 (23%) vs. B: 5/12 (42%)
“Does the plot clearly show the relationship between mean covered charges (Average.Covered.Charges) and mean total payments (Average.Total.Payments) vary by medical condition (DRG.Definition) and the state in which care was received (Provider.State)?”
G: 7/22 (32%) vs. B: 5/12 (42%)
“Is the plot visually pleasing?”
G: 21/22 (95%) vs. B: 10/12 (83%)
G: 20/22 (91%) vs. B: 8/12 (67%)
“Do the plot text and labels use full words instead of abbreviations?”
G: 21/22 (95%) vs. B: 12/12 (100%)
G: 11/22 (50%) vs. B: 5/12 (42%)
2A theoretical model
Data
Data scienceas a Science (DSaaS) @jtleek