Upload
ben-bolker
View
269
Download
1
Embed Size (px)
DESCRIPTION
Citation preview
De�nitions Statistics Computation Software Conclusions References
Statistics, computation, and software engineering:development and maintenance of mixed modeling
software in R
Ben Bolker
McMaster University, Mathematics & Statistics and Biology
15 October 2013
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Outline
1 De�nitions and context
2 Statistical challenges
3 Computational challenges
4 Software engineering
5 Conclusions
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Outline
1 De�nitions and context
2 Statistical challenges
3 Computational challenges
4 Software engineering
5 Conclusions
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
(Generalized) linear mixed models
(G)LMMs: a statistical modeling framework incorporating:
Linear combinations of categorical and continuouspredictors, and interactions
Response distributions in the exponential family
(binomial, Poisson, and extensions)
Any smooth, monotonic link function
(e.g. logistic, exponential models)
Flexible combinations of blocking factors
(clustering; random e�ects)
Applications in ecology, neurobiology, behaviour, epidemiology, realestate, . . .
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
(Generalized) linear mixed models
(G)LMMs: a statistical modeling framework incorporating:
Linear combinations of categorical and continuouspredictors, and interactions
Response distributions in the exponential family
(binomial, Poisson, and extensions)
Any smooth, monotonic link function
(e.g. logistic, exponential models)
Flexible combinations of blocking factors
(clustering; random e�ects)
Applications in ecology, neurobiology, behaviour, epidemiology, realestate, . . .
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
(Generalized) linear mixed models
(G)LMMs: a statistical modeling framework incorporating:
Linear combinations of categorical and continuouspredictors, and interactions
Response distributions in the exponential family
(binomial, Poisson, and extensions)
Any smooth, monotonic link function
(e.g. logistic, exponential models)
Flexible combinations of blocking factors
(clustering; random e�ects)
Applications in ecology, neurobiology, behaviour, epidemiology, realestate, . . .
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
(Generalized) linear mixed models
(G)LMMs: a statistical modeling framework incorporating:
Linear combinations of categorical and continuouspredictors, and interactions
Response distributions in the exponential family
(binomial, Poisson, and extensions)
Any smooth, monotonic link function
(e.g. logistic, exponential models)
Flexible combinations of blocking factors
(clustering; random e�ects)
Applications in ecology, neurobiology, behaviour, epidemiology, realestate, . . .
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Examples
ecology survival, predation, etc. (experimental plots)
genomics presence/absence of polymorphisms, gene expression(individuals)
educational assessment student scores (students × teachers)
psychology/sensometrics decisions, responses to stimuli(individuals)
epidemiology disease prevalence (postal codes, provinces, countries)
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Examples
ecology survival, predation, etc. (experimental plots)
genomics presence/absence of polymorphisms, gene expression(individuals)
educational assessment student scores (students × teachers)
psychology/sensometrics decisions, responses to stimuli(individuals)
epidemiology disease prevalence (postal codes, provinces, countries)
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Examples
ecology survival, predation, etc. (experimental plots)
genomics presence/absence of polymorphisms, gene expression(individuals)
educational assessment student scores (students × teachers)
psychology/sensometrics decisions, responses to stimuli(individuals)
epidemiology disease prevalence (postal codes, provinces, countries)
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Examples
ecology survival, predation, etc. (experimental plots)
genomics presence/absence of polymorphisms, gene expression(individuals)
educational assessment student scores (students × teachers)
psychology/sensometrics decisions, responses to stimuli(individuals)
epidemiology disease prevalence (postal codes, provinces, countries)
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Examples
ecology survival, predation, etc. (experimental plots)
genomics presence/absence of polymorphisms, gene expression(individuals)
educational assessment student scores (students × teachers)
psychology/sensometrics decisions, responses to stimuli(individuals)
epidemiology disease prevalence (postal codes, provinces, countries)
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Technical de�nition
Yi︸︷︷︸response
∼
conditionaldistribution︷︸︸︷Distr (g−1(ηi )︸ ︷︷ ︸
inverselink
function
, φ︸︷︷︸scale
parameter
)
η︸︷︷︸linear
predictor
= Xβ︸︷︷︸�xede�ects
+ Zb︸︷︷︸randome�ects
b︸︷︷︸conditionalmodes
∼ MVN(0, Σ(θ)︸ ︷︷ ︸variance-covariancematrix
)
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Outline
1 De�nitions and context
2 Statistical challenges
3 Computational challenges
4 Software engineering
5 Conclusions
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Estimation
Maximum likelihood estimation
L(Yi |θ,β)︸ ︷︷ ︸likelihood
=
∫· · ·
∫L(Yi |θ,β′)︸ ︷︷ ︸
data|random e�ects
×L(β′|Σ(θ))︸ ︷︷ ︸random e�ects
dβ′
deterministic: precision vs. computational cost:penalized quasi-likelihood, Laplace approximation, adaptiveGauss-Hermite quadrature (Breslow, 2004) . . .
Monte Carlo: frequentist and Bayesian (Booth and Hobert,1999; Ponciano et al., 2009; Sung, 2007)
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Estimation: example (McKeon et al., 2012)
Log−odds of predation−6 −4 −2 0 2
Symbiont
Crab vs. Shrimp
Added symbiont
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
GLM (fixed)GLM (pooled)PQLLaplaceAGQ
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Inference
Standard inferential tools:mostly asymptotic oruncontrolled approximations
Solutions are computationaland/or Bayesian: parametricbootstrap, MCMC
Good news: di�erent problemsfor small vs large data
True p value
Infe
rred
p v
alue
0.02
0.04
0.06
0.08
0.02 0.06
Osm Cu
H2S
0.02 0.06
0.02
0.04
0.06
0.08
Anoxia
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Inference
Standard inferential tools:mostly asymptotic oruncontrolled approximations
Solutions are computationaland/or Bayesian: parametricbootstrap, MCMC
Good news: di�erent problemsfor small vs large data
True p value
Infe
rred
p v
alue
0.02
0.04
0.06
0.08
0.02 0.06
Osm Cu
H2S
0.02 0.06
0.02
0.04
0.06
0.08
Anoxia
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Inference
Standard inferential tools:mostly asymptotic oruncontrolled approximations
Solutions are computationaland/or Bayesian: parametricbootstrap, MCMC
Good news: di�erent problemsfor small vs large data
True p value
Infe
rred
p v
alue
0.02
0.04
0.06
0.08
0.02 0.06
Osm Cu
H2S
0.02 0.06
0.02
0.04
0.06
0.08
Anoxia
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Outline
1 De�nitions and context
2 Statistical challenges
3 Computational challenges
4 Software engineering
5 Conclusions
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Problems of big data
How big is big?Airline data: 12G
(G)LMM works on moderately large problems,e.g. student evaluations(≈ 75K total, 3K students, 1K profs)
Fairly clever linear algebra
Possible improvements?
Chunking/parallelizationOut-of-memory operation
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Sparse matrix algorithms
repeated decomposition oflarge, matrices (especially Z )
�ll-reducing permutation toimprove sparsity pattern
further improvements possible:better matrix representation,parallelization?
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Bounded optimization
Parameterizevariance-covariance matrix Σ(θ)(Pinheiro and Bates, 1996)
Positive de�nite or onlysemi-de�nite?
Disadvantages of transformingto unconstrain
(Disadvantages of boundarysolutions)
raw log
0
10
20
30
0 1 2 3 −3 −2 −1 0
devi
ance
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Outline
1 De�nitions and context
2 Statistical challenges
3 Computational challenges
4 Software engineering
5 Conclusions
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Language tradeo�s
high-level/convenience: R
low-level/performance: C++
new wave? Julia
multi-language friction: mostly escaped in R/C++ case, atthe price of complexity
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Getting it right vs. getting it written
the curse of neophilia: Superiority
many versions: nlme, lme4(a,b,Eigen) . . .
The moral of the story is that if
you want to create a beautiful
language, for god's sake don't
make it useful
(Patrick Burns)
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Sociological issues
Wide user base:
As usual when software for complicated statistical
inference procedures is broadly disseminated, there is
potential for abuse and misinterpretation.
(Breslow, 2004)
What if there is no good answer?�do no harm� vs. �better me than someone else�
Diagnostics and warning messages
End users vs. downstream developers
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Outline
1 De�nitions and context
2 Statistical challenges
3 Computational challenges
4 Software engineering
5 Conclusions
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Next steps
Alternative platforms/languages
Flexible correlation structures:spatial, temporal, phylogenetic . . .
Improved MCMC methods?
Simulation tests of inferential tools
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Is it science?
Science is what we
understand well enough to
explain to a computer. Art
is everything else we do.
(Donald Knuth)
10
20
30
4050
2006 2008 2010 2012Date
artic
les
per
mon
th
key
glmm
lme4
Public Library of Science data
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Acknowledgments
lme4: Doug Bates, MartinMächler, Steve Walker
Data: Adrian Stier (UBC/OSU),Sea McKeon (Smithsonian),David Julian (UF)
NSERC (Discovery)
SHARCnet
Ben Bolker
Mixed model software
De�nitions Statistics Computation Software Conclusions References
Booth, J.G. and Hobert, J.P., 1999. Journal of the Royal Statistical Society. Series B, 61(1):265�285.doi:10.1111/1467-9868.00176.
Breslow, N.E., 2004. In D.Y. Lin and P.J. Heagerty, editors, Proceedings of the second Seattlesymposium in biostatistics: Analysis of correlated data, pages 1�22. Springer. ISBN 0387208623.
McKeon, C.S., Stier, A., et al., 2012. Oecologia, 169(4):1095�1103. ISSN 0029-8549.doi:10.1007/s00442-012-2275-2.
Pinheiro, J.C. and Bates, D.M., 1996. Statistics and Computing, 6(3):289�296.doi:10.1007/BF00140873.
Ponciano, J.M., Taper, M.L., et al., 2009. Ecology, 90(2):356�362. ISSN 0012-9658.
Sung, Y.J., 2007. The Annals of Statistics, 35(3):990�1011. ISSN 0090-5364.doi:10.1214/009053606000001389.
Ben Bolker
Mixed model software