32

computational science & engineering seminar, 16 oct 2013

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Statistics, computation, and software engineering:development and maintenance of mixed modeling

software in R

Ben Bolker

McMaster University, Mathematics & Statistics and Biology

15 October 2013

Ben Bolker

Mixed model software

Page 2: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Outline

1 De�nitions and context

2 Statistical challenges

3 Computational challenges

4 Software engineering

5 Conclusions

Ben Bolker

Mixed model software

Page 3: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Outline

1 De�nitions and context

2 Statistical challenges

3 Computational challenges

4 Software engineering

5 Conclusions

Ben Bolker

Mixed model software

Page 4: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

(Generalized) linear mixed models

(G)LMMs: a statistical modeling framework incorporating:

Linear combinations of categorical and continuouspredictors, and interactions

Response distributions in the exponential family

(binomial, Poisson, and extensions)

Any smooth, monotonic link function

(e.g. logistic, exponential models)

Flexible combinations of blocking factors

(clustering; random e�ects)

Applications in ecology, neurobiology, behaviour, epidemiology, realestate, . . .

Ben Bolker

Mixed model software

Page 5: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

(Generalized) linear mixed models

(G)LMMs: a statistical modeling framework incorporating:

Linear combinations of categorical and continuouspredictors, and interactions

Response distributions in the exponential family

(binomial, Poisson, and extensions)

Any smooth, monotonic link function

(e.g. logistic, exponential models)

Flexible combinations of blocking factors

(clustering; random e�ects)

Applications in ecology, neurobiology, behaviour, epidemiology, realestate, . . .

Ben Bolker

Mixed model software

Page 6: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

(Generalized) linear mixed models

(G)LMMs: a statistical modeling framework incorporating:

Linear combinations of categorical and continuouspredictors, and interactions

Response distributions in the exponential family

(binomial, Poisson, and extensions)

Any smooth, monotonic link function

(e.g. logistic, exponential models)

Flexible combinations of blocking factors

(clustering; random e�ects)

Applications in ecology, neurobiology, behaviour, epidemiology, realestate, . . .

Ben Bolker

Mixed model software

Page 7: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

(Generalized) linear mixed models

(G)LMMs: a statistical modeling framework incorporating:

Linear combinations of categorical and continuouspredictors, and interactions

Response distributions in the exponential family

(binomial, Poisson, and extensions)

Any smooth, monotonic link function

(e.g. logistic, exponential models)

Flexible combinations of blocking factors

(clustering; random e�ects)

Applications in ecology, neurobiology, behaviour, epidemiology, realestate, . . .

Ben Bolker

Mixed model software

Page 8: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Examples

ecology survival, predation, etc. (experimental plots)

genomics presence/absence of polymorphisms, gene expression(individuals)

educational assessment student scores (students × teachers)

psychology/sensometrics decisions, responses to stimuli(individuals)

epidemiology disease prevalence (postal codes, provinces, countries)

Ben Bolker

Mixed model software

Page 9: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Examples

ecology survival, predation, etc. (experimental plots)

genomics presence/absence of polymorphisms, gene expression(individuals)

educational assessment student scores (students × teachers)

psychology/sensometrics decisions, responses to stimuli(individuals)

epidemiology disease prevalence (postal codes, provinces, countries)

Ben Bolker

Mixed model software

Page 10: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Examples

ecology survival, predation, etc. (experimental plots)

genomics presence/absence of polymorphisms, gene expression(individuals)

educational assessment student scores (students × teachers)

psychology/sensometrics decisions, responses to stimuli(individuals)

epidemiology disease prevalence (postal codes, provinces, countries)

Ben Bolker

Mixed model software

Page 11: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Examples

ecology survival, predation, etc. (experimental plots)

genomics presence/absence of polymorphisms, gene expression(individuals)

educational assessment student scores (students × teachers)

psychology/sensometrics decisions, responses to stimuli(individuals)

epidemiology disease prevalence (postal codes, provinces, countries)

Ben Bolker

Mixed model software

Page 12: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Examples

ecology survival, predation, etc. (experimental plots)

genomics presence/absence of polymorphisms, gene expression(individuals)

educational assessment student scores (students × teachers)

psychology/sensometrics decisions, responses to stimuli(individuals)

epidemiology disease prevalence (postal codes, provinces, countries)

Ben Bolker

Mixed model software

Page 13: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Technical de�nition

Yi︸︷︷︸response

conditionaldistribution︷︸︸︷Distr (g−1(ηi )︸ ︷︷ ︸

inverselink

function

, φ︸︷︷︸scale

parameter

)

η︸︷︷︸linear

predictor

= Xβ︸︷︷︸�xede�ects

+ Zb︸︷︷︸randome�ects

b︸︷︷︸conditionalmodes

∼ MVN(0, Σ(θ)︸ ︷︷ ︸variance-covariancematrix

)

Ben Bolker

Mixed model software

Page 14: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Outline

1 De�nitions and context

2 Statistical challenges

3 Computational challenges

4 Software engineering

5 Conclusions

Ben Bolker

Mixed model software

Page 15: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Estimation

Maximum likelihood estimation

L(Yi |θ,β)︸ ︷︷ ︸likelihood

=

∫· · ·

∫L(Yi |θ,β′)︸ ︷︷ ︸

data|random e�ects

×L(β′|Σ(θ))︸ ︷︷ ︸random e�ects

dβ′

deterministic: precision vs. computational cost:penalized quasi-likelihood, Laplace approximation, adaptiveGauss-Hermite quadrature (Breslow, 2004) . . .

Monte Carlo: frequentist and Bayesian (Booth and Hobert,1999; Ponciano et al., 2009; Sung, 2007)

Ben Bolker

Mixed model software

Page 16: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Estimation: example (McKeon et al., 2012)

Log−odds of predation−6 −4 −2 0 2

Symbiont

Crab vs. Shrimp

Added symbiont

GLM (fixed)GLM (pooled)PQLLaplaceAGQ

Ben Bolker

Mixed model software

Page 17: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Inference

Standard inferential tools:mostly asymptotic oruncontrolled approximations

Solutions are computationaland/or Bayesian: parametricbootstrap, MCMC

Good news: di�erent problemsfor small vs large data

True p value

Infe

rred

p v

alue

0.02

0.04

0.06

0.08

0.02 0.06

Osm Cu

H2S

0.02 0.06

0.02

0.04

0.06

0.08

Anoxia

Ben Bolker

Mixed model software

Page 18: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Inference

Standard inferential tools:mostly asymptotic oruncontrolled approximations

Solutions are computationaland/or Bayesian: parametricbootstrap, MCMC

Good news: di�erent problemsfor small vs large data

True p value

Infe

rred

p v

alue

0.02

0.04

0.06

0.08

0.02 0.06

Osm Cu

H2S

0.02 0.06

0.02

0.04

0.06

0.08

Anoxia

Ben Bolker

Mixed model software

Page 19: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Inference

Standard inferential tools:mostly asymptotic oruncontrolled approximations

Solutions are computationaland/or Bayesian: parametricbootstrap, MCMC

Good news: di�erent problemsfor small vs large data

True p value

Infe

rred

p v

alue

0.02

0.04

0.06

0.08

0.02 0.06

Osm Cu

H2S

0.02 0.06

0.02

0.04

0.06

0.08

Anoxia

Ben Bolker

Mixed model software

Page 20: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Outline

1 De�nitions and context

2 Statistical challenges

3 Computational challenges

4 Software engineering

5 Conclusions

Ben Bolker

Mixed model software

Page 21: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Problems of big data

How big is big?Airline data: 12G

(G)LMM works on moderately large problems,e.g. student evaluations(≈ 75K total, 3K students, 1K profs)

Fairly clever linear algebra

Possible improvements?

Chunking/parallelizationOut-of-memory operation

Ben Bolker

Mixed model software

Page 22: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Sparse matrix algorithms

repeated decomposition oflarge, matrices (especially Z )

�ll-reducing permutation toimprove sparsity pattern

further improvements possible:better matrix representation,parallelization?

Ben Bolker

Mixed model software

Page 23: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Bounded optimization

Parameterizevariance-covariance matrix Σ(θ)(Pinheiro and Bates, 1996)

Positive de�nite or onlysemi-de�nite?

Disadvantages of transformingto unconstrain

(Disadvantages of boundarysolutions)

raw log

0

10

20

30

0 1 2 3 −3 −2 −1 0

devi

ance

Ben Bolker

Mixed model software

Page 24: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Outline

1 De�nitions and context

2 Statistical challenges

3 Computational challenges

4 Software engineering

5 Conclusions

Ben Bolker

Mixed model software

Page 25: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Language tradeo�s

high-level/convenience: R

low-level/performance: C++

new wave? Julia

multi-language friction: mostly escaped in R/C++ case, atthe price of complexity

Ben Bolker

Mixed model software

Page 26: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Getting it right vs. getting it written

the curse of neophilia: Superiority

many versions: nlme, lme4(a,b,Eigen) . . .

The moral of the story is that if

you want to create a beautiful

language, for god's sake don't

make it useful

(Patrick Burns)

Ben Bolker

Mixed model software

Page 27: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Sociological issues

Wide user base:

As usual when software for complicated statistical

inference procedures is broadly disseminated, there is

potential for abuse and misinterpretation.

(Breslow, 2004)

What if there is no good answer?�do no harm� vs. �better me than someone else�

Diagnostics and warning messages

End users vs. downstream developers

Ben Bolker

Mixed model software

Page 28: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Outline

1 De�nitions and context

2 Statistical challenges

3 Computational challenges

4 Software engineering

5 Conclusions

Ben Bolker

Mixed model software

Page 29: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Next steps

Alternative platforms/languages

Flexible correlation structures:spatial, temporal, phylogenetic . . .

Improved MCMC methods?

Simulation tests of inferential tools

Ben Bolker

Mixed model software

Page 30: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Is it science?

Science is what we

understand well enough to

explain to a computer. Art

is everything else we do.

(Donald Knuth)

10

20

30

4050

2006 2008 2010 2012Date

artic

les

per

mon

th

key

glmm

lme4

Public Library of Science data

Ben Bolker

Mixed model software

Page 31: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Acknowledgments

lme4: Doug Bates, MartinMächler, Steve Walker

Data: Adrian Stier (UBC/OSU),Sea McKeon (Smithsonian),David Julian (UF)

NSERC (Discovery)

SHARCnet

Ben Bolker

Mixed model software

Page 32: computational science & engineering seminar, 16 oct 2013

De�nitions Statistics Computation Software Conclusions References

Booth, J.G. and Hobert, J.P., 1999. Journal of the Royal Statistical Society. Series B, 61(1):265�285.doi:10.1111/1467-9868.00176.

Breslow, N.E., 2004. In D.Y. Lin and P.J. Heagerty, editors, Proceedings of the second Seattlesymposium in biostatistics: Analysis of correlated data, pages 1�22. Springer. ISBN 0387208623.

McKeon, C.S., Stier, A., et al., 2012. Oecologia, 169(4):1095�1103. ISSN 0029-8549.doi:10.1007/s00442-012-2275-2.

Pinheiro, J.C. and Bates, D.M., 1996. Statistics and Computing, 6(3):289�296.doi:10.1007/BF00140873.

Ponciano, J.M., Taper, M.L., et al., 2009. Ecology, 90(2):356�362. ISSN 0012-9658.

Sung, Y.J., 2007. The Annals of Statistics, 35(3):990�1011. ISSN 0090-5364.doi:10.1214/009053606000001389.

Ben Bolker

Mixed model software