computational science & engineering seminar, 16 oct 2013

Preview:

DESCRIPTION

 

Citation preview

De�nitions Statistics Computation Software Conclusions References

Statistics, computation, and software engineering:development and maintenance of mixed modeling

software in R

Ben Bolker

McMaster University, Mathematics & Statistics and Biology

15 October 2013

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Outline

1 De�nitions and context

2 Statistical challenges

3 Computational challenges

4 Software engineering

5 Conclusions

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Outline

1 De�nitions and context

2 Statistical challenges

3 Computational challenges

4 Software engineering

5 Conclusions

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

(Generalized) linear mixed models

(G)LMMs: a statistical modeling framework incorporating:

Linear combinations of categorical and continuouspredictors, and interactions

Response distributions in the exponential family

(binomial, Poisson, and extensions)

Any smooth, monotonic link function

(e.g. logistic, exponential models)

Flexible combinations of blocking factors

(clustering; random e�ects)

Applications in ecology, neurobiology, behaviour, epidemiology, realestate, . . .

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

(Generalized) linear mixed models

(G)LMMs: a statistical modeling framework incorporating:

Linear combinations of categorical and continuouspredictors, and interactions

Response distributions in the exponential family

(binomial, Poisson, and extensions)

Any smooth, monotonic link function

(e.g. logistic, exponential models)

Flexible combinations of blocking factors

(clustering; random e�ects)

Applications in ecology, neurobiology, behaviour, epidemiology, realestate, . . .

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

(Generalized) linear mixed models

(G)LMMs: a statistical modeling framework incorporating:

Linear combinations of categorical and continuouspredictors, and interactions

Response distributions in the exponential family

(binomial, Poisson, and extensions)

Any smooth, monotonic link function

(e.g. logistic, exponential models)

Flexible combinations of blocking factors

(clustering; random e�ects)

Applications in ecology, neurobiology, behaviour, epidemiology, realestate, . . .

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

(Generalized) linear mixed models

(G)LMMs: a statistical modeling framework incorporating:

Linear combinations of categorical and continuouspredictors, and interactions

Response distributions in the exponential family

(binomial, Poisson, and extensions)

Any smooth, monotonic link function

(e.g. logistic, exponential models)

Flexible combinations of blocking factors

(clustering; random e�ects)

Applications in ecology, neurobiology, behaviour, epidemiology, realestate, . . .

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Examples

ecology survival, predation, etc. (experimental plots)

genomics presence/absence of polymorphisms, gene expression(individuals)

educational assessment student scores (students × teachers)

psychology/sensometrics decisions, responses to stimuli(individuals)

epidemiology disease prevalence (postal codes, provinces, countries)

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Examples

ecology survival, predation, etc. (experimental plots)

genomics presence/absence of polymorphisms, gene expression(individuals)

educational assessment student scores (students × teachers)

psychology/sensometrics decisions, responses to stimuli(individuals)

epidemiology disease prevalence (postal codes, provinces, countries)

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Examples

ecology survival, predation, etc. (experimental plots)

genomics presence/absence of polymorphisms, gene expression(individuals)

educational assessment student scores (students × teachers)

psychology/sensometrics decisions, responses to stimuli(individuals)

epidemiology disease prevalence (postal codes, provinces, countries)

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Examples

ecology survival, predation, etc. (experimental plots)

genomics presence/absence of polymorphisms, gene expression(individuals)

educational assessment student scores (students × teachers)

psychology/sensometrics decisions, responses to stimuli(individuals)

epidemiology disease prevalence (postal codes, provinces, countries)

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Examples

ecology survival, predation, etc. (experimental plots)

genomics presence/absence of polymorphisms, gene expression(individuals)

educational assessment student scores (students × teachers)

psychology/sensometrics decisions, responses to stimuli(individuals)

epidemiology disease prevalence (postal codes, provinces, countries)

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Technical de�nition

Yi︸︷︷︸response

conditionaldistribution︷︸︸︷Distr (g−1(ηi )︸ ︷︷ ︸

inverselink

function

, φ︸︷︷︸scale

parameter

)

η︸︷︷︸linear

predictor

= Xβ︸︷︷︸�xede�ects

+ Zb︸︷︷︸randome�ects

b︸︷︷︸conditionalmodes

∼ MVN(0, Σ(θ)︸ ︷︷ ︸variance-covariancematrix

)

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Outline

1 De�nitions and context

2 Statistical challenges

3 Computational challenges

4 Software engineering

5 Conclusions

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Estimation

Maximum likelihood estimation

L(Yi |θ,β)︸ ︷︷ ︸likelihood

=

∫· · ·

∫L(Yi |θ,β′)︸ ︷︷ ︸

data|random e�ects

×L(β′|Σ(θ))︸ ︷︷ ︸random e�ects

dβ′

deterministic: precision vs. computational cost:penalized quasi-likelihood, Laplace approximation, adaptiveGauss-Hermite quadrature (Breslow, 2004) . . .

Monte Carlo: frequentist and Bayesian (Booth and Hobert,1999; Ponciano et al., 2009; Sung, 2007)

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Estimation: example (McKeon et al., 2012)

Log−odds of predation−6 −4 −2 0 2

Symbiont

Crab vs. Shrimp

Added symbiont

GLM (fixed)GLM (pooled)PQLLaplaceAGQ

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Inference

Standard inferential tools:mostly asymptotic oruncontrolled approximations

Solutions are computationaland/or Bayesian: parametricbootstrap, MCMC

Good news: di�erent problemsfor small vs large data

True p value

Infe

rred

p v

alue

0.02

0.04

0.06

0.08

0.02 0.06

Osm Cu

H2S

0.02 0.06

0.02

0.04

0.06

0.08

Anoxia

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Inference

Standard inferential tools:mostly asymptotic oruncontrolled approximations

Solutions are computationaland/or Bayesian: parametricbootstrap, MCMC

Good news: di�erent problemsfor small vs large data

True p value

Infe

rred

p v

alue

0.02

0.04

0.06

0.08

0.02 0.06

Osm Cu

H2S

0.02 0.06

0.02

0.04

0.06

0.08

Anoxia

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Inference

Standard inferential tools:mostly asymptotic oruncontrolled approximations

Solutions are computationaland/or Bayesian: parametricbootstrap, MCMC

Good news: di�erent problemsfor small vs large data

True p value

Infe

rred

p v

alue

0.02

0.04

0.06

0.08

0.02 0.06

Osm Cu

H2S

0.02 0.06

0.02

0.04

0.06

0.08

Anoxia

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Outline

1 De�nitions and context

2 Statistical challenges

3 Computational challenges

4 Software engineering

5 Conclusions

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Problems of big data

How big is big?Airline data: 12G

(G)LMM works on moderately large problems,e.g. student evaluations(≈ 75K total, 3K students, 1K profs)

Fairly clever linear algebra

Possible improvements?

Chunking/parallelizationOut-of-memory operation

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Sparse matrix algorithms

repeated decomposition oflarge, matrices (especially Z )

�ll-reducing permutation toimprove sparsity pattern

further improvements possible:better matrix representation,parallelization?

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Bounded optimization

Parameterizevariance-covariance matrix Σ(θ)(Pinheiro and Bates, 1996)

Positive de�nite or onlysemi-de�nite?

Disadvantages of transformingto unconstrain

(Disadvantages of boundarysolutions)

raw log

0

10

20

30

0 1 2 3 −3 −2 −1 0

devi

ance

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Outline

1 De�nitions and context

2 Statistical challenges

3 Computational challenges

4 Software engineering

5 Conclusions

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Language tradeo�s

high-level/convenience: R

low-level/performance: C++

new wave? Julia

multi-language friction: mostly escaped in R/C++ case, atthe price of complexity

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Getting it right vs. getting it written

the curse of neophilia: Superiority

many versions: nlme, lme4(a,b,Eigen) . . .

The moral of the story is that if

you want to create a beautiful

language, for god's sake don't

make it useful

(Patrick Burns)

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Sociological issues

Wide user base:

As usual when software for complicated statistical

inference procedures is broadly disseminated, there is

potential for abuse and misinterpretation.

(Breslow, 2004)

What if there is no good answer?�do no harm� vs. �better me than someone else�

Diagnostics and warning messages

End users vs. downstream developers

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Outline

1 De�nitions and context

2 Statistical challenges

3 Computational challenges

4 Software engineering

5 Conclusions

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Next steps

Alternative platforms/languages

Flexible correlation structures:spatial, temporal, phylogenetic . . .

Improved MCMC methods?

Simulation tests of inferential tools

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Is it science?

Science is what we

understand well enough to

explain to a computer. Art

is everything else we do.

(Donald Knuth)

10

20

30

4050

2006 2008 2010 2012Date

artic

les

per

mon

th

key

glmm

lme4

Public Library of Science data

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Acknowledgments

lme4: Doug Bates, MartinMächler, Steve Walker

Data: Adrian Stier (UBC/OSU),Sea McKeon (Smithsonian),David Julian (UF)

NSERC (Discovery)

SHARCnet

Ben Bolker

Mixed model software

De�nitions Statistics Computation Software Conclusions References

Booth, J.G. and Hobert, J.P., 1999. Journal of the Royal Statistical Society. Series B, 61(1):265�285.doi:10.1111/1467-9868.00176.

Breslow, N.E., 2004. In D.Y. Lin and P.J. Heagerty, editors, Proceedings of the second Seattlesymposium in biostatistics: Analysis of correlated data, pages 1�22. Springer. ISBN 0387208623.

McKeon, C.S., Stier, A., et al., 2012. Oecologia, 169(4):1095�1103. ISSN 0029-8549.doi:10.1007/s00442-012-2275-2.

Pinheiro, J.C. and Bates, D.M., 1996. Statistics and Computing, 6(3):289�296.doi:10.1007/BF00140873.

Ponciano, J.M., Taper, M.L., et al., 2009. Ecology, 90(2):356�362. ISSN 0012-9658.

Sung, Y.J., 2007. The Annals of Statistics, 35(3):990�1011. ISSN 0090-5364.doi:10.1214/009053606000001389.

Ben Bolker

Mixed model software

Recommended