359
Scalable Bayesian Inference David Dunson Departments of Statistical Science & Mathematics, Duke University December 3, 2018

Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Scalable Bayesian Inference

David Dunson

Departments of Statistical Science & Mathematics, Duke University

December 3, 2018

Page 2: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Outline

Motivation & background

Big n

High-dimensional data (big p)

Motivation & background 2

Page 3: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Typical approaches to big data

j There is an increasingly immense literature focused on big data

j Most of the focus has been on optimization methods

j Rapidly obtaining a point estimate even when sample size n &overall ‘size’ of data is immense

j Huge focus on specific settings - e.g., linear regression, labelingimages, etc

j Bandwagons: most people work on very similar problems, whilecritical open problems remain untouched

Motivation & background 2

Page 4: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Typical approaches to big data

j There is an increasingly immense literature focused on big data

j Most of the focus has been on optimization methods

j Rapidly obtaining a point estimate even when sample size n &overall ‘size’ of data is immense

j Huge focus on specific settings - e.g., linear regression, labelingimages, etc

j Bandwagons: most people work on very similar problems, whilecritical open problems remain untouched

Motivation & background 2

Page 5: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Typical approaches to big data

j There is an increasingly immense literature focused on big data

j Most of the focus has been on optimization methods

j Rapidly obtaining a point estimate even when sample size n &overall ‘size’ of data is immense

j Huge focus on specific settings - e.g., linear regression, labelingimages, etc

j Bandwagons: most people work on very similar problems, whilecritical open problems remain untouched

Motivation & background 2

Page 6: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Typical approaches to big data

j There is an increasingly immense literature focused on big data

j Most of the focus has been on optimization methods

j Rapidly obtaining a point estimate even when sample size n &overall ‘size’ of data is immense

j Huge focus on specific settings - e.g., linear regression, labelingimages, etc

j Bandwagons: most people work on very similar problems, whilecritical open problems remain untouched

Motivation & background 2

Page 7: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Typical approaches to big data

j There is an increasingly immense literature focused on big data

j Most of the focus has been on optimization methods

j Rapidly obtaining a point estimate even when sample size n &overall ‘size’ of data is immense

j Huge focus on specific settings - e.g., linear regression, labelingimages, etc

j Bandwagons: most people work on very similar problems, whilecritical open problems remain untouched

Motivation & background 2

Page 8: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

My focus - probability models

j General probabilistic inferencealgorithms for complex data

j We would like to handle arbitrarilycomplex probability models

j Algorithms scalable to huge data -potentially using many computers

j Accurate uncertainty quantification (UQ) is a critical issue

j Robustness of inferences also crucial

Motivation & background 3

Page 9: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

My focus - probability models

j General probabilistic inferencealgorithms for complex data

j We would like to handle arbitrarilycomplex probability models

j Algorithms scalable to huge data -potentially using many computers

j Accurate uncertainty quantification (UQ) is a critical issue

j Robustness of inferences also crucial

Motivation & background 3

Page 10: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

My focus - probability models

j General probabilistic inferencealgorithms for complex data

j We would like to handle arbitrarilycomplex probability models

j Algorithms scalable to huge data -potentially using many computers

j Accurate uncertainty quantification (UQ) is a critical issue

j Robustness of inferences also crucial

Motivation & background 3

Page 11: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

My focus - probability models

j General probabilistic inferencealgorithms for complex data

j We would like to handle arbitrarilycomplex probability models

j Algorithms scalable to huge data -potentially using many computers

j Accurate uncertainty quantification (UQ) is a critical issue

j Robustness of inferences also crucial

Motivation & background 3

Page 12: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

My focus - probability models

j General probabilistic inferencealgorithms for complex data

j We would like to handle arbitrarilycomplex probability models

j Algorithms scalable to huge data -potentially using many computers

j Accurate uncertainty quantification (UQ) is a critical issue

j Robustness of inferences also crucial

Motivation & background 3

Page 13: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

My focus - probability models

j General probabilistic inferencealgorithms for complex data

j We would like to handle arbitrarilycomplex probability models

j Algorithms scalable to huge data -potentially using many computers

j Accurate uncertainty quantification (UQ) is a critical issue

j Robustness of inferences also crucial

Motivation & background 3

Page 14: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

My focus - probability models

j General probabilistic inferencealgorithms for complex data

j We would like to handle arbitrarilycomplex probability models

j Algorithms scalable to huge data -potentially using many computers

j Accurate uncertainty quantification (UQ) is a critical issue

j Robustness of inferences also crucial

Motivation & background 3

Page 15: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayes approaches

j Bayesian methods offer an attractive general approach formodeling complex data

j Choosing a prior π(θ) & likelihood L(Y (n)|θ), the posterior is

πn(θ|Y (n)) = π(θ)L(Y (n)|θ)∫π(θ)L(Y (n)|θ)dθ

= π(θ)L(Y (n)|θ)

L(Y (n)).

j The posterior πn(θ|Y (n)) characterizes uncertainty in theparameters, in any functional f (θ) of interest & in predictivedistributions

j Often θ is moderate to high-dimensional & the integral indenominator is intractable

j Hence, in interesting models the posterior is not availableanalytically - what to do??

Motivation & background 4

Page 16: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayes approaches

j Bayesian methods offer an attractive general approach formodeling complex data

j Choosing a prior π(θ) & likelihood L(Y (n)|θ), the posterior is

πn(θ|Y (n)) = π(θ)L(Y (n)|θ)∫π(θ)L(Y (n)|θ)dθ

= π(θ)L(Y (n)|θ)

L(Y (n)).

j The posterior πn(θ|Y (n)) characterizes uncertainty in theparameters, in any functional f (θ) of interest & in predictivedistributions

j Often θ is moderate to high-dimensional & the integral indenominator is intractable

j Hence, in interesting models the posterior is not availableanalytically - what to do??

Motivation & background 4

Page 17: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayes approaches

j Bayesian methods offer an attractive general approach formodeling complex data

j Choosing a prior π(θ) & likelihood L(Y (n)|θ), the posterior is

πn(θ|Y (n)) = π(θ)L(Y (n)|θ)∫π(θ)L(Y (n)|θ)dθ

= π(θ)L(Y (n)|θ)

L(Y (n)).

j The posterior πn(θ|Y (n)) characterizes uncertainty in theparameters, in any functional f (θ) of interest & in predictivedistributions

j Often θ is moderate to high-dimensional & the integral indenominator is intractable

j Hence, in interesting models the posterior is not availableanalytically - what to do??

Motivation & background 4

Page 18: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayes approaches

j Bayesian methods offer an attractive general approach formodeling complex data

j Choosing a prior π(θ) & likelihood L(Y (n)|θ), the posterior is

πn(θ|Y (n)) = π(θ)L(Y (n)|θ)∫π(θ)L(Y (n)|θ)dθ

= π(θ)L(Y (n)|θ)

L(Y (n)).

j The posterior πn(θ|Y (n)) characterizes uncertainty in theparameters, in any functional f (θ) of interest & in predictivedistributions

j Often θ is moderate to high-dimensional & the integral indenominator is intractable

j Hence, in interesting models the posterior is not availableanalytically - what to do??

Motivation & background 4

Page 19: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayes approaches

j Bayesian methods offer an attractive general approach formodeling complex data

j Choosing a prior π(θ) & likelihood L(Y (n)|θ), the posterior is

πn(θ|Y (n)) = π(θ)L(Y (n)|θ)∫π(θ)L(Y (n)|θ)dθ

= π(θ)L(Y (n)|θ)

L(Y (n)).

j The posterior πn(θ|Y (n)) characterizes uncertainty in theparameters, in any functional f (θ) of interest & in predictivedistributions

j Often θ is moderate to high-dimensional & the integral indenominator is intractable

j Hence, in interesting models the posterior is not availableanalytically - what to do??

Motivation & background 4

Page 20: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Classical Posterior approximations

j In conjugate models, can express the posterior in simple form -e.g, as a multivariate Gaussian

j In more complex settings, can approximate posterior usingsome tractable class of distributions

j Large sample Gaussian approximations:

πn(θ|Y (n)) ≈ N (µ̂n ,Σn)

Bayesian central limit theorem (Bernstein von Mises)j Relies on sample size n large relative to # parameters p,

likelihood smooth & differentiable, true value θ0 in interior ofparameter space

j Related class of approximations use a Laplace approximation to∫π(θ)L(Y (n)|θ)dθ

Motivation & background 5

Page 21: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Classical Posterior approximations

j In conjugate models, can express the posterior in simple form -e.g, as a multivariate Gaussian

j In more complex settings, can approximate posterior usingsome tractable class of distributions

j Large sample Gaussian approximations:

πn(θ|Y (n)) ≈ N (µ̂n ,Σn)

Bayesian central limit theorem (Bernstein von Mises)j Relies on sample size n large relative to # parameters p,

likelihood smooth & differentiable, true value θ0 in interior ofparameter space

j Related class of approximations use a Laplace approximation to∫π(θ)L(Y (n)|θ)dθ

Motivation & background 5

Page 22: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Classical Posterior approximations

j In conjugate models, can express the posterior in simple form -e.g, as a multivariate Gaussian

j In more complex settings, can approximate posterior usingsome tractable class of distributions

j Large sample Gaussian approximations:

πn(θ|Y (n)) ≈ N (µ̂n ,Σn)

Bayesian central limit theorem (Bernstein von Mises)

j Relies on sample size n large relative to # parameters p,likelihood smooth & differentiable, true value θ0 in interior ofparameter space

j Related class of approximations use a Laplace approximation to∫π(θ)L(Y (n)|θ)dθ

Motivation & background 5

Page 23: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Classical Posterior approximations

j In conjugate models, can express the posterior in simple form -e.g, as a multivariate Gaussian

j In more complex settings, can approximate posterior usingsome tractable class of distributions

j Large sample Gaussian approximations:

πn(θ|Y (n)) ≈ N (µ̂n ,Σn)

Bayesian central limit theorem (Bernstein von Mises)j Relies on sample size n large relative to # parameters p,

likelihood smooth & differentiable, true value θ0 in interior ofparameter space

j Related class of approximations use a Laplace approximation to∫π(θ)L(Y (n)|θ)dθ

Motivation & background 5

Page 24: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Classical Posterior approximations

j In conjugate models, can express the posterior in simple form -e.g, as a multivariate Gaussian

j In more complex settings, can approximate posterior usingsome tractable class of distributions

j Large sample Gaussian approximations:

πn(θ|Y (n)) ≈ N (µ̂n ,Σn)

Bayesian central limit theorem (Bernstein von Mises)j Relies on sample size n large relative to # parameters p,

likelihood smooth & differentiable, true value θ0 in interior ofparameter space

j Related class of approximations use a Laplace approximation to∫π(θ)L(Y (n)|θ)dθ

Motivation & background 5

Page 25: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Alternative analytic approximations

j As an alternative to Laplace/Gaussian approximations, we candefine some approximating class q(θ)

j q(θ) may be something like a product of exponential familydistributions parameterized by ξ

j We could think to define some discrepancy between q(θ) andπn(θ) =πn(θ|Y (n))

j If we can optimize ξ to minimize discrepancy, resulting q̂(θ) maygive us a decent approximation

j Basis of variational Bayes, expectation-propagation & relatedmethods

Motivation & background 6

Page 26: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Alternative analytic approximations

j As an alternative to Laplace/Gaussian approximations, we candefine some approximating class q(θ)

j q(θ) may be something like a product of exponential familydistributions parameterized by ξ

j We could think to define some discrepancy between q(θ) andπn(θ) =πn(θ|Y (n))

j If we can optimize ξ to minimize discrepancy, resulting q̂(θ) maygive us a decent approximation

j Basis of variational Bayes, expectation-propagation & relatedmethods

Motivation & background 6

Page 27: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Alternative analytic approximations

j As an alternative to Laplace/Gaussian approximations, we candefine some approximating class q(θ)

j q(θ) may be something like a product of exponential familydistributions parameterized by ξ

j We could think to define some discrepancy between q(θ) andπn(θ) =πn(θ|Y (n))

j If we can optimize ξ to minimize discrepancy, resulting q̂(θ) maygive us a decent approximation

j Basis of variational Bayes, expectation-propagation & relatedmethods

Motivation & background 6

Page 28: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Alternative analytic approximations

j As an alternative to Laplace/Gaussian approximations, we candefine some approximating class q(θ)

j q(θ) may be something like a product of exponential familydistributions parameterized by ξ

j We could think to define some discrepancy between q(θ) andπn(θ) =πn(θ|Y (n))

j If we can optimize ξ to minimize discrepancy, resulting q̂(θ) maygive us a decent approximation

j Basis of variational Bayes, expectation-propagation & relatedmethods

Motivation & background 6

Page 29: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Alternative analytic approximations

j As an alternative to Laplace/Gaussian approximations, we candefine some approximating class q(θ)

j q(θ) may be something like a product of exponential familydistributions parameterized by ξ

j We could think to define some discrepancy between q(θ) andπn(θ) =πn(θ|Y (n))

j If we can optimize ξ to minimize discrepancy, resulting q̂(θ) maygive us a decent approximation

j Basis of variational Bayes, expectation-propagation & relatedmethods

Motivation & background 6

Page 30: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Variational Bayes - brief comments

j ICML 2018 tutorial by Tamara Broderick<www.tamarabroderick.com>

j Based on maximizing a lower bound discarding an intractableterm in KL divergence

j In general have no clue how accurate the approximation is

j Often posterior uncertainty badly under-estimated, though thereare some fix-ups; e.g., Giordano, Broderick & Jordan (2015)

j Fix-ups improve the variance characterization in a local mode

j Recent article: “On statistical optimality of variational Bayes”Pati, Bhattacharya & Yang, arXiv:1712.08983.

j No theory on accuracy of UQ

Motivation & background 7

Page 31: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Variational Bayes - brief comments

j ICML 2018 tutorial by Tamara Broderick<www.tamarabroderick.com>

j Based on maximizing a lower bound discarding an intractableterm in KL divergence

j In general have no clue how accurate the approximation is

j Often posterior uncertainty badly under-estimated, though thereare some fix-ups; e.g., Giordano, Broderick & Jordan (2015)

j Fix-ups improve the variance characterization in a local mode

j Recent article: “On statistical optimality of variational Bayes”Pati, Bhattacharya & Yang, arXiv:1712.08983.

j No theory on accuracy of UQ

Motivation & background 7

Page 32: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Variational Bayes - brief comments

j ICML 2018 tutorial by Tamara Broderick<www.tamarabroderick.com>

j Based on maximizing a lower bound discarding an intractableterm in KL divergence

j In general have no clue how accurate the approximation is

j Often posterior uncertainty badly under-estimated, though thereare some fix-ups; e.g., Giordano, Broderick & Jordan (2015)

j Fix-ups improve the variance characterization in a local mode

j Recent article: “On statistical optimality of variational Bayes”Pati, Bhattacharya & Yang, arXiv:1712.08983.

j No theory on accuracy of UQ

Motivation & background 7

Page 33: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Variational Bayes - brief comments

j ICML 2018 tutorial by Tamara Broderick<www.tamarabroderick.com>

j Based on maximizing a lower bound discarding an intractableterm in KL divergence

j In general have no clue how accurate the approximation is

j Often posterior uncertainty badly under-estimated, though thereare some fix-ups; e.g., Giordano, Broderick & Jordan (2015)

j Fix-ups improve the variance characterization in a local mode

j Recent article: “On statistical optimality of variational Bayes”Pati, Bhattacharya & Yang, arXiv:1712.08983.

j No theory on accuracy of UQ

Motivation & background 7

Page 34: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Variational Bayes - brief comments

j ICML 2018 tutorial by Tamara Broderick<www.tamarabroderick.com>

j Based on maximizing a lower bound discarding an intractableterm in KL divergence

j In general have no clue how accurate the approximation is

j Often posterior uncertainty badly under-estimated, though thereare some fix-ups; e.g., Giordano, Broderick & Jordan (2015)

j Fix-ups improve the variance characterization in a local mode

j Recent article: “On statistical optimality of variational Bayes”Pati, Bhattacharya & Yang, arXiv:1712.08983.

j No theory on accuracy of UQ

Motivation & background 7

Page 35: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Variational Bayes - brief comments

j ICML 2018 tutorial by Tamara Broderick<www.tamarabroderick.com>

j Based on maximizing a lower bound discarding an intractableterm in KL divergence

j In general have no clue how accurate the approximation is

j Often posterior uncertainty badly under-estimated, though thereare some fix-ups; e.g., Giordano, Broderick & Jordan (2015)

j Fix-ups improve the variance characterization in a local mode

j Recent article: “On statistical optimality of variational Bayes”Pati, Bhattacharya & Yang, arXiv:1712.08983.

j No theory on accuracy of UQ

Motivation & background 7

Page 36: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Variational Bayes - brief comments

j ICML 2018 tutorial by Tamara Broderick<www.tamarabroderick.com>

j Based on maximizing a lower bound discarding an intractableterm in KL divergence

j In general have no clue how accurate the approximation is

j Often posterior uncertainty badly under-estimated, though thereare some fix-ups; e.g., Giordano, Broderick & Jordan (2015)

j Fix-ups improve the variance characterization in a local mode

j Recent article: “On statistical optimality of variational Bayes”Pati, Bhattacharya & Yang, arXiv:1712.08983.

j No theory on accuracy of UQ

Motivation & background 7

Page 37: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Markov chain Monte Carlo

j Hence, accurate analytic approximations to the posterior haveproven elusive outside of narrow settings

j Markov chain Monte Carlo (MCMC) & other posterior samplingalgorithms provide an alternative

j MCMC: sequential algorithm to obtain correlated draws from theposterior:

πn(θ|Y (n)) = π(θ)L(Y (n)|θ)∫π(θ)L(Y (n)|θ)dθ

= π(θ)L(Y (n)|θ)

L(Y (n)).

j MCMC bypasses need to approximate the marginal likelihoodL(Y (n))

j Often samples more useful then an analytic form for πn(θ)anyway

j Can use samples to calculate a wide variety of posterior &predictive summaries of interest

Motivation & background 8

Page 38: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Markov chain Monte Carlo

j Hence, accurate analytic approximations to the posterior haveproven elusive outside of narrow settings

j Markov chain Monte Carlo (MCMC) & other posterior samplingalgorithms provide an alternative

j MCMC: sequential algorithm to obtain correlated draws from theposterior:

πn(θ|Y (n)) = π(θ)L(Y (n)|θ)∫π(θ)L(Y (n)|θ)dθ

= π(θ)L(Y (n)|θ)

L(Y (n)).

j MCMC bypasses need to approximate the marginal likelihoodL(Y (n))

j Often samples more useful then an analytic form for πn(θ)anyway

j Can use samples to calculate a wide variety of posterior &predictive summaries of interest

Motivation & background 8

Page 39: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Markov chain Monte Carlo

j Hence, accurate analytic approximations to the posterior haveproven elusive outside of narrow settings

j Markov chain Monte Carlo (MCMC) & other posterior samplingalgorithms provide an alternative

j MCMC: sequential algorithm to obtain correlated draws from theposterior:

πn(θ|Y (n)) = π(θ)L(Y (n)|θ)∫π(θ)L(Y (n)|θ)dθ

= π(θ)L(Y (n)|θ)

L(Y (n)).

j MCMC bypasses need to approximate the marginal likelihoodL(Y (n))

j Often samples more useful then an analytic form for πn(θ)anyway

j Can use samples to calculate a wide variety of posterior &predictive summaries of interest

Motivation & background 8

Page 40: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Markov chain Monte Carlo

j Hence, accurate analytic approximations to the posterior haveproven elusive outside of narrow settings

j Markov chain Monte Carlo (MCMC) & other posterior samplingalgorithms provide an alternative

j MCMC: sequential algorithm to obtain correlated draws from theposterior:

πn(θ|Y (n)) = π(θ)L(Y (n)|θ)∫π(θ)L(Y (n)|θ)dθ

= π(θ)L(Y (n)|θ)

L(Y (n)).

j MCMC bypasses need to approximate the marginal likelihoodL(Y (n))

j Often samples more useful then an analytic form for πn(θ)anyway

j Can use samples to calculate a wide variety of posterior &predictive summaries of interest

Motivation & background 8

Page 41: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Markov chain Monte Carlo

j Hence, accurate analytic approximations to the posterior haveproven elusive outside of narrow settings

j Markov chain Monte Carlo (MCMC) & other posterior samplingalgorithms provide an alternative

j MCMC: sequential algorithm to obtain correlated draws from theposterior:

πn(θ|Y (n)) = π(θ)L(Y (n)|θ)∫π(θ)L(Y (n)|θ)dθ

= π(θ)L(Y (n)|θ)

L(Y (n)).

j MCMC bypasses need to approximate the marginal likelihoodL(Y (n))

j Often samples more useful then an analytic form for πn(θ)anyway

j Can use samples to calculate a wide variety of posterior &predictive summaries of interest

Motivation & background 8

Page 42: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Markov chain Monte Carlo

j Hence, accurate analytic approximations to the posterior haveproven elusive outside of narrow settings

j Markov chain Monte Carlo (MCMC) & other posterior samplingalgorithms provide an alternative

j MCMC: sequential algorithm to obtain correlated draws from theposterior:

πn(θ|Y (n)) = π(θ)L(Y (n)|θ)∫π(θ)L(Y (n)|θ)dθ

= π(θ)L(Y (n)|θ)

L(Y (n)).

j MCMC bypasses need to approximate the marginal likelihoodL(Y (n))

j Often samples more useful then an analytic form for πn(θ)anyway

j Can use samples to calculate a wide variety of posterior &predictive summaries of interest

Motivation & background 8

Page 43: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC

j MCMC-based summaries of the posterior for any functional f (θ)

j As the number of samples T increases, these summariesbecome more accurate

j MCMC constructs Markov chain with stationary distributionπn(θ|Y (n))

j A transition kernel is carefully chosen & iterative samplingproceeds

j Most MCMC algorithms types of Metropolis-Hastings (MH):

1. θ∗ ∼ g (θ(t−1)) = sample a proposal (θ(t )=sample at step t )2. Accept proposal by letting θ(t ) = θ∗ with probability

min

{1,

π(θ∗)L(Y (n)|θ∗)

π(θ(t−1))L(Y (n)|θ(t−1))

g (θ(t−1))

g (θ∗)

}

Motivation & background 9

Page 44: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC

j MCMC-based summaries of the posterior for any functional f (θ)

j As the number of samples T increases, these summariesbecome more accurate

j MCMC constructs Markov chain with stationary distributionπn(θ|Y (n))

j A transition kernel is carefully chosen & iterative samplingproceeds

j Most MCMC algorithms types of Metropolis-Hastings (MH):

1. θ∗ ∼ g (θ(t−1)) = sample a proposal (θ(t )=sample at step t )2. Accept proposal by letting θ(t ) = θ∗ with probability

min

{1,

π(θ∗)L(Y (n)|θ∗)

π(θ(t−1))L(Y (n)|θ(t−1))

g (θ(t−1))

g (θ∗)

}

Motivation & background 9

Page 45: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC

j MCMC-based summaries of the posterior for any functional f (θ)

j As the number of samples T increases, these summariesbecome more accurate

j MCMC constructs Markov chain with stationary distributionπn(θ|Y (n))

j A transition kernel is carefully chosen & iterative samplingproceeds

j Most MCMC algorithms types of Metropolis-Hastings (MH):

1. θ∗ ∼ g (θ(t−1)) = sample a proposal (θ(t )=sample at step t )2. Accept proposal by letting θ(t ) = θ∗ with probability

min

{1,

π(θ∗)L(Y (n)|θ∗)

π(θ(t−1))L(Y (n)|θ(t−1))

g (θ(t−1))

g (θ∗)

}

Motivation & background 9

Page 46: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC

j MCMC-based summaries of the posterior for any functional f (θ)

j As the number of samples T increases, these summariesbecome more accurate

j MCMC constructs Markov chain with stationary distributionπn(θ|Y (n))

j A transition kernel is carefully chosen & iterative samplingproceeds

j Most MCMC algorithms types of Metropolis-Hastings (MH):

1. θ∗ ∼ g (θ(t−1)) = sample a proposal (θ(t )=sample at step t )2. Accept proposal by letting θ(t ) = θ∗ with probability

min

{1,

π(θ∗)L(Y (n)|θ∗)

π(θ(t−1))L(Y (n)|θ(t−1))

g (θ(t−1))

g (θ∗)

}

Motivation & background 9

Page 47: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC

j MCMC-based summaries of the posterior for any functional f (θ)

j As the number of samples T increases, these summariesbecome more accurate

j MCMC constructs Markov chain with stationary distributionπn(θ|Y (n))

j A transition kernel is carefully chosen & iterative samplingproceeds

j Most MCMC algorithms types of Metropolis-Hastings (MH):

1. θ∗ ∼ g (θ(t−1)) = sample a proposal (θ(t )=sample at step t )2. Accept proposal by letting θ(t ) = θ∗ with probability

min

{1,

π(θ∗)L(Y (n)|θ∗)

π(θ(t−1))L(Y (n)|θ(t−1))

g (θ(t−1))

g (θ∗)

}

Motivation & background 9

Page 48: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC

j MCMC-based summaries of the posterior for any functional f (θ)

j As the number of samples T increases, these summariesbecome more accurate

j MCMC constructs Markov chain with stationary distributionπn(θ|Y (n))

j A transition kernel is carefully chosen & iterative samplingproceeds

j Most MCMC algorithms types of Metropolis-Hastings (MH):1. θ∗ ∼ g (θ(t−1)) = sample a proposal (θ(t )=sample at step t )

2. Accept proposal by letting θ(t ) = θ∗ with probability

min

{1,

π(θ∗)L(Y (n)|θ∗)

π(θ(t−1))L(Y (n)|θ(t−1))

g (θ(t−1))

g (θ∗)

}

Motivation & background 9

Page 49: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC

j MCMC-based summaries of the posterior for any functional f (θ)

j As the number of samples T increases, these summariesbecome more accurate

j MCMC constructs Markov chain with stationary distributionπn(θ|Y (n))

j A transition kernel is carefully chosen & iterative samplingproceeds

j Most MCMC algorithms types of Metropolis-Hastings (MH):1. θ∗ ∼ g (θ(t−1)) = sample a proposal (θ(t )=sample at step t )2. Accept proposal by letting θ(t ) = θ∗ with probability

min

{1,

π(θ∗)L(Y (n)|θ∗)

π(θ(t−1))L(Y (n)|θ(t−1))

g (θ(t−1))

g (θ∗)

}

Motivation & background 9

Page 50: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Comments on MCMC & MH in particular

j Design of “efficient” MH algorithms involves choosing goodproposals g (·)

j g (·) can depend on the previous value of θ & on the data but noton further back samples - except in adaptive MH

j Gibbs sampler: Letting θ = (θ1, . . . ,θp )′ we draw subsets of θfrom their exact conditional posterior distributions fixing theothers

j Random walk: g (θ(t−1)) is a distribution centered on θ(t−1) witha tunable covariance

j HMC/Langevin: Exploit gradient information to generatesamples far from θ(t−1) having high posterior density

Motivation & background 10

Page 51: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Comments on MCMC & MH in particular

j Design of “efficient” MH algorithms involves choosing goodproposals g (·)

j g (·) can depend on the previous value of θ & on the data but noton further back samples - except in adaptive MH

j Gibbs sampler: Letting θ = (θ1, . . . ,θp )′ we draw subsets of θfrom their exact conditional posterior distributions fixing theothers

j Random walk: g (θ(t−1)) is a distribution centered on θ(t−1) witha tunable covariance

j HMC/Langevin: Exploit gradient information to generatesamples far from θ(t−1) having high posterior density

Motivation & background 10

Page 52: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Comments on MCMC & MH in particular

j Design of “efficient” MH algorithms involves choosing goodproposals g (·)

j g (·) can depend on the previous value of θ & on the data but noton further back samples - except in adaptive MH

j Gibbs sampler: Letting θ = (θ1, . . . ,θp )′ we draw subsets of θfrom their exact conditional posterior distributions fixing theothers

j Random walk: g (θ(t−1)) is a distribution centered on θ(t−1) witha tunable covariance

j HMC/Langevin: Exploit gradient information to generatesamples far from θ(t−1) having high posterior density

Motivation & background 10

Page 53: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Comments on MCMC & MH in particular

j Design of “efficient” MH algorithms involves choosing goodproposals g (·)

j g (·) can depend on the previous value of θ & on the data but noton further back samples - except in adaptive MH

j Gibbs sampler: Letting θ = (θ1, . . . ,θp )′ we draw subsets of θfrom their exact conditional posterior distributions fixing theothers

j Random walk: g (θ(t−1)) is a distribution centered on θ(t−1) witha tunable covariance

j HMC/Langevin: Exploit gradient information to generatesamples far from θ(t−1) having high posterior density

Motivation & background 10

Page 54: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Comments on MCMC & MH in particular

j Design of “efficient” MH algorithms involves choosing goodproposals g (·)

j g (·) can depend on the previous value of θ & on the data but noton further back samples - except in adaptive MH

j Gibbs sampler: Letting θ = (θ1, . . . ,θp )′ we draw subsets of θfrom their exact conditional posterior distributions fixing theothers

j Random walk: g (θ(t−1)) is a distribution centered on θ(t−1) witha tunable covariance

j HMC/Langevin: Exploit gradient information to generatesamples far from θ(t−1) having high posterior density

Motivation & background 10

Page 55: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC & Computational bottlenecks

j Time per iteration increases with # of parameters/unknowns

j Can also increase with the sample size n

j Due to the cost of sampling proposal & calculating acceptanceprobability

j Similar costs occur in most optimization algorithms!

j For example, the computational bottleneck may be attributableto gradient evaluations

Motivation & background 11

Page 56: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC & Computational bottlenecks

j Time per iteration increases with # of parameters/unknowns

j Can also increase with the sample size n

j Due to the cost of sampling proposal & calculating acceptanceprobability

j Similar costs occur in most optimization algorithms!

j For example, the computational bottleneck may be attributableto gradient evaluations

Motivation & background 11

Page 57: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC & Computational bottlenecks

j Time per iteration increases with # of parameters/unknowns

j Can also increase with the sample size n

j Due to the cost of sampling proposal & calculating acceptanceprobability

j Similar costs occur in most optimization algorithms!

j For example, the computational bottleneck may be attributableto gradient evaluations

Motivation & background 11

Page 58: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC & Computational bottlenecks

j Time per iteration increases with # of parameters/unknowns

j Can also increase with the sample size n

j Due to the cost of sampling proposal & calculating acceptanceprobability

j Similar costs occur in most optimization algorithms!

j For example, the computational bottleneck may be attributableto gradient evaluations

Motivation & background 11

Page 59: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC & Computational bottlenecks

j Time per iteration increases with # of parameters/unknowns

j Can also increase with the sample size n

j Due to the cost of sampling proposal & calculating acceptanceprobability

j Similar costs occur in most optimization algorithms!

j For example, the computational bottleneck may be attributableto gradient evaluations

Motivation & background 11

Page 60: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC - A potential 2nd bottleneck

j MCMC does not produce independent samples from πn(θ)

j Draws are auto-correlated - as level of correlation increases,information provided by each sample decreases

j “Slowly mixing” Markov chains have highly autocorrelated draws

j A well designed MCMC algorithm with a good proposal shouldideally exhibit rapid convergence & mixing

j Otherwise the Monte Carlo (MC) error in posterior summariesmay be high

Motivation & background 12

Page 61: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC - A potential 2nd bottleneck

j MCMC does not produce independent samples from πn(θ)

j Draws are auto-correlated - as level of correlation increases,information provided by each sample decreases

j “Slowly mixing” Markov chains have highly autocorrelated draws

j A well designed MCMC algorithm with a good proposal shouldideally exhibit rapid convergence & mixing

j Otherwise the Monte Carlo (MC) error in posterior summariesmay be high

Motivation & background 12

Page 62: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC - A potential 2nd bottleneck

j MCMC does not produce independent samples from πn(θ)

j Draws are auto-correlated - as level of correlation increases,information provided by each sample decreases

j “Slowly mixing” Markov chains have highly autocorrelated draws

j A well designed MCMC algorithm with a good proposal shouldideally exhibit rapid convergence & mixing

j Otherwise the Monte Carlo (MC) error in posterior summariesmay be high

Motivation & background 12

Page 63: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC - A potential 2nd bottleneck

j MCMC does not produce independent samples from πn(θ)

j Draws are auto-correlated - as level of correlation increases,information provided by each sample decreases

j “Slowly mixing” Markov chains have highly autocorrelated draws

j A well designed MCMC algorithm with a good proposal shouldideally exhibit rapid convergence & mixing

j Otherwise the Monte Carlo (MC) error in posterior summariesmay be high

Motivation & background 12

Page 64: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC - A potential 2nd bottleneck

j MCMC does not produce independent samples from πn(θ)

j Draws are auto-correlated - as level of correlation increases,information provided by each sample decreases

j “Slowly mixing” Markov chains have highly autocorrelated draws

j A well designed MCMC algorithm with a good proposal shouldideally exhibit rapid convergence & mixing

j Otherwise the Monte Carlo (MC) error in posterior summariesmay be high

Motivation & background 12

Page 65: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC: Causes of scalability problems

j Often mixing gets worse as problem size grows (e.g. datadimension)

j Hence, in some cases we have a double bottleneck - worseningmixing & time/iteration

j Also MCMC is an inherently serial algorithm, so naiveimplementation may require storing & processing all data onone machine

j Limits ease at which divide-and-conquer strategies can beapplied.

j For the above reasons, it is common to simply state that MCMCis not scalable

Motivation & background 13

Page 66: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC: Causes of scalability problems

j Often mixing gets worse as problem size grows (e.g. datadimension)

j Hence, in some cases we have a double bottleneck - worseningmixing & time/iteration

j Also MCMC is an inherently serial algorithm, so naiveimplementation may require storing & processing all data onone machine

j Limits ease at which divide-and-conquer strategies can beapplied.

j For the above reasons, it is common to simply state that MCMCis not scalable

Motivation & background 13

Page 67: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC: Causes of scalability problems

j Often mixing gets worse as problem size grows (e.g. datadimension)

j Hence, in some cases we have a double bottleneck - worseningmixing & time/iteration

j Also MCMC is an inherently serial algorithm, so naiveimplementation may require storing & processing all data onone machine

j Limits ease at which divide-and-conquer strategies can beapplied.

j For the above reasons, it is common to simply state that MCMCis not scalable

Motivation & background 13

Page 68: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC: Causes of scalability problems

j Often mixing gets worse as problem size grows (e.g. datadimension)

j Hence, in some cases we have a double bottleneck - worseningmixing & time/iteration

j Also MCMC is an inherently serial algorithm, so naiveimplementation may require storing & processing all data onone machine

j Limits ease at which divide-and-conquer strategies can beapplied.

j For the above reasons, it is common to simply state that MCMCis not scalable

Motivation & background 13

Page 69: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC: Causes of scalability problems

j Often mixing gets worse as problem size grows (e.g. datadimension)

j Hence, in some cases we have a double bottleneck - worseningmixing & time/iteration

j Also MCMC is an inherently serial algorithm, so naiveimplementation may require storing & processing all data onone machine

j Limits ease at which divide-and-conquer strategies can beapplied.

j For the above reasons, it is common to simply state that MCMCis not scalable

Motivation & background 13

Page 70: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC: A bright future

j Each of the above problems can be addressed & there is anemerging rich literature!

j This is even given that orders of magnitude more researcherswork on developing scalable optimization algorithms

j For an MCMC algorithm to be scalable, MC error in posteriorsummaries based on running for time τ should not explode withdimensionality

j Some popular algorithms have been shown to not be scalablewhile others can be made scalable

j I’m going to highlight some relevant relevant work starting byfocusing on big n problems & then transitioning to big p

Motivation & background 14

Page 71: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC: A bright future

j Each of the above problems can be addressed & there is anemerging rich literature!

j This is even given that orders of magnitude more researcherswork on developing scalable optimization algorithms

j For an MCMC algorithm to be scalable, MC error in posteriorsummaries based on running for time τ should not explode withdimensionality

j Some popular algorithms have been shown to not be scalablewhile others can be made scalable

j I’m going to highlight some relevant relevant work starting byfocusing on big n problems & then transitioning to big p

Motivation & background 14

Page 72: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC: A bright future

j Each of the above problems can be addressed & there is anemerging rich literature!

j This is even given that orders of magnitude more researcherswork on developing scalable optimization algorithms

j For an MCMC algorithm to be scalable, MC error in posteriorsummaries based on running for time τ should not explode withdimensionality

j Some popular algorithms have been shown to not be scalablewhile others can be made scalable

j I’m going to highlight some relevant relevant work starting byfocusing on big n problems & then transitioning to big p

Motivation & background 14

Page 73: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC: A bright future

j Each of the above problems can be addressed & there is anemerging rich literature!

j This is even given that orders of magnitude more researcherswork on developing scalable optimization algorithms

j For an MCMC algorithm to be scalable, MC error in posteriorsummaries based on running for time τ should not explode withdimensionality

j Some popular algorithms have been shown to not be scalablewhile others can be made scalable

j I’m going to highlight some relevant relevant work starting byfocusing on big n problems & then transitioning to big p

Motivation & background 14

Page 74: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MCMC: A bright future

j Each of the above problems can be addressed & there is anemerging rich literature!

j This is even given that orders of magnitude more researcherswork on developing scalable optimization algorithms

j For an MCMC algorithm to be scalable, MC error in posteriorsummaries based on running for time τ should not explode withdimensionality

j Some popular algorithms have been shown to not be scalablewhile others can be made scalable

j I’m going to highlight some relevant relevant work starting byfocusing on big n problems & then transitioning to big p

Motivation & background 14

Page 75: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Outline

Motivation & background

Big n

High-dimensional data (big p)

Big n 15

Page 76: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Some Solutions

j Embarrassingly parallel (EP) MCMC: run MCMC in parallel fordifferent subsets of data & combine.

j Approximate MCMC: Approximate expensive to evaluatetransition kernels.

j C-Bayes: Condition on observed data being in smallneighborhood of data drawn from assumed model [ROBUST]

j Hybrid algorithms: run MCMC for a subset of the parameters& use a fast estimate for the others.

Big n 15

Page 77: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Some Solutions

j Embarrassingly parallel (EP) MCMC: run MCMC in parallel fordifferent subsets of data & combine.

j Approximate MCMC: Approximate expensive to evaluatetransition kernels.

j C-Bayes: Condition on observed data being in smallneighborhood of data drawn from assumed model [ROBUST]

j Hybrid algorithms: run MCMC for a subset of the parameters& use a fast estimate for the others.

Big n 15

Page 78: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Some Solutions

j Embarrassingly parallel (EP) MCMC: run MCMC in parallel fordifferent subsets of data & combine.

j Approximate MCMC: Approximate expensive to evaluatetransition kernels.

j C-Bayes: Condition on observed data being in smallneighborhood of data drawn from assumed model [ROBUST]

j Hybrid algorithms: run MCMC for a subset of the parameters& use a fast estimate for the others.

Big n 15

Page 79: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Some Solutions

j Embarrassingly parallel (EP) MCMC: run MCMC in parallel fordifferent subsets of data & combine.

j Approximate MCMC: Approximate expensive to evaluatetransition kernels.

j C-Bayes: Condition on observed data being in smallneighborhood of data drawn from assumed model [ROBUST]

j Hybrid algorithms: run MCMC for a subset of the parameters& use a fast estimate for the others.

Big n 15

Page 80: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Embarrassingly parallel MCMC

j Divide large sample size n data set into many smaller data setsstored on different machines

j Draw posterior samples for each subset posterior in parallelj ‘Magically’ combine the results quickly & simply

Big n 16

Page 81: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Embarrassingly parallel MCMC

j Divide large sample size n data set into many smaller data setsstored on different machines

j Draw posterior samples for each subset posterior in parallel

j ‘Magically’ combine the results quickly & simply

Big n 16

Page 82: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Embarrassingly parallel MCMC

j Divide large sample size n data set into many smaller data setsstored on different machines

j Draw posterior samples for each subset posterior in parallelj ‘Magically’ combine the results quickly & simply

Big n 16

Page 83: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Toy Example: Logistic Regression

200

400

600 800

1000

1200

200

400

600 800

1000

1200

200

400

600

800

100

0

1200

1600

200

400

600

800

100

0

1600

200

400

600

800

100

0 120

0

200

400

600 800

1000 1200 200

400

600

800

1000 120

0

200

400 600

800 1000

1600

200

400

600

800

100

0

1200

1800

200

400

600

800

100

0

1200 160

0

200

400

600

800

1000

120

0

200

400

600

800

1000

120

0

1600

200

400

600 800

1000

120

0

1600

200

400

600

800

1000

1200

200

400

600

800

1000 120

0

1400

200

400

600

800

100

0

140

0

1600

200

400

600

800

100

0

1600

200

400

600

800

1000 120

0

200

400

600

800 1000

1600

200

400 600

800

1000 120

0

1600

200

400

600

800

1000

1600

200

400

600

800

1000

1200

200

400

600

800 100

0

1200

160

0

200

400 600

800

100

0 1200 1600

200

400

600 800

1000

1400

200

400

600

800

1000 120

0

1600

200

400 600

800

1000 120

0

1600

200

400

600 800

1000 1200

200

400 600

800

1000

1200

1400

200

400

600

800

1000

200

400

600

800

100

0

1400

200

400

600

800

100

0

120

0 1800

200

400

600

800 100

0

120

0 1

400

200

400

600 800

1000

1200

200

400

600

800 1000

120

0

1600

200

400

600

800

1000

1200

140

0

200

400

600

800

1000

1200

200

400

600

800

1000

120

0 1600

200

400

600

800 1000 1400

200

400

600

800

1000 120

0

1800

200

400

600

800

1000

120

0

1800

−1.15 −1.10 −1.05 −1.00 −0.95 −0.90 −0.850.85

0.90

0.95

1.00

1.05

1.10

1.15

200

400

600

800

1000 120

0

500

1000

1500

2000

MCMCSubset PosteriorWASP

β1

β 2

pr(yi = 1|xi 1, . . . , xi p ,θ) =exp

(∑pj=1 xi jβ j

)1+exp

(∑pj=1 xi jβ j

) .

j Subset posteriors: ‘noisy’ approximations of full data posterior.

j ‘Averaging’ of subset posteriors reduces this ‘noise’ & leads toan accurate posterior approximation.

Big n 17

Page 84: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Toy Example: Logistic Regression

200

400

600 800

1000

1200

200

400

600 800

1000

1200

200

400

600

800

100

0

1200

1600

200

400

600

800

100

0

1600

200

400

600

800

100

0 120

0

200

400

600 800

1000 1200 200

400

600

800

1000 120

0

200

400 600

800 1000

1600

200

400

600

800

100

0

1200

1800

200

400

600

800

100

0

1200 160

0

200

400

600

800

1000

120

0

200

400

600

800

1000

120

0

1600

200

400

600 800

1000

120

0

1600

200

400

600

800

1000

1200

200

400

600

800

1000 120

0

1400

200

400

600

800

100

0

140

0

1600

200

400

600

800

100

0

1600

200

400

600

800

1000 120

0

200

400

600

800 1000

1600

200

400 600

800

1000 120

0

1600

200

400

600

800

1000

1600

200

400

600

800

1000

1200

200

400

600

800 100

0

1200

160

0

200

400 600

800

100

0 1200 1600

200

400

600 800

1000

1400

200

400

600

800

1000 120

0

1600

200

400 600

800

1000 120

0

1600

200

400

600 800

1000 1200

200

400 600

800

1000

1200

1400

200

400

600

800

1000

200

400

600

800

100

0

1400

200

400

600

800

100

0

120

0 1800

200

400

600

800 100

0

120

0 1

400

200

400

600 800

1000

1200

200

400

600

800 1000

120

0

1600

200

400

600

800

1000

1200

140

0

200

400

600

800

1000

1200

200

400

600

800

1000

120

0 1600

200

400

600

800 1000 1400

200

400

600

800

1000 120

0

1800

200

400

600

800

1000

120

0

1800

−1.15 −1.10 −1.05 −1.00 −0.95 −0.90 −0.850.85

0.90

0.95

1.00

1.05

1.10

1.15

200

400

600

800

1000 120

0

500

1000

1500

2000

MCMCSubset PosteriorWASP

β1

β 2

pr(yi = 1|xi 1, . . . , xi p ,θ) =exp

(∑pj=1 xi jβ j

)1+exp

(∑pj=1 xi jβ j

) .

j Subset posteriors: ‘noisy’ approximations of full data posterior.j ‘Averaging’ of subset posteriors reduces this ‘noise’ & leads to

an accurate posterior approximation.Big n 17

Page 85: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Stochastic Approximation

j Full data posterior density of inid data Y (n)

πn(θ | Y (n)) =∏n

i=1 pi (yi | θ)π(θ)∫Θ

∏ni=1 pi (yi | θ)π(θ)dθ

.

j Divide full data Y (n) into k subsets of size m:Y (n) = (Y[1], . . . ,Y[ j ], . . . ,Y[k]).

j Subset posterior density for j th data subset

πγm(θ | Y[ j ]) =

∏i∈[ j ](pi (yi | θ))γπ(θ)∫

Θ

∏i∈[ j ](pi (yi | θ))γπ(θ)dθ

.

j γ=O(k) - chosen to minimize approximation error

Big n 18

Page 86: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Stochastic Approximation

j Full data posterior density of inid data Y (n)

πn(θ | Y (n)) =∏n

i=1 pi (yi | θ)π(θ)∫Θ

∏ni=1 pi (yi | θ)π(θ)dθ

.

j Divide full data Y (n) into k subsets of size m:Y (n) = (Y[1], . . . ,Y[ j ], . . . ,Y[k]).

j Subset posterior density for j th data subset

πγm(θ | Y[ j ]) =

∏i∈[ j ](pi (yi | θ))γπ(θ)∫

Θ

∏i∈[ j ](pi (yi | θ))γπ(θ)dθ

.

j γ=O(k) - chosen to minimize approximation error

Big n 18

Page 87: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Stochastic Approximation

j Full data posterior density of inid data Y (n)

πn(θ | Y (n)) =∏n

i=1 pi (yi | θ)π(θ)∫Θ

∏ni=1 pi (yi | θ)π(θ)dθ

.

j Divide full data Y (n) into k subsets of size m:Y (n) = (Y[1], . . . ,Y[ j ], . . . ,Y[k]).

j Subset posterior density for j th data subset

πγm(θ | Y[ j ]) =

∏i∈[ j ](pi (yi | θ))γπ(θ)∫

Θ

∏i∈[ j ](pi (yi | θ))γπ(θ)dθ

.

j γ=O(k) - chosen to minimize approximation error

Big n 18

Page 88: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Stochastic Approximation

j Full data posterior density of inid data Y (n)

πn(θ | Y (n)) =∏n

i=1 pi (yi | θ)π(θ)∫Θ

∏ni=1 pi (yi | θ)π(θ)dθ

.

j Divide full data Y (n) into k subsets of size m:Y (n) = (Y[1], . . . ,Y[ j ], . . . ,Y[k]).

j Subset posterior density for j th data subset

πγm(θ | Y[ j ]) =

∏i∈[ j ](pi (yi | θ))γπ(θ)∫

Θ

∏i∈[ j ](pi (yi | θ))γπ(θ)dθ

.

j γ=O(k) - chosen to minimize approximation error

Big n 18

Page 89: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Barycenter in Metric Spaces

Big n 19

Page 90: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Barycenter in Metric Spaces

Big n 20

Page 91: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

WAsserstein barycenter of Subset Posteriors (WASP)

Srivastava, Li & Dunson (2015)

j 2-Wasserstein distance between µ,ν ∈P 2(Θ)

W2(µ,ν) = inf{(E[d 2(X ,Y )]

) 12 : law(X ) =µ, law(Y ) = ν

}.

j Πγm(· | Y[ j ]) for j = 1, . . . ,k are combined through WASP

Πγn(· | Y (n)) = argmin

Π∈P 2(Θ)

1

k

k∑j=1

W 22 (Π,Πγm(· | Y[ j ])). [Agueh & Carlier (2011)]

j Plugging in Π̂γm(· | Y[ j ]) for j = 1, . . . ,k, a linear program (LP) canbe used for fast estimation of an atomic approximation!

Big n 21

Page 92: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

WAsserstein barycenter of Subset Posteriors (WASP)

Srivastava, Li & Dunson (2015)

j 2-Wasserstein distance between µ,ν ∈P 2(Θ)

W2(µ,ν) = inf{(E[d 2(X ,Y )]

) 12 : law(X ) =µ, law(Y ) = ν

}.

j Πγm(· | Y[ j ]) for j = 1, . . . ,k are combined through WASP

Πγn(· | Y (n)) = argmin

Π∈P 2(Θ)

1

k

k∑j=1

W 22 (Π,Πγm(· | Y[ j ])). [Agueh & Carlier (2011)]

j Plugging in Π̂γm(· | Y[ j ]) for j = 1, . . . ,k, a linear program (LP) canbe used for fast estimation of an atomic approximation!

Big n 21

Page 93: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

WAsserstein barycenter of Subset Posteriors (WASP)

Srivastava, Li & Dunson (2015)

j 2-Wasserstein distance between µ,ν ∈P 2(Θ)

W2(µ,ν) = inf{(E[d 2(X ,Y )]

) 12 : law(X ) =µ, law(Y ) = ν

}.

j Πγm(· | Y[ j ]) for j = 1, . . . ,k are combined through WASP

Πγn(· | Y (n)) = argmin

Π∈P 2(Θ)

1

k

k∑j=1

W 22 (Π,Πγm(· | Y[ j ])). [Agueh & Carlier (2011)]

j Plugging in Π̂γm(· | Y[ j ]) for j = 1, . . . ,k, a linear program (LP) canbe used for fast estimation of an atomic approximation!

Big n 21

Page 94: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

LP Estimation of WASP

j Minimizing Wasserstein is solution to a discrete optimaltransport problem

j Let µ=∑J1j=1 a jδθ1 j , ν=

∑J2

l=1 blδθ2l & M12 ∈ℜJ1×J2 = matrix ofsquare differences in atoms {θ1 j }, {θ2l }.

j Optimal transport polytope: T (a,b) = set of doubly stochasticmatrices w/ row sums a & column sums b

j Objective is to find T ∈T (a,b) minimizing tr(TT M12)

j For WASP, generalize to multimargin optimal transport problem- entropy smoothing has been used previously

j We can avoid such smoothing & use sparse LP solvers -neglible computation cost compared to sampling

Big n 22

Page 95: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

LP Estimation of WASP

j Minimizing Wasserstein is solution to a discrete optimaltransport problem

j Let µ=∑J1j=1 a jδθ1 j , ν=

∑J2

l=1 blδθ2l & M12 ∈ℜJ1×J2 = matrix ofsquare differences in atoms {θ1 j }, {θ2l }.

j Optimal transport polytope: T (a,b) = set of doubly stochasticmatrices w/ row sums a & column sums b

j Objective is to find T ∈T (a,b) minimizing tr(TT M12)

j For WASP, generalize to multimargin optimal transport problem- entropy smoothing has been used previously

j We can avoid such smoothing & use sparse LP solvers -neglible computation cost compared to sampling

Big n 22

Page 96: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

LP Estimation of WASP

j Minimizing Wasserstein is solution to a discrete optimaltransport problem

j Let µ=∑J1j=1 a jδθ1 j , ν=

∑J2

l=1 blδθ2l & M12 ∈ℜJ1×J2 = matrix ofsquare differences in atoms {θ1 j }, {θ2l }.

j Optimal transport polytope: T (a,b) = set of doubly stochasticmatrices w/ row sums a & column sums b

j Objective is to find T ∈T (a,b) minimizing tr(TT M12)

j For WASP, generalize to multimargin optimal transport problem- entropy smoothing has been used previously

j We can avoid such smoothing & use sparse LP solvers -neglible computation cost compared to sampling

Big n 22

Page 97: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

LP Estimation of WASP

j Minimizing Wasserstein is solution to a discrete optimaltransport problem

j Let µ=∑J1j=1 a jδθ1 j , ν=

∑J2

l=1 blδθ2l & M12 ∈ℜJ1×J2 = matrix ofsquare differences in atoms {θ1 j }, {θ2l }.

j Optimal transport polytope: T (a,b) = set of doubly stochasticmatrices w/ row sums a & column sums b

j Objective is to find T ∈T (a,b) minimizing tr(TT M12)

j For WASP, generalize to multimargin optimal transport problem- entropy smoothing has been used previously

j We can avoid such smoothing & use sparse LP solvers -neglible computation cost compared to sampling

Big n 22

Page 98: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

LP Estimation of WASP

j Minimizing Wasserstein is solution to a discrete optimaltransport problem

j Let µ=∑J1j=1 a jδθ1 j , ν=

∑J2

l=1 blδθ2l & M12 ∈ℜJ1×J2 = matrix ofsquare differences in atoms {θ1 j }, {θ2l }.

j Optimal transport polytope: T (a,b) = set of doubly stochasticmatrices w/ row sums a & column sums b

j Objective is to find T ∈T (a,b) minimizing tr(TT M12)

j For WASP, generalize to multimargin optimal transport problem- entropy smoothing has been used previously

j We can avoid such smoothing & use sparse LP solvers -neglible computation cost compared to sampling

Big n 22

Page 99: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

LP Estimation of WASP

j Minimizing Wasserstein is solution to a discrete optimaltransport problem

j Let µ=∑J1j=1 a jδθ1 j , ν=

∑J2

l=1 blδθ2l & M12 ∈ℜJ1×J2 = matrix ofsquare differences in atoms {θ1 j }, {θ2l }.

j Optimal transport polytope: T (a,b) = set of doubly stochasticmatrices w/ row sums a & column sums b

j Objective is to find T ∈T (a,b) minimizing tr(TT M12)

j For WASP, generalize to multimargin optimal transport problem- entropy smoothing has been used previously

j We can avoid such smoothing & use sparse LP solvers -neglible computation cost compared to sampling

Big n 22

Page 100: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

WASP: Theorems

Theorem (Subset Posteriors)Under “usual” regularity conditions, there exists a constant C1

independent of subset posteriors, such that for large m,

EP [ j ]θ0

W 22

{Πγm(· | Y[ j ]),δθ0 (·)}≤C1

(log2 m

m

) 1α

j = 1, . . . ,k,

Theorem (WASP)Under “usual” regularity conditions and for large m,

W2

{Πγn(· | Y (n)),δθ0 (·)

}=OP (n)

θ0

√log2/αm

km1/α

.

Big n 23

Page 101: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

WASP: Theorems

Theorem (Subset Posteriors)Under “usual” regularity conditions, there exists a constant C1

independent of subset posteriors, such that for large m,

EP [ j ]θ0

W 22

{Πγm(· | Y[ j ]),δθ0 (·)}≤C1

(log2 m

m

) 1α

j = 1, . . . ,k,

Theorem (WASP)Under “usual” regularity conditions and for large m,

W2

{Πγn(· | Y (n)),δθ0 (·)

}=OP (n)

θ0

√log2/αm

km1/α

.

Big n 23

Page 102: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Simple & Fast Posterior Interval Estimation (PIE)

Li, Srivastava & Dunson (2017)

j Usually report point & interval estimates for different 1-dfunctionals - multidimensional posterior difficult to interpret

j WASP has explicit relationship with subset posteriors in 1-d

j Quantiles of WASP are simple averages of quantiles of subsetposteriors

j Leads to a super trivial algorithm - run MCMC for each subset &average quantiles - reminiscent of bag of little bootstraps

j Strong theory showing accuracy of the resulting approximation

j Can implement in STAN, which allows powered likelihoods

Big n 24

Page 103: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Simple & Fast Posterior Interval Estimation (PIE)

Li, Srivastava & Dunson (2017)

j Usually report point & interval estimates for different 1-dfunctionals - multidimensional posterior difficult to interpret

j WASP has explicit relationship with subset posteriors in 1-d

j Quantiles of WASP are simple averages of quantiles of subsetposteriors

j Leads to a super trivial algorithm - run MCMC for each subset &average quantiles - reminiscent of bag of little bootstraps

j Strong theory showing accuracy of the resulting approximation

j Can implement in STAN, which allows powered likelihoods

Big n 24

Page 104: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Simple & Fast Posterior Interval Estimation (PIE)

Li, Srivastava & Dunson (2017)

j Usually report point & interval estimates for different 1-dfunctionals - multidimensional posterior difficult to interpret

j WASP has explicit relationship with subset posteriors in 1-d

j Quantiles of WASP are simple averages of quantiles of subsetposteriors

j Leads to a super trivial algorithm - run MCMC for each subset &average quantiles - reminiscent of bag of little bootstraps

j Strong theory showing accuracy of the resulting approximation

j Can implement in STAN, which allows powered likelihoods

Big n 24

Page 105: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Simple & Fast Posterior Interval Estimation (PIE)

Li, Srivastava & Dunson (2017)

j Usually report point & interval estimates for different 1-dfunctionals - multidimensional posterior difficult to interpret

j WASP has explicit relationship with subset posteriors in 1-d

j Quantiles of WASP are simple averages of quantiles of subsetposteriors

j Leads to a super trivial algorithm - run MCMC for each subset &average quantiles - reminiscent of bag of little bootstraps

j Strong theory showing accuracy of the resulting approximation

j Can implement in STAN, which allows powered likelihoods

Big n 24

Page 106: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Simple & Fast Posterior Interval Estimation (PIE)

Li, Srivastava & Dunson (2017)

j Usually report point & interval estimates for different 1-dfunctionals - multidimensional posterior difficult to interpret

j WASP has explicit relationship with subset posteriors in 1-d

j Quantiles of WASP are simple averages of quantiles of subsetposteriors

j Leads to a super trivial algorithm - run MCMC for each subset &average quantiles - reminiscent of bag of little bootstraps

j Strong theory showing accuracy of the resulting approximation

j Can implement in STAN, which allows powered likelihoods

Big n 24

Page 107: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Simple & Fast Posterior Interval Estimation (PIE)

Li, Srivastava & Dunson (2017)

j Usually report point & interval estimates for different 1-dfunctionals - multidimensional posterior difficult to interpret

j WASP has explicit relationship with subset posteriors in 1-d

j Quantiles of WASP are simple averages of quantiles of subsetposteriors

j Leads to a super trivial algorithm - run MCMC for each subset &average quantiles - reminiscent of bag of little bootstraps

j Strong theory showing accuracy of the resulting approximation

j Can implement in STAN, which allows powered likelihoods

Big n 24

Page 108: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Theory on PIE/1-d WASP

j We show 1-d WASP Πn(ξ|Y (n)) is highly accurate approximationto exact posterior Πn(ξ|Y (n))

j As subset sample size m increases, W2 distance between themdecreases at faster than parametric rate op (n−1/2)

j Theorem allows k =O(nc ) and m =O(n1−c ) for any c ∈ (0,1), som can increase very slowly relative to k (recall n = mk)

j Their biases, variances, quantiles only differ in high orders ofthe total sample size

j Conditions: standard, mild conditions on likelihood + prior finite2nd moment & uniform integrabiity of subset posteriors

Big n 25

Page 109: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Theory on PIE/1-d WASP

j We show 1-d WASP Πn(ξ|Y (n)) is highly accurate approximationto exact posterior Πn(ξ|Y (n))

j As subset sample size m increases, W2 distance between themdecreases at faster than parametric rate op (n−1/2)

j Theorem allows k =O(nc ) and m =O(n1−c ) for any c ∈ (0,1), som can increase very slowly relative to k (recall n = mk)

j Their biases, variances, quantiles only differ in high orders ofthe total sample size

j Conditions: standard, mild conditions on likelihood + prior finite2nd moment & uniform integrabiity of subset posteriors

Big n 25

Page 110: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Theory on PIE/1-d WASP

j We show 1-d WASP Πn(ξ|Y (n)) is highly accurate approximationto exact posterior Πn(ξ|Y (n))

j As subset sample size m increases, W2 distance between themdecreases at faster than parametric rate op (n−1/2)

j Theorem allows k =O(nc ) and m =O(n1−c ) for any c ∈ (0,1), som can increase very slowly relative to k (recall n = mk)

j Their biases, variances, quantiles only differ in high orders ofthe total sample size

j Conditions: standard, mild conditions on likelihood + prior finite2nd moment & uniform integrabiity of subset posteriors

Big n 25

Page 111: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Theory on PIE/1-d WASP

j We show 1-d WASP Πn(ξ|Y (n)) is highly accurate approximationto exact posterior Πn(ξ|Y (n))

j As subset sample size m increases, W2 distance between themdecreases at faster than parametric rate op (n−1/2)

j Theorem allows k =O(nc ) and m =O(n1−c ) for any c ∈ (0,1), som can increase very slowly relative to k (recall n = mk)

j Their biases, variances, quantiles only differ in high orders ofthe total sample size

j Conditions: standard, mild conditions on likelihood + prior finite2nd moment & uniform integrabiity of subset posteriors

Big n 25

Page 112: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Theory on PIE/1-d WASP

j We show 1-d WASP Πn(ξ|Y (n)) is highly accurate approximationto exact posterior Πn(ξ|Y (n))

j As subset sample size m increases, W2 distance between themdecreases at faster than parametric rate op (n−1/2)

j Theorem allows k =O(nc ) and m =O(n1−c ) for any c ∈ (0,1), som can increase very slowly relative to k (recall n = mk)

j Their biases, variances, quantiles only differ in high orders ofthe total sample size

j Conditions: standard, mild conditions on likelihood + prior finite2nd moment & uniform integrabiity of subset posteriors

Big n 25

Page 113: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Results

j We have implemented for rich variety of data & models

j Logistic & linear random effects models, mixture models, matrix& tensor factorizations, Gaussian process regression

j Nonparametric models, dependence, hierarchical models, etc.

j We compare to long runs of MCMC (when feasible) & VB

j WASP/PIE is much faster than MCMC & highly accurate

j Carefully designed VB implementations often do very well

Big n 26

Page 114: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Results

j We have implemented for rich variety of data & models

j Logistic & linear random effects models, mixture models, matrix& tensor factorizations, Gaussian process regression

j Nonparametric models, dependence, hierarchical models, etc.

j We compare to long runs of MCMC (when feasible) & VB

j WASP/PIE is much faster than MCMC & highly accurate

j Carefully designed VB implementations often do very well

Big n 26

Page 115: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Results

j We have implemented for rich variety of data & models

j Logistic & linear random effects models, mixture models, matrix& tensor factorizations, Gaussian process regression

j Nonparametric models, dependence, hierarchical models, etc.

j We compare to long runs of MCMC (when feasible) & VB

j WASP/PIE is much faster than MCMC & highly accurate

j Carefully designed VB implementations often do very well

Big n 26

Page 116: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Results

j We have implemented for rich variety of data & models

j Logistic & linear random effects models, mixture models, matrix& tensor factorizations, Gaussian process regression

j Nonparametric models, dependence, hierarchical models, etc.

j We compare to long runs of MCMC (when feasible) & VB

j WASP/PIE is much faster than MCMC & highly accurate

j Carefully designed VB implementations often do very well

Big n 26

Page 117: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Results

j We have implemented for rich variety of data & models

j Logistic & linear random effects models, mixture models, matrix& tensor factorizations, Gaussian process regression

j Nonparametric models, dependence, hierarchical models, etc.

j We compare to long runs of MCMC (when feasible) & VB

j WASP/PIE is much faster than MCMC & highly accurate

j Carefully designed VB implementations often do very well

Big n 26

Page 118: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Results

j We have implemented for rich variety of data & models

j Logistic & linear random effects models, mixture models, matrix& tensor factorizations, Gaussian process regression

j Nonparametric models, dependence, hierarchical models, etc.

j We compare to long runs of MCMC (when feasible) & VB

j WASP/PIE is much faster than MCMC & highly accurate

j Carefully designed VB implementations often do very well

Big n 26

Page 119: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

aMCMC Johndrow, Mattingly, Mukherjee & Dunson (2015)

j Different way to speed up MCMC - replace expensive transitionkernels with approximations

j For example, approximate a conditional distribution in Gibbssampler with a Gaussian or using a subsample of data

j Can potentially vastly speed up MCMC sampling inhigh-dimensional settings

j Original MCMC sampler converges to a stationary distributioncorresponding to the exact posterior

j Not clear what happens when we start substituting inapproximations - may diverge etc

Big n 27

Page 120: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

aMCMC Johndrow, Mattingly, Mukherjee & Dunson (2015)

j Different way to speed up MCMC - replace expensive transitionkernels with approximations

j For example, approximate a conditional distribution in Gibbssampler with a Gaussian or using a subsample of data

j Can potentially vastly speed up MCMC sampling inhigh-dimensional settings

j Original MCMC sampler converges to a stationary distributioncorresponding to the exact posterior

j Not clear what happens when we start substituting inapproximations - may diverge etc

Big n 27

Page 121: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

aMCMC Johndrow, Mattingly, Mukherjee & Dunson (2015)

j Different way to speed up MCMC - replace expensive transitionkernels with approximations

j For example, approximate a conditional distribution in Gibbssampler with a Gaussian or using a subsample of data

j Can potentially vastly speed up MCMC sampling inhigh-dimensional settings

j Original MCMC sampler converges to a stationary distributioncorresponding to the exact posterior

j Not clear what happens when we start substituting inapproximations - may diverge etc

Big n 27

Page 122: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

aMCMC Johndrow, Mattingly, Mukherjee & Dunson (2015)

j Different way to speed up MCMC - replace expensive transitionkernels with approximations

j For example, approximate a conditional distribution in Gibbssampler with a Gaussian or using a subsample of data

j Can potentially vastly speed up MCMC sampling inhigh-dimensional settings

j Original MCMC sampler converges to a stationary distributioncorresponding to the exact posterior

j Not clear what happens when we start substituting inapproximations - may diverge etc

Big n 27

Page 123: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

aMCMC Johndrow, Mattingly, Mukherjee & Dunson (2015)

j Different way to speed up MCMC - replace expensive transitionkernels with approximations

j For example, approximate a conditional distribution in Gibbssampler with a Gaussian or using a subsample of data

j Can potentially vastly speed up MCMC sampling inhigh-dimensional settings

j Original MCMC sampler converges to a stationary distributioncorresponding to the exact posterior

j Not clear what happens when we start substituting inapproximations - may diverge etc

Big n 27

Page 124: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

aMCMC Overview

j aMCMC is used routinely - there is an increasing rich literatureon algorithms

j Theory: guarantees can be used to target design of algorithms

j Define ‘exact’ MCMC algorithm, which is computationallyintractable but has good mixing

j ‘exact’ chain converges to stationary distribution correspondingto exact posterior

j Approximate kernel in exact chain with more computationallytractable alternative

Big n 28

Page 125: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

aMCMC Overview

j aMCMC is used routinely - there is an increasing rich literatureon algorithms

j Theory: guarantees can be used to target design of algorithms

j Define ‘exact’ MCMC algorithm, which is computationallyintractable but has good mixing

j ‘exact’ chain converges to stationary distribution correspondingto exact posterior

j Approximate kernel in exact chain with more computationallytractable alternative

Big n 28

Page 126: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

aMCMC Overview

j aMCMC is used routinely - there is an increasing rich literatureon algorithms

j Theory: guarantees can be used to target design of algorithms

j Define ‘exact’ MCMC algorithm, which is computationallyintractable but has good mixing

j ‘exact’ chain converges to stationary distribution correspondingto exact posterior

j Approximate kernel in exact chain with more computationallytractable alternative

Big n 28

Page 127: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

aMCMC Overview

j aMCMC is used routinely - there is an increasing rich literatureon algorithms

j Theory: guarantees can be used to target design of algorithms

j Define ‘exact’ MCMC algorithm, which is computationallyintractable but has good mixing

j ‘exact’ chain converges to stationary distribution correspondingto exact posterior

j Approximate kernel in exact chain with more computationallytractable alternative

Big n 28

Page 128: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

aMCMC Overview

j aMCMC is used routinely - there is an increasing rich literatureon algorithms

j Theory: guarantees can be used to target design of algorithms

j Define ‘exact’ MCMC algorithm, which is computationallyintractable but has good mixing

j ‘exact’ chain converges to stationary distribution correspondingto exact posterior

j Approximate kernel in exact chain with more computationallytractable alternative

Big n 28

Page 129: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Sketch of theory

j Define sε = τ1(P )/τ1(Pε) = computational speed-up, τ1(P ) =time for one step with transition kernel P

j Interest: optimizing computational time-accuracy tradeoff forestimators of Π f = ∫

Θ f (θ)Π(dθ|x)

j We provide tight, finite sample bounds on L2 error

j aMCMC estimators win for low computational budgets but haveasymptotic bias

j Often larger approximation error → larger sε & rougherapproximations are better when speed super important

Big n 29

Page 130: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Sketch of theory

j Define sε = τ1(P )/τ1(Pε) = computational speed-up, τ1(P ) =time for one step with transition kernel P

j Interest: optimizing computational time-accuracy tradeoff forestimators of Π f = ∫

Θ f (θ)Π(dθ|x)

j We provide tight, finite sample bounds on L2 error

j aMCMC estimators win for low computational budgets but haveasymptotic bias

j Often larger approximation error → larger sε & rougherapproximations are better when speed super important

Big n 29

Page 131: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Sketch of theory

j Define sε = τ1(P )/τ1(Pε) = computational speed-up, τ1(P ) =time for one step with transition kernel P

j Interest: optimizing computational time-accuracy tradeoff forestimators of Π f = ∫

Θ f (θ)Π(dθ|x)

j We provide tight, finite sample bounds on L2 error

j aMCMC estimators win for low computational budgets but haveasymptotic bias

j Often larger approximation error → larger sε & rougherapproximations are better when speed super important

Big n 29

Page 132: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Sketch of theory

j Define sε = τ1(P )/τ1(Pε) = computational speed-up, τ1(P ) =time for one step with transition kernel P

j Interest: optimizing computational time-accuracy tradeoff forestimators of Π f = ∫

Θ f (θ)Π(dθ|x)

j We provide tight, finite sample bounds on L2 error

j aMCMC estimators win for low computational budgets but haveasymptotic bias

j Often larger approximation error → larger sε & rougherapproximations are better when speed super important

Big n 29

Page 133: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Sketch of theory

j Define sε = τ1(P )/τ1(Pε) = computational speed-up, τ1(P ) =time for one step with transition kernel P

j Interest: optimizing computational time-accuracy tradeoff forestimators of Π f = ∫

Θ f (θ)Π(dθ|x)

j We provide tight, finite sample bounds on L2 error

j aMCMC estimators win for low computational budgets but haveasymptotic bias

j Often larger approximation error → larger sε & rougherapproximations are better when speed super important

Big n 29

Page 134: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Ex 1: Approximations using subsets

j Replace the full data likelihood with

Lε(x | θ) =(∏

i∈VL(xi | θ)

)N /|V |,

for randomly chosen subset V ⊂ {1, . . . ,n}.

j Applied to Pólya-Gamma data augmentation for logisticregression

j Different V at each iteration – trivial modification to Gibbsj Assumptions hold with high probability for subsets > minimal

size (wrt distribution of subsets, data & kernel).

Big n 30

Page 135: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Ex 1: Approximations using subsets

j Replace the full data likelihood with

Lε(x | θ) =(∏

i∈VL(xi | θ)

)N /|V |,

for randomly chosen subset V ⊂ {1, . . . ,n}.j Applied to Pólya-Gamma data augmentation for logistic

regression

j Different V at each iteration – trivial modification to Gibbsj Assumptions hold with high probability for subsets > minimal

size (wrt distribution of subsets, data & kernel).

Big n 30

Page 136: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Ex 1: Approximations using subsets

j Replace the full data likelihood with

Lε(x | θ) =(∏

i∈VL(xi | θ)

)N /|V |,

for randomly chosen subset V ⊂ {1, . . . ,n}.j Applied to Pólya-Gamma data augmentation for logistic

regressionj Different V at each iteration – trivial modification to Gibbs

j Assumptions hold with high probability for subsets > minimalsize (wrt distribution of subsets, data & kernel).

Big n 30

Page 137: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Ex 1: Approximations using subsets

j Replace the full data likelihood with

Lε(x | θ) =(∏

i∈VL(xi | θ)

)N /|V |,

for randomly chosen subset V ⊂ {1, . . . ,n}.j Applied to Pólya-Gamma data augmentation for logistic

regressionj Different V at each iteration – trivial modification to Gibbsj Assumptions hold with high probability for subsets > minimal

size (wrt distribution of subsets, data & kernel).Big n 30

Page 138: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application to SUSY dataset

j n = 5,000,000 (0.5 million test), binary outcome & 18 continuouscovariates

j Considered subsets sizes ranging from |V | = 1,000 to 4,500,000

j Considered different losses as function of |V |j Rate at which loss → 0 with ε heavily dependent on loss

j For small computational budget & focus on posterior meanestimation, small subsets preferred

j As budget increases & loss focused more on tails (e.g., forinterval estimation), optimal |V | increases

Big n 31

Page 139: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application to SUSY dataset

j n = 5,000,000 (0.5 million test), binary outcome & 18 continuouscovariates

j Considered subsets sizes ranging from |V | = 1,000 to 4,500,000

j Considered different losses as function of |V |j Rate at which loss → 0 with ε heavily dependent on loss

j For small computational budget & focus on posterior meanestimation, small subsets preferred

j As budget increases & loss focused more on tails (e.g., forinterval estimation), optimal |V | increases

Big n 31

Page 140: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application to SUSY dataset

j n = 5,000,000 (0.5 million test), binary outcome & 18 continuouscovariates

j Considered subsets sizes ranging from |V | = 1,000 to 4,500,000

j Considered different losses as function of |V |

j Rate at which loss → 0 with ε heavily dependent on loss

j For small computational budget & focus on posterior meanestimation, small subsets preferred

j As budget increases & loss focused more on tails (e.g., forinterval estimation), optimal |V | increases

Big n 31

Page 141: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application to SUSY dataset

j n = 5,000,000 (0.5 million test), binary outcome & 18 continuouscovariates

j Considered subsets sizes ranging from |V | = 1,000 to 4,500,000

j Considered different losses as function of |V |j Rate at which loss → 0 with ε heavily dependent on loss

j For small computational budget & focus on posterior meanestimation, small subsets preferred

j As budget increases & loss focused more on tails (e.g., forinterval estimation), optimal |V | increases

Big n 31

Page 142: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application to SUSY dataset

j n = 5,000,000 (0.5 million test), binary outcome & 18 continuouscovariates

j Considered subsets sizes ranging from |V | = 1,000 to 4,500,000

j Considered different losses as function of |V |j Rate at which loss → 0 with ε heavily dependent on loss

j For small computational budget & focus on posterior meanestimation, small subsets preferred

j As budget increases & loss focused more on tails (e.g., forinterval estimation), optimal |V | increases

Big n 31

Page 143: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application to SUSY dataset

j n = 5,000,000 (0.5 million test), binary outcome & 18 continuouscovariates

j Considered subsets sizes ranging from |V | = 1,000 to 4,500,000

j Considered different losses as function of |V |j Rate at which loss → 0 with ε heavily dependent on loss

j For small computational budget & focus on posterior meanestimation, small subsets preferred

j As budget increases & loss focused more on tails (e.g., forinterval estimation), optimal |V | increases

Big n 31

Page 144: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application 2: Mixture models & tensor factorizations

j We also considered a nonparametric Bayes model:

pr(yi 1 = c1, . . . , yi p = cp ) =k∑

h=1λh

p∏j=1

ψ( j )hc j

,

a very useful model for multivariate categorical data

j Dunson & Xing (2009) - a data augmentation Gibbs samplerj Sampling latent classes computationally prohibitive for huge n

j Use adaptive Gaussian approximation - avoid samplingindividual latent classes

j We have shown Assumptions 1-2, Assumption 2 result moregeneral than this setting

j Improved computation performance for large n

Big n 32

Page 145: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application 2: Mixture models & tensor factorizations

j We also considered a nonparametric Bayes model:

pr(yi 1 = c1, . . . , yi p = cp ) =k∑

h=1λh

p∏j=1

ψ( j )hc j

,

a very useful model for multivariate categorical dataj Dunson & Xing (2009) - a data augmentation Gibbs sampler

j Sampling latent classes computationally prohibitive for huge n

j Use adaptive Gaussian approximation - avoid samplingindividual latent classes

j We have shown Assumptions 1-2, Assumption 2 result moregeneral than this setting

j Improved computation performance for large n

Big n 32

Page 146: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application 2: Mixture models & tensor factorizations

j We also considered a nonparametric Bayes model:

pr(yi 1 = c1, . . . , yi p = cp ) =k∑

h=1λh

p∏j=1

ψ( j )hc j

,

a very useful model for multivariate categorical dataj Dunson & Xing (2009) - a data augmentation Gibbs samplerj Sampling latent classes computationally prohibitive for huge n

j Use adaptive Gaussian approximation - avoid samplingindividual latent classes

j We have shown Assumptions 1-2, Assumption 2 result moregeneral than this setting

j Improved computation performance for large n

Big n 32

Page 147: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application 2: Mixture models & tensor factorizations

j We also considered a nonparametric Bayes model:

pr(yi 1 = c1, . . . , yi p = cp ) =k∑

h=1λh

p∏j=1

ψ( j )hc j

,

a very useful model for multivariate categorical dataj Dunson & Xing (2009) - a data augmentation Gibbs samplerj Sampling latent classes computationally prohibitive for huge n

j Use adaptive Gaussian approximation - avoid samplingindividual latent classes

j We have shown Assumptions 1-2, Assumption 2 result moregeneral than this setting

j Improved computation performance for large n

Big n 32

Page 148: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application 2: Mixture models & tensor factorizations

j We also considered a nonparametric Bayes model:

pr(yi 1 = c1, . . . , yi p = cp ) =k∑

h=1λh

p∏j=1

ψ( j )hc j

,

a very useful model for multivariate categorical dataj Dunson & Xing (2009) - a data augmentation Gibbs samplerj Sampling latent classes computationally prohibitive for huge n

j Use adaptive Gaussian approximation - avoid samplingindividual latent classes

j We have shown Assumptions 1-2, Assumption 2 result moregeneral than this setting

j Improved computation performance for large n

Big n 32

Page 149: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application 2: Mixture models & tensor factorizations

j We also considered a nonparametric Bayes model:

pr(yi 1 = c1, . . . , yi p = cp ) =k∑

h=1λh

p∏j=1

ψ( j )hc j

,

a very useful model for multivariate categorical dataj Dunson & Xing (2009) - a data augmentation Gibbs samplerj Sampling latent classes computationally prohibitive for huge n

j Use adaptive Gaussian approximation - avoid samplingindividual latent classes

j We have shown Assumptions 1-2, Assumption 2 result moregeneral than this setting

j Improved computation performance for large n

Big n 32

Page 150: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application 3: Low rank approximation to GP

j Gaussian process regression, yi = f (xi )+ηi , ηi ∼ N (0,σ2)

j f ∼GP prior with covariance τ2 exp(−φ||x1 −x2||2)

j Discrete-uniform on φ & gamma priors on τ−2,σ−2

j Marginal MCMC sampler updates φ,τ−2,σ−2

j We show Assumption 1 holds under mild regularity conditionson “truth”, Assumption 2 holds for partial rank-r eigenapproximation to Σ

j Less accurate approximations clearly superior in practice forsmall computational budget

Big n 33

Page 151: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application 3: Low rank approximation to GP

j Gaussian process regression, yi = f (xi )+ηi , ηi ∼ N (0,σ2)

j f ∼GP prior with covariance τ2 exp(−φ||x1 −x2||2)

j Discrete-uniform on φ & gamma priors on τ−2,σ−2

j Marginal MCMC sampler updates φ,τ−2,σ−2

j We show Assumption 1 holds under mild regularity conditionson “truth”, Assumption 2 holds for partial rank-r eigenapproximation to Σ

j Less accurate approximations clearly superior in practice forsmall computational budget

Big n 33

Page 152: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application 3: Low rank approximation to GP

j Gaussian process regression, yi = f (xi )+ηi , ηi ∼ N (0,σ2)

j f ∼GP prior with covariance τ2 exp(−φ||x1 −x2||2)

j Discrete-uniform on φ & gamma priors on τ−2,σ−2

j Marginal MCMC sampler updates φ,τ−2,σ−2

j We show Assumption 1 holds under mild regularity conditionson “truth”, Assumption 2 holds for partial rank-r eigenapproximation to Σ

j Less accurate approximations clearly superior in practice forsmall computational budget

Big n 33

Page 153: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application 3: Low rank approximation to GP

j Gaussian process regression, yi = f (xi )+ηi , ηi ∼ N (0,σ2)

j f ∼GP prior with covariance τ2 exp(−φ||x1 −x2||2)

j Discrete-uniform on φ & gamma priors on τ−2,σ−2

j Marginal MCMC sampler updates φ,τ−2,σ−2

j We show Assumption 1 holds under mild regularity conditionson “truth”, Assumption 2 holds for partial rank-r eigenapproximation to Σ

j Less accurate approximations clearly superior in practice forsmall computational budget

Big n 33

Page 154: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application 3: Low rank approximation to GP

j Gaussian process regression, yi = f (xi )+ηi , ηi ∼ N (0,σ2)

j f ∼GP prior with covariance τ2 exp(−φ||x1 −x2||2)

j Discrete-uniform on φ & gamma priors on τ−2,σ−2

j Marginal MCMC sampler updates φ,τ−2,σ−2

j We show Assumption 1 holds under mild regularity conditionson “truth”, Assumption 2 holds for partial rank-r eigenapproximation to Σ

j Less accurate approximations clearly superior in practice forsmall computational budget

Big n 33

Page 155: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application 3: Low rank approximation to GP

j Gaussian process regression, yi = f (xi )+ηi , ηi ∼ N (0,σ2)

j f ∼GP prior with covariance τ2 exp(−φ||x1 −x2||2)

j Discrete-uniform on φ & gamma priors on τ−2,σ−2

j Marginal MCMC sampler updates φ,τ−2,σ−2

j We show Assumption 1 holds under mild regularity conditionson “truth”, Assumption 2 holds for partial rank-r eigenapproximation to Σ

j Less accurate approximations clearly superior in practice forsmall computational budget

Big n 33

Page 156: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Some interim comments

j EP-MCMC & aMCMC can be used in many-many settings tovastly speed up computation for big n

j Here, I just illustrated some of the possible algorithms - there isan increasingly huge literature on many other approaches

j aMCMC can just as easily be used in high-dimensional (large p)problems

j It is also certainly possible to combine EP-MCMC + aMCMC

j Robustness: one topic we haven’t discussed yet is robustness

Big n 34

Page 157: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Some interim comments

j EP-MCMC & aMCMC can be used in many-many settings tovastly speed up computation for big n

j Here, I just illustrated some of the possible algorithms - there isan increasingly huge literature on many other approaches

j aMCMC can just as easily be used in high-dimensional (large p)problems

j It is also certainly possible to combine EP-MCMC + aMCMC

j Robustness: one topic we haven’t discussed yet is robustness

Big n 34

Page 158: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Some interim comments

j EP-MCMC & aMCMC can be used in many-many settings tovastly speed up computation for big n

j Here, I just illustrated some of the possible algorithms - there isan increasingly huge literature on many other approaches

j aMCMC can just as easily be used in high-dimensional (large p)problems

j It is also certainly possible to combine EP-MCMC + aMCMC

j Robustness: one topic we haven’t discussed yet is robustness

Big n 34

Page 159: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Some interim comments

j EP-MCMC & aMCMC can be used in many-many settings tovastly speed up computation for big n

j Here, I just illustrated some of the possible algorithms - there isan increasingly huge literature on many other approaches

j aMCMC can just as easily be used in high-dimensional (large p)problems

j It is also certainly possible to combine EP-MCMC + aMCMC

j Robustness: one topic we haven’t discussed yet is robustness

Big n 34

Page 160: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Some interim comments

j EP-MCMC & aMCMC can be used in many-many settings tovastly speed up computation for big n

j Here, I just illustrated some of the possible algorithms - there isan increasingly huge literature on many other approaches

j aMCMC can just as easily be used in high-dimensional (large p)problems

j It is also certainly possible to combine EP-MCMC + aMCMC

j Robustness: one topic we haven’t discussed yet is robustness

Big n 34

Page 161: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Robustness in big data

j In standard Bayesian inference, it is assumed that the model iscorrect.

j Small violations of this assumption sometimes have a largeimpact, particularly in large datasets

j “All models are wrong,” & ability to carefully check modelingassumptions decreases for big/complex data

j Appealing to tweak Bayesian paradigm to be inherently morerobust

Big n 35

Page 162: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Robustness in big data

j In standard Bayesian inference, it is assumed that the model iscorrect.

j Small violations of this assumption sometimes have a largeimpact, particularly in large datasets

j “All models are wrong,” & ability to carefully check modelingassumptions decreases for big/complex data

j Appealing to tweak Bayesian paradigm to be inherently morerobust

Big n 35

Page 163: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Robustness in big data

j In standard Bayesian inference, it is assumed that the model iscorrect.

j Small violations of this assumption sometimes have a largeimpact, particularly in large datasets

j “All models are wrong,” & ability to carefully check modelingassumptions decreases for big/complex data

j Appealing to tweak Bayesian paradigm to be inherently morerobust

Big n 35

Page 164: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Robustness in big data

j In standard Bayesian inference, it is assumed that the model iscorrect.

j Small violations of this assumption sometimes have a largeimpact, particularly in large datasets

j “All models are wrong,” & ability to carefully check modelingassumptions decreases for big/complex data

j Appealing to tweak Bayesian paradigm to be inherently morerobust

Big n 35

Page 165: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Example: Perturbed mixture of Gaussians

j Mixtures are often used for clustering.

j But if the data distribution is not exactly a mixture from theassumed family, the posterior will tend to introduce more & moreclusters as n grows, in order to fit the data.

j As a result, interpretability of clusters may break down.

Big n 36

Page 166: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Example: Perturbed mixture of Gaussians

j Mixtures are often used for clustering.j But if the data distribution is not exactly a mixture from the

assumed family, the posterior will tend to introduce more & moreclusters as n grows, in order to fit the data.

j As a result, interpretability of clusters may break down.

Big n 36

Page 167: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Example: Perturbed mixture of Gaussians

j Mixtures are often used for clustering.j But if the data distribution is not exactly a mixture from the

assumed family, the posterior will tend to introduce more & moreclusters as n grows, in order to fit the data.

j As a result, interpretability of clusters may break down.

Big n 36

Page 168: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Example: Flow cytometry clustering

j Each sample has 3 to 20-dim measurements on 10K’s of cells.

j Manual clustering is time-consuming and subjective.j Multivariate Gaussian mixture yields too many clusters.j Example: GvHD data from FLOWCAP-I.

Big n 37

Page 169: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Example: Flow cytometry clustering

j Each sample has 3 to 20-dim measurements on 10K’s of cells.j Manual clustering is time-consuming and subjective.

j Multivariate Gaussian mixture yields too many clusters.j Example: GvHD data from FLOWCAP-I.

Big n 37

Page 170: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Example: Flow cytometry clustering

j Each sample has 3 to 20-dim measurements on 10K’s of cells.j Manual clustering is time-consuming and subjective.j Multivariate Gaussian mixture yields too many clusters.

j Example: GvHD data from FLOWCAP-I.

Big n 37

Page 171: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Example: Flow cytometry clustering

j Each sample has 3 to 20-dim measurements on 10K’s of cells.j Manual clustering is time-consuming and subjective.j Multivariate Gaussian mixture yields too many clusters.j Example: GvHD data from FLOWCAP-I.

Big n 37

Page 172: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Wait, if the model is wrong, why not just fix it?

j This is often impractical for a number of reasons.

Z insufficient insight into the data generating processZ time and effort to design model + algorithms, and develop theoryZ slower and more complicated to do inferenceZ complex models are less likely to be used in practice

j Further, a simple model may be more appropriate, even ifwrong.

Z If there is a lack of fit, it may be due to contamination.Z Many models are idealizations that are known to be inexact, but

have interpretable parameters that provide insight into thequestions of interest.

There are many reasons to prefer simple, interpretable, efficientmodels. But we need a way to do inference that is robust tomisspecification.

Big n 38

Page 173: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Wait, if the model is wrong, why not just fix it?

j This is often impractical for a number of reasons.Z insufficient insight into the data generating process

Z time and effort to design model + algorithms, and develop theoryZ slower and more complicated to do inferenceZ complex models are less likely to be used in practice

j Further, a simple model may be more appropriate, even ifwrong.

Z If there is a lack of fit, it may be due to contamination.Z Many models are idealizations that are known to be inexact, but

have interpretable parameters that provide insight into thequestions of interest.

There are many reasons to prefer simple, interpretable, efficientmodels. But we need a way to do inference that is robust tomisspecification.

Big n 38

Page 174: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Wait, if the model is wrong, why not just fix it?

j This is often impractical for a number of reasons.Z insufficient insight into the data generating processZ time and effort to design model + algorithms, and develop theory

Z slower and more complicated to do inferenceZ complex models are less likely to be used in practice

j Further, a simple model may be more appropriate, even ifwrong.

Z If there is a lack of fit, it may be due to contamination.Z Many models are idealizations that are known to be inexact, but

have interpretable parameters that provide insight into thequestions of interest.

There are many reasons to prefer simple, interpretable, efficientmodels. But we need a way to do inference that is robust tomisspecification.

Big n 38

Page 175: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Wait, if the model is wrong, why not just fix it?

j This is often impractical for a number of reasons.Z insufficient insight into the data generating processZ time and effort to design model + algorithms, and develop theoryZ slower and more complicated to do inference

Z complex models are less likely to be used in practice

j Further, a simple model may be more appropriate, even ifwrong.

Z If there is a lack of fit, it may be due to contamination.Z Many models are idealizations that are known to be inexact, but

have interpretable parameters that provide insight into thequestions of interest.

There are many reasons to prefer simple, interpretable, efficientmodels. But we need a way to do inference that is robust tomisspecification.

Big n 38

Page 176: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Wait, if the model is wrong, why not just fix it?

j This is often impractical for a number of reasons.Z insufficient insight into the data generating processZ time and effort to design model + algorithms, and develop theoryZ slower and more complicated to do inferenceZ complex models are less likely to be used in practice

j Further, a simple model may be more appropriate, even ifwrong.

Z If there is a lack of fit, it may be due to contamination.Z Many models are idealizations that are known to be inexact, but

have interpretable parameters that provide insight into thequestions of interest.

There are many reasons to prefer simple, interpretable, efficientmodels. But we need a way to do inference that is robust tomisspecification.

Big n 38

Page 177: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Wait, if the model is wrong, why not just fix it?

j This is often impractical for a number of reasons.Z insufficient insight into the data generating processZ time and effort to design model + algorithms, and develop theoryZ slower and more complicated to do inferenceZ complex models are less likely to be used in practice

j Further, a simple model may be more appropriate, even ifwrong.

Z If there is a lack of fit, it may be due to contamination.Z Many models are idealizations that are known to be inexact, but

have interpretable parameters that provide insight into thequestions of interest.

There are many reasons to prefer simple, interpretable, efficientmodels. But we need a way to do inference that is robust tomisspecification.

Big n 38

Page 178: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Wait, if the model is wrong, why not just fix it?

j This is often impractical for a number of reasons.Z insufficient insight into the data generating processZ time and effort to design model + algorithms, and develop theoryZ slower and more complicated to do inferenceZ complex models are less likely to be used in practice

j Further, a simple model may be more appropriate, even ifwrong.

Z If there is a lack of fit, it may be due to contamination.Z Many models are idealizations that are known to be inexact, but

have interpretable parameters that provide insight into thequestions of interest.

There are many reasons to prefer simple, interpretable, efficientmodels. But we need a way to do inference that is robust tomisspecification.

Big n 38

Page 179: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Wait, if the model is wrong, why not just fix it?

j This is often impractical for a number of reasons.Z insufficient insight into the data generating processZ time and effort to design model + algorithms, and develop theoryZ slower and more complicated to do inferenceZ complex models are less likely to be used in practice

j Further, a simple model may be more appropriate, even ifwrong.Z If there is a lack of fit, it may be due to contamination.

Z Many models are idealizations that are known to be inexact, buthave interpretable parameters that provide insight into thequestions of interest.

There are many reasons to prefer simple, interpretable, efficientmodels. But we need a way to do inference that is robust tomisspecification.

Big n 38

Page 180: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Wait, if the model is wrong, why not just fix it?

j This is often impractical for a number of reasons.Z insufficient insight into the data generating processZ time and effort to design model + algorithms, and develop theoryZ slower and more complicated to do inferenceZ complex models are less likely to be used in practice

j Further, a simple model may be more appropriate, even ifwrong.Z If there is a lack of fit, it may be due to contamination.Z Many models are idealizations that are known to be inexact, but

have interpretable parameters that provide insight into thequestions of interest.

There are many reasons to prefer simple, interpretable, efficientmodels. But we need a way to do inference that is robust tomisspecification.

Big n 38

Page 181: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Coarsened posterior - Miller & Dunson (2018)

j Assume a model {Pθ : θ ∈Θ} and a prior π(θ).

j Suppose θI ∈Θ represents the idealized distribution of the data.

The interpretation here is that θI is the “true” state of nature aboutwhich one is interested in making inferences.

j Suppose X1, . . . , Xn i.i.d. ∼ PθI are unobserved idealized data.j However, the observed data x1, . . . , xn are actually a slightly

corrupted version of X1, . . . , Xn in the sense thatd(P̂X1:n , P̂x1:n ) < R for some statistical distance d(·, ·).

Big n 39

Page 182: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Coarsened posterior - Miller & Dunson (2018)

j Assume a model {Pθ : θ ∈Θ} and a prior π(θ).j Suppose θI ∈Θ represents the idealized distribution of the data.

The interpretation here is that θI is the “true” state of nature aboutwhich one is interested in making inferences.

j Suppose X1, . . . , Xn i.i.d. ∼ PθI are unobserved idealized data.j However, the observed data x1, . . . , xn are actually a slightly

corrupted version of X1, . . . , Xn in the sense thatd(P̂X1:n , P̂x1:n ) < R for some statistical distance d(·, ·).

Big n 39

Page 183: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Coarsened posterior - Miller & Dunson (2018)

j Assume a model {Pθ : θ ∈Θ} and a prior π(θ).j Suppose θI ∈Θ represents the idealized distribution of the data.

The interpretation here is that θI is the “true” state of nature aboutwhich one is interested in making inferences.

j Suppose X1, . . . , Xn i.i.d. ∼ PθI are unobserved idealized data.j However, the observed data x1, . . . , xn are actually a slightly

corrupted version of X1, . . . , Xn in the sense thatd(P̂X1:n , P̂x1:n ) < R for some statistical distance d(·, ·).

Big n 39

Page 184: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Coarsened posterior - Miller & Dunson (2018)

j Assume a model {Pθ : θ ∈Θ} and a prior π(θ).j Suppose θI ∈Θ represents the idealized distribution of the data.

The interpretation here is that θI is the “true” state of nature aboutwhich one is interested in making inferences.

j Suppose X1, . . . , Xn i.i.d. ∼ PθI are unobserved idealized data.

j However, the observed data x1, . . . , xn are actually a slightlycorrupted version of X1, . . . , Xn in the sense thatd(P̂X1:n , P̂x1:n ) < R for some statistical distance d(·, ·).

Big n 39

Page 185: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Coarsened posterior - Miller & Dunson (2018)

j Assume a model {Pθ : θ ∈Θ} and a prior π(θ).j Suppose θI ∈Θ represents the idealized distribution of the data.

The interpretation here is that θI is the “true” state of nature aboutwhich one is interested in making inferences.

j Suppose X1, . . . , Xn i.i.d. ∼ PθI are unobserved idealized data.j However, the observed data x1, . . . , xn are actually a slightly

corrupted version of X1, . . . , Xn in the sense thatd(P̂X1:n , P̂x1:n ) < R for some statistical distance d(·, ·).Big n 39

Page 186: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Coarsened posterior

j If there were no corruption, then we should use the standardposterior

π(θ | X1:n = x1:n).

j However, due to the corruption this would clearly be incorrect.j Instead, a natural Bayesian approach would be to condition on

what is known, giving us the coarsened posterior or c-posterior,

π(θ | d(P̂X1:n , P̂x1:n ) < R).

j Since R may be difficult to choose a priori, put a prior on it:R ∼ H .

j More generally, consider

π(θ | dn(X1:n , x1:n) < R

)where dn(X1:n , x1:n) ≥ 0 is some measure of the discrepancybetween X1:n and x1:n .

Big n 40

Page 187: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Coarsened posterior

j If there were no corruption, then we should use the standardposterior

π(θ | X1:n = x1:n).

j However, due to the corruption this would clearly be incorrect.

j Instead, a natural Bayesian approach would be to condition onwhat is known, giving us the coarsened posterior or c-posterior,

π(θ | d(P̂X1:n , P̂x1:n ) < R).

j Since R may be difficult to choose a priori, put a prior on it:R ∼ H .

j More generally, consider

π(θ | dn(X1:n , x1:n) < R

)where dn(X1:n , x1:n) ≥ 0 is some measure of the discrepancybetween X1:n and x1:n .

Big n 40

Page 188: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Coarsened posterior

j If there were no corruption, then we should use the standardposterior

π(θ | X1:n = x1:n).

j However, due to the corruption this would clearly be incorrect.j Instead, a natural Bayesian approach would be to condition on

what is known, giving us the coarsened posterior or c-posterior,

π(θ | d(P̂X1:n , P̂x1:n ) < R).

j Since R may be difficult to choose a priori, put a prior on it:R ∼ H .

j More generally, consider

π(θ | dn(X1:n , x1:n) < R

)where dn(X1:n , x1:n) ≥ 0 is some measure of the discrepancybetween X1:n and x1:n .

Big n 40

Page 189: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Coarsened posterior

j If there were no corruption, then we should use the standardposterior

π(θ | X1:n = x1:n).

j However, due to the corruption this would clearly be incorrect.j Instead, a natural Bayesian approach would be to condition on

what is known, giving us the coarsened posterior or c-posterior,

π(θ | d(P̂X1:n , P̂x1:n ) < R).

j Since R may be difficult to choose a priori, put a prior on it:R ∼ H .

j More generally, consider

π(θ | dn(X1:n , x1:n) < R

)where dn(X1:n , x1:n) ≥ 0 is some measure of the discrepancybetween X1:n and x1:n .

Big n 40

Page 190: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Coarsened posterior

j If there were no corruption, then we should use the standardposterior

π(θ | X1:n = x1:n).

j However, due to the corruption this would clearly be incorrect.j Instead, a natural Bayesian approach would be to condition on

what is known, giving us the coarsened posterior or c-posterior,

π(θ | d(P̂X1:n , P̂x1:n ) < R).

j Since R may be difficult to choose a priori, put a prior on it:R ∼ H .

j More generally, consider

π(θ | dn(X1:n , x1:n) < R

)where dn(X1:n , x1:n) ≥ 0 is some measure of the discrepancybetween X1:n and x1:n .

Big n 40

Page 191: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Relative entropy c-posterior ≈ Power posterior

j There are many possible choices of discrepancy but relativeentropy works out exceptionally nicely.

j Suppose dn(X1:n , x1:n) is a consistent estimator of D(po‖pθ)

when Xii i d∼ pθ and xi

i i d∼ po .

j When R ∼ exp(α), we have the power posterior approximation,

π(θ |dn(X1:n , x1:n) < R

) ∝ π(θ)n∏

i=1pθ(xi )ζn

where ζn =α/(α+n).j The power posterior enables inference using standard

techniques:

Z Analytical solutions in the case of conjugate priorsZ MCMC is also straightforward

Big n 41

Page 192: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Relative entropy c-posterior ≈ Power posterior

j There are many possible choices of discrepancy but relativeentropy works out exceptionally nicely.

j Suppose dn(X1:n , x1:n) is a consistent estimator of D(po‖pθ)

when Xii i d∼ pθ and xi

i i d∼ po .

j When R ∼ exp(α), we have the power posterior approximation,

π(θ |dn(X1:n , x1:n) < R

) ∝ π(θ)n∏

i=1pθ(xi )ζn

where ζn =α/(α+n).j The power posterior enables inference using standard

techniques:

Z Analytical solutions in the case of conjugate priorsZ MCMC is also straightforward

Big n 41

Page 193: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Relative entropy c-posterior ≈ Power posterior

j There are many possible choices of discrepancy but relativeentropy works out exceptionally nicely.

j Suppose dn(X1:n , x1:n) is a consistent estimator of D(po‖pθ)

when Xii i d∼ pθ and xi

i i d∼ po .

j When R ∼ exp(α), we have the power posterior approximation,

π(θ |dn(X1:n , x1:n) < R

) ∝ π(θ)n∏

i=1pθ(xi )ζn

where ζn =α/(α+n).

j The power posterior enables inference using standardtechniques:

Z Analytical solutions in the case of conjugate priorsZ MCMC is also straightforward

Big n 41

Page 194: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Relative entropy c-posterior ≈ Power posterior

j There are many possible choices of discrepancy but relativeentropy works out exceptionally nicely.

j Suppose dn(X1:n , x1:n) is a consistent estimator of D(po‖pθ)

when Xii i d∼ pθ and xi

i i d∼ po .

j When R ∼ exp(α), we have the power posterior approximation,

π(θ |dn(X1:n , x1:n) < R

) ∝ π(θ)n∏

i=1pθ(xi )ζn

where ζn =α/(α+n).j The power posterior enables inference using standard

techniques:

Z Analytical solutions in the case of conjugate priorsZ MCMC is also straightforward

Big n 41

Page 195: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Relative entropy c-posterior ≈ Power posterior

j There are many possible choices of discrepancy but relativeentropy works out exceptionally nicely.

j Suppose dn(X1:n , x1:n) is a consistent estimator of D(po‖pθ)

when Xii i d∼ pθ and xi

i i d∼ po .

j When R ∼ exp(α), we have the power posterior approximation,

π(θ |dn(X1:n , x1:n) < R

) ∝ π(θ)n∏

i=1pθ(xi )ζn

where ζn =α/(α+n).j The power posterior enables inference using standard

techniques:Z Analytical solutions in the case of conjugate priors

Z MCMC is also straightforward

Big n 41

Page 196: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Relative entropy c-posterior ≈ Power posterior

j There are many possible choices of discrepancy but relativeentropy works out exceptionally nicely.

j Suppose dn(X1:n , x1:n) is a consistent estimator of D(po‖pθ)

when Xii i d∼ pθ and xi

i i d∼ po .

j When R ∼ exp(α), we have the power posterior approximation,

π(θ |dn(X1:n , x1:n) < R

) ∝ π(θ)n∏

i=1pθ(xi )ζn

where ζn =α/(α+n).j The power posterior enables inference using standard

techniques:Z Analytical solutions in the case of conjugate priorsZ MCMC is also straightforward

Big n 41

Page 197: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Toy example: Bernoulli trials

j Suppose H0 : θ = 0.5 is true; e.g, heads & tails are equally likelyin repeated coin flips

j But x1, . . . , xn are corrupted and behave like Bernoulli(0.51)samples.

j The c-posterior is robust to this, but the standard posterior isnot.

Big n 42

Page 198: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Toy example: Bernoulli trials

j Suppose H0 : θ = 0.5 is true; e.g, heads & tails are equally likelyin repeated coin flips

j But x1, . . . , xn are corrupted and behave like Bernoulli(0.51)samples.

j The c-posterior is robust to this, but the standard posterior isnot.

Big n 42

Page 199: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Toy example: Bernoulli trials

j Suppose H0 : θ = 0.5 is true; e.g, heads & tails are equally likelyin repeated coin flips

j But x1, . . . , xn are corrupted and behave like Bernoulli(0.51)samples.

j The c-posterior is robust to this, but the standard posterior isnot.

Big n 42

Page 200: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Mixture models

j Model: X1, . . . , Xn |w ,ϕ i.i.d. ∼∑Ki=1 wi fϕi (x)

j Prior: w ∼ Dirichlet(γ1, . . . ,γK ) and ϕ1, . . . ,ϕKi i d∼ H .

j c-Posterior is approximated as

π(w ,ϕ |dn(X1:n , x1:n) < R

)∝π(w ,ϕ)n∏

j=1

( K∑i=1

wi fϕi (x j ))ζn

where ζn =α/(α+n).

j A straightforward MCMC algorithm can be used for computation

j Scales well to large datasets

j EP-MCMC, a-MCMC etc can be used to enhance scalability

Big n 43

Page 201: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Mixture models

j Model: X1, . . . , Xn |w ,ϕ i.i.d. ∼∑Ki=1 wi fϕi (x)

j Prior: w ∼ Dirichlet(γ1, . . . ,γK ) and ϕ1, . . . ,ϕKi i d∼ H .

j c-Posterior is approximated as

π(w ,ϕ |dn(X1:n , x1:n) < R

)∝π(w ,ϕ)n∏

j=1

( K∑i=1

wi fϕi (x j ))ζn

where ζn =α/(α+n).

j A straightforward MCMC algorithm can be used for computation

j Scales well to large datasets

j EP-MCMC, a-MCMC etc can be used to enhance scalability

Big n 43

Page 202: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Mixture models

j Model: X1, . . . , Xn |w ,ϕ i.i.d. ∼∑Ki=1 wi fϕi (x)

j Prior: w ∼ Dirichlet(γ1, . . . ,γK ) and ϕ1, . . . ,ϕKi i d∼ H .

j c-Posterior is approximated as

π(w ,ϕ |dn(X1:n , x1:n) < R

)∝π(w ,ϕ)n∏

j=1

( K∑i=1

wi fϕi (x j ))ζn

where ζn =α/(α+n).

j A straightforward MCMC algorithm can be used for computation

j Scales well to large datasets

j EP-MCMC, a-MCMC etc can be used to enhance scalability

Big n 43

Page 203: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Mixture models

j Model: X1, . . . , Xn |w ,ϕ i.i.d. ∼∑Ki=1 wi fϕi (x)

j Prior: w ∼ Dirichlet(γ1, . . . ,γK ) and ϕ1, . . . ,ϕKi i d∼ H .

j c-Posterior is approximated as

π(w ,ϕ |dn(X1:n , x1:n) < R

)∝π(w ,ϕ)n∏

j=1

( K∑i=1

wi fϕi (x j ))ζn

where ζn =α/(α+n).

j A straightforward MCMC algorithm can be used for computation

j Scales well to large datasets

j EP-MCMC, a-MCMC etc can be used to enhance scalability

Big n 43

Page 204: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Mixture models

j Model: X1, . . . , Xn |w ,ϕ i.i.d. ∼∑Ki=1 wi fϕi (x)

j Prior: w ∼ Dirichlet(γ1, . . . ,γK ) and ϕ1, . . . ,ϕKi i d∼ H .

j c-Posterior is approximated as

π(w ,ϕ |dn(X1:n , x1:n) < R

)∝π(w ,ϕ)n∏

j=1

( K∑i=1

wi fϕi (x j ))ζn

where ζn =α/(α+n).

j A straightforward MCMC algorithm can be used for computation

j Scales well to large datasets

j EP-MCMC, a-MCMC etc can be used to enhance scalability

Big n 43

Page 205: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Mixture models

j Model: X1, . . . , Xn |w ,ϕ i.i.d. ∼∑Ki=1 wi fϕi (x)

j Prior: w ∼ Dirichlet(γ1, . . . ,γK ) and ϕ1, . . . ,ϕKi i d∼ H .

j c-Posterior is approximated as

π(w ,ϕ |dn(X1:n , x1:n) < R

)∝π(w ,ϕ)n∏

j=1

( K∑i=1

wi fϕi (x j ))ζn

where ζn =α/(α+n).

j A straightforward MCMC algorithm can be used for computation

j Scales well to large datasets

j EP-MCMC, a-MCMC etc can be used to enhance scalability

Big n 43

Page 206: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Example: Perturbed mixture of Gaussians

Big n 44

Page 207: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Example: Perturbed mixture of Gaussians

Big n 45

Page 208: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Results: Flow cytometry clustering

Clustering on test datasets closely matches manual ground truth.

Big n 46

Page 209: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Results: Flow cytometry clustering

j Clustering on test datasets closely matches manual groundtruth.

j Use F-measure to quantify similarity of partitions A and B:

F (A ,B) = ∑A∈A

|A|N

maxB∈B

2|A∩B ||A|+ |B | .

Big n 47

Page 210: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

c-Bayes discussion

j c-Bayes provides a framework for improving robustness tomodel misspecification

j Particularly useful when interest is in model-based inferences &sample size n is large

j If we just want a black box for prediction may as well let themodel grow (unnecessarily) in complexity with n

j c-Bayes can be implemented with a particular power posterior

j All the scalable MCMC tricks developed for regular posteriorscan be used directly

j Also provides a motivation for doing Bayesian inferences basedon subsamples

Big n 48

Page 211: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

c-Bayes discussion

j c-Bayes provides a framework for improving robustness tomodel misspecification

j Particularly useful when interest is in model-based inferences &sample size n is large

j If we just want a black box for prediction may as well let themodel grow (unnecessarily) in complexity with n

j c-Bayes can be implemented with a particular power posterior

j All the scalable MCMC tricks developed for regular posteriorscan be used directly

j Also provides a motivation for doing Bayesian inferences basedon subsamples

Big n 48

Page 212: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

c-Bayes discussion

j c-Bayes provides a framework for improving robustness tomodel misspecification

j Particularly useful when interest is in model-based inferences &sample size n is large

j If we just want a black box for prediction may as well let themodel grow (unnecessarily) in complexity with n

j c-Bayes can be implemented with a particular power posterior

j All the scalable MCMC tricks developed for regular posteriorscan be used directly

j Also provides a motivation for doing Bayesian inferences basedon subsamples

Big n 48

Page 213: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

c-Bayes discussion

j c-Bayes provides a framework for improving robustness tomodel misspecification

j Particularly useful when interest is in model-based inferences &sample size n is large

j If we just want a black box for prediction may as well let themodel grow (unnecessarily) in complexity with n

j c-Bayes can be implemented with a particular power posterior

j All the scalable MCMC tricks developed for regular posteriorscan be used directly

j Also provides a motivation for doing Bayesian inferences basedon subsamples

Big n 48

Page 214: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

c-Bayes discussion

j c-Bayes provides a framework for improving robustness tomodel misspecification

j Particularly useful when interest is in model-based inferences &sample size n is large

j If we just want a black box for prediction may as well let themodel grow (unnecessarily) in complexity with n

j c-Bayes can be implemented with a particular power posterior

j All the scalable MCMC tricks developed for regular posteriorscan be used directly

j Also provides a motivation for doing Bayesian inferences basedon subsamples

Big n 48

Page 215: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

c-Bayes discussion

j c-Bayes provides a framework for improving robustness tomodel misspecification

j Particularly useful when interest is in model-based inferences &sample size n is large

j If we just want a black box for prediction may as well let themodel grow (unnecessarily) in complexity with n

j c-Bayes can be implemented with a particular power posterior

j All the scalable MCMC tricks developed for regular posteriorscan be used directly

j Also provides a motivation for doing Bayesian inferences basedon subsamples

Big n 48

Page 216: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Hybrid high-dimensional density estimation

Ye, Canale & Dunson (2016, AISTATS)

j yi = (yi 1, . . . , yi p )T ∼ f with p large & f an unknown density

j Potentially use Dirichlet process mixtures of factor models

j Approach doesn’t scale well at all with p

j Instead use hybrid of Gibbs sampling & fast multiscale SVD

j Scalable, excellent mixing & empirical/predictive performance

Big n 49

Page 217: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Hybrid high-dimensional density estimation

Ye, Canale & Dunson (2016, AISTATS)

j yi = (yi 1, . . . , yi p )T ∼ f with p large & f an unknown density

j Potentially use Dirichlet process mixtures of factor models

j Approach doesn’t scale well at all with p

j Instead use hybrid of Gibbs sampling & fast multiscale SVD

j Scalable, excellent mixing & empirical/predictive performance

Big n 49

Page 218: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Hybrid high-dimensional density estimation

Ye, Canale & Dunson (2016, AISTATS)

j yi = (yi 1, . . . , yi p )T ∼ f with p large & f an unknown density

j Potentially use Dirichlet process mixtures of factor models

j Approach doesn’t scale well at all with p

j Instead use hybrid of Gibbs sampling & fast multiscale SVD

j Scalable, excellent mixing & empirical/predictive performance

Big n 49

Page 219: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Hybrid high-dimensional density estimation

Ye, Canale & Dunson (2016, AISTATS)

j yi = (yi 1, . . . , yi p )T ∼ f with p large & f an unknown density

j Potentially use Dirichlet process mixtures of factor models

j Approach doesn’t scale well at all with p

j Instead use hybrid of Gibbs sampling & fast multiscale SVD

j Scalable, excellent mixing & empirical/predictive performance

Big n 49

Page 220: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Hybrid high-dimensional density estimation

Ye, Canale & Dunson (2016, AISTATS)

j yi = (yi 1, . . . , yi p )T ∼ f with p large & f an unknown density

j Potentially use Dirichlet process mixtures of factor models

j Approach doesn’t scale well at all with p

j Instead use hybrid of Gibbs sampling & fast multiscale SVD

j Scalable, excellent mixing & empirical/predictive performance

Big n 49

Page 221: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Outline

Motivation & background

Big n

High-dimensional data (big p)

High-dimensional data (big p) 50

Page 222: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Scaling Bayes to high-dimensional data

j Thus far we have focused on solving computational &robustness problems arising in large n

j In many ways these problems are easier to deal with thenissues with high-dimensional/complex data

j For example, in biomedical studies we routinely measure HUGEnumbers of features/study subjects

j Genomics, precision medicine, neuroimaging, etc

j We have very few labeled data relative to data dimensionality p

j We also don’t want a black box for prediction but want to doscientific inferences

j Bayes for big p is a huge topic - I’ll just provide some vignettesto give a flavor

High-dimensional data (big p) 50

Page 223: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Scaling Bayes to high-dimensional data

j Thus far we have focused on solving computational &robustness problems arising in large n

j In many ways these problems are easier to deal with thenissues with high-dimensional/complex data

j For example, in biomedical studies we routinely measure HUGEnumbers of features/study subjects

j Genomics, precision medicine, neuroimaging, etc

j We have very few labeled data relative to data dimensionality p

j We also don’t want a black box for prediction but want to doscientific inferences

j Bayes for big p is a huge topic - I’ll just provide some vignettesto give a flavor

High-dimensional data (big p) 50

Page 224: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Scaling Bayes to high-dimensional data

j Thus far we have focused on solving computational &robustness problems arising in large n

j In many ways these problems are easier to deal with thenissues with high-dimensional/complex data

j For example, in biomedical studies we routinely measure HUGEnumbers of features/study subjects

j Genomics, precision medicine, neuroimaging, etc

j We have very few labeled data relative to data dimensionality p

j We also don’t want a black box for prediction but want to doscientific inferences

j Bayes for big p is a huge topic - I’ll just provide some vignettesto give a flavor

High-dimensional data (big p) 50

Page 225: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Scaling Bayes to high-dimensional data

j Thus far we have focused on solving computational &robustness problems arising in large n

j In many ways these problems are easier to deal with thenissues with high-dimensional/complex data

j For example, in biomedical studies we routinely measure HUGEnumbers of features/study subjects

j Genomics, precision medicine, neuroimaging, etc

j We have very few labeled data relative to data dimensionality p

j We also don’t want a black box for prediction but want to doscientific inferences

j Bayes for big p is a huge topic - I’ll just provide some vignettesto give a flavor

High-dimensional data (big p) 50

Page 226: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Scaling Bayes to high-dimensional data

j Thus far we have focused on solving computational &robustness problems arising in large n

j In many ways these problems are easier to deal with thenissues with high-dimensional/complex data

j For example, in biomedical studies we routinely measure HUGEnumbers of features/study subjects

j Genomics, precision medicine, neuroimaging, etc

j We have very few labeled data relative to data dimensionality p

j We also don’t want a black box for prediction but want to doscientific inferences

j Bayes for big p is a huge topic - I’ll just provide some vignettesto give a flavor

High-dimensional data (big p) 50

Page 227: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Scaling Bayes to high-dimensional data

j Thus far we have focused on solving computational &robustness problems arising in large n

j In many ways these problems are easier to deal with thenissues with high-dimensional/complex data

j For example, in biomedical studies we routinely measure HUGEnumbers of features/study subjects

j Genomics, precision medicine, neuroimaging, etc

j We have very few labeled data relative to data dimensionality p

j We also don’t want a black box for prediction but want to doscientific inferences

j Bayes for big p is a huge topic - I’ll just provide some vignettesto give a flavor

High-dimensional data (big p) 50

Page 228: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Scaling Bayes to high-dimensional data

j Thus far we have focused on solving computational &robustness problems arising in large n

j In many ways these problems are easier to deal with thenissues with high-dimensional/complex data

j For example, in biomedical studies we routinely measure HUGEnumbers of features/study subjects

j Genomics, precision medicine, neuroimaging, etc

j We have very few labeled data relative to data dimensionality p

j We also don’t want a black box for prediction but want to doscientific inferences

j Bayes for big p is a huge topic - I’ll just provide some vignettesto give a flavor

High-dimensional data (big p) 50

Page 229: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Variable/feature selection in large p regression

j Huge focus in sciences on variable selection

j For example, select the genetic variants x j associated with aresponse (phenotype) y

j Sample size n is modest & # genetic variants p is huge

j Large p, small n problem

j Huge literature for dealing with this problemj Two main approaches:

1. Independent Screening2. Penalized estimation/shrinkage

High-dimensional data (big p) 51

Page 230: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Variable/feature selection in large p regression

j Huge focus in sciences on variable selection

j For example, select the genetic variants x j associated with aresponse (phenotype) y

j Sample size n is modest & # genetic variants p is huge

j Large p, small n problem

j Huge literature for dealing with this problemj Two main approaches:

1. Independent Screening2. Penalized estimation/shrinkage

High-dimensional data (big p) 51

Page 231: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Variable/feature selection in large p regression

j Huge focus in sciences on variable selection

j For example, select the genetic variants x j associated with aresponse (phenotype) y

j Sample size n is modest & # genetic variants p is huge

j Large p, small n problem

j Huge literature for dealing with this problemj Two main approaches:

1. Independent Screening2. Penalized estimation/shrinkage

High-dimensional data (big p) 51

Page 232: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Variable/feature selection in large p regression

j Huge focus in sciences on variable selection

j For example, select the genetic variants x j associated with aresponse (phenotype) y

j Sample size n is modest & # genetic variants p is huge

j Large p, small n problem

j Huge literature for dealing with this problemj Two main approaches:

1. Independent Screening2. Penalized estimation/shrinkage

High-dimensional data (big p) 51

Page 233: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Variable/feature selection in large p regression

j Huge focus in sciences on variable selection

j For example, select the genetic variants x j associated with aresponse (phenotype) y

j Sample size n is modest & # genetic variants p is huge

j Large p, small n problem

j Huge literature for dealing with this problem

j Two main approaches:

1. Independent Screening2. Penalized estimation/shrinkage

High-dimensional data (big p) 51

Page 234: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Variable/feature selection in large p regression

j Huge focus in sciences on variable selection

j For example, select the genetic variants x j associated with aresponse (phenotype) y

j Sample size n is modest & # genetic variants p is huge

j Large p, small n problem

j Huge literature for dealing with this problemj Two main approaches:

1. Independent Screening2. Penalized estimation/shrinkage

High-dimensional data (big p) 51

Page 235: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Variable/feature selection in large p regression

j Huge focus in sciences on variable selection

j For example, select the genetic variants x j associated with aresponse (phenotype) y

j Sample size n is modest & # genetic variants p is huge

j Large p, small n problem

j Huge literature for dealing with this problemj Two main approaches:

1. Independent Screening

2. Penalized estimation/shrinkage

High-dimensional data (big p) 51

Page 236: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Variable/feature selection in large p regression

j Huge focus in sciences on variable selection

j For example, select the genetic variants x j associated with aresponse (phenotype) y

j Sample size n is modest & # genetic variants p is huge

j Large p, small n problem

j Huge literature for dealing with this problemj Two main approaches:

1. Independent Screening2. Penalized estimation/shrinkage

High-dimensional data (big p) 51

Page 237: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Independent Screening

j Test for an association between two variables at a time (e.g, aphenotype & a SNP)

j Repeat this for all possible pairs, getting a large number ofp-values

j Choose p-value threshold controlling False Discovery Rate(FDR) - eg Benjamini-Hochberg (BH)

j Get a list of discoveries & hopefully run follow-up studies toverify

High-dimensional data (big p) 52

Page 238: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Independent Screening

j Test for an association between two variables at a time (e.g, aphenotype & a SNP)

j Repeat this for all possible pairs, getting a large number ofp-values

j Choose p-value threshold controlling False Discovery Rate(FDR) - eg Benjamini-Hochberg (BH)

j Get a list of discoveries & hopefully run follow-up studies toverify

High-dimensional data (big p) 52

Page 239: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Independent Screening

j Test for an association between two variables at a time (e.g, aphenotype & a SNP)

j Repeat this for all possible pairs, getting a large number ofp-values

j Choose p-value threshold controlling False Discovery Rate(FDR) - eg Benjamini-Hochberg (BH)

j Get a list of discoveries & hopefully run follow-up studies toverify

High-dimensional data (big p) 52

Page 240: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Independent Screening

j Test for an association between two variables at a time (e.g, aphenotype & a SNP)

j Repeat this for all possible pairs, getting a large number ofp-values

j Choose p-value threshold controlling False Discovery Rate(FDR) - eg Benjamini-Hochberg (BH)

j Get a list of discoveries & hopefully run follow-up studies toverify

High-dimensional data (big p) 52

Page 241: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Independent Screening - Continued

j Very appealing in its simplicity

j Very widely used

j Many false positives & negatives; for sparse data falsenegatives huge problem

j Just considering a pair of variables at a time leads to limitedinsights

High-dimensional data (big p) 53

Page 242: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Independent Screening - Continued

j Very appealing in its simplicity

j Very widely used

j Many false positives & negatives; for sparse data falsenegatives huge problem

j Just considering a pair of variables at a time leads to limitedinsights

High-dimensional data (big p) 53

Page 243: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Independent Screening - Continued

j Very appealing in its simplicity

j Very widely used

j Many false positives & negatives; for sparse data falsenegatives huge problem

j Just considering a pair of variables at a time leads to limitedinsights

High-dimensional data (big p) 53

Page 244: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Independent Screening - Continued

j Very appealing in its simplicity

j Very widely used

j Many false positives & negatives; for sparse data falsenegatives huge problem

j Just considering a pair of variables at a time leads to limitedinsights

High-dimensional data (big p) 53

Page 245: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Problems with classical approaches

j Consider the canonical linear regression problem:

yi = x ′iβ+εi , εi ∼ N (0,σ2),

where xi = (xi 1, . . . , xi p )′ & β= (β1, . . . ,βp )′

j The classical approach is to estimate β using MLE whichreduces to the least squares estimator β̂= (X ′X )−1X ′y

j Unfortunately as p increases OR xi j s become more correlatedOR more sparse, the variance of β̂ blows up

j For p > n a unique MLE doesn’t exist

High-dimensional data (big p) 54

Page 246: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Problems with classical approaches

j Consider the canonical linear regression problem:

yi = x ′iβ+εi , εi ∼ N (0,σ2),

where xi = (xi 1, . . . , xi p )′ & β= (β1, . . . ,βp )′

j The classical approach is to estimate β using MLE whichreduces to the least squares estimator β̂= (X ′X )−1X ′y

j Unfortunately as p increases OR xi j s become more correlatedOR more sparse, the variance of β̂ blows up

j For p > n a unique MLE doesn’t exist

High-dimensional data (big p) 54

Page 247: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Problems with classical approaches

j Consider the canonical linear regression problem:

yi = x ′iβ+εi , εi ∼ N (0,σ2),

where xi = (xi 1, . . . , xi p )′ & β= (β1, . . . ,βp )′

j The classical approach is to estimate β using MLE whichreduces to the least squares estimator β̂= (X ′X )−1X ′y

j Unfortunately as p increases OR xi j s become more correlatedOR more sparse, the variance of β̂ blows up

j For p > n a unique MLE doesn’t exist

High-dimensional data (big p) 54

Page 248: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Problems with classical approaches

j Consider the canonical linear regression problem:

yi = x ′iβ+εi , εi ∼ N (0,σ2),

where xi = (xi 1, . . . , xi p )′ & β= (β1, . . . ,βp )′

j The classical approach is to estimate β using MLE whichreduces to the least squares estimator β̂= (X ′X )−1X ′y

j Unfortunately as p increases OR xi j s become more correlatedOR more sparse, the variance of β̂ blows up

j For p > n a unique MLE doesn’t exist

High-dimensional data (big p) 54

Page 249: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Including prior information

j We need to include some sort of outside or prior information

j In a Bayesian approach, we choose a prior probabilitydistribution π(β) characterizing our uncertainty in β prior toobserving the current data

j Then, we would use Bayes rule to update the prior withinformation in the likelihood:

π(β|Y , X ) = π(β)L(Y |X ,β)∫π(β)L(Y |X ,β)dβ

= π(β)L(Y |X ,β)

L(Y |X ),

where L(Y |X ,β) is the likelihood & L(Y |X ) is the marginallikelihood

High-dimensional data (big p) 55

Page 250: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Including prior information

j We need to include some sort of outside or prior information

j In a Bayesian approach, we choose a prior probabilitydistribution π(β) characterizing our uncertainty in β prior toobserving the current data

j Then, we would use Bayes rule to update the prior withinformation in the likelihood:

π(β|Y , X ) = π(β)L(Y |X ,β)∫π(β)L(Y |X ,β)dβ

= π(β)L(Y |X ,β)

L(Y |X ),

where L(Y |X ,β) is the likelihood & L(Y |X ) is the marginallikelihood

High-dimensional data (big p) 55

Page 251: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Including prior information

j We need to include some sort of outside or prior information

j In a Bayesian approach, we choose a prior probabilitydistribution π(β) characterizing our uncertainty in β prior toobserving the current data

j Then, we would use Bayes rule to update the prior withinformation in the likelihood:

π(β|Y , X ) = π(β)L(Y |X ,β)∫π(β)L(Y |X ,β)dβ

= π(β)L(Y |X ,β)

L(Y |X ),

where L(Y |X ,β) is the likelihood & L(Y |X ) is the marginallikelihood

High-dimensional data (big p) 55

Page 252: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayes in normal linear regression

j Suppose π(β) = Np (0,Σ0) & we have a normal linear regressionmodel

j Then, the posterior distribution of β has a simple form as

π(β|Y , X ) = Np (β̃,Vβ)

j Posterior covariance Vβ = (Σ−10 +σ−2X ′X )−1 combines the two

sources of information

j The posterior mean is β̃= (σ2Σ−10 +X ′X )−1X ′Y , which is a

weighted average of 0 and β̂= (X ′X )−1X ′Y .

High-dimensional data (big p) 56

Page 253: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayes in normal linear regression

j Suppose π(β) = Np (0,Σ0) & we have a normal linear regressionmodel

j Then, the posterior distribution of β has a simple form as

π(β|Y , X ) = Np (β̃,Vβ)

j Posterior covariance Vβ = (Σ−10 +σ−2X ′X )−1 combines the two

sources of information

j The posterior mean is β̃= (σ2Σ−10 +X ′X )−1X ′Y , which is a

weighted average of 0 and β̂= (X ′X )−1X ′Y .

High-dimensional data (big p) 56

Page 254: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayes in normal linear regression

j Suppose π(β) = Np (0,Σ0) & we have a normal linear regressionmodel

j Then, the posterior distribution of β has a simple form as

π(β|Y , X ) = Np (β̃,Vβ)

j Posterior covariance Vβ = (Σ−10 +σ−2X ′X )−1 combines the two

sources of information

j The posterior mean is β̃= (σ2Σ−10 +X ′X )−1X ′Y , which is a

weighted average of 0 and β̂= (X ′X )−1X ′Y .

High-dimensional data (big p) 56

Page 255: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayes in normal linear regression

j Suppose π(β) = Np (0,Σ0) & we have a normal linear regressionmodel

j Then, the posterior distribution of β has a simple form as

π(β|Y , X ) = Np (β̃,Vβ)

j Posterior covariance Vβ = (Σ−10 +σ−2X ′X )−1 combines the two

sources of information

j The posterior mean is β̃= (σ2Σ−10 +X ′X )−1X ′Y , which is a

weighted average of 0 and β̂= (X ′X )−1X ′Y .

High-dimensional data (big p) 56

Page 256: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Penalized estimation

j We can get the same estimator for β by solving:

β̃ = argminβ

n∑i=1

(yi −x ′iβ)2 +λ

p∑j=1

β2j

= argminβ

||Y −Xβ||22 +λ||β||22.

j Known as ridge or L2 penalized regression

j Dual interpretation as a Bayesian estimator under a Gaussianprior centered at zero & a least squares estimator with a penaltyon large coefficients

j Such estimators introduce some bias while reducing thevariance a lot to improve mean square error

High-dimensional data (big p) 57

Page 257: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Penalized estimation

j We can get the same estimator for β by solving:

β̃ = argminβ

n∑i=1

(yi −x ′iβ)2 +λ

p∑j=1

β2j

= argminβ

||Y −Xβ||22 +λ||β||22.

j Known as ridge or L2 penalized regression

j Dual interpretation as a Bayesian estimator under a Gaussianprior centered at zero & a least squares estimator with a penaltyon large coefficients

j Such estimators introduce some bias while reducing thevariance a lot to improve mean square error

High-dimensional data (big p) 57

Page 258: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Penalized estimation

j We can get the same estimator for β by solving:

β̃ = argminβ

n∑i=1

(yi −x ′iβ)2 +λ

p∑j=1

β2j

= argminβ

||Y −Xβ||22 +λ||β||22.

j Known as ridge or L2 penalized regression

j Dual interpretation as a Bayesian estimator under a Gaussianprior centered at zero & a least squares estimator with a penaltyon large coefficients

j Such estimators introduce some bias while reducing thevariance a lot to improve mean square error

High-dimensional data (big p) 57

Page 259: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Penalized estimation

j We can get the same estimator for β by solving:

β̃ = argminβ

n∑i=1

(yi −x ′iβ)2 +λ

p∑j=1

β2j

= argminβ

||Y −Xβ||22 +λ||β||22.

j Known as ridge or L2 penalized regression

j Dual interpretation as a Bayesian estimator under a Gaussianprior centered at zero & a least squares estimator with a penaltyon large coefficients

j Such estimators introduce some bias while reducing thevariance a lot to improve mean square error

High-dimensional data (big p) 57

Page 260: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

L1 - Lasso sparse estimation

j The above penalized loss function can be generalized as

β̃= argminβ

||Y −Xβ||22 +pλ(β),

where pλ(β) is a penalty term - L2 in the case discussed above

j Another very common penalty is L1 - penalizing the sum ofabsolute values |β j |

j Lasso & the resulting estimator has a Bayesian interpretationunder a double exponential (Laplace) prior

j β̃ is sparse & contains exact zeros values

High-dimensional data (big p) 58

Page 261: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

L1 - Lasso sparse estimation

j The above penalized loss function can be generalized as

β̃= argminβ

||Y −Xβ||22 +pλ(β),

where pλ(β) is a penalty term - L2 in the case discussed above

j Another very common penalty is L1 - penalizing the sum ofabsolute values |β j |

j Lasso & the resulting estimator has a Bayesian interpretationunder a double exponential (Laplace) prior

j β̃ is sparse & contains exact zeros values

High-dimensional data (big p) 58

Page 262: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

L1 - Lasso sparse estimation

j The above penalized loss function can be generalized as

β̃= argminβ

||Y −Xβ||22 +pλ(β),

where pλ(β) is a penalty term - L2 in the case discussed above

j Another very common penalty is L1 - penalizing the sum ofabsolute values |β j |

j Lasso & the resulting estimator has a Bayesian interpretationunder a double exponential (Laplace) prior

j β̃ is sparse & contains exact zeros values

High-dimensional data (big p) 58

Page 263: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

L1 - Lasso sparse estimation

j The above penalized loss function can be generalized as

β̃= argminβ

||Y −Xβ||22 +pλ(β),

where pλ(β) is a penalty term - L2 in the case discussed above

j Another very common penalty is L1 - penalizing the sum ofabsolute values |β j |

j Lasso & the resulting estimator has a Bayesian interpretationunder a double exponential (Laplace) prior

j β̃ is sparse & contains exact zeros values

High-dimensional data (big p) 58

Page 264: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

A proliferation of penalties & priors

j There is a HUGE literature proposing many different penalties

j Adaptive Lasso, fused Lasso, elastic net, etc etc

j In general, methods only produce a sparse point estimate & aredangerous scientifically

j Many errors in interpreting the zero vs non-zero elements

j Parallel Bayesian literature on shrinkage priors - horseshoe,generalized double Pareto, Dirichlet-Laplace, etc

High-dimensional data (big p) 59

Page 265: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

A proliferation of penalties & priors

j There is a HUGE literature proposing many different penalties

j Adaptive Lasso, fused Lasso, elastic net, etc etc

j In general, methods only produce a sparse point estimate & aredangerous scientifically

j Many errors in interpreting the zero vs non-zero elements

j Parallel Bayesian literature on shrinkage priors - horseshoe,generalized double Pareto, Dirichlet-Laplace, etc

High-dimensional data (big p) 59

Page 266: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

A proliferation of penalties & priors

j There is a HUGE literature proposing many different penalties

j Adaptive Lasso, fused Lasso, elastic net, etc etc

j In general, methods only produce a sparse point estimate & aredangerous scientifically

j Many errors in interpreting the zero vs non-zero elements

j Parallel Bayesian literature on shrinkage priors - horseshoe,generalized double Pareto, Dirichlet-Laplace, etc

High-dimensional data (big p) 59

Page 267: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

A proliferation of penalties & priors

j There is a HUGE literature proposing many different penalties

j Adaptive Lasso, fused Lasso, elastic net, etc etc

j In general, methods only produce a sparse point estimate & aredangerous scientifically

j Many errors in interpreting the zero vs non-zero elements

j Parallel Bayesian literature on shrinkage priors - horseshoe,generalized double Pareto, Dirichlet-Laplace, etc

High-dimensional data (big p) 59

Page 268: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

A proliferation of penalties & priors

j There is a HUGE literature proposing many different penalties

j Adaptive Lasso, fused Lasso, elastic net, etc etc

j In general, methods only produce a sparse point estimate & aredangerous scientifically

j Many errors in interpreting the zero vs non-zero elements

j Parallel Bayesian literature on shrinkage priors - horseshoe,generalized double Pareto, Dirichlet-Laplace, etc

High-dimensional data (big p) 59

Page 269: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayesian shrinkage priors

j Appropriate prior π(β) for the high-dimensional vector ofcoefficients?

j Most commonly local-global scale mixture of Gaussians,

β ji i d∼ N (0,ψ jλ), ψ j ∼ f , λ∼ g ,

ψ j =local scale, λ = global scalej Choose λ≈ 0 & ψ j to have many small values with some largej Different choices of f , g lead to different priors in the literature -

Bayesian Lasso is a poor choice, as horseshoe, gDP, DL etchave much better theoretical & practical performance

j Literature on scalable computation using MCMC - e.g,Johndrow et al arXiv:1705.00841

j Datta & Dunson (20)16, Biometrika) - develop such approachesfor huge dimensional sparse count data arising in genomics

High-dimensional data (big p) 60

Page 270: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayesian shrinkage priors

j Appropriate prior π(β) for the high-dimensional vector ofcoefficients?

j Most commonly local-global scale mixture of Gaussians,

β ji i d∼ N (0,ψ jλ), ψ j ∼ f , λ∼ g ,

ψ j =local scale, λ = global scale

j Choose λ≈ 0 & ψ j to have many small values with some largej Different choices of f , g lead to different priors in the literature -

Bayesian Lasso is a poor choice, as horseshoe, gDP, DL etchave much better theoretical & practical performance

j Literature on scalable computation using MCMC - e.g,Johndrow et al arXiv:1705.00841

j Datta & Dunson (20)16, Biometrika) - develop such approachesfor huge dimensional sparse count data arising in genomics

High-dimensional data (big p) 60

Page 271: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayesian shrinkage priors

j Appropriate prior π(β) for the high-dimensional vector ofcoefficients?

j Most commonly local-global scale mixture of Gaussians,

β ji i d∼ N (0,ψ jλ), ψ j ∼ f , λ∼ g ,

ψ j =local scale, λ = global scalej Choose λ≈ 0 & ψ j to have many small values with some large

j Different choices of f , g lead to different priors in the literature -Bayesian Lasso is a poor choice, as horseshoe, gDP, DL etchave much better theoretical & practical performance

j Literature on scalable computation using MCMC - e.g,Johndrow et al arXiv:1705.00841

j Datta & Dunson (20)16, Biometrika) - develop such approachesfor huge dimensional sparse count data arising in genomics

High-dimensional data (big p) 60

Page 272: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayesian shrinkage priors

j Appropriate prior π(β) for the high-dimensional vector ofcoefficients?

j Most commonly local-global scale mixture of Gaussians,

β ji i d∼ N (0,ψ jλ), ψ j ∼ f , λ∼ g ,

ψ j =local scale, λ = global scalej Choose λ≈ 0 & ψ j to have many small values with some largej Different choices of f , g lead to different priors in the literature -

Bayesian Lasso is a poor choice, as horseshoe, gDP, DL etchave much better theoretical & practical performance

j Literature on scalable computation using MCMC - e.g,Johndrow et al arXiv:1705.00841

j Datta & Dunson (20)16, Biometrika) - develop such approachesfor huge dimensional sparse count data arising in genomics

High-dimensional data (big p) 60

Page 273: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayesian shrinkage priors

j Appropriate prior π(β) for the high-dimensional vector ofcoefficients?

j Most commonly local-global scale mixture of Gaussians,

β ji i d∼ N (0,ψ jλ), ψ j ∼ f , λ∼ g ,

ψ j =local scale, λ = global scalej Choose λ≈ 0 & ψ j to have many small values with some largej Different choices of f , g lead to different priors in the literature -

Bayesian Lasso is a poor choice, as horseshoe, gDP, DL etchave much better theoretical & practical performance

j Literature on scalable computation using MCMC - e.g,Johndrow et al arXiv:1705.00841

j Datta & Dunson (20)16, Biometrika) - develop such approachesfor huge dimensional sparse count data arising in genomics

High-dimensional data (big p) 60

Page 274: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayesian shrinkage priors

j Appropriate prior π(β) for the high-dimensional vector ofcoefficients?

j Most commonly local-global scale mixture of Gaussians,

β ji i d∼ N (0,ψ jλ), ψ j ∼ f , λ∼ g ,

ψ j =local scale, λ = global scalej Choose λ≈ 0 & ψ j to have many small values with some largej Different choices of f , g lead to different priors in the literature -

Bayesian Lasso is a poor choice, as horseshoe, gDP, DL etchave much better theoretical & practical performance

j Literature on scalable computation using MCMC - e.g,Johndrow et al arXiv:1705.00841

j Datta & Dunson (20)16, Biometrika) - develop such approachesfor huge dimensional sparse count data arising in genomics

High-dimensional data (big p) 60

Page 275: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Features of a Bayesian approach

j Bayesian approach provides a full posterior π(β|Y , X )characterizing uncertainty instead of just a sparse pointestimate β̂

j By using MCMC, we can easily get credible bands (Bayesianconfidence intervals) for not only the β j ’s but also for anyfunctional of interest

j Relatively straightforward to incorporate extensions to allowhierarchical dependence structures, multivariate responses,missing data, etc

j However, there is a need for approaches that are

1. More robust to parametric assumptions,2. easily computationally scalable to huge datasets3. provide a way to deal with intractable p À n problems

High-dimensional data (big p) 61

Page 276: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Features of a Bayesian approach

j Bayesian approach provides a full posterior π(β|Y , X )characterizing uncertainty instead of just a sparse pointestimate β̂

j By using MCMC, we can easily get credible bands (Bayesianconfidence intervals) for not only the β j ’s but also for anyfunctional of interest

j Relatively straightforward to incorporate extensions to allowhierarchical dependence structures, multivariate responses,missing data, etc

j However, there is a need for approaches that are

1. More robust to parametric assumptions,2. easily computationally scalable to huge datasets3. provide a way to deal with intractable p À n problems

High-dimensional data (big p) 61

Page 277: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Features of a Bayesian approach

j Bayesian approach provides a full posterior π(β|Y , X )characterizing uncertainty instead of just a sparse pointestimate β̂

j By using MCMC, we can easily get credible bands (Bayesianconfidence intervals) for not only the β j ’s but also for anyfunctional of interest

j Relatively straightforward to incorporate extensions to allowhierarchical dependence structures, multivariate responses,missing data, etc

j However, there is a need for approaches that are

1. More robust to parametric assumptions,2. easily computationally scalable to huge datasets3. provide a way to deal with intractable p À n problems

High-dimensional data (big p) 61

Page 278: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Features of a Bayesian approach

j Bayesian approach provides a full posterior π(β|Y , X )characterizing uncertainty instead of just a sparse pointestimate β̂

j By using MCMC, we can easily get credible bands (Bayesianconfidence intervals) for not only the β j ’s but also for anyfunctional of interest

j Relatively straightforward to incorporate extensions to allowhierarchical dependence structures, multivariate responses,missing data, etc

j However, there is a need for approaches that are

1. More robust to parametric assumptions,2. easily computationally scalable to huge datasets3. provide a way to deal with intractable p À n problems

High-dimensional data (big p) 61

Page 279: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Features of a Bayesian approach

j Bayesian approach provides a full posterior π(β|Y , X )characterizing uncertainty instead of just a sparse pointestimate β̂

j By using MCMC, we can easily get credible bands (Bayesianconfidence intervals) for not only the β j ’s but also for anyfunctional of interest

j Relatively straightforward to incorporate extensions to allowhierarchical dependence structures, multivariate responses,missing data, etc

j However, there is a need for approaches that are1. More robust to parametric assumptions,

2. easily computationally scalable to huge datasets3. provide a way to deal with intractable p À n problems

High-dimensional data (big p) 61

Page 280: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Features of a Bayesian approach

j Bayesian approach provides a full posterior π(β|Y , X )characterizing uncertainty instead of just a sparse pointestimate β̂

j By using MCMC, we can easily get credible bands (Bayesianconfidence intervals) for not only the β j ’s but also for anyfunctional of interest

j Relatively straightforward to incorporate extensions to allowhierarchical dependence structures, multivariate responses,missing data, etc

j However, there is a need for approaches that are1. More robust to parametric assumptions,2. easily computationally scalable to huge datasets

3. provide a way to deal with intractable p À n problems

High-dimensional data (big p) 61

Page 281: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Features of a Bayesian approach

j Bayesian approach provides a full posterior π(β|Y , X )characterizing uncertainty instead of just a sparse pointestimate β̂

j By using MCMC, we can easily get credible bands (Bayesianconfidence intervals) for not only the β j ’s but also for anyfunctional of interest

j Relatively straightforward to incorporate extensions to allowhierarchical dependence structures, multivariate responses,missing data, etc

j However, there is a need for approaches that are1. More robust to parametric assumptions,2. easily computationally scalable to huge datasets3. provide a way to deal with intractable p À n problems

High-dimensional data (big p) 61

Page 282: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application 1: DNA methylation arrayscg17568996 (P(A)<0.001)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6cg11059483 (P(A)=0.13)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.00

24

68

10

cg11059483 (P(A)=0.76)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

j Focus: screening for differentially methylated CpG sites

j High-throughput arrays are routinely used - eg., Illumina HumanMethylation450 Beadchip

j Measurements in [0,1] interval, ranging from no methylation tofully methylated

j Representative data from the Cancer Genome Atlasj Clearly distributions exhibit multimodality & skewness

High-dimensional data (big p) 62

Page 283: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application 1: DNA methylation arrayscg17568996 (P(A)<0.001)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6cg11059483 (P(A)=0.13)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.00

24

68

10

cg11059483 (P(A)=0.76)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

j Focus: screening for differentially methylated CpG sitesj High-throughput arrays are routinely used - eg., Illumina Human

Methylation450 Beadchip

j Measurements in [0,1] interval, ranging from no methylation tofully methylated

j Representative data from the Cancer Genome Atlasj Clearly distributions exhibit multimodality & skewness

High-dimensional data (big p) 62

Page 284: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application 1: DNA methylation arrayscg17568996 (P(A)<0.001)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6cg11059483 (P(A)=0.13)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.00

24

68

10

cg11059483 (P(A)=0.76)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

j Focus: screening for differentially methylated CpG sitesj High-throughput arrays are routinely used - eg., Illumina Human

Methylation450 Beadchipj Measurements in [0,1] interval, ranging from no methylation to

fully methylated

j Representative data from the Cancer Genome Atlasj Clearly distributions exhibit multimodality & skewness

High-dimensional data (big p) 62

Page 285: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application 1: DNA methylation arrayscg17568996 (P(A)<0.001)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6cg11059483 (P(A)=0.13)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.00

24

68

10

cg11059483 (P(A)=0.76)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

j Focus: screening for differentially methylated CpG sitesj High-throughput arrays are routinely used - eg., Illumina Human

Methylation450 Beadchipj Measurements in [0,1] interval, ranging from no methylation to

fully methylatedj Representative data from the Cancer Genome Atlas

j Clearly distributions exhibit multimodality & skewness

High-dimensional data (big p) 62

Page 286: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application 1: DNA methylation arrayscg17568996 (P(A)<0.001)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6cg11059483 (P(A)=0.13)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.00

24

68

10

cg11059483 (P(A)=0.76)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

j Focus: screening for differentially methylated CpG sitesj High-throughput arrays are routinely used - eg., Illumina Human

Methylation450 Beadchipj Measurements in [0,1] interval, ranging from no methylation to

fully methylatedj Representative data from the Cancer Genome Atlasj Clearly distributions exhibit multimodality & skewness

High-dimensional data (big p) 62

Page 287: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Commentscg17568996 (P(A)<0.001)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6

cg11059483 (P(A)=0.13)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.00

24

68

10

cg11059483 (P(A)=0.76)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

j We observe data like this at a HUGE number of CpG sites

j Many distributions share common attributes - modes etcj Can accurately characterize the methylation densities using a

kernel mixture modelj Key idea: use the same kernels across the sites & groups but

allow the weights to varyj SHARed Kernel (SHARK) method (Lock & Dunson, 2015)

High-dimensional data (big p) 63

Page 288: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Commentscg17568996 (P(A)<0.001)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6

cg11059483 (P(A)=0.13)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.00

24

68

10

cg11059483 (P(A)=0.76)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

j We observe data like this at a HUGE number of CpG sitesj Many distributions share common attributes - modes etc

j Can accurately characterize the methylation densities using akernel mixture model

j Key idea: use the same kernels across the sites & groups butallow the weights to vary

j SHARed Kernel (SHARK) method (Lock & Dunson, 2015)

High-dimensional data (big p) 63

Page 289: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Commentscg17568996 (P(A)<0.001)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6

cg11059483 (P(A)=0.13)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.00

24

68

10

cg11059483 (P(A)=0.76)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

j We observe data like this at a HUGE number of CpG sitesj Many distributions share common attributes - modes etcj Can accurately characterize the methylation densities using a

kernel mixture model

j Key idea: use the same kernels across the sites & groups butallow the weights to vary

j SHARed Kernel (SHARK) method (Lock & Dunson, 2015)

High-dimensional data (big p) 63

Page 290: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Commentscg17568996 (P(A)<0.001)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6

cg11059483 (P(A)=0.13)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.00

24

68

10

cg11059483 (P(A)=0.76)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

j We observe data like this at a HUGE number of CpG sitesj Many distributions share common attributes - modes etcj Can accurately characterize the methylation densities using a

kernel mixture modelj Key idea: use the same kernels across the sites & groups but

allow the weights to vary

j SHARed Kernel (SHARK) method (Lock & Dunson, 2015)

High-dimensional data (big p) 63

Page 291: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Commentscg17568996 (P(A)<0.001)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6

cg11059483 (P(A)=0.13)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.00

24

68

10

cg11059483 (P(A)=0.76)

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

j We observe data like this at a HUGE number of CpG sitesj Many distributions share common attributes - modes etcj Can accurately characterize the methylation densities using a

kernel mixture modelj Key idea: use the same kernels across the sites & groups but

allow the weights to varyj SHARed Kernel (SHARK) method (Lock & Dunson, 2015)

High-dimensional data (big p) 63

Page 292: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Shark - some details

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

13.037 %D

ensi

ty

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

12.728 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

12.916 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

12.183 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

15.835 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

13.966 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

11.195 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

5.823 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

2.315 %

Den

sity

j The methylation density at site j in group g is f j g :

f j g (y) =k∑

h=1π j g hK (y ;θh)

j π j g = (π j g 1, . . . ,π j g k )′ are weights specific to j , gj K (y ;θh) is a shared kernel (truncated normal in this case)j We estimate the above kernels in a first stage relying on a

subsample of 500 sites - only need 9 kernels

High-dimensional data (big p) 64

Page 293: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Shark - some details

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

13.037 %D

ensi

ty

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

12.728 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

12.916 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

12.183 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

15.835 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

13.966 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

11.195 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

5.823 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

2.315 %

Den

sity

j The methylation density at site j in group g is f j g :

f j g (y) =k∑

h=1π j g hK (y ;θh)

j π j g = (π j g 1, . . . ,π j g k )′ are weights specific to j , g

j K (y ;θh) is a shared kernel (truncated normal in this case)j We estimate the above kernels in a first stage relying on a

subsample of 500 sites - only need 9 kernels

High-dimensional data (big p) 64

Page 294: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Shark - some details

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

13.037 %D

ensi

ty

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

12.728 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

12.916 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

12.183 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

15.835 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

13.966 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

11.195 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

5.823 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

2.315 %

Den

sity

j The methylation density at site j in group g is f j g :

f j g (y) =k∑

h=1π j g hK (y ;θh)

j π j g = (π j g 1, . . . ,π j g k )′ are weights specific to j , gj K (y ;θh) is a shared kernel (truncated normal in this case)

j We estimate the above kernels in a first stage relying on asubsample of 500 sites - only need 9 kernels

High-dimensional data (big p) 64

Page 295: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Shark - some details

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

13.037 %D

ensi

ty

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

12.728 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

12.916 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

12.183 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

15.835 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

13.966 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

11.195 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

5.823 %

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

2.315 %

Den

sity

j The methylation density at site j in group g is f j g :

f j g (y) =k∑

h=1π j g hK (y ;θh)

j π j g = (π j g 1, . . . ,π j g k )′ are weights specific to j , gj K (y ;θh) is a shared kernel (truncated normal in this case)j We estimate the above kernels in a first stage relying on a

subsample of 500 sites - only need 9 kernelsHigh-dimensional data (big p) 64

Page 296: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Shark - implementation (continued)cg17095936, pr(H0|X)<0.001

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

cg10203483, pr(H0|X)=0.21

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6

cg27324619, pr(H0|X)=0.66

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

cg27655905, pr(H0|X)>0.999

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

j We put a simple hierarchical prior on π j g - Dirichlet in eachgroup

j Prior probability random CpG site is differentially methylatedgiven a beta hyperprior

j Automatically adjusts for multiple testing error, controlling FDRj Computation - very fast Gibbs & parallelizable Gibbs samplerj Theory support, including under misspecification

High-dimensional data (big p) 65

Page 297: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Shark - implementation (continued)cg17095936, pr(H0|X)<0.001

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

cg10203483, pr(H0|X)=0.21

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6

cg27324619, pr(H0|X)=0.66

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

cg27655905, pr(H0|X)>0.999

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

j We put a simple hierarchical prior on π j g - Dirichlet in eachgroup

j Prior probability random CpG site is differentially methylatedgiven a beta hyperprior

j Automatically adjusts for multiple testing error, controlling FDRj Computation - very fast Gibbs & parallelizable Gibbs samplerj Theory support, including under misspecification

High-dimensional data (big p) 65

Page 298: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Shark - implementation (continued)cg17095936, pr(H0|X)<0.001

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

cg10203483, pr(H0|X)=0.21

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6

cg27324619, pr(H0|X)=0.66

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

cg27655905, pr(H0|X)>0.999

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

j We put a simple hierarchical prior on π j g - Dirichlet in eachgroup

j Prior probability random CpG site is differentially methylatedgiven a beta hyperprior

j Automatically adjusts for multiple testing error, controlling FDR

j Computation - very fast Gibbs & parallelizable Gibbs samplerj Theory support, including under misspecification

High-dimensional data (big p) 65

Page 299: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Shark - implementation (continued)cg17095936, pr(H0|X)<0.001

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

cg10203483, pr(H0|X)=0.21

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6

cg27324619, pr(H0|X)=0.66

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

cg27655905, pr(H0|X)>0.999

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

j We put a simple hierarchical prior on π j g - Dirichlet in eachgroup

j Prior probability random CpG site is differentially methylatedgiven a beta hyperprior

j Automatically adjusts for multiple testing error, controlling FDRj Computation - very fast Gibbs & parallelizable Gibbs sampler

j Theory support, including under misspecification

High-dimensional data (big p) 65

Page 300: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Shark - implementation (continued)cg17095936, pr(H0|X)<0.001

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

cg10203483, pr(H0|X)=0.21

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

01

23

45

6

cg27324619, pr(H0|X)=0.66

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

02

46

810

cg27655905, pr(H0|X)>0.999

Methylation

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

05

1015

20

j We put a simple hierarchical prior on π j g - Dirichlet in eachgroup

j Prior probability random CpG site is differentially methylatedgiven a beta hyperprior

j Automatically adjusts for multiple testing error, controlling FDRj Computation - very fast Gibbs & parallelizable Gibbs samplerj Theory support, including under misspecification

High-dimensional data (big p) 65

Page 301: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Results for Cancer Genome Atlas DataHistogram of pr(H0m|X)

pr(H0m|X)

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

050

0010

000

1500

0

j Illustrate using n = 597 breast cancer samples & 21,986 CpGsites from TGGA

j Focus on testing difference between basal-like (112) and not(485) at each site

j Global proportion of no difference was 0.821

j Distribution of posterior probabilities of H0m shown above

High-dimensional data (big p) 66

Page 302: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Results for Cancer Genome Atlas DataHistogram of pr(H0m|X)

pr(H0m|X)

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

050

0010

000

1500

0

j Illustrate using n = 597 breast cancer samples & 21,986 CpGsites from TGGA

j Focus on testing difference between basal-like (112) and not(485) at each site

j Global proportion of no difference was 0.821

j Distribution of posterior probabilities of H0m shown above

High-dimensional data (big p) 66

Page 303: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Results for Cancer Genome Atlas DataHistogram of pr(H0m|X)

pr(H0m|X)

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

050

0010

000

1500

0

j Illustrate using n = 597 breast cancer samples & 21,986 CpGsites from TGGA

j Focus on testing difference between basal-like (112) and not(485) at each site

j Global proportion of no difference was 0.821

j Distribution of posterior probabilities of H0m shown above

High-dimensional data (big p) 66

Page 304: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Results for Cancer Genome Atlas DataHistogram of pr(H0m|X)

pr(H0m|X)

Fre

quen

cy

0.0 0.2 0.4 0.6 0.8 1.0

050

0010

000

1500

0

j Illustrate using n = 597 breast cancer samples & 21,986 CpGsites from TGGA

j Focus on testing difference between basal-like (112) and not(485) at each site

j Global proportion of no difference was 0.821

j Distribution of posterior probabilities of H0m shown aboveHigh-dimensional data (big p) 66

Page 305: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Discussion & Comparisons

0.0 0.2 0.4 0.6 0.8 1.0

0.70

0.75

0.80

0.85

0.90

0.95

1.00

False positive rate

Pro

porti

on o

f CpG

site

s

Shared kernel test (learned P0)Anderson Darling test (FDR)Shared kernel test (P0 = 0.5)Anderson-Darling testco-OPTWilcoxon rank testt-TestRDDP dTV testPT test

j Of 2117 CpG sites with pr(H0m) < 0.01, 1256 have a significantnegative association with gene expression (p < 0.01 spearman’srank correlation)

j Methylation gives potential mechanistic explanation fordifferences in gene transcription levels

j We compared power of our approach with alternatives

High-dimensional data (big p) 67

Page 306: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Discussion & Comparisons

0.0 0.2 0.4 0.6 0.8 1.0

0.70

0.75

0.80

0.85

0.90

0.95

1.00

False positive rate

Pro

porti

on o

f CpG

site

s

Shared kernel test (learned P0)Anderson Darling test (FDR)Shared kernel test (P0 = 0.5)Anderson-Darling testco-OPTWilcoxon rank testt-TestRDDP dTV testPT test

j Of 2117 CpG sites with pr(H0m) < 0.01, 1256 have a significantnegative association with gene expression (p < 0.01 spearman’srank correlation)

j Methylation gives potential mechanistic explanation fordifferences in gene transcription levels

j We compared power of our approach with alternatives

High-dimensional data (big p) 67

Page 307: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Discussion & Comparisons

0.0 0.2 0.4 0.6 0.8 1.0

0.70

0.75

0.80

0.85

0.90

0.95

1.00

False positive rate

Pro

porti

on o

f CpG

site

s

Shared kernel test (learned P0)Anderson Darling test (FDR)Shared kernel test (P0 = 0.5)Anderson-Darling testco-OPTWilcoxon rank testt-TestRDDP dTV testPT test

j Of 2117 CpG sites with pr(H0m) < 0.01, 1256 have a significantnegative association with gene expression (p < 0.01 spearman’srank correlation)

j Methylation gives potential mechanistic explanation fordifferences in gene transcription levels

j We compared power of our approach with alternativesHigh-dimensional data (big p) 67

Page 308: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Shared kernel testing for complex phenotypes

j Shared kernel approach can be applied to very complexphenotypes

j As long as a mixture model can be defined for the phenotypedistribution under one condition

j I’ll illustrate briefly using brain connectome phenotypes

j For each individual i , we extract a structural connectome Xi

from MRI data

j Then, Xi [u,v] = 1 if there is any connection between regions u &v for individual i , and Xi [u,v] = 0 otherwise

High-dimensional data (big p) 68

Page 309: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Shared kernel testing for complex phenotypes

j Shared kernel approach can be applied to very complexphenotypes

j As long as a mixture model can be defined for the phenotypedistribution under one condition

j I’ll illustrate briefly using brain connectome phenotypes

j For each individual i , we extract a structural connectome Xi

from MRI data

j Then, Xi [u,v] = 1 if there is any connection between regions u &v for individual i , and Xi [u,v] = 0 otherwise

High-dimensional data (big p) 68

Page 310: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Shared kernel testing for complex phenotypes

j Shared kernel approach can be applied to very complexphenotypes

j As long as a mixture model can be defined for the phenotypedistribution under one condition

j I’ll illustrate briefly using brain connectome phenotypes

j For each individual i , we extract a structural connectome Xi

from MRI data

j Then, Xi [u,v] = 1 if there is any connection between regions u &v for individual i , and Xi [u,v] = 0 otherwise

High-dimensional data (big p) 68

Page 311: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Shared kernel testing for complex phenotypes

j Shared kernel approach can be applied to very complexphenotypes

j As long as a mixture model can be defined for the phenotypedistribution under one condition

j I’ll illustrate briefly using brain connectome phenotypes

j For each individual i , we extract a structural connectome Xi

from MRI data

j Then, Xi [u,v] = 1 if there is any connection between regions u &v for individual i , and Xi [u,v] = 0 otherwise

High-dimensional data (big p) 68

Page 312: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Shared kernel testing for complex phenotypes

j Shared kernel approach can be applied to very complexphenotypes

j As long as a mixture model can be defined for the phenotypedistribution under one condition

j I’ll illustrate briefly using brain connectome phenotypes

j For each individual i , we extract a structural connectome Xi

from MRI data

j Then, Xi [u,v] = 1 if there is any connection between regions u &v for individual i , and Xi [u,v] = 0 otherwise

High-dimensional data (big p) 68

Page 313: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

A nonparametric model of variation in brain networks

0L1L

2L

3L

4L

5L

6L

7L

8L

9L

10L

11L

12L

13L

14L

15L 16L

17L

18L

19L

20L

21L

22L

23L24L

25L

26L27L

28L

29L

30L

31L

32L

33L

34L

0R1R

2R

3R

4R

5R

6R

7R

8R

9R

10R

11R

12R

13R

14R

15R16R

17R

18R

19R

20R

21R

22R

23R 24R

25R

26R27R

28R

29R

30R

31R

32R

33R

34R

Low creativity subject

j Kernel for characterizing variation in brain network data acrossindividuals: Xi ∼ P , P =?.

j For each brain region (r ) & component (h), assign anindividual-specific score ηi h[r ]

j Characterize variation among individuals with:

logit{pr(Xi [u,v] = 1)} =µ[u,v]+K∑

h=1λi hηi h[u]ηi h[v], θi = {λi h ,ηi r } ∼Q.

j Using Bayesian nonparametrics, allow Q (& P ) to be unknown

High-dimensional data (big p) 69

Page 314: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

A nonparametric model of variation in brain networks

0L1L

2L

3L

4L

5L

6L

7L

8L

9L

10L

11L

12L

13L

14L

15L 16L

17L

18L

19L

20L

21L

22L

23L24L

25L

26L27L

28L

29L

30L

31L

32L

33L

34L

0R1R

2R

3R

4R

5R

6R

7R

8R

9R

10R

11R

12R

13R

14R

15R16R

17R

18R

19R

20R

21R

22R

23R 24R

25R

26R27R

28R

29R

30R

31R

32R

33R

34R

High creativity subject

j Kernel for characterizing variation in brain network data acrossindividuals: Xi ∼ P , P =?.

j For each brain region (r ) & component (h), assign anindividual-specific score ηi h[r ]

j Characterize variation among individuals with:

logit{pr(Xi [u,v] = 1)} =µ[u,v]+K∑

h=1λi hηi h[u]ηi h[v], θi = {λi h ,ηi r } ∼Q.

j Using Bayesian nonparametrics, allow Q (& P ) to be unknown

High-dimensional data (big p) 69

Page 315: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

A nonparametric model of variation in brain networks

0L1L

2L

3L

4L

5L

6L

7L

8L

9L

10L

11L

12L

13L

14L

15L 16L

17L

18L

19L

20L

21L

22L

23L24L

25L

26L27L

28L

29L

30L

31L

32L

33L

34L

0R1R

2R

3R

4R

5R

6R

7R

8R

9R

10R

11R

12R

13R

14R

15R16R

17R

18R

19R

20R

21R

22R

23R 24R

25R

26R27R

28R

29R

30R

31R

32R

33R

34R

Low creativity subject

j Kernel for characterizing variation in brain network data acrossindividuals: Xi ∼ P , P =?.

j For each brain region (r ) & component (h), assign anindividual-specific score ηi h[r ]

j Characterize variation among individuals with:

logit{pr(Xi [u,v] = 1)} =µ[u,v]+K∑

h=1λi hηi h[u]ηi h[v], θi = {λi h ,ηi r } ∼Q.

j Using Bayesian nonparametrics, allow Q (& P ) to be unknown

High-dimensional data (big p) 69

Page 316: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

A nonparametric model of variation in brain networks

0L1L

2L

3L

4L

5L

6L

7L

8L

9L

10L

11L

12L

13L

14L

15L 16L

17L

18L

19L

20L

21L

22L

23L24L

25L

26L27L

28L

29L

30L

31L

32L

33L

34L

0R1R

2R

3R

4R

5R

6R

7R

8R

9R

10R

11R

12R

13R

14R

15R16R

17R

18R

19R

20R

21R

22R

23R 24R

25R

26R27R

28R

29R

30R

31R

32R

33R

34R

High creativity subject

j Kernel for characterizing variation in brain network data acrossindividuals: Xi ∼ P , P =?.

j For each brain region (r ) & component (h), assign anindividual-specific score ηi h[r ]

j Characterize variation among individuals with:

logit{pr(Xi [u,v] = 1)} =µ[u,v]+K∑

h=1λi hηi h[u]ηi h[v], θi = {λi h ,ηi r } ∼Q.

j Using Bayesian nonparametrics, allow Q (& P ) to be unknown

High-dimensional data (big p) 69

Page 317: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

A nonparametric model of variation in brain networks

0L1L

2L

3L

4L

5L

6L

7L

8L

9L

10L

11L

12L

13L

14L

15L 16L

17L

18L

19L

20L

21L

22L

23L24L

25L

26L27L

28L

29L

30L

31L

32L

33L

34L

0R1R

2R

3R

4R

5R

6R

7R

8R

9R

10R

11R

12R

13R

14R

15R16R

17R

18R

19R

20R

21R

22R

23R 24R

25R

26R27R

28R

29R

30R

31R

32R

33R

34R

Low creativity subject

j Kernel for characterizing variation in brain network data acrossindividuals: Xi ∼ P , P =?.

j For each brain region (r ) & component (h), assign anindividual-specific score ηi h[r ]

j Characterize variation among individuals with:

logit{pr(Xi [u,v] = 1)} =µ[u,v]+K∑

h=1λi hηi h[u]ηi h[v], θi = {λi h ,ηi r } ∼Q.

j Using Bayesian nonparametrics, allow Q (& P ) to be unknown

High-dimensional data (big p) 69

Page 318: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

A nonparametric model of variation in brain networks

0L1L

2L

3L

4L

5L

6L

7L

8L

9L

10L

11L

12L

13L

14L

15L 16L

17L

18L

19L

20L

21L

22L

23L24L

25L

26L27L

28L

29L

30L

31L

32L

33L

34L

0R1R

2R

3R

4R

5R

6R

7R

8R

9R

10R

11R

12R

13R

14R

15R16R

17R

18R

19R

20R

21R

22R

23R 24R

25R

26R27R

28R

29R

30R

31R

32R

33R

34R

High creativity subject

j Kernel for characterizing variation in brain network data acrossindividuals: Xi ∼ P , P =?.

j For each brain region (r ) & component (h), assign anindividual-specific score ηi h[r ]

j Characterize variation among individuals with:

logit{pr(Xi [u,v] = 1)} =µ[u,v]+K∑

h=1λi hηi h[u]ηi h[v], θi = {λi h ,ηi r } ∼Q.

j Using Bayesian nonparametrics, allow Q (& P ) to be unknown

High-dimensional data (big p) 69

Page 319: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayesian inferences

j Based on this framework, we can cluster individuals in terms oftheir brain structure

j We can also test for relationships between brain structure &traits/genotype

j Just allow the weights in our mixture model to vary withtraits/genotypes with fixed kernels

j Allows scientific inference of global & local group differences innetwork structures with traits

j Adjusts for multiple testing reducing false positives

High-dimensional data (big p) 70

Page 320: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayesian inferences

j Based on this framework, we can cluster individuals in terms oftheir brain structure

j We can also test for relationships between brain structure &traits/genotype

j Just allow the weights in our mixture model to vary withtraits/genotypes with fixed kernels

j Allows scientific inference of global & local group differences innetwork structures with traits

j Adjusts for multiple testing reducing false positives

High-dimensional data (big p) 70

Page 321: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayesian inferences

j Based on this framework, we can cluster individuals in terms oftheir brain structure

j We can also test for relationships between brain structure &traits/genotype

j Just allow the weights in our mixture model to vary withtraits/genotypes with fixed kernels

j Allows scientific inference of global & local group differences innetwork structures with traits

j Adjusts for multiple testing reducing false positives

High-dimensional data (big p) 70

Page 322: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayesian inferences

j Based on this framework, we can cluster individuals in terms oftheir brain structure

j We can also test for relationships between brain structure &traits/genotype

j Just allow the weights in our mixture model to vary withtraits/genotypes with fixed kernels

j Allows scientific inference of global & local group differences innetwork structures with traits

j Adjusts for multiple testing reducing false positives

High-dimensional data (big p) 70

Page 323: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Bayesian inferences

j Based on this framework, we can cluster individuals in terms oftheir brain structure

j We can also test for relationships between brain structure &traits/genotype

j Just allow the weights in our mixture model to vary withtraits/genotypes with fixed kernels

j Allows scientific inference of global & local group differences innetwork structures with traits

j Adjusts for multiple testing reducing false positives

High-dimensional data (big p) 70

Page 324: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application to creativityResults from local testing

j Apply model to brain networks of 36subjects (19 with high creativity, 17with low creativity—measured viaCCI).

j p̂r(H1 | data) = 0.995.

j Highly creative individuals havesignificantly >inter-hemispheric connections.

j Differences in frontal lobe consistentwith recent fMRI studies analyzingregional activity in isolation.

High-dimensional data (big p) 71

Page 325: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application to creativityResults from local testing

j Apply model to brain networks of 36subjects (19 with high creativity, 17with low creativity—measured viaCCI).

j p̂r(H1 | data) = 0.995.

j Highly creative individuals havesignificantly >inter-hemispheric connections.

j Differences in frontal lobe consistentwith recent fMRI studies analyzingregional activity in isolation.

High-dimensional data (big p) 71

Page 326: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application to creativityResults from local testing

j Apply model to brain networks of 36subjects (19 with high creativity, 17with low creativity—measured viaCCI).

j p̂r(H1 | data) = 0.995.

j Highly creative individuals havesignificantly >inter-hemispheric connections.

j Differences in frontal lobe consistentwith recent fMRI studies analyzingregional activity in isolation.

High-dimensional data (big p) 71

Page 327: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application to creativityResults from local testing

j Apply model to brain networks of 36subjects (19 with high creativity, 17with low creativity—measured viaCCI).

j p̂r(H1 | data) = 0.995.

j Highly creative individuals havesignificantly >inter-hemispheric connections.

j Differences in frontal lobe consistentwith recent fMRI studies analyzingregional activity in isolation.

High-dimensional data (big p) 71

Page 328: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Modularization

j Idea: don’t allow for all the dependencies implied by the jointBayesian model

j Cutting dependence useful for computational scalability &robustness to model misspecification

j As an example, suppose we have a phenotype yi and SNPsxi = (xi 1, . . . , xi p )′

j We want to screen for SNPs xi j that are significantly relatedwith yi in a nonparametric manner

j We also want to account for dependence in the many differenttests

High-dimensional data (big p) 72

Page 329: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Modularization

j Idea: don’t allow for all the dependencies implied by the jointBayesian model

j Cutting dependence useful for computational scalability &robustness to model misspecification

j As an example, suppose we have a phenotype yi and SNPsxi = (xi 1, . . . , xi p )′

j We want to screen for SNPs xi j that are significantly relatedwith yi in a nonparametric manner

j We also want to account for dependence in the many differenttests

High-dimensional data (big p) 72

Page 330: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Modularization

j Idea: don’t allow for all the dependencies implied by the jointBayesian model

j Cutting dependence useful for computational scalability &robustness to model misspecification

j As an example, suppose we have a phenotype yi and SNPsxi = (xi 1, . . . , xi p )′

j We want to screen for SNPs xi j that are significantly relatedwith yi in a nonparametric manner

j We also want to account for dependence in the many differenttests

High-dimensional data (big p) 72

Page 331: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Modularization

j Idea: don’t allow for all the dependencies implied by the jointBayesian model

j Cutting dependence useful for computational scalability &robustness to model misspecification

j As an example, suppose we have a phenotype yi and SNPsxi = (xi 1, . . . , xi p )′

j We want to screen for SNPs xi j that are significantly relatedwith yi in a nonparametric manner

j We also want to account for dependence in the many differenttests

High-dimensional data (big p) 72

Page 332: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Modularization

j Idea: don’t allow for all the dependencies implied by the jointBayesian model

j Cutting dependence useful for computational scalability &robustness to model misspecification

j As an example, suppose we have a phenotype yi and SNPsxi = (xi 1, . . . , xi p )′

j We want to screen for SNPs xi j that are significantly relatedwith yi in a nonparametric manner

j We also want to account for dependence in the many differenttests

High-dimensional data (big p) 72

Page 333: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Modular Bayes screening (Chen & Dunson, 2018)

j Start with kernel mixture model for marginal distribution of yi :

f (y) =k∑

h=1πhK (y ;θh).

j This implies yi ∼K (θci ), pr(ci = h) =πh , with ci ∈ {1, . . . ,k} acluster index

j Run an MCMC algorithm to get samples of the unknownparameters & cluster indices

j For each SNP, define a simple Bayesian test for associationbetween xi j & ci

j Include common parameters across these tests - eg, probabilityof an association in a random SNP.

j Marginalize over MCMC samples of {ci } to take into accountuncertainty

High-dimensional data (big p) 73

Page 334: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Modular Bayes screening (Chen & Dunson, 2018)

j Start with kernel mixture model for marginal distribution of yi :

f (y) =k∑

h=1πhK (y ;θh).

j This implies yi ∼K (θci ), pr(ci = h) =πh , with ci ∈ {1, . . . ,k} acluster index

j Run an MCMC algorithm to get samples of the unknownparameters & cluster indices

j For each SNP, define a simple Bayesian test for associationbetween xi j & ci

j Include common parameters across these tests - eg, probabilityof an association in a random SNP.

j Marginalize over MCMC samples of {ci } to take into accountuncertainty

High-dimensional data (big p) 73

Page 335: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Modular Bayes screening (Chen & Dunson, 2018)

j Start with kernel mixture model for marginal distribution of yi :

f (y) =k∑

h=1πhK (y ;θh).

j This implies yi ∼K (θci ), pr(ci = h) =πh , with ci ∈ {1, . . . ,k} acluster index

j Run an MCMC algorithm to get samples of the unknownparameters & cluster indices

j For each SNP, define a simple Bayesian test for associationbetween xi j & ci

j Include common parameters across these tests - eg, probabilityof an association in a random SNP.

j Marginalize over MCMC samples of {ci } to take into accountuncertainty

High-dimensional data (big p) 73

Page 336: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Modular Bayes screening (Chen & Dunson, 2018)

j Start with kernel mixture model for marginal distribution of yi :

f (y) =k∑

h=1πhK (y ;θh).

j This implies yi ∼K (θci ), pr(ci = h) =πh , with ci ∈ {1, . . . ,k} acluster index

j Run an MCMC algorithm to get samples of the unknownparameters & cluster indices

j For each SNP, define a simple Bayesian test for associationbetween xi j & ci

j Include common parameters across these tests - eg, probabilityof an association in a random SNP.

j Marginalize over MCMC samples of {ci } to take into accountuncertainty

High-dimensional data (big p) 73

Page 337: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Modular Bayes screening (Chen & Dunson, 2018)

j Start with kernel mixture model for marginal distribution of yi :

f (y) =k∑

h=1πhK (y ;θh).

j This implies yi ∼K (θci ), pr(ci = h) =πh , with ci ∈ {1, . . . ,k} acluster index

j Run an MCMC algorithm to get samples of the unknownparameters & cluster indices

j For each SNP, define a simple Bayesian test for associationbetween xi j & ci

j Include common parameters across these tests - eg, probabilityof an association in a random SNP.

j Marginalize over MCMC samples of {ci } to take into accountuncertainty

High-dimensional data (big p) 73

Page 338: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Modular Bayes screening (Chen & Dunson, 2018)

j Start with kernel mixture model for marginal distribution of yi :

f (y) =k∑

h=1πhK (y ;θh).

j This implies yi ∼K (θci ), pr(ci = h) =πh , with ci ∈ {1, . . . ,k} acluster index

j Run an MCMC algorithm to get samples of the unknownparameters & cluster indices

j For each SNP, define a simple Bayesian test for associationbetween xi j & ci

j Include common parameters across these tests - eg, probabilityof an association in a random SNP.

j Marginalize over MCMC samples of {ci } to take into accountuncertainty

High-dimensional data (big p) 73

Page 339: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MOBS - comments

j Algorithm is very fast & scalable to huge p + triviallyparallelizable

j Has strong frequentist theoretical guarantees - comparable tostate-of-the-art

j Competitive with the state of the art in performance

j Particularly good at detecting complex distributional changes

High-dimensional data (big p) 74

Page 340: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MOBS - comments

j Algorithm is very fast & scalable to huge p + triviallyparallelizable

j Has strong frequentist theoretical guarantees - comparable tostate-of-the-art

j Competitive with the state of the art in performance

j Particularly good at detecting complex distributional changes

High-dimensional data (big p) 74

Page 341: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MOBS - comments

j Algorithm is very fast & scalable to huge p + triviallyparallelizable

j Has strong frequentist theoretical guarantees - comparable tostate-of-the-art

j Competitive with the state of the art in performance

j Particularly good at detecting complex distributional changes

High-dimensional data (big p) 74

Page 342: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

MOBS - comments

j Algorithm is very fast & scalable to huge p + triviallyparallelizable

j Has strong frequentist theoretical guarantees - comparable tostate-of-the-art

j Competitive with the state of the art in performance

j Particularly good at detecting complex distributional changes

High-dimensional data (big p) 74

Page 343: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application to cis-eQTL data

j Applied approach to GEUVADIS cis-eQTL data set

j Messenger RNA & microRNA on lymphoblastoid cell linesamples from 462 individuals in 1000 genomes

j 38 million Single Nucleotide Polymorphisms (SNPs)

j gene E2F2 (yi ) - key role in control of cell cycle & is multimodal

j 0.4% of pr(H0 j ) < 0.05 - picking up differences in distributionother methods miss

High-dimensional data (big p) 75

Page 344: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application to cis-eQTL data

j Applied approach to GEUVADIS cis-eQTL data set

j Messenger RNA & microRNA on lymphoblastoid cell linesamples from 462 individuals in 1000 genomes

j 38 million Single Nucleotide Polymorphisms (SNPs)

j gene E2F2 (yi ) - key role in control of cell cycle & is multimodal

j 0.4% of pr(H0 j ) < 0.05 - picking up differences in distributionother methods miss

High-dimensional data (big p) 75

Page 345: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application to cis-eQTL data

j Applied approach to GEUVADIS cis-eQTL data set

j Messenger RNA & microRNA on lymphoblastoid cell linesamples from 462 individuals in 1000 genomes

j 38 million Single Nucleotide Polymorphisms (SNPs)

j gene E2F2 (yi ) - key role in control of cell cycle & is multimodal

j 0.4% of pr(H0 j ) < 0.05 - picking up differences in distributionother methods miss

High-dimensional data (big p) 75

Page 346: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application to cis-eQTL data

j Applied approach to GEUVADIS cis-eQTL data set

j Messenger RNA & microRNA on lymphoblastoid cell linesamples from 462 individuals in 1000 genomes

j 38 million Single Nucleotide Polymorphisms (SNPs)

j gene E2F2 (yi ) - key role in control of cell cycle & is multimodal

j 0.4% of pr(H0 j ) < 0.05 - picking up differences in distributionother methods miss

High-dimensional data (big p) 75

Page 347: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Application to cis-eQTL data

j Applied approach to GEUVADIS cis-eQTL data set

j Messenger RNA & microRNA on lymphoblastoid cell linesamples from 462 individuals in 1000 genomes

j 38 million Single Nucleotide Polymorphisms (SNPs)

j gene E2F2 (yi ) - key role in control of cell cycle & is multimodal

j 0.4% of pr(H0 j ) < 0.05 - picking up differences in distributionother methods miss

High-dimensional data (big p) 75

Page 348: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Discussion

j Brief intro to Bayesian methods for large p problems

j Highlighted some recent work using shared kernels &/ormodularization

j There is a very rich literature & increasing focus on scalability

j One important direction is to obtain methods for assessing whenwe are attempting inferences on too fine of a scale for our data

j Ideally can then automatically coarsen the scale to answersolvable questions - e.g., Peruzzi & Dunson (2018)

High-dimensional data (big p) 76

Page 349: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Discussion

j Brief intro to Bayesian methods for large p problems

j Highlighted some recent work using shared kernels &/ormodularization

j There is a very rich literature & increasing focus on scalability

j One important direction is to obtain methods for assessing whenwe are attempting inferences on too fine of a scale for our data

j Ideally can then automatically coarsen the scale to answersolvable questions - e.g., Peruzzi & Dunson (2018)

High-dimensional data (big p) 76

Page 350: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Discussion

j Brief intro to Bayesian methods for large p problems

j Highlighted some recent work using shared kernels &/ormodularization

j There is a very rich literature & increasing focus on scalability

j One important direction is to obtain methods for assessing whenwe are attempting inferences on too fine of a scale for our data

j Ideally can then automatically coarsen the scale to answersolvable questions - e.g., Peruzzi & Dunson (2018)

High-dimensional data (big p) 76

Page 351: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Discussion

j Brief intro to Bayesian methods for large p problems

j Highlighted some recent work using shared kernels &/ormodularization

j There is a very rich literature & increasing focus on scalability

j One important direction is to obtain methods for assessing whenwe are attempting inferences on too fine of a scale for our data

j Ideally can then automatically coarsen the scale to answersolvable questions - e.g., Peruzzi & Dunson (2018)

High-dimensional data (big p) 76

Page 352: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Discussion

j Brief intro to Bayesian methods for large p problems

j Highlighted some recent work using shared kernels &/ormodularization

j There is a very rich literature & increasing focus on scalability

j One important direction is to obtain methods for assessing whenwe are attempting inferences on too fine of a scale for our data

j Ideally can then automatically coarsen the scale to answersolvable questions - e.g., Peruzzi & Dunson (2018)

High-dimensional data (big p) 76

Page 353: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Some references - large n Bayes

j Miller J, Dunson D (2018) Robust Bayesian inference via coarsening.Journal of the American Statistical Association, Online.

j Srivastava S, Li C, Dunson DB (2018) Scalable Bayes via barycenterin Wasserstein space. Journal of Machine Learning Research19(1):312-346.

j Li C, Srivastava S, Dunson DB (2017) Simple, scalable and accurateposterior interval estimation. Biometrika 104(3):665-680.

j Duan LL, Johndrow JE, Dunson DB (2018) Calibrated dataaugmentation for scalable Markov chain Monte Carlo. Journal ofMachine Learning Research, to appear.

j Minsker S, Srivastava S, Lin L, Dunson DB (2017) Robust andscalable Bayes via a median of subset posterior measures. Journal ofMachine Learning Research 18(1):4488-527.

j Johndrow JE, Smith A, Pillai N, Dunson DB (2018) MCMC forimbalanced categorical data. Journal of the American StatisticalAssociation, Online.

High-dimensional data (big p) 77

Page 354: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Some references - large p Bayes

j Armagan A, Dunson DB, Lee J (2013) Generalized double Paretoshrinkage. Statistica Sinica 23:119.

j Bhattacharya A, Pati D, Pillai NS, Dunson D (2015) Dirichlet-Laplacepriors for optimal shrinkage. JASA 110:1479-90.

j Chen Y, Dunson DB (2017) Modular Bayes screening forhigh-dimensional predictors. Biometrika, under revision.

j Datta J, Dunson DB (2016) Bayesian inferences on quasi-sparsecount data. Biometrika 103:971-83.

j Durante D, Dunson DB (2018) Bayesian inference and testing ofgroup differences in brain networks. Bayesian Analysis 13:29-58.

j Lee K, Lin L, Dunson D (2018) Maximum pairwise Bayes factors forcovariance structure testing. arXiv:1809.03105

j Lock EF, Dunson DB (2015) Shared kernel Bayesian screening.Biometrika 102:829-842.

j Peruzzi M, Dunson DB (2018) Bayesian modular and multiscaleregression. arXiv:1809.05935.High-dimensional data (big p) 78

Page 355: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Discussion

j Take home message: Bayes is scalable & MCMC is scalable

j But in big data & dimensionality problems we can’t necessarilybe naive & use off the shelf algorithms

j We need to think carefully about how to exploit parallelprocessing & accurate approximations to reduce bottlenecks

j Also useful to take a step away from the fully Bayes frameworkby using modularization, composite likelihoods, c-Bayes, etc

j Such generalized Bayes methods can have improvedcomputational performance & robustness

High-dimensional data (big p) 79

Page 356: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Discussion

j Take home message: Bayes is scalable & MCMC is scalable

j But in big data & dimensionality problems we can’t necessarilybe naive & use off the shelf algorithms

j We need to think carefully about how to exploit parallelprocessing & accurate approximations to reduce bottlenecks

j Also useful to take a step away from the fully Bayes frameworkby using modularization, composite likelihoods, c-Bayes, etc

j Such generalized Bayes methods can have improvedcomputational performance & robustness

High-dimensional data (big p) 79

Page 357: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Discussion

j Take home message: Bayes is scalable & MCMC is scalable

j But in big data & dimensionality problems we can’t necessarilybe naive & use off the shelf algorithms

j We need to think carefully about how to exploit parallelprocessing & accurate approximations to reduce bottlenecks

j Also useful to take a step away from the fully Bayes frameworkby using modularization, composite likelihoods, c-Bayes, etc

j Such generalized Bayes methods can have improvedcomputational performance & robustness

High-dimensional data (big p) 79

Page 358: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Discussion

j Take home message: Bayes is scalable & MCMC is scalable

j But in big data & dimensionality problems we can’t necessarilybe naive & use off the shelf algorithms

j We need to think carefully about how to exploit parallelprocessing & accurate approximations to reduce bottlenecks

j Also useful to take a step away from the fully Bayes frameworkby using modularization, composite likelihoods, c-Bayes, etc

j Such generalized Bayes methods can have improvedcomputational performance & robustness

High-dimensional data (big p) 79

Page 359: Scalable Bayesian Inference - Statistical Sciencedunson/ScalableBayes_dunsonNIPS2018.pdf · 2018-12-03 · Scalable Bayesian Inference David Dunson Departments of Statistical Science

Discussion

j Take home message: Bayes is scalable & MCMC is scalable

j But in big data & dimensionality problems we can’t necessarilybe naive & use off the shelf algorithms

j We need to think carefully about how to exploit parallelprocessing & accurate approximations to reduce bottlenecks

j Also useful to take a step away from the fully Bayes frameworkby using modularization, composite likelihoods, c-Bayes, etc

j Such generalized Bayes methods can have improvedcomputational performance & robustness

High-dimensional data (big p) 79