21
CS&E and Statistics James Berger Duke University and Statistical and Applied Mathematical Sciences Institute (SAMSI)

CS&E and Statistics

  • Upload
    amma

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

CS&E and Statistics. James Berger Duke University and Statistical and Applied Mathematical Sciences Institute (SAMSI). Outline. A Glimpse of the World of Statistical Modeling in Science, Engineering and Society from the Viewpoint of a Statistician - PowerPoint PPT Presentation

Citation preview

Page 1: CS&E and Statistics

CS&E and Statistics

James Berger

Duke University andStatistical and Applied Mathematical

Sciences Institute (SAMSI)

Page 2: CS&E and Statistics

Outline

• A Glimpse of the World of Statistical Modeling in Science, Engineering and Society from the Viewpoint of a Statistician

• Bringing the CS&E and Statistics Communities Together

• Research Themes

Page 3: CS&E and Statistics

I. An Idiosyncratic Glimpse of the World of Statistical Modeling in Science, Engineering, and Society

• Example 1: Predicting Fuel Economy Improvements

• Example 2: Understanding the Orbital Composition of Galaxies

• Example 3: Protecting Confidentiality in Government Databases, while Allowing for their Use in Research

Page 4: CS&E and Statistics

Example 1: An early 90’s study of the potential available gain in fuel economy,

to gauge the possibility of changing CAFE

• Statistical modeling of EPA data involved– physics/engineering-based data transformations– ‘multilevel random effects’ models, accounting

for vehicle model effects, manufacturer effects, technology type, … (about 3000 parameters)

– physics/engineering knowledge of effect on vehicle performance of technology changes, necessary to implement a ‘constant performance’ condition, some from simulation.

Page 5: CS&E and Statistics

• Prediction of the effect of technology change (highly non-linear)– was done in a Bayesian fashion;– involved thousands of 3000-dimensional integrals;– utilized Markov Chain Monte Carlo methods.

• The total estimated fuel economy gains available by 1995 and 2001 were (within 2%)– 11% and 20% (Automobile) – 8% and 16% (Truck)

(Note that legislation had proposed CAFE increases of 20% by 1995 and 40% by 2001.)

See http://www.stat.duke.edu/~berger/papers/fuel.html

Page 6: CS&E and Statistics

Example 2: Understanding the orbital composition of galaxies

• Consider a galaxy as made of a collection of ‘rings’ of orbiting stars; each ring specified by – its location – a given velocity for the stars in the ring.

• Available data is the luminosity in each (location,velocity) slit of the galaxy;– it is measured with noise.

• Goal: find the luminosity ‘weight’ of each ‘ring’.

Page 7: CS&E and Statistics

• Finding the weights appears to be a linearly constrained quadratic minimization problem, but– there are many local minima, with nearly the same

minimum value, so the actual minimum is unimportant– characterization of the uncertainty in the weights is

crucial, leading to identification of the computationally ‘stable’ and ‘transient’ orbits.

• A solution is to employ Bayesian analysis, leading to the posterior distribution of weights:– here, dimensions of integration are roughly equal to

the number of orbits considered;– new Markov Chain Monte Carlo methods for highly

constrained spaces are required .

Page 8: CS&E and Statistics

Example 3: Protecting Privacy in an Electronic, Post-9/11/01 World

• Underlying Tension: Federal statistical agencies must– protect confidentiality of data (and privacy of individuals

and organizations),

– disclose information to the public, researchers, …

• Current Milieu: Sophisticated ways to break confidentiality.– Example: Linkage to external databases (many) using

powerful software tools.

• The need: equally powerful models and tools to protect confidentiality .

Page 9: CS&E and Statistics

• Full Data: Large (e.g., 40 dimensions x 10 categories) contingency table corresponding to a categorical database; note that there are 1040 cells in the full table (but most are 0).

• To Disseminate to Researchers: Set of marginal sub-tables that maximize utility of released information subject to a risk disclosure constraint

• The difficult computational challenges include– computation via MCMC or integer programming or ??

with huge contingency tables;– optimization of the utility, subject to the constraint;– determination of the statistical utility of sub-tables.

See NISS Digital Government Project: http://www.niss.org/dg

Page 10: CS&E and Statistics

II. Bringing the CS&E and Statistics Communities Together

• Example : Inverse problems and validation for complex computer models

• Barriers to closer association

• Mechanisms for closer association

Page 11: CS&E and Statistics

Example: Development, Analysis and Validation of Computer Models

– Consider computer models of processes, created via applied mathematical modeling, statistical modeling, microsimulation, or other strategy.

– Collect data from the real process, to• Find unknown parameters of the computer model

(the inverse problem), and characterize uncertainty• Find inadequacies of the computer model and

suggest improvements• Predict accuracy of the computer model

Page 12: CS&E and Statistics

x

b(x)

Page 13: CS&E and Statistics

Illustration: Math modeling of vehicle crashes

• A finite element applied math model– 100,000 elements– developed using

LS-DYNA – 12+ hours to run

• Accelerometer data is available at differing vehicle velocities– 36 computer runs– 36 field tests

Page 14: CS&E and Statistics

Statistical modeling of velocities as a function of time:

vfield(t) = vtrue(t) + e(t), vmodel(t) = vtrue(t) + b(t),

where e(t) is noise and b(t) is computer model bias.

Analysis: Use Bayesian analysis and Markov Chain Monte Carlo implementation to– provide estimates (with uncertainties) of unknown

coefficients in the math model, e.g., damping;– assess accuracy of predictions of the computer model

(e.g., at initial velocity v=30 mph, there is a 90% chance that the computer model prediction is within 1.5 of the true process value)

– allows prediction of key engineering quantities, such as CRITV, the airbag deployment time.

See: http://www.niss.org/technicalreports/tr128.pdf

Page 15: CS&E and Statistics
Page 16: CS&E and Statistics

Barriers to Bringing the CS&E and Statistics Communities Together

• To many disciplinary scientists– we are each ‘providers of tools they can use’– we are indistinguishable quantitative experts

• Program and project funding rarely encourage inclusion of both CS&E and statistical scientists.

• Our traditional application areas generally differ– CS&E tradition: physical sciences and engineering– Statistics tradition: strongest – as the statistics

discipline – in social sciences, medical sciences,…(This could be an organizational strength for the CS&E

initiative, but is a barrier at the personal level.)

Page 17: CS&E and Statistics

Mechanisms for Bringing the CS&E and Statistics Communities Together

• Most important is simply to bring them together on interdisciplinary teams.

• Institute programs (e.g., at SAMSI), for extended cooperation– joint workshops– joint working groups

• Emphasize need for joint funding on interdisciplinary projects.

• At Universities?

Page 18: CS&E and Statistics

Organizing and Delivering Joint CS&E and Statistics Educational Programs

At SAMSI, we– provide integrated courses, jointly taught;– provide graduate students and postdocs with

year-long exposure to joint programs;– provide 1 week outreach programs to

undergraduates and high-school teachers, and 2 week outreach programs to beginning graduate students, to introduce them to the CS&E and Statistics worlds;

– begin opening program workshops with extensive tutorials.

Page 19: CS&E and Statistics

Research Challenges

• Statistical computational research challenges: – MCMC development and implementation– data confidentiality and large contingency tables– dealing with large data sets

• in real time• off-line

– bioinformatics, gene regulation, protein folding, …– data mining– utilizing multiscale data – data fusion, data assimilation– graphical models/causal networks– open source software environments– visualization– many many more.

Page 20: CS&E and Statistics

• Challenges in the synthesis of statistics and development of computer modeling:– Statistical analysis in non-linear situations can

require thousands of model evaluations (e.g., using MCMC), so the ‘real’ computational problem is the product of two very intensive computational problems; this is needed for

• designing effective evaluation experiments;• estimating unknown model parameters (inverse

problem), with uncertainty evaluation;• assessing model bias and predictive capability of

the model;• detecting inadequate model components.

Page 21: CS&E and Statistics

– Simultaneous use of statistical and applied mathematical modeling is needed for

• effective utilization of many types of data, such as – data that occurs at multiple scales;– data/models that are individual-specific.

• replacing unresolvable determinism by stochastic or statistically modeled components (parameterization)

This general area of validation of computer models should be a Grand Challenge.