View
223
Download
0
Category
Tags:
Preview:
Citation preview
Applied Statistics for the Office of Science
Understanding Variability and BringingRigor to Scientific Investigation
George Ostrouchov
Statistics and Data Sciences GroupComputer Science and Mathematics Division
Oak Ridge National Laboratory
Statistics and Data Sciences
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
George Ostrouchov
Filling a Gap in Statistics to Address Office of Science Needs
ASCR Strategic Plan“[AMR] weaknesses include an underinvestment or
lack of investment in several critical areas: . . . Underinvestment in statistics”
“The following gaps in the [AMR] program have been identified: Multiscale mathematics Ultrascale algorithms Discrete mathematics Statistics – investments in this area are required to deal
with extracting knowledge from the oceans of data that large-scale simulations will produce.
Multiphysics”
Through Applied Statistics, ASCR has the opportunity to engage the dominant segment of Applied Mathematics for its goals.
Office of Science Response to the Data Challenge:
The Office of Science will initiate a long-term research program to address the “Curse of Dimensionality.”
Raymond L. Orbach, AAAS, Feb. 19, 2006
U.S. Department of Energy
Office of Science
ORNL Applied Statistics program can address the curse of dimensionality and other Office of Science goals.
Statistics and Data Sciences
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
George Ostrouchov
Statistics Brings Rigor and Efficiency to Scientific InvestigationStatistics Brings Rigor and Efficiency to Scientific Investigation and Technology
Conrad Habicht, Maurice Solovine, and Albert Einstein, the self-styled Olympia Academy, in about 1903. At Einstein’s suggestion, the first book read was Pearson’s “The Grammar of Science.”
CREDIT: IMAGE ARCHIVE ETH-BIBLIOTHEK, ZÜRICH
Karl Pearson (1857-1936) “The Grammar of Science” (1892) – Relativity First Department of Statistics (1911) UCL Founding editor of Biometrika
EXPERIMENTAL
Statistics and Data Sciences
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
George Ostrouchov
Common Evolutionary Steps: Experimental Science and Computational Science
Early computational science relies largely on intuitive design and visual validation Computational experiments are expensive Petascale data sets are nearly as opaque as real systems – statistical
analysis must select what to visualize Uncertainty analysis is in its infancy
Statistics is a major partner in bringing computational science to the rigor and efficiency standards of experimental science Methods to see through, examine, and classify variability Uncertainty quantification Statistical design of experiments Fusion of data and computational experiment
Statistics and Data Sciences
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
George Ostrouchov
Statistics: the Study of Variability
The discipline concerned with the study of variability, with the study of uncertainty, and with the study of decision-making in the face of uncertainty.
Large scale user of mathematical and computational tools with a focused scientific agenda
Inherently interdisciplinary
Source: [NSF2004] Jon Kettenring, Bruce Lindsay, and David Siegmund, editors, 2004. Statistics: Challenges and Opportunities for the Twenty-First Century,
Cuts through the fog of variability and brings efficiency to science.
Statistics and Data Sciences
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
George Ostrouchov
Mathematics is Biology’s Next Microscope, Only Better
Here are five mathematical challenges that would contribute to the progress of biology.(1) Understand computation. Find more effective ways to gain insight and prove theorems fromnumerical or symbolic computations and agent-based models. We recall Hamming: “The purpose ofcomputing is insight, not numbers” (Hamming 1971, p. 31).(2) Find better ways to model multi-level systems, for example, cells within organs within peoplein human communities in physical, chemical, and biotic ecologies.(3) Understand probability, risk, and uncertainty. Despite three centuries of great progress, we arestill at the very beginning of a true understanding. Can we understand uncertainty and risk betterby integrating frequentist, Bayesian, subjective, fuzzy, and other theories of probability, or is anentirely new approach required?(4) Understand data mining, simultaneous inference, and statistical de-identification (Miller1981). Are practical users of simultaneous statistical inference doomed to numerical simulations ineach case, or can general theory be improved? What are the complementary limits of data miningand statistical de-identification in large linked databases with personal information?(5) Set standards for clarity, performance, publication and permanence of software andcomputational results.
Mathematics, Computer Science, and Statisticsare Biology’s Next Microscope, Only Better
Here are five mathematical challenges that would contribute to the progress of biology.(1) Understand computation. Find more effective ways to gain insight and prove theorems fromnumerical or symbolic computations and agent-based models. We recall Hamming: “The purpose ofcomputing is insight, not numbers” (Hamming 1971, p. 31).(2) Find better ways to model multi-level systems, for example, cells within organs within peoplein human communities in physical, chemical, and biotic ecologies.(3) Understand probability, risk, and uncertainty. Despite three centuries of great progress, we arestill at the very beginning of a true understanding. Can we understand uncertainty and risk betterby integrating frequentist, Bayesian, subjective, fuzzy, and other theories of probability, or is anentirely new approach required?(4) Understand data mining, simultaneous inference, and statistical de-identification (Miller1981). Are practical users of simultaneous statistical inference doomed to numerical simulations ineach case, or can general theory be improved? What are the complementary limits of data miningand statistical de-identification in large linked databases with personal information?(5) Set standards for clarity, performance, publication and permanence of software andcomputational results.
Statistics
Multiscale Math
Statistics
Computer Science
Computer Science and Mathematics
Cohen JE (2004). PLoS Biol 2(12): e439
Chemistry’s Materials’Astrophysics’ TelescopeParticle Physics’ Device,Fellow AAAS, Fellow AmPhilSoc, Member NAS
Statistics and Data Sciences
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
George Ostrouchov
Particle Physics Embraces Statistics
“… since 1900 … statistics … takes over field after field … [as] … the methodology of choice …
… people in astronomy and physics … are starting to use
statistics a lot more for the simple reason that they have to be efficient now.
… I don't see any area where it's being resisted much.”
Bradley EfronChair, Department of Statistics, Stanford University
and Max H. Stein Professor of Humanities and Sciences
2005 National Medal of Science Recipient
Statistics and Data Sciences
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
George Ostrouchov
Citations to Statistics Comprise the Dominant Group within Mathematics
Highly Cited Journals in Mathematics
Rank Journal 1991-2001Citations 1. J. American Statistical Assn. 16,457 2. Biometrics 10,8543. J. Math. Analysis 9,8454. Annals of Statistics 9,7025. Proc. Amer. Math Soc. 9,2376. C.R. Acad. Sci. Ser. I Math. 9,1537. Trans. Amer. Math. Soc. 8,5868. Journal of Algebra 8,5319. J. Functional Analysis 7,99910. Biometrika 7,91111. SIAM J. Numer. Anal. 7,38312. Inventiones Mathmaticae 7,38213. J. Royal Stat. Soc. B 6,57514. Mathemat. Programming 6,44415. Linear Algebra Appl. 6,112
SOURCE: ISI Essential Science Indicators, Sci. Citation Index (300 Journals in pure mathematics, applied mathematics, statistics and probability)
Highly Cited Authors in Mathematics for period 1991-2001Rank Name Affiliation Department / Field Papers Citations1. Pierre-Louis Lions University of Paris 9 Mathematics 75 12072. David L. Donoho Stanford University Statistics 27 11823. Adrian F.M. Smith Univ. London Statistics 40 10264. Elizabeth A. Thompson U. Washington Biostatistics 11 9735. Iain M Johnstone Stanford University Statistics 17 9686. Jianqing Fan Chinese U. Hong Kong Statistics 53 9017. Donald B. Rubin Harvard University Statistics 38 8548. Ingrid Daubechies Princeton University Mathematics 20 8079. Adrian E. Raftery U. Washington Statistics/Sociol. 31 80410. Alan E. Gelfand U. Connecticut Statistics 35 74711. Sun-Wei Guo Med. Coll. Wisconsin Biostatistics 6 73712. Scott L. Zeger Johns Hopkins Univ. Biostatistics 23 72313. Peter J. Green University of Bristol Statistics 14 66714. Bradley P. Carlin University of Minnesota Biostatistics 28 66315. J. Stephen Marron U. North Carolina Statistics 43 61816. David G. Clayton MRC, Cambridge Biostatistics 4 59817. Gareth O. Roberts Lancaster Univ. Statistics 41 59818. Albert Cohen University of Paris Mathematics 61 57219. Michael Rockner Univ. Bielefeld, Germany Mathematics 69 57220. Yangbo Ye University of Iowa Mathematics 42 56721. Jinchao Xu Pennsylvania St. U. Mathematics 22 56622. Xiao-Li Meng University of Chicago Statistics 27 56123. Matthew P. Wand Harvard University Biostatistics 31 55824. Wally R. Gilks MRC Biostatistics 16 55125. M. Chris Jones Open University Statistics 52 542
19 of Top 25 most cited mathematics authors
are from Statistics or Biostatistics !
Statistics is Highly Interdisciplinary !Citations per paper:Statistics and Biostatistics – 27Rest of Mathematics - 15
Statistics and Data Sciences
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
George Ostrouchov
Statistics Disseminates Data Analysis Ideas Accross Science Domains
Of 500 recent citations of Efron’s “Bootstrap” paper, 348 were outside statistics. [NSF2004]
Mitchell’s “Detmax Algorithm” paper 200+ citations (funded by AMR at ORNL) - red are outside statistics.
Statistics and Data Sciences
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
George Ostrouchov
Statistics Core Research Disseminates and Unifies Data Analysis Ideas
Tames the explosion of data analytic methods by Providing portability between science domains Deriving properties of new data analytic methods Building bridges between data analytic methods
Examples: Latent Semantic Indexing (Dumais+ 1991) and Correspondence
Analysis (Benzecri 1969, 1980,1992, Greenacre 1984) Empirical Orthogonal Functions (Lorenz 1956) and a climate time
series application of Principal Components Analysis (Pearson 1902, Hotelling 1935)
Support Vector Machines (Vapnik 1995) and Logistic Regression (Cox 1970) via hinge loss function (Hastie+ 2001)
FastMap approximation to Principal Components (Faloutsos+ 1995): Bridge to Convex Hull and new methods, RobustMap (Ostrouchov+ 2005) and to right Householder transformations (Ostrouchov+ 2006)
Addressing
Addressing
the Curse of Dimensionality
the Curse of Dimensionality
Statistics and Data Sciences
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
George Ostrouchov
Statistics Core
Science Applications
“I … emphasize the symbiotic relationship … between the Statisticians and Astrophysicists …. It is now … clear that there are core common problems …” Bob Nichol (CMU Physics)
Miller, CJ; Genovese, C; Nichol, RC; et al.Controlling the false-discovery rate in astrophysical data analysisASTRONOMICAL JOURNAL, 122 (6): 3492-3505 DEC 2001
Miller, CJ; Nichol, RC; Batuski, DJAcoustic oscillations in the early universe and todaySCIENCE, 292 (5525): 2302-2303 JUN 22 2001
Science publication on Big Bang while others still plow through plethora of data
Quantitative Rigor for Science: Transfer From Medicine via Core Statistics to Big Bang
False Discovery Rate: “Interdisciplinary” “Decision-making in the face of uncertainty”
Family-wise error rate of statistical tests:One test: 0.05 probability of a false positiveFifty tests: 0.93 probability of a false positive need simultaneous inference (SI)Thousand tests: SI too conservative, need FDR
Statistics core is the hub that disseminates and unifies data analysis ideas.
Critical mass engagement is needed to reap short term and long term returns.
Source: [NSF2004] Jon Kettenring, Bruce Lindsay, and David Siegmund, editors, 2004. Statistics: Challenges and Opportunities for the Twenty-First Century,
Statistics and Data Sciences
OAK RIDGE NATIONAL LABORATORY
U. S. DEPARTMENT OF ENERGY
George Ostrouchov
Engage Core Statistics for OASCR Goals
A gap exists between statistics research and simulation science Engage statistics with leadership computing Engage statistics with simulation science data Engage statistics with Office of Science experimental data (neutron
science)
Statistics Core
Science Applications
Computational Chemistry
Climate Simulation
Fusion Simulation
Combustion Simulation
Superscalable Algorithms
Neutron ScienceAstrophysics Simulation
Genome Science
Tuning Leadership Facilities
Ontologies for Energy
Recommended