Upload
david-gleich
View
1.549
Download
0
Tags:
Embed Size (px)
DESCRIPTION
A high level vision for a new project on how to use MapReduce computers for simulating physical phenomenon
Citation preview
SIMULATION INFORMATICS
A data-driven methodology for
synthesis from simulation results
David F. Gleich
John von Neumann Postdoctoral Fellow
Sandia National Laboratories
Livermore, CA
David Gleich (Sandia) 5/5/2011 1/18
A bit more background on meBefore
Undergrad Harvey Mudd College (Joint CS/Math)
Internships Yahoo! and Microsoft Internships
Graduate Stanford (Computation and Mathematical Engineering)Thesis: Models and Algorithm’s for PageRank Sensitivity
Internships Intel, Joint Library of Congress/Stanford, Microsoft
Post-doc 1 Univ. of British Columbia
Now
Post-doc 2 2010 John von Neumann fellow @ Sandia
In August
Tenure-Track position at Purdue’s CS department
David Gleich (Sandia) 5/5/2011 2/18
It’s time to ask:
What can science
learn from Google?
--Wired Magazine (2008)
David Gleich (Sandia) 5/5/2011
http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
3/18
A tale of two computers …
Cielo (#10 in top 500)
Separate storage system
This platform is awesome
for simulating physical systems
Cost ~$50 million
SNL/CA Nebula Cluster
2TB/core storage
This platform is awesome
for working with data
Cost $150kDavid Gleich (Sandia) 5/5/2011 4/18
Simulation: The Third Pillar of Science
21st Century Science in a nutshell
Experiments are not practical / feasible.
Simulate things instead.
But do we trust the simulations?
We’re trying!
Model Fidelity
Verification & Validation (V&V)
Uncertainty Quantification (UQ)
Insight and confidence requires multiple runs.
David Gleich (Sandia) 5/5/2011 5/18
The problem Simulation ain’t cheap!
21.1st Century Science
in a nutshell?
Simulations are too expensive.
Let data provide a surrogate.
David Gleich (Sandia) 5/5/2011
Input Parameters
Time historyof simulation
s f
The Database
s1 -> f1
s2 -> f2
sk -> fk
We can throw the numbers into
the biggest computing clusters the
world has ever seen and let
statistical algorithms find patterns
where science cannot.
--Wired (again)
6/18
Our CSRF project Simulation Informatics
To enable analysts and engineers to hypothesize from inexpensive data computations instead of expensive HPC computations.
The People
Myself
Paul G. Constantine (2009 von Neumann)
Stefan Domino
Jeremy Templeton
Joe Ruthruff
David Gleich (Sandia) 5/5/2011 7/18
The idea and vision: Store the runs!
David Gleich (Sandia) 5/5/2011
Supercomputer MapReduce cluster Engineer
Each multi-day HPC
simulation generates
gigabytes of data.
A MapReduce cluster can
hold hundreds or thousands
of old simulations …
… enabling engineers to query
and analyze months of simulation
data for statistical studies and
uncertainty quantification.
8/18
MapReduce
Originated at Google for indexing web pages and computing PageRank
The idea
Bring the computations to the data.
Express algorithms in data-local operations.
Implement one type of communication: shuffle.
Shuffle moves all data with the same key to the same reducer
David Gleich (Sandia) 5/5/2011
Data scalable
Fault tolerance by design
M
M R
RM
M
M
Input stored in triplicate
Map output
persisted to disk
before shuffle
Reduce input/
output on disk
1 M
M R
RM
M
M
Maps
Reduce
Shuffle
2
3
4
5
1 2
M M
3 4
M M
5
M
9/18
Mesh point variance in MapReduce
David Gleich (Sandia) 5/5/2011
Three simulation runs for three time steps each.
Run 1 Run 2 Run 3
T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3
10/18
Mesh point variance in MapReduce
David Gleich (Sandia) 5/5/2011
M M M
R R
1. Each mapper out-
puts the mesh points
with the same key.
2. Shuffle moves all
values from the same
mesh point to the same
reducer.
Three simulation runs for three time steps each.
Run 1 Run 2 Run 3
3. Reducers just compute
a numerical variance.
Bring the computations
to the data.!
T=1 T=2 T=3 T=1 T=2 T=3 T=1 T=2 T=3
11/18
Hadoop???
David Gleich (Sandia) 5/5/2011
Photo credit, New York Times (2009)
http://www.nytimes.com/2009/03/17/tech
nology/business-computing/17cloud.html
Hadoop was the name
of Doug Cutting’s
son’s stuffed elephant.
Hadoop: an open-source MapReduce system supported by Yahoo!
12/18
MapReduce and Surrogate Models
David Gleich (Sandia) 5/5/2011
On the MapReduce cluster
Just one machine
Extraction
The Surrogate
Interpolation
New Samples
On the MapReduce cluster
f1
f2
f5
A surrogate model
is a function that
reproduces the
output of a simul-
ation and predicts
its output at new
parameter values.
The Database
s1 -> f1
s2 -> f2
sk -> fk
sa -> fa
sb -> fb
sc -> fc
SurrogateSample
13/18
MapReduce + Simulations
Thermal Race problemJoe Ruthruff and Jeremy Templeton
Heating an unclassified NW model to simulate a fire
Goal make sure the weak link failsbefore the strong link
Objective determine where parameters are interesting
Parameters material properties
1000 sample parameter study using
David Gleich (Sandia) 5/5/2011
30 min per run700GB of Raw Data~64GB of data for
MapReduce
Our resultsExtraction 30 minInterpolation 5 min
14/18
Data driven Green’s functions
David Gleich (Sandia) 5/5/2011
The simulation varies most in the corners of the parameter space.
These analyses will help understand which parameters matter,
and where to continue running the simulations.
15/18
What can science
learn from Google?
David Gleich (Sandia) 5/5/2011
The importance of data.
For more about this, see a talk and paper by Peter Norvig
“Internet-Scale Data Analysis”http://www.stanford.edu/group/mmds/slides2010/Norvig.pdf
Halevy et al. “The unreasonable effectiveness of data.” IEEE Intelligent systems, 2009.
16/18
The future?
David Gleich (Sandia) 5/5/2011
MapReduce computers
embedded within super
computers as the parallel file
system and data analysis
platform.
The best of both worlds.
Any questions?
Our workLaying the foundation for useful
surrogate models that will
make data based simulation a
compelling addition to the
simulation revolution.
17/18