Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Modeling and Simulation:Modeling and Simulation:The Good, the Bad, and the HopefulThe Good, the Bad, and the Hopeful
David B. Nelson, Ph.D.Director
National Coordination Office for
Information Technology Research and Development
DOE Computational Science Graduate Fellowship Conference
July 15, 2003
2
National Coordination Office (NCO) for National Coordination Office (NCO) for Information Technology Research and Information Technology Research and
Development (IT R&D)Development (IT R&D)
Mission: To formulate and promote Federal information technology researchand development to meet national goals.
NCO Director reports to the Director of the White House Office of Science Technology Policy (OSTP) and co-chairs the Interagency Working Group for IT R&D
Coordinates planning, budget, and assessment activities for the Federal multi-agency Networking and Information Technology R&D (NITRD) Program
Supports six technical Coordinating Groups (CGs) that report to the Interagency Working Group
3
NITRD Program CoordinationNITRD Program Coordination
WHITEHOUSE
Executive Office of the PresidentOffice of Science and Technology Policy
National Science and Technology Council
National Coordination Office (NCO) for Information TechnologyResearch and Development
High EndComputing
Coordinating Group(HEC)
Large ScaleNetworking
Coordinating Group(LSN)
HighConfidence
Software and Systems
Coordinating Group(HCSS)
Human Computer
Interaction & Information
ManagementCoordinating
Group(HCI & IM)
Software Design and Productivity Coordinating
Group(SDP)
President’s Information Technology Advisory Committee
(PITAC)
Interagency Working Group (IWG) on
Information Technology R&D
Social, Economic and Workforce
Implications of IT and IT Workforce
DevelopmentCoordinating
Group(SEW)
Participating Agencies: AHRQ, DARPA, DOE/NNSA, DOE/SC, EPA, NASA, NIH, NIST, NOAA, NSA, NSF,
ODDR&E
U.S. Congress
NITRDAuthorization and Appropriations
Legislation
4
Simulation of Simulation of AquaporinAquaporin Protein Inside a Protein Inside a Cell (NSF, NIH, PSC Alpha Cluster)Cell (NSF, NIH, PSC Alpha Cluster)
Visualization shows transport of water molecules into cell.
5
Environmental Modeling of the Environmental Modeling of the Chesapeake Bay (NOAA, EPA, Chesapeake Bay (NOAA, EPA, DoDDoD))
Image shows visualization of computed salinity in the Bay (red is high salinity.)
South is up.Visualization is an important part of the model, because users may not be skilled computational scientists.
6
Environmental Modeling of the Environmental Modeling of the Chesapeake Bay (NOAA, EPA, Chesapeake Bay (NOAA, EPA, DoDDoD))
Model is checked against measured dataModel has shown that approximately 1/4 of the nitrogen added to the Bay starts as air pollution, some from sources hundreds of miles from the Bay's watershed.Model also shows that substantial nitrogen comes from ground water on the Eastern shore
7
Explosion of a SuperExplosion of a Super--Nova (not to scale) Nova (not to scale) (DOE)(DOE)
Start
Middle
End
8
Simulation of Turbulent Flame with Simulation of Turbulent Flame with Comprehensive Chemistry (DOE)Comprehensive Chemistry (DOE)
9
Power of Japanese Earth Simulator Allows Power of Japanese Earth Simulator Allows Better Resolution of Local FeaturesBetter Resolution of Local Features
Simulation of Tropical Cyclone Near Madagascar
125.1 km grid 62.5 km grid 10.4 km grid
(U.S. 1200 year control run used approximately 280 km grid.)
10
Grid Communications & Applications: Grid Communications & Applications: High End Physics ProblemHigh End Physics Problem
Tier2 Centre ~1TIPS
Online System
Offline Processor Farm
~20 TIPS
CERN Computer Centre
FermiLab ~4 TIPSFrance RegionalCentre
Italy Regional CentreGermany RegionalCentre
InstituteInstituteInstituteInstitute~0.25TIPS
Physicist workstations
~100 MBytes/sec
~100 MBytes/sec
~622 Mbits/sec
~1 MBytes/sec
There is a “bunch crossing” every 25 nsecs.
There are 100 “triggers” per second
Each triggered event is ~1 MByte in size
Physicists work on analysis “channels”.
Each institute will have ~10 physicists working on one or morechannels; data for these channels should be cached by the instituteserver
Physics data cache
~PBytes/sec
~622 Mbits/secor Air Freight (deprecated)
Tier2 Centre ~1TIPS
Tier2 Centre ~1TIPS
Tier2 Centre ~1TIPS
Caltech~1 TIPS
~622 Mbits/sec
Tier 0Tier 0
Tier 1Tier 1
Tier 2Tier 2
Tier 4Tier 4
1 TIPS is approximately 25,000
SpecInt95 equivalents
Image courtesy Harvey Newman, Caltech
Compact Muon Solenoid at CERN
11
ProcessorProcessor--Memory Performance GapMemory Performance Gap
µProc60%/yr.
DRAM7%/yr.
1
10
100
100019
8019
81
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU
1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Perf
orm
ance “Moore’s Law”
•Alpha 21264 full cache miss / instructions executed:180 ns/1.7 ns =108 clks x 4 or 432 instructions
• Caches in Pentium Pro: 64% area, 88% transistors*Taken from Patterson-Keeton Talk to SigMod
12
Processing vs. Memory AccessProcessing vs. Memory Access
Doesn’t cache solve this problem?It depends. With small amounts of contiguous data, usually. With large
amounts of non-contiguous data, usually not.In most computers the programmer has no control over cache.Often “a few” Bytes/FLOP is considered OK.
However, consider operations on the transpose of a matrix (e.g., for adjunct problems)Xa= b XTa = bIf X is big enough, 100% cache misses are guaranteed, and we need at
least 8 Bytes/FLOP (assuming a and b can be held in cache).
Latency and limited bandwidth of processor-memory and node-node communications are major limiters of performance for scientific computation
13
Testing Processing vs. MemoryTesting Processing vs. MemoryAccess with BenchmarksAccess with Benchmarks
Simple benchmark: Stream Triad
ai + s Ă— bi = ci
ai, bi, and ci are vectors; s is a scalar. Vector length is chosen to be much longer than cache size.
Each execution includes 2 memory loads + 1 memory store and 2 FLOPs, or 12 Bytes/FLOP (assuming 8 Byte precision)
No computer has enough memory bandwidth to reference 12 Bytes for each FLOP!
14
Testing Processing vs. MemoryTesting Processing vs. MemoryAccess with BenchmarksAccess with Benchmarks
Another Benchmark: Linpack
Aij xj = bi
Solve this linear equation for the vector x, where A is a known matrix, and b is a known vector. Linpackuses the BLAS routines, which divide A into blocks.
On the average Linpack requires 1 memory reference for every 2 FLOPs, or 4Bytes/Flop.
Many of these can be cache references.
15
Selected System CharacteristicsSelected System Characteristics
Earth Simulator ASCI Q ASCI White MCR Cray X1(NEC) (HP ES45) (IBM SP3) (Dual Xeon) (Cray)
Year of Introduction 2002 2003 2000 2002 2003Node Architecture Vector Alpha micro Power 3 micro Xeon micro Vector
SMP SMP SMP SMP SMPSystem Topology NEC single-stage Quadrics QsNet IBM Quadrics QsNet 2D Torus
Crossbar Fat-tree Omega network Fat-tree InterconnectNumber of Nodes 640 3072 (Total) 512 1152Processors - per node 8 4 16 2 4
- system total 5120 12288 8192 2304Processor Speed 500 MHz 1.25 GHz 375 MHz 2.4 GHz 800 MHzPeak Speed - per processor 8 Gflops 2.5 Gflops 1.5 Gflops 4.8 Gflops 12.8 Gflops
- per node 64 Gflops 10 Gflops 24 Gflops 9.6 Gflops 51.2 Gflops- system total 40 Gflops 30 Gflops 12 Tflops 10.8 Tflops
Memory - per node 16 GB 16 GB 16 GB 16 GB 8-64 GB- per processor 2 GB 4 GB 1 GB 2 GB 2-16 GB- system total 10.24 TB 48 TB 8 TB 4.6 TB
Memory Bandwidth (peak)- L1 Cache N/A 20 GB/s 5 GB/s 20 GB/s 76.8 GB/s- L2 Cache N/A 13 GB/s 2 GB/s 1.5 GB/sMain (per processor)32 GB/s 2 GB/s 1 GB/s 2 GB/s 34.1 GB/s
Inter-node MPI- Latency 8.6 µsec 5 µsec 18 µsec 4.75 µsec- Bandwidth 11.8 GB/s 300 MB/s 500 MB/s 315 MB/s 12.8 GB/s
Bytes/flop to main memory 4 0.8 0.67 0.4 2.66Bytes/flop interconnect 1.5 0.12 0.33 0.07 1
Most of this data is from Kerbyson, Hoisie, Wasserman; LANL; unpublished
16
Performance Measures of Selected Top Performance Measures of Selected Top ComputersComputers
0.1
1.0
10.0
100.0
EarthSimulator
LANL AlphaServer
LANL ASCIWhite
LLNL MCRCluster
PSC AlphaServer
Peak Perf.Linpack RmaxStream Triad
87.5%
27.0%75.5%
(Percentage represents ratios
Rmax /Rpeak and Stream/Rpeak)
3.0%
58.8%
2.0%
51.5%
1.8%
74.0%
3.7%
Note Logarithmic Y axis
17
Major Problem: Poor Links BetweenMajor Problem: Poor Links BetweenWorkload and Architecture DesignWorkload and Architecture Design
Build It and They Will ComeWeakness of Government High Performance Computing and Communication Program in 1990s– No link between grants for computer architecture research and grants
for computer acquisition
– Poor feedback from users to developers
– Poor connections between computational scientists and computer scientists (one workshop in Pittsburgh in 1993)
Result: Selection of computer architecture is not well grounded on application needs
18
What About Synthetic Benchmarks?What About Synthetic Benchmarks?
Peak performance – nuf saidLinpack –only measures performance of cache-friendly codeStream – only measures contiguous communications with memory, but good measure of bandwidthGUPS – really tough benchmark because it makes random memory access; may exceed requirements of most codesIDC balanced benchmarks – good compilation, but somewhat artificially combinedEffective System Performance Benchmark – promising, but not widely usedNAS Parallel Benchmarks – disused, but may be coming backLivermore Loops – obsoleteYour own workload - ??
19
Resurgence of Performance Analysis Is Resurgence of Performance Analysis Is PromisingPromising
LANL Performance and Architecture Lab: http://www.c3.lanl.gov/par_arch/
Performance Evaluation Research Center: http://perc.nersc.gov/
IDC User Forum: http://64.122.81.35/benchmark/
Performance Modeling and Characterization: http://www.sdsc.edu/PMaC/Benchmark/
NAS Parallel Benchmarks: http://www.nas.nasa.gov/Software/NPB/
Recent High End Computing Workshop offered recommendations for performance evaluation: http://www.cra.orgActivities/workshops/nitrd/
Great opportunity for agencies to cooperate on performance evaluation.
20
SummarySummary
Computational Science is now a third pillar of research, along with experiment and theory.High-end computers are getting harder to use and more inefficient.Federal agencies are recognizing this and working to improve things.
21
For Further InformationFor Further Information
Please contact us at:
Or visit us on the Web:
www.itrd.gov
22
BackupBackup
23
The Federal Government Plays a Critical Role in The Federal Government Plays a Critical Role in Supporting Fundamental Research in Supporting Fundamental Research in
Networking and ITNetworking and IT
Federally-sponsored research is critical to building the technology base on which the IT industry has grown
The Federal government funds basic research not traditionally funded by the commercial sector– High risk, innovative ideas whose practical benefits may take years to demonstrate
The Networking and IT R&D program (NITRD) provides a mechanism for focused long-term interagency R&D in information technologies in a vast breadth of research and application areas
$2 billion multi-agency NITRD Program– 12 agencies and departments coordinated via a “virtual agency”
coordination/management structure
– Additional agencies participate as observers or associates
– Coordinated by the National Coordination Office for Information Technology Research and Development
Assessed by the President’s Information Technology Advisory Committee
24
Peak Perf. Linpack Benchmark Stream Triad Benchmark
TFLOP/s TFLOP/s Efficiency GBytes/s TFLOP/s Efficiency
Earth Simulator 40.960 35.860 87.5% 133,120 11.093 27.1%
LANL Alpha Server 10.240 7.727 75.5% 3,670 0.306 3.0%
LANL ASCI White 12.288 7.226 58.8% 3,021 0.252 2.0%
LLNL MCR 11.060 5.694 51.5% 2,419 0.202 1.8%
PSC Alpha Server 6.032 4.463 74.0% 2,702 0.225 3.7%
25
Performance Measures of Selected Top Performance Measures of Selected Top ComputersComputers
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
Earth Simulator ASCI Q (LANL) ASCI White (LLNL) MCR Linux Cluster(LLNL)
Alpha Server (PSC)
Rm
ax a
nd R
pea
k (G
Flops)
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
Str
eam
Triad
(G
B/s
)*
Rpeak
Rmax
Stream Triad
3.0%
Earth Simulator ASCI Q (LANL)
ASCI White (LLNL)
MCR Linux Cluster(LLNL)
Alpha Server(PSC)
2.0% 1.8%3.7%
27.0%
87.5%
75.5% 58.8%51.4%
73.9%
(Percentage represents ratios
Rmax /Rpeak and Stream/Rpeak)
Rpeak and Rmax data from Top500.org Stream Triad data from IDC
* Stream Triad performance in GFLOP/s equals performance in GB/s divided by 12
26
TradeTrade--Offs Between Commodity Clusters Offs Between Commodity Clusters and Custom Supercomputersand Custom Supercomputers
Clusters
Absolutely CheapCheap for peak FLOPS/$Low direct maintenance $Large volume per nodeHigh power requirements per nodeEasy to develop code on workstations
Efficient for code with limited communication that fits in cacheBenchmark to be sure!
Supercomputers
Absolutely ExpensiveMay be cheap for sustained FLOPS/$
Smaller volume per nodeLower power requirements per node
Harder to develop code on workstationsEfficient for large codes with high communications requirementsBenchmark to be sure!
27
Collision of Deuterium Ion with Gold in Collision of Deuterium Ion with Gold in Relativistic Heavy Ion Relativistic Heavy Ion ColliderCollider (DOE)(DOE)