Modeling and Simulation: The Good, the Bad, and the Hopeful

Modeling and Simulation:Modeling and Simulation:The Good, the Bad, and the HopefulThe Good, the Bad, and the Hopeful

David B. Nelson, Ph.D.Director

National Coordination Office for

Information Technology Research and Development

DOE Computational Science Graduate Fellowship Conference

July 15, 2003

2

National Coordination Office (NCO) for National Coordination Office (NCO) for Information Technology Research and Information Technology Research and

Development (IT R&D)Development (IT R&D)

Mission: To formulate and promote Federal information technology researchand development to meet national goals.

NCO Director reports to the Director of the White House Office of Science Technology Policy (OSTP) and co-chairs the Interagency Working Group for IT R&D

Coordinates planning, budget, and assessment activities for the Federal multi-agency Networking and Information Technology R&D (NITRD) Program

Supports six technical Coordinating Groups (CGs) that report to the Interagency Working Group

3

NITRD Program CoordinationNITRD Program Coordination

WHITEHOUSE

Executive Office of the PresidentOffice of Science and Technology Policy

National Science and Technology Council

National Coordination Office (NCO) for Information TechnologyResearch and Development

High EndComputing

Coordinating Group(HEC)

Large ScaleNetworking

Coordinating Group(LSN)

HighConfidence

Software and Systems

Coordinating Group(HCSS)

Human Computer

Interaction & Information

ManagementCoordinating

Group(HCI & IM)

Software Design and Productivity Coordinating

Group(SDP)

President’s Information Technology Advisory Committee

(PITAC)

Interagency Working Group (IWG) on

Information Technology R&D

Social, Economic and Workforce

Implications of IT and IT Workforce

DevelopmentCoordinating

Group(SEW)

Participating Agencies: AHRQ, DARPA, DOE/NNSA, DOE/SC, EPA, NASA, NIH, NIST, NOAA, NSA, NSF,

ODDR&E

U.S. Congress

NITRDAuthorization and Appropriations

Legislation

4

Simulation of Simulation of AquaporinAquaporin Protein Inside a Protein Inside a Cell (NSF, NIH, PSC Alpha Cluster)Cell (NSF, NIH, PSC Alpha Cluster)

Visualization shows transport of water molecules into cell.

5

Environmental Modeling of the Environmental Modeling of the Chesapeake Bay (NOAA, EPA, Chesapeake Bay (NOAA, EPA, DoDDoD))

Image shows visualization of computed salinity in the Bay (red is high salinity.)

South is up.Visualization is an important part of the model, because users may not be skilled computational scientists.

6

Environmental Modeling of the Environmental Modeling of the Chesapeake Bay (NOAA, EPA, Chesapeake Bay (NOAA, EPA, DoDDoD))

Model is checked against measured dataModel has shown that approximately 1/4 of the nitrogen added to the Bay starts as air pollution, some from sources hundreds of miles from the Bay's watershed.Model also shows that substantial nitrogen comes from ground water on the Eastern shore

7

Explosion of a SuperExplosion of a Super--Nova (not to scale) Nova (not to scale) (DOE)(DOE)

Start

Middle

End

8

Simulation of Turbulent Flame with Simulation of Turbulent Flame with Comprehensive Chemistry (DOE)Comprehensive Chemistry (DOE)

9

Power of Japanese Earth Simulator Allows Power of Japanese Earth Simulator Allows Better Resolution of Local FeaturesBetter Resolution of Local Features

Simulation of Tropical Cyclone Near Madagascar

125.1 km grid 62.5 km grid 10.4 km grid

(U.S. 1200 year control run used approximately 280 km grid.)

10

Grid Communications & Applications: Grid Communications & Applications: High End Physics ProblemHigh End Physics Problem

Tier2 Centre ~1TIPS

Online System

Offline Processor Farm

~20 TIPS

CERN Computer Centre

FermiLab ~4 TIPSFrance RegionalCentre

Italy Regional CentreGermany RegionalCentre

InstituteInstituteInstituteInstitute~0.25TIPS

Physicist workstations

~100 MBytes/sec

~100 MBytes/sec

~622 Mbits/sec

~1 MBytes/sec

There is a “bunch crossing” every 25 nsecs.

There are 100 “triggers” per second

Each triggered event is ~1 MByte in size

Physicists work on analysis “channels”.

Each institute will have ~10 physicists working on one or morechannels; data for these channels should be cached by the instituteserver

Physics data cache

~PBytes/sec

~622 Mbits/secor Air Freight (deprecated)

Tier2 Centre ~1TIPS

Tier2 Centre ~1TIPS

Tier2 Centre ~1TIPS

Caltech~1 TIPS

~622 Mbits/sec

Tier 0Tier 0

Tier 1Tier 1

Tier 2Tier 2

Tier 4Tier 4

1 TIPS is approximately 25,000

SpecInt95 equivalents

Image courtesy Harvey Newman, Caltech

Compact Muon Solenoid at CERN

11

ProcessorProcessor--Memory Performance GapMemory Performance Gap

µProc60%/yr.

DRAM7%/yr.

1

10

100

100019

8019

81

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU

1982

Processor-MemoryPerformance Gap:(grows 50% / year)

Perf

orm

ance “Moore’s Law”

•Alpha 21264 full cache miss / instructions executed:180 ns/1.7 ns =108 clks x 4 or 432 instructions

• Caches in Pentium Pro: 64% area, 88% transistors*Taken from Patterson-Keeton Talk to SigMod

12

Processing vs. Memory AccessProcessing vs. Memory Access

Doesn’t cache solve this problem?It depends. With small amounts of contiguous data, usually. With large

amounts of non-contiguous data, usually not.In most computers the programmer has no control over cache.Often “a few” Bytes/FLOP is considered OK.

However, consider operations on the transpose of a matrix (e.g., for adjunct problems)Xa= b XTa = bIf X is big enough, 100% cache misses are guaranteed, and we need at

least 8 Bytes/FLOP (assuming a and b can be held in cache).

Latency and limited bandwidth of processor-memory and node-node communications are major limiters of performance for scientific computation

13

Testing Processing vs. MemoryTesting Processing vs. MemoryAccess with BenchmarksAccess with Benchmarks

Simple benchmark: Stream Triad

ai + s × bi = ci

ai, bi, and ci are vectors; s is a scalar. Vector length is chosen to be much longer than cache size.

Each execution includes 2 memory loads + 1 memory store and 2 FLOPs, or 12 Bytes/FLOP (assuming 8 Byte precision)

No computer has enough memory bandwidth to reference 12 Bytes for each FLOP!

14

Testing Processing vs. MemoryTesting Processing vs. MemoryAccess with BenchmarksAccess with Benchmarks

Another Benchmark: Linpack

Aij xj = bi

Solve this linear equation for the vector x, where A is a known matrix, and b is a known vector. Linpackuses the BLAS routines, which divide A into blocks.

On the average Linpack requires 1 memory reference for every 2 FLOPs, or 4Bytes/Flop.

Many of these can be cache references.

15

Selected System CharacteristicsSelected System Characteristics

Earth Simulator ASCI Q ASCI White MCR Cray X1(NEC) (HP ES45) (IBM SP3) (Dual Xeon) (Cray)

Year of Introduction 2002 2003 2000 2002 2003Node Architecture Vector Alpha micro Power 3 micro Xeon micro Vector

SMP SMP SMP SMP SMPSystem Topology NEC single-stage Quadrics QsNet IBM Quadrics QsNet 2D Torus

Crossbar Fat-tree Omega network Fat-tree InterconnectNumber of Nodes 640 3072 (Total) 512 1152Processors - per node 8 4 16 2 4

- system total 5120 12288 8192 2304Processor Speed 500 MHz 1.25 GHz 375 MHz 2.4 GHz 800 MHzPeak Speed - per processor 8 Gflops 2.5 Gflops 1.5 Gflops 4.8 Gflops 12.8 Gflops

- per node 64 Gflops 10 Gflops 24 Gflops 9.6 Gflops 51.2 Gflops- system total 40 Gflops 30 Gflops 12 Tflops 10.8 Tflops

Memory - per node 16 GB 16 GB 16 GB 16 GB 8-64 GB- per processor 2 GB 4 GB 1 GB 2 GB 2-16 GB- system total 10.24 TB 48 TB 8 TB 4.6 TB

Memory Bandwidth (peak)- L1 Cache N/A 20 GB/s 5 GB/s 20 GB/s 76.8 GB/s- L2 Cache N/A 13 GB/s 2 GB/s 1.5 GB/sMain (per processor)32 GB/s 2 GB/s 1 GB/s 2 GB/s 34.1 GB/s

Inter-node MPI- Latency 8.6 µsec 5 µsec 18 µsec 4.75 µsec- Bandwidth 11.8 GB/s 300 MB/s 500 MB/s 315 MB/s 12.8 GB/s

Bytes/flop to main memory 4 0.8 0.67 0.4 2.66Bytes/flop interconnect 1.5 0.12 0.33 0.07 1

Most of this data is from Kerbyson, Hoisie, Wasserman; LANL; unpublished

16

Performance Measures of Selected Top Performance Measures of Selected Top ComputersComputers

0.1

1.0

10.0

100.0

EarthSimulator

LANL AlphaServer

LANL ASCIWhite

LLNL MCRCluster

PSC AlphaServer

Peak Perf.Linpack RmaxStream Triad

87.5%

27.0%75.5%

(Percentage represents ratios

Rmax /Rpeak and Stream/Rpeak)

3.0%

58.8%

2.0%

51.5%

1.8%

74.0%

3.7%

Note Logarithmic Y axis

17

Major Problem: Poor Links BetweenMajor Problem: Poor Links BetweenWorkload and Architecture DesignWorkload and Architecture Design

Build It and They Will ComeWeakness of Government High Performance Computing and Communication Program in 1990s– No link between grants for computer architecture research and grants

for computer acquisition

– Poor feedback from users to developers

– Poor connections between computational scientists and computer scientists (one workshop in Pittsburgh in 1993)

Result: Selection of computer architecture is not well grounded on application needs

18

What About Synthetic Benchmarks?What About Synthetic Benchmarks?

Peak performance – nuf saidLinpack –only measures performance of cache-friendly codeStream – only measures contiguous communications with memory, but good measure of bandwidthGUPS – really tough benchmark because it makes random memory access; may exceed requirements of most codesIDC balanced benchmarks – good compilation, but somewhat artificially combinedEffective System Performance Benchmark – promising, but not widely usedNAS Parallel Benchmarks – disused, but may be coming backLivermore Loops – obsoleteYour own workload - ??

19

Resurgence of Performance Analysis Is Resurgence of Performance Analysis Is PromisingPromising

LANL Performance and Architecture Lab: http://www.c3.lanl.gov/par_arch/

Performance Evaluation Research Center: http://perc.nersc.gov/

IDC User Forum: http://64.122.81.35/benchmark/

Performance Modeling and Characterization: http://www.sdsc.edu/PMaC/Benchmark/

NAS Parallel Benchmarks: http://www.nas.nasa.gov/Software/NPB/

Recent High End Computing Workshop offered recommendations for performance evaluation: http://www.cra.orgActivities/workshops/nitrd/

Great opportunity for agencies to cooperate on performance evaluation.

20

SummarySummary

Computational Science is now a third pillar of research, along with experiment and theory.High-end computers are getting harder to use and more inefficient.Federal agencies are recognizing this and working to improve things.

21

For Further InformationFor Further Information

Please contact us at:

[email protected]

Or visit us on the Web:

www.itrd.gov

22

BackupBackup

23

The Federal Government Plays a Critical Role in The Federal Government Plays a Critical Role in Supporting Fundamental Research in Supporting Fundamental Research in

Networking and ITNetworking and IT

Federally-sponsored research is critical to building the technology base on which the IT industry has grown

The Federal government funds basic research not traditionally funded by the commercial sector– High risk, innovative ideas whose practical benefits may take years to demonstrate

The Networking and IT R&D program (NITRD) provides a mechanism for focused long-term interagency R&D in information technologies in a vast breadth of research and application areas

$2 billion multi-agency NITRD Program– 12 agencies and departments coordinated via a “virtual agency”

coordination/management structure

– Additional agencies participate as observers or associates

– Coordinated by the National Coordination Office for Information Technology Research and Development

Assessed by the President’s Information Technology Advisory Committee

24

Peak Perf. Linpack Benchmark Stream Triad Benchmark

TFLOP/s TFLOP/s Efficiency GBytes/s TFLOP/s Efficiency

Earth Simulator 40.960 35.860 87.5% 133,120 11.093 27.1%

LANL Alpha Server 10.240 7.727 75.5% 3,670 0.306 3.0%

LANL ASCI White 12.288 7.226 58.8% 3,021 0.252 2.0%

LLNL MCR 11.060 5.694 51.5% 2,419 0.202 1.8%

PSC Alpha Server 6.032 4.463 74.0% 2,702 0.225 3.7%

25

Performance Measures of Selected Top Performance Measures of Selected Top ComputersComputers

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

Earth Simulator ASCI Q (LANL) ASCI White (LLNL) MCR Linux Cluster(LLNL)

Alpha Server (PSC)

Rm

ax a

nd R

pea

k (G

Flops)

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

Str

eam

Triad

(G

B/s

)*

Rpeak

Rmax

Stream Triad

3.0%

Earth Simulator ASCI Q (LANL)

ASCI White (LLNL)

MCR Linux Cluster(LLNL)

Alpha Server(PSC)

2.0% 1.8%3.7%

27.0%

87.5%

75.5% 58.8%51.4%

73.9%

(Percentage represents ratios

Rmax /Rpeak and Stream/Rpeak)

Rpeak and Rmax data from Top500.org Stream Triad data from IDC

* Stream Triad performance in GFLOP/s equals performance in GB/s divided by 12

26

TradeTrade--Offs Between Commodity Clusters Offs Between Commodity Clusters and Custom Supercomputersand Custom Supercomputers

Clusters

Absolutely CheapCheap for peak FLOPS/$Low direct maintenance $Large volume per nodeHigh power requirements per nodeEasy to develop code on workstations

Efficient for code with limited communication that fits in cacheBenchmark to be sure!

Supercomputers

Absolutely ExpensiveMay be cheap for sustained FLOPS/$

Smaller volume per nodeLower power requirements per node

Harder to develop code on workstationsEfficient for large codes with high communications requirementsBenchmark to be sure!

27

Collision of Deuterium Ion with Gold in Collision of Deuterium Ion with Gold in Relativistic Heavy Ion Relativistic Heavy Ion ColliderCollider (DOE)(DOE)

Documents

Modeling and Simulation: The Good, the Bad, and the Hopeful