Upload
tymon
View
27
Download
9
Tags:
Embed Size (px)
DESCRIPTION
Scientific Computations on Modern Parallel Vector Systems. Leonid Oliker Julian Borrill, Jonathan Carter, Andrew Canning, John Shalf, David Skinner Lawrence Berkeley National Laboratories Stephane Ethier Princeton Plasma Physics Laboratory http://crd.lbl.gov/~oliker. Overview. - PowerPoint PPT Presentation
Citation preview
Scientific Computations on Modern Parallel
Vector SystemsLeonid Oliker
Julian Borrill, Jonathan Carter, Andrew Canning, John Shalf, David SkinnerLawrence Berkeley National Laboratories
Stephane EthierPrinceton Plasma Physics Laboratory
http://crd.lbl.gov/~oliker
Overview
Superscalar cache-based architectures dominate HPC market Leading architectures are commodity-based SMPs due to generality and
perception of cost effectiveness Growing gap between peak & sustained performance is well known in
scientific computing Modern parallel vectors may bridge gap this for many important
applications
In April 2002, the Earth Simulator (ES) became operational: Peak ES performance > all DOE and DOD systems combined Demonstrated high sustained performance on demanding scientific apps
Conducting evaluation study of scientific applications on modern vector systems
09/2003 MOU between ES and NERSC was completedFirst visit to ES center: December 8th-17th, 2003 (ES remote access not available)First international team to conduct performance evaluation study at ES
Examining best mapping between demanding applications and leading HPC systems - one size does not fit all
Vector Paradigm
High memory bandwidth• Allows systems to effectively feed ALUs (high byte to flop ratio)
Flexible memory addressing modes• Supports fine grained strided and irregular data access
Vector Registers• Hide memory latency via deep pipelining of memory load/stores
Vector ISA• Single instruction specifies large number of identical operations
Vector architectures allow for:• Reduced control complexity • Efficiently utilize large number of computational resources• Potential for automatic discovery of parallelism
However: most effective if sufficient regularity discoverable in program
structure• Suffers even if small % of code non-vectorizable (Amdahl’s Law)
Architectural Comparison
Node Type Where CPU/
NodeClockMHz
PeakGFlop
Mem BW GB/s
Peak byte/fl
op
NetwkBW
GB/s/P
BisectBW
byte/flop
MPI Latenc
yusec
NetworkTopolog
y
Power3 NERSC 16 375 1.5 1.0 0. 47 0.13 0.087 16.3 Fat-tree
Power4 ORNL 32 1300 5.2 2.3 0.44 0.13 0.025 7.0 Fat-tree
Altix ORNL 2 1500 6.0 6.4 1.1 0.40 0.067 2.8 Fat-treeES ESC 8 500 8.0 32.0 4.0 1.5 0.19 5.6 CrossbarX1 ORNL 4 800 12.8 34.1 2.7 6.3 0.088 7.3 2D-torus
Custom vector architectures have •High memory bandwidth relative to peak•Superior interconnect: latency, point to point, and bisection bandwidth
Overall ES appears as the most balanced architecture, while Altix shows best architectural balance among superscalar architectures
A key ‘balance point’ for vector systems is the scalar:vector ratio
Memory Performance
Triad Mem Test:
A(i) = B(i) + s*C(i)
NO MachineSpecific
Optimizations
•For strided access, SX6 achieves 10X, 100X, 1000X improvement over X1, Pwr4, Pwr3
•For gather/scatter, SX6/X1 show similar performance, exceed scalar at higher data sizes
•All machines performance can be improved via architecture specific optimizations•Example: On X1 using non-cachable & unroll(4) pragma improves strided BW by 20X
1
10
100
1000
10000
100000
0 100 200 300 400 500
Stride
Triad (MB/s)Power3Power4SX-6X1
1
10
100
1000
10000
1003007001.5K4.5K10K25K65K150K375K900K2.2M5.5M13M
Data Size (Bytes)
Triad Gath/Scat (MB/s)
Power3 Power4 SX-6 X1
Analysis using‘Architectural Probe’
Tunable parameters to mimic behavior of important scientific kernel
Gather/Scatter expensive on commodity cache-based systems
Power4 can is only 1.6% (1 in 64)Itanium2: much less sensitive at 25% (1 in 4)
Huge amount of computation may be required to hide overhead of irregular
data access
Itanium2 requires CI of about 9 flops/wordPower4 requires CI of almost 75!
What % of memory access can be random before performance decreases
by half?
How much computational intensity is required to hide the penalty of all
random access?Reducing performance by 50%
1.6%
25%
6.3%
0.8%
0%
1%
10%
100%
Itanium 2 Opteron Power3 Power4
% Indirection
CI required to hide indirection
9.3
149.3
18.7
74.7
0
50
100
150
200
Itanium 2 Opteron Power3 Power4Computational Intensity
(CI)
Sample Kernel Performance
NPB FT Class B Nbody (Barnus-Hut)
FFT computationally intensive with data parallel operations
Well suited for vectorization: 17X, 4X faster than Power3/4
Fixed cost interprocessor communication hurts scalability
Nbody requires irregular, unstructured data access, control flow and communication
Poorly suited for vectorization:2X and 5X slower than Power/4
Vector architectures may not be suitable for all classes of applications
0.0
0.5
1.0
1.5
2.0
1 2 4 8 16 32 64Processors
GFlops/s
Power3Power4SX-6X1
0.00
0.05
0.10
0.15
0.20
0.25
0.30
1 2 4 8 16 32 64Processors
GFlops/s
Power3Power4SX-6X1
Applications studied
LBMHD Plasma Physics 1,500 lines grid basedLattice Boltzmann approach for magneto-hydrodynamics
CACTUS Astrophysics 100,000 lines grid based Solves Einstein’s equations of general relativity
PARATEC Material Science 50,000 lines Fourier space/grid Density Functional Theory electronic structures codes
GTC Magnetic Fusion 5,000 lines particle based Particle in cell method for gyrokinetic Vlasov-Poisson equation
MADbench Cosmology 2,000 lines dense linear algebra Maximum likelihood two-point angular correlation, I/O intensive
Applications chosen with potential to run at ultrascale Computations contain abundant data parallelism
• ES runs require minimum parallelization and vectorization hurdles Codes originally designed for superscalar systems Ported onto single node of SX6, first multi-node experiments performed
at ESC
Plasma Physics: LBMHD LBMHD uses a Lattice Boltzmann method to
model magneto-hydrodynamics (MHD)
Performs 2D simulation of high temperature plasma
Evolves from initial conditions and decaying to form current sheets
2D spatial grid is coupled to octagonal streaming lattice
Block distributed over 2D processor grid
Main computational components:
Collision requires coefficients for local gridpoint only, no communication Stream values at gridpoints are streamed to neighbors,
at cell boundaries information is exchanged via MPI Interpolation step required between spatial and stream lattices
Developed George Vahala’s group College of William and Mary, ported Jonathan Carter
Current density decays of two cross-shaped structures
LBMHD: Porting Details
Collision routine rewritten: For ES loop ordering switched so gridpoint loop (~1000 iterations) is inner rather
than velocity or magnetic field loops (~10 iterations) X1 compiler made this transformation automatically: multistreaming outer loop and
vectorizing (via strip mining) inner loop Temporary arrays padded reduce bank conflicts
Stream routine performs well: Array shift operations, block copies, 3rd-degree polynomial eval
Boundary value exchange MPI_Isend, MPI_Irecv pairs Further work: plan to use ES "global memory" to remove message copies
(left) octagonal streaming lattice coupled with square spatial grid
(right) example of diagonal streaming vector updating three spatial cells
LBMHD: Performance
ES achieves highest performance to date: over 3.3 Tflops for P=1024 X1 comparable absolute speed up to P=64 (lower % peak) But performs 1.5X slower at P=256 (decreased scalability)
CAF improved X1 to slightly exceed ES at P=64 (up to 4.70 Gflop/P) ES is 44X, 16X, and 7X faster than Power3, Power4, and Altix
• Low CI (1.5) and high memory requirement (30GB) hurt scalar performance
Altix best scalar due to: high memory bandwidth, fast interconnect
DataSize P
Power 3 Power4 Altix ES X1
Gflops/P
%peak
Gflops/P
%peak
Gflops/P
%peak
Gflops/P
%peak
Gflops/P %peak
4096 x
4096
16 0.11 7% 0.28 5% 0.60 10% 4.6 58% 4.3 34%64 0.14 9% 0.30 6% 0.62 10% 4.3 54% 4.4 34%256 0.14 9% 0.28 5% --- --- 3.2 40% --- ---
8192x
8192
64 0.11 7% 0.27 5% 0.65 11% 4.6 58% 4.5 35%256 0.12 8% 0.28 5% --- --- 4.3 53% 2.7 21%
1024 0.11 7% --- --- --- --- 3.3 41% --- ---
LBMHD on X1 MPI vs CAF
X1 well-suited for one-sided parallel languages (globally addressable mem)• MPI hinders this feature and requires scalar tag matching
CAF allows much simpler coding of boundary exchange (array subscripting):• feq(ista-1,jsta:jend,1) = feq(iend,jsta:jend,1)[iprev,myrankj]
MPI requires non-contiguous data copies into buffer, unpacked at destination Since communication about 10% of LBMHD, only slight improvements However, for P=64 on 40962 performance degrades. Tradeoffs:
• CAF reduced total message volume 3X (eliminates user and system buffer copy)• But CAF used more numerous and smaller sized message
DataSize P
X1-MPI X1-CAF
Gflops/P %peak Gflops/
P%peak
4096216 4.32 34% 4.55 36%64 4.35 34% 4.26 33%
8192264 4.48 35% 4.70 37%256 2.70 21% 2.91 23%
Astrophysics: CACTUS Numerical solution of Einstein’s equations
from theory of general relativity
Among most complex in physics: set of coupled nonlinear hyperbolic & elliptic systems with thousands of terms
CACTUS evolves these equations to simulate high gravitational fluxes, such as collision of two black holes
Evolves PDE’s on regular grid using finite differences
Uses ADM formulation: domain decomposed into 3D hypersurfaces for different slices of space along time dimension
Exciting new field about to be born: Gravitational Wave Astronomy - fundamentally new information about Universe
Gravitational waves: Ripples in spacetime curvature, caused by matter motion, causing distances to change.
Developed at Max Planck Institute, vectorized by John Shalf
Visualization of grazing collision of two black holes
Communication at boundariesExpect high parallel efficiency
CACTUS: Performance
ES achieves fastest performance to date: 45X faster than Power3! Vector performance related to x-dim (vector length) Excellent scaling on ES using fixed data size per proc (weak scaling) Scalar performance better on smaller problem size (cache effects)
X1 surprisingly poor (4X slower than ES) - low ratio scalar:vector Unvectorized boundary, required 15% of runtime on ES and 30+% on X1 < 5% for the scalar version: unvectorized code can quickly dominate cost
Poor superscalar performance despite high computational intensity Register spilling due to large number of loop variables Prefetch engines inhibited due to multi-layer ghost zones calculations
ProblemSize P
Power 3 Power 4 Altix ES X1
Gflops/P
%peak
Gflops/P
%peak
Gflops/P
%peak
Gflops/P %peak Gflops/
P%pea
k80X80x8
0per
processor
16 0.31 21% 0.58 11% 0.89 15% 1.5 18% 0.54 4%64 0.22 14% 0.50 10% 0.70 12% 1.4 17% 0.43 3%256 0.22 14% 0.48 9% --- --- 1.4 17% 0.41 3%
250x80x80
perprocessor
16 0.10 6% 0.56 11% 0.51 9% 2.8 35% 0.81 6%64 0.08 6% --- --- 0.42 7% 2.7 34% 0.72 6%256 0.07 5% --- --- --- --- 2.7 34% 0.68 5%
Material Science: PARATEC
PARATEC performs first-principles quantum mechanical total energy calculation using pseudopotentials & plane wave basis set
Density Functional Theory to calc structure & electronic properties of new materials
DFT calc are one of the largest consumers of supercomputer cycles in the world
Induced current and chargedensity in crystallized glycine
Uses all-band CG approach to obtain wavefunction of electrons
33% 3D FFT, 33% BLAS3, 33% Hand coded F90 Part of calculation in real space other in Fourier space
• Uses specialized 3D FFT to transform wavefunction Computationally intensive - generally obtains high percentage of peak Developed Andrew Canning with Louie and Cohen’s groups (UCB, LBNL)
Transpose from Fourier to real space
3D FFT done via 3 sets of 1D FFTs and 2 transposes
Most communication in global transpose (b) to (c) little communication (d) to (e)
Many FFTs done at the same timeto avoid latency issues
Only non-zero elements communicated/calculated
Much faster than vendor 3D-FFT
PARATEC:Wavefunction Transpose(a) (b)
(e)
(c)
(f)
(d)
PARATEC Scaling: ES vs. Power3
ES can run the same system about 10 times faster than the IBM SP (on any number of processors)
Main advantage of ES for these types of codes is the fast communication network
Fast processors require less fine-grain parallelism in code to get same performance as RISC machines
Vector arch allow opportunity to simulate systems not possible on scalar platforms10
100
1000
10000
32 64 128 256 512 1024
Processors
GFlops
309 QD - Ideal
309 QD - Pwr3
432 Si - Pwr3
432 Si - ES
686 Si - ES
PARATEC: Performance
DataSize P
Power 3 Power4 Altix ES X1
Gflops/P%peak
Gflops/P %peak Gflops/
P %peakGflops/P
%peak
Gflops/P
%peak
432Atom
32 0.95 63% 2.0 39% 3.7 62% 4.7 60% 3.0 24%64 0.85 57% 1.7 33% 3.2 54% 4.7 59% 2.6 20%128 0.74 49% 1.5 29% --- --- 4.7 59% 1.9 15%256 0.57 38% 1.1 21% --- --- 4.2 52% --- ---512 0.41 28% --- --- --- --- 3.4 42% --- ---
686Atom
128 4.9 62% 3.0 24%256 4.6 57% 1.3 10%
ES achieves fastest performance to date! Over 2Tflop/s on 1024 procs Main advantage for this type of code is fast interconnect system
X1 3.5X slower than ES (although peak is 50% higher) Non-vectorizable code can be much more expensive on X1 (32:1 vs 8:1) Lower bisection bandwidth to computation ratio
Limited scalability due to increasing cost of global transpose and reduced vector length
Plan to run larger problem size next ES visit Scalar architectures generally perform well due to high computational intensity
Power3, Power4, Alitx are 8X, 4X, 1.5X slower than ES Vector arch allow opportunity to simulate systems not possible on scalar
platforms
Magnetic Fusion: GTC Gyrokinetic Toroidal Code: transport of thermal
energy (plasma microturbulence) Goal magnetic fusion is burning plasma power
plant producing cleaner energy GTC solves 3D gyroaveraged gyrokinetic
system w/ particle-in-cell approach (PIC) PIC scales N instead of N2 – particles interact w/
electromagnetic field on grid Allows solving equation of particle motion with
ODEs (instead of nonlinear PDEs) Main computational tasks:
Scatter deposit particle charge to nearest point Solve Poisson eqn to get potential for each
point Gather calc force based on neighbors potential Move particles by solving eqn of motion Shift particles moved outside local domain
3D visualization of electrostatic potential in magnetic fusion device
Developed at Princeton Plasma Physics Laboratory, vectorized by Stephane Ethier
GTC: Scatter operation
Particle charge deposited amongst nearest grid points. Calculate force based on neighbors potential, then move particle accordingly
Several particles can contribute to same grid points, resulting in memory conflicts (dependencies) that prevent vectorization
Solution: VLEN copies of charge deposition array with reduction after main loop• However, greatly increases memory footprint (8X)
Since particles are randomly localized - scatter also hinders cache reuse
GTC: Performance
Number
Particles
P
Power 3 Power4 Altix ES X1
Gflops/P
%peak
Gflops/P
%peak
Gflops/P
%peak
Gflops/P
%peak
Gflops/P
%peak
10/cell
20M
32 0.13 9% 0.29 5% 0.29 5% 1.15 14% 1.00 8%64 0.13 9% 0.32 5% 0.26 4% 1.00 13% 0.80 6%
100/cell
200M
32 0.13 9% 0.29 5% 0.33 6% 1.62 20% 1.50 12%64 0.13 9% 0.29 5% 0.31 5% 1.56 20% 1.36 11%1024 0.06 4% ES achieves fastest performance of any tested architecture!
• First time code achieved 20% of peak - compared with less 10% on superscalar systems• Vector hybrid (OpenMP) parallelism not possible due to increased memory requirements • P=64 on ES is 1.6X faster than P=1024 on Power3!• Reduced scalability due to decreasing vector length, not MPI performance
Non-vectorizable code portions expensive on X1• Before vectorization shift routine accounted for 11% of ES and 54% of X1 overhead
Larger tests could not be performed at ES due to parallelization/vectorization hurdles• Currently developing new version with increased particle decomposition
Advantage of ES for PIC codes may reside in higher statistical resolution simulations• Greater speed allow more particles per cell
GTC: Performance
With increasing processors, and fixed problem size, the vector length decreases
Limited scaling due to decreased vector efficiency rather than communications overhead.
MPI communication by itself has near perfect scaling.
Cosmology: MADbench
Microwave Anisotropy Dataset Computational Analysis Package
Optimal general algorithm for extracting key cosmological data from Cosmic Microwave Background Radiation (CMB)
CMB encodes fundamental parameters of cosmology: Universe geometry, expansion rate, number of neutrino species
Preserves full complexity of underlying scientific problem
Calculates maximum likelihood two-point angular correlation function
Recasts problem in dense linear algebra: ScaLAPACKSteps include: mat-mat, mat-vec, chol decomp, redistribution
High I/O requirement - due to out-of-core nature of calculation
Developed at NERSC/CRD by Julian Borrill
CMB analysis moves from the time domain - observations - O(1012) to the pixel domain - maps - O(108) to the multipole domain - power spectra - O(104)calculating the compressed data and their
reduced error bars at each step.
CMB Data Analysis
MADbench: Performance
ES achieves fastest performance to date: 4.7 Tflop/s on 1024 processors! Original ES visit: only partially ported due to code’s requirements of global file
system New version of Madbench successfully reduced I/O overhead and removed
global file system requirements Computational ScaLAPACK kernel achieves high performance on all systems ES shows highest % peak due to balanced I/O system X1 performance starts high, but falls quickly due to I/O overheads Columbia performance relatively poor Just looking at overall performance is not sufficient to understand bottlenecks
Pixels PPower 3 Columbia ES X1
Gflops/P
%peak
Gflops/P
%peak
Gflops/P %peak Gflops/
P%pea
k
5K 16 0.68 45% 1.2 19% 2.0 25% 4.5 36%10K 64 0.74 49% 1.2 20% 2.9 37% 2.1 17%20K 256 0.76 51% 1.1 19% 4.0 50% 0.6 5%
40K 1024 0.74 50% --- --- 4.6 58% --- ---
IPM Overview
Integrated
Performance
Monitoring
portable, lightweight, scalable profiling
fast hash method
profiles MPI topology
profiles code regions
open source
MPI_Pcontrol(1,”W”); …code…MPI_Pcontrol(-1,”W”);
############################################ IPMv0.7 :: csnode041 256 tasks ES/ESOS# madbench.x (completed) 10/27/04/14:45:56## <mpi> <user> <wall> (sec)# 171.67 352.16 393.80 # …################################################ W# <mpi> <user> <wall> (sec)# 36.40 198.00 198.36## call [time] %mpi %wall# MPI_Reduce 2.395e+01 65.8 6.1# MPI_Recv 9.625e+00 26.4 2.4# MPI_Send 2.708e+00 7.4 0.7# MPI_Testall 7.310e-02 0.2 0.0# MPI_Isend 2.597e-02 0.1 0.0###############################################…
MADBench:Performance
Characterization
In depth analysis shows performance contribution of each component for evaluated architectures
Identifies system specific balance and opportunities for optimization
Results show that I/O has more effect on ES than Seaborg - due to ratio between I/O performance and peak ALU speed
Demonstrated IPM capabilities to measure MPI overhead on variety of architectures without the need to recompile, at a trivial runtime overhead (<1%)
0%20%40%60%80%
100%
P=16P=16P=16P=16 P=64P=64P=64P=64P=256P=256P=256P=256
P=1024P=1024
Sbg ES Phx Cmb Sbg ES Phx Cmb Sbg ES Phx Cmb Sbg ES
% of theoretical peak
CalcCalc+MPICalc+MPI+I/OCalc+MPI+I/O+Rmp
Overview
Tremendous potential of vector architectures: 5 codes running faster than ever before Vector systems allows resolution not possible with scalar arch (regardless of # procs)
Opportunity to perform scientific runs at unprecedented scale ES shows high raw and much higher sustained performance compared with X1
• Non-vectorizable code segments become very expensive (8:1 or even 32:1 ratio)• Evaluation codes contain sufficient regularity in computation for high vector performance
• GTC example code at odds with data-parallelism• Important to characterize full application including I/O effects• Much more difficult to evaluate codes poorly suited for vectorization
Vectors potentially at odds w/ emerging techniques (irregular, multi-physics, multi-scale) Plan to expand scope of application domains/methods, and examine latest HPC architectures
Code(P=64) % peak (P=Max avail) Speedup ES
vs.
Pwr3 Pwr4 Altix ES X1 Pwr3 Pwr4 Altix X1LBMHD 7% 5% 11% 58% 37% 30.6 15.3 7.2 1.5CACTUS 6% 11% 7% 34% 6% 45.0 5.1 6.4 4.0
GTC 9% 6% 5% 20% 11% 9.4 4.3 4.1 1.1PARATEC 57% 33% 54% 58% 20% 8.2 3.9 1.4 3.9MADbenc
h 49% --- 19% 37% 17% 6.3 --- 3.5 1.4
Average 19.9 7.2 4.5 2.4
EXTRA SLIDES
ES Processor Overview 8 Gflops per CPU 8 CPU per SMP 8 way replicated
vector pipe 72 vec registers,
256 64-bit words Divide Unit 32 GB/s pipe to
FPLRAM 4-way superscalar
o-o-o @ 1 Gflop 64KB I$ & D$ ES:640 nodes
ES: newly developed FPLRAM (Full Pipelined RAM) SX6: DDR-SDRAM 128/256Mb
ES: uses IN 12.3 GB/s bi-dir btw any 2 nodes, 640 nodes SX6: uses IXS 8GB/s bi-dir btw any 2 nodes, max 128 nodes
Earth Simulator OverviewMachine type : 640 nodes, each node is 8-way SMP vector processors (5120 total procs) Machine Peak: 40TF/s (proc peak 8GF/s)OS : Extended version of Super-UX: 64 bit Unix OS based on System V-R3Connection structure : a single stage crossbar network (1400 miles of cable), 83,000 copper cables: 7.9 TB/s aggregate switching capacity 12.3 GB/s bi-di between any two nodesGlobal Barrier Counter within interconnect allows global barrier synch <3.5usecStorage: 480 TB Disk, 1.5 PB TapeCompilers : Fortran 90, HPF, ANSI C, C++Batch: similar to NQS, PBSParallelization: vectorization processor level OpenMP, Pthreads, MPI, HPF
Cray X1 Overview
SSP: 3.2GF computational core VL = 64, dual pipes (800 MHz)2-way scalar 0.4 GF (400MHz)
MSP: 12.8 GF combines 4 SSPshares 2MB data cache (unique)
Node: 4 MSP w/ flat shared mem
Interconnect: modified 2D torusfewer links then full crossbar butsmaller bisection bandwidth
Globally addressable: procs can directly read/write to global mem
Parallelization: Vectorization (SSP) Multistreaming (MSP) Shared mem (OMP, Pthreads) Inter-node (MPI2, CAF, UPC) Node
SSP MSP
Altix3000 Overview
Itanium2@ 1.5GHz (peak 6 GF/s) 128 FP registers, 32K L1, 256K L2, 6MB L3• Cannot store FP in values in L1
EPIC Bundles instruction• Bundles processed in-order, instructions within bundle processed in parallel
Consists of “Cbricks” : 4 Itanium2, memory, 2 controller ASICS called SHUB Uses high bandwidth, low latency Numalink3 interconnect (fat-tree) Implements CCNUMA protocol in hardware
• A cache miss caused data to be communicated/replicated via SHUB Uses 64-bit Linux with single system image (256 processor / few for OS
services) Scalability to large numbers of processors ?
LBMHD: Performance
Preliminary time breakdown shown relative to each architecture
Cray X1 has highest % spent in communication (P=64), CAF version reduced this
ES shows best memory bandwidth performance (stream) Communication increases at higher scalability (as expected)
–0–10–20–30–40–50–60–70–80
–P3 –P4 –ES –X1
– % t
ime
–collision
–stream
–comm
8192 x 8192 Grid 64 processors
–0–10–20–30–40–50–60–70–80
– % t
ime
–P3 –P4 –ES
8192 x 8192 Grid 256 processors
GTC: Porting Details
Large vector memory footprint required eliminate dependencies P=64 uses 42 GB on ES compared w/ 5 GB on Power3
Relatively small mem per processor (ES=2GB, X1=4GB) limits problem size runs
GTC has second level of parallelism via OpenMP (hybrid programming). However, on ES/X1 memory footprint increased: additional 8X, about 320GB
Non-vectorized “Shift” routine accounted for: 54% X1, 11% on ES • Due to high penalty of serialized sections on X1 when multistreaming
The shift routine vectorized on X1, but NOT on ES - X1 has advantage
Limited time at ES prevented vectorization of shift routine
• Now shift account for only 4% of X1 runtime
Second ES visit
Evaluate high-concurrency PARATEC performance using large-scale Quantum Dot simulation
Evaluate CACTUS performance using updated vectorization of radiation boundary condition
Evaluate MADCAP performance using a newly optimized version, without global file systems requirements and improved I/O behavior
Examine 3D version of LBMHD, and explore optimization strategies Evaluate GTC performance using updated vectorization of shift
routine as well as new particle decomposition approach designed to increase concurrency
Evaluate performance of FVCAM3 (Finite Volume atmospheric model), at high concurrencies and resolution (1x1.25 , 0.5 x 0.625, 0.25 x 0.375)
Papers available at http://crd.lbl.gov/~oliker
CMB Science
The CMB is a unique probe of the very early Universe.
Tiny fluctuations in its temperature & polarization encode
- the fundamental parameters of cosmology• Universe geometry, expansion rate, number of neutrino species,
ionization history, dark matter, cosmological constant - ultra-high energy physics beyond the Standard Model