Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Can Scalable Development Leadto Scalable Execution?
Damian RousonScalable Computing R&D
Sandia National Laboratories
Sponsors:Office of Naval Research, City University of New York
Outline• Motivation, Objectives & Guideposts• Conventional development• Scalable development• Applications• Toward scalable execution• Conclusions & Acknowledgments
Motivation
Code writing, efficiency & translation
Limits of HPC software toolsPersonnelHardware
MPI issuesPerformance
I/O issuesOther
Debugging
Objectives
1. To develop a design methodology that scalesup to large numbers of programming units, e.g.procedures, classes, data structures, etc.
2. To demonstrate that this methodology canproduce new science.
3. To demonstrate that this approach also scalesup to large numbers of execution units, e.g.threads, processes, cores, etc.
Guideposts“What are the metrics?”Oyekunle Olukotun, Stanford EE/CS, c. 1996
“Procedural programming is like an N-body problem.”Lester Dye, Stanford Petroleum Eng., c. 1995
“Separate the data from the physcis.”Jaideep Ray, Sandia, c. 2004
“First they ignore you. Then they laugh at you. Then theyfight you. Then you win.”Mahatma Ghandi, c. ????
Outline• Motivation, Objectives & Guideposts• Conventional development
– Amdahl’s Law– Pareto Principle– Complexity
• Scalable development• Applications• Toward scalable execution• Conclusions & Acknowledgments
Conventional DevelopmentTotal solution time
MathematicalModeling
Codewriting
ProductionRun
Debugging
Barrier
Amdahl’s Law
5.1lim1
3
1
3
2
1=!
+
="#
totS
run
totS
S
Srun
Code Writing Time Debugging Time Run Time
original run time!
runS
Total speedup:
Run-time speedup:
3/1 3/1 3/1
Representative case study for a published run*:
optimized run time
*Rouson et al. (2007) Proc. 2006 Summer Program, Center for Turbulence Research, Stanford University.
initialt finalt
The speedup achievable by focusing solely on decreasingrun time is very limited.
Pareto PrincipleWhen participants (lines) share resources (run time), therealways exists a number such that (1-k)% ofthe participants occupy k% of the resources:
Limiting cases:
• k=50%, equal distribution
• k100%, monopoly
Rule of thumb: 20% of the lines occupy 80% of the run time
Scalability requirements determine the percentage of thecode that can be focused strictly on programmability:
)100,50[!k
5/8.02.0
1lim
%
max%
=+
=!"
kS S
Sk
Runtime Profile
• 5% procedures occupy nearly 80% of run time.
• Structure 95% of procedures to reduce development time.
23.6transform_to_physical
38.7transform_to_fourier
43.8Statistics_
44.0Nonlinear_Fluid
47.8RK3_Integrate
79.5operator(.x.)
100.0main
Inclusive Run-Time Share(%)
Procedure
Calls
Total Solution Time Speedup
1 2 3 4 5 6 7 81
1.25
1.5
1.75
Nummber of Processors
To
tal
So
luti
on
Tim
e S
pee
du
p
SGI Math Library
Number of Threads
Intel Math Kernel Library (MKL)
Theoretical Limit
Conventional DevelopmentModel Problem: Unsteady 1D Diffusion
!! 2/ "=## Dt
2
112
xD
dt
diii
i
!
+"= "+ ###
#
Semi-discrete equations:
2
1 11
2
xDt
nnn
nn iii
ii !
+"#!+= "++
$$$$$
Fully discrete equations:
!!!rrr][
2A
x
Dt
"
#"+$
1x 100
xL
x!
!
1!
100!
Solution algorithm:
Conventional Program Debugging“Not much time is spent fixing bugs. Most of the timeis spent finding bugs.”-- Shalloway & Truitt, Design Patterns Explained-- Oliveira & Stewart, Writing Scientific Software
phi(1),phi(2),…,phi(100) Data Set
PROGRAM mainREAL :: phi(100),D=1.,dt=0.1,dx=0.01phi = phi + (D*dt/dx**2)*laplacian(phi)
Legend
WriteRead
FUNCTION laplacian(phi)REAL :: phi(:),A(SIZE(phi),3),laplacian(SIZE(phi))laplacian(:)=A(:,1)*phi(:)+A(:,2)*phi(:) +A(:,3)*phi(:)
?
Bug Search ComplexityConsider a list of the unique program lines with alllines that execute before the symptom preceding thesymptomatic line:
____________________________________________________________________________________________________________________________________________________________________________________
l
12/ !l
( ) 212/ !l
0)2( <! (symptom)
0<D (bug)
Code Fault Rates
500 1000 1500 2000 2500 3000 3500 40000
2
4
6
8
10
12
14
Release n
Release n+1
Module Size
F
a
u
l
t
s
R
a
t
e
Fenton & Ohlssen, “Quantitative analysis of faults andfailures in a complex software system,” IEEE Trans.Soft. Eng. 2000:
>3500
1000/6!" r
FaultRate
Module Size
Scientific Code Fault RatesHatton, L. “The `T’ Experiments – Errors in ScientificSoftware,” Comp. Sci. Eng. 1997:• 8 statically detectable faults/1000 lines of commerciallyreleased C code•12 statically detectable faults/1000 lines of commerciallyreleased Fortran 77 code• more recent data finds 2-3 times as many faults in C++
036.0006.0 !"# r
lines searched per bugline review
[ ] )(2/)12/()(
))(()(#
tr
tbugstsearch
!=
"=
llline review
Outline• Motivation, Objectives & Guideposts• Conventional development• Scalable development
– Complexity– Information theory
• Applications• Toward scalable execution• Conclusions & Acknowledgments
Object-Oriented Programming
ml
____________________________________________________________________________________
Private Data
____________________________________________________________________________________
Private Data
____________________________________________________________________________________
Private Data
ll <<m
line review[ ] )(2/)12/()( trtmmsearch
!= ll
Scientific OOP
____________________________________________________________________________________
Private Data
____________________________________________________________________________________
Private Data
____________________________________________________________________________________
Private Data
p
=!p
ml
"lines per module
[ ] )(2/)12/()( tpprtsearch != ""
procedures per module
line review
!! 2/ "=## Dt
class(Field) :: phi
REAL :: D
d_dt(phi) = D*laplacian(phi)
222/ x!!=" ## REAL, DIMENSION(n) :: p
laplacian(p) = d2_dx2(p)
xx !!"## // $$
class(Scalar) :: Smoke
Smoke = Smoke + dt*d_dt(Smoke)t
tttt!
!"+="+
### )()(
class(Grid) :: x
d_dx(p)=delta(p)/delta(x)
iixxx !=" +1
REAL :: dx
delta(x) = dx
SpatiallyDifferentiable
Fields
Field
Grid
TimeIntegrationAlgorithm
GoverningPDE
SpatialDiscretization
Legend: public, private
Scientific OOPIntegrator
Scalar
Decomposing the problem into a set of classes thatadmit an abstract data type calculus yields
!
" # const., p # const.
Information Theory
0log !"= #i
ii ppS
• Interface information content sets the minimumamount of communication between developers.
• Let pi = frequency of occurrence of the ith keywordin a set of statements. Shannon entropy is
• Repeated implementation of same proceduralinterfaces generates high pi values low S.
• Kirk & Jensen (2004) related Shannon entropy ofcodes to thermodynamic entropy, enabling thestudy of phase transitions in code structure.
Kolmogorov Complexity• For a program p, the Kolmogorov complexity K(p)
is the shortest description in some descriptionlanguage
• Properties:– Provably not computable.
– Bounded from above by any actual description of p.
– Lowest upper bound at any given time: compressedprogram length + decompression program length
• Using this measure, we have detected slightlygreater complexity in C++ than Fortran 2003
Applications
Time Integrator
Grid
Fluid
Cloud
*Rouson et al. (2008) Physics of Fluids, February.
**Morris, Koplik & Rouson (2008) Physical Review Letters, in review.
***Rouson & Handler (2007) in Environmental Sciences &Environmental Computing, Vol. III.
Field
Mixture
Magnetofluid
Time Integrator
Grid
ClassicalFluid
QuantumFluid
Field
Mixture
Currently Running
(Vertically adjacent layers communicate through interfaces.)
Quantum vortexinteractions with classicalfluids**:
Solid particle dispersionin electrically conductingfluids*:
Under Development
Time Integrator
Grid
FluidScalar
Field
Atmosphere
Cloud
Aerosol dispersion in theatmospheric boundarylayer***:
Large Eddy Simulation of the ABL
z x
y
2 km
ΩΕ
dx
dPPhysical Processes
• Shear
• Buoyancy
• Coriolis effects
• Geostrophic wind forcing
• Thermal Fluctuations
• Passive Scalar
Code Details
• Fully spectral LES: Fourier in horizontal, Chebyshev in vertical.
• Uniform grid in horizontally, cosine-stretched grid vertically.
• Compressibility is neglected (different from COAMPS).
Advection
Coriolis
PressureSubgridPhysics Buoyancy Geostrophic
pressuregradient0=!" u
!
"# '
"t+ u $%# ' =% $
&T
PrT
%# ''
( )
*
+ ,
13
'
edx
dPgeCN
t
u
o
sgs ++!"+#"$+=%
%
&
&'Momentum:
Mass
Heat
!
" sgs =2#$T#S
!
"T=C
Sl2 2#S
2Smagorinsky Sub-Grid Scale Turbulence Model:
!~l (grid scale)
Governing Equations
!
" # cp$ % & +r u '
r u /2
!
" # p p0( )
$
$ %1
!
" # T /$Exner Function: Virtual Temperature:
Lx = 12.57 km Ly = 2.0 km Lz = 4.71 km
G = 2.88 m/s (geostrophic wind)
ΩE = 7.2722 x 10-5 rad/s
νT = 0.72 m2/s (Agrees reasonably well with Sullivan et al BL Met. 1994)
ΔEkman = (νT/ ΩE )1/2 = 0.1 km
Re= (G∗Ly)/νT = 8000
• VERY SIMPLE PHYSICS
• PRESSURE GRADIENT IN THE X-DIRECTION• CONSTANT TURBULENT VISCOSITY• ROTATING EARTH
THESE PARAMETERS GIVE
Simulation Parameters
U*/G = 0.067 In good agreement with Coleman et al (JFM, 1990)
Mean Wind Profiles
Wind Velocity (W) in an x-y plane
Thin Ekman Layer with turbulent “eruptions”
W in x-z plane at 43 meters
Note highly elongated low speed regions and “gusts”
Vertical vorticity 640 meters in x-z plane
Note “coherent 2D vortices” --- Air-Spikes !?
Outline• Motivation, Objectives & Guideposts• Conventional development• Scalable development• Applications• Toward scalable execution
– A strategy– Turbulence at the petascale
• Conclusions & Acknowledgments
Toward Scalable Executionclass(Scalar) :: Smoke
Smoke = Smoke + dt*d_dt(Smoke)
Strategy:
• Decompose problem into elementary operations.
• Instantiate distributed objects, e.g. via Trilinos.
• Parallelize operators across distributed objects.
Potential pitfalls:
• Cache utilization.
• Combined instructions.
Turbulence at the Petascale• R. D. Moser* estimates 1500 Petaflop-hours
required for DNS at Reτ=5000, which willachieve asymptotic behavior in the log layer.
• The bottom plane of many ABL simulations liesin the log layer & employs a boundary conditionvalid at asymptotically high Reynolds number:
*NSF Workshop on Cyber-Fluid Dynamics, Arlington, VA (2006).
!
u+
=1
"+
#
Re$
%
& '
(
) * ln y
+++y +
Re$
+ B
limRe$ ,-
u+
=1
"ln y
++ B
Conclusions• Applying Amdahl’s law to the total solution time
suggests that optimizing run time only severelylimits speedup.
• The Pareto Principle determines the percentageof code that can be focused on programmabilityrather than efficiency.
• The global data sharing in conventionaldevelopment leads to a quadratic search times.
• Enabling an abstract data type calculus– Renders bug search times roughly scale-invariant and– Limits interface content (developer communications)
• We have demonstrated scalable development onseveral applications and proposed a path towardscalable execution.
Acknowledgements• NRL
– Dr. Robert Handler– Dr. Robert Rosenberg
• CUNY– Prof. Joel Koplik– Ms. Karla Morris & Dr. Xiaofeng Xu
• University of Cyprus– Prof. Stavros Kassinos– Dr. Irene Moulitsas, Dr. Evaggelos Akylas, & Dr. Hari
Radhakrishnan• University of Belgium
– Prof. Bernard Knaepen– Dr. Ioannis Sarris