View
1
Download
0
Category
Preview:
Citation preview
High Performance Computing | Systems and Technology Group
Scalability issues :HPC Applications & Performance Tools
Chiranjib Sur HPC @ India Systems and Technology Lab
chiranjib.sur@in.ibm.com
Top 500 : Some statistics
Top 500 - Domains
Scalability Performance
2
Top500 : Systems Top500 : Performance
Source : www.top500.org
34%
16.4%
35.63% 33.99%
Laboratory astrophysics - computational snapshot
Laboratory Astrophysics
Multi-phased, multi-level Massive computation
Computational challenge !!
Massive ParallelismRequired
3
Parallel AlgorithmParallel
Language
HardwareArchitecture,Threading,
I/O
Interconnects
OS &
ParallelEnvironment
CompilersOptimization
&Debuggers
Scalable parallel
File System
Performance Analysis tools - Single place to go !
Scalability challenges – different aspects
High Throughput
Sustained Performance
Scalable High
Performance Computing
4
70%70%
30%30%
70%
Amdahl's law
If the serial component remain proportionately equal, there is no inherent speedup !
Parallel component is 50x, max speed up is 3.25x
http://en.wikipedia.org/wiki/Amdahl's_law
70%70%
30%
95%
Gustafson's law
If the serial component shrinks in size, as the problem scales, there is opportunity for speedup !
Parallel component is 50x, max speed up is 18.26x
5%
http://en.wikipedia.org/wiki/Gustafson's_Law
High PERFORMANCE or High THROUGHPUT
5
70%70%
30%30%
70%
Amdahl's law
If the serial component remain proportionately equal, there is no inherent speedup !
Parallel component is 50x, max speed up is 3.25x
http://en.wikipedia.org/wiki/Amdahl's_law
70%70%
30%
95%
Gustafson's law
If the serial component shrinks in size, as the problem scales, there is opportunity for speedup !
Parallel component is 50x, max speed up is 18.26x
5%
http://en.wikipedia.org/wiki/Gustafson's_Law
High PERFORMANCE or High THROUGHPUT
6
T p=T s
p+T Oh( p)
Parametrization of Scalability
Tp = parallel execution time
Ts = serial execution time
TOh
= Overheard
Scalability – algorithm / programming languages
7
Parallel algorithm
- Most legacy codes are not designed to work in parallel
- Mostly not designed to exploit modern day HPC architecture
Parallel languages
- Legacy codes contains language (version) specific syntaxes
(e.g. dynamic memory in FORTRAN 77)
- Old codes needs major revision to use modern features, e.g. handling of large arrays
- Not so easy to re-write old codes using new languages like X10, UPC etc.
Legacy code – Algorithm - a Case Study
8
Legacy code – Algorithm - a Case Study
9
Legacy code – Algorithm - a Case Study
10
Hardware – Scaling OUT or Scaling UP ?
Scalability – computing platform
Courtesy : Thomas Dunning, http://www.nsca.illinois.edu/BlueWaters 11
Hardware – what to look for ? / how to look for ?
Scalability – computing platform
12
Hardware Thread Management
Usage of multiple lightweight concurrent threads Less switching overhead Addressing the issue of instruction and memory latency
Threading - Random Access to Global Memory
Any thread can read/write any location(s) Sync with the system software Monolithic thread vs blocks (smaller in size) of threads
On-Chip Shared Memory
Efficient managament of Data @ cache Efficient thread communication / cooperation within blocks
Hardware – what to look for ? / how to look for ?
Scalability – computing platform
13
Hardware Thread Management
Usage of multiple lightweight concurrent threads Less switching overhead Addressing the issue of instruction and memory latency
Threading - Random Access to Global Memory
Any thread can read/write any location(s) Sync with the system software Monolithic thread vs blocks (smaller in size) of threads
On-Chip Shared Memory
Efficient managament of Data @ cache Efficient thread communication / cooperation within blocks
O1 O2 O3 O4
Opt level ---->
User Space Kernel Space
IP
IF_LSDDHYP
Op
erat
ing
Sys
tem
s: A
IX /
Lin
ux
NSD - VDISK
GPFS
HCP
LL
/ R
eso
urc
e M
gr
Pre
-em
pt i
on
, C
/RxC
at
Network(s)Network Adapter(s) – HFI, IB
Hardware Platforms: pSeries / xSeries
HAL – AIX & Linux
AIX & Linux Verbs
GSMInfra-structure
LAPI – Reliable FIFO, RDMA, Striping,Failover/Recovery, Checkpoint/Restart,Pre-emption, User Space Statistics,Multi-Protocol, Scalability
PNSD / NRTDebug/CommInfrastructure
Eclipse PTP FrameworkPOE Runtime
ParallelDebugger
HPCS ToolkitEclipse Tools
APPLICATION
MP
I
C,
C+
+O
pen
MP
Fo
rtra
n (
77,
95
)O
pen
MP
ES
SL
MA
SS
UP
C
CA
F
SH
ME
M
GS
M
TCPUDP
SOCKETS
Multi-Link, Superpkt
NM
ParallelESSL
Scalability – system software
14
Compilers (www.ibm.com/software/awdtools/fortran/xlfortran/library)
Five distinct optimization levels + many additional options
Code generation and tuning for specific hardware chipsets
Interprocedural optimization and inlining using IPA
Profile-directed feedback (PDF) optimization
User-directed optimization with directives and source-level intrinsic functions
Optimization of OpenMP programs and auto-parallelization capabilities to exploit SMP systems
Automatic parallelization of calculations using vector machine instructions and high-performance mathematical libraries
++++++++ .....
OS and Parallel Environment
Scalability – System Software stack
15
Compilers (www.ibm.com/software/awdtools/fortran/xlfortran/library)
Five distinct optimization levels + many additional options
Code generation and tuning for specific hardware chipsets
Interprocedural optimization and inlining using IPA
Profile-directed feedback (PDF) optimization
User-directed optimization with directives and source-level intrinsic functions
Optimization of OpenMP programs and auto-parallelization capabilities to exploit SMP systems
Automatic parallelization of calculations using vector machine instructions and high-performance mathematical libraries
++++++++ .....
OS and Parallel Environment
Scalability – System Software stack
16
Opt level
Mflo
ps/ S
ec
Parallel Environment – what next ?
Memory -Using Remote Direct Memory Access (RDMA)
Interconnects - RDMA with proper interconnect
Parallel tuned library - Customized
http://publib.boulder.ibm.com/infocenter/clresctr/vxrx/index.jsp?topic=%2Fcom.ibm.cluster.pe432.opuse1.doc%2Fam102_scalaperf.html
Data intensive / Task intensive computing – Combining Massive Data parallelism and instruction level parallelism – heterogeneous model ?
Next generation – MPI 3 ..?
Scalability – System Software stack
17
The Computing cycle
18
The Performance Pie
Performance Performance DimensionsDimensions
CPU Performance
MPI Performance
Threading Performance
I/O Performance
19
What this tool is all about ? – More on next few sessions
What we can do with a tool like this ?
What programming language ? - FORTRAN, C, C++ ...
Which platform we can use ? - Entire range of IBM HPC hardware portfolio
Which operating system ? - AIX & Linux M$
What we mean by Scalable Tools ?
Scalability – Performance Tools
20
Performance analysis in a nutshell – IBM HPC Toolkit
21
Hardware Hardware Performance Performance MonitoringMonitoring
HPM
Profiling MPI calls
OpenMP
Profiling openMP
directives
I/O analysis and
optimization
Eclipse Plug-in, Eclipse Plug-in, PeekPerf,PeekPerf,
XprofXprof
Visualization
MPI MIO
2 4 8 16 320
10
20
30
40
50
60
NPB 3.3 - Fourier Transform - Class A
NonInstInst
No of procs
Exe
cutio
n ti
me
Scalability – Performance tools
2 4 8 16 32 64 128 256 5120.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
16.0
18.0
20.0
Timing - ft.A
Exec time (2)Initialization time (4)Overhead (4)
No of procs
Tim
ing
23
Scalability – case studies : Timing and overhead
2 4 8 16 32 64 1280
5000000
10000000
15000000
20000000
25000000
30000000
35000000
40000000
MPI All-to-All communication - ft.A
No of procs
Da
ta tr
an
sfe
r ( b
yte
s)
24
Scalability – case studies : MPI communication
2 4 8 16 320
0.5
1
1.5
2
2.5
3
3.5
Average Communication time (MPI) - ft.A
No of Procs
Tim
e (
s)
2 4 8 16 32 640
1000
2000
3000
4000
5000
6000
No. of pagefault without I/O - ft.A
No of Procs
pa
ge
fau
lts
2 4 8 16 32 640
20
40
60
80
100
Context switch - ft.A
No of procs
Co
nte
xt s
witc
h
25
Scalability – case studies : Hardware & I/O
Summary : Performance analysis and next ...
What we can do now ?
What we need ?
26
Summary : Performance analysis and next ...
What we can do now ?
What we need ?
What we are planning to do ?
27
Next few talks ..
28
Today
Tomorrow
The team working on performance tools @ IBM
PidadAditya
Praful
Servesh
Dave
29
John
Chiranjib
Recommended