Performance Analysis and Tuning in Windows HPC Server 2008
Xavier Pillons Program Manager
Microsoft Corp. [email protected]
Introduction
• How to monitor performance on Windows ?
• What to look for ?
• How to tune the system ?
• How to trace MS-MPI ?
Built-in Network Diagnostics
• MPI Ping-Pong (mpipingpong.exe) – Launchable via HPC Admin Console Diagnostics
• Pro’s: Easy, Data is auto-stored for historical comparison • Con’s: No choice of network, no intermediate results
– Launchable via command line
• Command Line Features – Tournament mode, ring mode, serial mode – Output progress to xml, stderr, stdout – Histogram, per-node, and per-cluster data – Test throughput / latency or both
• Remember: Usually you want only1 rank per node
• Additional diagnostics and extensibility in v3
Basic Network Troubleshooting
• Know Expected Bandwidths and Latencies
• Make sure drivers and firmware are up to date
• Use the product diagnostics to confirm – Or Pallas Pingpong, etc.
Network Bandwidth Latency
IB QDR (ConnectX PCI-E 2.0) 2400MB/s 2µs
IB DDR (ConnectX PCI-E 2.0) 1500MB/s 2µs
IB DDR (ConnectX PCI-E 1.0) 1400MB/s 2.8µs
IB DDR / ND 1400MB/s 5µs
IB SDR /ND 950MB/s 6µs
IB / IPoIB 200-400MB/s 30µs
Gige 105MB/s 40-70µs
Basic Tools - Perfmon
Counter Tolerance Used For
Processor /%CPU time 95% User mode bottleneck
Processor / %Kernel Time 10% Kernel issues
Processor / %DPC time 5% RSS, Affinity
Processor / %Interrupt Time 5% Misbehaving drivers
Network / Output Queue Length 1 Network bottleneck
Disk / Average Queue Length 1 / platter Disk bottleneck
Memory / Pages Per Sec 1 Hard Faults
System/ Context Switches per sec 20,000 Locks, wasting processing
System / system calls per sec 100,000 Excessive transitions
Windows Performance Toolkit
• Official performance analysis tools from Windows – Used to optimize Windows itself
• Wide support range – Cross platform: Vista, Server 2008/R2, Win7
– Cross architecture: x86, x64, ia64
• Very low overhead – live capture on production systems – Less than 2 % processor overhead for a sustained rate of 10,000
events/second on a 2GHz processor
• The only tool that lets you correlate most of the fundamental system activity – All processes and threads, both user and kernel mode
– DPCs and ISRs, thread scheduling, disk and file I/O, memory usage, graphics subsystem, etc.
• Available externally: part of Windows 7 SDK – http://www.microsoft.com/whdc/system/sysperf/perftools.mspx
NetworkDirect
• Verbs-based design for close fit with native, high-perf networking interfaces
• Equal to Hardware-Optimized stacks for MPI micro-benchmarks
• NetworkDirect drivers for key high-performance fabrics:
– Infiniband [available now!]
– 10 Gigabit Ethernet (iWARP-enabled) [available now!]
– Myrinet [available soon]
• MS-MPIv2 has 4 networking paths: – Shared Memory between processors
on a motherboard
– TCP/IP Stack (“normal” Ethernet)
– Winsock Direct for sockets-based RDMA
– New NetworkDirect interface
User Mode
Kernel Mode
TCP/Ethernet
Networking
Ke
rnel
By-
Pas
s
MPI App Socket-Based
App
MS-MPI
Windows Sockets (Winsock + WSD)
Networking Hardware Networking Hardware Networking Hardware
Networking Hardware Networking Hardware Hardware Driver
Networking Hardware
Networking Hardware Mini-port
Driver
TCP
NDIS
IP
Networking Hardware Networking Hardware User Mode Access Layer
Networking Hardware
Networking Hardware
WinSock Direct
Provider
Networking Hardware
Networking Hardware
NetworkDirect Provider
RDMA Networking
OS Component
HPCS2008 Component
IHV Component
(ISV) App
A new RDMA networking interface built for speed and stability
MS-MPI Fine tuning
• Lots of MPI parameters (mpiexec –help3) :
– MPICH_PROGRESS_SPIN_LIMIT
• 0 is adaptive, otherwise 1-64K
– SHM / SOCK / ND eager limit
• Switchover point for eager / rendezvous behaviour
– ND ZCOPY threshold
• Sets the switchover point between bcopy and zcopy
– Buffer-reuse and registration cost affect this ( registration ~= 32K bcopy )
– Affinity
• Definitely used for NUMA systems
Reducing OS Jitter
• Track Hard Fault with xperf
– Disable non used services (up to 42+)
– Delete Windows scheduled tasks
– Change GP update interval (90mn by default)
Tuning Memory Access
• Effective memory use is rule #1
• Processor Affinity is key here
• Need to know the Processor architecture
• Use STREAM to measure memory bandwidth
GigE
Blade Chassis
8-core
servers
InfiniBand
16-core
servers32-core
servers
InfiniBandInfiniBand
GigE
10 GigE
10 GigE
A big model (requires Large memory machines)
An ISV application (requires Nodes where the application is installed)
APP
C0 C1
M
C2 C3
M
Quad-core
C0 C1
M
C2 C3
|||||||| ||||||||
|||||||| ||||||||
M
M
M
M
M
M
M
M
P0 P1
P2 P3
32-core
IO IO
4-way Structural Analysis MPI Job
APP
APP
APP
APP
APP
APP
Multi-threaded application (requires machine with many Cores)
APP
Numa Aware
Capacity Aware
Application Aware
node groups, job templates, filters, affinity Process Placement
APP
APP
APP
APP
APP
APP
MPI Process Placement
• Request resource with JOB: /numnodes:N
/numsockets:N
/numcores:N
/exclusive
• Control Placement with MPIEXEC: – cores X
– n X
– affinity
Examples
• job submit /numcores:4 mpiexec foo.exe
• job submit /numnodes:2 mpiexec –c 2 –affinity foo.exe
Compute Node
Compute Node
http://blogs.technet.com/windowshpc/archive/2008/09/16/mpi-process-placement-with-windows-hpc-server-2008.aspx
Force Affinity
• mpiexec -affinity
• start /wait /b /affinity <mask> app.exe
• Windows API
– SetProcessAffinityMask
– SetThreadAffinityMask
• With task manager or procexp.exe
Core and Affinity mask for woodcrest
Processor 2
Processor 1
Core 0 Core 1
L2 Cache
Core 2 Core 3
L2 Cache
System Bus
Bus Interface Bus Interface
0x01 0x02 0x04 0x08
0x03 0x0C
0x0F
0x00
0x00
0x00
Processor Affinity Mask
L2 Cache Affinity Mask
Core Affinity Mask
Core 4 Core 5
L2 Cache
Core 6 Core 7
L2 Cache
Bus Interface Bus Interface
0x10 0x20 0x40 0x80
0x30 0xC0
0xF0
Finer control of affinity
• to overcome hyperthreading on NH
• mpiexec setaff.cmd mpiapp.exe
@REM setaff.cmd – set affinity based on MPI Rank @IF "%PMI_SMPD_KEY%" == "7" set AFFINITY=1 @IF "%PMI_SMPD_KEY%" == "1" set AFFINITY=2 @IF "%PMI_SMPD_KEY%" == "5" set AFFINITY=4 @IF "%PMI_SMPD_KEY%" == "3" set AFFINITY=8 @IF "%PMI_SMPD_KEY%" == "4" set AFFINITY=10 @IF "%PMI_SMPD_KEY%" == "2" set AFFINITY=20 @IF "%PMI_SMPD_KEY%" == "6" set AFFINITY=40 @IF "%PMI_SMPD_KEY%" == "0" set AFFINITY=80 start /wait /b /affinity %AFFINITY% %*
Devs can't tune what they can't see
• MS-MPI Tracing: Single, time-correlated log of MPI Events on All Nodes
• Dual purpose:
– Performance Analysis
– Application Trouble-Shooting
• Trace Data Display
– VAMPIR (TU Dresden)
– Intel Trace Analyzer
– MPICH Jumpshot (Argonne NL)
– Windows ETW tools
– Text
MS-MPI Tracing Overview
• MS-MPI includes “Built-In” Tracing – Low Overhead – Based on Event Tracing for Windows (ETW) – No need to recompile your application
• Three Step Process – Trace: mpiexec –trace [event category] MyApp.exe – Sync: clocks across nodes (mpicsync.exe) – Convert: to Viewing format
• Explained in excruciating detail in: “Tracing MPI Apps with Windows HPC Server 2008”
• Traces can also be triggered via any ETW mechanism (Xperf, etc.)
Step 1 – Tracing and filtering
• mpiexec -trace MyApp.exe
• mpiexec -trace (PT2PT,ICND) MyApp.exe
– PT2PT : Point to point communication
– ICND : Network Direct Interconnect Communication
– These event groups are defined in the file mpitrace.mof which resides in the %CCP_HOME%\bin\ folder
• log files written on each node in %userprofile%
– mpi_trace_{JobID}.{TaskID}.{TaskInstanceID}.etl
• Trace filename can be overriden with –tracefile argument
Step 2 – Clock synchronisation
• Use mpiexec and mpicsync to correct trace file timestamps for each node used in a job
– mpiexec –cores 1 mpicsync mpi_trace_42.1.0.etl
• mpicsync uses uniquely trace (.etl) file data to calculate CPU clock corrections
• mpicsync must be run as an MPI program – mpiexec -cores 1 –wdir %%USERPROFILE%% mpicsync
mpi_trace_%CCP_JOBID%.%CCP_TASKID%.%CCP_TASKINSTANCEID%.etl
Step 3 - Format the Binary .etl File For Viewing
• Format to TEXT, OTF, CLOG2
– tracefmt, etl2otf and etl2clog
– Format the event log and apply clock corrections
• Leverage the power of your cluster by using mpiexec to translate all your .etl files simultaneously on the compute nodes used for your trace job – mpiexec -cores 1 -wdir %%USERPROFILE%% etl2otf mpi_trace_42.1.0.etl
• Finally collect trace files from all nodes in a single location
Helper script TraceMyMPI.cmd
• Provided as part of the tracing whitepaper
• Execute all the require steps
• Start mpiexec for you
Resources
• The windows performance toolkit is here
– http://www.microsoft.com/whdc/system/sysperf/perftools.mspx
• Windows Internals series is very good
• Basic windows server tuning is here
– http://www.microsoft.com/whdc/system/sysperf/Perf_tun_srv.mspx
• Process Affinity in HPC Server 2008 SP1
– http://blogs.technet.com/windowshpc/archive/2009/10/01/process-affinity-and-windows-hpc-server-2008-sp1.aspx