Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Experiences with HPC on Windows
Christian Terboven
[email protected]‐aachen.de
Center for Computing and Communication
RWTH Aachen University
Server Computing Summit 2008April 7‐11, HPI/Potsdam
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Agendao High Performance Computing (HPC)
– OpenMP & MPI & Hybrid Programming
o Windows‐Cluster @ Aachen– Hardware & Software
– Deployment & Configuration
o Top500 submission
o Case Studies– Dynamic Optimization: AVT
– Bevel Gears: WZL
– Application & Benchmark Codes
o Summary
HPCOverview
WindowsCluster Top500 Case Studies Summary
2
Experiences with HPC on Windows 07.04.2008 – C. Terboven
High Performance Computing (HPC)o HPC begins … when performance (runtime) matters!
– HPC is about reducing latency ‐ the „Grid“ increases latency
– Dominating programming languages: Fortran (77 + 95), C++, C
o Focus in Aachen is on Computational Engineering Science!
o We do HPC on Unix for many years ‐ why try Windows?– Attract new HPC users:
• Third party cooperations sometimes depend on Windows
• Some users look out for the Desktop‐like HPC experience
• Windows is a great development platform
– We rely on tools: Combine the best of both worlds!
– Top500 on Windows: Why not ☺?3
HPCOverview
WindowsCluster Top500 Case Studies Summary
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Parallel Computer Architectures: Shared‐Memoryo Dual‐Core processors are common now
→ A laptop/desktop is a Shared‐Memory parallel computer!
4
Memory
L2 Cache
core
L1
core
L1
Intel Woodcrest‐based systemI am ignoring instruction caches, address caches (TLB), write buffers, prefetchbuffers, … as data caches are most important for HPC applications.
o Multiple processors have access tothe same main memory.
o Different types:
o Uniform memory access (UMA)
o Non‐uniform memory access(ccNUMA), still cache‐coherent
o Trend towards ccNUMA
HPCOverview
WindowsCluster Top500 Case Studies Summary
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Programming for Shared‐Memory: OpenMP
const int N = 100000;double a[N], b[N], c[N];[…]#pragma omp parallel forfor (int i = 0; i < N; i++) {
c[i] = a[i] + b[i];}[…]
5
Master Thread Serial Part
Parallel Region
Slave ThreadsSlave ThreadsSlave Threads
www.openmp.org
HPCOverview
WindowsCluster Top500 Case Studies Summary
Serial Parto … and other threading‐based paradigms …
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Memory
L2 Cache
core
L1
core
L1
Parallel Computer Architectures: Distributed‐Memory
o Distributed‐Memory: Each processor has only accessto it‘s own main memory
o Programs have to use external network for communication
6
External network
Memory
L2 Cache
core
L1
core
L1
o Cooperationis done viamessage exchange→ Cluster
HPCOverview
WindowsCluster Top500 Case Studies Summary
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Programming for Distributed‐Memory: MPI
7
HPCOverview
WindowsCluster Top500 Case Studies Summary
int rank, size, value = 23;MPI_Init(…);MPI_Comm_size(MPI_COMM_WORLD, &size);MPI_Comm_rank(MPI_COMM_WORLD, &rank);if (rank == 0) {
MPI_Send(&value, 1, MPI_INT, rank+1, …)}else {
MPI_Recv(&value, 1, MPI_INT, rank-1, …)}MPI_Finalize(…);
www.mpi‐forum.orgMPI Process 2
MPI_Send
MPI_Recv
Msg: 1 int
MPI Process 1
rank: 0 rank: 1
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Programming Clusters of SMP Nodes: Hybrid
8
HPCOverview
WindowsCluster Top500 Case Studies Summary
Memory
L2 Cache
core
L1
core
L1
External network
Memory
L2 Cache
core
L1
core
L1
OpenMP OpenMP
MPI MPI
o Multiple levels of parallelism can improve scalability!– One or more MPI tasks per node
– One or more OpenMP threads per MPI task
o Many of ourapplicationsare hybrid:Best fit for clustersof SMP nodes.
o The SMP node will grow in the future!
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Speedup and Efficiencyo The Speedup denotes how much a parallel program is faster
than the serial program:
o The Efficiency of a parallel program is defined as:
o According to Amdahl‘s law the Speedup is limited to p.
9
HPCOverview
WindowsCluster Top500 Case Studies Summary
pp T
TS 1=p = number of threads / processesT1 = execution time of the serial programTp = execution time of the parallel program
pS
E pp =
Note: Other (similar) definitions for other needs available in the literature.
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Agendao High Performance Computing (HPC)
– OpenMP & MPI & Hybrid Programming
o Windows‐Cluster @ Aachen– Hardware & Software
– Deployment & Configuration
o Top500 submission
o Case Studies– Dynamic Optimization: AVT
– Bevel Gears: WZL
– Application & Benchmark Codes
o Summary
HPCOverview
WindowsCluster Top500 Case Studies Summary
10
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Harpertown‐based InfiniBand Clustero Recently installed Cluster:
– Fujitsu‐Siemens Primergy RS 200 S4 servers• 2x Intel Xeon 5450 (quad‐core, 3.0 GHz)
• 16 / 32 GB memory per node
• 4x DDR InfiniBand (latency: 2.7 us, bandwidth: 1250 MB/s)
– In total: About 25 TFLOP/s peak performance (260 nodes)
o Cluster Frontend– Windows Server 2008
– Better performance when machine gets loaded
– Faster file access: Lower access time, higher bandwidth
– Automatic load balancing: Three machines are clusteredto form „one“ interactive machine:cluster-win.rz.rwth-aachen.de
11
HPCOverview
WindowsCluster Top500 Case Studies Summary
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Harpertown‐based InfiniBand Clustero Quite a lot of Pizza boxes…
12
HPCOverview
WindowsCluster Top500 Case Studies Summary
4 cables per box:o InfiniBand: MPI
o Gigabit Ethernet: I/O
o Management
o Power
Pictures taken during Cluster installation
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Software Environmento Complete Development Environment
– Visual Studio 2005 Pro + Microsoft Compute Cluster Pack
– Subversion Client, X‐Win32, Microsoft SDKs, …
– Intel Tool Suite:• C/C++ Compiler, Fortran Compiler
• VTune Performance Analyzer, Threading Tools
• MKL library, Threading Building Blocks
• Intel MPI + Intel Tracing Tools
– Java, Eclipse, …
o Growing list of ISV software– ANSYS, HyperWorks, Fluent, Maple, Mathematica, Matlab,
MS Office 2003, MSC.Marc, MSC.Adams, …
– User‐licensed software hosting, e.g. GTM‐X13
HPCOverview
WindowsCluster Top500 Case Studies Summary
Experiences with HPC on Windows 07.04.2008 – C. Terboven
ISV codes in the batch systemo Several requirements
– No usage of graphical elements allowed
– No user input allowed (except redirected standard input)
– Execution of compiled set of commands, Termination
o Exemplary usage instructions for Matlab– Command line:
\\cifs\cluster\Software\MATLAB\bin\win64\matlab.exe/minimize /nosplash/logfile log.txt/r "cd('\\cifs\cluster\home\YOUR_USERID\YOUR_PATH'), YOUR_M_FILE„
– The .M file should contain „quit;“ as last statement14
Software share
Disable GUI elementsSave output
Change dir& Execute
HPCOverview
WindowsCluster Top500 Case Studies Summary
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Deploying a 256 node cluster
o Setting up the Head Node: approx. 2 hours– Installing and configuring Microsoft SQL Server 2005
– Installing and configuring Microsoft HPC Pack
o Installing the compute nodes:– Installing 1 compute node: 50 minutes
– Installing n compute nodes: 50 minutes? (Multicasting!)• We installed 42 nodes at once in 50 minutes15
HPCOverview
WindowsCluster Top500 Case Studies Summary
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Deploying a 256 node clustero The deployment performance is only limited by Head Node:
16
HPCOverview
WindowsCluster Top500 Case Studies Summary
42 multicasting nodes servedby WDS on Head Node
Network utilization at 100%CPU utilization at 100%
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Configuring a 256 node clustero Updates? Deploy an updated image to the compute nodes!
17
HPCOverview
WindowsCluster Top500 Case Studies Summary
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Agendao High Performance Computing (HPC)
– OpenMP & MPI & Hybrid Programming
o Windows‐Cluster @ Aachen– Hardware & Software
– Deployment & Configuration
o Top500 submission
o Case Studies– Dynamic Optimization: AVT
– Bevel Gears: WZL
– Application & Benchmark Codes
o Summary
HPCOverview
WindowsCluster Top500 Case Studies Summary
18
Experiences with HPC on Windows 07.04.2008 – C. Terboven
The Top500 List (www.top500.org)
HPCOverview
WindowsCluster Top500 Case Studies Summary
19
1
10
100
1.000
10.000
100.000
1.000.000
Jun93
Jun94
Jun95
Jun96
Jun97
Jun98
Jun99
Jun00
Jun01
Jun02
Jun03
Jun04
Jun05
Jun06
RWTH: PeakPerformance (R peak)RWTH: LinpackPerformanceTOP500: Rank 1
TOP500: Rank 50
TOP500: Rank 200
TOP500: Rank 500
PC-Technology
Moore's Law
Experiences with HPC on Windows 07.04.2008 – C. Terboven
The Top500 List: Windows clusters
HPCOverview
WindowsCluster Top500 Case Studies Summary
20
o Why add just another Linux cluster? → Top500 on Windows!
o Watch out for the Windows/Unix ratio in the upcomingTop500 list in June…
Experiences with HPC on Windows 07.04.2008 – C. Terboven
The LINPACK benchmark
HPCOverview
WindowsCluster Top500 Case Studies Summary
21
o Ranking is determined by running a single benchmark: HPL– Solve a dense system of linear equations
– Gaussian elimination‐type of algorithm:
– Very regular problem→ high performance / efficiency
– www.netlib.org/hpl
o Result is not of high significance for our applications …
o … but it‘s a pretty good sanity check for the system!– Low performance→ Something is wrong!
o Steps to success:– Cluster setup & sanity check
– Linpack parameter tuning
)(32 23 nOn +
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Results of initial Cluster Sanity Checko The cluster was in normal operation since end of January
o Bandwidth test: Variations from 180 MB/s to 1180 MB/s
22
HPCOverview
WindowsCluster Top500 Case Studies Summary
Nodes 1 to 9
Nod
es 1 to
21
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Parameter Tuningo HPC approach: Performance Analysis → Tuning → Analyis→
Tuning → Analysis → Tuning → …
23
HPCOverview
WindowsCluster Top500 Case Studies Summary
Performance got
better
over
time:
System
and
Parameter Tun
ing
Experiences with HPC on Windows 07.04.2008 – C. Terboven
HPCOverview
WindowsCluster Top500 Case Studies Summary
Parameter Tuning – Going the Windows way
24
o Excel application on a laptop controlled the whole cluster!
We easily reached 76,69 GFlop/s on one node.Peak performance is: 8 cores * 3 GHz * 4 results per cycle = 96 Gflop/s→ That is 80% efficiency!
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Windows HPC Server 2008 in Action
HPCOverview
WindowsCluster Top500 Case Studies Summary
25
o Job startup comparison – 2048 MPI processes:– Our Linux configuration: Order of Minutes
– Our Windows configuration: Order of Seconds
Video: Job startupon Windows 2008
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Agendao High Performance Computing (HPC)
– OpenMP & MPI & Hybrid Programming
o Windows‐Cluster @ Aachen– Hardware & Software
– Deployment & Configuration
o Top500 submission
o Case Studies– Dynamic Optimization: AVT
– Bevel Gears: WZL
– Application & Benchmark Codes
o Summary
HPCOverview
WindowsCluster Top500 Case Studies Summary
26
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Case Study: Dynamic Optimization (AVT)o Dynamic optimization in chemical industry
27
Composition of A und B => unfeasible material
Plastic A Plastic B
Task: Changing the product specification (from A to B) of a plastic manufactoryin ongoing business
• Minimize the junk (composition of A and B)• Search for economic and ecologic optimal operational mode!
This task is solved by the DyOS (Dynamic Optimization Software) tool, developedat the chair for process systems engineering (AVT: Aachener Verfahrenstechnik)at RWTH Aachen University.
HPCOverview
WindowsCluster Top500 Case Studies Summary
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Case Study: Dynamic Optimization (AVT)
28
HPCOverview
WindowsCluster Top500 Case Studies Summary
o What is dynamic optimization?
FinishStart
• Drive 250 m in minimal time!• Start with v = 0, stop exactly at x = 250m!• Maximal speed is v = 10 m/s!• Maximal acceleration is a = 1 m/s²!
0 10 20 30
-1
-0.5
0
0.5
1MAX
MIN
Acceleration a
Time [sec]0 10 20 300
2
4
6
8
10
0 10 20 30 0
50
100
150
200
250
Speed v Distance x
Time [sec]
.)(,0)(ftff xtxtv ≥=0)0(),0( =vx
max)( vtv ≤
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Case Study: Dynamic Optimization (AVT)o One simulation typically takes up to two weeks and requires
a significant amount of memory (>4GB) → MPI parallelization
o Challenge: Commercial software component depending on VS6, only one instance per machine allowed→ outsourced into DLL
29
DyOS
x
A
Physical realisation gPROMS‐interface
Tasks of MPI communicator (ROOT):
• storing data
• sending and receiving data
• control over entire software
HPCOverview
WindowsCluster Top500 Case Studies Summary
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Case Study: Dynamic Optimization (AVT)o A scenario consists of several simulations
o Projected compute time on desktop:– 5 scenarios à 2 months = 10 months
o Compute time on our cluster (faster machines, multiple jobs):– 5 scenarios à 1.5 months = 3 months
o MPI parallelization lead to factor 4.5 on 8 cores:– 5 scenarios à 0.3 months = 0.7 months
o Arndt Hartwich: Stability of RZ‘s compute nodes issignificantly higher than stability of our desktops.
30
HPCOverview
WindowsCluster Top500 Case Studies Summary
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Case Study: KegelToleranzen (WZL)
31
HPCOverview
WindowsCluster Top500 Case Studies Summary
Bevel Gear Pair Differential Gear
Laboratory for Machine Tools and Production Engineering, RWTH Aachen
o Simulation of Bevel Gears– Written in Fortran, using Intel Fortran 10.1 compiler
– Very cache‐friendly→ runs at high Mflop/s rates
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Case Study: KegelToleranzen (WZL)
32
HPCOverview
WindowsCluster Top500 Case Studies Summary
o Target– Pentium/Windows/Intel → Xeon/Windows/Intel
– Serial Tuning + Parallelization with OpenMP
o Procedure– Get the tools: Porting to UltraSparc IV/Solaris/Sun Studio
– Simulog Foresys: Convert to Fortran 90• 77000 Fortran77 lines→ 91000 Fortran 90 lines
– Sun Analyzer: Runtime Analysis with different datasets• Deduce targets for Serial Tuning and OpenMP Parallelization
– OpenMP Parallelization: 5 Parallel Regions, 70 Directives
– Get the tools: Porting new code to Xeon/Windows/Intel
– Intel Thread Checker: Verification of OpenMP Parallelization
o Put new code in production on Xeon/Windows/Intel
Experiences with HPC on Windows 07.04.2008 – C. Terboven
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 2 4 8
MFlop
/s
#Threads
Performance of KegelToleranzen
Linux 2.6: 4x Opteron dual‐core, 2.2 GHz
Windows 2003: 4x Opteron dual‐core, 2.2 GHz
Case Study: KegelToleranzen (WZL)
33
HPCOverview
WindowsCluster Top500 Case Studies Summary
o Comparing Linux and Windows Server 2003:
24%
better
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Case Study: KegelToleranzen (WZL)
34
HPCOverview
WindowsCluster Top500 Case Studies Summary
0
1000
2000
3000
4000
5000
6000
7000
8000
1 2 4 8
MFlop
/s
#Threads
Performance of KegelToleranzen
Linux 2.6: 4x Opteron dual‐core, 2.2 GHz
Windows 2003: 4x Opteron dual‐core, 2.2 GHz
Linux 2.6: 2x Harpertown quad‐core, 3.0 GHz
Windows 2008: 2x Harpertown quad‐core, 3.0 GHz
7%
better
o Comparing Linux and Windows Server 2008:
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Case Study: KegelToleranzen (WZL)
35
HPCOverview
WindowsCluster Top500 Case Studies Summary
0
1000
2000
3000
4000
5000
6000
7000
8000
1 2 4 8
MFlop
/s
#Threads
Performance of KegelToleranzen
Linux 2.6: 4x Opteron dual‐core, 2.2 GHz
Windows 2003: 4x Opteron dual‐core, 2.2 GHz
Linux 2.6: 4x Clovertown quad‐core, 3.0 GHz
Windows 2008: 4x Clovertown quad‐core, 3.0 GHz
What about 96 Gflop/s per node?!?
In fact, this application is rather fast.HPC applications are waiting for memory!
o Comparing Linux and Windows Server 2008:
betterPerformance gain for the user: Speedup of 5.6 on one node.
Even better from starting point (desktop: 220 MFlop/s).MPI parallelization is work in progress.
7%
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Agendao High Performance Computing (HPC)
– OpenMP & MPI & Hybrid Programming
o Windows‐Cluster @ Aachen– Hardware & Software
– Deployment & Configuration
o Top500 submission
o Case Studies– Dynamic Optimization: AVT
– Bevel Gears: WZL
– Application & Benchmark Codes
o Summary
HPCOverview
WindowsCluster Top500 Case Studies Summary
36
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Invitation: Windows‐HPC User Group Meeting
Einladung zum 1. Treffen der
Windows High Performance ComputingNutzergruppe im deutschsprachigen Raum
21./22. April 2008Rechen‐ und Kommunikationszentrum, RWTH Aachen
http://www.rz.rwth‐aachen.de/winhpcug
HPCOverview
WindowsCluster Top500 Case Studies Summary
37
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Invitation: Windows‐HPC User Group Meeting
HPCOverview
WindowsCluster Top500 Case Studies Summary
38
o Ziele– Informationsaustausch zwischen Nutzern untereinander
– Informationsaustausch zwischen Nutzern und Microsoft
o Montag, 21. April– 17:00h: Domführung
– 18:30h: Gemeinsames Abendessen
o Dienstag, 22. April– Präsentationen der Firmen Allinea, Cisco, Intel,
The Mathworks und Microsoft
– Präsentationen von Einrichtungen aus Forschung und Lehre
– Frage‐und‐Antwort‐Runde mit HPC‐Experten von Microsoft
Experiences with HPC on Windows 07.04.2008 – C. Terboven
Summary & Outlooko The Windows‐HPC environment has been well accepted
– Growing interest and need of compute power on Windows.
o Windows HPC Server 2008 (beta) is pretty impressive!!!– Deploying and Configuring 256 nodes.
– Job Startup and Job Management.
– Performance improvements & Linpack efficiency.
o Interoperability in heterogeneous environments got easier.– E.g. WDS can interoperate with Linux DHCP and PXE‐Boot.
o Performance: Need for Windows‐specific tuning.
HPCOverview
WindowsCluster Top500 Case Studies Summary
39
Experiences with HPC on Windows 07.04.2008 – C. Terboven
The End
Thank you for
your attention!
HPCOverview
WindowsCluster Top500 Case Studies Summary
40