A brief overview with emphasis on cluster performance Eric Lantz ([email protected])[email protected] Lead Program Manager, HPC Team Microsoft Corp

A brief overview with emphasis on cluster performance

Eric Lantz ([email protected])Lead Program Manager , HPC TeamMicrosoft Corp.

Fab Tillier ([email protected] )Developer, HPC TeamMicrosoft Corp.

A Brief Overview of this second release from Microsoft’s HPC team.

Some Applicable Market Data IDC Cluster Study (113 sites, 303 clusters, 29/35/36 GIA split)

Industry self-reports average of 85 nodes per cluster When needing more computing power:

~50% buy a new cluster, ~50% add nodes to existing cluster When purchasing:

61% buy direct from vendor, 67% have integration from vendor 51% use a standard benchmark in purchase decision

Premium paid for lower network latency as well as power and cooling solutions Applications Study (IDC Cluster Study, IDC App Study (250 codes, 112 vendors,

11 countries) - Visits) Application usage

Apps use 4-128 CPUs and are majority In-house developed Majority multi-threaded Only 15% use whole cluster In practice 82% are run at 32 processors or below

Excel running in parallel is an application of broad interest Top challenges for implementing clusters:

Facility issues with power and cooling System management capability Complexity implementing parallel algorithms Interconnect latency Complexity of system purchase and deployment

Sources: 2006 IDC Cluster Study, HECMS, 2006 Microsoft HEWS Study

Page3

Markets Addressed by HPCS2008

Key HPC Server 2008 Features Systems Management

New admin console based on System Center UI framework integrates every aspect of cluster management

Monitoring heat map allows viewing cluster status at-a-glance High availability for multiple head nodes Improved compute node provisioning using Windows Deployment Services Built-in system diagnostics and cluster reporting

Job Scheduling Integration with the Windows Communication Foundation, allowing SOA application developers

to harness the power of parallel computing offered by HPC solutions Job scheduling granularity at processor core, processor socket, and compute node levels Support for Open Grid Forum’s HPC-Basic Profile interface

Networking and MPI Network Direct, providing dramatic RDMA network performance improvements for MPI

applications Improved Network Configuration Wizard New shared memory MS-MPI implementation for multicore servers MS-MPI integrated with Event Tracing for Windows and Open Trace Format translation

Storage Improved iSCSI SAN and Server Message Block (SMB) v2 support in Windows Server 2008 New parallel file system support and vendor partnerships for clusters with high-performance

storage needs New memory cache vendor partnerships

8

End-To-End Approach To PerformanceMulti-Core is Key

Big improvements in MS-MPI shared memory communications

NetworkDirect A new RDMA networking interface built for speed and stability

Devs can't tune what they can't see MS-MPI integrated with Event Tracing for Windows

Perf takes a village Partnering for perf

Regular Top500 runs Performed by the HPCS2008 product team on a permanent,

scale-testing cluster

9

Multi-Core is KeyBig improvements in MS-MPI shared memory communicationsMS-MPI automatically routes between

Shared Memory: Between processes on a single [multi-proc] node

Network: TCP, RDMA (WinsockDirect, NetworkDirect)

MS-MPIv1 monitored incoming shmem traffic by aggressively polling [for low latency] which caused: Erratic latency measurements High CPU utilization

MS-MPIv2 uses entirely new shmem approach Direct process-to-process copy to increase shm

throughput. Advanced algorithms to get the best shm latency

while keeping CPU utilization low.

10

Prelim shmem results

NetworkDirectA new RDMA networking interface built for speed and stability Priorities

Equal toHardware-Optimized stacks for MPI micro-benchmarks

Focus on MPI-Only Solution for CCSv2 Verbs-based design for close fit with

native, high-perf networking interfaces

Coordinated w/ Win Networking team’s long-term plans

Implementation MS-MPIv2 capable of 4 networking

paths: Shared Memory

between processors on a motherboard TCP/IP Stack (“normal” Ethernet) Winsock Direct

for sockets-based RDMA New NetworkDirect interface

HPC team partnering with networking IHVs to develop/distribute drivers for this new interface

User ModeKernel Mode

TCP/Ethernet Networking

Kern

el

By-

Pass

MPI AppSocket-

Based App

MS-MPI

Windows Sockets (Winsock + WSD)

Networking HardwareNetworking HardwareNetworking Hardware

Networking HardwareNetworking HardwareHardware Driver

Networking

Hardware

Networking

Hardware

Mini-port Driver

TCP

NDIS

IP

Networking HardwareNetworking HardwareUser Mode Access Layer

Networking

Hardware

Networking

Hardware

WinSock Direct

Provider

Networking

Hardware

Networking

Hardware

NetworkDirect

Provider

RDMA Networking

OS Componen

t

CCP Componen

t

IHV Componen

t(ISV) App

Devs can't tune what they can't seeMS-MPI integrated with Event Tracing for Windows

Single, time-correlated log of: OS, driver, MPI, and app events

CCS-specific additions High-precision CPU clock

correction Log consolidation from multiple

compute nodes into a single record of parallel app execution

Dual purpose: Performance Analysis

Application Trouble-Shooting

Trace Data Display Visual Studio & Windows ETW

tools

!Soon! Vampir Viewer for Windows

13

MS-MPI

Windows ETW

Infrastructure

mpiexec.exe

-trace args

logman.exe

Trace settings

(mpitrace.mof)

Trace Log File

Convert to textLive feed

MS-MPI

Windows ETW

Infrastructure

Trace Log File

Trace Log Files

Trace Log Files

Trace Log Files

Perf takes a village(Partnering for perf)

Networking Hardware vendorsNetworkDirect design reviewNetworkDirect & WinsockDirect provider

developmentWindows Core Networking TeamCommercial Software Vendors

Win64 best practicesMPI usage patternsCollaborative performance tuning

3 ISVs and counting 4 benchmarking centers online

IBM, HP, Dell, SGI14

Regular Top500 runs MS HPC team just completed a

3rd entry to the Top500 list Using our dev/test scale cluster

(Rainier) Currently #116 on Top500 Best efficiency of any Clovertown

with SDR IB (77.1%)

Learnings incorporated into white papers & CCS product

15

Configuration:•260 Dell Blade Servers•1 Head node•256 compute nodes•1 IIS server•1 File Server•App/MPI: Infiniband•Private: Gb-E•Public: Gb-E

Location:•Microsoft Tukwila Data center (22 miles from Redmond campus)

• Each compute node has two quad-core Intel 5320 Clovertown, 1.86GHz, 8GB RAM• Total

•2080 Cores•2+TB RAM

What is Network Direct?What Verbs should look like for Windows:Service Provider Interface (SPI)

Verbs Specifications are not APIs!Aligned with industry-standard Verbs

Some changes for simplicitySome changes for convergence of IB and

iWARPWindows-centric design

Leverage Windows asynchronous I/O capabilities

ND Resources

Resources ExplainedResource Description

Provider Represents the IHV driver

Adapter Represents an RDMA NICContainer for all other resources

Completion Queue (CQ)

Used to get I/O results

Endpoint (EP) Used to initiate I/OUsed to establish and manage connections

Memory Registration (MR)

Make buffers accessible to HW for local access

Memory Window (MW) Make buffers accessible for remote access

19

ND to Verbs Resource MappingNetwork Direct Verbs

Provider N/A

Adapter HCA/RNIC

Completion Queue (CQ) Completion Queue (CQ)

Endpoint (EP) Queue Pair (QP)

Memory Registration (MR)

Memory Region (MR)

Memory Window (MW) Memory Window (MW)

ND SPI TraitsExplicit resource management

Application manages memory registrations Applications manages CQ to Endpoint bindings

Only asynchronous data transfers Initiate requests on an Endpoint Get request results from the associated CQ

Application can use event driven and/or polling I/O modelLeverage Win32 asynchronous I/O for event driven operationNo kernel transitions for polling mode“Simple” Memory Management Model

Memory Registrations are used for local access Memory Windows are used for remote access

IP Addressing No proprietary address management required

21

ND SPI ModelCollection of COM interfaces

No COM runtime dependency Use the interface model only Follows model adopted by the UMDF

Thread-less providersNo callbacks

Aligned with industry standard VerbsFacilitates IHV adoption

Why COM Interfaces?Well understood programming modelEasily extensible via IUnknown::QueryInterface

Allows retrieving any interface supported by an object

Object orientedC/C++ language independent

Callers and providers can be independently implemented in C or C++ without impact on one another

Interfaces support native code syntax - no wrappers

Asynchronous OperationsWin32 Overlapped operations used for:

Memory RegistrationCQ NotificationConnection Management

Client controls threading and completion mechanismI/O Completion Port or GetOverlappedResult

Simpler for kernel drivers to supportIoCompleteRequest – I/O manager handles the

rest.

Microsoft HPC web site - HPC Server 2008 (beta) Available Now!!

http://www.microsoft.com/hpc Network Direct SPI documentation, header and test executables

In the HPC Server 2008 (beta) SDKhttp://www.microsoft.com/hpc

Microsoft HPC Community Sitehttp://windowshpc.net/default.aspx

Argonne National Lab’s MPI websitehttp://www-unix.mcs.anl.gov/mpi/

CCS 2003 Performance Tuning Whitepaperhttp://www.microsoft.com/downloads/details.aspx?FamilyID=40cd8152-f89d-4abf-ab1c-a467e180cce4&DisplayLang=en Or go to http://www.microsoft.com/downloads and search for CCS Performance

Socrates software boosts performance by 30% on Microsoft cluster to achieve 77.1% overall cluster efficiency

Performance improvement was demonstrated with exactly the same hardware and is attributed to : Improved networking performance of MS-MPI’s

NetworkDirect interface

Entirely new MS-MPI implementation for shared memory communications

Tools and scripts to optimize process placement and tune the Linpack parameters for this 256-node, 2048-processor cluster

Windows Server 2008 improvements in querying completion port status

Use of Visual Studio’s Profile Guided Optimization (POGO) on the Linpack, MS-MPI, and the ND provider binaries

Documents

A brief overview with emphasis on cluster performance Eric Lantz ([email protected])[email protected] Lead Program Manager, HPC Team Microsoft Corp