High-Performance Computing on the Windows Server Platform
Marvin TheimerSoftware ArchitectWindows Server HPC Grouphpcinfo @ microsoft.comMicrosoft Corporation
Session OutlineSession Outline
Brief introduction to HPCDefinitions
Market trends
Overview of V1 version of Windows Server 2003 CCE
Features
System architecture
Key challenges for future HPC systemsToo many factors affect performance
Grid computing economics
Data management
Defining High Performance Computing (HPC)Defining High Performance Computing (HPC)
Systems MPPs, Vector, NUMA
Dedicated HPC Clusters
Resource Scavenging
Goal Absolute Perf Price/Perf, SLA Harness unused cycles
Targets Large scale SMPs; special purpose
Clusters Underutilized desktops & servers
Coupling Tight Tight, Loose Loose
Comm Shared memory Message passing Messages, files
Network Bus, backplane Myrinet, Infiniband, GigE
TCP/IP, HTTP
HPC Definition: Using compute resources to solve computationally intensive problems
Persist(DB, FS, ..)
ComputationalModeling Sensors
Mining
Interpretation
Technical and
ScientificComputing
HPC Use
HPC Role in Science Different Platforms for Achieving Results
Cluster HPC ScenarioCluster HPC Scenario
Data
Inp
ut
Job Policy, reports
Man
agem
ent
DB or FS
High speed, low latency interconnect (1GE,
Infiniband, Myricom)
User
Job
Admin
User Mgmt
Resource Mgmt
Cluster Mgmt
Job Mgmt
Web service
Web page
Cmd line
Head Node
Cluster Node
Job Mgr
Resource Mgr
User AppMPI
Node Mgr
Sensors, Workflow,Computation
Data mining, Visualization, Workflow Remote query
Top 500 Supercomputer TrendsTop 500 Supercomputer Trends
Industry usage is
rising
Clusters over 50%
GigE is gaining
IA is winning
Commoditized HPC Systems are Affecting Every VerticalCommoditized HPC Systems are Affecting Every Vertical
Leverage Volume Markets of Industry Standard Hardware and Software.
Rapid Procurement, Installation and Integration of systems
Cluster Ready Applications Accelerating Market Growth
Engineering
Bioinformatics
Oil & Gas
Finance
Entertainment
Government/Research
The convergence of affordable high performance hardware and commercial apps is making supercomputing a mainstream market
Supercomputing: Yesterday vs. TodaySupercomputing: Yesterday vs. Today
1991 1998 2005System Cray Y-MP C916 Sun HPC10000 Shuttle @ NewEgg.com
Architecture 16 x Vector4GB, Bus
24 x 333MHz Ultra-SPARCII, 24GB, SBus
4 x 2.2GHz Athlon644GB, GigE
OS UNICOS Solaris 2.5.1 Windows Server 2003 SP1
GFlops ~10 ~10 ~10
Top500 # 1 500 N/A
Price $40,000,000 $1,000,000 (40x drop) < $4,000 (250x drop)
Customers Government Labs Large Enterprises Every Engineer & Scientist
Applications Classified, Climate, Physics Research
Manufacturing, Energy, Finance, Telecom
Bioinformatics, Materials Sciences, Digital Media
Cheap, Interactive HPC Systems Cheap, Interactive HPC Systems Are Making Supercomputing PersonalAre Making Supercomputing Personal
Grids of personal &
departmental clusters
Personal workstations &
departmental servers
Minicomputers
Mainframes
The Evolving Nature of HPCThe Evolving Nature of HPC
Scenario Focus
Departmental ClusterConventional scenario
IT owns large clusters due to complexity and allocates resources on per job basis
Users submit batch jobs via scripts
In-house and ISV apps, many based on MPI
Scheduling multiple users’ applications onto scarce compute cycles
Cluster systems administration
Personal/Workgroup ClusterEmerging scenario
Clusters are pre-packaged OEM appliances, purchased and managed by end-users
Desktop HPC applications transparently and interactively make use of cluster resources
Desktop development tools integration
Interactive applications
Compute grids
HPC Application IntegrationFuture scenario
Multiple simulations and data sources integrated into a seamless application workflow
Network topology and latency awareness for optimal distribution of computation
Structured data storage with rich meta-data
Applications and data potentially span organizational boundaries
Data-centric, “whole-system” workflows
Data grids
Interactive Computation and Visualization
Manual, batchexecution
IT Mgr
SQL
Intel (32bit & 64bit) & AMD x64
Fast Ethernet Gigabit Ethernet
TCP
Windows based HPC TodayWindows based HPC Today
Ecosystem
Partnerships with ISV to develop on Windows
Partnership with Cornell Theory Center
Technical Solution
Partner Driven Solution Stack
PlatformPlatformPlatformPlatform
InterconnectInterconnectInterconnectInterconnect
ProtocolProtocolProtocolProtocol
OSOSOSOS
MiddlewareMiddlewareMiddlewareMiddleware
WINDOWS
MPI/Pro MPICH-1.2 WMPI MPI-NT
ApplicationsApplicationsApplicationsApplications Parallel Applications
Vis
ua
l Stu
dio
LSF PBSPro DataSynapse MSTIManagementManagementManagementManagement
What Windows-based HPC needs to provideWhat Windows-based HPC needs to provide
Users require:An integrated supported solution stack leveraging the Windows infrastructure
Simplified job submission, status and progress monitoring
Maximum compute performance and scalability
Simplified environment from desktops to HPC clusters
Administrators require:
Ease of setup and deployment
Better cluster monitoring and management for maximum resource utilization
Flexible, extensible, policy-driven job scheduling and resource allocation
High availability
Secure process startup and complete cleanup
Developers Require:
Programming environment that enables high productivity
Availability of optimized compilers (Fortran) and math libraries
Parallel debugger, profiler, and visualization tools
Parallel programming models (MPI)
V1 PlansV1 Plans
Introduce compute cluster solutionWindows Server 2003 Compute Cluster Edition based on Windows Server 2003 SP1 x64 Standard Edition
Features for Job Management, IT Admin and Developers
Build partner eco-system around the Windows Server Compute Cluster Edition from day one
Establish Microsoft credibility in the HPC community
Create worldwide Centers of Innovation
TechnologiesTechnologies
PlatformWindows Server 2003 SP1 64 bit Edition
x64 processors (Intel EM64T & AMD Opteron)
Ethernet, Ethernet over RDMA and Infiniband support
AdministrationPrescriptive, simplified cluster setup and administration
Scripted, image-based compute node management
Active Directory based security, impersonation and delegation
Cluster-wide job scheduling and resource management
DevelopmentMPICH-2 from Argonne National Labs
Cluster scheduler accessible via DCOM, http, and Web Services
Visual Studio 2005 – Compilers, Parallel Debugger
Partner delivered compilers and libraries
Windows HPC EnvironmentWindows HPC Environment
Data
Inp
ut
Job Policy, reports
Man
agem
ent
DB or FS
High speed, low latency interconnect (Ethernet over RDMA,
Infiniband)
User
Job
Admin
User Mgmt
Resource Mgmt
Cluster Mgmt
Job Mgmt
Web service
Web page
Cmd line
Head Node
Cluster Node
Job Mgr
Resource Mgr
User AppMPI
Node Mgr
Sensors, Workflow,Computation
Data mining, Visualization, Workflow Remote query
Active Directory
Microsoft Operations
Manager
Windows Server 2003,
Compute Cluster Edition
Architectural OverviewArchitectural Overview
Legend
Cluster Nodes
GigE/RDMA
Windows Server 2003 CCE
Node Manager
Infiniband
Head Node
Windows Server 2003 CCE
WSE3MSDE
Job Scheduler
RIS AD
Job Sched UI
ClusterUser Workstation
GigE
Windows XP
X86/64 Disk
Application Job ScriptsData
Developer Workstation
GigE
Windows XP
X86/64 Disk
Application
Whidbey WSE
SFU
Compilers Libs
HPC SDKMPISched WSPolicy API
3rd Party
Windows OS
Whidbey
MS Component
HPC Component
Application
IIS6
HPC Application
Cluster Nodes
GigE/RDMA
Windows Server 2003 CCE
MPI-2
TCP
Node Manager
Infiniband
SHM WSD/SDP
HPC Application
MPI-2 MPI-2
WSE
HTTP
HTTP
MPI-2
TCP SHM WSD/SDP
WSE
COM
COM
Difficult to Tune PerformanceDifficult to Tune Performance
Example: Tightly-coupled MPI applications:Very sensitive to network performance characteristics
Communication times measured in microseconds: O(10 usecs) for interconnects such as Infiniband; O(100 usecs) for GigE
OS network stack is a significant factor: Things like RDMA can make a big difference
Excited about the prospects of industry-standard RDMA hardware
We are working with InfiniBand and GigE vendors to ensure our stack supports them
Driver quality is an important facet
We are supporting the OpenIB initiative
Considering the creation of a WHQL program for InfiniBand
Very sensitive to mismatched node performanceRandom OS activities can add millisecond delays to microsecond communication times
Need self-tuning systemsNeed self-tuning systems
Application configuration has a significant impactIncorrect assumptions about hardware/communications architecture can dramatically affect performance
Choice of communication strategy
Choice of communication granularity
…
Tuning is an end-to-end issue:OS support
ISV library support
ISV application support
Computational Grid EconomicsComputational Grid Economics
What $1 will buy you (roughly): Computers cost $1000 (roughly) 1 cpu day (~ 10 Tera-ops) == $1
(roughly, assuming 3 yr use cycle)
10TB network transfer costs == $1 (roughly, assuming 1Gbps interconnect)
Internet bandwidth costs roughly 100 $/mbps/month (not including routers and management) 1GB network transfer costs == $1 (roughly)
Some observations:HPC cluster communication is 10,000x cheaper than WAN communicationBreak-even point for instructions computed per byte transferred:
Cluster: O(1) instrs/byteWAN: O(10,000) instrs/byte
Computational Grid Economics ImplicationsComputational Grid Economics Implications
“Small data, high compute” applications work well across the Internet, such as SETI@home and Folding@home
MPI-style parallel, distributed applications work well in clusters and across LANs, but are uneconomic and do not work well in wide-area settings
Data analysis is usually best done by moving the programs to the data, not the data to the programs.
Move questions and answers, not petabyte-scale datasets
The Internet is NOT the cpu backplane (Internet-2 will not change this)
Exploding Data SizesExploding Data Sizes
Experimental data: TBs PBs
Modeling data:Today: 10’s to 100’s of GB is the common case
Tomorrow: TBs
Near-future example: CFD simulation of a turbine engine
10**9 mesh nodes, each containing 16 double-precision variables
128 GB / time-step
Simulate 1000’s of time steps 100’s TBs / simulation
Archived for future reference
Whole-System Modeling and WorkflowWhole-System Modeling and Workflow
Today: mostly about computationStand-alone static simulations of individual parts/phenomena
Mostly batch
Simple workflows: short, deterministic pipelines (though some are massively parallel)
Future: mostly about data that is produced and consumed by computational steps
Dynamic whole-system modeling via multiple, interacting simulations
More complex workflows (don't yet know how complex)
More interactive analysis
More sharing
Whole-System Modeling Example: Whole-System Modeling Example: Turbine EngineTurbine Engine
Interacting simulationsCFD simulation of dynamic airflow through turbine
FE stress analysis of engine & wing parts
"Impedance" issues between various simulations (time steps, meshes, ...)
Serial workflow stepsCrack analysis of engine & wing parts
Visualization of results
Interactive Workflow ExampleInteractive Workflow Example
Base CFD simulation produces huge outputPoints of interest may not be easy to find
Find and then focus on important detailsData analysis/mining of output
Restart simulation at a desired point in time/space.
Visualize simulation from that point forward.
Modify simulation from that point forward (e.g. higher fidelity)
Data Analysis and MiningData Analysis and Mining
Traditional approach:Keep data in flat files
Write C or Perl programs to compute specific analysis queries
Problems with this approach:Imposes significant development times
Scientists must reinvent DB indexing and query technologies
Results from the astronomy community:Relational databases can yield speed-ups of one to two orders of magnitude
SQL + application/domain-specific stored procedures greatly simplify creation of analysis queries
Combining Simulation with Experimental Combining Simulation with Experimental Data: Drug DiscoveryData: Drug Discovery
Clinical trial database describes toxicity & side effects observed for tested drugs.
Simulation searches for candidate compounds that have a desired effect on a biological system.
Clinical data searched for drugs that contain a candidate compound or "near neighbor"; toxicity results retrieved and used to decide if the candidate compound should be rejected or not.
SharingSharing
Simulations (or ensembles of simulations) mostly done in isolation
No sharing except for archival output
Some coarse-grained sharingCheck-out/check-in of large componentsExample: automotive design
Check-out componentCAE-based design & simulation of componentCheck-in with design rule checking step
Data warehouses typically only need coarse-grained update granularity
Bulk or coarse-grained updatesModeling & simulations done in the context of particular versions of the data
Audit trails and reproducible workflows becoming increasingly important
Data Management NeedsData Management Needs
Cluster file systems and/or parallel DBs to handle I/O bandwidth needs of large, parallel, distributed applications
Data warehouses for experimental data and archived simulation output
Coarse-grained geographic replication to accommodate distributed workforces and workflows
Indexing and query capabilities to do data mining & analysis
Audit trails, workflow recorders, etc.
Windows HPC RoadmapWindows HPC Roadmap
V1V1Introduce complete Introduce complete
Windows based compute Windows based compute
cluster solutioncluster solution
V1V1Introduce complete Introduce complete
Windows based compute Windows based compute
cluster solutioncluster solution
V2V2Apply Microsoft’s core Apply Microsoft’s core
innovation to HPCinnovation to HPC
V2V2Apply Microsoft’s core Apply Microsoft’s core
innovation to HPCinnovation to HPC
Long-term visionLong-term vision Revolutionize scientific Revolutionize scientific
and technical and technical
computationcomputation
Long-term visionLong-term vision Revolutionize scientific Revolutionize scientific
and technical and technical
computationcomputation
• Complete development platform – SFU, MPI, Compilers, Parallel Debugger
• Integrated cluster setup and management
• Secure and programmable job scheduling and management
• Complete development platform – SFU, MPI, Compilers, Parallel Debugger
• Integrated cluster setup and management
• Secure and programmable job scheduling and management
• Enhanced development support
• Harnessing unused desktop cycles for HPC
• Meta scheduling to integrate desktops and forests of clusters
• Efficient management and manipulation of large datasets and workflows
• Enhanced development support
• Harnessing unused desktop cycles for HPC
• Meta scheduling to integrate desktops and forests of clusters
• Efficient management and manipulation of large datasets and workflows
• Ease of parallel programming
• Exploitation of hardware features (e.g. GPUs and multi-core chips)
• Integration with high-level tools (e.g. Excel)
• Support for domain-specific platforms
• Ease of parallel programming
• Exploitation of hardware features (e.g. GPUs and multi-core chips)
• Integration with high-level tools (e.g. Excel)
• Support for domain-specific platforms
Call To ActionCall To Action
IHVsDevelop Winsock Direct drivers for your RDMA cards
Automatically let our MPI stack take advantage of low latency
Develop support for diskless scenarios (e.g. iScsi)
OEMsOffer turn-key clusters
Pre-wired for management and RDMA networks
Support “boot from net” diskless scenarios
Support WS-Management
Consider noise and power requirements for personal and workgroup configurations
Community ResourcesCommunity Resources
Windows Hardware & Driver Central (WHDC)www.microsoft.com/whdc/default.mspx
Technical Communitieswww.microsoft.com/communities/products/default.mspx
Non-Microsoft Community Siteswww.microsoft.com/communities/related/default.mspx
Microsoft Public Newsgroupswww.microsoft.com/communities/newsgroups
Technical Chats and Webcastswww.microsoft.com/communities/chats/default.mspx
www.microsoft.com/webcasts
Microsoft Blogswww.microsoft.com/communities/blogs
Related WinHEC SessionsRelated WinHEC Sessions
TWNE05005: Winsock Direct Value Proposition-Partner Concepts
TWNE05006: Implementing Convergent Networking-Partner Concepts
To Learn MoreTo Learn More
Microsoft:Microsoft HPC website: http://www.microsoft.com/hpc/
Other Sites:CTC Activities: http://cmssrv.tc.cornell.edu/ctc/winhpc/3rd Party Windows Cluster Resource Centre www.windowsclusters.org HPC related-links web site: http://www.microsoft.com/windows2000/hpc/miscresources.asp
Some useful articles & presentations:“Supercomputing in the Third Millenium”, by George Spix: http://www.microsoft.com/windows2000/hpc/supercom.asp Introduction of the book “Beowulf Cluster Computing with Windows” by Thomas Sterling, Gordon Bell, and Janusz Kowalik“Distributed Computing Economics”, by Jim Gray: MSR-TR-2003-24: http://research.microsoft.com/research/pubs/view.aspx?tr_id=655 “Web Services, Large Databases, and what Microsoft is doing in the Grid Computing Space”, presentation by Jim Gray: http://research.microsoft.com/~Gray/talks/WebServices_Grid.ppt
Send questions to hpcinfo @ microsoft.com