Upload
jeff-squyres
View
1.401
Download
0
Embed Size (px)
DESCRIPTION
Citation preview
Cisco Public 1© 2013 Cisco and/or its affiliates. All rights reserved.
Why Cisco is Awesome for Your Next HPC Cluster
Jeff SquyresCisco Systems, Inc.
Cisco Public 2© 2013 Cisco and/or its affiliates. All rights reserved.
Yes, we sell servers now
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 3
Cisco UCS servers
Cisco 2 x 10Gb VIC
Cisco 40GbNexus switches
Record-settingIntel Ivy Bridge
1U and 2U servers
Ultra lowlatency Ethernet
Yes,really!
40Gb top-of-rackand core switches
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 4
Cisco Unified Compute System (UCS) Servers
4
Rack
4 socket + giant memoryHPC performance
Blad
e
UCS B420 M34-socket blade for
large-memory compute workloads
Cisco UCS: Many Server Form Factors, One System
UCS C240 M3Perfect as HPC cluster head nodes
or IO nodes (2 socket)
UCS C220 M3Ideal for HPC compute-intensive
applications (2 socket)
UCS B200 M3Blade form factor, 2-socket
UCS C420 M34-socket rack server for large-memory
compute workloads
Industry-leading compute without compromise
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 5
In Only Four Years…
UCS impacting growth of established vendors like HP
Legacy offerings flat-lining or in decline
Cisco growth out-pacing the market
Customers have shifted 19.3% of the global x86 blade server market to Cisco and over 26% in the Americas (Source: IDC Worldwide Quarterly Server Tracker, Q1 2013 Revenue Share, May 2013)Source: IDC Worldwide Quarterly Server Tracker, Q1 2013 Revenue Share, May 2013
Worldwide X86 Server Blade Market Share
Demand for Data Center Innovation Has Vaulted Cisco Unified Computing System (UCS) to the #2 Leader in the Fast-Growing Segment of the x86 Server Market
Market Appetite for Innovation Fuels
UCS GrowthUCS #2 and climbing
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 6
Customers Have Spoken
Maintained #2 in N. America (27.9%) and #2 in the US (28.3%)1
UCS x86 Blade servers revenue grew 35% Y/Y in Q1CY131
Advanced to #2 worldwide in x86 Blades with 19.3%
UCS momentum is fueled by game-changing innovation; Cisco is quickly passing established players
UCS #2 in Only Four Years
X86
Serv
er B
lade
Mar
ket S
hare
, Q1C
Y131
UCS #2 with 26.9%
Source: 1 IDC Worldwide Quarterly Server Tracker, Q1 2013, May 2013, Revenue Share
HPCisco
IBMDellNEC
HitachiFujitsuOracle
0 5 10 15 20 25 30 35 40 45 50
Worldwide
UCS #2 19.3%
Oracle
SGI
Dell
IBM
Cisco
HP
0 5 10 15 20 25 30 35 40 45
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 7
Cisco UCS Performance-80 RecordsA History of World Record Performance on Industry Standard Benchmarks
Best CPU Performance 16 world records
Best Virtualization & Cloud Performance
8 world records
Best Database Performance 9 world records
Best Enterprise Application Performance 18 world records
Best Enterprise Middleware Performance 14 world records
Best HPC Performance 15 world records
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 8
Why have 2 HPC networks?
One wire to rule them all:• Commodity traffic (e.g., ssh)• Cluster / hardware management• File system / IO traffic• MPI traffic
10G or 40Gwith real QoS
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 9
Cisco Nexus Datacenter Switches
9
High densityLow latency
Cisco Nexus: Years of experience rolled into dependable solutions
Nexus 3548190ns port-to-port latency (L2 and L3)
Created for HPC / HFT48 10Gb / 12 40Gb ports
Low latency, high density 10 / 40Gb switches
Nexus 60041us port-to-port latency
384 10Gb / 96 40Gb ports
Cisco Public© 2013 Cisco and/or its affiliates. All rights reserved. 10
HPC Topologies: 2-Tier Spine/Leaf Examples
Spine
Leaf
Characteristics• 3 Hops• Low Oversubscription – Non-Blocking• < ~3.5 usecs depending on config and workload• 10G or 40G Capable• Spine: 4 to 16 Wide• Leaf: Determined by Spine Density
Spine - Leaf Port Scale Latency Spines Leafs
10G Fabric 6004 - 6001 18,432 x 10G 3:1 ~ 3 usecs Cut-through 16 384
40G Fabric 6004 - 6004 7,680 x 40G 5:1 ~ 3 usecs Cut-through 16 96
Mixed Fabric 6004 - 6001 4,680 x 10G 3:1 ~ 3 usecs S&F 4 96
10G Fabric 6004 - 3548 12,288 x 10G 3:1 ~ 1.5 usecs Cut-through 16 384
40G Fabric 6004 - 3548 1,152 x 40G 1:1 ~ 1.5 usecs Cut-through 6 96
Mixed Fabric 6004 - 3548 3,072 x 10G 3:1 ~ 1.5 usecs S&F 4 96
…many other configurations are also possible
Cisco Public© 2013 Cisco and/or its affiliates. All rights reserved. 11
HPC Topologies: 3-Tier Spine/Leaf 3 or 5 hops – Max Scale Examples
Leaf
Spine2
Spine1
Spine2-Spine1-Leaf Port Scale Latency Spine2 Spine1 Leafs
10G Fabric 6004 - 6004 - 6001 55,296 x 10G 3:1 ~ 3-5 usecs Cut-through 48 16 x 6 192
40G Fabric 6004 - 6004 - 6004 23,040 x 40G 5:1 ~ 3-5 usecs Cut-through 48 16 48
Mixed Fabric 6004 - 6004 - 6001 18,432 x 10G 3:1 ~ 3-5 usecs S&F 32 4 x 8 48
10G Fabric 6004 - 6004 - 3548 24,576 x 10G 2:1 ~ 1.5-3.5 usecs Cut-through 32 16 x 4 192
40G Fabric 6004 - 6004 - 3548 2,304 x 40G 1:1 ~ 1.5-3.5 usecs Cut-through 24 6 x 8 48
Mixed Fabric 6004 - 6004 - 3548 9,216x 10G 2:1 ~ 1.5-3.5 usecs S&F 24 6 x 8 48
Characteristics• 3 Hops Pod – 5 hops DC east-west traffic• Low Oversubscription – Non-Blocking• < ~3.5 usecs depending on config and
workload• 10G or 40G Capable• Two spine layers
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 12
Ultra LowLatency Ethernet:
Userspace NIC (usNIC)
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 13
Cisco Userspace NIC (usNIC) overview
• Direct access to NIC hardware from Linux userspace
Operating System bypassVia the Linux Verbs API (UD)
• Utilizes Cisco Virtual Interface Card (VIC) for ultra-low Ethernet latency
2nd generation 80Gbps Cisco ASIC2 x 10Gbps Ethernet ports2 x 40Gbps coming in Q4 2013PCI and mezzanine form factors
• Half-round trip (HRT) ping-pong latencies:
Back to back: 1.7μsThrough N3548: 1.9μsThrough MPI+N3548: 2.16μs (*)
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 14
Software architecture comparison
Application
Kernel
Cisco VIC hardware
TCP stack
General Ethernet driver
Cisco VIC driver
Userspace
Userspace sockets library Userspace verbs library
Cisco VIC hardware
Application
Verbs IB core
Cisco USNIC driver
Bootstrappingand setup
Send and receivefast path
usNICTCP/IP
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 15
Ethernet OS bypass applied to MPI
MPI
MPI directlyinjects
L2 frames to the network
MPI receivesL2 framesdirectly fromthe VIC
Userspace verbs library
Cisco VIC hardware
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 16
IO MMUSR-IOV NIC
OS Bypass Architecture
VIC
Classifier
x86 Chipset VT-d
MPI process
QPQPQueue pair
MPI process
InboundL2 frames
OutboundL2 frames
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 17
PCIe Single Root IO Virtualization (SR-IOV)
VIC
VF VF VF
VF VF VF
Physical port Physical port
Physical Function (PF) Physical Function (PF)MAC address: aa:bb:cc:dd:ee:ff MAC address: aa:bb:cc:dd:ee:fe
VF VF VF
VF VF VF
QPQP
QPQP
QPQP
QPQP
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 18
PCIe Single Root IO Virtualization (SR-IOV)
VICPF (MAC)
VF VF VF
VF VF VF
PF (MAC)
VF VF VF
VF VF VF
MPI process
MPI processPhysical portPhysical port
Intel IO MMUQP QP QP QP
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 19
Open Source• Everything above the
firmware is open source
• Open MPIDistributing Cisco Open MPI 1.6.5Upstream in Open MPI 1.7.3
• Libibverbs plugin
• Verbs kernel module
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 20
PerformanceHardware
• Cisco UCS C220 M3 Rack Server • Intel E5-2690 Processor 2.9 GHz (3.3 GHz Turbo), 2 Socket, 8 Cores/Socket• 1600 MHz DDR3 Memory, 8 GB x 16, 128 GB installed• Cisco VIC 1225 with Ultra Low Latency Networking usNIC Driver
• Cisco Nexus 3548• 48 Port 10 Gbps Ultra Low Latency Ethernet Networking Switch
Software
• OS: Centos 6.4, Kernel: 2.6.32-358.el6.x86_64 (SMP)
• NetPIPE (ver 3.7.1)
• Intel MPI Benchmarks (ver 3.2.4)
• High Performance Linpack (ver 2.1)
• Other: Intel C Compiler (ver 13.0.1), Open MPI (ver 1.6.5), Cisco usNIC (1.0.0.7x)
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 21
NetPIPE(Point to Point with Nexus 3548 switch)
1 16 45 99 256 765 1539 4096 12285 24579 65536 196605 393219 1048576 3145725 62914591
10
100
1000
10000
0
2500
5000
7500
10000
Cisco usNIC Latency Cisco usNIC Throughput
Message Size (bytes)
Late
ncy
(use
cs)
Thro
ughp
ut (M
bps)
2.16 usecs latency for small messages
9.3 Gbps Throughput
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 22
Intel MPI Benchmark: Point-to-Point(PingPong & PingPing)
4 16 64 256 1024 4096 16384 65536 262144 1048576 41943041
10
100
1000
10000
0
300
600
900
1200
PingPong ThroughPut (MB/s) PingPing Througput (MB/s) PingPong Latency (usecs) PingPing Latency (usecs)
Message Size (bytes)
Late
cny
(use
cs)
Thro
ughp
ut (M
B/s)
2.16 usecs PingPong Latency2.21 usecs PingPing Latency
PingPing and PingPong Latency track together!
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 23
Intel MPI Benchmark: Point-to-Point(SendRecv & Exchange)
4 16 64 256 1024 4096 16384 65536 262144 1048576 41943041
10
100
1000
10000
0
600
1200
1800
2400
SendRecv Throughput (MB/s) Exchange Throughput (MB/s) SendRecv Latency (usecs) Exchange Latency (usecs)
Message Size (bytes)
Late
cny
(use
cs)
Thro
ughp
ut (M
B/s)
2.22 usecs SendRecv Latency2.69 usecs Exchange Latency
Full Bi-directional Performance for both Exchange and SendRecv
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 24
High Performance Linpack
16 32 64 128 256 5120
2500
5000
7500
10000
12500
# of CPU Cores
GFlo
ps
GFLOPS = FLOPS/Cycle x Num CPU Cores x Freq (GHz)E5-2690 Max GFLOPS = 8 x 16 x 3.3 = 422 GFLOPS
Single Node HPL Score (16 cores): 340.51 GFLOPS*32 Node HPL Score (512 cores): 9,773.45 GFLOPS
Efficiency based on Single Machine Score: (9,773.45)/(340.51 x 32) x 100 = 89.69%
* Score may improve with additional compiler settings or newer compiler versions
© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 25
Conclusions• Cisco usNIC with Cisco Nexus 3548 switch offers 2.16 usecs latency for small
messages with Open MPI
• Cisco usNIC with Cisco Nexus 3548 switch offers up to 89.69% HPL efficiency across 512 Cores
• Cisco usNIC integrated with open source Open MPI
• Cisco usNIC offers ultra low latency networking performance over standard Ethernet networking suitable for HPC applications
Thank you.