26
Cisco Public 1 © 2013 Cisco and/or its affiliates. All rights reserved. Why Cisco is Awesome for Your Next HPC Cluster Jeff Squyres Cisco Systems, Inc.

Cisco EuroMPI'13 vendor session presentation

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Cisco EuroMPI'13 vendor session presentation

Cisco Public 1© 2013 Cisco and/or its affiliates. All rights reserved.

Why Cisco is Awesome for Your Next HPC Cluster

Jeff SquyresCisco Systems, Inc.

Page 2: Cisco EuroMPI'13 vendor session presentation

Cisco Public 2© 2013 Cisco and/or its affiliates. All rights reserved.

Yes, we sell servers now

Page 3: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 3

Cisco UCS servers

Cisco 2 x 10Gb VIC

Cisco 40GbNexus switches

Record-settingIntel Ivy Bridge

1U and 2U servers

Ultra lowlatency Ethernet

Yes,really!

40Gb top-of-rackand core switches

Page 4: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 4

Cisco Unified Compute System (UCS) Servers

4

Rack

4 socket + giant memoryHPC performance

Blad

e

UCS B420 M34-socket blade for

large-memory compute workloads

Cisco UCS: Many Server Form Factors, One System

UCS C240 M3Perfect as HPC cluster head nodes

or IO nodes (2 socket)

UCS C220 M3Ideal for HPC compute-intensive

applications (2 socket)

UCS B200 M3Blade form factor, 2-socket

UCS C420 M34-socket rack server for large-memory

compute workloads

Industry-leading compute without compromise

Page 5: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 5

In Only Four Years…

UCS impacting growth of established vendors like HP

Legacy offerings flat-lining or in decline

Cisco growth out-pacing the market

Customers have shifted 19.3% of the global x86 blade server market to Cisco and over 26% in the Americas (Source: IDC Worldwide Quarterly Server Tracker, Q1 2013 Revenue Share, May 2013)Source: IDC Worldwide Quarterly Server Tracker, Q1 2013 Revenue Share, May 2013

Worldwide X86 Server Blade Market Share

Demand for Data Center Innovation Has Vaulted Cisco Unified Computing System (UCS) to the #2 Leader in the Fast-Growing Segment of the x86 Server Market

Market Appetite for Innovation Fuels

UCS GrowthUCS #2 and climbing

Page 6: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 6

Customers Have Spoken

Maintained #2 in N. America (27.9%) and #2 in the US (28.3%)1

UCS x86 Blade servers revenue grew 35% Y/Y in Q1CY131

Advanced to #2 worldwide in x86 Blades with 19.3%

UCS momentum is fueled by game-changing innovation; Cisco is quickly passing established players

UCS #2 in Only Four Years

X86

Serv

er B

lade

Mar

ket S

hare

, Q1C

Y131

UCS #2 with 26.9%

Source: 1 IDC Worldwide Quarterly Server Tracker, Q1 2013, May 2013, Revenue Share

HPCisco

IBMDellNEC

HitachiFujitsuOracle

0 5 10 15 20 25 30 35 40 45 50

Worldwide

UCS #2 19.3%

Oracle

SGI

Dell

IBM

Cisco

HP

0 5 10 15 20 25 30 35 40 45

Page 7: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 7

Cisco UCS Performance-80 RecordsA History of World Record Performance on Industry Standard Benchmarks

Best CPU Performance 16 world records

Best Virtualization & Cloud Performance

8 world records

Best Database Performance 9 world records

Best Enterprise Application Performance 18 world records

Best Enterprise Middleware Performance 14 world records

Best HPC Performance 15 world records

Page 8: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 8

Why have 2 HPC networks?

One wire to rule them all:• Commodity traffic (e.g., ssh)• Cluster / hardware management• File system / IO traffic• MPI traffic

10G or 40Gwith real QoS

Page 9: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 9

Cisco Nexus Datacenter Switches

9

High densityLow latency

Cisco Nexus: Years of experience rolled into dependable solutions

Nexus 3548190ns port-to-port latency (L2 and L3)

Created for HPC / HFT48 10Gb / 12 40Gb ports

Low latency, high density 10 / 40Gb switches

Nexus 60041us port-to-port latency

384 10Gb / 96 40Gb ports

Page 10: Cisco EuroMPI'13 vendor session presentation

Cisco Public© 2013 Cisco and/or its affiliates. All rights reserved. 10

HPC Topologies: 2-Tier Spine/Leaf Examples

Spine

Leaf

Characteristics• 3 Hops• Low Oversubscription – Non-Blocking• < ~3.5 usecs depending on config and workload• 10G or 40G Capable• Spine: 4 to 16 Wide• Leaf: Determined by Spine Density

Spine - Leaf Port Scale Latency Spines Leafs

10G Fabric 6004 - 6001 18,432 x 10G 3:1 ~ 3 usecs Cut-through 16 384

40G Fabric 6004 - 6004 7,680 x 40G 5:1 ~ 3 usecs Cut-through 16 96

Mixed Fabric 6004 - 6001 4,680 x 10G 3:1 ~ 3 usecs S&F 4 96

10G Fabric 6004 - 3548 12,288 x 10G 3:1 ~ 1.5 usecs Cut-through 16 384

40G Fabric 6004 - 3548 1,152 x 40G 1:1 ~ 1.5 usecs Cut-through 6 96

Mixed Fabric 6004 - 3548 3,072 x 10G 3:1 ~ 1.5 usecs S&F 4 96

…many other configurations are also possible

Page 11: Cisco EuroMPI'13 vendor session presentation

Cisco Public© 2013 Cisco and/or its affiliates. All rights reserved. 11

HPC Topologies: 3-Tier Spine/Leaf 3 or 5 hops – Max Scale Examples

Leaf

Spine2

Spine1

Spine2-Spine1-Leaf Port Scale Latency Spine2 Spine1 Leafs

10G Fabric 6004 - 6004 - 6001 55,296 x 10G 3:1 ~ 3-5 usecs Cut-through 48 16 x 6 192

40G Fabric 6004 - 6004 - 6004 23,040 x 40G 5:1 ~ 3-5 usecs Cut-through 48 16 48

Mixed Fabric 6004 - 6004 - 6001 18,432 x 10G 3:1 ~ 3-5 usecs S&F 32 4 x 8 48

10G Fabric 6004 - 6004 - 3548 24,576 x 10G 2:1 ~ 1.5-3.5 usecs Cut-through 32 16 x 4 192

40G Fabric 6004 - 6004 - 3548 2,304 x 40G 1:1 ~ 1.5-3.5 usecs Cut-through 24 6 x 8 48

Mixed Fabric 6004 - 6004 - 3548 9,216x 10G 2:1 ~ 1.5-3.5 usecs S&F 24 6 x 8 48

Characteristics• 3 Hops Pod – 5 hops DC east-west traffic• Low Oversubscription – Non-Blocking• < ~3.5 usecs depending on config and

workload• 10G or 40G Capable• Two spine layers

Page 12: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 12

Ultra LowLatency Ethernet:

Userspace NIC (usNIC)

Page 13: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 13

Cisco Userspace NIC (usNIC) overview

• Direct access to NIC hardware from Linux userspace

Operating System bypassVia the Linux Verbs API (UD)

• Utilizes Cisco Virtual Interface Card (VIC) for ultra-low Ethernet latency

2nd generation 80Gbps Cisco ASIC2 x 10Gbps Ethernet ports2 x 40Gbps coming in Q4 2013PCI and mezzanine form factors

• Half-round trip (HRT) ping-pong latencies:

Back to back: 1.7μsThrough N3548: 1.9μsThrough MPI+N3548: 2.16μs (*)

Page 14: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 14

Software architecture comparison

Application

Kernel

Cisco VIC hardware

TCP stack

General Ethernet driver

Cisco VIC driver

Userspace

Userspace sockets library Userspace verbs library

Cisco VIC hardware

Application

Verbs IB core

Cisco USNIC driver

Bootstrappingand setup

Send and receivefast path

usNICTCP/IP

Page 15: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 15

Ethernet OS bypass applied to MPI

MPI

MPI directlyinjects

L2 frames to the network

MPI receivesL2 framesdirectly fromthe VIC

Userspace verbs library

Cisco VIC hardware

Page 16: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 16

IO MMUSR-IOV NIC

OS Bypass Architecture

VIC

Classifier

x86 Chipset VT-d

MPI process

QPQPQueue pair

MPI process

InboundL2 frames

OutboundL2 frames

Page 17: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 17

PCIe Single Root IO Virtualization (SR-IOV)

VIC

VF VF VF

VF VF VF

Physical port Physical port

Physical Function (PF) Physical Function (PF)MAC address: aa:bb:cc:dd:ee:ff MAC address: aa:bb:cc:dd:ee:fe

VF VF VF

VF VF VF

QPQP

QPQP

QPQP

QPQP

Page 18: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 18

PCIe Single Root IO Virtualization (SR-IOV)

VICPF (MAC)

VF VF VF

VF VF VF

PF (MAC)

VF VF VF

VF VF VF

MPI process

MPI processPhysical portPhysical port

Intel IO MMUQP QP QP QP

Page 19: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 19

Open Source• Everything above the

firmware is open source

• Open MPIDistributing Cisco Open MPI 1.6.5Upstream in Open MPI 1.7.3

• Libibverbs plugin

• Verbs kernel module

Page 20: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 20

PerformanceHardware

• Cisco UCS C220 M3 Rack Server • Intel E5-2690 Processor 2.9 GHz (3.3 GHz Turbo), 2 Socket, 8 Cores/Socket• 1600 MHz DDR3 Memory, 8 GB x 16, 128 GB installed• Cisco VIC 1225 with Ultra Low Latency Networking usNIC Driver

• Cisco Nexus 3548• 48 Port 10 Gbps Ultra Low Latency Ethernet Networking Switch

Software

• OS: Centos 6.4, Kernel: 2.6.32-358.el6.x86_64 (SMP)

• NetPIPE (ver 3.7.1)

• Intel MPI Benchmarks (ver 3.2.4)

• High Performance Linpack (ver 2.1)

• Other: Intel C Compiler (ver 13.0.1), Open MPI (ver 1.6.5), Cisco usNIC (1.0.0.7x)

Page 21: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 21

NetPIPE(Point to Point with Nexus 3548 switch)

1 16 45 99 256 765 1539 4096 12285 24579 65536 196605 393219 1048576 3145725 62914591

10

100

1000

10000

0

2500

5000

7500

10000

Cisco usNIC Latency Cisco usNIC Throughput

Message Size (bytes)

Late

ncy

(use

cs)

Thro

ughp

ut (M

bps)

2.16 usecs latency for small messages

9.3 Gbps Throughput

Page 22: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 22

Intel MPI Benchmark: Point-to-Point(PingPong & PingPing)

4 16 64 256 1024 4096 16384 65536 262144 1048576 41943041

10

100

1000

10000

0

300

600

900

1200

PingPong ThroughPut (MB/s) PingPing Througput (MB/s) PingPong Latency (usecs) PingPing Latency (usecs)

Message Size (bytes)

Late

cny

(use

cs)

Thro

ughp

ut (M

B/s)

2.16 usecs PingPong Latency2.21 usecs PingPing Latency

PingPing and PingPong Latency track together!

Page 23: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 23

Intel MPI Benchmark: Point-to-Point(SendRecv & Exchange)

4 16 64 256 1024 4096 16384 65536 262144 1048576 41943041

10

100

1000

10000

0

600

1200

1800

2400

SendRecv Throughput (MB/s) Exchange Throughput (MB/s) SendRecv Latency (usecs) Exchange Latency (usecs)

Message Size (bytes)

Late

cny

(use

cs)

Thro

ughp

ut (M

B/s)

2.22 usecs SendRecv Latency2.69 usecs Exchange Latency

Full Bi-directional Performance for both Exchange and SendRecv

Page 24: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 24

High Performance Linpack

16 32 64 128 256 5120

2500

5000

7500

10000

12500

# of CPU Cores

GFlo

ps

GFLOPS = FLOPS/Cycle x Num CPU Cores x Freq (GHz)E5-2690 Max GFLOPS = 8 x 16 x 3.3 = 422 GFLOPS

Single Node HPL Score (16 cores): 340.51 GFLOPS*32 Node HPL Score (512 cores): 9,773.45 GFLOPS

Efficiency based on Single Machine Score: (9,773.45)/(340.51 x 32) x 100 = 89.69%

* Score may improve with additional compiler settings or newer compiler versions

Page 25: Cisco EuroMPI'13 vendor session presentation

© 2013 Cisco and/or its affiliates. All rights reserved. Cisco Public 25

Conclusions• Cisco usNIC with Cisco Nexus 3548 switch offers 2.16 usecs latency for small

messages with Open MPI

• Cisco usNIC with Cisco Nexus 3548 switch offers up to 89.69% HPL efficiency across 512 Cores

• Cisco usNIC integrated with open source Open MPI

• Cisco usNIC offers ultra low latency networking performance over standard Ethernet networking suitable for HPC applications

Page 26: Cisco EuroMPI'13 vendor session presentation

Thank you.