48
© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 1 © 2008 Cisco Systems, Inc. All rights reserved. Cisco Public BRKAPP-2011 14413_04_2008_c1 2 Scaling Applications in a Clustered Environment BRKAPP-2011

Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

1

© 2008 Cisco Systems, Inc. All rights reserved. Cisco PublicBRKAPP-201114413_04_2008_c1 2

Scaling Applications in a Clustered Environment

BRKAPP-2011

Page 2: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

2

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 3BRKAPP-201114413_04_2008_c1

Agenda

Cluster BasicsDatabase Clusters

Oracle RAC Implementation

Financial ClustersMarket Feed/Algorithmic TradingCompute Cluster

High Performance Computing ClustersHPC ApplicationsParallel ApplicationsMessaging

Data DeliveryNAS, Clustered NASBlock/File Parallel File Systems, Object-Based Parallel File Systems

Three Facets of Latency

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 4BRKAPP-201114413_04_2008_c1

Cluster Basics

Page 3: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

3

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 5BRKAPP-201114413_04_2008_c1

What Is Clustering?

Cluster—two or more interconnected computers that provide:Application High Availability

Load Distribution

Distributed and High Performance Computing (HPC)

Clustering can be implemented at different levels of the system:Storage Abstraction—shared disk, mirrored disk, and shared nothing

Operating systems: UNIX/Linux server clusters, Microsoft clustering

APIs: PVM, MPI, DAPL

Applications (includes Database)—Three Major Categories

Compute intensive

I/O intensive

Transaction intensive

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 6BRKAPP-201114413_04_2008_c1

Defining Clustered Servers

Database

Financial Trading and Compute

High Performance Computing

Application Servers

Multi-ProtocolGateway

IBM DB2 Parallel

Oracle RAC

IP Infrastructure

Database Clusters on

Ethernet

Server Switch Fabric

IBM DB2 Parallel

MySQL Cluster

Database Clusters on InfiniBand

Application Servers

Storage Network

Page 4: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

4

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 7BRKAPP-201114413_04_2008_c1

Defining Clustered Servers

Database

Financial Trading and Compute

High Performance Computing

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 8BRKAPP-201114413_04_2008_c1

Defining Clustered Servers

IP Multicast

High BandwidthLow Latency

Security Services

Bandwidth Control

NFS /PVFS /NAS SAN

Management Network

ProcessorFarm #3

Fabric Hosted Applications

Fabric Assisted Applications

Storage Virtualization

Data Replication Services

ApplicationControl Engine SSL/IPSec VPNServer Load BalancingApplication Message Services Security Services

Master Node

GRID/HPCComputingHigh BandwidthLow Latency

I/O Network

ProcessorFarm #4

ProcessorFarm #5

ProcessorFarm #6

InfiniBandAttached Storage

ProcessorFarm #2

ProcessorFarm #1

Database

Financial Trading and Compute

High Performance Computing

Page 5: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

5

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 9BRKAPP-201114413_04_2008_c1

Database Cluster Oracle Implementation

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 10BRKAPP-201114413_04_2008_c1

Oracle RAC in the Data Center

Clustered Implementation (RAC)

Latency sensitivity for inter-process communications

Bandwidth sensitivity for data delivery

Interconnect density and bandwidth10Gbps solutions—10G Ethernet or InfiniBand

Database

Financial

High Performance Computing

Page 6: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

6

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 11BRKAPP-201114413_04_2008_c1

Oracle RAC

Development started in 2002, Oracle 9 RAC

Implementations supporting Ethernet and InfiniBand

Offload implementations for user space UDPEthernet—Solar Flare (formerly Level5) NIC

InfiniBand—DAPL (uDAPL) chosen due to network independent model

Required changing IPC communications infrastructure for IB

Eventually discarded due to massive internal code change requirements

Database

Financial

High Performance Computing

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 12BRKAPP-201114413_04_2008_c1

Oracle RAC Optimization

DB IPC communication acceleration

DB to App tier potential acceleration with a 10 Gig class network

Oracle 11i AS actually has the ability to leverage SDP in Asynchronous I/O mode (RDMA) with IB and using iWARP for IB and Ethernet with OFED 1.2

Oracle 10g uses UDP—IB will use IPoIB-CM

Oracle 11g RAC—RDS standard within OFED 1.3 Database

Financial

High Performance Computing

Page 7: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

7

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 13BRKAPP-201114413_04_2008_c1

Basic Multi-Tier Oracle Environment

APPWEB DB

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 14BRKAPP-201114413_04_2008_c1

APPWEB DB

Oracle Bottlenecks

StorageIOPs

DB IPC

APP/DB IPC

Page 8: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

8

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 15BRKAPP-201114413_04_2008_c1

Blade Servers

Blade Servers are the Gillette of the server worldBuy one chassis

Plug in your blades

When they get dull, just swap them out for sharp ones

Any “dumb” datacenter tech can swap a blade

Datacenter in a BoxEthernet and high speed connections in one box

Most blades are limited to two high speed ports of the same type (usually just a single host adapter) Database

Financial

High Performance Computing

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 16BRKAPP-201114413_04_2008_c1

Basic Multi-Tier Oracle Environment with Blade Servers

?

APPWEB DB

Page 9: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

9

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 17BRKAPP-201114413_04_2008_c1

Blade Servers

Ethernet and Fiber Channel Ethernet and InfiniBand

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 18BRKAPP-201114413_04_2008_c1

Blade Servers

Ethernet and Fiber Channel

Page 10: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

10

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 19BRKAPP-201114413_04_2008_c1

Ethernet/Fiber Channel and Blades?

APPWEB DB

10GbE TOE 10GbE iWARP

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 20BRKAPP-201114413_04_2008_c1

Ethernet/InfiniBand and Blades?

APPWEB DB

Page 11: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

11

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 21BRKAPP-201114413_04_2008_c1

Sample Design—Blade Servers Accelerated IPC

Blade-servers using 10GbE for IPC and Storage access

Use Multi-Fabric I/O for Storage Access

Could use MFIO technology for App tier access as well, not common due to available low cost Ethernet interfaces

Database

Financial

High Performance Computing

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 22BRKAPP-201114413_04_2008_c1

Optimized IPC and I/O for Oracle and Blade Servers

APPWEB DB

Page 12: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

12

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 23BRKAPP-201114413_04_2008_c1

Optimized IPC Multi-Tier Oracle Environment

APPWEB DB

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 24BRKAPP-201114413_04_2008_c1

Fully-Optimized Multi-Tier Oracle Environment

APPWEB DB

Page 13: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

13

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 25BRKAPP-201114413_04_2008_c1

Financial Trading and Compute Clusters

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 26BRKAPP-201114413_04_2008_c1

Financial Trading and Compute Clusters

Algorithmic tradingUp to 100’s of machines

End to end latency is king—but not just low latency, latency deviation is just as critical

Compute machines for pricing, risk analysis10,000’s to 100,000s of machines

Database

Financial

High Performance Computing

Two Key Areas of Cluster Computing in the Financial Banking World

Page 14: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

14

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 27BRKAPP-201114413_04_2008_c1

“In any gun fight, it’s not enough just to shoot fast or to shoot straight. Survival depends on being able to do both… The lone gunslinger of the open-outcry trading floors is rapidly being replaced by ultra-fast, computerized trading systems which are more akin to robots with machine guns.”

IBM Report, “Tacking Latency: the Algorithmic Arms Race”

Algorithmic Trading

Database

Financial

High Performance Computing

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 28BRKAPP-201114413_04_2008_c1

Deterministic Performance

#1 problem in financial trading environments

Financials don’t care about MIN(latency) or AVG(latency), but STDDEV(latency) at the application level

A single frame dropped in a switch or adapter causes significant impact on performance

TCP NACK delayed by up to 125 ms with most NICs with interrupt throttling enabled

TCP window shortened

TCP retransmit timeout 500ms standard usually 200ms implementation

Database

Financial

High Performance Computing

Page 15: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

15

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 29BRKAPP-201114413_04_2008_c1

Why Is Latency/Performance a Problem?

Response to changing market conditions is delayed by system latency and creates significant loss of opportunity for trade execution, and affects trading strategies

ExchangeSystems

TradePrice

Market DataSupplier

DistributionPlatform

TradingEngine

RiskSoftware

ExchangeSystems

ExecTrade

Latency Is Introduced

by the Exchange

and Supplier

Latency Is Introduced

by the Exchange

and Supplier

Most Trading Houses Systems Are No Different

The goal of the Low Latency is to provide the required level of capacity to support current and future market volumes while minimizing latency

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 30BRKAPP-201114413_04_2008_c1

0

500

1000

1500

2000

2500

3000

3500

4000

4500

08/1

1/20

06

08/1

2/20

06

08/0

1/20

07

08/0

2/20

07

08/0

3/20

07

08/0

4/20

07

08/0

5/20

07

08/0

6/20

07

08/0

7/20

07

08/0

8/20

07

08/0

9/20

07

08/1

0/20

07

08/1

1/20

07

08/1

2/20

07

08/0

1/20

08

08/0

2/20

08

08/0

3/20

08

08/0

4/20

08

08/0

5/20

08

08/0

6/20

08

08/0

7/20

08

08/0

8/20

08

08/0

9/20

08

08/1

0/20

08

08/1

1/20

08

Date

Gig

abits

per

day

gb_received gb_sent Linear (gb_received) Linear (gb_sent)

Traffic Growth Next 12 Months

CPU Problems

These are estimates of average data rates, if no changes are made to the environment

Page 16: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

16

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 31BRKAPP-201114413_04_2008_c1

The Trading Challenge

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 32BRKAPP-201114413_04_2008_c1

Market Data—Algorithmic Trading

Page 17: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

17

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 33BRKAPP-201114413_04_2008_c1

Financial Compute Cluster

Data intensive

Latency insensitive

Scatter-Gather type work

Post trade analysis

Feed back in to risk engine for algorithmic trading

As successful trades increase, post trade analysis and feedback mechanisms increase

Parametric Computing

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 34BRKAPP-201114413_04_2008_c1

High-Performance Computing Clusters

Page 18: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

18

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 35BRKAPP-201114413_04_2008_c1

HPC Network Communication

Access Network

Management Network

IPC Network

Storage Network

User Access

Management

Storage

IPC

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 36BRKAPP-201114413_04_2008_c1

Access Network (Public)

Communications to/from external resources

Security

QoS

Availability

User Access

Management

Storage

IPC

Page 19: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

19

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 37BRKAPP-201114413_04_2008_c1

Management Network (Private)

User Access

Management

Storage

IPC

Communications between master and slave nodes

Heartbeat

Small to medium-sized HPC clusters commonly consolidate Access and Management Networks

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 38BRKAPP-201114413_04_2008_c1

IPC Network (MPI)

User Access

Management

Storage

IPC

Communications between nodes during run time

IPC Network and Management Network may be the same physical network

Page 20: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

20

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 39BRKAPP-201114413_04_2008_c1

Storage Network

User Access

Management

Storage

IPC

Access to stored data

NAS—file-level access

SAN—block-level access

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 40BRKAPP-201114413_04_2008_c1

HPC Network Consolidation

User Access

Management

Storage

IPC

Page 21: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

21

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 41BRKAPP-201114413_04_2008_c1

Network Design Considerations

Database

Financial

High Performance Computing

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 42BRKAPP-201114413_04_2008_c1

HPC Cluster Components

Applications

Communication Libraries

Device Driver

OS and Kernel

CPU and Bus Technologies

Network Interfaces (NICs, HCAs)

Interconnect Network(s)

Storage

Languages andCompilers

File Systems

Physical

Mid

dlew

are

and

Sche

dulin

g

Users

Page 22: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

22

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 43BRKAPP-201114413_04_2008_c1

Two Key Concepts—Terminology

CapacityThe ability to provide predictable (and large) computation throughput for experimentation, production runs, testing

CapabilityThe ability to provide peak power for a specific amount of time as to solve a problem within a guaranteed time window

The Two Require Different Architectural Solutions, but in Practice the Same Infrastructure Must Deliver Both; this Leads Naturally to Concepts like Virtualization, Grid-Computing and Dynamic Provisioning

Database

Financial

High Performance Computing

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 44BRKAPP-201114413_04_2008_c1

What Do We Need to Know?

Application characteristics

Cluster size

Network/switch characteristics

Node configuration

Node communication considerations

Node interconnect

Database

Financial

High Performance Computing

Page 23: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

23

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 45BRKAPP-201114413_04_2008_c1

Application Characteristics

Is the application:Latency sensitive?Bandwidth sensitive?

Real time or batchResponse time requirementsStorage

DAS, NAS, SAN,Physical attachment

FC, Ethernet, IBFile System

Parallel Virtual File System, Luster, NFS

Database

Financial

High Performance Computing

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 46BRKAPP-201114413_04_2008_c1

ApplicationRequirements

NetworkTraffic

HighLow

BandwidthHighLow

LatencyHighLow

CPUArchitecture

AMDIntel

OperatingSystem

Windows

Linux Solaris

StorageSystem

File System

NAS SAN

Form Factor

Analyze Application(s)

Database

Financial

High Performance Computing

Page 24: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

24

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 47BRKAPP-201114413_04_2008_c1

Application Mix

ParallelTightly Coupled

Loosely Coupled

Parametric

Serial

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 48BRKAPP-201114413_04_2008_c1

Job Mix

Running multiple application in parallel

Running multiple copies of the same non-parallel application with different inputs—parametric execution

Parametric execution is widely used in HPC and accounts for more than 70% of cluster usage

Running multiple serial applications on one node or one core per serial application run

Database

Financial

High Performance Computing

Page 25: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

25

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 49BRKAPP-201114413_04_2008_c1

Determine Cluster Size

How many nodes can the application support?

How many concurrent users?

How large are their projects?

How much speedup can application achieve?

Load Balancing

High Availability

Database

High Performance Computing

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 50BRKAPP-201114413_04_2008_c1

Ethernet(Mgmt

and Access)

InfiniBand(IPC)

Ethernet Only

Ethernet(IPC, Mgmt

and Access)

InfiniBand(IPC and Access)

Ethernet(Mgmt)

Most popular choice for smaller clusters1:1, 2:1, 4:1 blocking (depending upon app)

Most common in larger clusters1:1 or 2:1 typical oversubscription in IB IPC 8:1 to 16:1 typical oversubscription in Ethernet to Access nodes

Ethernet (Mgmt/Access)InfiniBand (IPC)

Ethernet (Mgmt)InfiniBand (IPC+ Access)

Used for IB attached storage 1:1, 2:1, 4:1 typical oversubscription in IB fabric16:1 or higher oversubscription for Ethernet management

Determine Network Architecture

Page 26: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

26

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 51BRKAPP-201114413_04_2008_c1

Determine Network Topology

Based on a non-blocking architecture and equivalent sized non-blocking switch “building blocks”

Sometimes combined with Star architecture to provide a hybrid network

CoreSpine

LeafEdge

Fat Tree

Star

If Beyond a Single Switch, Use a Fat Tree/CLOS Style Network Design

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 52BRKAPP-201114413_04_2008_c1

Calculate Cluster Design

I/O Nodes

Compute Nodes

n Gbps

m Gbps

Minimum BisectionMgmt and

I/O Traffic

IPC Traffic

Page 27: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

27

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 53BRKAPP-201114413_04_2008_c1

High-Performance Computing Solution

Interprocessor Communication (IPC) NetworkLow latency, high bandwidth on standard open source MPI over InfiniBand networkCisco Catalyst® switching for TCP-based applications, can benefit from policing, QoS and multicast

Management and I/O Network Used for job scheduling, network monitoringTCP- or UDP-based—benefits from Quality of Service and MulticastNetFlow reporting, NSF/SSO for high availability

Storage Network NAS or iSCSI over Ethernet fabricIB-attached storage for lower storage overheadFiber Channel storage with data replication, integrated applications

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 54BRKAPP-201114413_04_2008_c1

Data Delivery

Page 28: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

28

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 55BRKAPP-201114413_04_2008_c1

Storage Access Protocols and Technologies

Fiber Channel, InfiniBand or iSCSI

FC or InfiniBand GatewayBlockSAN

DAS, Fiber Channel or iSCSI

Ethernet or InfiniBandObjectParallel File

System

DAS or Fiber Channel

Ethernet or InfiniBandFile/BlockParallel File

System

SCSI or FiberEthernet or InfiniBand GatewayFileCluster NAS

SCSI or Fiber Channel

Ethernet or InfiniBand GatewayFileNAS

Back-End Storage Access

Server Access

Block or File Access

Storage Type

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 56BRKAPP-201114413_04_2008_c1

Network Attached Storage (NAS)

Attaches via connections to the network using Gigabit and 10Gigabit Ethernet

There are a few NAS vendors using IB as the interconnect of choice

Primarily using NFS (only standards-based file systems in this space)

Perform well for small clusters but does not scale well

Single point of access and single point of failure

Page 29: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

29

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 57BRKAPP-201114413_04_2008_c1

Clustered NAS

Attaches via connections to the network using Gigabit and 10Gigabit Ethernet

Where the traditional NAS or NFS solution uses a single filer or server, a cluster NAS solution utilizes several heads with storage that is connected directly to the heads or via some type of storage network (fiber channel)

Each of the filer heads can only access the storage assigned to it and not the storage assigned to other filers

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 58BRKAPP-201114413_04_2008_c1

Clustered NAS

Access is limited to assigned storage

All filers have knowledge of the location of data regardless of which storage and filer the data is located

Depending on implementation data access occurs either via a process which moves data from one filer to another or in an NFS gateway process with a parallel file system on the back-end

Page 30: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

30

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 59BRKAPP-201114413_04_2008_c1

Parallel File Systems

Attached via connections to the network using Gigabit Ethernet, 10Gigabit Ethernet and InfiniBand

Provides multiple or parallel access to storage nodes also known as I/O nodes

PFS nodes have access to direct attached storage

Implementations are file/block-based and/or object-based

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 60BRKAPP-201114413_04_2008_c1

Parallel File Systems

For File/Block-based, Metadata service is one of the key bottle necks to scalability

Example: file write requests are made to the metadata server which allocates the block(s); compute note then sends the data to the metadata server which sends the data to the file system and then to disk

Metadata services are either a dedicated or shared/ clustered implementation

Page 31: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

31

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 61BRKAPP-201114413_04_2008_c1

Parallel File Systems—Object-Based

Metadata services are used but have limited functionality

I/O nodes use protocols to manage the location of data as the nodes are not just storage bricks

If we follow the process as we did in the file/block-based solution:

The compute node will first contact the metadata service regarding a file operation

The metadata service then contacts the storage devices and then based upon the protocols in the file system identifies where the object can be stored

The metadata server then passes a list of which storage devices that can be used for the file operation

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 62BRKAPP-201114413_04_2008_c1

Parallel File Systems—Object-Based

Unlike the file/block-based solution, the metadata service is removed from the file operation as the node will then write the data directly to an I/O node and then on to storage

The metadata service will monitor the file operations as to allow the location of the data to current within the metadata records

Performance of these systems can and will vary based upon any number of variable; the choice of network architecture, interconnect and switch fabric can and will have a significant impact to the performance

Page 32: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

32

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 63BRKAPP-201114413_04_2008_c1

Parallel File Systems—Object-Based

Metadata services are used but have limited functionalityI/O nodes use protocols to manage the location of data as the nodes are not just storage bricksIf we follow the process as we did in the file/block-based solution:

The compute node will first contact the metadata service regarding a file operationThe metadata service then contacts the storage devices and then based upon the protocols in the file system identifies where the object can be stored The metadata server then passes a list of which storage devices that can be used for the file operation

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 64BRKAPP-201114413_04_2008_c1

Cluster Performance Design and Latency

Page 33: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

33

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 65BRKAPP-201114413_04_2008_c1

Latency

Latency is the time taken for a packet of data to be:1. The time for encoding the packet for transmission and

transmitting it 2. The time for that serial data to traverse the network

equipment between the nodes, and3. The time to get the data off the circuit

This is also known as “one-way latency;” a minimum bound on latency is determined by the distance between communicating devices and the speed at which the signal propagates in the circuits (typically 70–95% of the speed of light) Actual latency is much higher, due to packet processing in networking equipment, and other traffic

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 66BRKAPP-201114413_04_2008_c1

Latency

End user of the data

Is it core in a parallel application?

It is another application

Is it an end user 3+ms one way latency cost from the data center?

It takes a batsmen in cricket 400ms to decide where and how to hit the ball when the bowler releases it

It takes a normal human 250ms just to recognize that data has been delivered to their screen—not to mention the in host latency and deciding what to do with it

Page 34: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

34

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 67BRKAPP-201114413_04_2008_c1

Sources of Latency in Network Today

Problem Needs to Be Solved End-to-End: Applications, NIC, Blade, Leaf and Core Switches

~10 μs

~3 μs Blade Switch

~10 μs

~10 μs

Variable Variable

Ping/Pong Latency 25–30 μs

Application

Networking Stack

Application

Networking Stack

~3 μs - ToR~7 μs - Core~10 μs

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 68BRKAPP-201114413_04_2008_c1

Latency Effects in an Ethernet World

End-to-End Latency100 us E2E at 1GbE reduces throughput by 15–20%

100us E2E at 10GbE reduces throughput by 20–25%

Thank our good friend TCP for that

Who cares about throughput?Storage heavy applications

Load/unload operations

Page 35: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

35

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 69BRKAPP-201114413_04_2008_c1

NIC

Traditional Server I/O Architecture

Bus-Based architecture with I/O memory pool

Access to I/O resource handled by BIOS

A data packet is typically copied three to four timesCPU Interrupts, Bus bandwidth constrained, Memory bus constrained

1.

32.2.

3.3.5.5.

4.4.

CPU

ApplicationMemory

Network

I/O MemoryPool

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 70BRKAPP-201114413_04_2008_c1

Adapter and Protocol Considerations

Fundamental part of any solution

Great advantage of EthernetHighly competitive and open market place

On-loading vs. Off-loading camp, …

Linux vs. Windows vs. Solaris

iWARP—RDMA, RDDP, DDPSingle sided offload with zero copy kernel bypass

TCP over lossless Ethernet is just the beginningAlternative protocols being considered

Page 36: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

36

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 71BRKAPP-201114413_04_2008_c1

Kernel Bypass/Zero Copy Architecture

Bypass-CapableAdapter

1.

33.3.

CPU

ApplicationMemory

I/O MemoryPool

2.2.

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 72BRKAPP-201114413_04_2008_c1

Low Latency Performance Comparison

25%25%25%28%27%26%23%25%25%9%CPU

135113541220103389672756012191214118Bandwidth MB/s

IPoIBIP

OMPIMVAPICH10G LLE

DDR IB

SDR IB

DDR IB

SDR IB

10GE LLE10GEGigabit

Ethernet

MPI OFED 1.2

SDP OFED 1.2

TCP

3.293.828.810 3.3214.320.3

8.5 (L2)

11 (TCP)

25.835.3Latency (μs)

MPISockets API

Page 37: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

37

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 73BRKAPP-201114413_04_2008_c1

Switch Architecture Value

VS.

3.2 ms

KernelKernel

NICNIC NICNICN5000SwitchN5000Switch

End-to-End Latency ~11 ms

KernelKernel

Application Application

3.9 ms 3.9 ms

Data Packet

20 ms

KernelKernel

NICNIC NICNICSwitchSwitch

End-to-End Latency ~ 70ms

KernelKernel

Application Application

25 ms 25 ms

Data Packet

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 74BRKAPP-201114413_04_2008_c1

Jitter: Delay VariationJitter Is #1 Problem for

HPC ApplicationsIdeal Solution for Jitter Reduction

Latency “Standard Deviation”is bigger issue than Average Latency (Jitter)

Jitter and Latency limit “Cluster Size”

A single frame dropped in a switch or adapter causes significant impact on application performance

TCP windowing is a major source of jitter

Cut through architecture

Line rate processing and forwarding

PFC and congestion management control jitter at the at the source

Line Rate processing

No-drop and delay-drop classes of service

Delay-drop is best suited for TCP-based workloads

Page 38: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

38

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 75BRKAPP-201114413_04_2008_c1

Three Facets of Latency

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 76BRKAPP-201114413_04_2008_c1

Latency

Compute LatencyCore-to-core message latency

Application LatencyLatency between multi-tiered application

Data LatencyData load and unload times

Three Focuses of Latency

Page 39: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

39

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 77BRKAPP-201114413_04_2008_c1

Compute Latency

KernelKernel

NICNIC NICNICSwitchSwitch

End-to-End Latency

KernelKernel

Core Core

Message

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 78BRKAPP-201114413_04_2008_c1

Compute Latency

Core to core messaging for Inter-Processor Communication

Medium to large node count job distribution

Data intensive

Compute and data latencies impact overall wall clock time and scalability

Page 40: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

40

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 79BRKAPP-201114413_04_2008_c1

Compute Latency

60–70% of message latency is in host

Offload technologies reduce in host latencyKernel bypass and zero copy

Balance between time communicating vs. time computing—node efficiency

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 80BRKAPP-201114413_04_2008_c1

Compute Latency

InfiniBand has been interconnect and fabric of choice

10GbE moving forward with iWARP (RDMA/RDDP)Port costs, port densities, physical power draw and heating issues to be overcome

Page 41: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

41

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 81BRKAPP-201114413_04_2008_c1

Application Latency

KernelKernel

NICNIC NICNICSwitchSwitch

KernelKernel

Application Application

Message Message

NICNIC NICNICSwitchSwitch

KernelKernel

Application

Message

NICNIC NICNICSwitchSwitch

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 82BRKAPP-201114413_04_2008_c1

Application Latency

Multi-tiered applications

Few application environments can be a mix of all three areas of latency (Oracle RAC)

Low latency is not a significant requirement outside of market data and algorithmic trading applications

Higher latencies at the initial tiers will be exacerbated as secondary and tertiary application tiers act on higher tier output

TCP and/or UDP traffic

Unicast or Multicast

Page 42: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

42

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 83BRKAPP-201114413_04_2008_c1

Data Latency

KernelKernel

NICNIC NICNICSwitchSwitch

KernelKernel

Application Parallel Storage

Read/Write Data Request

ControllerHBA

ControllerHBA

ArrayController

ArrayController

FiberChannelSwitch

FiberChannelSwitch

DrivesDrives

File

ArrayController

ArrayController

ControllerHBA

ControllerHBA

FiberChannelSwitch

FiberChannelSwitch

KernelKernel

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 84BRKAPP-201114413_04_2008_c1

Data Latency

Data Delivery to Computational ClustersOnce on the network data flows are limited by slowest link

Cluster to parallel File SystemsSustained data flows to/from disk limited by slowest link

Peta-Scale File Systems pushing 2000 ports

Large compute farms

Page 43: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

43

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 85BRKAPP-201114413_04_2008_c1

Low Latency and Data Delivery

Data Delivery through NAS and Parallel File System interconnects will drive more 10GbE interconnects than node connects until prices are at or near IB pricing

Speed matching is critical for maximum data delivery

10GbE targets with 1GbE initiators limits throughput by more than 50% versus 10GbE–10GbE

IB shows higher throughput due to init/target speed matching and RDMA effects (TCP offload)

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 86BRKAPP-201114413_04_2008_c1

Data Latency—Seismic Processing

Large data sets

Medium to large node count distribution

Loosely coupled to parametric processing

TCP/UDP-based transport

Data load/unload times impact overall wall clock time

Connect speed of storage targets and compute nodesGbE, 10GbE, InfiniBand, FC

Storage structureNFS, Parallel NFS, NAS, Clustered, Parallel File System,

Page 44: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

44

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 87BRKAPP-201114413_04_2008_c1

Data Latency

Data Delivery through NAS and Parallel File System interconnects will drive more 10GbE interconnects than node connects until prices are at or near IB pricing

Speed matching is critical for maximum data delivery

10GbE targets with 1GbE initiators limits throughput by more than 50% versus 10GbE–10GbE

IB shows higher throughput due to init/target speed matching and RDMA effects (TCP offload)

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 88BRKAPP-201114413_04_2008_c1

Low Latency and Data Delivery

Application wall clock time changes by moving to 10GbE or IB-connected Parallel File System:

Dark Matter application (Gig to IB)35-hour run time with data delivered over GbEFour-hour run time with data delivered over SDR IB

Oil and Gas Seismic Processing 120,000+ cores Oil and Gas exploration (Gig to 10GbE)

Small jobs 2x reduction in wall clockLarge jobs 16x reduction in wall clock

Parallel File Systems scale to large numbers of systemsThe issue is how to deliver tens of Gigabits/s of I/O to a large number of clustersDOD/DOE labs share peta-scale storage systems across clusters with 10GbEUse gateways (PCs with10GbE/IB) to get to their IB clusters

Page 45: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

45

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 89BRKAPP-201114413_04_2008_c1

Data Delivery in HPC

Large data repositories in the peta-scale rangeUse of NFS Filers and Parallel File SystemsDirect Attached Storage and FC SAN with Parallel File System

GbE is not scaling to meet the higher throughput and lower wall clock times required in research and business

PFS and large data sets are driving 10GbE and IB for I/O node interconnectData latency impacts more applications than compute latency

Compute latency benefit from a low latency high bandwidth fabric will affect <30 % of many applications

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 90BRKAPP-201114413_04_2008_c1

Data Throughput Performance

920–1000 MBpsIB DDR–SDPIB DDR–SDP

525–600 MBpsIB DDR—IPoIB (OFED 1.2)

IB DDR—IPoIBCM (ofced 1.2)

350–375 MBpsIB DDR—IPoIBIB DDR—IPoIB

590–625 MBpsIB SDR—SDPIB SDR—SDP

525–575 MBpsIB SDR—IPoIB (OFED 1.2)

IB SDR–IPoIBCM (OFED 1.2)

350–375 MBpsIB SDR—IPoIBIB SDR—IPoIB

550–600 MBps10 Gigabit Ethernet10 Gigabit Ethernet

325–450 MBps10 Gigabit EthernetGigabit Ethernet

112–118 MBpsGigabit EthernetGigabit Ethernet

Data Throughput per I/O Node or I/O ConnectTarget SpeedInitiator Speed

Page 46: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

46

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 91BRKAPP-201114413_04_2008_c1

Q and A

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 92BRKAPP-201114413_04_2008_c1

Recommended Reading

Continue your Cisco Live learning experience with further reading from Cisco Press®

Check the Recommended Reading flyer for suggested books

Available Onsite at the Cisco Company Store

Page 47: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

47

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 93BRKAPP-201114413_04_2008_c1

Complete Your Online Session Evaluation

Give us your feedback and you could win fabulous prizes; winners announced daily

Receive 20 Passport points for each session evaluation you complete

Complete your session evaluation online now (open a browser through our wireless network to access our portal) or visit one of the Internet stations throughout the Convention Center

Don’t forget to activate your Cisco Live virtual account for access to all session material on-demand and return for our live virtual event in October 2008

Go to the Collaboration Zone in World of Solutions or visit www.cisco-live.com

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 94BRKAPP-201114413_04_2008_c1

Page 48: Scaling Applications in a Clustered Environmentfaculty.ccc.edu/mmoizuddin/CISCO LIVE 2008/APP/BRKAPP-2011.pdf© 2006, Cisco Systems, Inc. All rights reserved. Presentation_ID.scr 3

© 2006, Cisco Systems, Inc. All rights reserved.Presentation_ID.scr

48

© 2008 Cisco Systems, Inc. All rights reserved. Cisco Public 95BRKAPP-201114413_04_2008_c1

Latency Fundamentals

What matters is the application-to-application latency and jitter

Driver/Kernel software

Adapter

Network components

Latencies of 1GbE switches can be quite high (>20 μs)

Store and forward

Multiple hops

Line serialization delay

Protocol processing, context switching and copying dominates latency

KernelKernel

NICNIC NICNICN5000SwitchN5000Switch

End-to-End Latency ~11 μs

KernelKernel

Application Application

Data Packet

3.2 μs