53
Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Embed Size (px)

Citation preview

Page 1: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Extensible Message Layers forResource-Rich Cluster Computers

Craig Ulmer

Center for Experimental Research in Computer Systems

A Doctoral Thesis

Page 2: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Outline

Background­ Evolution of cluster computers

Thesis­ Design of extensible message layers

GRIM: General-purpose Reliable In-order Messages­ Core communication functions

Extensions­ Integrating peripheral devices­ Streaming computations­ Multicast­ Sockets API

Host-to-host performance Concluding remarks

Page 3: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Background

An Evolution of Cluster Computers

Page 4: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Cluster Computers

Cost-effective alternative to supercomputers­ Number of commodity workstations­ Specialized network hardware and software

Result: Large pool of host processors

CPU

NetworkInterface

Memory

I/O

Bus

CPU

NetworkInterface

Memory

I/O

Bus

CPU

NetworkInterface

Memory

I/O

Bus

CPU

NetworkInterface

MemoryI/

O B

us

System Area Network

Page 5: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Industry Trends

Increasingly independent, intelligent peripheral devices

Migration of computing power and bandwidth requirements to peripherals

Network and multimedia applications reliance on peripheral devices

High-bandwidth, low-latency system area networks

Ethernet

Host

Storage

CPU

SAN NI

Media Capture

Page 6: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Resource-Rich Cluster Computers

Inclusion of diverse peripheral devices­ Ethernet server cards, multimedia capture devices, embedded

storage, computational accelerators

Processing takes place in host CPUs and peripherals

SAN NI

Ethernet

HostHost

Host

System AreaNetwork

Cluster

SAN NIVideo Capture

FPGA

Host

Host Host

Storage

HostHost

CPU CPU

Page 7: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Problem: Utilizing distributed cluster resources

How is efficient intra-cluster communication provided?­ Desirable abstractions and their implementations

How can applications make use of resources?

CPU

CPUCPU CPU CPU CPU CPU

CPU

CPU

VideoCapture

FPGA

RAID

FPGA

FPGA

EthernetEthernet

RAID

RAID

? ? ? ? ? ?

Page 8: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Thesis Definition

Supporting Resource-Rich Cluster Computers

Page 9: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

The Challenge

Message layers are enabling technology for clusters

Current message layers­ Optimized for transmissions between host CPUs­ Peripheral devices only available in context of the local host

What is needed­ Support efficient communication between

host CPUS and peripheral devices­ Ability to harness peripheral devices as pool of resources

Page 10: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Message Layer Design Characteristics

Reliable data transfers­ End-to-end flow control­ Simplify workload for communication endpoints

Virtualization of the network interface­ Multiple endpoints share network access

Flexible programming abstractions­ Efficient data transfer mechanisms­ Allow users to customize interactions with resources

Page 11: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Message Layer Requirements

Extensibility is a key feature­ Hardware: Easily add new peripheral devices­ Software: Support higher-level programming

abstractions

ApplicationExtensions

PeripheralExtensions

MessageLayerCore

Page 12: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Related Work

InfiniBand­ Emerging industry standard for clusters/data centers­ New I/O infrastructure

Existing message layers­ Myricom’s GM­ OPIUM: SCSI interactions

Adaptive Computing­ Clusters with FPGA cards­ Tower of Power

Page 13: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

GRIM: An Implementation

A message layer for

resource-rich clusters

Page 14: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

General-purpose Reliable In-order Message Layer (GRIM)

Message layer for resource-rich clusters­ Myrinet SAN backbone­ Interconnect host CPUs and peripheral devices­ Direct access to network hardware for resources

Communication core­ Migrate functionality into NI­ Easy to provide extensions GRIM

Page 15: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Core Software Architecture

Per-hop flow control­ Endpoint-NI transfers­ NI-NI transfers

Logical channels­ Multiple endpoints­ Virtual NI

Programming interfaces­ Active messages­ Remote memory

NI

Endpoint

Remote Memory API

Registered Memory

Memory Management

Active Message API

Handler Management

Active MessageExecution

Network

Endpoint-NI Transfers

NI-NI Transfers

NI

Endpoint

Page 16: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Per-hop Flow Control

End-to-end flow control necessary for reliable delivery­ Prevents buffer overflows in communication path

Endpoint-managed schemes­ Impractical for peripheral devices

Per-hop flow control scheme­ Transfer data as soon as next stage can accept­ Optimistic approach

ReceivingEndpoint

SendingEndpoint SAN

Network Interface Network Interface

PCI PCI

Reply

ReceivingEndpoint

SendingEndpoint

Send

SANNetwork Interface Network Interface

PCI PCIReceivingEndpoint

SendingEndpoint

DATA

ACK

DATA

ACK

PCISAN

Network Interface Network Interface

DATA

ACK

PCI

Page 17: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Logical Channels

Multiple endpoints in a host share the NI Employ multiple logical channels in the NI

­ Each endpoint owns one or more logical channels­ Logical channel provides virtual interface to network

Endpoint 1

Endpoint n

Logical Channel

Logical Channel

Network Interface

Scheduler

Network

Page 18: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Programming Interfaces: Active Messages

Message specifies function to be executed at receiver­ Similar to remote procedure calls, but lightweight­ Invoke operations at remote resources

Useful for constructing device-specific APIs Example: Interactions with remote storage controller

CPU

StorageControllerNINI SAN

am_fetch_file()

am_return_file_data()

Page 19: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Programming Interfaces: Remote Memory

Transfer blocks of data from one host to another­ Receiving NI executes transfer directly

Read and Write operations­ NI interacts with kernel driver to translate virtual addresses­ Optional notification mechanisms

CPU

NINI SAN

MemoryCPU

Memory

Page 20: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Integrating Peripheral Devices

Hardware Extensibility

Page 21: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Peripheral Device Overview

NI

CPU

CPU

Peripheral Device

In GRIM peripherals are endpoints

Intelligent peripherals­ Operate autonomously­ On-card message queues­ Process incoming active messages­ Eject outgoing active messages

Legacy peripherals­ Managed by host application or­ Remote memory operations

Legacy Peripheral Device

Page 22: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Cyclone Systems I2O Server Adaptor Card

Networked host on a PCI card Integration with GRIM

­ Interact directly with the NI­ Ported host-level endpoint software

Utilized as a LAN-SAN bridge

HostSystem

i960 RxProcessor

DMAEngines

PrimaryPCI

Interface

DRAM

10/100 Ethernet

10/100 Ethernet

SCSI

SCSI

ROM

DMAEngine

SecondaryPCI

Interface

Daughter Card

Local Bus

Page 23: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Video Capture and Display Cards

Video capture card­ Specialized DMA engine­ Host-level library based on

Video-4-Linux­ Active messages to control

Video display card­ Host locates frame buffer­ Manipulate frame buffer­ Remote memory writes

Incorporate legacy peripheral devices­ Host-level management or remote memory operations

DMAA/D FrameBuffer

HostMemoryVideo Capture Video Display

D/AAGPFrameBuffer

Page 24: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Celoxica RC-1000 FPGA Card

FPGAs provide acceleration­ Load with application-specific circuits

Celoxica RC-1000 FPGA card­ Xilinx Virtex-1000 FPGA­ 8 MB SRAM

Hardware implementation­ Endpoint as state machines­ AM handlers are circuits

SRAM

0SRAM

1SRAM

2SRAM

3

PCIFPGA

Control­&­Switching

Page 25: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

FPGA Endpoint Organization

Frame

Input­Queues

Output­Queues

Communication Library API

ApplicationData

Memory API

FPGA Card Memory

FPGACircuit Canvas

User­Circuit­n

User Circuit API

User­Circuit­1

Page 26: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Example FPGA Configuration

Cryptography configuration­ DES, RC6, MD5, and ALU

20 MHz Clock­ Newer FPGAs much faster

Page 27: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Expansion: Sharing the FPGA

FPGA has limited space for hardware circuits­ Host reconfigures FPGA on demand­ FPGA Function Fault

HostCPU

FPGA

Circuit X

Circuit Y

Configuration A

Circuit X

Circuit Y

Configuration A

Configuration B

Circuit E

Circuit F

Configuration C

Circuit G

StateStorage

SRAM­0Message:Use Circuit F

FunctionFault

Circuit E

Circuit F

Configuration C

Circuit G

(150 ms)

Page 28: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Page Fault

Expansion: Sharing On-Card Memory

Limited card-memory for storing application data­ Construct virtual memory system for on-card memory­ Swap space is host memory

HostCPU

FPGA

User-definedCircuits

PageFrame 1

SRAM­1

PageFrame 2

SRAM­2

PageFrame 1

PageFrame 1

PageFrame 1

UserPage X

Page 29: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Extension: Streaming Computations

Software extensibility [1/3]

Page 30: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Streaming Computation Overview

Programming method for distributed resources­ Establish pipeline for streaming operations­ Example: Multimedia processing

Celoxica RC-1000 FPGA endpoint

CPU

NI

VideoCapture

CPU

NI

MediaProcessor

CPU

NI

MediaProcessor

CPU

NI

MediaProcessor

System Area Network

Page 31: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Streaming Fundamentals

Computation: How is a computation performed?­ Active message approach

Forwarding: Where are results transmitted?­ Programmable forwarding directory

Destination: FPGAForward Entry: XAM: Perform FFT

In MessageFPGA

Computational Circuits

Circuit 1: FFT

Circuit N: Encrypt

Forwarding DirectoryDestination: Host Forward Entry: XAM: Receive FFT

Out Message

Page 32: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Performance: FPGA Computations

Acquire SRAM

Detect New Message

Fetch Header

Computation

Store Results

Store Header

Lookup Forwarding

Update Queues

Release SRAM

8

4

7

1024

1024

16

5

3

1

Fetch Payload 1024

Clocks

Clock Speed: 20MHzOperation Latency: 55 s (4KB 73MB/s)

Page 33: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Extension: Multicast

Software extensibility [2/3]

Page 34: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

GRIM Multicast Extensions

Distribute the same message to multiple receivers­ Tree based distributions­ Replicate message at NI­ Messages are recycled back into network

Extensions to NI’s core communication operations­ Recycled messages in separate logical channel­ Utilize per-hop flow control for reliable delivery

A

B C

D E

NIEndpoint A

NI Endpoint B

NI Endpoint D

NI Endpoint C

NI Endpoint E

A

B

C

D

E

Page 35: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Multicast Performance

1

10

100

1,000

10,000

100,000

1 10 100 1,000 10,000 100,000 1,000,000

Multicast RTT

Unicast RTT

Multicast Injection Overhead

Unicast Injection Overhead

LANai 4, P4-1.7 GHz Hosts

Tim

e (μ

s)

8 Hosts

Multicast Message Size (Bytes)

Page 36: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Multicast Observations

Beneficial: reduces sending overhead

Performance loss for large messages­ Dependent on NI memory copy bandwidth

On-card memory copy benchmark: ­ LANai 4: 19 MB/s­ LANai 9: 66 MB/s

Page 37: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Extension: Sockets API

Software Extensibility [3/3]

Page 38: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Extension: Sockets Emulation

Berkeley sockets is a communication standard­ Utilized in numerous distributed applications

GRIM provides sockets API emulation­ Functions for intercepting socket calls­ AM handler functions for buffering connection data

write()

Intercept

Generate AM

AM:AppendSocket X

SocketData

Socket X

AM HandlerAppend Socket

Intercept

Extract Data

read()

Sender Receiver

Page 39: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Sockets Emulation Performance

0

20

40

60

80

100

120

1 10 100 1,000 10,000 100,000 1,000,000 10,000,000

GRIM Sockets LANai 4

100 Mb/s Ethernet

P4-1.7 GHz Hosts

Ban

dwid

th (

MB

ytes

/s)

Transfer Size (Bytes)

Page 40: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Host-to-Host Performance

Transferring data betweentwo host-level endpoints

Page 41: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Host-to-Host Communication Performance

Host-to-Host transfers standard benchmark­ Remote memory writes in benchmarks­ Myrinet LANai 4, 9 NI cards

Injection performance Overall communication path

NI SAN

CPU

NI

CPU

Memory Memory

Active Messages

Remote Memory Operations

11

22

33

Source Destination

Page 42: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Host-NI: Data Injections

Host-NI transfers challenging­ Host lacks DMA engine

Multiple transfer methods­ Programmed I/O­ DMA

Automatically select methodResult: Tunable PCI Injection Library (TPIL)

CPU

MainMemory

PC

I B

us

PCIDMA

Peripheral

DeviceMemory

MemoryController

Cache

Page 43: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

TPIL Performance: LANai 9 NI with Pentium III-550 MHz Host

Ban

dwid

th (

MB

ytes

/s)

Injection Size (Bytes)

Page 44: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Overall Performance: Store-and-Forward

Approach: Single message, no overlap­ Three transmission stages­ Expect roughly 1/3 of bandwidth of individual stage

P3-550 MHz Hosts

Message 1

Message 1

Message 1

time

PCI: 132 MB/s

PCI: 132 MB/s

Myrinet: 160 MB/s

Overall Transmission Time

SendingHost-NI

NI-NI

ReceivingNI-Host

Ban

dwid

th (

MB

ytes

/s)

Message Size (Bytes)

Page 45: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Enhancement: Message Pipelining

Allow overlap with multiple in-flight messages­ GRIM uses AM and RM fragmentation/reassembly­ Performance depends on fragment size

LANai 9, P3-550 MHz Hosts

SendingHost-NI

NI-NI

ReceivingNI-Host

Message 1

time

Message 3Message 2

Overall Transmission Time

Message 1 Message 3Message 2

Message 1 Message 3Message 2

Ban

dwid

th (

MB

ytes

/s)

Message Size (Bytes)

Page 46: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Enhancement: Cut-through Transfers

Forward data as soon as it begins to arrive­ Cut-through at sending and receiving NIs

time

Message 1

Message 1

Message 1 Message 2

Message 2

Message 2

SendingHost-NI

NI-NI

ReceivingNI-Host

Overall Transmission TimeLANai 9, P3-550 MHz HostsMessage Size (Bytes)

Ban

dwid

th (

MB

ytes

/s)

Page 47: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Overall Host-to-Host Performance

Host NI Latency (μs) Bandwidth (MB/s)

P4-1.7GHzLANai 9 8 146

LANai 4 14.5 108

P3-550MHzLANai 9 9.5 116

LANai 4 14 96

Ban

dwid

th (

MB

ytes

/s)

Message Size (Bytes)

Page 48: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Comparison to Existing Message Layers

Latency (μs)

μs

Bandwidth (MB/s)

MB/s

Page 49: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Host-to-Host Performance Summary

GRIM fitted with performance enhancements­ Take place automatically­ Self configuring

GRIM provides competitive performance­ Bandwidth: 1.168 Gb/s­ Latency: 8 μs

Provides increased functionality

Page 50: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Concluding Remarks

Page 51: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Key Contributions

Framework for communication in resource-rich clusters­ Reliable delivery mechanisms, virtualized network interface, and

flexible programming interfaces­ Comparable performance to state-of-the-art message layers

Extensible for peripheral devices­ Suitable for intelligent and legacy peripherals­ Methods for managing card resources

Extensible for higher-level programming abstractions­ Endpoint-level: Streaming computations and sockets emulation­ NI-level: multicast

Page 52: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Future Directions

Continued work with GRIM­ Video card vendors opening cards to developers­ Myrinet connected embedded devices

Adaptation to other network substrates­ Gigabit Ethernet appealing because of cost­ Modification to transmission protocols­ InfiniBand technology promising

Active system area networks­ FPGA chips beginning to feature gigabit transceivers­ Use FPGA chips as networked processing device

Page 53: Extensible Message Layers for Resource-Rich Cluster Computers Craig Ulmer Center for Experimental Research in Computer Systems A Doctoral Thesis

Related Publications

A Tunable Communications Library for Data Injection, C. Ulmer and S. Yalamanchili, Proceedings of Parallel and Distributed Processing Techniques and Applications, 2002.

Active SANs: Hardware Support for Integrating Computation and Communication, C. Ulmer, C. Wood, and S. Yalamanchili, Proceedings of the Workshop on Novel Uses of System Area Networks at HPCA, 2002.

A Messaging Layer for Heterogeneous Endpoints in Resource Rich Clusters, C. Ulmer and S. Yalamanchili, Proceedings of the First Myrinet User Group Conference, 2000.

An Extensible Message Layer for High-Performance Clusters, C. Ulmer and S. Yalamanchili, Proceedings of Parallel and Distributed Processing Techniques and Applications, 2000.

Papers and Software Available at http://www.ee.gatech.edu/~grimace/research