22
Scalable Networking for Next-Generation Computing Platforms Yoshio Turner * , Tim Brecht *‡ , Greg Regnier § , Vikram Saletore § , John Janakiraman * , Brian Lynn * * Hewlett Packard Laboratories § Intel Corporation University of Waterloo

Scalable Networking for Next-Generation Computing Platforms Yoshio Turner *, Tim Brecht *‡, Greg Regnier §, Vikram Saletore §, John Janakiraman *, Brian

Embed Size (px)

Citation preview

Scalable Networking for Next-Generation Computing

Platforms Yoshio Turner*, Tim Brecht*‡, Greg Regnier§, Vikram Saletore§, John Janakiraman*, Brian

Lynn*

*Hewlett Packard Laboratories§Intel Corporation

‡University of Waterloo

page 214 Feb 2004 SAN-3 workshop – HPCA-10

Outline

•Motivation: Enable applications to scale to next-generation network and I/O performance on standard computing platforms

•Proposed technology strategy:– Embedded Transport Acceleration (ETA)– Asynchronous I/O (AIO) programming model

•Web server application evaluation vehicle•Evaluation Plan•Conclusions

page 314 Feb 2004 SAN-3 workshop – HPCA-10

Motivation: Next-Generation Platform Requirements

• Low overhead packet and protocol processing for next-generation commodity interconnects (e.g., 10 gigE)– Current systems: performance impeded by

interrupts, context switches, data copies– Existing proposals include:

• TCP Offload Engines (TOE): special hardware, cost/time to market issues

• RDMA: new protocol, requires support at both endpoints

• Increased I/O concurrency for high link utilization– I/O bandwidth is increasing– I/O latency is fixed or slowly decreasing toward limit Need larger number of in-flight operations to fill

pipe

page 414 Feb 2004 SAN-3 workshop – HPCA-10

Proposed Technology Strategy

• Embedded Transport Acceleration (ETA) architecture– Intel Labs project: prototype architecture dedicates one

or more processors to perform all network packet processing -- ``Packet Processing Engines’’ (PPEs)

– Low overhead processing: PPE interacts with network interfaces/applications directly via cache-coherent shared memory (bypass the OS kernel)

– Application interface: VIA-style user-level communication

• Asynchronous I/O (AIO) programming model– Split two-phase file/socket operations

• Post an I/O operation request: non-blocking call• Asynchronously receive completion event information

– High I/O concurrency even for single-threaded application

– Initial focus: ETA socket AIO (future extensions to file AIO)

page 514 Feb 2004 SAN-3 workshop – HPCA-10

Key Advantages

• Potentially enables Ethernet and TCP to approach latency and throughput performance of System Area Networks

• Uses standard system processor/memory resources:– Automatically tracks semiconductor cost-performance

trends– Leverages microarchitecture trends:

multiple cores, hardware multi-threading– Leverages standard software development environments

rapid development• Extensibility: fully programmable PPE to support

evolving data center functionality– Unified IP-based fabric for all I/O– RDMA

• AIO increases network-centric application scalability

page 614 Feb 2004 SAN-3 workshop – HPCA-10

Overview of the ETA Architecture

• Partitioned server architecture:– Host: application execution– Packet Processing Engine (PPE)

• Host-PPE Direct Transport Interface (DTI)– VIA/Infiniband-like queuing structures in cache

coherent shared host memory (OS bypass)– Optimized for sockets/TCP

• Direct User Socket Interface (DUSI)– Thin software layer to support user level

applications

page 7SAN-3 workshop – HPCA-10 14 Feb 2004

HostCPU(s)

PPE

LAN Storage IPC

Network Fabric

ETA Host Interface

iSCSI

FileSystem Kernel

Applications

User Applications

LegacySockets

DirectAccess

TCP/IP

Driver

SharedMemory

ETA Overview: Partitioned Architecture

page 8SAN-3 workshop – HPCA-10 14 Feb 2004

ETA Overview: Direct Transport Interface (DTI) Queuing Structure

• Asynch socket operations: connect, accept, listen, etc.• TCP buffering semantics – anonymous buffer pool supports

non-pre-posted or OOO receive packets

Packet Processing Engine

HOST

Shared Host Memory

DTITx

Queue

DataBuffers DTI

EventQueue

DTIRx

Queue

AnonymousBufferPool

DTIDoorbells

page 914 Feb 2004 SAN-3 workshop – HPCA-10

API for Asynchronous I/O (AIO)

• Layer socket AIO API above ETA architecture– Investigate impact of AIO API features on application

structure and performance

• Initial focus: ETA Direct User Socket Interface (DUSI) API– provides asynchronous socket operations: connect,

listen, accept, send, receive

• AIO examples:– File/socket: Windows AIO w/completion ports, POSIX AIO– File I/O: Linux AIO recently introduced– Socket I/O with OS bypass: ETA DUSI, OpenGroup

Sockets API Extensions

page 1014 Feb 2004 SAN-3 workshop – HPCA-10

ETA Direct User Socket Interface (DUSI) AIO API

• Queuing structure setup for sockets:– One Direct Transfer Interface (DTI) per socket– Event queues: created separately from DTIs

• Memory registration: – Pin user space memory regions, provide address

translation information to ETA for zero-copy transfers

– Provide access keys (protection tags)• Application posts socket I/O operation requests

to DTI Tx and Rx work queues• PPE delivers operation completion events to DTI

event queues• Both operation posting and event delivery are

lightweight (no OS involvement)

page 1114 Feb 2004 SAN-3 workshop – HPCA-10

AIO Event Queue Binding

• AIO API design issue: assignment of events to event queues– Flexible binding enables applications to separate or group

events to facilitate operation scheduling

• DUSI: each DTI work queue can be bound at socket creation to any event queue– Allows separating or grouping events from different sockets– Allows separating events by type (transmit, receive)

• Alternatives for event queue binding:– Windows: per-socket– Linux and POSIX AIO: per-operation– OpenGroup Sockets API Extensions: per-operation-type

page 1214 Feb 2004 SAN-3 workshop – HPCA-10

Retrieving AIO Completion Events

• AIO API design issue: application interface for retrieving events

• DUSI: lightweight mechanism bypassing OS– Event queues in shared memory– Callbacks: similar to Windows– Event tags

• Application monitoring of multiple event queues– Poll for events (OK for small number of queues)– No events block in OS on multiple queues

• Uncommon case in a busy server acceptable in this case to use OS signaling mechanism

• Useful for simultaneous use of different AIO APIs– Race conditions: user level responsibility

page 1314 Feb 2004 SAN-3 workshop – HPCA-10

AIO for Files and Sockets

• File AIO support– OS (e.g., Linux AIO, POSIX AIO)– Future: ETA support for file I/O (e.g., via iSCSI or DAFS)

• Unified application processing of file/socket events– ETA PPE and OS kernel may both supply event queues

• Blocking on event queues of different types facilitated by use of OS signal signal mechanism (as in DUSI)

• Unified event queues may be desirable: require efficient coordination of ETA and OS access to event queues

– Support for zero-copy sendfile(): integration of ETA with OS management of the shared file buffer in system memory

page 1414 Feb 2004 SAN-3 workshop – HPCA-10

Initial Demonstration Vehicle: Web Server Application

• Plan: demonstrate value of ETA/AIO for network-centric applications

• Initial target: web server application– Single request may require multiple I/Os– Stresses system resources (esp. OS resources)– Must multiplex thousands/tens of thousands

concurrent connections• Web server architecture alternatives:

– SPED (single process event-driven)– MP (multi-process) or MT (multi-threaded)– Hybrid approach: AMPED (asymmetric multi-

process event-driven) AIO model favors SPED for raw performance

page 1514 Feb 2004 SAN-3 workshop – HPCA-10

The userver

• Open source micro web server• Extensive tracing and statistics facilities• SPED model -- run one process per host CPU• Previous support for Unix non-blocking socket I/O

and event notification via Linux epoll()• Modified to support socket AIO (eventually file

AIO)– Generic AIO interface: can be mapped to a variety of

underlying AIO APIs (DUSI, Linux AIO, etc.)• Comparison: web server performance with and

without ETA engine– With Standard Linux: processes share file buffer

cache using sendfile() for zero-copy file transfer– With ETA: mmap() files into shared address space

page 1614 Feb 2004 SAN-3 workshop – HPCA-10

Web Server Event Scheduling

• Balance accepting new connections with processing of existing connections

• Scheduling:– Separate queues

for accept(), read(), and write()/close() completion events

– Process based on current queue lengths

• Early results with non-blocking I/O – accept processing frequency

Throughput impact of frequency of accepting new connections

page 1714 Feb 2004 SAN-3 workshop – HPCA-10

Evaluation Plans

• Goal: evaluate approach, compare to design alternatives• Construct functional prototype of proposed stack (Linux)

– Extend existing ETA prototype kernel-level interface to user level with OS bypass (DUSI)

– Extend the userver to use socket AIO, mapping layer to DUSI

– Evaluate on 10 gigE –based client/server setup using SPECweb type workload

• Current ETA prototype: promising kernel-level micro-benchmark performance

• Expectation: ETA + AIO will show significantly higher scalability than existing Linux network implementation

page 18SAN-3 workshop – HPCA-10 14 Feb 2004

UDP TCP RAW

IP

Linux Kernel DTIDataPath

User

Kernel

ETA Direct UserSockets Interface (DUSI)

Packet Driver

Linux Sockets Library

uServer - AIO

ETA Packet Processing Engine

ControlPath

AIO Mapping

Network Interfaces

ETA KernelAgent

uServer - sockets

Proposed Stack/Comparison

0.00

0.20

0.40

0.60

0.80

1.00

1.20

256 512 1024 4096 65536

Transmit Size in Bytes

Idle

CP

U

0

0.5

1

1.5

2

2.5

3

Th

rou

gh

pu

t in

Gig

abit

s

Linux Idle CPU ETA Idle CPU

Linux Rate ETA Rate

Kernel-Level ETA Prototype

0.00

0.20

0.40

0.60

0.80

1.00

1.20

256 512 1024 4096 65536

Receive Size in Bytes

Idle

CP

U

0

0.5

1

1.5

2

2.5

Th

rou

gh

pu

t in

Gig

abit

s

Linux Idle CPU ETA Idle CPU

Linux Rate ETA Rate

Kernel-Level ETA Prototype

page 2114 Feb 2004 SAN-3 workshop – HPCA-10

Evaluation Plans: Analyses and Comparisons

• Compare proposed stack to well-tuned conventional system: checksum offload, TCP segmentation offload, interrupt moderation (NAPI)

• Examine micro-architectural impacts: VTune/oprofile to get CPU, memory, cache usage, interrupts, data copies, context switches

• Comparison to TOE• Extend analysis to application domains beyond

web server: e.g., storage, transaction processing• Port highly scalable user-level threading package

(UC Berkeley Capriccio project) to ETA– Benefit: familiar threaded programming model with

efficient ``under the hood’’ underlying AIO and OS bypass

page 2214 Feb 2004 SAN-3 workshop – HPCA-10

Summary

• Proposed technology strategy combining ETA and AIO to enable industry standard platforms to scale to next-generation network performance

• Cost-performance, time to market, flexibility advantages over alternative approaches

• Ethernet/TCP to approach performance levels of today’s SANs – toward unified data center I/O fabric based on commodity hardware

• Status– Promising initial experimental results for kernel-

level ETA– Prototype implementation of proposed stack nearly

complete– Testing environment setup based on 10 gigE