Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

Embed Size (px)

Citation preview

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    1/37

    Quadrics Ltd 128/8/2008

    QsNetIII an Adaptively Routed Network for

    High Performance Computing

    Duncan Roweth, Quadrics Ltd

    Hot Interconnects August 2008

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    2/37

    Quadrics Ltd 228/8/2008

    Quadrics Background

    Develops interconnect products for the HPC market

    HPC Linux systems

    AlphaServer SC systems

    Quadrics is owned by the Finmeccanica group

    Quadrics was 12 years old in July

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    3/37

    Quadrics Ltd 328/8/2008

    QsNet Networks

    Multi-stage switch network

    Components

    Adapter: Elan

    Router: Elite

    Switches, cables Firmware, drivers, libraries

    Diagnostics, documentation

    HPC specific features

    Adaptive routing

    Hardware barrier & broadcast

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    4/37

    Quadrics Ltd 428/8/2008

    Communication Model

    Processs

    VirtualAddress

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    5/37

    Quadrics Ltd 528/8/2008

    Quadrics Networks

    Elan1 / Elite1, 1994, Meiko Computing Surface 2

    Source chooses between pre-defined routes

    Elan3 / Elite3, 2000, first Quadrics product, QsNet

    First use of packet-by-packet adaptive routing

    Crosspoint router, x8 Elan4 / Elite4, 2004, QsNetII

    Reduced latency, increased bandwidth

    Increased support for offloading collectives

    Elan5 / Elite5, 2008, QsNetIII

    General purpose crosspoint router, increased radix, x32

    Highly programmable adapter

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    6/37

    Quadrics Ltd 628/8/2008

    What is Adaptive Routing ?

    Switch networks typically provide manypaths between any two points

    In an adaptively routed network

    routers make packet by packet decisions

    on the route to use based on Queue occupancy

    Channel usage

    Error rates and state

    Class of traffic

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    7/37

    Quadrics Ltd 728/8/2008

    Why is Adaptive Routing Important ?

    Most HPC networks are statically routed

    They use pre-determined paths between nodes

    Static routing can work well

    If traffic pattern is known in advance

    If traffic pattern is persistent If traffic pattern is uniform (i.e. application is load balanced)

    If there are no errors

    These conditions are not met by real codes on productionHPC systems {see LLNL and Sandia results}

    Adaptive routing solves these problems Delivering significantly better aggregate bandwidths and worst

    case latencies on real systems running real codes

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    8/37

    Quadrics Ltd 828/8/2008

    Benefits of Adaptive Routing

    Bandwidth achievedwhen 1024 nodes allcommunicate at thesame time

    Plots show thedistribution ofmeasured bandwidths

    System Interconnect Min Max Average

    Atlas Infiniband 95 762 263

    Thunder QsNetII 248 403 369Data from Lawrence Livermore National Lab, published at the Sonoma OpenFabrics workshop April 2007

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    9/37

    Quadrics Ltd 928/8/2008

    Benefits of Adaptive Routing

    Classic QsNetII all-to-all bandwidth scaling graph

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    10/37

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    11/37

    Quadrics Ltd 1128/8/2008

    Adaptive Routing in QsNetIII

    More flexible than QsNetII

    Operates over arbitrary sets of links

    More opportunities to use the technique

    Higher radix switches

    Select a subset of lightly loaded output ports based on:

    Destination

    Link state, errors etc

    Number of pending acks (programmable threshold)

    Programmable algorithm for selecting from this subset: First free, next free, random

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    12/37

    Quadrics Ltd 1228/8/2008

    Adaptive Routing: standard case

    All top switches are equivalent, select one

    Adaptive routing selects a lightly loaded path

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    13/37

    Quadrics Ltd 1328/8/2008

    Implementation of Fat Tree Networks

    Connect MN-way node switches by NM-way top switches

    In this case M = 16, N = 4

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    14/37

    Quadrics Ltd 1428/8/2008

    Adaptive Routing in the Top Switch

    If top switch radix router radix / 2

    i.e. 16 for Elite5, 2048-way networks

    Router provides multiple top switches

    Select which to use based on load

    Example: Traffic from A to B via routers 210 and

    300 is blocked by traffic between 300and 200.

    The router providing 300, 301, 302 and303 can select a different path

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    15/37

    Quadrics Ltd 1528/8/2008

    Adaptive Routing on the Final Hop

    Multiple connections to a node

    Switch can select a free path

    Reduces end-point contention

    Simple case is not optimal Spreading the connections

    Improves fault tolerance

    Reduces network contention

    Routing decision is made higherin the network

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    16/37

    Quadrics Ltd 1628/8/2008

    Adaptive routing in the presence of errors

    In a production system with 1000sof links it is not uncommon for asmall number to be broken untilthe next maintenance slot

    Adaptive routing minimises the

    impact Example:

    Link between routers 10 and 20 isbroken

    Router 10 dynamically selects paths

    via 21,22,23 spreading the load.

    Reverse case, avoid sending to 10via 20. Reset 20s links or update

    switches 11,12,13.

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    17/37

    Quadrics Ltd 1728/8/2008

    Small Packet Support

    Aim to get as close to line rate as possible with small packets

    For example:

    Small put

    32 byte packet

    Adapter has multiple packet engines

    Adapters support up to 64 outstanding packets per link

    Doubles if we use both links

    Switches provide 32 virtual channels per output link

    Prioritisation buffering on input to the router

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    18/37

    Quadrics Ltd 1828/8/2008

    Barrier & Broadcast Support

    Switches broadcast overa range of output links

    Combine Acks / Nacks

    Contiguous in QsNetII Sparse in QsNetIII

    Barrier implementation

    Network conditional

    Broadcast release

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    19/37

    Quadrics Ltd 1928/8/2008

    Fabric

    Bridge

    x8

    PLL

    EEPROM ClocksPCIe16 Lanes

    Host I/F

    TLB

    Cmd Launch

    PCIe

    SERDES

    Local Functions

    Buffer Manager

    Object Cache Tags

    Free List

    Local Memory

    Ext i/fSDRAM i/f

    External cache

    ExternalDDRII

    16K x 8 x 8 banks = 1MB ECC RAM

    CX4/QSNet

    III

    Link

    CX4/QSNet

    III

    Link

    Packet Engine16K inst cache9K data buffers

    Packet Engine16K inst cache9K data buffers

    Packet Engine16K inst cache9K data buffers

    Packet Engine16K inst cache9K data buffers

    Packet Engine16K inst cache9K data buffers

    Packet Engine16K inst cache9K data buffers

    Packet Engine16K inst cache9K data buffers

    Elan5 Adapter

    Elan5 Device Overview

    2 QsNetIII links 20Gbit/s/direction after protocol

    PCIe, PCIe2 host interface

    Multiple packet engines

    512KB of high bandwidth onchip local memory

    SDRAM interface to optionallocal memory

    Buffer manager, object

    cache

    Details in ISC DresdenPaper

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    20/37

    Quadrics Ltd 2028/8/2008

    Elite5 Device Overview

    64 32 crosspoint router

    Direct & buffered input from each link

    8K of input buffering per link

    32 virtual channels per link

    Physical layer DDR XAUI (6.25GHz) Adaptive routing

    Hardware barrier and broadcast

    Memory mapped stats & errorcounters accessed out-of-band

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    21/37

    Quadrics Ltd 2128/8/2008

    QsNetIII Device Overview

    Elan Elite

    Semi custom ASIC

    Manufacturing partners LSI / TSMC G90 process500 MHz 312 MHz

    High performance BGA package

    672 pin 982 pin

    < 17W < 18W

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    22/37

    Quadrics Ltd 2228/8/2008

    QsNetIII Implementation

    Node switch chassis

    128 links down to the nodes

    128 links up to the top switches

    Backplane connects 2 sets of cards

    Top switches 256 links down to the node switches

    Range of system sizes:

    Ports Radix Per Chassis

    512 4 64

    1024 8 32

    2048 16 16

    4096 32 8

    QsNetIII switchlogical design

    QsNetIII

    switchimplementation

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    23/37

    Quadrics Ltd 2328/8/2008

    QsNetIII Network 1024way

    Fat tree, constructed from 8 128-way node switches connected by128 8-way top switches

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    24/37

    Quadrics Ltd 2428/8/2008

    QsNetIII Implementation Cables

    QSFP connectors throughout

    Copper cables (e.g. Gore) 1-10m

    Active copper cables (e.g. Gore), 8-20m

    Optical cables (e.g. Luxtera), 5-300m

    PVDF Plenum rated LSZH available as an option

    No longer Quadrics proprietary

    Likely usage:

    Short copper cables from nodes

    Optical cables between switches

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    25/37

    Quadrics Ltd 2528/8/2008

    QsNetIII Fault Tolerance

    All of the QsNetII Features

    CRCs on every packet

    Automatic retransmission

    Redundant routes

    Adaptive routing avoids failed links Redundant, hot plugable, PSUs and fans

    + Line rate testing of each link as it comes up

    Switches generate CRPAT, CJPAT or PRBS packets

    Links are only added to the route tables when they are (a) up, (b)connect to the right place, and (c) can transfer data at full line ratewithout error.

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    26/37

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    27/37

    Quadrics Ltd 2728/8/2008

    Elite5 silicon in Bristol

    Elan5 at TSMC, first parts expected

    in 3-4 weeks

    Switch PCBs, chassis, backplane,controllers are working

    First adapter PCBs are ready

    PCI-Express x16, HP Blade,ExpressModule (Sun Blade)

    We are porting the QsNetII software

    Components at SC08 in Austin

    First customer shipment in Q1 of 2009

    Current Status

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    28/37

    Quadrics Ltd 2828/8/2008

    Future Work

    QsNetIII hardware

    Low cost 32-way switch

    1024-way single chassis switch

    QsNetIII

    Software General framework for optimised collectives

    Support for multiport networks - fat nodes have multipleconnections to the same rail

    Ethernet firmware for the network adapter

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    29/37

    Quadrics Ltd 2928/8/2008

    Adaptive routing underwrites the scalability of HPC systemsdesigned to run a single large application

    Adaptive routing has been a feature of QsNet systems since 2000

    QsNetIII offers significant enhancements over both QsNetII and

    competing products

    Conclusions

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    30/37

    Quadrics Ltd 3028/8/2008

    Thank you for listening

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    31/37

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    32/37

    Quadrics Ltd 3228/8/2008

    Packet Format

    Packet size of up to 4K made up of 256 byte packet segment andcontinuations, 8 byte ACK

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    33/37

    Quadrics Ltd 3328/8/2008

    Impact of static routing on latency

    Data from Thunderbird cluster, Sandia National LabBig increases in worst case latency with number of nodes

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    34/37

    Quadrics Ltd 3428/8/2008

    Impact of static routing on latency

    Data from Thunderbird cluster, Sandia National LabBig variation in worst case latency across a large job

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    35/37

    Quadrics Ltd 3528/8/2008

    Software Model Firmware & Drivers

    Base firmware in the ROMs

    Firmware modules loadable with the device driver

    Elan, OpenFabrics, 10GE Ethernet,

    Kernel modules

    elan5, elan, rms Device dependent library (libelan5)

    Device independent library (libelan)

    User libraries

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    36/37

    Quadrics Ltd 3628/8/2008

    Software Model Elan Libraries

    Point-to-point messagepassing

    One-sided put/get

    Transparent rail striping

    Optimised collectives

    Locks and atomics ops

    Global memory allocation

  • 8/14/2019 Quadrics QsNetIII Adaptively Routed Network for HPC - Presentation

    37/37

    Quadrics Ltd 3728/8/2008

    QsNetIII Performance Summary

    Similar latencies to QsNetII

    The 1.3 to 2 microsecs of latency is mostly in the host PCI andmemory system

    Higher issue rates

    Improved link utilisation on small transfers

    Higher bandwidths

    1.5 to 2.25 GB/sec/link depending on host interface

    Bi-directional host interface

    2 x improvement over QsNetII

    Broadcast and barrier in hardware Continued development of adaptive routing underwrites scaling

    to high node counts